You are on page 1of 996

Írta: Sima Dezső

Lektorálta: oktatói munkaközösség

PÁRHUZAMOS RENDSZEREK
ARCHITEKTÚRÁJA

PÁRHUZAMOS SZÁMÍTÁSTECHNIKA MODUL

PROAKTÍV INFORMATIKAI MODULFEJLESZTÉS


1
COPYRIGHT:
2011-2016, Dr. Sima Dezső, Óbudai Egyetem, Neumann János Informatikai Kar

LEKTORÁLTA: oktatói munkaközösség

Creative Commons NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)


A szerző nevének feltüntetése mellett nem kereskedelmi céllal szabadon másolható,
terjeszthető, megjelentethető és előadható, de nem módosítható.

TÁMOGATÁS:
Készült a TÁMOP-4.1.2-08/2/A/KMR-2009-0053 számú, “Proaktív informatikai
modulfejlesztés (PRIM1): IT Szolgáltatásmenedzsment modul és Többszálas
processzorok és programozásuk modul” című pályázat keretében

KÉSZÜLT: a Typotex Kiadó gondozásában


FELELŐS VEZETŐ: Votisky Zsuzsa
ISBN 978-963-279-561-4

2
KULCSSZAVAK:
többmagos processzorok, sokmagos processzorok, homogén többmagos processzorok,
heterogén többmagos processzorok, mester-szolga elvű heterogén többmagos
processzorok, csatolt elvű heterogén többmagos processzorok, Core
2/Penryn/Nehalem/Nehalem-X/Westmere/Westmer-EX/Sandy Bridge-alapú Intel
architektúrák, egyéni (privat consumer) és vállalati (enterprise) orientált platformok, Intel vPro
platformja, általános célú GPU-k (GPGPU-k), adatpárhuzamos gyorsítók (DPA-k), integrált
CPU/GPU architektúrák

ÖSSZEFOGLALÓ:
A tárgy keretében a hallgatók áttekintést kapnak a processzorarchitektúrák terén az elmúlt
években végbement rohamos fejlődésről. Megismerkednek a többmagos processzorok
megjelenésének szükségszerűségével, a többmagos/sokmagos processzorok főbb
osztályaival, nevezetesen a homogén és a heterogén többmagos processzorokkal, azok
alosztályaival és reprezentáns implementációikkal.
Ismertetésre kerülnek a többmagos Intel processzorok főbb családjai és azok főbb jellemzői,
nevezetesen a Core 2, Penryn, Nehalem, Nehalem-EX, Westmere, Westmere-EX és a
Sandy Bridge alapú architektúrák és jellemzőik. Az előadásban a hallgatók megismerkednek
a többmagos asztali számítógép platformokkal, kiemelten az egyéni ill. a vállalati alkalmazási
orientációjú (vPro) platformokkal és azok sajátosságaival. Az anyag megértését nagyszámú
konkrét megvalósítás bemutatása segíti. A továbbiakban az előadás tárgyalja a
számításigényes alkalmazások terén egyre szélesebb körben elterjedő általános célú GPU-
kat (GPGPU-k) és adatpárhuzamos gyorsítókat (DPA-k). Végül ismertetésre kerülnek a
reprezentáns Nvidia és AMD/ATI GPGPU családok architektúrái valamint a processzorok
fejlődésének legutóbbi szakaszában megjelent integrált CPU/GPU architektúrák ill.
reprezentáns implementációik.

3
Tartalomjegyzék

• Multicore-Manycore Processors
• Evolution of Intel’s Basic Microarchitectures
• Intel’s Desktop Platforms
• GPGPUs/DPAs Overview
• GPGPUs/DPAs 5.1
• GPGPUs/DPAs 5.2
• Integrated CPUs/GPUs
• References to all four sectionsof GPGPUs/DPAs

© Sima Dezső, ÓE NIK 4 www.tankonyvtar.hu


Multicore-Manycore
Processors

Dezső Sima

© Sima Dezső, ÓE NIK 5 www.tankonyvtar.hu


Contents

• 1.The inevitable era of multicores

• 2. Homogeneous multicores

• 2.1 Conventional multicores

• 2.2 Many-core processors

• 3. Heterogeneous multicores

• 3.1 Master-slave type heterogeneous multicores

• 3.2 Add-on type heterogeneous multicores

• 4. Outlook

• 5. References

© Sima Dezső, ÓE NIK 6 www.tankonyvtar.hu


1. The inevitable era of multicores

© Sima Dezső, ÓE NIK 7 www.tankonyvtar.hu


1. The inevitable era of multicores (1)

1. The inevitable era of multicores


Integer performance grows

SPECint92
Levelling off
10000
P4/3200 * * Prescott (2M)
* * *Prescott (1M)
5000 P4/3060 * Northwood B
P4/2400 * **P4/2800
P4/2000 * *P4/2200
2000 P4/1500 * *
P4/1700
PIII/600 PIII/1000
1000 *
**PIII/500
PII/400
PII/300 *
* PII/450
500 *

~ 100*/10 years Pentium/200 * Pentium Pro/200


200 *
Pentium/133 * * Pentium/166
Pentium/100 * * Pentium/120
100
Pentium/66*
50 * 486-DX4/100
486/50 * 486-DX2/66
486/33 * *
20 486-DX2/50
*
486/25 *
10
* 386/33
386/20 * 386/25
5 *
386/16 *
80286/12
2 *
80286/10
1 *
8088/8
0.5 *

0.2 * 8088/5

Year
79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05

Figure 1.1: Integer performance8growth of Intel’s x86 processors


© Sima Dezső, ÓE NIK www.tankonyvtar.hu
1. The inevitable era of multicores (2)

Performance (Pa)

Pa = fC x IPC

Clock frequency x Instructions Per Cycle

Clock Efficiency
Pa = x
frequency (Pa/fC)

© Sima Dezső, ÓE NIK 9 www.tankonyvtar.hu


1. The inevitable era of multicores (3)

SPECint_base2000/ f c 2. generation
superscalars

Levelling off
1
0.5 Pentium Pro Pentium II
* *
* * *
Pentium III
* *
~10*/10 years Pentium *
0.2
* *
486DX
0.1

0.05 * *
386DX

0.02 * 286
0.01
~
~
Year
78 79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99 2000 01 02

Figure 1.2: Efficiency of Intel processors

© Sima Dezső, ÓE NIK 10 www.tankonyvtar.hu


1. The inevitable era of multicores (4)

Main sources of processor efficiency (IPC)

Processor width Core enhancements Cache enhancements

• branch prediction L2/L3


• speculative loads enhancements
1 2 4
• ... (size, associativity ...)

pipeline superscalar
1. Gen. 2. Gen.

Figure 1.3: Main sources of processor efficiency

© Sima Dezső, ÓE NIK 11 www.tankonyvtar.hu


1. The inevitable era of multicores (5)

Figure 1.4: Extent of parallelism available in general purpose applications


for 2. generation superscalars [37]
© Sima Dezső, ÓE NIK 12 www.tankonyvtar.hu
1. The inevitable era of multicores (6)

Main sources of processor efficiency (IPC)

Processor width Core enhancements Cache enhancements

• branch prediction L2/L3


• speculative loads enhancements
1 2 4
• ... (size, associativity ...)

pipeline superscalar
1. Gen. 2. Gen.

Figure 1.5: Main sources of processor efficiency

© Sima Dezső, ÓE NIK 13 www.tankonyvtar.hu


1. The inevitable era of multicores (7)

Beginning with 2. generation superscalars

• the era of extensively increasing processor efficiency came to an end


• processor efficiency levelled off.

Pa = fC x IPC

Clock frequency x Instructions Per Cycle

Performance increase can basically be achived by fc

© Sima Dezső, ÓE NIK 14 www.tankonyvtar.hu


1. The inevitable era of multicores (8)

Shrinking: ~ 0.7/2 Years

Figure 1.6: Evolution of Intel’s process technology [38]

© Sima Dezső, ÓE NIK 15 www.tankonyvtar.hu


1. The inevitable era of multicores (9)

Figure 1.7: The actual rise of IC complexity in DRAMs and microprocessors [39]

© Sima Dezső, ÓE NIK 16 www.tankonyvtar.hu


1. The inevitable era of multicores (10)

Main sources of processor efficiency (IPC)

Processor width Core enhancements Cache enhancements

• branch prediction L2/L3


• speculative loads enhancements
1 2 4
• ... (size, associativity ...)

pipeline superscalar
1. Gen. 2. Gen.

What is the best use of ever increasing Doubling transistor counts Moore’s
number of processors ??? ~ every two years law

© Sima Dezső, ÓE NIK 17 www.tankonyvtar.hu


1. The inevitable era of multicores (11)

IC fab technology
(Linear shrink ~ 0.7x/2 years) ~ Doubling transistor counts / 2 years

Moore’s law

Possible use of surplus transistors

Processor width Core enhancements Cache enhancements

• branch prediction L2/L3


• speculative loads enhancements
1 2 4
• ... (size, associativity ...)

pipeline superscalar
1. Gen. 2. Gen.

Figure 1.8: Possible use of surplus transistors


© Sima Dezső, ÓE NIK 18 www.tankonyvtar.hu
1. The inevitable era of multicores (12)

Increasing number of transistors Diminishing return in performance

Use available surplus transistors for multiple cores

The inevitable era of multicores

© Sima Dezső, ÓE NIK 19 www.tankonyvtar.hu


1. The inevitable era of multicores (13)

Figure 1.9: Rapid spreading of Intel’s multicore processors [40]


© Sima Dezső, ÓE NIK 20 www.tankonyvtar.hu
2. Homogeneous multicores

• 2.1 Conventional multicores

• 2.2 Manycore processors

© Sima Dezső, ÓE NIK 21 www.tankonyvtar.hu


2. Homogeneous multicores (1)

Multicore processors

Homogeneous Heterogeneous
multicores multicores

Conventional Manycore Master/slave Add-on


multicores processors type multicores type multicores

2≤ n≤8 cores with >8 cores

Mobiles Desktops Servers

MPC

CPU GPU

General purpose Prototypes/ MM/3D/HPC HPC


computing experimental systems production stage

© Sima Dezső, ÓE NIK Figure 2.1: Major classes22


of multicore processors www.tankonyvtar.hu
2.1 Conventional multicores

• 2.1.1 Example: Intel’s MP servers

© Sima Dezső, ÓE NIK 23 www.tankonyvtar.hu


2.1 Conventional multicores (1)

Multicore processors

Homogeneous Heterogeneous
multicores multicores

Conventional Manycore Master/slave Add-on


multicores processors type multicores type multicores

2≤ n≤8 cores with >8 cores

Mobiles Desktops Servers

MPC

CPU GPU

General purpose Prototypes/ MM/3D/HPC HPC


computing experimental systems production stage

© Sima Dezső, ÓE NIK Figure 2.1: Major classes24


of multicore processors www.tankonyvtar.hu
2.1.1 Example: Intel’s MP servers

• 2.1.1.1 Introduction

• 2.1.1.2 The Pentium 4 based Truland MP platform

• 2.1.1.3 The Core 2 based Caneland MP platform

• 2.1.1.4 The Nehalem-EX based Boxboro-EX MP platform

• 2.1.1.5 Evolution of MP platforms

© Sima Dezső, ÓE NIK 25 www.tankonyvtar.hu


2.1.1.1 Introduction (1)

2.1.1.1 Introduction

Servers

Uni-Processors Dual Processors Multi Processors Servers with more than


(UP) (DP) (typically 4 processors) 8 processors
(MP)

© Sima Dezső, ÓE NIK 26 www.tankonyvtar.hu


2.1.1.1 Introduction (2)
Basic Arch. Core/technology MP server processors

Pentium 4 90 nm 11/2005 Paxville MP 2x1 C, 2 MB L2/C


Pentium 4
(Prescott)
Pentium 4 65 nm 8/2006 7100 (Tulsa) 2x1 C, 1 MB L2/C 16 MB L3

7200 (Tigerton DC) 1x2 C, 4 MB L2/C


Core2 65 nm 9/2007
7300 (Tigerton QC) 2x2 C, 4 MB L2/C
Core 2
Penryn 45 nm 9/2008 7400 (Dunnington) 1x6 C, 3 MB L2/2C 16 MB L3

Nehalem-EP 45 nm

Westmere-EP 32 nm

Nehalem
Nehalem-EX 45 nm 3/2010 7500 (Beckton) 1x8 C, ¼ MB L2/C 24 MB L3

Westmere-EX 32nm 4/2011 E7-48xx (Westmere-EX) 1x10 C, ¼ MB L2/C 30 MB L3

Sandy Bidge 32 nm /2011


Sandy
Bridge
Ivy Bridge 22 nm 11/2012

© Sima Dezső, ÓE NIK Table 2.1: Overview of Intel’s


27 multicore MP servers www.tankonyvtar.hu
2.1.1.1 Introduction (3)

MP server platforms

Pentium 4 based Core 2 based Nehalem-EX based Sandy Bridge based


MP server platform MP server platform MP server platform MP server platform

Truland (2005) Caneland (2007) Boxboro-EX (2010) To be announced yet


(90 nm/65 nm Pentium 4
Prescott MP based)

© Sima Dezső, ÓE NIK 28 www.tankonyvtar.hu


2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (1)

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform

Overview

Remark
For presenting a more complete view of the evolution of multicore MP server platforms
we include also the single core (SC) 90 nm Pentium 4 Prescott based Xeon MP (Potomac)
processor that was the first 64-bit MP server processor and gave rise to the Truland platform.

© Sima Dezső, ÓE NIK 29 www.tankonyvtar.hu


2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (2)
3/2005 11/2005

MP platforms Truland Truland( updated)

3/2005 11/2005 8/2006

MP cores Xeon MP Xeon 7000 Xeon 7100


(Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C

90 nm/675 mtrs 90 nm/2x169 mtrs 65 nm/1328 mtrs


1 MB L2 2x1 (2) MB L2 2x1 MB L2
8/4 MB L3 - 16/8/4 MB L3
667 MT/s 800/667 MT/s 800/667 MT/s
mPGA 604 mPGA 604 mPGA 604
3/2005 11/2005

MCH E8500 E8501

(Twin Castle) (Twin Castle?)


2xFSB 2xFSB
667 MT/s 800 MT/s
HI 1.5 HI 1.5
4 x XMB 4 x XMB
(2 channels/XMB (2 channels/XMB
4 DIMMs/channel 4 DIMMs/channel
DDR-266/333 DDR-266/333
DDR2-400 DDR2-400
32GB 32GB

4/2003

Pentium 4 based
ICH ICH5
Core 2 based
Penryn based
Pentium 4-based/90 nm 30 Pentium 4-based/65 nm
© Sima Dezső, ÓE NIK www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (3)
3/02 11/02 2Q/05
3/04
^ ^ ^ ^
Xeon - MP line Foster-MP Gallatin
Gallatin
Potomac

0.18 µ /108 mtrs 0.13 µ /178 mtrs 0.09µ


0.13 µ /286 mtrs
1.4/1.5/1.6 GHz 1.5/1.9/2 GHz > 3.5 MHz
2.2/2.7/3.0 GHz
On-die 256K L2 On-die 512K L2 On-die 1M L2
On-die 512K L2
On-die 512K/1M L3 On-die 1M/2M L3 On-die 8M L3 (?)
On-die 2M/4M L3
400 MHz FSB 400 MHz FSB
400 MHz FSB
µ PGA 603 µ PGA 603 µ PGA 603
5/01 2/02 11/02 7/03 6/04 2Q/05
^ ^ ^ ^ ^ ^
Xeon DP line Foster Prestonia-A Prestonia-B Prestonia-C Nocona Jayhawk

0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ/178 mtrs 0.09 µ/ 125 mtrs 0.09µ
1.4/1.5/1.7 GHz 1.8/2/2.2 GHz 2/2.4/2.6/2.8 GHz 3.06 GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.8 GHz
On-die 256 K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2, 1M L3 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB (Cancelled 5/04)
µPGA 603 µ PGA 603 µ PGA 603 µ PGA 603 µ PGA 604

11/03 11/04 1Q/05 2/05


^ ^ ^
Extreme Edition Irwindale-A1 Irwindale-B1 Irwindale-C

0.13µ /178 mtrs 0.13µ /178mtrs 0.09 µ


3.2EE GHz 3.4EE GHz 3.0/3.2/3.4/3.6 GHz
Desktop-line On-die 512K L2, 2M L3 On-die 512K L2, 2 MB L3 On-die 512K L2, 2M L3

µPGA604
800 MHz FSB 1066 MHz FSB
µPGA 478 LGA 775

11/00 8/01 1/02 5/02 11/02 5/03 2/04 6/04 8/04 3Q/05
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
Willamette Willamette Northwood-A2,3 Northwood-B 4 Northwood-B Northwood-C5 Prescott 6,7 Prescott 8,9,10
Prescott-F11 Tejas

0.18 µ /42 mtrs 0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09 µ /
1.4/1.5 GHz 1.4 ... 2.0 GHz 2A/2.2 GHz 2.26/2.40B/2.53 GHz 3.06 GHz 2.40C/2.60C/2.80C GHz 2.80E/3E/3.20E/3.40E GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.20F/3.40F/3.60F GHz 4.0/4.2 GHz
On-die 256K L2 On-die 256K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 1M L2 On-die 1M L2 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB (Cancelled 5/04)
µ PGA 423 µ PGA 478 µPGA 478 µ PGA 478 µ PGA 478 µPGA 478 µ PGA 478 LGA 775 LGA 775

9/02 6/04 9/04


5/02
^ ^ ^ ^
Celeron-line Willamette-128 Northwood-128 Celeron-D12 Celeron-D13
(Value PC-s)
0.18µ 0.13µ 0.09µ 0.09µ
1.7 GHz 2 GHz 2.4/2.53/2.66/2.8 GHz 2.53/2.66/2.80/2.93 GHz
On-die 128K L2 On-die 128K L2 On-die 256K L2 On-die 256K L2
400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB
µPGA 478 µPGA 478 µ PGA 478 LGA 775

2000 2001 2002 2003 2004 2005

Cores supporting hyperthreading Cores with EM64T implemented but not enabled Cores supporting EM64T

Figure 2.2: The Potomac processor as Intel’s first 64-bit Xeon MP processor based on the
© Sima Dezső, ÓE NIK third core (Prescott core) of the
31 Pentium 4 family of processors www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (4)

Basic system architecture of the 90 nm Pentium 4 Prescott MP based Truland


MP server platform

Xeon MP Xeon 7000 Xeon 7100


/ /
(Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C

Pentium 4 Pentium 4 Pentium 4 Pentium 4


Xeon Xeon XeonP XeonP
1C/2x1C 1C/2x1C 1C/2x1C 1C/2x1C

FSB
XMB: eXxternal Memory Bridge
Provides a serial link,
XMB XMB 5.33 GB inbound BW
2.65 GB outbound BW
85001/8501
(simultaneously)l
XMB XMB
HI 1.5
DDR-266/333 HI 1.5 (Hub Interface 1.5)
DDR-266/333 8 bit wide, 66 MHz clock, QDR,
DDR2-400 DDR2-400
ICH5 66 MB/s peak transfer rate

Pentium 4 Prescott MP based Truland MP server platform (for up to 2 cores)

1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac)

© Sima Dezső, ÓE NIK 32 www.tankonyvtar.hu


2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (5)

Expanding the Truland platform to 3 generations of Pentium 4 based Xeon MP servers

Xeon MP Xeon 7000 Xeon 7100


1C 2x1C 2C

Figure 2.3: Expanding the Truland platform [1]

© Sima Dezső, ÓE NIK 33 www.tankonyvtar.hu


2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (6)

Example 1: Block diagram of a 8500 chipset based Truland MP server board [2]

Figure 2.4: Block diagram of a 8500 chipset based Truland MP server board [2]
© Sima Dezső, ÓE NIK 34 www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (7)

Example 2: Block diagram of the E8501 based Truland MP server platform [3]

Xeon DC MP 7000
(4/2005) or later
DC/QC MP 7000 processors

IMI: Independent
Memory Interface
IMI: Serial link
5.33 GB inbound BW
2.67 GB outbound BW
simultaneously

(North Bridge) XMB: eXxternal


Memory Bridge

Intelligent MC
Dual mem. channels
DDR 266/333/400
4 DIMMs/channel

Figure 2.5: Intel’s 8501 chipset based Truland MP server platform (4/ 2006) [3]
© Sima Dezső, ÓE NIK 35 www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (8)

Example 3: E8501 based MP server board implementing the Truland platform

2 x XMB

DDR2 Xeon DC
DIMMs
7000/7100
64 GB

2 x XMB E8501 NB

ICH5R SB

Figure 2.6: Intel E8501 chipset based MP server board (Supermicro X6QT8)
© Sima Dezső, ÓE NIK 36 DC MP processor families [4]
for the Xeon 7000/7100 www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (9)

Figure 2.7: Bandwith bottlenecks in Intel’s 8501 based Truland MP server platform [5]
© Sima Dezső, ÓE NIK 37 www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (10)

Remark

Previous (first generation) MP servers made use of a symmetric topology including only a
single FSB that connects all 4 single core processors to the MCH (north bridge), as shown
below.

Typical system architecture of a first generation Xeon MP based MP server platform

Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1


SC SC SC SC

FSB

Preceding NBs

E.g. DDR-200/266 E.g. HI 1.5 E.g. DDR-200/266

Preceding ICH HI 1.5 266 MB/s

Figure 2.8: Previous Pentium 4 MP based MP server platform (for single core processors)

© Sima Dezső, ÓE NIK 38 www.tankonyvtar.hu


2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (11)

Example: Block diagram of an MP server board that is based on Pentium 4


(Willamette MP) single core 32-bit Xeon MP processors (called Foster)

The memory is placed


on an extra card.
There are 4 memory controllers
each supporting 4 DIMMs
(DDR-266/200)

The chipset (CMIC/CSB5) is


ServerWorks’
Grand Champion HE Classic
chipset

Figure 2.9: Block diagram of an MP server board [6]

© Sima Dezső, ÓE NIK 39 www.tankonyvtar.hu


2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (12)

Evolution from the first generation MP servers supporting SC processors to the


90 nm Pentium 4 Prescott MP based Truland MP server platform (supporting up to 2 cores)

Xeon MP Xeon 7000 Xeon 7100


/ /
(Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C

Pentium 4 Pentium 4 Pentium 4 Pentium 4


Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1
Xeon MP Xeon MP Xeon MP Xeon MP
SC SC SC SC
1C/2x1C 1C/2x1C 1C/2x1C 1C/2x1C

FSB
FSB
XMB XMB
Preceding NBs 85001/8501
XMB XMB
E.g. DDR-200/266 E.g. HI 1.5 E.g. DDR-200/266 HI 1.5
DDR-266/333 DDR-266/333
DDR2-400 DDR2-400
Preceding ICH ICH5
HI 1.5 266 MB/s

Previous Pentium 4 MP based 90 nm Pentium 4 Prescott MP based


MP server (for single core processors) Truland MP server platform (for up to 2 C)

© Sima Dezső, ÓE NIK 40 www.tankonyvtar.hu


2.1.1.3 The Core 2 based Caneland MP server platform (1)

2.1.1.3 The Core 2 based Caneland MP server platform

© Sima Dezső, ÓE NIK 41 www.tankonyvtar.hu


2.1.1.3 The Core 2 based Caneland MP server platform (2)
9/2007

MP platforms Caneland

9/2007 9/2008

MP cores Xeon 7200 Xeon 7300 Xeon 7400


(Tigerton DC) 1x2C (Tigerton QC) 2x2C (Dunnington 6C)

65 nm/2x291 mtrs 65 nm/2x291 mtrs 45 nm/1900 mtrs


2x4 MB L2 2x(4/3/2) MB L2 9/6 MB L2
- - 16/12/8 MB L3
1066 MT/s 1066 MT/s 1066 MT/s
mPGA 604 mPGA 604 mPGA 604

9/2007

MCH E7300
(Clarksboro)
4xFSB
1066 MT/s
ESI
4 x FBDIMM
(DDR2-533/667
8 DIMMs/channel)
512GB

5/2006
Pentium 4 based
631xESB
ICH Core 2 based
632xESB
Penryn based

Core2-based/65 nm
42 Penryn 45 nm
© Sima Dezső, ÓE NIK www.tankonyvtar.hu
2.1.1.3 The Core 2 based Caneland MP server platform (3)

Basic system architecture of the Core 2 based Caneland MP server platform

Xeon MP Xeon 7000 Xeon 7100 Xeon 7200 Xeon 7300 Xeon 7400
/ / / /
(Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C (Tigerton DC) 1x2C (Tigerton QC) 2x2C (Dunnington 6C)

Pentium 4 Pentium 4 Pentium 4 Pentium 4 Core2 Core2 Core2 Core2


Xeon MP Xeon MP Xeon MP Xeon MP (2C/4C) (2C/4C) (2C/4C) (2C/4C)
1C/2x1C 1C/2x1C 1C/2x1C 1C/2x1C Penryn (6C) Penryn (6C) Penryn (6C) Penryn (6C)

FSB FSB

XMB XMB up to
85001/8501 7300 8 DIMMs
XMB XMB
HI 1.5 ESI

DDR-266/333 DDR-266/333
631xESB/ FBDIMM
DDR2-400 ICH5 DDR2-400
632xESB DDR2-533/667

90 nm Pentium 4 Prescott MP based Core 2 based


Truland MP server platform (for up to 2 C) Caneland MP server platform (for up to 6 C)

HI 1.5 (Hub Interface 1.5) ESI: Enterprise System Interface


8 bit wide, 66 MHz clock, QDR, 4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface,
66 MB/s peak transfer rate providing 1 GB/s transfer rate in each direction)
1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac)
© Sima Dezső, ÓE NIK 43 www.tankonyvtar.hu
2.1.1.3 The Core 2 based Caneland MP server platform (4)

Example 1: Intel’s Nehalem-EP based Tylersburg-EP DP server platform with a single IOH

Xeon
7200 (Tigerton DC, Core 2), 2C
7300 (Tigerton QC, Core 2), QC

FB-DIMM
4 channels
8 DIMMs/channel
up to 512 GB

Figure 2.10: Intel’s 7300 chipset based Caneland platform


for the Xeon 7200/7300 DC/QC processors (9/2007) [7]
© Sima Dezső, ÓE NIK 44 www.tankonyvtar.hu
2.1.1.3 The Core 2 based Caneland MP server platform (5)

Example 3: Caneland MP serverboard

FB-DIMM Xeon
DDR2
7200 DC
192 GB
7300 QC
(Tigerton)

ATI ES1000 Graphics with


7300 NB
32MB video memory

SBE2 SB

Figure 2.11: Caneland MP Supermicro serverboard, with the 7300 (Clarksboro) chipset
for the Xeon 7200/7300 DC/QC MP processor families [4]
© Sima Dezső, ÓE NIK 45 www.tankonyvtar.hu
2.1.1.3 The Core 2 based Caneland MP server platform (6)

Figure 2.12: Performance comparison of the Caneland platform with a quad core Xeon (7300 family)
vs the Bensley platform with a dual core Xeon 7140M [8]
© Sima Dezső, ÓE NIK 46 www.tankonyvtar.hu
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (1)

2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform

© Sima Dezső, ÓE NIK 47 www.tankonyvtar.hu


2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (2)
3/2010

MP platforms Boxboro-EX

3/2010 4/2011

MP cores Xeon 7500 Xeon E7-4800

(Nehalem-EX)
(Becton) 8C (Westmere-EX) 10C

45 nm/2300 mtrs/513 mm2 32 nm/2600 mtrs/584 mm2


¼ MB L2/C ¼ MB L2/C
24 MB L3 30 MB L3
4 QPI links 4 QPI links
4 SMI links 4 SMI links
2 mem. channels/link 2 mem. channels/link
2 DIMMs/mem. channel 2 DIMMs/mem. channel
DDR3 1067 MT/s DDR3 1333 MT/s
1 TB (64x16 GB) 1 TB (64x16 GB)
LGA1567
LGA1567
3/2010

IOH 7500
(Boxboro)
2 QPI links
32xPCIe 2. Gen.
0.5 GB/s/lane/direction
ESI
1GB/s/directon
6/2008

ICH ICH10

Nehalem-EX-based Westmere-EX
© Sima Dezső, ÓE NIK 45 nm 48 45 nm www.tankonyvtar.hu
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (3)

The 8 core Nehalem-EX (Xeon 7500/Beckton) Xeon 7500 MP server processor

2 cores

Figure 2.13: The 8 core Nehalem-EX (Xeon 7500/Beckton) Xeon 7500 MP server processor [9]
© Sima Dezső, ÓE NIK 49 www.tankonyvtar.hu
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (4)

The 10 core Westmere-EX (Xeon E7-!800) MP server processor [10]

© Sima Dezső, ÓE NIK 50 www.tankonyvtar.hu


2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (5)

Block diagram of the Westmere-EX (E7-8800/4800/2800) processors [11]

E7-8800: for 8 P systems


E7-4800: for MP systems
E7-2800: for DP systems

© Sima Dezső, ÓE NIK 51 www.tankonyvtar.hu


2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (6)

Main platform features introduced in the 7500 Boxboro IOH (1)


Along with their Nehalem-EX based Boxboro platform Intel continued their move to
increase system security and manageability by introducing platform features
provided else by their continuously enhanced vPro technology for enterprise oriented
desktops since 2006 and DP servers since 2007.
The platform features introduced in the 7500 IOH are basically the same as described for the
Tylersburg-EP DP platform that is based on the 5500 IOH which is akin to the 7500 IOH of the
Boxboro-EX platform.
They include:
a) Intel Management Engine (ME)
b) Intel Virtualization Technology for Directed I/O (VT-d2)
VT-d2 is an upgraded version of VT-d.
c) Intel Trusted Execution Technology (TXT) .

© Sima Dezső, ÓE NIK 52 www.tankonyvtar.hu


2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (7)

Basic system architecture of the Nehalem-EX based Boxboro-EX MP server platform


(assuming 1 IOH)

Xeon 7500 Xeon 7-4800


(Nehalem-EX) /
(Becton) 8C (Westmere-EX) 10C
SMB SMB
SMB Nehalem-EX 8C QPI Nehalem-EX 8C SMB
SMB Westmere-EX Westmere-EX SMB
10C 10C
SMB SMB
QPI QPI QPI QPI
SMB SMB
SMB Nehalem-EX 8C Nehalem-EX 8C SMB
Westmere-EX Westmere-EX
SMB 10C QPI 10C SMB

SMB QPI QPI SMB


2x4 SMI 2x4 SMI
channels channels
DDR3-1067 7500 IOH DDR3-1067
ME

ESI
SMI: Serial link between the processor and
the SMB
ICH10 SMB: Scalable Memory Buffer
Parallel/serial conversion
ME: Management Engine

Nehalem-EX based Boxboro-EX MP server platform (for up to 10 C)

© Sima Dezső, ÓE NIK 53 www.tankonyvtar.hu


2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (8)

Wide range of scalability of the 7500/6500 IOH based Boxboro-EX platform [12]

© Sima Dezső, ÓE NIK 54 www.tankonyvtar.hu


2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (9)

Example: Block diagram of a 7500 chipset based Boxboro-EX MP serverboard [13]

ESI

© Sima Dezső, ÓE NIK 55 www.tankonyvtar.hu


2.1.1.5 Evolution of MP server platforms (1)

2.1.1.5 Evolution of MP server platforms

© Sima Dezső, ÓE NIK 56 www.tankonyvtar.hu


2.1.1.5 Evolution of MP server platforms (2)

Evolution from the first generation MP servers supporting SC processors to the


90 nm Pentium 4 Prescott MP based Truland MP server platform (supporting up to 2 cores)

Xeon MP Xeon 7000 Xeon 7100


/ /
(Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C

Pentium 4 Pentium 4 Pentium 4 Pentium 4


Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1
Xeon MP Xeon MP Xeon MP Xeon MP
SC SC SC SC
1C/2x1C 1C/2x1C 1C/2x1C 1C/2x1C

FSB
FSB
XMB XMB
Preceding NBs 85001/8501
XMB XMB
E.g. DDR-200/266 E.g. HI 1.5 E.g. DDR-200/266 HI 1.5
DDR-266/333 DDR-266/333
DDR2-400 DDR2-400
Preceding ICH ICH5
HI 1.5 266 MB/s

Previous Pentium 4 MP based 90 nm Pentium 4 Prescott MP based


MP server platform (for single core processors) Truland MP server platform (for up to 2 C)

© Sima Dezső, ÓE NIK 57 www.tankonyvtar.hu


2.1.1.5 Evolution of MP server platforms (3)

Evolution from the 90 nm Pentium 4 Prescott MP based Truland MP platform (up to 2 cores) to the
Core 2 based Caneland MP platform (up to 6 cores)

Xeon MP Xeon 7000 Xeon 7100 Xeon 7200 Xeon 7300 Xeon 7400
/ / / /
(Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C (Tigerton DC) 1x2C (Tigerton QC) 2x2C (Dunnington 6C)

Pentium 4 Pentium 4 Pentium 4 Pentium 4 Core2 Core2 Core2 Core2


Xeon MP Xeon MP Xeon MP Xeon MP (2C/4C) (2C/4C) (2C/4C) (2C/4C)
1C/2x1C 1C/2x1C 1C/2x1C 1C/2x1C Penryn (6C) Penryn (6C) Penryn (6C) Penryn (6C)

FSB FSB

XMB XMB up to
85001/8501 7300 8 DIMMs
XMB XMB
HI 1.5 ESI

DDR-266/333 DDR-266/333
631xESB/ FBDIMM
DDR2-400 ICH5 DDR2-400
632xESB DDR2-533/667

90 nm Pentium 4 Prescott MP based Core 2 based


Truland MP server platform (for up to 2 C) Caneland MP server platform (for up to 6 C)

HI 1.5 (Hub Interface 1.5) ESI: Enterprise System Interface


8 bit wide, 66 MHz clock, QDR, 4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface,
266 MB/s peak transfer rate providing 1 GB/s transfer rate in each direction)

1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac)
© Sima Dezső, ÓE NIK 58 www.tankonyvtar.hu
2.1.1.5 Evolution of MP server platforms (4)

Evolution to the Nehalem-EX based Boxboro-EX MP platform (that supports up to 10 cores)


(In the basic system architecture we show the single IOH alternative)

Xeon 7500 Xeon 7-4800


(Nehalem-EX) /
(Becton) 8C (Westmere-EX) 10C
SMB SMB
SMB Nehalem-EX 8C QPI Nehalem-EX 8C SMB
SMB Westmere-EX Westmere-EX SMB
10C 10C
SMB SMB
QPI QPI QPI QPI
SMB SMB
SMB Nehalem-EX 8C Nehalem-EX 8C SMB
Westmere-EX Westmere-EX
SMB 10C QPI 10C SMB

SMB QPI QPI SMB


2x4 SMI 2x4 SMI
channels channels
DDR3-1067 7500 IOH DDR3-1067
ME

ESI
SMI: Serial link between the processor and
the SMBs
ICH10 SMB: Scalable Memory Buffer
Parallel/serial converter
ME: Management Engine

Nehalem-EX based Boxboro-EX MP server platform (for up to 10 C)


© Sima Dezső, ÓE NIK 59 www.tankonyvtar.hu
2.2 Many-core processors

• 2.2.1 Intel’s Larrabee

• 2.2.2 Intel’s Tile processor


• 2.2.3 Intel’s SCC

© Sima Dezső, ÓE NIK 60 www.tankonyvtar.hu


2.2 Manycore processors (1)

Multicore processors

Homogeneous Heterogeneous
multicores multicores

Conventional Manycore Master/slave Add-on


multicores processors type multicores type multicores

2≤ n≤8 cores with >8 cores

Mobiles Desktops Servers

MPC

CPU GPU

General purpose Prototypes/ MM/3D/HPC HPC


computing experimental systems production stage

© Sima Dezső, ÓE NIK Figure 2.1: Major classes61


of multicore processors www.tankonyvtar.hu
2.2.1 Intel’s Larrabee

© Sima Dezső, ÓE NIK 62 www.tankonyvtar.hu


2.2.1 Intel’s Larrabee (1)

2.2.1 Larrabee

Part of Intel’s Tera-Scale Initiative.

• Objectives:
High end graphics processing, HPC
Not a single product but a base architecture for a number of different products.

• Brief history:
Project started ~ 2005
First unofficial public presentation: 03/2006 (withdrawn)
First official public presentation: 08/2008 (SIGGRAPH)
Due in ~ 2009
• Performance (targeted):
2 TFlops

© Sima Dezső, ÓE NIK 63 www.tankonyvtar.hu


2.2.1 Intel’s Larrabee (2)

Basic architecture

Figure 2.14: Block diagram of a GPU-oriented Larrabee (2006, outdated) [41]

Update: SIMD processing width: SIMD-64 rather than SIMD-16

© Sima Dezső, ÓE NIK 64 www.tankonyvtar.hu


2.2.1 Intel’s Larrabee (3)

Figure 2.15: Board layout of a GPU-oriented Larrabee (2006, outdated) [42]

© Sima Dezső, ÓE NIK 65 www.tankonyvtar.hu


2.2.1 Intel’s Larrabee (4)

Figure 2.16: Four socket MP server design with 24-core Larrabees connected by the CSI bus [41]
© Sima Dezső, ÓE NIK 66 www.tankonyvtar.hu
2.2.2 Intel’s Tile processor

© Sima Dezső, ÓE NIK 67 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (1)

2.2.2 Tile processor

• First incarnation of Intel’s Tera-Scale Initiative


(more than 100 projects underway)
• Objective: Tera-Scale experimental chip
• Brief history:
Announced at IDF 9/2006
Due in 2009/2010

© Sima Dezső, ÓE NIK 68 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (2)

Figure 2.17: A forerunner: The Raw processor (MIT 2002)


(16 tiles, each tile has a compute element, router, instruction and data memory) [43]

© Sima Dezső, ÓE NIK 69 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (3)

Bisection bandwidth:
If the network is segmented into two equal parts,
this is the bandwidth between the two parts

Figure 2.18: Die photo and chip details of the Tile processor [14]

© Sima Dezső, ÓE NIK 70 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (4)

Figure 2.19: Main blocks of a tile [14]

© Sima Dezső, ÓE NIK 71 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (5)

(Clocks run with the same frequency


but unknown phases

FP Multiply-Accumulate
(AxB+C)

Figure 2.20: Block diagram of a tile [14]

© Sima Dezső, ÓE NIK 72 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (6)

Figure 2.21: On board implementation of the 80-core Tile Processor [15]

© Sima Dezső, ÓE NIK 73 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (7)

Figure 2.22: Performance and dissipation figures of the Tile-processor [15]


© Sima Dezső, ÓE NIK 74 www.tankonyvtar.hu
2.2.2 Intel’s Tile processor (8)

Performance at 4 GHz:
Peak SP FP: up to 1.28 TFlops (2 FPMA x 2 instr./cyclex80x4 GHz = 1.28 TFlops)

© Sima Dezső, ÓE NIK 75 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (9)

Figure 2.23: Programmer’s perspective of the Tile processor [14]

© Sima Dezső, ÓE NIK 76 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (10)

Figure 2.24: The full instruction set of the Tile processor [14]

© Sima Dezső, ÓE NIK 77 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (11)

VLIW

Figure 2.25: Instruction word and latencies of the Tile processor [14]

© Sima Dezső, ÓE NIK 78 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (12)

Figure 2.26: Performance of the Tile processor – the workloads [14]

© Sima Dezső, ÓE NIK 79 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (13)

Figure 2.27: Instruction word and latencies of the Tile processor [14]

© Sima Dezső, ÓE NIK 80 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (14)

Figure 2.28: The significance of the Tile processor [14]

© Sima Dezső, ÓE NIK 81 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (15)

Figure 2.29: Lessons learned from the Tile processor (1) [14]

© Sima Dezső, ÓE NIK 82 www.tankonyvtar.hu


2.2.2 Intel’s Tile processor (16)

Figure 2.30: Lessons learned from the Tile processor (2) [14]

© Sima Dezső, ÓE NIK 83 www.tankonyvtar.hu


2.2.3 Intel’s SCC

© Sima Dezső, ÓE NIK 84 www.tankonyvtar.hu


2.2.3 Intel’s SCC (1)

2.2.3 Intel’s SCC (Single-chip Cloud Computer)

• 12/2009: Announced
• 9/2010: Many-core Application Research Project (MARC) initiative started on the SCC
platform
• Designed in Braunschweig and Bangalore
• 48 core, 2D-mesh system topology, message passing

© Sima Dezső, ÓE NIK 85 www.tankonyvtar.hu


2.2.3 Intel’s SCC (2)

Figure 2.31: The SCC chip [14]

© Sima Dezső, ÓE NIK 86 www.tankonyvtar.hu


2.2.3 Intel’s SCC (3)

Figure 2.32: Hardware view of SCC [14]

© Sima Dezső, ÓE NIK 87 www.tankonyvtar.hu


2.2.3 Intel’s SCC (4)

Figure 2.33: Dual core tile of SCC [14]

© Sima Dezső, ÓE NIK 88 www.tankonyvtar.hu


2.2.3 Intel’s SCC (5)

(Joint Test Action Group)


Standard Test Access Port

Figure 2.34: SCC system overview [14]

© Sima Dezső, ÓE NIK 89 www.tankonyvtar.hu


2.2.3 Intel’s SCC (6)

Figure 2.35: Removing hardware cache coherency [16]

© Sima Dezső, ÓE NIK 90 www.tankonyvtar.hu


2.2.3 Intel’s SCC (7)

Figure 2.36: Improving energy efficiency [16]

© Sima Dezső, ÓE NIK 91 www.tankonyvtar.hu


2.2.3 Intel’s SCC (8)

Figure 2.37: A programmer’s view of SCC [14]

© Sima Dezső, ÓE NIK 92 www.tankonyvtar.hu


2.2.3 Intel’s SCC (9)

(Message Passing Buffer)

Figure 2.38: Operation of SCC [14]

© Sima Dezső, ÓE NIK 93 www.tankonyvtar.hu


3. Heterogeneous multicores

• 3.1 Master-slave type multicores

• 3.2 Add-on type multicores

© Sima Dezső, ÓE NIK 94 www.tankonyvtar.hu


3. Heterogeneous multicores (1)

Multicore processors

Homogeneous Heterogeneous
multicores multicores

Conventional Manycore Master/slave Add-on


multicores processors type multicores type multicores

2≤ n≤8 cores with >8 cores

Mobiles Desktops Servers

MPC

CPU GPU

General purpose Prototypes/ MM/3D/HPC HPC


computing experimental systems production stage

© Sima Dezső, ÓE NIK Figure 3.1: Major classes95


of multicore processors www.tankonyvtar.hu
3.1 Master-slave type multicores

• 3.1.1 The Cell processor

© Sima Dezső, ÓE NIK 96 www.tankonyvtar.hu


3.1 Master-slave type multicores (1)

Multicore processors

Homogenous Heterogenous
multicores multicores

Conventional Manycore Master/slave Add-on


MC processors processors architectures architectures

2≤ n≤8 cores with >8 cores

Desktops Servers

MPC

CPU GPU

General purpose Prototypes/ MM/3D/HPC HPC


computing experimental systems production stage near future

© Sima Dezső, ÓE NIK Figure 3.1: Major classes


97 of multicore processors www.tankonyvtar.hu
3.1.1 The Cell Processor (1)

3.1.1 The Cell Processor

• Designated also as the Cell BE (Broadband Engine)

• Collaborative effort from Sony, IBM and Toshiba


• Objective: Game/multimedia, HPC apps.

Playstation 3 (PS3) QS2x Blade Server family


(2 Cell BE/blade)
• Brief history:
Summer 2000: High level architectural discussions
02/2006: Cell Blade QS20
08/ 2007 Cell Blade QS21
05/ 2008 Cell Blade QS22

© Sima Dezső, ÓE NIK 98 www.tankonyvtar.hu


3.1.1 The Cell Processor (2)

SPE: Synergistic Procesing Element


SPU: Synergistic Processor Unit
SXU: Synergistic Execution Unit
LS: Local Store of 256 KB
SMF: Synergistic Mem. Flow Unit

EIB: Element Interface Bus

PPE: Power Processing Element


PPU: Power Processing Unit
PXU: POWER Execution Unit
MIC: Memory Interface Contr.
BIC: Bus Interface Contr.

XDR: Rambus DRAM

Figure 3.2: Block diagram of the Cell BE [44]

© Sima Dezső, ÓE NIK 99 www.tankonyvtar.hu


3.1.1 The Cell Processor (3)

Figure 3.3: Die shot of the Cell BE (221mm2, 234 mtrs) [44]
© Sima Dezső, ÓE NIK 100 www.tankonyvtar.hu
3.1.1 The Cell Processor (4)

Figure 3.4: Die shot of the Cell BE – PPE [44]


© Sima Dezső, ÓE NIK 101 www.tankonyvtar.hu
3.1.1 The Cell Processor (5)

Figure 3.5: Block diagram of the PPE [44]


© Sima Dezső, ÓE NIK 102 www.tankonyvtar.hu
3.1.1 The Cell Processor (6)

© Sima Dezső, ÓE NIK Figure 3.6: Die shot of103


the Cell BE – SPEs [44] www.tankonyvtar.hu
3.1.1 The Cell Processor (7)

Figure 3.7: Block diagram of the SPE


[44]

© Sima Dezső, ÓE NIK 104 www.tankonyvtar.hu


3.1.1 The Cell Processor (8)

Figure 3.8: Die shot of the Cell BE – Memory interface [44]


© Sima Dezső, ÓE NIK 105 www.tankonyvtar.hu
3.1.1 The Cell Processor (9)

Figure 3.9: Die shot of the Cell BE – I/O interface [44]


© Sima Dezső, ÓE NIK 106 www.tankonyvtar.hu
3.1.1 The Cell Processor (10)

Figure 3.10: Die shot of the Cell BE – EIB [44]


© Sima Dezső, ÓE NIK 107 www.tankonyvtar.hu
3.1.1 The Cell Processor (11)

Figure 3.11: Principle of the operation of the EIB [44]


© Sima Dezső, ÓE NIK 108 www.tankonyvtar.hu
3.1.1 The Cell Processor (12)

Figure 3.12: Concurrent transactions of the EIB [44]


© Sima Dezső, ÓE NIK 109 www.tankonyvtar.hu
3.1.1 The Cell Processor (13)

• Performance @ 3.2 GHz:


QS21 Peak SP FP: 409,6 GFlops (3.2 GHz x 2x8 SPE x 2x4 SP FP/cycle)

• Cell BE - NIK
2007: Faculty Award (Cell 3Đ app./Teaching)
2008: IBM – NIK Reserch Agreement and Cooperation: Performance investigations
• IBM Böblingen Lab
• IBM Austin Lab

© Sima Dezső, ÓE NIK 110 www.tankonyvtar.hu


3.1.1 The Cell Processor (14)

© Sima Dezső, ÓE NIK Picture 3.1: IBM’s Roadrunner


111 (Los Alamos 2008) [45] www.tankonyvtar.hu
3.1.1 The Cell Processor (15)

The Roadrunner

6/2008 : International Supercomputing Conference, Dresden

Top 500 supercomputers

1. Roadrunner 1 Petaflops (1015) sustained Linpack

© Sima Dezső, ÓE NIK 112 www.tankonyvtar.hu


3.1.1 The Cell Processor (16)

Figure 3.13: Key features of Roadrunner [44]

© Sima Dezső, ÓE NIK 113 www.tankonyvtar.hu


3.2 Add-on type multicores

• 3.2.1 GPGPUs

© Sima Dezső, ÓE NIK 114 www.tankonyvtar.hu


3.2 Add-on type multicores (1)

Multicore processors

Homogeneous Heterogeneous
multicores multicores

Conventional Manycore Master/slave Add-on


multicores processors type multicores type multicores

2≤ n≤8 cores with >8 cores

Mobiles Desktops Servers

MPC

CPU GPU

General purpose Prototypes/ MM/3D/HPC HPC


computing experimental systems production stage

© Sima Dezső, ÓE NIK Figure 3.1: Major classes115


of multicore processors www.tankonyvtar.hu
3.2.1 GPGPUs

• 3.2.1.1 Introduction to GPGPUs

• 3.2.1.2 Overview of GPGPUs

• 3.2.1.3 Example 1: Nvidia’s Fermi GF100

• 3.2.1.4 Example 2: Intel’s on-die integrated


CPU/GPU lines – Sandy Bridge

© Sima Dezső, ÓE NIK 116 www.tankonyvtar.hu


3.2.1.1 Introduction to GPGPUs (1)

3.2.1.1 Introduction to GPGPUs

Unified shader model of GPUs (introduced in the SM 4.0 of DirectX 10.0)

Unified, programable shader architecture

The same (programmable) processor can be used to implement all shaders;


• the vertex shader
• the pixel shader and
• the geometry shader (new feature of the SMl 4)

© Sima Dezső, ÓE NIK 117 www.tankonyvtar.hu


3.2.1.1 Introduction to GPGPUs (2)

Based on its FP32 computing capability and the large number of FP-units available

the unified shader is a prospective candidate for speeding up HPC!

GPUs with unified shader architectures also termed as

GPGPUs
(General Purpose GPUs)

or

cGPUs
(computational GPUs)

© Sima Dezső, ÓE NIK 118 www.tankonyvtar.hu


3.2.1.1 Introduction to GPGPUs (3)

Peak FP32/FP64 performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [17]

© Sima Dezső, ÓE NIK 119 www.tankonyvtar.hu


3.2.1.1 Introduction to GPGPUs (4)

Peak FP32 performance of AMD’s GPGPUs [18]

© Sima Dezső, ÓE NIK 120 www.tankonyvtar.hu


3.2.1.1 Introduction to GPGPUs (5)

Evolution of the FP-32 performance of GPGPUs [19]

© Sima Dezső, ÓE NIK 121 www.tankonyvtar.hu


3.2.1.1 Introduction to GPGPUs (6)

Evolution of the bandwidth of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [20]

© Sima Dezső, ÓE NIK 122 www.tankonyvtar.hu


3.2.1.1 Introduction to GPGPUs (7)

Figure 3.14: Contrasting the utilization of the silicon area in CPUs and GPUs [21]

• Less area for control since GPGPUs have simplified control (same instruction for
all ALUs)
• Less area for caches since GPGPUs support massive multithereading to hide
latency of long operations, such as memory accesses in case of cache misses.

© Sima Dezső, ÓE NIK 123 www.tankonyvtar.hu


3.2.1.2 Overview of GPGPUs (1)

3.2.1.2 Overview of GPGPUs

Basic implementation alternatives of the SIMT execution

GPGPUs Data parallel accelerators

Programmable GPUs Dedicated units


with appropriate supporting data parallel execution
programming environments with appropriate
programming environment

Have display outputs No display outputs


Have larger memories
than GPGPUs

E.g. Nvidia’s 8800 and GTX lines Nvidia’s Tesla lines


AMD’s HD 38xx, HD48xx lines AMD’s FireStream lines

Figure 3.15: Basic implementation alternatives of the SIMT execution


© Sima Dezső, ÓE NIK 124 www.tankonyvtar.hu
3.2.1.2 Overview of GPGPUs (2)

GPGPUs

Nvidia’s line AMD/ATI’s line

90 nm G80
80 nm R600
Shrink Enhanced
arch. Shrink
Enhanced
65 nm G92 G200 arch.
55 nm RV670 RV770
Enhanced
Shrink Enhanced Enhanced
arch. Shrink arch. arch.
40 nm GF100 RV870 Cayman
(Fermi)

Figure 3.16: Overview of Nvidia’s and AMD/ATI’s GPGPU lines

© Sima Dezső, ÓE NIK 125 www.tankonyvtar.hu


3.2.1.2 Overview of GPGPUs (3)

NVidia
11/06 10/07 6/08

Cores G80 G92 GT200


90 nm/681 mtrs 65 nm/754 mtrs 65 nm/1400 mtrs

Cards 8800 GTS 8800 GTX 8800 GT GTX260 GTX280


96 ALUs 128 ALUs 112 ALUs 192 ALUs 240 ALUs
320-bit 384-bit 256-bit 448-bit 512-bit

OpenCL OpenCL
Standard
6/07 11/07 6/08 11/08

CUDA Version 1.0 Version 1.1 Version 2.0 Version 2.1

AMD/ATI
11/05 5/07 11/07 5/08

Cores R500 R600 R670 RV770


80 nm/681 mtrs 55 nm/666 mtrs 55 nm/956 mtrs

Cards (Xbox) HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870


48 ALUs 320 ALUs 320 ALUs 320 ALUs 800 ALUs 800 ALUs
512-bit 256-bit 256-bit 256-bit 256-bit
12/08
OpenCL OpenCL
Standard
11/07 9/08 12/08
Brooks+ Brook+ Brook+ 1.2 Brook+ 1.3
(SDK v.1.0) (SDK v.1.2) (SDK v.1.3)
6/08
RapidMind 3870
support
2005 2006 2007 2008
Figure 3.17: Overview of GPGPUs and their basic software support (1)
© Sima Dezső, ÓE NIK 126 www.tankonyvtar.hu
3.2.1.2 Overview of GPGPUs (4)

NVidia
3/10 07/10 11/10
Cores GF100 (Fermi) GF104 (Fermi) GF110 (Fermi)
40 nm/3000 mtrs 40 nm/1950 mtrs 40 nm/3000 mtrs
1/11
Cards GTX 470 GTX 480 GTX 460 GTX 580 GTX 560 Ti
448 ALUs 480 ALUs 336 ALUs 512 ALUs 480 ALUs
320-bit 384-bit 192/256-bit 384-bit 384-bit

6/09 10/09 6/10

OpenCL OpenCL 1.0 OpenCL 1.0 OpenCL 1.1


SDK 1.0 Early release SDK 1.0 SDK 1.1
5/09 6/09 3/10 6/10 1/11 3/11

CUDA Version 22 Version 2.3 Version 3.0 Version 3.1 Version 3.2 Version 4.0
Beta
AMD/ATI
9/09 10/10 12/10

Cores RV870 (Cypress) Barts Pro/XT Cayman Pro/XT


40 nm/2100 mtrs 40 nm/1700 mtrs 40 nm/2640 mtrs

Cards HD 5850/70 HD 6850/70 HD 6950/70


1440/1600 ALUs 960/1120 ALUs 1408/1536 ALUs
256-bit 256-bit 256-bit

11/09 03/10 08/10

OpenCL OpenCL 1.0 OpenCL 1.0 OpenCL 1.1


(SDK V.2.0) (SDK V.2.01) (SDK V.2.2)
3/09
Brooks+ Brook+ 1.4
(SDK V.1.4 Beta) 8/09
Intel bought RapidMind
RapidMind

2009 2010 2011

Figure 3.18: Overview of GPGPUs and their basic software support (2)
© Sima Dezső, ÓE NIK 127 www.tankonyvtar.hu
3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (1)

3.2.1.3 Example 1: Nvidia’s Fermi GF 100

Announced: 30. Sept. 2009 at NVidia’s GPU Technology Conference, available: 1Q 2010 [22]

© Sima Dezső, ÓE NIK 128 www.tankonyvtar.hu


3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (2)

Sub-families of Fermi
Fermi includes three sub-families with the following representative cores and features:

Available Max. no. Max. no. No of Compute


GPGPU Aimed at
since of cores of ALUs transistors capability
Gen.
GF100 3/2010 161 5121 3200 mtrs 2.0
purpose

GF104 7/2010 8 384 1950 mtrs 2.1 Graphics

Gen.
GF110 11/2010 16 512 3000 mtrs 2.0
purpose

1 In the associated flagship card (GTX 480) however, one of the SMs has been disabled, due to overheating
problems, so it has actually only 15 SIMD cores, called Streaming Multiprocessors (SMs) by Nvidia and 480
FP32 EUs [69]

© Sima Dezső, ÓE NIK 129 www.tankonyvtar.hu


3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (3)

Overall structure of Fermi GF100 [22], [23]

NVidia: 16 cores
(Streaming Multiprocessors)
(SMs)

Each core: 32 ALUs


512 ALUs

Remark
In the associated flagship card
(GTX 480) however,
one SM has been disabled,
due to overheating problems,
so it has actually
15 SMs and 480 ALUs [a]

6x Dual Channel GDDR5


(6x 64 = 384 bit)

© Sima Dezső, ÓE NIK 130 www.tankonyvtar.hu


3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (4)

High level microarchitecture of Fermi GT100

Figure 3.19: Fermi’s131system architecture [24]


© Sima Dezső, ÓE NIK www.tankonyvtar.hu
3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (5)

Evolution of the high level microachitecture of Nvidia’s GPGPUs [24]

Fermi GF100

Note

The high level microarchitecture of Fermi evolved from a graphics oriented structure
to a computation oriented one complemented with a units needed for graphics processing.

© Sima Dezső, ÓE NIK 132 www.tankonyvtar.hu


3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (6)

Layout of a Cuda GF100 core (SM) [25]


(SM: Streaming Multiprocessor)

SFU: Special
Function Unit

1 SM includes 32 ALUs
called “Cuda cores” by NVidia)

© Sima Dezső, ÓE NIK 133 www.tankonyvtar.hu


3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (7)

Layout of a Cuda GF100 core


(SM) [25]

Special Function Units


calculate FP32
transcendental functions
(such as trigonometric
functions etc.)

1 SM includes 32 ALUs
called “Cuda cores” by NVidia)

© Sima Dezső, ÓE NIK 134 www.tankonyvtar.hu


3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (8)

A single ALU (“CUDA core”)

SP FP:32-bit

FP64 Fermi’s integer units (INT Units)


• are 32-bit wide.
• First implementation of the IEEE 754-2008
• became stand alone units, i.e.
standard
they are no longer merged with
• Needs 2 clock cycles to issue the entire warp the MAD units as in prior designs.
for execution.
• In addition, each floating-point
unit (FP Unit) is now capable of
producing IEEE 754-2008-
FP64 performance: ½ of FP32 performance!! compliant double-precision (DP)
(Enabled only on Tesla devices! FP results in every 2. clock cycles,
at ½ of the performance of
single-precision FP calculations.
Figure 3.20: A single ALU [26]
© Sima Dezső, ÓE NIK 135 www.tankonyvtar.hu
3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (9)

Remark
The Fermi line supports the Fused Multiply-Add (FMA) operation, rather than the Multiply-Add
operation performed in previous generations.

Previous lines

Fermi

Figure 3.21: Contrasting the Multiply-Add (MAD) and the Fused-Multiply-Add (FMA) operations
[27]

© Sima Dezső, ÓE NIK 136 www.tankonyvtar.hu


3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (10)

Principle of the SIMT execution in case of serial kernel execution

Host Device

Each kernel invocation


lets execute all
kernel0<<<>>>() thread blocks (Block(i,j))

Thread blocks may be


executed independently
from each other
kernel1<<<>>>()

Figure 3.22: Hierarchy of


threads [28]
© Sima Dezső, ÓE NIK 137 www.tankonyvtar.hu
3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (11)

Nvidia GeForce GTX 480 and 580 cards [29]

GTX 480 GTX 580


(GF 100 based) (GF 110 based)

© Sima Dezső, ÓE NIK 138 www.tankonyvtar.hu


3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (12)

A pair of GeForce GTX 480 cards [30]


(GF100 based)

© Sima Dezső, ÓE NIK 139 www.tankonyvtar.hu


3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (13)

FP64 performance increase in Nvidia’s Tesla and GPGPU lines

Performance is bound by the number of available DP FP execution units.

G80/G92 GT200 GF100 GF110


(11/06, 10/07) (06/08) (03/10) (11/10)
Avail. FP64 No FP64 1 16 16
units support FP64 unit FP64 units FP64 units
operations (Add, Mul, MAD) (Add, Mul, FMA ) (Add, Mul, FMA)

Peak FP64 load/SM 1 FP64 MAD 16 FP64 FMA 16 FP64 MAD


Peak FP64 perf./cycle/SM 1x2 operations/SM 16x2 operations/SM 16x2 operations/SM

Tesla cards
Flagship Tesla card C1060 C2070
Peak FP64 perf./card 30x1x2x1296 14x16x2x1150

77.76 GFLOPS 515.2 GFLOPS


GPGPU cards
Flagship GPGPU úcard GT280 GTX 4801 GTX 5801
Peak FP64 perf./card 30x1x2x1296 15x4x2x1401 16x4x2x1544

77.76 GFLOPS 168.12 GFLOPS 197.632 GFLOPS

1 In their GPGPU Fermi cards Nvidia activates only 4 FP64 units from the available 16
© Sima Dezső, ÓE NIK 140 www.tankonyvtar.hu
3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs (1)

3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs – Sandy Bridge

Integration
to the chip

Figure 3.23: Evolution of add-on MC architectures

© Sima Dezső, ÓE NIK 141 www.tankonyvtar.hu


3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs (2)

The Sandy Bridge processor [31]


• Shipped in Jan. 2011
• Provides on-die integrated CPU and GPU

© Sima Dezső, ÓE NIK 142 www.tankonyvtar.hu


3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs (3)

Main features of Sandy Bridge [32]

© Sima Dezső, ÓE NIK 143 www.tankonyvtar.hu


3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs (4)

Key specification data of Sandy Bridge [33]

Branding Core i5 Core i5 Core i5 Core i7 Core i7


Processor 2400 2500 2500K 2600 2600K
Price $184 $205 $216 $294 $317
TDP 95W 95W 95W 95W 95W
Cores / Threads 4/4 4/4 4/4 4/8 4/8

Frequency GHz 3.1 3.3 3.3 3.4 3.4

Max Turbo GHz 3.4 3.7 3.7 3.8 3.8

DDR3 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz
L3 Cache 6MB 6MB 6MB 8MB 8MB
Intel HD
2000 2000 3000 2000 3000
Graphics

GPU Max freq 1100 MHz 1100 MHz 1100 MHz 1350 MHz 1350 MHz

Hyper-
No No No Yes Yes
Threading

AVX Extensions Yes Yes Yes Yes Yes

Socket LGA 1155 LGA 1155 LGA 1155 LGA 1155 LGA 1155

© Sima Dezső, ÓE NIK 144 www.tankonyvtar.hu


3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs (5)

Die photo of Sandy Bridge [34]

256 KB L2 256 KB L2 256 KB L2 256 KB L2


(9 clk) (9 clk) (9 clk) (9 clk)

Hyperthreading
32K L1D (3 clk) AES Instr.
AVX 256 bit VMX Unrestrict.
4 Operands 20 nm2 / Core

@ 1.0 1.4 GHz


(to L3 connected) (25 clk)
256 b/cycle Ring Architecture PCIe 2.0

DDR3-1600 25.6 GB/s

32 nm process / ~225 nm2 die size / 85W TDP

Figure 3.24: Die photo of Sandy Bridge [34]

© Sima Dezső, ÓE NIK 145 www.tankonyvtar.hu


3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs (6)

Sandy Bridge’s integrated graphics unit [31]

© Sima Dezső, ÓE NIK 146 www.tankonyvtar.hu


3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs (7)

Specification data of the HD 2000 and HD 3000 graphics [35]

© Sima Dezső, ÓE NIK 147 www.tankonyvtar.hu


3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs (8)

Performance comparison: gaming [36]

HD5570
400 ALUs

i5/i7 2xxx:
Sandy Bridge

i56xx
Arrandale

frames per sec


© Sima Dezső, ÓE NIK 148 www.tankonyvtar.hu
4. Outlook

© Sima Dezső, ÓE NIK 149 www.tankonyvtar.hu


4. Outlook (1)

4. Outlook

Heterogenous
multicores

Master/slave Add-on
type multicores type multicores

1(Ma):M(S) 2(Ma):M(S) M(Ma):M(S) 1(CPU):1(D) M(CPU):1(D) M(CPU):M(D)

M(Ma) = M(CPU)

M(S) = M(D)

Ma: Master D: Dedicated (like GPU)


S: Slave H: Homogenous
M: Many M: Many

Figure 4.1: Expected evolution


150 of heterogeneous multicores
© Sima Dezső, ÓE NIK www.tankonyvtar.hu
4. Outlook (2)

Master-slave type multicores require much more intricate workflow control and
synchronization than add-on type multicores

It can expected that add-on type multicores will dominate the future of heterogeneous
multicores.

© Sima Dezső, ÓE NIK 151 www.tankonyvtar.hu


5. References

© Sima Dezső, ÓE NIK 152 www.tankonyvtar.hu


References (1)

[1]: Gilbert J. D., Hunt S. H., Gunadi D., Srinivas G., The Tulsa Processor: A Dual Core Large
Shared-Cache Intel Xeon Processor 7000 Sequence for the MP Server Market Segment,
Aug 21 2006, http://www.hotchips.org/archives/hc18/3_Tues/HC18.S9/HC18.S9T1.pdf
[2]: Intel Server Board Set SE8500HW4, Technical Product Specification, Revision 1.0,
May 2005, ftp://download.intel.com/support/motherboards/server/sb/se8500hw4_board_
set_tpsr10.pdf
[3]: Intel® E8501 Chipset North Bridge (NB) Datasheet, Mai 2006,
http://www.intel.com/design/chipsets/e8501/datashts/309620.htm
[4]: Supermicro Motherboards, http://www.supermicro.com/products/motherboard/

[5]: Next-Generation AMD Opteron Processor with Direct Connect Architecture – 4P Server
Comparison, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/4P_
Server_Comparison_PID_41461.pdf
[6]: Supermicro P4QH6 / P4QH8 User’s Manual, 2002,
http://www.supermicro.com/manuals/motherboard/GC-HE/MNL-0665.pdf

[7]: Intel® 7300 Chipset Memory Controller Hub (MCH) – Datasheet, Sept. 2007,
http://www.intel.com/design/chipsets/datashts/313082.htm

[8]: Quad-Core Intel® Xeon® Processor 7300 Series Product Brief, Intel, Nov. 2007
http://download.intel.com/products/processor/xeon/7300_prodbrief.pdf
[9]: Mitchell D., Intel Nehalem-EX review, PCPro,
http://www.pcpro.co.uk/reviews/processors/357709/intel-nehalem-ex

[10]: Nagaraj D., Kottapalli S.: Westmere-EX: A 20 thread server CPU, Hot Chips 2010
http://www.hotchips.org/uploads/archive22/HC22.24.610-Nagara-Intel-6-Westmere-EX.pdf
© Sima Dezső, ÓE NIK 153 www.tankonyvtar.hu
References (2)

[11]: Intel Xeon Processor E7-8800/4800/2800 Product Families, Datasheet Vol. 1 of 2,


April 2011, http://www.intel.com/Assets/PDF/datasheet/325119.pdf

[12]: Intel Xeon Processor 7500/6500 Series, Public Gold Presentation, March 30 2010,
http://cache-www.intel.com/cd/00/00/44/64/446456_446456.pdf

[13]: Supermicro X8QB6-F / X8QBE-F User’s Manual, 2010,


http://files.siliconmechanics.com/Documentation/Rackform/iServ/R413/Mainboard/MNL
-X8QB-E-6-F.pdf

[14]: Mattson T., The Future of Many Core Computing: A tale of two processors, March 4 2010,
http://og-hpc.com/Rice2010/Slides/Mattson-OG-HPC-2010-Intel.pdf

[15]: Kirsch N., An Overview of Intel's Teraflops Research Chip, Febr. 13 2007, Legit Reviews,
http://www.legitreviews.com/article/460/1/

[16]: Rattner J., „Single-chip Cloud Computer”, Dec. 2 2009


http://www.pcper.com/reviews/Processors/Intel-Shows-48-core-x86-Processor-Single-
chip-Cloud-Computer
[17]: Nvidia CUDA C Programming Guide, Version 3.2, October 22 2010
http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_
Programming_Guide.pdf

[18]: Chu M. M., GPU Computing: Past, Present and Future with ATI Stream Technology,
AMD, March 9 2010,
http://developer.amd.com/gpu_assets/GPU%20Computing%20-%20Past%20
Present%20and%20Future%20with%20ATI%20Stream%20Technology.pdf

© Sima Dezső, ÓE NIK 154 www.tankonyvtar.hu


References (3)

[19]: Hwu W., Kirk D., Nvidia, Advanced Algorithmic Techniques for GPUs, Berkeley,
January 24-25 2011
http://iccs.lbl.gov/assets/docs/20110124/lecture1_computational_thinking_Berkeley_2011.pdf

[20]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review,
PC Perspective, June 16 2008,
http://www.pcper.com/article.php?aid=577&type=expert&pid=3

[21]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,
June 2008, Nvidia,
http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming
Guide_2.0.pdf
[22]: Next Gen CUDA GPU Architecture, Code-Named “Fermi”, Press Presentation at Nvidia’s
2009 GPU Technology Conference, (GTC), Sept. 30 2009,
http://www.nvidia.com/object/gpu_tech_conf_press_room.html

[23]: Nvidia’s Next Generation CUDATM Compute Architecture: FermiTM, Version 1.1, 2009
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_
Architecture_Whitepaper.pdf

[24]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,
http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT093009110932&
mode=print

[25]: Kirsch N., NVIDIA GF100 Fermi Architecture and Performance Preview,
Legit Reviews, Jan 20 2010, http://www.legitreviews.com/article/1193/2/

© Sima Dezső, ÓE NIK 155 www.tankonyvtar.hu


References (4)

[26]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,
Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1

[27]: Glaskowsky P. N., Nvidia’s Fermi: The First Complete GPU Computing Architecture
Sept 2009, http://www.nvidia.com/content/PDF/fermi_white_papers/
P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf

[28]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” Real World Technologies,
Sept. 8 2008, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242

[29]: Hoenig M., Nvidia GeForce 580 Review, HardwareCanucks, Nov. 8, 2010,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/37789-nvidia-
geforce-gtx-580-review-5.html
[30]: Wasson S., Nvidia's GeForce GTX 480 and 470 graphics processors, Tech Report,
March 31 2010, http://techreport.com/articles.x/18682

[31]: Piazza T., Dr. Jiang H., Microarchitecture Codename Sandy Bridge: Processor Graphics,
Presentation ARCS002, IDF San Francisco, Sept. 2010

[32]: Kahn O., Valentine B., Microarchitecture Codename Sandy Bridge: New Processor
Innovations, Presentation ARCS001, IDF San Francisco Sept. 2010

[33]: Hagedoorn H. Mohammad S., Barling I. R., Core i5 2500K and Core i7 2600K review,
Jan. 3 2011,
http://www.guru3d.com/article/core-i5-2500k-and-core-i7-2600k-review/2

[34]: Intel Sandy Bridge Review, Bit-tech, Jan. 3 2011,


http://www.bit-tech.net/hardware/cpus/2011/01/03/intel-sandy-bridge-review/1

© Sima Dezső, ÓE NIK 156 www.tankonyvtar.hu


References (5)

[35]: Wikipedia: Intel GMA, 2011, http://en.wikipedia.org/wiki/Intel_GMA

[36]: Shimpi A. L., The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core i3-2100
Tested, AnandTech, Jan. 3 2011,
http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7-600k-i5-2500k-
core-i3-2100-tested/11
[37]: Wall D. W.: Limits of Instruction Level Parallelism, WRL TN-15, Dec. 1990

[38]: Bhandarkar D.: „The Dawn of a New Era”, Presentation EMEA, May 11 2006.

[39]: Moore G. E., No Exponential is Forever… ISSCC 2003,


http://download.intel.com/research/silicon/Gordon_Moore_ISSCC_021003.pdf
[40]: Intel Roadmap 2006, Source Intel

[41]: Davis E.: Tera Tera Tera Presentation, 2008


http://bt.pa.msu.edu/TM/BocaRaton2006/talks/davis.pdf
[42]: Stokes J.: Clearing up the confusion over Intel’s Larrabee, Part II
http://arstechnica.com/hardware/news/2007/06/clearing-up-the-confusion-over-intels-
larrabee-part-ii.ars

[43]: Taylor M. B. & all: Evaluation of the Raw Microprocessor, Proc. ISCA 2004
http://groups.csail.mit.edu/cag/raw/documents/raw_isca_2004.pdf

[44]: Wright C., Henning P.: Roadrunner Tutorial, An Introduction to Roadrunner and the Cell
Processor, Febr. 7 2008,
http://ebookbrowse.com/roadrunner-tutorial-session-1-web1-pdf-d34334105

© Sima Dezső, ÓE NIK 157 www.tankonyvtar.hu


References (6)

[45]: Seguin S.: IBM Roadrunner Beats Cray’s Jaguar, Tom’s Hardware, Nov. 18 2008
http://www.tomshardware.com/news/IBM-Roadrunner-Top500-Supercomputer,6610.html

© Sima Dezső, ÓE NIK 158 www.tankonyvtar.hu


Evolution of Intel’s
Basic Microarchitectures

Dezső Sima

© Sima Dezső, ÓE NIK 159 www.tankonyvtar.hu


Contents

• 1. Introduction

• 2. Core 2

• 3. Penryn

• 4. Nehalem

• 5. Nehalem-EX

• 6. Westmere

• 7. Westmere-EX

• 8. Sandy Bridge

• 9. Overview of the evolution


© Sima Dezső, ÓE NIK 160 www.tankonyvtar.hu
Remarks

Remarks

1) To preserve a clear review the discussion of the basic architectures is restricted only to
Intel’s “standard voltage” basic lines.
Medium-voltage/low-voltage/ultra-low voltage processors are not included.
2) The release dates given relate to the first processors shipped in a considered line.
Subsequently shipped models of the lines are not taken into account in order to keep
the overviews comprehensible.
3) On the sides the core numbers reflect the max. number of cores.
Usually, manufacturers provide also processors with less than the max. number of cores.

© Sima Dezső, ÓE NIK 161 www.tankonyvtar.hu


1. Introduction

© Sima Dezső, ÓE NIK 162 www.tankonyvtar.hu


1. Introduction (1)

The evolution of Intel’s basic microarchitectures

Figure 1.1: Intel’s Tick-Tock development model [1]

© Sima Dezső, ÓE NIK 163 www.tankonyvtar.hu


1. Introduction (2)
Intel’s Tick-Tock model
2 YEARS
Key microarchitectural features

TICK
TOCK Pentium 4 /Willamette 180nm 11/2000 New microarch.
2 YEARS

TICK
TOCK Pentium 4 /Northwood 130nm 01/2002 Adv. microarch., hyperthreading
2 YEARS

TICK Adv. microarch., hyperthreading,


Pentium 4 /Prescott 90nm 02/2004
TOCK 64-bit

TICK Pentium 4 / Cedar Mill


2 YEARS

01/2006
65nm
New microarch., 4-wide core,
TOCK Core 2 07/2006
128-bit SIMD, no hyperthreading

11/2007

11/2008 New microarch., hyperthreading,


(inclusive) L3, integrated MC, QPI

01/2010

01/2011 New microarch. hyperthreading,


256-bit AVX, integr. GPU, ring bus,

© Sima Dezső, ÓE NIK Figure 1.2: Overview of Intel’s


164 Tick-Tock model (based on [3]) www.tankonyvtar.hu
1. Introduction (3)

Basic architectures and their related shrinks

Considered from the Pentium 4 Prescott (the third core of Pentium 4) on

Basic architectures Basic architectures and their shrinks

Pentium 4 2005 90 nm Pentium 4


(Prescott) 2006 65 nm Pentium 4

2006 65 nm Core 2
Core 2
2007 45 nm Penryn

2008 45 nm Nehalem
Nehalem
2010 32 nm Westmere

2011 32 nm Sandy Bridge


Sandy Bridge
2012 22 nm Ivy Bridge

© Sima Dezső, ÓE NIK 165 www.tankonyvtar.hu


1. Introduction (4)

3/02 11/02 2Q/05


3/04
^ ^ ^ ^
Xeon - MP line Foster-MP Gallatin
Gallatin
Potomac

0.18 µ /108 mtrs 0.13 µ /178 mtrs 0.09µ


0.13 µ /286 mtrs
1.4/1.5/1.6 GHz 1.5/1.9/2 GHz > 3.5 MHz
2.2/2.7/3.0 GHz
On-die 256K L2 On-die 512K L2 On-die 1M L2
On-die 512K L2
On-die 512K/1M L3 On-die 1M/2M L3 On-die 8M L3 (?)
On-die 2M/4M L3
400 MHz FSB 400 MHz FSB
400 MHz FSB
µ PGA 603 µ PGA 603 µ PGA 603
5/01 2/02 11/02 7/03 6/04 2Q/05
^ ^ ^ ^ ^ ^
Xeon DP line Foster Prestonia-A Prestonia-B Prestonia-C Nocona Jayhawk

0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ/178 mtrs 0.09 µ/ 125 mtrs 0.09µ
1.4/1.5/1.7 GHz 1.8/2/2.2 GHz 2/2.4/2.6/2.8 GHz 3.06 GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.8 GHz
On-die 256 K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2, 1M L3 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB (Cancelled 5/04)
µPGA 603 µ PGA 603 µ PGA 603 µ PGA 603 µ PGA 604

11/03 11/04 1Q/05


^ ^ ^
Extreme Edition Irwindale-A1 Irwindale-B1 Irwindale-C

0.13µ /178 mtrs 0.13µ /178mtrs 0.09 µ


3.2EE GHz 3.4EE GHz 3.0/3.2/3.4/3.6 GHz
Desktop-line On-die 512K L2, 2M L3 On-die 512K L2, 2 MB L3 On-die 512K L2, 2M L3
800 MHz FSB 1066 MHz FSB
µPGA 478 LGA 775

11/00 8/01 1/02 5/02 11/02 5/03 2/04 6/04 8/04 3Q/05
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
Willamette Willamette Northwood-A2,3 Northwood-B 4 Northwood-B Northwood-C5 Prescott 6,7 Prescott 8,9,10 Prescott-F11 Tejas

0.18 µ /42 mtrs 0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09 µ /
1.4/1.5 GHz 1.4 ... 2.0 GHz 2A/2.2 GHz 2.26/2.40B/2.53 GHz 3.06 GHz 2.40C/2.60C/2.80C GHz 2.80E/3E/3.20E/3.40E GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.20F/3.40F/3.60F GHz 4.0/4.2 GHz
On-die 256K L2 On-die 256K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 1M L2 On-die 1M L2 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB (Cancelled 5/04)
µ PGA 423 µ PGA 478 µPGA 478 µ PGA 478 µ PGA 478 µPGA 478 µ PGA 478 LGA 775 LGA 775

9/02 6/04 9/04


5/02
^ ^ ^ ^
Celeron-line Willamette-128 Northwood-128 Celeron-D12 Celeron-D13
(Value PC-s)
0.18µ 0.13µ 0.09µ 0.09µ
1.7 GHz 2 GHz 2.4/2.53/2.66/2.8 GHz 2.53/2.66/2.80/2.93 GHz
On-die 128K L2 On-die 128K L2 On-die 256K L2 On-die 256K L2
400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB
µPGA 478 µPGA 478 µ PGA 478 LGA 775

2000 2001 2002 2003 2004 2005

Cores supporting hyperthreading Cores with EM64T implemented but not enabled Cores supporting EM64T

Figure 1.3: Intel’s P4 family of orocessors based on the Netburst architecture


© Sima Dezső, ÓE NIK 166 www.tankonyvtar.hu
1. Introduction (5)

Prescott Prescott 2M Cedar Mill Presler


(original core) (L2 increased to 2 MB) (65 nm shrink) (2 x Cedar Mill)

Intel’s first 64 –bit x86


desktop processor
2/20041 2/2005 1/2006 1/2006

90 nm 90 nm 65 nm 65 nm
112 mm2 135 mm2 81 mm2 2 x 81 mm2
125 mtrs 169 mtrs 188 mtrs 2 x 188 mtrs

Pentium 4 A/E/F series Pentium 4 6x0/6x2 series Pentium D 9xx


Pentium 4 5xx series Pentium 4 EE 3.73 Pentium 4 6x1 series Pentium EE 955/965

1 The original Prescott core included but did not activate the support of 64-bit operation, called EM64T and used an mPGA 439 socket..
EM64T support was released later, about 6/2004 while changing to the socket LGA 775.

Figure 1.4: Genealogy of the Cedar Mill core and the DC Presler processor

© Sima Dezső, ÓE NIK 167 www.tankonyvtar.hu


2. Core 2

• 2.1 Introduction

• 2.2 Wide execution

• 2.3 Smart L2 cache

• 2.4 Smart memory accesses

• 2.5 Enhanced digital media support

• 2.6 Intelligent power management

• 2.7 Core based processor lines

© Sima Dezső, ÓE NIK 168 www.tankonyvtar.hu


2.1 Introduction (1)

2.1 Introduction
Core 2 microarchitecture

Developed at Intel’s Haifa Research and Development Center

The Pentium4 line was cancelled due to unmanageable high dissipation figures
of its third core (the 90 nm Prescott core)
(caused primarily by the design philosophy of the line
that preferred clock frequency over core efficiency for increasing performance)

For the development of the next processor line dissipation became the key design issue

The next line became based primarily on the Pentium M – Intel’s first mobile line
since for its designs dissipation reduction was a key issue.
(The Pentium M line was a 32-bit line with 3 subsequent cores, designed at
Intel’s Haifa Research and Development Center)

The Haifa team had no experiences with multithreading

The Core microarchitecture does not support multithreading


© Sima Dezső, ÓE NIK 169 www.tankonyvtar.hu
2.1 Introduction (2)

Key features of the Core 2 microarchitecture

Intel® Wide Intel® Intelligent


Dynamic Execution Power Capability

Intel® Advanced Intel® Smart


Digital Media Boost Memory Access

Intel® Advanced
Smart Cache

Figure 2.1: Key features of170


the Core 2 microarchitecture [16] www.tankonyvtar.hu
© Sima Dezső, ÓE NIK
2.2 Wide execution (1)

2.2 Wide execution


• 4-wide core
• Enhanced execution resources
• Micro fusion
• Macro fusion

© Sima Dezső, ÓE NIK 171 www.tankonyvtar.hu


2.2 Wide execution (2)

4-wide core

4-wide front end and retire unit

Key benefit of the Core family

By contrast
both Intel’s previous Pentium 4 family and AMD’s K8 have 3-wide cores.

© Sima Dezső, ÓE NIK 172 www.tankonyvtar.hu


Figure 2.2: Block diagram of Intel’s Core microarchitecture [4]
2.2 Wide execution (3)

173
Figure 2.3: Block diagram of Intel’s Pentium 4 microarchitecture [5]
2.2 Wide execution (4)

174
Retire width: 3 instr./cycle
carmean
www.tankonyvtar.hu

Figure 2.4: Block diagram of AMD’s K8 microarchitecture[4]


2.2 Wide execution (5)

175
2.2 Wide execution (6)

Enhanced execution resources

The Core has three complex SSE units


By contrast
• The Pentium 4 provides a single complex SSE unit and
a second simple SSE unit performing only SSE move and store operations.
• The K8 has two SSE units

© Sima Dezső, ÓE NIK 176 www.tankonyvtar.hu


Figure 2.5: Issue ports and execution units of the Core [4]
2.2 Wide execution (7)

177
2.2 Wide execution (8)

Figure 2.6: Issue ports and execution unit of the Pentium 4 [9]

Ports 0 und 1 can issue up to two microinstructions per cycle, allowing to issue
altogether up to 6 microinstr./cycle

© Sima Dezső, ÓE NIK 178 www.tankonyvtar.hu


2.2 Wide execution (9)

Remark

Both the Core’s and the Pentium 4’s schedulers can issue 6 operations per cycle, but

• Pentium 4’s schedulers have only 4 ports, with two double pumped simple ALUs,

• by contrast Core has a unified scheduler with 6 ports, allowing more flexibility
for issuing instructions.

© Sima Dezső, ÓE NIK 179 www.tankonyvtar.hu


2.2 Wide execution (10)

Table 2.1: Key features


of x86 processors [4]
180 www.tankonyvtar.hu
2.2 Wide execution (11)

Remark

IBM’s POWER4 and subsequent processors of this line have introduced 5-wide cores
with 8-wide out of order issue.

These processor bundle 5 subsequent instructions into a group, dispatch groups


in order, execute instructions out of order,
and retire groups (one group in a cycle) in order.

© Sima Dezső, ÓE NIK 181 www.tankonyvtar.hu


2.2 Wide execution (12)

Micro-op fusion [10]

• Originally introduced in the Pentium M (1. core (Banias) in 2003).


• Combining micro-ops derived from the same macro-operation into a single micro-op.

• Micro-op fusion can reduce the total number of micro-ops to be processed by more than
10 %.

This results in higher processor performance.

© Sima Dezső, ÓE NIK 182 www.tankonyvtar.hu


2.2 Wide execution (13)

Remark

IBM’s POWER4 and subsequent processor provide a 5-wide frontend.

© Sima Dezső, ÓE NIK 183 www.tankonyvtar.hu


2.2 Wide execution (14)

Macro-op fusion [10]

• New feature introduced into the Core.

• Combing common x86 instruction pairs (such as a compare followed by a conditional


jump) into a single micro-op during decoding.

Two x86 instructions can be executed as a single micro-op.


This increases performance.

Example

© Sima Dezső, ÓE NIK 184 www.tankonyvtar.hu


2.2 Wide execution (15)

Figure 2.7: Macro-op fusion example (1) [11]


© Sima Dezső, ÓE NIK 185 www.tankonyvtar.hu
2.2 Wide execution (16)

Figure 2.8: Macro-op fusion example (2) [11]


© Sima Dezső, ÓE NIK 186 www.tankonyvtar.hu
2.2 Wide execution (17)

Figure 2.9: Macro-op fusion example (3) [11]


© Sima Dezső, ÓE NIK 187 www.tankonyvtar.hu
2.2 Wide execution (18)

Table 2.2: Comparing Intel’s and AMD’s fusion techniques [4]

© Sima Dezső, ÓE NIK 188 www.tankonyvtar.hu


2.2 Wide execution (19)

Performance leadership changes between Intel and AMD

• In 2003 AMD introduced their K8-based processors implementing


• the 64-bit x86 ISA and
• the direct connect architecture concept, that includes
• integrated memory controllers and
• high speed point-to-point serial buses (the HyperTransport bus)
used to connect processors to processors and processors to south bridges.

• AMD’s K8-based processors became the performance leader, first of all on the DP and MP
server market, where the 64-bit direct connect architecture has clear benefits
vs Intel’s 32-bit Pentium 4 based processors using shared FSBs to connect processors
to north bridges.

© Sima Dezső, ÓE NIK 189 www.tankonyvtar.hu


2.2 Wide execution (20)

Example 1: DP web-server performance comparison (2003)

Figure 2.10: DP web server performance comparison: AMD Opteron 248 vs. Intel Xeon 2.8 [6]
© Sima Dezső, ÓE NIK 190 www.tankonyvtar.hu
2.2 Wide execution (21)

Example 2: Summary assessment of extensive benchmark tests contrasting


dual Opterons vs dual Xeons (2003) [7]

“In the extensive benchmark tests under Linux Enterprise Server 8 (32 bit as well as
64 bit), the AMD Opteron made a good impression. Especially in the server disciplines,
the benchmarks (MySQL, Whetstone, ARC 2D, NPB, etc.) show quite clearly that the
Dual Opteron puts the Dual Xeon in its place”.

© Sima Dezső, ÓE NIK 191 www.tankonyvtar.hu


2.2 Wide execution (22)

• This situation has completely changed in 2006 when Intel introduced their Core 2
microarchitecture,

• The Core 2 has

 a 4-wide front-end and retire unit compared to the 3-wide K8 or the Pentium 4,

 three complex FP/SSE units compared to two units available in the K8 or


just a single complex unit and a second simple unit performing only FP-move and
FP store operations.

• This and further enhancements of the Core microarchitecture, detailed subsequently,


resulted in record breaking performance figures.

Intel regained performance leadership vs AMD.

© Sima Dezső, ÓE NIK 192 www.tankonyvtar.hu


2.2 Wide execution (23)

Example: DP web-server performance comparison (2006)

Webserver Performance

MSI K2-102A2M MSI K2-102A2M Opteron 280 vs. Extrapolated Xeon 5160
Opteron 275 Opteron 280 Opteron 275 Opteron 3 GHz 3 GHz

Jsp - Peak 144 154 7% 182 230

AMP - Peak 984 1042 6% 1178 1828

Jsp: Java Server Page performance


AMP: Apache/MySQL/PHP

Figure 2.11: DP web server performance comparison: AMD Opteron 275/280 vs. Intel Xeon 5160 [8]

Remark
Both web-server benchmark results were published from the same source (AnandTech)

© Sima Dezső, ÓE NIK 193 www.tankonyvtar.hu


2.3 Smart L2 cache (1)

2.3 Smart L2 cache

Shared L2 instead of private L2 caches associated with the cores.

Private Shared

Core1 Core2 Core1 Core2

L2 L2
L2 Cache
Cache Cache

Pentium 4-based DCs Core 2-based DCs

Figure 2.12: Core’s shared L2 cache vs Pentium 4’s private L2 caches

© Sima Dezső, ÓE NIK 194 www.tankonyvtar.hu


2.3 Smart L2 cache (2)

Benefits of shared caches

• Dynamic cache allocation to the individual cores


• Efficient data sharing (no replicated data)

+ 2x bandwidth to L1 caches.

© Sima Dezső, ÓE NIK 195 www.tankonyvtar.hu


2.3 Smart L2 cache (3)

Figure 2.13: Dynamic L2 cache allocation according to cache demand [11]


© Sima Dezső, ÓE NIK 196 www.tankonyvtar.hu
2.3 Smart L2 cache (4)

Figure 2.14: Data sharing in shared and private (independent) L2 cache implementations [11]
© Sima Dezső, ÓE NIK 197 www.tankonyvtar.hu
2.3 Smart L2 cache (5)

Drawbacks of shared caches

Shared caches combine access patterns

Reduce the efficiency of hardware prefetching vs private caches.

Choice between shared and private caches

Design decision depending on

whether benefits or drawbacks dominate as far as performance is concerned.

Trend

Core 2 prefers a shared L2 cache Nehalem prefers private L2 caches

POWER5 prefers shared L2 cache POWER6 prefers private L2 caches

© Sima Dezső, ÓE NIK 198 www.tankonyvtar.hu


2.3 Smart L2 cache (6)

Table 2.3: Cache parameters of Intel’s and AMD’s processors [4]


© Sima Dezső, ÓE NIK 199 www.tankonyvtar.hu
2.4 Smart memory accesses (1)

2.4 Smart memory accesses

• Memory disambiguation
• Enhanced hardware prefetchers

(L1 I-Cache not shown)

Figure 2.15: Units involved in implementing memory disambiguation or hardware prefetching [12]
© Sima Dezső, ÓE NIK 200 www.tankonyvtar.hu
2.4 Smart memory accesses (2)

Memory disambiguation

In Intel’s Core 2: Memory disambiguation means memory access reordering


by letting Loads to bypass both Loads and Stores speculatively.

Aim
Hiding memory latency through reordering of loads.

© Sima Dezső, ÓE NIK 201 www.tankonyvtar.hu


2.4 Smart memory accesses (3)

Example

Figure 2.16: Example for memory reordering (memory disambiguation), Loads may bypass
Stores [13]
© Sima Dezső, ÓE NIK 202 www.tankonyvtar.hu
2.4 Smart memory accesses (4)

An overview of memory access reordering


Memory operations are typically decoupled from other operations in such a way that

• memory operations to be performed (i.e. Loads/Stores) are typically written into


two separate queues, called the Load Queue and the Store Queue,
• sequential consistency among memory operations is preserved according to a chosen
sequential consistency model by e.g. a Memory Order Buffer, as discussed subsequently,
• sequential consistency between memory operations and non memory operations is
maintained by appropriately handling load-use dependencies, such
that dependent operations are scheduled for execution only after a Load instruction
has already delivered the referenced value from the memory.

© Sima Dezső, ÓE NIK 203 www.tankonyvtar.hu


2.4 Smart memory accesses (5)

Sequential consistency models

Strong Weak
sequential consistency sequential consistency

No Load/Store reordering Load/Store reordering


is allowed is allowed

Load reordering

Loads bypass Loads bypass


only Loads both Loads and Stores

Typ. examples
Pentium Pro, Pentium II, III See later.
Pentium 4

Figure 2.17: Main alternatives of load reordering


© Sima Dezső, ÓE NIK 204 www.tankonyvtar.hu
2.4 Smart memory accesses (6)

Example

Figure 2.18: Example for memory reordering (memory disambiguation), Loads may bypass
Stores [13]
© Sima Dezső, ÓE NIK 205 www.tankonyvtar.hu
2.4 Smart memory accesses (7)

When Loads bypass Loads


no dependencies need to be taken into account.

When Loads bypass Stores


data dependencies need to be checked, as follows:

• Loads may bypass only Stores whose target addresses differ from that of the Load,
else the Load would access a former, incorrect value, from the target address.
• However, Store addresses are not always known at the time when the scheduler needs
to decide whether or not the Load considered is allowed to bypass a Store.
• There are two options how to proceed when Store addresses are not yet known.

© Sima Dezső, ÓE NIK 206 www.tankonyvtar.hu


2.4 Smart memory accesses (8)

Loads bypass both Loads and Stores

Deterministic Speculative
Store bypassing Store bypassing

Loads bypass Stores only Loads may bypass Stores also in cases
if all respective Store addresses are known when respective Store addresses
and the Load address does not coincide with are not yet known, that is not yet computed.
any of the Store addresses to be bypassed Then, the correctness of the speculative Load
need to be checked, e.g. as follows:
Each calculated Store address will be compared
to all younger Load addresses.
For a hit, this Load and subsequent instructions
Examples will be aborted and re-executed.

A few 1. gen. superscalar RISCs, such as 2. gen. superscalar RISCs, such as


MC88110 (1993) PowerPC 620 1996)
PowerPC 603/604 (1993/1995) PA8000 (1996)

2. gen. superscalar RISCs, such as Recent x86 CISCs, such as


UltraSPARC (1995) Core 2 (2006)
Alpha 21264 (1998) Penryn (2007)

Figure 2.19: Introduction of Load reordering related to both Loads and Stores
© Sima Dezső, ÓE NIK 207 www.tankonyvtar.hu
2.4 Smart memory accesses (9)

Remark 1

Available literature sources for the UltraSPARC processor do not allow a clear distinction
about the load bypassing option used. Based on these a deterministic load bypassing
was assumed.

Remark 2

x86 processors have much more complex addressing modes, requiring a number of
address additions, compared to RISC processors.
This is the reason while load reordering was much later introduced in x86 processors
than in RISC processors.

© Sima Dezső, ÓE NIK 208 www.tankonyvtar.hu


2.4 Smart memory accesses (10)

The benefit arising when Loads may bypass Stores as well

• Runtime overlapping of tight loops that access memory.


• Overlapping: when Loads at the beginning of an iteration may access memory without
having to wait until Stores at the end of the previous iteration are completed [15].

Usually, Stores are not allowed to bypass older Stores or Loads.

Assumed reason
Stores are less frequent than Loads (~ 1/3-1/4)
Performance gain obtained does not pay off by the additional complexity
(and dissipation).

© Sima Dezső, ÓE NIK 209 www.tankonyvtar.hu


2.4 Smart memory accesses (11)

How to implement speculative Loads?

Figure 2.20: Assumed hardware structure to implement speculative Loads in Intel’s Core 2 [4]
© Sima Dezső, ÓE NIK 210 www.tankonyvtar.hu
2.4 Smart memory accesses (12)

The predictor [14]

• Actually, when a Load is issued from the Reservation Station’s scheduler to the Load
Buffer a predictor is looked up .

• The predictor makes a prediction, as described subsequently.

• If the prediction is “non colliding” a Store with an unknown store address may be passed
else not.

© Sima Dezső, ÓE NIK 211 www.tankonyvtar.hu


2.4 Smart memory accesses (13)

Figure 2.21: Principle of the implementation of speculative Loads in Intel’s Core 2 [12]
© Sima Dezső, ÓE NIK 212 www.tankonyvtar.hu
2.4 Smart memory accesses (14)

Source: [12]

© Sima Dezső, ÓE NIK 213 www.tankonyvtar.hu


2.4 Smart memory accesses (15)

Source: [12]

© Sima Dezső, ÓE NIK 214 www.tankonyvtar.hu


2.4 Smart memory accesses (16)

Remark 1
There is a further option to hide memory latency, called Store to Load forwarding, or
Store Buffer forwarding.
Store to Load forwarding
Forwarding store data immediately from the Store buffer to a Load without waiting
for the data to be written to the cache, in cases when the last Store writing the
same address as referenced by the Load, is actually available in the Store buffer.

Examples
Pentium 4 (2000)
(Core 2 2006)
Penryn (2007)
Merom (2008)
AMD Athlon64 (2003)
AMD Opteron (2003)

In effect Store to Load forwarding is a special form of Load reordering when


Loads are allowed to bypass both Loads and Stores,
As Load to Store bypassing results in bypassing both Loads and Stores,
it is however, effective only to Loads whose companion Store (last Store
referring to the same memory address) is actually available in the Store Buffer.

© Sima Dezső, ÓE NIK 215 www.tankonyvtar.hu


2.4 Smart memory accesses (17)

Figure 2.22: Example for Store to Load forwarding [12]


© Sima Dezső, ÓE NIK 216 www.tankonyvtar.hu
2.4 Smart memory accesses (18)

Remark 2

There is yet another option to reduce D cache load-use latency, called speculative Loads.

Speculative Loads
• Issuing a load-use dependent instruction before it turns out whether the data cache
hits or misses, optimistically, in expectation of a cache hit. Then data become
available typically 1-2 clock cycles earlier than with traditional processing.

• If the expectation turns out to be wrong, the execution of the instruction in concern
becomes aborted and re-executed after the cache miss is serviced.

Example: Alpha 21264

Speculative Loads do not modify the memory access order.

They do not belong to the memory access reordering schemes.

© Sima Dezső, ÓE NIK 217 www.tankonyvtar.hu


2.4 Smart memory accesses (19)

Memory access reordering

Load reordering Store reordering

Loads bypass Loads bypass both


only Loads Loads and Stores

Store to Load Deterministic Speculative


forwarding Store bypassing Store bypassing

Pentium Pro
Pentium
Pentium II/III
lines
Pentium 4 Pentium 4
Core 2 Core 2
Penryn Penryn
Nehalem Nehalem

AMD Athlon 64
lines Opteron

Figure 2.23: Overview of memory access reordering schemes and their use in x86 processors
© Sima Dezső, ÓE NIK 218 www.tankonyvtar.hu
2.4 Smart memory accesses (20)

Hardware prefetchers [9]

Remarks

• Intel’s first hardware prefetcher appeared in the Pentium 4 family,


associated with the L2 cache.

• Intel’s first on-die L2 cache debuted only about one year earlier (10/1999),
in the second core of the Pentium III line (called the Coppermine core,
built on 180 nm technology, with a size of 256 KB).

Principle of operation of the L2 hardware prefetcher


• it monitors data access patterns and prefetches data automatically into the L2 cache,
• it attempts to stay 256 bytes ahead of the current data access location,
• the prefetcher remembers the history of cache misses to detect concurrent,
independent data streams that it tries to prefetch ahead of its use.

© Sima Dezső, ÓE NIK 219 www.tankonyvtar.hu


2.4 Smart memory accesses (21)

Enhanced hardware prefetchers in the Core 2 [11]

8 prefetchers per two-core processor

• 2 data and 1 L1 instruction prefetchers per core,


able to handle multiple simultaneous patterns.
• 2 prefetchers in the L2 cache
tracking multiple access patterns per core.
• Prefetchers monitor demand traffic and regulate “aggression”.

© Sima Dezső, ÓE NIK 220 www.tankonyvtar.hu


2.4 Smart memory accesses (22)

Hardware prefetchers within the Core 2 microarchitecture

Figure 2.24: Hardware prefetchers within the Core 2 microarchitecture [11]


© Sima Dezső, ÓE NIK 221 www.tankonyvtar.hu
2.5 Enhanced digital media support (1)

2.5 Enhanced digital media support


• Widening the FP/SSE Execution units from 64-bit to 128-bit.
• Supplemental enhancement of the SSE3 ISA extension.

© Sima Dezső, ÓE NIK 222 www.tankonyvtar.hu


2.5 Enhanced digital media support (2)

Widening the FP/SSE Execution units from 64-bit to 128-bit

Figure 2.25: Widening the FP/SSE Execution Units from 64-bit to 128-bit [12]
© Sima Dezső, ÓE NIK 223 www.tankonyvtar.hu
2.5 Enhanced digital media support (3)

Single cycle 128-bit execution as a result of widening the FP/SSE Execution units

Figure 2.26: Single cycle execution of 128-bit operations


as a result of widening the FP/SSE Execution Units from 64.bit to 128-bit [17]
© Sima Dezső, ÓE NIK 224 www.tankonyvtar.hu
2.5 Enhanced digital media support (4)

• Supplemental enhancement of the SSE3 ISA extension,


as shown in the next Figure.

Overview of the x86 ISA extensions in Intel’ processor lines

© Sima Dezső, ÓE NIK 225 www.tankonyvtar.hu


2.5 Enhanced digital media support (5)

MultiMedia eXtensions

Streaming SIMD Extensions

Northwood (Pentium 4)

Advanced Encryption Standard


Advanced Vector Extension

Ivy Bridge Fused Multiply-Add instr.

Figure 2.27: Overview of Intel’s x86 ISA extensions (based on [18])


© Sima Dezső, ÓE NIK 226 www.tankonyvtar.hu
2.5 Enhanced digital media support (6)

8 MM registers (64-bit),
aliased on the FP Stack registers

8 XMM registers (128-bit)

16 XMM registers (128-bit)

Northwood (Pentium4)
Norhwood
Northwood (Pentium4)

Larrabee: large number


of registers (512-bit)

16 YMM registers (256-bit)

Ivy Bridge

Figure 2.28: Intel’s x86 ISA extensions - the SIMD register space (based on [18]) BMA
© Sima Dezső, ÓE NIK 227 www.tankonyvtar.hu
2.5 Enhanced digital media support (7)

Figure 2.29: Evolution of the SIMD


processing width [18]
© Sima Dezső, ÓE NIK 228 www.tankonyvtar.hu
2.5 Enhanced digital media support (8)

64-bit FX SIMD with


32/16/8-bit operands (ops)
Support of MM

128-bit FX, FP SIMD with


32/16/8-bit FX, 32-bit FP ops.
Support of MM/3D

128-bit FX, FP SIMD with


64/32/16/8-bit FX, 64/32-bit FP,
Support for MPEG-2, video, MP3

Northwood (Pentium4)
Norhwood

DSP-oriented FP enhancements,
enhanced thread manipulation

Diverse arithmetic enhancements

Media acceleration
(video encoding, MM, gaming)

Accelerated string/text manip.,


appl. targeted acceleration
Accelerated encription operations

256-bit FX, FP SIMD with


64/32/16/8-bit FX, 64/32-bit FP???
Ivy Bridge

Figure 2.30: Intel’s x86 ISA extensions - the operations introduced (based on [17])
© Sima Dezső, ÓE NIK 229 www.tankonyvtar.hu
2.5 Enhanced digital media support (9)
Arithmetic:
addpd - Adds 2 64bit doubles.
addsd - Adds bottom 64bit doubles.
subpd - Subtracts 2 64bit doubles.
subsd - Subtracts bottom 64bit doubles.
mulpd - Multiplies 2 64bit doubles.
mulsd - Multiplies bottom 64bit doubles.
divpd - Divides 2 64bit doubles.
divsd - Divides bottom 64bit doubles.
maxpd - Gets largest of 2 64bit doubles for 2 sets.
maxsd - Gets largets of 2 64bit doubles to bottom set.
minpd - Gets smallest of 2 64bit doubles for 2 sets.
minsd - Gets smallest of 2 64bit values for bottom set.
paddb - Adds 16 8bit integers.
paddw - Adds 8 16bit integers.
paddd - Adds 4 32bit integers.
paddq - Adds 2 64bit integers.
paddsb - Adds 16 8bit integers with saturation.
paddsw - Adds 8 16bit integers using saturation.
paddusb - Adds 16 8bit unsigned integers using saturation.
paddusw - Adds 8 16bit unsigned integers using saturation.
psubb - Subtracts 16 8bit integers.
psubw - Subtracts 8 16bit integers.
psubd - Subtracts 4 32bit integers.
psubq - Subtracts 2 64bit integers.
psubsb - Subtracts 16 8bit integers using saturation.
psubsw - Subtracts 8 16bit integers using saturation.
psubusb - Subtracts 16 8bit unsigned integers using saturation.
psubusw - Subtracts 8 16bit unsigned integers using saturation.
pmaddwd - Multiplies 16bit integers into 32bit results and adds results.
pmulhw - Multiplies 16bit integers and returns the high 16bits of the result.
pmullw - Multiplies 16bit integers and returns the low 16bits of the result.
pmuludq - Multiplies 2 32bit pairs and stores 2 64bit results.
rcpps - Approximates the reciprocal of 4 32bit singles.
rcpss - Approximates the reciprocal of bottom 32bit single. Figure 2.31: Excerpt of Intel’s
sqrtpd - Returns square root of 2 64bit doubles. SSE2 SIMD ISA extension [19]
sqrtsd - Returns square root of bottom 64bit double.
230 www.tankonyvtar.hu
2.5 Enhanced digital media support (10)

2 x 32-bit FX MMX EUs

2 x 32-bit MMX,
2 x 32-bit SSE EUs

1 full + 1 simple (moves/stores)


64-bit FP/SSE EUs

Northwood (Pentium 4)

3 x 128-bit FP/SSE EUs

Larrabee: 24-32? x
512-bit FP/SSE EUs

? x 256-bit SSE EUs??

Ivy Bridge

Figure 2.32: SIMD execution resources in Intel’s basic processors (based on [18])
© Sima Dezső, ÓE NIK 231 www.tankonyvtar.hu
2.5 Enhanced digital media support (11)

Achieved performance boost in Core 2 for gaming apps

Figure 2.33: Achieved performance boost in Core2 for gaming vs AMD’s Athlon 64 FX60 [13]
© Sima Dezső, ÓE NIK 232 www.tankonyvtar.hu
2.6 Intelligent power management (1)

2.6 Intelligent Power management


• Ultra fine grained power control
• Bus splitting
• Platform Thermal Control

© Sima Dezső, ÓE NIK 233 www.tankonyvtar.hu


2.6 Intelligent power management (2)

Ultra fine grained power control


Shutting down currently not needed units of the processor.

Green units not needed


for the given operation,
they can be shut off.

Figure 2.34: The operation of the Ultra fine grained power control – an example [11].
© Sima Dezső, ÓE NIK 234 www.tankonyvtar.hu
2.6 Intelligent power management (3)

Bus splitting

Introduced already in the Pentium M.

Principle of operation

Most buses are sized for worst case, activating only needed bus widths saves power.

Figure 2.35: Principle of bus splitting to save power [11]


© Sima Dezső, ÓE NIK 235 www.tankonyvtar.hu
2.6 Intelligent power management (4)

Platform Thermal Control

Digital
Thermal Sensors
(DTS) on the dies,
instead of
analog diodes,
providing
digital data,
scanned by
dedicated logic.

(Platform Environment Control Interface)


A single wire proprietary interface with
transfer rates of 2 kbit/s-2 Mbit/s)
DTSs report the difference between the
current temperature and the throttle point,
at which the processor reduces speed.
Figure 2.36: Principle of the Platform Thermal Control [11] , [20]
© Sima Dezső, ÓE NIK 236 www.tankonyvtar.hu
2.6 Intelligent power management (5)

PECI-based platform fan speed control

Figure 2.37: Principle of the PECI-based platform fan speed control [42]

Remark
PECI reports the relative temperature values measured below the onset value of the
thermal control circuit (TCC)

© Sima Dezső, ÓE NIK 237 www.tankonyvtar.hu


2.6 Intelligent power management (6)

Further feature – Loop Stream Detector


• Loops are very common in most applications
• While executing loops
- the same instructions are decoded over and over
- the same branch predictions are made over and over.
• Loop Stream Detector identifies program loops
- lets forward instructions for decoding from the Loop Stream Detector instead of
the normal path,
- disables unneeded blocks of logic for power savings.
• Higher performance by removing instruction fetch limitations.

Figure 2.38: Loop Stream Detector in the Core [1]


© Sima Dezső, ÓE NIK 238 www.tankonyvtar.hu
2.6 Intelligent power management (7)

Remark
In the Nehalem Intel modified the Loop Stream Detector as follows:

• Same concept as in the Core but


• Higher performance by expanding the size of the Loop Stream Detector,
• Improved power efficiency by disabling even more logic.

Figure 2.39: The modified loop Stream Detector in the Nehalem [1]

© Sima Dezső, ÓE NIK 239 www.tankonyvtar.hu


2.7 Core based processor lines (1)

2.7 Overview of Core 2 based processor lines


Mobiles
T5xxx/T7xxx, 2C ,Merom 8/2006
Desktops
E6xxx/E4xxx, 2C, Core 2 Duo, (Conroe) 7/2006
X6800 Core 2 Extreme, 2C, (Conroe) 7/2006
E6xxx/E4xxx, Core 2 Duo, 2C, (Allendale) 1/2007
Q6xxx Core 2 Quad, 2x2C (Kentsfield) (2xConroe) 1/2007
QX6xxx Core 2 Extreme Quad, 2x2C, (Kentsfield XE) 11/2006
Servers
UP-Servers
30xx, 2C, Conroe, 9/2006
30xx, 2C, Allendale 1/2007
32xx, 2x2C, Kentsfield (2xConroe) 1/2007
DP-Servers
51xx, 2C Woodcrest, 6/2006
53xx, 2x2, Clowertown, (2xWoodcrest) 11/2006

MP-Servers
72xx, 2C, Tigerton DC, (2xMP-enhanced SC Woodcrest) 9/2007
73xx, 2x2C, Tigerton QC, (2xMP-enhanced DC Woodcrest) 9/2007

Based on [43]
© Sima Dezső, ÓE NIK 240 www.tankonyvtar.hu
3. Penryn

• 3.1 Introduction

• 3.2 Enhanced wide execution

• 3.3 More advanced L2 cache

• 3.4 More advanced digital media support

• 3.5 More advanced power management

• 3.6 Penryn based processor lines

© Sima Dezső, ÓE NIK 241 www.tankonyvtar.hu


3.1 Introduction
Penryn () (1)

3.1 Introduction
Penryn

Basically a shrink (tick) from the 65 nm Core to 45 nm with a few microarchitectural


and ISA enhancements, discussed subsequently.

© Sima Dezső, ÓE NIK 242 www.tankonyvtar.hu


3.1 Introduction (2)

Sub-threshold =
Source-Drain

Figure 3.1: Dynamic and static power dissipation trends in chips [21]
© Sima Dezső, ÓE NIK 243 www.tankonyvtar.hu
3.1 Introduction (3)

Figure 3.2: Structure of a high-k + metal transistor [23]

© Sima Dezső, ÓE NIK 244 www.tankonyvtar.hu


3.1 Introduction (4)

Figure 3.3: Benefits of high-k + metal gate transistors [23], [24]


© Sima Dezső, ÓE NIK 245 www.tankonyvtar.hu
3.1 Introduction (5)

2 x Core 2 x Penryn

Figure 3.4: The 45 nm Penryn is a shrink of the 65 nm Core with a few enhancements [25]
© Sima Dezső, ÓE NIK 246 www.tankonyvtar.hu
3.1 Introduction (6)

Key enhancements of Penryn

Fast Radix-16 Divider

Large shared L2 cache

Intel SSE4 ISA Extension


Super Shuffle Engine

Figure 3.5: Key enhancements introduced into Penryn’s microarchitecture vs the Core
(based on [25])

© Sima Dezső, ÓE NIK 247 www.tankonyvtar.hu


3.2 Enhanced wide execution (1)

3.2 Enhanced wide execution


Fast radix-16 divider [29]

Principle of a conventional binary divider


iteratively subtracts the divisor from the dividend.

Principle of a fast binary divisider


iteratively subtracts multiples of the divisor from the dividend.
Multiple bits of the quotient can be calculated in each iteration.

Radix-r divider

computes in each iteration b bits, where r = 2b

Core: radix- 4 divider (computing 2 bits/iteration).


Penryn: radix16 divider (computing 4 bits/iteration).

On average > 50 % speed up over previous generation [27]

© Sima Dezső, ÓE NIK 248 www.tankonyvtar.hu


3.2 Enhanced wide execution (2)

CSA: Carry Save


Adder

CPA: Carry Prop.


Adder

QSL: Quotient
Select

Hybrid:
producing
Gs and Ps

Figure 3.6: Simplified block diagram and latency of Penryn’s radix-16 divider [27]
© Sima Dezső, ÓE NIK 249 www.tankonyvtar.hu
3.3 More advanced L2 cache (1)

3.3 More advanced L2 cache


Larger shared L2 cache

Core 2 Penryn

Figure 3.7: Penryn’s L2 caches vs Core 2’s caches (Source: Intel)

© Sima Dezső, ÓE NIK 250 www.tankonyvtar.hu


3.4 More advanced digital media support (1)

3.4 More advanced digital media support


• SSE4.1 ISA extension
• Super shuffle engine

SSE4.1 ISA extension Largest set of ISA extensions introduced since 2000.

(47 instructions)

Figure 3.8: Overview of the SSE4.1 ISA extension [33]

© Sima Dezső, ÓE NIK 251 www.tankonyvtar.hu


3.4 More advanced digital media support (2)

SS4.1 ISA Extension details

Figure 3.9: Detailed overview of the SSE4.1 ISA extension [26]


© Sima Dezső, ÓE NIK 252 www.tankonyvtar.hu
3.4 More advanced digital media support (3)

Aim of Streaming Loads

Fast accessing of graphics card memory e.g. by using two treads for CPU-GPU sharing

Figure 3.10: Aim of streaming loads: [29]


© Sima Dezső, ÓE NIK 253 www.tankonyvtar.hu
3.4 More advanced digital media support (4)

The streaming Load instruction

Figure 3.11: Streaming Load instruction (1) [26]


© Sima Dezső, ÓE NIK 254 www.tankonyvtar.hu
3.4 More advanced digital media support (5)

Super Shuffle Engine

Figure 3.12: The Super Shuffle Engine of Penryn [30]


© Sima Dezső, ÓE NIK 255 www.tankonyvtar.hu
3.4 More advanced digital media support (6)

Figure 3.13: Operation of the Super Shuffle Engine of Penryn [30]

© Sima Dezső, ÓE NIK 256 www.tankonyvtar.hu


3.4 More advanced digital media support (7)

Latency improvements achieved by Penryn’s Super Shuffle Engine

Figure 3.14: Latency improvements achieved by Penryn’s Super Shuffle Engine [30]

© Sima Dezső, ÓE NIK 257 www.tankonyvtar.hu


3.4 More advanced digital media support (8)

Microarchitecture comparison

Table 3.1: Microarchitecture comparison [25]


© Sima Dezső, ÓE NIK 258 www.tankonyvtar.hu
3.4 More advanced digital media support (9)

Overall performance achievements with Penryn (1)

Figure 3.15: Performance improvements of Penryn vs Core at the same clock frequency [26]
© Sima Dezső, ÓE NIK 259 www.tankonyvtar.hu
3.4 More advanced digital media support (10)

Overall performance achievements with Penryn (2)

Figure 3.16: Extending Intel’s performance leadership in main application segments [26]

© Sima Dezső, ÓE NIK 260 www.tankonyvtar.hu


3.5 More advanced power management (1)

3.5 More advanced power management


• Deep Power Down (DPD) technology
• Enhanced Dynamic Acceleration (EDAT)

available only on mobile platforms.


(Both techniques became introduced in Nehalem for general use)!

© Sima Dezső, ÓE NIK 261 www.tankonyvtar.hu


3.5 More advanced power management (2)

Deep Power Down technology (DPD)

(First Introduced in the Core Duo (3. core of the Pentium M line)

• Intelligent
heuristics decides
when enter into.

Figure 3.17: Intel’s Deep Power Down technology [26]


262
© Sima Dezső, ÓE NIK www.tankonyvtar.hu
3.5 More advanced power management (3)

(OS API
WAIT)

Figure 3.18: Operation of Intel’s Deep Power Down technology [27]


© Sima Dezső, ÓE NIK 263 www.tankonyvtar.hu
3.5 More advanced power management (4)

Figure 3.19: Power reduction achieved by the Deep Power Down Technology [27]
© Sima Dezső, ÓE NIK 264 www.tankonyvtar.hu
3.5 More advanced power management (5)

Enhanced Dynamic Acceleration Technology (EDAT) (for mobiles)

Principle for dual core Penryn processors

Figure 3.20: Principle of the Enhanced Dynamic Acceleration Technology [27]


© Sima Dezső, ÓE NIK 265 www.tankonyvtar.hu
3.5 More advanced power management (6)

Remark 1
A similar technique was already developed for the Montecito (dual core Itanium),
but not implemented, called the Foxton technology.

Foxton technology [28]


Depending on the actual usage pattern the chip is able to speed-up or down by
increasing/decreasing core voltage and clock rate.

• Under “low activity workloads” which generate less heat


the processors speeds up by increasing core voltage and core frequency
until it reaches the nominal power setting.
• Under “high activity workloads” which generate more heat
the processors scales down by decreasing core voltage and core frequency
to stay berlow the nominal power setting.

“Low activity workloads” typically include integer-intensive applications, such as


commercial database applications.

Foxton technology should increase performance for these applications


by about 10% compared with the same processor running with a "fixed clock."

© Sima Dezső, ÓE NIK 266 www.tankonyvtar.hu


3.5 More advanced power management (7)

Remark 2
Intel’s next basic core, the Nehalem includes a more advanced technology than
the Enhanced Dynamic Acceleration Technology, called the Turbo Boost Technology for
increasing clock frequency in case of inactive cores or light workloads.

© Sima Dezső, ÓE NIK 267 www.tankonyvtar.hu


3.6 Penryn based processor lines (1)

3.6 Overview of Penryn based processor lines (1)

-DP

-MP (6 cores)

Figure 3.21: Overview of the Penryn family [25]


© Sima Dezső, ÓE NIK 268 www.tankonyvtar.hu
3.6 Penryn based processor lines (2)

Overview of Penryn based processor lines (2)


Mobiles
Core 2 Duo T8xxx/T9xxx, Penryn-3M, 2C, 1/2008
Core 2 Duo T6xxx, Penryn-3M, 2C, 1/2009
Core 2 Quad Q9xxx, Penryn QC, 2x2C, (2xPenryn-3M), 8/2008

Desktops
Core 2 Duo E8xxx, Wolfdale, 2C, 1/2008
Core 2 Duo E7xxx, Wolfdale-3M, 2C, 4/2008
Core 2 Quad Q9xxx, Yorkfield-6M, 2x2C, (2x Wolfdale-3M), 3/2008
Core 2 Quad Q8xxx, Yorkfield-6M, 2x2C, (2x Wolfdale-3M), 8/2008
Core 2 Extreme QX9xxx, Yorkfield XE, 2x2C (2x Wolfdale), 11/2007
Servers
UP-Servers
E31xx Wolfdale, 2C, 1/2008
X33xx, Yorkfield-6M, 2x2C, (2xWolfdale), 1/2008
X33xx, Yorkfield, (2xWolfdale), 2x2C, 1/2008
DP-Servers
E52xx, Wolfdale, 2C, 11/2007
E54xx/X54xx, Harpertown 2x2C, (2xWolfdale), 11/2007
MP-Servers
E74xx, 4C/6C, Dunnington, 9/2008

© Sima Dezső, ÓE NIK Based on [43] 269 www.tankonyvtar.hu


4. Nehalem

• 4.1 Introduction

• 4.2 Simultaneous Multithreading (SMT)

• 4.3 New cache architecture

• 4.4 Further advanced digital media support

• 4.5 Integrated memory controller

• 4.6 QuickPath Interconnect bus

• 4.7 More advanced power management

• 4.8 Advanced virtualization

• 4.9 New socket

© Sima Dezső, ÓE NIK 270 www.tankonyvtar.hu


4.1 Introduction (1)

4.1 Introduction
Nehalem

Developed at Hillsboro, Oregon, at the site where the Pentium 4 emerged).

Experiences with HT

Nehalem became a multithreaded design.

The design effort took about five years and required thousands of engineers
(Ronak Singhal, lead architect of Nehalem) [37].

First implementation of the Nehalem microarchitecture in the desktop segment


Core i7-9xx (Bloomfield) 4C in 11/2008

© Sima Dezső, ÓE NIK 271 www.tankonyvtar.hu


4.1 Introduction (2)

Design objective: The same core for all major segments

Figure 4.1: Design objective of Nehalem [1]


© Sima Dezső, ÓE NIK 272 www.tankonyvtar.hu
4.1 Introduction (3)

Nehalem lines

1. generation Nehalem processors 2. generation Nehalem processors

Mobiles Mobiles
Core i7-9xxM Clarksfield 4C 9/2009
Core i7-8xxQM Clarksfield 4C 9/2009
Core i7-7xxQM Clarksfield 4C 9/2009
Desktops Desktops
Core i7-9xx (Bloomfield) 4C 11/2008 Core i7-8xx (Lynnfield) 4C 9/2009
Core i5-7xx (Lynnfield) 4C 9/2009

Servers Servers
UP-Servers UP-Servers

34xx Lynnfield 4C 9/2009


35xx Bloomfield 4C 3/2009 C35xx Jasper forest1 4C 2/2010

DP-Servers DP-Servers
55xx Gainestown (Nehalem-EP) 4C 3/2009 C55xx Jasper forest1 2C/4C 2/2010

1Jasper forest: Embedded UP or DP server

Based on [44]
© Sima Dezső, ÓE NIK 273 www.tankonyvtar.hu
4.1 Introduction (4)

Die photo of the 1. generation Nehalem desktop processor (Bloomfield) [45]

Bloomfield [45]

Note
• Both the Bloomfield (desktop) chip and the Gainestown (DP server) chip have the same
layout.
• On the Bloomfield die there are two QPI bus controllers however they are not needed
for this desktop part.
One of them is simply not used in the desktop version Bloomfield [45], but both are needed
in the DP alternative (Gainestown).
© Sima Dezső, ÓE NIK 274 www.tankonyvtar.hu
4.1 Introduction (5)

The 2. generation Lynnfield chip as a major redesign of the Bloomfield chip (1) [46]

• It is a cheaper and more effective two-chip system solution instead of the previous three
chip solution.
• It is connected to the P55 chipset by a DMI interface rather than by a QPI interface
used in the previous system solution to connect the Bloomfield chip to the X58 chipset.

Intel's Bloomfield Platform (X58 + LGA-1366) Intel's Lynnfield Platform (P55 + LGA-1156)
© Sima Dezső, ÓE NIK 275 www.tankonyvtar.hu
4.1 Introduction (6)

The 2. generation Lynnfield chip as a major redesign of the Bloomfield chip (2) [46]
• It provides PCIe 2.0 lanes (16 to 32 lanes) to attach a graphics card immediately to the
processor rather than to the north bridge (by 36 lanes) as done in the previous solution.

Intel's Bloomfield Platform (X58 + LGA-1366) Intel's Lynnfield Platform (P55 + LGA-1156)
© Sima Dezső, ÓE NIK 276 www.tankonyvtar.hu
4.1 Introduction (7)

The Lynnfield chip as a major redesign of the Bloomfield chip (3) [46]

• It supports only two DDR3 memory channels instead of three as in the previous solution.
• Its socket needs less connections (LGA-1156) than the Bloomfield (LGA-1366).
• All in all the Lynnfield chip is a cheaper and more effective successor of the Bloomfield chip.

Intel's Bloomfield Platform (X58 + LGA-1366) Intel's Lynnfield Platform (P55 + LGA-1156)
© Sima Dezső, ÓE NIK 277 www.tankonyvtar.hu
4.1 Introduction (8)

Die photos of the 1. and 2. gen.


Nehalem desktop processors

First generation: Bloomfield (11/2008)


(263 mm2, 731 mtrs, LGA-1366)
[47] [45] [46] Bloomfield
(desktop/DP server) []

Second gneration: Lynnfield (9/2009) [45]


(296 mm2, 774 mtrs, LGA-1156) [48] [45] [46] 278
Az adatok védelme érdekében a PowerPoint nem engedélyezte a kép automatikus letöltését.

4.1 Introduction (9)

The three alternatives of the


2. generation Nehalem processors [45]

Remarks [49]
• All 3 chips are basically the same design.
• In Jasper Forest the circled part is the QPI
controller, however this part remains blank
in the mobile and desktop versions as
these versions do not provide a QPI link.

Clarksfield (mobile) [49]

Lynnfield (desktop) [49] 279 Jasper Forest (embedded DP server) [49]


4.1 Introduction (10)

Remark
The embedded DP server Jasper Forest (Xeon C5500) is not an “all QPI solution”.
It provides a single QPI bus along with a DMI bus for the 3420 chipset (Picket Post Platform).

Figure 4.2: The Picket Post


platform [47]
© Sima Dezső, ÓE NIK 280 www.tankonyvtar.hu
4.1 Introduction (11)

Major innovations of the 1. generation Nehalem processors

• Native 4C
• Simultaneous Multithreading (SMT)
• New cache architecture
• SSE 4.2 ISA extension
• Integrated memory controller
• QuickPath Interconnect bus (QPI)
• Enhanced power management
• Advanced virtualization
• New socket

Figure 4.3: Overview of the major innovations of 1. generation Nehalem processors (based on [22])
(The die photo is that of the Bloomfield/Gainestown processor)

© Sima Dezső, ÓE NIK 281 www.tankonyvtar.hu


4.1 Introduction (12)

Overview of the microarchitecture of Nemalem [69]

© Sima Dezső, ÓE NIK 282 www.tankonyvtar.hu


4.1 Introduction (13)

The front-end part of the microarchitecture of Nehalem [69]

© Sima Dezső, ÓE NIK 283 www.tankonyvtar.hu


4.1 Introduction (14)

The back-end part of the microarchitecture of Nehalem [69]

© Sima Dezső, ÓE NIK 284 www.tankonyvtar.hu


Figure 4.4: Block diagram of Intel’s Core microarchitecture [4]
4.1 Introduction (15)

285
4.2 Simultaneous Multithreading (SMT) (1)

4.2 Simultaneous Multithreading (SMT)


• Two-way multithreaded ( two threads at the same time)

Benefits

• 4-wide core is fed more efficiently (from 2 threads)


• Hides latency of a single tread
• More performance with low additional die area cost,
• May provide significant performance increase on
dedicated applications.

Figure 4.5: Simultaneous Multithreading (SMT) of Nehalem [1]


© Sima Dezső, ÓE NIK 286 www.tankonyvtar.hu
4.2 Simultaneous Multithreading (SMT) (2)

Details of Nehalem’s SMT implementation

Figure 4.6: Details of Nehalem’s SMT implementation [1]

© Sima Dezső, ÓE NIK 287 www.tankonyvtar.hu


4.2 Simultaneous Multithreading (SMT) (3)

Deeper buffers

Core

Pentium M

Figure 4.7: Deeper buffers


288 to support SMT [1]
© Sima Dezső, ÓE NIK www.tankonyvtar.hu
4.2 Simultaneous Multithreading (SMT) (4)

Performance gains of SMT

Figure 4.8: Performance gains achieved by Nehalem’ SMT [1]

© Sima Dezső, ÓE NIK 289 www.tankonyvtar.hu


4.3 New cache architecture (1)

4.3 Enhanced cache architecture

2-level cache hierarchy 3-level cache hierarchy


(Penryn) (Nehalem)

Core Core Core Core Core

32 kB/32 KB L1 Caches L1 Caches L1 Caches L1 Caches L1 Caches 32 kB/32 KB

4 MB 256 KB
Shared/ L2 Cache L2 Caches L2 Caches L2 Cache
Private
two cores

Up to 8 MB
L3 Cache Inclusive

Figure 4.9: The 3-level cache architecture of Nehalem [based on 1]

© Sima Dezső, ÓE NIK 290 www.tankonyvtar.hu


4.3 New cache architecture (2)

Distinguished features of Nehalem’s cache architecture

• The L2 cache is private again rather than shared as in the Core and Penryn processors

Private L2 Shared L2

Pentium 4
Core
Penryn
Nehalem

Assumed reason for returning to the private scheme

Private caches allow a more effective hardware prefetching than shared ones.

Reason
• Hardware prefetchers look for memory access patterns.
• Private L2 caches have more easily detectable memory access patterns
than shared L2 caches.

© Sima Dezső, ÓE NIK 291 www.tankonyvtar.hu


4.3 New cache architecture (3)

Remark

The POWER family had the same evolution path as above

Private L2 Shared L2

POWER4
POWER5

POWER6

© Sima Dezső, ÓE NIK 292 www.tankonyvtar.hu


4.3 New cache architecture (4)

• The L3 cache is inclusive rather than exclusive

as in a number of competing designs, such as UltraSPARC IV+ (2005), POWER5 (2005),


POWER6 (2007), AMD’s K10-based processors (2007).

Intel’s argumentation for inclusive caches [38]

Inclusive L3 caches prevent L2 snoop traffic for L3 cache misses since

• with inclusive L3 caches an L3 cache miss means that the referenced data
doesn’t exist in any core’s L2 caches, thus no L2 snooping is needed.
• By contrast, with exclusive L3 caches the referenced data may exist in any
of the L2 caches, thus L2 snooping is required.

For higher core numbers L2 snooping becomes a more demanding task


and may overshadow the benefits arising from the more efficient cache use
of the explicit cache scheme.

Demonstration example

Benefits of inclusive caches vs exclusive ones.

© Sima Dezső, ÓE NIK 293 www.tankonyvtar.hu


4.3 New cache architecture (5)

Exclusive Inclusive

L3 Cache L3 Cache

Core Core Core Core Core Core Core Core


0 1 2 3 0 1 2 3

• Data request from Core 0 misses Core 0’s L1 and L2


• Request sent to the L3 cache

Figure 4.10: Comparing exclusive and inclusive cache behavior in case of a L3 cache miss (1) [1]

© Sima Dezső, ÓE NIK 294 www.tankonyvtar.hu


4.3 New cache architecture (6)

Exclusive Inclusive

L3 Cache L3 Cache
MISS! MISS!

Core Core Core Core Core Core Core Core


0 1 2 3 0 1 2 3

• Core 0 looks up the L3 Cache


• Data not in the L3 Cache

Figure 4.11: Comparing exclusive and inclusive cache behavior in case of a L3 cache miss (2) [1]

© Sima Dezső, ÓE NIK 295 www.tankonyvtar.hu


4.3 New cache architecture (7)

Exclusive Inclusive

L3 Cache L3 Cache
MISS! MISS!

Core Core Core Core Core Core Core Core


0 1 2 3 0 1 2 3

Must check other cores Guaranteed: data is not on-die

Figure 4.12: Comparing exclusive and inclusive cache behavior in case of a L3 cache miss (3) [1]

© Sima Dezső, ÓE NIK 296 www.tankonyvtar.hu


4.3 New cache architecture (8)

Exclusive Inclusive

L3 Cache L3 Cache
HIT! HIT!

Core Core Core Core Core Core Core Core


0 1 2 3 0 1 2 3

No need to check other cores Data could be in another core BUT


Intel® CoreTM microarchitecture
(Nehalem) is smart…

Figure 4.13: Comparing exclusive and inclusive cache behavior in case of a L3 cache hit (1) [1]

© Sima Dezső, ÓE NIK 297 www.tankonyvtar.hu


4.3 New cache architecture (9)

Inclusive

• Maintain a set of “core valid”


bits per cache line in the L3
cache L3 Cache
• Each bit represents a core HIT! 0 0 0 0
• If the L1/L2 of a core may
contain the cache line, then
core valid bit is set to “1”
• No snoops of cores are Core Core Core Core
needed if no bits are set
0 1 2 3
• If more than 1 bit is set, line
cannot be in Modified state in
any core
Core valid bits limit unnecessary
snoops

Figure 4.14: Comparing exclusive and inclusive cache behavior in case of a L3 cache hit (2) [1]

© Sima Dezső, ÓE NIK 298 www.tankonyvtar.hu


4.4 Further advanced digital media support (1)

4.4 SSE4.2 ISA extension


Neh alem

Figure 4.15: Nehalem’ new


299 SSE4.2 ISA extension [33]
© Sima Dezső, ÓE NIK www.tankonyvtar.hu
4.5 Integrated memory controller (1)

4.5 Integrated memory controller

Main features

• 3 channels per socket


Nehalem-EP (Efficient Performance):
• Up to 3 DIMMs per channel (impl. dependent)
Designation of the server line
• DDR3-800, 1066, 1333,…
• Supports both RDIMM and UDIMM (impl. dependent)
• Supports single, dual and quad-rank DIMMs

Figure 4.16: Integrated memory controller of Nehalem [33]


© Sima Dezső, ÓE NIK 300 www.tankonyvtar.hu
4.5 Integrated memory controller (2)

Benefit of integrated memory controllers

Low memory access latency

important for memory intensive apps.

Drawback of integrated memory controllers

Processor becomes memory technology dependent

For an enhanced memory solution (e.g. for increased memory speed)


a new processor modification is needed.

© Sima Dezső, ÓE NIK 301 www.tankonyvtar.hu


4.5 Integrated memory controller (3)

Non Uniform Memory Access (NUMA)

a consequence of using integrated memory controllers in connection with multi-socket servers

Local memory access Remote memory access

• Most multi-socket platforms use NUMA


• Remote memory access latency ~ 1.7 x longer than local memory access latency
• Local memory bandwidth is up to 2 x greater than remote bandwidth
• Operating systems differ in allocation strategies + APIs

Figure 4.17: Non Uniform Memory Access (NUMA) in multi-socket servers [1]
© Sima Dezső, ÓE NIK 302 www.tankonyvtar.hu
4.5 Integrated memory controller (4)

Memory latency comparison: Nehalem vs Penryn

Harpertown: Quad-Core Penryn based server (Xeon 5400 series)

Figure 4.18: Memory latency comparison: Nehalem vs Penryn [1]

© Sima Dezső, ÓE NIK 303 www.tankonyvtar.hu


4.5 Integrated memory controller (5)

Remark

Intel’s Timna – a forerunner to integrated memory controllers [34]

Timna (announced in 1999, due to 2H 2000, cancelled in Sept. 2000)

• Developed in Intel’s Haifa Design and Development Center.


• Low cost microprocessor with integrated graphics and memory controller
(for Rambus DRAMs).
• As the price of Rambus memories failed to fall as anticipated before, Intel decided
to use a bridge chip (Memory Translation Hub, MTH) to attach less expensive SDRAM
chips to the memory controller.
• Due to design problems with the MTH chip and lack of interest from many vendors,
Intel finally cancelled Timna in Sept. 2000.

Figure 4.19: The low cost (<600 $) Timna PC [40]

© Sima Dezső, ÓE NIK 304 www.tankonyvtar.hu


4.5 Integrated memory controller (6)

Figure 4.20: Intel’s roadmap from 1999 showing Timna due to 2H 2000. [35]

© Sima Dezső, ÓE NIK 305 www.tankonyvtar.hu


4.6 QuickPath Interconnect bus (1)

4.6 QuickPath Interconnect bus (QPI)


• A processor interconnect bus, connecting processors to processors or South Bridges.
Note
(The QPI isn’t an I/O interface, the standard I/O interface remains the PCI-Express bus)

• Formerly designated as the Common System Interface bus (CSI bus)

• A serial, high speed differential point-to-point interconnect


(similar to the HyperTransport bus)

• Consists of 2 unidirectional links, one in each directions, called the TX and RX


unidirectional links.

© Sima Dezső, ÓE NIK 306 www.tankonyvtar.hu


4.6 QuickPath Interconnect bus (2)

Signals of the QuickPath Interconnect bus (QPI bus)

TX Unidirectional link

RX Unidirectional link

16 data
2 protocol
2 CRC

Figure 4.21: Signals of the QuickPath Interconnect bus (QPI-bus) [22]


© Sima Dezső, ÓE NIK 307 www.tankonyvtar.hu
4.6 QuickPath Interconnect bus (3)

QuickPath Interconnect bus (QPI)


• A processor interconnect bus, connecting processors to processors or South Bridges.
Note
(The QPI isn’t an I/O interface, the standard I/O interface remains the PCI-Express bus)

• Formerly designated as the Common System Interface bus (CSI bus)

• A serial, high speed differential point-to-point interconnect


(similar to the HyperTransport bus).

• Consists of 2 unidirectional links, one in each directions, called the TX and RX


unidirectional links.

• Each unidirectional link comprises 20 data lanes and a clock lane, with
each lane consisting of a pair of differential signals.

© Sima Dezső, ÓE NIK 308 www.tankonyvtar.hu


4.6 QuickPath Interconnect bus (4)

Signals of the QuickPath Interconnect bus (QPI)

TX Unidirectional link

RX Unidirectional link

16 data
2 protocol
2 CRC

Figure 4.22: Signals of the QuickPath Interconnect bus (QPI-bus) [22]


© Sima Dezső, ÓE NIK 309 www.tankonyvtar.hu
4.6 QuickPath Interconnect bus (5)

QuickPath Interconnect bus (QPI)


• A processor interconnect bus, connecting processors to processors or South Bridges.

Note
(The QPI isn’t an I/O interface, the standard I/O interface remains the PCI-Express bus)

• Formerly designated as the Common System Interface bus (CSI bus)

• A serial, high speed differential point-to-point interconnect


(similar to the HyperTransport bus).

• Consists of 2 unidirectional links, one in each directions, called the TX and RX links.

• Each unidirectional link comprises 20 data lanes (16 data, 2 protocol, 2 CRC) and a clock lane,
with each lane consisting of a pair of differential signals.

• Number of QPI links: up to 4, implementation dependent.


(Desktop processors: 1, DP processors: 2, MP processors: 4)

© Sima Dezső, ÓE NIK 310 www.tankonyvtar.hu


4.6 QuickPath Interconnect bus (6)

QPI based DP and MP server system architectures

Figure 4.23: QPI based DP and MP server system architectures [31], [33]

© Sima Dezső, ÓE NIK 311 www.tankonyvtar.hu


4.6 QuickPath Interconnect bus (7)

Comparison of the transfer rates of the QPI, FSB and HT buses

QuickPath Interconnect Bus (QPI-bus)


3.2 GHz DDR 2-Byte 12.8 GB/s in each direction

Fastest FSB
400 MHz QDR 8 Byte 12.8 GB/s bidirectional

HyperTransport Bus (HT-bus)

Typical speed and width figures in AMD’s systems


HT 1.0: 0.8 GHz DDR 2-Byte 3.2 GB/s in each direction
HT 2.0: 1.0 GHz DDR 2-Byte 4.0 GB/s in each direction
HT 3.0: 2.6 GHz DDR 2-Byte 10.4 GB/s in each direction

DDR: Double Data Rate

© Sima Dezső, ÓE NIK 312 www.tankonyvtar.hu


4.6 QuickPath Interconnect bus (8)

The “Uncore”

Figure 4.24: Interpretation of the notion “Uncore” [1]

© Sima Dezső, ÓE NIK 313 www.tankonyvtar.hu


4.7 More advanced power management (1)

4.7 Enhanced power management


Discussed issues

• Integrated power gates


• Integrated Power Control Unit
• Turbo boost technology

© Sima Dezső, ÓE NIK 314 www.tankonyvtar.hu


4.7 More advanced power management (2)

Power switches

Figure 4.25: Use of integrated power gates [32]

© Sima Dezső, ÓE NIK 315 www.tankonyvtar.hu


4.7 More advanced power management (3)

Allows to manage the power


of the whole chip as an entity.
Used also for the Turbo Boost Mode

Figure 4.26: Overview of the Power Control unit [32]


© Sima Dezső, ÓE NIK 316 www.tankonyvtar.hu
4.7 More advanced power management (4)

Nehalem’s Turbo Mode


Aim

Utilization of the power headroom of inactive cores and that of active cores with light workload
for increasing clock frequency.
Remark
The Penryn core already introduced a less intricate technology for the same purpose,
termed as the Enhanced Dynamic Acceleration Technology that increases clock frequency
only for the mobile platform and in case of inactive cores.

© Sima Dezső, ÓE NIK 317 www.tankonyvtar.hu


4.7 More advanced power management (5)

Understanding the notion of TDP and the related potential to boost performance (1) [50]

TDP (Thermal Design Power) is the maximum power consumed at realistic worst case
applications (TDP application).
The thermal solution (cooling system) needs to ensure that the junction temperature (Tj)
at maximum core frequency specified in connection with TDP does not exceed the
junction temperature limit (Tjmax) while the processor runs TDP applications.
Example
The mobile quad core Clarksfield processor i7-920XM has
• a TDP of 55 W
• ACPI P-states from 1.2 GHz (Low Frequency Mode) to 2.0 GHz (High Frequency Mode)
available to implement DBS (Demand Based Switching) of fc and Vcc, and
• allows to increase fc in turbo mode from 2.0 GHz up to 3.20 GHz.

The maximum clock frequency related to TDP (2 GHz in the above example) is determined
while running (worst case) TDP applications that intensively utilize all four cores such that
at this frequency dissipation still remains below TDP (i.e. 55 W in the above example).
© Sima Dezső, ÓE NIK 318 www.tankonyvtar.hu
4.7 More advanced power management (6)

Understanding the notion of TDP and the related potential to boost performance (2) [50]

Typical workloads however, are not intensive enough to push power consumption to the TDP limit.
The remaining power headroom can be utilized to increase fc if

• the OS requests the highest performance ACPI state (P0 state)


• provided that the processor operates within its TDP and temperature limits.

The possible frequency increase depends on the intensity of the workload and the number of
active cores.

© Sima Dezső, ÓE NIK 319 www.tankonyvtar.hu


4.7 More advanced power management (7)

Principle of the turbo boost technology (1) [51]

If the OS requests an active core to increase fc beyond the TDP limited maximum frequency
(i.e. to enter the PO state),
and there is an available power headroom
• either by having idle cores
• or a lightly threaded workload
the turbo mode controller will increase the core frequency of the active cores
provided that the power consumption of the socket and junction temperatures of the cores
do not exceed the given limits.

In turbo mode all active cores in the processor will operate at the same fc and voltage.

© Sima Dezső, ÓE NIK 320 www.tankonyvtar.hu


4.7 More advanced power management (8)

Principle of the turbo boost technology (2) [52]

Figure 4.27: Turbo mode uses the available power headroom in processor package power limits [52]
© Sima Dezső, ÓE NIK 321 www.tankonyvtar.hu
4.7 More advanced power management (9)

Turbo boost frequencies in case of inactive cores [53]

For inactive cores, the turbo mode controller will increase fc to a maximum turbo frequency
that depends on the number of active cores provided that actual power and temperature values
remain below specified limits.
Maximum turbo frequencies are factory configured and kept in an internal register (MSR 1ADH).
E.g. in case of a single active core the Core i7-920XM will increase fc to 3.2 GHz,
which is 8 frequency bins higher than the TDP limited core frequency of 2.0 GHz.

The turbo boost technology considers a core “active” if it is in ACPI C0 or C1 states,


whereas cores in the C3 or C6 ACPI states are considered as “inactive”.

© Sima Dezső, ÓE NIK 322 www.tankonyvtar.hu


4.7 More advanced power management (10)

Increasing/decreasing turbo boost frequencies [51]

• If OS is requesting the ACPI P0 state


and the calculated power consumption of the package and measured junction temperatures
(Tj) of the cores remain below factory configured limits
the turbo boost controller automatically steps up core frequency typically by 133 MHz
until it reaches the maximum frequency dictated by the number of active cores or
the intensity of the workload.
• When the power consumption of the package or the junction temperature of any core
exceeds the factory configured limits, the turbo boost controller automatically steps down
core frequency in increments of e.g. 133 MHz.

Remark
In the above example the 133 MHz is the basic frequency that will be multiplied by the PLL
by an appropriate factor to generate the clock frequency fc.

© Sima Dezső, ÓE NIK 323 www.tankonyvtar.hu


4.7 More advanced power management (11)

Assuring that power and temperature values do not exceed specified limits [53], [50]

A precondition for increasing fc is that the power consumption of the package and
the junction temperature of the cores do not exceed given limits.
To assure this the turbo boost controller samples the current power consumption and
die temperatures in 5 ms intervals [53].
Power consumption is determined by monitoring the processor current at its input pin as well
as the associated voltage (Vcc) and calculating the power consumption as a moving average.
The junction temperature of the cores are monitored by DTSs (Digital Thermal Sensors) with
an error of ± 5 % [50].

© Sima Dezső, ÓE NIK 324 www.tankonyvtar.hu


4.8 Enhanced virtualization (1)

4.8 Enhanced Virtualization


Very wide area, not discussed.
(See e.g. [41])

© Sima Dezső, ÓE NIK 325 www.tankonyvtar.hu


4.9 New socket (1)

Figure 4.28: Die


shot
of a DP
(guessed)
Nehalem [36]
© Sima Dezső, ÓE NIK 326 www.tankonyvtar.hu
4.9 New socket (2)

4.9 LGA1366 socket

Figure 4.29: New LGA1366 socket [22]

© Sima Dezső, ÓE NIK 327 www.tankonyvtar.hu


5. Nehalem-EX lines

• 5.1 Introduction

• 5.2 Native 8 cores with 24 MB L3 cache (LLC)

• 5.3 On-die ring interconnect bus

• 5.4 Serial memory channels

• 5.5 Scalable platform configuration

• 5.5 Extending the turbo boost technology to 8 cores


• 5.7 Overview of the 2. generation Nehalem based
(Nehalem-EX based) processor lines

© Sima Dezső, ÓE NIK 328 www.tankonyvtar.hu


5.1 Introduction (1)

5.1 Introduction
• Nehalem-EX based DP/MP servers are designated also as the Beckton family.
• They include only server processors.
• First Becton processor were delivered in 3/2010.

Major innovations of the Nehalem-EX processors


• Native 8 cores with 24 MB L3 cache (LLC)
• On-die ring interconnect
• Serial memory channels (designated as scalable memory interface (SMI)
• Enhanced support for RAS, virtualization and power control (not detailed here)
• Scalable platform configurations
• Extending turbo boost to 8 cores

© Sima Dezső, ÓE NIK 329 www.tankonyvtar.hu


5.1 Introduction (2)

Block diagram of the Nehalem-EX 7500 processor [54]

© Sima Dezső, ÓE NIK 330 www.tankonyvtar.hu


5.2 Native 8 cores with 24 MB L3 cache (LLC)

5.2 Native 8 cores with 24 MB L3 cache (LLC) [55]

© Sima Dezső, ÓE NIK 331 www.tankonyvtar.hu


5.3 On-die ring interconnect bus

5.3 On-die ring interconnect bus [56]

© Sima Dezső, ÓE NIK 332 www.tankonyvtar.hu


5.4 Serial memory channels (1)

5.4 Serial memory channels [55]

Pawlowski IDF 2009

© Sima Dezső, ÓE NIK 333 www.tankonyvtar.hu


5.4 Serial memory channels (2)

Nehalem-EX processors provide an FB-DIMM like memory subsystem [55].

Remark: The SMI interface was formerly designated as the Fully Buffered DIMM2 interface
© Sima Dezső, ÓE NIK 334 www.tankonyvtar.hu
5.5 Scalable platform configurations

5.5 Scalable platform configurations [55]


Nehalem-EX allows platform scaling from 2 to 256 sockets.

© Sima Dezső, ÓE NIK 335 www.tankonyvtar.hu


5.6 Extending the turbo boost technology to 8 cores

5.6. Extending the turbo boost technology to 8 cores [67]

© Sima Dezső, ÓE NIK 336 www.tankonyvtar.hu


5.7 Overview of the 2. generation Nehalem based processor lines (1)

5.7 Overview of the 2. generation Nehalem based (Nehalem-EX based)


processor lines (based on [44])
Servers

DP-Servers
65xx Beckton (Nehalem-EX) 8C 3/2010
MP-Servers
75xx Beckton (Nehalem-EX) 8C 3/2010

© Sima Dezső, ÓE NIK 337 www.tankonyvtar.hu


5.7 Overview of the 2. generation Nehalem based processor lines (2)

Performance features of the 8-core Nehalem-EX based Xeon 7500 vs the Penryn based
6-core Xeon 7400 [67]

© Sima Dezső, ÓE NIK 338 www.tankonyvtar.hu


6. Westmere

• 6.1 Introduction

• 6.2 Native 6 cores with 12 MB L3 cache (LLC)

• 6.3 In-package integrated CPU/GPU

• 6.4 Enhanced turbo boost technology

© Sima Dezső, ÓE NIK 339 www.tankonyvtar.hu


6.1 Introduction (1)

6.1 Introduction
• Westmere (formerly Nehalem-C) is the 32 nm die shrink of Nehalem.
• First Westmere-based processors were launched in 1/2010

© Sima Dezső, ÓE NIK 340 www.tankonyvtar.hu


6.1 Introduction (2)

Announcing Intel’s Westmere lines [140] Part4-ből

© Sima Dezső, ÓE NIK 341 www.tankonyvtar.hu


6.1 Introduction (3)

Westmere family

Westmere lines Westmere-EX lines


(only servers)

Mobiles
Core i3-3xxM Arrandale 2C+G 1/2010
Core i5-4xxM Arrandale 2C+G 1/2010
Core i5-5xxM Arrandale 2C+G 1/2010
Core i7-6xxM Arrandale 2C+G 1/2010

Desktops
Core i3-5xx Clarkdale 2C+G 1/2010
Core i5-6xx Clarkdale 2C+G 1/2010
Core i7-9xx/9xxX Gulftown 6C 3/2010

Servers
UP-Servers

36xx Gulftown (Westmere-EP) 6C 3/2010 E7-28xx Westmere-EX 10C 4/2011

DP-Servers DP-Servers
56xx Gulftown (Westmere-EP) 6C 3/2010 E7-28xx Westmere-EX 10C 4/2011

MP-Servers
E7-48xx Westmere-EX 10C 4/2011
E7-88xx Westmere-EX 10C 4/2011
Data based on [44]

© Sima Dezső, ÓE NIK 342 www.tankonyvtar.hu


6.1 Introduction (4)

1. Westmere 2-core and 6-core die plots [57]

2-core die plot 6-core die plot


Arrandale (mobile) Gulftown (Westmere-EP) (desktop/UP/DP server)
Clarkdale (desktop)

© Sima Dezső, ÓE NIK 343 www.tankonyvtar.hu


6.1 Introduction (5)

1. Westmere 2-core mobile/desktop and 6-core UP/DP server platforms [57]

Westmere 2-core mobile and desktop platform Westmere 6-core UP/DP server platform

© Sima Dezső, ÓE NIK 344 www.tankonyvtar.hu


6.1 Introduction (6)

Key improvements of the Westmere lines vs the Nehalem lines [44]

• Increased number of cores for UP/DP servers.


Native 6 core processors vs 4 cores of the 1. generation Nehalem processors.
• Enlarged L3 cache for UP/DP servers.
Native 12 MB L3 cache vs 8 MB available in Nehalem-based servers.
• In-package integrated graphics for the mobile and desktop segments.
• Enhanced support for AES (Advanced Encryption Standard) by providing a set of
instructions to perform hardware accelerated encryption/decryption (not discussed here).
• Enhanced support for virtualization (not discussed here).
• Over 100 incremental improvements in the microarchitecture [58] (not discussed here).
• Enhanced Turbo Boost technology.

© Sima Dezső, ÓE NIK 345 www.tankonyvtar.hu


6.2 Native 6 cores with 12 MB L3 cache

6.2 Native 6 cores with 12 MB L3 cache (LLC) for UP/DP servers [58]

© Sima Dezső, ÓE NIK 346 www.tankonyvtar.hu


6.3 In-package integrated CPU/GPU (1)

6.3 In-package integrated CPU/GPU processors for the mobile and the
desktop segments
Example

In-package integrated CPU/GPU of the mobile Arrandale line [137] Part4-ből

32 nm CPU/45 nm discrete GPU

© Sima Dezső, ÓE NIK 347 www.tankonyvtar.hu


6.3 In-package integrated CPU/GPU (2)

In-package integrated CPU/GPU Westmere lines

Mobile lines Desktop lines

Arrandale lines Clarkdale lines

i3 3xxM 2C i3 5xx 2C
i5 4xx/5xx 2C i5 6xx 2C
i7 6xx 2C

CPU/GPU components
CPU: Hillel (32 nm Westmere architecture)
(Enhanced 32 nm shrink of the
45 nm Nehalem architecture)
GPU: Ironlake (45 nm)
Shader model 4, DX10 support

© Sima Dezső, ÓE NIK 348 www.tankonyvtar.hu


6.3 In-package integrated CPU/GPU (3)

Basic components of Intel’s mobile Arrandale line [123] Part4

32 nm CPU (Hillel)
(Mobile implementation of the Westmere
basic architecture,
which is the 32 nm shrink of the
45 nm Nehalem basic architecture) 45 nm GPU (Ironlake)
Intel’s GMA HD (Graphics Media Accelerator)
(12 Execution Units, Shader model 4, no OpenCL support)
© Sima Dezső, ÓE NIK 349 www.tankonyvtar.hu
6.3 In-package integrated CPU/GPU (4)

Key specifications of Intel’s Arrandale line [139] Part4

http://www.anandtech.com/show/2902

350
6.3 In-package integrated CPU/GPU (5)

Intel’s i3/i5 package integrated CPU/GPU desktop Clarksdale line

Figure 6.1: The Clarksdale processor with in-package integrated graphics along with the H57 chipset
[140] Part4-ből

© Sima Dezső, ÓE NIK 351 www.tankonyvtar.hu


6.3 In-package integrated CPU/GPU (6)

Key features of the Clarkdale line [141] Part4

© Sima Dezső, ÓE NIK 352 www.tankonyvtar.hu


6.3 In-package integrated CPU/GPU (7)

Integrated Graphics Media (IGM) architecture of Clarkdale [141] Part4

353
6.3 In-package integrated CPU/GPU (8)

Remark

In Jan. 2011 Intel replaced their in-package integrated their CPU/GPU lines with the
on-die integrated Sandy Bridge line.

© Sima Dezső, ÓE NIK 354 www.tankonyvtar.hu


6.4 Enhanced turbo boost technology (1)

6.4 Enhanced turbo boost technology [57]


The 2-core mobile version (Arrandale) with its in-package integrated graphics
extends the turbo boost technology to both the graphics and the integrated memory controller
by using a “two point” thermal design.
The driver controls the power sharing between the cores and the integrated graphics.
The desktop version of Westmere (Clarkdale) and the server lines (Westmere-EP, Westmere-EX
make use of a Nehalem-like turbo boost technology.

© Sima Dezső, ÓE NIK 355 www.tankonyvtar.hu


6.4 Enhanced turbo boost technology (2)

The two point thermal design (1) [59]

It includes two extreme design points

• The processor cores are operating at maximum thermal power level (which is greater
than their TDP) and the integrated graphics and the integrated memory controller
are operating at their minimum thermal power.
• The integrated graphics operates at its maximum thermal power level, while
the processor cores consumes the remaining MCP package power limit.

HFM: High Frequency Mode (highest P-state)


LFM: Low Frequency Mode (lowest P-state)

© Sima Dezső, ÓE NIK 356 www.tankonyvtar.hu


6.4 Enhanced turbo boost technology (3)

The two point thermal design (2) [59]

• Processor core currents are monitored by the processor input pin and calculated using a
moving average.
• When the power limit is reached power sharing control will adaptively remove the turbo boost
states to remain within the MCP thermal power limit.
• Errors in power estimation or measurement can significantly impact or completely eliminate
the performance benefit of the turbo boost technology.

© Sima Dezső, ÓE NIK 357 www.tankonyvtar.hu


6.4 Enhanced turbo boost technology (4)

The two point thermal design (3) [59]

NHM/M

(WSM/M)
NHM/D (WSM/D)

Kahn, Piazza, Valentine IDF 2010


Figure 6.2: Implementation alternatives of Intel’s turbo boost technologies [59]
© Sima Dezső, ÓE NIK 358 www.tankonyvtar.hu
6.4 Enhanced turbo boost technology (5)

The two point thermal design (4) [59]

• For the two point thermal design it must however be ensured that the
component Tjmax limits do not exceeded when either component is operating
at its extreme thermal power limit.

• The junction temperature of the cores, integrated graphics and memory controller are
monitored by their respective DTS (Digital Thermal Sensor).
A DTS outputs a temperature relative to the maximum supported junction temperature.
The error associated with DTS measurements will not exceed ± 5 % within the operating range.

© Sima Dezső, ÓE NIK 359 www.tankonyvtar.hu


7. Westmere-EX

• 7.1 Introduction

• 7.2 Native 10 cores with 30 MB L3 cache (LLC)

7.3 Overview of the 2. generation Westmere based



(Westmere-EX based) processor lines

© Sima Dezső, ÓE NIK 360 www.tankonyvtar.hu


7.1 Introduction (1)

7.1 Introduction
• Westmere-EX processors are 32 nm die shrinks of the 45 nm Nehalem-EX line.
• First Westmere-EX processors were shipped in 4/2011.
• They are socket compatible with the Nehalem-EX line (Xeon 75xx or Benton line).

Key improvements of the Westmere-EX processors vs Nehalem-EX server processors

Native 10 cores with 30 MB of L3 cache (LLC) vs native 8 cores with 24 MB L3 cache (LLC)
in order to compete with AMD’s 2x6 core (dual chip) Magny Course processors.

© Sima Dezső, ÓE NIK 361 www.tankonyvtar.hu


7.1 Introduction (2)

Basic building blocks of the Westmere-EX processors [60]

© Sima Dezső, ÓE NIK 362 www.tankonyvtar.hu


7.1 Introduction (3)

Block diagram of the Westmere-EX E7688xx/48xx/28xx processors [70]

© Sima Dezső, ÓE NIK 363 www.tankonyvtar.hu


7.2 Native 10 cores with 30 MB L3 cache (LLC)

7.2 Native 10 cores with 30 MB L3 cache (LLC) [60]

© Sima Dezső, ÓE NIK 364 www.tankonyvtar.hu


7.3 Overview of the 2. Westmere-EX based processor lines

7.3 Overview of the Westmere-EX based processor lines (based on [44])


Servers

UP-Servers
E7-28xx Westmere-EX 10C 4/2011
DP-Servers
E7-28xx Westmere-EX 10C 4/2011

MP-Servers
E7-48xx Westmere-EX 10C 4/2011
E7-88xx Westmere-EX 10C 4/2011

© Sima Dezső, ÓE NIK 365 www.tankonyvtar.hu


8. Sandy Bridge

• 8.1 Introduction

• 8.2 Advanced Vector Extension (AVX)


• 8.3 Decoded µops cache

• 8.4 On-die ring interconnect bus

• 8.5 On-die integrated graphics unit


• 8.6 Enhanced turbo boost technology

© Sima Dezső, ÓE NIK 366 www.tankonyvtar.hu


8.1 Introduction (1)

8.1 Introduction
• Sandy Bridge is Intel’s new microarchitecture using 32 nm line width.
• First delivered in 1/2011

© Sima Dezső, ÓE NIK 367 www.tankonyvtar.hu


8.1 Introduction (2)

Main functional units of Sandy Bridge [143] Part 4

256 KB L2 256 KB L2 256 KB L2 256 KB L2


(9 clk) (9 clk) (9 clk) (9 clk)

Hyperthreading
32K L1D (3 clk) AES Instr.
AVX 256 bit VMX Unrestrict.
4 Operands 20 nm2 / Core

@ 1.0 1.4 GHz


(to L3 connected) (25 clk)
256 b/cycle Ring Architecture PCIe 2.0

DDR3-1600 25.6 GB/s

32 nm process / ~225 nm2 die size / 85W TDP

© Sima Dezső, ÓE NIK 368 www.tankonyvtar.hu


8.1 Introduction (3)

The microarchitecture of Sandy Bridge [61]

© Sima Dezső, ÓE NIK 369 www.tankonyvtar.hu


8.1 Introduction (4)

Key features and benefits of the Sandy Bridge line vs the 1. generation Nehalem line [61]

370
8.1 Introduction (5)

Overview of the Sandy Bridge based processor lines

Mobiles
Core i3-23xxM, 2C, 2/2011
Core i5-24xxM//25xxM, 2C, 2/2011
Core i7-26xxQM/27xxQM/28xxQM, 4C, 1/2011
Core i7 Extreme-29xxXM , 4C, Q1 2011
Desktops
Core i3-21xx, 2C, 2/2011
Core i5-23xx/24xx/25xx, 4C, 1/2011
Core i7-26xx, 4C, 1/2011

Servers
UP-Servers
E3 12xx, 4C, Sandy Bridge-H2, 4C, 3/2011
DP-Servers
E5 2xxx, Sandy Bridge-EP, up to 8C, Q4/2011
MP-Servers
E5 4xxx, Sandy Bridge-EX, up to 8C, Q1/2012

Based on [62] and [63]


© Sima Dezső, ÓE NIK 371 www.tankonyvtar.hu
8.2 Advanced Vector Extension (AVX) (1)

8.2 Advanced Vector Extension


(AVX)
Introduction of AVX

Sandy Bridge

Figure 8.1: Evolution of the SIMD


processing width [18] BMA-ból
© Sima Dezső, ÓE NIK 372 www.tankonyvtar.hu
8.2 Advanced Vector Extension (AVX) (2)

8 MM registers (64-bit),
aliased on the FP Stack registers

8 XMM registers (128-bit)

16 XMM registers (128-bit)

Northwood (Pentium4)
Norhwood
Northwood (Pentium4)

Larrabee: large number


of registers (512-bit)

16 YMM registers (256-bit)

Ivy Bridge

Figure 8.2: Intel’s x86 ISA extensions - the SIMD register space (based on [18]) BMA
© Sima Dezső, ÓE NIK 373 www.tankonyvtar.hu
8.2 Advanced Vector Extension (AVX) (3)

Details of AVX [64]

© Sima Dezső, ÓE NIK 374 www.tankonyvtar.hu


8.3 Decoded µops cache (1)

8.3 Decoded µop cache [61]

1.5 K µops

© Sima Dezső, ÓE NIK 375 www.tankonyvtar.hu


8.3 Decoded µops cache (2)

Remark [65]

A µcode cache was already introduced by Intel in the ill-fated Pentium 4 (2010),
designated as the Trace Cache (keeping 12 K µops).

© Sima Dezső, ÓE NIK 376 www.tankonyvtar.hu


8.4 On-die ring interconnect bus (1)

8.4 The on die ring interconnect bus of Sandy Bridge [66]

Six bus agents.

The four cores and the


L3 slices share interfaces.

© Sima Dezső, ÓE NIK 377 www.tankonyvtar.hu


8.4 On-die ring interconnect bus (2)

Details of the on-die ring interconnect bus [64]

© Sima Dezső, ÓE NIK 378 www.tankonyvtar.hu


8.5 On-die integrated graphics unit (1)

8.5 Sandy Bridge’s integrated graphics unit [102] Part4

12 EUs

© Sima Dezső, ÓE NIK 379 www.tankonyvtar.hu


8.5 On-die integrated graphics unit (2)

Specification data of the HD 2000 and HD 3000 graphics [125] Part 4

© Sima Dezső, ÓE NIK 380 www.tankonyvtar.hu


8.5 On-die integrated graphics unit (3)

Performance comparison: gaming [126] part 4

HD5570
400 ALUs

i5/i7 2xxx/3xxx:
Sandy Bridge

i5 6xx
Arrandale

frames per sec


© Sima Dezső, ÓE NIK 381 www.tankonyvtar.hu
8.6 Enhanced turbo boost technology (1)

8.6 Enhanced turbo boost technology [64]


Innovative concept of the 2.0 generation Turbo Boost technology

The concept utilizes the real temperature response of processors to power changes
in order to increase the extent of overclocking [64]

Cooler Thermal capacitance

© Sima Dezső, ÓE NIK 382 www.tankonyvtar.hu


8.6 Enhanced turbo boost technology (2)

Concept: Use thermal energy budget accumulated during idle periods to push the core
beyond the TDP for short periods of time (e.g. for 20 sec).

Multiple algorithms manage in parallel current, power and die temperature. [64]
© Sima Dezső, ÓE NIK 383 www.tankonyvtar.hu
8.6 Enhanced turbo boost technology (3)

Intelligent power sharing between the cores and the integrated graphics [64]

© Sima Dezső, ÓE NIK 384 www.tankonyvtar.hu


8.6 Enhanced turbo boost technology (4)

Intelligent power sharing between the cores and the integrated graphics [68]

© Sima Dezső, ÓE NIK 385 www.tankonyvtar.hu


8.6 Enhanced turbo boost technology (5)

NHM/M WSM/M

NHM/D WSM/D

[61]
386
8.6 Enhanced turbo boost technology (6)

Remark

• Individual cores may run at different frequencies but all cores share the same power plane.
• Individual cores may be shut down if idle by power gates.

© Sima Dezső, ÓE NIK 387 www.tankonyvtar.hu


9. Overview of the evolution

© Sima Dezső, ÓE NIK 388 www.tankonyvtar.hu


9. Overview of the evolution (1)

Pentium 4 Core Penryn Nehalem


E4xxx/E6xxx E7xxx/E8xxx i7-9xx
(180/130/90 nm) (65 nm) (45 nm) (45 nm)
Processor
3-wide core 4-wide core
width
2 x 64-bit FP/SSE EUs 3 x 128-bit FP/SSE EUs
(1 complex + 1 simple) (3 complex units)
Enhanced Loop Strem
Loop Strem Detector
Detector

Microfusion

Macrofusion

Radix-16 divider

2-way SMT
SMT (with deeper buffers
to support SMT)
Private L2 caches
Cache Private L2 caches Shared L2 caches (256 KB/core)
architecture (up to 2 MB/core) (4 MB/2 core) (6 MB/2 core) + Shared L3 cache
(up to 8 MB)
Memory
Store to Load forwarding
accesses
Loads bypass both Loads
Loads bypass only Loads
and Stores
8 prefetchers
in DC processors
Single L2 prefetchers
(2 x L1 D$, 1 x L1 I$ per
core, 2 x L2)

Table 9.1: Evolution of the main features of Intel’s basic cores (1)
© Sima Dezső, ÓE NIK 389 www.tankonyvtar.hu
9. Overview of the evolution (2)

Pentium 4 Core Penryn Nehalem


E4xxx/E6xxx E7xxx/E8xxx i7-9xx
(180/130/90 nm) (65 nm) (45 nm) (45 nm)
ISA SSE3 Supplemental SSE3 SSE 4.1 SSE 4.2
Extension (DSP-oriented FP, (Diverse arithmetric (Media acceleration: (Accelerated string/text
enhanced thread enhancements) video encoding, gaming) manipulation,
manipulation) appl. targeted acceleration)

Support Super Shuffle Engine


Streaming Loads

- Integrated memory cont.


Support of the
syst. arch. - QuickPath
Interconnect bus

Power
Detailed subsequently
management

Support of
Not discussed
virtualization

Table 9.2: Evolution of the main features of Intel’s basic cores (2)

© Sima Dezső, ÓE NIK 390 www.tankonyvtar.hu


9. Overview of the evolution (3)

Pentium 4 Pentium 4 P4 P4 Core Penryn Nehalem


5xx/ 6xx/
Willamette --- Prescott
5x1 6x1
E4xxx/
E6xxx
E7xxx/
E8xxx
i7-9xx

(180 nm/478 pins) (90 nm/775 pins) (90 nm) (90/65 nm) (65 nm) (45 nm) (45 nm)
Protection Thermal Monitor 1
of (TM1) ---
overheating Hardware controlled
Turning off and on Adaptive
the clock Thermal
(Clock modulation) Monitor
Thermal Monitor 2 First activate
(TM2) TM2 and if not
enough,
Hardware controlled
activate
switching to a
second operating also TM1
state with reduced
fc and VID
(C1E state)

Table 9.3: Evolution of thermal management (1)

© Sima Dezső, ÓE NIK 391 www.tankonyvtar.hu


9. Overview of the evolution (4)

Pentium 4 Pentium 4 P4 P4 Core Penryn Nehalem


5xx/ 6xx/
Willamette
--- Prescott
5x1 6x1
E4xxx/
E6xxx
E7xxx/
E8xxx
i7-9xx

(180 nm/478 pins) (90 nm/775 pins) (90 nm) (90/65 nm) (65 nm) (45 nm) (45 nm)
Reducing EIST
power (Enhanced Intel
consumption Speed Step
of active Technology)
processors OS controlled
--- switching to
multiple P-states
(Power states)
in less active
periods
to reduce power
consumption
Ultra fine
grained
power
control
Shutting down
not needed
proc. units

Bus spliting
Activating not
needed bus
lines

Table 9.4: Evolution of thermal management (2)

© Sima Dezső, ÓE NIK 392 www.tankonyvtar.hu


9. Overview of the evolution (5)

Pentium 4 Core Penryn Nehalem


Willamette --- E4xxx/E6xxx E7xxx/E8xxx i7-9xx
(180 nm/478 pins) (65 nm) (45 nm) (45 nm)
Reducing Multiple C-states/
power S-states ---
consumption (CPU-states/Sleep states)
of less active Os controlled switching to
or idle states with increasingly
processors higher power savings but
longer wake-up times

Deep Power Down State


(DPD)
Saving core state in
dedicated SRAM and
reducing VID further on

(Available only on mobile (Available on all platforms)


platforms)

Integrated Power Gate


Idle cores go to near zero
power consumption

Table 9.5: Evolution of thermal management (3)

© Sima Dezső, ÓE NIK 393 www.tankonyvtar.hu


9. Overview of the evolution (6)
Pentium 4 Core Penryn Nehalem
Willamette --- E4xxx/E6xxx E7xxx/E8xxx i7-9xx
(180 nm/478 pins) (65 nm) (45 nm) (45 nm)
Managing the Power Control Unit
power
consumption Allows to manage the
of a multi- --- power consumption
core of the whole chip as an
processor entity. (Needed for the
as an entity Turbo Boost Mode)

EDAT Turbo Boost Mode


(Enhanced Dynamic In multi-core processors
Acceleration Technology) utilize the power
In multi-core processors headroom of idle cores
utilize the power headroom plus light workloads to
of idle cores in states CC3 boost fc of the active
or deeper to boost fc of the core)
active core)
(Only on mobile platforms)
Managing the Digital Temperature
fan-speed of Sensor (DTS)
a multi-chip
---
DTS readings represent
platform the temperature
difference below the
activation temperature of
the Thermal Control
Circuit (TCC)
Platform Environment
Control Interface (PECI)
A single wire interface to
forward temperature
readings of DTSs placed
on the dies
PECI-based platform
fan speed control

Table 9.6: Evolution


394 of thermal management (4) www.tankonyvtar.hu
© Sima Dezső, ÓE NIK
References (1)

[1]: Singhal R., “Next Generation Intel Microarchitecture (Nehalem) Family:


Architecture Insight and Power Management , IDF Taipeh, Oct. 2008,
http://intel.wingateweb.com/taiwan08/published/sessions/TPTS001/FA08%20IDF
-Taipei_TPTS001_100.pdf
[2]: Bryant D., “Intel Hitting on All Cylinders,” UBS Conf., Nov. 2007,
http://files.shareholder.com/downloads/INTC/0x0x191011/e2b3bcc5-0a37-4d06-
aa5a-0c46e8a1a76d/UBSConfNov2007Bryant.pdf
[3]: Fisher S., “Technical Overview of the 45 nm Next Generation Intel Core Microarchitecture
(Penryn),” IDF 2007, ITPS001, http://isdlibrary.intel-dispatch.com/isd/89/45nm.pdf
[4]: De Gelas J., “Intel Core versus AMD’s K8 architecture,” AnandTech, Mai 1. 2006,
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=2748&p=1
[5]: Carmean D., “Inside the PentiumIntel
4 Core
Processor
versus
AMD's K8 architecture
Micro-architecture,”, Aug. 2000,
http://people.virginia.edu/~zl4j/CS854/pda_s01_cd.pdf
Date: May 1st, 2006
[6]: Shimpi A. L. & Clark J., “AMD Opteron 248 vs. Intel Xeon 2.8: 2-way Web Servers
go Head to Head,” AnandTech, Dec. 17. 2003,
http://www.anandtech.com/showdoc.aspx?i=1935&p=1

[7]: Völkel F., “Duel of the Titans: Opteron vs. Xeon : Hammer Time: AMD On The Attack,”
Tom’s hardware, Apr. 22. 2003,
http://www.tomshardware.com/reviews/duel-titans,620.html
[8]: De Gelas J., “Intel Woodcrest, AMD's Opteron and Sun's UltraSparc T1:
Server CPU Shoot-out,” AnandTech, June 17. 2006,
http://www.anandtech.com/IT/showdoc.aspx?i=2772&p=1

© Sima Dezső, ÓE NIK 395 www.tankonyvtar.hu


References (2)

[9]: Hinton G. & al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology
Journal, Q1 2001, pp. 1-13
[10]: Wechsler O., “Inside Intel Core Microarchitecture,” White Paper, Intel, 2006
[11]: Lee V., “Inside the Intel Core Microarchitecture,” IDF, May 2006, Shenzhen,
http://www.prcidf.com.cn/sz/systems_conf/track_sz/SMC/Intel%20Core%20uArch.pdf
[12]: Doweck J., “Inside Intel Core Microarchitecture,” Hot Chips 18, 2006,
http://www.hotchips.org/archives/hc18/
[13]: Gruen H., “Intel’s new Core Microarchitecture,” Develop Brighton, AMD Technical
Day, July 2006,
http://ati.amd.com/developer/brighton/03%20Intel%20MicroArchitecture.pdf
[14]: Doweck J., “Intel Smart Memory access: Minimizing Latency on Intel Core
Microarchitecture, ” Technology @ intel Magazine, Sept. 2006, pp. 1-7,
ftp://download.intel.com/corporate/pressroom/emea/deu/fotos/06-10-Strategie_Tag/
Intel/Intel_Core2_Prozessoren/Texte/ENG-Smart_Memory_Access_Technology@
Intel_Magazine_Article.pdf
[15]: Sima D., Fountain T., Kacsuk P., Advanced Computer Architectures, Addison Wesley,
Harlow etc., 1997
[16]: Jafarjead B., “Intel Core Duo Processor,” Intel, 2006,
http://masih0111.persiangig.com/document/peresentation/behrooz%20jafarnejad.ppt
[17]: Pawlowski S. & Wechsler O., “Intel Core Microarchitecture,” IDF Spring, 2006,
http://www.intel.com/pressroom/kits/core2duo/pdf/ICM_tech_overview.pdf

© Sima Dezső, ÓE NIK 396 www.tankonyvtar.hu


References (3)

[18]: Goto H., Larrrabee architecture can be integrated into CPU”, PC Watch, Oct. 06. 2008,
http://pc.watch.impress.co.jp/docs/2008/1006/kaigai470.htm
[19]: SIMD Instruction Sets, http://softpixel.com/~cwright/programming/simd/index.php
[20]: Platform Environment Control Interface,
http://en.wikipedia.org/wiki/Platform_Environment_Control_Interface
[21]: Kim N. S. et al., „Leakage Current: Moore’s Law Meets Static Power”, Computer,
Dec. 2003, pp. 68-75.
[22]: Ng P. K., “High End Desktop Platform Design Overview for the Next Generation
Intel Microarchitecture (Nehalem) Processor,” IDF Taipei, TDPS001, 2008,
http://intel.wingateweb.com/taiwan08/published/sessions/TDPS001/
FA08%20IDF-Taipei_TDPS001_100.pdf
[23]: Bohr M., Mistry K., Smith S., “Intel Demonstrates High-k + Metal Gate Transistor
Breakthrough in 45 nm Microprocessors,”, Intel, Jan. 2007,
http://download.intel.com/pressroom/kits/45nm/Press45nm107_FINAL.pdf
[24]: Scott D. S., “Toward Petascale and Beyond,” APAC Conference, Oct. 2007,
http://www.apac.edu.au/apac07/pages/program/presentations/
Tuesday%20Harbour%20A%20B/David_Scott.pdf
[25]: Smith S. L., “45nm Product Press Briefing,” IDF Fall, 2007,
http://download.intel.com/pressroom/kits/events/idffall_2007/BriefingSmith45nm.pdf
[26]: Fisher S., “Technical Overview of the 45nm Next Generation Intel Core Microarchitecture
(Penryn),” IPTS001, Fall IDF 2007, http://isdlibrary.intel-dispatch.com/isd/89/45nm.pdf

© Sima Dezső, ÓE NIK 397 www.tankonyvtar.hu


References (4)

[27]: George V., 45nm Next Generation Intel Core Microarchitecture (Penryn),”
Hot Chips 19, 2007,
http://www.hotchips.org/archives/hc19/3_Tues/HC19.08/HC19.08.01.pdf
[28]: Foxton Technology, Wikipedia, http://en.wikipedia.org/wiki/Foxton_Technology
[29]: Coke J. & al., “Improvements in the Intel Core Penryn Processor Family Architecture
and Microarchitecture,” Intel Technology Journal, Vol. 12, No. 3, 2008, pp. 179-192

[30]: Fisher S., “Technical Overview of the 45nm Next Generation Intel Core Microarchitecture
(Penryn),” BMA S004, IDF 2007,
http://my.ocworkbench.com/bbs/attachment.php?attachmentid=318&d=1176911500

[31]: Gelsinger P. P., “Intel Architecture, Press Briefing, March 2008,


http://www.slideshare.net/angsikod/gelsinger-briefing-on-intel-architecture
[32]: Gelsinger P., “Invent the new reality,” IDF Fall 2008, San Francisco
http://download.intel.com/pressroom/kits/events/idffall_2008/PatGelsinger_day1.pdf
[33]: Brayton J.,”Nehalem: Talk of the Tock, Intel Next Generation Microprocessor,” IDF,
April 2008, Shanghai,
http://inteldeveloperforum.com.edgesuite.net/shanghai_2008/ti/TCH001/f.htm
[34]: Intel Timna Microprocessor Family, CPU World, http://www.cpu-world.com/CPUs/Timna/

[35]: Smith T., “Timna - Intel's first system-on-a-chip, Before 'Tolapai', before 'Banias'.
Register Hardware, 6. February 2007,
http://www.reghardware.co.uk/2007/02/06/forgotten_tech_intel_timna/

© Sima Dezső, ÓE NIK 398 www.tankonyvtar.hu


References (5)

[36]: Images, Xtreview,


http://xtreview.com/images/K10%20processor%2045nm%20architec%203.jpg
[37]: Singhal R. Intel’s i7, Podtech, http://www.podtech.net/home/5436/intels-core-i7
[38]: Strong B., “A Look Inside Intel: The Core (Nehalem) Microarchitecture,”
http://www.cs.utexas.edu/users/cart/arch/beeman.ppt
[39]: Valles A. C., Ansari Z., Mehrotra P.: “Tuning Your Software for the Next Generation
Intel Microarchitecture (Nehalem) Family,” IDF 2008, NGMS002,
http://www.benchmark.rs/tests/editorial/Nehalem_munich/presentations/
Software-Tuning_for_Nehalem.pdf
[40]: The low cost Tymna CPU, Tom’s hardware, Febr. 25 2000,
http://www.tomshardware.com/reviews/idf-2000,166-3.html
[41]: Intel® Virtualization Technology, Special issue, Vol. 10 No. 03 Aug. 2006,
http://www.intel.com/technology/itj/2006/v10i3/
[42]: Intel Core 2 Duo Processor E8000 and E7000 Series Datasheet, Intel, Jan. 2009

[43]: Wikipedia: List of Intel Core 2 microprocessors


http://en.wikipedia.org/wiki/List_of_Intel_Core_2_microprocessors

[44]: Wikipedia: Nehalem (microarchitecture)


http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

[45]: Glaskowsky P.: Investigating Intel's Lynnfield mysteries, cnet News, Sept. 21. 2009,
http://news.cnet.com/8301-13512_3-10357328-23.html
© Sima Dezső, ÓE NIK 399 www.tankonyvtar.hu
References (6)

[46]: Shimpi A. L.: Intel's Core i7 870 & i5 750, Lynnfield: Harder, Better, Faster Stronger,
AnandTech, Sept. 8. 2009, http://www.anandtech.com/show/2832
[47]: Intel Xeon Processor C5500/C3500 Series, Datasheet – Volume 1, Febr. 2010,
http://download.intel.com/embedded/processor/datasheet/323103.pdf
[48]: Intel CoreTM i7-800 and i5-700 Desktop Processor Series Datasheet – Volume 1,
July 2010, http://download.intel.com/design/processor/datashts/322164.pdf
[49]: Glaskowsky P.: Intel's Lynnfield mysteries solved, cnet News, Sept. 28. 2009,
http://news.cnet.com/8301-13512_3-10362512-23.html
[50]: Intel CoreTM i7-900 Mobile Processor Extreme Edition Series, Intel Core i7-800 and
i7-700 Mobile Processor Series, Datasheet – Volume One, Sept. 2009
http://download.intel.com/design/processor/datashts/320765.pdf
[51]: Intel Turbo Boost Technology in Intel CoreTM Microarchitecture (Nehalem) Based
Processors, White Paper, Nov. 2008
http://download.intel.com/design/processor/applnots/320354.pdf
[52]: Power Management in Intel Architecture Servers, White Paper, April 2009
http://download.intel.com/support/motherboards/server/sb/power_management_of_intel
architecture_servers.pdf

[53]: Glaskowsky P.: Explaining Intel’s Turbo Boost technology, cnet News, Sept. 28. 2009,
http://news.cnet.com/8301-13512_3-10362882-23.html
[54]: Intel Xeon Processor 7500 Series, Datasheet – Volume 2, March 2010
http://www.intel.com/Assets/PDF/datasheet/323341.pdf

© Sima Dezső, ÓE NIK 400 www.tankonyvtar.hu


References (7)

[55]: Pawlowski S.: Intelligent and Expandable High- End Intel Server Platform, Codenamed
Nehalem-EX, IDF 2009
[56]: Kottapalli S., Baxter J.: Nehalem-EX CPU Architecture, Hot Chips 2009, Sept. 10. 2009
http://www.hotchips.org/archives/hc21/2_mon/HC21.24.100.ServerSystemsI-Epub/HC21.24.
122-Kottapalli-Intel-NHM-EX.pdf
[57]: Kurd N. A. & all: A Family of 32 nm IA Processors, IEEE Journal of Solide-State Circuits,
Vol. 46, Issue 1., Jan. 2011, pp. 119-130
[58]: Hill D., Chowdhury M.: Westmere Xeon-56xx „Tick” CPU, Hot Chips 2010
http://www.hotchips.org/uploads/archive22/HC22.24.620-Hill-Intel-WSM-EP-print.pdf
[59]: Intel CoreTM i7-600, i5-500, i5-400 and i3-300 Mobile Processor Series, Datasheet -
Vol.1, Jan. 2010, http://download.intel.com/design/processor/datashts/322812.pdf
[60]: Nagaraj D., Kottapalli S.: Westmere-EX: A 20 thread server CPU, Hot Chips 2010
http://www.hotchips.org/uploads/archive22/HC22.24.610-Nagara-Intel-6-Westmere-EX.pdf
[61]: Kahn O., Piazza T., Valentine B.: Technology Insight: Intel Next Generation
Microarchitecture Codename Sandy Bridge, IDF 2010extreme.pcgameshardware.de/.../
281270d1288260884-bonusmaterial-pc-games-hardware-12-2010-sf10_spcs001_100.pdf

[62]: Wikipedia: Sandy Bridge, http://en.wikipedia.org/wiki/Sandy_Bridge


[63]: http://ark.intel.com

[64]: Kahn O., Valentine B.: Intel Next Generation Microarchitecture Codename Sandy Bridge:
New Processor Innovations, IDF 2010
© Sima Dezső, ÓE NIK 401 www.tankonyvtar.hu
References (8)

[65]: Shimpi A. L.: Intel Pentium 4 1.4GHz & 1.5GHz, AnandTech, Nov. 20. 2000
http://www.anandtech.com/show/661/5
[66]: Yuffe M., Knoll E., Mehalel M., Shor J., Kurts T.: A fully integrated multi-CPU, GPU and
memory controller 32nm processor, ISSCC, Febr. 20-24. 2011, pp. 264-266

[67]: Intel Xeon Processor 7500/6500 Series, Public Gold Presentation, Data Center Group,
March 30. 2010, http://cache-www.intel.com/cd/00/00/44/64/446456_446456.pdf
[68]: Tang H., Cheng H.: Intel Xeon Processor E3 Family Based Servers: A Smart Investment
for Managing Your Small Business, IDF 2011

[69]: Thomadakis M. E. PhD: The Architecture of the Nehalem Processor and Nehalem-EP SMP
Platforms, Texas A&M University, March 17. 2011
http://alphamike.tamu.edu/web_home/papers/perf_nehalem.pdf
[70]: Intel Xeon Processor E7-8800/4800/2800 Product Families, Datasheet Vol. 1 of 2,
April 2011, http://www.intel.com/Assets/PDF/datasheet/325119.pdf

© Sima Dezső, ÓE NIK 402 www.tankonyvtar.hu


Intel’s Desktop Platforms

Dezső Sima

© Sima Dezső, ÓE NIK 403 www.tankonyvtar.hu


Contents

• 1. Introduction to DT platforms

• 2. Introduction to Intel’s vPro platform family

• 3. Overview of Intel’s DT platforms

• 4. Core 2 and Penryn based DT platforms

• 5. Nehalem and Westmere based DT platforms

• 6. Sandy Bridge based DT platforms


• 7. Overview of distinguishedaspects of the evolution of
Intel’s DT platforms
• 8. References

© Sima Dezső, ÓE NIK 404 www.tankonyvtar.hu


1. Introduction to DT platforms

© Sima Dezső, ÓE NIK 405 www.tankonyvtar.hu


1. Introduction to DT platforms (1)

Intel’s desktop (DT) platforms

Platforms
Set of processors and associated chipsets capable of working together.
Traditional three chip platforms consist of
• a single or multiple processors,
• an MCH (Memory Control Hub) and Processor Processor

• an IOH (I/O Control Hub), whereas


FSB DMI

Intels’s recent two chip platforms include


MCH PCH
• a single or multiple processors and
• a PCH (Platform control Hub). DMI

IOH

Traditional Recent
DT platform DT platform

© Sima Dezső, ÓE NIK 406 www.tankonyvtar.hu


1. Introduction to DT platforms (2)
6/2006

Co-design of platform components


Averill

Platform components are often co-designed, announced 7/2006 11/2006


and delivered together, as it was the case with Intel’s
E6xxx/E4xxx Q6xxx
Core 2 based Averill platform. X6800 QX6xxx
Interchangeability of platform components (Conroe (Steppings B2/GÍ) (Kentsfield)
Allendale: (Steppings L2/MÍ) Core 2 Extreme Quad 2x2C
Usually, co-designed components are used as a set Core 2 Extreme 2C Core 2 Quad
to implement a particular system architecture. Core 2 Duo 2C
65 nm 65 nm
Nevertheless, co-designed components may be Conroe: 291 mtrs/143 mm2 /2x291 mtrs/2x143 mm2
substituted by components belonging to their Allendale: 167 mtrs/111 mm2
2/4 MB L2 2*4 MB L2
preceding or subsequent generation, assuming E6800X/E6xxx: 1066 MT/s 1066 MT/s
compatibility. E4xxx: 800MT/s
LGA775 LGA775
E.g. The Core-2 based Averill platform supports 6/2006
also the Core 2 Quad processor lines, or
instead of the genuine 965 chipset and the 965 Series
ICH8 IOH the previous 975 MCH can be chosen
(Broadwater)
with the ICH7 IOH that targeted the 90 nm FSB
Pentium 4 Prescott (not shown since it is a 1066/800/566 MT/s
single core processor), as indicated partly 2 DDR2 channels
DDR2: 800/666 MT/s
in the figure. 2 DIMMs/channel
8 GB max.

6/2006

ICH8

The Core 2-based Averill platform (65 nm)


© Sima Dezső, ÓE NIK 407 www.tankonyvtar.hu
1. Introduction to DT platforms (3)
6/2006

Averill

7/2006 11/2006

E6xxx/E4xxx Q6xxx
X6800 QX6xxx

(Conroe (Steppings B2/GÍ) (Kentsfield)


Allendale: (Steppings L2/MÍ) Core 2 Extreme Quad 2x2C
Core 2 Extreme 2C Core 2 Quad
Core 2 Duo 2C
65 nm 65 nm
Conroe: 291 mtrs/143 mm2 /2x291 mtrs/2x143 mm2
Allendale: 167 mtrs/111 mm2
2/4 MB L2 2*4 MB L2
E6800X/E6xxx: 1066 MT/s 1066 MT/s
E4xxx: 800MT/s
LGA775 LGA775
6/2006

965 Series

(Broadwater)
FSB
1066/800/566 MT/s
2 DDR2 channels
DDR2: 800/666 MT/s
2 DIMMs/channel
8 GB max.

6/2006

ICH8

Core 2-based (65 nm)


© Sima Dezső, ÓE NIK 408 www.tankonyvtar.hu
1. Introduction to DT platforms (4)

Note

Main features of memory interfaces of recent platforms


a) Width of the memory channels
Memory channels are 64 bit wide
b) Supported memory features
DT memories typically do not support ECC or registered (buffered) DIMMs,
in contrast to servers that typically make use of registered DIMMs with ECC protection.

© Sima Dezső, ÓE NIK 409 www.tankonyvtar.hu


1. Introduction to DT platforms (5)

Typical implementation of ECC protected registered DIMMs (used in servers)

Main components
• Two register chips, for buffering the address- and command lines
• A PLL (Phase Locked Loop) unit for deskewing clock distribution.

ECC

Register PLL Register

Figure 1.1: Typical layout of a registered memory module with ECC [1]

© Sima Dezső, ÓE NIK 410 www.tankonyvtar.hu


1. Introduction to DT platforms (6)

c) Memory types used in recent DT platforms

DDR2 and increasingly DDR3 memories.

SDRAM 168-pin

DDR 184-pin

DDR2 240- pin

DDR3 240-pin

© Sima Dezső, ÓE NIK Figure 1.2: DIMM modules


411 (8-Byte wide) www.tankonyvtar.hu
1. Introduction to DT platforms (7)

Typical memory speeds

• DDR2: 400/667/800 MT/s


• DDR3: 800/1067/1333 MT/s

© Sima Dezső, ÓE NIK 412 www.tankonyvtar.hu


1. Introduction to DT platforms (8)

d) The number of memory channels provided

In a traditional pre-Nehalem based system architecture the memory controller


(which is responsible for the memory channels) is implemented in the MCH.
The MCH has however, a large number of electrical connections that are implemented as
copper trails on the motherboard.

© Sima Dezső, ÓE NIK 413 www.tankonyvtar.hu


1. Introduction to DT platforms (9)

Example: Core 2 based (private consumer oriented Averill) DT platform [2]

FSB

Display 2 DIMMs/channel

2 DIMMs/channel
card

C-link

© Sima Dezső, ÓE NIK 414 www.tankonyvtar.hu


1. Introduction to DT platforms (10)

Figure 1.3: Copper trails on a motherboard (MSI 915G Combo motherboard)


(The copper trails are equalized to reduce skew)
© Sima Dezső, ÓE NIK 415 www.tankonyvtar.hu
1. Introduction to DT platforms (11)

There are limitations on both the minimum width and the spacing of the copper trails.

• The minimum width of the copper trails is restricted by their non-zero trace impedance.
• The minimum spacing between trails is constrained by parasitic capacitance and crosstalk.
Given
• the huge number of connections to be implemented as copper traces connecting
particular parts to the MCH,
• and the large number of connections each DDR2 or DDR3 memory channel needs,
• as well as the physical restrictions implied on the copper traces,
recent 3-chip platforms are limited typically to 2 DDR2 or DDR3 memory channels.

By contrast, FBDIMM channels make use of serial links and need only about 80 lines,
as a consequence 3-chip platforms may have about 3 times more FBDIMM memory
channels than DDR2 or DDR3 channels.

© Sima Dezső, ÓE NIK 416 www.tankonyvtar.hu


1. Introduction to DT platforms (12)

Beginning with Intel’s Nehalem processor, however the memory controller moved onto the
processor chip, as shown below.

© Sima Dezső, ÓE NIK 417 www.tankonyvtar.hu


1. Introduction to DT platforms (13)

Example: 1. generation Nehalem (called Bloomfield) private consumer oriented


Tylersburg) DT platform [3]

6.4 GT/s

Tylersburg

© Sima Dezső, ÓE NIK 418 www.tankonyvtar.hu


1. Introduction to DT platforms (14)

In this case the memory channels are attached to the processor whereby the processor
has much less connections to other unit than the MCH had in the previous design.
As a consequence, Nehalem-based or subsequent platforms may implement more than two
DDR2/DDR3 channels, as illustrated above.

Remark
Typical bandwidth of recent 2-channel memory interfaces

Let’s assume that a particular platform has dual memory channels with DDR3-1333 DIMMs.
Then the resulting memory bandwidth (BW) of the platform amounts to

BW = 2 x 8 B x 1333 MT/s = 21.3 GB/s

© Sima Dezső, ÓE NIK 419 www.tankonyvtar.hu


1. Introduction to DT platforms (15)

e) The number of DIMMs/channel

As memory speeds increase or the number of DIMMs attached to each memory channel
increases the operational tolerances of data transmissions over the memory channels
become narrower due to electrical effects, such as reflections, jitter, skews, crosstalk etc..

© Sima Dezső, ÓE NIK 420 www.tankonyvtar.hu


1. Introduction to DT platforms (16)

The operational margins are effective while capturing transmitted data at the receiver.
They are defined by

• a temporal window, called the Data Valid Window (DVD), and


• a voltage window, given by the VHmin and VLmax values.

V
VH
VHmin

Forbidden V area Data

VLmax
VL
t
DVD

DVD: Min. time data must remain valid

Clock (for capturing data)

© Sima Dezső, ÓE NIK 421 www.tankonyvtar.hu


1. Introduction to DT platforms (17)

Interpretation of the Data Valid Window (DVW)


It is the minimum time interval for which the input signal must remain valid (high or low)
before and after the clock edge in order to capture the data bits correctly.

Data

CK
tS
tH

Min. DVW

Figure 1.4: Interpretation of the DVW for ideal signals

The minimum DVW has two characteristics,

a size, that is the sum of the setup time (tS) and the hold time (tH), and
a correct phase related to the clock edge, to satisfy both tS and tH requirements.

© Sima Dezső, ÓE NIK 422 www.tankonyvtar.hu


1. Introduction to DT platforms (18)

The size and fulfillment of the operational tolerances can be visualized by the eye diagram.
It shows the picture of a large number of overlaid data signals.

min
DVW

max

Figure 1.5: Eye diagram of a real signal showing both available DVW and requested voltage levels [4]

DVW: Data Valid Window

© Sima Dezső, ÓE NIK 423 www.tankonyvtar.hu


1. Introduction to DT platforms (19)

Reflections, jitter, skews, crosstalks and other disturbances narrow the operational margins of
the data transfer and limit thereby
• the transfer speed and
• the number of DIMMs allowed to be attached to each channel.

© Sima Dezső, ÓE NIK 424 www.tankonyvtar.hu


1. Introduction to DT platforms (20)

Reflections

At operational speeds of DDR2/DDR3 memories the connection lines, i.e. the copper traces
behave like transmission lines.
Transmission lines need to be terminated by their characteristic impedance (about 50-70 Ω for
copper traces on mainboards) if reflections should be avoided.
In case of a termination mismatch or existing inhomogenities of the transmission line,
reflections arise and narrow the operational tolerances.

© Sima Dezső, ÓE NIK 425 www.tankonyvtar.hu


1. Introduction to DT platforms (21)

Termination of the transmission lines connecting the memory controller and the DRAM chips

Despite the fact that subsequent memory technologies (SDRAM to DDR3) laid more and more
emphasis on the appropriate termination of the transmission lines (till on die dynamically
adjusted termination of the lines in case of DDR3 memories), a certain termination mismatch
typically remains and reflections arise, as shown in the next slide.

© Sima Dezső, ÓE NIK 426 www.tankonyvtar.hu


1. Introduction to DT platforms (22)

Example for reflections

Figure 1.6: Reflections shown on an eye diagram due to termination mismatch [5]

© Sima Dezső, ÓE NIK 427 www.tankonyvtar.hu


1. Introduction to DT platforms (23)

Inhomogenity of transmission lines connecting the memory controller and the DRAM chips

Transmission lines connecting the memory controller and the DRAM chips mounted on the
DIMMs are inherently inhomogen due to the kind of the connection.

© Sima Dezső, ÓE NIK 428 www.tankonyvtar.hu


1. Introduction to DT platforms (24)

The dataway that connects the memory controller and the DRAM chips

Memory modules Inhomogenities


For higher data rates PCB traces arising in the
behave like transmission lines transmission line
Memory controller

Motherboard trace

Figure 1.7: The copper traces connecting the memory controller and the DRAM chips behaves
like transmission lines (based on [6])

© Sima Dezső, ÓE NIK 429 www.tankonyvtar.hu


1. Introduction to DT platforms (25)

Jitter

• I means phase uncertainty causing ambiguity in the rising and falling edges of a
signal, as shown in the figure below,
• It has a stochastic nature,

Figure 1.8: Jitter of signal edges [7]

The main sources of jitter are

• Crosstalk caused by coupling adjacent traces on the board or in the DRAM device,
• ISI (Inter-Symbol Interference) caused by cycling the bus faster than it can settle,
• Reflection noise due to mismatching termination of signal lines,
• EMI (Electromagnetic Interference) caused by electromagnetic radiation emitted
from external sources.

© Sima Dezső, ÓE NIK 430 www.tankonyvtar.hu


1. Introduction to DT platforms (26)

Skew

It is a time offset of the signal edges


• between different occurances of the same signal, such as a clock, at different locations
on a chip or a PC board (as shown in the Figure below), or
• between different bit lines of a parallel bus at a given location.

Figure 1.9: Skew due to propagation delay [7]

Skews arise mainly due to


- propagation delays in the PC-board traces, termed also as time of flight (TOF)
(about 170 ps/inch), as indicated above [8],
- capacitive loading of a PC-board trace (about 50 ps per pF) as indicated in the
subsequent figure [8],
- SSO (Simultaneous Switching Output) occurring due to parasitic inductances in case
when a number of bit lines simultaneously change their output states.

© Sima Dezső, ÓE NIK 431 www.tankonyvtar.hu


1. Introduction to DT platforms (27)

CK-1

CK-2

Skew

Figure 1.10: Skew due to capacitve loading of signal lines [8]

© Sima Dezső, ÓE NIK 432 www.tankonyvtar.hu


1. Introduction to DT platforms (28)

Reflections, jitter, skews and further electrical disturbances reduce the operational tolerances
effective at the receiver end of the transmission lines connecting the memory controller and
the DRAM chips and limit the operational speed of the memory channels as well as
the number of DIMMs attachable per channel.

As a consequence in recent DT and also server platforms the number of DIMMs that can be
attached to DDR2 or DDR 3 memory channels is typically restricted to two.
This restriction roots in the parallel style of connecting traditional memory modules to
memory controllers.
By contrast, serially connected memory modules, such as FBDIMM modules, have much higher
operational tolerances and in this case more than two (typically up to 6 or 8) FDDIMM modules
can be attached to a memory channel.

Example
The DRAM capacity of a DT platform (C) having three memory channels with two 4 GB DIMMs
per channel (C) amounts to

C = 3 x 2 x 4 GB = 24 GB

© Sima Dezső, ÓE NIK 433 www.tankonyvtar.hu


1. Introduction to DT platforms (29)

DT platforms

Private consumer oriented Enterprise oriented


DT platforms DT platforms

Intel’s standard DT platforms Intel’s vPro platforms

© Sima Dezső, ÓE NIK 434 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family

© Sima Dezső, ÓE NIK 435 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (1)

Intel’s vPro platforms

Target market
Enterprise computing
Main goals
• lowering TCO (Total Cost of Ownership)
• increasing system availability and
• enhancing security.

Main goals achieved basically by


• hardware assisted remote maintenance
• hardware assisted virtualization and
• hardware assisted security.

© Sima Dezső, ÓE NIK 436 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (2)

Dedicated technologies constituting the vPro technology

VPro consists of an increasing set of dedicated technologies, such as [30]

(AMT)1

(VT)1 21
(VT-d)
(TXT)

(TXT)23
(AMT)
(TXT)
Client Intel© Turbo Memory Technology (TM)
(TXT)2
Server
Server
(AT)2
(AMT)
(AT)2
(VT)
(VT)

1AMT version 1.0 preceded vPro, it was introduced based on the 945 chipset, the ICH7 and
a Gigabit Ethernet controller and supported the dual core P4 Smithfield processor in 2005
2Introduced in the 2. gen. vPro based on the Q35 in 2007

3Introduced in the 4. gen. vPro based on the Q57/QM57 in 2007


(AT)1
(AT)1
Through the evolution of these technologies vPro provides a continuously expanding feature set.

© Sima Dezső, ÓE NIK 437 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (3)

Brief overview of the basic constituent technologies of vPro

Intel Active management Technology (AMT) [9]


• Allows system administrators to remotely manage PCs when the PC is shut down or there is
a hardware error (e.g. hard disk failure) or the OS is unavailable.

Intel Virtualization Technology for Directed I/O (VT-d) [11]


• Consists of a set of hardware and software components that support platform virtualization.
That allows running multiple OSs and applications in independent partitions.
• Each partition behaves like a virtual machine (VM).
• VT provides isolation and protection across partitions.
• VT enables among others server consolidation, workload isolation, legacy software migration,
disaster recovery.

Intel Trusted Execution Technology (TXT) [11]


• Hardware based enhanced protection for storing, processing and exchanging data in a PC.
Intel Turbo Memory Technology (TM) [12]
• It provides hard disk caching by using 1 – 4 GB SSD (Solid-State Drives), i.e. flash memory
placed on a card connected typically via the PCIe interface.
Intel Antitheft Technology (AT) [10]
• Allows system administrators to protect data stored on missing or stolen laptops e.g.
by sending an encrypted SMS message (designated as the poison pill) over a 3G network or
even to request the laptop to send location information (GPS coordinates) to the
central server, as well as to reactivate data if the desktop is recovered.
© Sima Dezső, ÓE NIK 438 www.tankonyvtar.hu
2. Introduction to Intel’s vPro platform family (4)

Main hardware components of vPro [30]

• AMT, VT-d, TXT, TM and AT enabled processor.


Most recent Intel processors provide support for these technologies.
• AMT enabled MCH or PCH, such as the GM45, Q35, Q45, Q55 or Q67 Express chipsets.
These chipsets include a Manageability Engine (actually a microcontroller),
which is the heart of AMT.
• AMT capable Gigabit LAN controller that provides an independent LAN communication
channel (needed for AMT) and
• an Intel wireless LAN controller that provides an independent wireless communication
channel (for mobiles, needed for AT).

Example
Main hardware components of the 5. generation vPro

© Sima Dezső, ÓE NIK 439 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (5)

Remark

AMT implements an Out-of-band remote management

Out-of-band management (OOB) (Based on [13])


• It means providing a dedicated LAN and/or wireless channel and hardware support
for remote monitoring and maintenance of devices, such as PCs, servers or network
equipments for system administrators.
• By contrast, in-band management makes typically use of the regular LAN and/or wireless
connection and is based on software, such as remote maintenance software that must be
installed on the remote device being managed and only works after the OS has been booted.
•Both in-bound and out-bound management requires LAN and or wireless network connection,
but out-of-band management needs separate dedicated LAN and/or wireless connections,
like a separated network connector.
• The remote management unit usually has an independent power supply and can power on or
off the device through the network.
• In-bound management may be cheaper but it does not allow to access BIOS settings or
reinstall the OS etc.

© Sima Dezső, ÓE NIK 440 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (6)

Note

Beyond hardware requirements vPro needs also firmware (BIOS) and OS support, not
detailed here. For details see e.g. [9].

© Sima Dezső, ÓE NIK 441 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (7)

More details about the basic constituent technologies of vPro

© Sima Dezső, ÓE NIK 442 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (8)

Intel Active management Technology (AMT) [9]

• Key component of vPro.


• Allows system administrators to remotely manage PCs when the PC is shut down or there
a hardware error (e.g. hard disk failure) or the OS is unavailable.

Main components of AMT [9]

Filters, sensors: Provide enhanced


security features

FW: Firmware

MAC: Media Access Control


(provides addressing and channel
access control) as part of OSI layer 2.

NVM: Non Volatile Memory

Out-of-band: Dedicated independent


system control without using the OS

3PDS: Third party data storage space


(≈ 192 KB) for general use
of OEM platform SW or third party
platform applications.

© Sima Dezső, ÓE NIK 443 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (9)

Remark

AMT version 1.0 preceded vPro.


It was based on the DT Lyndon platform (set up of the dual core P4 Smithfield processor,
the 945/955 chipset with the ICH7 and a Gigabit Ethernet Controller) in 2005 [14].

© Sima Dezső, ÓE NIK 444 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (10)

The Manageability Engine (ME)

• It is the heart of AMT.


• ME is an embedded microcontroller that is incorporated into the IOH or PCH.
(The ME became part of the PCH since the 2. generation Nehalem processors
(designated as Lynnfied) along with their associated 5 Series chipset called the PCH.
• ME provides an out-of-band (dedicated independent) remote management whereas
software based remote management requires running the underlying OS.

© Sima Dezső, ÓE NIK 445 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (11)

Relocation of the ME while the DT system architecture evolved from the 3-chip solution to
the 2-chip solution (along with the Nehalem-EX processors (Lynnfield) and their associated
5 Series PCHs) [24]

Previous

(Dedicated graphics
via graphics card)

(Series 5 PCH)

“Consumer
graphics

© Sima Dezső, ÓE NIK 446 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (12)

Operation of ME [30]

• ME is an embedded microcontroller integrated into the IOH or PCH.


• ME runs a dedicated microkernel OS that provides an execution engine for out-of-band
processor management.
• At system initialization ME loads its code (ME FW) from the nonvolatile (NVM) system flash
memory.
• This allows the dedicated OS to run before the main OS is started, independently from
the main OS.
• At runtime, ME has access to a protected area of system memory being in DIMM 0
of Channel A.
• ME is connected to an independent power plane that allows running it even if the CPU
or many other components of the system are in (ACPI) deeper sleep state.

© Sima Dezső, ÓE NIK 447 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (13)

Intel Virtualization Technology for Directed I/O (VT-d) [31], [32], [33]
Virtualization technology in general
• consists of a set of hardware and software components that allow running multiple OSs
and applications in independent partitions.
• Each partition is isolated and protected from all other partitions.
• Virtualization enables among others
• Server consolidation
Substituting multiple dedicated servers by a single virtualized platform,
• Legacy software migration
Legacy software: software commonly used previously,
written often in not more commonly used languages (such as Cobol) and
running under not more commonly used OSs or platforms.
Legacy software migration: moving legacy software to a recent platform,
• Effective disaster recovery.

© Sima Dezső, ÓE NIK 448 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (14)

Overview of the evolution of Intel’s virtualization technology


VT
It is Intel’s general designation for virtualization technology
VT-x
• Intel’s first generation VT implementation for x86 architectures.
• It provides hardware support for processor virtualization.
• Appeared first in 2005 for two Pentium 4 models (662. 672).
VT-i
• Intel’s first generation VT implementation for IA-64 (Itanium) architectures.
• It provides hardware support for processor virtualization.
• First implemented in the Montecito line of Itanium Processors in 2006.
VT-d (Intel Virtualization Technology for Directed I/O)
• Intel’s second generation VT implementation.
• It adds chipset hardware features to enhance I/O performance and robustness
of virtualization.
• First implemented in the desktop oriented Bearlake chipset (x35) in 2007.

© Sima Dezső, ÓE NIK 449 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (15)

Intel Trusted Execution Technology (TXT) [15], [16].

• The preceding designation for the TXT technology was the LaGrande technology.
• It provides hardware based security against hypervisor attacks, BIOS or other
firmware attacks, malicious root kit installations or other software attacks.
• It extends the Virtual Machine Environment (MLE) of Intel’s Virtualization Technology (VT)
by providing a verifiably secure installation, launch and use of a hypervisor or OS.
• It consists of a number of hardware enhancements to allow the creation of multiple
separated execution environments or partitions.
• One of the components is the TPM (Trusted Platform Module), a special chip which allows for
secure key generation and storage and authenticated access to data encrypted by this key.
The TPM chip is usually connected to the LPC (Low Pin Count) bus.
• TXT became available for DT platforms beginning with the 3 Series MCH model Q35 in 2007.

© Sima Dezső, ÓE NIK 450 www.tankonyvtar.hu


2. Introduction to Intel’s vPro platform family (16)

Intel Turbo Memory [12], [17], [18]

• It is a disk caching technology by using SSD (Solid-State Drives), i.e. flash memory
placed on a card.
• The Turbo memory card provides 1 – 4 GB disk space and is connected to the PC via the
PCIe interface.
• It caches frequently used data or user selected applications.
• Expected results: faster access to data and lower power consumption.
• It was announced first for mobile platforms in 2005 and offered
later on also for DT platforms, along with the 3 and 4 Series chipsets
starting in 2007.
• The Turbo Memory Technology is supported by Microsoft
Windows Vista “Ready Drive” and “Ready Boost” technologies.
• According to related reviews the Turbo Memory technology
did not fulfilled the expectations, it was costly and
was not worth the cost.
• In 2009 Intel announced the successor technology for the
5 Series mobile chipsets, called the Braidwood technology,
but subsequently they did withdraw it.
• In 2011 Intel introduced disk caching mechanism in their Z68 chipset
(and mobile derivatives) of the Series 6 PCH family, to provide
disk caching by a SATA SDD.

© Sima Dezső, ÓE NIK 451 www.tankonyvtar.hu


3. Overview of Intel’s DT platforms

© Sima Dezső, ÓE NIK 452 www.tankonyvtar.hu


3. Overview of Intel’s DT platforms (1)

Basic Arch. Techn. Core/technology Cores Intro. Cache arch. Interf.

X6800 Conroe 2C 7/2006 4 MB L2


E6xxx Conroe 2C 7/2006 2/4 MB L2
E4xxx Allendale 2C 1/2007 4 MB L2
Core2 65 nm FSB
E6xxx Allendale 2C 7/2007 4 MB L2
QX67xx Kentsfield 2x2C 11/2006 2x4MB L2
Q6xxx Kentsfield 2*2C 1/2007 2x4 MB l2
E8xxx Wolfdale 2C 1/2008 6 MB L2
E7xxx Wolfdale-3M 2C 4/2008 3 MB L2
QX9xxx Yorkfield XE 2x2C 11/2007 2x6 MB L2
Penryn 45 nm FSB
Q9xxx Yorkfield 2*2C 1/2008 2x6 MB L2
Q9xxx Yorkfield-6M 2*2C 1/2008 2x3 MB L2
Q8xxx Yorkfield-4M 2x2C 8/2008 2x2 MB L2
1. gen. Nehalem i7-920-965 Bloomfield 4C 11/2008 ¼ MB L2/C, 8 MB L3 QPI
45 nm
2. gen. Nehalem i7-8xxx/i5-7xx Lynnfield 4C 9/2009 ¼ MB L2/C, 8 MB L3 DMI

i7-9xxX Gulftown 6C 3/2010 ¼ MB L2/C, 12 MB L3 QPI


Westmere 32 nm i7-9xx Gulftown 6C 7/2010 ¼ MB L2/C, 12 MB L3 QPI
i5-6xx/i3-5xx Clarkdale 2C+G 1/2010 ¼ MB L2/C, max. 4 MB L2 DMI

i7-26/27/28/29xx 2/4C+G 1/2001 ¼ MB L2/C, 4/8 MB L3


Sandy Bridge 32 nm i5-23/24/25xx Sandy Bridge 2/4C+G 1/2011 ¼ MB L2/C, 3/6 MB L3 DMI2
i3-21/23xx 2C+G 1/2011 ¼ MB L2/C, 3 MB L3

Table 3.1: Intel’s Core 2 based or more recent multicore desktop (DT) processors

© Sima Dezső, ÓE NIK 453 www.tankonyvtar.hu


3. Overview of Intel’s DT platforms (2)

DT platforms

Core 2 and Penryn Nehalem Sandy Bridge


based DT platforms based DT platforms based DT platforms

© Sima Dezső, ÓE NIK 454 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms

© Sima Dezső, ÓE NIK 455 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (1)

Intel’s Core 2 and Penryn based DT platforms

Core 2 and Penryn based DT platforms

Private consumer oriented Enterprise oriented


Core 2 and Penryn based DT platforms Core 2 and Penryn based DT platforms

Averill (2006) Averill professional (2006)


Salt Creek (2007) Weybridge (2007)
Boulder Creek (2008) McCreary (2008)

© Sima Dezső, ÓE NIK 456 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (2)

a) Intel’s Core 2 and Penryn based private consumer oriented DT platforms

Overview

© Sima Dezső, ÓE NIK 457 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (3)
6/2006 6/2007 6/2008

Averill Salt Creek Boulder Creek

7/2006 11/2006 11/2007-3/2008

E6xxx/E4xxx Q6xxx E7xxx/E8xxx


X6800 QX6xxx QX9xxx/Q9xxx/Q8xxx1

(Conroe (Steppings B2/GÍ) (Kentsfield) (Wolfdale/Wolfdale 3M: 2C


Allendale: (Steppings L2/MÍ) Core 2 Extreme Quad 2x2C /Yorkfield/Yorkfield 6M: 2x2C
Core 2 Extreme 2C Core 2 Quad
Core 2 Duo 2C
65 nm 65 nm 45 nm
Conroe: 291 mtrs/143 mm2 /2x291 mtrs/2x143 mm2 Wolfdale/410 mtrs/107 mm2
Allendale: 167 mtrs/111 mm2 Yorkfield:/2x410 mtrs/2x107 mm2
2/4 MB L2 2*4 MB L2 2C: 6 MB L2 for Wolfdale
E6800X/E6xxx: 1066 MT/s 1066 MT/s 4C: 2*6 MB L2 for Yorkfield
E4xxx: 800MT/s 1066/1333 MT/s
LGA775 LGA775 LGA771
6/2006 6/2007 6/2008

965 Series 3 Series 4 Series

(Broadwater) (Bearlake) (Eaglelake)


FSB FSB FSB
1066/800/566 MT/s 1333/1066/800 MT/s 1333/1066/800 MT/s
2 DDR2 channels 2 DDR2/DDR3 channels 2 DDR2/DDR3 channels
DDR2: 800/666 MT/s DDR2: 666/800 MT/s DDR2: 800 MT/s
2 DIMMs/channel DDR3: 800/1067 MT/s DDR3: 1067 MT/s
8 GB max. 2 DIMMs/channel 2 DIMMs/channel
8 GB max. 16 GB max.
6/2006 6/2007 6/2008

ICH8 ICH9 ICH10

Core 2-based (65 nm) Penryn-based (45 nm)


Core 2-Quad based
458 (65 nm) Q8000: 8/2008
1
© Sima Dezső, ÓE NIK www.tankonyvtar.hu
4. Intel’s Core 2 and Penryn based DT platforms (4)

Remark

• The X38 chipset may be considered as belonging to the 3 Series family of chipsets.
It does not support vPro.
• Similarly, the X48 chipset may be considered as belonging to the 4 Series family of chipsets.
It does not support vPro.

© Sima Dezső, ÓE NIK 459 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (5)

Features of different platforms are defined by the particular chipset they include.
E.g. different models of the Averill platform (Core 2/Penryn based platform incorporating the
965 chipset and the ICH7 south bridge) provide the following features [19]: .

© Sima Dezső, ÓE NIK 460 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (6)

Basic system architecture of Core 2/Penryn based private consumer oriented


DT platforms

Core2/Penryn
(2C/2*2C)
proc.

FSB

965/3-/4- Series DDR2/DDR3


MCH depending on
the MCH model used

DMI C-link1

ICH8/9/10

1TC-link(Controller link) is needed basically for loading ME (Management Engine) firmware from the nonvolatile
system flash memory that is attached to the ICH
The ME is used to implement particular platform features supported.
© Sima Dezső, ÓE NIK 461 www.tankonyvtar.hu
4. Intel’s Core 2 and Penryn based DT platforms (7)

1. Example: The Core 2/Penryn based private consumer oriented Averill DT platform
with the P965 MCH and ICH8 that does not provide an integrated display controller
[20]

card 2 DIMMs/channel

2 DIMMs/channel

C-link

© Sima Dezső, ÓE NIK 462 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (8)

Remark

Intel Matrix Storage Technology


• Reconfigures the SATA controller to support RAID 0, 1, 5 and 10.
• Provides increased data protection and disk performance.

Intel Quiet system Technology (QST)


• Aims at reducing system noise and heat through intelligent fan speed control algorithms.
• Formerly designated as the Advanced Fan Speed Control (AFSC)

© Sima Dezső, ÓE NIK 463 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (9)

2. Example: The Core 2/Penryn based private consumer oriented Averill DT platform
with the G965 MCH and ICH8 that provides an integrated display controller [2]

Display 2 DIMMs/channel

2 DIMMs/channel
card

C-link

© Sima Dezső, ÓE NIK 464 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (10)

b) Intel’s Core 2 and Penryn based enterprise oriented DT platforms (vPro platforms)

Overview

© Sima Dezső, ÓE NIK 465 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (11)
6/2006 6/2007 6/2008

Averill prof. (vPro) Weybridge( vPro) McCreary (vPro)

7/2006 11/2006 11/2007-3/2008

E6xxx/E4xxx Q6xxx E7xxx/E8xxx


X6800 QX6xxx QX9xxx/Q9xxx/Q8xxx1

(Conroe (Steppings B2/GÍ) (Kentsfield) (Wolfdale/Wolfdale 3M: 2C


Allendale: (Steppings L2/MÍ) Core 2 Extreme Quad 2x2C /Yorkfield/Yorkfield 6M: 2x2C
Core 2 Extreme 2C Core 2 Quad
Core 2 Duo 2C
65 nm 65 nm 45 nm
Conroe: 291 mtrs/143 mm2 /2x291 mtrs/2x143 mm2 Wolfdale/410 mtrs/107 mm2
Allendale: 167 mtrs/111 mm2 Yorkfield:/2x410 mtrs/2x107 mm2
2/4 MB L2 2*4 MB L2 2C: 6 MB L2 for Wolfdale
E6800X/E6xxx: 1066 MT/s 1066 MT/s 4C: 2*6 MB L2 for Yorkfield
E4xxx: 800MT/s 1066/1333 MT/s
LGA775 LGA775 LGA771
6/2006 6/2007 6/2008

Q965 Q35 Q45

(Broadwater) (Bearlake) (Eaglelake)


FSB FSB FSB
1066/800/566 MT/s 1333/1066/800 MT/s 1333/1066/800 MT/s
2 channels DDR2 2 channels DDR2/DDR3 2 channels DDR2/DDR3
DDR2: 800/666 MT/s DDR2: 666/800 MT/s DDR2: 800 MT/s
2 DIMMs/channel DDR3: 800/1066/ MT/s DDR3: 1067 MT/s
8 GB max. 2 DIMMs/channel 2 DIMMs/channel
8 GB max. 16 GB max.
6/2006 6/2007 6/2008

ICH8 ICH9 ICH10

Core 2-Quad based (65 nm) 1Q8000: 8/2008 Penryn-based (45 nm)
Core 2-based (65 nm) 466 www.tankonyvtar.hu
4. Intel’s Core 2 and Penryn based DT platforms (12)

Basic system architecture of Core 2/Penryn based enterprise oriented DT platforms

Core2/Penryn
(2C/2*2C)
proc.

FSB

DDR2/DDR3
Q965/Q35/Q45
depending on
MCH the MCH model used

DMI C-link1

ICH8/9/10

1TC-link(Controller link) is needed basically for loading ME (Management Engine) firmware from the nonvolatile
system flash memory that is attached to the ICH.
The ME is used to implement particular platform feature, such as AMT.

© Sima Dezső, ÓE NIK 467 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (13)

1. Example: The Core 2/Penryn based enterprise oriented Averill DT platform


with the Q965 MCH and ICH8 that does not provide an integrated display controller [2]

Display
2 DIMMs/channel

card 2 DIMMs/channel

C-link

© Sima Dezső, ÓE NIK 468 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (14)

2. Example: The Core 2/Penryn based enterprise oriented Weybridge DT platform


with the Q35 MCH and ICH9 that does not provide an integrated display controller [21]

C-link

© Sima Dezső, ÓE NIK 469 www.tankonyvtar.hu


4. Intel’s Core 2 and Penryn based DT platforms (15)

Remark

• The Weybridge platform is the 2. gen. (Q35 based) vPro platform.


• It provides greatly enhanced features vs the previous 1. gen. vPro platform that is
the (Q965 based) Averill professional vPro platform.
Major enhancements include
• Virtualization Technology for Directed I/O (VT-d)
• Trusted Execution Technology (TXT)
• Turbo Memory Technology (TM)
• AMT version 3.0 vs AMT version 2.0.

© Sima Dezső, ÓE NIK 470 www.tankonyvtar.hu


5. Intel’s Nehalem based DT platforms

© Sima Dezső, ÓE NIK 471 www.tankonyvtar.hu


5. Intel’s Nehalem based DT platforms (1)

Intel’s Nehalem based DT platforms

Overview-1

Nehalem based DT platforms

1. gen. Nehalem (Bloomfield, 4C) 2. gen. NehalemX (Lynnfield 4C)


based DT platforms based DT platforms

Private consumer oriented Enterprise oriented Private consumer oriented Enterprise oriented
DT platforms DT platforms DT platforms DT platforms

Tylersburg (2008) Kings Creek (2009) Piketon (2009)

© Sima Dezső, ÓE NIK 472 www.tankonyvtar.hu


5. Intel’s Nehalem based DT platforms (2)

Overview-2
11/2008 9/2009
Kings Creek
Tylersburg Piketon (vPro)1

11/2008 3/2010 9/2009 1/2010

i7-920-965 i7-970-990X i7-8xx/i5-7xx i5-6xx/i3-5xx

(Bloomfield) (Gulftown) (Lynnfield) (Clarkdale)


Nehalem-EP 4C Westmere 6C Nehalem-EX 4C Westmere 2C+G
45 nm/731 mtrs/263 mm2 45 nm/1170 mtrs/240 mm2 45 nm/774 mtrs/296 mm2 32 nm/384 mtrs/81 mm2
¼ MB L2/C ¼ MB L2/C ¼ MB L2/C ¼ MB L2/C
8 MB L3 12 MB L3 8 MB L3 4 MB L3
1 QPI link/1 DMI link 1 QPI link/1 DMI link 1 DMI 1 DMI link
3 DDR3 channels 3 DDR3 channels + 1 FDI (for B/H/Q PCHs) + 1 FDI (for B/H/Q PCHs)
800/1066 MT/s 800/1066 MT/s 2 DDR3 channels 2 DDR3 channels
2 DIMMs/channel 2 DIMMs/channel 1066/1333 MT/s 1066/1333 MT/s
24 GB max. 24 GB max. 2 DIMMs/channel 2 DIMMs/channel
LGA-1366 LGA-1366 16 GB max. 16 GB max.
LGA-1156 LGA-1156
11/2008
9/2009
X58
5 series PCH
(Tylersburg)
1 QPI link/1 DMI link (Ibex Peak)
Akin to the 34xx chipset)
36xPCIe 2. gen.
DMI 1 DMI link/PECI
6/2008 6-8xPCIe 2. gen.

ICH10
1. gen. Nehalem-EP based 1 Needs the Q57 PCH Westmere-EP
45 nm Westmere-EP Nehalem-EX-based 32 nm
© Sima Dezső, ÓE NIK 32 nm 473 45 nm www.tankonyvtar.hu
5. Intel’s Nehalem based DT platforms (3)

a) 1. gen. Nehalem (Bloomfield, 4C) based private consumer oriented Tylersburg


DT platform

Basic system architecture of the platform

1. gen. Nehalem
Bllomfied (4C)/
Westmere Gulftown DDR3
(6C) proc.

QPI

X58 IOH

DMI C-link1

ICH10

1TC-link(Controller link) is needed basically for loading ME (Management Engine) firmware from the nonvolatile
system flash memory that is attached to the ICH.
The ME is used to implement particular platform feature supported.

© Sima Dezső, ÓE NIK 474 www.tankonyvtar.hu


5. Intel’s Nehalem based DT platforms (4)

Example: 1. gen. Nehalem (Bloomfield, 4C) based private consumer oriented Tylersburg
DT platform [22]

2 DIMMs/channel
2 DIMMs/channel
2 DIMMs/channel

Processor:
• Nehalem-EP
(Bloomfield, 4C)
• Westmere-EP
(Gulftown, 6C)

Remark: The platform shown does not include an integrated display controller
© Sima Dezső, ÓE NIK 475 www.tankonyvtar.hu
5. Intel’s Nehalem based DT platforms (5)

[23]

© Sima Dezső, ÓE NIK 476 www.tankonyvtar.hu


5. Intel’s Nehalem based DT platforms (6)

b) 2. gen. Nehalem (Lynnfield 4C) based DT platforms

These platforms introduced a new kind of system architecture that consists only of two chips.

© Sima Dezső, ÓE NIK 477 www.tankonyvtar.hu


5. Intel’s Nehalem based DT platforms (7)

Introduction of Intel’s 2-chip DT chipset solution

Based on their 5 Series PCH (Platform Control Hub) in 9/2009 [24]

Previous

(Dedicated graphics
via graphics card)

5 Series PCH

controller

“Consumer
graphics

© Sima Dezső, ÓE NIK 478 www.tankonyvtar.hu


5. Intel’s Nehalem based DT platforms (8)

Remarks

The 5 Series (Ibex Peak) PCH family covers all the


• mobile
• desktop and the
• UP and DP server

segments.

For each segment Intel provides a number of PCH lines for different use, among others
• the X line for extreme performance for home use (mostly for gamers),
• the P line for home use,
• the Q and H lines for business use etc.

Each line comprises usually a number of models with different feature sets.

E.g. the desktop segment includes the following line and models with the feature sets given.

© Sima Dezső, ÓE NIK 479 www.tankonyvtar.hu


5. Intel’s Nehalem based DT platforms (9)

Supported features of different desktop models of the series 5 PCH family

5 series datasheet
© Sima Dezső, ÓE NIK 480 www.tankonyvtar.hu
5. Intel’s Nehalem based DT platforms (10)

Basic system architecture of the 2. gen. Nehalem based DT platforms

2. gen. Nehalem (Lynnfield, 4C) based DT platforms

Private consumer oriented Enterprise oriented


DT platforms DT platforms

Kings Creek (2009) Piketon (2009)

1. gen. Nehalem 2. gen. Nehalem


(Lynnfield, 4C)/ DDR3- (Lynnfield, 4C)/ DDR3-
Westmere 1333 Westmere 1333
(Clarkdale, 2C+G) (Clarkdale, 2C-EP+G)

FDI1 DMI FDI1 DMI

Q57
5-Series
PCH
PCH
(w/AMT 6.0)

1FDI is needed for an integrated display controller (included in all 6 Series PCHs except the P55

© Sima Dezső, ÓE NIK 481 www.tankonyvtar.hu


5. Intel’s Nehalem based DT platforms (11)

Note

In the 2-chip system architecture the PCH includes the ME (Manageability Engine) as well as
it is connected to the nonvolatile system flash memory that keeps the microcode to be read at
boot time.
So there is no need for an extra interface for loading the ME from the nonvolatile system memory
as was in case of the previous 3-chip system architecture.

© Sima Dezső, ÓE NIK 482 www.tankonyvtar.hu


5. Intel’s Nehalem based DT platforms (12)

Example 1: 2. gen. Nehalem (Lynnfield, 4C) based private consumer oriented


DT platform; the Kings Creek platform [25]

2 DIMMs/channel

2 DIMMs/channel

Remark: It does not include an integrated display controller


© Sima Dezső, ÓE NIK 483 www.tankonyvtar.hu
5. Intel’s Nehalem based DT platforms (13)

Example 2: 2. gen. Nehalem (Lynnfield, 4C) based enterprise oriented DT platforms;


the Piketon platform [26]

FDI: Flexible Display interface, it is needed for an integrated display controller


© Sima Dezső, ÓE NIK 484 www.tankonyvtar.hu
5. Intel’s Nehalem based DT platforms (14)

Remark

Intel Rapid Storage Technology


• This is an enhanced version of Intel’s previous Matrix Storage Technology.
• It configures the SATA controller as a RAID controller that supports RAID 0/1/5/10 resulting
in a more effective and secure data access.

© Sima Dezső, ÓE NIK 485 www.tankonyvtar.hu


6. Intel’s Sandy Bridge based DT platforms

© Sima Dezső, ÓE NIK 486 www.tankonyvtar.hu


6. Intel’s Sandy Bridge based DT platforms (1)

Intel’s Sandy Bridge based DT platforms – Overview-1

Sandy Bridge based DT platforms

Private consumer oriented Enterprise oriented


Sandy Bridge based DT platforms Sandy Bridge DT platforms

Sugar Bay (2011) Sugar Bay (vPro)? (2011)

© Sima Dezső, ÓE NIK 487 www.tankonyvtar.hu


6. Intel’s Sandy Bridge based DT platforms (2)
1/2011 1/2011
Overview-2
Sugar Bay Sugar Bay (vPro)

1/2011 1/2011

i7-26xx-29xx i7-26xx-29xx
i5-23xx-25xx i5-23xx-25xx
i3-21xx-23xx i7-21xx.23xx

Sandy Bridge 2C/4C Sandy Bridge 2C/4C

4C (12 EUs): 32 nm/915? mtrs/216 mm2 4C (12 EUs): 32 nm/915? mtrs/216 mm2
2C (12 EUs): 32 nm/624? mtrs/149 mm2 2C (12 EUs): 32 nm/624? mtrs/149 mm2
¼ MB L2/C ¼ MB L2/C
Up to 8 MB L3 Up to 8 MB L3
1 x DMI2 1 x DMI2
+ 1 x FDI (except the P67 PCH) + 1 x FDI (except the P67 PCH)
2 DDR3 channels 2 DDR3 channels
1066/1333 MT/s 1066/1333 MT/s
2 DIMMs/channel 2 DIMMs/channel
max. 32 GB max. 32 GB
LGA-1155 LGA-1155
1/2011 1/2011

6 series PCH Q67 PCH


(Cougar Point) (Cougar Point)
(akin to the C200 PCH) (akin to the C200 PCH)
1 DMI2 link/PECI 1 DMI2 link/PECI
6-8xPCIe 2. gen. 6-8xPCIe 2. gen.
8 GB/s/lane/direction 8 GB/s/lane/direction

Sandy Bridge-based Sandy Bridge-based


© Sima Dezső, ÓE NIK 32 nm 488 32 nm www.tankonyvtar.hu
6. Intel’s Sandy Bridge based DT platforms (3)

Supported features of different models of the 6 Series PCH family

AHCI: Advanced Host Controller Interface Specification for Serial ATA


RST: Rapid Storage Technology
6 Series datasheet
HDMI/DVI/VGA/eDP: Different display interfaces
SSD: Solid State Drive
489
© Sima Dezső, ÓE NIK www.tankonyvtar.hu
6. Intel’s Sandy Bridge based DT platforms (4)

Basic system architecture of Sandy Bridge based DT platforms

Sandy Bridge DDR3- DDR3-


Sandy Bridge
(2C/4C + G) 1333 1333
(2C/4C + G)

FDI1 DMI2 FDI1 DMI2

6- Series Q67
PCH PCH

Private consumer oriented Enterprise oriented


Sandy Bridge based DT platform Sandy Bridge based DT platform

Sugar Bay (2011) Sugar Bay (vPro)? (2011)

1FDI is needed for integrated display controllers (included in all 6 Series PCHs except the P67

© Sima Dezső, ÓE NIK 490 www.tankonyvtar.hu


6. Intel’s Sandy Bridge based DT platforms (5)

Example 1: Sandy Bridge based private consumer oriented DT platform;


the Sugar Bay platform [27]

© Sima Dezső, ÓE NIK 491 www.tankonyvtar.hu


6. Intel’s Sandy Bridge based DT platforms (6)

Example 2: Sandy Bridge based consumer oriented DT platform;


the Sugar Bay vPro platform [28]

© Sima Dezső, ÓE NIK 492 www.tankonyvtar.hu


6. Intel’s Sandy Bridge based DT platforms (7)

Remark

The Z68 model supports disk caching that allows an SDD to be used to cache a SATA hard disk.
This technology is designated now as the Smart Response Technology.
It is analogous to Intel’s previous Turbo Memory Technology introduced along with the 3 Series
Chipsets but discontinued with the Series 5 PCHs.

© Sima Dezső, ÓE NIK 493 www.tankonyvtar.hu


7. Overview of distinguished aspects
of the evolution of Intel’s DT platforms

© Sima Dezső, ÓE NIK 494 www.tankonyvtar.hu


7. Overview of distinguished features of the evolution of DT platforms (1)

Overview of distinguished aspects of the evolution of Intel’s DT platforms

a) Evolution of the basic system architecture

1. gen. Nehalem 2. gen. Nehalem


Core2/Penryn (4C)/
(4C)/
(2C/2*2C) Westmere (2C)/
Westmere (6C)
proc. Sandy Bridge (4C)
proc.

FSB QPI FDI DMI

965/3-/4- Series 5/6- Series


X58 IOH
MCH PCH

DMI DMI

ICH8/9/10 ICH10

© Sima Dezső, ÓE NIK 495 www.tankonyvtar.hu


7. Overview of distinguished features of the evolution of DT platforms (2)

b) Evolution of the vPro technology (based on [30])

1. generation 2. generation 3. generation 4. generation 5. generation


vPro vPro vPro vPro vPro

Averill Prof. Weybridge prof. McCreary Piketon Sugar Bay


Core2 Core 2/Penryn Penryn Nehalem/Westmere Sandy Bridge
Q965 Q35 Q45 Q57 Q67
AMT 2.0 AMT 3.0 AMT
4965.0 AMT 6.0 AMT 7.0
7. Overview of distinguished features of the evolution of DT platforms (3)

Remark

KVM (Keyboard, Video, Mouse) feature [29]


• It gives maintenance personal complete remote control over the PC by using keyboard and
mouse.
• KVM allows pre-boot access to the remote PC.
• It allows both hard wired and wireless connections so also every mobile device, even an
iPhone can be used to access the remote PC.

© Sima Dezső, ÓE NIK 497 www.tankonyvtar.hu


8. References

© Sima Dezső, ÓE NIK 498 www.tankonyvtar.hu


References (1)

[1]: DDR SDRAM Registered DIMM Design Specification, JEDEC Standard No. 21-C,
Page 4.20.4-1, Jan. 2002, http://www.jedec.org

[2]: Product Brief: Intel G965 Express Chipset, 2006,


http://szarka.ssgg.sk/Vyuka/Prednaska-3/2009/Prednaska-3_/G965-prod_brief.pdf

[3]: Wikipedia: Intel X58, 2011, http://en.wikipedia.org/wiki/Intel_X58

[4]: Ahn J.-H., „Memory Design Overview,” March 2007, Hynix,


http://netro.ajou.ac.kr/~jungyol/memory2.pdf

[5]: Allan G., „The outlook for DRAMs in consumer electronics”, EETIMES Europe Online,
01/12/2007, http://eetimes.eu/showArticle.jhtml?articleID=196901366&queryText=calibrated
[6]: Jacob B., Ng S. W., Wang D. T., Memory Systems, Elsevier, 2008
Intel Core versus
AMD's K8 architecture

[7]: Ebeling C., Koontz T., Krueger R., „System


Date: Clock Management Simplified with Virtex-II
May 1st, 2006

Pro FPGAs”, WP190, Febr. 25 2003, Xilinx,


http://www.xilinx.com/support/documentation/white_papers/wp190.pdf

[8]: Kirstein B., „Practical timing analysis for 100-MHz digital design,”, EDN, Aug. 8, 2002,
www.edn.com

[9]: Izaguirre J., Building and Deploying Better Embedded Systems with Intel Active
Management Technology (Intel AMT), Intel Technology Journal, Vol. 13, Issue 1., 2009,
pp. 84-95
[10]: Technology Brief, 2nd generation Intel Core processor family, Intel Anti-Theft Technology,
2011, http://www.intel.com/technology/anti-theft/anti-theft-tech-brief.pdf
© Sima Dezső, ÓE NIK 499 www.tankonyvtar.hu
References (2)

[11]: Intel Technology – Intel vPro Technology,


https://nmso.mdg.ca/WebManuals/Vx_7_8_English/components/mbd/vpro_amt.htm
[12]: Intel Turbo Memory, Intel Corporation, http://www.intel.com/cd/channel/reseller/apac/
eng/products/mobile/mprod/turbo_memory/396715.htm
[13]: Wikipedia: Out-of-band management, 2011,
http://en.wikipedia.org/wiki/Out-of-band_management

[14]: Intel Lyndon Platform & Future TS, VR-Zone, Dec. 9 2004,
http://vr-zone.com/articles/intel-lyndon-platform--future-ts/1520.html
[15]: Greene J., Intel Trusted Execution Technology, White Paper, 2010,
http://www.intel.com/Assets/PDF/whitepaper/323586.pdf
[16]: Wikipedia: Trusted Execution Technology, 2011,
http://en.wikipedia.org/wiki/Trusted_Execution_Technology
[17]: Intel Turbo Memory supported chipsets, Intel Corporation,
http://www.intel.com/support/chipsets/itm/sb/CS-025854.htm

[18]: Wikipedia: Intel Turbo Memory, 2011, http://en.wikipedia.org/wiki/Intel_Turbo_Memory

[19]: Intel 965 Express Chipset Family Datasheet, July 2006, http://ivanlef0u.fr/repo/ebooks/
intel_manuals/Intel%20%20965%20Express%20Chipset%20Family.pdf

[20]: Intel Core 2 Duo Processor, http://www.intel.com/pressroom/kits/core2duo/

[21]: Product Brief: Intel Q35 and Q33 Express Chipsets, 2007,
http://www.intel.com/Assets/PDF/prodbrief/317312.pdf
© Sima Dezső, ÓE NIK 500 www.tankonyvtar.hu
References (3)

[22]: Product Brief: Intel X58 Express Chipset, 2008,


http://www.intel.com/Assets/PDF/prodbrief/x58-product-brief.pdf
[23]: 2nd Generation Intel Core Processors 30/3/30, 2010,
http://cache-www.intel.com/cd/00/00/46/02/460297_460297.pdf

[24]: Smith S. L., Intel Roadmap Overview, Aug. 20 2008,


http://download.intel.com/pressroom/kits/events/idffall_2008/SSmith_briefing_roadmap.pdf

[25]: Product Brief: Intel P55 Express Chipset, 2009,


http://www.intel.com/Assets/PDF/prodbrief/322641.pdf
[26]: Product Brief: Intel Q57 Express Chipset, 2009,
http://www.intel.com/Assets/PDF/prodbrief/323191.pdf
[27]: Product Brief: Intel P67 Express Chipset and 2nd Generation Intel Core Processors, 2010,
http://www.intel.com/Assets/PDF/prodbrief/324585.pdf
[28]: Product Brief: Intel Q67 Express Chipset, 2011,
http://www.intel.com/Assets/PDF/prodbrief/325518.pdf

[29]: Freeman V., Intel refreshes its vPro platform for 2010, March 9 2010, Hardware Central,
http://www.hardwarecentral.com/features/article.php/3869536/Intel-Refreshes-Its-
vPro-Platform-for-2010.htm
[30]: Marek J., Rasheed Y., Watts L., Technical Overview of Next Generation Intel vPro
Technology, PROS001, 2010

© Sima Dezső, ÓE NIK 501 www.tankonyvtar.hu


References (4)

[31]: Neiger G., Santoni A., Leung F., Rodgers D., Uhlig R., Intel® Virtualization Technology:
Hardware support for efficient processor virtualization, Aug. 10 2006, Vol. 10, Issue 3,
http://www.intel.com/technology/itj/2006/v10i3/1-hardware/1-abstract.htm

[32]: Intel Software Networks: Forums,


http://software.intel.com/en-us/forums/showthread.php?t=56802

[33]: Wikipedia: x86 virtualization, 2011,


http://en.wikipedia.org/wiki/X86_virtualization#Intel_virtualization_.28VT-x.29

© Sima Dezső, ÓE NIK 502 www.tankonyvtar.hu


GPGPUs/DPAs
Overview

Dezső Sima

© Sima Dezső, ÓE NIK 503 www.tankonyvtar.hu


Aim

Aim
Brief introduction and overview.

© Sima Dezső, ÓE NIK 504 www.tankonyvtar.hu


Contents

1.Introduction

2. Basics of the SIMT execution

3. Overview of GPGPUs

4. Overview of data parallel accelerators

5. References

© Sima Dezső, ÓE NIK 505 www.tankonyvtar.hu


1. Introduction

© Sima Dezső, ÓE NIK 506 www.tankonyvtar.hu


1. Introduction (1)

Representation of objects by triangles

Vertex

Edge Surface

Vertices
• have three spatial coordinates
• supplementary information necessary to render the object, such as
• color
• texture
• reflectance properties
• etc.

© Sima Dezső, ÓE NIK 507 www.tankonyvtar.hu


1. Introduction (2)

Example: Triangle representation of a dolphin [149]

© Sima Dezső, ÓE NIK 508 www.tankonyvtar.hu


1. Introduction (3)

Main types of shaders in GPUs

Shaders

Vertex shaders Pixel shaders Geometry shaders


(Fragment shaders)

Transform each vertex’s Calculate the color Can add or remove


3D-position in the virtual space of the pixels vertices from a mesh
to the 2D coordinate,
at which it appears on the screen

© Sima Dezső, ÓE NIK 509 www.tankonyvtar.hu


1. Introduction (4)

DirectX version Pixel SM Vertex SM Supporting OS

8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000

8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ Windows
Server 2003
9.0 (12/2002) 2.0 2.0

9.0a (3/2003) 2_A, 2_B 2.x

9.0c (8/2004) 3.0 3.0 Windows XP SP2

10.0 (11/2006) 4.0 4.0 Windows Vista

10.1 (2/2008) 4.1 4.1 Windows Vista SP1/


Windows Server 2008
11 (10/2009) 5.0 5.0 Windows7/
Windows Vista SP1/
Windows Server 2008 SP2

Table 1.1: Pixel/vertex shader models (SM) supported by subsequent versions of DirectX
and MS’s OSs [18], [21]

DirectX: Microsoft’s API set for MM/3D


© Sima Dezső, ÓE NIK 510 www.tankonyvtar.hu
1. Introduction (5)

Convergence of important features of the vertex and pixel shader models


Subsequent shader models introduce typically, a number of new/enhanced features.
Differences between the vertex and pixel shader models in subsequent shader models
concerning precision requirements, instruction sets and programming resources.

Shader model 2 [19]


• Different precision requirements
Different data types
• Vertex shader: FP32 (coordinates)
• Pixel shader: FX24 (3 colors x 8)

• Different instructions
• Different resources (e.g. registers)

Shader model 3 [19]


• Unified precision requirements for both shaders (FP32)
with the option to specify partial precision (FP16 or FP24)
by adding a modifier to the shader code
• Different instructions
• Different resources (e.g. registers)

© Sima Dezső, ÓE NIK 511 www.tankonyvtar.hu


1. Introduction (6)

Shader model 4 (introduced with DirectX10) [20]


• Unified precision requirements for both shaders (FP32)
with the possibility to use new data formats.
• Unified instruction set
• Unified resources (e.g. temporary and constant registers)

Shader architectures of GPUs prior to SM4

GPUs prior to SM4 (DirectX 10):


have separate vertex and pixel units with different features.

Drawback of having separate units for vertex and pixel shading


• Inefficiency of the hardware implementation
• (Vertex shaders and pixel shaders often have complementary load patterns [21]).

© Sima Dezső, ÓE NIK 512 www.tankonyvtar.hu


1. Introduction (7)

Unified shader model (introduced in the SM 4.0 of DirectX 10.0)

Unified, programable shader architecture

The same (programmable) processor can be used to implement all shaders;


• the vertex shader
• the pixel shader and
• the geometry shader (new feature of the SMl 4)

© Sima Dezső, ÓE NIK 513 www.tankonyvtar.hu


1. Introduction (8)

Figure 1.1: Principle of the unified shader architecture [22]


© Sima Dezső, ÓE NIK 514 www.tankonyvtar.hu
1. Introduction (9)

Based on its FP32 computing capability and the large number of FP-units available

the unified shader is a prospective candidate for speeding up HPC!

GPUs with unified shader architectures also termed as

GPGPUs
(General Purpose GPUs)

or

cGPUs
(computational GPUs)

© Sima Dezső, ÓE NIK 515 www.tankonyvtar.hu


1. Introduction (10)

Peak FP32/FP64 performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [43]

© Sima Dezső, ÓE NIK 516 www.tankonyvtar.hu


1. Introduction (11)

Peak FP32 performance of AMD’s GPGPUs [87]

© Sima Dezső, ÓE NIK 517 www.tankonyvtar.hu


1. Introduction (12)

Evolution of the FP-32 performance of GPGPUs [44]

© Sima Dezső, ÓE NIK 518 www.tankonyvtar.hu


1. Introduction (13)

Evolution of the bandwidth of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [43]

© Sima Dezső, ÓE NIK 519 www.tankonyvtar.hu


1. Introduction (14)

Figure 1.2: Contrasting the utilization of the silicon area in CPUs and GPUs [11]

• Less area for control since GPGPUs have simplified control (same instruction for
all ALUs)
• Less area for caches since GPGPUs support massive multithereading to hide
latency of long operations, such as memory accesses in case of cache misses.

© Sima Dezső, ÓE NIK 520 www.tankonyvtar.hu


2. Basics of the SIMT execution

© Sima Dezső, ÓE NIK 521 www.tankonyvtar.hu


2. Basics of the SIMT execution (1)

Main alternatives of data parallel execution models

Data parallel execution models

SIMD execution SIMT execution

• One dimensional data parallel execution, • Two dimensional data parallel execution,
i.e. it performs the same operation i.e. it performs the same operation
on all elements of given on all elements of a given
FX/FP input vectors FX/FP input array (matrix)
• is massively multithreaded,
and provides
• data dependent flow control as well as
• barrier synchronization

Needs an FX/FP SIMD extension Assumes an entirely new specification,


of the ISA that is done at the virtual machine level
(pseudo ISA level)
E.g. 2. and 3. generation GPGPUs,
superscalars data parallel accelerators

Figure 2.1: Main alternatives of data parallel execution


© Sima Dezső, ÓE NIK 522 www.tankonyvtar.hu
2. Basics of the SIMT execution (2)

Remarks
1) SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia).
2) The SIMT execution model is a low level execution model that needs to be complemented
with further models, such as the model of computational resources or the memory model,
not discussed here.

© Sima Dezső, ÓE NIK 523 www.tankonyvtar.hu


2. Basics of the SIMT execution (3)

Specification levels of GPGPUs


GPGPUs are specified at two levels
• at a virtual machine level (pseudo ISA level, pseudo assembly level, intermediate level) and
• at the object code level (real GPGPU ISA level).

HLL

Virtual machine
level

Object code
level

© Sima Dezső, ÓE NIK 524 www.tankonyvtar.hu


2. Basics of the SIMT execution (4)

The process of program development


Becomes a two-phase process
• Phase 1: Compiling the HLL application to pseudo assembly code

Nvidia AMD

HLL level CUDA (Brook+)


HLL application
OpenCL OpenCL

nvcc (brcc)
HLL compiler
nvopencc

Virtual machine level


Pseudo assembly code PTX AMD IL
(Compatible code)

The compiled pseudo ISA code (PTX code/IL code) remains independent from the
actual hardware implementation of a target GPGPU, i.e. it is portable over different
GPGPU families.
Compiling a PTX/IL file to a GPGPU that misses features supported by the particular PTX/IL
version however, may need emulation for features not implemented in hardware.
This slows down execution.
© Sima Dezső, ÓE NIK 525 www.tankonyvtar.hu
2. Basics of the SIMT execution (5)

The process of program development-2

• Phase 2: Compiling the pseudo assembly code to GPU specific binary code

Nvidia AMD

Virtual machine level


Pseudo assembly code PTX AMD IL
(Compatible code)

Pseudo assembly – GPU compiler CUDA driver CAL compiler

Object code level


GPU specific binary code CUBIN file Target binary
(GPU bound)

The object code (GPGPU code, e.g. a CUBIN file) is forward portable, but forward portabilility
is provided typically only within major GPGPU versions, such as Nvidia’s compute capability
versions 1.x or 2.x.

© Sima Dezső, ÓE NIK 526 www.tankonyvtar.hu


2. Basics of the SIMT execution (6)

Benefits of the portability of the pseudo assembly code

• The compiled pseudo ISA code (PTX code/IL code) remains independent from the
actual hardware implementation of a target GPGPU, i.e. it is portable over subsequent
GPGPU families.
Forward portability of the object code (GPGPU code, e.g. CUBIN code) is provided however,
typically only within major versions.
• Compiling a PTX/IL file to a GPGPU that misses features supported by the particular PTX/IL
version however, may need emulation for features not implemented in hardware.
This slows down execution.
• Portability of pseudo assembly code (Nvidia’s PTX code or AMD’s IL code) is highly
advantageous in the recent rapid evolution phase of GPGPU technology as it results in
less costs for code refactoring.
Code refactoring costs are a kind of software maintenance costs that arise when the user
switches from a given generation to a subsequent GPGPU generation (like from GT200
based devices to GF100 or GF110-based devices) or to a new software environment
(like from CUDA 1.x SDK to CUDA 2.x or from CUDA 3.x SDK to CUDA 4.x SDK).

© Sima Dezső, ÓE NIK 527 www.tankonyvtar.hu


2. Basics of the SIMT execution (7)

Remark
The virtual machine concept underlying both Nvidia’s and AMD’s GPGPUs is similar to
the virtual machine concept underlying Java.
• For Java there is also an inherent pseudo ISA definition, called the Java bytecode.
• Applications written in Java will first be compiled to the platform independent Java bytecode.
• The Java bytecode will then either be interpreted by the Java Runtime Environment (JRE)
installed on the end user’s computer or compiled at runtime by the Just-In-Time (JIT)
compiler of the end user.

© Sima Dezső, ÓE NIK 528 www.tankonyvtar.hu


2. Basics of the SIMT execution (8)

Specification GPGPU computing at the virtual machine level


At the virtual machine level GPGPU computing is specified by
• the SIMT computational model and
• the related pseudo iSA of the GPGPU.

© Sima Dezső, ÓE NIK 529 www.tankonyvtar.hu


2. Basics of the SIMT execution (9)

The SIMT computational model


It covers the following three abstractions

Model of The memory Model of


computational model SIMT
resources execution

Figure 2.2: Key abstractions of the SIMT computational model

© Sima Dezső, ÓE NIK 530 www.tankonyvtar.hu


2. Basics of the SIMT execution (10)

1. The model of computational resources


It specifies the computational resources available at virtual machine level (the pseudo ISA level).
• Basic elements of the computational resources are SIMT cores.
• SIMT cores are specific SIMD cores, i.e. SIMD cores enhanced for efficient multithreading.
Efficient multithreading means zero-cycle penalty context switches, to be discussed later.

First, let’s discuss the basic structure of the underlying SIMD cores.
SIMD cores execute the same instruction stream on a number of ALUs (e.g. on 32 ALUs),
i.e. all ALUs perform typically the same operations in parallel.

Fetch/Decode

SIMD core
ALU ALU ALU ALU ALU

Figure 2.3: Basic structure of the underlying SIMD cores

ALUs operate in a pipelined fashion, to be discussed later.

© Sima Dezső, ÓE NIK 531 www.tankonyvtar.hu


2. Basics of the SIMT execution (11)

SIMD ALUs operate according to the load/store principle, like RISC processors i.e.
• they load operands from the memory,
• perform operations in the “register space” i.e.
• they take operands from the register file,
• perform the prescribed operations and
• store operation results again into the register file, and
• store (write back) final results into the memory.

The load/store principle of operation takes for granted the availability of a register file (RF)
for each ALU.

Load/Store
Memory RF

ALU

Figure 2.4: Principle of operation of a SIMD ALU

© Sima Dezső, ÓE NIK 532 www.tankonyvtar.hu


2. Basics of the SIMT execution (12)

As a consequence of the chosen principle of execution each ALU is allocated a register file (RF)
that is a number of working registers.

Fetch/Decode

ALU ALU ALU ALU ALU ALU

RF RF RF RF RF RF

Figure 2.5: Main functional blocks of a SIMD core

© Sima Dezső, ÓE NIK 533 www.tankonyvtar.hu


2. Basics of the SIMT execution (13)

Remark
The register sets (RF) allocated to each ALU are actually, parts of a large enough register file.

RF RF RF RF RF RF

ALU ALU ALU ALU ALU ALU


ALU ALU

Figure 2.6: Allocation of distinct parts of a large register file to the private register sets of the ALUs

© Sima Dezső, ÓE NIK 534 www.tankonyvtar.hu


2. Basics of the SIMT execution (14)

Basic operations of the underlying SIMD ALUs

• They execute basically FP32 Multiply-Add instructions of the form


axb+c,
• and are pipelined, i.e.
• capable of starting a new operation every new clock cycle,
RF (more precisely, every new shader clock cycle), and
• need a few number of clock cycles, e.g. 2 or 4 shader cycles
to present the results of the FP32 Multiply-Add operations to the RF,

ALU Without further enhancements


the peak performance of the ALUs is 2 FP32 operations/cycle.

© Sima Dezső, ÓE NIK 535 www.tankonyvtar.hu


2. Basics of the SIMT execution (15)

Beyond the basic operations the SIMD cores provide a set of further computational capabilities,
such as

• FX32 operations,
• FP64 operations,
• FX/FP conversions,
• single precision trigonometric functions (to calculate reflections, shading etc.).

Note
Computational capabilities specified at the pseudo ISA level (intermediate level) are
• typically implemented in hardware.
Nevertheless, it is also possible to implement some compute capabilities
• by firmware (i.e. microcoded,
• or even by emulation during the second phase of compilation.

© Sima Dezső, ÓE NIK 536 www.tankonyvtar.hu


2. Basics of the SIMT execution (16)

Enhancing SIMD cores to SIMT cores

SIMT cores are enhanced SIMD cores that provide an effective support of multithreading

Aim of multithreading in GPGPUs

Speeding up computations by eliminating thread stalls due to long latency operations.


Achieved by suspending stalled threads from execution and allocating free computational
resources to runable threads.
This allows to lay less emphasis on the implementation of sophisticated cache systems
and utilize redeemed silicon area (used otherwise for implementing caches)
for performing computations.

© Sima Dezső, ÓE NIK 537 www.tankonyvtar.hu


2. Basics of the SIMT execution (17)

Effective implementation of multithreading


requires that thread switches, called context switches, do not cause cycle penalties.

Achieved by
• providing and maintaining separate contexts for each thread, and
• implementing a zero-cycle context switch mechanism.

© Sima Dezső, ÓE NIK 538 www.tankonyvtar.hu


2. Basics of the SIMT execution (18)

SIMT cores

= SIMD cores with per thread register files (designated as CTX in the figure)

Fetch/Decode

SIMT core
CTX CTX CTX CTX CTX CTX
CTX CTX CTX CTX CTX CTX
CTX CTX CTX CTX CTX CTX
Actual context CTX
Register file (RF)
CTX CTX CTX CTX CTX

Context switch CTX CTX CTX CTX CTX CTX

CTX CTX CTX CTX CTX CTX

ALU
ALU ALU ALU ALU ALU ALU

Figure 2.7: SIMT cores are specific SIMD cores providing separate thread contexts for each thread

© Sima Dezső, ÓE NIK 539 www.tankonyvtar.hu


2. Basics of the SIMT execution (19)

The final model of computational resources of GPGPUs at the virtual machine level
The GPGPU is assumed to have a number of SIMT cores and is connected to the host.

Fetch/Decode

SIMT
ALU ALU core
ALU ALU ALU ALU ALU ALU

Host
Fetch/Decode

Fetch/Decode SIMT
ALU ALU
core
ALU ALU ALU ALU ALU ALU
ALU ALU SIMT
ALU ALU ALU ALU ALU ALU
core

Figure 2.8: The model of computational resources of GPGPUs

During SIMT execution 2-dimensional matrices will be mapped to the available SIMT cores.
© Sima Dezső, ÓE NIK 540 www.tankonyvtar.hu
2. Basics of the SIMT execution (20)

Remarks
1) The final model of computational resources of GPGPUs at the virtual machine level is similar
to the platform model of OpenCL, given below assuming multiple cards.

ALU

Card

SIMT core

Figure 2.9: The Platform model of OpenCL [144]

© Sima Dezső, ÓE NIK 541 www.tankonyvtar.hu


2. Basics of the SIMT execution (21)

2) Real GPGPU microarchitectures reflect the model of computational resources discussed


at the virtual machine level.

Figure 2.10: Simplified block diagram of the Cayman core (that underlies the HD 69xx series) [99]
© Sima Dezső, ÓE NIK 542 www.tankonyvtar.hu
2. Basics of the SIMT execution (22)

3) Different manufacturers designate SIMT cores differently, such as

• streaming multiprocessor (Nvidia),


• superscalar shader processor (AMD),
• wide SIMD processor, CPU core (Intel).

© Sima Dezső, ÓE NIK 543 www.tankonyvtar.hu


2. Basics of the SIMT execution (23)

The memory model

The memory model at the virtual machine level declares all data spaces available at this level
along with their features, like their accessibility, access mode (read or write) access width etc.

Key components of available data spaces at the virtual machine level

Available data spaces

Register space Memory space

Per thread Local memory Constant memory Global memory


register file

(Local Data Share) (Constant Buffer) (Device memory)

Figure 2.11: Overview of available data spaces in GPGPUs

© Sima Dezső, ÓE NIK 544 www.tankonyvtar.hu


2. Basics of the SIMT execution (24)

Per thread register files


• Provide the working registers for the ALUs.
• There are private, per thread data spaces available for the execution of threads
that is a prerequisite of zero-cycle context switches.

SIMT 1

Local Memory

Reg. Reg. Reg.


File 1 File 2 File n

Instr.
ALU 1 ALU 2 ALU n
Unit

Constant Memory

Global Memory

Figure 2.12: Key components of available data spaces at the level of SIMT cores

© Sima Dezső, ÓE NIK 545 www.tankonyvtar.hu


2. Basics of the SIMT execution (25)

Local memory
• On-die R/W data space that is accessible from all ALUs of a particular SIMT core.
• It allows sharing of data for the threads that are executing on the same SIMT core.

SIMT 1

Local Memory

Reg. Reg. Reg.


File 1 File 2 File n

Instr.
ALU 1 ALU 2 ALU n
Unit

Constant Memory

Global Memory

Figure 2.13: Key components of available data spaces at the level of SIMT cores

© Sima Dezső, ÓE NIK 546 www.tankonyvtar.hu


2. Basics of the SIMT execution (26)

Constant Memory
• On-die Read only data space that is accessible from all SIMT cores.
• It can be written by the system memory and is used to provide constants for all threads
that are valid for the duration of a kernel execution with low access latency.

GPGPU

SIMT 1 SIMT m
Reg. Reg. Reg. Reg.
File 1 File n File 1 File n

ALU 1 ALU n ALU 1 ALU n

Local Memory Local Memory

Constant Memory

Global Memory

Figure 2.14: Key components of available data spaces at the level of the GPGPU
© Sima Dezső, ÓE NIK 547 www.tankonyvtar.hu
2. Basics of the SIMT execution (27)

Global Memory
• Off-die R/W data space that is accessible for all SIMT cores of a GPGPU.
• It can be accessed by the system memory and is used to hold all instructions and data
needed for executing kernels.

GPGPU

SIMT 1 SIMT m
Reg. Reg. Reg. Reg.
File 1 File n File 1 File n

ALU 1 ALU n ALU 1 ALU n

Local Memory Local Memory

Constant Memory

Global Memory

Figure 2.15: Key components of available data spaces at the level of the GPGPU
© Sima Dezső, ÓE NIK 548 www.tankonyvtar.hu
2. Basics of the SIMT execution (28)

Remarks
1. AMD introduced Local memories, designated as Local Data Share, only along with their
RV770-based HD 4xxx line in 2008.
2. Beyond the key data space elements available at the virtual machine level, discussed so far,
there may be also other kinds of memories declared at the virtual machine level,
such as AMD’s Global Data Share, an on-chip Global memory introduced with their
RV770-bssed HD 4xxx line in 2008).
3. Traditional caches are not visible at the virtual machine level, as they are transparent for
program execution.
Nevertheless, more advanced GPGPUs allow an explicit cache management at the
virtual machine level, by providing e.g. data prefetching.
In these cases the memory model needs to be extended with these caches accordingly.
4. Max. sizes of particular data spaces are specified by the related instruction formats
of the intermediate language.
5. Actual sizes of particular data spaces are implementation dependent.
6. Nvidia and AMD designates different kinds of their data spaces differently, as shown below.

Nvidia AMD

Register file Registers General Purpose Registers


Local Memory Shared Memory Local Data Share
Constant Memory Constant Memory Constant Register
Global memory Global Memory Device memory

© Sima Dezső, ÓE NIK 549 www.tankonyvtar.hu


2. Basics of the SIMT execution (29)

Example 1: The platform model of PTX vers. 2.3 [147]


Nvidia

A set of SIMT cores


with on-chip
shared memory

A set of ALUs
within the
SIMT cores

© Sima Dezső, ÓE NIK 550 www.tankonyvtar.hu


2. Basics of the SIMT execution (30)

Example 2: Data spaces in AMD’s IL vers. 2.0 (simplified)


Data space Access type Available Remark

Deafult: (127-2)*4
General Purpose Registers R/W Per ALU 2*4 registers are reserved as Clause
Temporary Registers

On-chip memory that enables sharing of


Local Data Share (LDS) R/W Per SIMD core data between threads executing on a
particular SIMT

128 x 128 bit


Constant Register (CR) R Per GPGPU
Written by the host

On-chip memory that enables


Global Data Share R/W Per GPGPU sharing of data between threads
executing on a GPGPU

Device Memory R/W GPGPU Read or written by the host

Table 2.1: Available data spaces in AMD’s IL vers. 2.0 [107]

Remarks
• Max. sizes of data spaces are specified along with the instructions formats of the
intermediate language.
• The actual sizes of the data spaces are implementation dependent.

© Sima Dezső, ÓE NIK 551 www.tankonyvtar.hu


2. Basics of the SIMT execution (31)

Example: Simplified block diagram of the Cayman core (that underlies the HD 69xx series) [99]

© Sima Dezső, ÓE NIK 552 www.tankonyvtar.hu


2. Basics of the SIMT execution (32)

The SIMT execution model

Key components of the SIMT execution model

SIMT execution model

Multi-dimensional The kernel The model Barrier


domain of concept of data synchronization
execution sharing

Massive Concept of Data dependent Communication


multithreading assigning work flow control between
to execution threads
pipelines

© Sima Dezső, ÓE NIK 553 www.tankonyvtar.hu


2. Basics of the SIMT execution (33)

1. Multi-dimensional domain of execution

Domain of execution: index space of the execution

Scalar execution SIMD execution SIMT execution


(assuming a 2-dimensional
index space)

8 8 8

8 8 8
Domain of execution: Domain of execution: Domain of execution:
scalars, no indices one-dimensional index space two-dimensional index space
Objects of execution: Objects of execution: Objects of execution:
single data elements data elements of vectors data elements of matrices
Supported by Supported by Supported by
all processors 2.G/3.G superscalars GPGPUs/DPAs

Figure 2.16: Domains of execution in case of scalar, SIMD and SIMT execution

© Sima Dezső, ÓE NIK 554 www.tankonyvtar.hu


2. Basics of the SIMT execution (34)

2. Massive multithreading

The programmer creates for each element of the index space, called the execution domain
parallel executable threads that will be executed by the GPGPU or DPA.

Threads
(work items)

The same instructions


will be executed
for all elements of the
domain of execution

Domain of
execution

Figure 2.17: Parallel executable threads created and executed for each element of an execution
domain

© Sima Dezső, ÓE NIK 555 www.tankonyvtar.hu


2. Basics of the SIMT execution (35)

3. The kernel concept-1

The programmer describes the set of operations to be done over the entire domain of execution
by kernels.

Threads
(work items)

The same instructions


will be executed
for all elements of the
Operations to be done domain of execution
over the entire
domain of execution
are described Domain of
by a kernel execution

Figure 2.18: Interpretation of the kernel concept

Kernels are specified at the HLL level and compiled to the intermediate level.

© Sima Dezső, ÓE NIK 556 www.tankonyvtar.hu


2. Basics of the SIMT execution (36)

The kernel concept-2


Dedicated HLLs like OpenCL or CUDA C allow the programmer to define kernels, that,
when called are executed n-times in parallel by n different threads,
as opposed to only once like regular C functions.

Specification of kernels
• A kernel is defined by

• using a declaration specifier (like _kernel in OpenCL or _global_ in CUDA C) and


• declaring the instructions to be executed.

• Each thread that executes the kernel is given a unique identifier (thread ID, Work item ID)
that is accessible within the kernel.

© Sima Dezső, ÓE NIK 557 www.tankonyvtar.hu


2. Basics of the SIMT execution (37)

Sample codes for kernels


The subsequent sample codes illustrate two kernels that adds two vectors (a/A) and (b/B)
and store the result into vector (c/C).

CUDA C [43] OpenCL [144]

Remark
During execution each thread is identified by a unique identifier that is
• int I in case of CUDA C, accessible through the threadIdx variable, and
• int id in case of OpenCL accessible through the built-in get_global_id() function.

© Sima Dezső, ÓE NIK 558 www.tankonyvtar.hu


2. Basics of the SIMT execution (38)

Invocation of kernels
The kernel is invoked in CUDA C and OpenCL differently
• In CUDA C
by specifying the name of the kernel and the domain of execution [43]

• In OpenCL
by specifying the name of the kernel and the related configuration arguments, not detailed
here [144].

© Sima Dezső, ÓE NIK 559 www.tankonyvtar.hu


2. Basics of the SIMT execution (39)

4. Concept of assigning work to execution pipelines of the GPGPU

Typically a four step process

a) Segmenting the domain of execution to work allocation units


b) Assigning work allocation units to SIMT cores for execution
c) Segmenting work allocation units into work scheduling units to be executed on the
execution pipelines of the SIMT cores
d) Scheduling work scheduling units for execution to the execution pipelines of the SIMT cores

© Sima Dezső, ÓE NIK 560 www.tankonyvtar.hu


2. Basics of the SIMT execution (40)

4.a Segmenting the domain of execution to work allocation units-1

• The domain of execution will be broken down into equal sized ranges, called
work allocation units (WAUs), i.e. units of work that will be allocated to the SIMT cores
as an entity.
Domain of execution Domain of execution
Global size m Global size m

WAU WAU

Global size n
Global size n

(0,0) (0,1)

WAU WAU
(1,0) (1,1)

Figure 2.19: Segmenting the domain of execution to work allocation units (WAUs)

E.g. Segmenting a 512 x 512 sized domain of execution into four 256 x 256 sized
work allocation units (WAUs).

© Sima Dezső, ÓE NIK 561 www.tankonyvtar.hu


2. Basics of the SIMT execution (41)

4.a Segmenting the domain of execution to work allocation units-2

Domain of execution Domain of execution


Global size m Global size m

WAU WAU

Global size n
Global size n

(0,0) (0,1)

WAU WAU
(1,0) (1,1)

Figure 2.20: Segmenting the domain of execution to work allocation units (WAUs)

• Work allocation units may be executed in parallel on available SIMT cores.


• The kind how a domain of execution will be segmented to work allocation units
is implementation specific, it can be done either by the programmer or the HLL compiler.
Remark
Work allocation units are designated
by Nvidia as Thread blocks and
by AMD as Thread blocks (Pre-OpenCL term) or Work Groups or Workgroups (OpenCL term).
© Sima Dezső, ÓE NIK 562 www.tankonyvtar.hu
2. Basics of the SIMT execution (42)

4.b Assigning work allocation units to SIMT cores for execution

Work allocation units will be assigned for execution to the available SIMT cores as entities
by the scheduler of the GPGPU/DPA.

© Sima Dezső, ÓE NIK 563 www.tankonyvtar.hu


2. Basics of the SIMT execution (43)

Example: Assigning work allocation units to the SIMT cores in AMD’s Cayman GPGPU [93]

Kernel i: Domain of execution


Global size mi

Work Group Work Group


Array of
Global size ni

(0,0) (0,1)
SIMT cores

Work Group Work Group


(1,0) (1,1)

The work allocation units are


called here Work Groups.

They will be assigned


for execution to the same or to
different SIMT cores.
(ALU)

© Sima Dezső, ÓE NIK 564 www.tankonyvtar.hu


2. Basics of the SIMT execution (44)

Kind of assigning work allocation units to SIMT cores

Serial kernel processing Concurrent kernel processing

The GPGPU scheduler assigns work allocation units The GPGPU scheduler is capable of
only from a single kernel assigning work allocation units to SIMT cores
to the available SIMT cores, from multiple kernels concurrently
i.e. the scheduler distributes work allocation units with the constraint that
to available SIMT cores for maximum the scheduler can assign work allocation units
parallel execution. to each particular SIMT core only
from a single kernel

© Sima Dezső, ÓE NIK 565 www.tankonyvtar.hu


2. Basics of the SIMT execution (45)

Serial/concurrent kernel processing-1 [38], [83]


Serial kernel processing
The global scheduler of the GPGPU is capable of assigning work to the SIMT cores only
from a single kernel

© Sima Dezső, ÓE NIK 566 www.tankonyvtar.hu


2. Basics of the SIMT execution (46)

Serial/concurrent kernel processing in Nvidia’s GPGPUs [38], [83]

• A global scheduler, called the Gigathread scheduler assigns work to each SIMT core.
• In Nvidia’s pre-Fermi GPGPU generations (G80-, G92-, GT200-based GPGPUs)
the global scheduler could only assign work to the SIMT cores from a single kernel
(serial kernel execution).
• By contrast, in Fermi-based GPGPUs the global scheduler is able to run up to 16 different
kernels concurrently, presumable, one per SM (concurrent kernel execution).

In Fermi up to 16 kernels can run


concurrently, presumable, each one
on a different SM.

Compute devices 1.x Compute devices 2.x


(devices before Fermi) (devices starting with Fermi)
© Sima Dezső, ÓE NIK 567 www.tankonyvtar.hu
2. Basics of the SIMT execution (47)

Serial/concurrent kernel processing in AMD’s GPGPUs


• In GPGPUs preceding Cayman-based systems (2010), only a single kernel was allowed to run
on a GPGPU.
In these systems, the work allocation units constituting the NDRange (domain of execution)
were spread over all available SIMD cores in order to speed up execution.
• In Cayman based systems (2010) multiple kernels may run on the same GPGPU, each one
on a single or multiple SIMD cores, allowing a better utilization of the hardware resources
for a more parallel execution.

© Sima Dezső, ÓE NIK 568 www.tankonyvtar.hu


2. Basics of the SIMT execution (48)

Example: Assigning multiple kernels to the SIMT cores in Cayman-based systems

Kernel 1: NDRange1
Global size 10

Work Group Work Group


Global size 11

(0,0) (0,1) DPP Array

Work Group Work Group


(1,0) (1,1)

Kernel 2: NDRange2
Global size 20

Work Group Work Group


Global size 21

(0,0) (0,1)

Work Group Work Group


(1,0) (1,1)

© Sima Dezső, ÓE NIK 569 www.tankonyvtar.hu


2. Basics of the SIMT execution (49)

4.c Segmenting work allocation units into work scheduling units to be executed
on the execution pipelines of the SIMT cores-1
• Work scheduling units are parts of a work allocation unit that will be scheduled for execution
on the execution pipelines of a SIMT core as an entity.
• The scheduler of the GPGPU segments work allocation units into work scheduling units
of given size.

© Sima Dezső, ÓE NIK 570 www.tankonyvtar.hu


2. Basics of the SIMT execution (50)

Example: Segmentation of a 16 x 16 sized Work Group into Subgroups of the size of 8x8
in AMD’s Cayman core [92]
Work Group

One 8x8 block Another 8x8 block


constitutes a wavefront constitutes an another wavefront
and is executed on one and is executed on the same or
SIMT core another SIMT core

Wavefront of
64 elements

In the example a SIMT core


has 64 execution pipelines
(ALUs)
Array of SIMT cores

© Sima Dezső, ÓE NIK 571 www.tankonyvtar.hu


2. Basics of the SIMT execution (51)

4.c Segmenting work allocation units into work scheduling units to be executed
on the execution pipelines of the SIMT cores-2
Work scheduling units are called warps by Nvidia or wavefronts by AMD.
Size of the work scheduling units
• In Nvidia’s GPGPUs the size of the work scheduling unit (called warp) is 32.
• AMD’s GPGPUs have different work scheduling sizes (called wavefront sizes)

• High performance GPGPU cards have typically wavefront sizes of 64, whereas
• lower performance cards may have wavefront sizes of 32 or even 16.

The scheduling units, created by segmentation are then send to the scheduler.

© Sima Dezső, ÓE NIK 572 www.tankonyvtar.hu


2. Basics of the SIMT execution (52)

Example: Sending work scheduling units for execution to SIMT cores in AMD’s Cayman core [92]

Work Group

One 8x8 block Another 8x8 block


constitutes a wavefront constitutes a wavefront
and is executed on one and is executed on the same or
SIMT core another SIMT core

Subgroup of 64 elements

In the example a SIMT core


has 64 execution pipelines
(ALUs)
Array of SIMT cores

© Sima Dezső, ÓE NIK 573 www.tankonyvtar.hu


2. Basics of the SIMT execution (53)

4.d Scheduling work scheduling units for execution to the execution pipelines of the
SIMT cores
The scheduler assigns work scheduling units to the execution pipelines of the SIMT cores
for execution according to a chosen scheduling policy (discussed in the case example parts
5.1.6 and 5.2.8).

© Sima Dezső, ÓE NIK 574 www.tankonyvtar.hu


2. Basics of the SIMT execution (54)

Work scheduling units will be executed on the execution pipelines (ALUs) of the SIMT cores.

Thread Thread Thread Thread Thread

Fetch/Decode SIMT core

Work scheduling unit


ALU ALU ALU ALU ALU

Thread Thread Thread Thread Thread


SIMT core
Fetch/Decode
Thread Thread Thread Thread Thread
Work scheduling unit
Fetch/Decode
ALU ALU ALU ALU ALU

SIMT core
ALU ALU ALU ALU ALU

Work scheduling unit

© Sima Dezső, ÓE NIK 575 www.tankonyvtar.hu


2. Basics of the SIMT execution (55)

Note
Massive multitheading is a means to prevent stalls occurring during the execution of
work scheduling units due to long latency operations, such as memory accesses caused
by cache misses.

Principle of preventing stalls by massive multithreading


• Suspend the execution of stalled work scheduling units and allocate ready to run
work scheduling units for execution.
• When a large enough number of work scheduling units is available, stalls can be hidden.

Example
Up to date (Fermi-based) Nvidia GPGPUs can maintain up to 48 work scheduling units,
called warps per SIMT core.
For instance, the GTX 580 includes 16 SIMT cores, with 48 warps per SIMT core and
32 threads per warp for a total number of 24576 threads.

© Sima Dezső, ÓE NIK 576 www.tankonyvtar.hu


2. Basics of the SIMT execution (56)

5. The model of data sharing-1

• The model of data sharing declares the possibilities to share data between threads..
• This is not an orthogonal concept, but result from both
• the memory concept and
• the concept of assigning work to execution pipelines of the GPGPU.

© Sima Dezső, ÓE NIK 577 www.tankonyvtar.hu


2. Basics of the SIMT execution (57)

The model of data sharing-2


(considering only key elements of the data space, based on [43])

Per-thread
reg. file

Local Memory

Domain of execution 1

Notes
Domain of execution 2
1) Work Allocation Units
are designated in the Figure
as Thread Block/Block

2) The Constant Memory


is not shown due to space limitations.
It has the same data sharing scheme
but provides only Read only accessibility.

© Sima Dezső, ÓE NIK 578 www.tankonyvtar.hu


2. Basics of the SIMT execution (58)

6. Data dependent flow control


Implemented by SIMT branch processing

In SIMT processing both paths of a branch are executed subsequently such that
for each path the prescribed operations are executed only on those data elements which
fulfill the data condition given for that path (e.g. xi > 0).

Example

© Sima Dezső, ÓE NIK 579 www.tankonyvtar.hu


2. Basics of the SIMT execution (59)

Figure 2.21: Execution of branches [24]


The given condition will be checked separately for each thread

© Sima Dezső, ÓE NIK 580 www.tankonyvtar.hu


2. Basics of the SIMT execution (60)

First all ALUs meeting the condition execute the prescibed three operations,
then all ALUs missing the condition execute the next two operatons

Figure 2.22: Execution of branches [24]


© Sima Dezső, ÓE NIK 581 www.tankonyvtar.hu
2. Basics of the SIMT execution (61)

Figure 2.23: Resuming instruction stream processing after executing a branch [24]

© Sima Dezső, ÓE NIK 582 www.tankonyvtar.hu


2. Basics of the SIMT execution (62)

7. Barrier synchronization

Barrier synchronization

Synchronization of Synchronization of
thread execution memory read/writes

© Sima Dezső, ÓE NIK 583 www.tankonyvtar.hu


2. Basics of the SIMT execution (63)

Barrier synchronization of thread execution


It allows to synchronize threads in a Work Group such that at a given point
(marked by the barrier synchronization instruction) all threads must have completed
all prior instructions before execution can proceed.
It is implemented
• in Nvidia’s PTX by the “bar” instruction [147] or
• in AMD’s IL by the “fence thread” instruction [10]..

© Sima Dezső, ÓE NIK 584 www.tankonyvtar.hu


2. Basics of the SIMT execution (64)

Barrier synchronization of memory read/writes


• It ensures that no read/write instructions can be re-ordered or moved across the memory
barrier instruction in the specified data space (Local Data Space/Global memory/
System memory).
• Thread execution resumes when all the thread’s prior memory writes have been completed
and thus the data became visible to other threads in the specified data space.

It is implemented
• in Nvidia’s PTX by the “membar” instruction [147] or
• in AMD’s IL by the “fence lds”/”fence memory” instructions [10].

© Sima Dezső, ÓE NIK 585 www.tankonyvtar.hu


2. Basics of the SIMT execution (65)

8. Communication between threads

Discussion of this topic assumes the knowledge of programming details therefore it is omitted.
Interested readers are referred to the related reference guides [147], [104], [105].

© Sima Dezső, ÓE NIK 586 www.tankonyvtar.hu


2. Basics of the SIMT execution (66)

The pseudo ISA


• The pseudo ISA part of the virtual machine specifies the instruction set available at this level.
• The pseudo ISA evolves in line width the real ISA in form of subsequent releases.
• The evolution comprises both the enhancement of the qualitative (functional) and the
quantitative features of the pseudo architecture.

Example
• Evolution of the pseudo ISA of Nvidia’s GPGPUs and their support in real GPGPUs.
• Subsequent versions of both the pseudo- and real ISA are designated as compute capabilities.

© Sima Dezső, ÓE NIK 587 www.tankonyvtar.hu


2. Basics of the SIMT execution (67)

a) Evolution of the qualitative


(functional) features of subsequent
compute capability versions
of Nvidia’s pseudo ISA
(called virtual PTX) [81]

588 www.tankonyvtar.hu
2. Basics of the SIMT execution (68)

Evolution of the device parameters bound to Nvidia’s subsequent compute capability


versions [81]

589 www.tankonyvtar.hu
2. Basics of the SIMT execution (69)

b) Compute capability versions of PTX ISAs generated by subsequent releases of


CUDA SDKs and supported GPGPUs (designated as Targets in the Table) [147]

PTX ISA 1.x/sm_1x


Pre-Fermi implementations

PTX ISA 1.x/sm_1x


Fermi implementations

© Sima Dezső, ÓE NIK 590 www.tankonyvtar.hu


2. Basics of the SIMT execution (70)

c) Supported compute capability versions of Nvidia’s GPGPU cards [81]

Capability vers. GPGPU cores GPGPU devices


(sm_xy)

10 G80 GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870,


FX4/5600, 360M
11 G86, G84, G98, G96, G96b, G94, GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS,
G94b, G92, G92b 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT
120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM,
32/370M, 3/5/770M, 16/17/27/28/36/37/3800M,
NVS420/50
12 GT218, GT216, GT215 GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M,
NVS 2/3100M
13 GT200, GT200b GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX
3/4/5800
20 GF100, GF110 GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro
600,4/5/6000, Plex7000, GTX570, GTX580

21 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 450, GTX 460, GTX 550Ti,
GTX 560Ti

© Sima Dezső, ÓE NIK 591 www.tankonyvtar.hu


2. Basics of the SIMT execution (71)

d) Forward portability of PTX code [52]


Applications compiled for pre-Fermi GPGPUs that include PTX versions of their kernels
should work as-is on Fermi GPGPUs as well .

e) Compatibility rules of object files (CUBIN files) compiled to a particular GPGPU


compute capability version [52]
The basic rule is forward compatibility within the main versions (versions sm_1x and sm_2x),
but not across main versions.
This is interpreted as follows:
Object files (called CUBIN files) compiled to a particular GPGPU compute capability version
are supported on all devices having the same or higher version number within the
same main version.
E.g. object files compiled to the compute capability 1.0 are supported on all 1.x devices
but not supported on compute capability 2.0 (Fermi) devices.

For more details see [52].

© Sima Dezső, ÓE NIK 592 www.tankonyvtar.hu


3. Overview of GPGPUs

© Sima Dezső, ÓE NIK 593 www.tankonyvtar.hu


3. Overview of GPGPUs (1)

Basic implementation alternatives of the SIMT execution

GPGPUs Data parallel accelerators

Programmable GPUs Dedicated units


with appropriate supporting data parallel execution
programming environments with appropriate
programming environment

Have display outputs No display outputs


Have larger memories
than GPGPUs

E.g. Nvidia’s 8800 and GTX lines Nvidia’s Tesla lines


AMD’s HD 38xx, HD48xx lines AMD’s FireStream lines

Figure 3.1: Basic implementation alternatives of the SIMT execution

© Sima Dezső, ÓE NIK 594 www.tankonyvtar.hu


3. Overview of GPGPUs (2)

GPGPUs

Nvidia’s line AMD/ATI’s line

90 nm G80
80 nm R600
Shrink Enhanced
arch. Shrink
Enhanced
65 nm G92 G200 arch.
55 nm RV670 RV770
Enhanced
Shrink Enhanced Enhanced
arch. Shrink arch. arch.
40 nm GF100 RV870 Cayman
(Fermi)

Figure 3.2: Overview of Nvidia’s and AMD/ATI’s GPGPU lines

© Sima Dezső, ÓE NIK 595 www.tankonyvtar.hu


3. Overview of GPGPUs (3)

NVidia
11/06 10/07 6/08

Cores G80 G92 GT200


90 nm/681 mtrs 65 nm/754 mtrs 65 nm/1400 mtrs

Cards 8800 GTS 8800 GTX 8800 GT GTX260 GTX280


96 ALUs 128 ALUs 112 ALUs 192 ALUs 240 ALUs
320-bit 384-bit 256-bit 448-bit 512-bit

OpenCL OpenCL
Standard
6/07 11/07 6/08 11/08

CUDA Version 1.0 Version 1.1 Version 2.0 Version 2.1

AMD/ATI
11/05 5/07 11/07 5/08

Cores R500 R600 R670 RV770


80 nm/681 mtrs 55 nm/666 mtrs 55 nm/956 mtrs

Cards (Xbox) HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870


48 ALUs 320 ALUs 320 ALUs 320 ALUs 800 ALUs 800 ALUs
512-bit 256-bit 256-bit 256-bit 256-bit
12/08
OpenCL OpenCL
Standard
11/07 9/08 12/08
Brooks+ Brook+ Brook+ 1.2 Brook+ 1.3
(SDK v.1.0) (SDK v.1.2) (SDK v.1.3)
6/08
RapidMind 3870
support
2005 2006 2007 2008

Figure 3.3: Overview of GPGPUs and their basic software support (1)
© Sima Dezső, ÓE NIK 596 www.tankonyvtar.hu
3. Overview of GPGPUs (4)

NVidia
3/10 07/10 11/10
Cores GF100 (Fermi) GF104 (Fermi) GF110 (Fermi)
40 nm/3000 mtrs 40 nm/1950 mtrs 40 nm/3000 mtrs
1/11
Cards GTX 470 GTX 480 GTX 460 GTX 580 GTX 560 Ti
448 ALUs 480 ALUs 336 ALUs 512 ALUs 480 ALUs
320-bit 384-bit 192/256-bit 384-bit 384-bit

6/09 10/09 6/10

OpenCL OpenCL 1.0 OpenCL 1.0 OpenCL 1.1


SDK 1.0 Early release SDK 1.0 SDK 1.1
5/09 6/09 3/10 6/10 1/11 3/11

CUDA Version 22 Version 2.3 Version 3.0 Version 3.1 Version 3.2 Version 4.0
Beta
AMD/ATI
9/09 10/10 12/10

Cores RV870 (Cypress) Barts Pro/XT Cayman Pro/XT


40 nm/2100 mtrs 40 nm/1700 mtrs 40 nm/2640 mtrs

Cards HD 5850/70 HD 6850/70 HD 6950/70


1440/1600 ALUs 960/1120 ALUs 1408/1536 ALUs
256-bit 256-bit 256-bit

11/09 03/10 08/10

OpenCL OpenCL 1.0 OpenCL 1.0 OpenCL 1.1


(SDK V.2.0) (SDK V.2.01) (SDK V.2.2)
3/09
Brooks+ Brook+ 1.4
(SDK V.1.4 Beta) 8/09
Intel bought RapidMind
RapidMind

2009 2010 2011

Figure 3.4: Overview of GPGPUs and their basic software support (2)
© Sima Dezső, ÓE NIK 597 www.tankonyvtar.hu
3. Overview of GPGPUs (5)

Remarks on AMD-based graphics cards [45], [66]

Beginning with their Cypress-based HD 5xxx line and SDK v.2.0 AMD left Brook+
and started supporting OpenCL as their basic HLL programming language.

AMD/ATI
9/09 10/10 12/10

Cores RV870 (Cypress) Barts Pro/XT Cayman Pro/XT

40 nm/2100 mtrs 40 nm/1700 mtrs 40 nm/2640 mtrs

Cards HD 5850/70 HD 6850/70 HD 6950/70


1440/1600 ALUs 960/1120 ALUs 1408/1536 ALUs
256-bit 256-bit 256-bit

11/09 03/10 08/10

OpenCL OpenCL 1.0 OpenCL 1.0 OpenCL 1.1


(SDK V.2.0) (SDK V.2.01) (SDK V.2.2)
3/09
Brooks+ Brook+ 1.4
(SDK V.2.01) 8/09
Intel bought RapidMind
RapidMind

2009 2010 2011

As a consequence AMD changed also


• both the microarchitecture of their GPGPUs (by introducing Local and Global Data Share
memories) and
• their terminology by introducing Pre-OpenCL and OpenCL terminology, as discussed
in Section 5.2.
© Sima Dezső, ÓE NIK 598 www.tankonyvtar.hu
3. Overview of GPGPUs (6)

Remarks on Fermi-based graphics cards [45], [66]

FP64 speed
• ½ of the FP32 speed for the Tesla 20-series
• 1/8 of the SP32 speed for the GeForce GTX 470/480/570/580 cards
1/12 for other GForce GTX4xx cards

ECC
available only on the Tesla 20-series

Number of DMA engines


Tesla 20-series has 2 DMA Engines (copy engines). GeForce cards have 1 DMA Engine.
This means that CUDA applications can overlap computation and communication on Tesla
using bi-directional communication over PCI-e.

Memory size
Tesla 20 products have larger on board memory (3GB and 6GB)

© Sima Dezső, ÓE NIK 599 www.tankonyvtar.hu


3. Overview of GPGPUs (7)

Positioning Nvidia’s discussed GPGPU cards in their entire product portfolio [82]

© Sima Dezső, ÓE NIK 600 www.tankonyvtar.hu


3. Overview of GPGPUs (8)

Nvidia’s compute capability concept [52], [149]


• Nvidia manages the evolution of their devices and programming environment by maintaining
compute capability versions of both
• their intermediate virtual PTX architectures (PTX ISA) (not discussed here) and
• their real architectures (GPGPU ISA).
• Designation of the compute capability versions
• Subsequent versions of GPGPU ISAs are designated as sm_1x/sm_2x or simply by 1.x/2.x.
• The first digit 1 or 2 denotes the major version number, the second digit denotes the
minor version.
• Major versions of 1.x or 1x relate to pre-Fermi solutions whereas those of 2.x or 2x
to Fermi based solutions.

© Sima Dezső, ÓE NIK 601 www.tankonyvtar.hu


3. Overview of GPGPUs (9)

a) Functional features provided by


the compute capability versions
of Nvidia’se GPGPUs [81]

602 www.tankonyvtar.hu
3. Overview of GPGPUs (10)

b) Device parameters bound to the compute capability versions of Nvidia’s GPGPUs [81]

603 www.tankonyvtar.hu
3. Overview of GPGPUs (11)

c) Supported GPGPU compute capability versions of Nvidia’s GPGPU cards [81]

Capability vers. GPGPU cores GPGPU devices


(sm_xy)

10 G80 GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870,


FX4/5600, 360M
11 G86, G84, G98, G96, G96b, G94, GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS,
G94b, G92, G92b 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT
120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM,
32/370M, 3/5/770M, 16/17/27/28/36/37/3800M,
NVS420/50
12 GT218, GT216, GT215 GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M,
NVS 2/3100M
13 GT200, GT200b GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX
3/4/5800
20 GF100, GF110 GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro
600,4/5/6000, Plex7000, GTX570, GTX580

21 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 450, GTX 460, GTX 550Ti,
GTX 560Ti

© Sima Dezső, ÓE NIK 604 www.tankonyvtar.hu


3. Overview of GPGPUs (12)
8800 GTS 8800 GTX 8800 GT GTX 260 GTX 280
Core G80 G80 G92 GT200 GT200

Introduction 11/06 11/06 10/07 6/08 6/08

IC technology 90 nm 90 nm 65 nm 65 nm 65 nm

Nr. of transistors 681 mtrs 681 mtrs 754 mtrs 1400 mtrs 1400 mtrs

Die are 480 mm2 480 mm2 324 mm2 576 mm2 576 mm2

Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz
Computation
No of SMs (cores) 12 16 14 24 30

No.of FP32 EUss 96 128 112 192 240

Shader frequency 1.2 GHz 1.35 GHz 1.512 GHz 1.242 GHz 1.296 GHz

No. FP32 operations./cycle 21 3 3

Peak FP32 performance 230.4 GFLOPS 345.61 GFLOPS 508 GFLOPS 715 GFLOPS 933 GFLOPS

Peak FP64 performance – – – 59.62 GFLOPS 77.76 GFLOPS


Memory
Mem. transfer rate (eff) 1600 Mb/s 1800 Mb/s 1800 Mb/s 1998 Mb/s 2214 Mb/s

Mem. interface 320-bit 384-bit 256-bit 448-bit 512-bit

Mem. bandwidth 64 GB/s 86.4 GB/s 57.6 GB/s 111.9 GB/s 141.7 GB/s

Mem. size 320 MB 768 MB 512 MB 896 MB 1.0 GB

Mem. type GDDR3 GDDR3 GDDR3 GDDR3 GDDR3

Mem. channel 6*64-bit 6*64-bit 4*64-bit 8*64-bit 8*64-bit


System
Multi. CPU techn. SLI SLI SLI SLI SLI

Interface PCIe x16 PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16

MS Direct X 10 10 10 10.1 subset 10.1 subset

TDP 146 W 155 W 105 W 182 W 236 W

1: Nvidia takes the FP32 capable Texture Processing Units also into consideration and calculates with 3 FP32 operations/cycle
© Sima Dezső, ÓE NIK 605 of Nvidia’s GPGPUs-1
Table 3.1: Main features www.tankonyvtar.hu
3. Overview of GPGPUs (13)

Remarks
In publications there are conflicting statements about whether or not the GT80 makes use
of dual issue (including a MAD and a Mul operation) within a period of four shader cycles or not.
Official specifications [22] declare the capability of dual issue, but other literature sources [64]
and even a textbook, co-authored by one of the chief developers of the GT80 (D. Kirk [65])
deny it.
A clarification could be found in a blog [66], revealing that the higher figure given in Nvidia’s
specifications includes calculations made both by the ALUs in the SMs and by the texture
processing units TPU).
Nevertheless, the TPUs can not be directly accessed by CUDA except for graphical tasks,
such as texture filtering.
Accordingly, in our discussion focusing on numerical calculations it is fair to take only
the MAD operations into account for specifying the peak numerical performance.

© Sima Dezső, ÓE NIK 606 www.tankonyvtar.hu


3. Overview of GPGPUs (14)

Structure of an SM of the G80 architecture

Texture processing Units


consisting of
• TA: Texture Address units
• TF: Texture Filter Units

They are FP32 or FP16 capable [46]

© Sima Dezső, ÓE NIK 607 www.tankonyvtar.hu


3. Overview of GPGPUs (15)

GTX 470 GTX 480 GTX 460 GTX 570 GTX 580
Core GF100 GF100 GF104 GF110 GF110

Introduction 3/10 3/10 7/10 12/10 11/10

IC technology 40 nm 40 nm 40 nm 40 nm 40 nm

Nr. of transistors 3200 mtrs 3200 mtrs 1950 mtrs 3000 mtrs 3000 mtrs

Die are 529 mm2 529 mm2 367 mm2 520 mm2 520 mm2

Core frequency 732 MHz 772 MHz


Computation
No of SMs (cores) 14 15 7 15 16

No. of FP32 EUs 448 480 336 480 512

Shader frequency 1215 MHz 1401 MHz 1350 MHz 1464 MHz 1544 MHz

No. FP32 operations/cycle 2 2 3 2 2

Peak FP32 performance 1088 GFLOPS 1345 GFLOPS 9072 GFLOPS 1405 GFLOPS 1581 GFLOPS

Peak FP64 performance 136 GFLOPS 168 GFLOPS 75.6 GFLOPS 175.6 GFLOPS 197.6 GFLOPS
Memory
Mem. transfer rate (eff) 3348 Mb/s 3698 Mb/s 3600 Mb/s 3800 Mb/s 4008 Mb/s

Mem. interface 320-bit 384-bit 192/256-bit 320-bit 384-bit

Mem. bandwidth 133.9 GB/s 177.4 GB/s 86.4/115.2 GB/s 152 GB/s 192.4 GB/s

Mem. size 1.28 GB 1.536 GB 0.768/1.024 GB/s 1.28 GB 1.536/3.072 GB

Mem. type GDDR5 GDDR5 GDDR5 GDDR5 GDDR5

Mem. channel 5*64-bit 6*64-bit 3/4 *64-bit 5*64-bit 6*64-bit


System
Multi. CPU techn. SLI SLI SLI SLI SLI

Interface PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16

MS Direct X 11 11 11 11 11

TDP 215 W 250 W 150/160 W 219 W 244 W

© Sima Dezső, ÓE NIK 608 of Nvidia’s GPGPUs-2


Table 3.2: Main features www.tankonyvtar.hu
3. Overview of GPGPUs (16)

Remarks

1) The GDDR3 memory has a double clocked data transfer


Effective memory transfer rate = 2 x memory frequency

The GDDR5 memory has a quad clocked data transfer


Effective memory transfer rate = 4 x memory frequency

2) Both the GDDR3 and GDDR5 memories are 32-bit devices.


Nevertheless, memory controllers of GPGPUs may be designed either to control a single
32-bit memory channel or dual memory channels, providing a 64-bit channel width.

© Sima Dezső, ÓE NIK 609 www.tankonyvtar.hu


3. Overview of GPGPUs (17)

Examples for Nvidia cards

Nvidia GeForce GTX 480 (GF 100 based) [47]

© Sima Dezső, ÓE NIK 610 www.tankonyvtar.hu


3. Overview of GPGPUs (18)

Nvidia GeForce GTX 480 and 580 cards [77]

GTX 480 GTX 580


(GF 100 based) (GF 110 based)

© Sima Dezső, ÓE NIK 611 www.tankonyvtar.hu


3. Overview of GPGPUs (19)

A pair of GeForce GTX 480 cards [47]


(GF100 based)

© Sima Dezső, ÓE NIK 612 www.tankonyvtar.hu


3. Overview of GPGPUs (20)

HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870


Core R600 R670 R670 RV770 (R700-based) RV770 (R700 based)

Introduction 5/07 11/07 11/07 5/08 5/08

IC technology 80 nm 55 nm 55 nm 55 nm 55 nm

Nr. of transistors 700 mtrs 666 mtrs 666 mtrs 956 mtrs 956 mtrs

Die are 408 mm2 192 mm2 192 mm2 260 mm2 260 mm2

Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz
Computation
No. of ALUs 320 320 320 800 800

Shader frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz

No. FP32 operations./cycle 2 2 2 2 2

Peak FP32 performance 471.6 GFLOPS 429 GFLOPS 496 GFLOPS 1000 GFLOPS 1200 GFLOPS

Peak FP64 performance – – – 200 GFLOPS 240 GFLOPS


Memory
Mem. transfer rate (eff) 1600 Mb/s 1660 Mb/s 2250 Mb/s 2000 Mb/s 3600 Mb/s (GDDR5)

Mem. interface 512-bit 256-bit 256-bit 265-bit 265-bit

Mem. bandwidth 105.6 GB/s 53.1 GB/s 720 GB/s 64 GB/s 118 GB/s

Mem. size 512 MB 256 MB 512 MB 512 MB 512 MB

Mem. type GDDR3 GDDR3 GDDR4 GDDR3 GDDR3/GDDR5

Mem. channel 8*64-bit 8*32-bit 8*32-bit 4*64-bit 4*64-bit

Mem. contr. Ring bus Ring bus Ring bus Crossbar Crossbar
System
Multi. CPU techn. CrossFire X CrossFire X CrossFire X CrossFire X CrossFire X

Interface PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16

MS Direct X 10 10.1 10.1 10.1 10.1

TDP Max./Idle 150 W 75 W 105 W 110 W 150 W

© Sima Dezső, ÓE NIK 613 of AMD/ATIs GPGPUs-1


Table 3.3: Main features www.tankonyvtar.hu
3. Overview of GPGPUs (21)

Evergreen series HD 5850 HD 5870 HD 5970


Core Cypress PRO (RV870-based) Cypress XT (RV870-based) Hemlock XT (RV870-based)

Introduction 9/09 9/09 11/09

IC technology 40 nm 40 nm 40 nm

Nr. of transistors 2154 mtrs 2154 mtrs 2*2154 mtrs

Die are 334 mm2 334 mm2 2*334 mm2

Core frequency 725 MHz 850 MHz 725 MHz


Computation
No. of SIMD cores / VLIW5 ALUs 18/16 20/16 2*20/16

No. of EUs 1440 1600 2*1600

Shader frequency 725 MHz 850 MHz 725 MHz

No. FP32 inst./cycle 2 2 2


Peak FP32 performance 2088 GFLOPS 2720 GFLOPS 4640 GFLOPS

Peak FP64 performance 417.6 GFLOPS 544 GFLOPS 928 GFLOPS


Memory
Mem. transfer rate (eff) 4000 Mb/s 4800 Mb/s 4000 Mb/s

Mem. interface 256-bit 256-bit 2*256-bit

Mem. bandwidth 128 GB/s 153.6 GB/s 2*128 GB/s

Mem. size 1.0 GB 1.0/2.0 GB 2*(1.0/2.0) GB

Mem. type GDDR5 GDDR5 GDDR5

Mem. channel 8*32-bit 8*32-bit 2*8*32-bit


System
Multi. CPU techn. CrossFire X CrossFire X CrossFire X

Interface PCIe 2.1*16 PCIe 2.1*16 PCIe 2.1*16

MS Direct X 11 11 11

TDP Max./Idle 151/27 W 188/27 W 294/51 W

© Sima Dezső, ÓE NIK 614


Table 3.4: Main features of AMD/ATI’s GPGPUs-2 www.tankonyvtar.hu
3. Overview of GPGPUs (22)

Northerm Islands series HD 6850 HD 6870


Core Barts Pro Barts XT

Introduction 10/10 10/10

IC technology 40 nm 40 nm

Nr. of transistors 1700 mtrs 1700 mtrs

Die are 255 mm2 255 mm2

Core frequency 775 MHz 900 MHz


Computation
No. of SIMD cores /VLIW5 ALUs 12/16 14/16

No. of EUs 960 1120

Shader frequency 775 MHz 900 MHz

No. FP32 inst./cycle 2 2

Peak FP32 performance 1488 GFLOPS 2016 GFLOPS

Peak FP64 performance - -


Memory
Mem. transfer rate (eff) 4000 Mb/s 4200 Mb/s

Mem. interface 256-bit 256-bit

Mem. bandwidth 128 GB/s 134.4 GB/s

Mem. size 1 GB 1 GB

Mem. type GDDR5 GDDR5

Mem. channel 8*32-bit 8*32-bit


System
Multi. CPU techn. CrossFire X CrossFire X

Interface PCIe 2.1*16 PCIe 2.1*16

MS Direct X 11 11

TDP Max./Idle 127/19 W 151/19 W

© Sima Dezső, ÓE NIK 615


Table 3.5: Main features of AMD/ATI’s GPGPUs-3 www.tankonyvtar.hu
3. Overview of GPGPUs (23)
Northerm Islands series HD 6950 HD 6970 HD 6990 HD 6990 unlocked
Core Cayman Pro Cayman XT Antilles Antilles

Introduction 12/10 12/10 3/11 3/11

IC technology 40 nm 40 nm 40 nm 40 nm

Nr. of transistors 2.64 billion 2.64 billion 2*2.64 billion 2*2.64 billion

Die are 389 mm2 389 mm2 2*389 mm2 2*389 mm2

Core frequency 800 MHz 880 MHz 830 MHz 880 MHz
Computation
No. of SIMD cores /VLIW4 ALUs 22/16 24/16 2*24/16 2*24/16

No. of EUs 1408 1536 2*1536 2*1536

Shader frequency 800 MHz 880 MHz 830 MHz 880 MHz

No. FP32 inst./cycle / ALU 4 4 4 4

Peak FP32 performance 2.25 TFLOPS 2.7 TFLOPS 5.1 TFLOPS 5.4 TFLOPS

Peak FP64 performance 0.5625 TFLOPS 0.683 TFLOPS 1.275 TFLOPS 1.35 TFLOPS
Memory
Mem. transfer rate (eff) 5000 Mb/s 5500 Mb/s 5000 Mb/s 5000 Mb/s

Mem. interface 256-bit 256-bit 256-bit 256-bit

Mem. bandwidth 160 GB/s 176 GB/s 2*160 GB/s 2*160 GB/s

Mem. size 2 GB 2 GB 2*2 GB 2*2 GB

Mem. type GDDR5 GDDR5 GDDR5 GDDR5

Mem. channel 8*32-bit 5*32-bit 2*8*32-bit 2*8*32-bit


System
ECC - - - -

Multi. CPU techn. CrossFireX CrossFireX CrossFireX CrossFireX

Interface PCIe 2.1*16-bit PCIe 2.1*16-bit PCIe 2.1*16-bit PCIe 2.1*16-bit

MS Direct X 11 11 11 11

TDP Max./Idle 200/20 W 250/20 W 350/37 W 415/37 W


© Sima Dezső, ÓE NIK 616 www.tankonyvtar.hu
Table 3.6: Main features of AMD/ATIs GPGPUs-4
3. Overview of GPGPUs (24)

Remark
The Radeon HD 5xxx line of cards is designated also as the Evergreen series and
the Radeon HD 6xxx line of cards is designated also as the Northern islands series.

© Sima Dezső, ÓE NIK 617 www.tankonyvtar.hu


3. Overview of GPGPUs (25)

Examples for AMD cards

HD 5870 (RV870 based) [41]

© Sima Dezső, ÓE NIK 618 www.tankonyvtar.hu


3. Overview of GPGPUs (26)

HD 5970 (actually RV870 based) [80]

ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock

© Sima Dezső, ÓE NIK 619 www.tankonyvtar.hu


3. Overview of GPGPUs (27)

HD 5970 (actually RV870 based) [79]

ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock

© Sima Dezső, ÓE NIK 620 www.tankonyvtar.hu


3. Overview of GPGPUs (28)

AMD HD 6990 (actually Cayman based) [78]

AMD HD 6990: 2 x ATI HD 6970 with slightly reduced memory and shader clock

© Sima Dezső, ÓE NIK 621 www.tankonyvtar.hu


3. Overview of GPGPUs (29)

Price relations (as of 01/2011)

Nvidia
GTX 570 ~ 350 $
GTX 580 ~ 500 $

AMD
HD 6970 ~ 400 $
HD 6990 ~ 700 $
(Dual 6970)

© Sima Dezső, ÓE NIK 622 www.tankonyvtar.hu


4. Overview of data parallel accelerators

© Sima Dezső, ÓE NIK 623 www.tankonyvtar.hu


4. Overview of data parallel accelerators (1)

Data parallel accelerators

Implementation alternatives of data parallel accelerators

On card On-die
implementation integration

Recent Emerging
implementations implementations

E.g. GPU cards


Intel’s Heavendahl
Data-parallel
accelerator cards Intel’s Sandy Bridge (2011)

AMD’s Torrenza AMD’s Fusion (2008) 2010


integration technology integration technology

Trend

Figure 4.1: Implementation alternatives of dedicated data parallel accelerators


© Sima Dezső, ÓE NIK 624 www.tankonyvtar.hu
4. Overview of data parallel accelerators (2)

On-card accelerators

Card Desktop 1U server


implementations implementations implementations

Usually dual cards Usually 4 cards


Single cards fitting mounted into a box, mounted into a 1U server rack,
into a free PCI Ex16 slot connected to an connected two adapter cards
of the host computer. adapter card that are inserted into
that is inserted into a two free PCIEx16 slots of a server
free PCI-E x16 slot of the through two switches
host PC through a cable. and two cables.

E.g. Nvidia Tesla C870 Nvidia Tesla D870 Nvidia Tesla S870
Nvidia Tesla C1060 Nvidia Tesla S1070
Nvidia Tesla C2070 Nvidia Tesla S2050/S2070
AMD FireStream 9170
AMD FireStream 9250
AMD FireStream 9370

Figure 4.2: Implementation alternatives of on-card accelerators


© Sima Dezső, ÓE NIK 625 www.tankonyvtar.hu
4. Overview of data parallel accelerators (3)

NVidia Tesla-1 (Non-Fermi based DPAs)

G80-based GT200-based

6/07 6/08

Card C870 C1060

1.5 GB GDDR3 4 GB GDDR3


345.6 SP: 345.6 GFLOPS
DP: -
SP: 933 GFLOPS
DP: 77.76 GFLOPS

6/07
Desktop D870
2*C870 incl.
3 GB GDDR3
SP: 691.2 GFLOPS
DP: -

6/07 6/08
IU Server S870 S1070
4*C870 incl. 4*C1060
6 GB GDDR3 16 GB GDDR3
SP: 1382 GFLOPS SP: 3732 GFLOPS
DP: - DP: 311 GFLOPS

6/07 11/07 6/08

CUDA Version 1.0 Version 1.01 Version 2.0

2007 2008

Figure 4.3: Overview of Nvidia’s G80/G200-based Tesla family-1

© Sima Dezső, ÓE NIK 626 www.tankonyvtar.hu


4. Overview of data parallel accelerators (4)

FB: Frame Buffer

Figure 4.4: Main functional units of Nvidia’s Tesla C870 card [2]

© Sima Dezső, ÓE NIK 627 www.tankonyvtar.hu


4. Overview of data parallel accelerators (5)

Figure 4.5: Nvida’s Tesla C870 and


AMD’s FireStream 9170 cards [2], [3]
© Sima Dezső, ÓE NIK 628 www.tankonyvtar.hu
4. Overview of data parallel accelerators (6)

Figure 4.6: Tesla D870 desktop implementation [4]

© Sima Dezső, ÓE NIK 629 www.tankonyvtar.hu


4. Overview of data parallel accelerators (7)

Figure 4.7: Nvidia’s Tesla D870 desktop implementation [4]

© Sima Dezső, ÓE NIK 630 www.tankonyvtar.hu


4. Overview of data parallel accelerators (8)

Figure 4.8: PCI-E x16 host adapter card of Nvidia’s Tesla D870 desktop [4]

© Sima Dezső, ÓE NIK 631 www.tankonyvtar.hu


4. Overview of data parallel accelerators (9)

Figure 4.9: Concept of Nvidia’s Tesla S870 1U rack server [5]

© Sima Dezső, ÓE NIK 632 www.tankonyvtar.hu


4. Overview of data parallel accelerators (10)

Figure 4.10: Internal layout of Nvidia’s Tesla S870 1U rack [6]


© Sima Dezső, ÓE NIK 633 www.tankonyvtar.hu
4. Overview of data parallel accelerators (11)

Figure 4.11: Connection cable between Nvidia’s Tesla S870 1U rack and the adapter cards
inserted into PCI-E x16 slots of the host server [6]
© Sima Dezső, ÓE NIK 634 www.tankonyvtar.hu
4. Overview of data parallel accelerators (12)

NVidia Tesla-2 (Fermi-based DPAs)


GF100 (Fermi)-based

11/09
Card C2050/C2070
3/6 GB GDDR5
SP: 1.03 TLOPS1
DP: 0.515 TFLOPS

04/10 08/10
Module M2050/M2070 M2070Q

3/6 GB GDDR5 6 GB GDDR5


SP: 1.03 TFLOPS1 SP: 1.03 TFLOPS1
DP: 0.515 TFLOPS DP: 0.515 TFLOPS

11/09
IU Server S2050/S2070

4*C2050/C2070
12/24 GB GDDR31
SP: 4.1 TFLOPS
DP: 8.2 TFLOPS

5/09 6/09 3/10 6/10 1/11


CUDA
CUDA Version 2.2 Version 2.3 Version 3.0 Version 3.1 Version 3.2

6/10

OpenCL+ OpenCL 1.1

2009 2010 2011

1: Without SF (Special Function) operations

Figure 4.12: Overview of Nvidia’s GF100 (Fermi)-based Tesla family


© Sima Dezső, ÓE NIK 635 www.tankonyvtar.hu
4. Overview of data parallel accelerators (13)

Fermi based Tesla devices

Tesla C2050/C2070 Card [71] Tesla S2050/S2070 1U [72]


(11/2009) (11/2009)

Single GPU Card Four GPUs


3/6 GB GDDR5 12/16 GB GDDR5s
515 GFLOPS DP 2060 GFLOPS DP
ECC ECC

© Sima Dezső, ÓE NIK 636 www.tankonyvtar.hu


4. Overview of data parallel accelerators (14)

Tesla M2050/M2070/M2070Q Processor Module


(Dual slot board with PCIe Gen. 2 x16 interface)
(04/2010)

Figure 4.13: Tesla M2050/M2070/M2070Q Processor Module [74]

Used in the Tianhe-1A Chinese supercomputer (10/2010)

Remark
The M2070Q is an upgrade of the M2070 providing higher memory clock (introduced 08/2010)
© Sima Dezső, ÓE NIK 637 www.tankonyvtar.hu
4. Overview of data parallel accelerators (15)

Tianhe-1A (10/2010) [48]


• Upgraded version of the Tianhe-1 (China)
• 2.6 PetaFLOPS (fastest supercomputer in the World in 2010)
• 14 336 Intel Xeon 5670
• 7 168 Nvidia Tesla M2050

© Sima Dezső, ÓE NIK 638 www.tankonyvtar.hu


4. Overview of data parallel accelerators (16)

Specification data of the Tesla M2050/M2070/M2070Q modules [74]

(448 ALUs) (448 ALUs)

Remark
The M2070Q is an upgrade of the M2070, providing higher memory clock (introduced 08/2010)

© Sima Dezső, ÓE NIK 639 www.tankonyvtar.hu


4. Overview of data parallel accelerators (17)

Support of ECC
• Fermi based Tesla devices introduced the support of ECC.
• By contrast recently neither Nvidia’s straightforward GPGPU cards nor AMD’s GPGPU or
DPA devices support ECC [76].

© Sima Dezső, ÓE NIK 640 www.tankonyvtar.hu


4. Overview of data parallel accelerators (18)

Tesla S2050/S2070 1U

The S2050/S2070 differ only in the memory size, the S2050 includes 12 GB, the S2070 24 GB.

GPU Specification
 Number of processor cores: 448
 Processor core clock: 1.15 GHz
 Memory clock: 1.546 GHz
 Memory interface: 384 bit

System Specification
 Four Fermi GPUs
 12.0/24.0 GB of GDDR5,
 configured as 3.0/6.0 GB per GPU.

 When ECC is turned on,


Figure 4.14: Block diagram and technical specifications  available memory is ~10.5 GB
of Tesla S2050/S2070 [75]  Typical power consumption: 900 W
© Sima Dezső, ÓE NIK 641 www.tankonyvtar.hu
4. Overview of data parallel accelerators (19)

AMD FireStream-1 (Brook+ programmable DPAs)

RV670-based RV770-based

11/07 6/08

Card 9170 9170


2 GB GDDR3 Shipped
FP32: 500 GLOPS
FP64:~200 GLOPS

6/08 10/08
9250 9250
1 GB GDDR3 Shipped
FP32: 1000 GLOPS
FP64: ~300 GFLOPS

12/07 09/08
Stream Computing
SDK Version 1.0 Version 1.2
Brook+ Brook+
ACM/AMD Core Math Library ACM/AMD Core Math Library
CAL (Computer Abstor Layer) CAL (Computer Abstor Layer)

Rapid Mind

2007 2008

Figure 4.15: Overview of AMD/ATI’s FireStream family-1

© Sima Dezső, ÓE NIK 642 www.tankonyvtar.hu


4. Overview of data parallel accelerators (20)

AMD FireStream-2 (OpenCL programmable DPAs)


In 01/11 Version 2.3
renamed to APP

RV870-based

06/10 10/10

Card 9350/9370 9350/9370

2/4 GB GDDR5 Shipped


FP32: 2016 GLOPS
FP64: 403/528 GLOPS

03/09 03/10 05/10 08/10 12/10


Stream Computing
SDK Version 1.4 Version 2.01 Version 2.1 Version 2.2 Version 23
Brooks+ OpenCL 1.0 OpenCL 1.0 OpenCL 1.1 OpenCL 1.1

2009 2010 2011

APP: Accelerated Parallel Processing

Figure 4.16: Overview of AMD/ATI’s FireStream family-2

© Sima Dezső, ÓE NIK 643 www.tankonyvtar.hu


4. Overview of data parallel accelerators (21)

Nvidia Tesla cards


Core type C870 C1060 C2050 C2070

Based on G80 GT200 T20 (GF100-based)

Introduction 6/07 6/08 11/09

Core

Core frequency 600 MHz 602 MHz 575 MHz

ALU frequency 1350 MHz 1296 GHz 1150 MHz

No. of SMs (cores) 16 30 14

No. of ALUs 128 240 448

Peak FP32 performance 345.6 GFLOPS 933 GFLOPS 1030.4 GFLOPS

Peak FP64 performance - 77.76 GFLOPS 515.2 GFLOPS


Memory
Mem. transfer rate (eff) 1600 Gb/s 1600 Gb/s 3000 Gb/s

Mem. interface 384-bit 512-bit 384-bit

Mem. bandwidth 768 GB/s 102 GB/s 144 GB/s

Mem. size 1.5 GB 4 GB 3 GB 6 GB

Mem. type GDDR3 GDDR3 GDDR5


System
ECC - - ECC

Interface PCIe *16 PCIe 2.0*16 PCIe 2.0*16

Power (max) 171 W 200 W 238 W 247 W

Table 4.1: Main features of Nvidia’s data parallel accelerator cards (Tesla line) [73]

© Sima Dezső, ÓE NIK 644 www.tankonyvtar.hu


4. Overview of data parallel accelerators (22)

AMD FireStream cards


Core type 9170 9250 9350 9370

Based on RV670 RV770 RV870 RV870

Introduction 11/07 6/08 10/10 10/10

Core

Core frequency 800 MHz 625 MHz 700 MHz 825 MHz

ALU frequency 800 MHz 325 MHz 700 MHz 825 MHz

No. of EUs 320 800 1440 1600

Peak FP32 performance 512 GFLOPS 1 TFLOPS 2016 GFLOPS 2640 GFLOPS

Peak FP64 performance ~200 GFLOPS ~250 GFLOPS 403.2 GFLOPS 528 GFLOPS
Memory
Mem. transfer rate (eff) 1600 Gb/s 1986 Gb/s 4000 Gb/s 4600 Gb/s

Mem. interface 256-bit 256-bit 256-bit 256-bit

Mem. bandwidth 51.2 GB/s 63.5 GB/s 128 GB/s 147.2 GB/s

Mem. size 2 GB 1 GB 2 GB 4 GB

Mem. type GDDR3 GDDR3 GDDR5 GDDR5


System
ECC - - - -

Interface PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16

Power (max) 150 W 150 W 150 W 225 W

Table 4.2: Main features of AMD/ATI’s data parallel accelerator cards (FireStream line) [67]

© Sima Dezső, ÓE NIK 645 www.tankonyvtar.hu


4. Overview of data parallel accelerators (23)

Price relations (as of 1/2011)

Nvidia Tesla
C2050 ~ 2000 $
C2070 ~ 4000 $
S2050 ~ 13 000 $
S2070 ~ 19 000 $

NVidia GTX
GTX580 ~ 500 $

© Sima Dezső, ÓE NIK 646 www.tankonyvtar.hu


GPGPUs/DPAs 5.1
Case example 1:
Nvidia’s Fermi family of cores

Dezső Sima

© Sima Dezső, ÓE NIK 647 www.tankonyvtar.hu


Aim

Aim
Brief introduction and overview.

© Sima Dezső, ÓE NIK 648 www.tankonyvtar.hu


5.1 Nvidia’s Fermi family of cores

5.1.1 Introduction to Nvidia’s Fermi family of cores


5.1.2 Nvidia’s Parallel Thread eXecution (PTX)
Virtual Machine concept

5.1.3 Key innovations of Fermi’s PTX 2.0

5.1.4 Nvidia’s high level data parallel programming model


5.1.5 Major innovations and enhancements of
Fermi’s microarchitecture
5.1.6 Microarchitecture of Fermi GF100
5.1.7 Comparing key features of the microarchitectures of
Fermi GF100 and the predecessor GT200
5.1.8 Microarchitecture of Fermi GF104

5.1.9 Microarchitecture of Fermi GF110


5.1.10 Evolution of key features of the
microarchitecture of Nvidia’s GPGPU lines
© Sima Dezső, ÓE NIK 649 www.tankonyvtar.hu
5.1.1 Introduction to Nvidia’s Fermi family of cores

© Sima Dezső, ÓE NIK 650 www.tankonyvtar.hu


5.1.1 Introduction to Nvidia’s Fermi family of cores (1)

Announced: 30. Sept. 2009 at NVidia’s GPU Technology Conference, available: 1Q 2010 [83]

© Sima Dezső, ÓE NIK 651 www.tankonyvtar.hu


5.1.1 Introduction to Nvidia’s Fermi family of cores (2)

Sub-families of Fermi
Fermi includes three sub-families with the following representative cores and features:

Available Max. no. Max. no. No of Compute


GPGPU Aimed at
since of cores of ALUs transistors capability

GF100 3/2010 161 5121 3200 mtrs 2.0 Gen. purpose

GF104 7/2010 8 384 1950 mtrs 2.1 Graphics

GF110 11/2010 16 512 3000 mtrs 2.0 Gen. purpose

1 In the associated flagship card (GTX 480) however, one of the SMs has been disabled, due to overheating
problems, so it has actually only 15 SIMD cores, called Streaming Multiprocessors (SMs) by Nvidia and 480
FP32 EUs [69]

© Sima Dezső, ÓE NIK 652 www.tankonyvtar.hu


5.1.1 Introduction to Nvidia’s Fermi family of cores (3)
Terminology of these slides Nvidia’s terminology AMD/ATI’s terminology

(4-24) X SIMD Cores


Core Block Array
Streaming Processor Array SIMD Array
CBA in Nvidia’s G80/G92/
(7-10) TPC in Data Parallel Processor Array
GT200 SPA
G80/G92/GT200 DPP Array
CA Core Array (else)
(8-16) SMs in the Fermi line Compute Device
Stream Processor

Texture Processor Cluster


Core Block (2-3) Streaming
CB in Nvidia’s G80/G92/ TPC Multiprocessors
G200 in G80/G92/GT200,
not present in Fermi

Streaming Multiprocessor (SM) SIMD Core


G80-GT200: scalar issue to SIMD Engine (Pre OpenCL term)
a single pipeline Data Parallel Processor (DPP)
SIMT Core
C SM GF100/110: scalar issue to Compute Unit
SIMD core
dual pipelines 16 x (VLIW4/VLIW5) ALUs
GF104: 2-way superscalar
issue to dual pipelines

VLIW4/VLIW5 ALU
Stream core (in OpenCL SDKs)
Streaming Processor
Algebraic Logic Unit Compute Unit Pipeline (6900 ISA)
ALU CUDA Core
(ALU) SIMD pipeline (Pre OpenCL) term
Thread processor (Pre OpenCL term)
Shader processor (Pre OpenCL term)

Stream cores (ISA publ.s)


Execution Units (EUs) FP Units Processing elements
EU
(e.g. FP32 units etc.) FX Units Stream Processing Units
ALUs (in ISA publications)

Table 5.1.1: Terminologies used with GPGPUs/Data parallel accelerators


© Sima Dezső, ÓE NIK 653 www.tankonyvtar.hu
5.1.2 Nvidia’s Parallel Thread eXecution (PTX)
Virtual Machine concept

© Sima Dezső, ÓE NIK 654 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (1)

The PTX Virtual Machine concept consists of two related components


• a parallel computational model and
• the ISA of the PTX virtual machine (Instruction Set architecture), which is a pseudo ISA,
since programs compiled to the ISA of the PTX are not directly executable but need
a further compilation to the ISA of the target GPGPU.
The parallel computational model underlies the PTX virtual machine.

© Sima Dezső, ÓE NIK 655 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (2)

The parallel computational model of PTX


The parallel computational model of PTX underlies both the ISA of the PTX and the CUDA
language.
It is based on three key abstractions
a) The model of computational resources
b) The memory model
c) The data parallel execution model covering
c1) the mapping of execution objects to the execution resources (parallel machine model).
c2) The data sharing concept
c3) The synchronization concept

These models are only outlined here, a detailed description can be found in the related
documentation [147].
Remark
The outlined four abstractions remained basically unchanged through the life span of PTX
(from the version 1.0 (6/2007) to version 2.3 (3/2011).

© Sima Dezső, ÓE NIK 656 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (3)

a) The model of computational


resources [147]

A set of SIMD cores


with on-chip
shared memory

A set of ALUs
within the
SIMD cores

© Sima Dezső, ÓE NIK 657 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (4)

b) The memory model [147]


Per-thread
reg. space

Main features of the memory spaces

© Sima Dezső, ÓE NIK 658 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (5)

c) The data parallel execution model-1[147]


(SIMT model)

The execution model is based on

• a set of SIMT capable SIMD processors,


designated in our slides as SIMD cores,
(called Multiprocessors in the Figure
and the subsequent description), and
• a set of ALUs (whose capabilities are
declared in the associated ISA),
designated as Processors in the Figure
and the subsequent description.

© Sima Dezső, ÓE NIK 659 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (6)

c1) The data parallel execution model-2[147]


A concise overview of the execution model is given in Nvidia’s PTX ISA description,
worth to cite.

© Sima Dezső, ÓE NIK 660 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (7)

c1) The data parallel execution model-3 [147]

© Sima Dezső, ÓE NIK 661 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (8)

c2) The data sharing concept-1 [147]


Per-thread
The data parallel model allows to share data for threads reg. space

within a CTA by means of a Shared Memory declared


in the platform model that is allocated to each SIMD core.

Main features of the memory spaces

© Sima Dezső, ÓE NIK 662 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (9)

c2) The data sharing concept-2 [147]

A set of SIMD cores


with on-chip
shared memory

A set of ALUs
within the
SIMD cores

© Sima Dezső, ÓE NIK 663 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (10)

c3) The synchronization concept [147]

• Sequential consistency is provided by barrier synchronization


(implemented by the bar.synch instruction.
• Threads wait at the barrier until all threads in the CTA has arrived.
• In this way all memory writes prior the barrier are guaranteed to have stored data before
reads after the barrier will access referenced data (providing memory consistency).

© Sima Dezső, ÓE NIK 664 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (11)

The ISA of the PTX virtual machine


It is the definition of a pseudo ISA for GPGPUs that
• is close to the “metal” (i.e. to the actual ISA of GPGPUs) and
• serves as the hardware independent target code for compilers e.g. for CUDA or OpenCL.

Compilation to
PTX pseudo ISA
instructions
Translation to
executable CUBIN file
at load time
CUBIN FILE
665 Source: [68]
5.1.2 Nvidia’s PTX Virtual Machine Concept (12)

The PTX virtual machine concept gives rise to a two phase compilation process.
1) First, the application, e.g. a CUDA or OpenCL program will be compiled to a pseudo code,
called also as PTX ISA code or PTX code by the appropriate compiler.
The PTX code is a pseudo code since it is not directly executable and needs to be translated
to the actual ISA of a given GPGPU to become executable.

Application
(CUDA C/OpenCL file)

Two-phase compilation

CUDA C compiler
or
OpenCL compiler
• First phase:
Compilation to the PTX ISA format
(stored in text format)
pseudo ISA instructions)

CUDA driver
• Second phase (during loading):
JIT-compilation to
executable object code
(called CUBIN file).
CUBIN file
© Sima Dezső, ÓE NIK (executable on the GPGPU
666) www.tankonyvtar.hu
5.1.2 Nvidia’s PTX Virtual Machine Concept (13)

2) In order to become executable the PTX code needs to be compiled to the actual ISA code of a
particular GPGPU, called the CUBIN file.
This compilation is performed by the CUDA driver during loading the program (Just-In-Time).

Application
(CUDA C/OpenCL file)

Two-phase compilation

CUDA C compiler
or
OpenCL compiler

• First phase:
Compilation to the PTX ISA format
(stored in text format)
pseudo ISA instructions)
CUDA driver

• Second phase (during loading):


JIT-compilation to
executable object code
CUBIN file (called CUBIN file).
(runable on the GPGPU )
© Sima Dezső, ÓE NIK 667 www.tankonyvtar.hu
5.1.2 Nvidia’s PTX Virtual Machine Concept (14)

Benefit of the Virtual machine concept

• The compiled pseudo ISA code (PTX code) remains in principle independent from the
actual hardware implementation of a target GPGPU, i.e. it is portable over subsequent
GPGPU families.
Porting a PTX file to a lower compute capability level GPGPU however, may need emulation
for features not implemented in hardware that slows down execution.
Forward portability of GPGPU code (CUBIN code) is provided however only within major
compute capability versions.
• Forward portability of PTX code is highly advantageous in the recent rapid evolution phase of
GPGPU technology as it results in less costs for code refactoring.
Code refactoring costs are a kind of software maintenance costs that arise when the user
switches from a given generation to a subsequent GPGPU generation (like from GT200
based devices to GF100 or GF110-based devices) or to a new software environment
(like from CUDA 1.x SDK to CUDA 2.x or CUDA 3.x SDK).

© Sima Dezső, ÓE NIK 668 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (15)

Remarks [149]

1) • Nvidia manages the evolution of their devices and programming environment by maintaining
compute capability versions of both
• their intermediate virtual PTX architectures (PTX ISA) and
• their real architectures (GPGPU ISA).
• Designation of the compute capability versions
• Subsequent versions of the intermediate PTX ISA are designated as PTX ISA 1.x or 2.x.
• Subsequent versions of GPGPU ISAs are designated as sm_1x/sm_2x or simply by 1.x/2.x.
• The first digit 1 or 2 denotes the major version number, the second or subsequent digit
denotes the minor version.
• Major versions of 1.x or 1x relate to pre-Fermi solutions whereas those of 2.x or 2x
to Fermi based solutions.

© Sima Dezső, ÓE NIK 669 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (16)

Remarks (cont.) [149]


• Correspondence of the PTX ISA and GPGPU ISA compute capability versions
Until now there is a one-to-one correspondence between the PTX ISA version and the
GPGPU ISA versions, i.e. the PTX ISA versions and the GPGPU ISA versions with the
same major and minor version number have the same compute capability.
However, there is no guarantee that this one-to-one correspondence will remain valid
in the future.
Main facts concerning the compute capability versions are summarized below.

© Sima Dezső, ÓE NIK 670 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (17)

a) Functional features provided by


the compute capability versions
of the GPGPUs
and virtual PTX ISAs [81]

671 www.tankonyvtar.hu
5.1.2 Nvidia’s PTX Virtual Machine Concept (18)

b) Device parameters bound to the compute capability versions of Nvidia’s GPGPUs


and virtual
PTX ISAs [81]

672 www.tankonyvtar.hu
5.1.2 Nvidia’s PTX Virtual Machine Concept (19)

c) Supported compute capability versions of Nvidia’s GPGPU cards [81]

Capability vers. GPGPU cores GPGPU devices


(sm_xy)

10 G80 GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870,


FX4/5600, 360M
11 G86, G84, G98, G96, G96b, G94, GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS,
G94b, G92, G92b 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT
120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM,
32/370M, 3/5/770M, 16/17/27/28/36/37/3800M,
NVS420/50
12 GT218, GT216, GT215 GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M,
NVS 2/3100M
13 GT200, GT200b GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX
3/4/5800
20 GF100, GF110 GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro
600,4/5/6000, Plex7000, GTX570, GTX580

21 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 450, GTX 460, GTX 550Ti,
GTX 560Ti

© Sima Dezső, ÓE NIK 673 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (20)

d) Compute capability versions of PTX ISAs generated by subsequent releases of


CUDA SDKs and supported GPGPUs (designated as Targets in the Table) [147]

PTX ISA 1.x/sm_1x


Pre-Fermi implementations

PTX ISA 1.x/sm_1x


Fermi implementations

© Sima Dezső, ÓE NIK 674 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (21)

e) Forward portability of PTX code [52]


Applications compiled for pre-Fermi GPGPUs that include PTX versions of their kernels
should work as-is on Fermi GPGPUs as well .

f) Compatibility rules of object files (CUBIN files) compiled to a particular GPGPU


compute capability version [52]
The basic rule is forward compatibility within the main versions (versions sm_1x and sm_2x),
but not across main versions.
This is interpreted as follows:
Object files (called CUBIN files) compiled to a particular GPGPU compute capability version
are supported on all devices having the same or higher version number within the
same main version.
E.g. object files compiled to the compute capability 1.0 are supported on all 1.x devices
but not supported on compute capability 2.0 (Fermi) devices.

For more details see [52].

© Sima Dezső, ÓE NIK 675 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (22)

Remarks (cont.)
2. Contrasting the virtual machine concept with the traditional computer technology
Whereas the PTX virtual machine concept is based on a forward compatible but not directly
executable compiler’s target code (pseudo code), in traditional computer technology
the compiled code, such as an x86 object code, is immediately executable by the processor.
Earlier CISC processors, like Intel’s x86 processors up to the Pentium, executed x86 code
immediately by hardware.
Subsequent CISCs, beginning with 2. generation superscalars (like the Pentium Pro),
including current x86 processors, like Intel’s Nehalem (2008) or AMD’ Bulldozer (2011)
map x86 CISC instructions during decoding first to internally defined RISC instructions.
In these processors a ROM-based µcode Engine (i.e. firmware) supports decoding of
complex x86 instructions (decoding of instructions which need more than 4 RISC instructions)
The RISC core of the processor executes then the requested RISC operations directly.

Figure 5.1.1: Hardware/firmware mapping


of x86 instructions of
directly executable RISC instructions
in Intel Nehalem [51]

© Sima Dezső, ÓE NIK 676 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (23)

Remarks (cont.)
3) Nvidia’ CUDA compiler (nvcc) is designated as CUDA C compiler, beginning with
CUDA version 3.0 to stress the support of C.

© Sima Dezső, ÓE NIK 677 www.tankonyvtar.hu


5.1.2 Nvidia’s PTX Virtual Machine Concept (24)

Remarks (cont.)
4) nvcc can be used to generate both architecture specific files (CUBIN files) or
forward compatible PTX versions of the kernels [52].

Application
(CUDA C/OpenCL file)

Direct compilation CUDA C compiler Two-phase compilation


to executable object cod, or • First to PTX code
called CUBIN file.
OpenCL compiler (pseudo ISA instructions)
• then during loading
(JIT-compilation)
to executable object code,
called CUBIN file.

CUDA driver

No forward compatibility, Forward compatibility


the CUBIN file is bound within the same major
to the given compute capability
compute capability CUBIN file version no. (1.x or 2.x)
revision no. (1.x or 2.x). (GPGPU specific provided
object678
code)
© Sima Dezső, ÓE NIK www.tankonyvtar.hu
5.1.2 Nvidia’s PTX Virtual Machine Concept (25)

Remarks (cont.)
The virtual machine concept underlying both Nvidia’s and AMD’s GPGPUs is similar to
the virtual machine concept underlying Java.
• For Java there is also an inherent computational model and a pseudo ISA, called the
Java bytecode.
• Applications written in Java will first be compiled to the platform independent Java bytecode.
• The Java bytecode will then either be interpreted by the Java Runtime Environment (JRE)
installed on the end user’s computer or compiled at runtime by the Just-In-Time (JIT)
compiler of the end user.

© Sima Dezső, ÓE NIK 679 www.tankonyvtar.hu


5.1.3 Key innovations of Fermi’s PTX 2.0

© Sima Dezső, ÓE NIK 680 www.tankonyvtar.hu


5.1.3 Key innovations of Fermi’s PTX 2.0 (1)

Overview of PTX 2.0


• Fermi’s underlying pseudo ISA is the 2. generation PTX 2.x (Parallel Thread eXecution) ISA,
introduced along with the Fermi line.
• PTX 2.x is a major redesign of the PTX 1.x ISA, towards a more RISC-like load/store
architecture rather than being an x86 memory based architecture.

With the PTX2.0 Nvidia states that they have created a long-evity ISA for GPUs,
like the x86 ISA for CPUs.
Based on the key innovations and declared goals of Fermi’s ISA (PTX2.0) and considering
the significant innovations and enhancements made in the microarchitecture
it can be expected that Nvidia’s GPGPUs entered a phase of relative consolidation.

© Sima Dezső, ÓE NIK 681 www.tankonyvtar.hu


5.1.3 Key innovations of Fermi’s PTX 2.0 (2)

Key innovations of PTX 2.0


a) Unified address space for all variables and pointers with a single set of load/store instructions
b) 64-bit addressing capability
c) New instructions to support the OpenCL and DirectCompute APIs
d) Full support of predication
e) Full IEEE 754-3008 support for 32-bit and 64-bit FP precision

These new features greatly improve GPU programmability, accuracy and performance.

© Sima Dezső, ÓE NIK 682 www.tankonyvtar.hu


5.1.3 Key innovations of Fermi’s PTX 2.0 (3)

a) Unified address space for all variables and pointers with a single set of
load/store instructions-1 [58]
• In PTX 1.0 there are three separate address spaces
(thread private local, block shared and global)
with specific load/store instructions to each one of the three address spaces.
• Programs could load or store values in a particular target address space at addresses
that become known at compile time.
It was difficult to fully implement C and C++ pointers since a pointer’s target address
could only be determined dynamically at run time.

© Sima Dezső, ÓE NIK 683 www.tankonyvtar.hu


5.1.3 Key innovations of Fermi’s PTX 2.0 (4)

a) Unified address space for all variables and pointers with a single set of
load/store instructions-2 [58]

• PTX 2.0 unifies all three address spaces into a single continuous address space that
can be accessed by a single set of load/store instructions.
• PTX 2.0 allows to use unified pointers to pass objects in any memory space and
Fermi’s hardware automatically maps pointer references to the correct memory space.
Thus the concept of the unified address space enables Fermi to support C++ programs.

© Sima Dezső, ÓE NIK 684 www.tankonyvtar.hu


5.1.3 Key innovations of Fermi’s PTX 2.0 (5)

b) 64-bit addressing capability

• Nvidia’s previous generation GPGPUs (G80, G92, GT200) provide 32 bit addressing
for load/store instructions,
• PTX 2.0 extends the addressing capability to 64-bit for future growth.
however, recent Fermi implementations use only 40-bit addresses allowing to access
an address space of 1 Terabyte.

© Sima Dezső, ÓE NIK 685 www.tankonyvtar.hu


5.1.3 Key innovations of Fermi’s PTX 2.0 (6)

c) New instructions to support the OpenCL and DirectCompute APIs


• PTX2.0 is optimized for the OpenCL and DirectCompute programming environments.
• It provides a set of new instructions allowing hardware support for these APIs.

© Sima Dezső, ÓE NIK 686 www.tankonyvtar.hu


5.1.3 Key innovations of Fermi’s PTX 2.0 (7)

d) Full support of predication [56]

• PTX 2.0 supports predication for all instructions.


• Predicated instructions will be executed or skipped depending on the actual values
of conditional codes.
• Predication allows each thread to perform different operations while execution continuous
at full speed.
• Predication is a more efficient solution for streaming applications than using conventional
conditional branches and branch prediction.

© Sima Dezső, ÓE NIK 687 www.tankonyvtar.hu


5.1.3 Key innovations of Fermi’s PTX 2.0 (8)

e) Full IEEE 754-3008 support for 32-bit and 64-bit FP precision

• Fermi’s FP32 instruction semantics and implementation supports now


• calculations with subnormal numbers
(numbers that lie between zero and the smallest normalized number) and
• all four rounding modes (nearest, zero, positive infinity, negative infinity).
• Fermi provides fused multiply-add (FMA) instructions for both single and double precision
FP calculations (with retaining full precision in the intermediate stage)
instead of using truncation between the multiplication and addition as done in
previous generation GPGPUs for multiply-add instructions (MAD).

© Sima Dezső, ÓE NIK 688 www.tankonyvtar.hu


5.1.3 Key innovations of Fermi’s PTX 2.0 (9)

Supporting program development for the Fermi line of GPGPUs [58]

• Nvidia provides a development environment, called Nexus, designed specifically to support


parallel CUDA C, OpenCL and DirectCompute applications.
• Nexus brings parallel-aware hardware source code debugging and performance analysis
directly into Microsoft Visual Studio.
• Nexus allows Visual Studio developers to write and debug GPU source code using
exactly the same tools and interfaces that are used when writing and debugging CPU code.
• Furthermore, Nexus extends Visual Studio functionality by offering tools to manage
massive parallelism.

© Sima Dezső, ÓE NIK 689 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel
programming model

© Sima Dezső, ÓE NIK 690 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel programming model (1)

CUDA (Compute Unified Device Architecture) [43]


• It is Nvidia’s hardware and software architecture for issuing and managing
data parallel computations on a GPGPU without the need to mapping them to a graphics API.
• It became available starting with the CUDA release 0.8 (in 2/2007) and the GeForce 8800 cards.
• CUDA is designed to support various languages and Application Programming Interfaces (APIs).

Figure 5.1.2: Supported languages and APIs (as of starting with CUDA version 3.0)

© Sima Dezső, ÓE NIK 691 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel programming model (2)

Writing CUDA programs [43]


Writing CUDA programs

At the level of CUDA C At the level of the CUDA driver API

• CUDA C exposes the CUDA programming model • The CUDA Driver API is a lover level C API that allows
as a minimal set of C language extensions. to load and launch kernels as modules of binary or
• These extensions allow to define kernels along with assembly CUDA code and to manage the platform.
the dimensions of associated grids and thread blocks. • Binary and assembly codes are usually obtained
• The CUDA C program must be compiled with nvcc. by compiling kernels written in C.

E.g. CUDA C
(to be compiled
(E.g. CUBLAS) with nvcc)

API
(To manage the platform

CUDA Driver API


API
(To manage the platform
and to load and launch kernels)

Figure 5.1.3: The CUDA


software stack [43]
© Sima Dezső, ÓE NIK 692 www.tankonyvtar.hu
cuda 3.2
5.1.4 Nvidia’s high level data parallel programming model (3)

The high-level CUDA programming model

• It supports data parallelism.


• It is the task of the operating system’s multitasking mechanism to manage accessing
the GPGPU by several CUDA and graphics applications running concurrently.
• Beyond that advanced Nvidia GPGPU’s (beginning with the Fermi family) are able to run
multiple kernels concurrently.

© Sima Dezső, ÓE NIK 693 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel programming model (4)

Main components of the programming model of CUDA [43],


• The data parallel programming model is based on the following abstractions
a) The platform model
b) The memory model of the platform
c) The execution model including
c1) The kernel concept as a means to utilize data parallelism
c2) The allocation of threads and thread blocks to ALUs and SIMD cores
c3) The data sharing concept
c4) The synchronization concept)

• These abstractions will be outlined briefly below.


• A more detailed description of the programming model of the OpenCL standard is given
in Section 3.

© Sima Dezső, ÓE NIK 694 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel programming model (5)

a) The platform model [146]

SIMD core

ALUs

© Sima Dezső, ÓE NIK 695 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel programming model (6)

b) The memory model of the platform [43]

The Local Memory is an


extension of the per-thread
register space in the
device memory.

A thread has access to the


device’s DRAM and on-chip
memory through a set of
memory spaces.
© Sima Dezső, ÓE NIK 696 www.tankonyvtar.hu
5.1.4 Nvidia’s high level data parallel programming model (7)

Remark
Compute capability dependent memory sizes of Nvidia’s GPGPUs

© Sima Dezső, ÓE NIK 697 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel programming model (8)

c) The execution model [43]

Overview

Serial code executes on the host


while parallel code executes on
the device.
© Sima Dezső, ÓE NIK 698 www.tankonyvtar.hu
5.1.4 Nvidia’s high level data parallel programming model (9)

c1) The kernel concept [43]

• CUDA C allows the programmer to define kernels as C functions, that, when called
are executed N-times in parallel by N different CUDA threads, as opposed to only once
like regular C functions.
• A kernel is defined by
• using the _global_ declaration specifier and
• declaring the instructions to be executed.
• The number of CUDA threads that execute that kernel for a given kernel call is given
during kernel invocation by using the <<< …>>> execution configuration identifier.
• Each thread that executes the kernel is given a unique thread ID that is accessible within
the kernel through the built-in threadIdx variable.

The subsequent sample code illustrates a kernel that adds two vectors A and B of size N and
stores the result into vector C as well as its invocation.

© Sima Dezső, ÓE NIK 699 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel programming model (10)

c2) The allocation of threads and thread blocks to ALUs and SIMD cores

• Threads are allocated to ALUs for execution.


• The capability of the ALUs depends on the Compute capability version of the GPGPU that needs
to be supported by the compilation (SDK release).
• Thread blocks are allocated for execution to the SIMD cores.
Within thread blocks there is a possibility of data sharing through the Shared Memories,
that are allocated to each SIMD core (to be discussed subsequently).

© Sima Dezső, ÓE NIK 700 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel programming model (11)

Available register spaces for threads, thread blocks and grids-1 [43]

© Sima Dezső, ÓE NIK 701 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel programming model (12)

Available register spaces for threads, thread blocks and grids-2 [43]
Per-thread
reg. space

© Sima Dezső, ÓE NIK 702 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel programming model (13)

c3) The data sharing concept [43]


Threads within a thread block can cooperate by sharing date through a
per thread block Shared Memory.

Figure 5.1.4: The memory


model [43]

© Sima Dezső, ÓE NIK 703 www.tankonyvtar.hu


5.1.4 Nvidia’s high level data parallel programming model (14)

c4) The synchronization concept [43]

• Within a thread block the execution of threads can be synchronized.


• To achieve this the programmer can specify synchronization points in the kernel by
calling the syncthread() function that acts as a barrier at which all threads in the
thread block must wait before any is allowed to proceed.

© Sima Dezső, ÓE NIK 704 www.tankonyvtar.hu


5.1.5 Major innovations and enhancements of
Fermi’s microarchitecture

© Sima Dezső, ÓE NIK 705 www.tankonyvtar.hu


5.1.5 Major innovations and enhancements of Fermi’s microarchitecture (1)

Major innovations
a) Concurrent kernel execution
b) True two level cache hierarchy
c) Configurable shared memory/L1 cache per SM
d) ECC support

Major enhancements
a) Vastly increased FP64 performance
b) Greatly reduced context switching times
c) 10-20 times faster atomic memory operations

© Sima Dezső, ÓE NIK 706 www.tankonyvtar.hu


5.1.5 Major innovations and enhancements of Fermi’s microarchitecture (2)

Major architectural innovations of Fermi

a) Concurrent kernel execution [39], [83]


• In previous generations (G80, G92, GT200) the global scheduler could only assign work
to the SMs from a single kernel (serial kernel execution).
• The global scheduler of Fermi is able to run up to 16 different kernels concurrently, one per SM
• A large kernel may be spread over multiple SMs.

In Fermi up to 16 kernels can run


concurrently each one on a different SM.

In Fermi up to 16 kernels can run


concurrently each one on a different SM.

Compute devices 1.x Compute devices 2.x


© Sima Dezső, ÓE NIK (devices before Fermi) 707 (devices starting with Fermi) www.tankonyvtar.hu
5.1.5 Major innovations and enhancements of Fermi’s microarchitecture (3)

b) True two level cache hierarchy [58]

• Traditional GPU architectures support a read-only


“load path” for texture operations and a write-only
“export path” for pixel data output.
For computational tasks however, this impedes
the ordering of read and write operations
done usually for speeding up computations.
• To eliminate this deficiency Fermi implements a
unified memory access path for both loads and stores.
• Fermi provides further on a unified L2 cache
for speeding up loads, stores and texture requests.

[58] Fermi white paper


© Sima Dezső, ÓE NIK 708 www.tankonyvtar.hu
5.1.5 Major innovations and enhancements of Fermi’s microarchitecture (4)

c) Configurable shared memory/L1 cache per SM [58]

• Fermi provides furthermore a configurable


shared memory/L1 cache per SM.
• The shared memory/L1 cache unit is configurable
to optimally support both shared memory and
caching of local and global memory operations.
Supported options are 48 KB shared memory
with 16 KB L1 cache or vice versa.
• The optimal configuration depends on
the application to be run.

© Sima Dezső, ÓE NIK 709 www.tankonyvtar.hu


5.1.5 Major innovations and enhancements of Fermi’s microarchitecture (5)

d) ECC support [58]

It protects
• DRAM memory
• register files
• shared memories
• L1 and L2 caches.

Remark
ECC support is provided only for Tesla devices.

© Sima Dezső, ÓE NIK 710 www.tankonyvtar.hu


5.1.5 Major innovations and enhancements of Fermi’s microarchitecture (6)

Major architectural enhancements of Fermi

a) Vastly increased FP64 performance


Compared to the previous G80 and GT200-based generations, Fermi provides vastly increased
FP64 performance over flagship GPGPU cards.

Flagship Tesla cards

C870 C1060 C2070


(G80-based) (GT200-based) (Fermi T20-based)

FP64 performance - 77.76 GFLOPS 515.2 GFLOPS

Flagship GPGPU cards

8800 GTX GTX 280 GTX 480 GTX 580


(G80-based) (GT200-based) (Fermi GF100-based) (Fermi GF110-based)

FP64 performance - 77.76 GFLOPS 168 GFLOPS 197.6 GFLOPS

© Sima Dezső, ÓE NIK 711 www.tankonyvtar.hu


5.1.5 Major innovations and enhancements of Fermi’s microarchitecture (7)

Throughput of arithmetic operations per clock cycle per SM [43]


GT80- GT2001 GF100/110 GF104

1 1 GT 80/92
does not
support
FP64

712 www.tankonyvtar.hu
5.1.5 Major innovations and enhancements of Fermi’s microarchitecture (8)

b) Greatly reduced context switching times [58]

• Fermi performs context switches between different applications in about 25 µs.


• This is about 10 times faster than context switches in the previous GT200 (200-250 µs).

© Sima Dezső, ÓE NIK 713 www.tankonyvtar.hu


5.1.5 Major innovations and enhancements of Fermi’s microarchitecture (9)

c) 10-20 times faster atomic memory operations [58]


• Atomic operations are widely used in parallel programming to facilitate correct
read-modify-write operations on shared data structures.
• Owing to its increased number of atomic units and the L2 cache added, Fermi performs
atomic operations up to 20x faster than previous GT200 based devices.

© Sima Dezső, ÓE NIK 714 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100

© Sima Dezső, ÓE NIK 715 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (1)

Overall structure of Fermi GF100 [83], [58]

NVidia: 16 cores
(Streaming Multiprocessors)
(SMs)

Each core: 32 ALUs


512 ALUs

Remark
In the associated flagship card
(GTX 480) however,
one SM has been disabled,
due to overheating problems,
so it has actually
15 SMs and 480 ALUs [a]

6x Dual Channel GDDR5


(6x 64 = 384 bit)

© Sima Dezső, ÓE NIK 716 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (2)

High level microarchitecture of Fermi GT100

Figure 5.1.5: Fermi’s717system architecture [39]


© Sima Dezső, ÓE NIK www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (3)

Evolution of the high level microachitecture of Nvidia’s GPGPUs [39]

Fermi GF100

Note

The high level microarchitecture of Fermi evolved from a graphics oriented structure
to a computation oriented one complemented with a units needed for graphics processing.

© Sima Dezső, ÓE NIK 718 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (4)

Layout of a Cuda GF100 core (SM) [54]


(SM: Streaming Multiprocessor)

SFU: Special
Function Unit

1 SM includes 32 ALUs
called “Cuda cores” by NVidia)

© Sima Dezső, ÓE NIK 719 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (5)
Fermi GT100/GT110 [54]
Evolution of the cores (SMs)
in Nvidia’s GPGPUs -1 GT200 SM [84]

GT80 SM [57]

Streaming Multiprocessor
Instruction L1
Instruction Fetch/Dispatch
Shared Memory
SP SP
SP SP
SFU SFU
SP SP
SP SP

• 16 KB Shared Memory
• 8 K registersx32-bit/SM • 16 KB Shared Memory
• up to 24 active warps/SM • 16 K registersx32-bit/SM
up to 768 active threads/SM • up to 32 active warps/SM
• 64 KB Shared Memory/L1 Cache
10 registers/SM on average up to 1 K active threads/SM
• up to 48 active warps/SM
16 registers/thread on average
• 32 threads/warp
• 1 FMA FPU (not shown) up to 1536 active threads/SM
© Sima Dezső, ÓE NIK 720 20 registers/thread on average
5.1.6 Microarchitecture of Fermi GF100 (6)
GF100 [70] GF104 [55]
Further evolution of the cores
(SMs) in Nvidia’s GPGPUs -2

GF104 [55]

Available specifications:

• 64 KB Shared Memory/L1 Cache


• 32 Kx32-bit registers/SM
• 32 threads/warp

Data about
• the number of active warps/SM and
• the number of active threads/SM
are at present (March 2011) not available.

© Sima Dezső, ÓE NIK 721


5.1.6 Microarchitecture of Fermi GF100 (7)

Structure and operation of the Fermi GF100 GPGPU

© Sima Dezső, ÓE NIK 722 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (8)

Layout of a Cuda GF100 core


(SM) [54]

Special Function Units


calculate FP32
transcendental functions
(such as trigonometric
functions etc.)

1 SM includes 32 ALUs
called “Cuda cores” by NVidia)

© Sima Dezső, ÓE NIK 723 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (9)

A single ALU (“CUDA core”)

SP FP:32-bit

FP64 Fermi’s integer units (INT Units)


• are 32-bit wide.
• First implementation of the IEEE 754-2008
• became stand alone units, i.e.
standard
they are no longer merged with
• Needs 2 clock cycles to issue the entire warp the MAD units as in prior designs.
for execution.
• In addition, each floating-point
unit (FP Unit) is now capable of
producing IEEE 754-2008-
FP64 performance: ½ of FP32 performance!! compliant double-precision (DP)
(Enabled only on Tesla devices! FP results in every 2. clock cycles,
at ½ of the performance of
single-precision FP calculations.
Figure 5.1.6: A single ALU [40]
© Sima Dezső, ÓE NIK 724 www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (10)

Remark
The Fermi line supports the Fused Multiply-Add (FMA) operation, rather than the Multiply-Add
operation performed in previous generations.

Previous lines

Fermi

Figure 5.1.7: Contrasting the Multiply-Add (MAD) and the Fused-Multiply-Add (FMA) operations
[56]

© Sima Dezső, ÓE NIK 725 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (11)

Principle of the SIMT execution in case of serial kernel execution

Host Device

Each kernel invocation


lets execute all
kernel0<<<>>>() thread blocks (Block(i,j))

Thread blocks may be


executed independently
from each other
kernel1<<<>>>()

Figure 5.1.8: Hierarchy of


threads [25]
© Sima Dezső, ÓE NIK 726 www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (12)

Principle of operation of a Fermi GF100 GPGPU

The key point of operation is work scheduling

Subtasks of work scheduling


• Scheduling kernels to SMs
• Scheduling thread blocks of the kernels to the SMs
• Segmenting thread blocks into warps
• Scheduling warps for execution in SMs

© Sima Dezső, ÓE NIK 727 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (13)

Scheduling kernels to SMs [38], [83]

• A global scheduler, called the Gigathread scheduler assigns work to each SM.
• In previous generations (G80, G92, GT200) the global scheduler could only assign work to the
SMs from a single kernel (serial kernel execution).
• The global scheduler of Fermi is able to run up to 16 different kernels concurrently, one per SM.
• A large kernel may be spread over multiple SMs.

In Fermi up to 16 kernels can run


concurrently each one on a different SM.

Compute devices 1.x Compute devices 2.x


(devices before Fermi) (devices starting with Fermi)

© Sima Dezső, ÓE NIK 728 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (14)

The context switch time occurring between kernel switches is greatly reduced compared to
the previous generation, from about 250 µs to about 20 µs (needed for cleaning up TLBs,
dirty data in caches, registers etc.) [39].

© Sima Dezső, ÓE NIK 729 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (15)

Scheduling thread blocks of the kernels


to the SMs

• The Gigathread scheduler assigns t0 t1 t2 … tm t0 t1 t2 … tm


up to 8 thread blocks of the same kernel
to each SM.
(Tread blocks assigned to a particular SM
must belong to the same kernel).
• Nevertheless, the Gigathread scheduler
can assign different kernels to different SMs, Up to 16 Blocks Up to 16 Blocks
so up to 16 concurrent kernels can run
on 16 SMs.

© Sima Dezső, ÓE NIK 730


5.1.6 Microarchitecture of Fermi GF100 (16)

The notion and main features of thread blocks in CUDA [57]

 Programmer declares blocks:


 Block size: 1 to 512 concurrent threads CUDA Thread Block
 Block shape: 1D, 2D, or 3D
 Block dimensions in threads Thread Id #:
 Each block can execute in any order relative to 0123… m
other blocs!
 All threads in a block execute the same kernel program
(SPMD)
 Threads have thread id numbers within block
Thread program
 Thread program uses thread id to select work
and address shared data
 Threads in the same block share data and
synchronize while doing their share of the work
 Threads in different blocks cannot cooperate

Courtesy: John Nickolls,


NVIDIA

© Sima Dezső, ÓE NIK 731 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (17)

Segmenting thread blocks


into warps [12] Block 1 Warps Block 2 Warps Block 1 Warps
… … …
t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31
… … …
TB1 W3
TB1 W2
TB1 W1

TB: Thread Block


• Threads are scheduled for execution in groups W: Warp
of 32 threads, called the warps.
• For scheduling each thread block is subdivided
into warps.
• At any point of time up to 48 warps can be
maintained by the schedulers of the SM.

Remark
The number of threads constituting a warp
is an implementation decision and not
part of the CUDA programming model.
E.g. in the G80 there are 24 warps per SM, whereas
in the GT200 there are 32 warps per SM.

© Sima Dezső, ÓE NIK 732 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (18)

Scheduling warps for execution in SMs

Nvidia did not reveal details of the microarchitecture of Fermi so the subsequent
discussion of warp scheduling is based on assumptions given in the sources [39], [58].

© Sima Dezső, ÓE NIK 733 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (19)

Assumed block diagram of the Fermi GF100 microarchitecture and its operation

• Based on [39] and [58], Fermi’s front end can be


assumed to be built up and operate as follows:
• The front end consist of dual execution pipelines or
from another point of view of two tightly coupled
thin cores with dedicated and shared resources.
• Dedicated resources per thin core are
• the Warp Instruction Queues,
• the Scoreboarded Warp Schedulers and
• 16 ALUs.
• Shared resources include
• the Instruction Cache,
• the 128 KB (32 K registersx32-bit) Register File,
• the four SFUs and the
• 64 KB LiD Shared Memory.

© Sima Dezső, ÓE NIK 734 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (20)

Remark
Fermi’s front end is similar to the basic building block of AMD’s Bulldozer core (2011)
that consists of two tightly coupled thin cores [85].

Figure 5.1.9: The Bulldozer core [85]


© Sima Dezső, ÓE NIK 735 www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (21)

Assumed principle of operation-1

• Both warp schedulers are connected through a


partial crossbar to five groups of execution
units, as shown in the figure.
• Up to 48 warp instructions may be held in dual
instruction queues waiting for issue.
• Warp instructions having all needed operands
are eligible for issue.
• Scoreboarding tracks the availability of
operands based on the expected latencies of
issued instructions, and marks instructions
whose operands became already computed as
eligible for issue.
• Fermi’s dual warp schedulers select two
eligible warp instruction for issue in every two
shader cycles according to a given scheduling
policy.
• Each Warp Scheduler issues one warp
instruction to a group of 16 ALUs
(each consisting of a 32-bit SP ALU and
a 32-bit FPU), 4 SFUs or 16 load/store units
(not shown in the figure).

Figure 5.1.10: The Fermi core [39]


© Sima Dezső, ÓE NIK 736 www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (22)

Assumed principle of operation-2

• Warp instructions are issued to the appropriate


group of execution units as follows:
• FX and FP32 arithmetic instructions, including
FP FMA instructions are forwarded to
16 32-bit ALUs, each of them incorporating a
32-bit FX ALU and a 32-bit FP32 ALU (FPU) .
FX instructions will be executed in the
32-bit FX unit whereas SP FP instructions
in the SP FP unit.
• FP64 arithmetic instructions, including
FP64 FMA instructions will be forwarded to
both groups of 16 32-bit FPUs in the same time,
thus DP FMA instructions enforce single issue.
• FP32 transcendental instructions will be
issued to the 4 SPUs.

Figure 5.1.11: The Fermi core [39]


© Sima Dezső, ÓE NIK 737 www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (23)

Assumed principle of operation-3

• A warp scheduler needs multiple shader cycles


to issue the entire warp (i.e. 32 threads),
to the available number of execution units of
the target group.

The number of shader cycles needed is determined


by the number of execution units available in
a particular group, e.g. :
• FX or FP32 arithmetic instructions: 2 cycles
• FP64 arithmetic instructions : 2 cycles
(but they prevent dual issue)
• FP32 transcendental instructions: 8 cycles
• Load/store instructions 2 cycles.

Execution cycles of further operations are given


in [43].

Figure 5.1.12: The Fermi core [39]


© Sima Dezső, ÓE NIK 738 www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (24)

Example: Throughput of arithmetic operations per clock cycle per SM [43]

GT80- GT200 GF100/110 GF104

1 1 GT 80/92
does not
support
DP FP

739 www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (25)

Scheduling policy of warps in an SM of Fermi GT100-1

Official documentation reveals only that Fermi GT100 has


dual issue zero overhead prioritized scheduling [58]

© Sima Dezső, ÓE NIK 740 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (26)

Scheduling policy of warps in an SM of Fermi GT100-2

Official documentation reveals only that Fermi GT100 has dual issue zero overhead
prioritized scheduling [58]

Nevertheless, based on further sources [86)] and early slides discussing


warp scheduling in the GT80 in a lecture held by D. Kirk, one of the key developers of
Nvidia’s GPGPUs (ECE 498AL Illinois, [12] the following assumptions can be made for the
warp scheduling in Fermi:
 Warps whose next instruction is ready for execution, that is all its operands are
available, are eligible for scheduling.
 Eligible Warps are selected for execution on a not revealed priority scheme that is
based presumably on the warp type (e.g. pixel warp, computational warp),
instructions type and age of the warp.
 Eligible warp instructions of the same priority are scheduled presumably
according to a round robin policy.
 It is not unambiguous whether or not Fermi is using fine grained or coarse
grained scheduling.
Early publications discussing warp scheduling in the GT80 [12] let assume that
warps are scheduled coarse grained but figures in the same publication
illustrating warp scheduling show to the contrary fine grain scheduling,
as shown subsequently.

© Sima Dezső, ÓE NIK 741 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (27)

Remarks

D. Kirk, one of the developers of Nvidia’s GPGPUs details warp scheduling in [12],
but this publication includes two conflicting figures, one indicating to coarse grain and the
other to fine grain warp scheduling as shown below.
Underlying microarchitecture of warp scheduling in an SM of the G80
I$
 The G80 fetches one warp instruction/issue cycle L1
 from the instruction L1 cache
 into any instruction buffer slot.
Multithreaded
 Issues one “ready-to-go” warp instruction/issue cycle Instruction Buffer
 from any warp - instruction buffer slot.
 Operand scoreboarding is used to prevent hazards
Shared
 An instruction becomes ready after all needed R C$ Mem
F L1
values are deposited.
 It prevents hazards
Operand Select
 Cleared instructions become eligible for issue
 Issue selection is based on round-robin/age of warp.
 SM broadcasts the same instruction to 32 threads of a warp. SFU
MAD

Figure 5.1.13: Warp scheduling


in the G80 [12]

© Sima Dezső, ÓE NIK 742 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (28)

Scheduling policy of warps in an SM of the G80 indicating coarse grain warp scheduling

 The G80 uses decoupled memory/processor pipelines


 any thread can continue to issue instructions until scoreboarding prevents
issue
 it allows memory/processor ops to proceed in shadow of other waiting
memory/processor ops.

TB1, W1 stall
TB2, W1 stall TB3, W2 stall

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3


W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4

Time TB = Thread Block, W = Warp

Figure 5.1.14: Warp scheduling in the G80 [12]

Note
The given scheduling scheme reveals a coarse grain one.

© Sima Dezső, ÓE NIK 743 www.tankonyvtar.hu


5.1.6 Microarchitecture of Fermi GF100 (29)

SM Warp Scheduling
Scheduling policy of warps in an SM of the G80 indicating fine grain warp scheduling

 SM hardware implements zero-overhead Warp


scheduling
 Warps whose next instruction has its operands
ready for consumption are eligible for
execution.
 Eligible Warps are selected for execution on a
SM multithreaded prioritized scheduling policy.
Warp scheduler  All threads in a Warp execute the same
instruction when selected.
time  4 clock cycles needed to dispatch the same
instruction for all threads in a Warp in the G80.
warp 8 instruction 11
Note
warp 1 instruction 42
The given scheduling scheme reveals a coarse grain one.

warp 3 instruction 95
..
.
warp 8 instruction 12

warp 3 instruction 96
Figure 5.1.15: Warp scheduling in the G80 [12]
© Sima Dezső, ÓE NIK 744 www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (30)

Estimation of the peak performance of the Fermi GF100 -1

a) Peak FP32 performance per SM


Max. throughput of warp instructions/SM:
• dual issue
• 2 cycles/issue
2 x ½ = 1 warp instruction/cycle

b) Peak FP32 performance (P FP32) of a GPGPU card


• 1 warp instructions/cycle
• 32 FMA/warp
• 2 operations/FMA
• at a shader frequency of fs
• n SM units
P FP32 = 1 x 32 x 2 x 2 x fs x n

P FP32 = 2 x 32 x fs x n FP32 operations/s

E.g. in case of the GTX580


• fs = 1 401 GHz
• n = 15
P FP32 = 2 x 32 X 1401 x 15 = 1344.96 GFLOPS
Figure 5.1.16: The Fermi core [39]
© Sima Dezső, ÓE NIK 745 www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (31)

Estimation of the peak performance of the Fermi GF100 -2

c) Peak FP64 performance per SM


Max. throughput of warp instructions/SM:
• single issue
• 2 cycles/issue
1 x 1/2 = 1/2 warp instruction/cycle

d) Peak FP64 performance (P FP64) of a GPGPU card


• 1 warp instruction/2 cycles
• 32 FMA/warp
• 2 operations/FMA
• at a shader frequency of fs
• n SM units
PFP64 = ½ x 32 x 2 x fs x n

PFP64 = 32 x fs x n FP64 operations/s

E.g. in case of the GTX580


• fs = 1 401 GHz
• n = 15
PFP64 = 32 x 1401 x 15 = 672. 048 GFLOPS Figure 5.1.17: The Fermi core [39]
(This speed is provided only on Tesla devices, else it is merely 1/8 of the FP32 performance).
© Sima Dezső, ÓE NIK 746 www.tankonyvtar.hu
5.1.7 Comparing key features of the
microarchitectures of Fermi GF100
and the predecessor GT200

© Sima Dezső, ÓE NIK 747 www.tankonyvtar.hu


5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (1)

Key differences in the block diagrams of the microarchitectures of the GT200 and Fermi
(Assumed block diagrams [39] without showing result data paths)

Single execution Dual execution


pipeline pipelines

Vastly increased
execution resources
© Sima Dezső, ÓE NIK 748 www.tankonyvtar.hu
5.1.7 Comparing5.1.3
the microarchitectures
Comparison of key of Fermi GF100
features … (2) and GT200 (2)

Available execution resources of the GT200


• The GT200 includes only a single [39]
execution pipeline with the following
execution resources:
• Eight 32-bit ALUs, each incorporating
• a 32-bit FX and
• a 32-bit (SP) FP
execution unit.
The 32-bit ALUs have a throughput of
one FX32 or FP32 operation
(including the MAP operation) per cycle.
• A single 64-bit FPU, with a throughput
of one FP64 arithmetic operation,
including the MAD operation) per cycle.
• Dual SFUs (Special Function Units),
each with a throughput of 4 FP32 MUL
instructions per cycle or
diverse transcendental functions per
multiple cycles.

© Sima Dezső, ÓE NIK 749 www.tankonyvtar.hu


5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (3)

Throughput of arithmetic operations per clock cycle per SM in the GT200 [43]

G80- GT200 GF100/110 GF104

750 www.tankonyvtar.hu
5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (4)

Latency, throughput and warp issue rate of the arithmetic pipelines of the GT200 [60]

© Sima Dezső, ÓE NIK 751 www.tankonyvtar.hu


5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (5)

Scheduling kernels to SMs in the GT200 [38], [83]

• A global scheduler, called the Gigathread scheduler assigns work to each SM.
• In the GT200, and all previous generations (G80, G92) the global scheduler could only assign
work to the SMs from a single kernel (serial kernel execution).
• By contrast, Fermi’s global scheduler is able to run up to 16 different kernels concurrently,
presumable, one per SM.

In Fermi up to 16 kernels can run


concurrently, presumable, each one
on a different SM.

Compute devices 1.x Compute devices 2.x


(devices before Fermi) (devices starting with Fermi)

© Sima Dezső, ÓE NIK 752 www.tankonyvtar.hu


5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (6)

Assigning thread blocks to the SMs in the GT200

The global scheduler distributes the thread blocks of the running kernel to the available SMs,
by assigning typically multiple blocks to each SM, as indicated in the Figure below.

Figure 5.1.18: Assigning thread blocks


to an SM in the GT200 [61]

© Sima Dezső, ÓE NIK 753 www.tankonyvtar.hu


5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (7)

Each thread block may consist of multiple warps, e.g. of 4 warps, as indicated in the Figure.

Figure 5.1.19: Assigning thread blocks to an SM in the GT200 [61]


© Sima Dezső, ÓE NIK 754 www.tankonyvtar.hu
5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (8)

Scheduling warps for issue in the GT200-1


Subsequently, the SM’s instruction schedulers select warps for execution.

The SM’s instruction scheduler


is designated as the Warp Scheduler

Figure 5.1.20: Allocation of thread


blocks
to an SM in the GT200 [61]
© Sima Dezső, ÓE NIK 755 www.tankonyvtar.hu
5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (9)

Scheduling warps for issue in the GT200-2


Nvidia did not reveal the scheduling policy.

© Sima Dezső, ÓE NIK 756 www.tankonyvtar.hu


5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (10)

Issuing warps to the execution pipeline


The issue process can be characterized by

• the maximal issue rate of warps to a particular group of pipelined execution units of an SM,
called the issue rate by Nvidia and
• the maximal issue rate of warps to the execution pipeline of the SM.

GT200 core [39]

Maximal issue rate of warps


to the execution pipeline of the SM

Maximal issue rate of warps


to particular groups of
pipelined execution units of an SM

© Sima Dezső, ÓE NIK 757 www.tankonyvtar.hu


5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (11)

The maximal issue rate of warps to a particular group of pipelined execution units
(called the warp issue rate (clocks per warp) in Nvidia’s terminology)
It depends on the number and throughput of the individual execution units in a group,
or from another point of view, on the total throughput of all execution units
constituting a group of execution units, called arithmetic and flow control pipelines.

Table 5.1.2: Throughput of the arithmetic pipelines in the GT 200 [60]

The issue rate of the arithmetic and flow control pipelines will be determined by the warp size
(32 threads) and the throughput (ops/clock) of the arithmetic or flow control pipelines,
as shown below for the arithmetic pipelines.

Table 5.1.3: Issue rate of the arithmetic pipelines in the GT 200 [60]
© Sima Dezső, ÓE NIK 758 www.tankonyvtar.hu
5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (12)

Accordingly,
• the issue of an FP32 MUL or MAD warp needs 32/8 = 4 clock cycles, or
• the issue of an FP64 MUL or MAD warp needs 32/1 = 32 clock cycles.

In the GT200 FP64 instructions are executed at 1/8 rate of FP32 instructions.

© Sima Dezső, ÓE NIK 759 www.tankonyvtar.hu


5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (13)

Example: Issuing basic arithmetic operations, such as an FP32 addition or multiplication


to the related arithmetic pipeline (constituting of 8 FPU units, called SP units in the Figure)

Figure 5.1.21: Issuing warps on an SM [61]

© Sima Dezső, ÓE NIK 760 www.tankonyvtar.hu


5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (14)

The maximal issue rate of warps to the execution pipeline of the SM -1 (based on [25])
As discussed previously, the Warp Schedulers of the SM can issue warps to the arithmetic
pipelines at the associated issue rates, e.g. in case of FX32 or FP32 warp instruction
in every fourth shader cycle to the FPU units .

FPU MAD MAD

FPU units occupied FPU units occupied

Nevertheless, the scheduler of GT200 is capable of issuing warp instructions in every second
shader cycle to not occupied arithmetic pipelines of the SM if there are no dependencies
with previously issued instructions.
E.g. after issuing an FP32 MAD instruction to the FPU units, the Warp Scheduler can issue
already two cycles later an FP32 MUL instruction to the SFU units, if these units are not busy
and there is no data dependency between these two instructions.

FPU MAD MAD

FPU units occupied FPU units occupied

SFU MUL MUL

SFU units occupied

The FP32 MUL warp instruction will occupy the 4 SPU units for 4 shader cycles.
© Sima Dezső, ÓE NIK 761 www.tankonyvtar.hu
5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (15)

The maximal issue rate of warps to the execution pipeline of an SM -2 (based on [25])
In this way the Warp Schedulers of the SMs may issue up to two warp instructions to the
single execution pipeline of the SM in every four shader cycle, provided that there are
no resource or data dependencies.

Issue MAD MUL MAD MUL

FPU MAD MAD

FPU units occupied FPU units occupied

SFU MUL MUL

SFU units occupied

Figure 5.1.22: Dual issue of warp instructions in every 4 cycles in the GT200
(Based on [25]

© Sima Dezső, ÓE NIK 762 www.tankonyvtar.hu


5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (16)

Estimation of the peak performance of the GT200 -1

a) Peak FP32 performance for arithmetic [39]


operations per SM
Max. throughput of the FP32 arithmetic
instructions per SM for e,g,:
• dual issue of a FP32 MAD and a FP32 MUL
warp instruction in 4 cycles
2 x 1/4 = 1/2 warp instruction/cycle
b) Peak FP32 performance (PFP32) of a GPGPU card
• 1 FP32 MAD warp instructions/4 cycle plus
• 1 FP32 MUL warp instruction/4 cycle
• 32 x (2+1) operations in 4 shader cycles
• at a shader frequency of fs
• n SM units
P SP FP = 1/4 x 32 x 3 x fs x n

P FP32 = 24 x fs x n FP32 operations/s

E.g. in case of the GTX280


• fs = 1 296 GHz
• n = 30
PFP32 = 24 x 1296 x 30 = 933.12 GFLOPS
© Sima Dezső, ÓE NIK 763 www.tankonyvtar.hu
5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (17)

Estimation of the peak performance of the GT200 -2

c) Peak FP64 performance for arithmetic [39]


operations per SM
Max. throughput of the FP64 MAD
instructions per SM:
• single issue of a FP64 MAD
warp instruction in 32 cycles
1/32 warp instruction/cycle
d) Peak FP64 performance (PFP64) of a GPGPU card
• 1 FP64 MAD warp instructions/32 cycle
• 32 x 2 operations in 32 shader cycles
• at a shader frequency of fs
• n SM units

PFP64 = 1/32 x 32 x 2 x fs x n

PFP64 = 2 x fs x n FP64 operations/s

E.g. in case of the GTX280


• fs = 1 296 GHz
• n = 30
PFP64 = 2 x 1296 x 30 = 77.76 GFLOPS
© Sima Dezső, ÓE NIK 764 www.tankonyvtar.hu
5.1.8 Microarchitecture of Fermi GF104

© Sima Dezső, ÓE NIK 765 www.tankonyvtar.hu


5.1.8 Microarchitecture of Fermi GF104 (1)

Introduced 7/2010 for graphic use


Key differences to the GF100
Number of cores
Only 8 in the GF104 vs 16 in the GF100

Figure 5.1.23: Contrasting the overall structures of the GF104 and the GF100 [69]
© Sima Dezső, ÓE NIK 766 www.tankonyvtar.hu
5.1.8 Microarchitecture of Fermi GF104 (2)

Note
In the GF104 based GTX 460 flagship card Nvidia activated only 7 SMs rather than
all 8 SMs available, due to overheating.

© Sima Dezső, ÓE NIK 767 www.tankonyvtar.hu


5.1.8 Microarchitecture of Fermi GF104 (3)

Available per SM Fermi GF100 [70] Fermi GF104 [55]

execution resources
in the GF104 vs the GF100

GF100 GF104

No. of
SP FX/FP 32 48
ALUs
No. of
16 16
L/S units
No. of
4 8
SFUs
No. of
8 4
DP FP ALUs

© Sima Dezső, ÓE NIK 768


5.1.8 Microarchitecture of Fermi GF104 (4)

Note
The modifications done in the GF104 vs the GF100 aim at increasing graphics performance per
SM at the expense of FP64 performance while halving the number of SMs in order to
reduce power consumption and price.

© Sima Dezső, ÓE NIK 769 www.tankonyvtar.hu


5.1.8 Microarchitecture of Fermi GF104 (5)

Warp issue in the Fermi GF104


SMs of the GF104 have dual thin cores
(execution pipelines), each with
2-wide superscalar issue [62]

It is officially not revealed how the


issue slots (1a – 2b) are allocated to
the groups of execution units.

© Sima Dezső, ÓE NIK 770


5.1.8 Microarchitecture of Fermi GF104 (6)

Peak computational performance data for the Fermi GF104 based GTX 460 card

According to the computational capability data [43] and in harmony with the figure on the
previous slide:

Peak FP32 FMA performance per SM:


2 x 48 FMA operations/shader cycle per SM

Peak FP32 performance of a GTX460 card while it executes FMA warp instructions:

PFP32 = 2 x 48 x fs x n FP32 operations/s

with • fs: shader frequency


• n: number of SM units

For the GTX 460 card


• fs = 1350 MHz
•n=7
P FP32 = 2 x 48 x 1350 x 7 = 907.2 MFLOPS

Peak FP64 performance of a GTX 460 card while it executes FMA instructions:

PFP64 = 2 x 4 x fs x n FP64 operations/s

PFP64 = 2 x 4 x 1350 x 7 = 75.6 MFLOPS


© Sima Dezső, ÓE NIK 771 www.tankonyvtar.hu
5.1.9 Microarchitecture of Fermi GF110

© Sima Dezső, ÓE NIK 772 www.tankonyvtar.hu


5.1.9 Microarchitecture of Fermi GF110 (1)

Microarchitecture of Fermi GF110 [63]


• Introduced 11/2010 for general purpose use
• The GF110 is a redesign of the GF100 resulting in less power consumption and higher speed.
As long as in the GF100-based flagship card (GTX 480) only 15 SMs could be activated
due to overheating, the GF110 design allows Nvidia to activate all 16 SMs in the associated
flagship card (GTX 580) and at the same time to increase clock speed.

Key differences between the GF100-based GTX 480 and the GF110-based GTX 580 cards

GTX 480 GTX 580

No. of SMs 15 16

No. SP ALUs 480 512

Shader frequency 1401 MHz 1544 MHz

TDP 250 W 244 W

• Due to its larger shader frequency and increased number of SMs the GF110-based
GTX 580 card achieves a ~ 10 % peak performance over the GF100 based GTX 480 card
by a somewhat reduced power consumption.

© Sima Dezső, ÓE NIK 773 www.tankonyvtar.hu


5.1.10 Evolution of key features of
Nvidia’s GPGPU microarchitectures

© Sima Dezső, ÓE NIK 774 www.tankonyvtar.hu


5.1.10 Evolution of key features of Nvidia’s GPGPU microarchitectures (1)

Evolution of FP32 warp issue efficiency in Nvidia’s GPGPUs

FP32 Warp issue

Scalar issue Scalar issue Scalar issue 2-way superscalar issue


to a single pipeline to a single pipeline per dual pipelines per dual pipelines
in every 4. cycle1 in every 2. cycle in every 2. cycle in every 2. cycle

Max. no. of warps per cycle


¼ warp per cycle 1/2 warp per cycle 1 warp per cycle 2 warps per cycle
with issue restrictions with less issue with issue restrictions
Peak warp mix restrictions

1 FP32 MAD (1 FP32 MAD + (1 FP32 FMA + (2 FP32 FMA +


per 4 cycles 1 FP32 MUL) 1 FP32 FMA) 1 FP32 FMA)
per 4 cycles per 2 cycles per 2 cycles

Max .no. of operations/cycle


16 24 64 96
Used in
G80/G92 GT200 GF100/GF110 GF104
(11/06, 10/07) (6/08) (3/10) (11/10)

© Sima Dezső, ÓE NIK 775 www.tankonyvtar.hu


5.1.10 Evolution of key features of Nvidia’s GPGPU microarchitectures (2)

FP64 performance increase in Nvidia’s Tesla and GPGPU lines

Performance is bound by the number of available DP FP execution units.

G80/G92 GT200 GF100 GF110


(11/06, 10/07) (06/08) (03/10) (11/10)
Avail. FP64 No FP64 1 16 16
units support FP64 unit FP64 units FP64 units
operations (Add, Mul, MAD) (Add, Mul, FMA ) (Add, Mul, FMA)

Peak FP64 load/SM 1 FP64 MAD 16 FP64 FMA 16 FP64 MAD


Peak FP64 perf./cycle/SM 1x2 operations/SM 16x2 operations/SM 16x2 operations/SM

Tesla cards
Flagship Tesla card C1060 C2070
Peak FP64 perf./card 30x1x2x1296 14x16x2x1150

77.76 GFLOPS 515.2 GFLOPS


GPGPU cards
Flagship GPGPU úcard GT280 GTX 4801 GTX 5801
Peak FP64 perf./card 30x1x2x1296 15x4x2x1401 16x4x2x1544

77.76 GFLOPS 168.12 GFLOPS 197.632 GFLOPS

1 In their GPGPU Fermi cards Nvidia activates only 4 FP64 units from the available 16
© Sima Dezső, ÓE NIK 776 www.tankonyvtar.hu
GPGPUs/DPAs 5.2
Case example 2:
AMD’s Cayman core

Dezső Sima

© Sima Dezső, ÓE NIK 777 www.tankonyvtar.hu


Aim

Aim
Brief introduction and overview.

© Sima Dezső, ÓE NIK 778 www.tankonyvtar.hu


5.2 AMD’s Cayman core

5.2.1 Introduction to the Cayman core

5.2.2 AMD’s virtual machine concept

5.2.3 AMD’s high level data and task parallel programming model

5.2.4 Simplified block diagram of the Cayman core

5.2.5 Principle of operation of the Command Processor

5.2.6 The Data Parallel Processor Array

5.2.7 The memory architecture

5.2.8 The Ultra-Threaded Dispatch Processor

5.2.9 Major steps of the evolution of AMD’s GPGPUs

© Sima Dezső, ÓE NIK 779 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core

© Sima Dezső, ÓE NIK 780 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (1)

Remarks

1) The subsequent description of the microarchitecture and its operation focuses on


the execution of computational tasks and disregards graphics ones.
2) To shorten designations, in the following slides the prefixes AMD or ATI as well as Radeon
will be left.

© Sima Dezső, ÓE NIK 781 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (2)

Overview of the performance increase of AMD’s GPGPUs [87]

© Sima Dezső, ÓE NIK 782 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (3)

Overview of AMD’s Northern Island series (HD 6xxx)

• The AMD HD 6970 (Cayman XT) introduced in 12/2010


• It is part of the AMD Northern Island (HD 6xxx) series.

AMD Northern Island series (HD 6xxx)

AMD HD 68xx (Barts-based) AMD HD 69xx (Cayman based)

• Gaming oriented • GPGPU oriented


• no FP64 • FP64 at 1/4 speed of FP32
• ALUs: 5-way VLIWs • ALUs: 4-way VLIWS

Cards
HD 6850 (Barts Pro) 10/2010 HD 6950 (Cayman Pro) 12/2010
HD 6870 (Barts XT) 10/2010 HD 6970 (Cayman XT) 12/2010
HD 6990 (Antilles) 3/2011 3/2011

© Sima Dezső, ÓE NIK 783 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (4)

Remarks
1) The Barts core (underlying AMD’s HD 68xx cards) is named after Saint Barthélemy island.
2) The Cayman core (underlying AMD’s HD 69xx cards) is named after the Cayman island.
3) Cayman (AMD HD 69xx) was originally planned as a 32 nm device.
But both TSMC and Global Foundries canceled their 32 nm technology efforts (11/2009
resp. 4/2010) to focus on the 28 nm process, so AMD had to use the 40 nm feature size
for Cayman while eliminating some features already foreseen for that device [88].

© Sima Dezső, ÓE NIK 784 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (5)

Changing the brand naming convention of GPGPU cards and SDKs


along with the introduction of the Northern island series (HD 6xxx) [89]

• For their earlier GPGPUs, including the Evergreen series (HD 5xxx) AMD made use of the
ATI brand, e.g.

ATI Evergreen series: ATI Radeon HD 5xxx


ATI Radeon HD 5850 (Cypress Pro) 9/2009
ATI Radeon HD 5870 (Cypress XT)
ATI Radeon HD 5970 (Hemslock) 11/2009

• But starting with the Northern Island series AMD discontinued using the ATI brand
and began to use the AMD brand to emphasize the correlation with their computing
platforms, e.g.

AMD Northern Island series: AMD Radeon HD 6xxx

AMD Radeon HD 6850 (Barts Pro) 10/2010


AMD Radeon HD 6870 (Barts XT) 10/2010

• At the same time AMD renamed also the new version (v2.3) of their ATI Stream SDK
to AMD Accelerated Parallel Processing (APP).

© Sima Dezső, ÓE NIK 785 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (6)

Changing AMD’s software ecosystem


along with the introduction of the RV870 (Cypress) core based Evergreen series

© Sima Dezső, ÓE NIK 786 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (7)

AMD Stream Software Ecosystem as declared in the AMD Stream Computing


User Guide 1.3 [90]

© Sima Dezső, ÓE NIK 787 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (8)

AMD Stream Software Ecosystem as declared in the ATI Stream Computing


Programming Guide 2.0 [91]

© Sima Dezső, ÓE NIK 788 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (9)

Changing AMD’s software ecosystem

Software ecosystem: more recent designation of the the programming environment

AMD/ATI
9/09 10/10 12/10

Cores RV870 (Cypress) Barts Pro/XT Cayman Pro/XT

40 nm/2100 mtrs 40 nm/1700 mtrs 40 nm/2640 mtrs

Cards HD 5850/70 HD 6850/70 HD 6950/70


1440/1600 ALUs 960/1120 ALUs 1408/1536 ALUs
256-bit 256-bit 256-bit

11/09 03/10 08/10

OpenCL OpenCL 1.0 OpenCL 1.0 OpenCL 1.1


(SDK V.2.0) (SDK V.2.01) (SDK V.2.2)
3/09
Brooks+ Brook+ 1.4
(SDK V.2.01) 8/09
Intel bought RapidMind
RapidMind

2009 2010 2011

Beginning with their Cypress-based HD 5xxx line and SDK v.2.0 AMD left Brook+
and started supporting OpenCL.

© Sima Dezső, ÓE NIK 789 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (10)

Implications of changing the software ecosystem

Considerable implications on both the microarchitecture of AMD’s GGGPUs, AMD IL and also
on the terminology used in connection with AMD’s GPGPUs.

Implications to the microarchitecture

• The microarchitecture of general purpose CPUs is obviously, language independent.


By contrast, GPGPUs are designed typically with a dedicated language (such as CUDA or
Brook+) in mind,
there is a close interrelationship between the programming environment and the
microarchitecture of a GPGPU that supports the programming environment.

• As a consequence, changing the software ecosystem affects the microarchitecture of the


related cards as well, as discussed in the Sections 5.2.2 (AMD’s virtual machine concept)
and 2.5.9 (Major steps of the evolution of AMD’s GPGPUs.

Implications to the AMD IL (Intermediate Language) pseudo ISA


This point will be discussed in Section 5.2.2 (AMD’s virtual machine concept).

© Sima Dezső, ÓE NIK 790 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (11)

Implications to the terminology


While moving from the Brook+ based ecosystem to the OpenCL based ecosystem
AMD also changed their terminology by distinguishing between
• Pre OpenCL terms
(while designating them as deprecated), and
• OpenCL terms, .
as shown in the Table Terminologies concerning GPGPUs/Data parallel accelerators.

Remark
1) In particular Pre-OpenCL and OpenCL publications AMD makes use of contradicting
terminology.
In Pre-OpenCL publications (relating to RV700 based HD4xxx cards or before)
AMD interprets the term “stream core” as the individual execution units within
the VLIW ALUs, whereas
in OpenCL terminology the same term designates the complete VLIW5 or VLIW4 ALU.

© Sima Dezső, ÓE NIK 791 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (12)

Pre-OpenCL terminology [92] OpenCL terminology [93]

(SIMD core)

© Sima Dezső, ÓE NIK 792 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (13)

Further remarks to naming of GPGPU cards

2) AMD designates their RV770 based HD4xxx cards as Terascale Graphics Engines [36]
referring to the fact that the HD4800 card reached a peak FP32 performance of 1 TFLOPS.
3) Beginning with the RV870 based Evergreen line (HD5xxx cards) AMD designated their
GPGPU architecture as the Terascale 2 Architecture referring to the fact that
the peak FP32 performance of the HD 5850/5870 cards surpassed the 2 TFLOPS mark.

© Sima Dezső, ÓE NIK 793 www.tankonyvtar.hu


5.2.1 Introduction to the Cayman core (14)
In these slides Nvidia AMD/ATI

(4-24) X SIMD Cores


Core Block Array
Streaming Processor Array SIMD Array
CBA in Nvidia’s G80/G92/
(7-10) TPC in Data Parallel Processor Array
GT200 SPA
G80/G92/GT200 DPP Array
CA Core Array (else)
(8-16) SMs in the Fermi line Compute Device
Stream Processor

Texture Processor Cluster


Core Block (2-3) Streaming
CB in Nvidia’s G80/G92/ TPC Multiprocessors
G200 in G80/G92/GT200,
not present in Fermi

Streaming Multiprocessor (SM) SIMD Core


G80-GT200: scalar issue to SIMD Engine (Pre OpenCL term)
a single pipeline Data Parallel Processor (DPP)
SIMT Core
C SM GF100/110: scalar issue to Compute Unit
SIMD core
dual pipelines 16 x (VLIW4/VLIW5) ALUs
GF104: 2-way superscalar
issue to dual pipelines

VLIW4/VLIW5 ALU
Stream core (in OpenCL SDKs)
Streaming Processor
Compute Unit Pipeline (6900 ISA)
ALU Algebraic Logic Unit (ALU) CUDA Core
SIMD pipeline (Pre OpenCL) term
Thread processor (Pre OpenCL term)
Shader processor (Pre OpenCL term)

Stream cores (ISA publ.s)


Execution Units (EUs) FP Units Processing elements
EU
(e.g. FP32 units etc.) FX Units Stream Processing Units
ALUs (in ISA publications)

Table 5.2.1: Terminologies used with GPGPUs/Data


794 parallel accelerators www.tankonyvtar.hu
5.2.2 AMD’s virtual machine concept

© Sima Dezső, ÓE NIK 795 www.tankonyvtar.hu


5.2.2 AMD’s virtual machine concept (1)

For their GPGPU technology AMD makes us of the virtual machine concept like Nvidia with their
PTX Virtual Machine.
AMD’s virtual machine is composed of
• the pseudo ISA, called AMD IL and
• its underlying computational model.

© Sima Dezső, ÓE NIK 796 www.tankonyvtar.hu


5.2.2 AMD’s virtual machine concept (2)

AMD IL (AMD’s Intermediate Language)

Main features of AMD IL


• it is an abstract ISA that is close to the real ISA
of AMD’s GPGPUs,
• it hides the hardware details of real GPGPUs
in order to become forward compatible over
subsequent generations of GPGPUs,
• it serves as the backend of compilers.
• Kernels written in a HLL are compiled to
AMD’s IL code.
• IL code can not directly be executed on
real GPGPUs, but needs a second compilation
before execution.
• IL code becomes executable after a second
compilation by the CAL compiler that
produces GPGPU specific binary code
and if required also the IL file.

Figure 5.2.1: AMD’s virtual machine concept,


that is based on the IL pseudo ISA [103]

© Sima Dezső, ÓE NIK 797 www.tankonyvtar.hu


5.2.2 AMD’s virtual machine concept (3)

Remarks
1) Originally, the IL Intermediate Language was based on Microsoft 9X Shader Language [104]
2) About 2008 AMD made a far reaching decision to replace their Brook+ software environment
with the OpenCL environment, as already mentioned in the previous Section.

Figure 5.2.2: AMD’s Brook+ based Figure 5.2.3: AMD’s OpenCL based
programming environment [90] programming environment [91]

© Sima Dezső, ÓE NIK 798 www.tankonyvtar.hu


5.2.2 AMD’s virtual machine concept (4)

As a consequence of this decision, AMD had to make a major change in their IL


(and also in the real ISA of their GPGPUs along with the associated microarchitecture).
An example for the induced microarchitectural changes is the introduction of local and global
memories (LDS, GDS) in AMD’s RV770-based HD4xx line (Evergreen line) of GPGPUs in 2008.
Here we note that although Brook+ supported local data sharing, until the appearance of the
RV770 based HD 4xxx line AMD’s GPGPUs did not provide Local Data Share memories.

Figure 5.2.4: Introduction of Local and Global Data Share memories (LDSs, GDS) in AMD’s HD
4800 [36]

© Sima Dezső, ÓE NIK 799 www.tankonyvtar.hu


5.2.2 AMD’s virtual machine concept (5)

3) AMD provides also a low level programming interface to their GPGPU, called the
CAL interface (Compute Abstraction Layer) programming interface [106], [107].

Figure 5.2.5: AMD’s OpenCL based HLL and the low level CAL programming environment [91]
The CAL programming interface [104]
• is actually a low-level device-driver library that allows a direct control of the hardware.
• The set of low-level APIs provided allows programmers to directly open devices,
allocate memory, transfer data and initiate kernel execution and thus optimize
performance.
• An integral part of CAL interface is a JIT compiler for AMD IL.
© Sima Dezső, ÓE NIK 800 www.tankonyvtar.hu
5.2.2 AMD’s virtual machine concept (6)

CAL compilation
from AMD IL to the device specific ISA

Device-Specific ISA disassembled

Figure 5.2.6: Kernel compilation from AMD IL to Device-Specific ISA (disassembled) [148]

© Sima Dezső, ÓE NIK 801 www.tankonyvtar.hu


5.2.2 AMD’s virtual machine concept (7)

The AMD IL pseudo ISA and its underlying parallel computational model together constitute
a virtual machine.
From its conception on the virtual machine, like Nvidia’s virtual machine, evolved in many
aspects, but due to lacking documentation of previous AMD IL versions it can not be
tracked.

© Sima Dezső, ÓE NIK 802 www.tankonyvtar.hu


5.2.2 AMD’s virtual machine concept (8)

AMD’s low level (IL) parallel computational model

The following brief overview is based on version 2.0e of the AMD Intermediate Language
Specification (Dec. 2010) [105].
The parallel computational model inherent in AMD IL is set up of three key abstractions:
a) The model of execution resources
b) The memory model, and
c) The parallel execution model (parallel machine model) including
c1) The allocation of execution objects to the execution pipelines
c2) The data sharing concept and
c3) The synchronization concept.
which will be outlined very briefly and simplified below.

© Sima Dezső, ÓE NIK 803 www.tankonyvtar.hu


5.2.2 AMD’s virtual machine concept (9)

a) The model of execution resources

The execution resources include a set of SIMT cores, each incorporating a number of ALUs
that are able to perform a set of given computations.

b) The memory model


AMD IL v.2.0e maintains five non-uniform memory regions;
• the private, constant, LDS, GDS and device memory regions.
•These memory regions map directly to the memory regions supported by OpenCL.
c) The execution model
c1) Allocation of execution objects to the execution units.
The thread model
• A hierarchical thread model is used.
• Threads are grouped together into thread groups which are units of allocation to the
SIMT cores.
• Threads within a thread group can communicate through local shared memory (LDS).
• There is no supported communication between thread groups.

Main features of the allocation


• A kernel running on a GPGPU can be launched with a number of thread groups.
• Threads in a thread group run in units, called wavefronts.
• All threads in a wavefront run in SIMD fashion (in lock steps).
• All wavefronts within a thread group can be synchronized by barrier synchronization.
© Sima Dezső, ÓE NIK 804 www.tankonyvtar.hu
5.2.2 AMD’s virtual machine concept (10)

c2) The data sharing concept

Data sharing is supported between all threads of a thread group.


This is implemented by providing Local Data Stores (actually read/write registers)
allocated to each SIMD core.
c3) The synchronization concept

• There is a barrier synchronization mechanism available to synchronize threads within


a thread group.
• When using a barrier all threads in a thread group must have reached the barrier before any
thread can advance.

© Sima Dezső, ÓE NIK 805 www.tankonyvtar.hu


5.2.2 AMD’s virtual machine concept (11)

Ideally, the same parallel execution model underlies all main components of the
programming environment, such as
• the real ISA of the GPGPU,
• the pseudo ISA and
• the HLL (like Brook+, OpenCL).

Enhancements of a particular component (e.g. the introduction of supporting the


global shared memory concept by the HLL (OpenCL) evoked the enhancement
of both other components.

© Sima Dezső, ÓE NIK 806 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel
programming model

© Sima Dezső, ÓE NIK 807 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (1)

The interpretation of the notion “AMD’s data and task parallel programming model”
A peculiarity of the GPGPU technology is that its high-level programming model is associated
with a dedicated high level language (HLL) programming environment, like that
for CUDA. Brook+ oder OpenCL.
(By contrast, the programming model of the traditional CPU technology is associated with an
entire class of HLL languages, called the imperative languages, like Fortran, C, C++ etc.
as these languages share the same high-level programming model).
With their SDK 2.0 (Software Development Kit) AMD changed the supported high-level
programming environment from Brook+ to OpenCL in 2009.
Accordingly, AMD’s high-level data parallel programming model became that of OpenCL.
• So a distinction is needed between AMD’s Pre-OpenCL and OpenCL
programming models.
• The next section discusses the programming model of OpenCL.

© Sima Dezső, ÓE NIK 808 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (2)

Remarks to the related AMD terminology


The actual designation of SDK 2.0 was ATI Stream SDK 2.0 referring to the fact
that the SDK is part of AMD’s Stream technology.
AMD’s Stream technology [108]
It is a set of hardware and software technologies that enable AMD’ GPGPUs to work in concert
with x86 cores for accelerating the execution of applications including data parallelism.

Renaming of “AMD stream technology” to “AMD Accelerated Parallel Processing technology”


(APP)
In January 2011, along with the introduction of their SDK 2.3, AMD renamed
their “ATI Stream technology” to “AMD Accelerated Parallel Processing technology”,
presumably to better emphasize the main function of GPGPUs as data processing accelerators.

Changing AMD’s GPGPU terminology with distinction between Pre-OpenCL and OpenCL
terminology
Along with changing their programming model AMD also changed their related terminology
by distinguishing between Pre-OpenCL and OpenCL terminology, as already discussed
in Section 5.2.1.
For example, in their Pre-OpenCL terminology AMD speaks about threads and thread groups,
whereas in OpenCL terminology these terms are designated as Work items and Work Groups.
For a summary of the terminology changes see [109].

© Sima Dezső, ÓE NIK 809 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (3)

Main components of the programming model of OpenCL [93], [109], [144]

Basic philosophy of the OpenCL programming model


OpenCL allows developers to write a single portable program for a heterogeneous platform
including CPUs and GPGPUs (designated as GPUs in the Figure below).

Figure 5.2.7: Example heterogeneous platform targeted by OpenCL [144]


© Sima Dezső, ÓE NIK 810 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (4)

OpenCL includes
• a language (resembling to C99) to write kernels, which allow the utilization of
data parallelism by using GPGPUs, and
• APIs to control the platform and program execution.

© Sima Dezső, ÓE NIK 811 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (5)

Main components of AMD’s data and task parallel programming model of OpenCL [109]
The data and task parallel programming model is based on the following abstractions
a) The platform model
b) The memory model of the platform
c) The execution model
c1) Command queues
c2) The kernel concept as a means to utilize data parallelism
c3) The concept of NDRanges-1
c4) The concept of task graphs as a means to utilize task parallelism
c5) The scheme of allocation Work items and Work Groups to execution resources
of the platform model
c6) The data sharing concept
c7) The synchronization concept

© Sima Dezső, ÓE NIK 812 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (6)

a) The platform model [109], [144]

An abstract, hierarchical model that allows a unified view of different kinds of processors.

• In this model a Host coordinates execution and data transfers to and from an array
of Compute Devices.
• A Compute Device may be a GPGPU or even a CPU.
• Each Compute Device is composed of an array of Compute Units
(e.g. VLIW cores in case of a GPGPU card),
whereas each Compute Unit incorporates an array of Processing Elements
(e.g. VLIW5 ALUs in case of AMD’s GPGPUs.
© Sima Dezső, ÓE NIK 813 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (7)

E.g. in case of AMD’s GPGPUs the usual platform components are:

E.g. VLIW5 or VLIW4 ALU

Card

SIMD core

Figure 5.2.8: The Platform model of OpenCL [144]

© Sima Dezső, ÓE NIK 814 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (8)

b) The memory model of the platform-1 [109]


• Each Compute Device has a separate global memory space that is typically implemented as
an off-chip DRAM.
It is accessible by all Compute units of the Compute Device, typically supported by
a data cache.

• Each Compute Unit is assigned a Local memory that is typically implemented on-chip,
providing lower latency and higher bandwidth than the Global Memory.

© Sima Dezső, ÓE NIK 815 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (9)

b) The memory model of the platform-2 [109]

• There is also available an on-chip Constant memory, accessible to all Compute Units that
allows the reuse of read-only parameters during computations.
• Finally, a Private Memory space, typically a small register space, is allocated to each
Processing Element, e.g. to each VLIW5 ALU of an AMD GPGPU.

© Sima Dezső, ÓE NIK 816 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (10)

The memory model including assigned Work items and Work Groups (Based on 94)

Compute Unit 1/Work Groups Compute Unit N/Work Groups

© Sima Dezső, ÓE NIK 817 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (11)

c) The execution model


• There is a mechanism, called command queues, to specify tasks to be executed, such as
data movements between the host and the compute device or data parallel kernels.
• The execution model supports both data and task parallelism.
• Data parallelism is obtained by the kernel concept which allows to apply a single function
over a range of data elements in parallel.
• Task parallelism is achieved by
• providing a general way to express execution dependencies between tasks, in form of
task graphs,
• and ensuring by an appropriate execution mechanism that the GPGPU will execute
the tasks in the specified order.
• Task graphs can be specified as lists of task related events in the command queues
that must occur before a particular kernel can be executed.

© Sima Dezső, ÓE NIK 818 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (12)

c1) Command queues

• They coordinate data movements between the host and Compute Devices (e.g. GPGPU cards)
as well as let launch kernels.
• An OpenCL command queue is created by the developer and is associated with a specific
compute device.
• To target multiple OpenCL Compute Devices simultaneously the developer need to
create multiple command queues.
• Command queues allow to specify dependencies between tasks (in form of a task graph),
ensuring that tasks will be executed in the specified order.
• The OpenCL runtime module will execute tasks in parallel if their dependencies are
satisfied and the platform is capable to do so.
• In this way command queues as conceived, allow to implement a task parallel execution
model.

© Sima Dezső, ÓE NIK 819 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (13)

Use of multiple Command queues

Aims
• either to parallelize an application across multiple compute devices (SIMD cores),
• or to run multiple completely independent streams of computation (kernels) across
multiple compute devices (SIMD cores).
The latter possibility is given only with the Cayman core.

© Sima Dezső, ÓE NIK 820 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (14)

c2) The kernel concept as a means to utilize data parallelism [109]

• Kernels are high level language constructs that allow to express data parallelism and
utilizing it for speeding up computation by means of a GPGPU.
• Kernels will be written in a C99-like language.
• OpenCL kernels are executed over an index space, which can be 1, 2 or 3 dimensional.
• The index space is designated as the NDRange (N-dimensional Range).
The subsequent Figure shows an example of a 2-dimensional index space, which has
Gx x Gy elements.

Figure 5.2.9: Example 2-dimensional


index space [109]
© Sima Dezső, ÓE NIK 821 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (15)

c3) The concept of NDRanges-1


• For every element of the kernel index space a Work item will be executed.
• All work items execute the same program, although their execution may differ due to
branching based on actual data values or the index assigned to each Work item.
The NDRange defines the total number of Work items that should be executed in parallel
related to a particular kernel.

Figure 5.2.10: Example 2-dimensional index space [109]


© Sima Dezső, ÓE NIK 822 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (16)

c3) The concept of NDRanges-2


• The index space is regularly subdivided into tiles, called Work Groups.
As an example, the Figure below illustrates Work Groups of size Sx x Sy elements.

Figure 5.2.11: Example


2-dimensional index
space [109]

• Each Work item in the Work Group will be assigned a Work Group id, labeled as wx, wy
in the Figure, as well as a local id, labeled as Sx, Sy in the Figure.
• Each Work item also becomes a global id, which can be derived from its Work Group and
local ids.
© Sima Dezső, ÓE NIK 823 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (17)

Segmentation of NDRanges to Work Groups


(Called also as the NDRange configuration)

Segmentation of NDRanges to Work Groups

Explicit specification Implicit specification

The developer explicitly specifies The developer does not specify


sizes of Work Groups in kernels the Work Group sizes.
by commands. in this case the OpenCL driver
will do the segmentation.

Example for specification of the global work size and explicit specification of Work Groups [96]

© Sima Dezső, ÓE NIK 824 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (18)

Example for a simple kernel written for adding two vectors [144]

© Sima Dezső, ÓE NIK 825 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (19)

Remarks

1) Compilation of the kernel code delivers both the GPU code and the CPU code.

Figure 5.2.12: Compilation of the kernel code [144]

© Sima Dezső, ÓE NIK 826 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (20)

Remarks (cont)

2) Kernels will be compiled to the ISA level to a series of clauses.

Clauses
• Groups of instructions of the same clause type that will be executed without preemption,
like ALU instructions, texture instructions etc.
Example [100]
The subsequent example relates to a previous generation device (HD 58xx) that
includes 5 Execution Units, designated as the x, y, z, w and t unit.
By contrast the discussed HD 69xx devices (belonging to the Cayman family) provide
only 4 ALUs (x, y, z, w).

© Sima Dezső, ÓE NIK 827 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (21)

Figure 5.2.13: Examples for ALU and Tex clauses [100]


Howes L.
© Sima Dezső, ÓE NIK 828 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (22)

c4) The concept of task graphs as a means to utilize task parallelism


• The developer can specify a task graph associated with the command queues in form of
a list of events that must occur before a particular kernel can be executed by the GPGPU.
• Events will be generated by completing the execution of kernels or read, write or copy
commands.
Example task graph [109]

• In the task graph arrows indicate dependencies. E.g. the kernel A is allowed to be executed
only after Write A and Write B have finished etc.
•The OpenCL runtime has the freedom to execute tasks given in the task graph in parallel
as long as the dependencies specified are fulfilled.
© Sima Dezső, ÓE NIK 829 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (23)

c5) The scheme of allocation Work items and Work Groups to execution resources
of the platform model

As part of the execution model


• Work items will be executed by the processing Elements of the platform
(e.g. VLIW 5 or VLIW4 ALUs in case of AMD’s GPGPUs.
• Each Work Group will be assigned for execution to a particular Compute Unit (SIMD core).

© Sima Dezső, ÓE NIK 830 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (24)

c6) The data sharing concept [145]


It results from the platform memory model and the Work item/Work Group allocation model.
Local data sharing
All Work items in a Work Group can access the Local Memory. which allows data sharing
within Work items of a Work Grouö that is allocated to a Compute Unit (SIMD core)
Global data sharing
All Work items in all Work Groups can access the Global Memory, which allows data sharing
within Work items running on the same Compute Device (e.g. PPGPU card).

Work items
Work Groups

E.g. The Cayman core has 32 KB large Local memories and a Global Memory of 64 KB [99].
© Sima Dezső, ÓE NIK 831 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (25)

c7) The synchronization concept

Synchronization mechanisms of OpenCL

Synchronization during Synchronization during


task parallel execution data parallel (kernel) execution

Synchronization Synchronization
of Work items of Work items
within a Work Group being in different Work Groups

Task graph • Barrier synchronization Atomic memory transactions


• Memory fences

© Sima Dezső, ÓE NIK 832 www.tankonyvtar.hu


5.2.3 AMD’s high level data and task parallel programming model (26)

Synchronization mechanisms [109]

Task graphs
Discussed already in connection with parallel task execution.

Barrier synchronization
• Allows synchronization of Work items within a Work Group.
• Each Work item in a Work Group must first execute the barrier before execution is allowed
to proceed (see the SIMT execution model, discussed in Section 2).
Memory fences
let synchronize memory operations (load/store sequences).
Atomic memory transactions
• Allow synchronization of Work items being in the same or in different Work Groups.
• Work items may e.g. append variable numbers of results to a shared queue
in global memory to coordinate execution of Work items in different Work Groups.
• Atomic memory transactions are OpenCL extensions supported by some OpenCL runtimes,
such as the ATI Stream SDK OpenCL runtime for x86 processors.
• The use of atomic memory transactions needs special care to avoid deadlocks and allow
scalability.

© Sima Dezső, ÓE NIK 833 www.tankonyvtar.hu


5.2.4 Simplified block diagram of the
Cayman core

© Sima Dezső, ÓE NIK 834 www.tankonyvtar.hu


5.2.4 Simplified block diagram of the Cayman core (1)

Simplified block diagram of the Cayman core (used in the HD 69xx series) [97]

L1T/L2T: Texture caches


(Read-only caches)

© Sima Dezső, ÓE NIK 835 www.tankonyvtar.hu


5.2.4 Simplified block diagram of the Cayman core (2)

Comparing the block diagrams of the Cypress (HD 58xx) and the Cayman (HD 69xx)
cores [97]

© Sima Dezső, ÓE NIK 836 www.tankonyvtar.hu


5.2.4 Simplified block diagram of the Cayman core (3)

Comparing the block diagrams of the Cayman (HD 69xx) and the Fermi cores [97]

GF110

Note
Fermi has read/write L1/L2 caches like CPUs.

© Sima Dezső, ÓE NIK 837 www.tankonyvtar.hu


5.2.4 Simplified block diagram of the Cayman core (4)

Simplified block diagram


of the Cayman XT core [98]
(Used in the HD 6970)

2x12 SIMD cores

16 VLIW4 ALUs/core

4 EUs/VLIW4 ALU

1536 EUs

838 www.tankonyvtar.hu
5.2.4 Simplified block diagram of the Cayman core (5)

Simplified block diagram of the Cayman core (that underlies the HD 69xx series) [99]

© Sima Dezső, ÓE NIK 839 www.tankonyvtar.hu


5.2.5 Principle of operation of the Command
Processor

© Sima Dezső, ÓE NIK 840 www.tankonyvtar.hu


5.2.5 Principle of operation of the Command Processor (1)

Principle of operation of the Command Processor-1 [99]

Basic architecture of Cayman (that underlies both the HD 6950/70 GPGPUs)

The developer writes a set of commands that control the execution of the GPGPU program.
These commands
• configure the HD 69xx device (not detailed here),
• specify the data domain on which the HD 69xx device has to operate,
• command the HD 69xx device to copy programs and data between system memory and
device memory,
• cause the HD 69xx device to begin the execution of a GPGPU program (OpenCL program).
The host writes the commands to the memory mapped HD 69xx registers in the
system memory space.
© Sima Dezső, ÓE NIK 841 www.tankonyvtar.hu
5.2.5 Principle of operation of the Command Processor (2)

Principle of operation of the Command Processor-2 [99]

Basic architecture of Cayman (that underlies both the HD 6950/70 GPGPUs)

The Command Processor reads the commands that the host has written to
memory mapped HD 69xx registers into the system-memory address space,
copies them into the Device Memory of the HD 6900 and launches their execution.
© Sima Dezső, ÓE NIK 842 www.tankonyvtar.hu
5.2.5 Principle of operation of the Command Processor (3)

Principle of operation of the Command Processor-3 [99]

Basic architecture of Cayman (that underlies both the HD 6950/70 GPGPUs)

The Command Processor sends hardware-generated interrupt to the host when


the commands are completed. 843 www.tankonyvtar.hu
5.2.6 The Data Parallel Processor Array

© Sima Dezső, ÓE NIK 844 www.tankonyvtar.hu


5.2.6 The Data Parallel Processor Array (1)

The Data Parallel Processor Array (DPP) in the Cayman core [99]

Basic architecture of Cayman (that underlies both the HD 6950/70 GPGPUs)

© Sima Dezső, ÓE NIK 845 www.tankonyvtar.hu


5.2.6 The Data Parallel Processor Array (2)

The Data Parallel Processor Array (DPP)


in more detail [93]

The DPP Array


• The DPP is the “heart” of the GPGPU.
• It consists of 24 SIMD cores
(designated also as Compute Units). DPP
DPP Array
Array
SIMD cores
• Operate independently from each other.
• Each of them includes 16 4-wide
VLIW ALUs, termed as VLIW4 ALUs.

Remark
The SIMD cores are also designated as
• Data Parallel Processors (DPP) or
• Compute Units.

OpenCL Progr. Guide 1.2


OpenCL
Jan. 2011 Progr. Guide 1.2
Jan. 2011
© Sima Dezső, ÓE NIK 846 www.tankonyvtar.hu
5.2.6 The Data Parallel Processor Array (3)

Operation of the SIMD cores


• Each SIMD core consists of 16 VLIW4 ALUs
• All 16 VLIW4 ALUs of a SIMD core operate in lock step (in parallel)

SIMD core

Figure 5.2.14: Operation of a SIMD core of an HD 5870 [100]

Remark
Both the Cypress-based HD 5870 and the Cayman-based HD 6970 have the same basic structure

© Sima Dezső, ÓE NIK 847 www.tankonyvtar.hu


5.2.6 The Data Parallel Processor Array (4)

The four-wide VLIW ALUs of the Cayman core [98]


(Designated as Stream Processing Units in the Figure below)

FP capability:
• 4xFP32 FMA
• 1XFP64 FMA
per clock.

© Sima Dezső, ÓE NIK 848 www.tankonyvtar.hu


5.2.6 The Data Parallel Processor Array (5)

Operation of Cayman’s VLIW4 ALUs [97]

Throughput and latency of the pipelined VLIW4 ALUs

• The VLIW4 ALUs are pipelined, having a throughput of 1 instruction per shader cycle for the
basic operations, i.e. they accept a new instruction every new cycle for the basic operations.
• ALU operations have a latency of 8 cycles, i.e. they require 8 cycles to be performed.
• The first 4 cycles are needed to read the operands from the register file,
one quarter wavefront at a time.
• The next four cycles are needed to execute the requested operation.

Interleaved mode of execution


To hide the 8-cycles latency two wavefronts (an even and an odd) will be executed in an
interleaved fashion.
As long as one wavefront accesses the register file the other will execute.
This is conceptually similar to fine-grained multi-threading, where the two threads
switch every 4 cycles, but do not simultaneously execute.

© Sima Dezső, ÓE NIK 849 www.tankonyvtar.hu


5.2.6 The Data Parallel Processor Array (6)

Contrasting AMD’s VLIW4 issue in the Cayman core (HD 69xx) with Nvidia’s scalar issue
(Based on [88])

AMD/ATI Nvidia
• VLIW issue • Scalar issue
• Static dependency resolution • Dynamic dependency resolution
performed by the compiler by scoreboarded warp scheduler

Requires sophisticated Requires sophisticated


compiler to properly fill hardware scheduler
instruction slots
Remark
Nvidia’s Fermi GF104 introduced
(7/2010) already 2-way
superscalar issue

© Sima Dezső, ÓE NIK 850 www.tankonyvtar.hu


5.2.6 The Data Parallel Processor Array (7)

Peak FP32/64 performance of Cayman based GPGPU cards

a) Peak FP32 performance (PFP32) of a Cayman-based GPGPU card


• n SIMD cores
• 16 SIMD VLIW4 ALUs per SIMD core
• 4 Execution Units (EUs) per SIMD VLIW4 ALU
• 1 FP32 FMA operation per EU
• 2 operations/FMA
• at a shader frequency of fs

PFP32 = n x 16 x 4 x 1 x 2 x fs

P FP32 = n x 128 x fs FP32 operations/s

E.g. in case of the HD 6970


• n = 24
• fs = 880 GHz
PFP32 = 24 x 128 x 880 = 2703 GFLOPS

© Sima Dezső, ÓE NIK 851 www.tankonyvtar.hu


5.2.5 The Data Parallel Processor Array (8)

b) Peak FP64 performance (PFP64) of a Cayman-based GPGPU card


• n SIMD cores
• 16 VLIW4 ALUs per SIMD core
• 4 Execution Units (EUs) per VLIW4 ALU
• 1 FP64 FMA operation per 4 EU
• 2 operations/FMA
• at a shader frequency of fs
PFP64 = n x 16 x 4 x 1/4 x 2 x fs

PFP64 = n x 32 x fs FP64 operations/s = ¼ PFP32

E.g. in case of the HD 6970


• n = 24
• fs = 900 GHz
PFP64 = 24 x 32 x 900 = 691.12 GFLOPS

© Sima Dezső, ÓE NIK 852 www.tankonyvtar.hu


5.2.7 The memory architecture of the
Cayman core

© Sima Dezső, ÓE NIK 853 www.tankonyvtar.hu


5.2.7 The memory architecture (1)

Overview of the memory architecture of the Cayman core (simplified)

(= SIMD core)
Private memory: 16 K GPRs/SIMD core
4 x 32-bit/GPR

LDS: 32 KB/SIMD core


L1: 8 KB/SIMD core (Read only)

L2: 512 KB/GPGPU (Read only)

Global Memory: 64 KB per GPGPU

Figure 5.2.15: Memory architecture of the Cayman core [99]

© Sima Dezső, ÓE NIK 854 www.tankonyvtar.hu


5.2.7 The memory architecture (2)

Main features of
the memory spaces
available in the
Cayman core [99]

© Sima Dezső, ÓE NIK 855


5.2.7 The memory architecture (3)

Compliance of the Cayman core (HD 6900) memory hierarchy with the OpenCL
memory model

HD 6900 memory architecture [99] OpenCL memory architecture [94]

(= SIMD core)

© Sima Dezső, ÓE NIK 856 www.tankonyvtar.hu


5.2.7 The memory architecture (4)

Memory architecture of OpenCL (more detailed) [93]

The Global memory can be either


in the high-speed GPU memory
(VRAM) or in the host memory,
which is accessed by the
PCIe bus [99].

857 www.tankonyvtar.hu
5.2.7 The memory architecture (5)

The Private memory

• It is implemented as a number of General Purpose Registers (GPRs)


available per SIMD core, e.g. 16 K GPRs for the Cayman core.
• Each GPR includes a set of four 32-bit registers,
to provide register space for three 32-bit input data and a single
32-bit output data.
• As the Cayman core includes 16 VLIW4 ALUs, there are 1 K GPRs
available for each VLIW4 ALU.
• As each VLIW4 ALU incorporates 4 Execution Units
(termed as Processing Element in the Figure), in the Cayman core
there are 1 K/4 = 256 GPRs available for each Execution Unit.

(EU)

© Sima Dezső, ÓE NIK 858 www.tankonyvtar.hu


5.2.7 The memory architecture (6)

The Private memory-2


• All GPRs available for a particular Execution Unit are shared among all Work Groups
allocated to that SIMD core.
• The driver allocates the required number of GPRs for each kernel, this set of GPRs remains
persistent across all Work items during the execution of the kernel.
• The available number of GPRs per Execution Unit (256 GPRs for Cayman) and the requested
number of GPRs for a given kernel, determine the maximum number of active wavefronts
on a SIMD core running the considered kernel, as shown below.

Table 5.2.2: Max. no. of wavefronts per SIMD859


core vs no. of GPRs requested per wavefront [93]
5.2.7 The memory architecture (7)

Example

If a kernel needs 30 GPRs (30 x 4 x 32-bit registers), up to 8 active wavefronts can run on
each SIMD core.

Remarks [93], [94]


• Each work item has access to up to 127 GPRs minus two times the number of
Clause Temporary GPRs.
By default 2x2 GPRs are reserved as Clause Temporary Registers, each two for the even and
the odd instruction slots, used for all instructions of an ALU clause.
• Each SIMD core is limited to 32 active wavefronts.
• The Private Memory has a bandwidth of 48 B/cycle
(3 (source registers x 32-bit x 4 (Execution Units)

© Sima Dezső, ÓE NIK 860 www.tankonyvtar.hu


5.2.7 The memory architecture (8)

Data sharing in Cayman-1

There are two options for data sharing


• 32 KB Local Data Share (LDS) and
• 64 KB Global Data Share (GDS)
both introduced in the RV770 [36].
The 32 KB Local Data Share (LDS)
[93], [97]

• Used for data sharing among


wavefronts within a Work Group
(running on the same SIMD core).
• Exposed through the OpenCL and
DirectCompute specifications.
• Each VLIW4 ALU can load two
32-bit values from the LDS/cycle.
• The LDS size is allocated
on a per Work Group basis.
Each Work Group specifies
how much LDS it requests.
The Ultra-Threaded Dispatch Proc.
uses this information to determine
which Work Groups can share
the same SIMD core.
Figure 5.2.16: Local data sharing in the Cayman core [99]
© Sima Dezső, ÓE NIK 861 www.tankonyvtar.hu
5.2.7 The memory architecture (9)

Data sharing in Cayman-2


The 64 KB Global Data Share (GDS)

• Introduced in the RV770 [36],


(at a size of 16 KB),
but not documented in the
related ISA Reference Guide [101]
but in the previous ISA and
Microcode Reference Guide [112].
• Provides data sharing across
an entire kernel.
• It is not exposed in the OpenCL
or DirectCompute specifications,
must be accessed through
vendor specific extensions.

Figure 5.2.17: Global data sharing in the


Cayman core [99]

© Sima Dezső, ÓE NIK 862 www.tankonyvtar.hu


5.2.7 The memory architecture (10)

The cache system of the Cayman core [97]

• The Cayman core has 8KB L1 Texture cache


(Read only cache) per SIMD core and
• a 512 KB L2 Texture cache (Read only cache)
shared by all SIMD cores.

L1T/L2T: Texture caches


(Read-only caches)

© Sima Dezső, ÓE NIK 863 www.tankonyvtar.hu


5.2.7 The memory architecture (11)

Comparing the Cayman’s (HD 69xx) and Fermi GF110’s (GTX 580) cache system [97]

GF110

Notes
1) Fermi has read/write L1/L2 caches like CPUs, whereas the Cayman core has read-only caches.
2) Fermi has optionally 16 KB L1 cache and 48 KB shared buffer or vice versa per SIMD core.
3) Fermi has larger L1/L2 caches.

© Sima Dezső, ÓE NIK 864 www.tankonyvtar.hu


5.2.7 The memory architecture (12)

Memory controller [88]

The Cayman core has dual


bidirectional DMA Engines,
for faster system memory
reads and writes,
whereas previous families
provided only one.

© Sima Dezső, ÓE NIK 865 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor

© Sima Dezső, ÓE NIK 866 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor (1)

The Ultra-Threaded Dispatch Processor-1 [99]

The Ultra-Threaded Dispatch Processor

© Sima Dezső, ÓE NIK 867 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor (2)

The Ultra-Threaded Dispatch Procesors-2


In contrast to the previous figure (from [99]), Cayman has actually dual Ultra-Threaded Dispatch
processors, as shown below.

Each Ultra-Threaded Dispatch


Processor manages half of the
SIMD cores i.e. 12 SIMD cores
(called SIMD Engines
in the Figures).

This layout is roughly a dual


(RV770-based) HD 4870 design,
since in the HD 4870
a single Ultra-Threaded Dispatch
Processor cares for
10 SIMD cores.

Figure 5.2.18: Block diagram


of Cayman XT [98]
(Used in the HD 6970)
868 www.tankonyvtar.hu
5.2.8 The Ultra-Threaded Dispatch Processor (3)

Block diagram of AMD’s (RV770-based) 4870 [36]

© Sima Dezső, ÓE NIK 869 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor (4)

The Ultra-Threaded Dispatch Procesors-3


The main task of the Ultra-Threaded Dispatch Processors is work scheduling

Work scheduling consists of


• Assigning kernels and their Work Groups to SIMD cores
• Scheduling Work items of the Work Groups for execution

© Sima Dezső, ÓE NIK 870 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor (5)

Assigning Work Groups to SIMD cores -1

Kernel 1: NDRange1
Global size 1,0

Work Group Work Group


Global size 1,1

(0,0) (0,1)
DPP Array

Work Group Work Group

?
(1,0) (1,1)

Kernel 2: NDRange2
Global size 2,0

Work Group Work Group


Global size 2,1

(0,0) (0,1)

OpenCL Progr. Guide 1.2


Work Group Work Group (EU)
(1,0) (1,1) Jan. 2011

© Sima Dezső, ÓE NIK 871 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor (6)

Assigning Work Groups to SIMD cores -2

• The Ultra-Threaded Dispatch Processor allocates each Work Group of a kernel for execution
to a particular SIMD core.
• Up to 8 Work Groups of the same kernel can share the same SIMD core, if there are
enough resources to fit them in.
• Multiple kernels can run in parallel at different SIMD cores, in contrast to previous designs
when only a single kernel was allowed to be active and their Work Groups were spread
over the available SIMD cores.
• Hardware barrier support is provided for up to 8 Work Groups per SIMD core.
• Work items (threads) of a Work Group may communicate to each other through a
local shared memory (LDS) provided by the SIMD core.
This feature is available since the RV770.based 4xxx GPGPUs.

© Sima Dezső, ÓE NIK 872 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor (7)

Data sharing and synchronization of Work items within a Work Group


Work items in a Work Group may
• share a local memory, not visible by Work items belonging to different Work Groups.
• synchronize with each other by means of barriers or memory fence operations.
Work items in different Work Groups cannot synchronize with each other.

Remark
Barriers allow to synchronize Work item (thread) execution whereas memory fences
let synchronize memory operations (load/store sequences).

© Sima Dezső, ÓE NIK 873 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor (8)

Scheduling Work items (threads) within Work Groups for execution

Work items are scheduled for execution in groups, called wavefronts.


Wavefronts
• Collection of Work items (threads) scheduled for execution as an entity.
• Work items of a wavefront will be executed in parallel on a SIMD core.

SIMD core

Figure 5.2.19: Structure of a SIMD core in the HD 5870 [100]

Remark
Cayman’s SIMD cores have the same structure as the HD 5870

© Sima Dezső, ÓE NIK 874 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor (9)

Wavefront size

SIMD core

• In Cayman each SIMD core has 16 VLIW4 (in earlier models VLIW5) ALUs, similarly as
the SIMD cores in the HD 5870, shown in the Figure [100].
• Both VLIW4 and VLIW5 ALUs provide 4 identical Execution Units, so
Each 4 Work items, called collectively as a quad, is processed on the same VLIW ALU
The wavefront is composed of quads.
The number of quads is identical with the number of VLIW ALUs.

The wavefront size = No. of VLIW4 ALUs x 4

• High performance GPGPU cards provide typically 16 VLIW ALUs,


so for these cards the wavefront size is 64.
• Lower performance cards may have 8 or even 4 VLIW ALUs,
resulting in wavefront sizes of 32 and 16 resp.
© Sima Dezső, ÓE NIK 875 www.tankonyvtar.hu
5.2.8 The Ultra-Threaded Dispatch Processor (10)

Building wavefronts
• The Ultra-Thraded Dispatch Processor segments Work Groups into wavefronts,
and schedules them for execution on a single SIMD core.
• This segmentation is called also as rasterization.

© Sima Dezső, ÓE NIK 876 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor (11)

Example: Segmentation of a 16 x 16 sized Work Group into wavefronts of the size 8x8
and mapping them to SIMD cores [92]
Work Group

One 8x8 block maps Another 8x8 block maps


to one wavefront to another wavefront
and is executed on one and is executed on another
SIMD core SIMD core

Each quad executes in


the same VLIW ALU
Wavefront (16 quads)

All VLIW ALUs in a SIMD core


execute the same
instruction sequence,
different SIMD cores
may execute different
instructions.
© Sima Dezső, ÓE NIK 877 www.tankonyvtar.hu
5.2.8 The Ultra-Threaded Dispatch Processor (12)

Scheduling wavefronts for execution


This is task of the Ultra-Threaded Dispatch Processors [97]

• AMD’s GPGPUs have dual Ultra-Threaded Dispatch Processors, each responsible for
one issue slot of two available.
• The main task of the Ultra-Threaded Dispatch Processors is to assign Work Groups of
currently running kernels to the SIMD cores and schedule wavefronts for execution.
• Each Ultra-Threaded Dispatch processor has a dispatch pool of 248 wavefronts.
• Each Ultra-Threaded Dispatch processor selects two wavefronts for execution for each
SIMD core and dispatches them to the SIMD cores.
• The selected two wavefronts will be executed interleaved.

© Sima Dezső, ÓE NIK 878 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor (13)

Scheduling policy of wavefronts


• AMD did not detail the scheduling policy of wavefronts.
• Based on published figures [98] it can be assumed that wavefronts are scheduled
coarse grained.
• Wavefronts can be switched only on clause boundaries.

© Sima Dezső, ÓE NIK 879 www.tankonyvtar.hu


5.2.8 The Ultra-Threaded Dispatch Processor (14)

Example for scheduling wavefronts-1 [93]


• At run time, Work item T0 executes until a stall occurs, e.g. due to a memory fetch request
at cycle 20.
• The Ultra-Threaded Dispatch Scheduler selects then Work item T1 (based e.g. on age).
• Work item T1 runs until it stalls, etc.

Figure 5.2.20: Simplified execution of Work items on a SIMD core with hiding memory stalls [93]
© Sima Dezső, ÓE NIK 880 www.tankonyvtar.hu
5.2.8 The Ultra-Threaded Dispatch Processor (15)

Example for scheduling wavefronts-2 [93]

• When data requested from memory arrive (that is T0 becomes ready) the scheduler
will select T0 for execution again.
• If there are enough work-items ready for execution memory stall times can be hidden,
as shown in the Figure below.

Figure 5.2.21: Simplified execution of Work items on a SIMD core with hiding memory stalls [93]
© Sima Dezső, ÓE NIK 881 www.tankonyvtar.hu
5.2.8 The Ultra-Threaded Dispatch Processor (16)

Example for scheduling wavefronts-3 [93]


• If during scheduling no runable Work items are available, scheduling stalls and the scheduler
waits until one of the Work items becomes ready.
• In the example below Work item T0 is the first returning from the stall and will continue
execution.

Figure 5.2.22: Simplified execution of Work items on a SIMD core without hiding memory stalls [93]
© Sima Dezső, ÓE NIK 882 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s
GPGPU microarchitectures

© Sima Dezső, ÓE NIK 883 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (1)

Evolution of key features of AMD’s GPGPU microarchitectures


(Key features: Important qualitative features rather than quantitative features

a) Changing the underlying programming environment from Brook+ to OpenCL (2009)


b) Introduction of LDS, GDS (2008)
c) Introduction of 2-level segmentation in breaking down the domain of execution/
NDRange into wavefronts (2009)
d) Allowing multiple kernels to run in parallel on different SIMD cores (2010)
e) Introducing FP64 capability and replacing VLIW5 ALUs with VLIW4 ALUs (2007/2010)

© Sima Dezső, ÓE NIK 884 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (2)

a) Changing the underlying programming environment from Brook+ to OpenCL


Starting with their RV870 (Cypress)-based HD 5xxx line and SDK v.2.0 AMD left Brook+
and began supporting OpenCL in 2009.

AMD/ATI
9/09 10/10 12/10

Cores RV870 (Cypress) Barts Pro/XT Cayman Pro/XT

40 nm/2100 mtrs 40 nm/1700 mtrs 40 nm/2640 mtrs

Cards HD 5850/70 HD 6850/70 HD 6950/70


1440/1600 ALUs 960/1120 ALUs 1408/1536 ALUs
256-bit 256-bit 256-bit

11/09 03/10 08/10

OpenCL OpenCL 1.0 OpenCL 1.0 OpenCL 1.1


(SDK V.2.0) (SDK V.2.01) (SDK V.2.2)
3/09
Brooks+ Brook+ 1.4
(SDK V.2.01) 8/09
Intel bought RapidMind
RapidMind

2009 2010 2011

Presumable in anticipation of this move AMD modified their IL, enhanced the
microarchitecture of their GPGPU devices with LDS and GDS (as discussed next),
and also changed their terminology, as given in Table xx.

© Sima Dezső, ÓE NIK 885 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (3)

b) Introduction of LDS, GDS

Interrelationship between the programming environment of a GPGPU and its microachitecture


The microarchitecture of general purpose CPUs is obviously, language independent
as it has to serve an entire class of HLLs, called the imperative languages.
By contrast, GPGPUs are designed typically with a dedicated language (such as CUDA or
Brook+) in mind.
There is a close interrelationship between the programming environment and the
microarchitecture of a GPGPU, as will be shown subsequently.

© Sima Dezső, ÓE NIK 886 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (4)

The memory concept of Brook+ [92]


Brook+ defined three memory domains (beyond the register space of the ALUs):
• Host (CPU memory or system memory) memory
• PCIe memory (this is a section of the host memory that can be accessed both by the
host and the GPGPU)
• Local (GPGPU) memory.

The memory concept of Brook+ was decisive to the memory architecture of AMD’s first
R600-based GPGPUs, as shown next.

© Sima Dezső, ÓE NIK 887 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (5)

The microarchitecture of R600 family processors [7]

The PCI memory

is part of the
System memory
that is accessible
by both the Host
and the GPGPU

The memory architecture of the R600 reflects the memory concept of Brook+.
© Sima Dezső, ÓE NIK 888 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (6)

The memory concept of OpenCL

By contrast, the memory model of OpenCL includes also Local and Global memory spaces
in order to allow data sharing among Work items running on the same SIMD core or even
on the GPGPU.

Figure 5.2.23: OpenCL’s memory concept [94]

© Sima Dezső, ÓE NIK 889 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (7)

The microarchitecture of the


Cayman core-1

Two options for data sharing


• 32 KB Local Data Share (LDS)
• 64 KB Global Data Share (GDS)

Both the LDS and the GDS were


introduced in the RV770-based
HD 4xxx series of GPGPUs [36].

Figure 5.2.24: Data sharing in the


Cayman core [99]

© Sima Dezső, ÓE NIK 890 www.tankonyvtar.hu


5.2.9 Major
5.2.9 Evolution steps
of key of theof
features evolution of AMD’s
AMD’s GPGPU GPGPUs (8)
microarchitectures (8)

Introduction of LDSs and a GDS in the RV770-based HD 4xxx line

Presumably, in anticipation of supporting OpenCL AMD introduced both


• 16 KB Local Data Share (LDS) memories per SIMD core and
• a 16 KB Global Data Share (GDS) memory to be shared for all SIMD cores,
in their RV770-based HD 4xxx lines of GPGPUs (2008), as indicated below.

Figure 5.2.25: Introduction of LDSs and a GDS in RV770-based HD 4xxx GPGPUs [36]

© Sima Dezső, ÓE NIK 891 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (9)

Data sharing in the Cayman core-2


The Local Data Share (LDS)
• Introduced in the RV770-based
4xxx lines [36]
• The LDS has a size of 32 KB
per SIMD core.
• Used for data sharing within a
wavefront or workgroup
(running on the same SIMD core).
• Exposed through the OpenCL and
DirectCompute specifications.
• Each EU can have one read and
one write access (32-bit values)
per clock cycle from/to the LDS.

Figure 5.2.26: Local Data sharing in the


Cayman core [99]

© Sima Dezső, ÓE NIK 892 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (10)

Data sharing in the Cayman core-3


The Global Data Share (GDS)

• Introduced also in the RV770-based


HD 4xxx line, as shown before [36]
(at a size of 16 KB/GPGPU).
• It was however not documented
in the related ISA Reference Guide.
[101].

Figure 5.2.27: Global Data sharing in the


Cayman core [99]

© Sima Dezső, ÓE NIK 893 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (11)

The microarchitecture of the RV770 (HD 4xxx) family of GPGPUs [101]

Remark: AMD designates the RV770 core internally as the R700 DPP: Data Parallel Processor
© Sima Dezső, ÓE NIK 894 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (12)

The GDS memory became visible only in the ISA documentation of the Evergreen (HD 5xxx )
[112] and the Northern Island (HD 6xxx) families of GPGPUs [99].

Figure 5.2.28: Basic architecture of Cayman (that underlies both the HD 6950 and 6970 GPGPUs) [99]
© Sima Dezső, ÓE NIK 895 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (13)

Data sharing in the Cayman core-4

The Global Data Share (GDS)

• Provides data sharing across


an entire kernel.
• It is not exposed in the OpenCL
or DirectCompute specifications,
must be accessed through
vendor specific extensions.

Figure 5.2.29: Global Data sharing in the


Cayman core [99]

© Sima Dezső, ÓE NIK 896 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (14)

Sizes of the introduced LDS, GDS memories


Both the sizes of LDS and GDS became enlarged in subsequent lines, as indicated below.

R600-based R670-based RV770-based RV870-based Cayman-based


HD 2900 XT line HD 3xxx line HD 4xxx line HD 5xxx line HD 69xx line

LDS/SIMD core - - 16 KB 32 KB 32 KB

GDS/GPGPU - - 16 KB 64 KB 64 KB

© Sima Dezső, ÓE NIK 897 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (15)

c) Introduction of 2-level segmentation in breaking down the domain of execution/


NDRange into wavefronts (2009)

Segmentation of the domains of execution (now NDR-Ranges) into wavefronts :


in Pre-OpenCL GPGPUs (those preceding the RV870-based 5xxx line (2009))

Domain of execution

Wavefronts

In Pre-OpenCL GPGPUs the domain of


execution was segmented into wavefronts
in a single level process.

One wavefront

Figure 5.2.30: Segmenting the domain of execution


in Pre-OpenCL GPGPUs of AMD
(Based on [92])

© Sima Dezső, ÓE NIK 898 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (16)

Segmentation of the domain of execution, called NDRange into wavefronts in


OpenCL-based GPGPUs
• OpenCL based GPGPUs were introduced beginning with the RV870-based HD 5xxx line).
• In these systems the segmentation became a 2-level process.

The process of the 2-level segmentation


• First the NDRange (renamed domain of execution) is broken down
into Work Groups of the size n x wavefront size NDRange
• either explicitly by the developer else Global size 0
• implicitly by the OpenCL driver,
• then the Ultra-Threaded Dispatch Processor.
Work Group Work Group

Global size 1
(0,0) (0,1)
• allocates the Work Groups for execution to the SIMD cores,
• and segments the Work Groups into wavefronts.

After segmentation the Ultra-Threaded Dispatch Processor Work Group Work Group
schedules the wavefronts for execution in the SIMD cores. (1,0) (1,1)

© Sima Dezső, ÓE NIK 899 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (17)

Example for the second step of the segmentation:


Segmentation of a 16 x 16 sized Work Group into 8x8 sized wavefronts [92]

Work Group

One 8x8 block maps Another 8x8 block maps


to one wavefront to another wavefront
and is executed on one and is executed on the same or
SIMD core another SIMD core

Each quad executes in


the same VLIW ALU
Wavefront (16 quads)

All VLIW ALUs in a SIMD core


execute the same
instruction sequence,
different SIMD cores
may execute different
instructions.
© Sima Dezső, ÓE NIK 900 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (18)

d) Allowing multiple kernels to run in parallel on different SIMD cores (2010)


• In GPGPUs preceding Cayman-based systems, only a single kernel was allowed to run
on a GPGPU.
In these systems, the Work Groups constituting the NDRange (domain of execution) were
spread over all available SIMD cores in order to speed up execution.
• In Cayman based systems multiple kernels may run on the same GPGPU, each one
on a single or multiple SIMD cores, allowing a better utilization of the hardware resources.

© Sima Dezső, ÓE NIK 901 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (19)

Assigning multiple kernels to the SIMD cores

Kernel 1: NDRange1
Global size 10
Global size 11

Work Group Work Group


(0,0) (0,1)
DPP Array

Work Group Work Group


(1,0) (1,1)

Kernel 2: NDRange2
Global size 20
Global size 21

Work Group Work Group


(0,0) (0,1)

OpenCL Progr. Guide 1.2


Work Group Work Group Jan. 2011
(1,0) (1,1)

902 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (20)

e) Introducing FP64 capability and replacing VLIW5 ALUs with VLIW4 ALUs

Main steps of the evolution

2007 2008-2009 2010

R600 (HD 2900XT) RV770 (HD 4850/70), Cayman Pro/XT


RV670 (HD 3850/3870) RV870 aka Cypress Pro/XT (HD 6950/70)
(HD 5850/70)

VLIW-5 SIMD VLIW-5 SIMD VLIW-4 SIMD


[34] [49] [98]

5xFP32 MAD 5xFP32 MAD 4xFP32 FMA


- FP64 MAD 1XFP64 MAD 1XFP64 FMA

© Sima Dezső, ÓE NIK 903 www.tankonyvtar.hu


5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (21)

Remark
Reasons for replacing VLIW5 ALUs with VLIW4 ALUs [97]
AMD/ATI choose the VLIW-5 ALU design in connection with DX9, as it allowed to calculate a
4 component dot product (e.g. w, x, y, z) and a scalar component (e.g. lighting) in parallel.

But in gaming applications for DX10/11 shaders the average slot utilization became only 3.4.
On average the 5. EU remains unused.
With Cayman AMD redesigned their ALU by
• removing the T-unit and
• enhancing 3 of the new EUs such that these units together became capable of
performing 1 transcendental operation per cycle as well as
• enhancing all 4 EUs to perform together an FP64 operation per cycle.
The new design can compute
• 4 FX32 or 4 FP32 operations or
• 1 FP64 operation or
• 1 transcendental + 1 FX32 or 1FP32 operation
per cycle, whereas
the previous design was able to calculate
• 5 FX32 or 5 FP32 operations or
• 1 FP64 operation or
• 1 transcendental + 4 FX/FP operation
per cycle.
© Sima Dezső, ÓE NIK 904 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (22)

Advantages of the VLIW4 design [97]


• With removing the T-unit but enhancing 3 of the EUs to perform transcendental functions
as well as all 4 EUs to perform together an FP64 operation per cycle
about 10 % less floor space is needed compared with the previous design.
More ALUs can be implemented on the same space.
• The symmetric ALU design simplifies largely the scheduling task for the VLIW compiler,
• FP64 calculations can now be performed by a ¼ rate of FP32 calculations, rather than
by a 1/5 rate as before.

© Sima Dezső, ÓE NIK 905 www.tankonyvtar.hu


Integrated CPUs/GPUs

Dezső Sima

© Sima Dezső, ÓE NIK 906 www.tankonyvtar.hu


Aim

Aim
Brief introduction and overview

General remark
Integrated CPU/GPU designs are not yet advanced enough to be employed as GPGPUs,
but they mark the way of the evolution.
For this reason their discussion is included into the description of GPGPUs.

© Sima Dezső, ÓE NIK 907 www.tankonyvtar.hu


Contents

6. Integrated CPUs/GPUs

6.1 Introduction to integrated CPUs/GPUs

6.2 The AMD Fusion APU line

6.3 Intel’s in-package


on-die integrated
integrated
CPU/GPU
CPU/GPU
lineslines

6.4 Intel’s on-die integrated CPU/GPU lines

7. References

© Sima Dezső, ÓE NIK 908 www.tankonyvtar.hu


6.1 Introduction to integrated CPU/GPUs

© Sima Dezső, ÓE NIK 909 www.tankonyvtar.hu


6.1 Introduction to integrated CPUs/GPUs (1)

Basic support of graphics

Integrated traditionally into the north bridge

© Sima Dezső, ÓE NIK 910 www.tankonyvtar.hu


6.1 Introduction to integrated CPUs/GPUs (2)

Remarks
• In early PCs, displays were connected to the system bus (first to the ISA then to the PCI bus)
via graphic cards.
• Spreading of internet and multimedia apps at the end of the 1990’s impelled enhanced
graphic support from the processors.
This led to the emergence of the 3. generation superscalars that provided already MM and
graphics support (by means of SIMD ISA enhancements).
• A more demanding graphics processing however, invoked the evolution of the system
architecture away from the bus-based one to the hub-based one at the end of the 1990’s.
Along with the appearance of the hub-based system architecture the graphics
controller (if provided) became typically integrated into the north bride (MCH/GMCH),
as shown below.
PCI architecture Hub architecture

P P

Display

System contr. MCH


PCI G. card

Perif. contr. ICH

ISA
PCI

Figure 6.1: Emergence of hub based system architectures at the end of the 1990’s
© Sima Dezső, ÓE NIK 911 www.tankonyvtar.hu
6.1 Introduction to integrated CPUs/GPUs (3)

Example
Integrated graphics controllers appeared
in Intel chipsets first in the 810 chipset
in 1999 [113].

Note
The 810 chipset does not provide AGP
connection.
© Sima Dezső, ÓE NIK 912 www.tankonyvtar.hu
6.1 Introduction to integrated CPUs/GPUs (4)

Subsequent chipsets (e.g. those developed for P4 processors (around 2000) provided then
both an integrated graphics controller intended for connecting the display and
further on a Host-to-AGP Bridge to cater for an AGP-bus output in order to achieve high quality
graphics for gaming apps by using a graphics card.

Display

AGP
bus

Figure 6.2: Conceptual block diagram of the north bridge of Intel’s 845G chipset [129]

© Sima Dezső, ÓE NIK 913 www.tankonyvtar.hu


6.1 Introduction to integrated CPUs/GPUs (5)

Integration trends of graphics controllers and HPC accelerators

Basic graphics support


Integration to the CPU chip
Integrated traditionally into the north bridge Now
(Intel, AMD)

Professional graphics support Integration to the CPU chip


Given by high performance graphics cards just started
with modest performance
(Intel, AMD)
HPC support
Given by Integration to the CPU chip
just started
• high performance GPGPUs or
with modest performance
• Data accelerators (AMD)

© Sima Dezső, ÓE NIK 914 www.tankonyvtar.hu


6.2 The AMD Fusion APU line
(Accelerated Processing Unit)

© Sima Dezső, ÓE NIK 915 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (1)

Fusion line
Line of processors with on-die integrated CPU and GPU units, designated as APUs
(Accelerated Processing Units)
• Introduced in connection with the AMD/ATI merger in 10/2006.
• Originally planned to ship late 2008 or early 2009

• Actually shipped
• 11/2010: for OEMs
• 01/2011: for retail

First implementations for


• laptops, called the Zacate line, and
• netbooks, called the Ontario line.

© Sima Dezső, ÓE NIK 916 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (2)

Benefits of using APUs instead of using graphics processing units integrated


into the NB, called IGPs [115]

UNB: Unbuffered
UVD: Universal Video Decoder

North Bridge
SB: South Bridge

IGP: Integrated Graphics Processing unit


© Sima Dezső, ÓE NIK 917 www.tankonyvtar.hu
6.2 The AMD Fusion APU line (3)

Roadmaps of introducing APUs (published in 11/2010)


(Based on [115], [116])

© Sima Dezső, ÓE NIK 918 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (4)

Source: [115]

© Sima Dezső, ÓE NIK 919 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (5)

© Sima Dezső, ÓE NIK Table 6.1: Main features 920


of AMD’s Stars family [130] www.tankonyvtar.hu
6.2 The AMD Fusion APU line (6)

Source: [115]

© Sima Dezső, ÓE NIK 921 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (7)

Source: [116]

© Sima Dezső, ÓE NIK 922 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (8)

AMD’s Ontario and Zacate Fusion APUs

© Sima Dezső, ÓE NIK 923 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (9)

Source: [115]

© Sima Dezső, ÓE NIK 924 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (10)

AMD’s Ontario and Zacate Fusion APUs

Targeted market segments


• Ontario: Main stream notebooks
• Zacate: HD Netbooks

© Sima Dezső, ÓE NIK 925 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (11)

Main features of the Ontario and Zacate APUs [115]

© Sima Dezső, ÓE NIK 926 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (12)

Benefit of CPU/GPU integration [115]

© Sima Dezső, ÓE NIK 927 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (13)

Table 6.2: Main features of AMD’s mainstream Zacate Fusion APU line [131]

© Sima Dezső, ÓE NIK 928 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (14)

Table 6.3: Main features of AMD’s low power Ontario Fusion APU line [131]

© Sima Dezső, ÓE NIK 929 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (15)

OpenCL programming support for both the Ontario and the Zacate lines
AMD’s APP SDK 2.3 (1/2011) (formerly ATI Stream SDK) ( provides OpenCL support for
both lines .

New features of APP SDK 2.3 [118]

Improved OpenCL runtime performance:


Improved kernel launch times.
Improved PCIe transfer times.
Enabled DRMDMA for the ATI Radeon 5000 Series and AMD Radeon 6800 GPUs that are
specified in the Supported Devices.
Increased size of staging buffers.
Enhanced Binary Image Format (BIF).
Support for UVD video hardware component through OpenCL (Windows 7).
Support for AMD E-Series and C-Series platforms (AMD Fusion APUs).
Support for Northern Islands family of devices.
Support for AMD Radeon™ HD 6310 and AMD Radeon™ 6250 devices.
Support for OpenCL math libraries: FFT and BLAS-3, available for download at AMD Accelerated
Parallel Processing Math Libraries.
Preview feature: An optimization pragma for unrolling loops.
Preview feature: Support for CPU/X86 image. This enables the support for Image formats, as
described in the Khronos specification for OpenCL, to be run on the x86 CPU. It is enabled by
the following environment variable in your application: CPU_IMAGE_SUPPORT.

© Sima Dezső, ÓE NIK 930 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (16)

The Bobcat CPU core


Both the Ontario and Zacate Fusion APUs are based on Bobcat CPU cores.

© Sima Dezső, ÓE NIK 931 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (17)

Target use of the Bobcat CPU core

Source: [115]
© Sima Dezső, ÓE NIK 932 www.tankonyvtar.hu
6.2 The AMD Fusion APU line (18)

Contrasting the Bulldozer and the Bobcat cores [119]

© Sima Dezső, ÓE NIK 933 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (19)

Microarchitecture of a Bobcat CPU core [119]

(Dual issue superscalar)

© Sima Dezső, ÓE NIK 934 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (20)

Microarchitecture of a Bobcat CPU core (detailed) [119]

In-breadth support
of dual issue

© Sima Dezső, ÓE NIK 935 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (21)

Bobcat’s floorplan [119]

© Sima Dezső, ÓE NIK 936 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (22)

AMD’s Ontario APU chip [120]:


Dual Bobcat CPU cores
+
ATI-DX11 GPU

TSMC 40nm
(Taiwan Semiconductor Manufacturing Company)
~ 400 mtrs

© Sima Dezső, ÓE NIK 937 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (23)

Die shot of the Ontario APU chip [127]

© Sima Dezső, ÓE NIK 938 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (24)

Source:

DX9, 2 Execution Units

939
6.2 The AMD Fusion APU line (25)

AMD’s Llano Fusion APU CPU

© Sima Dezső, ÓE NIK 940 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (26)

Target use of the Llano Fusion APU

© Sima Dezső, ÓE NIK 941 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (27)

Llano Fusion APU (Ax APU line)


Target use
Mainstream desktops, but will be replaced by next generation Bulldozer Fusion APUs in 2012
Main components
2-4 CPU cores and a DX11 capable GPU
CPU cores
Revamped Star cores
Main features of the CPU cores [121]
• The cores are 32 nm shrinks of the originally 45 nm Star cores with major enhancements, like
new power gating and digital controlled power management
• Three-wide out-of-order cores
• 1 MB L2
• Power consumption range of a single core: 2.5 -25 W
• 35 million transistors per core

Main feature of the GPU


It supports DX11
No. of the EUs in the GPU
Up to 400
Targeted availability
Q3 2011

© Sima Dezső, ÓE NIK 942 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (28)

Supposed parameters of Llano’s GPU [134]

APU/GPU Stream GPU-klock Memory Memory clock. TDP


pro.

AMD A8 400 st. ~ 594 DDR3 1 866 MHz 65/100


APU MHz W*

AMD A6 320 st. ~ 443 DDR3 1 866 MHz 65/100


APU MHz W*

AMD A4 160 st. ~ 594 DDR3 1 866 MHz 65W*


APU MHz

AMD E2 80 st. ~ 443 DDR3 1 600 MHz 65W*


APU MHz

AMD 400 st. 775 MHz GDDR5 4 000 MHz 64W


HD5670

AMD 400 st. 650 MHz DDR3/GD 1 800/4 000 1 800/4 39W
HD5570 DR5 000 MHz

AMD 320 st. 550 MHz DDR3/GD 1 800/4 000 1 800/4


HD5550 DR5 000 MHz
TDP: Total 39W
TDP for the CPUs and the GDP
AMD 80 st. 400-650 DDR2/DD 800/1 600 MHz 19W
HD5450 MHz R3

© Sima Dezső, ÓE NIK 943 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (29)

Floor plan of the CPU core (revamped Star core) of the Llano APU [133]

© Sima Dezső, ÓE NIK 944 www.tankonyvtar.hu


Az adatok védelme érdekében a PowerPoint nem engedélyezte a kép automatikus letöltését.

6.2 The AMD Fusion APU line (30)

Floor plan of the Llano APU [132]

© Sima Dezső, ÓE NIK 945 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (31)

The Bulldozer CPU core

Intended to be used both in CPU processors and Fusion APUs, as shown


in the Notebook, Desktop and Server roadmaps before.

In 2011
Bulldozer CPU cores will be used as the basis for their desktop and server CPU processors,
In 2012
next generation Bulldozer CPUs are planned to be used both

• in AMD’s Fusion APUs for notebooks and desktops as well as


• in AMD’s servers, implemented as CPUs.

Remark
No plans are revealed to continue to develop the Llano APU

© Sima Dezső, ÓE NIK 946 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (32)

Bulldozer’s basic modules each consisting of two tightly coupled cores -1[135]

Basic module of the microarchitecture

22 Butler, Dec. 2010

© Sima Dezső, ÓE NIK 947 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (33)

Bulldozer’s basic modules each consisting of two tightly coupled cores – 2 [136]

22

© Sima Dezső, ÓE NIK 948 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (34)

Bulldozer’s basic modules each consisting of two tightly coupled cores – 3 [135]

22 Butler, Dec. 2010

© Sima Dezső, ÓE NIK 949 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (35)

Similarity of the microarchitectures of Nvidia’s Fermi and AMD’s Bulldozer CPU core
Both are using tightly coupled dual pipelines with shared and dedicated units.

AMD’s Bulldozer module [136] Nvidia’s Fermi core [39]

© Sima Dezső, ÓE NIK 950 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (36)

Bulldozer’s cores doesn’t support multithreading [136]

22

© Sima Dezső, ÓE NIK 951 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (37)

An 8-core Bulldozer chip consisting of four building blocks [136]

© Sima Dezső, ÓE NIK 952 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (38)

Floor plan of
Bulldozer [128]

953 www.tankonyvtar.hu
6.2 The AMD Fusion APU line (39)

Floor plan of a dual core module [128]

© Sima Dezső, ÓE NIK 954 www.tankonyvtar.hu


6.2 The AMD Fusion APU line (40)

Expected evolution of the process technology at Global Foundries [122]

© Sima Dezső, ÓE NIK 955 www.tankonyvtar.hu


6.3 Intel’s in-package integrated CPU/GPU lines

© Sima Dezső, ÓE NIK 956 www.tankonyvtar.hu


6.3 Intel’s in-package integrated CPU/GPU lines (1)

Intel’s in-package integrated CPU/GPU processors


Introduced in Jan. 2010
Example
In-package integrated CPU/GPU [137]

© Sima Dezső, ÓE NIK 957 www.tankonyvtar.hu


6.3 Intel’s in-package integrated CPU/GPU lines (2)

Announcing Intel’s in-package integrated CPU/GPU lines [140]

© Sima Dezső, ÓE NIK 958 www.tankonyvtar.hu


6.3 Intel’s in-package integrated CPU/GPU lines (3)

Intel’s in-package integrated CPU/GPU processors

Mobile processors Desktop processors

Arrandale Clarksdale

i3 3xx i3 5xx
i5 4xx/5xx i5 6xx
i7 6xx
CPU/GPU components
CPU: Westmere architecture (32 nm)
(Enhanced 32 nm shrink of the
45 nm Nehalem architecture)
GPU: (45 nm)
Shader model 4, DX10 support

© Sima Dezső, ÓE NIK 959 www.tankonyvtar.hu


6.3 Intel’s in-package integrated CPU/GPU lines (4)

Repartitioning the system architecture by introducing the Nehalem based Westmere


with in-package integrated graphics [141]

© Sima Dezső, ÓE NIK 960 www.tankonyvtar.hu


6.3 Intel’s in-package integrated CPU/GPU lines (5)

Intel’s i3/i5/i7 mobile Arrandale line [137]


Announced 1/2010
32 nm CPU/45 nm discrete GPU

© Sima Dezső, ÓE NIK 961 www.tankonyvtar.hu


6.3 Intel’s in-package integrated CPU/GPU lines (6)

The Arrandale processor [138]


Westmere CPU (32 nm shrink of the 45 nm Nehalem) + tightly coupled 45 nm GPU in a package

© Sima Dezső, ÓE NIK 962 www.tankonyvtar.hu


6.3 Intel’s in-package integrated CPU/GPU lines (7)

Basic components of Intel’s mobile Arrandale processor [123]

32 nm CPU
(Mobile implementation of the Westmere
basic architecture,
which is the 32 nm shrink of the
45 nm Nehalem basic architecture) 45 nm GPU
Intel’s GMA HD (Graphics Media Accelerator)
(12 Execution Units, Shader model 4, no OpenCL support)
© Sima Dezső, ÓE NIK 963 www.tankonyvtar.hu
6.3 Intel’s in-package integrated CPU/GPU lines (8)

Key specifications of Intel’s Arrandale line [139] http://www.anandtech.com/show/2902

964
6.3 Intel’s in-package integrated CPU/GPU lines (9)

Intel’s i3/i5 desktop Clarksdale line


Announced 1/2010

Figure 6.3: The Clarksdale processor with in-package integrated graphics along with the H57 chipset
[140]

© Sima Dezső, ÓE NIK 965 www.tankonyvtar.hu


6.3 Intel’s in-package integrated CPU/GPU lines (10)

Key features of the Clarksdale processor [141]

© Sima Dezső, ÓE NIK 966 www.tankonyvtar.hu


6.3 Intel’s in-package integrated CPU/GPU lines (11)

Integrated Graphics Media (IGM) architecture [141]

© Sima Dezső, ÓE NIK 967


6.3 Intel’s in-package integrated CPU/GPU lines (12)

Key features of Intel’s i5-based Clarksdale desktop processors [140]

© Sima Dezső, ÓE NIK 968 www.tankonyvtar.hu


6.3 Intel’s in-package integrated CPU/GPU lines (13)

In Jan. 2011 Intel replaced their in-package integrated CPU/GPU lines with the on-die integrated
Sandy Bridge line.

© Sima Dezső, ÓE NIK 969 www.tankonyvtar.hu


6.4 Intel’s on-die integrated CPU/GPU lines
(Sandy Bridge)

© Sima Dezső, ÓE NIK 970 www.tankonyvtar.hu


6.4 Intel’s on-die integrated CPU/GPU lines (1)

The Sandy Bridge processor


• Shipped in Jan. 2011
• Provides on-die integrated CPU and GPU

© Sima Dezső, ÓE NIK 971 www.tankonyvtar.hu


6.4 Intel’s on-die integrated CPU/GPU lines (2)

Main features of Sandy Bridge [142]

© Sima Dezső, ÓE NIK 972 www.tankonyvtar.hu


6.4 Intel’s on-die integrated CPU/GPU lines (3)

Key specification data of Sandy Bridge [124]

Branding Core i5 Core i5 Core i5 Core i7 Core i7


Processor 2400 2500 2500K 2600 2600K
Price $184 $205 $216 $294 $317
TDP 95W 95W 95W 95W 95W
Cores / Threads 4/4 4/4 4/4 4/8 4/8

Frequency GHz 3.1 3.3 3.3 3.4 3.4

Max Turbo GHz 3.4 3.7 3.7 3.8 3.8

DDR3 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz
L3 Cache 6MB 6MB 6MB 8MB 8MB
Intel HD
2000 2000 3000 2000 3000
Graphics

GPU Max freq 1100 MHz 1100 MHz 1100 MHz 1350 MHz 1350 MHz

Hyper-
No No No Yes Yes
Threading

AVX Extensions Yes Yes Yes Yes Yes

Socket LGA 1155 LGA 1155 LGA 1155 LGA 1155 LGA 1155

© Sima Dezső, ÓE NIK 973 www.tankonyvtar.hu


6.4 Intel’s on-die integrated CPU/GPU lines (4)

Die photo of Sandy Bridge [143]

256 KB L2 256 KB L2 256 KB L2 256 KB L2


(9 clk) (9 clk) (9 clk) (9 clk)

Hyperthreading
32K L1D (3 clk) AES Instr.
AVX 256 bit VMX Unrestrict.
4 Operands 20 nm2 / Core

@ 1.0 1.4 GHz


(to L3 connected) (25 clk)
256 b/cycle Ring Architecture PCIe 2.0

DDR3-1600 25.6 GB/s

32 nm process / ~225 nm2 die size / 85W TDP

© Sima Dezső, ÓE NIK 974 www.tankonyvtar.hu


6.4 Intel’s on-die integrated CPU/GPU lines (5)

Sandy Bridge’s integrated graphics unit [102]

© Sima Dezső, ÓE NIK 975 www.tankonyvtar.hu


6.4 Intel’s on-die integrated CPU/GPU lines (6)

Specification data of the HD 2000 and HD 3000 graphics [125]

© Sima Dezső, ÓE NIK 976 www.tankonyvtar.hu


6.4 Intel’s on-die integrated CPU/GPU lines (7)

Performance comparison: gaming [126]

HD5570
400 ALUs

i5/i7 2xxx:
Sandy Bridge

i56xx
Arrandale

frames per sec


© Sima Dezső, ÓE NIK 977 www.tankonyvtar.hu
References to all four sections
of GPGPUs/DPAs

Dezső Sima

© Sima Dezső, ÓE NIK 978 www.tankonyvtar.hu


References (1)

References (to all four sections)

[1]: Torricelli F., AMD in HPC, HPC07, 2007


http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf
[2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia

[3] AMD FireStream 9170, 2008


http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html

[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,
Nvidia, http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf

[5]: Tesla S870 GPU Computing System, Specification, Nvida, March 13 2008,
http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf

[6]: Torres G., Nvidia Tesla Technology, Nov. 2007,


http://www.hardwaresecrets.com/article/495

[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD

[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,
ASPLOS 2006, June 2008

[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007
http://ati.amd.com/developer/techpapers.html

© Sima Dezső, ÓE NIK 979 www.tankonyvtar.hu


References (2)

[10]: Compute Abstraction Layer (CAL) Technology – Intermediate Language (IL),


Version 2.0, AMD, Oct. 2008

[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,
June 2008, Nvidia,
http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming
Guide_2.0.pdf

[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,
University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/
ectures/lecture7-threading%20hardware.ppt

[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,
http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf

[14]: Goto H., Nvidia G80, PC Watch, April 16 2007,


http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm

[15]: Goto H., GeForce 8800GT (G92), PC Watch, Oct. 31 2007,


http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf

[16]: Goto H., NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008,
http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm

[17]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review,
PC Perspective, June 16 2008,
http://www.pcper.com/article.php?aid=577&type=expert&pid=3
© Sima Dezső, ÓE NIK 980 www.tankonyvtar.hu
References (3)

[18]: http://en.wikipedia.org/wiki/DirectX

[19]: Dietrich S., “Shader Model 3.0, April 2004, Nvidia,


http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf

[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,
Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html

[21]: Patidar S. & al., “Exploiting the Shader Model 4.0 Architecture, Center for
Visual Information Technology, IIIT Hyderabad, March 2007,
http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf

[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,
http://www.nvidia.com/page/8800_tech_briefs.html

[23]: Goto H., Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,
http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf

[24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,”
Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,

[25]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” Real World Technologies,
Sept. 8 2008, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242

[26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide,


Version 1.1, Nov. 2007, Nvidia,
http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA_
Programming_Guide_1.1.pdf 981 www.tankonyvtar.hu
© Sima Dezső, ÓE NIK
References (4)

[27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,”
ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008

[28]: Kogo H., “Larrabee”, PC Watch, Oct. 17, 2008,


http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm

[29]: Shrout R., IDF Fall 2007 Keynote, PC Perspective, Sept. 18, 2007,
http://www.pcper.com/article.php?aid=453

[30]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Ars Technica,
Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-intels-biggest-leap-
ahead-since-the-pentium-pro.html

[31]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated


First Move, Anandtech, Aug. 4. 2008,
http://www.anandtech.com/showdoc.aspx?i=3367&p=2

[32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19,
Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf

[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1
http://ati.amd.com/technology/streamcomputing/Stream_Computing_User_Guide.pdf

[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,
http://www.graphicshardware.org/previous/www_2007/presentations/doggett-
radeon2900-gh07.pdf

© Sima Dezső, ÓE NIK 982 www.tankonyvtar.hu


References (5)

[35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007,
http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf

[36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008,
http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf

[37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008,
http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf

[38]: Kogo H., RV770 Overview, PC Watch, July 02 2008,


http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf

[39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,
http://www.realworldtech.com/includes/templates/articles.cfm?
ArticleID=RWT093009110932&mode=print

[40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,
Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1

[41]: Wasson S., AMD's Radeon HD 5870 graphics processor,


Tech Report, Sept 23 2009, http://techreport.com/articles.x/17618/1

[42]: Bell B., ATI Radeon HD 5870 Performance Preview ,


Firing Squad, Sept 22 2009, http://www.firingsquad.com/hardware/
ati_radeon_hd_5870_performance_preview/default.asp

© Sima Dezső, ÓE NIK 983 www.tankonyvtar.hu


References (6)

[43]: Nvidia CUDA C Programming Guide, Version 3.2, October 22 2010


http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/
CUDA_C_Programming_Guide.pdf

[44]: Hwu W., Kirk D., Nvidia, Advanced Algorithmic Techniques for GPUs, Berkeley,
January 24-25 2011
http://iccs.lbl.gov/assets/docs/2011-01-24/lecture1_computational_thinking_
Berkeley_2011.pdf
[45]: Wasson S., Nvidia's GeForce GTX 580 graphics processor
Tech Report, Nov 9 2010, http://techreport.com/articles.x/19934/1

[46]: Shrout R., Nvidia GeForce 8800 GTX Review – DX10 and Unified Architecture,
PC Perspective, Nov 8 2006
http://swfan.com/reviews/graphics-cards/nvidia-geforce-8800-gtx-review-dx10-
and-unified-architecture/g80-architecture

[47]: Wasson S., Nvidia's GeForce GTX 480 and 470 graphics processors
Tech Report, March 31 2010, http://techreport.com/articles.x/18682
[48]: Gangar K., Tianhe-1A from China is world’s fastest Supercomputer
Tech Ticker, Oct 28 2010, http://techtickerblog.com/2010/10/28/tianhe-1a-
from-china-is-worlds-fastest-supercomputer/

[49]: Smalley T., ATI Radeon HD 5870 Architecture Analysis, Bit-tech, Sept 30 2009,
http://www.bit-tech.net/hardware/graphics/2009/09/30/ati-radeon-hd-5870-
architecture-analysis/8

© Sima Dezső, ÓE NIK 984 www.tankonyvtar.hu


References (7)

[50]: Nvidia Compute PTX: Parallel Thread Execution, ISA, Version 1.0, June 2007,
https://www.doc.ic.ac.uk/~wl/teachlocal/arch2/papers/nvidia-PTX_ISA_1.0.pdf

[51]: Kanter D., Intel's Sandy Bridge Microarchitecture, Real World Technologies,
Sept 25 2010 http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4

[52]: Nvidia CUDATM FermiTM Compatibility Guide for CUDA Applications, Version 1.0,
February 2010, http://developer.download.nvidia.com/compute/cuda/3_0/
docs/NVIDIA_FermiCompatibilityGuide.pdf

[53]: Hallock R., Dissecting Fermi, NVIDIA’s next generation GPU, Icrontic, Sept 30 2009,
http://tech.icrontic.com/articles/nvidia_fermi_dissected/
[54]: Kirsch N., NVIDIA GF100 Fermi Architecture and Performance Preview,
Legit Reviews, Jan 20 2010, http://www.legitreviews.com/article/1193/2/
[55]: Hoenig M., NVIDIA GeForce GTX 460 SE 1GB Review, Hardware Canucks, Nov 21 2010,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/38178-
nvidia-geforce-gtx-460-se-1gb-review-2.html
[56]: Glaskowsky P. N., Nvidia’s Fermi: The First Complete GPU Computing Architecture
Sept 2009, http://www.nvidia.com/content/PDF/fermi_white_papers/
P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf

[57]: Kirk D. & Hwu W. W., ECE498AL Lectures 4: CUDA Threads – Part 2, 2007-2009,
University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/
al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt

© Sima Dezső, ÓE NIK 985 www.tankonyvtar.hu


References (8)

[58]: Nvidia’s Next Generation CUDATM Compute Architecture: FermiTM, Version 1.1, 2009
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_
Architecture_Whitepaper.pdf
[59]: Kirk D. & Hwu W. W., ECE498AL Lectures 8: Threading Hardware in G80, 2007-2009,
University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/
al/lectures/lecture8-threading-hardware-spring-2009.ppt
[60]: Wong H., Papadopoulou M.M., Sadooghi-Alvandi M., Moshovos A., Demystifying GPU
Microarchitecture through Microbenchmarking, University of Toronto, 2010,
http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf
[61]: Pettersson J., Wainwright I., Radar Signal Processing with Graphics Processors
(GPUs), SAAB Technologies, Jan 27 2010,
http://www.hpcsweden.se/files/RadarSignalProcessingwithGraphicsProcessors.pdf
[62]: Smith R., NVIDIA’s GeForce GTX 460: The $200 King, AnandTech, July 11 2010,
http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2
[63]: Angelini C., GeForce GTX 580 And GF110: The Way Nvidia Meant It To Be Played,
Tom’s Hardware, Nov 9 2010, http://www.tomshardware.com/reviews/geforce-
gtx-580-gf110-geforce-gtx-480,2781.html
[64]: NVIDIA G80: Architecture and GPU Analysis, Beyond3D, Nov. 8 2006,
http://www.beyond3d.com/content/reviews/1/11
[65]: D. Kirk and W. Hwu, Programming Massively Parallel Processors, 2008
Chapter 3: CUDA Threads, http://courses.engr.illinois.edu/ece498/al/textbook/
Chapter3-CudaThreadingModel.pdf

© Sima Dezső, ÓE NIK 986 www.tankonyvtar.hu


References (9)

[66]: NVIDIA Forums: General CUDA GPU Computing Discussion, 2008


http://forums.nvidia.com/index.php?showtopic=73056
[67]: Wikipedia: Comparison of AMD graphics processing units, 2011
http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units
[68]: Nvidia OpenCL Overview, 2009
http://gpgpu.org/wp/wp-content/uploads/2009/06/05-OpenCLIntroduction.pdf
[69]: Chester E., Nvidia GeForce GTX 460 1GB Fermi Review, Trusted Reviews,
July 13 2010, http://www.trustedreviews.com/graphics/review/2010/07/13/
Nvidia-GeForce-GTX-460-1GB-Fermi/p1
[70]: NVIDIA GF100 Architecture Details, Geeks3D, 2008-2010,
http://www.geeks3d.com/20100118/nvidia-gf100-architecture-details/
[71]: Murad A., Nvidia Tesla C2050 and C2070 Cards, Science and Technology Zone,
17 nov. 2009,
http://forum.xcitefun.net/nvidia-tesla-c2050-and-c2070-cards-t39578.html

[72]: New NVIDIA Tesla GPUs Reduce Cost Of Supercomputing By A Factor Of 10,
Nvidia, Nov. 16 2009
http://www.nvidia.com/object/io_1258360868914.html
[73]: Nvidia Tesla, Wikipedia, http://en.wikipedia.org/wiki/Nvidia_Tesla

[74]: Tesla M2050 and Tesla M2070/M2070Q Dual-Slot Computing Processor Modules,
Board Specification, v. 03, Nvidia, Aug. 2010,
http://www.nvidia.asia/docs/IO/43395/BD-05238-001_v03.pdf

© Sima Dezső, ÓE NIK 987 www.tankonyvtar.hu


References (10)

[75]: Tesla 1U gPU Computing System, Product Soecification, v. 04, Nvidia, June 2009,
http://www.nvidia.com/docs/IO/43395/SP-04975-001-v04.pdf
[76]: Kanter D., The Case for ECC Memory in Nvidia’s Next GPU, Realworkd Technologies,
19 Aug. 2009,
http://www.realworldtech.com/page.cfm?ArticleID=RWT081909212132
[77]: Hoenig M., Nvidia GeForce 580 Review, HardwareCanucks, Nov. 8, 2010,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/37789-nvidia-
geforce-gtx-580-review-5.html
[78]: Angelini C., AMD Radeon HD 6990 4 GB Review, Tom’s Hardware, March 8, 2011,
http://www.tomshardware.com/reviews/radeon-hd-6990-antilles-crossfire,2878.html
[79]: Tom’s Hardware Gallery,
http://www.tomshardware.com/gallery/two-cypress-gpus,0101-230369-
7179-0-0-0-jpg-.html
[80]: Tom’s Hardware Gallery,
http://www.tomshardware.com/gallery/Bare-Radeon-HD-5970,0101-230349-
7179-0-0-0-jpg-.html
[81]: CUDA, Wikipedia, http://en.wikipedia.org/wiki/CUDA
[82]: GeForce Graphics Processors, Nvidia, http://www.nvidia.com/object/geforce_family.html

[83]: Next Gen CUDA GPU Architecture, Code-Named “Fermi”, Press Presentation at
Nvidia’s 2009 GPU Technology Conference, (GTC), Sept. 30 2009,
http://www.nvidia.com/object/gpu_tech_conf_press_room.html

© Sima Dezső, ÓE NIK 988 www.tankonyvtar.hu


References (11)

[84]: Tom’s Hardware Gallery,


http://www.tomshardware.com/gallery/SM,0101-110801-0-14-15-1-jpg-.html
[85]: Butler, M., Bulldozer, a new approach to multithreaded compute performance,
Hot Chips 22, Aug. 24 2010
http://www.hotchips.org/index.php?page=hot-chips-22
[86]:. Voicu A., NVIDIA Fermi GPU and Architecture Analysis, Beyond 3D, 23rd Oct 2010,
http://www.beyond3d.com/content/reviews/55/1
[87]: Chu M. M., GPU Computing: Past, Present and Future with ATI Stream Technology,
AMD, March 9 2010, http://developer.amd.com/gpu_assets/GPU%20Computing%20-
%20Past%20Present%20and%20Future%20with%20ATI%20Stream%20Technology.pdf

[88]: Smith R., AMD's Radeon HD 6970 & Radeon HD 6950: Paving The Future For AMD,
AnandTech, Dec. 15 2010,
http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950
[89] Christian, AMD renames ATI Stream SDK, updates its with APU, OpenCL 1.1 support,
Jan. 27 2011, http://www.tcmagazine.com/tcm/news/software/34765/
amd-renames-ati-stream-sdk-updates-its-apu-opencl-11-support
[90]: User Guide: AMD Stream Computing, Revision 1.3.0, Dec. 2008,
http://www.ele.uri.edu/courses/ele408/StreamGPU.pdf
[91]: ATI Stream Computing Compute Abstraction Layer (CAL) Programming Guide,
Revision 2.01, AMD, March 2010, http://developer.amd.com/gpu_assets/ATI_Stream_
SDK_CAL_Programming_Guide_v2.0.pdf
http://developer.amd.com/gpu/amdappsdk/assets/AMD_CAL_Programming_Guide_v2.0.pdf
© Sima Dezső, ÓE NIK 989 www.tankonyvtar.hu
References (12)

[92]: Technical Overview: AMD Stream Computing, Revision 1.2.1, Oct. 2008,
http://www.cct.lsu.edu/~scheinin/Parallel/StreamComputingOverview.pdf

[93]: AMD Accelerated Parallel Processing OpenCL Programming Guide, Rev. 1.2,
AMD, Jan. 2011, http://developer.amd.com/gpu/amdappsdk/assets/
AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

[94]: An Introduction to OpenCL, AMD, http://www.amd.com/us/products/technologies/


stream-technology/opencl/pages/opencl-intro.aspx
[95]: Behr D., Introduction to OpenCL PPAM 2009, Sept. 15 2009,
http://gpgpu.org/wp/wp-content/uploads/2009/09/B1-OpenCL-Introduction.pdf
[96]: Gohara D.W. PhD, OpenCL Episode 2 – OpenCL Fundamentals, Aug. 26 2009,
MacResearch, http://www.macresearch.org/files/opencl/Episode_2.pdf

[97]: Kanter D., AMD's Cayman GPU Architecture, Real World Technologies, Dec. 14 2010,
http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=3
[98]: Hoenig M., AMD Radeon HD 6970 and HD 6950 Review, Hardware Canucks,
Dec. 14 2010, http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/
38899-amd-radeon-hd-6970-hd-6950-review-3.html
[99]: Reference Guide: AMD HD 6900 Series Instruction Set Architecture, Revision 1.0,
Febr. 2011, http://developer.amd.com/gpu/AMDAPPSDK/assets/
AMD_HD_6900_Series_Instruction_Set_Architecture.pdf
[100]:Howes L., AMD and OpenCL, AMD Application Engineering, Dec. 2010,
http://www.many-core.group.cam.ac.uk/ukgpucc2/talks/Howes.pdf

© Sima Dezső, ÓE NIK 990 www.tankonyvtar.hu


References (13)

[101]: ATI R700-Family Instruction Set Architecture Reference Guide, Revision 1.0a,
AMD, Febr. 2011, http://developer.amd.com/gpu_assets/R700-Family_Instruction_
Set_Architecture.pdf
[102]: Piazza T., Dr. Jiang H., Microarchitecture Codename Sandy Bridge: Processor
Graphics, Presentation ARCS002, IDF San Francisco, Sept. 2010
[103]: Bhaniramka P., Introduction to Compute Abstraction Layer (CAL),
http://coachk.cs.ucf.edu/courses/CDA6938/AMD_course/M5%20-
%20Introduction%20to%20CAL.pdf
[104]: Villmow M., ATI Stream Computing, ATI Intermediate Language (IL),
May 30 2008, http://developer.amd.com/gpu/amdappsdk/assets/ATI%20Stream
%20Computing%20-%20ATI%20Intermediate%20Language.ppt#547,9
[105]: AMD Accelerated Parallel Processing Technology,
AMD Intermediate Language (IL), Reference Guide, Revision 2.0e, March 2011,
http://developer.amd.com/gpu/AMDAPPSDK/assets/AMD_Intermediate_Language
_(IL)_Specification_v2.pdf
[106]: Hensley J., Hardware and Compute Abstraction Layers for Accelerated Computing
Using Graphics Hardware and Conventional CPUs, AMD, 2007,
http://www.ll.mit.edu/HPEC/agendas/proc07/Day3/10_Hensley_Abstract.pdf
[107]: Hensley J., Yang J., Compute Abstraction Layer, AMD, Febr. 1 2008,
http://coachk.cs.ucf.edu/courses/CDA6938/s08/UCF-2008-02-01a.pdf
[108]: AMD Accelerated Parallel Processing (APP) SDK, AMD Developer Central,
http://developer.amd.com/gpu/amdappsdk/pages/default.aspx

© Sima Dezső, ÓE NIK 991 www.tankonyvtar.hu


References (14)

[109]: OpenCL™ and the AMD APP SDK v2.4, AMD Developer Central, April 6 2011,
http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-AMD-APP-
SDK.aspx
[110]: Stone J., An Introduction to OpenCL, U. of Illinois at Urbana-Champign, Dec. 2009,
http://www.ks.uiuc.edu/Research/gpu/gpucomputing.net

[111]: Introduction to OpenCL Programming, AMD, No. 137-41768-10, Rev. A, May 2010,
http://developer.amd.com/zones/OpenCLZone/courses/Documents/Introduction_
to_OpenCL_Programming%20Training_Guide%20(201005).pdf
[112]: Evergreen Family Instruction Set Architecture, Instructions and Microcode Reference
Guide, AMD, Febr. 2011, http://developer.amd.com/gpu/amdappsdk/assets/
AMD_Evergreen-Family_Instruction_Set_Architecture.pdf
[113]: Intel 810 Chipset: Intel 82810/82810-DC100 Graphics and Memory Controller Hub
(GMCH) Datasheet, June 1999
ftp://download.intel.com/design/chipsets/datashts/29065602.pdf
[114]: Huynh A.T., AMD Announces "Fusion" CPU/GPU Program, Daily Tech, Oct. 25 2006,
http://www.dailytech.com/article.aspx?newsid=4696
[115]: Grim B., AMD Fusion Family of APUs, Dec. 7 2010, http://www.mytechnology.eu/wp-
content/uploads/2011/01/AMD-Fusion-Press-Tour_EMEA.pdf
[116]: Newell D., AMD Financial Analyst Day, Nov. 9 2010,
http://www.rumorpedia.net/wp-content/uploads/2010/11/rumorpedia02.jpg
[117]: De Maesschalck T., AMD starts shipping Ontario and Zacate CPUs, DarkVision
Hardware, Nov. 10 2010, http://www.dvhardware.net/article46449.html

© Sima Dezső, ÓE NIK 992 www.tankonyvtar.hu


References (15)

[118]: AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream) with
OpenCL 1.1 Support, APP SDK 2.3 Jan. 2011
[119]: Burgess B., „Bobcat” AMD’s New Low Power x86 Core Architecture, Aug. 24 2010,
http://www.hotchips.org/uploads/archive22/HC22.24.730-Burgess-AMD-Bobcat-x86.pdf

[120]: AMD Ontario APU pictures, Xtreme Systems, Sept. 3 2010,


http://www.xtremesystems.org/forums/showthread.php?t=258499
[121]: Stokes J., AMD reveals Fusion CPU+GPU, to challenge Intel in laptops,
Febr. 8 2010, http://arstechnica.com/business/news/2010/02/amd-reveals-
fusion-cpugpu-to-challege-intel-in-laptops.ars
[122]: AMD Unveils Future of Computing at Annual Financial Analyst Day, CDRinfo,
Nov. 10 2010, http://www.cdrinfo.com/sections/news/Details.aspx?NewsId=28748
[123]: Shimpi A. L., The Intel Core i3 530 Review - Great for Overclockers & Gamers,
AnandTech, Jan. 22 2010, http://www.anandtech.com/show/2921
[124]: Hagedoorn H. Mohammad S., Barling I. R., Core i5 2500K and Core i7 2600K review,
Jan. 3 2011,
http://www.guru3d.com/article/core-i5-2500k-and-core-i7-2600k-review/2
[125]: Wikipedia: Intel GMA, 2011, http://en.wikipedia.org/wiki/Intel_GMA

[126]: Shimpi A. L., The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core
i3-2100 Tested, AnandTech, Jan. 3 2011,
http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7-600k-i5-
2500k-core-i3-2100-tested/11

© Sima Dezső, ÓE NIK 993 www.tankonyvtar.hu


References (16)

[127]: Marques T., AMD Ontario, Zacate Die Sizes - Take 2 , Sept. 14 2010,
http://www.siliconmadness.com/2010/09/amd-ontario-zacate-die-sizes-take-2.html

[128]: De Vries H., AMD Bulldozer, 8 core processor, Nov. 24 2010,


http://chip-architect.com/

[129]: Intel® 845G/845GL/845GV Chipset Datasheet: Intel® 82845G/82845GL/82845GV


Graphics and Memory Controller Hub (GMCH), Mai 2002
http://www.intel.com/design/chipsets/datashts/290746.htm
[130]: Huynh A. T., Final AMD "Stars" Models Unveiled, Daily Tech, May 4 2007,
http://www.dailytech.com/Final+AMD+Stars+Models+Unveiled+/article7157.htm
[131]: AMD Fusion, Wikipedia, http://en.wikipedia.org/wiki/AMD_Fusion
[132]: Nita S., AMD Llano APU to Get Dual-GPU Technology Similar to Hybrid CrossFire,
Softpedia, Jan. 21 2011, http://news.softpedia.com/news/AMD-Llano-APU-to-
Get-Dual-GPU-Technology-Similar-to-Hybrid-CrossFire-179740.shtml
[133]: Jotwani R., Sundaram S., Kosonocky S., Schaefer A., Andrade V. F., Novak A.,
Naffziger S., An x86-64 Core in 32 nm SOI CMOS, IEEE Xplore, Jan. 2011,
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5624589
[134]: Karmehed A., The graphical performance of the AMD A series APUs, Nordic
Hardware, March 16 2011,
http://www.nordichardware.com/news/69-cpu-chipset/42650-the-graphical-
performance-of-the-amd-a-series-apus.html

© Sima Dezső, ÓE NIK 994 www.tankonyvtar.hu


References (17)

[135]: Butler M., „Bulldozer” A new approach to multithreaded compute performance,


Aug. 24 2010, http://www.hotchips.org/uploads/archive22/HC22.24.720-Butler
-AMD-Bulldozer.pdf
[136]: „Bulldozer” and „Bobcat” AMD’s Latest x86 Core Innovations, HotChips22,
http://www.slideshare.net/AMDUnprocessed/amd-hot-chips-bulldozer-bobcat
-presentation-5041615
[137]: Altavilla D., Intel Arrandale Core i5 and Core i3 Mobile Unveiled, Hot Hardware,
Jan. 04 2010,
http://hothardware.com/Reviews/Intel-Arrandale-Core-i5-and-Core-i3-Mobile-Unveiled/

[138]: Dodeja A., Intel Arrandale, High Performance for the Masses, Hot Hardware,
Review of the IDF San Francisco, Sept. 2009,
http://akshaydodeja.com/intel-arrandale-high-performance-for-the-mass
[139]: Shimpi A., An Intel Arrandale: 32nm review for Notebooks, core to be assigned Core i5 540M
ReviewedAnand Tech, 1/4/2010
http://www.anandtech.com/show/2902
[140]: Chiappeta M., Intel Clarkdale Core i5 Desktop Processor Debuts, Hot Hardware
Jan. 03 2010,
http://hothardware.com/Articles/Intel-Clarkdale-Core-i5-Desktop-Processor-Debuts/
[141]: Thomas S. L., Desktop Platform Design Overview for Intel Microarchitecture (Nehalem)
Based Platform, Presentation ARCS001, IDF 2009

[142]: Kahn O., Valentine B., Microarchitecture Codename Sandy Bridge: New Processor
Innovations, Presentation ARCS001, IDF San Francisco Sept. 2010

© Sima Dezső, ÓE NIK 995 www.tankonyvtar.hu


References (18)

[143]: Intel Sandy Bridge Review, Bit-tech, Jan. 3 2011,


http://www.bit-tech.net/hardware/cpus/2011/01/03/intel-sandy-bridge-review/1

[144]: OpenCL Introduction and Overview, Chronos Group, June 2010,


http://www.khronos.org/developers/library/overview/opencl_overview.pdf

[145]: ATI Stream Computing OpenCL Programming Guide, rev.1.0b, AMD, March 2010,
http://www.ljll.math.upmc.fr/groupes/gpgpu/tutorial/ATI_Stream_SDK_OpenCL
Programming_Guide.pdf

[146]: Nvidia CUDA C Programming Guide, Version 0.8, Febr.2007


http://www.scribd.com/doc/6577212/NVIDIA-CUDA-Programming-Guide-0

[147]: Nvidia Compute PTX: Parallel Thread Execution, ISA, Version 2.3, March 2011,
http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/docs/ptx_isa_
2.3.pdf
[148]: ATI Stream Computing Compute Abstraction Layer (CAL) Programming Guide,
Revision 2.03, AMD, Dec. 2010
http://developer.amd.com/gpu/amdappsdk/assets/AMD_CAL_Programming_Guide_
v2.0.pdf
[149]: Wikipedia: Dolphin triangle mesh,
http://en.wikipedia.org/wiki/File:Dolphin_triangle_mesh.png

© Sima Dezső, ÓE NIK 996 www.tankonyvtar.hu

You might also like