Professional Documents
Culture Documents
The History of
the GPU - New
Developments
The History of the GPU - New Developments
Jon Peddie
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword
Real-time 3D graphics and consumer gaming markets have been responsible for
driving tremendous innovations to feed the insatiable appetite of high-resolution,
photo-realistic gaming technologies. Capturing the interest of computer scientists
and creative hardware developers around the world, the development of the GPU
has led to advancements in the computational capabilities and memory systems to
feed them. Advanced algorithms and APIs to manage large, complex data systems—
along with the move to general-purpose programming models with exploitation for
general-purpose computing, high-performance computing, cryptocurrency, and arti-
ficial intelligence—have further propelled the GPU into an unprecedented pace of
development.
In the early 1990s when I first became involved with the commercialization of
3D graphics technology, Jon Peddie was already a well-known graphics market
analyst. I joined a team of very seasoned hardware and software developers at GE
Visual Systems in Daytona Beach, where large-scale military and NASA training
systems were developed. We created some of the first consumer commercial uses
of the technology with Sega Models 1 and 2 hardware, initially sporting 180 k
polygons per second and 1.2 M pixels per second with a resolution of 496 × 384 in
the arcade gaming space. After acquisition by Martin Marietta and then Lockheed
Martin, real 3D was formed where I was part of a small team that developed Intel
740 3D architecture that started Intel’s 3D rendering roadmaps. Jon has shared some
unique perspective on I740 development and Intel’s entry into 3D graphics that a
quick search will reveal. His second book of this series will cover industry trends
and struggles during this period. I joined ATI Technologies in 1999, later acquired by
Advanced Micro Devices, Inc. (AMD) where I have had the pleasure of advancing the
Radeon product line, console gaming systems, and our latest RDNA/CDNA products
that power some of the most exciting developments of the century. Over the years, I
have regularly read JPR research report by Jon to understand his broad prospective of
relevant emerging trends in our industry. I have had the pleasure of meeting with Jon
on several occasions at product introduction events and industry conferences to chat
about trends, motivations, technical detail, and the successes in real-time graphics.
v
vi Foreword
In Jon’s third book of a three-book series on the History of the GPU, he shares an
interesting and knowledgeable history of the chaotic and competitive time that forged
today’s leaders in the 3D gaming hardware industry. Jon draws on the breath of his
relationships formed over the years and his knowledge to break these contributions
into six eras of GPU development. In each chapter, Jon not only covers innovations
and key products, but also shares his perspective on company strategy, key leaders,
and visionary architects during each era of development. I hope that you will thor-
oughly enjoy this series and the final book while learning about the tremendous
growth of technology and the hard work, risk, and determination of those who have
contributed to today’s GPU success.
Michael Mantor
AMD Chief GPU Architect
and Corporate Fellow
Preface
This is the third book in the three-book series on the History of the GPU.
The first book covered the history of computer graphics controllers and proces-
sors from the 1970s leading up to the introduction of the fully integrated GPU first
appearing in game consoles in 1996, and then the PC in 1999. The second book in
the series covers the developments that led up to the integrated GPU, from the early
1990s to the late 1990s.
The GPU has been employed in many systems (platforms) and evolved since
1996.
This final book in the series covers the second to sixth eras of the development of
GPU on the PC platform, and other platforms. Other platforms include workstations,
game machines, and others, such as various vehicles—GPUs are used everywhere
in almost everything.
Each chapter is designed to be read independently, hence there may be some
redundancy. Hopefully, each one tells an interesting story.
In general, a company is discussed and introduced on the year of its formation.
However, a company may be discussed in multiple time periods in multiple chapters
depending on how significant their developments were and what impact they had on
the industry.
vii
viii Preface
4. 1980-1989
Graphics Controllers 4. Major Era of GPUs 4. Mobile GPUs
on PCs
5. 1990-1995
Graphics Controllers 5. First Era of GPUs 5. Game Console GPUs
on PCs
6. 1990-1999
6. GPU
Graphics Controllers 6. Compute GPUs
Environment-Hardware
on Other Platforms
8. GPU
8. What is a GPU Environment-Software 8. Sixth Era of GPUs
Extensions
I mark the GPU’s introduction as the first fully integrated single chip with hardware
geometry processing capabilities—transform and lighting. Nvidia gets that honor on
the PC by introducing their GeForce 256 based on the NV10 chip in October 1999.
However, Silicon Graphics Inc. (SGI) introduced an integrated GPU in the Nintendo
64 in 1996, and ArtX developed an integrated GPU for the PC a month after Nvidia.
As you will learn, Nvidia did not introduce the concept of a GPU, nor did they
Preface ix
develop the first hardware implementation of transform and lighting. But Nvidia was
the first to bring all that together in a mass-produced single chip device.
The evolution of the GPU however did not stop with the inclusion of the trans-
formation and lighting (T&L) engine because the first era of such GPUs had fixed
function T&L processors—that was all they could do and when they were not doing
that they sat idle using power. The GPU kept evolving and has gone through six eras
of evolution ending up today as a universal computing machine capable of almost
anything.
The Author
I have been working in computer graphics since the early 1960s, first as an engineer,
then as an entrepreneur (I founded four companies and ran three others), ending up
in a failed attempt at retiring in 1982 as an industry consultant and advisor. Over
the years, I watched, advised, counseled, and reported on developing companies
and their technology. I saw the number of companies designing or building graphics
controllers swell from a few to over forty-five. In addition, there have been over thirty
companies designing or making graphics controllers for mobile devices.
I’ve written and contributed to several other books on computer graphics (seven
under my name and six co-authored). I’ve lectured at several universities around the
world, written uncountable articles, and acquired a few patents, all with a single,
passionate thread—computer graphics and the creation of beautiful pictures that tell
a story. This book is liberally sprinkled with images—block diagrams of the chips,
photos of the chips, the boards they were put on, and the systems they were put in,
and pictures of some of the people who invented and created these marvelous devices
that impact and enhance our daily lives—many of them I am proud to say are good
friends of mine.
I laid out the book in such a way (I hope) that you can open it up to any page and
start to get the story. You can read it linearly; if you do, you’ll probably find new
information and probably more than you ever wanted to know. My email address is
in various parts of this book, and I try to answer everyone, hopefully with 48 hours.
I’d love to hear comments, your stories, and your suggestions.
The following is an alphabetical list of all the people (at least I hope it’s all of
them) who helped me with this project. A couple of them have passed away, sorry to
say. Hopefully, this book will help keep the memory of them and their contributions
alive.
Thanks for reading
Jon Peddie—Chasing pixels, and finding gems
x Preface
The following people helped me with editing, interviews, data, photos, and most of
all encouragement. I literally and figuratively could not have done this without them.
Ashraf Eassa—Nvidia
Andrew Wolfe—S3
Anand Patel—Arm
Atif Zafar—Pixilica
Borger Ljosland—Falanx
Brian Kelleher—DEC, and finally Nvidia
Bryan Del Rizzo—3dfx & Nvidia
Carrell Killebrew—TI/ATI/AMD
Chris Malachowsky—Nvidia
Curtis Priem—Nvidia
Dado Banatao—S3
Dan Vivoli—Nvidia
Dan Wood—Matrox, Intel
Daniel Taranovsky—ATI
Dave Erskine—ATI & AMD
Dave Orton—SGI, ArtX, ATI & AMD
David Harold—Imagination Technologies
Dave Kasik—Boeing
Emily Drake—Siggraph
Edvaed Sergard—Falanx
Eric Demers—AMD/Qualcomm
Frank Paniagua—Video Logic
Gary Tarolli—3dfx
Gerry Stanley—Real3D
George Sidiropoulos—Think Silicon
Henry Chow—Yamaha & Giga Pixel
Henry Fuchs—UNC
Henry C. Lin—Nvidia
Henry Quan—ATI
Hossain Yassaie—Imagination Technologies
Iakovos Istamoulis—Think Silicon
Ian Hutchinson—Arm
Jay Eisenlohr—Rendition
Jay Torberg—Microsoft
Jeff Bush—Nyuzi
Jeff Fischer—Weitek & Nvidia
Jem Davis—Arm
Jensen Huang—Nvidia
Jim Pappas—Intel
Joe Curley—Tseng/Intel
Preface xi
Jonah Alben—Nvidia
John Poulton—UNC & Nvidia
Karl Guttag—TI
Karthikeyan (Karu) Sankaralingam—University of Wisconsin-Madison
Kathleen Maher—JPA & JPR
Ken Potashner—S3 & SonicBlue
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Programmable Vertex and Geometry Shaders (2001–2006) . . . . . . 1
1.1.1 Nvidia NV20—GeForce 3 (February 2001) . . . . . . . . . . . . 2
1.1.2 ATI R200 Radeon 8500 (August 2001) . . . . . . . . . . . . . . . . 4
1.1.3 Nvidia’s NV25–28—GeForce 4 Ti (February 2002) . . . . . 11
1.1.4 ATI’s R300 Radeon 9700 and the VPU (August
2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.4.1 First PC GPU with Eight Pipes . . . . . . . . . . . . . 16
1.1.4.2 Z-Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.4.3 Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.1.4.4 Memory Management . . . . . . . . . . . . . . . . . . . . . 19
1.1.4.5 Multiple Displays . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.1.4.6 Along Comes a RenderMonkey . . . . . . . . . . . . . 22
1.1.4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.1.5 SiS Xabre—September 2002 . . . . . . . . . . . . . . . . . . . . . . . . 23
1.1.5.1 SiS 301B Video Processor . . . . . . . . . . . . . . . . . 26
1.1.5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.1.6 The PC GPU Landscape in 2003 . . . . . . . . . . . . . . . . . . . . . 27
1.1.7 Nvidia NV 30–38 GeForce FX 5 Series (2003–2004) . . . 27
1.1.7.1 CineFX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.1.7.2 Nvidia Enters the AIB Market
with the GeForceFX (2003) . . . . . . . . . . . . . . . . 31
1.1.8 ATI R520 an Advanced GPU (October 2005) . . . . . . . . . . 31
1.1.8.1 Avivo Video Engine . . . . . . . . . . . . . . . . . . . . . . . 44
1.1.8.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.1.8.3 Nvidia’s NV40 GPU (2005–2008) . . . . . . . . . . 45
1.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
xiii
xiv Contents
xix
xx List of Figures
Fig. 1.17 Xabre 600 AIB with similar layout to ATI and Nvidia.
Courtesy of Zoltek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Fig. 1.18 SiS’s Xabre vertex shader data flow between CPU and GPU . . . 25
Fig. 1.19 SiS’s competitive market position . . . . . . . . . . . . . . . . . . . . . . . . . 25
Fig. 1.20 Nvidia’s NV30-based GeForce Fx 5900 with heat sink
and fan removed. Courtesy of iXBT . . . . . . . . . . . . . . . . . . . . . . . 28
Fig. 1.21 Nvidia NV30 block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Fig. 1.22 Final Fantasy used subdivision rendering for skin tone.
Courtesy of Nvidia [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Fig. 1.23 ATI R520 ring bus memory controller. The GDDR is
connected at the four ring stops. (Source ATI) . . . . . . . . . . . . . . . 34
Fig. 1.24 ATI R520 block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Fig. 1.25 ATI R520 thread size and dynamic branching efficiency
was improved with ultra-threading. Courtesy of ATI . . . . . . . . . . 37
Fig. 1.26 ATI R520 vertex shader engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Fig. 1.27 Making things look brighter than they are. Courtesy of ATI . . . . 39
Fig. 1.28 Inside the abandoned church with HDR on. Courtesy
of Valve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Fig. 1.29 Inside the abandoned church with HDR off. Courtesy
of Valve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Fig. 1.30 Different modes of anti-aliasing. Courtesy of Valve . . . . . . . . . . . 41
Fig. 1.31 ATI’s special class of bump mapping . . . . . . . . . . . . . . . . . . . . . . 42
Fig. 1.32 ATI’s Ruby red CrossFire—limited production. Courtesy
of ATI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Fig. 1.33 Nvidia NV40 Curie vertex and fragment processor block
diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Fig. 1.34 Nvidia’s NV40 curie-based GeForce 6800 Xt AIB.
Courtesy tech Power Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Fig. 1.35 Nvidia curie block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Fig. 2.1 Tony Tamasi. Courtesy of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Fig. 2.2 GPU architecture progression, first and second era.
Courtesy of Tony Tamasi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Fig. 2.3 Evolution from first-era to third-era GPU design . . . . . . . . . . . . . 53
Fig. 2.4 Nvidia’s G80 unified shader GPU—a sea of processors . . . . . . . 54
Fig. 2.5 Nvidia GeForce 8800 Ultra with the heatsink removed
showing the 12 memory chips surrounding the GPU.
Courtesy of Hyins—Public Domain, Wikimedia . . . . . . . . . . . . . 55
Fig. 2.6 Nvidia’s GT200 streaming multiprocessor . . . . . . . . . . . . . . . . . . 56
Fig. 2.7 Evolution of Nvidia’s logo, 1993 to 2006 (left) and 2006
on (right). Courtesy of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Fig. 2.8 Daniel Pohl demonstrating Quake running ray-traced
in real time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Fig. 2.9 Intel Larrabee AIB. Courtesy of the VGA Museum . . . . . . . . . . 58
Fig. 2.10 General organization of the Larrabee many-core
architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
List of Figures xxi
Fig. 2.40 Geometry with red boxes is sufficiently far from the camera,
and therefore, it is of minor importance to the overall
image. Thus, the color shading frequency could be reduced
(using CPS with no noticeable effect on the visual quality
or the frame rate). Courtesy of Intel . . . . . . . . . . . . . . . . . . . . . . . . 97
Fig. 2.41 Position only tile-based rendering (PTBR) block diagram . . . . . . 97
Fig. 3.1 The rise and fall of mobile graphics chip and intellectual
property (IP) suppliers versus market growth . . . . . . . . . . . . . . . . 102
Fig. 3.2 Mobile devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Fig. 3.3 Sources of mobile GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Fig. 3.4 Big jump in GPU power efficiency. Courtesy of Imagination
Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Fig. 3.5 Tile region protection isolates critical functions from each
other. Courtesy of Imagination Technologies . . . . . . . . . . . . . . . . 105
Fig. 3.6 Imagination’s BXT MC4 block diagram. Courtesy
of Imagination Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Fig. 3.7 The B boxes of imagination. Courtesy of Imagination
Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Fig. 3.8 In 2020 imagination had a broadest range of IP GPU
designs available. Courtesy of Imagination Technologies . . . . . . 107
Fig. 3.9 Mali in Arm, circa 2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Fig. 3.10 Falanx Arm Mali block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Fig. 3.11 Arm Mali’s graphics stack with MIDlets . . . . . . . . . . . . . . . . . . . . 114
Fig. 3.12 The Mali-400 could share the load on fragments . . . . . . . . . . . . . 116
Fig. 3.13 Fujitsu MB86292 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Fig. 3.14 Fujitsu’s MB86R01 SoC Jade . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Fig. 3.15 MediaQ MQ-200 block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Fig. 3.16 MediaQ MQ-200 drawing engine . . . . . . . . . . . . . . . . . . . . . . . . . 122
Fig. 3.17 Symbolic block diagram of the Nvidia TEGRA 6x0 (2007) . . . . 126
Fig. 3.18 Nvidia’s Tegra road map (2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Fig. 3.19 Nvidia offered its X-Jet software development toolkit
(SDK) software stack for automotive development
on the Jetson platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Fig. 3.20 Mercedes concept car of the future. Courtesy of Nvidia . . . . . . . 130
Fig. 3.21 In the back row from left: Petri Norlaund, Kaj Tuomi,
and Mika Tuomi from Bitboys. In the front row, Falanx,
from left: unknown (guy in blue jeans), Mario Blazevic,
Jørn Nystad, Edvard Sørgård, and Borgar Ljosland.
Courtesy of Borgar Ljosland . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Fig. 3.22 Bitboys’ Acceleon handheld prototype and the art it is
rendering. Courtesy of Petri Nordlund . . . . . . . . . . . . . . . . . . . . . 133
Fig. 3.23 Bitboys’ G40 mobile GPU organization . . . . . . . . . . . . . . . . . . . . 134
Fig. 3.24 Mikko Sarri 2009. Courtesy Mikko Sarri . . . . . . . . . . . . . . . . . . . 138
Fig. 3.25 Ideal’s Joe Palooka punching bag. Source
thepeoplehistory.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
List of Figures xxiii
Fig. 8.3 Final Fantasy 2001 and Tomb Raider 2013. Courtesy
of Wikipedia and Crystal Dynamics . . . . . . . . . . . . . . . . . . . . . . . 362
Fig. 8.4 Death Standing 2020 and enemies. Courtesy of Sony
Interactive Entertainment and Unity . . . . . . . . . . . . . . . . . . . . . . . 363
Fig. 8.5 Computational fluid dynamics is used to model and test
in a computer to find problems and opportunities. Courtesy
of Siemens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
List of Tables
xxxi
xxxii List of Tables
Microsoft’s introduction of DirectX 8 in November 2000 kicked off the second era
of GPU development which started in early 2001. ATI, Nvidia, and others showed
Microsoft and Khronos plans for future GPUs, and programmability was a big part
of it. Moore’s law was enabling the GPU companies to add more transistors. And
those tens of millions of new transistors were being put to clever use. For instance,
more registers, caches, and control logic were added to the GPU. The advances
would convert the GPU from a fixed function graphics controller that could do a
little geometry to a first-class parallel processor, a major step in the evolution of the
GPU and the computer graphics industry.
Perhaps some of the best news was that the industry had learned its lesson and
was not taking any chances on proprietary APIs such as 3dfx’s, rendition’s, and
others (discussed in Book two). In the early 2000s, OpenGL was ahead of DirectX in
terms of advanced graphics functions. OpenGL1.1 (1997) already had programmable
vertex shaders in support of workstations.
But there was a gap between the professional graphics applications and consumer
applications like games. Even though programmable shaders would make the games
look better and run faster, the game developers, never looking for more work and
always fighting deadlines, were slow to adopt the capability, with a few exceptions.
In the case of professional tools, graphics capabilities represented a competitive edge
in some cases.
A vertex is a point of a triangle where two edges meet. A triangle has three vertices.
A vertex shader is a processor that transforms shape and positions into 3D drawing
coordinates. It transforms the points of a triangle’s attributes—such as position,
direction, color, and texture—from their initial virtual space to the display space. It
allows the original objects to be distorted or reshaped in any manner.
Vertex shaders do not change the type of data; they change the values of the data.
Therefore, a vertex emerges from the process (called shading) with a different color,
different textures, or a different position in space. The vertex shader is sometimes
referred to as an assembler, as is discussed in Book two.
A triangle is a primitive formed by three vertices.
In both the OpenGL and Direct3D rendering pipelines, the geometry shader
operates after the vertex shader and before the fragment/pixel shader.
The primary purpose of geometry shaders is to create new primitives as part of
the tessellation process. Geometry shaders are started by an incoming primitive (a
triangle), which comes with all the data on a specific primitive as well as adjacent
ones.
Before GPUs, vertex and geometry shaders were run in a dedicated coprocessor
or by the floating-point processors in the CPUs. CPUs that had parallel processors
known as same-instruction, multiple-date (SIMD) processing stages were also used.
Examples of second-era GPU are given in Table 1.1.
Due to advances in semiconductor manufacturing, the pace of change was much
faster in the second era of GPUs. The GPUs ran faster, had more memory internally
and externally, and drove higher resolution displays. It was a golden era, but not the
last one.
Having led the way with the GeForce 256 and its follow-on, GeForce 2, Nvidia
surprised the industry with its powerful GeForce 3. It was the first GPU with
programmable vertex shaders and ushered in new capabilities. The GPU pushed
the display resolution up to 1800 × 1400 × 32 at 70 Hz refresh and 1600 × 1200
× 32 at 85 Hz, a range usually reserved for workstations. Few gamers had such a
1 With the NV04, Nvidia adopted the policy of using famous scientist’s names for the code names
of their GPUs.
4 1 Introduction
Fig. 1.1 Nvidia NV20-based GeForce 3 AIB with AGP4x buss. Courtesy tech Power Up
So, the Kelvin-based nFinite NV20 GPU on the GeForce 3 was Nvidia’s first
second-era GPU which made it significant because it was Nvidia’s first GPU with
programmable vertex shaders and the first to market with a 150 nm chip.
On August 14, 2001, after the November 2000 introduction of Microsoft’s Direct3D
8.1 Shader Model 1.4, ATI introduced the R200 GPU, code named Morpheus.
Direct3D 8.1 contained many powerful new 3D graphics features, such as vertex
shaders, pixel shaders, fog, bump mapping, and texture mapping. The AIB was also
OpenGL 1.3 compatible.
ATI took Direct3D a step further and introduced TruForm, which added hardware
acceleration of tessellation. ATI included TruForm in Radeon 8500 and later products
(Fig. 1.2).
Manufactured in TMSC’s 150 nm process, the R200 used the Rage7 architecture
(Fig. 1.3). The 120 mm2 GPU had 60 million transistors, four-pixel shaders, two
vertex shaders, two texture-mapping units, and four ROP engines. The GPU drew
92 watts which was close to the 110 w limit of AGP4x.
However, what made the R200 significant was its TruForm hardware tessellation
acceleration and its use of a number (n)-patches.
1.1 Programmable Vertex and Geometry Shaders … 5
Fig. 1.2 ATI R200-based Radeon 8500 AIB. Courtesy tech Power Up
Fig. 1.3 ATI R200 block diagram. The chip had 60 million transistors, four-pixel shaders, two
vertex shaders, two texture-mapping units, and four ROP engines
6 1 Introduction
Fig. 1.4 Tessellation can reduce or expand the number of triangles (polygons) in a 3D model to
improve realism or increase performance
1.1 Programmable Vertex and Geometry Shaders … 7
2 A technique developed by Henri Gouraud in the early 1970s that computes a shaded surface based
on the color and illumination at the corners of every triangle. Gouraud shading is the simplest
rendering method and is computed faster than Phong shading. It does not produce shadows or
reflections. The surface normals at the triangle’s points are used to create RGB values, which are
averaged across the triangle’s surface.
8 1 Introduction
The preprocessor operated in front of the T&L engine and extended the graphics
pipeline further into the application (Fig. 1.8). The first reaction to a software update
was received with skepticism, but ATI explained that online gaming providers would
sell the update to gamers and make a few extra dollars, so they would love it.
TruForm was a technology announcement, and no product was immediately asso-
ciated with it yet. However, the demos looked good, and the supporting white paper
[3] was also well done, so it was something ATI had been working on and preparing for
quite a while. The company said it was getting good responses from game developers.
The game developers were less resistant to the idea of developing for hardware since
there was so much less unique hardware in the market to create for, and DirectX 8
was the common denominator.
ATI offered TruForm up and until the Radeon X1000 series. After that, they no
longer advertised it as a hardware feature. Nonetheless, the Radeon 9500 and subse-
quent AIBs compatible with Shader Model 3.0 had a render-to-vertex buffer capa-
bility, which could be employed for tessellation acceleration. Tessellation returned
in ATI’s Xenos processor in the Xbox 360 and the 2007 Radeon R600 GPUs [4].
Game developers did not embrace ATI’s TruForm because they would have had
to design their games initially with TruForm procedures to get the best results—it
10 1 Introduction
Fig. 1.8 ATI’s TruForm was a preprocessor in an expanding chain of graphics functions
1.1 Programmable Vertex and Geometry Shaders … 11
was not just a patch to the game. Models used to create objects and scenes in the
game needed flags assigned to them to indicate which ones should be tessellated.
And since only ATI had it, the feature, though valuable and useful, did not gain the
support of developers.
The R200 was an advanced design and ATI’s second GPU with a vertex shader.
Sadly, because of minimal adoption by game developers, ATI dropped TruForm
support from its future hardware. The world would have to wait until the third era of
GPUs in 2009 when DirectX 11 was released. Being first did not always pay off.
Nvidia implemented Direct3D 7-class fixed function T&L in the vertex shaders.
The company also included an improved dual-monitor capability (TwinView)
adapted from the GeForce 2 MX.4.
Fig. 1.9 VisionTek Nvidia NV25-based GeForce Ti 4200 AIB. Courtesy of Hyins for Wikipedia
12 1 Introduction
its performance advantage, the sexy name, and Nvidia’s marketing strength, the Ti
4600 quickly became the most popular AIB and reduced the gamer’s interest in ATI’s
Radeon 8500. That situation would change when ATI introduced the Radeon 9700.
1.1.4 ATI’s R300 Radeon 9700 and the VPU (August 2002)
Nvidia promoted the GPU name to enormous success. If asked in the early 2000s
what a GPU was, most people would say a chip made by Nvidia. ATI sought to gain
product and market differentiation and not be the other GPU company. Dave Orton
knew he could not make ATI the GPU company, so he came up with a new term—the
VPU—the visual processor unit. The Radeon 9700 would be a VPU, not a GPU. The
VPU incorporated a technique ATI called vertex skinning for a more fluid movement
of polygons (Refer to the image of the leopards in Book two’s discussion of ATI’s
R100). ATI and 3D/creative laboratories said the VPU would be the next generation
in graphics and visualization systems.
The ArtX team that developed the first IGP also developed the R300 Radeon 9700,
code named Khan. Built on the Rage 8 architecture, the ATI R300 GPU delivered
exceptional results and shipped on schedule. It was the first to offer DirectX 9.0 and
Shader Model 2.0, Vertex Shader 2.0, and Pixel Shader 2.0 compatibility. The R300
was the industry’s second device to offer AGP 8 × capability; SiS’s Xabre 80/200/400
line was the first. And the R300 was the part that used a flip-chip package, which
enables high pin counts and faster performance.3
A VPU, said ATI, had to have several programmable elements; lots of memory
bandwidth; wide-bus DACs; and at least eight pipelines for parallel processing.
The R300 VPU (Fig. 1.11) was one of the most advanced graphics processors ever
created. Manufactured in TSMC’s 150 nm process, it had 107 million transistors. It
was a completely new architecture designed around the concepts of high bandwidth,
parallelism, efficiency, precision, and programmability—all the stuff needed to be a
VPU [5].
The chip had a 156-bit crossbar memory controller, multiple display outputs,
video-in and processing, and an AGP8X I/O, as depicted in Fig. 1.12.
The vertex processing engine has four parallel vertex shader pipelines coupled to
an optimized triangle setup engine. ATI said the R300 was the first chip to process
one vertex or triangle in a single clock cycle. Also, it was the first chip to implement
the 2.0 vertex shader specification introduced in DirectX 9.0.
Each vertex shader pipeline in the R300 could handle vector and scalar operations
simultaneously. Vector operations worked on values composed of multiple compo-
nents, such as 3D coordinates (x, y and z components) and color (red, green, and blue
3 A chip packaging technique in which the active area of the chip is “flipped over” facing downward.
The flip chip allows for lots of interconnects with shorter distances than wire, greatly reducing
inductance and the enemy of bandwidth. It is a package solution for high pin count and high-
performance chip package needs.
14 1 Introduction
Fig. 1.11 ATI R300 Radeon 9700 AIB. Notice heatsinks on the memory and similar layout to
Nvidia NV25-based GeForce Ti 4200 AIB, in Fig. 1.9. Courtesy of Wikimedia
Fig. 1.12 ATI R300 block diagram. The display interface included a multi-input LUTDAC
1.1 Programmable Vertex and Geometry Shaders … 15
color components). Scalar operations worked on values with just a single component.
Since vertex shaders typically included a mixture of vector and scalar operations, the
optimization could improve processing speed by up to 100%. Figure 1.13 illustrates
the organization.
As mentioned, the device had four vertex engines, and all four outputs went to
the setup engine.
The vertex processing engine of R300-based Radeon 9700 also included support
for ATI’s latest version of TruForm. The higher-order surface technology smoothed
the curved surfaces of 3D characters, objects, and terrain by increasing the polygon
count through tessellation. By taking advantage of the parallel vertex processing of
the R300, such an algorithm could deliver more natural-looking 3D scenes without
requiring any changes to existing artwork.
TruForm also supported displacement mapping, which provided more control
over the shape of 3D objects and surfaces. It worked by modifying the positions
ATI’s R300 was the first graphics processor able to render up to eight pixels simulta-
neously. The VPU accomplished that with eight parallel, 96-bit rendering pipelines,
each had an independent texture and pixel shader engine. The texture unit of each
rendering pipeline could sample up to 16 textures of up to 1024 instructions with
flow control in a single rendering pass. Depending on the desired quality level, those
textures could be one, two, or three dimensional, with bilinear, trilinear, or anisotropic
filtering applied. The R300’s block diagram is shown in Fig. 1.14.
ATI said its DirectX 9.0 compatible pixel shader engines were designed to handle
floating-point operations and provide increased range and precision compared to the
integer operations used in earlier designs. That was because the engines had up to
96 bits of precision for all calculations, which was necessary for recreating Holly-
wood studio-quality visual effects. Figure 1.14 shows a generalized block diagram
of one of eight R300 shaders.
The R300’s pixel shader engines also achieved greater efficiency by simultane-
ously processing up to three instructions: one texture lookup, one texture address
operation, and one-color operation. Since pixel shaders typically contain a mixture
1.1 Programmable Vertex and Geometry Shaders … 17
Fig. 1.14 ATI’s R300 pixel shader engine the chip had eight of these “pipes”
of those three procedures, that capability would ensure maximum engine utilization
and performance.
The R300 in the 9700 also used ATI’s SmartShader 2.0, the company’s second
generation of programmable vertex, and pixel shader technology. The first imple-
mentation of DirectX 9.0, SmartShader 2.0, was running a little ahead of the industry
(and especially of the game developers). ATI said it would be fully compatible with
current and future revisions of OpenGL. ATI positioned its new shader as enabling
movie-quality effects in real-time games and other interactive applications.
Of course, the R300 rendering engine also included ATI’s latest generation of
Smoothvision (also kicked up to version 2.0), the company’s anti-aliasing, and
texture-filtering technology.
1.1.4.2 Z-Buffer
Reading and updating the z-buffer typically consumes more memory bandwidth
than any other part of the 3D rendering process, making it a significant performance
bottleneck.
A trick to reduce data transfers and avoid bandwidth starving was to use z compres-
sion. It was tricky, and a lossless algorithm was needed to compress data sent to the
z-buffer. So, ATI (and Nvidia and 3Dlabs) had a unique name for those things, and
ATI called its HyperZ III.
The goal of ATI’s HyperZ technology was to reduce the memory bandwidth
consumed by the z-buffer, thereby increasing performance. It achieved a minimum
18 1 Introduction
1.1.4.3 Video
If one was going to build a VPU, one had better do video and visualization, too.—if
anyone knew that it was ATI. Therefore, the R300 had an integrated video processing
engine. Using its video shader technology, ATI claimed to offer the first product that
could apply the power of programmable pixel shaders to enhance video capture and
playback in real time. It was one of, if not the, truely first multimedia AIBs.
ATI used it for a wide range of new applications. It could do deblocking of
streaming Internet video, noise removal filtering for captured video, and adaptive
de-interlacing. It had algorithms for sharper and clearer TV and DVD playback.
And it could apply photoshop-style filters in real time for video-editing applications.
Figure 1.15 shows the way ATI went about it.
3D rendering and video functions got processed by two separate sections of the
VPU. Each part has its own set of features.
One example of an application for video shader technology that ATI liked to
use was streaming video deblocking. Most streaming Internet video exhibits blocky
compression artifacts, which are especially noticeable in low bandwidth connections.
The R300, said ATI, could automatically filter out those artifacts, providing smoother,
clearer video images.
ATI pointed out that its programmable pixel shaders would enhance TV and DVD
playback quality by improving the adaptive de-interlacing algorithms used in other
ATI graphics chips. On captured video signals, video shader could apply real-time
noise filtering to produce cleaner video. And, the company added, the R300 also
offered interesting new possibilities for video editing. Image filtering effects such as
blurring, embossing, and outlining would get applied to video streams in real time.
1.1 Programmable Vertex and Geometry Shaders … 19
The VPUs and GPUs of 2002 started looking increasingly like a CPU and not the
least of that was how they managed memory. The R300 incorporated a new high-
performance 256-bit DDR memory interface, which ATI said could provide over
20 GB/second of graphics memory bandwidth. It was composed of four independent
64-bit memory channels, each of which could simultaneously write data to memory
or read data back into the graphics processor.
ATI followed Nvidia’s example, which copied it from SGI, and designed a crossbar
memory controller that divided the 256-bit-wide memory interface into four subunits
that could access the memory separately, making memory accesses more efficient.
The sophisticated sequencer logic ensured the utilization of all four channels for
maximum efficiency.
20 1 Introduction
ATI was a leader in support of multiple displays, and with its acquisition of Appian
Graphics HydraVision technology for $2 million in 2001, it put together a compre-
hensive package. The R300 display interface supported a wide range of display
configurations and new technologies that improved the quality of displayed images.
The general organizational block diagram is shown in Fig. 1.16.
First to offer HDR. For one thing, the R300 had a new, high-precision 10-bit-per-
color channel frame buffer format enabled by DirectX 9.0. That enhancement of the
standard 32-bpp color format (which supports just 8 bits per color channel) known
as HDR could represent over one billion distinct colors. Although used in high-
end professional graphics systems and workstations, HDR only came to consumer
computers and TVs in 2019.
Also, the R300 had two integrated display controllers, allowing it to drive two
displays with entirely independent images, resolutions, and refresh rates. The chip
also included the following:
Fig. 1.16 ATI R300 video processing engine showing all the outputs
1.1 Programmable Vertex and Geometry Shaders … 21
• Two 10-bit, 400 MHz DACs for driving high-resolution analog displays through
the VGA port
• A 165-MHz TMDS transmitter for driving digital displays through the DVI port
• TV output capability at resolutions up to 1024 × 768, supporting NTSC/PAL/
SECAM formats.
The R300 was a significant leap from the previous chip (R200; consider Table 1.2).
The other level of comparison is the API. The company was proud to say that the
R300 was for Microsoft’s upcoming DirectX 9 specification (Table 1.3).
The step from DirectX 8.1 to DX9 was a significant improvement in exposing
and exploiting the power of the new VPUs and GPUs.
22 1 Introduction
In DirectX 9, vertex shader programs could be much more complex than before.
The new vertex shader specifications added flow control, more constants, and 1024
vertex shader instructions per program. While the new pixel shaders would not allow
flow control, the maximum number of pixel shader instructions had grown to 160.
However, the significant feature of DirectX 9 was the introduction of RGBA values in
64-bit (16-bit FP per color) and 128-bit (32-bit FP per color) floating-point precision.
That tremendous increase of color precision allowed an incredible new number of
visual effects and picture quality.
Although DirectX 9 was not introduced until later that year, there were games that
used OpenGL and could take advantage of the R300 to accelerate those features.
Rick Bergman, ATI’s Senior Vice President of marketing and General Manager,
said ATI was moving DirectX 8.1 into the mainstream and making DirectX 9 available
for the enthusiast segment when it came out.
ATI partnered with graphics board companies to bring various products and
configurations to market in all parts of the world. Those product introductions marked
the first time ATI had introduced a family of products simultaneously with its board
partners. AIB manufacturers worked with ATI to market the Radeon 9000, Radeon
9000 PRO, and Radeon 9700-based products.
The company also introduced a mainstream AIB version with a slower R300 and
less memory, called the Radeon 9000 Pro. It featured 64 MB of DDR memory and
flexible dual-display support with DVI-I, VGA, and TV-out, with a suggested retail
price of $149.
RenderMonkey was an open, extensible shader development tool for current and
future hardware that ATI said enabled programmers and artists to collaborate on
creating real-time shader effects.
ATI showed the tool publicly for the first time at SIGGRAPH 2002 in San Antonio,
and a beta version was available for download from ATI shortly after SIGGRAPH—
free of charge.
The toolkit could be used as a plug-in with any of the 3D-development suites of
the day and would generate vertex and pixel shader code.
It had a real-time viewer that was comfortable for developers and artists to use
and made the development of titles that used vertex and pixel shaders a lot easier
than it had been. Additionally, ATI included a compiler for Renderman, and another
compiler for Maya was in the works. Nvidia offered its Cg; 3Dlabs also had a high-
level programming language and thus began the shader compiler wars.
1.1.4.7 Summary
ATI said it was taking back the high ground and believed the R300 was ample proof.
“The crown is ours to take,” said Dave Orton, ATI’s CEO.
1.1 Programmable Vertex and Geometry Shaders … 23
Orton knew Asia was where the battle would be. ATI did an excellent job repo-
sitioning itself as a semiconductor supplier to the original device manufacturers
(ODMs) and original equipment manufacturers (OEMs) while still leaving room to
maneuver in the ATI high-end AIB retail market. But there were penalties for being
first to market; still, the R300 gave Nvidia a severe challenge and forced Nvidia into
a more aggressive position. But the most exciting part was the remarkable come-
back of ATI. It took almost two years. During that time, the company continued to
produce excellent products, expand into new markets, reorganize the management
and engineering teams, and eke out a modest profit most of the time—overall, one of
the most impressive corporate moves ever seen. A few years later, ATI was acquired
by AMD.
The Radeon 9700 was further honored by being selected by venerable E&S for
its simFUSION 6000q used in simulation and platinum systems [6]. And Silicon
Graphics used up to 32 ATI R3xx VPUs in their high-end Onyx 4 “UltimateVision”
systems, which came out in 2003 [7].
SiS released its Xabre 400/200/80 in September 2002 and was the first to market a
GPU supporting the (1997) Accelerated Graphics Port (AGP) 8X bus, a variation
on the (1992) PCI bus standard that enabled a dedicated path between the slots to
the processor enabling faster graphics performance. In November, at the Comdex
conference in Las Vegas, the company showed its Xabre 600, which supported
AGP 8X and DirectX 8.1, whereas the 400 was built on 150 nm and the 600 series
was made in SiS’s new 130 nm process.
SiS claimed that its 30 million transistor Xabre 600 was the most AGP compliant
GPU on the market, supporting AGP 8X 533 MHz with a 16-stage pipeline, full
sideband function, and dual 300 MHz clocks for the engine and the 128-bit memory
I/O. The chip also integrated a 256-bit 3D graphics engine, and the company claimed
was loaded with new SiS proprietary technologies for the mainstream gaming market.
The chip had bump mapping, cubic mapping, and volume texture. It offered texture
transparency, blending, wrapping, mirror, clamping, fogging, alpha blending, and
2X/3X/4X full-scene anti-aliasing. The 256-bit engine had four programmable pixel
rendering pipelines and eight texture units.
Peak polygon rate was specified at 30 M polygons/sec at 1 pixel/polygon with
Gouraud shaded, point-sampled, linear, and bilinear texture mapping. The peak fill
rate was 1200 M pixel/sec and 2400 M texture/sec at 10,000 pixels/polygon with
Gouraud shading. The following are the highlights of the chip.
• Four pipelines could do eight textures per pass
• Up to 11.2 GB/sec of memory bandwidth
• 4 × 32-bit DDR2 memory controllers running at ~700 MHz
• AGP 8X support
24 1 Introduction
Fig. 1.17 Xabre 600 AIB with similar layout to ATI and Nvidia. Courtesy of Zoltek
Fig. 1.18 SiS’s Xabre vertex shader data flow between CPU and GPU
As GPUs became more powerful, the need for memory bandwidth increased even
faster. To meet that need, in late 2000, ATI announced HyperZ, and Nvidia announced
its Lightspeed Memory Architecture in the GeForce 3 in February 2001. SiS also
offered a similar capability and increased its memory bus to 128-bit with 64 MB
of 300 MHz DDR memory. The Xabre 600 realized a peak of 9.6 GB per second
memory throughput.
The chip supported DirectX 8.1 (Pixel Shader version. 1.3). Although it had a
pixel engine (“Pixelizer”), it did not support DirectX 9.0. The company targeted the
chip at the mainstream user and assumed DirectX 9 would not be available for a year.
As Fig. 1.19 shows, based on SiS data, SiS had a competitive market and product
plan.
Chris Lin, Vice President of the Multimedia Product Division at SiS, said in
an interview, “With the Xabre family, our goal was to break all the bottlenecks
gamers experience and maximize bandwidth speed, graphics clarity, reliability, and
performance. The Xabre 600 built on that foundation and took the experience up
another notch with even greater speed and performance” [8].
The SiS Xabre was offered in a 37.5 mm2 package and priced in the $30 range.
The company said the chip would be available in small quantities (sampling) in Q1
2003. SiS planned to introduce its DirectX 9.0 part in Q3 2003 and said it would
have eight graphic pipelines and use DDR2.
Thomas Tsui, Director of SiS’s Multimedia Division, said in June 2001 that the
math was in his favor in that Intel’s competitors struggled to make money on the
unpleasant ASPs that were coming out of the integrated graphics chipsets (IGCs)
market [9]. Tsui believed the only way to achieve the kind of margins that made
building IGCs worthwhile was by having control over the manufacturer to achieve
the required economies of scale. In contrast to Nvidia and ATI, SiS had its own
fab. SiS’s integrated 370S and 730SE had integrated 3D and I/O and were compat-
ible with AMD’s 266 MHz FSB. In addition to the 256-bit SiS 315, the company
produced a low-cost discrete graphics chip with T&L for the same market that the
STMicroelectronics—Imagination Technologies Kyro II—was after. SiS expected
the integrated 730 product line to increase the company’s revenues.
The SiS 301B did video processing and output in a separate companion chip, another
cost-cutting decision because not every customer wanted or needed video. Also, as
Lin pointed out, the video output sections of a chip did not scale with Moore’s law
and required a certain amount of floor space for resistors and capacitors (RCs). The
video bridge of the SiS 301B chip’s output section had a TV encoder and a 375 MHz
DAC. The chip’s output included a motion-fixing video processor. The chip had de-
interlacing, half-downscaling functions, and four fields per-pixel motion-detection
de-interlace function.
On the input side, the Xabre 600 had a MPEG-2/1 video decoder with motion
compensation layer decoding architecture that the company said could deliver up to
20 Mb/sec bit rate decoding, making it capable of reading VCD, DVD, and HDTV
decoding.
At the beginning of 2003, SiS reported record sales attributed mainly to the Xabre
600. In March 2003, the company announced its 130 nm DirectX 9, AGP 8 ×
300 MHz Xabre 660 (AKA Xabre II), which would sell for $60, twice the price of
the original Xabre 600.
1.1.5.2 Summary
The Xabre line was the last dGPU and AIB from SiS because the company spun
off its graphics division (renamed XGI), and it merged with Trident Graphics a couple
of months later, as discussed earlier in this chapter (SiS’s first PC-based IGP).
Most of the XGI products were disappointing and underdelivered. The one excep-
tion was the entry-level V3, which offered performance equal to the GeForce FX 5200
Ultra and Radeon 9200.
XGI introduced the Volari 8300 in late 2005, which was more competitive with
the Radeon X300SE and the GeForce 6200. However, XGI could not sell enough to
sustain itself, and in October 2010, the company was reabsorbed back into SiS.
By 2003, the PC GPU market had consolidated into two main suppliers: ATI and
Nvidia. All traces of other suppliers had faded away except for integrated graphics in
chipsets. ATI and Nvidia kept introducing new GPUs on a relatively regular schedule,
mostly in sync with new process nodes introduced by TSMC and new versions of
Microsoft’s DirectX API. ATI and Nvidia would show and discuss advance plans
with Microsoft for forthcoming GPUs. And if Microsoft liked the ideas, it would
incorporate them into the next version of the DirectX API. In that way, when a new
GPU with new features came out, the API was waiting for it. Game developers also
received advanced information about the new GPUs and APIs, but game development
always took longer than anyone estimated or wanted. Therefore, when a new GPU
was introduced, the AIB, PC suppliers, and Microsoft would have to wait for games
to exploit the new features. Everything would sync up occasionally, and all three
elements—API, GPU, and games—would show up simultaneously, but that was
a rarity. Often ATI and Nvidia would go to game developers and help them with
programming the new features.
Nvidia’s NV30 to NV38 GPUs (code named Rankine), used in the GeForce FX
series, was the fifth generation of the popular line.
The NV30 should have been released in August 2002, about the same time as
ATI’s Radeon 9700. However, Nvidia experienced start-up problems and high defect
rates with TSMC’s low-K 130 nm process and that delayed Nvidia’s release. Also,
while Nvidia was trying to get the NV30 out, it was developing a special version of
the chip for Microsoft’s Xbox, which spread Nvidia’s engineering resources a little
thin. The Nvidia GeForce Fx 5900 AIB with NV30 is shown in Fig. 1.20.
Nvidia switched fabs and went to IBM, which had a more conventional (fluo-
rosilicate glass-FSG) low-K 130 nm process [10]. That eliminated the defects but
delayed the introduction of the new chip.
28 1 Introduction
Fig. 1.20 Nvidia’s NV30-based GeForce Fx 5900 with heat sink and fan removed Courtesy of
iXBT
The company released the NV30-based GeForce FX 5800, its first GDDR2-based
AIB, in January 2003, several months after ATI released its Radeon 9700 DirectX 9
architecture in August 2002. Nvidia introduced the 500 MHz, 128 MB, GeForce FX
5800 AIB on March 6, 2003. Built on the 130 nm process, the 135 million transistor
GPU had a 128-bit memory bus and supported DirectX 9.0a with Vertex Shader 2.0
and Pixel Shader 2.0. The AIB was compatible with AGP 8x. The NV30’s block
diagram is shown in Fig. 1.21.
The NV30-based GeForce 5 × 00 AIBs were Nvidia’s response to ATI’s Radeon
9700 Pro, but Nvidia was several months late to market. When it arrived, it outper-
formed ATI’s R300-based Radeon 9700 Pro by 10% in some tests and underper-
formed relative to the R300 in many other tests [11]. The problems that beset the
NV30’s launch were not totally under Nvidia’s control. It takes a long time to design
and build a GPU, and Nvidia decided to use TSMC’s 130 nm process one or two
years earlier. ATI was conservative with the R300 and stayed with a proven 150 nm
process.
1.1.7.1 CineFX
One of the most noteworthy features to be introduced with the NV30 was CineFX.
Nvidia postulated at the time that interactive PC graphics (CGI) were approaching the
realism of computer graphics, cinematic shading used in cinema for special effects,
and even feature-length movies.
One such movie was Final Fantasy: The Spirits Within, a 2001 computer-animated
science fiction film directed by Hironobu Sakaguchi, the creator of the Final Fantasy
1.1 Programmable Vertex and Geometry Shaders … 29
franchise, at Square Co., Ltd. Square was a well-known and respected Japanese video
game company founded in September 1986 by Masafumi Miyamoto. The 2001 film
was the first computer-animated feature film with photo-realistic effects and images.
It was also the most expensive video game-inspired film for the next nine years.
Square rendered the movie with the most advanced processing capabilities for film
animation at that time [12]. The movie received poor reviews but lauded for its
character’s realism. Nonetheless, it did not do well at the box office and has been
blamed for the failure of Square Pictures, resulting in the merger of Square and Enix
[13] (Fig. 1.22).
30 1 Introduction
Fig. 1.22 Final Fantasy used subdivision rendering for skin tone. Courtesy of Nvidia [14]
Real-time cinematic shading required new levels of features and performance such
as advanced programmability, high-precision color, high-level shading language,
highly efficient architecture, and high bandwidth to system memory and the CPU.
With the introduction of programmable vertex shaders, it was possible to issue over
65,000 instructions, giving great control to game developers.
Along with the expanded programming range, Nvidia introduced support for
16- or 32-bit floating-point components and 64-bit and 128-bit FP color precision.
That 16-bit floating-point format was the same format that Pixar and ILM use for
films, the so-called s10e5 representation. And, of course, it had high-dynamic-range
illumination.
CineFX could render a vertex array and offered displacement mapping, particle
systems, and even ray tracing. Nvidia heralded it as the convergence of film and
real-time rendering.
Nvidia also developed a programming language based on C that they called Cg
Shader Code. It had a compiler, could be optimized into vertex and pixel shader
assembly code, and was compatible with DX9’s high-level shading language (HLSL).
NV30 was Nvidia’s first GPU that enabled unified shaders for the Direct3D 9 API
and the unified Shader Model of Direct3D. However, it was not a true unified shader
system that would come later with DirectX 10 and the third era of GPUs discussed
in Chapters ten and thirteen.
1.1 Programmable Vertex and Geometry Shaders … 31
The NV30 matched the ATI R300 on the number of textures that could be filtered
(bi-, tri-, or anisotropic) per clock tick; however, the NV30 ran at 500 MHz while
the ATI R300 was only at 325 MHz, giving the NV30 a considerable potential
performance gain.
There was speculation that significant input came from the 3dfx/Gigapixel engi-
neers for the NV30 design. And people tried to read symbolism into the FX name
of the AIBs, thinking it was a gesture to honor the designers from 3dfx that joined
Nvidia. Also, people noted that 3dfx had promoted cinematic effects (motion blur and
other effects done in hardware) and wondered if that was where Nvidia got the idea
for the name CineFX. But no one who was at Nvidia at the time could corroborate
any of those speculations.
The G70 GPU, based on the CineFX architecture and implemented in the GeForce
7880 series of AIBs, was introduced in 2005. It was significant because it offered
three kinds of programmable engines for various stages of the 3D pipeline plus
several additional stages of configurable and fixed function logic. Researchers at
universities used it as a general-purpose parallel processor but had to program it in
the arcane OpenGL language. It shows how excited they were about the potential
payoff and acceleration. It also signaled the need for better software tools if the GPU
was to become a first-class compute accelerator.
1.1.7.2 Nvidia Enters the AIB Market with the GeForceFX (2003)
In early 2003, Nvidia said it would sell and certify the quality NV30, NV31, and
NV34-based GeForce FX AIBs to its AIB partners [15]. That step was being taken,
Nvidia said, to control the quality of NV30-based boards. Some OEMs had used
knockoff cheap capacitors that failed after a few hours of operation, thus damaging
Nvidia’s reputation with customers. Nvidia’s tightened control over manufacture and
the elevated warranty costs, but Nvidia has always been protective of its image, brand,
and standing in the community. Nvidia said it would still sell low-end chips (which
did not require such tight tolerance) to certain OEM board builders and required
them to tell Nvidia where they sourced their parts.
Naturally, rumors circulated about the move and the gamer press and fans spec-
ulated the move was motivated by the late arrival of the NV30 to the market.
Nvidia countered those rumors by saying that tighter control on the manufacturing
process of complete boards would result in faster availability. Ten years later, one of
Nvidia’s largest AIB partners, EVGA, had had enough of Nvidia’s competition and
margin squeezing and called it quits, refusing to produce or market Nvidia’s newest
generation, the RTX 4000 serries AIBs.
API in the series, and the last one until Vista came out in November 2006. However,
Microsoft revealed Shader Model 3 (SM3) in May 2004.
Nvidia’s GeForce 6 series (code name NV40) launched on April 14, 2004, with
built-in support for Shader Model 3.0. At the Windows Hardware Engineering
Conference (WinHEC), on May 4, 2004, Nvidia demonstrated its 130 nm GeForce
6 Series with Shader Model 3.0. The company declared that the GeForce 6 was the
first and only GPU to take full advantage of Shader Model 3.0—and it was.
ATI also had a GPU that would be SM3 capable, the R520. ATI had worked with
Microsoft on the SM3 definition and design and knew everything about it. It was the
foundation for a line of DirectX 9.0c and OpenGL 2.0 3D Radeon 1000 AIBs. The
R520 was ATI’s first major architectural overhaul since the R300 and optimized for
Shader Model 3.0. But ATI could not ship it until October 5, 2005.
Regardless of the timing, the R520 was genuinely new and powerful graphics
architecture with a new memory manager design, ultra-threading, and flow control.
It also had expanded floating-point and video capabilities that enhanced HDR.
By this time, Dave Orton, the former CEO of ArtX and VP at Silicon Graphics
was running ATI. The ArtX team within ATI came from SGI and had brought several
advanced computer graphics ideas and techniques. Orton wanted to apply them to
the GPU. The R520 was to be the showcase GPU for ATI.
ATI was aggressive and bold and the first to move to the new 90 nm fabrication.
New process nodes require new design libraries. Design libraries define how long
connecting wires can be and what, if any, capacitors are needed—at microscopic
sizes. It was tricky and hard to get it right the first time. AMD did not get it right for
months. Maddening and insidious timing errors that defied synthesizers, analyzers,
and simulators, refusing to match up with the fab process and not staying in one place,
drove the engineering and manufacturing staff in Toronto, Boston, Santa Clara, and
Taipei nuts for six months.
R520 test chips coming from TSMC’s fab failed in ATI’s laboratory. The ATI
engineers discovered a problem after several very costly trials. Each time they fixed
it, they had to have a new mask generated. The mask is like a stencil that tells the
deposition process in the fab where to put the silicon. The problem was a faulty
90 nm chip design library from a supplier. AMD had to make multiple masks, called
re-spinning before they rid the design of the problem. The bug was finally tracked
down—after 22 mask tries. So, where ATI (and then AMD) could have been the
first to market with their revolutionary unified shader design, they were the last.
However, AMD could take credit for being the first company in the graphics industry
to introduce three new chips in a new process simultaneously.
One of the side effects of a late rollout is that it shortens the product lifetime and
impacts the return on the engineering investment, which influences margins. But ATI
had no choice other than to get the next version (R580) out on schedule.
When ATI moved to the TSMC 90 nm process, it incorporated dynamic voltage
control (DVC). DVC allows the software to adjust the voltage level used by the GPU.
DVC is used to adjust the performance level of functions not needed for the current
application. It is a feature used today on all GPUs and SoCs.
1.1 Programmable Vertex and Geometry Shaders … 33
The R520 (code named Fudo after a colorful industry reporter) was a genuinely
new and powerful graphics architecture with a new memory manager design, ultra-
threading, and flow control. It also had expanded floating-point and video capabilities
that would enhance HDR.
Here is an overview of what the GPU had:
• Ultra-threaded shader engine with support for DirectX 9 programmable vertex
and pixel shaders
• Fast dynamic branching
• DirectX Vertex Shader 3.0 vertex shader functionality, with 1024 instructions
(unlimited with flow control)
• Single-cycle trigonometric operations (SIN & COS)
• Pixel Shader 3.0 running on ATI’s ultra-thread, pixel shader engine
• Single-precision 128-bit floating-point (fp32) processing as well as 128, 64 and
32-bit per-pixel floating-point color formats
• 16 textures per rendering pass
• 32 temporary and constant registers per-pixel
• Facing register for two-sided lighting
• Multiple render target support
• Shadow volume rendering acceleration.
Manufactured in 90 nm on a 288 mm2 die, the R520 had 320 million transistors,
a core clock of 500 MHz, and a memory clock of 1 GHz. ATI always had leading
memory design capabilities and Vice President, Joe Marci, who was a memory expert.
The chip employed a dual ring architecture, shown in Fig. 1.23.
ATI claimed the ring design would reduce routing complexity, permit higher clock
speeds—one ring-stop per pair of memory channels, and link directly to the memory
interface. It supported the fastest graphics memory of the time, GDDR3, at 48+
bytes/second with the 512-bit ring (two, 256 rings) bus.
It had a new, larger cache design (one of the reasons for all the transistors) and
was fully associative for best performance. It also gave the R520 improved HyperZ
performance, better compression, hidden surface removal, and had programmable
arbitration logic.
The memory controller broke the memory channels into 8- by 32-bit blocks. That
provided a tighter coupling between the GDDR and the caches and further improved
efficiency. It also allowed for cache lines to map to any location in the external
GDDR. Associative memory has reduced memory bandwidth requirements for all
operations (texture, color, Z, and stencil), and ATI took advantage of that.
The R520 had an improved hierarchical z-buffer that detected and discarded
hidden pixels before shading (e.g., overlapping objects). The company claimed it
had developed a new technique for using floating point for improved precision,
which caught more hidden pixels than the previous design.
There was also improved Z compression in the R520. Z-buffer data is typically the
largest user of memory bandwidth, and bandwidth could be reduced by up to 8:1 using
lossless compression. ATI said that its new method achieved higher compression
ratios more often.
34 1 Introduction
Fig. 1.23 ATI R520 ring bus memory controller. The GDDR is connected at the four ring stops.
(Source ATI)
The dispatch processor maintained the same count of texture address and texture
sample units but removed their tie to the pixel hardware. Pixel threads that needed
to texture could do so independently. In addition, texturing operations could be
scheduled separately from pixel vector and scalar processing arithmetic. Typically,
texture operations take many more cycles than math ops.
Pixel Shader 3.0 inefficiencies. ATI’s ultra-threaded pixel shader engine offered
PS 3.0 support and fast dynamic branching. The vertex shader engine also supported
36 1 Introduction
VS3.0, up to two vertex transformations per clock cycle, and full-speed 128-bit
floating-point processing for all calculations said the company.
ATI improved pixel shader efficiency by eliminating or hiding latency and
avoiding wasted cycles. One of the significant sources of inefficiency in pixel shaders
is texture fetching. If a pixel shader needed to look up a texture value not located in
the texture cache, it had to look in graphics memory, which introduced hundreds of
cycles of latency.
Dynamic branching was a source of inefficiency within Pixel Shader 3.0. It allowed
a pixel shader program to execute different branches or loops depending on calculated
values. Therefore, cleverly implemented, dynamic branching could provide signif-
icant opportunities for optimization. For example, it allowed for early outs, where
large portions of shader code could be skipped for specific pixels when determined
unnecessary. Unfortunately, dynamic branching in pixel shaders destroys traditional
graphics architecture’s parallelism, which could often eliminate any performance
benefits.
So, ATI attacked those issues by developing their ultra-threaded pixel shader
engine. Ultra-threading was a kind of large-scale multi-threading, which breaks down
the pixel-processing workload required to render an image into many small tasks or
threads. Such threads consist of small four-by-four blocks of pixels. That block of
16-pixels had the same shader code run on it. Ultra-threading was also used in dedi-
cated branch execution units to eliminate flow control overhead in shader processors
and large, multi-ported register arrays to enable fast thread switching.
The R520 pixel shader engine incorporated a central dispatch unit that tracked
and distributed up to 512 threads across an array of shader processors, arranged into
four identical groups called quad pixel shader cores. Each core was an autonomous
processing unit that could execute shader code on a two-by-two block of pixels.
Dynamic flow control. A unique feature found in the R520 was dynamic flow
control. It allowed different paths through the same shader run on adjacent pixels.
That provided optimization opportunities, allowed parts of a shader to be skipped
(early out), and avoided state change overhead by combining multiple related shaders
into one. In general, it allowed the GPU to execute CPU code more effectively.
On the negative side, it could interfere with parallelism, and redundant computa-
tion could often reverse any flow control benefits. So, programmers still had to keep
an eye on their threads.
No one sits around. Whenever the dispatch processor sensed a core was idle (either
having completed a task or waiting for data), it immediately assigned a new thread to
execute. The dispatcher suspended an idle thread waiting for data, freeing its ALUs
to work on other threads. That enabled the GPU pixel shader cores to achieve over
90% utilization in practice.
ATI tried to illustrate this in the diagram in Fig. 1.25.
The ability to manage many threads, each containing a relatively small number
of pixels, allowed dynamic branching to occur (see Fig. 1.25). The goal was to
minimize the cases where different pixels in the same thread could branch down
different shader code paths. Each time that happened, all the pixels in the thread
1.1 Programmable Vertex and Geometry Shaders … 37
Fig. 1.25 ATI R520 thread size and dynamic branching efficiency was improved with ultra-
threading. Courtesy of ATI
had to run each possible code path, which eliminated the performance benefit of
branching.
Two-by-two. Each pixel shader processor in the R520 could perform two vector
operations and two scalar operations each clock cycle. The pixel shader engine
included a set of 16-texture address ALUs that could perform texture operations
without tying up the pixel shader cores. The dispatch processor contained sequencing
logic that automatically arranged shader code to maximize the use of the ALUs. Each
core had a dedicated branch execution unit, which could execute one flow control
operation.
The R520 could run six-pixel shader instructions per clock cycle on 16 pixels
simultaneously. The pixel shader processors could simultaneously perform two scalar
instructions, two three-component vector instructions, and a flow control instruction.
In addition, a bank of independent texture address units (TAU) could process texture
address instructions in parallel with the shader processors.
Vertex shader engine. The R520 had eight vertex shader units supporting the
VS3.0 Shader Model from DirectX 9. Each was capable of processing one 128-bit
vector instruction plus one 32-bit scalar instruction per clock cycle. Combined, the
eight vertex shader units could transform up to two vertices every clock cycle.
The R520’s vertex shader (Fig. 1.26) was the first GPU capable of processing
10 billion vertex shader instructions per second. The vertex shader units also
supported dynamic flow control instructions, including branches, loops, and subrou-
tines.
Four-by-thirty-two float. One of the significant new features of DirectX 9.0 was
the support for floating-point processing and data formats. Compared to the integer
38 1 Introduction
formats used in previous API versions, floating-point formats provided much higher
precision, range, and flexibility.
ATI said its new shader engine would perform all calculations with FP preci-
sion, vertex, and pixel shader operations executed on 128-bit floating-point data
formats. The internal data paths within the shader engine were wide enough to
process those formats at full speed, so there was no need to reduce precision to
optimize performance.
HDR—high-dynamic range. Nvidia got credit for being the first to market with full
32-bit floating-point shaders. And it spoke about high-dynamic range compensating
for overdriving the display’s limited dynamic range.
On the other hand, ATI said it had HDR with its natural light feature it had
demonstrated on the 9700 almost two years earlier.
It came down to shifting and compressing portions of a scene’s intensity so that
adjoining pixels did not mask out neighbors—and to do that, one needed 32-bit
floating-point processing (32 float).
ATI had that in its R520, and it extended that to the alpha element as well, so it
spoke about four-component 32-bit float HDR or 128-bit float HDR.
Dynamic range defines the ratio between the highest and lowest value repre-
sented—i.e., more bits of data equal a greater dynamic range. And the floating point
had a much greater range than the integer. For example:
• 8-bit integer—256:1
• 10-bit integer—1024:1 16-bit integer—65,536:1
• 16-bit floating point—2.2 trillion:1.
However, LCDs and some CRTs could only recognize values between zero and
255 (i.e., 8 bits per color component); therefore, they require tone-mapping to
preserve or show detail.
1.1 Programmable Vertex and Geometry Shaders … 39
Fig. 1.27 Making things look brighter than they are. Courtesy of ATI
Tone-mapping is used with light bloom and lens flare effects to help convey a high
brightness. HDR rendering takes advantage of color formats with greater dynamic
range, tricking the eye and brain into seeing what we believe are more realistic
images. Computer graphics are tricks to imitate nature with limited computational
and display resources—we are just faking it all the time. A case in point is the
illustration in Fig. 1.27.
One of the most used images to show off the effects of HDR was from Valve’s
highly anticipated Half Life2: Lost Coast. In it, the hero, Dr. Freeman, fights his
way into an abandoned church, where he must look for threats and possibly goodies.
Seeing in the muted light is especially important, as any FPS player will attest (see
Fig. 1.28).
Notice the bright windows, and yet you could still see detail in walls and floor.
Notice in Fig. 1.29 that the muted windows balance the overall brightness of the
scene, clearly not as realistic.
ATI offered several HDR modes with the R520n, including 64-bit (FP16, Int16)—
a maximum range that includes HDR with anti-aliasing, blending, and filtering, 32-bit
(Int10) for full speed that also offers HDR texture compression with DXT & 3Dc+,
and custom formats (e.g., Int10 + L16) for tone-mapping acceleration.
40 1 Introduction
Fig. 1.28 Inside the abandoned church with HDR on. Courtesy of Valve
Fig. 1.29 Inside the abandoned church with HDR off. Courtesy of Valve
The technology used ray tracing techniques with dynamic branching for improved
performance. Thanks to its ultra-threading technology for SM 3.0, ATI said it was
now possible at real-time frame rates.
Crossing over fire. ATI had its share of problems, some due to management issues
and others to just bad luck. The delay of the R520 was a significant problem, and the
net result was that management was distracted for almost three calendar quarters.
That distraction let things slip through the cracks, including effective marketing, and
nothing suffered more than the highly anticipated CrossFire.
ATI showed its dual AIB system, CrossFire, for the first time at a private showing
at the 2005 E3 conference. It used two X850 XT PE AIBs with master control logic
(an FPGA) attached.
Analysts and reporters thought CrossFire got delayed with the R520, which made
sense since you could not get a dual AIB operation without a master R520 AIB
(Fig. 1.32).
CrossFire required a regular AIB with an R520, plus a custom AIB with a
controller (the FPGA) and an R520, and a motherboard with an ATI Xpress 200
chipset CrossFire model, which provided 16 PCI Express lanes for the AIBs (ATI
offered an Xpress chipset for Intel and AMD CPUs).
CrossFire was probably one of the most mismanaged product launches in ATI’s
history, and Nvidia and its fans loved it. Every day Nvidia shipped a hundred or so
more SLI AIBs, while ATI struggled to get the R520 out.
CrossFire promised features beyond those of Nvidia’s SLI. CrossFire allowed
the use of different AIB configurations, such as combining an X800 with X1800.
CrossFire also offered more dual rendering modes, such as alternate frame rendering
(AFR, which ATI said it invented). In addition, there were scissors or split screen,
tiling (subdividing the screen into a checkerboard-like pattern of 32 × 32-pixel
squares, with each AIB rendering every other square), and SuperAA to picture.
1.1 Programmable Vertex and Geometry Shaders … 43
ATI’s CrossFire did not use a strap connector between the two AIBs as Nvidia’s
SLI did. Instead, it used an external cable between the AIB’s video output connectors
(reminiscent of the Voodoo 2 and original SLI). The DVI out of the slave Radeon
AIB got fed into the DMS port on the master CrossFire Edition Radeon AIB.
The CrossFire Edition (master) was a compositing engine that used a Xilinx FPGA
combined with a Texas Instruments TDMS DVI receiver/deserializer and a Silicon
Image 1161 DVI panel link receiver. It was a little clunky. That was because ATI
did not think SLI was useful. It had tried similar dual AIB operations with some
Rage products when 3Dfx promoted the idea, and ATI did not find much traction
or consumer interest. Of course, in those days, ATI was not the darling of graphics
and gamers that 3Dfx was, so it was no wonder ATI’s customers did not show any
interest.
Put bluntly, ATI missed it, and when they realized dual AIBs were desirable,
the R520, 530, and 515 were already in production (tapped out). Therefore, the
compositing engine had to be put on a master AIB instead of inside the chip as
Nvidia did, not that Nvidia got it perfectly correct.
The X1950 Pro GPU, released on October 17, 2006, used the 80 nm RV570 GPU.
It had 12 texture units and 36-pixel shaders and was the first ATI AIB that supported
native CrossFire implementation. Those AIBs also displayed 2048 × 1536 at 70 Hz
on the Radeon X1800, X1600, and X1300 (14X AA).
Although CrossFire worked with any application or game, like Nvidia’s SLI, it
could not help when CPU bound. For those cases, CrossFire could bring a higher
degree of anti-aliasing to improve the looks of the images, if not the performance.
44 1 Introduction
And with the combined AIB’s AA working, the user could get 16X AA and improved
anisotropic filtering.
Also, CrossFire was not quite as robust as ATI suggested concerning mixing and
matching AIBs. Only AIBs within a family would work, i.e., only X800-class AIBs
would work with an X800 master and only X850 AIBs with an X850 master.
Before the launch of R520-based AIBs, ATI released its new Avivo video and display
platform. With the PC’s ever-expanding role as a family entertainment unit, ATI
realized that improving brightness, resolution, de-interlacing, and video playback
were essential.
ATI felt that before HD and terrestrial HD delivery to PC was available, consumer
demand for rich video display platforms was not crucial. But with the growing
integration of media into the PC, consumers would demand better video capabilities
and performance from the PC. With Avivo, ATI hoped to satisfy those expectations.
Capture. Avivo used 3D comb filtering, hardware noise reduction, and multi-path
cancelation to enhance the capture of terrestrial, satellite, and cable signals. Avivo
examined the incoming video and raised it to ensure it was going into the rest of the
pipeline in as high quality as possible.
Avivo had 12-bit A/Ds to convert analog signals into digital. The company
emphasized that accuracy level because anything in the analog signal not accurately
converted would be permanently lost, jeopardizing the work of the rest of the video
pipeline.
ATI used 3D comb filtering because 2D filtering separates the signals from within
a single image. 3D filtering uses the two dimensions of the image plus the third
dimension of time for the best separation of the signals.
Before the demodulation process, ATI used multi-path cancelation to eliminate
echoes and shadows caused by urban environments.
Encoding. Avivo playback products (ATI’s discrete graphics and integrated
graphics chipsets) had hardware-assisted compression and transcoding to facilitate
media interchange. ATI had hardware MPEG-2 compression with its tuner encoder
products, and ATI claimed that its encoder products would reduce CPU utilization
to as low as 3−4% while encoding live TV signals.
ATI believed that transcoding capability was a vital capability with the introduc-
tion of multiple video-capable devices (PDAs, cell phones, portable game consoles,
etc.) that had a wide variety of formats and storage capacity.
Therefore, the company equipped its Avivo products with encoder support for
MPEG-2, WMV9, and H.264. ATI’s VPU had dedicated circuitry for video decode.
Avivo Display. The Avivo display engine consisted of two symmetrical display
pipelines with dual 10-bit end-to-end display processing. The pipelines ensured that
the output image (video or other) matched the device’s display.
1.1 Programmable Vertex and Geometry Shaders … 45
1.1.8.2 Summary
ATI was forming a special relationship with Microsoft. The two companies had been
partners for years, and with the introduction of the Xbox 360, they developed even
a tighter connection. The Xbox 360 had advanced capabilities not yet available on
the PC, which ATI developed with Microsoft.
Hoping to leverage that information, ATI pushed the fab (TSMC) to build a 90 nm
next-generation GPU. ATI took several chances, adding new features, API, video
processing, and memory management. It was bold, and some might even say reckless,
but admirable and impressive. But the weak link in the chain got them, and the
company lost its leading position with all its technology. Nonetheless, the R520 was
an outstanding and astonishing design, and ATI would leverage the developments
for several generations.
Shortly after the R520 finally came out, AMD acquired the company. That put a
brake on development, marketing, management planning, and oversight and started
ATI—now AMD—on a downward path that would take over a decade to correct, as
you will see in subsequent chapters.
One of Nvidia’s most successful GPU architectures was the Curie series that spawned
over a hundred AIB models across almost four years. Curie launched in September
2005 with the NV40 in the GeForce 6800 XE (Fig. 1.33). Nvidia ended it in August
2008 with the RSX 65 that went into the Sony PlayStation 3. There were even two
process shrinks for the PlayStation: 40 nm in October 2012 and 28 nm in June 2013.
Nvidia was able to scale the DirectX 9.0c design from 130 to 90 nm. It also
increased from 12 shaders in the NV40 on the GeForce 6800 XT to 27 shaders in
Fig. 1.33 Nvidia NV40 Curie vertex and fragment processor block diagram
46 1 Introduction
the G71 on the GeForce 7800 GS+. All the products of the NV4x series offered the
same feature set for DirectX 9.
The Curie architecture could be built with 4, 8, 12, or 16-pixel pipelines.
Nvidia extensively redesigned the architecture of the NV40’s pixel shader engine
from its predecessor. For example, the NV40 could calculate 32 pixels per clock,
while the NV35/38 could only render eight. Also, the NV40’s pixel shaders were
32-bit floating-point precision. And although the GPU could execute the half-
precision modes of the previous NV3x series, it was not dependent on them to realize
its peak performance.
With the Curie architecture, Nvidia introduced PureVideo. Based on the GeForce
FX’s video engine (VPE), PureVideo reused the MPEG-1/MPEG-2 decoding
pipeline and improved the quality of de-interlacing and overlay resizing. It included
limited hardware acceleration for VC-1 (motion compensation and post-processing)
and H.264 video compatibility with DirectX 9’s VMR9.
PureVideo offloaded the MPEG-2 pipeline starting from the inverse discrete cosine
transform, leaving the CPU to perform the initial run-length decoding, variable-
length decoding, and inverse quantization.
PureVideo became a foundational element of Nvidia GPUs, and the company kept
enhancing, expanding, and improving it through over eleven generations.
Nvidia introduced its UltraShadow technology with the Rankine NV35 GPU
found on the GeForce FX 5900. Correct shadows are crucial for realistic and believ-
able images [16]. However, the complex interactions between the light sources, char-
acters, and objects could require complicated programming. In an application like
a game, each light source must be analyzed relative to every object for each image
or frame. The more passes the GPU had to make for the lighting and shadow calcu-
lations in a scene, the more significantly the performance would slow down the PC
and gameplay.
With UltraShadow II in the GeForce 6 and GeForce 7 Series of GPUs, Nvidia
made improvements that complex scenes achieved noticeably improved perfor-
mance results. The improvements in UltraShadow II produced four times the perfor-
mance (compared to the previous generation) for passes involving shadow volumes
(Fig. 1.34).
UltraShadow II gave developers the ability to calculate shadows quickly and elim-
inate areas unnecessary for consideration. It defined a bounded portion of a scene
(called depth bounds), and limited calculations only in the area affected most by
the light source. Now developers could accelerate the shadow generation process.
UltraShadow II offered the ability to fine-tune shadows within critical regions. That
enabled developers to create great visualizations that mimicked reality while still
realizing good performance for fast-action games. It also worked with Nvidia’s
Intellisample technology for anti-aliasing shadow edges.
Intellisample Version 4.0 was used in the GeForce 6 and GeForce 7 series and
included two new methods: Transparency Supersampling (TSAA) and the faster but
lower-quality Transparency Multi-sampling (TMAA). Those methods improved the
anti-aliasing quality of scenes with partially transparent textures (such as chain link
fences) and anisotropic filtering of textures at oblique angles to the viewing screen.
1.1 Programmable Vertex and Geometry Shaders … 47
Fig. 1.34 Nvidia’s NV40 curie-based GeForce 6800 Xt AIB. Courtesy tech Power Up
Nvidia upgraded its CineFX engine for complex visual effects to Microsoft’s
DirectX 9.0c Shader Model 3.0 and OpenGL 1.5 APIs.
Introduced in 2002 with GeForceFX’s CineFX, programmers could now develop
shader programs utilizing those technologies and techniques.
DirectX 9.0 Shader Model 3.0 capabilities provided by the Nvidia NV40 GPU
included the following:
• Infinite length shader programs. The CineFX 3.0 had no hardware-imposed limi-
tations on shader programs. Therefore, longer programs ran faster than previous
GPUs.
• Dynamic flow control. Additional looping/branching options provided subrou-
tine call/return functions, giving programmers more choices for writing efficient
shader programs.
• Displacement mapping. CineFX 3.0 allowed vertex processing with textures
which provided new realism and depth to every component. Displacement
mapping enabled developers to make subtle changes in a model’s geometry with
minimal computational cost.
• Vertex frequency stream divider. Effects could be applied to multiple characters or
objects in a scene, providing individuality where models were otherwise identical.
• Multiple render target (MRT). MRTs allowed for deferred shading. After
rendering all the geometry, eliminating multiple passes through the scene, the
scene’s lighting could be manipulated. Photo-realistic lighting could be created
while avoiding unnecessary processing time for pixels that did not contribute to
the visible portions of an image.
48 1 Introduction
New effects included subsurface scattering, which provided depth and realistic
translucence to skin and other surfaces. It could generate soft shadows for sophis-
ticated lighting effects, accurately represent environmental and ground shadows,
and create photo-realistic lighting with global illumination. The Nvidia Curie block
diagram is shown in Fig. 1.35.
The GPU could use DDR and GDDR3 memory via a 256-bit-wide memory inter-
face (bus). The device offered 16x anisotropic filtering, rotating grid anti-aliasing,
and transparency anti-aliasing with high-precision dynamic range (HPDR). With its
dual 400 MHz LUT-DACs, the NV40 could display 2048 × 1536 up to 85 Hz.
The GPU included an integrated TV encoder with TV output up to 1024 × 768
resolutions and video scaling and filtering, and the HQ filtering technique could
operate up to HDTV resolutions.
1.2 Conclusion
Programmable vertex shaders introduced by ATI and then Nvidia expanded the capa-
bilities of the GPU significantly and ushered in a new era of computer graphics effects
that executed completely and only on the GPU. It was the precursor of what the GPU
would become—a totally programmable compute engine.
During this period, ATI and Nvidia advanced computer graphics features formerly
found in high-priced workstations. This era was marked by additional consolidation
as the number of suppliers dropped from 32 to 8 and the concentration of engineering
talent ATI and Nvidia accelerated setting the stage for yet even more advanced
developments in the next-generation and future GPUs.
References
1. Peddie, J. True to form, ATI raises the ante in graphics processing, The Peddie Report, Volume
XIV, Number 22 (May 28, 2001).
2. Chung, J, and Kim, L-S. A PN Triangle Generation Unit for Fast and Simple Tessellation
Hardware, Proceedings of the 2003 International Symposium on Circuits and Systems, 2003.
ISCAS ‘03. (June 25, 2003). https://tinyurl.com/28nfku3v
3. ATI, TruForm White paper, August 2001. https://tinyurl.com/4tzbyxw4
4. Witheiler, M. ATI TruForm—Powering the next generation Radeon, AnandTech, (May 29,
2001). https://www.anandtech.com/show/773
5. Jon Peddie’s Tech Watch, Volume 2 Number 15, July 22, 2002.
6. De Maesschalck, T. The simFUSION 6000q the force of four Radeon 9700 cores!. (November
5, 2003). https://www.dvhardware.net/article2076.html
7. Shankland, S. SGI uses ATI for graphics behemoths. (July 14, 2003). https://www.cnet.com/
tech/computing/sgi-uses-ati-for-graphics-behemoths/
8. Peddie, J. SiS previews Xabre600 GPU, TechWatch Volume 2, Number 24 (November 25,
2002).
9. Maher, K. Computex 2001, Chipsets, The Peddie Report, Volume XIV, Number 24, pp 989.
(June 11, 2001).
10. Singer, G. History of the Modern Graphics Processor, Part 3: Market Consolidation, The Nvidia
vs. ATI Era Begins, TechSpot. (December 6, 2020). https://www.techspot.com/article/657-his
tory-of-the-gpu-part-3/
11. Shimpi, A. Nvidia GeForce FX 5800 Ultra: It’s Here, but is it Good? AnandTech (January 27,
2003). https://www.anandtech.com/show/1062/19
12. Transmedia storytelling: A Demise Caused By A Film, Film Tv Moving Image University of
Westminster. (March 6, 2017)
13. Briscoe, D. Final Fantasy’ flop causes studio to fold, Chicago Sun-Times. (February 4, 2002).
https://article.wn.com/view/2002/02/04/Final_Fantasy_flop_causes_studio_to_fold/
14. CineFX Architecture, Siggraph 2002, Nvidia. http://developer.download.nvidia.com/assets/
gamedev/docs/CineFX_1-final.pdf
15. Wang, E and Lee, C. Nvidia said to sell NV30 cards, NV31 and NV34 chips to down-
stream clients, Digitimes. (January 13, 2003). https://www.digitimes.com/news/a20030113
01002.html
16. UltraShadow II Technology. https://www.nvidia.com/en-us/drivers/feature-ultrashadow2/
Chapter 2
The Third- to Fifth-Era GPUs
Due to Moore’s law, the increase in transistor density opened up new possibilities
for GPU designers and CG scientists. The concepts, math, and algorithms remained
well understood. However, efficiently executing the algorithms was the tricky part.
It was what set one GPU architectural design apart from others (Fig. 2.1).
At Nvidia’s Nvision 2008 conference in San Jose, Tony Tamasi, a former senior
engineer at 3dfx and a senior vice president of content and technology at Nvidia,
took a look back and traced the development of GPUs up to the third era [1]. Tamasi
produced the chart in Fig. 2.2.
Previously fixed functions constrained the first-generation GPUs. For instance,
game developers could only use a limited number of characters in a game. There
were limited animation capabilities. And using multi-textures required multi-pass
rendering. To economize on graphics resources, game developers relied on simplistic
scenes and environments were usually indoors. With the arrival of DirectX 10,
Microsoft would usher in a new period of innovation enabling cinematic effects
and realistic graphics.
Fig. 2.2 GPU architecture progression, first and second era. Courtesy of Tony Tamasi
The defining technology of this era was the introduction of unified shaders, which
combined vector and pixel shaders. The first instance of the unified shader model
was realized in the ATI Xeon processor in the Xbox 360 in June 2005.
The shift opened the door to programmable graphics and is symbolized in Fig. 2.3.
It allowed armies of characters, complex physical simulations, sophisticated Al,
procedural generation, custom renderers, and lighting. In Microsoft’s parlance, this
strategy was called Shader Model 4, or SM 4.
Unified shader design made all shaders equal, and their capabilities were available
simultaneously. In the past, the differences between vertex and pixel shaders could
cause situations where the processes got out of sync so that vertex shaders were idle,
and pixel shaders got backlogged.
The following are some examples of third-era GPUs.
2.1 The Third Era of GPUs—DirectX 10 (2006–2009) 53
On November 8, 2006, Nvidia launched its first unified shader architecture and the
first DirectX 10, Shader Model 4-compatible GPU, the G80. The G80 redefined what
GPUs were capable of and what they would become.
Radeon 8500 in 2001 and Nvidia’s GeForce 3 could execute small programs via
specialized, programmable vertex and pixel shaders. Nvidia and ATI carried on with
that basic design up until March 9, 2006, when Nvidia released the G71. The G80
(Fig. 2.4) broke the mold and ushered in a new era of GPU capabilities.
The G80 was based on Nvidia’s Tesla architecture (code named NV50) and had
128 shaders (grouped in 16 streaming modules (SMs)). The 484 mm2 die was fabri-
cated in TSMC’s 90 nm process and had 681 million transistors. The GPU had
32 TMUs and eight texture processor clusters (TPCs). It used GDDR3 memory
clocked at 792 MHz, while the GPU ran at 513 MHz.
Nvidia’s G80, which powered the GTX 8800 family of AIBs, was the first to
replace dedicated pixel and vertex shaders with an array of standard (unified) stream
processors (SPs—Nvidia would later re-brand then as CUDA cores).
Nvidia’s previous GPUs were SIMD vector processors and could run concurrently
on RGB+A color components of a pixel. The G80 had a scalar processor design such
that each streaming processor could handle one color component. Nvidia moved
from a GPU architecture with dedicated hardware for specific shader programs
to an array of relatively simple cores. Those (seemingly) simpler cores could be
programmed to perform whatever shader calculations an application required. It was
a clear breakthrough and a break-away design.
In an interview with Joel Hruska of ExtremeTech in November 2016, Jonah Alben,
Nvidia’s Senior VP of GPU Engineering, said, “I think that one of the biggest chal-
lenges with G80 was the creation of the brand new ‘SM’ processor design at the
54 2 The Third- to Fifth-Era GPUs
core of the GPU. We pretty much threw out the entire shader architecture from
NV30/NV40 and made a new one from scratch with a new [similar-instruction,
multiple-threads] general processor architecture (SIMT), that also introduced new
processor design methodologies.” [3].
Nvidia designed the G80 to run much more complicated pixel shaders with more
branching, dependencies, and resource requirements than previous chips. The new
cores could also operate much faster. The previous GeForce 7900 GTX’s GPUs
were manufactured in a 90 nm process at TSMC and ran at 650 MHz. The G80-
based GeForce 8800 GTX (Fig. 2.5) used an 80 nm half node process, and its shader
cores ran at 1.35 GHz.
The new chip debuted in two new AIBs, the $599 GeForce 8800 GTX and the
$449 GeForce 8800 GTS.
The G80 was the threshold processor that would lead Nvidia to general computing
acceleration beyond gaming—a big evolutionary step that would have consequences
on the entire computing industry for decades.
Demonstrating the scalability of its primary texture processor cluster (TPC), Nvidia
took the G80 design (that was built in 90 nm) and shrank it to 65 nm and then made
the whole chip larger by increasing the 681 transistors in the G80 to an astonishing
1.4 billion in the GT200 and in the process creating the biggest chip built to date.
2.1 The Third Era of GPUs—DirectX 10 (2006–2009) 55
Fig. 2.5 Nvidia GeForce 8800 Ultra with the heatsink removed showing the 12 memory chips
surrounding the GPU. Courtesy of Hyins—Public Domain, Wikimedia
Nvidia would make that claim on several future GPUs as well—bigger is better at
Nvidia, or as Jensen Huang has said, “Moore’s law is our friend.” [4].
Nvidia started with a basic design for a streaming processor (SP), known as a
shader—see the CUDA core call out Fig. 2.6. Nvidia then built a streaming multi-
processor (SM) from an array of SPs. In the GT200, there were eight SPs in an SM;
the SPs are depicted as cores in the block diagram. An SM in the GT200 consists of
eight SPs (cores) and a special function unit (SFU).
Nvidia designed its GPU architecture to be scalable, so a texture processing core
(TPC) could be made of any number of SMs. In the G80, there were two SMs per
TPC, and in the scaled up GT200, there were three SMs.
Nvidia continued the modular theme by grouping several TPCs to form a streaming
processor array (SPA).
The result was a chip that was 576 mm2 , with 240 shaders or cores, 80 TMUs,
adding up to 60 special function units (SFUs) or 10 TPCs. The chip drew a whopping
2236 w, another biggest number for Nvidia. The GPU ran at 600 MHz, while the
GDDR3 memory ran at 2.2 GHz from a 1.1 GHz clock. Like the G80, it was DirectX
11 (10_0) compatible as well as OpenGL 3.3. The chip was used on Nvidia’s GTX2x0
series AIB and the top of the line, the GTX260 had a GB of memory, and another
biggest for Nvidia.
2.1.2.1 Summary
What was significant about the chip is that it demonstrated two substantial aspects of
Nvidia’s Tesla design, scalability in moving to a smaller process node (90–65 nm)
and scalability in the modular assembly capability of the shaders and their subsequent
larger building blocks.
56 2 The Third- to Fifth-Era GPUs
Fig. 2.7 Evolution of Nvidia’s logo, 1993 to 2006 (left) and 2006 on (right). Courtesy of Nvidia
Also in 2006, Nvidia changed their logo font from Italics to bold (Fig. 2.7).
Nvidia likes to write its name in all caps. However, the name is a proper noun and
not an acronym and should not be all caps.
Intel launched the Larrabee project in 2005, code named SMAC. Paul Otellini, Intel’s
CEO, hinted about the project in 2007 during his Intel Developer’s Forum (IDF)
keynote. Otellini said it would be a 2010 release and compete against AMD and
Nvidia in the realm of high-end graphics.
Intel announced Larrabee in 2008 and early August at SIGGRAPH. Then at the
Hot Chips conference in late August and finally at the IDF in mid-August. The
company said Larrabee would have dozens of small, in-order x86 cores and run as
many as 64 threads. The chip would be a coprocessor suitable for graphics processing
or scientific computing. Intel said, at the time, programmers could, at any given time,
decide how they would use those cores.
At Intel’s Research @Intel Day, held in the Computer History Museum on June
14, 2009, the company showed a ray-traced version of Enemy Territory: Quake Wars
at 1024 × 720. Intel research scientist Dr. Daniel Pohl demonstrated his software
on a 16-core 2.93 GHz Xeon Tigerton system, with four processors, each with four
cores (Fig. 2.8). The implication was that the multi-processor Larrabee would deliver
similar performance, but the processors in Larrabee were not of Xeon class.
Intel did not release the details on Larrabee, but rumors swirled that it would have
64 processors—four times as many as Pohl used.
Larrabee was scheduled to launch in the 2009–2010 timeframe; then in December
2009, Intel surprised the industry and canceled it. Rumors circulated in the late
2009 that Larrabee was not performing as well as expected. And in 2010, Intel
acknowledged the power density of x86 cores did not scale as well as a GPU. The
AIB was big, as indicated in Fig. 2.9.
To salvage all the work that had gone into Larrabee, Intel pivoted to the Knight’s
bridge Phi’s compute-coprocessor based on the Larrabee chip.
Larrabee used multiple in-order x86 CPU cores augmented by a wide vector
processor unit and a few fixed function logic blocks. Intel said this would provide
58 2 The Third- to Fifth-Era GPUs
Fig. 2.8 Daniel Pohl demonstrating Quake running ray-traced in real time
dramatically higher performance per watt and unit of area than out-of-order CPUs
on highly parallel workloads. The company asserted that Larrabee’s highly parallel
architecture would make the rendering pipeline completely programmable. It could
run an extended version of the x86 instruction set, including wide vector processing
operations and specialized scalar instructions. The three diagrams, Figs. 2.10, 2.11,
and 2.12, are based on Intel’s presentation at SIGGRAPH.
2.1 The Third Era of GPUs—DirectX 10 (2006–2009) 59
The diagram was symbolic of the number of CPU cores and the number and type
of coprocessors and I/O blocks, which were implementation dependent, as were the
locations of the CPU and non-CPU blocks on the chip. Each core could access a
subset of a coherent L2 cache to provide high bandwidth access and simplify data
sharing and synchronization.
Intel claimed Larrabee would be more flexible than current GPUs. Its CPU-like
x86-based architecture supported subroutines and page faulting. Some operations that
GPUs traditionally perform with fixed function logic, such as rasterization and post-
shader blending, were implemented entirely in software in Larrabee. As with GPUs,
Larrabee used fixed function logic for texture filtering, but the cores assisted the
fixed function logic (e.g., by supporting page faults). Larrabee’s core black diagram
is shown in Fig. 2.12.
Larrabee’s programmability offered support for traditional graphics APIs such
as DirectX and OpenGL via tile-based deferred rendering that ran as software
layers. Intel ran the renderers using a tile-based deferred rendering approach. Tile-
based deferred rendering can be very bandwidth-efficient, but it presented some
compatibility problems at that time—the PCs of the day were not using tiling.
Each core had fast access to its 256 kB local subset of the coherent second-level
cache. The L1 cache sizes were 32 kB for I cache and 32 kB for D cache. Ring
network accesses passed through the L2 cache for coherency. Intel manufactured the
Knights Ferry chip in its 45 nm high-performance process and the Knights Corner
chip in 22 nm.
In Pohl’s ray-tracing demo, he used four, four-core, 2.93 GHz Xeon 7300 Tigerton.
Assuming perfect scaling, they could produce 358 GTFLOPS per processor or 1.4
TGFLOPS total. The re-purposed coprocessor version of Larrabee, Knights Ferry
(Larrabee 1), had 32 cores at up to 1.2 GHz, each producing 38 GFLOPS, for a
total of 1.2 TFLOPS. However, its x86 cores were based on the much simpler P54C
Pentium. A public demonstration of the Larrabee architecture took place at the IDF
in San Francisco on September 22, 2009. Pohl ported his Quake Wars ray-traced
demo, and it ran in real time. The scene included a ray-traced water surface that
accurately reflected the surrounding objects, like a ship and several flying vehicles.
Intel used the Larrabee chip for its Knights series many integrated core (MIC)
coprocessors. Former Larrabee team member Tom Forsyth said, “They were the
exact same chip on very nearly the exact same board. As I recall, the only physical
difference was that one of them did not have a DVI connector soldered onto it.” [5].
Knights Ferry had a die size of 684 mm2 and a transistor count of 2300 million—
a large chip. It had 256 shading units, 32 texture-mapping units, and 4 ROPS, and
it supported DirectX 11.1. For GPU-compute applications, it was compatible with
OpenCL version 1.2.
The cores had a 512-bit vector processing unit, able to process 16 single-precision,
floating-point numbers simultaneously. Larrabee was different from the conventional
GPUs of the day. Larrabee used the x86 instruction set with specific extensions. It had
cache coherency across all its cores. It performed tasks like z-buffering, clipping, and
blending in software using a tile-based rendering approach (refer to the simplified
DirectX pipeline diagram in Book two). Knights Ferry, aka Larrabee 1, was mainly
2.1 The Third Era of GPUs—DirectX 10 (2006–2009) 61
an engineering sample, but a few went out as developer devices. Knights Ferry
D-step, aka Larrabee 1.5, looked like it could be a proper development vehicle. The
Intel team had lots of discussions about whether to sell it and, in the end, decided
not to. Finally, Knights Corner, aka Larrabee 2, was sold as XeonPhi.
The Intel developers (Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth,
Pradeep Dubey, Stephen Junkins, Adam Lake, Robert Cavin, Roger Espasa, Ed
Grochowski, and Toni Juan) believed Larrabee was more programmable than GPUs
of the time [6]. Additionally, Larrabee had fewer fixed function units, so they thought
Larrabee would be an appropriate platform for the convergence of GPU and CPU
applications.
Larrabee used general-purpose CPUs and could theoretically have run its own
operating system. However, the graphics performance wasn’t good enough compared
to competing products.
The Larrabee/MIC/XeonPhi split was purely marketing, branding, and pricing.
Same team, same chips.
In 2009, Intel announced it had canceled the Larrabee product. However, the first
run of devices was re-purposed as a software development vehicle for ISVs and the
high-performance computing (HPC) community.
Unlike conventional GPUs, the hardware in Larrabee was half of the solution.
The other was the software that translated the DirectX graphics pipeline elements
into x86-compatible code. Unfortunately, legacy systems took too long to write, test,
debug, and fix that mountain of software, especially in the OpenGL realm. The corner
cases of old OpenGL extensions were especially insidious. And because Intel is Intel,
it had to guarantee backward compatibility.
The result was that Larrabee missed its launch window. If the product were
released later (assuming they could get all the software cases to work—and
there was no certainty about that), Larrabee would compete against newer, more
powerful competition than initially planned. Moreover, designing and manufacturing
a next-generation chip would take three years. It was an untenable situation.
Moreover, to use a cliché, Intel was fixing the engine while the airplane was flying.
Naturally, for a CPU company, Intel did not have a very deep team of GPU experts
as Larrabee ramped up. Intel staffed up in 2007 and 2008, bringing in dozens, maybe
hundreds, of new people from a dozen or more companies and universities. Getting
them integrated into a team took more time than Intel anticipated. Then, continuing
to fix the engine in flight, Intel launched the chip with a 32-port coherent cache. The
most extensive coherent cache Intel had built to date was eight CPUs.
In May 2019, Intel told customers it would no longer accept orders for the Phi
products after August 2019 and that the Xeon Phi 7295, 7285, and 7235 would be
end-of-life (EOL) July 31, 2020—from inception (2006) to termination (2023) is a
long run in this industry.
Larrabee was significant because it tested the notion that a SIMD construction
of any CISC or RISC processor with a FPU could be used as a GPU. The legacy
of evolution of CISC processors like Intel’s x86 carried so many functions and
features; it could be made small enough to be replicated in large quantity and the
62 2 The Third- to Fifth-Era GPUs
wasn’t powerful enough to be that much better than several shaders in a conventional
GPU. It also wasn’t power efficient, or inexpensive.
Although Intel disparaged the GPU every chance it got, it always wanted one and
tried a few times before Larrabee (e.g., 82768, i860, i740). In 2018, it kicked off
another GPU project code named Xe , discussed in a later chapter.
With the G965 and GM965 chipset evolution, Intel expanded the GPU in the G45
Graphics and Memory Controller Hub (GMCH) to incorporate DirectX 10 m SM
4.0, unified shader architecture. Figure 2.13 shows the block diagram of the G45.
The GPU had 10 EU, five threads per EU, and it offered HD Decode (high-quality
video), with, said the company, a focus on game compatibility. The G45 was one of
Intel’s last external chipsets.
Intel was the first to introduce a CPU with a built-in GPU in January 2010—five
years after AMD and ATI announced their plan for CPU with an integrated GPU and
a year before AMD actually managed to accomplish it (Fig. 2.14).
Worried AMD would beat them to the market with an integrated CPU and GPU
Intel marshaled its forces and went into skunk-work mode to build such a device.
At the same time, AMD hit one roadblock after another whittling away their time to
market advantage and costing them first mover status for a concept they created.
The first iGPU
Westmere was Intel’s first CPU with an encapsulated shared memory integrated
GPU—the GPU and CPU were in the same package but not the same die (Fig. 2.15).
Westmere was Intel’s latest microarchitecture and was not the name of the processors
that used it. The Westmere (formerly Nehalem-C) architecture was a 32 nm die shrink
of the Nehalem architecture. The Westmere design could use the same CPU sockets
as Nehalem-based CPUs.
The 32 nm Clarkdale (80616) was the first processor to use the Westmere archi-
tecture and incorporate a GPU die in the same package. The GPU was Intel’s 45 nm
fifth-generation HD series, Ironlake graphics GPU. It ran at 500–900 MHz. It had
177 million transistors in a 114 mm2 die with 24 shaders—12 execution units (EUs)
and a four MB L3 cache—and it was DirectX 10.1, Shader Model 4.0, and OpenGL
2.1 compatible. A block diagram of the Ironlake iGPU is shown in Fig. 2.16. The
GPU could produce up to 43.2 GFLOPS at 900 MHz (24 GFLOPS at 500 MHz).
The iGPU could decode an H264 1080p video at up to 40 fps.
2.1 The Third Era of GPUs—DirectX 10 (2006–2009) 63
The Ironlake GPU could provide 2560 × 1600 resolution capability via a Display-
Port connection. For DVI (due to its single-link connection), the resolution was
limited to 2048 × 1536 and, for HDMI, 1920 × 1200. Intel believed DisplayPort,
which it invented, was the future—they were right.
Intel branded the Clarksdale processors as Celeron, Pentium, or Core with HD
Graphics.
Intel initially sold Clarkdale as desktop Intel Core i5, Core i3, and Pentium. It was
closely related to the mobile Arrandale processor. The most significant difference
between Clarkdale and Arrandale is that the latter had integrated graphics.
64 2 The Third- to Fifth-Era GPUs
A year later, in February 2011, Intel introduced the second-generation Core brand
with the Sandy Bridge architecture and the GPU integrated into the CPU.
Intel built the fully integrated 131 mm2 die, four-core, 2.27 GHz processor Sandy
Bridge, with the iGPU in its 32 nm fab. Depending upon the model, the GPU’s clock
was 650–1100 MHz (turbo to 1350 MHz).
2.2 The Fourth Era of GPUs. October 2009 65
The Sandy Bridge GPU, code named HD Graphics 3000 (where HD stood for
high-definition), did not have dedicated memory and shared the Level 3 cache and
part of the main memory with the CPU.
The GPU had 12 EUs with 96 shaders and 12 texture-mapping units (TMUs), but
Sandy Bridge’s GPU was faster than Clarkdale due to architectural changes and a
process shrink. The GPU was DirectX 10.1, OpenGL 3.0, and DirectCompute 4.1
compatible. The GPU also incorporated dedicated units for decoding and encoding
HD videos.
The GPU was impressive for the number of shaders and other processors it had
and within the power envelope of an integrated system with a x86 CPU. It was
significant in that it shifted T&L work from the CPU to the GPU’s shaders making
it truly DirectX 8 to 11 compatible.
The fourth era of GPUs was launched with Microsoft’s D3D11/DirectX 11 and the
introduction of compute shaders. DirectX 11 brought programmable geometry shades
and tessellation with its Hull and Domain Shaders. It was a significant improvement
and fostered a whole new era of games with never seen before qualities approaching
ray tracing.
66 2 The Third- to Fifth-Era GPUs
Fig. 2.17 AMD graphics logos, circa 1985, 2006, 2010. Courtesy of AMD
AMD made seven variations of the first AMD-branded GPUs, code named Northern
Islands and Vancouver.
• 292M 40 nm (Cedar)
• 370M 40 nm (Caicos)
• 716M 40 nm (Turks)
• 1.040M 40 nm (Juniper)
• 1.700M 40 nm (Barts)
• 2.640M 40 nm (Cayman).
The Northern Islands GPUs, introduced in October 2010, were the first new GPUs
developed by ATI, under the management of AMD. It formed part of AMD’s 40 nm
Radeon brand. AMD based some versions on the second generation of the TeraScale
architecture (VLIW5) and some on the third-generation TeraScale 3 (VLIW4).
With the Northern Islands GPUs introduction, AMD discontinued the ATI brand.
AMD wanted to tighten the correlation between the graphics products and the AMD
CPU and chipset branding. For the most part, the AMD badging was just a replace-
ment for the ATI badge, and people continued to refer to the graphics products as
ATI for almost a decade afterward because the name was so admired—and some say
loved.
The former ATI logo received a renovation and took on some of the 2010 AMD
Vision logo design elements (Fig. 2.17).
AMD also retired the Mobility Radeon name for laptop GPUs and used an M
suffix at the end of the GPU model number.
Jerry Sanders founded AMD in 1969 to compete with Intel in the x86 CPU business,
which it did very successfully. Sanders stepped down in 2002, and Hector Ruiz took
over and ran AMD until 2008. During his tenure, AMD acquired ATI and entered
the GPU market.
2.2 The Fourth Era of GPUs. October 2009 67
Ruiz and Dave Orton, ATI’s former CEO, planned to design and build an integrated
device with an x86 CPU and a powerful GPU they called Fusion. However, AMD
was top-heavy, and Orton didn’t want to move to Texas, so he resigned in 2007,
which slowed the energy of the Fusion product. In late 2006, Nvidia reached parity
with AMD in market share and then continued to gain.
AMD’s CPUs were not as good as they should have been, and the company was
rapidly losing market share to Intel and Nvidia and started losing money. When Ruiz
resigned in 2008, AMD’s BOD elected VP Dirk Myer, a protegee of Sanders to the
CEO position. But AMD continued to lose money, and Myer looked for things to cut
or sell off.
In 2008, Myer sold the ATI TV business to Broadcom, which helped their position
in the set-top box market. Myers sold off AMD’s fab in 2008, and it became Global
Foundries. That stopped some of the losses and brought in some cash. But the debt
the company was carrying was crushing it, and the funding for GPU investment
almost disappeared. Myer sold the ATI mobile group to Qualcomm in 2009, helping
that company establish its long running and highly successful Snapdragon product
line and become the largest independent iGPU supplier in the world.
Myer was a CPU guy and didn’t appreciate the GPU. Rumor had it that although
he helped develop the logic for the acquisition, he actually opposed it. Meyer focused
the company on notebook PCs and data center server market. As a result, funding
for R&D for GPUs was secondary to CPUs. GPUs are complicated devices to design
and build and take three to five years to develop. In 2009, after Myer took over, the
ATI group lost several vital engineers, and GPU development stalled.
Relying primarily on the CPU for revenue, the under-performing Bulldozer CPU
(designed under Myer’s VP days) was dragging the company down. It suffered years
of losses and accumulated debt, even though the GPU business brought in money.
ATI had new designs in the pipeline when AMD bought them, and in the summer
of 2008, AMD launched the Radeon HD 4870. It was an excellent product, and
AMD’s AIB OEMs were able to offer an AIB for $300 that had comparable or better
performance to Nvidia boards selling for $400–$450. It was an incredible value at the
time, and an instant hit, and the GPU revenue significantly helped AMD’s sales and
sustain the company.
Nonetheless, the company was losing money, and rumors circulated that AMD
would be sold to Intel, Nvidia, or VCs or file bankruptcy. In possibly three of the
worst years of AMD’s history and not totally his fault, Myers resigned in 2011.
AMD hired Rory Reed to be the new CEO. Reed was President and COO at
Lenovo, and at the time, Lenovo had just marked its 7th straight quarter as the
fastest-growing PC maker in the world and had become the world’s third-largest
global PC manufacturer. The BOD thought if anyone could get AMD back on track
and into the OEMs, it was Reed.
Reed was the first CEO of AMD who wasn’t an engineer. Reed’s legacy in AMD
is that he hired the best. Reed brought in Mark Papermaster, CTO at Cisco, as AMD’s
CTO. And he hired Lisa Su, a distinguished engineer, and manager who had been
at IBM, and Freescale, where she was Senior Vice President and General Manager.
She assumed the same positions at AMD under Reed.
68 2 The Third- to Fifth-Era GPUs
AMD continued to sink, reporting its lowest revenue in 2015 over ten years. But
despite that, Reed, Su, Papermaster, and the brilliant CFO Devinder Kumar, reduced
AMD’s debt, and under Su’s direction, started hiring top-notch engineers, some of
whom had left AMD earlier, including Raja Koduri, who had been at Apple after he
left AMD.
Under Su’s leadership, AMD began investing in and developing the next gener-
ation of CPUs and GPUs. In 2014, Reed was asked to step down, and the BOD
appointed Lisa Su to be the company’s CEO, a position she held for over nine years
(as of this writing) and the second longest-running CEO at AMD since Jerry Sanders.
But the damage of underinvestment had taken its toll. In 2007, AMD launched
the HD 2000 series, the company’s last high-end AIB.
In 2010, AMD had to give up its position in the high-end of the GPU market. With
the introduction of DirectX 11, AMD came to the market with cost-effective mid-
range GPUs that had excellent performance. The company was still in the game,
so to speak, just not in the high-end. The kickoff products for the fourth era and
DirectX 11 were the mid-range TeraScale GPUs, code-named Barts (October 2010),
and Turks (February 2011), discussed in the following sections.
The entry-level Radeon HD 6500/6600, code-named Turks, was released on
February 7, 2011. The Turks family included Turks PRO and Turks XT, and AMD
marketed them as HD 6570 and HD 6670. Originally released only to OEMs, they
proved so popular and cost-effective that AMD released them to retail. Figure 2.18
is a block diagram of the AMD Turks GPU.
The Turks GPU had 480 shaders, 24 TMUs, and 8 ROPS and could drive four
independent monitors. The 6.1 MM2 chip had 716 million transistors and drew about
75 W. It complied with DirectX 11.2, Shader Model 5, and OpenGL 4.4.
2.2.2.1 Summary
AMD’s Radeon HD 6570 and 6670 were minor upgrades of their Evergreen prede-
cessor, the HD 5570 and 5670. The Turks GPUs contained 80 more stream processors
and four more texture units. AMD also upgraded the design to include the new tech-
nologies in the Northern Islands GPUs such as HDMI 1.4a, UVD3, and stereoscopic
3D.
Its direct competitor was Nvidia’s GeForce 500 Series, which Nvidia launched
approximately a month later.
What made the Turks GPU significant was that it demonstrated the scalability of
the TeraScale architecture. Typically, entry-level GPUs are scrapes, higher-end parts
that did not meet specs. That is known as binning. The Turks was a unique design
and stood on its own merits.
2.2 The Fourth Era of GPUs. October 2009 69
Nvidia introduced its Fermi microarchitecture in the late March 2010. It was the
next-generation GPU and branded as the GF100 family, superseding the popular
Tesla architecture-based products such as the G80 GPUs.
Manufactured at TSMC in a 40 nm process, the three billion transistor Fermi GPU
contained 512 processors—now called stream processors—organized into 16 groups
of 32. The Fermi processors were Nvidia’s first GPU compatible with OpenGL 4.0
and Direct3D 11. Nvidia released over 70 products with the Fermi architecture.
According to Nvidia, the GF100’s unified shader architecture incorporated tessel-
lation shaders into the same vertex, geometry, and pixel shader architecture. Thus,
70 2 The Third- to Fifth-Era GPUs
the benefits of the Fermi, asserted the company, were improved computing, physics
processing, and computational graphics capabilities, and the addition of tessellation
[7].
The die size of the Fermi was 529 mm2 . Even with a reduced process (40 nm
vs. 90), Fermi was larger than the 484 mm2 Tesla chip; those extra shaders took up
space.
The company had introduced its CUDA parallel processing software based on
C++, and the Fermi GPU was the first to exploit it. CUDA would prove an essential
tool for the industry and Nvidia especially. It became a de facto standard, finding
its way into hundreds of programs and leading Nvidia into the realm of artificial
intelligence for which it would become famous.
Large GPUs were not sold with all elements enabled, especially with the initial
production run. As GPUs increased in complexity, additional backup circuits were
built-in and used in case of failures during the manufacturing cycle. In the case of
Nvidia’s Fermi, the company did ship the GTX 580 in 2010 with all units enabled
(the 580 was the same layout as the 480 in most respects but with the texture and
z-cull units updated to add performance optimizations).
2.2.3.1 Summary
The Fermi architecture was Nvidia’s move to offer a GPU designed for GPU
computing, a vector accelerator for HPC, and servers. The GTX 480 initially shipped
with 15 streaming multiprocessors and six memory controllers. (One streaming
multiprocessor was disabled.) The remainder of the stack of products was filled out
by the GTX 470, with 14 streaming multiprocessors and five memory controllers.
The GTX 465 had 11 streaming multiprocessors and four memory controllers.
A year later, a new top-end model, the GTX 580, was introduced with the full 16
streaming multiprocessors and six memory controllers.
Nvidia retired the Fermi GPU line in April 2018. That is a very long life for a
semiconductor and a testament to its versatility.
When AMD bought ATI in 2006, Hector Ruiz was CEO of AMD, and Dave Orton
was CEO of ATI. Dirk Meyer, President of AMD, and Orton became the architects
of the acquisition primarily based in the idea of building a CPU with integrated
graphics.
When Dave Orton was at SGI in 1998, he championed the development of Cobalt,
the first integrated GPU chipset (discussed in Book two). Then, when he was running
ArtX, he pushed the idea into game consoles and carried on that effort after ATI
acquired ArtX in 2000. Orton had a vision, understood Moore’s law, and saw the
inevitable assimilation of the GPU by the CPU.
2.2 The Fourth Era of GPUs. October 2009 71
The board promoted Orton to CEO of ATI, and, in 2004, he explored how hetero-
geneous processors like a GPU and a CPU could work harmoniously and deliver a
synergist result—1 + 1 = 3. When AMD visited him for exploratory discussions
about a merger or acquisition, Orton was more interested in their vision of heteroge-
neous integration of processors than their grand scheme of taking over the computer
market. Fortunately, AMD had the right people, who agreed with Orton and shared
some of the work they had been doing on the concept. This, Orton thought, could be
a fantastic and highly successful merger.
AMD acquired ATI in 2006, and the keystone of the deal was to build a single chip
with a powerful CPU and a vector accelerator processor—a GPU. The code name for
the project was Fusion, and it represented not just the merging of GPUs and CPUs but
the merging of two companies with divergent cultures from two different countries,
using two different types of processor manufacturing and technology—what could
go wrong. Orton would need every hard-won skill he had in his long and storied
career to pull that off.
The Fusion project sputtered, and fiefdoms were threatened. Its old board
members, who believed in central control, insisted California-based Orton move
to Austin. Orton had just spent the last six years communing from Silicon Valley
to Markham, Canada, and now, with a son in high school, Orton did not want to
repeat the process, so he resigned. It was among AMD’s dumbest moves. They let
Orton walk away with his decades of experience in precisely what AMD would now
have to learn the hard way by trial and error. They did not even have the humility or
wisdom to engage him as a consultant. If Orton had stayed, it would have added an
extra management layer, and no one was in favor of that. But Orton also represented
an additional management layer, and he knew that would not be in the company’s
best interest.
Orton left AMD in 2007, and Ruiz left in 2008.
In July 2007, Rick Bergman, formally ATI/AMD’s Senior Vice President and GM
for the Graphics Division, announced he was moving up to run the entire Graphics
business, reporting directly to the AMD CEO, eventually running all product groups
at AMD.
AMD would have two new CEOs after Ruiz and before AMD could introduce an
integrated GPU–CPU in the following three years.
Four torturous years after Orton left, Bergman announced the Llano Fusion
processor, renamed the APU-accelerated processor unit for copyright reasons and
outside marketing advice. Seven years from Orton’s vision in 2004 and AMD’s final
implementation came to production in June 2011. Then in September 2011, Bergman
departed to become CEO role at Synaptics, a small, public semiconductor company.
AMD had been building its CPUs using silicon on insulator (SOI); the designs
of the ATI GPUs were in bulk-CMOS and had to be converted to SOI-compatible
layouts—not trivial. Then, the model had to be simulated, and, from that, the software
developers could start building drivers.
It might be useful to compare Intel’s journey to an integrated CPU to AMD’s.
Intel approached the problem with baby steps using two dies and then integrating
the GPU. AMD went for the whole thing at once. And AMD did it while it was
hemorrhaging money and switching CEOs. AMD lost money four out of seven years
from 2006 to 2011, with two of the most significant losses ever in 2007 and 2008
[8].
72 2 The Third- to Fifth-Era GPUs
To try and mitigate some of its losses, in October 2008, AMD announced it would
go fabless and spin-off its semiconductor manufacturing business into a new company
temporarily called The Foundry Company. It later became GlobalFoundries (GF) and
then GoFlo. GF continued to make AMD’s chips but had difficulty manufacturing
the 32 nm process node Llano needed. GF’s difficulties, unfortunately, truncated the
success of Llano. Avoiding that problem, the newer Brazos Fusion/APU processor
got built on the TSMC 40 nm process. The APU launched in 2010. It was a more
modest architecture but enjoyed high volume success as a mainstream solution.
The company was cash-flow strapped, laid off over 10% of its employees, and
morale was at an all-time low. There was also a small cultural war going on, as is
typical after an acquisition. Each camp was frustrated by the intransigence of the
other. Cell libraries, the basic design of the transistors, did not match up. IC layout
policies did not match, and even the internal nomenclature did not agree. It took top
management decrees to force one group to accept the other’s policies and procedures.
Meanwhile, homogeneous, and monolithic Intel charged ahead, running scared that
AMD would beat it once again. AMD did not win this race.
Intel beat AMD to it and introduced the Clarkdale in 2011. Intel already had a
graphics group and did not have to overcome cultural barriers, library mismatches,
or process differences. They just had to make it work and convince management and
the fab that using 60% of the die for a GPU that they could not charge extra for was
a good idea.
AMD began revealing details about the Llano in February 2011 at the ISSCC in
San Francisco. The company spoke about power management enhancements made to
the x86 cores in Llano to increase performance per watt and help make the CPU/GPU
combination even more compelling. Core power gating, a processor feature that
disconnects power to the core when not in use, was employed. AMD claimed its
silicon on insulator (SOI) process allowed it to use more efficient nFET transistors for
power gating instead of the pFET transistors used with a bulk silicon manufacturing
process [9]. Both AMD and Intel’s design used shared memory (between the CPU
and GPU) and so cache design was critical to performance. AMD’s was twice the
size of Intel’s.
The photograph of the die in Fig. 2.19 shows the core’s more than 35 million
transistors that fit within 9.7 mm2 (not counting the 1 MB of L2 cache shown on the
right):
AMD said its first APU would have the following:
• Four CPU cores, DDR3 memory, and a DirectX 11 capable SIMD engine
integrated on die.
• Llano was the first design from AMD using the 32 nm SOI process technology.
• The APU used AMD’s Sabine platform for mainstream notebooks.
• The APU ran above 3 GHz. The graphics could also handle Blu-Ray playback.
Getting the right balance is always tricky. Intel thought the answer lay in X86 as
the panacea to all problems. Nvidia said a GPU could solve all one’s needs, whereas
AMD sought balance with its Fusion design (Fig. 2.20).
There was little specific GPU functionality in Intel’s Larrabee or Clarksdale
processors. There was no X86 functionality in Nvidia’s Fermi. And there is quite a
bit of GPU and X86 functionality in AMD’s Llano Fusion product.
2.2 The Fourth Era of GPUs. October 2009 73
AMD said Llano offered observable and controllable processing with context
switching, single-lane programming, and support for x86 virtual memory. It also
provided support for C++ constructs, virtual functions, and direct-link libraries
(DLLs).
There was a multitude of elements in Llano:
• Multiple instructions, multiple data (MIMD): Four threads per cycle per vector,
from different apps, per compute unit.
• SIMD: 64 FMAD vectors for four waves per cycle (floating-point fused Multiply–
ADd vectors).
74 2 The Third- to Fifth-Era GPUs
The Llano used AMD’s latest processor, code named Steamroller, a four-core
K10 x86 CPU, and a Radeon HD 6000-series GPU on a 228 mm2 die. AMD had it
fabricated at Global Foundries in 32 nm.
The GPU in the Llano was AMD’s Redwood GPU core (Radeon HD 5570,
discussed later in this chapter) with some enhancements, code named Sumo. The
DirectX 11 GPU had five SIMD arrays with 80 cores (400 shader processors).
The APU presented a single face to the OS. The GPU in an APU was an actual
application accelerator, but the application had to be written correctly to take advan-
tage of it. That was different from Intel’s approach: it had an iGPU, but its only
interaction with the CPU was through main memory (and possibly the L2). An
application could only access the Intel GPU via the graphics API. AMD stated that
the OS would service both memory-management units (MMU) and IO-MMU under
a unified address space–CPU and GPU would use the same 64b pointers.
The MMUs were available to the CPU and GPU and could pass a memory pointer
between them. The OS, of course, had to provide for it, and manual synchronization
was required. There was no coherent view of the memory—the GPU did not snoop,
meaning the GPU did not check the cache’s activity to see what the CPU was up to.
But the x86 also did not snoop GPU cache writes unless it was explicitly marked;
the processors were not working in harmony because they were unaware of each
76 2 The Third- to Fifth-Era GPUs
other. Snooping is how one processor checks the cache’s condition regarding another
processor’s cache activity.
Back to the Future
The team that developed the chip for the Nintendo 64 in 1996 set the stage for
the development of the APU. Seventeen years later, in 2013, it would become the
processor for the PlayStation 3 and Xbox One game consoles. SGI to ArtX to ATI
to AMD was a long, sometimes bumpy ride, but one that can be looked at with great
pride and satisfaction by the developers and their customers.
Today’s SOCs have striking similarities to previous devices like AMD’s APU.
2.2.4.1 Summary
Llano did not do that well for AMD initially; its x86 cores were not as competitive as
they would later become, but it certainly set the stage for many other processors and
led to powerful game consoles (2013) and a sweep of the console market for AMD.
The second-generation APU from AMD announced in June 2012 was the Trinity
for high-performance and Brazos-2 for low-power devices. The third-generation
Kaveri was for high-performance devices was launched in January 2014.
The Fusion design and subsequent APUs was significant because the disclosure
of the design in 2006 changed the way the industry thought about integrated GPUs—
they didn’t have to be power starved and low performance. And the design fueled
the console market for the next decade and beyond.
The Kepler architecture enjoyed a double life, first as an upgrade or mid-life kicker
of the GTX 600 series, and then as the GTX 700 series. Kepler was designed to
establish Nvidia’s entry into the GPU-compute market segment as well as graphics.
Built on 28 nm (from TSMC) and Nvidia’s new Kepler architecture, it would be
deployed in the GeForce GTX 780 add-in board. With three to six GB of GDDR5
memory, the memory used a 384-bit bus and ran at 1.5 GHz giving it plenty of
bandwidth (Fig. 2.23).
Kepler was a major upgrade and new design. The GPU was more than doubled
from the GTX 680’s GK104’s 3.5 billion to seven billion transistors in the GK110
Kepler GTX 780. The GPU clock was slightly lower from 1 GHz to 900 MHz to
keep the temperature down, but the number of shaders was increased from 1536 to
2304, moving the performance from 3.2 TFLOPS to 4.16 TFLOPS. The chip was
398 mm2 and the board drew 250 w. With GPU Boost 2.0, the GPU could boost to the
highest clock speed it could, while operating at 80 °C, Boost 2.0 dynamically adjusts
the GPU fan speed up or down as needed to attempt to maintain this temperature.
2.2 The Fourth Era of GPUs. October 2009 77
The significant aspect of the new GPU was the introduction of Nvidia’s PhysX
physics software which allowed for a game’s 3D models to be destructible, and a
realistic and non-baked way meaning elements would break and fall differently each
time (Fig. 2.24).
78 2 The Third- to Fifth-Era GPUs
Surface tensions and vicious forces were modeled, as well as the density and
weight of various elements in a model or scene. It also introduced real-time fluid
dynamics and particles for realistic smoke, waves, and water ripples.
The GPU was designed for DirectX 11 (the fourth era of GPUs), but when DirectX
12 was introduced (in the early 2015), the GTX 780 could run a lot of its features.
The Kepler GK110 and GK210 were also designed to be a parallel processing
powerhouse for Tesla and the HPC market (Table 2.1).
GK110 and GK210 provided fast double-precision computing performance to
accelerate professional HPC compute workloads; that is a key difference from the
Nvidia Maxwell GPU architecture, which was designed primarily for fast graphics
performance and single-precision consumer compute tasks. While the Maxwell archi-
tecture performs double-precision calculations at rate of 1/32 that of single-precision
calculations, the GK110 and GK210 Kepler-based GPUs were capable of performing
double-precision calculations at a rate of up to 1/3 of single-precision compute
performance.
Each of the Kepler GK110/210 SMX units featured 192 single-precision CUDA
cores, and each core had a fully pipelined floating-point and integer Arithmetic Logic
Units. Kepler retained the IEEE 754-2008 compliant single- and double-precision
arithmetic introduced in Fermi, including the fused multiply–add (FMA) operation.
One of the design goals for the Kepler GK110/210 SMX was to significantly increase
the GPU’s delivered double-precision performance since double-precision arithmetic
is at the heart of many HPC applications. Kepler GK110/210’s SMX also retained
the special function units (SFUs) for fast approximate transcendental operations
as in previous-generation GPUs, providing 8x the number of SFUs of the Fermi
2.2 The Fourth Era of GPUs. October 2009 79
Intel continued to improve the iGPU in its CPU. In 2012, the company introduced
the 22 nm Ivy Bridge CPU with the HD 4000 iGPU. It had 16 EUs, 128 shaders, 16
TMUs, and two ROPS. The 1.2 billion transistor GPU ran at 600–1000 MHz, could
generate 256 (fp32) GFLOPS, and was compatible with DirectX 12, but just basic
functions, not ray tracing and still just a 512k L2 cache.
The company expanded the HD to 10 EUs (80 shaders—184 GFLOPS) when it
introduced the Haswell CPU in 2013. Subsequent Haswell processors had the HD
4200 (20 EUs, 160 shaders—640–768 GFLOPS) and the HD 5000 (40 EUs, 320
shaders—832 GFLOPS).
The Intel Broadwell CPU brought out in 2014 had the Haswell DirectX 12 GT1
iGPU, with 12 EUs (96 shaders—163 GFLOPS), and the subsequent three versions
had HD Graphics 5300, 5500, 5600, and P5700 that used the GT2 chip with 24 EUs
(192 shaders—384 GFLOPS), and the first Iris Pro graphics P6300 iGPU with 48
EUs (384 shaders—883 GFLOPS).
In 2015, Intel brought out the popular i7 6700k (code named Skylake) using a
Broadwell iGPU with 24–72 EUs (576 shaders—1152 GFLOPS).
In 2017, the seventh-gen CPU 17-7700k (code named Kaby Lake) came out with
the new HD 630 iGPU with 24 EUs. Kaby Lake also was available with Iris Pro
graphics 650 with 48 EUs (384 shaders—883 GFLOPS).
Then in 2018, Intel released the ninth-gen core 19-9900k (code named Coffee
Lake), with as many as 18 CPU cores. The UHD 630 iGPU (GT2) had 24 unified
EUs. Intel brought out a particular version, the 19-9900KF, that did not have an
iGPU.
Intel introduced its Gen 11 integrated graphics processor Core H series Tiger
Lake processor in May 2021 and claimed the iGPU had enhanced execution units
(Fig. 2.25).
The Gen 11 CPU had 64 EUs (512 shaders—1126 GFLOPS), more than double
Intel Gen 9 iGPU’s 24 EUs. Intel said the Gen 11 GPU would break the 1 TFLOPS
barrier. The iGPU was released in early 2019 and built with an Intel 10 nm process
using a new SuperFin process [11] shown in Fig. 2.26. When announced in the fall
of 2020, Intel said it was the largest single intranode enhancement in Intel’s history.
80 2 The Third- to Fifth-Era GPUs
Tiger Lake used Intel’s Willow Cove cores, the successor to Sunny Cove.
Tiger Lake was the first processor family to use Intel’s 10 nm SuperFin transis-
tors. According to Intel, the SuperFin, an intranode enhancement, would improve
performance compared to a full-node transition [12].
What made Intel’s move to integrate the GPU with the CPU was the commitment
of silicon. Over 50% of the Gen 11 integrated processor got devoted to the iGPU and
associated image processing and display as shown in Fig. 2.27.
In 2013, when Intel was offering the Haswell processor with 10 execution units,
Senior Vice President Mooly Eden told a group of analysts he hated GPUs.
It was, as you can imagine, a shocking statement. It was not that Eden hated GPU
technology; he hated the cost of the GPUs. “Sixty percent of the die,” he said, “goes
2.2 The Fourth Era of GPUs. October 2009 81
Fig. 2.27 Die shot of Intel’s 11th Gen Core processor showing the amount of die used by the GPU.
Courtesy of Intel
to the GPU—60 percent!” “And you know what we get paid for that?” he asked.
“Nothing, zero dollars, not a dime.” [13].
And so it had been since Q1’10 when Intel first put the GPU in the CPU. From
50 to 60% of an Intel processor’s die area was for free. It was not a good business
proposition. Intel painted themselves into a corner just to beat AMD. But, as Intel
found out too late, AMD had a different business model for their integrated GPU
processor, the APU—they charged for the GPU contribution.
Intel’s integrated GPUs were always unfairly compared to discrete GPUs, and
the performance difference was embarrassing at times—but it was (and is) an unrea-
sonable evaluation (see Flops Versus Fraps: Cars and GPUs charts in Book two).
The actual performance was quite good, and given that it was for free, it was disin-
genuous of benchmarkers to make such comparisons without including the price.
One tester, Jon Peddie Research, used price, power consumption, and performance s
(Pmark [14]) in its Mt. Tiburon Testing Labs reports to evaluate AIBs. Refer to Book
two, Why Good Enough Is Not, for an explanation as to why iGPUs cannot match a
dGPU’s performance.
Furthermore, Intel was unhappy with the big hunk of the high-performance and
server market Nvidia, and AMD had taken by providing GPUs as compute and AI
accelerators.
Therefore, in early 2017, Intel decided it would end its attempted processor hege-
mony and not only acknowledge the value of GPUs but launch a project to create an
entire top-to-bottom dGPU product line. In late 2017, Intel shocked the industry and
hired AMD’s Senior Vice President and Chief Architect of the Radeon Technologies
Group, Raja Koduri for the job (Fig. 2.28).
Koduri became Intel’s Chief Architect and Senior Vice President of the newly
formed Core and Visual Computing Group. He also was appointed General Manager
of a new initiative to drive edge computing solutions [15]. Under Koduri’s leadership,
Intel would launch the Xe GPU product line, discussed later in this chapter.
82 2 The Third- to Fifth-Era GPUs
The follow-on GPU to Nvidia’s 2013 Kepler architecture was the Maxwell design
introduced in 2014. Built on a 28 nm TSMC process like Kepler, it was a completely
new architecture and designed for computer graphics rather than GPU compute as
Kepler was. Maxwell was introduced in the GeForce 700 series, GeForce 800M
series, GeForce 900 series.
Second-generation Maxwell GPUs introduced several new technologies:
Dynamic Super Resolution, Third-Generation Delta Color Compression, Multi-
Pixel Programming Sampling, Nvidia VXGI (Real-Time Voxel-Global Illumination),
VR Direct, Multi-Projection Acceleration, Multi-Frame Sampled Anti-Aliasing
(MFAA), and Direct3D12 Feature Level 12_1. HDMI 2.0 support was also added.
The GPU could run voxel illumination, and the company produced an amazing
demo showing the Apollo lander on the moon (Fig. 2.29).
The GM204 was a large chip with a die area of 398 mm2 and 5.2 billion transistors.
It had 2048 shading units, 128 texture-mapping units, and 64 ROPs. The GPU was
used in Nvidia GeForce GTX 980 and came with 4 GB GDDR5 memory using a
256-bit memory interface and ran at 1.753 GHz (7 Gbps effective). The GPU ran at
1.127 GHz, which could be boosted up to 1.216 GHz, memory is being a dual-slot
AIB, the GTX 980 drew power from two 6-pin power connectors (on the top of the
AIB) and consumed 165 W maximum. Display outputs include: DVI, HDMI 2.0,
and three DisplayPort 1.4a. The board used a PCI Express 3.0 x16 interface and ran
DirectX 12.
2.3 The Fifth Era of GPUs (July 2015) 83
Fig. 2.29 Nvidia Maxwell GPU running voxel illumination. Courtesy of Nvidia
The fifth era of GPUs was marked by the introduction of DirectX 12 (D3D12) and
featured advanced low-level programming which reduced driver overhead.
DirectX 12 differed from DirectX 11 in that it was closer to the GPU (low level)
like AMD’s Mantel, and it gave developers a fine-grained control of how games
could interact with the CPU and GPU.
AMD introduced its new RX 480 (code named Ellesmere) with 2048 stream proces-
sors at 5.8 TFLOPS, built with 14 nm FinFETs at Global Foundry. The GPU was
based on AMD’s graphics core next (GCN) 4.0 architecture and had a number of
fundamental features that defined it including:
• Primitive discard accelerator
• Hardware scheduler
• Instruction prefetch
84 2 The Third- to Fifth-Era GPUs
Wikipedia
The GCN shader compiler created wavefronts (also called simply waves) that
contained 64 work-items. When every work-item in a wavefront was executing the
same instruction, the organization was very efficient. Each GCN compute unit (CU)
included four SIMD units that consisted of 16 ALUs; each SIMD executed a full
wavefront instruction over four clock cycles (Fig. 2.30). The main challenge then
became maintaining enough active wavefronts to saturate the four SIMD units in a
CU.
A GCN CU had four SIMDs, each with a 64 KiB register file of 32-bit vector
general-purpose registers (VGPRs), for a total of 65,536 VGPRs per CU. Every
CU also has a register file of 32-bit scalar general-purpose registers (SGPRs). Until
GCN3, each CU contained 512 SGPRs and from GCN3 on the count was bumped
to 800. That yields 3200 SGPRs total per CU, or 12.5 KiB.
The RDNA architecture was designed for a new narrower wavefront with 32 work-
items, called wave32, that was optimized for efficient compute. Wave32 offered
several critical advantages for compute and complemented the graphics-focused
wave64 mode.
The GPU also incorporated h.265 decode at up to 4K and encode at 4K and
60 FPS. It did not incorporate HBM as the previous generation did (Fig. 2.31).
2.3 The Fifth Era of GPUs (July 2015) 85
AMD’s Vega GPU which had also been known as Greenland featured 4096 stream
processors. The stream processors utilized the advancements made in the IP v9.0
generation of graphics SOCs by AMD. The Vega 10 GPU could be configured with
as much as 32 GB of HBM2 VRAM and use 18 billion transistors.
AMD also announced Embedded Radeon E9260 and E9550 Polaris for Embedded
Markets.
And it introduced the A12 APU with R7 GPU with 512 stream processors. It was
the last APU with the last-generation GCN architecture.
Intel surprised the industry by announcing a CPU with an embedded AMD GPU—
code named Kaby Lake G. The company built the CPU processor in its 14 nm fab.
It was Intel’s eighth-generation Intel Core processor and had a Radeon RX Vega
M GH GPU co-processor. The company offered multiple versions—the i7-8809G,
i7-8808G, 8709G, 8706G, -705G, and i5-8305G series.
86 2 The Third- to Fifth-Era GPUs
Fig. 2.32 Intel multi-chip Kaby Lake G. The chip on the left is the 4 GB HMB2, the middle chip
is the Radeon RX Vega, and the chip on the right is the eighth-gen core. Courtesy of Intel
The high-end GPU had 24 compute units (1536 shaders) and 96 texture units,
and all the others had a 20 CU (1280 shaders) and 80 texture unit version. It had
a 1.06 GHz clock (boost to 1.19 GHz) and a 0.93 GHz clock (boost 1.01 GHz)
(Fig. 2.32).
The GPU was DirectX 12 and OpenGL 4.5 compatible and had 4 GB of internal
HBM.
Intel discontinued the product line in January 2020, two years after announcing
it would build its own dGPU. But it was significant because Intel was able to refine
its chip-to-chip interconnection scheme which led to Intel’s embedded multi-die
interconnect bridge (EMIB) for its chiplet Xe dGPU design.
2.3.3 Nvidia
Nvidia added the Nvidia GeForce GTX 1060 with a starting price of $249 to its
Pascal family of gaming GPUs, complementing the GTX 1080 and 1070 following
their launch two months earlier. The GTX 1060 had 1280 CUDA cores, 6 GB of
GDDR5 memory running at 8 Gbps, and a boost clock of 1.7 GHz, which can be
easily overclocked to 2 GHz for further performance.
The GTX 1060 also supported Nvidia Ansel technology, a game-capture tool that
allowed users to explore, capture, and compose gameplay shots, pointing the camera
in any direction, from any vantage point within a gaming world, and then capture
360° stereo photospheres for viewing with a VR headset or Google Cardboard.
Nvidia also announced Xavier, a new SoC based on the company’s next-gen
Volta GPU, which Nvidia hoped would be the processor in the future self-driving
2.3 The Fifth Era of GPUs (July 2015) 87
cars. Xavier featured a high-performance GPU, and the latest ARM CPU yet had
great energy efficiency according to the company (Fig. 2.33).
Using the expanded 512-core Volta GPU in Xavier, the chip was designed to
support deep learning features important to the automotive market, said the company.
A single Xavier-based AI car supercomputer would be able to replace cars configured
with Drive PX 2 with two Parker SoCs and two Pascal GPUs. Xavier was be built
using 16 nm FinFET process and had seven billion transistors—it was probably the
biggest chip ever built anywhere at the time.
AMD introduced its new Navi GPU architecture in mid-2019. New product
announcements are typically made at CES in January and Computex in June or
July, where OEM meetings can get arranged in one central place, reducing travel
time and expense for everyone.
AMD also introduced a new name for its architecture—Radeon DNA—RDNA.
Navi GPUs were the first to use AMD’s new Navi RDNA (1.0) architecture, and
they had redesigned compute units with improved efficiency and instructions per
clock (IPC) capability. They had a multi-level cache hierarchy, which offered higher
performance, lower latency, and less power consumption than the previous series.
The new architectural design provided 25% better performance per clock per core
and 50% better power efficiency than AMD’s previous Vega generation architecture.
88 2 The Third- to Fifth-Era GPUs
The first AIB announced with the Navi RDNA was the Radeon RX 5700 XT,
introduced in July 2019.
The Navi 10 also had an updated memory controller with GDDR6 support—
AMD’s first use of GDDR6 (Nvidia had employed the higher-speed memory in its
RTX 2080 a year earlier). The memory had a 256-bit bus, giving a GPU 448 GB/s
memory bandwidth with a 1.75 GHz memory clock.
The Navi 10 ran at 1.68 GHz (a special anniversary unit ran at 1.98 GHz) and had
2560 shaders, 160 tensor processing unit (TPUs), 64 render units, and 40 compute
units. The chip had 10.3 billion transistors and was in a 251 mm2 die; its block
diagram is shown in Fig. 2.34.
One of the significant differences in the Navi architecture was AMD’s use of a
communications fabric among the various elements in the GPU.
AMD organized the GPU into several main blocks, connected by AMD’s Infinity
Fabric. The command processor and PCI Express interface connected the GPU
to the PC system and controlled various functions. The two shader engines held
the programmable compute resources and some dedicated graphics hardware. Each
shader engine included two shader arrays, which comprised the new dual compute
units, a shared graphics L1 cache, a primitive unit, a rasterizer, and four render
back-ends (RBs). In addition, the GPU included a dedicated logic for multimedia
and display processing. The partitioned L2 cache and memory controllers routed
memory access. This was an on-die predeessor to a future chiplets design.
The command processor received API commands and, in turn, operated different
processing pipelines in the GPU, as illustrated in Fig. 2.35. The graphics command
processor managed the traditional graphic pipeline (e.g., DirectX, Vulkan, OpenGL)
shaders tasks and fixed function hardware. The Asynchronous Compute Engines
(ACE) implemented compute tasks that managed compute shaders. Each ACE main-
tained an independent stream of commands and could dispatch compute shader wave-
fronts to the compute units. Similarly, the graphics command processor had a stream
for each shader type (e.g., vertex and pixel). The command processor spread work
across the fixed function units and shader arrays for maximum performance.
The RDNA architecture introduced a new scheduling and quality-of-service
feature known as Asynchronous Compute Tunneling, enabling compute and graphics
workloads to co-exist in the GPU. Different shaders could execute on the RDNA
compute unit in a typical operation. However, a task could be more sensitive to
latency than other jobs. The RDNA architecture could suspend the execution of
shaders and free up compute units for high-priority tasks.
The command processor and scheduling logic partitioned graphics and compute
worked to facilitate dispatching to the arrays to improve performance. For example,
the graphics pipeline was partitioned for screen space and then sent to each partition
independently. Developers could also create scheduling algorithms for computer-
based effects.
The RDNA architecture consisted of multiple independent arrays, fixed function
hardware, and programmable dual compute units. AMD could scale performance
2.3 The Fifth Era of GPUs (July 2015) 89
Fig. 2.34 Block diagram of the AMD Navi 10, one of its first GPUs powered by the RDNA
architecture
from the high-end to the low-end by increasing the number of shader arrays and
altering the balance of resources within each shader array. The Radeon RX 5700
XT included four shader arrays, and each one had a primitive unit, a rasterizer, four
render back-ends (RBs), five dual compute units, and a graphics L1 cache.
The primitive units assembled triangles from vertices and were responsible for
fixed function tessellation. Each primitive unit provided culling of two primitives per
clock, making it twice as fast as the previous generation.
90 2 The Third- to Fifth-Era GPUs
The rasterizer in each shader engine performed the mapping from the geometry
stages to the pixel stages. AMD subdivided the screen with other fixed function
hardware, each portion distributed to one rasterizer.
The GPU’s dual compute unit had a dedicated front-end, as shown in Fig. 2.36.
The L0 instruction cache was shared between all four SIMDs within the dual compute
unit—previous instruction caches got shared between four CUs—or 16 graphics core
next (GCN) SIMD. The instruction cache of the RNDA architecture was 32 KB and
four-way set-associative; it consisted of four banks of 128 cache lines that were 64
bytes long. Each of the four SIMDs could request instructions every cycle. And the
instruction cache could deliver 32 bytes (typically two to four instructions) every
clock to each SIMDs—roughly 4X greater bandwidth than GCN.
The fetched instructions went to wavefront controllers. Each SIMD had a private
instruction pointer and a 20-entry wavefront controller for 80 wavefronts per dual
compute unit. Wavefronts could be from a different work-group or kernel, although
the dual compute unit could maintain 32 work groups simultaneously and operate in
wave32 or wave64 mode.
The architecture had a hypervisor agent, allowing the GPU to be virtualized
and shared between different operating systems. That was useful for cloud gaming
services in data centers where virtualization was crucial from a security and oper-
ational standpoint. Although consoles focused on gaming, many offered a suite of
communication and media capabilities and benefited from virtualizing the hardware
to deliver performance for all tasks.
The GPU could reach a theoretical performance level of 8.6 TFLOPS (the
anniversary unit could get to 10.13 TFLOPS).
The new GPU was fabricated at TSMC using their 7 nm manufacturing process,
and AMD was the first to produce chips at that node. Nvidia went to Samsung for
its chips and made them at 8 nm.
The new RDNA Navi parts employed PCI Express 4.0 as well. The Navi 10 was
AMD’s second GPU with PCIe 4.0 capability; the Vega 20 also had it, but AMD
restricted PCIe 4.0 to its high-end Vega parts, whereas with Navi, all segments had
PCIe 4.0. The new PCIe 4.0 interface operated at 16 GT/s, double the throughput of
earlier 8 GT/s PCIe 3.0-based GPUs.
2.3.4.2 Summary
In October 2019, AMD launched the RX 5500 series AIBs based on its 7 nm Navi
14 RDNA 1.0 GPU. Introduced with the AIB were three new gaming features, an
anti-lag feature that shortened the time from a mouse click to screen action, an
image-sharpening feature, and a variable resolution feature called Boost.
In August 2016, AMD acquired HiAlgo, a developer of PC gaming tools designed
to improve the gaming experience without overclocking the GPU. The company was
founded in 2011 in Sunnyvale by Eugene Fainstain and Alex Tsodikov.
HiAlgo applications were plug-ins that helped hardware perform better. In 3D
games, the application allocated computer resources by dynamically changing frame
rates and picture resolution. The applications used code-injection techniques to attach
to a game.
The company introduced three tools that gamers and developers could use: Boost,
Chill, and Switch.
Boost was a utility that made gameplay smoother with less lag. It intercepted
and, on the fly, modified commands sent from the game to the graphics AIB, which
the company claimed optimized performance frame by frame. During fast-paced
moments of the game, it lowered the rendering resolution, causing the frame rate
and responsiveness to increase, effectively increasing the performance of the GPU
noticeably.
92 2 The Third- to Fifth-Era GPUs
Chill was a smart frame-rate limiter utility that reduced GPU and CPU churning
(and subsequently overheating). The application tracked what was happening in the
game and allocated computational resources for the best game performance. When
there was not much action in the game, the application lowered the frame rate. When
the action picked up, the frame rate went up. The company said Chill prevented
underclocking and saved power. That increased gaming time on laptops.
Switch was a utility that changed the game resolution from 100 to 50% with a
button push that increased the frame rate. It worked by intercepts, which modified
commands sent from the game to the AIB on the fly. Rendering resolutions could be
adjusted, even if the game could not do it without a restart.
When AMD acquired the firm, its software ran only on games compatible with
DirectX 9; DirectX 11 was on the to-do list.
With the RX 5500 XT launch in December 2019, AMD introduced three new soft-
ware features: Anti-Lag, Boost, and Chill. In HiAlgo terms, these were the equivalent
of Boost, Switch, and Chill.
AMD’s Radeon Boost dynamically lowered the resolution of the entire frame
when fast on-screen character motion was detected (from the user’s mouse activity).
That allowed higher FPS with little perceived impact on quality. The feature reduced
screen resolution on a linear scale (down to a 50% minimum). AMD also referred to
that feature as Motion Adaptive Resolution.
Radeon Chill was a power-saving feature that dynamically regulated the frame
rate based on your character and camera movements in game. As activity decreased,
Radeon Chill reduced frame rate and saved power, helping lower the GPU’s temper-
ature. Radeon Chill worked for most titles using DirectX 9, 10, 11, 12, and
Vulkan.
Radeon Anti-Lag was a feature that helped reduce input-to-response latency (input
lag) by reducing the time between the game’s sampling of user controls and the output
appearing on the display.
Radeon Chill, Radeon Anti-Lag, and Radeon Boost were mutually exclusive, and
only one could be enabled at a time.
2.3.5 Summary
The AMD RDNA Navi line of GPUs was significant because AMD introduced a
new, more efficient and powerful architecture with RDNA, and at the same took
advantage of a process shrink to 7 nm to add several new features.
The Gen 9.5-integrated UHD Graphics 620 (GT2) was in processors from the Whisky
Lake generation. The DirectX 12-compatible GT2 version of the Skylake GPU had
2.3 The Fifth Era of GPUs (July 2015) 93
24 EUs (192 shaders) clocked at up to 1150 MHz, and it had three MB-dedicated L3
caches. The HD 620 used a shared memory architecture with the CPU (DDR4-2133).
A block diagram of the GT2 iGPU is shown in Fig. 2.37.
The video engine supported H.265/HEVC Main10 profile in hardware with 10-bit
color. Google’s VP9 codec could be hardware decoded. The Core i7 chips supported
HDCP 2.2 and, therefore, Netflix 4K. HDMI 2.0, however, only worked if the TV
had a high-speed level shifter and active-protocol converter (LSPCon) converter chip
in it.
What made it significant was its large number of shaders—and that the processor
and iGPU had been designed for low-power operation in notebooks. It also showed
the direct the Xe basic design would likely follow.
A prelude to X e ?
Intel’s integrated Gen 11 GPU was a monolithic design, with significant microar-
chitectural enhancements (over earlier generations) that improved performance per
watt efficiency. The Gen 11 GPU graphics technology (GT) architecture said the
company, targeted modern thin and light mainstream and premium PC designs. At
the time, speculation was the Gen 11 graphics architecture would be the basis for
Intel’s upcoming Xe discrete GPU architecture.
94 2 The Third- to Fifth-Era GPUs
The Gen 11 GPU GT2 architectural enhancements (over Gen 9) improved perfor-
mance per FLOP by removing bottlenecks and increasing the efficiency of the
pipeline.
The design had 64 EU (512 shaders), 32 TMUs, and 8 ROPS; it was DirectX
12 (12_1) compatible and generated 1100 TFLOPS. It also supported 3D rendering,
GPU computing, and programmable and fixed function media capabilities. Intel split
the iGPU architecture into four subslices: The Global Assets slice, which had some
fixed function blocks that interfaced to the rest of the SoC, the Media fixed function
slice, the 2D blitter, and the Graphics Technology Interface (GTI) slice. The GTI
slice housed the 3D fixed function geometry, eight subslices containing the EUs, and
a slice common that held the rest of the fixed function blocks supporting the render
pipeline and L3 cache.
For the Gen 11 GPU-based products, Intel aggregated eight subslices into one
slice. Thus, a single slice aggregated a total of 64 EU. Aside from grouping subslices,
the slice integrated additional logic for the geometry, L3 cache, and the slice common.
The Gen 11 GPU’s 3D fixed function geometry had a typical render front-end that
mapped to the graphics pipeline in OpenGL DirectX, Vulkan, or Metal APIs. Addi-
tionally, it included the Position Only Shading pipeline, or POSH pipeline, used to
implement position only tile-based rendering (PTBR) mentioned above.
Vertex fetch (VF), one of the first stages in the geometry pipe, was, as its name
implied, used to fetch vertex data from memory. Those data then was used in later
vertices. They got reformatted and written into a buffer. A vertex typically has more
than one attribute (e.g., position, normals, texture coordinates, color). As graphics
workload complexity has increased, more vertex attributes have grown too. The Gen
11 GPU increased the VF input rate from four attributes/clock to six and improved the
input data cache efficiency. Another significant VF change in the Gen 11 GPU was the
increased number of draw calls it could handle at the same time to enable streaming
of back-to-back draw calls. Newer APIs like DirectX 12 and Vulkan reduced the
overhead significantly for draw calls, which increased draw calls that could be made
per frame, improving the visual quality [16]. Shown in the block diagram in Fig. 2.38
are the iGPU and CPU with a connecting ring.
The Gen 11 GPU also made tessellation improvements. It delivered up to twice
the Hull Shader thread dispatch rate, increasing the efficiency of output topology,
especially for patch primitives subject to low tessellation factors. Another notable
new feature of the Gen11 iGPU was variable rate shading (VRS).
This was another significant integrated GPU introduced by Intel. The company
devoted over 50% of its precious die to the free GPU in the CPU. And it wasn’t just
the cost of the silicon, there were added costs for testing and driver writing. Those
are big investments, and Intel would have made them if it did not see a long-term
strategic advantage in doing so.
2.3 The Fifth Era of GPUs (July 2015) 95
The chip used a ring-based topology bus between CPU cores, caches, and the GPU. It
had dedicated local interfaces for each connected agent. Intel first introduced the ring
architecture for graphics in the Larrabee design in 2006. The ring connected a system
agent for off-chip system memory transactions to/from CPU cores and to/from the
iGPU. Intel processors included a shared Last-Level Cache (LLC) connected to the
ring, and the integrated GPU shared the LLC. So the ring design for GPUs by Intel
had been around for many years.
The Gen 11 GPU integrated within Intel’s Core processors implemented multiple
clocks. Intel partitioned the clocks into a processor graphics clock domain, a per-
CPU core clock domain, and a ring interconnect clock domain. Intel used those
segmentations in power management scenarios.
Since before the Larrabee project in 2006, Intel had been a regular contributor and
presenter at the ACM SIGGRAPH conferences worldwide. The papers were always
well received, frequently referenced, and respected. However, most of the advanced
concepts Intel presented did not seem to find their way into Intel’s GPUs. The Gen
11 iGPU was an exception and showed some of the power Intel would bring to its
Xe dGPUs, announced in 2018 and known unofficially as Gen 12.
Two significant features of Gen 11 were coarse pixel shading and POSH.
96 2 The Third- to Fifth-Era GPUs
Fig. 2.39 CPS added two more steps in the GPU’s pipeline
Decoupled sampling techniques such as coarse pixel shading (CPS) can lower the
shading rate while resolving visibility at full resolution, preserving details along
geometric edges [17]. Coarse pixel shading decreased the workload on GPUs by
reducing the number of color samples used to render an image. Decreasing pixel
shader runs also saved power. Coarse pixel shading provided application developers
with a new rate control on pixel shading operations. CPS is better than upscaling. It
lets developers work at the render target resolution while sampling the more slowly
varying color values at the coarse pixel rate. AMD and Nvidia introduced similar
features in their GPUs and drivers (Fig. 2.39).
CPS cut in half the number of shader invocations, yet there was almost no
noticeable difference in high pixel density display (Fig. 2.40).
Intel first described coarse pixel shading as a technique in its 2014 High-
Performance Graphics Paper [18].
Fig. 2.40 Geometry with red boxes is sufficiently far from the camera, and therefore, it is of minor
importance to the overall image. Thus, the color shading frequency could be reduced (using CPS
with no noticeable effect on the visual quality or the frame rate). Courtesy of Intel
The POSH pipeline, Intel’s position only tile-based rendering (PTBR) system,
deployed two geometry pipelines—a standard rendering pipeline and the POSH
pipeline.
The POSH pipeline ran the position shader in parallel with the main application.
Still, it typically generated results much faster, as it only shaded position attributes
and skipped the rendering of pixels. The POSH pipeline ran ahead of the rendering
pipeline and used attributes from the shaded position to compute visibility informa-
tion for triangles and determine if they were culled. The object visibility recording
unit of the POSH pipeline calculated the visibility, compressed the data, and recorded
it in memory, as illustrated in Fig. 2.41.
In theory, POSH was a faster and more power-efficient way to handle certain types
of geometry processing. Overall performance and applicability to workloads would
depend on the rendering mode used by games. However, Intel was thinking about
maximizing memory bandwidth and introducing more advanced features around the
idea. The company patented the concept [19].
2.3.8 Summary
The Gen 11 iGPU signaled what Intel would include in its discrete GPU. The Xe
architecture was designed to span from integrated GPUs to high-end workstation and
data center accelerator GPUs.
Intel open-source developers began preparing the graphics compiler back-end
changes for the Gen 12 Xe GPU, starting with Tiger Lake processors. Significant
architectural changes were revealed when compared to Ice Lake Gen 11 GPU. The
patches showed that the Gen 12 GPU ISA was one of the biggest reworks ever to the
Intel EU ISA since the original i965 graphics a decade earlier.
Intel updated nearly every instruction field, opcode, and register type. Other signif-
icant changes include removing the hardware register scoreboard logic—which left
it up to the compiler to ensure data coherency between register reads and writes to
sync hardware instruction.
Intel’s 11th-gen desktop CPUs, Rocket Lake, launched in March 2021. It
competed with AMD’s Ryzen 5000 series.
2.4 Conclusion
The third era of the GPU saw the introduction of unified shaders. Prior to that, the
shaders were semi-fixed function and would sit idle, while other semi-fixed function
shaders might have been overburdened, not an efficient use of processing power.
Unified shaders were one step closer to the ultimate all compute GPU.
Also during the era, we saw the introduction of the GPU integrated in with the
CPU creating the iGPU. Intel was the first to market with such a device and quickly
rose to the number one GPU supplier since almost every CPU now had a GPU. The
iGPU would still trail the discrete GPU (dGPU) in performance, albeit getting more
powerful with every introduction, and you couldn’t beat the price—free.
References
3. Hruska, J. 10 years ago, Nvidia launched the G80-powered GeForce 8800 and changed PC
gaming, computing forever, Extreme Tech, November 8, 2016, https://www.extremetech.com/
gaming/239078-ten-years-ago-today-nvidia-launched-g80-powered-geforce-8800-changed-
pc-gaming-computing-forever
4. Peddie, J. Chasing the nanometer: When does it become indivisible? (April 30, 2020), https://
www.jonpeddie.com/news/chasing-the-nanometer/
5. Forsyth, T. Why didn’t Larrabee fail?, TomF’s Tech Blog, (August 15, 2016), https://tomfor
syth1000.github.io/blog.wiki.html[[Whydidn’tLarrabeefail?]]
6. Seiler, L., Cavin, D., Espasa, E., Grochowski, T., Juan, M., Hanrahan, P., Carmean, S., Sprangle,
A., Forsyth, J., Abrash, R., Dubey, R., Junkins, E., Lake, T., Sugerman, P. Larrabee: A Many-
Core x86 Architecture for Visual Computing (PDF). ACM Transactions on Graphics. Proceed-
ings of ACM SIGGRAPH, (August 2008), https://dl.acm.org/doi/pdf/10.1145/1360612.136
0617
7. Wittenbrink, C. M., Kilgariff, M., and Prabhu, A. Fermi GF100 GPU Architecture, IEEE Micro,
pp. 50–59, vol. 31, (March/April 2011), DOI Bookmark: https://doi.org/10.1109/MM.2011.24,
https://www.computer.org/csdl/magazine/mi/2011/02/mmi2011020050/13rRUILtJih
8. AMD Net Income/Loss 2006–2021|AMD, https://www.macrotrends.net/stocks/charts/AMD/
amd/net-income-loss
9. Transistors as Switches, https://learn.digilentinc.com/Documents/312
10. Smith, R. AMD’s Radeon HD 6970 & Radeon HD 6950: Paving The Future For AMD,
(December 15, 2010), https://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-
hd-6950/4
11. Dillinger, T. A “Super” Technology Mid-life Kicker for Intel, August 17, 2020, Semi-
Wiki, https://semiwiki.com/semiconductor-manufacturers/intel/289716-a-super-technology-
mid-life-kicker-for-intel/
12. Intel Architecture Day 2020, https://www.intel.com/content/www/us/en/newsroom/resources/
press-kits-architecture-day-2020.html#gs.cbozbp
13. Peddie, J. Free GPU — looking a gift horse in the mouth, TechWatch, (January 22, 2019),
https://www.jonpeddie.com/editorials/free-gpu-looking-a-gift-house-in-the-mouth/
14. Dow, R. AMD’s RNDA 2.0 Add-in board delivers performance in the mid-range, Jon Peddie
Research, (August 9, 2021), https://www.jonpeddie.com/reviews/6600-xt
15. Raja Koduri Joins Intel as Chief Architect to Drive Unified Vision across Cores and Visual
Computing, (November 8, 2017), https://newsroom.intel.com/news-releases/raja-koduri-joins-
intel/#gs.f2gsw5
16. Intel Processor Graphics Gen11 Architecture, Version 1.0, https://www.intel.com/content/
dam/develop/external/us/en/documents/the-architecture-of-intel-processor-graphics-gen11-
r1new-810410.pdf
17. Kai Xiao, K, Liktor, G, and Vaidyanathan, K. Coarse Pixel Shading with Temporal Super-
sampling, ACM SIGGRAPH I3d ’18 Symposium on Interactive 3D Graphics and Games,
(May 4–6, 2018), https://software.intel.com/content/dam/develop/external/us/en/documents/
CPST_preprint.pdf
18. Vaidyanathan, K., Salvi, M., Toth, R., Foley, T., Akenine-Möller, T., Nilsson, J., Munkberg, J.,
Hasselgren, J., Sugihara, M., Clarberg, P., Janczak., T., and Lefohn, A. Coarse Pixel Shading,
HPG ‘14: Proceedings of High Performance Graphics’, (June 2014), https://www.researchg
ate.net/publication/288811329_Coarse_pixel_shading
19. Position Only Shader Context Submission Through a Render Command Streamer, US patent
20170091989A1, https://patents.google.com/patent/US20170091989A1/en
Chapter 3
Mobile GPUs
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 101
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_3
102 3 Mobile GPUs
Fig. 3.1 The rise and fall of mobile graphics chip and intellectual property (IP) suppliers versus
market growth
The GPUs that go into mobile devices come from various suppliers, as shown
in Fig. 3.3. Some specialize in market segments such as automotive or gaming, and
some serve all markets and platforms.
This chapter will look at the companies in the mobile market, which makers of
smartphones and tablet GPUs dominate.
3.2 Mobiles: The First Decade (2000–2010) 103
3.1 Organization
This chapter has two parts—the first decade (2000–2010) and the second decade and
on (2010 and on). These are of course a slight overlap; the companies didn’t know
at the time they were making history.
The mobile market doesn’t follow the API advancements as the PC market does
and, therefore, doesn’t have eras like the PC. The APIs for the mobile market tend
to move in fits and starts, and the adoption of them is equally irregular. Most of the
mobile devices use the Android operating system or some Linux variant. Apple has
its own, Linux-based OS they call iOS. The Android/Linux OS use the OpenGL ES
API and in the 2020s began adopting Khronos’ Vulkan API. Apple has its own API,
Metal.
Up until 2003 with the establishment of OpenGL ES 1.0, the mobile market was
chaotic with various APIs. That made it difficult (to impossible) for software devel-
opers to get their applications to run on various devices. This is discussed more fully
in book two, The Ears and Environment, in the chapter on The GPU Environment—
APIs. By 2010, OpenGL ES 2.0 was in place and the mobile market had, for the
most part, stabilized. Apple continued to be an outlier and offer only its own API
for its own OS. But Linux and various versions of Android used OpenGL ES and
applications could run on any platform.
From 1999 to 2003, the mobile market had several independent and proprietary
APIs such as Nokia’s Symbian, RIM’s BlackBerry API, and Qualcomm’s BREW,
as well as Java ME mobile 3D graphics APIs.
104 3 Mobile GPUs
The situation was chaotic. In those early years, Nokia’s Symbian OS was the most
popular and there were four different UIs that ran on top of the Symbian OS builds:
S60, S80, S90, and UIQ. Despite their common OS, no app written for one OS could
be used on the other. Nevertheless, Nokia had almost 50% of the market by 2007.
Its numbers began to decline following the introduction of the iPhone. Nokia finally
gave up and dumped Symbian in favor of Android and OpenGL ES in 2010.
OpenGL ES stabilized the mobile market, and the SoC developers began following
it as PC developers followed DirectX.
Imagination was one of the pioneers of GPUs in mobile, and in 2000, it introduced
its first transform and lighting unit, Elan, which drove two CLX2s for the Naomi 2
arcade machines (the second-generation PowerVR2’s code name was CLX2).
Imagination enjoyed several years as the GPU in Texas Instruments (TI’s) Open
Multimedia Application Platform (OMAP) smartphone processors, Samsung’s, and
other smartphone SoC suppliers.
In 2019, Imagination Technologies introduced its Imagination A-Series. The
company said the A-Series was its most crucial GPU launch since the mobile
PowerVR GPU 15 years before.
Up to 6 TFLOPS and low power in a mobile device
In November 2020, the company said it had topped itself and introduced the
Imagination B-Series—an expanded range of GPU IP (see chart in Fig. 3.4). It
delivered up to 6 TFLOPS with a reduction in power up to 30%, a 25% area reduction
over earlier generations, and the company claimed its fill rate was up to 2.5 times
higher than competing IP cores.
Fig. 3.4 Big jump in GPU power efficiency. Courtesy of Imagination Technologies
3.3 Imagination Technologies First GPU IP (2000) 105
With Imagination A, the company said it made an extraordinary leap over earlier
generations. That resulted in an industry-leading position for performance and power
characteristics. The B-Series was a further evolution. The company said that it deliv-
ered the highest performance per mm2 for GPU IP with new configurations for lower
power. It reduced the bandwidth by up to 35% for a given performance target. All
that, said the company, made it a compelling solution for top-tier designs.
The B-Series offered a wide range of configurations, which expanded the options
available for its customers. Its scalability meant the B-Series was suitable for many
markets, including entry-level to premium mobile phones, consumer devices, Internet
of things, microcontrollers, digital television, and automotive. Add multi-core to the
mix and the B-Series was adaptable to data center performance levels. That said, the
company was a unique range for GPU IP.
The B-Series Imagination BXS had the first ISO 26262-capable cores in 2020,
which opened a range of automotive options, from small, safety-fallback cores to
many teraflops of computing. The high TFLOPS targeted advanced driver-assistance
systems and high-performance autonomy applications such as the dashboard display
shown in Fig. 3.5. Imagination was immensely proud of its ISO certification.
Imagination believed they picked the sweet spot between high-performance and
power-optimized cores. They incorporated an innovative decentralized approach to
multi-core design. The company claimed the design would allow it to deliver high-
efficiency scaling. BXT offered compatibility with industry trends such as chiplet
architectures. That enabled the company to provide a range of performance levels
and configurations, a capability that Imagination said had not before been possible
in GPU IP.
Imagination claimed it had optimized the new multi-core architecture for each
product family. The BXT and BXM cores featured primary full-core scaling capa-
bilities (refer to the block diagram in Fig. 3.6). Combined with all the cores’ power, it
Fig. 3.5 Tile region protection isolates critical functions from each other. Courtesy of Imagination
Technologies
106 3 Mobile GPUs
Fig. 3.6 Imagination’s BXT MC4 block diagram. Courtesy of Imagination Technologies
Fig. 3.8 In 2020 imagination had a broadest range of IP GPU designs available. Courtesy of
Imagination Technologies
25% area saving over previous-generation cores. And, said the company, it had
up to 2.5x the fill rate density compared to the competition.
• Imagination BXM—High-efficiency cores with fill rate and compute balance.
The company designed the BXM for a compact silicon area in mid-range mobile
gaming, DTV, and complex UI solutions.
• Imagination BXT—The company claimed its four-core part, designed for hand-
held devices to the data center, could generate 6 TFLOPS while producing
192 gigapixels per second. It could supply 24 trillion operations per second
(TOP/s) for AI, delivering the industry’s highest performance per mm2 .
• Imagination BXS—The BXS family were ISO 26262-capable GPUs. The
company claimed they would enable next-generation HMI, UI display, infotain-
ment, digital cockpit, and surround view.
Several OEMs selected the Imagination BXT core for its scalability. The company
claimed its GPUs could provide up to a 70% higher compute density than desktop
GPUs. Imagination said the BXT core IP offered flexible control over the OEM’s
configuration and layout of individual cores in SoCs and multi-die packages, claimed
the company.
3.3.1 Summary
Chinese equity company Canyon Bridge. Like others, it suffered through a world-
wide pandemic and unrelenting competitors. And through all that, maybe because of
it, the company developed a string of new architectures with long legs, scalability,
and continually surprising features and results.
Acorn Computers was founded in Cambridge, UK, in 1978 and was the genesis of
Arm. Acorn produced several computers, which were especially popular in the UK
including the Electron and the Archimedes. The company had been using commercial
off-the-shelf processors and decided it should build its own. The new reduced instruc-
tion set computer (RISC) was a popular design for its speed, low power consumption,
and minimalistic design. So Acorn decided it would build a RISC processor.
The Acorn RISC project started in October 1983, and Acorn had spent over $8.6
million on it by 1987. VLSI Technology was chosen to build the chip. VLSI produced
the first Arm processor on April 26, 1985, 4 years after the IBM PC [1]. Arm started
life in 1987 as part of the BBC computer made by Acorn [2].
In 1990, Acorn and Apple began collaborating to develop a low-power processor
for Apple’s Newton. The two firms concluded that a single company should do
processor development. Acorn transferred most of its advanced research and devel-
opment section to the new company and formed Arm Ltd. in November 1990. Acorn
and Apple had a 43% shareholding in Arm (in 1996) [3]. VLSI was also an investor
and the first Arm licensee.
Arm became the processor of choice for handheld devices, from PDAs (Newton,
Palm, etc.) to smartphones and early pretablet slate-like devices.
Arm had been working indirectly with Imagination Technologies for GPU IP. In
January 2001, the two announced a strategic development deal: They would integrate
Imagination’s compact PowerVR graphics core with Arm’s 16/32-bit RISC cores.
The companies said their first markets would be mobile phones, PDAs, digital set-top
boxes, tethered Internet appliances, and mobile gaming markets.
That arrangement lasted until June 2006 when Arm announced it would acquire
the Norway-based firm Falanx Microsystems. In addition to the GPU technology,
Arm acquired the Mali product brand.
Imagination quickly announced that it would continue to market its next-
generation graphics technology, PowerVR SGX, directly to customers and inde-
pendently of Arm. SGX was Imagination’s newest design and supported OpenGL
ES 2.0, and they hoped Arm would pick it up. But time moved on, and Imagination
had been re-evaluating its marketing of PowerVR SGX technology for a while.
Imagination had always had a close and frank relationship with Arm with good
executive- and working-level contacts, and they were both aware of each other’s
views and plans. Arm didn’t decide to go its separate way based on the technical
aspects of Imagination’s designs. The driving factors were customer input/support
trends, commercial and strategic.
3.4 Arm’s Path to GPUs (2001) 109
So, in 2005, Arm began to look for another approach and launched the Flag project
to find the options available to the company (Flag was not an acronym, just a project
name). With a directive from the board of directors, the Arm team started a due-
diligence investigation, visited almost all the suppliers, and proclaimed suppliers of
graphics coprocessor chips and IP. The FLAG team did not begin with any precon-
ceived notions. They looked at the candidate companies’ cultures, philosophies,
direction, technology, and resources. Then in June 2005, they sent some repre-
sentatives to the northern university town of Trondheim, the third-largest city in
Norway.
Imagination Technologies was not out of consideration, but the Flag team found
the architectural directions taken in the development of the SGX did not completely
line up with Arm’s plans. Also, Arm management recognized graphics had become
a significant part of an SoC solution and having a design of its own would give the
company more leverage and control of its future.
After lots of investigation and several more meetings, Falanx agreed it should
become the new graphics IP business unit within Arm.
Announcing the acquisition in June 2006, Arm said the Falanx’s GPU family,
Mali, would allow Arm to meet the demand for more sophisticated graphics in
mobile, automotive, and home applications. Figure 3.9 shows the organization of the
Falanx GPU.
Falanx said it was not really looking to be acquired; it had investors with deep
pockets and long horizons. Arm knew better and that Falanx was, in fact, pretty
desperate and looking to find a partner or a buyer. Falanx did have a development
deal at Zoran and others in the works; all of which could be moved over to Arm.
Arm acquired Falanx in June 2006, and Mike Inglis, executive vice president at
Arm, said, “The estimated total available market for embedded 3D hardware is set
to grow from 135 million units in 2006 to more than 435 million units in 2010.” His
forecast was a little off—in 2010, 325 million mobile devices with Mali shipped. By
2012, the market had jumped to 850 million, and in 2021, it reached 1.3 billion units
[4].
Arm won one socket after another during that time, and the Mali GPU found its
way into most low-end and mid-range phones and most tablets. Mali became the
unquestioned market leader.
The Arm ecosystem ships billions of chips per quarter, and in the fourth quarter of
2020, Arm reported that its silicon partners shipped a record 6.7 billion Arm-based
chips in the prior quarter. That equates to about 842 chips shipped per second. By
2021, Arm partners shipped more than 180 billion Arm-based chips to all platforms
(including autos, toys, TVs, and refrigerators)—chances are you have three or more
Mali GPUs in your collection of electronic stuff.
3.4.1 Falanx
In 1998, Jørn Nystad and Borgar Ljosland met as students at the Norwegian Univer-
sity of Science and Technology (NTNU) in Trondheim. They started to discuss what
held clock frequency performance back in current GPU designs. Jørn had architec-
tural ideas from his CPU designs that formed the basis for a new type of architecture
for a low-power, high-performance GPU (Nystad has over 100 patents and is an Arm
fellow). Ljosland and Nystad assembled a team that included Mario Blazevic (soft-
ware engineering) and Edvard Sørgård (hardware engineering). They were thinking
they could commercialize the design and enter the PC add-in board (AIB) market,
but later changed their plan to being an IP supplier.
Backed by a business plan contest prize money from the university, govern-
mental research grants, and equipped with a promising field-programmable gate
array (FPGA) prototype, they formed Falanx Microsystems. In their young careers,
they quickly saw what was happening in the PC and AIB markets and changed their
strategy to offering IP for SoCs. It was a strategy not too unlike what Imagination
Technologies had gone through. From that move, came the original Mali design.
Later, as the business began to develop and the company got customers, Imagination
Technologies spoke about suing them for patent violation. However, no suit was ever
filed; neither company wanted to take on the burden of lawyer fees [5].
Falanx was incorporated on April 4, 2001, finished its RTL (register-transfer
level, a design abstraction) GPU IP model, and began marketing its IP Graphics
Core. The company targeted the mobile market and said its design could be used
in mobile phones, PDAs, set-top boxes, handheld gaming devices, and infotainment
systems. Falanx hoped to license its family of IP cores directly to equipment, and
SoC suppliers like Arm and Imagination Technologies had. Falanx spoke about its
independence from the university by emphasizing that the Mali graphics solution
needed no third-party licenses.
3.4 Arm’s Path to GPUs (2001) 111
The Mali family of graphics IP cores offered 4× and 16× full-scene anti-aliasing
(FSAA) to improve the mobile gaming and multimedia experience. Arm claimed the
4× mode came with no performance, power, or bandwidth penalties, making its use
practical in mobile phones and other handheld devices. Initially, the Mali family had
three members:
• Mali-50
– Entry-level and basic mobile phones
• Mali-100
– For gamepads, feature, and smartphones
• Geometry engine
– An onboard geometry processor offered lower power consumption and
increased T&L performance.
The relative differences between the three designs are shown in Table 3.1.
The Mali-110 and Mali-55 were improvements of partly reimplemented versions
of Mali-100 and Mali-50. As the data in Table 3.1 indicates, the design scaled very
well. It was safe to say that the Mali GPU design represented the smallest GPU with
integrated T&L available.
Shown in Table 3.2 are the feature sets available in the primary family members.
In 2005, Falanx Microsystems added a video capability to its lineup of IP, offering
customers another feature for handheld devices operating in low-power configura-
tions: the Mali Video Series IP cores. Those cores provided H.264 encode/decode,
enabling customers to build high-frame-rate video capture and playback devices.
Historically, the video-enabled cores are a footnote. They were video acceleration
features integrated into the GPUs. It worked well for the resolutions of the time, but
the dedicated video accelerators won out as video resolution increased rapidly.
The company also reduced traffic between the processor and memory and created a
pipeline based on computational patterns rather than fixed functions. That approach
let Falanx take advantage of the similarities between video compression and 3D
Table 3.1 Configuration and performance parameters for Mali family of graphics cores
Mali geometry Mali-110 Mali-55
Gate count, logic (nand 2 × 2, full scan) 150k 230k 190k
SRAM 10 kb 7 kb 4 kb
Die area (130 mm) 1.5 mm2 3.0 mm2 2.0 mm2
Max clock 150 MHz 200 MHz 200 MHz
Mpix/s (4×FSSA, bilinear) na 300 100
Mtri/s (transform) 5 5 (consume) 1 (consume)
112 3 Mobile GPUs
graphics. As a result, the company could use the same gates for graphics as for
video display, reducing the gate count and simplifying the design process. Although
that approach was innovative and generated a lot of interest, in the end, it was not
commercially successful. Video and graphics requirements exploded, and power
and performance were ultimately more important than cost optimizations. A block
diagram of the Falanx Arm Mali GPU is shown in Fig. 3.10.
Mali50 /100
SoC Interface External
Pixel
Memory
Processor
AHB APB
The Mali graphics core communicated with the Arm processor and system
memory through the advanced high-performance bus (AHB), and with peripherals
through the Advanced Peripheral Bus (APB).
The Mali Video Series included three pixel-processing cores: the Mali-110V,
Mali-55V, and a geometry processing core, the Mali-GP-V. These processors worked
with the Arm CPU to accelerate video encoding and decoding up to 30 frames
per second, with motion estimation, motion compensation, video scaling, imagine
differencing, and color-space conversion.
The introduction of OpenGL ES2.0 and the Mali-200 (originally Mali-120) used
a split programmable shader architecture. Its development spanned the acquisition
and marked the step up to Arm quality standards. It took a lot longer to complete
than initially envisioned, but that investment in time and resources from Arm created
the foundation for success 5–10 years later.
Because Arm had led the explosion of the handheld market and had done quite well
in a few others, the company thought it was time for a multi-core GPU in a handheld
device. In 2008, it announced the Utgard 4-core architecture, the Mali-400 MP GPU.
Arm had already demonstrated that it could design multi-core CPUs with its
successful MPCore product. The company leveraged that technology (design tools,
test-and-measurement, etc.) and exploited the Mali architecture’s inherent scalability.
Its results produced a design that delivered more than a billion pixels a second while
using little memory bandwidth, which was the key to reducing power consumption.
Part of the Arm announcement was its software stack. Arm had been adding to
its stack for years. But before Mali was Arm, the clever Trondheimers had already
grafted several important APIs, interfaces, and miscellaneous standards to the GPU
core. Back then, it looked like the diagram in Fig. 3.11.
At the time, Arm said it had shipped 50 million graphics software engines before
the Falanx acquisition. A software stack ran JSR-184 on the CPU via an API and
contained a software rendering engine. JSR 184 is a specification that defines the
Mobile 3D Graphics (M3G) API for the J2ME (Java 2 Platform, Micro Edition),
a technology that allows programmers to use the Java programming language and
related tools to develop programs for mobile devices. LG and Motorola used it at the
time in tens of millions of handsets [6].
Arm sold drivers to the semiconductor suppliers with an abstracted driver to
OEMs. The OEM’s experience and feedback gave Arm information about how to
tune the drivers. Arm then fed that information to the chip suppliers, which allowed
Arm to understand the OEM’s use cases and other issues and help the chip suppliers
deliver better-tuned, higher-performing parts to get the right CPU into the right OEM
with the suitable applications.
Licensing the software to OEMs was an essential feature for Arm. It added signif-
icant value by prevalidating the software with Arm’s hardware. That helped ensure
114 3 Mobile GPUs
Arm’s partners did not end up with complications when the lower-level software stack
was integrated with Java, which was hampering other companies. The important thing
to see in Fig. 3.11 is all the “J” numbers and “Open” words and numbers that informed
the OEMs where that software could be applied and how broad the applications and
opportunities were for it. That work did not come easy and certainly not overnight.
And as part of Arm’s offering, it was a substantial claim—preverified—that was, it
should work.
The software stack needed something to run on, and that was where the Mali-
400 MP came in.
The 2008 design tackled one of the main power drain issues—memory band-
width—one wants as much as possible, but it comes at the expense of joules. Arm
claimed the Mali-400 MP GPU reduced memory bandwidth by combining the best
immediate-mode and tile-based rendering and clever use of a shared L2 cache with
unified memory access. Like other designs, it supported multiple levels of power
gating, all of which, said Arm at the time, would provide a 30–45% power savings.
But the concept had been a Falanx design philosophy since 1998. With the L2 cache,
Arm improved it even more.
Arm used a large set of different applications and use cases for its measurements
to support those claims. The partners supplied test cases developed internally and
included UI engines, 2D and 3D games, industry-standard benchmarks, and tech
demos. Bandwidth measurements were performed inside and outside the L2 cache
using Arm’s profiling tools and real (ASIC and FPGA) platforms or by running RTL
simulations of the Mali-400 MP. How quickly the Mali-400 MP could power down
1A MIDlet is an application that uses the mobile information device profile (MIDP) for the Java
Platform, Micro Edition (Java ME) environment.
3.4 Arm’s Path to GPUs (2001) 115
portions of the core depended on how fast the OEM’s silicon implementation could
provide stable power. The latencies imposed by Mali were just tens of cycles.
Arm also claimed that the design scaled linearly with the number of cores and
offered Table 3.3 as an illustration.
The fill rate was measured using supersampling anti-aliasing (SSAA), also called
full-scene anti-aliasing (FSAA) without any overdraw assumptions.
Arm said the Mali driver reduced complexity because it was the same API and driver
for all configurations. The hardware detail (hidden from developers) and the cores
work autonomously on different parts of the frame buffer in parallel (see Fig. 3.12).
Therefore, the rendering was distributed for optimal load balancing and bandwidth
efficiency.
Because the L2 cache was crucial to the multi-core design and efficiency, the
fragment processors were unaware of one another and worked on separate parts of
the final frame buffer. The granularity of the load balancing was on a tile-by-tile basis.
Only one fragment processor touched any pixel in the frame buffer (in that frame).
Refer to Fig. 3.12 illustration where the screen grid represents the tiles making up
the full-frame buffer.
On its face, Falanx looked like it had a tough challenge as it took its IP licensing
model for handheld devices up against established and well-trained suppliers like
ATI, Nvidia, and Imagination Technologies. It was a tough challenge, but it did not
bother Falanx. CEO Borgar Ljosland said his company was young and built from
the start on an IP licensing model, and as a result, it was positioned to function as
IP-enabled SoCs. Although the competitors had IP strategies, the people at Falanx
reasoned that those competitors were not really organized to enjoy the margins of an
IP licensing model. They said that the Mali architecture was flexible and scalable in
ways that the competitors were not. Ljosland said the Mali architecture enabled chip
builders to use two and four cores to create more powerful platforms.
116 3 Mobile GPUs
Utgard was the final design based on the original Falanx architecture. The Mali-
50/55/100/110 fixed function architecture (never given a code name) was the only
original Falanx architecture. The start of Utgard (Mali-200) was before Arm, but the
majority were developed as Arm. Arm managed to keep most of the original design
team and added significantly to the Trondheim and Cambridge teams. The company
developed three more architectures, the Midgard, the Bifrost, and the Valhall.
3.5 Fujitsu’s MB86292 GPU (2002–) 117
translucent display. In addition to analog RGB output, the controller supported digital
RGG output and picture-in-picture video data.
The MB96291 supported 3D rendering, such as perspective texture mapping
with perspective collection, Gouraud shading, alpha blending, and anti-aliasing for
rendering smooth lines.
The host MB86292 CPU interface could transfer display list, texture pattern
data from the main memory to the Orchid graphic memory or internal registers
using an external direct memory access (DMA) controller. The device had a 32-bit,
33 MHz PCI interface that enabled data transfer at 70 MB/s, more than any other
GDCs claimed by the company. The GDC incorporated an external memory inter-
face to allow off-chip connections to synchronous dynamic random-access memory
(SDRAM) or fast cycle RAM (FCRAM).
Fujitsu manufactured the GDC in its 250 nm CMOS technology fab in Japan.
In 2004, Fujitsu did a process shrink and introduced the MB86296 Coral PA GPU,
manufactured in a 180 nm process. The Coral PA ran on 1.8 V at 500 mA and 3.3 V
at 100 mA. It was compatible with all Fujitsu graphic display controller integrated
circuits (ICs). The DGC worked with different host CPU buses, including the FR
series from Fujitsu, SH3, and SH4 from Hitachi, and V83 NEC Electronics. Coral
PA required no external glue logic.
In 2014, Fujitsu changed its design strategy and built an SoC with an ARM926 RISC
CPU and provided a popular and well-known ISA to developers.
The MB86R01 incorporated Fujitsu’s MB86296 3D graphics core the company
upgraded to support high-speed double data rate (DDR) memory. Fabricated in
Fujitsu’s 90 nm process, the SoC offered an optimal balance of power (low leakage
current) and performance.
The MB86R01 graphics SoC featured a hierarchical bus system that isolated high-
performance functions like 3D graphics processing low-speed I/O and routine jobs.
Fujitsu designed the Arm processor to run at twice the rate of the graphics core to
reduce memory bus contention between those two primary functions. (The Arm ran
at 333 MHz and the graphics core at 166 MHz.)
Central to MB86R01’s architecture was a 3D geometry processing unit capable
of performing all primary 3D operations, including transformations, rotations, back-
face culling, view-plane clipping, and hidden surface management. See the block
diagram in Fig. 3.14.
A display controller supported two capture sources (YUV/ITU656 or RGB) and
enabled both upscaling and downscaling of video images. It was possible to map
the video to any of the six display layers and have it texture-mapped to polygons
to create special effects. The display controller was also capable of dual digital
outputs, supporting multiple monitor configurations in different resolutions. The
content could be either the same or unique to each panel. For example, a single
3.5 Fujitsu’s MB86292 GPU (2002–) 119
MB86R01 controller could support a 1024 × 768 resolution center console featuring
a navigation system and an 800 × 480 resolution rear-seat display for video, and the
menu buttons could be shared between the displays.
The MB86R01’s six display layers could be six individual frame buffers or indi-
vidual canvases; each could contain unique content. The layers could be optimally
sized to save memory and improve system throughput and graphics performance.
Psion generally gets credit for producing the first personal digital assistant (PDA)—
the 1984 Organizer. That was followed seven years later with the Series 3, in 1991,
which established the familiar PDA form factor with a full keyboard.
John Sculley, Apple’s CEO, introduced the term PDA on January 7, 1992. At the
Consumer Electronics Show in Las Vegas, Nevada, he introduced the Apple Newton.
Apple chose a 20 MHz ARM610 processor made by VLSI (the first silicon partner
for Arm) and chose the VY86C610 to power the Newton. Apple had discovered Arm
in September 1990 and bought a 43% stake in the company.
IBM introduced the Simon in 1994, the first PDA/smartphone [7], and in 1996,
Nokia introduced the 9000 Communicator, its PDA with digital cellphone function-
ality. That same year Palm released its famous PDA products, powered by a Motorola
68328. It had built-in functions, like a color and grayscale display controller, and
3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles (2003–) 121
drove a 160 × 160-pixel monochrome touchscreen LCD. Palm quickly became the
dominant vendor of PDAs until the introduction of consumer smartphones in the
early 2000s. All those devices had simple 2D LCD screens.
Realizing the emergence of the handheld market, in 1996, Microsoft introduced its
CE operating system, code name Pegasus. It was a scaled-down version of Windows
optimized for devices with minimal memory; a Windows CE kernel could run on
one megabyte of memory.
Seeing that development, Ramesh Singh and Ignatius Tjandrasuwita, who were at
S3, along with four others at Chips & Technologies and Cirrus Logic, formed MediaQ
in 1997. They had entered the market to develop a dedicated graphics controller for
PDAs, mobile phones, and handheld game machines.
In 1998, another variant of windows was introduced, Jupiter, a stripped-down
version of the Microsoft Windows CE operating system for low-power RISC
processors like Arm. It never went into production.
In 1999, MediaQ launched its first device, the MQ-200, a 128-bit graphics display
controller with 2 MB of embedded dynamic random-access memory (DRAM), illus-
trated in Fig. 3.15. The company had grown to 23 people from just about every
graphics semiconductor company in Silicon Valley by that time. The device used a
250 nm process and had a 1442 mm die.
The Graphics Engine was a specialized logic processor for 2D graphics opera-
tions such as BitBlts and Raster-Ops (ROP), area fill, and vertical and horizontal
line drawing. It also provided hardware support for clipping, transparency, and font
color expansion. MediaQ offered a Windows CE device driver package that used
the graphics engine to accelerate graphics and windowing performance in 8, 16, and
32 bpp graphics modes. The graphics controller freed the CPU from most of the
display rendering function with three main benefits (Fig. 3.16):
• If the CPU was busy, it did not halt the graphics controller’s accelerated operations.
The end-user did not see screen operations start and stop or slow down in most
cases.
• The graphics controller consumed less power because its display buffer and engine
were in the same device.
• The CPU was free to perform time-critical or real-time operations such as
functioning as a software modem, while the graphics engine performed the
rendering.
MediaQ did well in the market and won several designs with follow-on projects.
The company became the leader in micrographics controllers. Its customers included
handset and PDA manufacturers such as Mitsubishi, Siemens, DBTel, Dell, HP,
Palm, Philips, Sharp, and Sony. Its only serious competitor was Epson, primarily in
Japan. MediaQ was in devices using different processors such as StrongARMs and
SH-4s, and it supported the major operating systems, including EPOC, WindRiver’s
VxWorks, Linux, and Palm OS.
By the mid-2000s, most PDAs had morphed into smartphones. Stand-alone PDAs
without cellular radios had died off.
The smartphone market was just getting started, Nokia was the leader, Apple had
just introduced the iPod in 2001, and rumors were swirling about its interest in the
phone market.
3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles (2003–) 123
Nvidia had also been eyeing the mobile market and looking for an entry point to
leverage its graphics technology and reputation. The internal microdevice projects
Nvidia was running were not progressing fast enough, and management was worried
it would miss a window of opportunity.
In August of 2003, Nvidia announced it had signed a definitive agreement to
acquire MediaQ for $70 million in cash. That was a big deal at the time and caught
a few companies off guard. Nvidia named the new division GoForce and rebanded
MediaQ’s graphics processors as the GoForce brand.
“This acquisition supports Nvidia’s strategy of extending our platform reach and
accelerates our entry into the wireless mobile markets,” said Jensen Huang, Nvidia’s
CEO. “The MediaQ acquisition extends Nvidia’s competencies in ultra-low-power
design methodologies and system-on-chip designs, as well as in the Microsoft Pocket
PC, Microsoft SmartPhone, Palm, and Symbian operating systems.”
Just before Nvidia’s acquisition, in July 2003 the Khronos Group released
OpenGL ES 1.0. OpenGL ES 1.0, based on the original OpenGL 1.3 API, had some
functionality, most extensions, and overhead removed. The members of Khronos,
including Nvidia and MediaQ, of course, knew all about OpenGL ES. It promised
to stabilize the mobile and handheld market and allow apps to migrate from one
platform or device to another easily—and it did just that.
The leading developer of OpenGL ES code examples at the time was Hybrid
Graphics, a small company in Finland backed by an advertising agency. Its drivers
were small, tight, and very efficient. It developed and licensed graphics technology
solutions for handheld consumer devices. Its customers were key players in the
mobile industry, including Nokia, Ericsson, Philips, Samsung, and Symbian, who
held well over 50% of the existing handheld market.
The company also had an operation for making 3D graphics images called Fake
Graphics, which it spun off in 2004. After that, the company’s most important product
was SurRender 3D (1996–2000), its OpenGL graphics library.
In 2000, Hybrid launched visibility optimization middleware, the dPVS (dynamic
Potentially Visible Set) that massively multi-player online role-playing games
(MMORPGs) such as EverQuest II and Star Wars Galaxies used. In 2006, Hybrid sold
its dPVS technology to Umbra Software, founded that same year. Umbra is special-
ized in occlusion culling and visibility solution technology and provided middleware
for video games.
Hybrid Graphics provided the first commercial implementations for the OpenGL
ES and OpenVG mobile graphics APIs. The company also was actively involved in
the development of the M3G (JSR-184) Java standard.
Nvidia was still trying to get into the mobile business and realized its driver
development for OpenGL ES was not progressing fast enough. To remedy that,
Nvidia bought Hybrid Graphics for an undisclosed amount (which meant less than
$20 million) in March of 2006.
“Since its inception, Hybrid has successfully delivered graphics solutions to
hundreds of millions of handheld devices,” said Mikael Honkavaara, CEO of Hybrid
Graphics. “We provide innovative graphics technology, and we make it work in the
real world, in real devices.”
124 3 Mobile GPUs
Nvidia now had two of the four components needed to enter the mobile market;
it still needed Arm processor technology and experience and some radio capability.
And although the company had made complex integrated graphics processors (IGPs)
and GPUs, it had no experience with heterogeneous SoCs or radios.
In 1999, the same year Nvidia would introduce its PC-based GPU, six guys
formed a media chip semiconductor company they named PortalPlayer. In 1998,
the backer and mentor of 3Dfx, Gordon (Gordie) Campbell, visited chip pioneer
National Semiconductor, hoping to interest them in building a chip system for an
MP3 player. National was not interested, as it was pursuing its own Cyrix media
processor. However, John Mallard, National’s CTO, was impressed, and as the
legend goes, followed Campbell out to the parking lot after the meeting to discuss it
further. Mallard and four other National engineers, with $5 million in funding from
Campbell’s Techfund and J. P. Morgan, formed PortalPlayer in 1999.
The company developed an MP3 player with a flexible design that allowed
customers to use different functions that would create a unique custom chip. Word
of the company’s flexible design got to Apple, where Steve Jobs was trying to get
an MP3 player project off the ground—the timing was perfect. Several companies
were bidding on the Apple project, including mighty Texas Instruments, considered
the logical winner with its strong DSP background. But the Texas Instruments part
was too big and used too much power. In the summer of 2001, Apple picked the
start-up PortalPlayer [8]. The young, energetic company poured its collective heart
and soul into the Apple project, and Apple introduced the iPod in November of that
same year. That put the company in the headlines, and soon after, Samsung, RCA,
Rio, and others would place orders with the firm. However, Steve Jobs did not want
any information about Apple to get out, and he was not happy about the notoriety
PortalPlayer was getting.
In 2002, founder CEO Mallard stepped down, and Gary Johnson, who had been
running S3, came in as CEO. MP3 players were taking off, and over 4 million of them
sold in 2002; by 2004, it had climbed to 7 million. Information about PortalPlayers’
relationship with Apple dried up. Johnson could not concede the relationship even
existed out of fear of offending the company’s biggest, and notoriously secretive
customer. “I’m not even going to refer to those guys,” he told Forbes magazine [9].
PortalPlayer had a unique design and may be a year’s lead on the competition, and
everyone wanted Apple’s business.
In November 2004, PortalPlayer filed an IPO and went public. In its disclosure,
it revealed that Apple represented 90% of the company’s $48 million revenue, and
PortalPlayer had 85% of the MP3 market. Furthermore, the company revealed to
investors anxious to get a piece of the action, PortalPlayer was likely to get the
design win for Apple’s follow-on product; the nano iPod, expected out in 2005.
But PortalPlayer had committed the unforgivable sin of mentioning Apple’s name.
When asked about PortalPlayer, Jobs would only say it was one of many suppliers.
Jobs could not stand sharing the spotlight with anyone and was notoriously famous
for keeping his key component suppliers a secret from the competition. The damage
had been done, and Apple selected Samsung for the MP3 processor for the Nano
iPod in 2006. With the new device coming out, orders from Apple for the old device
3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles (2003–) 125
fell off drastically. In June, PortalPlayer announced it would reduce its workforce
by 14% following its failure in April to secure the Apple business. A month later,
PortalPlayer’s President and CEO Gary Johnson said he would resign by the end of
the year. PortalPlayer’s quarterly earnings of $34.6 million were down 50% compared
to the previous year.
And then in November, Nvidia reported it would acquire PortalPlayer to beef up
its development of personal media players (PMPs). Nvidia said it would pay about
$357 million for PortalPlayer. Now, Nvidia would have three of the four components
for a smartphone chip. But why the company paid so much for a failing company
remained a mystery. People thought if Nvidia had waited a few months, it could have
gotten the company for a third of what it paid. But Jensen Huang saw PortalPlayer’s
IP and technology as too valuable to pass up. PortalPlayer’s product mix of semicon-
ductors, firmware, and software platforms for PMPs would fit neatly into Nvidia’s
family of graphics processing units, Nvidia said [10].
Nvidia was smart and always on the hunt for talent. By acquiring PortalPlayer,
Nvidia would pick up approximately 300 employees, including more than 125 in
Hyderabad, India; about 200 of PortalPlayer’s employees were engineers. Nvidia
had been expanding its operation in India lately, so it all tied together nicely.
Why would Nvidia want PortalPlayer? Because Nvidia was going to design and
sell an application processor for the handheld market. The company was already an
Arm licensee and had started its own applications processor project. But the project
was not moving as fast as management wanted, and the PortalPlayer team was a
lucky pickup to expand and accelerate the project.
PortalPlayer was also an Arm licensee for the ARM7 cores and provided not
only the iPod’s sound chip but also its CPU. As Nvidia was acquiring Portalplayer,
rumors of Apple’s iPhone were circulating. The stories said Portalplayer would be
the SoC supplier for it. It was speculated that $150 million of the $357 million
Nvidia offered for Portalplayer was expected to come from the Apple deal. However,
Portalplayer/Nvidia never got the contract, and Apple’s iPhone A4 SoC was built
by Samsung and used Imagination Technologies GPU IP. Once again, PortalPlayer
had committed the sin of speaking about Apple. Until late 2005, it was seriously
being considered for the iPhone, but when Jobs heard the rumors, he went back to
Samsung. PortalPlayer had ignored the first rule of the Apple Fight Club: never talk
about Apple.
Nvidia began developing its GeForce Arm-based applications processor in 2006
and showed it at the 3GSM conference in Nice France in February 2007. The GoForce
6100 was a legacy part from PortalPlayer (so it was not really Nvidia’s first apps
processor), but it formally launched it and had a customer (SanDisk for the Sansa
View PMP).
126 3 Mobile GPUs
In the summer of 2007, in Taipei, Nvidia launched a new brand and product line
around the design and named it Tegra. Tegra joined GoForce, Quadro, and Tesla, and
Nvidia quietly put the mobile GoForce product line to sleep.
Nvidia introduced the Tegra APX 250 SoC in 2008 with a 300–400 MHz integrated
GPU and a 600 MHz Arm 11 processor. Audi incorporated it in its entertainment
systems, and other car companies followed. In March 2017, Nintendo announced it
would use the Tegra in the new Switch game console.
Tegra was Nvidia’s Arm-based family of small-package, low-power, high-
performance application processors targeting the handset market (APX 2500) and
the mobile Internet devices (MID) market (Tegra 650 and 600), which were SoCs,
illustrated in Fig. 3.17. Nvidia had also taken the step to define the MID market.
Nvidia’s first apps processor was the APX 2500, announced at Mobile World
Congress in February 2008. In December, Nvidia pushed its SoC, Tegra, initially
planned for late 2008, back to late spring of 2009. These SoCs are difficult, Nvidia
said.
In February 2009, Nvidia announced the APX 2600.
Microsoft’s Zune HD media player was one of the first products to use the Tegra
in September 2009. Samsung employed it in its Vodafone M1, and Microsoft’s Kin
was the first cellular phone to use the Tegra. Microsoft did not have an app store, so
the phone did not sell very well.
Then, in May 2009, Tegra was the weak sister in Nvidia’s suite of products,
although not for lack of investment or patience. It was difficult to understand the
passion the company had for Tegra. It was a low ASP (and margin) part; the category
had huge competitors (Qualcomm, TI, Freescale, Broadcom, Marvel, Samsung, etc.),
and the carriers controlled the market. Also, Intel was going to re-enter the market
and disturb it even more. The HTC Android phone that Nvidia showed at the Work
Group Meeting (WGM) conference did not appear either. A couple of no-name white
Fig. 3.17 Symbolic block diagram of the Nvidia TEGRA 6x0 (2007)
3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles (2003–) 127
box navigation units and one stock-keeping unit (SKU) in the Audi did not make for
a sustainable business.2
However, in Nvidia’s defense, the company had always said Tegra would not ramp
up meaningfully until 2H 09 and privately said it never expected big-name wins in
the first half of the year. Nvidia still believed Tegra should be at least $25 m–$50 m
in FY10 and would exit 2010 at a run rate > $100 M/year, with a fast ramp. That did
not happen.
In January 2010, at the Consumer Electronics Show (CES), Nvidia demonstrated
its next-generation Tegra SoC, the Nvidia Tegra 250, and showed off devices from
NotionInk, ICD, Compal, MSI, and Foxconn. The Tegra platform used a dual-core
Arm Cortex-A9 CPU running at speeds up to 1 GHz, with an eight-core GPU. The
company announced that it supported Android on Tegra 2 and that booting other
Arm-supported operating systems was possible on devices where the bootloader was
open. The company also announced that Ubuntu Linux distribution support was also
at the Nvidia developer forum.
In August 2010 and the IFA conference in Berlin, Nvidia OEMs introduced a
series of tablets based on the Tegra 2. IFA was the largest consumer electronics trade
show in Europe, known as the European CES, an important conference with lots of
new product announcements making it challenging to get noticed. However, Nvidia
had such a strong brand and leading GPU technology it did get noticed in its partner’s
exhibits.
In late September 2010, Nvidia said it had almost finished the Tegra 3 and was
working on the Tegra 4 (Fig. 3.18).
Nvidia announced its quad-core SoC, code named Kal-El, which would become
the Tegra 3, at the Mobile World Congress in Barcelona, in February 2011.
2 SKU—a stock-keeping unit—a number or string of alpha and numeric characters that uniquely
identify a product.
128 3 Mobile GPUs
Then in May 2011, Nvidia announced it would acquire Icera, a baseband chip-
making company in the UK, for $367 million. Now, Nvidia had everything it needed
to make an SoC for smartphones.
In 2012, Nvidia unveiled its Project Kai-El, a Tegra 3 reference design called
Tegra Tab, for low-cost tablets. The Google Nexus 7 was based on it. Tegra 3 was
in many Android 7-in. tablets, Windows Surface tablets, and other Android tablets.
It also found a home in many infotainment devices. Revenue for the second half of
2012 was big growth for Nvidia.
Just after CES 2013, in February and ahead of the Mobile World Congress event
in Barcelona, Nvidia announced the Tegra 4i (code named Grey). It had hardware
support for the same audio and video formats as Wayne but used the Cortex-A9
instead of a Cortex-A15. The Tegra 4i was a low-power variant of the Tegra 4 designed
for phones and tablets and had an integrated long-term evolution (LTE) modem. Then
in 2013, at Computex in Taipei, Nvidia showed the Tegra Note. It was a $199 Android
reference design tablet (a technical blueprint that for others to copy) that used the
Tegra 4 processor.
Although many OEMs embraced Nvidia’s tablet designs based on various gener-
ations of Tegra, those tablets did not sell well. They were generally a little thicker and
heavier than the Apple iPad and Samsung tablets based on Qualcomm. The Nvidia
tablets also did not get very good battery life compared with the competition.
The Tegra 4 shipped in some China phones (Xiaomi). Following T4 was TK1
(integrating the same CUDA GPU from GeForce into Tegra). TK1 went into Android
tablets.
Nvidia got a few designs wins in phones and tablets but not enough to sustain a
business and support the necessary R&D.
Nvidia started investigating the automotive market in 2008 thinking it Tegra processor
would provide a powerful solution for the new console navigation and multi-seat
entertainment screens.
In June of 2009, Audi announced Nvidia would power the display system in the
2009 Q5.
Nvidia knew the automotive market would not come close to the expectations of
the mobile market but would provide some ROI. Here, Nvidia surprised itself with
its success.
The company’s Logan Tegra chip had 192 shaders and could drive two displays.
It had four Arm A15 processors with Neon SIMD and shadow LP cores.
Nvidia introduced image signal processors (ISPs) and support for OpenCV in
its Tegra (T124) Logan chip. The chip made full use of the Khronos OpenVX API
designed as an enabler to the popular OpenCV libraries.
Machine Vision was a logical application for vehicles. Nvidia’s Danny Shapiro,
Senior Director of Automotive, said:
3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles (2003–) 129
Fig. 3.19 Nvidia offered its X-Jet software development toolkit (SDK) software stack for
automotive development on the Jetson platform
Progress in the automotive field could be deceptive because the process from design to
implementation takes a long time. Nvidia established itself in the automotive field, with over
23 automotive brands signing on as of 2014.
Shapiro added that Nvidia had shipped over 4 million chips to date 2014, and
he forecasted 25 million in shipments when the recent design wins came to market
(Fig. 3.19).
In 2014, Nvidia embedded its Jetson computing boards as part of its Advanced
Driver Assistance Systems program. It offered imaging for panoramic view, night
vision (infrared), autonomous parking, collision detection, and pedestrian detection
and tools for developers.
Nvidia was active in the OpenCV group, initially formed by Intel in 1999 to
explore and promote applications for computer vision. At that time, Intel was inter-
ested in how computer vision could help sell CPUs. The group, however, devel-
oped into a broad-based and active community of cross-platform developers, and an
OpenCV for Android subset emerged with features that could extend the usefulness
to mobile devices and cars. Nvidia contributed algorithms to OpenCV, and developed
its offshoot, OpenCV for Tegra. Functions included algorithms for image processing,
video, stereo, and 3D.
In January 2015 at CES, Nvidia announced its Drive platform for autonomous
car and driver-assistance functionality. In January 2017, at CES, Nvidia announced
a partnership with Audi to put the world’s most advanced AI car on the road by
130 3 Mobile GPUs
2020. As discussed in the Games Consoles chapter in this section, Nintendo chose
the Tegra for its very popular Switch portable game console.
In June 2020, Nvidia announced it had formed a partnership with Mercedes-Benz
to build advanced, software-defined, and autonomous vehicles. Mercedes said its
fleet would become perpetually upgradable, backed by teams of software engineers
and AI experts, starting in 2024 (Fig. 3.20).
In April 2021, at Nvidia’s virtual GPU Technology Conference, Nvidia announced
the next-generation SoC, code named Atlan.
running, and then we can show it and maybe get some new investors.” A plan was
made, and another round of beers was ordered.
In March 2003, Mika Tuomi visited NEC in Minato-ku, Tokyo, Japan. While there,
he was wrestling with how to efficiently run a 2D engine in minimum memory and
with minimum power. He stepped out on a fire escape to have fresh air, and while
looking at the distant Toyko skyline, he had an epiphany. That brainstorm led him
and the team to the VG engine, a highly efficient 2D controller.
The photo in Fig. 3.21 was taken at the Bitboys office in Noormarkku, Finland,
around 2003.
“The Falanx guys paid us a friendly visit,” said Norlund. “We were competitors
back then but good friends nevertheless.”
By January 2004, Bitboys had established new relationships with mobile phone
suppliers NEC and Nokia. Nokia talked with and watched Bitboys’ progress for a
while, as had NEC through regular meetings at Khronos.
As a result, of the Japan epiphany and some funding from NEC, that year NEC
built a mobile phone display integrating the Bitboys recently developed G10 vector
graphics processor into the display driver silicon. NEC had financed the company
with significant monthly payments until Bitboys ended that funding so they could
work on 3D hardware.
By that time, Khronos was operating and drawing in everyone and anyone who
wanted to be in the handheld space. Bitboys were a member of Khronos, and Bitboys’
VG became the foundation for Khronos’ OpenVG API (Fig. 3.22).
The OpenVG group was formed on July 6, 2004, by several firms: 3Dlabs,
Bitboys, Ericsson, Hybrid Graphics, Imagination Technologies, Motorola,
Nokia, PalmSource, Symbian, and Sun Microsystems. Other firms including
chip manufacturers ATI, LG Electronics, Mitsubishi Electric, Nvidia, and
Texas Instruments and software- and/or IP vendors DMP, Esmertec, ETRI,
Falanx Microsystems, Futuremark, HI Corporation, Ikivo, HUONE (formerly
MTIS), Superscape, and Wow4M also participated. The first draft specification
was at the end of 2004, and the 1.0 version of the specification was released
on August 1, 2005.
Bitboys had long used SIGGRAPH as a place to try out new ideas and as a
friendly venue for new product announcements. In 2004, Bitboys planned to show
its repositioned graphics technology for wireless and embedded graphics processors.
The company had been using the name Acceleon for its mobile graphics processors
132 3 Mobile GPUs
Fig. 3.21 In the back row from left: Petri Norlaund, Kaj Tuomi, and Mika Tuomi from Bitboys. In
the front row, Falanx, from left: unknown (guy in blue jeans), Mario Blazevic, Jørn Nystad, Edvard
Sørgård, and Borgar Ljosland. Courtesy of Borgar Ljosland
line. In their initial announcements, the Acceleon line included plans for three cores,
the G10, G20, and G30. True to Bitboys conviction that less is more, the G10 was
designed with just 60,000 transistors and yet included full-screen anti-aliasing, hard-
ware, texture decompression, programmable geometry, pixel shaders, and hardware
vector graphics. The company hoped to license the new design to semiconductor and
mobile phone manufacturers as IP cores for SoC manufacturers’ products.
However, no matter how good Bitboys’ products looked at SIGGRAPH, the
general feeling, including among customers, was that they had better have something
3.7 Bitboys 3.0 (2002–2011) 133
concrete, given that they had made so many announcements at so many SIGGRAPHs
in the past. Bitboys was a member of the Khronos Group and put its support behind the
new Futuremark benchmark, which at least demonstrated the company’s commitment
to standards and that it would aim for commonly agreed upon benchmark goals.
The new Bitboys product line covered the entire range of mobile phone platforms.
The low-end G32 graphics accelerator covered the volume-market wireless
devices. It formed the basis of the Bitboys’ new product line. It had been designed to
be compatible with the new OpenGL ES 1.1 API. They put emphasis on producing
a minimal design size and very low power consumption to make it suitable for
volume-market wireless devices.
The mid-range G34 targeted wireless gaming. An evolution from the G32 design,
the G34, was a 2D/3D graphics processor core for the OpenGL ES 1.1 feature set. It
had added performance through a programmable geometry engine and support for
programmable vertex shaders qualifying it as GPU. The G34 could support scenes
with numerous animated characters with tens of thousands of polygons, rendered
with anti-aliasing, 32-bit color at more than 30 frames per second. The floating-point
geometry processor allowed for advanced scene and object complexity.
G40 was a bigger version of the G34 (see Fig. 3.23). It also had programmable
2D, 3D with T&L, vector graphics acceleration, and OpenGL ES 2.0 compatibility.
Hardware acceleration was provided for all forms of graphics, including 2D bitmap
graphics, vector graphics, and 3D graphics. It could be used for everything from UIs
to applications and games.
Bitboys based the rendering pipeline on OpenGL ES 2.0 shader architecture [11].
They also emulated fixed functions using the programmable pipeline.
G40 rendering features included the following:
• 2D graphics rendering
– BitBlt, fills, ROPS (256)
134 3 Mobile GPUs
more than 85%, was in the mid-range and entry-level segments. So, at the 2005
Games Developer Conference in San Francisco, Bitboys announced the G12, its
vector graphics processor for mobile devices [13].
The Bitboys G12 was an extremely compact vector graphics processor that
supported OpenVG 1.0 [14] and SVG Tiny 1.2 [15] graphics hardware rendering.The
company claimed it would deliver over 60 frames per second, more than a 100-fold
improvement over software-based vector graphics rendering. The processor, said
Bitboys, operated at extremely low power consumption levels and did not require
much CPU capacity.
Vector graphics technology offered small file sizes for high-quality graphics, scal-
ability to any display size, and lossless compression without JPEG artifacts. Vector
graphics would allow content to be rendered to the greatest possible detail and quality,
animation was efficiently executed, and it used very little memory. Also, vector
graphics facilitated the smooth rendering of anti-aliased text and complex non-Latin
fonts such as Kanji.
The advanced vector graphics features of Bitboys G12 enabled the following:
• Fast, ultra-clear user interfaces
• Animated, interactive advertisements for network operators
• Scalable, rotating maps with small file size, including animations and corporate
logos
• Cartoons, anime, greeting cards, games, and other mobile entertainment applica-
tions.
Because vector graphics content took very little memory space and its
data compressed well, it could easily get distributed over a wireless network, such
as CDMA (code division multiple access) and GSM (global system for mobile
communication).
As much as we all love 3D graphics, it is 2D that does most of the work, but it
must be good 2D with sharp lines with anti-aliased edges. To do it right without using
excessive amounts of power or transistors takes real skill. The market potential for
2D or vector graphics acceleration in 2005 was over 425 million units, and more if
one added dedicated vector acceleration to high-end phone 3D engines.
Bitboys G40 also included hardware acceleration of vector graphics by a
2D/3D/vector graphics processor; Bitboys introduced it in August 2004 and targeted
high-end multimedia mobile phones.
Things were rocking along for the boys, sales were picking up, profits were almost
there, and then in late 2005, Nokia Growth Partners invested $4 million for 14% of
Bitboys (which came back to them at about 1.2x, in about a year—not bad) [16].
And then, in May 2006, word came out that Bitboys, with about 40 employees,
would be acquired by ATI for $44 million and integrated into ATI’s handheld business
unit and form the nucleus for a key design center for graphics software design in
Europe.
“Bitboys dovetails into ATI perfectly,” said Paul Dal Santo, general manager of
ATI’s handheld business unit [17].
3.7 Bitboys 3.0 (2002–2011) 137
Also, ATI had just announced a new strategic relationship with … Nokia! Nokia
and ATI were working on 3D gaming and mobile TV.
Nokia said that it had started a long-term strategic relationship with ATI to develop
music playback, 3D gaming, mobile TV, and video functions for its handsets for
Nokia’s customers worldwide. ATI would provide the tools chain and software
development toolkit (SDK) for multimedia developers in the fall of 2006.
“Our role,” said Dal Santo, “is to enable all content, from ultra-high-quality music playback
to 3D gaming, and we will jointly guide and support the members of the content
“We want to make sure that with the NSeries devices, we are pushing the multi-
media experience,” said Damian Stathonikos, a spokesman for Nokia. Nokia had
recently introduced new features to the NSeries line for multimedia facilities such
as video capture.
Nokia had been increasingly emphasizing its multimedia devices, and with good
reason: The Nokia N70 multimedia phone was the highest revenue generator for the
company.
As close as Bitboys was to Nokia, Bitboys just was not big enough or deep enough
(in terms of tools, middleware, and range of solutions) to make a giant like Nokia
completely comfortable, giving Bitboys a design win. Big companies like to deal
with big companies, knowing they would be around and providing support and a
sustainable road map. The big guys cannot afford point products.
Nokia pretty much dictated what hardware it wanted integrated into chips it bought
from ST Microelectronics and TI. So Bitboys talked with all those companies and
eventually made a licensing deal with TI on the G40 GPU. News of the deals rippled
across Wall Street, and the financial analysts were tripping all over themselves to
point out how they had always liked ATI, even when they were calling for shorting
it and dumping it.
As all this was going on, Bitboys were pretty far along in merging with Hybrid. At
the end of the day, the deal was canceled from the Hybrid side, and in March of 2006,
Nvidia bought Hybrid Graphics for an undisclosed sum (estimated by analysts as
about $10 million). As a result, ATI was expected to quickly pull Bitboys stuff from
Hybrid and replace all that software with its own. If Nvidia bought Hybrid because
they thought that would be their path to Nokia, they made a mistake. The employee
at Nokia who was on Hybrid’s BOD (as a Nokia representative) was working for
ATI when the ATI–Nokia deal went through [18].
Then, in May 2006, ATI announced it had acquired privately held Bitboys Oy (for
about $44 million). Bitboys brought valuable engineering experience, technology,
and customer relationships that enhanced ATI’s existing mobile phone multimedia
offerings. Based in Finland, the Bitboys team became part of ATI’s handheld business
unit and formed the nucleus for a critical design center for ATI in Europe—all that
technology flowed to Qualcomm through their licensing deal [19].
A few months after ATI acquired Bitboys and won the deal with Nokia, in July
2006, AMD acquired ATI for $5.6 billion. The mobile unit that included Bitboys
personnel was renamed Imageon. But the market for discrete GPUs in mobile devices
was disappearing. The mobile group in AMD moved into IP, and their biggest
138 3 Mobile GPUs
customer was Qualcomm. ATI sold IP and tech support to Qualcomm and, in January
2009, sold the Imageon group to Qualcomm for $65 million. Qualcomm established
the operations as its Finnish development center.
Before and after the acquisition, Bitboys helped develop Qualcomm’s Adreno
GPU.
As the British IT journal The Register put it, Bitboys became “known chiefly for
its ebullient performance claims and a string of missed deadlines.” [20].
Commenting on that, in Juha-Pekka [21] interview with Mikko Saari [21], a
Bitboys cofounder who was then country manager of Qualcomm Finland and a
friend of the Tuomi brothers since childhood conceded,
The criticism was not unfounded. Bitboys were overoptimistic about what could be
achieved—not technologically, but outside of the engineering work. We could have tried
to fix our reputation, but we thought we would handle it with just getting on with our work.
We have learned our lesson. Luckily pros did not care. Nokia funded us in 2004, and they
would not have thrown their money into a sink.
Bitboys were part of ATI, then AMD, and then Qualcomm. The Qualcomm Adreno
220/225 GPU was based on Bitboys’ GPUs, developed in Finland. The architecture
was code named “Leia.” (Fig. 3.24).
Two years later, due to internal disagreements on design philosophies between
Qualcomm’s San Diego GPU design group, and the Finnish GPU design group,
Qualcomm shut down the Finland operations and laid everyone off.
A former Qualcomm executive commented in 2021 that “the Bitboys mythology
was highly overrated.”
In the 1950s, a prize-fighter comic book character named Joe Palooka (the strip
debuted in 1930 and ran till 1984) could take a punch and always get up and win the
fight. The Ideal toy company made a 48-in.-tall inflatable punching bag in 1952 that
had a weighted bottom. A child could punch the bag as hard as he wanted and knock
it down, and then as if nothing had happened, the bag would pop back up, just like
Joe Palooka (Fig. 3.25).
The Bitboys team was Joe Palooka. You could knock them down, and they would
pop right back up.
In 2008, Qualcomm launched the Snapdragon S1 with the Adreno 200, which
incorporated the ATI-Yamato core IP (Z430)—see Fig. 3.27. Yamato was a cutdown
version of the Xbox 360 GPU for the mobile market. The S1/Adreno 200 marked
the first fully 3D hardware-accelerated GPU in a Qualcomm SoC. The 337 million
transistors Xenos was one of two dies in a package with eDRAM. Xenos was ATI’s
first implementation of the unified shader. It was also a quasi-tiled renderer, which
was why Adreno was a tiled renderer.
On January 20, 2009, the story ended when AMD announced it would sell its strug-
gling handset division to Qualcomm for $65 million [24]. The agreement meant that
Qualcomm would acquire AMD’s handset division and that the company also inher-
ited the graphics, multimedia technology, and other IP from the business. Qualcomm
planned to use that technology and IP to develop new types of 2D and 3D graphics
technologies for handset devices and cell phones and enhanced audio, video, and
display capabilities. Qualcomm also extended job offers to some of AMD’s handset
division employees.
Qualcomm had worked with AMD and ATI before. Since 2004, the Snapdragon
Adreno GPU used ATI’s Xenos GPU IP (as did Microsoft’s Xbox360).
“This acquisition of assets from AMD’s handheld business brings us strong multi-
media technologies, including graphics cores that we have been licensing for several
years,” Steve Mollenkopf, executive vice president of Qualcomm and president of
Qualcomm CDMA Technologies, said [25].
Qualcomm expanded and enlarged Snapdragon in the subsequent years, adding
multiple DSPs and ISPs.
3.10 Siru (2011–2022) 143
By 2010, the number of SoC suppliers in the mobile market was shrinking and by
2017 would be reduced to just five as shown in Fig. 3.1. OpenGL ES was providing
leadership to the market, and in 2015, OpenGL ES 3.2 was introduced with tessella-
tion and geometry, floating-point render targets, and texture compression. Then, in
January 2016 Vulkan was introduced and mobile devices had access to the same API
as PCs. New opportunities opened up for developers everywhere.
The Bitboys had been subsumed into Qualcomm and then seemingly abandoned
in Finland when Qualcomm shut down their operations—but much of the team
regrouped in 2011 and formed Siru. Among them, Joonas Torkkeli, Jani Huhtanen,
Kari Malmberg, Mikko Nurmi, and Juuso Heikkila went to Siru Innovations. Marko
Laiho, the former chief software architect at Bitboys, left Qualcomm for another
stealth start-up, Vire Labs.
The Siru team was still flying low enough under the radar to the point where it
was unknown where all of the members ended up. But brothers Mika and Kaj Tuomi
were the cofounders, along with Mikko Alho. Alho, Siru’s CEO, was the former
graphics processor hardware project manager for Qualcomm Finland. Importantly,
although Alho was in a management position, keeping things on track with project
planning, resourcing, design definition, day-to-day project leading, and project status
reporting, he was doing the block design for Siru, including the C- and RTL-model
implementation. That meant an engineer and not a suit ran Siru’s management
(Fig. 3.28).
3.10.1 Samsung
In 2008, Samsung began developing it Hummingbird SoC, the S5PC110 for its
smartphone line. The company had been using Qualcomm SoCs and Samsung wanted
to have more freedom and hopefully reduce costs. In 2010, the company rebranded
the SoC as Exynos and launched the first smartphone with an Exynos SoC in 2011.
Samsung used its SoC in international markets and Qualcomm in the U.S. The
Exynos SoC used an Arm CPU and the Mali GPU, and as such was not as powerful
as Qualcomm’s SoC with the Adreno GPU (which at the time was based on the
Bitboys design).
Wanting to catch up with Qualcomm, and most importantly Apple, Samsung
decided to develop its own GPU and launched an R&D project in 2013. Samsung
had been using Imagination Technologies’ SGX 544 GPU IP to compete with Apple,
also an Imagination technologies customer, and an investor in Imagination.
Samsung moved the project out of R&D in Korea and into engineering in Austin
and San Jose in 2017. And then in a surprise to everyone, in January 2019, Yongin
Park, President of System LSI Business at Samsung Electronics announced Samsung
would build the Xclipse mobile GPU with AMD’s RDNA 2 graphics IP technology.
The Exynos 2200 series of chips was announced January 18th for the Galaxy
S22, Galaxy Z Fold3, and Flip3 phones. Based on Samsung’s 4 nm processing node,
the SoC’s Xclipse 920 GPU included AMD’s intersection ray tracing processors—
making Exynos 2200 SoC the first mobile processor with hardware-accelerated ray
tracing. The Xclipse could also handle variable rate shading and came with a multi-IP
governor system to enhance efficiency.
Samsung hoped to set a standard for mobile gaming experience, along with
improving social media apps and photo usage. The Xclipse GPU was positioned
between a console and a mobile graphics processor. The name, Xclipse, includes the
X for Exynos and references eclipse as in it shades all that has gone before. It marked
the beginning of a new era in mobile gaming.
In the past, Samsung sold its phones with its SoC into the international market
and used Qualcomm’s Snapdragon chips for U.S. models.
The Exynos 2200 used Arm’s octa-core Armv9 CPU cores in a tri-cluster
configuration consisting of a single Arm Cortex-X2 flagship core, three balanced
Cortex-A710 big cores, and four power-efficient Cortex-A510 little cores.
Samsung said the Exynos 2200’s neural processing unit (NPU) performance has
been doubled when compared to its predecessor. That allows more calculations to be
made in parallel, which enhances the overall AI-related performance and experience
on the smart phone. The NPU also offers greater precision with FP16 support and
power-efficient INT8 and INT16.
The Exynos 2200 SoC included an advanced ISP (Image Signal Processor) that
could support image sensors up to 200 MP. That, said the company, allowed a creator
to record and edit 8k videos. With an advanced MFC (Multi Format Codec), the
Exynos 2200 could encode and decode high-resolution videos at faster frame rate
than previous Exynos SoCs.
146 3 Mobile GPUs
The Xclipse 920 used the Vulkan 1.2 API to deliver the ray tracing features, and
also OpenGL ES 3.1 for legacy apps.
It had zero shutter lag at 108 MP. Samsung said it could connect up to seven
cameras and run four of them concurrently. With the help of its AI engine, the ISP
also provided tailored choice of color, white balance, exposure, and dynamic range
for each individual scene. Samsung’s SoC could do 8K video recording at 30 fps and
was capable of decoding 4K videos at up to 240 fps and 8K videos at up to 60 fps.
In addition, the MFC included a power-efficient AV1 decoder that helped to enable
a longer playback time.
On the display side, the SoC offered HDR10+, and a high refresh rate up to
4K—3840 × 2400 (WQUXGA) at 120 Hz, or HD+ at 144 Hz.
Samsung invested millions of dollars in R&D, gone through several managers and
engineers, and never gave up. The company was determined to have its own GPU
and be vertically integrated like Apple and Qualcomm.
Unfortunately, the Xclipse 920 SoC’s Arm v9 CPUs and AMD GPU design
consumed more power than forecasted and that necessitated lowering frequencies
below the original specifications. Samsung had projected improved power efficiency
from its 4 nm node, but the process shrink failed to meet the company’s expectations.
Samsung trimmed clock speeds through firmware changes and that caused a delay
of the Galaxy S22 launch by more than a month.
Because of the performance reductions, the new Exynos was only made avail-
able in few countries (compared to previous years); price-conscious regions such as
India and Latin America. For its premium markets (the US, Canada, South Korea,
Hong Kong, and most of Europe), Samsung used Qualcomm’s Snapdragon 8 Gen 1
processor.
Texas Instruments (TI) had been steadily expanding and improving their OMAP
product line since they first introduced it in May 1999 and began shipping it in Q4
2000—and it had a string of successes for the company.
TI entered the mobile phone market with DSP-based baseband processors for
radio modems. Early OMAP models featured the TI TMS320 series DSP. TI OMAP
devices also included a general-purpose Arm processor plus one or more specialized
coprocessors.
On December 12, 2002, STMicroelectronics and TI jointly announced the Open
Mobile Application Processor Interfaces (OMAPI) initiative [26]. OMAPI was
designed for 2.5 and 3G mobile phones expected during 2003. Shortly afterward, in
2003, TI and STMicroelectronics joined with Arm, Intel, Nokia, and Samsung and
formed the larger initiative Mobile Industry Processor Interface (MIPI) Alliance,
which incorporated OMAPI. TI, however, carried on with the OMAP product line
branding. OMAP would be MIPI compatible and was the model of what a MIPI
3.11 Texas Instruments OMAP (1999–2012) 147
device should be for a while. Such devices were also called mobile Internet devices
(MIDs).
The first OMAP SoC had a camera processor and an early version of an ISP.
In 2003, TI incorporated Imagination Technologies’ GPU IP and introduced the
TMS2400 series.
TI segmented the expanding product line into three main categories:
• High-performance multimedia
• Basic multimedia
• Integrated modem and applications.
TI differentiated its SoC by the Arm processor and the number of processors
overall. The top of the line, TMS2420, had four processors, as illustrated in Fig. 3.30,
which shows the 2410 and 2420.
The OMAP2410 included a 330 MHz Arm1136JS-FTM core, a 220-MHz TI DSP,
an integrated camera interface, a 2D/3D graphics accelerator offering up to 2 million
polygons per second, and a direct access media (DMA) controller.
The OMAP2420 processor added a TI programmable imaging and video accel-
erator supporting 6-megapixel still capture, full-motion video encode, 640 × 480
video encoding at 30 fps, or decode in common intermediate format (CIF) to VGA
resolution on top of the OMAP2410. The OMAP2420 also could output images onto
an external TV. There were 5 MBytes of on-die SRAM. It was this component that
delivered the 6-megapixel camera support.
TI updated its OMAP with a second-generation architecture that integrated a
3D graphics acceleration engine, an imaging accelerator, support for hi-resolution
digicams and camcorders, and a TV output. Imagination Technologies’ PowerVR
MBX graphics core was the major contributor to the video I/O, 2D/3D, and display
capabilities and could process up to 2 million polygons per second.
148 3 Mobile GPUs
Midgard was the third architectural design from the Falanx Arm team. It consisted of
the T604 and T658 series (first generation), the T622, 24, 28, and 78 series (second
generation), the T700 series (third generation), and the T800 series (fourth and last
generation).
Architecturally, Midgard was the progeny of Utgard the final design based on
the original Falanx architecture. There was still a difference in how the unified and
discrete shaders operated. While there were clear elements of relation, the step change
was perhaps the largest they had ever done. In one step, they moved to unified shaders
ad added full compute, a whole new tiling system, and a new memory subsystem, to
mention a few features. The shader design for Midgard inherited many of Utgard’s
design elements and features. However, those differences were more important for
programmers than users.
In 2011, Arm gave its forecast of the mobile processor demand for the next five
years, and the growth and demand for performance were impressive. Its unique rela-
tionship with all the semiconductor and smartphone builders gave Arm a surprisingly
good view of future needs.
In late 2011, Arm announced the latest development in its Mali line of GPU IP
cores. The company predicted they would start appearing in various mobile devices
by 2013. Arm claimed the new Mali-T658 design would have four times the GPU
compute performance of the quad-core Mali-T604 GPU and had up to 10 times
greater graphics performance of the Mali-400 GPUs. The Mali-T604, Arm’s previous
top-of-the-line, was launched at the Arm TechCon in 2010. That core appeared in
silicon and then in devices by 2012 [28].
3.12 Arm’s Midgard (2012) 149
“It is simple,” said Steve Steele at the time. Steele was the senior product manager
of Arm’s media processing division. “It is all about higher performance—twice as
many shader cores (eight, compared with the Mali-T604’s four) and doubles the
arithmetic pipelines per core (from two to four). For graphics, this means that the
Mali-T658 GPU offers up to 10 times the performance of the Mali-400 MP GPU.”
The Mali-T658 delivered what Arm called desktop-class performance. Arm said
it got the performance by doubling the number of arithmetic pipelines within each
GPU core and improving the compiler and pipeline efficiency.
The T658 was compliant with multiple compute APIs, including Khronos OpenCL
1.1 (Full Profile), Google Renderscript compute, and Microsoft DirectCompute. The
GPU design had hardware support for 64-bit scalar and vector, integer, and floating-
point data types, which were fundamental for accelerating complex and computation-
ally intensive algorithms. Compatibility with Khronos APIs was maintained across
the Mali-T600 Series of GPUs.
“By showing off some of the versatility of the Midgard architecture, it brings in
a compute punch of up to 350 GFLOPS,” said Edvard Sørgård, Arm’s Consultant
Graphics Architect in Trondheim, Norway [29]. “And with over 5 GPixel/s real fill
rate to external memory to power high-end mobile devices with visual computing and
augmented reality, the 4k DTV revolution, and exascale high-performance compute,
it was another fantastic component in the joined-up computing story from Arm.”
As would be expected, the Mali-T658 GPU was compatible with other popular
graphics APIs, including Khronos OpenGL ES, Open VG, and Microsoft DirectX
11. The overall organization of the Mali-T658 is shown in Fig. 3.31.
Arm added some additional functionality to Midgard enabled by the Android
Extension Pack (AEP) for Android L. The AEP extended OpenGL ES 3.1 by enabling
features such as tessellation and geometry shaders, features that did not make it into
OpenGL ES 3.1.
Arm also offered Direct3D support on Midgard. That functionality was never used
because all Windows Phone and Windows RT devices at the time used Qualcomm
or Nvidia SoCs. Only Mali-T760 was Direct3D Feature Level 11_1 capable.
The T700 series of the Midgard added more shader cores, going from 1 to 16.
The T658 was designed to work with the latest version (4) of the Advanced Micro-
controller Bus Architecture, which featured cache-coherent interconnect (CCI).
“Data shared between processors in the system—a natural occurrence in heteroge-
neous computing—no longer required costly synchronization via external memory
and explicit cache maintenance operations,” said Roberto Mijat, GPU Computing
Marketing Manager for Arm at the time [30] (Table 3.4).
“All of this is now performed in hardware and is enabled transparently inside the drivers,”
added Mijat. “In addition to reduced memory traffic, CCI avoids superfluous sharing of data:
Only data genuinely requested by another master is transferred to it, to the granularity of a
cache line. No need to flush a whole buffer or data structure anymore.”
Like its predecessor, the Mali-T658, shown in Fig. 3.32, could perform
parallel processing operations on appropriate applications, including physics, image
processing and stabilization, augmented reality, and transcoding. The T658 had eight
cores and four pipelines (refer to Fig. 3.32).
“The difference between a core and a pipeline,” said Ed Plowman, Arm’s Technical
Marketing Manager, “is that a core is a self-contained entity—which had its own self-
contained resources—so you can build SoCs with different numbers of cores. You
can also scale dynamically by powering cores up and down—independently of other
elements. The Mali-T658 scales by numbers of cores.” [31].
Each arithmetic pipeline (Figs. 3.33 and 3.34) was independent, working on its
own set of threads. Midgard could process one bilinear filtered texel per clock or
one trilinear filtered texel over two clocks. In the high-end T760, the number of
texture units and render output units (ROPS) per shader core was the same, so all
configurations had a 1:1 ratio between texels and pixels.
Arm claimed the Mali-T658 would give the company performance leadership over
its rival graphics core licensor, Imagination Technologies. Imagination’s top-of-the-
line GPU then was the PowerVR Series 6 design that went by the code name Rogue.
“Imagination and Vivante; they are the competition,” said Ian Smythe, marketing
director for the media processing division.
However, Arm also claimed that Mali, in all its versions, with 57 licenses and 29
SoCs designed, was then the most widely licensed GPU architecture—which was
true.
3.12 Arm’s Midgard (2012) 151
Arm emphasized that with the Mali-T658 SoC, designers could use a carefully
crafted system-level approach to multi-core design. That approach included Arm
Cortex processor cores, the little-big power-efficiency technology, and CCI.
According to Jem Davies, vice president of technology for the media processing
division at Arm, “Designers are expected to target high-end smartphones on 28 nm
silicon with quad-core Mali-T658. They will be coming to market in 2013, and
eight-core Mali-T658 graphics units to be fabricated in 20 nm silicon in 2015.” Arm
also expected to find applications in tablet computers, smart-TVs, and automotive
infotainment systems for the core (Fig. 3.35).
“Mali-T658,” added Davies, “will be able to take on computation tasks in applications such
as image processing or augmented reality. The core had been made compatible with the
recently announced A7-A15 little–big coupling so that as computation was moved on to the
T658, it may accompany the movement of the core program down from the A15 to the A7.”
[32]
The Mali Job Manager was autonomous and could carry on graphics processing
with a reduced load on the CPU, which meant it was well suited to working with
a big.LITTLE CPU system. Arm’s big.LITTLE technology is a heterogeneous
processing architecture that uses two types of processors. LITTLE processors are
designed for maximum power efficiency while big processors are designed to provide
maximum compute performance.
Using the correct processor for the right task, the Mali-T658 handled GPU
compute tasks in parallel with the CPU running the always-on, always-connected
tasks. “Arm’s CoreLink system IP enables system-level cache coherency across
clusters of multi-core processors, including the Cortex-A15 and Mali-T658,” added
Davies.
Also, the Mali-T658 was compatible with the Armv8’s entire 64-bit ISA, as was
the Mali-T604.
The lead partners signing up to develop the Mali-T658 were Fujitsu Semicon-
ductor, LG Electronics, Nufront, and Samsung.
In October 2011, Peter Hutton, general manager of the media processing division
at Arm, said, “We are looking at a little–big approach for Mali too.”
With more than six types of portable devices (smartphones, tablets, game consoles,
cameras, navigation units, and notebooks) needing powerful low-energy graphics
engines with GPU compute capability, competition increased to meet this demand.
The lines were becoming divided on who Arm’s clients for Mali would be: those
customers who did not have an Arm architectural license. It was exceedingly difficult
for a single company that did not have a large, powerful GPU and CPU design team
to compete against the architectural licensees and Arm itself. The licensees like
Freescale, Marvell, Nvidia, Qualcomm, and TI would create a differentiated product,
while others would have to rely on the Cortex/Mali menu options. The net result was
that some amazingly innovative products emerged from the Arm community and
Intel in the following years. They all competed for the potential of the two-billion-
unit market that all the mobile devices represented. However, that volume was not
realized due to competition and a slow-down in the market.
Arm steadily improved the Mali GPU after acquiring it from Falanx. The little GPU
had found its way into TVs, phones and tablets, automobiles, and various consumer
and industrial devices. The little GPU that could and did went through several archi-
tectural evolutions. In early 2016, the company introduced its Bifrost architecture.
154 3 Mobile GPUs
Fig. 3.36 The 12-year history of Arm Mali architectures over time
Implemented in the Mali-G31, then in the second quarter the Mali-G71, and in
October 2017 the Mali-G72, G76 in 2018, and G52 and G31.
The Bifrost architecture was significantly different from the previous Midgard
design, and Bifrost positioned the GPU for new tasks in AI, AR, and rendering. The
multi-generational history of Mali is illustrated in Fig. 3.36.
On the last day of May 2018, at the Arm Tech Day conference, the company
revealed its newest upgrade to the Bifrost family and showed the Mali-G76. Arm
skipped the numbering sequence from G72 to G76 to align Mali with the new CPU
Cortex-A76.
Mali-G76, the third-generation GPU based on the Bifrost architecture featured
three execution engines per shader core, dual texture mapper, and configurable 4–20
shader cores and could be configured with 2–4 slices of L2 cache—from 512 KB to
4 MB total. The G72’s execution engines could run four scalar threads in lockstep,
and when doing a vec3 FP32 add, it only took three cycles. The G76 employed
wide execution engines, which doubled the number of lanes, so eight threads doing
vec3 FP32 could add in 3 cycles and provided int8 dot product support for machine
learning. The core design is illustrated in Fig. 3.37.
Arm said the G76 could work faster than the previous generation because its eight
lanes use the same energy as four lanes to process a given workload.
A simple comparison of the total compute capability is shown in Table 3.5.
The G76 also had dual texture units so applications could execute two textures
per core, which effectively gave the G76 twice the throughput.
At the time, Arm said its microarchitecture improvements would provide area and
power improvement resulting from register bank optimization and half the number
of register banks but larger sizes. The G76 also had increased compute performance
through the thread-local storage area, where the stack area was used for register
spilling in shaders. That provided the data for a single thread. It then got grouped
into chunks at the same location. As a result, said the company, gaming performance
would improve, and they offered the comparison in Fig. 3.38 to make the point (based
on Arm’s testing).
Because of the efficiency improvements, games can also be played for more
extended periods, said the company.
In October 2019, Arm expanded its Mali graphics portfolio with its G57 GPU based
on the new Valhall architecture, the fourth generation of Mali GPUs [33]. The G57
targeted the mainstream market, and the G77 targeted the premium mobile market.
Arm said that the Mali-G57-enabled capabilities are not usually associated with
smartphones and home appliances (e.g., set-top boxes and DTVs). That included
high fidelity content, 4K and 8K user interfaces, console-like smartphone graphics,
and more complex ML, AR, and VR workloads.
The heart of Mali-G57 was based on the company’s Mali-G77 (see below).
Compared with the previous Bifrost generation Mali-G52, the Valhall G57 had 30%
greater performance density and was designed to run content like Fortnite at high
resolution. Also, said Arm, the G57 provided double the texturing performance of
the G52, which would improve high-resolution UI performance in 4K and 8K DTVs,
AR, VR, and gaming. The increase in compute, and texture capabilities also made the
Mali-G57 a good candidate for HDR rendering, physically based rendering (PNR),
Fig. 3.39 Arm’s comparison of performance between the Mali-G52 and the new Mali-G57.
Courtesy of Arm
3.12 Arm’s Midgard (2012) 157
and volumetric effects, which were becoming standard features on mobile devices.
That would, they said, translate to enhanced and smoother experiences and faster,
greater responsiveness on their mainstream devices for the end-user [34].
Efficiency. Along with performance, Arm said it made significant energy effi-
ciency improvements with Valhall, up to 30% over Bifrost. The company offered
some comparisons of the G57 and the G52 (Fig. 3.39).
Whereas the premium Mali-G77 had at least seven cores, the Mali-G57 had one
to six cores depending on the configuration.
ML performance. Mali-G77 brought a significant improvement in ML perfor-
mance—up to 60% over the G52. Arm claimed the Mali-G57 would show similar
improvements, taking more complex ML workloads to mainstream devices. The 60%
increase in on-device ML performance was made possible by twice as many fused
multiply-accumulate (FMA) processors to Mali-G52 (depending on the configura-
tion) and architectural optimizations. That provided faster responsiveness to a wide
range of ML use cases common on smartphones, such as face detection, image
quality enhancement, and speech recognition. Moreover, Arm claimed Mali-G57’s
flexibility to perform different ML workloads would ensure that the next generation
of mainstream devices would provide future and emerging use cases based on ML.
3.12.2.1 AR and VR
Arm believed consumers and the mainstream market wanted more AR and VR
immersive experiences. Therefore, AR and VR games and applications were often
limited to premium devices. Using VR as an example, the Mali-G57 offered foveated
rendering, allowing VR developers to reduce their application’s workload. One typi-
cally achieved foveated rendering by selectively reducing the shading rate for the
regions of the screen that are less visible through the lenses of VR headsets. As a
result, claimed the company, the industry was likely to see users with access to more
immersive VR games, apps, and experiences. Similar improvements applied for AR
said Arm, where Mali-G57’s performance increased would satisfy immersive and
higher quality AR content, games, and features on mainstream devices.
The Mali-G77 did 33% more math in parallel than the G76.
The significant architectural changes could be found in the execution unit inside
the core, the part of the GPU responsible for number crunching.
Inside the execution engine
Each GPU core In Bifrost had three execution engines (or two in some lower-end
Mali-G52 models). Each engine had an i cache, a warp control unit, and a register file.
In the Mali-G72, each engine handles four instructions per cycle, which increased
to 8 in last year’s Mali-G76. Spread across these three cores allowed for 12 and
24 32-bit floating-point (FP32) fused multiply-accumulate (FMA) instructions per
cycle.
With the Valhall Mali-G77, only a single execution engine was inside each GPU
core (Fig. 3.41). That engine held the warp control unit, register, and i cache shared
across two processing units. Each processing unit handled 16 warp instructions
per cycle, for a total throughput of 32 FP32 FMA instructions per core (the 33%
instruction throughput boost over the Mali-G76).
With Valhall, Arm transitioned from three to a single execution unit per GPU
core, but two processing units were within a G77 core.
Each processing unit had two new math function blocks. A convert unit (CVT)
handled basic integer, logic, branch, and conversion instructions. And the special
function unit (SFU) accelerated integer multiplication, divisions, square root,
logarithms, and other complex integer functions.
The FMA unit supported 16 FP32 instructions per cycle, 32 FP16, or 64 INT8
dot product instructions. Those optimizations produced a 60% performance uplift in
machine learning applications.
At the time, Arm said the Valhall architecture would align with modern APIs
such as Vulkan, which was considered the new standard for graphics on mobile and
other platforms. Thanks to the performance and energy efficiency improvements
of Mali-G57, and the company said games would run smoother and for longer on
devices. That, in turn, said Arm, would help reinforce the reputation of developers
in the gaming ecosystem to device manufacturers. Arm saw the Chinese and Asian
markets, where mainstream devices and mobile gaming were most prevalent, as
an excellent opportunity for mainstream gaming apps. According to a 2018 Newzoo
report on the China games market, China would generate $23 billion in mobile games
revenue alone [35]. Therefore, game developers were keen to access these regions
where the mainstream market was robust (Fig. 3.42).
The Valhall architecture was the basis of Arm’s latest generation GPUs and
contained various improvements and new features over the previous Bifrost architec-
ture. It was a scalable architecture that made the high-end and complex features in the
G77 possible on both premium and mainstream devices. In addition to aligning better
with the Vulkan API, the key elements of Valhall were a new superscalar engine,
a simplified scalar ISA, and new dynamic scheduling of instructions. All of those
brought about the performance and energy efficiency improvements in the Mali-
G57. A companion to the Mali-G57 was the Mali-D57 display processor (discussed
below).
Arm coined the expression total compute as a solution-focused approach to
system-on-chip design, moving beyond individual IP elements to design and opti-
mize the system as a whole. It was an apt description and easy to remember. Along
with 5G, the acceleration of AI, extended reality (XR), and the Internet of things (IoT)
compute requirements were changing. The performance needed for digital immer-
sion, said Arm at the time, will have to push beyond what we have today toward the
world of total compute.
In the spring of 2019, Arm announced its latest development in GPU land—the
Mali-G77. Arm said the new GPU was 40% faster in overall graphics than the G76
and 60% faster at ML tasks. The company also claimed the G77 was 30% more
energy-efficient and used 40% less bandwidth [36].
The company also introduced its second-generation ML processor design, which
was twice as efficient as the original. The peak performance improved slightly from
4.6 to 5 TOP/s, but that performance required only 1 W instead of 2 W for the original
ML processor.
Fig. 3.43 Arm has the whole suite of engines for 5G AI, ML, and VR. Courtesy of Arm
3.12 Arm’s Midgard (2012) 161
Arm said the new ML processor had improved memory compression techniques
by three times that of the previous generation, and it could scale up to eight cores
for a total performance of 32 TOP/s. However, such designs were unlikely to make
it into mobile devices due to higher power requirements (Fig. 3.43).
In addition to the ML processor, Arm introduced a companion display processor,
the Mali-D77.
Arm also brought a companion for the G77, the D77 display processor—the
companion for the Mali-V61 from 2017, shown in Fig. 3.44 [37]. Looking to compete
with Qualcomm’s XR platform, Arm claimed its D77 display processor IP delivered
superior head-mounted display (HMD) VR performance for eliminating motion sick-
ness and optimized for 3K resolution at 120 Hz refresh. The D77 had new fixed func-
tion hardware, which Arm said achieved more than 40% system bandwidth savings,
12% power savings for VR workloads. That said, the company hoped to develop
lighter, smaller, and more comfortable untethered VR HMDs to standard premium
mobile displays.
Arm claimed that the D77 was a display processor that could significantly improve
the VR user experience with dedicated hardware functions for VR HMDs, namely
lens distortion correction, chromatic aberration correction, and asynchronous time
warp (ATW).
VR HMDs require displays close to the eyes; therefore, to maintain the perceived
quality of the images (e.g., reducing artifacts in the display like the screen-door
Fig. 3.45 Arm said an SoC that could drive the level of performance for wearable VR HMDs did
not exist (in 2019). That presented a significant challenge to SoC vendors who need to achieve the
above requirements. Courtesy of Arm
effect), more pixels were needed in the same area. Unlike regular smartphone
displays, VR HMDs required at least six times more pixels per unit area to maintain
the same perceived quality. The state-of-the-art VR devices in 2020 were 2880 ×
1440 pixels. However, the trend was moving toward increased resolutions within
ever-shrinking power budgets on lighter, more comfortable headsets (Fig. 3.45).
Arm claimed the Mali-D77 would also eliminate motion sickness through the
higher frame rates and instantly respond to real-world head movements by repro-
jecting the VR scene according to the latest pose due to its just-in-time single pass
composition process. Moreover, it would deliver crisper images free from artifacts
through its more advanced hardware-based filtering and image processing functions.
Key features:
• R63455 VR DDIC optimized for 2160 × 2400 at 90 Hz head-mounted display
• 1000 ppi, 2K per eye image quality
• Foveal transport provided a clear visual where the user needs it
• VXR7200 VR Bridge supported DP1.4 bandwidth with AMD/Nvidia GPUs over
USB-C
• Support for panels > 2K resolution or faster refresh rates, without image loss due
to cabling.
Although virtual reality devices were becoming more common, there were still
plenty of issues, as anyone who has endured too many demos can attest. For instance,
there could be problems with getting the level of resolution needed to avoid motion
sickness and the screen-door effect.
Companies continued to invest in and promote VR for consumers. Although games
were not the only application for consumer VR, they still comprised 96% of consumer
3.12 Arm’s Midgard (2012) 163
VR applications. As such, VR was still a novelty and a snack for consumers. A few
diehard VR gamers would spend an hour or more, but most normal people gave it
up after 10 or 15 min and did not go rushing back to it later.
Arm was in a unique marketing position as an SoC IP component supplier. It
was the only company that offered CPU, GPU, display, and other designs to OEM
chip builders. Other GPU designers either use their designs internally as Nvidia and
Qualcomm do. AMD also used its GPU design internally and licensed it. And other
companies such as Digital Media Professionals (DMP), Imagination Technologies,
Think Silicon, and VeriSilicon only licensed their GPU designs. Intel could offer
CPU and GPU IP but did not participate in the IP market. The IP GPU suppliers did
not have a CPU design to offer, so Arm was uniquely positioned to know about all
sorts of companies’ needs and ambitions.
In September 2016, SoftBank bought Arm for $32 billion [38]. Many in England
were unhappy that the country’s largest technology company would be in the hands
of the Japanese. Some newspapers in England saw the sale as an unfortunate result
of Brexit, which reduced the pound’s value by 11% against the yen and the dollar
and made Arm more of a bargain for international buyers.
However, SoftBank chairman Masayoshi Son said that he had been following
Arm for many years and that the reduced value of the British pound was not a factor.
Ironically, Prime Minister Theresa May, commenting on a disputed and eventually
thwarted sale of pharmaceutical company AstraZeneca to Pfizer, noted that it was
important for Britain to have control over its business assets. The implication was the
UK might move to block the sale of valuable companies and technologies outside
164 3 Mobile GPUs
the country. However, that was not her position after the sale. May said that the deal
had proven that England was open for business. In her official statement, May said,
“The announcement of investment this morning from SoftBank into Arm Holdings
was clearly a vote of confidence in Britain.”
It helped that SoftBank pledged to grow Arm, keep the headquarters in Cambridge,
and promised to “at least double the employee headcount in England.”Arm CEO
Simon Segars stayed on to run the company, and SoftBank said: “management will
stay in place.” In an interview on the Arm Web site, Segars said that Arm was
not looking to sell the company and was confident about their opportunities in the
future. He said then that for Arm to consider a sale, the price would have to be very
compelling, and the deal would have to open more opportunities than the company
could achieve on its own.
Arm certainly did not lose its way since being bought by SoftBank, nor had it
been auctioned off in pieces which some people feared would happen. When Son
told the Arm executives that he wanted to buy the company, he made them a series
of promises: The company would remain an independent subsidiary of SoftBank, he
would not interfere in the day-to-day management of Arm, and the company would
be allowed to invest all the profits into research and development [39].
And SoftBank brought a boatload of money to the party and lots of new contacts.
The workforce had grown from 4500 to over 6000 by 2019, and more hands meant
more and faster development. As a result, Arm’s reach into new market areas was
impressive, and its new concepts of what a GPU should do expanded.
In February 2018, Arm launched Project Trillium, now known as the Arm AI
Platform, a new machine learning-powered platform to provide advanced computing
capabilities to connected devices.
Four years later, in September 2020, Nvidia announced it would buy Arm from Soft-
Bank for $40 billion [40]. In April 2021, the UK government issued an intervention
notice over the sale of Arm by Japan’s SoftBank to Nvidia [41]. Then in January
2022, the U.S. Federal Trade Commission sued to block Nvidia’s purchase of chip
designer Arm, saying the deal would create a powerful company that could hurt the
growth of new technologies. On February 8, 2022, Nvidia withdrew its bid to acquire
Arm. Softbank positioned Arm for a public offering, and the company laid of 1200
people.
3.13 Nvidia Leaves Smartphone Market, 2014 165
Nvidia led with its strength in the mobile market and promoted it graphics perfor-
mance. However, in the mobile market, OEMs and consumers valued power effi-
ciency more. And with the smaller screens, high-performance graphics was not
that beneficial or appreciated. Power-efficient and lower performance GPUs from
Qualcomm, Imagination, and Arm took over the market.
Then, in May 2015, Nvidia said that instead of wasting its resources on a budget
or even mainstream mobile phone, it would instead carve out a new market for what
it called superphones.
In January, the company introduced the Tegra 4 SoC with a 72 core GPU, a video
engine, and a dual-channel DDR3L 1833 memory controller. It was manufactured
at TSMC in 28 nm HPL (low power with High-K + metal gates optimized for low
leakage).
In February 2013, Nvidia introduced its Phoenix 5-in. superphone platform based
on the application processor and LTE modem on the same silicon. The Tegra 4i is
an application processor and LTE modem on the same silicon. Then, in September
2013 the company announced Xiaomi had introduced the Mi3 Super Phone powered
by a Tegra 4.
Fifteen months later in May 2014, Jensen Huang told CNet that the company
would withdraw from the smartphone and tablet market altogether and concentrate
on automotive and games machines.
For the second time in his career and in the company’s history, it pivoted from
a losing situation to winning one. In May 2015, the company wrote off the Icera
operation it paid $352 million for in June 2011. It incurred expenses of $100 to $125
million in severance and other employee termination benefit costs [42] (Fig. 3.46).
Research firm Jon Peddie Research estimated Nvidia had invested over a billion
dollars in developing the Tegra product line beginning with the acquisition of MediaQ
in 2003. Mobile phones were always the goal, and Nvidia did not waver from trying to
get into the market—but the market changed faster than Nvidia could. Qualcomm was
always the big player and seeing how rich it was getting prompted smaller companies
(like MediaTek, Rockchip, and others) and giant Intel to enter the market. Failing
to develop a viable processor, Intel began buying its way into the market, a strategy
Nvidia could not match. At the same time, the Chinese suppliers offered nothing
more than an Arm reference design cut the prices to a few dollars for a mobile SoC.
Not a price structure Nvidia could or wanted to live with.
In 2013, Nvidia showed its roadmap for the Tegra product line, revealing the Logan
and parker SoCs. Logan was Nvidia’s first SoC with CUDA-compatible shaders based
on the Kepler architecture GPU and OpenGL 4.3. Kepler had a shader block granu-
larity of 1 SMX (192 CUDA cores). Logan demos appeared in 2013 and production
devices in early 2014.
After Logan was Parker (code name Denver) with 64-bit capabilities and Maxwell
GPU. Parker was also built using 3D FinFET transistors, from TSMC.
Announced in September 2016 Xavier, a-new SoC based on the company’s next-gen
Volta GPU, which Nvidia hopes will be the processor in future self-driving cars.
Xavier features a high-performance GPU, and the latest Arm CPU, yet has great
energy efficiency according to the company.
Using the expanded 512-core Volta GPU in Xavier, the chip, was designed to
support deep learning features important to the automotive market, said the company.
A single Xavier-based AI car supercomputer would be able to replace the company’s
fully configured Drive PX 2 with two Parker SoCs and two Pascal GPUs. Xavier was
built using 16 nm FinFET process and had seven-billion transistors—which was the
biggest chip built to date.
And again at CES 2017, Nvidia’s Xavier SoC was featured and then at Nvidia’s
GPU Technology Conference (GTC) in March 2017. Nvidia’s CEO Jensen Huang
announced Toyota would use the Xavier processor for its autonomous car in 2020.
Quite a claim and commitment for a product that had not been built yet.
The Nvidia Drive AV (autonomous vehicle) platform used neural networks to let
cars drive themselves and had two new software platforms: Drive IX and Drive AR.
Drive IX, said the company, was an intelligent experience software development
kit that would enable AI assistants for sensors inside and outside the car and for
drivers and passengers in the car.
Xavier was the Drive software stack and had been expanded to a trio of AI plat-
forms covering every aspect of the experience inside next-generation automobiles.
Later in June 2017, Nvidia released (via the Lineley report) [43] an updated high-
level block diagram and replaced the computer vision accelerator (CVA) with a deep
3.14 Qualcomm Snapdragon 678 (2020) 167
Fig. 3.47 Nvidia’s high-level (circa 2017) Xavier block diagram—DLA is the deep learning
accelerator
learning accelerator (DLA)—a deep learning accelerator, a much sexier name. The
chip was symbolized in the block diagram in Fig. 3.47.
At CES 2018, the company said Xavier had more than nine-billion transistors,
which included an eight-core (Arm65) CPU, a deep learning accelerator, a 512-core
Volta GPU, an 8K high-dynamic-range (HDR) video processor, and new computer
vision accelerators. The SoC could perform 30 trillion operations per second on 30
W of power. Nvidia claimed Xavier processors were being delivered to customers
that quarter and that it was the most complex SoC the company ever created [44].
Xavier was a key part of the Nvidia Drive Pegasus AI computing platform. It
offered, the company said, the equivalent amount of processing power as a trunk
full of PCs. It was nothing less, claimed Nvidia, than the world’s first AI car
supercomputer, designed for fully autonomous Level 5 robotaxis (Fig. 3.48).
Pegasus was built on two Xavier SoCs and two next-generation Nvidia GPUs.
Nvidia claimed more than 25 companies were already using Nvidia technology to
develop fully autonomous robotaxis, and Pegasus would be its path to production.
Xavier was announced in 2016 and appeared in products available by end of 2017
and the beginning of 2018.
The next SoC from Nvidia was the Orin, illustrated in Fig. 3.49.
The details of the devices are shown in Table 3.6.
Nvidia continued to pursue the automotive and autonomous vehicle market and
also saw success with their Tegra chips in Nintendo Switch consoles. In 2022
the company introduced it Grace Hopper SoC with a 72 core Arm CPU designed
for giant-scale AI and HPC systems.
Fig. 3.48 Nvidia’s Xavier-based Pegasus board (circa 2018) offered 320 TOPS and the ability to
run deep neural networks at the same time. Courtesy of Nvidia
connections for sophisticated photo and video capture, and immersive entertainment
experiences.
The performance enhancements of the Snapdragon 678 over Snapdragon 675
included
• Kryo 460 CPU clock speed up to 2.2 GHz
• Adreno 612 GPU performance increase.
3.14 Qualcomm Snapdragon 678 (2020) 169
The 612 GPU runs at 700–750 MHz, has two execution units with 96 shading
units, and can produce 328.2 Gigaflops. It can drive a display with 2520 × 1080
and a color depth: up to 10 bits. The GPU supports Vulkan 1.0, OpenGL ES 3.2,
OpenCL 2.0, and DirectX 11 (FL 11_1). In addition to those performance upgrades,
the 678 supported dynamic photography and videography capabilities, immersive
entertainment experiences, fast connectivity, and long battery life.
Dynamic photography and videography came from the two 14-bit Spectra 250L
ISP, which could process sensors with up to 48 MP and zero shutter lag (ZSL).
• Dual Camera, Multi-Frame Noise Removal (MFNR), ZSL, 30 fps: up to 16 MP
• Single Camera, MFNR, ZSL, 30 fps: up to 25 MP
• Single Camera, MFNR: up to 48 MP
• Single Camera: up to 192 MP.
170 3 Mobile GPUs
In December 2020, the company introduced its flagship Snapdragon 888 5G SoC at
the Qualcomm Snapdragon Tech Summit. The company said at the time it hoped the
888 would set the benchmark for flagship smartphones in 2021. The SoC integrated
5G along with Wi-Fi 6 and Bluetooth audio.
3.15 Qualcomm Snapdragon 888 (2020) 171
Qualcomm claimed the 880’s Adreno 660 GPU would deliver a 35% increase in
graphics rendering (measured probably in fps). That said, it was the company’s most
significant performance leap for its GPUs yet. The GPU’s classification number
showed that the Adreno 660 was not the highest performance GPU Qualcomm
offered. The 2019 Snapdragon 8cx SoC (for Always-On, Always-Connected PCs)
had an Adreno 680 GPU.
Compared with the Snapdragon 865, the 888 CPU and GPU were more power-
efficient, the company said. Qualcomm claimed a 25% improvement for the Kyro
680 CPU (over the 585). There was a 20% improvement with the Adreno 660 (over
the 650). (The 888 was what people thought would be a Snapdragon 875.) The
Snapdragon 865 used LPDDR5 2750 MHz/LPDDR4X 2133 MHz. The 888 used
LPDDR5. The SoC was manufactured in Samsung’s 5 nm process.
Based on Qualcomm’s numbering system, speculation was that the Adreno
660 would still fall below Apple’s M1 four-core GPU’s performance. However,
compared with the Arm Mali-G78 GPU, the Adreno 660 should have had a significant
advantage.
The Snapdragon 888 introduced Qualcomm’s variable rate shading (VRS) solu-
tion. That made the 880 the first mobile device GPU to have such capability. Qual-
comm said its VRS would improve game rendering by up to 30% for mobile experi-
ences, with improved power consumption at the same time. The increased graphics
processing also enabled HDR features in mobile gaming and allowed frame rates of
up to 144 fps.
The company also introduced Game Quick Touch, Qualcomm’s anti-lag cursor
control. That, said the company, would increase responsiveness by up to 20% by
lowering touch latency. With the speeds and low latencies that 5G and Wi-Fi 6
deliver, Qualcomm believed elite gamers could unite or compete in real-time for
unmatched global competition.
The 888 had impressive camera improvements. The Spectra 580 ISP was the
first from Qualcomm to feature a triple ISP. It could capture three simultaneous 4K
HDR video streams or three 28-megapixel photos at once at up to 2.7 gigapixels
per second (35% faster than the 865). It also offered improved burst capabilities and
could capture up to 120 photos in a single second at a 10-megapixel resolution. Lastly,
the upgraded ISP added computational HDR to 4K videos and had an improved low-
light capture architecture. It also offered the option to shoot photos in 10-bit color in
high-efficiency image file format.
Qualcomm’s Hexagon 780 AI processor in the Snapdragon 888 was a sixth-
generation AI engine that the company claimed would help improve everything from
computational photography to gaming to voice assistant performance. Qualcomm
said the 888 could perform 26 trillion TOP/s (compared with 15 TOP/s on the 865)
and would do it while delivering three times the power efficiency. Additionally,
Qualcomm promised significant improvements in both scalar and tensor AI tasks as
part of those upgrades.
The Snapdragon 888 also had the second-generation sensing hub, a dedicated
low-power AI processor for smaller hardware-based tasks, such as identifying when
172 3 Mobile GPUs
the user raised the phone to light up the display. The new sensing hub relied less on
the Hexagon processor for those tasks.
In late 2020, Apple introduced its latest SoC, the M1, as the heart of its 2021 PC,
iPhone, Air, Mac Pro, Mini, and iPad.
Apple had built SoCs for its smartphones and MP3 players since the late 1990s, so
it was not a new adventure for the company. The M1 differed from the iPhone’s A13
Bionic in that the A13 had extensive camera processors (ISPs), AI processors, and
memory (Fig. 3.51). However, the two SoCs had some CPU and GPU similarities.
The M1 SoC had an Arm big.LITTEL 8-core CPU with an ultra-wide execution
architecture. The four big cores had a 192 kb instruction cache, a 192 kb data cache,
and a shared 12 MB L2 cache. The four little cores had a 128 kb instruction cache, a
64 kb data cache, and shared a 4 MB L2 cache. Apple said the little cores only used
10% as much power as the big cores.
The 8-core GPU had 128 EUs that could execute 24,576 concurrent threads.
In compute mode, the GPU was capable of 2.6 TFLOPS, 82 Gtex/s, and 41 Gpix/s.
Apple said it could deliver 200% of the performance of the iGPU in the Intel processor
used in the previous-generation Air computer when operating at 10 w. The iGPU used
a unified (shared) memory architecture (UMA), and the memory which was on the
substrate was tightly coupled to SoC. The company said the M1 could accomplish
up to 390% faster video processing than a 1.2 GHz quad-core Intel Core i7-based
MacBook Air system (both configured with 16 GB RAM and 2 TB SSD) and up to
710% faster image processing than a 3.6 GHz quad-core Intel Core i3-based Mac
mini (Fig. 3.52).
There was also a 16-core neural engine that could reach 11 trillion operations/s.
That meant the Apple processor could provide 150% faster ML performance than a
3.6 GHz quad-core Intel Core i3-based Mac mini.
Fig. 3.52 Floor plan of Apple’s M1 substrate with chip and memory. Courtesy of Apple
The chip was built-in TSMC’s 5-nm fab and had 16 billion transistors, the most
ever for an Apple semiconductor. Overall, Apple said the M1 would provide 200%
of the performance at max power or the same performance on 33% as much power
as an Intel-based Apple computer.
With its 8-core GPU, ML accelerators, and neural engine, the entire M1 chip
would excel at ML. Final Cut Pro could intelligently frame a clip in a fraction of
the time. Pixelmator Pro could magically increase sharpness and detail at incredible
speeds. And every app with ML-powered features would benefit from performance
never before seen on Mac, claimed the company.
Using an Arm processor for Apple’s OS was not a big step. The basic OS was
already running on an Arm processor in the iPhone. And Qualcomm demonstrated it
was possible to run Windows on an Arm processor. The Intel processor had an iGPU,
so moving to an Arm-based processor with an iGPU was not a significant change.
Apple had been building its own SoCs with its iGPU for many years.
Apple’s slide presentation clarified that the GPU was a tile-based deferred
rendering architecture, which only Imagination Technologies’ PowerVR offered.
A lot of the developer docs, especially the A14 ones, made this straightforward.
Apple’s Arm license was architectural, so although the GPU was PowerVR,
it was probably best considered a branch of that design. Someone could do the
174 3 Mobile GPUs
GPU part better using Imagination’s higher-end A-Series or B-Series than Apple’s
implementation.
Who could match Apple’s CPU was another question. However, the other archi-
tectural Arm licensees mostly went back to taking cores (Fujitsu and Nvidia being
notable exceptions).
Apple excelled at integration, overall chip design, and verification. It was the
master of these heterogeneous SoCs, like Qualcomm, and getting the maximum
speed (or is it minimum latency?) from its integrations.
shows the general layout and organization of the SoC, the CPU cores are in the upper
left.
The 16-core GPU was equally impressive, and Apple claimed it offered twice the
performance of the M1. Seen in the center of the top, the GPU had 16 cores with
2048 execution units and could run up to 49,512 concurrent threads. The GPU was
capable of 5.2 TFLOPS, 164 Gtexels/s, and 82 Gpixels/s.
The SoC had three media engines (upper right blocks, Fig. 3.54) that ran multiple
streams of 4k and 8k video and the SoC had a ProRes (RAW) codec.
The M1 Max, based on M1 Pro, was even more impressive. It doubled the M1
Pro’s bandwidth to 400 GB/s and had a whopping 57 billion transistors, 1.75x the
count of the M1 pro and 3.5x the M1 Pro. Most of the additional transistors were
used for the GPU as shown in Fig. 3.55.
The M1 Max also had double the embedded unified memory to 64 GB (see
Fig. 3.56), again for the GPU.
Apple said their processors were incredibly power efficient. Compared to laptop
with discrete GPU the M1 Pro reached the same performance at 70% less power,
or 7x the performance at the same power. That stung Intel, and Intel’s CEO, Pat
Gelsinger said at the time, the company [Intel] hoped to win back Apple’s business,
but it will need to create a better chip than Apple Silicon to do it [47].
The M1 SoC-powered Apple’s 14-in. Mac Pro notebook which has 14.2 active
diagonal screen with 3024 × 1965 resolution (5.9 mpix) and a refresh rate up to
120 Hz that dynamically adjusted to the content if static refresh slows down.
The M1 max was used to power the 16-in. Mac Pro with 16.2-in. diagonal display
of 3456 × 2234 pixels (7.7 mpix), 1 billion colors (10-bit) liquid retina XDR display
that put out 1000 nits (1600 peak). It had thousands of LEDs in the back light with
dozens of zones and offers 1,000,000:1 contrast ratio (Fig. 3.57).
176 3 Mobile GPUs
Fig. 3.55 The M1 Max offers 4x faster GPU performance than M1. Courtesy of Apple
Fig. 3.56 The M1 Mx with its unified embedded memory. Courtesy of Apple
In the fine print, Apple says the test systems were 4-core MSI Prestige 14 EVO
PC laptops with iGPUs and an 8-core MSI GP66 Leopard 11ug which use Intel Core
i7-1185G7, and the Core i7-11800H, 4-core and 8-core models of Intel’s Tiger Lake
10 nm SuperFin CPUs.
3.16 Apple’s M1 GPU and SoC (2020) 177
The M1 has five GPU cores in an 8-156 configuration—8 TMUs and 156 shaders
per core, or 40 TMUs and 1280 shaders.
Apple is now FP32 centric, and FP16 has the same rate (no speed up, but versus
previous gen this means FP32 is 2x rate).
M1 Pro is 14–16 cores, so 16 such cores would be a 128-4096 design, likely
running at similar or higher clocks than Mobile, which uses up to 1.3 GHz.
M1 Max is 24–32 cores, so 32 cores would be a 256-8096 design, again likely
running at up to 1.3 GHz.
In March 2022, at its annual new products conference, Apple announced a new
version of the M1—the M1 Ultra, a dual die chiplet device that introduced Apple’s
UltraFusion interconnect technology (Fig. 3.58).
Apple’s UltraFusion used a silicon interposer to connect the chips across more
than 10,000 signals, providing 2.5 TB/s of low latency and inter-processor bandwidth.
That was four times more Apple claimed at the time than the bandwidth of the leading
multi-chip interconnect technology (see Chiplets in Book two).
Apple said the M1 Ultra would behave and be recognized by software as one
chip, so developers don’t need to rewrite code to take advantage of its performance
(Fig. 3.59).
The M1 Ultra featured a 20-core CPU composed of 16 high-performance and
four high-efficiency cores. Apple claimed it delivered 90% higher multi-threaded
178 3 Mobile GPUs
Fig. 3.58 Apple’s UltraFusion packaging architecture connects two M1 Max die to create the M1
Ultra. Courtesy of Apple
Fig. 3.59 Apple said the 20-core CPU of the M1 Ultra could deliver 90% higher multi-threaded
performance than the fastest 2022 16-core PC desktop chip in the same power envelope. Courtesy
of Apple
performance than the fastest available 16-core PC desktop chip in the same power
envelope. Apple also claimed, the M1 Ultra could reach the same peak performance
as PC chips with 100 fewer watts.3 Less power consumption means fans would run
3 Testing was conducted by Apple in February 2022 using preproduction Mac Studio systems with
Apple M1 Max, 10-core CPU and 32-core GPU, and preproduction Mac Studio systems with
Apple M1 Ultra, 20-core CPU and 64-core GPU. Performance measured using select industry-
standard benchmarks. 10-core PC desktop CPU performance data tested from Core i5-12600K
and DDR5 memory. 16-core PC desktop CPU performance data tested from Core i9-12900K and
3.16 Apple’s M1 GPU and SoC (2020) 179
Fig. 3.60 Apple claimed its M1 Ultra 64-core GPU produced faster performance than the highest-
end PC GPU available while using 200 fewer watts of power. Courtesy of Apple
quieter, even while running power-demanding apps like Logic Pro and processing
massive amounts of virtual instruments, audio plug-ins, and effects (Fig. 3.60).
For graphics needs, like 3D rendering and complex image processing, the M1
Ultra had a 64-core GPU—eight times the size of M1. Apple claimed it could deliver
faster performance than even the highest-end PC GPU available while using 200
fewer watts of power. Performance was measured using Apple-selected benchmarks
and compared against the performance of a Core i9-12900K with DDR5 memory
and a GeForce RTX 3060 Ti and GeForce RTX 3090.
Apple said their unified memory architecture also scaled up with the M1 Ultra.
Memory bandwidth increased to 800 GB/s, more, claimed Apple, than ten times the
latest PC desktop chip. The M1 Ultra could address up to 128 GB of unified memory.
Apple compared that shared memory configuration with dGPU’s dedicated GDDR6
and cited the dGPU-based AIBs as being limited to 48 GB, claiming the M1 Ultra
offered more memory to support GPU-intensive workloads such as 3D geometry
and rendering massive scenes. However, such claims cannot be proved and do not
consider the needs of the OS and applications’ use of the shared memory, nor did
Apple address the performance difference of DDR5 versus GDDR6. A PC of the
day could be configured with 128 GB of DDR5 plus the 48 GB of GDDR6 for a
total of 176 GB, so Apple’s claim of more memory than a PC was not defendable.
Subsequent testing showed Nvidia’s RTX3090 resound beating the M1 Ultra [48].
Was it all just a publicity stunt?
The new SoC contained 114 billion transistors, the most ever in a personal
computer chip. Apple claimed the performance would be appreciated by artists
working in large 3D environments previously challenging to render, developers
DDR5 memory. Performance tests were conducted using specific computer systems and reflect the
approximate performance of Mac Studio.
180 3 Mobile GPUs
compiling code, and video professionals. The company said users could transcode
video in ProRes up to 5.6x faster than with a 28-core Mac Pro with Afterburner.
Although Apple has designed their GPU, after separating from Imagination Tech-
nologies and almost destroying the company, Apple has returned to Imagination.
Even though neither company will say it in public, there is likely quite a bit of
Imagination IP in the new GPU.
Apple uses its API called Metal, derived from AMD’s Mantle, which is also
the basis for Khronos’s Vulkan and Microsoft’s DirectX 12. Therefore, the API
performance of Apple’s drivers is likely to be on par with PCs using a Khronos or
Microsoft API.
Apple’s claims of superior GPU performance while using a shared memory archi-
tecture and DDR5 are sure to be challenged and probably disproven by the discrete
GPU suppliers as soon as they can get their hands on an M1 Ultra.
The M1 Ultra shows as a single Metal device so it is one GPU not two GPUs from
an API point of view. Apple, with Metal, does have more control than other vendors
who have to deal with standardized (multi-vendor) APIs which has some benefits
(e.g., one can avoid difficult corner cases).
The M1 Ultra has 2.5 TB/s of interconnect which is enough for memory access
but also enough for die to die to make GPUs work (e.g., a big GPU even on a single
die has to distribute work and not everything talks to everything so making it scale
over a 2.5 TB/s link is not impossible).
SLI is not employed. However, Apple’s GPU is based on Imagination’s architec-
ture, and Imagination had multi-chip designs since the Dreamcast days. Distributing
work in a tile-based system is far easier than immediate-mode renderer (IMR)-
like systems, e.g., spreading out tiles is relatively easy. Imagination’s Series5XT
had multi-core, e.g., Imagination connected SGX5xxMPs, back in 2012 (e.g., MP4
designs). Imagination did it then, and it scaled linearly.
Apple would have you believe they have somehow broken the laws of physics
by bolting two die together and increasing the memory address space. Shared DDR
memory is still shared DDR memory, and DDR5 has a clock limit. Clocks also
determine how hot a chip gets. And although Apple claims they get to performance
levels at 100 w, that’s still 100 w, which is not cool; it’s hot (put your hand on a
100 w light bulb). That’s not to diminish what Apple did, but they didn’t do anything
revolutionary. AMD and Intel have been building multi-chip devices for ten years.
Qualcomm has been building higher-performance, low-power GPUs even longer.
AMD is building processors using 5 nm, and so has Samsung, so Apple hasn’t
broken any Moore’s law barriers or been first in that node either—although they
were first with the M1 Max at 5 nm.
The Apple M1 Ultra was a huge chip, about three times larger than AMD’s Ryzen
APU (Fig. 3.61).
The competition in the PC market is brutal, and Apple has never been shy about
praising itself. Their Mac Studio will be an able performer, but it can’t replace a
powerful desktop system.
In mid-2022, Apple introduced the second-generation M-series processor, the M2.
3.16 Apple’s M1 GPU and SoC (2020) 181
Fig. 3.61 Apple M1 compared to AMD Ryzen chip size. Courtesy of Max Tech/YouTube [49]
4 Testing conducted by Apple in May 2022 using preproduction 13-in. MacBook Pro systems with
Apple M2, 8-core CPU, 10-core GPU, and 16 GB of RAM; and production 13-in. MacBook Pro
systems with Apple M1, 8-core CPU, 8-core GPU, and 16 GB of RAM. Performance measured
using select industry-standard benchmarks. Performance tests are conducted using specific computer
systems and reflect the approximate performance of MacBook Pro.
5 Testing conducted by Apple in May 2022 using preproduction 13-in. MacBook Pro systems with
Apple M2, 8-core CPU, 10-core GPU, and 16 GB of RAM. Performance measured using select
industry-standard benchmarks. 10-core PC laptop chip performance data from testing Samsung
Galaxy Book2 360 (NP730QED-KA1US) with Core i7-1255U and 16 GB of RAM. Performance
182 3 Mobile GPUs
The company also claimed the new SoC’s Neural Engine could process up to
15.8 trillion operations per second—over 40% more than M1. The media engine
included a higher-bandwidth video decoder supporting 8K H.264 and HEVC video
and Apple’s ProRes video engine that enabled playback of multiple streams of both
4K and 8K video. The company says there is also a new image signal processor (ISP)
that delivers better image noise reduction.
MetalFX
With the M2, Apple introduced its MetalFX Upscaling Technology—a mix of
spatial and temporal upscaling algorithms. Metal is Apple’s API similar to DirectX
and Vulkan and discussed in Book two, The History of the GPU—the Eras and
Environment. With the introduction of the M2, Apple also introduced Metal 3.
Nvidia was the first to introduce AI-based scaling with DLSS in August 2019, also
discussed in Book Two. AMD released its FidelityFX Super Resolution (FSR) as an
efficient scaling alternative to DLSS in June 2021, and then a significant improvement
of it with FSR 2.0 in May 2022. Intel announced its XeSS in May 2022 in its Xe
GPU’s Matrix Extensions (Intel XMX).
Nvidia DLSS 1 and AMD’s FSR 1 used spatial upscaling, and the second-
generation versions of those technologies included superior and more compute-
intensive temporal upscaling. Nvidia introduced DLSS 2.0 in March 2020, and AMD
introduced FSR 2.0 in May 2022.
3.16.3 Summary
3.17 Conclusion
The mobile market exploded and then contracted. In the end, Apple, a company with
no history in communications, became the market leader. Companies with massive
funding and resources, like NEC, were driven out or, like Nokia, shrunk. Others
with fabulous technology, like Nvidia, found out they did not have the right stuff.
The China mobile phone market expanded and initially embraced foreign suppliers,
but by 2020 had driven most of them out in favor of Chinese suppliers, Apple and
Samsung being exceptions.
tests are conducted using specific computer systems and reflect the approximate performance of
MacBook Pro.
References 183
References
1. Garnsey, E., Lorenzoni, G., Ferriani, S. Speciation through entrepreneurial spin-off: The Acorn-
Arm story, (March 2008). Research Policy. 37 (2): 210–224. doi:https://doi.org/10.1016/j.res
pol.2007.11.006. Retrieved 2 June 2011.
2. History of Arm: from Acorn to Apple, The Telegraph, (January 6, 2011), https://www.telegraph.
co.uk/finance/newsbysector/epic/arm/8243162/History-of-Arm-from-Acorn-to-Apple.html
3. Acorn Group and Apple Computer Dedicate Joint Venture to Transform IT in UK Education,
Archived 3 March 2016 at the Wayback Machine, press release from Acorn Computers, 1996.
4. Clarke, P. Arm acquires Norwegian graphics company, EE Times, (June 23, 2006), https://
www.eetimes.com/arm-acquires-norwegian-graphics-company/#
5. Falanx Microsystems rolls out new multimedia accelerator cores for handheld SoCs, Tech-
Watch, Volume 5, Number 3, (February 14, 2005).
6. 3D on Java?—ask Arm, TechWatch, Volume 8, Number 4, (February 25, 2008).
7. Peddie, J. The History of Visual Magic in Computers, Springer Science & Business Media,
(June 13, 2013), https://link.springer.com/book/10.1007/978-1-4471-4932-3
8. Kahney, L. Inside Look at Birth of the IPOD, Wired, (July 21, 2004), https://www.wired.com/
2004/07/inside-look-at-birth-of-the-ipod/
9. Apple seed, Forbes, (February 16, 2004), https://www.forbes.com/forbes/2004/0216/050.html?
sh=1cb8705b234b
10. Gardner, D. Nvidia Acquires PortalPlayer For $357 Million, InformationWeek (November 6,
2006), https://www.informationweek.com/nvidia-acquires-portalplayer-for-$357-million/d/d-
id/1048542
11. Nordlund, P. Bitboys G40 Embedded graphics processor, HotChips, Graphics Hardware 2004
Hot3D presentations, https://www.graphicshardware.org/previous/www_2004/Presentations/
gh2004.hot3d.bitboys.pdf
12. Clarke, P. Bitboys licenses G34 graphics processor to NEC Electronics, EETimes
(August 10, 2004), https://www.eetimes.com/bitboys-licenses-g34-graphics-processor-to-nec-
electronics/#
13. Bitboys Introduces Vector Graphics Processor for Mobile Devices at Game Developers
Conference, Design & Reuse, (March 7, 2005), https://tinyurl.com/yph4wwnb
14. Scalable Vector Graphics (SVG) Tiny 1.2 Specification, W3C Recommendation (December 22,
2008), https://www.w3.org/TR/SVGTiny12/
15. Rice, D. (Editor), OpenVG Specification, Version 1.0.1 (August 1, 2005), https://www.khronos.
org/registry/OpenVG/specs/openvg_1_0_1.pdf
16. Nokia Growth Partners invests four million euros in Bitboys to support the company’s growth
and product development Press release, (February 9, 2006), https://tinyurl.com/2t8jwfk7
17. ATI acquires Bitboys Oy, Press release (May 4, 2006), https://evertiq.com/news/3786
18. Peddie, J. Bitboys acquires ATI and leads them to Nokia, TechWatch - Volume 6, Number 10,
(May 8, 2006).
19. Peddie, J. Bitboys acquires ATI and leads them to Nokia, TechWatch, Volume 6, Number 10,
May 8, 2006, (page 12).
20. Smith, T. Bitboys offers next-gen mobile 3D chips, (August 10, 2004), https://www.theregister.
com/2004/08/10/bitboys_g40/
21. Tikka, J-P. The Ups and Downs of Bitboys, Now Known As Qualcomm Finland, (April
1st, 2009), https://xconomy.com/san-diego/2009/04/01/the-ups-and-downs-of-bitboys-now-
known-as-qualcomm-finland/2/
22. ATI to exhibit handheld products at International CES, TechWatch, Volume 4, Number 1,
(January 12, 2004), (page 30).
23. Peddie, J. Qualcomm demonstrates its Q3D gaming architecture, TechWatch, Volume 5,
Number 4, page 16 (February 28, 2005).
24. Peddie, J. Make a left, and the exit door will be on the right in front of you, TechWatch, Volume
9, Number 3, page 4, (February 2, 2009).
184 3 Mobile GPUs
25. Qualcomm Acquires AMD’s Handheld Business, Press release (January 21, 2009), https://news.
softpedia.com/news/Qualcomm-Acquires-AMD-039-s-Handheld-Business-102577.shtml
26. STMicroelectronics and Texas Instruments Team Up to Establish an Open Standard for Wireless
Applications. STMicroelectronics Press release (December 12, 2002), https://tinyurl.com/89j
kt3cw
27. Goddard, L. Texas Instruments admits defeat, moves focus away from smartphone proces-
sors, Reuters (September 26, 2012), https://www.theverge.com/2012/9/26/3411212/texas-ins
truments-omap-smartphone-shift
28. Smith, R. Arm’s Mali Midgard Architecture Explored (July 3, 2014), https://www.anandtech.
com/show/8234/arms-Mali-midgard-architecture-explored
29. Sørgård, E Launching Mali-T658: “Hi Five-Eight, welcome to the party!”, Arm blog,
(September 11, 2013), https://tinyurl.com/y9r763rp
30. Mijat, R GPU Computing in Android? With Arm Mali-T604 & RenderScript Compute You
Can!, Arm Blog, (September 11, 2013), https://tinyurl.com/xrw952bt
31. Plowman, E. Multicore or Multi-pipe GPUs: Easy steps to becoming multi-frag-gasmic, Arm
Blog, (September 11, 2013), https://tinyurl.com/3kshz7my
32. Clarke, P. Arm announces 8-way graphics core, EE Times (November 10, 2011), https://www.
eetimes.com/arm-announces-8-way-graphics-core/#
33. Peddie, J. Arm’s new Valhall-based Mali-G57, TechWatch, https://www.jonpeddie.com/report/
arms-new-valhall-based-Mali-g57/
34. Peddie, J. Company claims highest-performing Valhall Mali GPU, (May 26, 2020), https://
www.jonpeddie.com/report/arm-Mali-g78-gpu/
35. China Games Market 2018 (August 3, 2018), https://newzoo.com/insights/infographics/china-
games-market-2018/
36. Peddie, J. Arm introduces a display processor and enhances its Mali GPU: Enabling VR with
a display processor and an AMP-sipping little GPU (June 3, 2019), https://www.jonpeddie.
com/report/arm-introduces-a-display-processor-and-enhances-its-Mali-
37. Peddie, J. Arm’s Mali-Cetus display processor (May 31, 2027), https://www.jonpeddie.com/
report/arms-Mali-cetus-display-processor1/
38. SoftBank Offers to Acquire Arm Holdings for GBP 24.3 Billion (USD 31.4 Billion) in Cash,
BusinessWire, (July 18, 2016), https://www.businesswire.com/news/home/20160717005081/
en/SoftBank-Offers-to-Acquire-Arm-Holdings-for-GBP-24.3-Billion-USD-31.4-Billion-in-
Cash
39. Medeiros, J. How SoftBank ate the world, Wired, (July 2, 2019), https://www.wired.co.uk/art
icle/softbank-vision-fund
40. Peddie, J. Nvidia to buy Arm: The next wave of IT is underway, (September 15, 2020), https://
www.jonpeddie.com/report/nvidia-to-buy-arm/
41. Sandle, P. UK invokes national security to investigate Nvidia’s Arm deal, Reuters,
(April 19, 2021), https://www.reuters.com/world/uk/uk-intervenes-nvidias-takeover-arm-nat
ional-security-grounds-2021-04-19/
42. TechWatch, Volume 15, Number 10 (May 12, 2015).
43. Demler, M. Xavier Simplifies Self-Driving Cars. Linley Newsletter (Formerly Processor Watch,
Linley Wire, and Linley on Mobile), Issue #533, (June 22, 2017).
44. Shapiro, D. Nvidia Drive Xavier, World’s Most Powerful SoC, Brings Dramatic New AI
Capabilities, (January 7, 2018), https://blogs.nvidia.com/blog/2018/01/07/drive-xavier-proces
sor/
45. Apple Computer, Apple announces Mac transition to Apple silicon, [Press release], (June 22,
2020), https://www.apple.com/newsroom/2020/06/apple-announces-mac-transition-to-apple-
silicon/
46. Apple Computer, Introducing M1 Pro and M1 Max: the most powerful chips Apple has ever
built [Press Release], (October 18, 2021), https://www.apple.com/newsroom/2021/10/introd
ucing-m1-pro-and-m1-max-the-most-powerful-chips-apple-has-ever-built/
47. Gallagher’s, W. Intel CEO hopes to win back Apple with a ‘better chip’, (October
18, 2021), https://appleinsider.com/articles/21/10/18/intel-ceo-hopes-to-win-back-apple-with-
a-better-chip
References 185
48. Clover, J. M1 Ultra Doesn’t Beat Out Nvidia’s RTX 3090 GPU Despite Apple’s Charts, (March
17, 2022), https://www.macrumors.com/2022/03/17/m1-ultra-nvidia-rtx-3090-comparison/
49. Max Tech Shorts Channel, M1 Max Mac Studio FULL Teardown & Thermal Throttle Test!
(March 21, 2022), https://www.youtube.com/channel/UCptwuAv0XQHo1OQUSaO6NHw
Chapter 4
Game Console GPUs
When game consoles were first introduced, they used the best processors available,
hoping they would be good enough for at least five years or more (Fig. 4.1).
Many consoles had custom-made coprocessors and accelerators to give them
unique and proprietary features. Also several of the game developers were often
employees of the console makers so they didn’t have to worry about pleasing
developers and often exploited custom features for competitive advantage (Fig. 4.2).
Consoles have had GPU-quality technology since the Nintendo 64 in 1996. As
processors and their support became more complicated, it became apparent that
custom systems (producing tens of millions of units) were no competition for the
commercial-off-the-shelf (COTS) devices (producing hundreds of millions of units).
As a result, the industry shifted to COTS and semi-custom devices based on COTS
designs. This shift began first with GPUs and then included CPUs (Table 4.1).
As will be discovered in this chapter, the console market has a shrinking number
of semiconductor choices and a growing number of console suppliers.
In the late 1990s, Ken Kutaragi, Sony’s CEO of American operations, convinced
Sony’s senior management to take a considerable risk on his vision by authorizing
the PlayStation 2 project [1]. The project was risky because Kutaragi wanted to create
an entirely new design instead of using components from the first PlayStation. The
costs of developing new designs and parts can grow unexpectedly, and it was more
than Sony could take on alone. Sony executives told Kutaragi to find a partner. They
suggested he work with Microsoft to produce an online video game business. In 1999,
Kutaragi met with Microsoft’s CEO and chairman, Bill Gates, but no agreement or
partnership came out of it. Kutaragi did not know at the time that Microsoft was
planning to compete with Sony, and Kutaragi’s offer may have inadvertently given
Microsoft insight into Sony’s plans.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 187
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_4
188 4 Game Console GPUs
Fig. 4.1 Rise and fall of console supplier versus market growth
Kutaragi’s team took the popular and well-respected 64-bit MIPS R3000 processor
design and doubled it, making it the first 128-bit MIPS processor. Kutaragi named
his creation the Emotion Engine. Toshiba manufactured the 13.5 million transistor
chips in 175 nm at its Oita fab. With the transistor chips, the Emotion Engine could
reach 6.2 GFLOPS performance.
The floating-point processors in the Emotion Engine processed geometric data
(T&L) and sent the transformed vertices to the Sony Graphics Synthesizer (GS)
250 GPU. The CPU also created primitive instructions and sent them to the GPU.
4.1 Sony PlayStation 2 (2000) 189
The GPU took the vertices data and clipped, where necessary, for the viewing window.
Then it conducted a z-sort to remove any occluded polygons that would not appear
in the viewport. Figure 4.3 shows the block diagram of the PS2.
Toshiba also designed and manufactured a custom GPU for Sony called the GS. It
had a fill rate of 2.4 gigapixels per second and could render up to 75 million polygons
per second. When the GPU used functions such as texture mapping, lighting, and
anti-aliasing, the average game performance dropped from 75 million polygons per
second to only 3–16 million polygons per second—which was still very respectable
for that time.
With a full texture diffuse map, Gouraud shaded (see Chap. 4) the chip could
generate 1.2 Gpix per second (37,750,000 32-bit pixel raster triangles). Moreover,
Gouraud claimed the chip could generate 0.6 Gpix per second (18,750,000 32-bit
pixel raster triangles) with two full textures (diffuse map and specular, alpha, or
other).
The chip had 4 Mbytes of embedded SDRAM and an external dual-channel Direct
Rambus DRAM with 3.2-Gbit/s bandwidth. Kutaragi claimed it was ten times faster
than any graphics accelerator available. Kutaragi even suggested the next iteration
of the graphics synthesizers might have embedded flash memory, but that never
occurred.
The GPU had 53.5 million 180 nm transistors in a 279 mm2 die built into an
embedded DRAM CMOS process. The GPU could run at 147.5 MHz and display
up to 1920 × 1080, 32-bits (RGBA: 8 bits each).
At the 2001 IEEE International Solid-State Conference in San Francisco, Sony
Computer Entertainment engineers presented a GPU for a development system
named GScube. Created in parallel with the GS-250 PlayStation chip, the GScube
used multiple Emotion Engines and Graphics Synthesizers.
The chip resulted from engineering cooperation between Sony Computer Enter-
tainment, the Sony Corporation Semiconductor Network Company, the Sony Kihara
Research Center, and Altius Solutions, Inc. (Altius merged with Simplex Solutions
in October 2000). Sony Computer Entertainment, the Sony Kihara Research Center,
and Sony Semiconductor developed the chip’s architecture and functional design.
Altius and Sony Semiconductor developed the electrical and physical design of the
chip.
The GScube system’s architecture was an enhanced version of the PlayStation 2
computer entertainment system’s architecture. The GScube system included 16 sets
of graphics units combined with a 128-bit microprocessor and a graphics rendering
processor.
In April 2003, Sony announced the Emotion Engine and graphics synthesizer chips
would be integrated and manufactured in 90 nm at Sony’s new fab at 90 nm. The new,
highly integrated semiconductor would have 53.5 million transistors, enhanced low
power consumption, and reduced cost. According to the International Technology
Road map for Semiconductors, mass production with a 90 nm embedded DRAM
process put Sony in a leadership position six months ahead of others in the industry
[2]. Consuming only 8 W, the new integrated chip would have an 86 mm2 die and
run at 250 MHz [3].
Production of the new semiconductor started in the spring of 2003 in Oita TS
Semiconductor, a joint venture between SCEI and Toshiba Corporation. SCEI’s semi-
conductor fabrication facility in Isahaya City, Nagasaki Prefecture, began producing
the chip in the fall.
The PlayStation 2 proved to be even more successful than the first PlayStation
and became the bestselling video game console of all time. As of September 2020,
Sony had sold almost 158 million PlayStation 2 units worldwide (compared to 102
million for the PlayStation 1).
4.2 Microsoft Xbox (2001) 191
When Ken Kutaragi met with Bill Gates in 1999 to propose a partnership, Gates
politely turned him down. Gates privately agreed with Kutaragi that a console would
become the living room entertainment center because a console would be less intim-
idating and more like a consumer device—easy to use, limited, and specific. Candid
and forthright, Kutaragi explained his vision of how the PlayStation 2 could become
a home entertainment center, including using its internet connection to do email. That
frightened Gates and Intel because they planned to make the PC a home entertain-
ment center, but Microsoft struggled to get Outlook established as the email client for
the PC. Moreover, Gates was concerned that game developers would stop developing
PC games in favor of the PlayStation. If game developers did a PC version (a port) of
a game, it usually came out later and did not exploit all the PC’s power capabilities.
However, Gates also had a backup plan in the form of a Microsoft console code
named Midway—a slur referring to the defeat of Japanese-owned Sony.
The idea for project Midway was developed in 1998 and championed by Seamus
Blackley and Kevin Bachus [5]. Otto Berkes, team leader of DirectX, and Ted Hase
later joined the team. They decided the console would use Microsoft’s DirectX API
so PC games could be ported quickly to the console (Fig. 4.4).
Fig. 4.4 Original Xbox team Ted Hase, Nat Brown, Otto Berkes, Kevin Bachus, and Seamus
Blackley. Courtesy of Microsoft
192 4 Game Console GPUs
When writing The Making of the Xbox: How Microsoft Unleashed a Video Game
Revolution, Dean Takahashi claimed, “DirectX enabled the PC to take advantage of
the enormous boosts in 3D graphics and keep up with consoles such as the Sony
PlayStation. Were it not for DirectX, Microsoft would have had no foundation to
build a software-based games business” [6].
Blackley had met Gates during a DreamWorks tech demo for Trespasser, the game
that tied into Jurassic Park. The game impressed Gates, and he invited Blackley to
join Microsoft. Gates helped Blackley secure a job at Microsoft as program manager
for Entertainment Graphics.
The team pitched the console product idea to Bill Gates, and Gates approved
it [7]. The product was named the DirectX Box. It would use Windows 2000 and
an x86 CPU. Kutaragi had not revealed all his plans to Gates. When asked which
processor would be in the PS2, Kutaragi only said it was a custom CPU. Gates and
the team thought a PC processor would be superior because Microsoft had a special
relationship with Intel.
While the design for the Xbox was coming together, two factions formed within the
company about whose semiconductors to use. In August 1997, Microsoft completed
the acquisition of WebTV, a company founded in 1995 by Steve Pearlman, Bruce
Leak, and Phil Goldman and established a hardware design and development team
within Microsoft. WebTV later become MSN TV.
When the Midway DirectX box project became known within Microsoft,
Pearlman and his team lobbied to design the GPU for the project, but Kevin Bachus
and his team lobbied to use COTS parts. It soon became clear that Microsoft could
not design, build, and debug an in-house part to be in time due to budget constraints.
In addition to this disagreement, Perlman also wanted to use a different API instead
of DirectX. The project was dealing with too many variables and too little time. Jay
Torborg, the architect and a proponent of the Talisman project, sided with Bachus.
Negotiations began with Nvidia and briefly with ATI.
Microsoft decided to go with Nvidia if the price was favorable. Negotiations
between Nvidia and Microsoft went on for months. Nvidia wanted the deal, Microsoft
wanted the part, and all that stood between them was the price. Eventually, Nvidia
gave in; and Microsoft confirmed that an Nvidia GPU would be in the Xbox. But the
transaction, subsequent communications, and tech support between the two compa-
nies did not go smoothly, and Nvidia would not receive a second deal with Microsoft
after the game console project concluded. Nvidia was unhappy with the margins and
did not pursue any follow-up business.
Nvidia supplied a 233 MHz NV2A, a variant of its Nforce IGP, which consisted
of a GeForce 3 GPU and a Northbridge UMA chip. The GPU had a 128-bit memory
interface capable of running at 200 MHz. The Xbox had four banks of 16 MB DDR
SDRAM. The GPU had half of the memory to avoid conflicts with the CPU and
other delays.
The Xbox used a 733 MHz, 32-bit one GFLOPS Intel Pentium III, less than the
Sony PlayStation 2’s 300 MHz 132-bit 6.2 GFLOPS. However, the Xbox had twice
the RAM storage (64 MB rather than the PS2’s 32) [8]. Figure 4.5 shows a block
diagram of the Xbox system.
4.2 Microsoft Xbox (2001) 193
Fig. 4.6 Halo, developed by Bungie, was an exclusive Xbox title and credited with the machine’s
success. Courtesy of Microsoft
194 4 Game Console GPUs
particles per second. It had four pixel pipelines with two texture units. Each unit
had a peak fill rate of 932 megapixels per second (233 MHz times four pipelines),
a texture fill rate of 1864 megatexels per second (932 MP times two texture units),
and four ROPS.
The NV2A graphics processor was an average-sized chip with a die area of
142 mm2 and 57 million transistors. Built on TSMC’s 150 nm process, the
GeForce 3 IGP-integrated variant was compatible with DirectX 8.1.
A popular story at the time had it that Nvidia could not have gotten the Xbox
GPU out on time if not for the 3dfx engineers working at Nvidia. However, those
engineers’ contributions would have had to arrive late in the development of the
nForce derivative of the Xbox chipset. Nvidia did not absorb 3dfx until 2001, which
was during the last 25% of the Xbox project. 3dfx contributions to the nForce2,
which came out in July 2002 seem more likely.
4.2.1 Epilogue
Perlman left Microsoft in 1999 and started Rearden Steel (later renamed Rearden,
Limited), a business incubator for new media and entertainment technology compa-
nies. The members of Perlman’s group who stayed at Microsoft produced the
Microsoft TV platforms and later helped develop the web browsing capabilities
for Microsoft’s next-generation console, the Xbox 360.
Although it was more expensive to developed than Microsoft expected, the Xbox
was a huge commercial success. Still, Microsoft did not make a profit from the
success of the first Xbox, a fact the company tried to hide for a long time. By 2002,
the original team had broken up and gone to other companies or projects.
Seamus Blackley was the cocreator and technical director of the Xbox. Blackley
was one of the people who convinced Bill Gates to risk a potential $3.3 billion in
losses. Blackley argued it was a smart investment for Microsoft to gain the desired
foothold in living rooms, but he was too outspoken about his opinion, so he was booted
out of Microsoft in 2002. Notably, Blackley’s actions were not unlike Kutaragi’s
actions at Sony, which also led to his removal.
Of the other four original Xbox team members, Keven Bacchus left Microsoft
and the industry in 2001. In 2005, he reappeared and took over the start-up console
company Infinium Labs, which became Phantom Labs and struggled to introduce
the much-hyped and therefore controversial Phantom console. It was never officially
launched, and Bachus left shortly after joining. Ted Hase left Microsoft in 2006, and
Otto Berks stayed at Microsoft and worked on other projects until May 2011. All
the other supporters of the Xbox such as Ed Fries, Cameron Ferroni, J Allard, and
Robbie Bach have left Microsoft.
Nvidia never got over the lowball price given to Microsoft, and the two companies
entered into arbitration over the dispute in 2002 [9]. Nvidia stated in its SEC filing
that Microsoft wanted a $13 million discount on shipments for the 2002 fiscal year.
The two companies arrived at a private settlement on Feb 6, 2003.
4.3 Sony PSP (2004) 195
In December 2004, Sony released the PlayStation Portable (PSP) handheld game
console in Japan. Then in March 2005, the company introduced the PSP in North
America. Later that same year, the game console was also released in the PAL regions.
It was the first and only handheld device in the PlayStation line of consoles.
Legend has it that Ken Kutaragi came up with the idea for the PSP on the back of
a beer mat just before E3 2003 [11]. The graphics were a custom rendering engine
plus a surface engine GPU built by Toshiba using Imagination Technologies GPU
IP design. The GPU ran at 166 MHz as indicated in the block diagram in Fig. 4.7.
When the PSP was released, Shinichi Ogasawara led its design team at Sony.
The PSP was the most powerful handheld on the market [12]. The GPU had a 2 MB
VRAM frame buffer, which was quite a lot for such a system at that time. The display
resolution was 480 × 272 pixels, with 24-bit color (8 bits more than other handheld
devices). It had the largest handheld display for the time, with a 30:17 widescreen
that used a TFT LCD.
The system was the first real competitor to Nintendo’s handheld dominance.
finesse with the transistor radio, Walkman, and Trinitron TV. Sony also has been
hurt by its insistence on making its content proprietary,” Kutaragi said [13].
Thus, an era of Kutaragi’s influence at Sony ended, but Sony’s success in the
console business would continue.
At the 2014 Game Developer’s Conference, Ken Kutaragi received a Lifetime
Achievement Award. The award was presented to Kutaragi by Mark Cerny, the lead
architect on the PlayStation 4 console [14].
By 2002, Microsoft knew it had created a valuable and desirable brand with the Xbox
franchise. Even though several of the original team members had left, others in the
company began to talk about a follow-up product and discuss its road map. Microsoft
and Sony were the dominant console suppliers at this time and competed to obtain
exclusive deals on new games.
Although the Nvidia IGP in the original Xbox performed well, hard feelings
remained between Microsoft and Nvidia. Therefore, Microsoft contacted ATI. ATI
had a good relationship with the DirectX team and often brought new concepts and
technology to Microsoft.
In addition to using ATI for the GPU, Microsoft would also change the central
processor in the Xbox 360 to the XCPU (a code-named Xenon processor designed
by IBM). The XCPU’s 3.2 GHz triple-core was an advanced multi-processor; each
core could process two threads simultaneously [15] (Figs. 4.9 and 4.10).
The Xbox 360’s CPU had three identical cores that shared an eight-way set-
associative and 1-Mbyte L2 cache, and they ran at 3.2 GHz. Each core contained a
complement of four-way SIMD vector units. IBM customized the CPU’s L2 cache,
cores, and vector units.
The front-side bus ran at 5.4 Gbit per pin per second with 16 logical pins in
each direction. That provided a 10.8-GB/s read bandwidth and a 10.8-GB/s write
bandwidth. The FSB design combined with the CPU L2 provided the additional
support needed for the GPU to read directly from the CPU L2 cache.
The CPU used Direct3D (D3D) compressed data formats. The same formats used
by the GPU. D3D allowed the user to store compressed graphics data generated by
the CPU directly in the L2. The compressed formats allowed an approximate 50%
savings in required bandwidth and memory size.
Xbox 360 Gave Game Developers a Cross-Platform Advantage
The GPU operated at 500 MHz and had 48 combined vector and scalar shader ALUs.
The GPU shaders’ dynamic allocation was of significant interest; there was no distinct
vertex or pixel shader—the hardware automatically adjusted to the load on a fine-
grained basis. The dissolution of unique shaders occurred before the public release
of DirectX 10 (known in Direct3D 10 as Shader Model 4.0). Microsoft described the
GPU in the Xbox 360 as compatible with the High-Level Shader Language of D3D
9.0 with extensions. Indeed, those extensions gave the Xbox 360 a year’s head start
on the unified shaders capability that would appear in Windows Vista’s DirectX 10
in late 2006.
4.4 Xbox 360—Unified Shaders and Integration (November 2005) 199
Although Sony had its game studio, Microsoft was in a contest with Sony and
wanted to attract the independent game developers away from the PS2 and toward the
Xbox. Thanks to the success of the PlayStation most game developers wanted to get
on the PS2 platform. It takes years and millions of dollars to develop a game. If Sony
attracted the game developers first, it could be a year or two before the developers
could come out with an Xbox version and even later for the PC. If Microsoft could
offer the game developers advanced information and a development platform for the
next generation of PCs, the team could offer developers a double payoff that Sony
could not match.
The gambit paid off, and Microsoft had 14 games available in North America and
13 in Europe at its launch; Sony only had 10. Microsoft’s premier game, Halo 3, was
released two years later.
On May 12, 2005, Microsoft announced the Xbox 360 on MTV during MTV
Presents: The Next Generation Xbox Revealed; The Xbox360 shipped on Nov 22,
2005.
The GPU in the Xbox 360 was a customized version of ATI’s R520, a revolutionary
design at the time. In keeping with the “X” prefix and echoing the IBM processor,
ATI called this GPU the Xenos.
The ATI Xenos, code named C1, had 10 MB of internal eDRAM and 512 MB of
700 MHz GDDR3 RAM. ATI’s R520 GPU used the R500 architecture, made with
a 90 nm production process at TSMC, a die size of 288 mm2 , and a transistor count
of 321 million. See Fig. 4.10 for a block diagram of the system.
The GPU’s ALUs were 32-bit IEEE 754 floating-point compliant (with typical
graphics simplifications of rounding modes), denormalized numbers (flush to zero
on reads), exception handling, and not-a-number handling. They were capable of
vector (including dot product) and scalar operations with single-cycle throughput—
that is, all operations issued every cycle. That allowed the ALUs a peak processing of
96 shader calculations per cycle while fetching textures and vertices.
The GPU had eight vertex shader units supporting the VS3.0 Shader Model of
DirectX 9. Each was capable of processing one 128-bit vector instruction plus one
32-bit scalar instruction for each clock cycle. Combined, the eight vertex shader units
could transform up to two vertices every clock cycle. The Xenos was the first GPU
to process 10 billion vertex shader instructions per second. The vertex shader units
supported dynamic flow control instructions such as branches, loops, and subroutines.
One of the significant new features of DirectX 9.0 was the support for floating-
point processing and data formats known as 4 ×-32 float. Compared with the integer
formats used in previous API versions, floating-point formats provided much higher
precision, range, and flexibility.
The Xenos introduced the precursor to the third era of GPUs—the unified shader.
ATI/AMD incorporated it in its TeraScale microarchitecture for PCs in 2007.
200 4 Game Console GPUs
An essential requirement for the Xbox 360 was supporting a 720p (progressive
scan), high definition (HD), 16:9 aspect ratio screen. That meant that the Xbox 360
needed a significant and reliable fill rate.
Red Ring of Death
Early in the life of the Xbox 360, a problem appeared: components in the system
were overheating. When this happened, temperature sensors in the system would
attempt to alert the user by flashing a ring of red around the power button. Usually,
the red ring was a death signal, and the system died. Gamers called it the Red Ring
of Death.
To mitigate the issue, Microsoft extended the Xbox 360’s warranty to three
years for all models released before the Xbox 360 S. (Microsoft’s Xbox360 original
warranty period was 90-days from the date of purchase.)
The Xbox 360 launched in November 2005 with two configurations: the Xbox 360
(known as the 20 GB Pro or Premium) and the Xbox 360 Core. The Xbox 360 Arcade
later replaced the 360 Core in October 2007. Next came a 60 GB version of the Xbox
360 Pro on August 1, 2008. On August 28, 2009, Microsoft discontinued the Pro
package.
There were two significant hardware improvements: the Xbox 360 S (also referred
to as the Slim) replaced the Elite and Arcade models in 2010. All told across the
lifetime of the Xbox 360—from late 2005 to early 2016—the company introduced
nine different versions ranging in price from $400 down to $200.
Microsoft wanted to reduce the costs of the Xbox 360 and boldly authorized Rune
Jensen, Nick Baker, and Jeff Andrews to design an SoC using IP from ATI and IBM.
It was a bold move because Microsoft did not have much depth in semiconductor
design or manufacturing, except for the design team from WebTV. Considering the
battles over who should produce the GPU for the first Xbox, this authorization was
a surprising move. Nonetheless, the company moved forward with this plan, and
its success exceeded all expectations. Baker and the team did such a good job, the
updated (referred to as a shrink going from 65 to 45 nm) AMD APU ran a lot faster.
Microsoft had to slow it down to keep within the original specification range. Console
manufacturers guarantee a stable (i.e., no changes) platform to game developers so
all users get the same experience regardless of when they bought the console.
4.4 Xbox 360—Unified Shaders and Integration (November 2005) 201
Months after launching the revised Xbox 360, the company revealed the console’s
integrated CPU and GPU (APU) configuration features, illustrated in Fig. 4.11. The
Slim 360 would have a 45 nm SoC code named Vejle. The chip had 372 m transistors,
was 50% smaller, and drew 60% less power than the original CPU/GPU combo.
Integrating the significant components meant the console would require fewer
chips, heatsinks, and fans (see chip layout in Fig. 4.12). Therefore, the console
could use a smaller motherboard and power supply. Reducing power and heat also
decreased the chance users would experience the red circle of death.
IBM manufactured the SoC at its advanced 45 nm SOI fab in East Fishkill,
NY [16].
With all the processors integrated into a smaller process node, efficiency improved,
and with the clocks turned down saved power, the chip ran even cooler. The Vejle
SoC also used an FSB replacement block with the same bandwidth as the bus used
by the stand-alone CPU and GPU, which kept the system from being faster than its
Fig. 4.12 Microsoft Xbox 360 SoC chip floor plan. Courtesy of Microsoft
predecessors [17]. Microsoft sold over 84 million Xbox 360 consoles throughout its
life from 2005 to 2016.
When Nintendo decided to develop the Wii, Satoru Iwata president of Nintendo told
the engineers to avoid competing on graphics and power with Microsoft and Sony
and to aim at a broader demographic of players with unique and novel gameplay.
Legendary Game developer Shigeru Miyamoto (Mario, The Legend of Zelda, Donkey
Kong, Star Fox and Pikmin) and Genyo Takeda led the development of the new
console and gave it the codename Revolution.
Initial Wii models included full backward compatibility support for the Game-
Cube. Later in its lifecycle, two lower-cost Wii models were produced: a revised
model with the same design as the original Wii but removed the GameCube compat-
ibility features and the Wii Mini, a compact, budget redesign of the Wii which further
removed features including online connectivity and SD card storage.
4.6 Sony PlayStation 3 (2006) 203
The Sony PlayStation 3 ended up using an Nvidia GPU, based on the company’s
Curie architecture, but that was not the original plan. Sony had planned to use a
GPU design jointly developed by Sony and Toshiba, as they had done in the
PlayStation 2. However, Sony was pushing too many variables at once.
First, the company decided to abandon its 128-bit MIPS Expression
engine processor used in the PlayStation 2 in favor of a newly developed multi-core
IBM processor known as the Cell.
204 4 Game Console GPUs
The Cell processor had an uncommon bus and instruction set. Therefore, the GPU
Sony picked had to match the Cell, and a unique software driver had to be written.
That was not impossible, but it was time-consuming and had a steep learning curve.
At the same time, Toshiba was experiencing difficulties with its GPU design.
Sony had a deadline based on established introduction dates for the new console,
and that date was rapidly approaching. Kutaragi and the senior management made
the difficult decision to cancel Toshiba’s GPU and went looking for a replacement.
They decided Nvidia had the best GPU and would perform well for the next five
to seven years, but could it be married to the Cell? Having lost the Xbox contract,
Nvidia was anxious to re-establish itself in the console market. Nvidia wanted to
be involved in all gaming production. Therefore, the company put aside its newest
GPU design project and accepted Sony’s challenge. Sony announced the partnership
in December 2004. People familiar with the Cell knew Nvidia was heading into a
challenge; Nvidia knew it too but realized there would be no gain without risk [18].
The switching from Toshiba to Nvidia, combined with other problems inherent
to shifting to a new CPU, cost Sony time, and the company missed the desired
introduction date. Because of the missed deadline, Microsoft beat Sony to market
with the AMD-powered Xbox 360 by almost a year.
Nvidia took an existing design, the NV47, and its I/O structure to match the
IBM Cell processor and XDR memory system. It named this repurposed design
the G70/G71, and this later became known as the RSX Reality Synthesizer. The
Cell had a unique OS and a somewhat primitive graphics API (PSGL (OpenGL
ES 1.1 + Nvidia Cg)). Nvidia had to develop a companion chip to translate from
the PCI structure to the Cell’s FlexIO bus, learn a new OS and API, and write a
driver that would work in that environment. It was not just a simple “Hello World”
software hack; it had to be a high-performance driver to expose everything in the
GPU possible to potential apps—games. Somehow, Sony did it. The PS3 was a
very popular console, and Sony sold over 90 million units. That was more than
Microsoft’s 88 million and their year’s advance introduction.
Sony built the semi-custom 300 million transistor GPU, aptly named RSC-90 nm,
in their eight-layer 90 nm fab. The combination of the RSX GPU and the Cell CPU
required almost 600 million transistors—the most for any console to date.
Dr. Lisa Su was a senior engineer on the IBM Cell project in 2006 and worked
with Nvidia and Jensen Huang. Ironically she would take over AMD eight years later
and become Huang and Nvidia’s biggest competitor, and Sony’s processor supplier
for PlayStation (Fig. 4.14).
The RSX GPU ran at 550 MHz and had 256 MB of local GDDR3 memory at
700 MHz and individual vertex and pixel shader pipelines. And there were 24 parallel
pixel shader ALU pipes with 5 ALU operations per pipeline per cycle. Eight parallel
vertex pipelines ran at 500 MHz.
The GPU could process 136 shader operations per cycle compared to the 53
shader ops per cycle processed by Nvidia’s GeForce 6 GPU. The RSX was clearly
more powerful than even two GeForce 6 GPUs in shader performance, and it only
drew 80 W. At Sony’s pre-show press conference at E3 in 2005, Jensen Huang,
4.7 Nintendo 3DS (June 2011) 205
Fig. 4.14 IBM technologist Dr. Lisa Su holds the new Cell microprocessor. The processor was
jointly developed by IBM, Sony, and Toshiba. IBM claimed the Cell provided vastly improved
graphics and visualization capabilities, in many cases 10 times the performance of PC processors.
Courtesy of Business Wire
Nvidia’s founder, and CEO said the RSX was twice as powerful as the GeForce 6800
Ultra [19].
The GPU had eight vertex shaders (10 ops per clock per cycle), making it capable
of 40 GFLOPS of vertex floating-point operations when running at 500 MHz. The
RSX also had 24 TMUs.
The RSX GPU offered dual-screen output, with a resolution of up to 1080p
(1920 × 1080) for both screens. It was 128-bit pixel precision that provided scenes
with high dynamic range rendering (HDR).
Sony discontinued the PS3 in June 2017, and it was the last PlayStation Nvidia
would help create. At that time, it looked like Nvidia was out of the game console
business.
In 2011, Nintendo introduced its 3DS, a 3D handheld gaming system that did not
require 3D glasses.
The Nintendo’s 3DS’s most distinguishing and innovative feature (shown in
Fig. 4.15) was its 3D top screen, which used a clever feat of engineering to achieve
the illusion of depth.
Sharp made the 3D display which measured 3.5 in. inward with an 800 × 240
resolution (400 × 240 per eye). The display used an integrated LCD-based parallax
barrier panel sandwiched to the back of the color LCD, which rapidly alternated
between left and right images.
206 4 Game Console GPUs
On one side of the glass was a conventional color TFT panel, whereas the other
side had a monochrome LCD element. The monochrome LCD parallax barrier in the
back acted as a gate that allowed or denied light to pass through some screen regions.
Switching the gate in the correct patterns at a high frequency created the illusion of
3D depth.
Sharp also made the lower, secondary display that doubled as a touch screen.
The significant subsystem in the 3DS was the applications processor; it costs
approximately $10 and was manufactured by Sharp using the TSMC fab. The GPU
in the AP was the DMP 286 MHz PICA200. It could drive 800 megapixels per second
at 200 MHz. It also had hardware transformation of 40 M-triangles per second at
100 MHz and lighting vertex performance of 15.3 million polygons per second at
200 MHz. The frame buffer was 4095 × 4095 pixels.
The GPU had DMP’s Maestro-2G technology with per-pixel lighting, fake
subsurface scattering, procedural texture, refraction mapping, subdivision primitive,
shadow, and gaseous object rendering. Refer to the block diagram in Fig. 4.16.
The 3DS had 2 Gbytes of RAM, eight times the 256 Mbytes in the DSi.
The 3DS subsystem also had a microelectromechanical system gyroscope and
an accelerometer, which allowed the game system to operate using motion-sensitive
control.
The tiny but mighty gaming machine had three cameras to deliver 3D photography.
The stereo vision system employed two parallel VGA cameras in a module and a third
VGA camera. Users could record and view moments from sporting events, birthday
parties, and holiday events or create original 3D productions to show others.
4.8 Sony PS Vita (December 2011) 207
Because the 3DS had more extensive and sophisticated displays and a higher-
performance application processor than the DSi, it could include new features such
as a gyroscope and an accelerometer, which needed a higher capacity battery. The
original DSi battery was an 840 milliampere-hour (mAh) model, and the battery in
the 3DS was 1300 mAh.
On March 27, 2011, the 3DS went on sale in the United States for $249.99, and
15 games were made available for purchase on the same day. Nintendo slashed the
price to $169.99 in July because the reported sales had not met expectations.
Nintendo was caught by surprise when the industry shifted toward casual
smartphone gaming, and the iPad became one of the fastest-growing gaming
platforms.
Nintendo sought to keep its loyal customers, so it offered 20 free downloadable
games to any 3DS owner who purchased the console at the original price.
Nintendo discontinued the 3DS in September 2020 after selling over 76 million
units.
Sony unveiled the Vita at the E3 Expo in June 2011. It was a follow-up to its PlaySta-
tion Portable (PSP) and a competitor to the Nintendo 3DS. Although Nintendo and
Sony still held a large share of the portable gaming market, smartphones were rising
in popularity as handheld gaming platforms.
Sony equipped the PlayStation Vita with a 5-in. OLED multi-touch screen that
could display 960 × 544 pixels with 16 million colors. The Vita also had an
innovative, touch-sensitive back panel expected to usher in new gameplay, but it
didn’t.
The PlayStation Vita had all the features of modern luxury smartphones, including
a GPS device and a three-axis gyroscope, accelerometer, and electronic compass.
208 4 Game Console GPUs
Sony even offered two different models: a version with 3G and Wi-Fi and a Wi-Fi-
only model. The PS Vita is shown in Fig. 4.17.
The PlayStation Vita was the first handheld device to use a custom, quad-core
ARM processor. Sony hoped the processor would differentiate Vita from other
handheld gaming consoles and gaming experiences offered on tablets and high-end
smartphones.
The PS Vita’s increased performance came from the highly integrated Sony
CXD5315GG processor composed of a quad-core ARM Cortex-A9 device with an
embedded Imagination SGX543MP4+ quad-core GPU. Sony used IBM and Toshiba
to manufacture the SoC.
The GPU had four, pixel shaders, two vertex shaders, eight TMUs, and four ROPS.
The bus bandwidth was an impressive 5.3 GB per second. The bus also had a pixel
fill rate of 664 million pixels per second and a maximum of 33 M polygon per second
(T&L). Figure 4.18 shows a block diagram of the design.
The GPU had texture compression, hardware clipping, morphing, and hardware
tessellation, and it could perform Bezier, B-Spline (NURBS), which was overkill for
a handheld game machine.
In 2012, Dick James of Chipworks took X-rays of the chip and discovered it was a
five-die stack. The processor was the base of the chip placed facing a Samsung 1 Gb
wide I/O SDRAM, and the top three dies were two Samsung 2 Gb mobile DDR2
SDRAMs separated by a spacer die. The base die was ~ 250 µm thick, and the others
had thicknesses of ~100–120 µm [20].
Toshiba also supplied the device with multi-chip memory. Qualcomm provided
the Vita with an MDM6200 HSPA+ GSM modem.
4.9 Eighth-Generation Consoles (2012) 209
Sony released the new PSP Vita in 2012 and sold over seven million before Sony
discontinued the device in March 2019.
Beginning with the eighth generation of consoles, the three leading suppliers
(Microsoft, Nintendo, and Sony) selected AMD for their graphics. Nintendo used a
Broadway IBM PowerPC for the CPU, whereas Microsoft and Sony chose AMD’s
integrated CPU–GPU APU chips. Because the APU in the Microsoft Xbox One and
the Sony PlayStation 4 were relatively similar, they are examined together in the
following section. This convention will be followed for the section discussing the
ninth-generation consoles too.
The APUs used in the eighth consoles from Microsoft and Sony contained 20 GCN
GPU compute units (two of which were for redundancy to improve manufacturing
yield). As shown in Fig. 4.19, GPU cores dominated the chip’s die area (the big
bluish glob in the middle, memory is to its right).
Manufactured using TSMC’s 28 nm process, AMD’s APU and the Durango GPU
used the 2013 GCN 1.0 architecture. It was not a large chip with a die size of
363 mm2 (approximately ¾ of an inch on a side). The APU had five billion transistors
210 4 Game Console GPUs
and was compatible with DirectX 11.2 (Feature Level 11.0). It included 768 shading
units, 48 texture-mapping units, and 16 ROPS. For GPU compute applications, the
Durango GPU could use OpenCL version 1.2. A block diagram of the chip is shown
in Fig. 4.20.
The AMD APU, shown in Fig. 4.21, replaced at least three chips from the previous
generation. That replacement reduced costs, increased performance, decreased power
consumption, and gave developers a platform they recognized from the PC—an x86
CPU and a DirectX GPU.
The GPU had 18 operational shader units and could produce a theoretical peak
performance of 1.84 TFLOPS.
Fig. 4.21 Tiny but mighty, AMD’s Jaguar-based APU powered the most popular eighth-generation
game consoles. Courtesy of AMD
Nintendo developed the Wii U video game console as the successor to the Wii and
released it to the public in late 2012. The Wii U was the first eighth-generation game
console and would compete with Microsoft’s Xbox One and Sony’s PlayStation.
The system used a 1.24 GHz Tri-Core IBM PowerPC Espresso CPU and had a
2 GB DDR3 RAM with internal flash memory (8 GB (Basic Set)/32 GB (Deluxe
Set).
Based on AMD’s TeraScale 2 architecture, the Wii U’s GPU used an AMD
550 MHz Radeon-based unified shader architecture (code named Latte). It had
160 shaders, 16 TMUs, eight ROPS, and it ran memory at 800 MHz.
The GPU had 32 MB of high bandwidth eDRAM and supported 720p 4x MSAA
(or 1080p). That enabled the GPU to render a frame in a single pass via HDMI and
component video outputs. Nintendo had the chip fabricated in 40 nm at Renesas. It
had 880 million transistors and a small 146 mm2 die.
Although the GPU was compatible with DirectX, the Wii U used a proprietary OS
known as Café GX2, also with a proprietary 3D graphics API. With AMD’s support,
Nintendo designed the API to be as efficient as the GX (1) used in the Nintendo
GameCube and Wii systems. The API adopted features from OpenGL and the AMD
r7xx series GPUs. Nintendo referred to the Wii U’s graphics processor as GPU7.
In January 2017, Nintendo discontinued production of the Wii U because the
console was not considered a success. As of December 31, 2016, the company had
sold only 13.56 million units of the Wii U, which paled in comparison to the over
212 4 Game Console GPUs
101 million sold units of the first Wii console. Although this appeared to be a disap-
pointment, the initial Wii’s success was always going to be a challenge for the Wii U.
The Wii U was more like a handheld than the multi-user Wii console. It had minimal
third-party support, was underpowered relative to the competition, and didn’t have
a hard drive.
When Sony introduced the PlayStation 4 (PS4), Microsoft launched the Xbox One
based on a custom version of AMD’s Jaguar APU.
Sony used an eight-core AMD x86-64 Jaguar 1.6 GHz CPU (2.13 GHz on PS4
Pro) APU with an 800 MHz (911 MHz on PS4 Pro) GCN (graphics core next) Radeon
GPU.
Microsoft used an eight-core 1.75 GHz APU (two quad-core Jaguar modules),
and the X model had a 2.3 GHz AMD eight-core APU. The Xbox One GPU ran at
853 MHz, the Xbox One S at 914 MHz, and the Xbox One X at 1.172 GHz, using
AMD Radeon GCN architecture.
The integrated GPU (iGPU) was the most popular device in terms of unit ship-
ments. It was cost-effective (free) and powerful enough (good enough) for most tasks
[21]. The iGPU had even found use in performance-demanding workstation market
applications.
The iGPU was the dominant GPU used in PCs; found in 100% of all game x86-
based consoles; 100% of all tablets, smartphones, Chromebooks; and about 60% of
all automobiles. As of 2020, over four billion iGPU units had been sold.
GPUs are incredibly complicated and complex devices, with hundreds of 32-bit
floating-point processors (called shaders) made with millions of transistors. It was
because of Moore’s law that such density was realized. Every day one engages with
multiple GPUs: in one’s phone, PC, TV, car, watch, game console, and cloud account.
The world would not have progressed to its current technological level without the
venerable and ubiquitous GPU.
(Fig. 4.22). It was powered by Nvidia’s latest ARM-based processor, Tegra 4, which
ran Google’s Jellybean version of Android.
It was an appealing product and was irresistible to pick up because of its controller
appearance and flip-up screen, but was that enough to make it disruptive? Mobile
phones certainly existed before Apple introduced the iPhone, yet not many people
would dispute the iPhone was a transformational and disruptive product.
In Professor Clayton M. Christensen’s 1997 best-selling book, The Innovator’s
Dilemma, he separated new technology into sustainable technology and disruptive
technology. Disruptive technology often has novel performance problems, appeals
to a limited audience, and may lack a proven practical application.
The use of disruptive as a description of the Nvidia’s Shield or the iPhone is hardly
adequate. It is doubtful the Shield would have any more performance problems than
any other new product would have. The same can be said about the iPhone. Its game
player audience was limited but consisted of 30 million consumers overall as of 2015,
and the number of gamers for all platforms approached 300 million. Therefore, scale
must be considered when determining what should be considered limited. Moreover,
the iPhone and the Shied certainly have proven practical.
Announced at its February 2012 GPU Technology Conference in San Jose, Nvidia’s
Grid was an on-demand gaming service. Nvidia claimed it would provide advantages
over traditional console gaming systems and was an “any-device” gaming service.
According to Nvidia, it offered high-quality, low-latency gaming on the PC, Mac,
tablet, smartphone, and TV.
Nvidia combined its game development on an x-86-based PC with a GeForce
graphics add-in board (AIB) and a non-x86 streaming device. This technique used
the non-x86 device as a thin client (TC) so that the TC could send commands and
display the streamed results. Except for the latency, a network function, the user’s
214 4 Game Console GPUs
experience was as if the powerful AIB was in their local device. Nvidia schematized
the concept in the diagram shown in Fig. 4.23.
The diagram in Fig. 4.23 does not include the equipment that formed the green,
cloud-like Grid in the center.
That green cloud could be a generic server with an Nvidia GeForce AIB, an Nvidia
low-latency encoder, and fast frame buffer capture technology. According to Nvidia’s
calculations, Grid would deliver the same throughput and latency as a console.
The Grid could also connect to the Shield. Furthermore, the Shield could drive a
large screen TV display through its HDMI output. This feature made the Grid even
more similar to a console.
No, it was not. Interesting? Yes. Clever? Yes. Well-executed? Time would tell, but
Nvidia’s track record suggested that it would, but it wasn’t.
Nvidia’s charismatic president and founder, Jensen Huang, said he built the Shield
because no one else was making a mobile device like it. He told GamesBeat in an
interview:
We are not trying to build a console. We are trying to build an Android digital device in the
same way that Nexus 7 enjoys books, magazines, and movies. That is an Android device for
enjoying games. It is part of your collection of Android devices. The reason why I built this
device is because only we can build this device [22].
Because only Nvidia could build the device that would only run on Nvidia Grid
servers and the Nvidia Tegra processor, the company had established a proprietary
closed garden. If other game delivery services such as Amazon, Google, Steam, or
Origin adopted the Grid, it would be more universal. However, until other providers
embrace Grid, the acceptance of Shield would be less than ideal. The only other
option to expand the Grid’s popularity was for Nvidia to become a streaming game
4.12 Nvidia Shield (January 2013–2015) 215
service like Microsoft and Sony. That was a potential licensing nightmare and would
drain enormous resources from the company—Huang was unlikely to adopt such a
plan.
Assuming the content distribution question could be resolved, the Shield adoption
equation would become one of economics.
At the time, Huang indicated he was not interested in the hardware-subsidized
game console model, nor should he be since he did not own any content. Therefore,
the Shield had to sell for cost-of-goods (COG) plus a margin (Fig. 4.24).
With 5.3+ in., 1280 × 768 screens, smartphones would challenge the Nvidia
Shield’s 5-in., 1280 × 720 screen. Although Huang said, the Shield would be “part
of your collection of Android devices.” A lightweight controller that attached to a
smartphone would have been a better choice. Moreover, smartphones could drive an
HDTV too.
Nvidia discontinued the handheld controller with design Shield in 2015.
Disruptive to Nvidia’s Business?
By introducing the Shield, Nvidia put itself in competition with its customers who
used Nvidia products in their console products. Nvidia argued this was incorrect
because no one else made a device like the Shield. Sony, one of Nvidia’s customers
at the time, would have undoubtedly seen the competition for console and handheld
players that came with the release of the Shield. Nvidia argued that Sony offered Sony
handhelds and consoles games, whereas the Shield was designed as an Android game
player. The release of the Shield only enhanced the existing competition between
Sony and Android. Nvidia quietly withdrew the product after it introduced the Shield
tablet.
216 4 Game Console GPUs
Starting in 1994 with the first CD-ROM-based console using a 32-bit MIPS CPU,
Sony has always been bold in its processor selection for the PlayStation. It followed
that selection with the 128-bit (MIPS-based) Emotion Engine PS2 and the first DVD-
based console in 2000. The PS3 came out in 2006 as the first machine with a Blu-
ray player and the new IBM-Power-based SIMD CPU structure called the Cell.
In 2013, Sony introduced the PS4 with an AMD x86-based architecture and an
embedded powerful GPU that resembled a PC. See Fig. 4.25 for Sony’s PlayStation
introductions and life span.
The common thread that connects these designs is the lack of backward compati-
bility. That is a brave step for a company to take, and Sony should have been credited
with being so courageous. But instead, they were criticized for it. Why? You don’t
need backward compatibility—you never did. If you have old games, you probably
also have an old machine, so use it. Who wants to play clunky old games with the
availability of new, high-quality performance games?
Rumors circulated in May 2011 that Sony chose Nvidia to create the processor
for the PS4. These rumors were inflamed when a hiring requisition at Sony (SCEA)
became public and revealed that Sony was looking for someone with (Nvidia) CUDA
experience. That could have simply been related to the game development of the PS3,
which had a 136 shader G70 GPU (code named RSX). Regardless, the information
encouraged people to speculate.
The rumors and speculation ended in January 2012, when Forbes reported AMD
and Sony were in discussions for the PlayStation 4 [23].
The PlayStation 4 used an AMD Jaguar-based APU. The system had 8 GB of
GDDR5 unified memory, 16 times the amount found in the PS3. The PS4’s GDDR5
memory could run up to 2.75 GHz (5500 MT per second) and had a maximum
memory bandwidth of 176 GB per second.
Fig. 4.25 Sony’s eighth-generation PlayStation 4 with controller changed the design rules for
consoles. Courtesy of Sony Computer Entertainment
4.14 Microsoft Xbox One (November 2013) 217
The PlayStation 4 Pro added half a gigabyte of extra RAM for game developers
because of the additional 1 GB of memory needed for nongaming uses like multi-
tasking and reports.
The console also contained an audio module offering in-game chat rooms and
audio streams. And all PlayStation 4 models would have high dynamic range color
profiles.
Leveraging the game streaming technology acquired from Gaikai in July 2014
(for $380 million), the PS4 allowed games to be downloaded and updated in the
background or standby mode. The system also enabled digital titles to be playable
while downloading.
Vita owners could also play PS4 games remotely. Sony saw the PS Vita as the
companion device for the PS4. Sony had long-term visions to make most PS4 games
playable on PS Vita and transferrable over Wi-Fi.
A new application called the PlayStation App-enabled mobile devices like
iPhones, iPads, and Android-based smartphones and tablets to become second
screens for the PS4. However, this ability depended on the version of OS that the
users had. Without the correct OS version, users could not use the PlayStation app.
An AMD Jaguar APU also powered the Xbox One. It used 8 GB of DDR3 RAM,
with a memory bandwidth of 68.3 GB per second instead of the GDDR5 Sony used
in the PS4.
The Xbox One memory subsystem also had 32 MB of embedded static RAM
(ESRAM). It had a memory bandwidth of 109 GB per second. The ESRAM was
capable of a theoretical memory bandwidth of 192 GB per second and a memory
bandwidth of 133 GB per second while using alpha transparency blending. A block
diagram of the Xbox one can be seen in Fig. 4.26.
The Xbox One was Microsoft’s successor to the 2005 ATI Xenos GPU-powered
Xbox 360, introduced eight years earlier. Because of the gaming PC’s much shorter
product cycle, its primary competition (consoles) had to remain ahead of the curve.
The sheer magnitude of the SoC created by Microsoft demonstrated its dedication
to this goal.
With 47 MB of on-chip SRAM storage and five billion transistors, the console’s
Main SoC used the bulk of Xbox One’s functionality. AMD co-engineered the Xbox
One SoC with Microsoft; its block diagram is shown in Fig. 4.27.
Xbox One processing used an AMD eight-core x86 CPU, a Graphics Core Next
GPU, and 15 special-purpose processors.
As illustrated in Fig. 14.22, the custom APU saved CPU and GPU cycles
by offloading well-known processing tasks to dedicated hardwired engines and
DSPs. The offloaded processing tasks were typically things like video encoding
and decoding, audio, and display. Several swizzle engines handled unaligned image
copies in memory.
218 4 Game Console GPUs
All internal processors had access to shared, unified memory via host/guest
MMUs. That allowed low overhead coprocessing by the CPU and the GPU (and
other priority agents). Shared unified memory let multiple processors work together
to pass pointers to data structures. The alternative would have been making each
processor copy the structure itself. When large memory transfers were required,
the SoC’s memory subsystem was up to the challenge; it offered an impressive
200 GB per second of realizable bandwidth. Only 30–68 GB per second of that
bandwidth (depending on whether coherency enforcement was used) came from
external DRAM, whereas the majority (204 GB per second peak) came from 32
Mbytes of embedded SRAM.
Reacting to the motion-sensing capabilities of the Wii and PlayStation, Microsoft
introduced Intel’s RealSense image sensing system, Kinect, as an accessory to the
Xbox. It was entirely CPU-based and did not use any GPU resources for capture.
In the mid-2010s, Nvidia steadily invested in expanding its Shield product line and
Grid (GPU) server system capabilities. The original product, the Shield Portable,
was a game controller with an attached screen. This product was Nvidia’s first taste
of the consumer electronics business. The following product released was the Shield
Tablet, which Nvidia shared with its partners (e.g., EVGA).
At the 2015 GDC in San Francisco, Nvidia unveiled its latest Shield product: the
Shield Console. It was an Android TV-like device with a game controller and an
optional remote-control stick.
Nvidia also set up a game store with over 50 titles, and users could sign up
for a subscription. Nvidia’s Shield console was like Amazon’s Fire TV for easy
comparison.
However, Nvidia had been shaving milliseconds since it brought out the first Shield
Portable. The difference between Nvidia’s Shield console and all other Android
TV/consoles was the response time, latency, and throughput; Nvidia’s was faster
because Nvidia controlled both ends of the pipe—the server and the client.
The company worked closely with game developers and reported several new
games. These games ran well at 4k and even better at HD.
The cabinet was attractive, with sharp diagonal lines and multi-reflective surfaces.
Measuring 8 × 5 in. and 1 in. thick, the Shield was roughly the size of a thin book
(Fig. 4.28).
The system sold for $199 and came with the Shield processor cabinet (see
Table 4.2), a stand, a game controller, and a power supply. The optional TV controller
was $30. The cabinet stand had an amazingly sticky nano surface on the bottom; once
put on a flat surface such as a table, it stayed put. The cabinet slipped neatly and
firmly into the stand.
In the demonstration at Nvidia’s facilities in Santa Clara, the system was very
responsive, even though the server was in Seattle. That was a surprise because it
220 4 Game Console GPUs
The GPU had 256 shader cores (2 SMMs) and ran at 1000 MHz (see Fig. 4.29).
The memory interface offered a maximum bandwidth of 25.6 GB per second (2x 32
Bit LPDDR4-3200).
The product marked a significant step for Nvidia that it had been building toward
for decades. With the Shield’s release, Nvidia became a full-service consumer elec-
tronics company. For several years, Nvidia had been saying, “We are not a semi-
conductor company.” Their Grid, Tesla, Quadro, and other products are ample proof
of this claim. Yes, they sold semiconductors, and they also sold components (like
automotive subsystems). The main difference between Apple and Nvidia (other than
size) was Nvidia’s use of a common OS. Nvidia saw this as an advantage because it
could leverage all app development work, the APIs, and the OS without carrying all
the expense.
Polymega was a modular multi-system game console that ran original game cartridges
and CDs for classic game consoles on the user’s HDTV. Developed by LA-based
Playmaji, Polymega was initially named Retroblox when first announced in 2017.
222 4 Game Console GPUs
The company offered several modules that accepted cartridges from almost all old
consoles and CD compatibility with the Base Unit. Each box in the system plugged
into a base platform (Fig. 4.30).
Polymega had one of the widest ranges of emulation systems on the market. The
system used a 2.9 GHz Pentium Intel Coffee Lake S CPU with UHD graphics 610.
It had 2 GB of DDR and 32 GB of NanoSSD storage, expanded with an SSD
or a microSD. It had a Realtek Wi-Fi and Bluetooth Combo Module, HDMI 1.4,
Gigabit Ethernet, 2x USB 2.0, and a Polymega Expansion Bus. The system used a
Proprietary Linux-based OS.
The integrated UHD Graphics 610 (GT1) had been in processors from the Whisky-
Lake generation. The GT1 version of the Skylake GPU had 12 Execution Units (EUs),
which could clock up to 950 MHz. The HD 610’s architecture shared memory with
the CPU (2x 64bit DDR3L-1600/DDR4-2133).
The video engine supported H.265/HEVC Main10 profiles in hardware with 10-bit
color. Google’s VP9 codec could be hardware decoded. The Pentium chips supported
HDCP 2.2 and Netflix 4K. HDMI 2.0, however, could only be supported with an
external converter chip (LSPCon).
Fig. 4.30 Artist’s rendition of the Polymega system. The final version was a dark, flat gray. Courtesy
of Polymega
4.17 Nintendo Switch (March 2017) 223
The system’s games were limited in color and resolution, so the processor (an
iGPU) was more than adequate. The system’s magic was in the emulations that Play-
maji developed and the physical hardware interfaces for cartridges and controllers.
The system shipped to consumers in September 2021.
Two years after Nvidia withdrew from the mobile market in March 2017, Nintendo
announced the release of the Switch shown in Figs. 4.31 and 4.32. The Nintendo
Switch was a portable game console that used a Tegra X1 processor to form a repack-
aged Nvidia Tegra tablet. Tegra processors are covered in this section. The 256-shader
core Tegra GPU is shown in Fig. 4.29 in the Nvidia Shield section.
The Switch had two innovative controllers attached to the tablet’s sides for game
control. Alternatively, the tablet could be put in a desk stand, and the controllers could
be detachable and used independently as handheld devices without being directly
attached to the Switch.
1996 2001 2002 2006 2007 2012 2013 2016 2017 2019 2021 2022 2028
Nintendo 64
Game Cube
Wii
Wii U
Switch
Switch OLED
5th 7th 8th 8th
Atari withdrew from the console market in 1996 when the Atari Jaguar CD game
console failed to reach its sales goals (see Section 3: The First GPU and What It
Led to on Other Platforms). In 1998, the company liquidated, and Hasbro Interactive
acquired Atari’s intellectual property and brand [25].
In January 2001, Infogrames Entertainment SA acquired Hasbro Interactive for
$100 million and renamed it Infogrames Interactive [26]. In May 2003, Infogrames
renamed itself Atari SA and the Interactive subsidiary. Atari Interactive has provided
the licensing of Atari consoles produced since 2004.
At the E3 gaming conference in June 2017, Atari’s CEO, Fred Chesnais, told
GamesBeat that the company was working on a new console code named AtariBox.
The hearts of older gamers began to race. Courtesy of Atari, rumors about the
AtariBox and dark model pictures appeared in the news.
Feargal Mac Conuladh (Mac Conuladh), who joined Atari in September 2017 as
the general manager, was inspired by the concept of the Atari VCS (code named
AtariBox). Conuladh saw that Atari’s game catalog had strong brand recognition,
and the console would be a perfect vehicle to exploit that famous catalog.
Conuladh said in a 2017 interview that seeing gamers connect their laptops to
televisions to play games on a larger screen inspired him to develop the unit [27]
(Fig. 4.34).
Conuladh’s goal for the AtariBox was to satisfy the nostalgia for the former Atari
consoles and indie games without needing a PC. Following the lead of both Microsoft
and Sony, Conuladh said Atari SA would use a custom version of AMD’s APU (code
named Kestrel).
4.18 Atari VCS (June 2017) 225
Fig. 4.34 Feargal Mac (left) of Atari and former Microsoft games executive Ed Fries. Courtesy of
Dean Takahashi
After announcing the AtariBox at the 2017 E3, the Atari management then went to
San Francisco during the 2018 GDC to show a mockup to a few reporters (Fig. 4.35).
The company claimed that the Atari VCS platform would provide 4K resolution,
HDR, 60 FPS content, onboard and expandable storage options, dual-band Wi-Fi,
Bluetooth 5.0, and USB 3.0.
Atari said the VCS would include the Atari Vault, with more than 100 classic
games, such as arcade and home entertainment favorites like Asteroids, Centipede,
Breakout, Missile Command, Gravitar, and Yars’ Revenge.
Designed to be reminiscent of the 1977 Atari 2600, which was called the Atari
Video Computer System or VCS, the new VCS was introduced in a YouTube teaser
in 2017. Some thought Atari’s 2021 summer debut of the VCS console was about
four years late; others thought the console should have been released more than four
decades ago. Having promised a delivery date in the summer of 2019, the company
launched a successful crowdfunding campaign for the product release. Several press
releases and news leaks appeared in 2019 and 2020, but the product did not launch.
The VCS unit was finally released in late 2021.
Atari VCS Specifications (Table 4.3).
Was there room for the Atari VCS in a society obsessed with smartphones,
streaming, and powerful gaming PCs? Such retro enthusiasm had been seen before,
with limited success. Other nostalgia gaming devices have been attempted, but it
was not easy for nostalgic games to compete with the modern games available for
PlayStation and Xbox. So, the answer is no, or maybe. Nintendo has proven several
times that millions of loyal and enthusiastic gamers still love classic games and
characters.
In 2015, China lifted a 14-year ban on console gaming. During those 14 years, with
no gaming consoles, the gaming scene in China became much more PC-focused. But
there was also an active black market for consoles in China. When China lifted the
ban, Microsoft’s Xbox One was the first console to be offered. Sony was next and
ended up being the preferred console. But China required both companies to manu-
facture some of the consoles in China. And there was a desire from the government
for a Chinese-made console.
In 2013, Xiaobawang and Ali Group formed a strategic cooperation. It announced
it would work with Ali TVOS to launch a home game console equipped with Ali
TVOS and a variety of Xiaobawang games on the Ali TVOS platform.
In 2016, Xiaobawang and AMD reached an agreement for a customized VR host
chip, becoming the fourth company after Microsoft, Sony, and Nintendo and the first
company in China to have a high-end host chip. Xiaobawang invested about $60
million in AMD to develop a custom processor for Xiaobawang’s upcoming console
and PC hybrid system.
4.20 Sony PlayStation 5 (November 2020) 227
the main features of the future console in a virtual presentation called “The Road to
PS5.” The new console would have three major features: a high-speed drive, GPU
with ray tracing, and 3D spatial audio [28] (Fig. 4.37).
The GPU had 36 compute units (CU) consisting of 2304 shaders. When the GPU
ran at 2.23 GHz, it had a theoretical performance level of 10 TFLOPS.
The GPU had hardware-accelerated ray tracing capability, AMD’s VRS (variable
rate shading), and AMD’s Fidelity FX super (scaling) resolution. The console came
with 16 GB of GDDR6 SDRAM with 448 GB per second peak bandwidth. It was
equipped with Bluetooth 5.1, 802.11ax (Wi-Fi 6), USB Type-A Hi-Speed, USB
type-C, and HDMI 2.1.
The PS5 GPU used AMD’s Primitive Shaders from RDNA 1; AMD introduced
primitive Shaders in June 2019.
The PS5 had backward compatibility features that were a function of an x86-
based APU. Sony looked at the top 100 PS4 titles as ranked by playtime and expected
almost all of them to be playable on PS5 when it launched. With more than 4000
games published on the PS4, the company said it would continue the testing process
and expand backward compatibility coverage over time (Fig. 4.38).
Although the Microsoft Xbox Series X (XSX) and the Sony PlayStation 5 (PS5)
both used a custom variant of the same basic APU from AMD, the two machines
had significantly different GPUs and mass storage systems (see Table 4.4).
1994 2000 2006 2012 2013 2015 2020 2021 2022 2025 2028
PS1
PS2
PS3
PS4
PS5
PS6?
5th 6th 7th 8th 9th 10th
Table 4.4 Comparison of Sony PlayStation 5 and Microsoft Xbox Series X key specifications
PS5 Xbox Series X
CPU 8-core 3.5 GHz AMD Zen 2 8-core, 3.8 GHz AMD Zen 2
GPU 10.3 TFLOPS AMD RDNA 1, 12.0 TFLOPS AMD RDNA 2,
2.23 GHz 1.82 GHz
Shaders 2304 (36 CU) RDA-1 3328 (52 CU) RDA-1
Resolution Up to 8K Up to 8K
Frame rate Up to 120 FPS Up to 120 FPS
RAM 16 GB GDDR6, 256-bit bus, 16 GB GDDR6, 320-bit bus,
448 GB/s 560 GB/s
Storage 825 GB custom SSD, 5.5 GB/s, 1 TB custom NVMe SSD,
8–9 GB/s compressed 2.4 GB/s 4.8 GB/s compressed
Optical disc drive 4K UHD Blu-ray (Standard PS5 4K UHD Blu-ray
only)
Backward compatibility Almost all PS4 games, including All Xbox One games/Select Xbox
optimized PS4 Pro titles 360 and original Xbox games
Price $500 (PS5); $400 (PS5 Digital $500
Edition)
The SSD was a significant feature. It took six to seven seconds to load 1 GB of
data and approximately 20 s to read the data on an HDD. Games in 2020 were using
5–6 GB, which meant an SSD was a necessity for gamers; using an SSD could cut
those times dramatically.
Sony decided against a faster GPU with fewer shaders in favor of a faster SSD.
However, the PS5 could not use INT8 or INT4 instructions for ML. It could only use
FP16 for basic AI.
Even though it used primitive shaders, the PS5 ran Epic’s extraordinary Lumen
in the Land of Nanite demo (discussed in Book one).
The GPU’s compute units included an intersection engine for ray tracing, which
calculated the intersection of rays with boxes and triangles. To use the engine, a BVH
acceleration structure needed to be built. BVH was a popular ray tracing acceleration
technique that used a bounding volume hierarchy. The technique used a tree-based
acceleration structure that contained multiple hierarchically arranged bounding boxes
(bounding volumes) that encompass different amounts of scene geometry.
The shader program used a new instruction inside the compute unit that checked
the intersection engine against the bounding volume hierarchy.
A PS5 compute unit (CU) was not the same as a CU in the PS4. Sony chose
fewer compute units at a higher frequency rather than more compute units at lower
frequencies. To demonstrate the trade-off, 36 compute units running at 1 GHz resulted
in the same 4.6 TFLOPS as 48 compute units running at 0.75 GHz. Moreover, using
a higher clock allowed everything else to run faster. It was easier to use 36 compute
units in parallel than to use 48 simultaneously. It was harder to fill all available CUs
with useful work when triangles were small.
230 4 Game Console GPUs
The CPU supported 256-bit native instruction, and the GPU’s 36 compute units
(i.e., 2304 shaders or 58 PS4 compute units) employed variable frequencies for a
continuous boost. The clock was increased until the system reached the system’s
cooling solution capacity; therefore, the system ran at constant power, and the
frequency varied according to the load. Rather than looking only at chip temperature,
Sony also noted the CPU and GPU’s actual activities and set the frequency accord-
ingly. To maximize efficiency, Sony used AMD’s Smartshift technology to send any
unused power from the CPU to the GPU to squeeze out a few more pixels. The CPU
could run as high as 3.5 GHz. The GPU frequency was capped at 2.23 GHz to allow
the on-chip logic to run correctly. With 36 compute units running at 2.23 GHz, the
system produced 10.3 TFLOPS.
The Xbox Series X had a custom 7 nm AMD APU with an iGPU based on AMD’s
RDNA 2 architecture (see block diagram in Fig. 4.39). The iGPU had 56 compute
units with 3584 cores, but only 52 CUs and 3328 cores were enabled and ran at the
fixed 1.825 GHz rate. The iGPU was capable of 12 TFLOPS. Series X and S had
completely programmable front ends (Mesh Shaders, VRS, SFS).
The GPU had ray tracing capability through a new processor element called an
intersection engine in the CPU.
The Xbox S had AMD’s FidelityFX Super Resolution image upscaling tech-
nology, which enabled higher resolutions and frame rates.
The APU also had an eight-core Zen 2 CPU that ran at 3.8 GHz; however, the
CPU ran at 3.6 GHz when simultaneous multi-threading (SMT) was used. Microsoft
dedicated one CPU core to the operating system. Xbox Series X could use INT8 and
INT4 instructions, but the PS5 could not.
As the die photo shows, the GPU occupied most of the silicon in the die, as
revealed by Fig. 4.40.
The console had 16 GB of GDDR6 SDRAM, with 10 GB running at 560 GB per
second for the iGPU and 6 GB at 336 GB per second for other computing functions.
The custom AUP could drive 4K TVs up to 120 Hz, and it could drive 8K TVs up
to 60 Hz. The Xbox Series X performance goal was to render games in 4K resolution
at 60 frames per second. Microsoft planned to achieve this by using roughly four
times more powerful CPU than the Xbox One X CPU and a GPU twice as powerful
(Fig. 4.41).
Microsoft had two machines with different hardware capabilities to play the same
games. Table 4.5 compares the components of Microsoft’s fourth generation of Xbox
consoles.
The consoles had compatibility with the new features of the HDMI 2.1. These
new features included the integration of variable refresh rate and auto low latency
mode (ALLM) into televisions in 2019.
In 2021, AMD was the supplier for four of the five available game consoles: Atari
VCS, Microsoft’s Xbox Series X, Sony’s PS5, and Valve’s Steam Deck. Each of the
consoles was an x86-based game machine.
Valve’s Steam Deck was a handheld PC with an integrated 7-in. screen and game
controls instead of a keyboard. The software supported an external keyboard, which
could be attached via Bluetooth or USB-C1. Like Nintendo’s Switch, one could
connect an external display via the USB-C port.
The Steam Deck Portable console was the first PC to use AMD’s Van Gogh APU.
The entry-level APU ran Steam’s OS 3.0. Van Gogh had a Zen 2 CPU with an RDNA
2 GPU, which promised over 2 TFLOPs of capability. This level of performance was
higher than the performance capabilities of the AMD-powered PlayStation 4 or the
Xbox One. The Steam Deck could run AMD’s FidelityFX Super Resolution. It also
had hardware support for DirectX 12 Ultimate features such as variable rate shading
4.22 Valve Steam Deck Handheld (July 2021) 233
(VRS) and acceleration for ray tracing. Lastly, the Steam Deck was also capable of
running Linux and using the Vulkan API.
Valve said the Steam Deck would run AAA games. That was a safe claim given
the APU that powered it. The Steam Deck’s display was limited to 1280 × 800 pixels,
with a 16:10 aspect ratio.
Specifications of the Steam Deck:
• Processor: AMD APU
• CPU: Zen 2 4c/8t, 2.4–3.5 GHz (up to 448 GFLOPS FP32)
• GPU: AMD 8 RDNA 2 CUs, 1.0–1.6 GHz (up to 1.6 TFLOPS FP32)
• RAM: 16 GB LPDDR5 RAM (5500 MT/s)
• 256 GB NVMe SSD (PCIe Gen 3 x4)
• Storage: 64 GB eMMC (PCIe Gen 2 x1)
• 512 GB high-speed NVMe SSD (PCIe Gen 3 x4)
• Display: 7-in. diagonal LCD with a 400 nits 60 Hz refresh
• All models included a high-speed microSD card slot
• Power: 4–15 W
• Connectors: USB-C with DisplayPort 1.4, and a 45 W USB Type-C PD 3.0, a
3.5 mm combo audio jack
• Communications: Bluetooth 5.0, dual-band Wi-Fi.
Valve’s Steam Deck allowed users to take their Steam library with them; therefore,
it commonly gets compared to Nintendo’s portable Switch. Table 4.6 shows the
characteristics of both devices.
Nintendo’s Switch ran on its self-contained battery for approximately two-and-a-
half to six-and-a-half hours. The Deck’s 40 W/h battery provided two to eight hours
of gameplay, depending on the games played.
The Switch was a less powerful device since it used a 2015 Arm-based Nvidia
Tegra X1 chipset, so it did not seem fair or logical to compare it to the Steam Deck.
Valve said the Steam Deck emulated the regular Steam app on a desktop. The
emulation included chat, notifications, cloud save support, and all of one’s synced
library, collections, and favorites. It could also stream games to the Steam Deck from
a PC using Valve’s Remote Play feature. In that it was similar to the original handheld
Nvidia Shield (Fig. 4.42).
Only one question remained: What problem did this machine solve? Who needed
or wanted the Steam Deck? What did the Deck do that any laptop with a game
controller could not do the same, if not better.
When Nvidia introduced the Shield, it referred to it as another Android device.
That reference did not pay off for them. Steam Deck was just a PC in a different
package. The main difference was the price. A $400 PC with a built-in game controller
was an excellent idea, but that $400 investment would spend most of its time turned
off on a table or inside a backpack. No one would want to replace their 13-in. notebook
with low res 800 line 7-in. display. Web browsing would be just as much fun as it
was on a smartphone with a higher resolution display.
had foundations in mobile and elite gaming but was a gaming-specific configuration
with a better GPU, top-end CPU, and memory (see Fig. 4.43).
It ran Android and, therefore, Android games. The company was consid-
ering making it Windows compatible too. That would have been relatively easy
because they already provided chips to OEMs like Asus, HP, and Lenovo for
always-connected PC running windows.
The dev kit used 6 GB LPDDR5; 128 GB UFS 3.1 flash memory storage.
Commercial device developers could do whatever they wanted concerning
memory.
It had a battery life of four to eight hours, 5G capabilities, Snapdragon sound, and
remote TV driving capabilities.
The dimensions were 293 mm × 116 mm × 21.5 –52 mm, and it weighed just a
little over 500 g.
Although it had 5G and could do web browsing, game downloading, and online
gaming, it could not be used as a phone but could do Voice Over Internet Protocol
(VoIP).
Razer was Qualcomm’s first partner and distributor of ref kits to game developers.
4.24 Conclusion
The fifth-generation Sony PlayStation, introduced in 1993 had a separate T&L copro-
cessor within the CPU. That machine along with the fifth-generation Nintendo 64
236 4 Game Console GPUs
introduced in 1996 kicked off the era of game consoles having GPU-like character-
istics. Microsoft’s first game console, the sixth-generation Xbox introduced in 2001,
had a dedicated Nvidia GPU. The consoles evolved, as did their GPUs, until 2020,
when Sony and Microsoft were offering ninth-generation devices with GPUs capable
of mesh shading.
The number of suppliers and products offered swelled from 1972 to seven compa-
nies and 19 products in 1996. It then declined to three companies by 1998 and stayed
at that level till 2017 where it grew to five companies, and then to seven in 2022.
Consoles used to be a leading technology platform until 2005 when Sony adopted
AMD’s APU then Microsoft did the same, and in 2017 Nintendo switched from
AMD to Nvidia’s Tegra for the Switch. The consoles components were derived from
products already in the market. The PC leaped ahead in GFLOPS and other features,
but in 2021 Microsoft and Sony surprised the industry by introducing consoles that
could run ray tracing and mesh shaders. The unified memory architecture of the APUs
in the Microsoft and Sony consoles will keep them behind the PC in performance,
but clever game developers have demonstrated they can get enormous performance
out of those devices on par with the PC at a fraction of the cost.
References
14. Corriea, A. R. Father of the PlayStation’ Ken Kutaragi receiving Lifetime Achievement Award
at GDC, (January 28, 2014), https://www.polygon.com/2014/1/28/5353680/father-of-the-pla
ystation-ken-kutaragi-receiving-lifetime-achievement
15. Andrews, J.; Baker, N. Xbox 360 System Architecture, Computer Science IEEE Micro, (March-
April 2006), https://www.cis.upenn.edu/~milom/cis501-Fall08/papers/xbox-system.pdf
16. Devendra K. Sadana, D., Current, M. Fabrication of Silicon-on-Insulator (SOI) and Strain-
Si-on-Insulator (SSOI) Wafers Using Ion Implantation, IBM Research Report IBM Research
Division Thomas J. Watson Research Center, (W0409-077) RC23337, (September 14, 2004),
https://dominoweb.draco.res.ibm.com/reports/rc23337.pdf
17. Drehmel, B., Jensen, R. The new Xbox 360 250GB CPU GPU SoC, Hot Chips 22, (August 23,
2010), https://old.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.23.215-1-Jensen-
XBox360.pdf
18. Maher, K. Nvidia goes for the mass market with incremental changes to lineup, Tech Watch,
Volume 4, Number 25, pp 5 (December 13, 2004)
19. Lal Shimpi, A Sony Introduces Playstation 3, to launch in 2006, (May 16, 2005), https://www.
anandtech.com/show/1683/4
20. James, D Sony’s PS Vita Uses Chip-on-Chip SiP—3D, but not 3D, (July 6, 2012), http://chi
pworksrealchips.blogspot.com/2012/07/sonys-ps-vita-uses-chip-on-chip-sip-3d.html
21. Peddie, J. Intel misquoted on integrated graphics overtaking discrete, (January 22,
2016), https://www.jonpeddie.com/blog/intel-misquoted-on-integrated-graphics-overtaking-
discrete/
22. Takahashi, D. Nvidia CEO’s 7-year journey to make the Project Shield portable gaming device
(exclusive interview), (January 9, 2013), https://venturebeat.com/2013/01/09/nvidia-ceos-
seven-year-journey-to-make-project-shield-portable-gaming-device-exclusive-interview/
23. Caulfield, B. After Years Of Hard Knocks, AMD Honing Its Killer Instinct, Forbes, (Febuary
22, 2012), https://www.forbes.com/sites/briancaulfield/2012/02/22/the-predator/?sh=4f2cc2
1be91b
24. Rahming, A.K US—Over 40 percent of Switch users also own PS4/Xbox One, says industry
analyst, (January 3, 2020), https://www.nintendoenthusiast.com/us-over-40-of-switch-users-
also-own-ps4-xbox-one-says-industry-analyst/
25. Johnston, C. Atari Goes to Hasbro, GameSpot. (April 8, 2000), https://www.gamespot.com/
articles/atari-goes-to-hasbro/1100-2462915/
26. Hasbro Completes Sale Of Interactive Business, The Associated Press—New York Times,
(January 30, 2001), https://www.nytimes.com/2001/01/30/business/company-news-hasbro-
completes-sale-of-interactive-business.html?n=Top/News/Business/Companies/Hasbro+Inc
27. Takahashi, D. Former Xbox leader Ed Fries quizzes Feargal Mac on Atari’s new console,
Venture Beat, (October 18, 2017), https://venturebeat.com/2017/10/18/former-xbox-leader-ed-
fries-quizzes-feargal-mac-on-ataris-new-console/
28. Peddie, J. The Road to Sony’s PS5, Tech Watch, (March 20, 2020), https://www.jonpeddie.
com/report/the-road-to-sonys-ps5/
Chapter 5
Compute Accelerators and Other GPUs
Very large-scale integration (VLSI) enabled the development and expansion of the
GPU thanks to Moore’s law. Every two years, process nodes shrank while costs
stayed relatively constant, and even in later years, as prices did rise, the advantages
of node shrinkage outweighed the added costs. As the GPU became denser, with
more computational power in the same area, its cost–benefit ratio was irresistible,
and people began adapting the GPU to every imaginable problem and application
(Fig. 5.1).
One would be challenged to find a platform or electronic device in their life that
did not have a GPU somewhere in it, and if not in it, in its manufacturing and design.
That is a beautiful story but one that is difficult to tell. So many companies have
come and gone and made tremendous contributions to the development of the GPU.
There are so many platforms to examine. And, in many cases, a single GPU model
will find itself on multiple platforms.
In this chapter, I try to cover all the special cases not discussed so far.
In 1999, the professional graphics workstation market began a major shift from
proprietary and purpose-built systems to systems based on commercial-off-the-shelf
semiconductors (COTSS). Manufacturers recognized the advantage of moving to
COTSS and the industry segment changed completely in a matter of two years, but
of course the transition for customers was a lot more complicated. The workstation
vendors had to support their customers that had older equipment. That meant that the
older, established workstation suppliers like Evans & Sutherland, Fujitsu, HP, NEC,
SGI, and Sun had the financial burden of supporting UNIX-proprietary workstations
as well as the new Microsoft Windows and Intel systems (WinTel)-COTSS work-
stations. The older suppliers had difficulty competing against startup companies like
Boxx, Compaq, and Dell.
HP offered its last proprietary workstation AIB, the Visualize FX10, in June 2000.
In 2002, computer graphics pioneers Evans and Sutherland began to slowly wind
down and exited the professional graphics market in 2006 to concentrate on the
planetarium projector market. SGI became a server supplier, and Sun introduced its
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 239
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_5
240 5 Compute Accelerators and Other GPUs
Fig. 5.1 The GPU scales faster than any other processor
long-awaited XVR-4000 (Zulu) in 2002, a massive AIB too late and too expensive
to compete.
Workstation graphics shifted to 3Dlabs, ATI, and Nvidia, while workstation CPUs
moved to Advanced Micro Devices (AMD) and Intel.
The following sections will examine some of the most interesting and, in some
cases, significant developments in professional graphics and compute accelerators.
Like its competitors HP, SGI, and others, Sun could see Moore’s law changing
the landscape in the workstation segment. Sun tried to carve out the high-end, high-
performance segment for the company’s value-add to avoid being seen as just another
COTSS box supplier.
In late 2002, the company belatedly introduced the Zulu supergraphics subsystem
it had been working on for about three years. It had four pipelines, each with a MAJC
5200 geometry engine and raster engine (which Sun called a frame buffer controller).
That was followed by schedulers that transferred the pixels to the 3DRAM buffers.
Data then went to the routers, which fed it into the convolve engines and the output
digital-to-analog converters (DACs).
5.1 Sun’s XVR-4000 Zulu (2002) the End of an Era 241
high-level contributions of many others, such as Mike Lavelle, Dave Neagle, and
Scott Nelson.
“Zulu was my last Sun 3D graphics accelerator product (a.k.a. Sage a.k.a. XVR-4000). The
SIGGRAPH 2002 paper [1] gives an overview of the architecture of the graphics pipeline
used and goes into detail on its most novel aspect: real-time five-by-five-pixel convolution
of super-sampled rendered images with higher-order anti-aliasing and reconstruction filters.
This product still has by far the highest final anti-aliased pixel quality of any real-time
machine ever built.” [2].
Sun would follow the industry and later offer the Sun XVR-1200 graphics accel-
erator, which featured a 3Dlabs Wildcat graphics processing unit. However, in April
2009, Oracle bought Sun for $7.4B after IBM dropped its bid. Oracle wanted the
Java software and discarded the graphics.
SiliconArts was founded in 2010 in Seoul by Dr. Hyung Min Yoon, formerly at
Samsung; Hee-Jin Shin from LG; Byoung Ok Lee from MtekVision; and Woo Chan
Park from Sejong University. The company took on the formidable task of designing
and manufacturing a ray tracing hardware accelerator coprocessor called RayCore.
The company showed its first implementation in a field-programmable gate array
(FPGA) in 2014, and it was impressive then [3]. During the four years that followed,
the company made steady improvements, expanding its product line and developing
an ambitious and impressive road map.
The first implementation of SiliconArt’s ray tracing hardware accelerator was the
RayCore 1000, shown in Fig. 5.3.
The RayCore design had several novel and interesting features, as shown in
Table 5.1.
The company targeted the RayCore 1000 at smartphones and tablets and
embedded and industrial processors for virtual reality (VR) and augmented reality
(AR).
In a test using Autodesk 3DS Max 2019, the company achieved the results, shown
in Table 5.2 with the RayCore 1000 (Fig. 5.4).
The host system was an Intel Pentium Gold G5600 at 3.9 GHz (4 CPU).
The next generation was the RayCore 2000, described in the following section.
5.2 SiliconArts Ray Tracing Chip and Intellectual Property (IP) (2019) 243
In 2018, the company introduced its RayCore 2000 real-time ray tracing IP design,
capable of 250 m-rays/s at 500 MHz (4 cores) and running at 2048 × 2048 resolution.
A block diagram of the design is shown in Fig. 5.5.
The RayCore 2000 offers displacement mapping, the effect of actual movement
of geometric points according to a given height field. It enhanced the level of realism
while cutting complicated modeling processes. And the processor supported Ambient
Occlusion and calculated how exposed each point in a scene was to ambient lighting.
It also offered multi-texture and overlapping of multiple texture images and light
binding on a target object. Although the frame rate was not real time, the megarays
were impressive.
The hardware T&I test for ray/path-tracing performance was 760 Mray/sec16
cores at 190 MHz). Those processes were normal in ray/path-tracing. The traversal
unit found the object hit by the ray (searching tree data structure), and the intersection
test unit found the exact or closest point of the object intersected by the ray.
The performance was ideal based on the clock speed and the required clock
cycles to complete the logic pipeline. The effective performance was 160 Mray/secs,
benchmarked based on an Intel Arria 10 PAC 10 GX FPGA AIB with Cornell Box.
The 2010 feature list is shown in Table 5.3.
RayCore Lite became available in 2020.
In addition to the three implementations listed above, the company also developed
other variants and extensions of the design.
Multi-core. The company introduced a multi-core version of the design called
RayCore MC, a photorealistic GPU IP offering with Monte-Carlo path-tracing, ray
generation, and direct/indirect illumination. Shown in Table 5.4 are the RayCore
MC’s proposed specifications.
RayCore MC was also available as IP.
RayTree. RayTree was a novel approach to ray tracing using a fast KD-tree
acceleration structure generation H/W for dynamic object rendering. KD-trees are
246 5 Compute Accelerators and Other GPUs
a strategy for organizing data structures in 3D space. The data in each node is in a
K-Dimensional point of space which can be navigated and searched.
The company offered a dedicated KD-tree generation design IP for hardware
implementation ray tracing. The company believed KD-tree regeneration was
compulsory for any application to deliver high-quality dynamic 3D contents and
guarantee real-time interactivity. Despite CPU overhead in systems, the CPU had
been primarily responsible for KD-tree generation, which caused process delays and
high-power consumption. RayTree said the company could replace the CPU’s role
and maximize KD-tree regeneration performance. It would re-generate KD-trees real
5.2 SiliconArts Ray Tracing Chip and Intellectual Property (IP) (2019) 247
time, thereby realizing on-the-fly dynamic scene processing without CPU use and
saving power consumption.
Shown in Fig. 5.6 is the generalized concept of the pipeline.
SiliconArts designed RayTree for implementation in dedicated KD-tree genera-
tion hardware RayTree scanned primitives and generated acceleration structure (KD-
tree) to support real-time dynamic scene processing. The company said it solved the
bottleneck problem between rendering and tree building tasks by load balancing and
distributing resources, efficiently yielding efficient ray tracing rendering.
The company predicted that when compared to KD-tree generation performance
using mobile CPU, RayTree had 35x faster KD-tree generation capability. Further-
more, offloading CPU overhead improved power efficiency. Combining RayTree
with RayCore effectively cut the KD-tree generation process in mobile CPUs, the
company said, which would result in reduced power consumption at the system level.
The generalized architectural organization is shown in Fig. 5.7.
SiliconArt’s RayTree offered a parallel hybrid tree architecture with a single scan-
tree unit and n KD-tree units.
5.2.5 Summary
SiliconArts defied predictions of its demise and found other funding to carry on
its R&D, which appeared well invested, and the company’s product line and road
map were impressive. The company claimed to have a large OEM customer. When
that customer would bring a product to market with SiliconArt’s technology, that
was when the company could stop using investor money and move toward positive
cash flow. 2019 was marked the year of ray tracing with Nvidia’s huge commitment
and introduction of ray tracing products, Sony’s announcement that the PS5 would
offer ray tracing, and the expected introduction of hardware-accelerated ray tracing
capabilities in AMD’s and Intel’s GPUs. SiliconArts was in the right place at the
right time.
Intel made significant and interesting announcements at the Denver 2019 supercom-
puter conference (SC19). The company officially launched its oneAPI, a unified
and scalable programming model for heterogeneous computing architectures. It also
announced a general-purpose GPU-compute design based on the Xe architecture,
code named Ponte Vecchio, and optimized for HPC/AI acceleration. And, it revealed
more architectural details of the exascale Aurora Supercomputer at Argonne National
Laboratory.
Intel said Ponte Vecchio would be manufactured on Intel’s 7 nm technology. The
GPU would use Intel’s Foveros 3D and embedded multi-die interconnect bridge
(EMIB) packaging (introduced December 2019). Other packaging technologies
included high bandwidth memory, Compute Express Link interconnect, and other
intellectual property.
Then, in August 2021, the company conceded its manufacturing was not up to
par and announced that the Ponte Vecchio chips would be manufactured by Taiwan
Semiconductor Manufacturing Company (TMSC) in Taiwan (Fig. 5.8).
Intel said its data-centric silicon portfolio and oneAPI initiative laid the foundation
for converging HPC and AI workloads at exascale within the Aurora system at
Argonne National Laboratory. Aurora was the first U.S. exascale system (Fig. 5.9).
Intel said it used the full breadth of its data-centric technology portfolio, building
on the Intel Xeon Scalable platform and using Xe -based GPUs and Intel Optane DC
Persistent Memory and connectivity technologies.
The compute node architecture of Aurora featured two 10 nm-based Intel Xeon
processors (code named Sapphire Rapids) and six Ponte Vecchio GPUs. Aurora had
over 10 petabytes of memory and over 230 petabytes of storage and used the Cray
Slingshot fabric to connect nodes across more than 200 racks.
Intel’s entry into a top-to-bottom, scalable GPU architecture called Xe promised
a lot (Fig. 5.10), and the Ponte Vecchio device was the first reveal.
5.3 Intel Xe Architecture-Discrete GPU for High-Performance … 249
Fig. 5.10 The Xe architecture is scaled by ganging together tiles of primary GPU cores. Courtesy
of Intel
250 5 Compute Accelerators and Other GPUs
Fig. 5.11 Intel’s Xe -HPC 2-stack shows the configurable and scalable aspects of its Xe -core design.
Courtesy of Intel
The HPC version had eight 512-bit vector and eight 4096-bit matrix engines,
showing the scaling mix-and-match capability of the Xe design. And Intel showed a
diagram (Fig. 5.11) of 64 of them stacked up and interconnected via 8 MB Li caches
[4].
In addition, the Xe -HPC 2-stack had eight slices with 128 core, 128 ray tracing
units, and eight hardware contexts, as well as two media engines, eight HBM2e
controllers, and 16 Xe links (Fig. 5.12).
In late October 2021, Raja Koduri, senior vice president and general manager of
Intel’s accelerated computing systems and graphics (AXG) group, said in a Tweet:
We deployed Xe-HP in our oneAPI devCloud and leveraged it as a SW development vehicle
for oneAPI and Aurora. We currently do not intend to productize Xe-HP commercially;
it evolved into high-performance graphics (HPG) and HPC that was on general market
production path [5].
Intel ended its plans in late 2021 to bring its Xe -HP server GPUs to the commercial
market, saying that Xe -HP had evolved into the Xe -HPC (Ponte Vecchio) and Xe -
HPG (Intel Arc, gaming GPUs) products to Intel’s GPU group. The company no
longer saw the benefit or return on investment from releasing the second set of server
GPUs based on Xe -HP.
Intel’s first family of server GPUs, also known by the code name Arctic Sound,
had been the most high-profile product under development from Intel’s reorganized
GPU group. Koduri had often shown chips housing the silicon as Intel brought-up
prototypes in its labs. And except for the Xe -LP/DG1 AIB, Xe -HP was the first Xe
silicon Intel developed. It would have been the only high-performance Xe silicon
manufactured by Intel, but TSMC subsequently built the Xe -HPC’s compute tiles
and Xe -HPG dies.
The Xe -HPC-based Ponte Vecchio was the most complex SoC Intel had ever
designed and a showcase example of its integrated device manufacturing (IDM)
2.0 strategy it announced in March 2021. Ponte Vecchio took advantage of several
advanced semiconductor processes, such as Intel’s EMIB technology and Foveros
3D packaging. Intel was bringing to life what it called its moonshot project with that
product. The 100 billion transistor devices delivered lots of floating operations per
5.3 Intel Xe Architecture-Discrete GPU for High-Performance … 251
Fig. 5.12 The Xe link allowed even more extensive subsystems to be created. Courtesy of Intel
Fig. 5.13 Ponte Vecchio with >100 billion transistors. 47 active titles and five process nodes.
Courtesy of Intel
The base tile, the connective tissue of Ponte Vecchio, was where all the complex
I/O and high bandwidth components would come together with the SoC infrastruc-
ture—PCIe Gen5, HBM2e memory, and MDFI links to connect tile-to-tile and EMIB
bridges.
High bandwidth 3D connect with high 2D interconnect and low latency made
Ponte Vecchio an infinite connectivity machine. Intel said the technology develop-
ment team worked to match the requirements on bandwidth, bump pitch, and signal
integrity.
Xe link tile was critical for scale-up for HPC and AI and provided the connectivity
between GPUs, providing eight links per tile. Intel added that tile to enable the
scale-up solution for the Aurora exascale supercomputer.
Ponte Vecchio was scheduled for release in 2022 for HPC and AI markets.
Intel disclosed IP block information of the Xe -HPC microarchitecture. It had
eight vector and matrix engines (XMX—Xe Matrix eXtensions) per Xe -core; slice
and stack information; and tile information including process nodes for the compute,
base, and Xe link tiles. The system had eight Xe -cores per tile and 4 MB L1 cache
per tile (Fig. 5.14).
The base tile was built on Intel’s Foveros substrate and was 640 mm2 in size
(Fig. 5.15). It included PCIe Gen5 I/O and its 144 MB l2 cache.
At Architecture Day 2021, Intel showed A0 Ponte Vecchio silicon that gener-
ated more than 45 TFLOP/s FP32 throughput, greater than five terabytes per second
5.3 Intel Xe Architecture-Discrete GPU for High-Performance … 253
Fig. 5.14 Ponte Vecchio chips in a carrier from the fab. Courtesy of Stephen Shankland/CNET
Fig. 5.15 Intel’s Ponte Vecchio circuit board revealing the tiles in the package. Courtesy of Intel
(TBps) memory fabric bandwidth, and greater than two-TBps connectivity band-
width. Ponte Vecchio would use the oneAPI standards-based, cross-architecture,
and cross-vendor unified software stack as with all its Xe architectures.
Then, Intel stacked up the Ponte Vecchio circuit boards shown in Fig. 5.14 to
build the 2S Sapphire Rapid’s accelerator board shown in Fig. 5.16.
Tying all the scalability together with CPUs was done via Intel’s oneAPI
(Fig. 5.17). Koduri said oneAPI was open, standards-based with a unified soft-
ware stack. That would allow freedom from proprietary programming models while
254 5 Compute Accelerators and Other GPUs
giving developers full performance from the hardware and peace of mind. It was
an apparent reference to Nvidia’s proprietary and very successful compute unified
device architecture (CUDA) programming environment.
The oneAPI initiative offered an open, standards-based unified software stack
that Intel said was cross-architecture and cross-vendor, allowing developers to break
free from proprietary languages and programming models. It had data-parallel C++
(DPC++) and one API library implementations for Nvidia GPUs, AMD GPUs, and
Arm CPUs. Intel said that oneAPI was being adopted broadly by independent soft-
ware vendors, operating system vendors, end-users, and academics. Key industry
Fig. 5.17 Intel’s oneAPI allowed heterogeneous processors to communicate and cooperate
5.4 Compute GPU Zhaoxin (2020) 255
leaders were helping to evolve the specification to support additional use cases and
architectures. Intel also had a commercial product offering with the foundational
oneAPI base toolkit, which added compilers, analyzers, debuggers, and porting tools
beyond the specification language and libraries.
The scalability of the Xe design was impressive. Intel had organized major subsys-
tems into tidy blocks and could be added and subtracted as needed for a particular
market segment. The tiling approach would work very well if the Foveros substrate
had good bandwidth and low latency (as Intel claims it did). The design would
have long legs and benefit from node improvements. If Intel could do that, then
the company’s ROI would be a win. The counterargument was economy of scale—
building several parts and then bolting them together vs. building a monolithic chip
and manufacturing a large quantity to drive down costs and accelerate ROI. It would
take at least three years to determine if Intel made the right bet. Intel’s next-gen
codename ’Rialto Bridge’ GPU launching in 2023, will succeed the Ponte Vecchio.
The oneAPI was also promising and challenging. Intel did not include discrete
GPU, DSP, or neural network processors in its early unveiling. They all had unique
instruction set architectures (ISAs) and programming rules.
Strong murmurs and rumors from China indicated that Zhaoxin, a joint venture
startup between Via Technologies and the Shanghai Municipal Government founded
in 2013, might introduce a dGPU. How did it manage to do that? Via Technologies
was founded in late 1980s by Cher Wang, heir to the Formosa Plastics Group. She
established the company to supply motherboard chipsets and components, but she
had a much bigger vision from the start. S3 was initially founded in California and
the UK to be near the fast-growing PC community, and as that business increasingly
moved to Asian suppliers, Via Technologies moved to Taipei in 1992.
Meanwhile, Cyrix Semiconductor started in 1988 in Richardson, Texas. In 1992,
it released a 486 CPU, and in 1995, it introduced its 586 to compete with Intel’s
Pentium class processors. In 1996, it introduced the MediaGX, one of the first CPUs
with integrated graphics. National Semiconductor bought Cyrix in 1997 for $550
million. And then, in July 1999, Via Technologies bought the Cyrix x86 division
from National Semiconductor for considerably less. National, however, hung on to
the MediaGX part—for a while.
Also, in Texas, Centaur Technologies was formed in 1995 to design low-power
versions of the x86 processor. Funded by IDT, Centaur introduced the WinChip
in 1997. When running out of money in 1999, Via bought Centaur. The design was
subsequently named and became the Via C3 to C7. Via was amassing CPU expertise.
In 2001, at Comdex, Via said it was enjoying growing sales for its processors and
expected to sell 10 million Via-Cyrix III processors. In 2011, the company announced
a quad-core x86 processor.
256 5 Compute Accelerators and Other GPUs
Chen Weiliang started MetaX (Mu Xi) Integrated Circuit Co. in the Pudong Lingang
Special Area of Shanghai’s Free Trade Zone in September 2020. With a master’s
degree from the Institute of Microelectronics of Tsinghua University, Chen had been
a senior GPU researcher at AMD China. Chen said MetaX would develop artificial
intelligence processors consisting of application-specific integrated circuits (ASICs),
FPGAs, and GPUs.
Chen commented that “GPU chip’s versatility and parallel computing capabilities
help it effectively apply to AI, cloud computing, and high-performance computing”
[9].
The company conducted four rounds of financing, with nearly 100 million yuan
(~$15 million); Heli Capital led the financing. By August 2021, the company had
raised CNY1 billion (~$150 million) Series A round of funding. Two state-owned
equity investment platforms, The China Structural Reform Fund and the China
5.5 MetaX (2020–) 259
Fig. 5.20 GlenFly’s AIB running the Unigine Heaven benchmark. Courtesy of Glenfield GlenFly
Technology
Internet Investment Fund, plus Lenovo’s venture capital, led the latest round. Existing
shareholders, including Jingwei China, Heli Capital, Sequoia China, and Lightspeed
China Partners, continued to invest in the firm.
The MetaX team came to the GPU market with an average of nearly 20 years of
research and development experience, including developing GPU products from 55 to
7 nm. The company claimed its team members had led the research and development
of more than ten world mainstream high-performance GPU products. The company
developed a concept of “extending reconfigurable hardware architecture based on
the traditional GPU architecture.”
In 2006, AMD invested $16 million in setting up its R&D center in Shanghai,
its most significant overseas R&D investment [10]. Weiliang and the core team had
worked for AMD for years before leaving to start MetaX. In late 2021, the startup
had more than 300 employees, over 80% of whom worked in R&D.
Ms. Peng Li, chief architect of hardware at MetaX, was the first Chinese female
made a Fellow at AMD and had 15 years of experience in high-performance GPU
design at AMD. Dr. Yang Jian, MetaX’s Chief software architect, was the first scien-
tist at AMD China. He had served as the chief architect for AMD and HiSilicon
and had 20 years of experience in large-scale chip and GPU software and hardware
design. Other senior management had worked at chip manufacturers such as Trident,
Arm, Nvidia, Intel, Spreadtrum, Cadence, Synopsys, byte, and LanChi (Fig. 5.21).
“The GPU,” said Chen, “has achieved many improvements in its microarchi-
tecture from the initial fixed pipeline ASIC to today’s general GPU processor.
Mu Xi proposed a software-defined hardware-reconfigurable architecture for big
260 5 Compute Accelerators and Other GPUs
data computing. The granular hardware reconfiguration enabled more accurate and
flexible calculations to achieve a better energy efficiency ratio”.
Reflecting China’s ambitions for its technology sector, Chen said:
Only by redefining the GPU instruction set and building the microarchitecture, software
drivers, compilers, and library functions from scratch can we achieve completely independent
intellectual property rights and higher product performance.
Chen and his backers saw that overseas companies such as AMD, Intel, and Nvidia
dominated the GPU market and believed there was a huge opportunity for Chinese
companies. Chinese developers and a few investors had seen the opportunities, but
the high investment, slow return, lengthy R&D pace, and lack of scientific and tech-
nological talents made many companies and investors hesitate. The semiconductor
firms that did launch in China encountered bottlenecks. China imported 543.5 billion
integrated circuits in 2020, a year-on-year increase of 22.1%; imports amounted to
$350 billion, with a year-on-year increase of 14.6%. However, although China’s chip
demand continued to rise, domestic chip production accounted for less than 20% of
the domestic chip market. Regardless of the industry, China has a market advantage
due to its population base and market size.
In March 2021, Tian Yulong, a member of the Party Leadership Group, Chief
Engineer, and Spokesperson of the Ministry of Industry and Information Technology,
stated at a State Council Information Office [11] press conference that the government
had issued policies to promote and the integrated circuit industry and supporting soft-
ware. He also said the government had pledged long-term support for chip corporate
income tax, the entire chip industry chain, talent reserves, and training.
5.5 MetaX (2020–) 261
In addition to policy support, the Chinese government would help with capital to
develop the industry. By August 2021, there were 27 financing operations for chip
design companies, twenty-two of which were above 100 million yuan (~$15 million).
The benefits of the increased funding for industrial development were apparent to
all, and startups appeared on a large scale.
Many chip companies entered the market in China, but the failure rate was high.
Chen Weiliang said that the lack of talent was a significant problem encountered in
the development of semiconductor companies. Still, it was not the root cause of the
high bankruptcy rate in the industry. The lack of core technology, weak products,
liquidity, mismatch of strategy and capability, and weak team cohesion may all cause
business failure [11].
In 2019, market research firm International Data Corp (IDC) predicted China’s
GPU server market would reach 30 billion yuan ($4.32 billion) by 2023 [12]. As a
result, armed with backing from government-controlled companies and the govern-
ment, the Chinese GPU startups including MetaX, Zhaoxin, XiangDiXian, and others
chased after the 30 billion yuan IDC predicted. The GPU server market in China was
predicted to reach about 45 billion yuan ($6.75 billion) in 2024.
GPU chips accounted for about 50% of the cost of GPU servers. The total market
size for GPUs in the Chinese GPU server market in 2025 was estimated to be about
27 billion yuan.
The wipeout of AMD’s China R&D center was a blow to the company, and as of
this writing, it remains to be seen where the Chinese GPUs would be manufactured.
At the time, China was pushing to get 12 nm fabs up, but to compete in the server
market in 2021, the startups needed to find a 5 or 3 nm fab. That meant going offshore
to Taiwan or Korea.
AMD restaffed its R&D center and, in 2022, had more people working there than
ever before.
AMD had struggled with China for a while. In June 2019, The Wall Street Journal
reported “How a Big US Chip Maker Gave China the ‘Keys to the Kingdom [13].”’
The story referred to AMD forming a joint venture called Tianjin Haiguang Advanced
Technology Investment Co. Ltd. (THATIC). THATIC was owned by AMD (51%) and
public and private Chinese companies, including the Chinese Academy of Sciences.
THATIC was chartered to enable and allow Haiguang Microelectronics Co. Ltd.
(a.k.a. HMC) to build X86 processors using AMD’s IP [14].
AMD would license SoC and high-performance processor IP to THATIC for
China’s fast-growing server market. AMD stood to gain up to $293 million in revenue
from the deal without incurring investment exposure. The THATIC deal involved
licensig x86 IP and not AMD graphics or Arm-based designs.
262 5 Compute Accelerators and Other GPUs
Bolt Graphics was a startup in stealth mode in 2020 and obtained funding in 2021 to
bring the company out of the shadows. The company was started by Darwesh Singh,
a cloud architect in Minneapolis, with the goal of building a ray trace accelerator for
the rendering/VFX. The company claimed it had the first fully hardware-accelerated
pipeline in the world. Bolt said its GPU was optimized for scenes involving reflective
surfaces, diffuse materials, and complex geometry. Scenes developed for 8 K or
higher resolutions, hundreds of samples per-pixel, and multiple bounces would render
quickly on the Bolt Platform, while other systems would struggle (Fig. 5.22).
Bolt argued that workloads were scaling faster than platforms—a new architecture
was needed. Figure 5.23 is a symbolized diagram of Bolt’s approach.
Since the Bolt 1 chip was designed for scale-out systems, the company claimed
it could perform equally well alone as with many chips. The Bolt Platform shared
memory between 16 chips, which reduced data transfer overheads when a data set
grows past one terabyte.
For enterprise customers, Bolt Graphic’s data center GPU used technologies like
HBM, TSMC’s 5 nm process, and PCIe Gen5 and got its first silicon in the second half
of 2022. Bolt’s GPU was projected to offer greater than three times the performance in
rendering and AI workloads than other solutions. The second-generation part would
be trimmed for gaming, DCC, and general-purpose computing and graphics.
One Bolt server, said the company in late 2021, would provide the same perfor-
mance as 16 high-end dual-CPU servers at a similar cost (Table 5.5). In situations
where render deadlines were tight, Bolt claimed its platform could perform 11.5x
faster, enabling users to meet aggressive render deadlines.
Bolt argued that workloads were scaling faster than platforms and that a new
architecture like theirs was needed (Fig. 5.24).
5.7 Bolt Graphics (2021–) 263
Fig. 5.22 Darwesh Singh, CEO and founder bolt graphics. Courtesy of Singh
Bolt said its servers consumed less than one-fourth of the total power of CPU-
based servers, significantly reducing the environmental footprint. No special cooling
treatment was needed, enabling simple integration within existing data centers.
264 5 Compute Accelerators and Other GPUs
Fig. 5.24 Bolt targeted industries characterized by exponentially expanding workloads. Courtesy
of Bolt Graphics
Changsha Jingjia Microelectronics Co., Ltd (Jingjia Micro) was founded in the
Yuelu District of Changsha Hunan, China, on April 5, 2006, to design military elec-
tronic products. Its products included integrated circuits, intelligent display modules,
graphic controller board, signal processing board, graphics chips, and related soft-
ware. Jing Jiawei was the founder and chairman of the Board, and in 2022, the
company had over 870 employees (Fig. 5.25).
In April 2014, we successfully developed the first domestic high-reliability, low-power GPU
chip-JM5400, with completely independent intellectual property rights, breaking the long-
term monopoly of foreign products in the GPU market in my country—Jing Jiawei
In late 2021, the company announced it was preparing its latest GPU, the JM7000-
series.
Information about the company’s pursuit to develop a GPU was first announced
in August 2019 [15] when the company projected its 28 nm-based GPU would
compete with Nvidia’s GTX 1050 and 1080. The company based its speculations on
the Jingmei JM5400 series GPUs used in Chinese military aircraft. At the time, the
company said the JM7200 series had already obtained some orders.
The JM7200 had the following specifications:
Clock frequency: 1300 MHz and supports dynamic frequency modulation
5.8 Jingjia Micro Series GPUs (2014) 265
Fig. 5.26 Jingjia micro’s JM7200-based PCIe AIB. Courtesy of Changsha Jingjia Microelectronics
Co.
to eight TFLOP/s. The performance was equal to or greater than the Nvidia’s GTX
1080.
In 2019, the Chinese government announced it had set up a national semiconductor
fund (China Integrated Circuit Industry Investment Fund) to develop and expand its
domestic chip industry and close the technology gap with the U.S. [16]. The fund
was budgeted at 204.2 billion yuan ($28.9 billion). It was at the time the largest fund
compared with similar funds launched in China in 2014 that had raised 139 billion
yuan.
Although Jingija Micro had targeted the GTX 1080, it was building its GPU at
28 nm, whereas the GTX 1000-series were built on 16 nm, giving Nvidia a consid-
erable advantage in performance headroom and cost. China had not been able to
build low-end nm fabs yet, and it would take several years to get the necessary
manufacturing equipment and experience to do so. Clearly, though, the Chinese are
committed to their semiconductor programs.
Changsha Jingjia Microelectronics was a pure-play fabless semiconductor
company similar in organization, goals, and ambitions to AMD, Nvidia, and Medi-
aTek. The company received sales from many key national Chinese projects, giving
it a considerable home market advantage and a favorable ROI for the immense costs
of developing a GPU. The only drivers available were for the UOS and Kylin OS.
UnionTech developed the Unity Operating System (UOS) or Chinese Linux distri-
bution. It was used in the People’s Republic of China as part of a government initiative
beginning in 2019 to replace foreign-made software such as Microsoft Windows with
domestic products.
Kylin was an operating system developed in 2001 at the National University of
Defense Technology in China and named after the mythical beast Qilin [17]. The
first versions were based on FreeBSD and intended for use by the Chinese military
and other government organizations.
The major Chinese chipmakers, including the SMIC foundry, have continued to
develop 14 and 12 nm process nodes based on fin field-effect transistor technology,
with solid support and encouragement from the government. China’s Vice Premier
Liu started a program in 2020 focused on using the country’s semiconductor manu-
facturing resources and talent to make China independent and a potential world leader
in compound semiconductors.
Apple video iPod to handle video record and playback, image capture and processing,
audio capture and processing, graphics, games, and ringtones.
The second-generation chip, VC02, ran at 150 MHz (almost twice the speed of
the 85-MHz operation of the VC01), displayed video on quarter video graphics array
(QVGA) screens, and captured images up to 8 MP from image sensors. The new chip
also had an input for TV tuners and TV-out. There was more internal SRAM (10
Mbits versus 8 Mbits in the earlier part) advanced image filters (Fig. 5.27).
The VC02 had features useful for video [18].
The VC02 also had dual 32-bit RISC processors and a dual-issue compiler. Like
all mobile devices of the time, the CPU(s) were fixed-point only. However, like
the predecessor single CPU part, the VideoCore II was a dual video DSP, with a
16 parallel data path very long instruction word (VLIW) vector processor tightly
coupled (by shared registers) to 32-bit RISC scalar processors.
The photo shows the development board; the VC02 is the chip just above the
display.
In September 2004, Alphamosaic was acquired by Broadcom for $123 million,
forming its Mobile Multimedia group on the Cambridge Science Park site.
Broadcom launched its first new chip in the VideoCore line at 3GSM in February
2005. The chip was designed for mainstream phones but enabled advanced features
such as video, 3D gaming, and multimedia, previously associated with high-end
phones. Broadcom branded it the BCM2705.
Broadcom reduced the amount of memory, reduced the JPEG encode and decode
from 8 to 4 MP, and removed some peripheral support, including TV-out and USB.
The chip would enable phones to display video at 30 fps on a 2-inch color LCD and
capture 4 MP images. The chip was 100% programmable and had MPEG-4 encode
and decode.
The VideoCore processor was a 150 MHz dual-ALU, allowing the BCM2702 to
function as a coprocessor or as a stand-alone. The chip was manufactured in 130 nm
CMOS and packaged in a 281-pin thin profile fine-pitch ball grid array (TFBGA)
package (10.9 × 10.1 mm).
Swann said, “We showed some pretty good 3D games before, and we have got
better games now.”
Though pricing was a very relative thing, the chip was in the range of $10.
In 2012, hobbyists began exploiting a powerful little development kit known as
Raspberry Pi created in the UK by the Raspberry Pi Foundation in association with
Broadcom. As of May 2021, over forty million boards had been sold (Fig. 5.28).
The Raspberry Pi had a Broadcom SoC with a VideoCore IV 3D graphics core
and used a closed-source binary driver (called a blob) that communicated with the
hardware. The blob ran on the BCM2835 SoC’s vector processing unit (VPU) of the
Raspberry Pi. Open-source graphics drivers were a thin shim runnng on the ARM11
via a driver in the Linux kernel. But the lack of an open-source graphics driver and
documentation was a problem for Linux on Arm—it prevented users from fixing
Fig. 5.28 Raspberry Pi 4 model B development board. Courtesy of Miiicihiaieil Hieinizilieir for
Wikimedia Commons
270 5 Compute Accelerators and Other GPUs
driver bugs, adding features, and generally understanding what their hardware was
doing.
Then in February 2014, Broadcom announced it would give the VC4 to the
community. It released all the documentation for graphics core and a complete source
code of the graphics stack under a three-clause Berkeley Source Distribution (BSD)
license—anyone could use it.
The source release targeted the BCM21553 cellphone chip, but it was straightfor-
ward to port it to the BCM2835 on the Pi. That allowed access to the graphics core
without using the blob. As an incentive to do that work, the Raspberry Pi organization
offered a $10,000 prize for the first person to demonstrate to them that Quake III
could run successfully at a playable framerate on Raspberry Pi using those drivers
(Fig. 5.29).
In April 2014, only a month after the prize was offered, it was claimed by Simon
Hall, a longtime Pi hacker.
In 2022, Raspberry Pi 4 kits could be bought for as little as $25.
Arm and Imagination Technologies were the most prominent suppliers of GPU IP
but were far from the only ones. As of 2022, there were four others: AMD, Digital
Media Professionals (DMP), Think Silicon, and VeriSilicon, plus the open GPU
organizations. The GPU IP suppliers serviced the other platform suppliers. Mobile
devices were the largest market, followed by automotive, and digital TV (DTV) and
5.10 The Other IP Providers 271
set-top boxes (STB) SoCs. Arm claimed in 2021 they had 80% market share in DTV
with their Mali GPU.
The following are some brief stories of those IP suppliers. I have put them all in
this mobile section for convenience.
The problem with IP is that you cannot see it. You can see representations of it,
such as a block diagram, a register-transfer level (RTL) netlist, or an SoC with the IP
buried inside it, but the pure IP is just a bunch of ones and zeros on a disk somewhere.
AMD is discussed throughout this book. However, it may not be evident that the
company has been an IP provider and a discrete and integrated GPU supplier.
ATI officially entered the IP market in 2003 with a project for Microsoft for the
Xbox 360 and, in 2004, with a deal with Qualcomm. Through various acquisitions
and partnerships, ATI was on a quest for dominance in all sectors of the graphics
market. Qualcomm would embed ATI’s Imageon graphics (discussed in Chapter
fifteen) in Qualcomm’s next-generation baseband processor. Qualcomm would use
ATI’s graphics technology to compete against TI and Intel with their Imagination
Technologie’s graphics.
The relationship went well, and to augment it and add more value, in May 2006,
ATI bought the renowned BitBoys and incorporated its newly developed 2D engine.
AMD acquired ATI, and when AMD got into financial difficulties in 2007 and
2008, it started looking for things to see and get rid of to lower operating costs. In
January 2009, AMD sold the Imageon group to Qualcomm, and AMD temporally
exited the IP business.
The company re-entered the IP market in 2012 when it developed a custom chip
for the Sony PlayStation 4 and then licensed Sony to build subsequent versions.
AMD did the same thing with Microsoft on the Xbox One.
One of the biggest surprises was in 2019 when word leaked out that ATI had
provided the IP for the GPU for Samsung’s next-generation Exynos SoC for Samsung
smartphones. Samsung had started an internal GPU project several years earlier. After
several manager changes and other setbacks, the senior management of Samsung had
had enough and decided to go with AMD.
Founded in 2002, in Japan Digital Media Professionals was well-known for its VR
and multimedia systems work. In 2004, DMP decided to change direction from a
high-end chip supplier to an IP provider.
The company was best known for its power-efficient hardware-accelerated
graphics core, which it offered as IP and in LSI.
272 5 Compute Accelerators and Other GPUs
The company had been developing its proof-of-concept Ultray design based on the
physical model rendering and experimenting with algorithms that defied the limits
of miniaturization. The company planned to release Pica, its first IP core based on
Ultray architecture, in 2006.
Tatsuo Yamamoto, the president, and CEO of DMP, said Ultray would allow
real-time photorealistic rendering with physically correct lighting and shadows,
such as soft shadow casting and position-dependent environmental mapping [19].
Furthermore, he pointed out that the physical model rendering achieved real-time and
high-quality images, significantly reduced the dependency on textures, and offered
excellent memory efficiency.
The architecture was very scalable and offered functionality from high-end appli-
cations such as VR and content creation to low-end, low-power applications such
as mobile devices. A pixel-level shader with components such as bidirectional
reflectance distribution function (BRDF) and Phong built into the hardware pipeline
achieved excellent quality and high-performance graphics with much fewer polygons
even in low-end embedded applications.
The Ultray supported the following features in the first-generation high-end DMP
product.
Hardware-accelerated shading:
Phong shading
Cook/Torrance shading
BRDF shading
Multi-layer reflection
Hardware-accelerated effects:
Texture mapping (bilinear, trilinear, programmable 4 × 4 tap filtering)
Bump mapping
Refraction mapping position-dependent cube mapping
Vector and boundary-edge anti-aliasing
Anti-aliased soft shadow casting
Hair generator
Glare and flare renderer
Gaseous object renderer (see image)
Polygon subdivision to lower processor bus bandwidth
Per-vertex subsurface scattering
Although the Ultray had many specifications in common with similar products, the
Ultray had unique functions. Notably, a hardware parametric engine did the gaseous
object rendering, not a shader, saving transistors, power consumption, and time. With
that unique feature, clouds, smoke, gas, and other fuzzy objects could be shaded and
rendered at an interactive rate.
Many techniques have been proposed to model gaseous objects, using numerical
simulations, volumetric functions, fractal, cellular automaton, and particle systems,
among several others. DMP looked at all those approaches, and to DMP, it appeared
the common denominator of the techniques could be defined, at a given level, as a
point set with attributes, or in other words, as a hyper-volume or a fiber bundle.
5.10 The Other IP Providers 273
DMP’s Pica was a 3D/2D graphics IP core for embedded systems. DMP planned
to start licensing Pica in the fall of 2006. In addition, DMP offered an OEM chip
development service based on its chip development capabilities proven in the launch
of Ultray2000 in 2005.
In May 2009, the company released its SMAPH-F vector graphics IP core. DMP
said it had licensed the core to a major Japanese Tier-1 automotive supplier.
SMAPH-F was designed to accelerate graphical user interface (GUI) applications
for entry-level embedded products such as mobiles, TVs, digital cameras, graphics
meters, navigation, gaming, and office products. The company said the applications
would be offered at a low cost and low power consumption while achieving excellent
vector graphics performance. SMAPH-F was compliant with the Khronos Group’s
OpenVG 1.1 and offered acceleration of vector graphics contents, including Adobe
Flash Lite and SVG. SMAPH-F included DMP’s Gradient Extension hardware accel-
eration for gradient animations and additional procedural texturing such as a woody
pattern (Fig. 5.30).
The company said it believed SMAPH-F was a friendly IP regarding integration
and achievement of performance goals in complex SoCs. That was partly due to its
support of industry-standard Open Core Protocol (OCP) and Advanced eXtensible
Interface (AXI) interconnect and its design for optimum system performance with
DDR burst accesses [20].
In June 2010, at the E3 conference in LA, Nintendo surprised the industry by
introducing the 3DS, a handheld game machine with a glasses-free 3D stereo screen.
It provided a 400 × 240 image per eye, which looked very sharp. Nintendo stock
rose about 11% during the first two days of the conference (Fig. 5.31).
Just after E3, Nintendo announced it had chosen the Pica200 graphics technology
from DMP.
The Pica200 used DMP proprietary Maestro extensions for the 3D graphics. By
hardware implementation of complex shader functionality, those extensions allowed
high-performance rendering found on high-end products to be realized on mobile
devices with low power consumption requirements.
The company went public on the Tokyo stock exchange in June 2011.
In the spring of 2014, the company bought the sole rights to sell Cognivue’s
embedded computer vision IP Apex Image Cognition Processor cores to the Taiwan
and Japanese markets. That enabled customers to buy multiple components for an
SoC from DMP in Taiwan, where there are many semiconductors and SoC makers
in mobile, automotive, and consumer electronics.
Think Silicon was founded in Patras, Greece, in 2007 by George Sidropoulos and Dr.
Iakovos Stamoulis. Dr. Stamoulis had been working at the ray tracing pioneer hard-
ware company Advanced Rendering Technologies in Cambridge, U.K., and wanted
to come home. In 2005, his old friend Sidropoulos convinced him to join Atmel IC
Design Team. They worked together in the MMC (multimedia and communications)
group of Atmel in Patras until Atmel closed the operation in late 2006. The two and
a few others from Atmel started Think Silicon (TSi). The Athens-based Metavallon
VC group backed Think Silicon. The company planned a wide range of graphics and
display processors for the Internet of things (IoT), wearables, and broader display
device markets (Fig. 5.32).
Think Silicon developed the tiny GPU design after they left Atmel—or more
accurately after Atmel left them. In 2009, Think Silicon released its first graphics
accelerator, named the Think2D, and a display controller, the ThinkLCD. In 2016,
Microchip Technology bought Atmel, and TSi licensed the GPU and display design
to Microchip and several other firms.
One of the first challenges the group took on was to provide usable graphics
in an early IoT system with just 128 KB by a phone vendor, which seemed to be
impossible. However, Stamoulis had used Commodore Amigas and Atari STs where
it was possible in systems with that amount of memory and main CPUs that were
just 8 MHz (Fig. 5.33).
Fig. 5.32 Think silicon founders George Sidropoulos and Iakovos Stamoulis. Courtesy of Think
Silicon
276 5 Compute Accelerators and Other GPUs
Fig. 5.33 Think silicon’s whiteboard from 2015. Courtesy of Think Silicon
“The challenge was accepted,” said Stamoulis. “And the designed IP used many old tech-
niques almost forgotten in the graphics world, combined with many modern techniques, and
the result was a graphics processor unit that was ideal for the emerging IoT and Smartwatch
market.”
In 2011, the company released a vector graphics GPU, Think VG, which it licensed
to Dialog Semiconductor, and in 2016, it introduced its multicore and multithreaded
Nema GPU design (Nήμα in Greek). Nema became a leading GPU for low-power
SoCs from Ambiq, ST Microelectronics, and other Tier-1 companies, including chip-
scale package SoC suppliers. Soon several devices in the market were using Nema.
Ambiq founders were the developers of the Subthreshold Power Optimized Tech-
nology (SPOT) processor at the University of Michigan and founded Ambiq in 2010.
TSi supplied the GPU IP design to Ambiq, who built the tiny SoCs that went into
millions of fitness and smartwatches, smart thermometers, and home devices (i.e.,
end-point devices).
In 2021, Ambiq revealed its newest Apollo4 SoC family incorporating Think
Silicon’s Nema pico GPU and Nema display controller IP, shown in Fig. 5.34.
In a power-efficient way, the technology was a lean version of the Nema ISA,
Nema pico XS microarchitecture combined VLIW, low-level vector processing, and
hardware-level support for multithreading. The Nema interface was not compatible
with standard or open APIs like OpenGL ES, Vulkan. It was a proprietary interface
that was light, designed to be used in embedded devices.
The tiny but mighty GPU boasted the following elements:
Hardware elements:
Texture mapping unit
Programmable VLIW instruction set shader engine
5.10 The Other IP Providers 277
Primitives Rasterizer
Command list-based DMAs to minimize CPU overhead
Display controller (optional)
Image transformation:
Texture mapping
Point sampling
Bilinear filtering
Blit support
Rotation any angle
Mirroring
Stretch (independently on x and y-axis)
Source and destination color keying
Format conversions on the fly
Blending capabilities:
Fully programmable alpha blending modes (source and destination)
Source/destination color keying
Drawing primitives:
Pixel/line drawing
Filled rectangles
Triangles (Gouraud Shaded)
Quadrilateral
Text rendering supports:
Bitmap anti-aliased (A1/A2/A4/A8)
Font Kerning
Unicode (UTF8)
The Nema pico XL/XS series was an extended version with a display controller
designed for SoC. The company specifically targeted the market for high-end wear-
ables and embedded IoT display devices. The popular Fitbit used a TSi GPU and the
display controller.
The Nema product family had four major segments that could be organized into
seven configurations.
Nema pico GPU Family:
278 5 Compute Accelerators and Other GPUs
Fig. 5.36 Comparison of think silicon’s Nema and Neox GPUs. Courtesy of Think Silicon
the need for the host processor. Each core had an execution shader core, a texture-
mapping unit (TMU), and a ROP unit. Figure 5.37 shows an SoC implementation
with Neox cores (Fig. 5.38).
Multi-threading maximizes efficiency in systems with long memory latency. It was
theoretically possible to achieve 100% compute use in memory-intensive applica-
tions. A thread-scheduler kept thread status and issued commands from a ready-to-run
pool of threads.
Table 5.7 shows a comparison of the Nema and Neox processors.
The Neox had a lightweight pipeline. That prevented it from having hazards or
complex interlocks; it also avoided branch prediction and did not have any feedbacks
paths. As a result, the pipeline was used as much as possible with no lost cycles
waiting for data, reducing power consumption. Nema XL and Neox supported ROP
operations, and there was one unit per core. In Nema XS, it was emulated by shader
code.
5.10 The Other IP Providers 281
TSi positioned the Neox and Nema processors as well suited for various devices
and applications, as depicted in Fig. 5.39.
The company also offered 7 nm (TSMC) models, which scaled the clock to
700 MHz and reduced the area proportionally.
5.10.5 VeriSilicon
GiQuila was founded in 2004 by former SGI, ATI, and Nvidia GPU designer Mike
Cai to design and develop a GPU for mobile devices. In 2007, GiQuila changed its
name to Vivante and changed its direction from a chip builder and seller to focus on
282 5 Compute Accelerators and Other GPUs
Fig. 5.39 Think silicon’s application and device range. Courtesy of Think Silicon
designing and licensing embedded graphics processing unit designs. That same year
Wei-Jin Dai left Cadence Design and took over as president and CEO of Vivante.
By 2009, over 15 companies used Vivante GPU IP in twenty embedded designs
[22].
In 2010, Vivante demonstrated a low-power multicore GPU called Scalarmor-
phic that exceeded 1 GHz [23]. The company said its multicore GPUs had been
proven in multiple tier-one SoC vendor’s products and would drive next-generation
game consoles, tablets, smartphones, automotive displays, and home entertainment
applications.
Vivante’s multicore GPUs were multithreaded extensions of the OpenGL ES
single-core GC series architecture first launched in 2007. The multicore GPUs
were capable of more than 200 M triangles per second on industry-standard GPU
benchmark polygon throughput tests.
In August 2013, Vivante launched its highly granular Vega series IP GPUs for
mobile devices and included Vega 1X, 2X, 4X, and 8X, based on target performance
and market requirements.
Vivante described the Vega as an ultra-threaded GPU, with each GPU core able
to handle up to 256 threads. It supported switching between threads in a round-robin
mode unless there were dependencies for a thread. If that was the case, the thread
got skipped until the dependency cleared.
Many threads in a GPU, even one with 16 shader cores, could be consumed in
a few clock cycles. The ultra-shader unit (see block diagram in Fig. 5.40) could
dynamically balance the loads, parsing tasks to which shader was free. The pool of
256 possible threads could handle several tasks available to parse.
5.10 The Other IP Providers 283
Each shader core had five units, two 64-bit wide adders (ADD), two 64-bit wide
multipliers (MUL), and one transcendental. It was like AMD/ATI’s VLIW5 archi-
tecture with four of the pipes somewhat limited. Each 64-bit unit could do 1 × 64,
2 × 32, and 4 × 16 making it capable of 5–17 ops per cycle.
The dash-outlined blocks (3D pipeline, vector graphics pipeline, 2D pipeline)
could be segmented or stacked into various configurations. A GPU could have more
vector graphics engines or more ultra-threaded shaders, up to 32 cores. The number
of graphics front-ends depended on how many shader cores were connected to each
graphics core; the counts could be confusing.
Vega’s fine-grain approach was fundamentally different from that of other
suppliers. Other GPU suppliers tended to balance pixel, shader, and other resources
in one of two ways: either a simple architecture like older GPUs like Nvidia’s Tegra
4 (partition in advance for a fixed number of pixel and vertex shaders) or unified
GPUs like Nvidia’s Kepler, which could be programmed for multiple tasks. Vivante’s
GPUs had a unified shader architecture and were more granular. That concept led
the company to its virtualized GPU.
The feature set of the Vega was as follows:
• GPU speeds over 1 GHz while reducing overall power consumption
284 5 Compute Accelerators and Other GPUs
GPU virtualization refers to technologies that allow the use of a GPU to accel-
erate graphics or GPGPU applications running on a virtual machine. GPU
virtualization is used in various applications such as desktop virtualization,
cloud gaming, and computational science.
A few years earlier, in 2001, Dr. WeiMing Dai founded VeriSilicon to supply
IP-centric, platform-based custom silicon solutions and end-to-end semiconductor
turnkey services. The company began its IP portfolio with ASIC standard cell libraries
and other foundry foundation IPs. Also added were larger IP blocks such as A/D
converters, USB2.0, and peripherals.
In 2006, the company expanded its IP portfolio by buying LSI’s ZSP unit, offering
DSP and associated technology. Based on ZSP, the company developed Dolby and
DTS-certified HD audio processes, voice quality enhancement, and multimode and
multiband wireless baseband platforms. VeriSilicon established a licensing agree-
ment with Google and became the exclusive supplier of the Hantro G1 multi-format
video encoder and decoder supporting both H.264 and VP8. Later, VeriSilicon and
Google codeveloped the Hantro G2 multi-format video decoder IP to support ultra-
HD 4 K video decoding for integrated high-efficiency video coding (HEVC) and
VP9. (Google bought On2 Technologies for $125 million in 2010. In 2007, On2
Technologies bought Hantro Products, and as a result, Hantro became part of the
Google portfolio.)
Without really planning it, Vivante and VeriSilicon were headed in the same
direction, with pretty much the same goal in mind: keep adding IP cores to become
a one-stop-shopping point for companies designing SoCs. They were not the only
companies with that idea.
So, both companies were, independently and unknown to each other, on the hunt
for innovative technology to fill out their portfolios—get big or die.
The irony was that Wei-Ming Dai is the genius big brother of Wei-Jin Dai who are
close family members, yet they never discussed the obvious. Two board members
and investors in VeriSilicon suggested the company’s merge.
Click, click, and click—it was so obvious everyone slapped their foreheads. The
addition of GPU and other cores from Vivante would provide VeriSilicon just what
it needed for its growing IP portfolio, the stickiness of the SoC design service. The
deal would also give VeriSilicon more exposure to the larger tier-one customer base
and open opportunities in the automotive market that Vivante was serving and the
IoT market that both VeriSilicon and Vivante wanted to enter.
In late 2014, VeriSilicon, described as a SiPaaS (Silicon Platform as a Service)
company, filed an IPO in the U.S. Nasdaq under the symbol VERI.
Then, in October 2015, VeriSilicon announced it would buy Vivante in an all-
stock deal, and the result would be a new, larger company approaching $200 million
in sales [25].
VeriSilicon had two operations, design and turnkey services and IP cores. The
company split itself into two business units, IP and services. In addition to being
286 5 Compute Accelerators and Other GPUs
an officer and executive VP of the company, Wei-Jin Dai took over the IP business
unit. Another senior executive took over the services business unit, reporting to
Dr. Wayne Wei-Ming Dai, the board chairman, and the president and CEO. Mike
Cai was the CTO and went on to patent anti-aliasing techniques that he assigned to
the company.
That was one of those genuinely synergistic deals that do not come around too
often. In addition to mixed-signal and radio frequency (RF) over IP, DSPs, and
video core, VeriSilicon supplied custom silicon solutions for microelectromechan-
ical system (MEMS) sensors found in over a billion devices, including tier-one
smartphones and tablets. VeriSilicon kept the Vivante GPU brand. Also, in 2015
Vivante introduced a partitioned GPU design well suited for virtualization, a vision
processor, and automotive safety applications.
After joining VeriSilicon, Vivante introduced the Arcturus GC8000 series based
on the Vega architecture. It was compatible with OpenGL ES 3.2, OpenVG 1.1,
OpenVX 1.1, OpenGL 4.0 Vulkan 1.0, and OpenCL 2.0 (Fig. 5.41).
The GC8000 added early culling, improved hardware virtualization, and lossless
data compression features. There were many improvements to the old Vega architec-
ture, with the new code name Arcturus. The GC8000 offered flexible configurations
and customizable RTL for specific applications. VeriSilicon began shipping GC8000
RTL in 2Q16.
The GC8000 had doubled the triangle throughput of the GC7000. Vivante accom-
plished that by enhancing the fixed function pipeline, including the primitive assem-
bler and setup. That added die area, but with RTL enhancements, it yielded a 30%
performance-per-area increase for triangle and pixel throughput. An early culling
stage discarded unneeded geometry before the processing stage, which helped
increase throughput. The GC8000 could do dynamic load balancing and assign
threads to less busy clusters like the original Vega architecture.
Fig. 5.41 VeriSilicon’s GPU could scale from IoT and wearables to AI training systems. Courtesy
of VeriSilicon
5.11 Nvidia’s Ampere (May 2020) 287
Nvidia pushed the envelope on graphics chips since it integrated the geometry
processor and pixel shader into one chip and called it a GPU in 1999. The poten-
tial for GPUs then vastly expanded in 2003 when a branch of development spiked
out of GPU applications that took advantage of the GPU’s parallel processing for
pure computation using Brook, a collection of C programs with streams developed
at Stanford. That work morphed into Nvidia’s famous CUDA and started the era of
GPU compute, also called a general-purpose graphics processing unit (GPGPU) and
accelerated computing. Brook freed the GPU from OpenGL, which until then had
been used for programming a GPU.
Nvidia—Getting to Ampere.
Over the next 17 years, Nvidia continued to develop new innovative and powerful
GPUs, targeted primarily at computer graphics and gaming, but with capabilities that
could be exploited for GPU computing.
The Pascal generation brought FP16 and INT8 support (the latter found in Pascal
P40 and P8), as well as NVLink technology which Nvidia deployed as a hybrid cube
mesh topology for multi-GPU, refer to Fig. 5.42. Nvidia GPU growth in transistors
and die size over time. The GP100, the first 100-series GPU based on Pascal, was
explicitly a data center part with 2xFP16, HBM2, and announced at GTC in 2016.
Pascal was followed by the Volta GV100 released in March 2018 and introduced
a new SM (streaming multiprocessor) and first-generation Tensor Cores for AI. The
GPU had 5120 shaders (80 SMs), 320 TMUs, 128 ROPS, and 640 tensor cores.
Volta was built on a 12 nm process and had 21,100 million transistors and a die
size 815 mm2 . It was implemented in an Nvidia Quadro AIB with 32 MB of HBM2
and a 4096-bit bus with 864 GB/s bandwidth and was capable of 16,6 (F32) TFLOPS.
The massive AIB sold for $9000 in 2018.
TU102 (Turing) came out after GV100 (Volta), but TU102 was primarily a
GeForce GPU while GV100 was a data center only chip.
Volta introduced Tensor Core technology, a breakthrough feature that dramatically
accelerated matrix math operations; 20 × more compute than the previous generation.
The Volta generation also brought the first generation of NVSwitch devices and was
a key enabler of Nvidia’s 16-GPU DGX-2 AI servers.
Turing added INT8 support to tensor core technology, bringing great speedups to
AI inference applications and included the introduction of the company’s T4 Tensor
Core GPU for AI inference and visual applications, running in just 70 W.
288 5 Compute Accelerators and Other GPUs
Fig. 5.42 Nvidia GPU growth in transistors and die size over time
Nvidia released the Ampere GA100 in May 2020. It was the biggest GPU ever
made in terms of transistors and in area—54 billion transistors in an 826 mm2
package.
Ampere added floating point (FP64) to tensor core technology to solve HPC
challenges. It introduced multi-instance GPU (MIG) technology to partition a single
GPU into multiple GPUs (7 for A100, 4 for A30). Nvidia also introduced TensorFloat-
32 (TF32), an AI-optimized FP32 precision which today is the default FP32 precision
for both TensorFlow and PyTorch frameworks.
Ampere had an improved streaming multiprocessor (SM) design that increased
power efficiency and CUDA compute capability. It marked Nvidia’s full commitment
to the data center and the use of GPU as computer accelerators. One could argue
that Fermi in 2009 or Kepler in 2012 was bigger steps (beyond original G80 CUDA
introduction in 2006) for providing features for compute workloads. Wherever one
puts the marker, with the added GPU-compute features and subsequent larger caches,
the transistor count (and die size) of Nvidia’s GPUs began to increase.
Gamer sites had been publishing leaked and speculated comments about Ampere
for months, expecting it to be the next generation of a ray tracing game engine. They
were surprised by what they discovered, the GA100 Ampere was no gamer chip,
and it was a killer GPU-compute chip with special emphasis and features for AI
training and server-based inferencing, in addition to HPC. It was a supercomputer
in a chip. There would be GA 10x chips based on the Ampere architecture targeting
5.11 Nvidia’s Ampere (May 2020) 289
gaming/graphics market as well. Table 5.8 is a summary of the chip’s most salient
features.
One big feature of the chip alluded to in the above table was sparsity. Sparsity,
fine-grained structured sparsity, was a method to double compute throughput for
deep neural networks. It is important, and it needs a lot of transistors.
Sparsity was essential and possible in deep learning because individual weights
evolve during the learning process in a neural net. Only a subset of weights acquired a
meaningful purpose in determining the learned output by the end of network training;
the remaining weights were no longer needed.
A fine-grained sparsity structure imposes a constraint on the pattern that makes it
easier and more efficient for hardware to align inputs. Deep learning networks could
adapt weights during the training process based on training feedback. Building on
that, Nvidia figured out that, in general, the structure constraint did not affect the accu-
racy of the trained network for inferencing. That enabled inferencing acceleration
with sparsity—something of a breakthrough.
290 5 Compute Accelerators and Other GPUs
For training acceleration, sparsity must be introduced early in the process to offer
a performance benefit, and methodologies for training acceleration without accuracy
loss are an active research area.
The headline numbers for Ampere were 3456 FP64 CUDA cores (32 per SM),
plus 432 Tensor (AI) cores (Fig. 5.43). There were 10,368 cores total between the
3456 FP64 cores and separate 6912 FP32 cores. They were separate blocks. Not
shared was a set of separate 6912 INT32 cores plus the 10,368 plus 432 tensor cores
for a total of 17,280 cores (Fig. 5.43).
The A100 had 40 GB of high-speed HBM2 memory delivering 1555 GB/s of
memory bandwidth—a 73% increase over Tesla V100. The chip could drive six
banks of HBM2 and had five such banks in its first release.
A third-generation tensor core also boosted FP16 throughput 2.5x over the V100
GPU. It added integer types for deep learning (DL) inferencing that could deliver a
further doubling of throughput with the sparsity feature. A100’s tensor core added
new support for TensorFloat-32 (TF32) for an out-of-the-box acceleration of FP32
networks in DL frameworks, at a rate 10x faster than FP32 on V100. A100 also
brought tensor cores to HPC, adding IEEE-compliant FP64 support at a rate 2.5x
faster than V100 FP64.
The GA100 was built on TSMC’s 7 nm fab, and the die was 826 mm2 (Fig. 5.44).
There were 12 Nvidia NVLinks per GPU, which offered up to 600 GB/s of inter-
GPU bandwidth (300 GB/s per direction). The GPU was accessible through six
Nvidia NVSwitches, which offered 4.8 TB/s of bidirectional bandwidth.
In addition to being big, and hungry for data, it was also hungry for watts and
would like 400, please. That clearly put it out of the realm of a gaming chip, but not
a supercomputer or data center, and given all the processors and memory it had, it
was pretty efficient.
A family of Amperes.
As with all Nvidia’s designs, the company was able to create several smaller
versions of the A100 GPU, as illustrated in Table 5.9.
As the table indicates, the model number sequence does not follow a sequential
calendar introduction schedule.
5.11.1 A Supercomputer
Nvidia took eight A100s and a couple of AMD Epyc 7742 64-core, 128-thread CPUs,
along with 1 TB of RAM, to create the DGX A100 supercomputer. The DGX A100
offered up to 10 POPS of INT8 performance, 5 PFLOPS of FP16, 2.5 PFLOPS of
TF32, and 156 TLOPS of FP64 compute performance. To put that in perspective,
the previous-generation Volta-powered DGX-2 with 16 GPUs offered two PFLOPS
of mixed-precision performance (Fig. 5.45).
Within the DGX A100 were eight single-port Mellanox Connect-X6 AIBs
for clustering (with 200 GB/s of peak throughput), a dual-port ConnectX-6 for
5.11 Nvidia’s Ampere (May 2020) 291
In November 2021, Imagination Technologies introduced its latest ray tracing IP, the
Imagination CXT, for its flagship B-Series GPU IP. The announcement marked the
debut of Imagination’s PowerVR Photon ray tracing architecture.
Photon, said Imagination, was the industry’s most advanced ray tracing archi-
tecture, bringing desktop-quality visuals to mobile and embedded applications. The
biggest news was that it had already been licensed for multiple markets.
The Photon architecture represented over a decade of development by Imagina-
tion in making ray tracing workable in low-power-enabled devices. The company
294 5 Compute Accelerators and Other GPUs
box structure and the triangle geometry. The RAC fully offloads these expensive
operations to dedicated hardware, delivering significant area and power-efficiency
benefits (Fig. 5.47).
The RAC consisted of the ray store, ray task scheduler, and coherency gatherer.
It was coupled to two 128-wide unified shading clusters (USCs) that Imagination
said had high-speed, dedicated data paths for efficient and low-power ray-traced
deployment.
The box tester unit performed the search for rays intersecting with objects in 3D
space. It tested rays against axis-aligned boxes from the scene hierarchy. The RAC
had a dedicated box per triangle testing hardware (dual triangle tester units). Those
blocks got the RAC to a Level 2 RT solution.
Ray traversal was in hardware and used a dedicated task scheduling and USC
interface. Those blocks used dedicated ray store and dedicated ray state tracking
units.
The predictive box scheduler handled ray traversal, tracking, and checking in
hardware with tight integration with the CXT USCs. The ray store kept ray data
structures on-chip during processing, which provided high bandwidth read–write
access to all units in the RAC. That, claimed the company, avoided slowdowns or
power increases from storing or reading ray data to DRAM. The ray task scheduler
offloaded the shader clusters, deploying and tracking ray workloads with dedicated
hardware, keeping ray throughput high and power consumption low. The packet
coherency gatherer unit analyzed all rays in flight and bundled rays from across the
scene into coherent groups enabling them to be processed efficiently. Imagination
said they had patented that technology, and it got the RAC up to Level 4.
296 5 Compute Accelerators and Other GPUs
The company claimed that a RAC could deliver up to 433MRay/s ray throughput
per RAC and do up to 16GBoxTesVs, all scaled with increasing RAC units and Imag-
ination multicore. It was compliant with the VulkanRT ray query and ray pipeline
and represented Level 4 RTS in hardware. An implementation of RACs within a
GPU is shown in Fig. 5.48.
Imagination CXT, said Imagination, was a significant step forward for rasterized
graphics performance, with 50% more compute, texturing, and geometry perfor-
mance than Imagination’s previous-generation GPU IP. The company claimed its
low-power superscalar architecture delivered high performance at low clock frequen-
cies for exceptional FPS/W efficiency, while Imagination Image Compression
(IMGIC) reduced bandwidth requirements.
Imagination said the RTL-based IP Photon architecture could be scaled to the
cloud, data center, and PC markets through Imagination’s multicore technology.
That, claimed the company, could generate up to 9 TFLOPS of FP32 rasterized
performance and over 7.8 Gy/s of ray tracing performance while offering up to 2.5x
greater power efficiency than current Level 2 or 3 ray tracing solutions.
Bounding volume hierarchy (BVH) was a popular ray tracing acceleration tech-
nique that used a tree-based acceleration structure that contained multiple hierarchi-
cally arranged bounding boxes (bounding volumes) that encompass or surrounded
different amounts of scene geometry or primitives.
Coherent ray tracing solved the integer BVH node decompression overhead
problem by spreading the decompression cost over many rays.
Imagination showed a demonstration video that featured global illumination (GI),
per-pixel ray tracing, denoising, lighting, tone-mapping, and TAA. RT GI added
ambient lighting, grounding the objects in the scene and giving the most realistic
lighting. The demo ran at 1080p between 30 and 60 fps in a mobile power budget.
Imagination also offered a ray tracing software tool, PVRTune, which allowed
developers to see low-level ray tracing counters such as rays per second, box tester
load, cache hit rate, and the number of recursive transverse rays per second (Fig. 5.49).
The PVR ray tracing simulation Vulkan layer in the Imagination SDK could be
used to simulate the capabilities and behavior of the CXT block (Fig. 5.50).
Fig. 5.49 Rays per second monitor in PVRTune. Courtesy of Imagination Technologies
Fig. 5.50 Imagination created the industry’s first real-time ray tracing silicon in 2014. It showed
the R6500 test chip code named Plato. Courtesy of Imagination Technologies
298 5 Compute Accelerators and Other GPUs
As Imagination’s first Level 4 IP, CXT offered developers a hardware block dedi-
cated to accelerating ray tracing that promised more power and area efficiency than
other solutions. Consumers had come to appreciate the realism ray tracing provided
and would demand it in all their devices in the near future.
5.12.1 Summary
software transformer engine that automatically switches between FP8 and FP16
formats, based on the workload and some AI cleverness.
During his keynote speech, Jensen Huang, Nvidia’s CEO, said the Hopper H100
provides a nine-times boost in training performance over Nvidia’s A100 and thirty
times more large-language-model inference throughput (Fig. 5.52).
Nvidia has continued with HBM, and the H100 is the first to use Gen5 reaching,
claims Nvidia, 40 terabits per second I/O bandwidth that’s 1.5x faster than the A100’s
HBM2E. While the A100 was available in 40 and 80 GB models, the H100 comes
with only 80 GB (so far).
Huang claimed twenty H100s could sustain the equivalent of the entire world’s
Internet traffic (Fig. 5.53).
With 80 GB per AIB, eight AIBs give a pool of 640 GB and to a program (or
programmer) look like one contagious memory pool. The NVLinks with super high
bandwidth and very low latency make that possible. Tying eight AIBs together like
that is kind of a superzinging of the chiplet concept.
Chiplets? Just say no.
Nvidia was asked about that—why has the company continued with the megachip
approach when its competitors are going with the chiplet approach?
Huang said everyone will acknowledge that a large chip can do interprocesses
better than multiple chips that have to go through I/O circuitry—wires on die are
orders of magnitude better than off die. Big chips are hard he said, but we are really
good at it, and it’s a competitive advantage. Do a superchip before you do a chiplet,
said Huang.
We asked Nvidia if they had any comparative data of Hopper to Intel’s Ponte
Vecchio AIB, and Nvidia said, no. Nvidia has launched the Nvidia H100 Tensor Core
GPU, providing an order of magnitude leap in performance versus prior generation
on the widest array of HPC and AI applications including up to 30X more inference
throughput and up to 9X faster training.
Fig. 5.53 Nvidia’s H100 Hopper AIB with NVLinks (upper left) supports a unified cluster of eight
GPU. Courtesy of Nvidia
The company is taking the Hopper subassembly (shown above) and using multiple
of it to build the DGX H100 supercomputer (Fig. 5.54).
Then, Nvidia is taking 16 DGX H100s to build a super-duper supercomputer they
are calling Earth 2 that will model and predict world weather patterns (Fig. 5.55).
The company also introduced new and updated versions of cloud application
such as Merlin 1.0 for recommender systems and version 2.0 of both its Riva speech
recognition and text-to-speech service.
5.13.1 Summary
Nvidia has cleverly leveraged its GPU prowess and propelled itself into the AI
and supercomputer industry. Although built for the data center, Hopper had texture
processors which were used in Cuda program, not just for graphics. They performed
an auxiliary data fetch and could improve cache availability.
Twenty twenty-one and the first part of 2022 were interesting for Nvidia to say
the least. It made a gallant attempt to acquire Arm and probably underestimated the
blow back by other companies and their lobbying agents. A few weeks ago, they
suffered a major hack attack and like everyone else have seen their sales restrained
by supply issues.
Huang believed the data center is fundamentally changing and becoming AI
factories, where raw data in becomes intelligence out.
Nvidia has embraced the digital twin and describes Earth 2 as a digital twin of
the Earth.
References 303
5.14 Conclusion
Researchers at Sandford University in California were among the first to exploit the
computational power of the parallel processor capabilities of the GPU in 2000. ATI
was the leader in the effort, but Nvidia somehow took over and made things easier
with its CUDA C-like programming environment which hid the tediousness of the
architecture.
GPUs were employed in the scientific segments as compute accelerators for prob-
lems that had massive amounts of data that could be processed in parallel. And then
in the mid-2010s that application of GPUs to AI training took off and suddenly
everyone was an AI expert, and Nvidia was king of the hill.
The concepts extended to auto-drive systems for vehicles, natural language recog-
nition via the web, and advanced rendering techniques involving ray tracing. AI
would solve all our problems—if you had the data for it to learn from.
This chapter is not finished as we are still exploring the depths and possibilities
of AI.
References
1. Deering, M. F. and Naegle, D. N. The SAGE Graphics Architecture, in Proc. SIGGRAPH 2002.
2. Deering, M. F. Hardware: Zulu/Sage/XVR-4000, http://www.michaelfrankdeering.com/Pro
jects/HardWare/Zulu/Zulu.html
3. Peddie, J. SiliconArt’s RayCore ray-tracing processor: Disruptive technology from a little
startup, TechWatch volume 14, number 17, page. 19, (August 26, 2014)
4. Peddie, J. The Arc of the Alchemist, TechWatch, (August 20, 2021), https://www.jonpeddie.
com/report/the-arc-of-the-alchemist/
5. Koduri.R. @Rajaontheedge https://twitter.com/Rajaontheedge/status/1453808598283210752
6. Compete with Intel Core i5! Exclusive first test of domestic 3.0GHz x86 processor performance,
Weibo (October 17, 2018), https://itw01.com/QCT3JEW.html
7. Shilov, A. Zhaoxin’s First Discrete GPU Pictured, (June 25, 2021), https://www.tomshardw
are.com/news/zhaoxin-discrete-gpu
8. Announcement for a material resolution by the Company’s Board, (November 4, 2021), https://
tinyurl.com/td8npxre
9. Chen Weiliang, founder of Muxi: There is no shortcut to making cores, (August 9, 2021), https://
min.news/en/economy/8cd295abfeb24bc8bdedbc88809b5b01.html
10. AMD Opens US$16m Shanghai R&D Center, China Daily, (August 23, 2006) http://www.
china.org.cn/english/BAT/178811.htm
11. SCIO briefing on development of industry, and information technology (March 1, 2021), http://
english.scio.gov.cn/pressroom/node_8022792.htm
12. IDC Artificial Intelligence Drives China’s GPU Server Market (June 03, 2019), https://www.
ictransistors.com/news/idc-artificial-intelligence-drives-china-s-gpu-23897463.html
13. O’Keeffe, K and Spegele, B. How a Big U.S. Chip Maker Gave China the ‘Keys to the Kingdom,’
(June 27, 2019), https://www.wsj.com/articles/u-s-tried-to-stop-china-acquiring-world-class-
chips-china-got-them-anyway-11561646798
14. Pirzada, U. No, AMD Did Not Sell The Keys To The x86 Kingdom—Here’s How The
Chinese Joint Venture Works, (June 29, 2019), https://wccftech.com/no-amd-did-not-sell-keys-
kingdom-how-thatic-jv-works/
304 5 Compute Accelerators and Other GPUs
15. Peddie, J. Jingjia Microelectronics introduces new GPU, TechWatch, (August 27, 2019), https://
www.jonpeddie.com/report/jingjia-microelectronics-introduces-new-gpu/
16. Kubota, Y. China Sets Up New $29 Billion Semiconductor Fund, The Wall Street Journal,
(October 25, 2019), https://www.wsj.com/articles/china-sets-up-new-29-billion-semicondu
ctor-fund-11572034480
17. Qilin, Mythical Chinese creatures, https://about-mythical-creatures.weebly.com/qilin.html
18. Alphamosaic’s dual CPU media processor, TechWatch, V.4.3, p. 6. (February 9, 2004)
19. Peddie, J. Introducing DMP, TechWatch, Volume 5, Number 7 (April 11, 2005)
20. Shuler, K. AMBA AXI and OCP: Behind the Standards, (April 04, 2011), https://www.arteris.
com/blog/bid/59646/AMBA-AXI-and-OCP-Behind-the-Standards
21. Peddie, J. Think Silicon shows early preview of industry’s first RISC-V GPU, TechWatch,
(December 04, 2019)
22. Langhi, R. Vivante signs 15th GPU licensees, Industry news Vivante, (June 8, 2009), https://
tinyurl.com/ymd3644t
23. Peddie, J. Vivante Multicore GPU IP TechWatch, Volume 10, number 25, pp.13, (December
21, 2010)
24. Vivante Vega IP Enables Full GPU Hardware Virtualization on Mobile and Home Entertain-
ment Devices, Vivante Vega IP Enables Full GPU Hardware Virtualization on Mobile and Home
Entertainment Devices (design-reuse.com)
25. Peddie, J. Vivante acquired by VeriSilicon in synergistic deal, TechWatch Volume 15, Number
21, pp.11, (October 20, 2015)
26. Parkerson, S. Imagination Technologies Ray Tracing Graphics Cores Provide New Opportu-
nities for Unity Devs, (March 19, 2014), https://appdevelopermagazine.com/imagination-tec
hnologies-ray-tracing-graphics-cores-provide-new-opportunities-for-unity-devs/
Chapter 6
Open GPU Projects (2000–2018)
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 305
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_6
306 6 Open GPU Projects (2000–2018)
In 2000, Timothy Miller was designing a graphics chip for his employer. As a recent
graduate of University of South Florida in computer science and not being told he
could not do it, he created a complete semiconductor design using whatever tools he
could get his hands on.
Later in 2004, Timothy Miller had his eureka moment. Surely, he thought, I cannot
be the only person in the world trying to design a GPU. So, he decided he would
share his developments, simulators, software tools, and most importantly, the ISA—
the instruction set architecture of his GPU design. He set up an open-source platform
using the MIT licensing policy. Miller named it the Open Graphics Project (OGP)
and launched it in October 2004 [1]. It, in turn, spawned other open organizations
such as the Open Hardware organization (Fig. 6.1).
As people joined the OGP, they wanted to find a manufacturer who could build the
chip so they could have hardware to work with. Miller learned that a small production
run of Open Graphics’s chips would cost about $2M. So Miller decided to create an
offshoot company called Traversal Technology Inc.
6.1 Open Graphics Project (2000) 307
One of Miller’s concerns was how the community would interact with the
company. To avoid any conflicts or even the impression of conflict and to safeguard
the interests of the OGP community, he suggested the creation of a new organization.
Based on that recommendation and supported by the rest of the community,
electrical engineer Patrick McNamara and Traversal Technology Inc. founded the
Open Hardware Foundation (OHF) in 2007. The goal was to facilitate the design,
development, and production of free and open hardware [2].
Another goal of the OHF was to help fund the production of open graphics prod-
ucts by providing Traversal with a known number of sales. Traversal benefited by
having less financial risk associated with producing the graphics chip. The open-
source community benefited by having hardware available at reduced or no cost
for developers who could contribute further to the project. But in 2009, McNamara
announced that to better support the Open Graphics Project, the foundation would
apply its funds (the product of donations) toward the Linux Fund [3].
Thanks to the funding help from the Linux Fund, Traversal built 25 boards and
distributed them to researchers worldwide [4] (Fig. 6.2).
The open graphics processor project was an idea that would not die. Since the
introduction of the commercial GPU in 1999 by Nvidia, the usefulness and scalability
of a GPU had been irresistible. People want to democratize the GPU and put it in
the hands of more users and researchers but without the cost and restrictions of
commercial devices and APIs.
308 6 Open GPU Projects (2000–2018)
In late 2012, Jeff Bush heard about Timothy Miller and sent him an email. Bush
told Miller of his work to establish an open GPU project. Bush pointed Miller to
the work he had already open sourced (which he had been working on for around
a year). Bush was in the process of proving the design by building an FPGA. As it
turned out, Miller was looking at some similar concepts. After a few conversations,
the two decided to collaborate and coauthor some academic papers.
Nyuzi had been in development since 2010 and became a fully functional open-
source GPU inspired by Larrabee. Although architecturally different from the SIMD
architectures from AMD and Nvidia, researchers at Binghamton University and
several other places used it to investigate GPUs.
In 2015, Jeff Bush collaborated with Timothy Miller and Aaron Carpenter, through
Binghamton University to further study and promote Open GPU projects. Bush gave
a presentation at the IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS) on Nyami—“A Synthesizable GPU [5].”
Nyuzi (previously named Nyami) was an experimental GPU processor hardware
design focused on compute-intensive tasks (also referred to as a GPGPU and GPU
compute). Said Bush and Miller, it was optimized for use cases like deep learning
and image processing.
That project included a synthesizable hardware design written in System Verilog,
an instruction set emulator, an LLVM-based C/C++ compiler, software libraries, and
tests. The design could be used to experiment with micro-architectural and instruction
set design trade-offs.
6.3 MIAOW (2015) 309
AMD announced GPUOpen on December 15, 2015 [12], and released it on January
26, 2016. It started as a middleware software suite of visual effects for computer
6.4 GPUOpen (2015) 311
games developed by AMD for its Radeon processor. It was initially released in 2016
to compete against Nvidia’s GameWorks; however, over time, AMD expanded its
scope to include past generations of their GPU’s ISA. The majority of GPUOpen
content was open-source software, unlike GameWorks, which was criticized for its
proprietary and closed nature.
AMD had released ISA manuals for its GPUs since the Radeon R600 (a GPU that
helped usher in the DirectX 10 era in 2006). According to AMD, the main reasons
for an ISA were the following:
• To specify the language constructs and behavior, including the organization of
each type of instruction in both text syntax and binary format.
• To provide a reference of instruction operation that compiler writers could use to
maximize the processor’s performance.
The ISA information was intended for programmers writing application and
system software, including operating systems, compilers, loaders, linkers, device
drivers, and system utilities. It assumed that programmers were writing compute-
intensive parallel applications (streaming applications) and that they had an under-
standing of requisite programming practices.
AMD posted its RDNA ISA in September 2020 [13]. The first product featuring
RDNA was the Radeon RX 5000 AIBs, introduced in July 2019. The processor
implemented a parallel micro-architecture suitable for computer graphics applica-
tions and general purpose data-parallel applications. Data-intensive applications that
require high bandwidth or are computationally intensive can run on an AMD RDNA
processor. Figure 6.5 shows the organization of the AMD RDNA series processors.
The document describes the environment, organization, and program state of
AMD RDNA generation devices. It details the instruction set and the microcode
In October 2017, Pedro Duarte, Pedro Tomas, and Gabriel Falcao from the Univer-
sidade de Coimbra and Universidade de Lisboa in Portugal presented a paper at the
ACM MICRO-50' 17 symposium on SCRATCH, a soft-GPGPU architecture, and
trimming tool [14]. (There is also Scratch a visual programming environment that
allows users (primarily ages 8–16) to learn computer programming while working
on personally meaningful projects such as animated stories and games.)
The researchers postulated that power consumption limitations constrained
advanced signal processing and artificial intelligence algorithms in high-performance
and embedded supercomputing devices and systems. GPUs helped mitigate the
throughput-per-watt performance in many compute-intensive applications. However,
dealing more efficiently with the autonomy requirements of intelligent systems
demanded power-oriented, customized architectures. Such designs had to be
specially tuned for each application without manual redesign of the entire hardware
while also being capable of supporting legacy code (Fig. 6.6).
In kernel B, integer and floating-point instructions were detected. Thus, the archi-
tecture was trimmed (e.g., by considering available resources and power consumption
limitations) to support both FU types (Fig. 6.7).
The researchers proposed a new SCRATCH framework that automatically iden-
tified the specific requirements of each application kernel regarding instruction set
and computing unit demands. That allowed for the generation of application-specific
and FPGA-implementable trimmed-down GPU-inspired architectures. Their work
was based on an improved version of the original MIAOW system. The researchers
extended the design to support 156 instructions and enhanced it to support a fast
Fig. 6.6 Two different trimmed architectures were generated for two distinct soft kernels. Courtesy
of Pedro Duarte and Gabriel Falcao from the Universities of Coimbra and Lisboa
6.6 Libre-GPU (2018) 313
Fig. 6.7 During compile-time, the instructions present in kernel A indicated that only scalar and
vectorized integer FUs should be instantiated on the reconfigurable fabric. Courtesy of Pedro Duarte
and Gabriel Falcao from the Universities of Coimbra and de Lisboa
The Libre-GPU open-source project was initiated in 2018 by Luke Kenneth Casson
Leighton in the Hague. It was a formidable undertaking (Fig. 6.8).
After 12 years studying SoCs, more than a hundred of them, he could not find one
that was open to the bedrock. They all had either closed GPU firmware, closed VPU
firmware, closed BIOS, DRM built-in with e-fuses, or usually blatant GPL copyright
violations (e.g., Amlogic, Allwinner, and others). Leighton was discouraged and, in
his eureka moment, decided the only way to solve that was to do it himself properly.
Leighton described libre RISC-V M-Class as a 100% libre RISC-V + 3D GPU
chip for mobile devices. The project began its life because Leighton wanted there to
be a completely free Libre system on a chip offering. At the time, he had access to
$250,000 in funding to make it happen. The design would use a RISC-V processor
ISA with extensions to increase the ability to run as parallel processors—and the
GPU would essentially be software-based and use the Khronos Vulkan API structure
[15].
314 6 Open GPU Projects (2000–2018)
Leighton emphasized that his plan called for a hybrid CPU, VPU, and GPU. It
was not, as some suggested, a dedicated exclusive GPU. However, he pointed out;
the option existed to create a stand-alone GPU product.
The primary goal was to design a complete SoC that included libre-licensed VPU-
and GPU-accelerated instructions as part of the actual main CPU. That, reasoned
Leighton, would greatly simplify driver development, applications integration, and
debugging, reducing costs and time to market.
Leighton said he did not have any illusions about the project’s cost and estimated
it would cost more than $6 million, with a contingency of up to $10 million. Leighton
sought backers to carry the project forward and hoped to have a tape out (RTL code)
by late 2020. It did not happen.
The project started as a RISC-V Vulkan accelerator design. However, Leighton
dropped RISC-V and switched to OpenPOWER ISA due to NDAs and other orga-
nizational issues. The original design goal in 2018 was modest at 1280 × 720 at
25 FPS and 5–6 GFLOPS. Chromebook-type laptops were envisioned as the first
platform for the SoC (Fig. 6.9).
In February 2021, at the FOSDEM conference, Leighton presented his ongoing
work for the hybrid CPU/VPU/GPU OpenPOWER [16]. Leighton declared that the
initiative would be fully open source, including the hardware design. Figure 6.9 is
based on his presentation.
Leighton’s focus in 2021 was on developing an embedded SoC and not building
a PCIe-based GPU for an AIB. Those past open-source graphics AIB efforts, such
as Project VGA, became school science projects.
For the Vulkan implementation, Leighton continued to use the Rust-based Kazan
implementation with Simple-V extension for RISC-V and other design elements for
making a Vulkan software implementation more efficient [17].
A group of enthusiasts led by Dr. Atif Zafar proposed a new set of graphics instruc-
tions designed for 3D graphics and media processing. These new instructions were
built on the RISC-V base vector instruction set. They added support for new data types
that were graphics specific as layered extensions in the spirit of the core RISC-V ISA.
Vectors, transcendental math, pixel and textures, and z/frame buffer operations were
supported. It could be a fused CPU–GPU ISA. The group called itself the RV64X
Group because instructions would be 64 bits long (32 bits would not be enough to
support a robust ISA, even though the data paths were 32 bits wide) (Fig. 6.11).
6.8 RV64X (2019) 317
The world has plenty of GPUs to choose from. Why this? The group believed
commercial GPUs were less effective at meeting unusual needs such as dual-phase
3D frustum clipping, adaptable HPC (arbitrary bit depth FFTs), and hardware SLAM.
The RV64X Group thought collaboration provided flexible standards, reduced the
10–20 man-year effort otherwise needed, and helped with cross-verification to avoid
mistakes.
The team said their motivation and goals were driven by the desire to create a
small, area-efficient design with custom programmability and extensibility. It should
offer low-cost IP ownership and development and not compete with commercial
offerings. It could be implemented in FPGA and ASIC targets and would be free and
open source. The initial design was targeted at low-power microcontrollers. It was
Khronos Vulkan compliant and would support other APIs (OpenGL, DirectX, etc.).
The design was to be a RISC-V core with a GPU functional unit. It would look like
a single piece of hardware with 64-bit long instructions coded as scalar instructions
to the programmer. The programming model was an apparent SIMD; the compiler
would generate SIMD from prefixed scalar opcodes. It would include variable-issue,
predicated SIMD backend, vector front-end, precise exceptions, branch shadowing,
and more. There wouldn’t be any need for an RPC/IPC-calling mechanism to send
3D API calls to and from unused CPU memory space to GPU memory space and
vice-versa, said the team. And it would be available as a 16-bit fixed point (ideal for
FPGAs) and 32-bit floating point (ASICs or FPGAs).
The design employed the Vblock format (from the Libre-GPU effort see Libre-
GPU (2018)):
318 6 Open GPU Projects (2000–2018)
6.9 Conclusion
Open-source GPU took the idea from the success of the open-source software
communities. Open-source software is computer software released under a license.
The copyright holder grants users the rights to use, study, change, and distribute the
software and its source code to anyone and for any purpose.
Open-source software may be developed in a collaborative public manner. One
excellent example of open-source software is Khronos and its ever-growing library
of APIs and tools. Khronos works by engaging the community of interested parties to
help develop initiatives, standards, and definitions. Then, people from all industries
participate. The result is a robust, far-reaching, and usually a long-lasting standard
developed by the industry’s best minds.
Open-source GPU hopes to emulate that model and be as successful.
References 321
References
1. https://wiki.p2pfoundation.net/Open_Graphics_Project
2. McNamara, P. Open Hardware, Technology Innovation Management Review, (September
2007). https://timreview.ca/article/76
3. Open Hardware Foundation goodbye message. https://tinyurl.com/54v62hfc
4. LinuxFund, OGP Supply Developers with Open Graphics Cards– OSnews, (April 12, 2009).
https://www.osnews.com/story/21299/linuxfund-ogp-supply-developers-with-open-graphics-
cards/
5. Bush, J., Miller, T. Nyami: A Synthesizable GPU Architectural Model for General-Purpose
and Graphics-Specific Workloads, IEEE, (April, 2015). http://www.cs.binghamton.edu/~mil
lerti/nyami-ispass2015.pdf
6. jbush001/NyuziProcessor. https://github.com/jbush001/NyuziProcessor/commit/63b77a515
c1658b4514b13764e0afdc5ba9ecda6
7. jbush001/NyuziProcessor, Hello world test program works on cyclone II starter board.
jbush001/NyuziProcessor@f189e8e · GitHub, (September 2015). https://www.intel.com/con
tent/dam/altera-www/global/en_US/uploads/d/db/Hello_World_Lab_Manual_CVE.pdf
8. Larabel, M.: AMD Publishes “Southern Islands” ISA Documentation, (August 15, 2012).
https://www.phoronix.com/scan.php?page=news_item&px=MTE2MDg
9. AMD GPU ISA documentation. https://gpuopen.com/documentation/amd-isa-documentation/
10. Balasubramanian, R. et al. Enabling GPGPU Low-Level Hardware Explorations with MIAOW:
An Open-Source RTL Implementation of a GPGPU. https://doi.org/10.1145/2764908
11. MIAOW: An Open-source GPGPU, (June 24, 2015). https://tinyurl.com/cww8j7s7
12. Smith, R. AMD’s 2016 Linux Driver Plans & GPUOpen Family of Dev Tools: Investing In Open
Source, AnandTech, (December 15, 2015). https://www.anandtech.com/show/9853/amd-gpu
open-linux-open-source
13. RDNA 1.0 Instruction Set Architecture Reference Guide, AMD, (September 25, 2020), https://
developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf
14. Duarte, P. Tomas, P. and Falcao, G. SCRATCH: An End-to-End Application-Aware Soft-
GPGPU Architecture and Trimming Tool, MICRO-50’17: Proceedings of the 50th Annual
IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, (October 14–18,
2017). https://doi.org/10.1145/3123939.3123953
15. Hybrid 3D GPU/CPU/VPU. https://libre-soc.org/3d_gpu/
16. Leighton, Luke K. C. The Libre-SOC Hybrid 3D CPU, FOSDEM2021 (February 3, 2021),
https://tinyurl.com/2kbpty7a
17. kazan-3d/kazan, (2018), https://github.com/kazan-3d/kazan/tree/master/docs
18. Tine, B., Elsabbagh, F., Yalamarthy, K., Kim, H. Vortex: Extending the RISC-V ISA for GPGPU
and 3D-Graphics Research, MICRO ’21, (October 18–22, 2021), Virtual Event, Greece, https://
vortex.cc.gatech.edu/publications/vortex_micro21_final.pdf
19. Vortex: A Reconfigurable RISC-V GPGPU Accelerator for Architecture Research, https://vor
tex.cc.gatech.edu/publications/hotchips-poster.pdf
20. N. V. H. Toan and T. Q. Duc, “Design an open DSP-based system to acquire and process
the bioelectric signal in realtime,” 2016 International Conference on Biomedical Engineering
(BME-HUST), pp. 90–94, (2016), https://ieeexplore.ieee.org/document/7782108
Chapter 7
The Sixth Era GPUs: Ray Tracing
and Mesh Shaders
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 323
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_7
324 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
Also in 2018, the availability and pace of new GPUs and AIBs slowed. The sales
of AIBs did not slow, just the introductions of new ones. AMD and Nvidia were
slowing down because design, development, and release were increasingly expensive
1 RayQuerys are used in Inline raytracing, an alternative form of raytracing that doesn’t use separate
dynamic shaders or shader tables. It is available in any shader stage, including compute shaders,
pixel shaders etc. Inline raytracing in shaders starts with instantiating a RayQuery object as a local
variable, acting as a state machine for ray query.
7.1 Miners and Taking a Breath 325
activities. Both companies had to pick their priorities carefully. And even though there
was a lot of fanfare when Koduri rejoined AMD in 2013, the Vega product line was
less than stellar, although it did have a surprising advantage for crypto-miners due
to its efficient and high-bandwidth memory manager. Meanwhile, Nvidia introduced
their award-winning Pascal GTX10 series in the summer of 2016 and expanded and
extended its ray tracing capabilities further.
The crypto craze caught both companies by surprise, and even though AMD
AIBs were more efficient and less expensive than Nvidia’s AIBs, Nvidia sold more
to the crypto crowd, demonstrating the power of brand. Selling is not exactly correct,
neither AMD nor Nvidia sold their products specifically for mining, but regardless
of the sales and marketing dynamics, both companies ran out of stock as prices in
the channel soared.
AMD was able to increase production a little faster than Nvidia; and in January
2018 announced they would be increasing the supply of the Radeon product line.
But when Nvidia does something it does with a flourish and enthusiasm, it amped
up production even more, but not till May.
Then at Computex, AMD’s new Senior Vice President of Engineering for the
Radeon Technologies Group, David Wang said he was committed to delivering a
new product every year, like clockwork.
Wang showed AMD’s graphics roadmap at a presentation at Computex, and
although it went out three years from 2017s including its next-gen Navi architecture,
and an un-named 7 nm+ architecture debuting in 2020, it didn’t look any different
than the roadmap shown at CES or GDC. Wang said at a round table discussion
that “AMD would be bringing out a new graphics product every year, via a new
architecture, process changes, or maybe incremental architecture changes”.2
Meanwhile, also at a Computex press briefing, Nvidia CEO Jensen Huang said
in response to a question about the next GPU release that gamers may not see next-
generation graphics boards for “a long time from now”.
That caught a few people by surprise because a press release from the Hot Chips
conference originally reported Nvidia would feature “their next-gen GPU.” That
statement was removed and shifted to “TBD.” The Hot Chips press release then said,
“We will hear from the CPU and GPU giants: AMD featuring their next-gen client
chip and Intel with an interesting die-stacked CPU with iGPU plus stacked dGPU”.
Speculation rippled through the conference and the net. One theory was if it ain’t
broke, don’t fix it. Nvidia had the performance advantage, with multiple SKUs and
the price difference between Nvidia and AMD did not seem to matter much. AMD
did not have a new part coming that year, so Nvidia had some breathing room to
devote R&D to other ambitions.
Another theory was that Nvidia had ramped up production of 1080 s and below,
and they needed to run that inventory down before announcing any new part (Osborne
2Wang and AMD lived up to his promise; 2018: November RX 590; 2019:December RX 5500X,
November RX 5500, July RX 5700 Series, January Radeon VII; 2020: November RX 6800 XT and
6900 XT; 2021:November RX 6600, July RX 6600 XT, and March RX 6700 XT.
326 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
effect).3 So Nvidia could take its time to roll out its next-generation GPU, code named
Turing. But they didn’t.
The Turing architecture introduced the first consumer products to deliver ray tracing
in real time—long thought to be available someday. The elements included artificial
intelligence processors (tensor cores) and dedicated ray tracing processors. Turing
was also the code name for Nvidia’s microarchitecture used in its Quadro RTX AIBs
and GeForce RTX 20 series AIBs.
With the Turing design, Nvidia also introduced new terms into the GPU vocab-
ulary: TIPS—tensor instructions per second, TOPS—tensor operations per second,
and TPU—tensor processing unit.
Nvidia had the Turing GPU manufactured with TSMC’s 12 nm FinFET process.
The top of the line, the TU102 GPU, had 18.6 billion transistors, 4608 shaders, 288
TMUs. 96 ROPS, and 72 streaming multi-processors (SMs). The GPU was the largest
ever made at 754 mm2 GPU, and it could generate 14.2 TFLOPS. The Turing GPU
featured independent tensor cores for AI inferencing like Nvidia’s Volta GPU and
had dedicated ray tracing cores.
The tensor cores processed deep learning (DL) and AI inferencing such as DL
anti-aliasing (DLAA), video and image denoising and resolution scaling, and video
re-timing. The GPU was capable of up to 500 trillion (tensor) operations per second
(TOPS), almost ten times more than the Pascal GPU.
Each streaming multi-processor contained 32 CUDA cores—an execute unit with
one floating-point (FP) and one integer (Int) compute processor. The streaming multi-
processor (SM) scheduled threads in a group of 32 threads called warps. The warp
schedulers could issue two warps at the same time [5].
The TU102 had a 384-bit wide GDDR6 memory bus and employed fast 14 Gbps
memory. There were also two NVLink channels, which Nvidia planned to use in its
next-generation multi-GPU technology.
The GPUs of this era had become so large (physically and logically) that block
diagrams became nothing more than a sea of little boxes representing shaders and
began to take on the look of a die shot, as shown in Fig. 7.1.
In addition to ray tracing, Turing had features for the data center. Nvidia claimed
it would deliver up to ten times more performance and had 25 times better energy
efficiency than CPU-based servers.
The Turing GPU architecture included features to improve the performance of data
center applications, such as an improved video engine and a multi-process service.
Nvidia claimed Turing improved inference performance for small batch sizes,
3The Osborne Effect is a reduction in sales of current products after the announcement of a future
product.
7.2 Nvidia’s Turing GPU (September 2018) 327
Fig. 7.1 Nvidia’s Turing TU102 GPU die photo and block diagram. Courtesy of Nvidia
– Deep-learning anti-aliasing
– Deep learning supersampling
– Hybrid-rendering
• variable rate shading
• mesh shaders
In addition to the Turing Tensor Cores, The Turing GPU architecture had several
features to improve the performance of data center applications, such as:
• An improved video engine. Turing could run additional video decode formats
such as HEVC 4:4:4 (8/10/12 bit) and VP9 (10/12 bit).
• Multi-process service. The Turing GPU inherited an improved multi-process
service (MPS) feature introduced in the Volta architecture. Compared to Pascal-
based GPUs, the MPS on Turing-based Tesla boards, Nvidia claimed, improved
inference performance for small batch sizes, reduced launch latency, and improved
quality of service (QoS) while servicing a higher number of concurrent client
requests.
• Higher memory bandwidth and larger memory size. According to Nvidia, the
Turing-based Tesla AIBs had larger memory capacity and higher memory band-
width than prior generation Pascal-based Tesla AIBs, which targeted similar server
segments. They provided a greater user density for VDI applications.
328 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
Fig. 7.2 Ray tracing features supported in Nvidia’s Turing GPU. Courtesy of Nvidia
In March 2018, at the GDC Conference, to show the power of its Volta-
based Quadro workstation AIBs, Nvidia linked four Volta-based AIBs together to
demonstrate real-time ray tracing [7].
Since the early 1980s, the industry had been on a mission to improve ray tracing
performance to make a GPU that could produce photorealistic graphics in real time.
Generation over generation, GPU suppliers, improved their rendering capabilities,
using technologies such as physically based rendering and photogrammetry. Ray
tracing would enable the last piece of the puzzle—global illumination (Fig. 7.2).
Before the 2018 SIGGRAPH announcement, it was commonly thought GPU
architectures did not have the processing power necessary to handle the ray tracing
workload in real time. Based on Moore’s law projections, the industry thought it
would be five or six years before that workload would be possible with a single GPU.
However, Nvidia’s investments in deep learning helped accelerate the trajectory to
realize that goal. With Turing, the dream of real-time ray tracing (RTRT) was realized.
Nvidia introduced a new AI-enhanced rendering technique with Turing, which took
advantage of the GPUs tensor and RT cores, see Fig. 7.3. Nvidia introduced a hybrid-
rendering technology that combined the ray tracing capabilities of the RT core’s
image denoising and image scaling capabilities. It increased ray tracing performance
by approximately six times in Epic’s RTRT and Microsoft’s DXRT ray tracing APIs
[8].
The RT cores accelerated the bounding volume hierarchy (BVH) traversal and
ray/triangle intersection testing. These two processes need to be performed in an
iteratively way because the BVH traversal otherwise would need thousands of
intersection-testing to finally calculate the color of the pixels. Since RT cores
330 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
Fig. 7.3 Nvidia’s hybrid-rendering technology combining the ray tracing capabilities of the RT
cores and the image denoising
were specialized to take on this load, this gives the shader cores in the streaming
multi-processor (a sub-section of the GPU) capacity for other aspects of the scene.
Without this advancement, Nvidia was still on the path to releasing a real-
time, ray tracing GPU in approximately five to 10 years. Still, the company’s
ability to combine computer-based rasterization and compute-based techniques with
hardware-acceleration, and deep learning enabled real-time ray tracing sooner. Their
hybrid rendering allowed the combined use of rasterization and compute-based tech-
niques, with hardware-accelerated ray tracing and deep learning. It was since adopted
in a few games and engines as well [9].
GPUs vary in performance by generation and segment (high to low) within a gener-
ation. As a result, some GPUs cannot consistently deliver the same level of quality
on every part of the output image. Turing’s independent integer and floating-point
pipelines meant it could simultaneously address and process numeric calculations.
Variable rate shading (VRS) increased the rendering speed and quality by varying
the shading rate for different frame regions (image). Variable rate shading came from
developments for operations such as foveated rendering—a significant component
of VR headsets—and motion-adaptive shading. Foveation allows processors to save
resources by concentrating on the data at the center of focus where it presents a
high-resolution image and processing the rest in lower resolution.
With variable rate shading, the pixel shading rate of blocks of triangles varied.
Turing offered developers seven options for each 16 × 16-pixel region, as well as a
shading result used to color four pixels (two-by-two: 2 × 2), or 16 pixels (4 × 4), or
non-square footprints like 1 × 2 or 2 × 4.
Turing’s variable rate shading enabled a scene to be shaded with rates varying
from once per visibility sample (supersampling) to once per 16 visibility samples.
The developer could specify the shading rate spatially (via a texture). As a result, a
7.2 Nvidia’s Turing GPU (September 2018) 331
single triangle could be shaded using multiple rates, providing the developer with
fine-grained control.
VRS allowed the developer to control the shading rate without changing the
visibility rate. The ability to decouple shading and visibility rates made VRS more
useful than techniques such as multi-resolution shading (MRS) and lens-matched
shading (LMS), which lower total rendering resolution in specified regions. At the
same time, VRS, MRS, and LMS could be used in combination because they are
independent techniques enabled by separate hardware paths.
DLSS takes a frame or two at 4 k and uses it as a reference, it then renders the scenes
at HD (1080p) and then it upscales the image to 4 K using references learned from
the origianl4K image.
Nvidia’s RTX AIBs introduced at SIGGRAPH 2018 was a sensational surprise
and instant hit. Initially, as is to be expected, there were not many games for it, but
the population steadily grew. However, the frame rate when running RTRT was not
optimal. In addition to having almost 3000 shader cores, the AIB also had 272 special-
ized ray tracing intersection cores, and 68 AI tensor cores. DLSS was a temporal
image upscaling technology. It used deep learning to upscale lower-resolution images
to a higher resolution for display on higher-resolution computer monitors.
Then, in March 2020, the company announced version 2.0 of its DLSS technology.
The 2.0 version was so different in its construction from the 1.0 version that several
fans and analysts thought Nvidia should have renamed it [10].
The basic concept of deep learning supersampling—DLSS—was just what its
name implied. Nvidia took a 3D model (a group of objects created by a game
developer for a scene in the game’s story) and rendered it using ray tracing techniques.
Unlike traditional scanline rendering which bogs down as objects are added, ray
tracing is impacted by the resolution of the final engine. So the trick of using lower
resolution for ray tracing and then scaling up the image is clever and efficient.
All rendering lives under the rules of performance vs. quality, with quality being
an arbitrary horizontal scale and performance being either frames-per-second on the
vertical axis for ray tracing or polygons-per-second for scan-line rendering.
Since DLSS was developed for ray tracing, the first step was to set the resolution
at a reference level. Through various empirical and experiential work, Nvidia settled
on 1080p as the smallest resolution to use as a base.
Nvidia ran a game at 1080, and fully ray-traced it, and then fed the ray-traced
images to a specially designed convolutional autoencoder neural network. The
motion vectors obtained from the game’s engine were then sent to and through the
network; they would get used later. The network then segmented and segregated key
elements, reassembled them, and created a 4 K output file. That file, along with a
sample file from a 16 K reference file, was then run through the network and iterated
a few times to arrive at the final finished, polished, fully ray-traced frames (Fig. 7.4).
The iteration eliminated the image noise ray tracing produces while processing.
332 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
Fig. 7.4 Data flow of Nvidia’s DLSS 2.0 process. Courtesy of Nvidia
Figure 7.4 shows an active component (the game during run time) and the offline
training.
DLSS achieved its image quality by using four inputs to arrive at the final frame
seen by the user:
1. The game engine’s base resolution image (e.g., 1080p) was rendered (far left in
the diagram).
2. The image generated by the game engine also produced motion vectors, which got
extracted (center of the diagram). Motion vectors informed the DLSS algorithm
about which direction objects in the scene were moving from frame to frame—
that data was used to direct the supersampling algorithm later.
3. The high-resolution output of the previous DLSS-enhanced frame (4 K) was then
created.
4. And an extensive data set of 16 K-resolution ground truth images Nvidia had
acquired from various game content was used to train the AI network running on
an Nvidia supercomputer.
A convolutional autoencoder AI network received the current 1080p base reso-
lution frame, motion vectors, and a previous high-resolution frame. It determined
what was needed to generate a higher resolution version of the current frame on a
pixel-by-pixel basis.
By examining the motion vectors and the prior high-resolution frame, the DLSS
algorithm could track objects from frame to frame. That information provided
stability for motion and reduced flickering, popping, and scintillation artifacts. That
process is known as temporal feedback, as it uses history to inform the algorithm
about the future (Fig. 7.5).
The DLSS algorithm had access to prior frames and motion vectors, which allowed
it to track each pixel and take samples of the same pixel from several frames (known
as temporal supersampling). That could deliver greater detail and edge quality than
traditional upscaling solutions.
Offline, during the training process, the output 4 K super-resolution image (from
the network) was compared to an ultra-high-quality 16 K reference image (referred
7.2 Nvidia’s Turing GPU (September 2018) 333
Fig. 7.5 Nvidia’s DLSS used motion vectors to improve the supersampling of the enhanced image.
Courtesy of Nvidia
to above as the ground truth). Then the difference was communicated back into the
network so it could continue to learn and improve its results. The 16 K reference
images were from different types of game content (with and without ray tracing)
that Nvidia rendered to 16 K and then compared to the DLSS algorithm’s output. It
was an iterative learning cycle that ran until the network could consistently repro-
duce a similar image. The iteration was repeated tens of thousands of times on the
supercomputer until the network reliability outputted high-quality, high-resolution
images.
The DLSS algorithm learned to predict high-resolution frames with greater accu-
racy by training against a large dataset of 16 K-resolution images. And, through
continual training on Nvidia’s supercomputers, DLSS could learn how to deal with
new content classes—from fire to smoke to particle effects—at a rate that engineers
doing hand-coding of non-AI algorithms could not maintain (Fig. 7.6).
DLSS successfully exploited the tensor cores in GeForce RTX AIBs. These cores
could deliver up to 285 teraflops of dedicated AI processing. As a result, DLSS could
be run in real-time simultaneously with an intensive 3D game.
Fig. 7.6 An Nvidia demo of ray tracing used in a game. Courtesy of Nvidia
out to the display. Applications and games dealing with high-geometric complexity
would benefit from the flexibility of the two-stage approach, which allowed efficient
culling, level-of-detail (LOD) techniques, and procedural generation. There is an
extensive discussion on mesh shaders in Book two, the GPU Environment—APIs.
Suffice it to say; mesh shaders opened up a new era in GPU image processing and
generation.
7.2.3 Summary
Nvidia showed that the GPU could evolve to meet the dream of photo-realism and
interactive frame rates. Adding additional specialized processors or engines to the
GPU was not a new concept. Even the earliest graphics controllers had video codecs,
audio, and unique function accelerators. The Turing GPU added AI and dedicated ray
tracing cores to the ever-expanding GPU. The Turing processor was a revolutionary
product and set the threshold for what other GPU suppliers would have to meet.
The Nvidia Turing GPU was a breakthrough device. It had the most shaders and
was the largest chip made at the time. It was designed for two markets, gaming, and
the data center, which included parts not needed for each segment. That gave Nvidia
the benefit of economy of scale but inflated the price. Nonetheless, the GPU was
very successful for Nvidia.
7.3 Intel–Xe GPU (2018) 335
Intel said the Xe LP engine would support 1080p gameplay and had a 12-bit
video pipeline end-to-end. The desktop models did not have an image processing
unit (IPU), those features were only available on mobile devices.
Alder Lake’s integrated GPU could drive up to five display outputs (eDP, dual
HDMI, and Dual DP++), and offered the same encoding/decoding features as both
Rocket Lake and Tiger Lake CPUs, including AV1 8-bit and 10-bit decode, 12-bit
VP9, and 12-bit HEVC.
Intel has a long history in PC graphics chips beginning in 1983, and, in late 2020, it
announced a new discrete GPU (dGPU), the Xe Max. The company had taken several
runs at building a discrete graphics chip to take on the market leaders, but it had a
challenging time. The company never seemed to address building a dGPU with the
same seriousness and resources as the x86 CPU.
That attitude changed when Intel plotted the development and introduction of its
Xe dGPU line. It was careful to communicate its confidence while keeping a tight
grip on its messaging. Intel finally announced a thin and light notebook discrete
(dGPU), the iRISxe Max, on Halloween and it may have been scary news for some
of the incumbents.
The basic specifications of the 2020 mobile dGPU are in Table 7.1.
Intel paired the iRISXe Max dGPU with its 11th-gen Intel Core mobile processors.
The company claimed the new dGPU delivered additive AI. Additive meant both
GPUs (the new dGPU and the CPU’s iGPU) could work together on inferencing and
rendering. And that, said Intel, can speed up content creation workloads as much as
seven times.
Intel compared its first product against a 10th-gen Intel Core i7-1065G7 with an
Nvidia GeForce MX350.
The 2020 dGPU offered Hyper Encode for up to 1.78 times faster encoding than
a high-end desktop graphics AIB. For that test, Intel used a 10th-gen Intel Core
i9-10980HK with Nvidia GeForce RTX 2080 Super.
Additionally, iRISXe Max worked with Intel’s Deep Link. Deep Link enabled
dynamic power-sharing. With Deep Link, the CPU could have all the power and
thermal resources dedicated to it when the discrete graphics was idle, resulting, said
Intel, in up to 20% better CPU performance.
Intel was no stranger to graphics. It had deep and hard-won experience. Intel had
some of the finest graphics engineers in the business. And yet, with the best fabs,
a bank account that others could only fantasize about, and a brand that could sell
used eight-track players, the company had continuously failed to launch a successful
discrete graphics processor product line (Fig. 7.7).
For all that, Intel was the largest seller of integrated graphics processors and
shipped more GPUs than all its competitors combined. Maybe another company
would have been happy with that accomplishment, but not Intel.
7.3 Intel–Xe GPU (2018) 337
Intel offered a more robust dGPU, DG2, in early 2021. It was a desktop part,
and versions had 128 and 512 EUs and used GDDR6. But remember, there were
four discrete segments in the desktop dGPU market: low-end, mid-range, high-end,
and workstation. Meeting the demands of each of those segments revealed a design
with the ability to scale. Scalability was one of the key requirements that killed the
Larrabee project.
Intel acknowledged the issue:
“No single transistor is optimal across all design points,” said chief architect Raja Koduri.
“The transistor we need for a performance desktop CPU, to hit super-high frequencies, is
very different from the transistor we need for high-performance integrated GPUs.”
7.3 Intel–Xe GPU (2018) 339
This time, however, things were going to be different. Since Larrabee, Intel had
developed its Embedded Multi-Die Interconnect Bridge (EMIB).
Designers put heterogeneous dies onto a single package before EMIB. Designers
and engineers used multiple dies for maximum performance or features set. Designers
used an interposer, which had wires through the substrate for communication.
Silicon vias (TSVs) passed through the interposer into a substrate, which formed
the package’s base, often referred to as 2.5D packaging.
EMIB abandoned the interposer in favor of tiny silicon bridges embedded in the
substrate layer (Fig. 7.9) The bridges contained microbumps that enabled die-to-die
connections. Intel demonstrated it with an FPGA implementation called Straatix.
Silicon bridges are less expensive than interposers. One of Intel’s first products
with embedded bridges was Kaby Lake. Laptops based on Kaby Lake G were consid-
ered expensive. However, they demonstrated Intel’s EMIB would work with hetero-
geneous dies in one package. For one thing, it consolidated valuable board space. It
Fig. 7.9 EMIB created a high-density connection between the Stratix 10 FPGA and two transceiver
dies. Courtesy of Intel
340 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
Fig. 7.10 Intel plans to span the entire dGPU market. Courtesy of Intel
could also improve performance and reduce cost compared to discrete components.
Kaby Lake G used dies from three different foundries. That was the foundational
work Intel had done for chiplet designs, and chiplets were Intel’s plan for scaling
Xe dGPU processors (tiles) for the various segments. Additionally, since Intel was
building DG2 in 7 nm at TSMC, multi-die, multi-vendor interoperability was critical.
Intel referred to it as the Advanced Interface Bus (AIB) between its core fabric
and each tile.
Intel introduced its Foveros technology, a 3D stacking approach, allowed Intel to
pick the best process technology for each layer in a stack in late 2018. The Lakefield
processor had the first implementation of Foveros. It incorporated processing cores,
memory control, and graphics using a 10-nm die. That chiplet sat on top of the base
die, which included the functions usually found in a platform controller hub (e.g.,
audio, storage, PCIe). Intel used low-power 14 nm for those processors. Microbumps
connected power and communications through TSVs in the base die. Intel then put
LPDDR4X memory from one of its partners on the top of the stack (Fig. 7.10).
Intel’s Xe may become what the company promised in 2019—a scalable archi-
tecture that could satisfy high-end GPU compute to low-end thin and lite, [12, 13]
with a common architecture that could share one driver and live atop Intel’s One API
concept [13].
From 2016 to 2021, Intel had a succession of manufacturing failures at its Hills-
boro research and manufacturing center that delayed three generations of new Intel
processors. At the same time, AMD introduced a new range of powerful × 86 CPUs.
Intel’s difficulties gave AMD an opening to overtake Intel with a more advanced
processor, costing Intel sales and profits.
7.3 Intel–Xe GPU (2018) 341
In January 2021, Intel acted. The company recruited Pat Gelsinger, Intel’s former
CTO who had joined the company back in 1979, left Intel in 2009 to take over EMC
and VMware (Fig. 7.11).
During Gelsinger’s earlier time at Intel, he launched Intel’s annual developer
conference in 1997. In 2017, former CEO Brian Krzanich scrapped it. When
Gelsinger returned, he revived the event, which was virtual in 2021 due to the
pandemic. Announcing the conference, Gelsinger said, “The geek is back at Intel”
[14].
After Gelsinger’s return, Intel hired hundreds of more workers in preparation for
a $3 billion Hillsboro factory expansion planned for 2022. The company had also
been expanding its employment at its Folsom California facility, where the central
dGPU Xe team was located.
7.3.3 DG1
At CES in January 2020, Intel showed off its DG1 (discrete graphics one, Fig. 7.12)
Xe− based AIB. Intel built it using a next-gen integrated GPU removed from its CPU
and made into a discrete part. As such, it used conventional DDR RAM, not high-
bandwidth GDDR which has been developed specifically for graphics processors.
Its performance was not much better than an iGPU, but that was not the point.
The architecture of the GPU was from Intel’s Xe design, and, as such, the developers
could use it to run test code and it established the DG product name.
In June 2021, while Intel was preparing to launch its Xe series of GPUs, Intel’s
chief architect Raja Koduri tweeted about the DG2 (discrete graphics two) chip and
showed a picture [15]. It was built from the foundation created by the Intel DG1 and
was presumed to have improved performance.
342 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
The Arc client graphics road map included Alchemist (previously known as DG2).
Intel decommissioned the DG# and renamed it Arc. Then it added generational
subsets such as Alchemist, Battlemage, Celestial, and Druid.
During the presentations, Koduri said, “We added a fourth microarchitecture to
the Xe family: Xe HPG optimized for gaming, with many new graphics features
including ray tracing support. We expect to ship this microarchitecture in 2021”
[16]. Figure 7.13 shows Intel’s road map.
Koduri said Intel’s Xe -HPG architecture employed energy-efficient processing
blocks from the Xe -LP architecture and high-frequency optimizations developed for
Xe -HP/Xe -HPC GPUs for data centers and supercomputers. The GPUs had high
bandwidth internal interconnections, a GDDR6-powered memory sub-system, and
hardware-accelerated ray tracing support. In a bold move, Intel manufactured the
DG2 family GPUs at TSMC.
The Xe -HPG microarchitecture powered the Alchemist family of SoCs, and the
first related products came to market in the first quarter of 2022 under the Intel
Arc brand. The Xe -HPG microarchitecture featured a new core, a compute-focused,
programmable, and scalable element.
7.3 Intel–Xe GPU (2018) 343
Fig. 7.13 Intel’s Xe Arc HPG road map circa 2021. Courtesy of Intel
The scalability of the Xe architecture spanned from iGPUs, which were classified
X LP, to high-performance gaming optimized GPUs, classified as Xe HP, and up to
e
XMX was Intel’s branding for tensor cores, and Intel used them to overcome the
classic performance-quality conflict illustrated in Fig. 7.15.
7.3.4 Summary
Intel assembled a massive team of engineers and marketing people from all over
the company and the industry. They also had a national lab commitment hanging
over their heads, and they could not afford to default on that; the repercussions and
embarrassment would be too great. One of Intel’s challenges was learning how to
deal with an outside fab. The cultural and procedural processes were so different
from the Intel way.
The other thing that Intel had to deal with was backward compatibility. That was
one of the things that killed Larrabee. Intel was not some start-up that went to market
just because it got some sample chips that worked. Intel was not going into the dGPU
market; it was going into the AIB market. The supply chain, Q&A, marketing, tech-
support, legal, and standards compliance issues were huge, and one could not do
everything overnight.
346 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
Fig. 7.16 Intel had a new SDK for its recent supersampling and scaling algorithm. Courtesy of
Intel
In late 2020, AMD introduced its latest Radeon boards based on the Navi 21 GPU
architecture. There had been rumors, and AMD gave hints about Big Navi for over
a year. The consensus of the industry was that it was worth the wait.
7.4 AMD Navi 21 RDNA 2 (October 2020) 347
Fig. 7.17 Example of the quality of Intel’s Xe SS—notice the Caution sign. Courtesy of Intel
AMD had not been a contender in the high-end sector for a while. As a result,
Nvidia had been enjoying the enthusiast space to itself. Big Navi changed the land-
scape. Not only was the Radeon RX 6800 XT a powerful competitor, but AMD had
become a much stronger company. AMD could mount a robust marketing program
to back up the product. That was an element that had been missing in the past.
The heart of the Navi 21 was AMD’s RDNA 2. It, said AMD, featured significant
architecture advancements from previous RDNA architecture. It had an enhanced
compute unit and a new visual pipeline with ray accelerators. The company claimed
up to 1.54× higher performance-per-watt on various games they tested. Additionally,
RDNA 2 offered a 1.3× higher frequency at the same per-CU power than an RX
5700 XT. The RDNA 2 provided DXR, VRS, and AMD’s FidelityFX capabilities.
Compare the original 2019 RDNA (discussed in the chapter on the third-to-fifth eras)
with the expanded and evolved version in Fig. 7.18, just one year later.
AMD also introduced its Infinity Cache that sped up performance. The company
said it focused on delivering breakthrough speeds with power efficiency in RDNA 2.
The on-die cache resulted in frame data with much lower energy per bit, said AMD.
According to AMD, the 128 MB Infinity Cache provided up to 3.25 × effective
bandwidth of 256-bit GDDR6. And, when adding power to the equation, it achieved
up to 2.4 × more effective bandwidth/watt versus 256-bit GDDR6 alone.
Also new to the AMD RDNA 2 compute unit was the ray accelerator (Fig. 7.19).
Ray accelerators, said AMD, provided a massive acceleration for intersecting rays.
The RX 6800 XT had 72 ray-accelerator units. Each ray accelerator could calculate
up to four rays per bounding box intersection or one ray per triangle intersection with
every clock. The ray accelerators calculated the intersections of the rays with the
scene geometry in a bounding volume hierarchy. Then it sorted them and returned
the information to the shaders for further scene traversal or result shading.
348 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
Variable Rate Shading (VRS) adjusted the shading rate for different regions of
an image. VRS was initially developed for gaze tracking in VR HMDs for foveated
rendering, the GPU could concentrate the rendering work where needed the most.
VRS applied higher shading to the most complex parts of the image. These areas
usually had the most important visual cues in an image. Nvidia introduced the concept
in its Turing architecture.
AMD said its RDNA 2 VRS functionality was throughout the entire pixel pipe
and offered shading rates of 1 × 1, 2 × 1, 1 × 2, and 2 × 2.
The RDNA 2 VRS provided a unique shading rate to every 8 × 8 region of pixels.
That granularity enabled developers to make appropriate decisions about the shading
rate for a given region.
The RX 6800 XT included the Radeon Media Engine. It offered hardware-
accelerated encode/decode capabilities. The engine was compatible with popular
codecs, such as H.264, H.265, VP9 (decode only), and AV1 (decode only).
With the introduction of the RX 6800 XT, AMD resurrected its Rage branding
introduced in 1996. The ATI Rage series was the first 3D graphics accelerator devel-
oped by ATI and it ushered in a new era of PC gaming. To remind the industry of
ATI’s early triumph and to underline the importance and confidence AMD felt for
the RX 6800 XT AIBs, AMD introduced a new Rage Mode for the Radeon RX 6800
XT AIBs.
Rage Mode was one of three one-click, performance-tuning presets available on
the Radeon RX 6800 XT, along with Quiet and Balanced. Those presets automatically
adjusted power and fan levels to allow quick and easy customization of the GPU’s
behavior.
AMD released the Radeon 6x00 series AIB at the height of the crypto coin demand
for GPUs and AIBs and the demand from the pandemic. Prices soared, users were
frustrated, and only the middlemen and scalpers made a hefty profit (Table 7.3).
The GPU was significant for its physical size. the Nvidia A100 was 1.6 times
larger, with a corresponding increase of shaders (6912–4608)—but the AMD GPU
got 20.7 TFLOPS while the Nvidia A100 got 19.49. So with 40% less silicon, AMD
was able to squeeze out an equivalent amount of TFLOPS.
In 2020, AMD and its console partner Sony announced AMD’s customized APUs
were in the new Sony PS5, and they also had ray tracing intersection shaders and
could do real-time ray tracing (RTRT) on consoles and PCs. In February 2021, AMD
announced that its RDNA2-based Radeon RX 6000 XT AIB could do RTRT.
With the RX 6000 series GPUs, AMD introduced a fixed function state
machine, intersection detection engine into the texture shader. It was a hybrid,
software-hardware approach to ray tracing, and AMD said it improved upon only
hardware-based solutions.
350 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
In AMD’s patent, the company said the hybrid approach addressed the issues with
hardware-based or software-based solutions. It used a shader unit to schedule the
processing when doing fixed function acceleration for a single node of the bounded
volume hierarchy tree addressed the issues with hardware-based or software-based
solutions [18]. AMD said it could preserve flexibility by controlling the overall calcu-
lation with the shader. It could bypass the fixed function hardware when necessary
while still getting the performance advantage of fixed function hardware. Addition-
ally, the texture processor infrastructure eliminated the large buffers. BVH caching
typically required in a hardware raytracing solution. The vector, general purpose
register and texture cache could be used in its place. That saved die area and reduced
the complexity of the hardware solution.
AMD used a specialized fixed function hardware ray intersection engine to handle
bounding volume hierarchy intersections (BVH calculations run through a stream
processor via software options have to deal with execution divergence and time-
consuming error corrections). AMD’s fixed function hardware was simpler than
7.4 AMD Navi 21 RDNA 2 (October 2020) 351
Nvidia’s RT cores and was in parallel with the texture filter pipeline in the GPU’s
texture processor.
The system included an interconnected TP shader and cache. The TP had a texture
address unit (TA), a texture cache processor (TCP), a filter pipeline unit, and a ray
intersection engine.
The shader sent a texture instruction, which contained ray data and a pointer, to
a BVH node to the texture address unit (TAU). The TCP used an address provided
by the TAU to fetch BVH node data from the cache. The ray intersection engine
performed ray-BVH node type intersection testing using the ray and BVH data. The
intersection testing indications and results for BVH traversal were returned to the
shader via a texture data return path. The shader reviewed the intersection results
and indications to decide how to traverse to the next BVH node (Fig. 7.20).
Using fixed function acceleration for a single node of the BVH tree was a hybrid
approach. Utilizing a shader unit to schedule the processing addressed issues associ-
ated with only hardware-based, or just software-based solutions were eliminated.
Because the shader still controlled the overall calculation and could bypass the
fixed function hardware, flexibility was preserved, and fixed function hardware’s
performance advantage was still realized. Additionally, using the texture processor
infrastructure, large buffers for ray storage and BVH caching were eliminated as the
existing VGPRs and texture cache could be used in its place, which saved die area
and complexity of the hardware solution.
The ray tracer could provide the anti-aliased front end for the FSR in the 6000 XT
series (although, as mentioned, it was not required). The FSR used conventional anti-
aliasing techniques for older GPUs, so the performance was not the same. Moreover,
it was not the same on an Nvidia GPU because of the front-end processing.
In March 2021, AMD announced an upgrade called FidelityFX. The company added
hardware acceleration and scaling to its GPUs and AIBs to enhance and speed up
352 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
gaming. The company offered the enhancement to developers via its OpenGPU
program. That made it useful for any GPU of almost any vintage. AMD had branded
it FidelityFX Super Resolution (FSR) [19].
It mitigated raster-scan rendering speed by the number of polygons in the image
and was not affected much by screen resolution. Ray tracing was the opposite—it
functioned almost independently of the polygon count and was challenged more by
screen resolution.
Nvidia developed a method of reducing the image’s resolution, using AI filtering,
and then scaling it up to speed up ray tracing’s real-time capabilities. AMD used a
similar approach but without AI filtering and its accompanying overhead. AMD’s
approach is explained in more detail below.
AMD’s FSR was an open-source, cross-platform technology designed to increase
frame rates and, at the same time, deliver high-quality, high-resolution gaming expe-
riences. AMD offered the following pipeline description of its process (Fig. 7.21).
The spatial upscaling technology utilized an algorithm to analyze features in
the source image. It then performed edge reconstruction and recreated the images
at a higher target resolution. Then the image was run through a sharpening filter,
which further improved image quality by enhancing texture details. However, the
sharpening step added edge noise and other artifacts. To fix that, AMD followed the
sharpening pass with a post-processing step to compensate for chromatic aberration
effects, film grain, and other clean-up functions. AMD said its FidelityFX Super
Resolution (FSR) could increase frame rates by as much as two and a half times.
AMD claimed the results produced an image with super high-quality edges
and distinctive pixel detail—especially when compared to other basic upscaling
techniques available at the time.
Our goal with FidelityFX Super Resolution was to develop an advanced, world-class
upscaling solution based on industry standards that can quickly and easily be implemented by
game developers, and is designed for how gamers really play games, said Scott Herkelman,
corporate vice president and general manager, Graphics Business Unit at AMD. FSR is the
industry’s ideal upscaler – it does not require any specialized, proprietary hardware and is
supported across a broad spectrum of platforms and ecosystems [20].
FSR had four quality settings: Ultra Quality, Quality, Balanced, and Performance.
The balance between image quality and performance was adjustable by the user. In
the Ultra Quality mode (see Table 7.4), AMD claimed that the FSR image quality was
almost indistinguishable from the native resolution. Native resolution referred to the
game’s image quality at the monitor’s advertised resolution, without any sharpening
filters or upscaling techniques.
When added to a game, one would get very close to full image fidelity (quality)
of any target resolution (AMD said 1440p and 4 K were the best examples) from the
upscaling of FSR. FSR rendered at a lower resolution for improved performance,
and then the image was upscaled and sharpened to get back to the target resolution,
with near-native image quality. There was minimal image quality impact using FSR.
AMD said it had worked with game developers and studios to develop FSR and
claimed over 40 developers pledged to support and integrate FSR into their games
and game engines. It was a small patch and did not require special AI hardware or
training.
Dan Ginsburg, a graphics developer at Valve, said at the time, “With AMD’s
FidelityFX Super Resolution, we are able to offer customers improved image quality
at a lower performance cost than full resolution rendering. This is particularly attrac-
tive for users with mid-range GPUs wanting to target higher resolutions. We are
very pleased that it is designed for use with all GPUs and with AMD’s open-source
approach with FSR [21]”.
FSR was compatible with a broad range of GPUs, including legacy AMD GPUs,
such as GCN and Nvidia AIBs. FSR ran on RX 400 GPUs forward and Ryzen
APUs and all Nvidia GPUs from GTX 10 series on. DirectX 11 was the minimum
officially supported API, although it should be relatively straightforward to port FSR
to DirectX 9.
FSR consisted of two consecutive compute shaders: one shader did upscaling with
edge reconstruction, and another shader sharpened the resulting upscaled image to
extract pixel detail.
As the FSR pipeline diagram (Fig. 7.21) indicates, processed and anti-aliased data
were fed to the FSR. That data could be from any conventional game, including 2D
games or a ray-traced game or application.
Ray-traced games could use AMD’s ray tracing hardware (discussed elsewhere
in this chapter)—however, it must be pointed out that ray tracing was not required,
which is one reason AMD could offer the FSR software for any GPU and multiple
versions of DirectX.
7.4.3 Summary
Nvidia brought real-time ray tracing (RTRT) to the forefront in 2018 and lit up
everyone’s imagination. By 2021, ray tracing ran on the Xbox series X, the PS5,
AMD’s RX6000 and Pro AIBs, and all Nvidia’s RTX AIBs. At that time, over 60
games incorporated RTRT—with more coming. RTRT was forecasted to be available
in 2025, based on Moore’s law, but AI and clever intersection work by AMD and
Nvidia pulled that date closer by seven years—which was startling, remarkable, and
very welcomed. With ray tracing, games simply felt better and more realistic, with
fewer to no artifacts to distract the player. Players were more immersed and genuinely
part of the game. These were exciting times, and they were going to get better.
AMD offered FSR, which improved image quality and performance but was not
a direct competitor to Nvidia’s DLSS, which improved performance by rendering
frames at a lower resolution and then using a spatial upscaling algorithm with a
sharpening filter.
FSR was a post-process shader that made it easy for game developers to imple-
ment across various AIBs. FSR was not the equivalent of DLSS—and especially not
DLSS 2.0z, but SR was a good non-AI/temporal upscaler that offered good-quality
performance at a low price.
Ultra FSR was more than just a Lanczos implementation [21] plus sharpening,
and it offered free performance improvements with a minimal effect on the visual
quality.
To make the next leap in the company’s growth, it decided to use Imagination’s
GPU, Innosilicon’s memory manager, and tensor cores to build a high-end GPU/AIB,
which they named Fantasy One.
In October 2020, Imagination and Innosilicon announced a collaboration. In late
2021, Imagination confirmed it had licensed its BXT design to Innosilicon and
claimed it would deliver up to six TFLOPS of single-precision compute power.
Imagination’s specifications are slightly higher than what Innosilicon quoted for the
Fantasy One GPU.
Fantasy One had nine GPU blocks with an undeclared number of cores but could
be as high as 32 cores per block. The company did not disclose the process technology
for the Fantasy One GPU. There could be one of two reasons for that, either they were
embarrassed by the fab they got, or the fab deal had not been closed yet. However,
they did have chips and were building AIBs.
The company showed four products at a press event in China in December 2021,
and it included a dual-GPU AIB and single-GPU AIBs.
The type A AIB was a consumer/workstation board featuring a multi-chip (chiplet)
single Fantasy One GPU design.
According to Innosilicon at the time, the GPU could deliver a fill rate of 160
GPixel/s and up to five TFLOPS of single-precision compute power. The AIB
had HDMI 2.1, DisplayPort 1.4, and VGA outputs. The AIB had up to 16 GB of
GDDR6(X) memory (using Innosilicon’s PHY) with a 128-bit interface. At the time,
Nvidia had been the only user of Micron’s G6X technology. Innosilicon developed its
own PAM4 signaling and was able to squeeze up to 19 Gbps of memory bandwidth
out of their GDDR6X implementation. However, the 128-bit memory bus limited its
bandwidth, which could theoretically get to 304 GB/s (somewhere between a Radeon
RX 6700XT and a 6600XT).
The small orange AIB shown in Fig. 7.22 was an entry-level AIB based on the
Fantasy One GPU and had less memory than the dual-fan solutions (Fig. 7.23).
The Type B could be a fanless design or a triple-fan as in Fig. 7.24, and was a dual-
GPU Fantasy One GPU design connected by an Innolink interface. The company
claimed it could hit up to ten TFLOPS of computing power and 320 GPixel/s fill rates
with two GPUs. The AIB offered 32, 1080p/60 fps streams or 64 streams at 720/30
fps. It featured up to 32 GB of GDDR6(X) memory via dual 128-bit interfaces from
each GPU. All the AIBs included a PCI-Express 4.0 interface at full X16 width.
Notice there is no power connector on the top. The typical power consumption
was only 20 watts.
The AIBs supported OpenGL, OpenGL ES, OpenCL, Vulkan, and DirectX,
though the company didn’t reveal which version of DirectX.
In December 2021, the company said it was working on the next generation
Fantasy 2 and 3 GPU families and will unveil in 2022. Innosilicon has plans to
utilize 5 nm process technology for those GPUs.
Innosilicon says their Innolink IP chiplet solution allows “massive amounts of
low-latency data to pass seamlessly between smaller chips as if they were all on the
same bus.” In other words, it’s a chiplets design, defined as independent functional
blocks making up a large chip (Fig. 7.25).
356 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
Fig. 7.23 An HDMI, display port with a VGA connector on the back of the AIB. Courtesy of
Innosilicon
The Innolink delivered 56Gbps/pair with 30 dB insertion loss. The company said
it was scalable to 4/8/16/32/64/128 lanes, PHY-independent, and had a very low
power mode.
It’s unlikely the company will do a specific miner product based on Imagination’s
IP Instead, Innosilicon has said it is focused on the data center, then desktop, then
laptop (Fig. 7.26).
The data center is the priority, then desktop. Performance range varies. FP32
performance was five to six TFLOPS and 160 gigapixel/s. (AMD’s RX 6600 is
7.5 Innosilicon (2021) 357
capable of seven teraflops, (Nvidia’s GTX 1660 SUPER can reach a hair over five
teraflops, so Innosilicon lands between those).
The AI calculations (INT8) performance was 25 TOPS with up to 16 GB GDDR6
(X) memory with 19 Gbit/s connected to the 128-bit memory interface, memory
bandwidth of 304 GB/s.
358 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders
The desktop graphics AIB was PCIe 4.0 × 16. The typical power consumption
was only 20 watts.
Two GPUs were connected via Innolink in the desktop and server. And, as
mentioned, the performance was projected to be ten TFLOPS and 320 Gpixel/s
and the memory 32 GB. Typical power consumption was fifty watts.
7.5.2 Summary
Innosilicon officially joined the other Chinese GPU makers (or promisers) Jinga,
MetaX, XianDiXian, and Zhaoxin. But Innosilicon was the first company in 22 years
to use Imagination IP to build a PC graphics processor, the last being NEC. Apple was
the first to use Imagination in a mobile phone built using PowerVR. Over the years,
Imagination demonstrated and improved on their tiling engine. And the company
always offered a lot of pixel GFLOPS for not too many milliwatts. It was exciting to
see Innosilicon take Imagination back to mainstream PC and into the data center. The
company always felt like it belonged there and was just looking for the right partner.
It could be speculated that if Canyon Bridge had not acquired Imagination, the deal
with Innosilicon might not have come about. It is also ironic that Apple’s public and
almost cruel abandonment of Imagination, and poaching Imagination employees, led
to the company’s valuation drop making the company a takeover target. And now
Apple is again a client of Imagination but will never have the stranglehold on the
company it once did.
It’s unlikely Innosilicon will push out of China for a while. China is a plenty big
market, and there are no cultural barriers to cross or associated expenses. However,
it also means AMD, Intel, and Nvidia will have more headwinds trying to penetrate
further into the Chinese GPU market, especially the data center, since three Chinese
companies have declared that they will own it.
References 359
China wants to self-sufficient and is building the foundations to make that happen
with western consumer dollars.
7.6 Conclusion
When the integrated T&L engine was introduced launching the first era of GPUs
the shaders were semi-fixed function and would sit idle while other semi-fixed func-
tion shaders might have been overburdened. The second era of GPUs introduced
programmable shaders were one step closer to the ultimate all compute GPU. The
introduction of unified shaders occurred in the third era of the GPU and the fourth
era introduced compute shaders. The fifth era brought us ray tracing and AI.
The sixth era was marked by the introduction DirectX 12 Ultimate, introduced in
November 2020. It brought a flood of new compute techniques and aligned the PC
with game consoles which greatly enhanced and sped up game development. Dx12U
introduced Mesh Shaders and Sampler Feedback. Mesh shading would transform
the detail and quality of computer images without penalizing performance.
References
11. Peddie, J. Intel unveils Xe-architecture-based discrete GPU for HPC, (November 17, 2019),
https://www.jonpeddie.com/report/intel-unveils-xe-architecture-based-discrete-gpu-for-hpc/
12. Peddie, J. Intel launches hybrid notebook processor: 3D stacking and very low power hallmarks
of new, (June 11, 2020), https://www.jonpeddie.com/report/intel-launches-hybrid-notebook-
processor/
13. Peddie, J. Intel’s stacked chip is sexy, (February 12, 2020), https://www.jonpeddie.com/report/
intels-stacked-chip-is-sexy/
14. Rogoway, M. Intel promises another decade of Moore’s Law as it strives to reconnect with
‘geeks,’ The Oregonian/OregonLive, (October. 27, 2021), https://www.oregonlive.com/sil
icon-forest/2021/10/intel-promises-another-decade-of-moores-law-as-it-strives-to-reconnect-
with-geeks.html
15. Koduri, R. @Rajaontheedgehttps, Xe-HPG (DG2) real candy - very productive time at the
Folsom lab couple of weeks ago (June 1, 2012), https://twitter.com/Rajaontheedge/status/139
9966271182045184
16. Koduri, R. Intel Delivers Advances Across 6 Pillars of Technology, Powering Our Leadership
Product Road map, (August 13, 2021), https://www.intel.com/content/www/us/en/newsroom/
opinion/advances-across-6-pillars-technology.html
17. Peddie, J. The Arc of the story—Intel brands its GPU, TechWatch, (August 20, 2021), https://
www.jonpeddie.com/report/the-arc-of-the-alchemist
18. Saleh, S.J., Kazakov, M.V., Goe, V. Texture Processor Based Ray Tracing Acceleration Method
and System, US 2019/0197761, (June 27, 219), https://www.freepatentsonline.com/201901
97761.pdf
19. GPUOpen, https://gpuopen.com/
20. AMD. With AMD FidelityFX Super Resolution, AMD Brings High-Quality, High-Resolution
Experiences to Gamers Worldwide [Press release]. (June 22, 2021), https://ir.amd.com/news-
events/press-releases/detail/1011/with-amd-fidelityfx-super-resolution-amd-brings
21. Lanczoz resampling, https://en.wikipedia.org/wiki/Lanczos_resampling
22. GPU GDDR6/6X PHY & Controller, Innosilicon, https://www.innosilicon.com/html/ip-sol
ution/14.html
Chapter 8
Concluding Remarks
The goal of computer graphics has been to present data in an informative and realistic
way. Molecular modeling was one of the first applications of presenting data and using
the graphics to gain new insights and see problems. CAD was the biggest impetus
to CG and then games and movies. GPU compute, although used in visualization
calculations, is not considered a CG application for this discussion.
The goal of visualization and simulation used for digital modeling and entertain-
ment has been to create an image that is indistinguishable from reality—to remove
any feeling of disbelief or discomfort.
In the case of engineering and design, ray tracing with global illumination has
been the Holy Grail and efforts to speed it up and make it a real-time tool have been
expended since Whitted’s breakout development that helped accelerate ray tracing
in 1979 [1]. Real-time ray tracing was realized in 2014 by Imagination Technologies
[2] and again and most famously in 2018 by Nvidia [3].
The beauty of real-time ray tracing was brought to games, and some of the best
show off examples were racing games with beautiful car models (Fig. 8.1).
Traditionally, one of the biggest challenges in computer graphics has been
rendering and simulating humans. Getting a life like, believable human image has
been a sought-after objective for decades. Image quality improved from crude cartoon
characters in games like Doom in the early 1990s and Lara Croft in Tomb Raider in
1996 (Fig. 8.2), to refined and believable images beginning in 2001 with the animated
movie Final Fantasy: The Spirits Within and Tomb Raider 2013 (Fig. 8.3). Skin, eyes,
and hair were the major challenges, followed by natural mechanics of movement. In
2020, facial realism powered by GPUs took another giant step as real people were
modeled for games such as in Death Standing. And then in 2022, new techniques
were applied as well as ray tracing by Unity after acquiring Weta. More realistic eyes
with caustics on the iris, a new hair system, and a new skin shader with peach fuzz
and winkle maps were added (Fig. 8.4).
As GPUs scaled up in the number of shaders, added AI matrix math cores, and
enlarged their memory size while increasing clock speeds, APIs expanded to mesh
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 361
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_8
362 8 Concluding Remarks
Fig. 8.1 Slick car from Forza Horizon 5. Courtesy of Xbox Game Studios
Fig. 8.2 Game characters in the 1990s: Doom and Tome raider. Source Wikipedia
Fig. 8.3 Final Fantasy 2001 and Tomb Raider 2013. Courtesy of Wikipedia and Crystal Dynamics
8 Concluding Remarks 363
Fig. 8.4 Death Standing 2020 and enemies. Courtesy of Sony Interactive Entertainment and Unity
Fig. 8.5 Computational fluid dynamics is used to model and test in a computer to find problems
and opportunities. Courtesy of Siemens
shaders and ray tracing, and software developers saw opportunities to do things that
were only possible on big computers running for days.
In the movie world, computer graphics have brought dead actors back to life,
including well-known people such as Steve McQueen, Carrie Fischer, and Peter
Cushing—a challenge complicated by the close relationships audiences have with
these actors.
In the areas of engineering, CAD has been a main stay user of GPUs and
rendering. In areas of simulation, finite element analysis (FEA) and computational
fluid dynamics (CFD) have been a challenge (Fig. 8.5).
364 8 Concluding Remarks
References
1. Whitted T. (1979) An improved illumination model for shaded display. Proceedings of the 6th
annual conference on Computer Graphics and Interactive Techniques
2. Triggs, R. Imagination shows off real time ray tracing demos at MWC, (February 23, 2016).
https://www.androidauthority.com/imagination-ray-tracing-demo-mwc Moore’s 016-675829/
3. Takahashi, D. Nvidia unveils real-time ray tracing Turing graphics architecture, (August
13, 2018). https://venturebeat.com/2018/08/13/nvidia-unveils-real-time-ray-tracing-turing-gra
phics-architecture/
Appendix A
Acronyms
Common acronyms used in this book and the computer graphics industry (product
names are not included)
© The Editor(s) (if applicable) and The Author(s), under exclusive license 365
to Springer Nature Switzerland AG 2022
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1
366 Appendix A: Acronyms
(continued)
VF Vertex fetch
VPU Vector processing unit
ALLM Auto low latency mode
NURBS Non-uniform rational basis spline
VoIP Voice Over Internet Protocol
AHB Advanced high-performance bus (Arm)
APB Advanced Peripheral Bus
ATW Asynchronous time warp
BREW Binary Runtime Environment for Wireless
CDMA Code division multiple access
CIF Common intermediate format
DLA Deep learning accelerator
FMA Fused multiply-accumulate
GSM Global System for mobile communication
IMGIC Imagination Image Compression
IoT Internet of things
MID Mobile Internet devices
MIPI Mobile Industry Processor Interface
OMAPI Open Mobile Application Processor Interfaces
PBR Physically based rendering
SKU Stock keeping unit
AXI Advanced eXtensible Interface
BRDF Bidirectional reflectance distribution function
BSD Berkeley Source Distribution
COTSS Commercial-off-the-shelf semiconductors
CSP Chip-scale package
DPC Data-parallel C
DVFS Dynamic voltage and frequency scaling
GCC GNU (C and C++) Compiler Collections\
HPM High-performance multi-crystalline
IDM Integrated device manufacturing
MDFI Mesh interconnect architecture
MIG Multi-instance GPU
MUL Multiply
NNP Neural network processors
OCP Open Core Protocol
RAC Ray acceleration cluster
TBps Terabytes per second
(continued)
Appendix A: Acronyms 367
(continued)
UOS Unity Operating System
WinTel Microsoft Windows and Intel systems
XMX Xe Matrix eXtensions
AIB Advanced Interface Bus (Intel)
DLAA Deep learning anti-aliasing
DS Domain Shader
MPS Multi-process service
PAM Pulse amplitude modulation
QoS Quality of service
TIPS Tensor instructions per second
TOPS Tensor operations per second
TSV Through silicon vias
VDI Virtual desktop infrastructure
VGPR Vector general-purpose registers
VS Vertex shader
XESS Xe supersampling
Appendix B
Definitions
Does anyone really read a glossary? Hopefully yes. They take a lot of time and
research to write and can inform, clear up ambiguities, and ever cause some people
to change their perspective. The trick is to know what to put in and leave out.
Terms
Throughout this book, specific terms will be used that assume the reader understands
and is familiar with the industry.
• Model—The marketing name for a GPU assigned by the manufacture, for
example, AMD’s Radeon, Intel’s Xe , and Nvidia’s GeForce. A model can also be
a 3D object. For example, the design of a car is a 3D model.
• Code name—The GPU manufacturer’s engineering code name for the device.
• CAD—Computer-aided design.
• CAE—Computer-aided engineering.
• CAGR—Compound average growth rate.
• CFD—Computational fluid dynamics.
• CGI—Computer-generated imagery.
• Launch—There is no standard. It can be the date the GPU first shipped, or the
date of the announcement.
• Architecture—The name of the design, the microarchitecture used for the GPU.
It, too, will be a proper noun such as AMD’s Radeon DNA or Nvidia’s Hopper.
• Fab—The fabrication process. The average feature size of the transistors in the
GPU expressed in nanometers (nm).
• Transistors—The number of transistors in the GPU or chip.
• Die size—The square area of the chip, typically measured in square millimeters
(mm2 ).
• Core clock—The GPU’s reference or base frequency (and boost if available) is
expressed in MHz or GHz.
• Fill rate
© The Editor(s) (if applicable) and The Author(s), under exclusive license 369
to Springer Nature Switzerland AG 2022
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1
370 Appendix B: Definitions
– Pixel—The rate at which the raster operators can render pixels to a display,
measured in pixels/s.
– Texture—The rate at which the texture-mapping units can map surfaces onto
a polygon mesh, measured in texels/s.
• Performance
– Shader operations—How many operations the pixel shaders (or unified
shaders) can perform, measured in operations/s.
– Vertex operations—The number of operations processed on the vertex shaders
in Direct3D 9.0c and older GPUs, expressed in vertices/s.
• Memory
– Bus width—The bit width of the memory bus.
– Size—Size of the graphics memory expressed in gigabytes (GB).
– Clock—The reference or base frequency of the memory clock, expressed in
MHz or GHz.
– Bandwidth—The maximum rate of data transfer across the memory expressed
in mega- or gigabytes per second (GB/s or MB/s).
• TDP (thermal design power)—The maximum heat generated by the GPU,
expressed in watts.
• TBP—The typical AIB power consumption, measured in watts.
• Bus interface—The connection that attaches the graphics processor to the system
(typically an expansion slot, such as PCI, AGP, or PCIe).
• API support—Rendering and computing APIs supported by the GPU and the
driver.
• Image generation—The image generation stage in a GPU where the final,
displayed pictures are created before being sent to the screen. It is where the
user engages with the results of the entire system. In the case of movies or TV, it
is passive. In the case of computers, it is interactive, such as playing a game. In
the case of interactive images, they can be for content creation work or content
consumption. There is a relentless demand and need for high quality, fast response,
and image generation in all cases.
Terminology and conventions change over time.
2D—Two dimensional, used to refer to “flat” graphics which only have two axes
(plural of axis), X and Y, along which drawing occurs, such as those used in normal
Windows applications. Includes drawing functions such as line drawing, BitBlts, text
display, and polygons. Most common form of computer graphics, since displays are
2D as well.
3D—Three dimensional, used to refer to the rendering/display of graphics which
are 3D in nature (i.e., exist along three axes, X, Y, and Z). In existing PC graphics
systems, this 3D data needs to be rendered into a 2D surface, namely the display.
This is something that graphics chips that offer 3D acceleration specialize in, offering
features such as 3D lines, texture mapping, perspective correction, alpha blending,
and color interpolation for smooth shading (used in simulating lighted scenes).
Appendix B: Definitions 371
and quality, resulting in a minimal increase in image quality (IQ) compared to the
same scene without AO.
Anaglyph 3D—Unrelated to 3D. This is a method of simulating a depth image on
a flat 2D display by overlaying colored images representing the view from left and
right eyes, then filtering the image presented to each eye through an appropriately
colored lens.
Anisotropic filtering (AF)—A method of enhancing the image quality of textures
on surfaces of computer graphics that are at oblique viewing angles with respect to
the camera where the projection of the texture (not the polygon or other primitive on
which it is rendered) appears to be non-orthogonal (thus the origin of the word: “an”
for not, “iso” for same, and “tropic” from tropism, relating to direction; anisotropic
filtering does not filter the same in every direction).
API—Acronym for application programming interface. A series of functions
(located in a specialized programming library), which allow an application to perform
certain specialized tasks. In computer graphics, APIs are used to expose or access
graphics hardware functionality in a uniform way (i.e., for a variety of graphics hard-
ware devices) so that applications can be written to take advantage of that function-
ality without needing to completely understand the underlying graphics hardware,
while maintaining some level of portability across diverse graphics hardware. Exam-
ples of these types of APIs include OpenGL and Microsoft’s Direct3D. An API is
a software program that interfaces an application (Word, Excel, a game, etc.) to the
GPU as well as the CPU and operating system of the PC. The API informs the appli-
cation of the resources available to it, which is called exposing the functionality. If
a GPU or CPU has certain capabilities and the API doesn’t expose them, then the
application will not be able to take advantage of them. The leading graphics APIs
are DirectX and OpenGL.
APU—The AMD Accelerated Processing Unit (APU), formerly known as Fusion,
is the marketing term for a series of 64-bit microprocessors from Advanced Micro
Devices (AMD), designed to act as a central processing unit (CPU) and graphics
accelerator unit (GPU) on a single chip.
ARIB STD-B67—Hybrid Log Gamma (HLG) is a high-dynamic-range (HDR)
standard that was jointly developed by the BBC and NHK. HLG defines a nonlinear
transfer function in which the lower half of the signal values uses a gamma curve
and the upper half of the signal values uses a logarithmic curve.
ASP—Average selling price.
Aspect ratio—The ratio of length to height of computer and TV screens, video,
film, or still images. Nearly, all TV screens are 4:3 aspect ratio. Digital TVs are
moving to widescreen which is 16:9 aspect ratio.
Attach rate—An attach rate (also called an attach ratio) measures how many add-
on products are sold with each of the basic product or platform and is expressed as
a percentage.
AU, Arithmetic Unit—The circuits in a microprocessor where all arithmetic
instructions are carried out. Often found in combination with separate logic and
other units, controlled by a long, or very long, instruction word.
Appendix B: Definitions 373
several thousand color images of material sample taken during different camera and
light positions.
Bilinear filtering—When a small texture is used as a texture map on a large surface,
a stretching will occur and large block pixels will appear. Bilinear filtering smoothens
out this blocky appearance by applying a blur.
Binary—A counting system in which only two digits exist, “0” and “1.” Also
known as the base-2 counting system. Each digit represents an additive magnitude
of a power of 2, based on its position, with the rightmost digit representing 2 to the
0th power (20 ), the next digit representing 2 to the 1st power (21 ), etc. For example,
the binary number 1001B converts to a decimal or base-10 number as follows: 1*23
+ 0*22 + 0*21 + 1*20 = 8 + 0 + 0 + 1 = 9. The binary system is the basis for all
digital computing.
Binary digits—The numbers “0” and “1” in the binary counting system. Also
called a bit.
Binary Notation—In various graphics hardware reference documents, as well as
in some programming languages, it’s common to see binary numbers (a combination
of binary digits) listed as the binary digits followed by the letter “B” or “b,” as in the
example listed under the term “binary.”
Binary Units—One or more bits.
Binning—Binning is a sorting process in which superior-performing chips are
sorted from specified and lower-performing chips. It can be used for CPUs, GPUs
(graphics cards), and RAM. The manufacturing process is never perfect, especially
given the incredible precision necessary and number of transistors to produce GPUs
and other semiconductors. Manufacturing high-performance and expensive GPUs
results in getting some that cannot run at the specified frequencies. Those parts
however may be able to run at slower speeds and can be sold as less expensive
GPUs.
Bit—Acronym derived from the term “Binary digIT” (see definition above).
Bit-depth-BPP—See bits per-pixel and BPP.
Bitmap—A bitmap image is a dot matrix data structure that represents a generally
rectangular grid of pixels (points of color), viewable via a monitor, paper, or other
display medium. A bitmap is a way of describing a surface, such as a computer screen
(display) as having several bits or points that can be individually illuminated and at
various levels of intensity. A bit-mapped 4k monitor would have over 8-million bits
or pixels.
Bits per channel—See bits per-pixel.
Bits per-pixel—Bits per channel are the number of bits used to represent one of
the color channels (red, green, blue). The “bit depth” setting when editing images
specifies the number of bits used for each color channel—bits per channel (BPC).
The human eye can only discern about 10 million different colors. An 8-bit neutral
(single color) gradient can only have 256 different values which is why similar tones
in an image can cause artifacts. Those artifacts are called posterization. A 16-bit
setting (BPC) would result in 48 bits per-pixel (BPP). The available number of pixel
values of that is (248 ).
Appendix B: Definitions 375
has the fastest possible access to the data it needs at any one time. Many processors
have a hierarchy of progressively larger and slower on-chip caches in an attempt to
match the speed and data locality requirements of the processor with the external
DRAM array. These are referred to as Level one (L1), Level two (L2), etc.
Calligraphic display—See vector scope.
Chipset—Typically, a pair of chips that manage the data flows and traffic between
the system memory, CPU, disk drives, keyboard and mouse, and various I/O ports
(e.g., USB, Ethernet, etc.)—see southbridge and northbridge.
Chrominance—Chrominance (chroma or C for short) is the signal used in video
systems to convey the color information of the picture, separately from the accom-
panying luma signal (or Y for short). Chrominance is usually represented as two
color-difference components: U = B' − Y' (blue − luma) and V = R' − Y' (red
− luma). Each of these difference components may have scale factors and offsets
applied to it, as specified by the applicable video standard.
Complementary metal–oxide–semiconductor (CMOS) sensor—A CMOS sensor
is an array of active pixel sensors in complementary metal–oxide–semiconductor
(CMOS) or N-type metal-oxide-semiconductor (NMOS, Live MOS) technologies.
Clamp—A clamp is a device which takes an input and produces an output which
is bounded. A traditional clamp will have two or three different inputs: the signal or
number to be clamped; the upper bound to clamp to; and possibly a lower bound to
clamp to. When the signal/numeric input to be clamped is received, it is compared
against the upper bound and, if it exceeds it, is replaced by the upper bounding value.
Similarly, if there is a lower bound, the signal/numeric input is compared, and if
found lower than the lower bound, it’s replaced with the lower bound. The result of
all the bounding is then passed on to the output of the device.
Clone mode—Duplicates the computer’s screen on the other monitor(s), it’s
referred to as “Duplicate (in multiple displays’ pull-down menu window). It can
be useful for presentations and sometimes to provide a different representation of
the same output.
Color—In current computer graphics systems, color display information is gener-
ated as a blend of three colored light components: red, green, and blue (RGB).
The combination of all three of these color components at full intensity produces a
white output, while the absence of all three produces black output. Blending these
three-color components at different intensities can produce a near infinite number
of distinct colors. While a display monitor tends to require each color component to
have a voltage from 0 (off) to 0.7 V (full intensity), a computer graphics subsystem
tends to deal with color in digital terms, on a pixel per-pixel basis. Each pixel has a
specific depth, also known as BPP. Each pixel, in the process of being displayed from
video memory, passes through a component called a RAMDAC. For 8 BPP or less,
the pixel value read from video memory is usually passed through the LUT portion
of a RAMDAC in order to produce the requisite RGB information. For greater than 8
BPP modes, pixels generally bypass the LUTs and go directly to the DACs. In order
to do this, such pixels must be defined with fixed possible ranges of RGB. Therefore,
it is standard that 15-bit pixels have 5 bits each of R, G, and B, with one bit left
unused; 16-bit pixels have 5 bits each of R and B and 6 bits of G; and 24 and 32-bit
Appendix B: Definitions 377
pixels have 8 bits each of R, G, and B (with 8 bits unused in 32-bit pixels). 5 bits
gives 32 distinct intensity levels of a color component, 6 bits gives 64 levels, and 8
bits gives 256 intensity levels. It should be noted that pixel modes that go through
the LUT are called “indexed” color modes, while those that don’t are referred to as
“direct color” or “true color” modes.
Color gamut—The entire range of colors available on a particular device such as
a monitor or printer. A monitor, which displays RGB signals, typically has a greater
color gamut than a printer, which uses CMYK inks. Also see gamut and wide color
gamut.
Color space—See color gamut and gamut.
Combine—The verb used to describe an operation in which two or more values
or signals are added or concatenated with each other in order to produce a combined
output.
Comparator—A comparator is a device which generally takes two inputs,
compares them, and based on the result of the comparison, produces a binary output
or signal to indicate the result of the comparison. For example, for a “greater-than”
comparator, the first input would be compared against the second input, and if the
first is larger, a TRUE (usually a binary 1) would be output.
Computational photography—Processing of still or moving images with the
objective of modifying, enhancing, or manipulating the images themselves.
Conformal rendering—Foveation that offers a smoothly varying transition from
the high acuity region and the low acuity region. Considered more efficient than
traditional foveated rendering because it requires fewer rendered pixels than other
techniques.
Conservative raster—When standard rasterization does not compute, the desired
result is shown, where one green and one blue triangles have been rasterized. These
triangles overlap geometrically, but the standard rasterization process does not detect
this fact.
used in CD players to convert CD data into sounds. DACs are also a key component of
any graphics subsystem, since they convert the pixel values into colors on the screen.
Graphics boards typically use a device known as a RAMDAC, which combines DACs
with Look-Up Tables (LUTs). RAMDACs typically contain three LUTs and three
DACs, one each for the red, green, and blue color components of a pixel. See “color”
and “LUT” for more information.
DCI P3—DCI P3 is a color space, introduced in 2007 by the SMP T E. It is used
in digital cinema and has a much wider gamut than the sRGB.
dGPU—The basic, discrete (stand-alone) processor that always had its own private
high-speed (GDDR) memory. dGPUs are applied to AIBs and system boards in
notebooks.
Desktop GPU segments—The desktop is segmented into five categories, and the
desktop discrete GPUs follow the same designations.
• Workstation
• Enthusiast
• Performance
• Mainstream
• Value
Device driver—A device driver is a low-level (i.e., close to the hardware) piece
of software which allows operating systems and/or applications to access hardware
functionality without actually having to understand exactly how the hardware oper-
ates. Without the appropriate device drivers, one would not be able to install a new
graphics board, for example, to use with Windows, because Windows wouldn’t know
how to communicate with the graphics board to make it work.
Dewarp—In vision systems, this refers to the process of correcting the spherical
distortion introduced by the optical components of the system. Especially where a
single camera is capturing a very wide field of view, significant distortion can be
present. This is usually, but not always, removed in the ISP before any significant
vision processing or further computational photography is done.
Direct3D—Also known as D3D, Direct3D is the 3D graphics API that’s part of
Microsoft’s DirectX foundation library for hardware support. Direct3D actually has
two APIs, one which calls the other (called Direct3D Retained Mode or D3D RM)
and hides the complexity of the lower level API (called Direct3D Immediate Mode
or D3D IM). Direct3D is becoming increasingly popular as a method used by games
and application developers to create 3D graphics, because it provides a reasonable
level of hardware independence, while still supporting a large variety of 3D graphics
functionality (see “3D”).
DisplayPort—DisplayPort is a VESA digital display interface standard for a
digital audio/video interconnect, between a computer and its display monitor, or
a computer and a home-theater system. DisplayPort is designed to replace digital
(DVI) and analog component video (VGA) connectors in the computer monitors and
video cards.
Dithering—Used to hide the banding of colors when rendering with a low number
of colors (e.g., 16 bits). Banding is what happens when there are not enough shades
380 Appendix B: Definitions
of colors, resulting in the eye being able to see a distinct change of colors between
two shades. Dithering is also a way to visually simulate a larger number of colors
on a display monitor by interleaving pixels of more limited colors in a small grid or
matrix pattern, much in the way a magazine’s color pictures are actually composed
of small colored dots. Dithering takes advantage of the human eye’s capability to
blend regions of color. For example, if you could only display red and blue pixels,
but wanted to give the visual impression of purple, you would create a matrix of
interleaved red and blue pixels, as depicted using letters below (B = Blue, R = Red):
BRBRBRBR
RBRBRBRB
BRBRBRBR
RBRBRBRB
When viewed from a distance, the human eye would blend the red and blue pixels
in this pattern, making the area appear to be a shade of purple. This technique allows
one to simulate thousands of color in exchange for a small loss in detail, even when
there are only 16 or 256 colors available for display as might be the case when a
graphics subsystem is configured to display in an indexed color mode (see “color”).
DMCVT—Dynamic Metadata for Color Volume Transforms, SMPTE ST 2094.
Dolby Vision—12-bit HDR, BT.2020, PQ, Dolby Vision dynamic metadata.
DVI (Digital Visual Interface)—DVI is a VESA (Video Electronics Standards
Association) standard interface for a digital display system. DVI sockets are found
on the back panel of AIBs and some PCs and also on flat panel monitors and TVs,
DVD players, data projectors, and cable TV set-top boxes. DVI was introduced in
and uses TMDS signaling. DVI supports High bandwidth Digital Content Protection,
which enforces digital rights management (see HDCP).
Dynamic contrast—The dynamic contrast shows the ratio between the brightest
and the darkest color, which the display can reproduce over time, for example, in the
course of playing a video.
EDF—Emissive Distribution Functions.
eGPU—An AIB with a dGPU located in a stand-alone cabinet (typically called a
breadbox) and used as an external booster and docking station for a notebook.
Electronic imaging—Electronic imaging is a broad term that defines a system of
image capture using a focusing lens sensor with a sensor behind it to translate the
image into electronic signals. Those signals are then filtered, processed, and made
available for storage and/or display. A technique for inputting, recording, processing,
storing, transferring, and using images (ISO 12651-1). Using computers and/or
specialized hardware/software to capture (copy), store, process, manipulate, and
distribute “flat information” such as documents, photographs, paintings, drawings,
and plans, through digitization.
End-to-end latency—See Motion-To-Photon Latency.
Energy conservation—The concept of energy conservation states that an object
cannot reflect more light than it receives.
Appendix B: Definitions 381
For practical purpose, more diffuse and rough materials will reflect dimmer and
wider highlights, while smoother and more reflective materials will reflect brighter
and tighter highlights.
Error correction model (ECM)—Belongs to a category of multiple time series
models most commonly used for data where the underlying variables have a long-
run stochastic trend, also known as co-integration. ECMs are a theoretically driven
approach useful for estimating both short-term and long-term effects of one-time
series on another. The term error correction relates to the fact that last-periods devia-
tion from a long-run equilibrium, the error, influences its short-run dynamics. Thus,
ECMs directly estimate the speed at which a dependent variable returns to equilibrium
after a change in other variables.
Extended mode—Extended mode creates one virtual display with the resolution
of all participating monitors. Depending on the hardware and software employed,
the monitors may have to have the same resolution (there’s more on this in the next
sections). Both of these modes present the display space to the user as a contiguous
area, allowing objects to be moved between, or even straddled across displays as if
they are one.
FEA—Finite element analysis.
Field of view—The field of view (also field of vision, abbreviated FOV) is the
extent of the observable world that is seen at any given moment. In case of optical
instruments or sensors, it is a solid angle through which a sensor detects the presence
of light.
Fixed function—Fixed function accelerator AIBs take some of the load off the
CPU by executing specific graphics functions, such as BitBlt operations and line
draws. That makes them better than frame buffers for environments that heavily
load the system CPU, such as Windows. Those types of AIBs have also been called
Windows and graphical user interface (GUI) accelerators.
A fixed function can also apply to the graphics pipeline, such as a T&L stage or
a tessellation stage.
Flat shading—A rendering method to determine brightness by the normal vector
on a polygon and the position of the light source and to shade the entire surface of
a polygon with the color of the brightness. This rendering method produces a clear
difference in the colors of adjacent polygons, making their boundary lines visible,
so it is unsuitable for rendering smooth surfaces.
Floating-point unit—An Arithmetic Unit which operates on floating-point data.
Most general-purpose floating-point units observe the IEEE 754 standard which
governs formats, precision, rounding, handling of exceptions, etc. Special purpose
AUs found in GPUs and other DSPs optimized for specific tasks do not always do
so and hence different results can be obtained for the same instructions executed on
different AUs. This is one of the challenges of heterogeneous computing.
382 Appendix B: Definitions
Plot of the sRGB standard gamma-expansion nonlinearity (red), and its local gamma value, slope
in log–log space (blue)
In most computer systems, images are encoded with a gamma of about 0.45
and decoded with a gamma of 2.2. The sRGB color space standard used with most
cameras, PCs, and printers does not use a simple power-law nonlinearity as above, but
has a decoding gamma value near 2.2 over much of its range. Gamma is sometimes
confused and/or improperly used as “gamut.”
384 Appendix B: Definitions
Typical gamut map. The grayed-out horseshoe shape is the entire range of possible chromaticities,
displayed in the CIE 1931 chromaticity diagram format
The most common usage refers to the subset of colors which can be accurately
represented in a given circumstance, such as within a given color space or by a certain
output device.
Also see color gamut and wide color gamut.
GDDR—An abbreviation for double data rate type six synchronous graphics
random-access memory, is a modern type of synchronous graphics random-access
memory (SGRAM) with a high bandwidth (“double data rate”) interface designed
for use in graphics cards, game consoles, and high-performance computation.
Geometry engine—Geometric manipulation of modeling primitives and transfor-
mations are applied to the vertices of polygons, or other geometric objects used as
modeling primitives, as part of the first stage in a classical geometry-based graphic
image rendering pipeline, which is referred to as the geometry engine. Geometry
transformations were originally implemented in software on the CPU or a dedicated
floating-point unit, or a DSP. In the early 1980s, a device called the geometry engine
was developed by Jim Clark and Marc Hannah at Stanford University.
Geometry shaders—Geometry shaders, introduced in Direct3D 10 and OpenGL
3.2, generate graphics primitives, such as points, lines, and triangles, from primi-
tives sent to the beginning of the graphics pipeline. Executed after vertex shaders,
geometry shader programs take as input a whole primitive, possibly with adjacency
information. For example, when operating on triangles, the three vertices are the
Appendix B: Definitions 385
geometry shader’s input. The shader can then emit zero or more primitives, which
are rasterized and their fragments ultimately passed to a pixel shader.
Global illumination—“Global illumination” (GI) is a term for lighting systems
that model this effect. Without indirect lighting, scenes can look harsh and artificial.
However, while light received directly is fairly simple to compute, indirect lighting
computations are highly complex and computationally heavy.
Gouraud shading—A rendering method to produce color gradual shading over the
entire surface of a polygon is performed by determining brightness with the normal
vector at each vertex of a polygon and the position of the light source and performing
linear interpolation between vertices.
The normal vector at each vertex can be determined by taking an average of
the normal vectors of all the polygons having the common vertex. For a triangular
polygon, the brightness at each vertex is determined by the normal vector obtained
for each vertex and the position of the light source. Therefore, the brightness of pixels
inside a triangle is determined by interpolation. This rendering method represents
color gradual variations between adjacent polygons, so it is suitable for rendering
smooth surfaces.
GPC—A graphics processing cluster (GPC) is group, or collection, of specialized
processors known as shaders, or simultaneous multiprocessors, or stream processors.
Organized as a SIMID processor, they can execute (process) a similar instruction
(program or kernel) simultaneously, or in parallel. Hence, they are known as a parallel
processor (A shader is a computer program that is used to do shading: the production
of appropriate levels of color within an image).
GPU (graphics processing unit)—The GPU is the chip that drives the display
(monitor) and generates the images on the screen (and has also been called a Visual
Processing Unit or VPU). The GPU processes the geometry and lighting effects and
transforms objects every time a 3D scene is redrawn—these are mathematically inten-
sive tasks and hence the GPU has upwards to hundreds of floating-point processor
(also called shaders or stream processors.) Because the GPU has so many powerful
32-bit floating-point processors, it has been employed as a special purpose processor
for various scientific calculations other than display and is referred to as a GPGPU in
that case. The GPU has its own private memory on a graphics AIB which is called a
frame buffer. When a small (less than five processors) GPU is put inside a northbridge
(making it an IGP), the frame buffer is dropped and the GPU uses system memory.
The GPU has to be compatible with several interface standards including software
APIs such as OpenGL and Microsoft’s DirectX, physical I/O standards within the
PC such as Intel’s Accelerated Graphics Port (AGP) technology and PCI Express,
and output standards known as VGA, DVI, HDMI, and DisplayPort.
GPU compute (GPGPU—General-Purpose Graphics Processor Unit)—The term
“GPGPU” is a bit misleading in that general-purpose computing such as the type
an ×86 CPU might perform cannot be done on a GPU. However, because GPUs
have so many (hundreds in some cases) powerful (32-bit) floating-point processors,
they have been employed in certain applications requiring massive vector operations
and mathematical intensive problems in science, finance, and aerospace applications.
386 Appendix B: Definitions
The application of a GPU can yield several orders of magnitude higher performance
than a conventional CPU.
GPU Preemption—The ability to interrupt or halt an active task (context switch)
on a processor and replace it with another task, and then later resume the previous
task this is a concept. In the era of single core CPUs preemption was how multi-
tasking was accomplished. Interruption in a GPU, which is designed for streaming
processing, is problematic in that it could necessitate a restart of a process and
thereby delay a job. Modern GPUs can save state and resume a process as soon as
the interruptive job is finished.
Graphics adapters—A graphics adapter is the device, subsystem, add-in board,
chip, or adapter used to generate a synthetic image and drive a display. It has been
called many things over the decades. Here are the names used in this book. The
differences may seem subtle, but they are used to differentiate one device from
another. For example, it is common to see the acronym GPU used when speaking or
writing about an add-in board. They are not synonyms, and a GPU is a component
of an AIB. That is not a pedantic diatribe. It would be like referring to an engine
or transmission to denote an entire automobile or truck. Part of the reason for the
misuse of terms is misunderstanding, another reason is the ease of speech (like calling
someone Tom instead of Thomas), and the third is that it is more fun and exciting to
use. People like to say GPGPU, an initialism for general-purpose GPU, as a shorthand
notation for GPU-computer. So, we cannot be the terminology police, but we can try
to clarify the differences. Generally, an acronym should be a pronounceable word.
Graphics controller—A graphics controller or graphics chip is a non-
programmable device designed primarily to drive a screen. More advanced versions
have some primitive drawing or shading graphic capabilities. The primary differ-
entiation between a controller and coprocessor or GPU is the programmable
capability.
Graphics coprocessors—Coprocessors (also written as coprocessors) can serve
as programmable processors, such as the Texas Instruments’ TI TMS34010 and Tl
TMS34020 series. Coprocessors can run all the graphics functions of an API and
display lists for applications such as CAD.
Graphics driver—A device driver is a software stack that controls computer
graphics hardware and supports graphics rendering APIs and is released under a free
and open-source software license. Graphics device drivers are written for specific
hardware to work within the context of a specific operating system kernel and to
support a range of APIs used by applications to access the graphics hardware. They
may also control output to the display, if the display driver is part of the graphics
hardware.
G-Sync—A proprietary adaptive sync technology developed by Nvidia aimed
primarily to eliminate screen tearing and the need for software deterrents such as
V-sync. G-Sync eliminates screen tearing by forcing a video display to adapt to the
framerate of the outputting device rather than the other way around, which could
traditionally be refreshed halfway through the process of a frame being output by
the device, resulting in two or more frames being shown at once.
Appendix B: Definitions 387
HLG—Hybrid Log Gamma Transfer Function for HDR signals (ITU-R BT.2100).
HLG defines a nonlinear transfer function in which the lower half of the signal values
uses a gamma curve (SD and HD) and the upper half of the signal values uses a
logarithmic curve. HLG is backwards compatible with SDR.
HPU—(Heterogeneous Processor Unit)—An integrated multi-core processor
with two or more x86 cores and four or more programmable GPU cores.
Hull Shaders—See tessellation shaders.
IGP (integrated graphics processor)—An IGP is a chip that is the result of inte-
grating a graphics processor with the northbridge chip (see northbridge and chipset).
An IGP may refer to enhanced video capabilities, such as 3D acceleration, in contrast
to an IGC (integrated graphics controller) that is a basic VGA controller. When a
small (less than five processors) GPU is put inside a northbridge (making it an IGP),
the frame buffer is dropped and the GPU uses system memory; this is also known as
a UMA—unified memory architecture.
iGPU—A scaled down version, with fewer shaders (processors) than a discrete
GPU which uses shared local RAM (DDR) with the CPU.
Image sensor—An image sensor, photo-sensor, or imaging sensor is a device,
which detects the presence of visible light, infrared transmission (IR), and/or ultra-
violet (UV) energy. That information constitutes an image. It does so by converting
the variable attenuation of waves of light (as they pass through or reflect off objects)
into electrical signals. Image sensors are used in electronic imaging devices of both
analog and digital types, which include digital cameras, camera modules, medical
imaging equipment, and night vision equipment such as thermal imaging devices,
radar, sonar, and others. The Digital Image Sensor is an Integrated Circuit Chip which
has an array of light sensitive components on the surface. The array is formed by the
individual photosensitive points. Each photosensitive sensor point inside the image
circle acts to convert the light to an electrical signal. The full set of electrical signals
are converted into an image by the on-board computer.
ISP, Image Synthesis Processor—An ISP refers to a processing unit which accepts
as input the raw samples from an imaging sensor and converts them into a human-
viewable image. The samples may have undergone some preprocessing by the sensor
circuitry to abstract certain details of the sensor operation but in general, they are
presented in the form of a “mosaic” of color samples without correction for things
like lens distortion, defective pixels, and temporal sampling artifacts. These things as
well as extracting the image from the color sample mosaic and encoding the output
into a standard format are the responsibility of the ISP.
ITU-R BT.2020—AKA Rec2020 defines various aspects of ultra-high-definition
television (UHDTV) with standard dynamic range (SDR) and wide color gamut
(WCG), including picture resolutions, frame rates with progressive scan, bit depths,
and color primaries.
ITU-R BT.2100—Defines various aspects of high-dynamic-range (HDR) video
such as display resolution (HDTV and UHDTV), bit depth, Bit Values (Files), frame
rate, chroma subsampling, and color space.
ITU-R BT.709—AKA Rec709 standardizes the format of high-definition televi-
sion, having 16:9 (widescreen) aspect ratio
Appendix B: Definitions 389
LUT—Acronym for Look-Up Table. LUTs are part of the RAMDAC of a graphics
subsystem and in modern graphics chips are usually located within the chip itself.
The LUT is the part of the output section of a graphics board which translates a pixel
value (primarily in 4 or 8 BPP indexed color modes) into its red, green, and blue
components. Once the components have been determined, they are passed through
the three DACs (red, green, and blue) to generate displayable signals. A diagram of
this operation, showing an 8-bit pixel, with a value of 250, going through the LUTs,
is below:
the one chosen for rendering. In a linearly interpolated mip-mapping operation (also
known as “trilinear filtering”), a weighted average of the two nearest mip-maps based
on the LOD value. The term mip is an acronym for the latin expression “multum in
parvo”—(many in small) implying the presence of many images in a small package.
Mixed Reality—Mixed Reality (MR) seamlessly blends a user’s real-world envi-
ronment with digitally created content, where both environments coexist to create
a hybrid experience. In MR, the virtual objects behave in all aspects as if they are
present in the real world, e.g., they are occluded by physical objects, their lighting is
consistent with the actual light sources in the environment, and they sound as though
they are in the same space as the user. As the user interacts with the real and virtual
objects, the virtual objects will reflect the changes in the environment as would any
real object in the same space.
Motherboard—The main circuit board in a PC, also known as a system boar or
a planar (by IBM). Graphics AIBs and other cards (i.e., audio, gigabyte Ethernet,
etc.), as well as memory, the CPU, and disk drive cables plug into the motherboard.
Motion-To-Photon Latency (MTPL)—Also known as the end-to-end latency is
the delay between the movement of the user’s head and the change of the VR device’s
display reflecting the user’s movement. As soon as the user’s head moves, the VR
scenery should match the movement. The more delay (latency) between the two
actions, the more unrealistic the VR world seems. To make the VR world realistic,
VR systems want low latency of <20 ms.
Multi-Frame Noise Reduction (MFNR)—Automatically take multiple images
continuously, combine them, reduce the noise, and record them as one image. With
multi-frame noise reduction, one can select larger ISO numbers than the maximum
ISO sensitivity. The image recorded is one combined image.
Multiplexer—A multiplexer, also known as a MUX, is an electronic device that
acts as a switching circuit. A MUX has two or more data inputs, along with a switch
or select input which determines which of the data inputs are passed to the output
portion of the device.
Multi-projection—Multi-projection can refer to an image created using multiple
projectors mapped on to a screen, or set of screens (as in a CAVE) for 3D projection
mapping using multiple projectors. It can also refer to multiple projections within a
screen in computer graphics.
Multiplayer game—Multiplayer games have traditionally meant that humans are
playing with other humans cooperatively, competing against each other, or both.
Artificial intelligence controlled players have historically been excluded from the
traditional definition of multi-player game. However, as AI technology progresses
this is likely to change. In the future, human controlled player’s skill and behavior
tracked over time could program the skill and behavior of a unique AI that can be
substituted for the human’s participation in the game.
Battle Royale is a game mode that creates a translucent dome (or other demarcation)
over/around the entire playing area. As the match progresses, the dome starts to shrink
toward a random point on the map. Players must stay within the bounds of the dome or take
damage leading to death. The shrinking dome “herds” players into smaller and smaller areas,
eventually ensuring that they will be in “close combat.” In summary, Battle Royale mode
392 Appendix B: Definitions
allows large-scale combat using long range weapons and vehicles over large distances, but
eventually forces the remaining players into CQC (close quarter combat), ensuring that the
round time does not extend too long, and a new round can begin.
Permadeath is a video game and simulation feature where the player’s death eliminates
them from the ability to continue participation in the game or continue as the specific entity
they were playing. Permadeath can come in a number of forms. In multi-player combat
games, this usually means having to wait until the next round starts if killed. In most multi-
player combat games, rounds last 5 to 30 minutes; however, it is theoretically possible that
death in a game would permanently exclude the player from further participation.
In other multi-player combat games, permadeath can mean losing all your equipment
and your position on the map, forcing the player to respawn with no equipment as a new
“entity.” Even though there is no waiting period, the ramifications of dying are significant,
as the player has often spent significant time equipping themselves and moving to strategic
areas of the map.
Persistent world games track (or attempt to track) the entire game universe as individual
objects and the state of each object in one single instance. For example, in a massive multi-
player persistent world games if a tree is cut down, the tree will forever be cut down for all
players, and for all of time. In single player games, the user sometimes has the ability to
“restart” the universe or run multiple iterations of the universe. Running multiple iterations
is known as “sharding” the universe. In this former case, the tree would reappear or in the
latter case have various states of being dependent on the shard being played.
Sharded world games can have multiple simultaneous existences of the same base game
universe in varying degrees of state. Sharded world games that employ procedurally gener-
ated universes can have multiple simultaneous versions of non-matching universes. In either
case for multi-player, this is usually done to reduce the server load of players and reduce
latency by grouping players from geographical regions into the most optimal “shard.” There
can be hundreds of servers running the same universe but with unique player participation
and parametric states of being.
MUX—See multiplexer.
Nit—A nit is candela per square meter (cd/m).
Normal map—A normal maps can be referred to as a newer, better type of bump
map. A normal map creates the illusion of depth detail on the surface of a model but
it does it differently than a bump map that uses grayscale values to provide either
up or down information. It is a technique used for faking the lighting of bumps and
dents—an implementation of bump mapping. It is used to add details without using
more polygons.
Northbridge—The northbridge is the controller that interconnects the CPU to
memory via the frontside bus (FSB). It also connects peripherals via high-speed
channels such as PCI Express and the AGP bus.
NTSC (National Television Systems Committee)—Analog color television
system standard used in U.S.A, Canada, Mexico, and Japan. Other standards include
NURBS—Non-uniform rational basis spline (NURBS) is a mathematical model
commonly used in computer graphics for generating and representing curves and
surfaces. It offers great flexibility and precision for handling both analytic (surfaces
defined by common mathematical formulae) and modeled shapes.
Oscilloscope—Early oscilloscopes used cathode ray tubes (CRTs) as their display
element. Storage oscilloscopes used special storage CRTs to maintain a steady display
Appendix B: Definitions 393
of a signal briefly presented. Storage scopes (e.g., Tektronix 4010 series) were often
used in computer graphics as a vector scope.
ODM—Original device manufacturer.
OLED (organic light-emitting diode)—A light-emitting diode (LED) in which
the emissive electroluminescent layer is a film of organic compound that emits light
in response to an electric current. This layer of organic semiconductor is situated
between two electrodes; typically, at least one of these electrodes is transparent.
OLEDs are used to create digital displays in devices such as television screens,
computer monitors, and portable systems such as mobile phones.
Open Graphics Library (OpenGL)—A cross-language, cross-platform application
programming interface (API) for rendering 2D and 3D vector graphics. The API is
typically used to interact with a graphics processing unit (GPU), to achieve hardware-
accelerated rendering.
OpenVDB—OpenVDB is an Academy Award-winning open-source C++ library
comprising a novel hierarchical data structure and a suite of tools for the efficient
storage and manipulation of sparse volumetric data discretized on three-dimensional
grids. It was developed by DreamWorks Animation for use in volumetric applica-
tions typically encountered in feature film production and is now maintained by the
Academy Software Foundation (ASWF). https://github.com/AcademySoftwareFou
ndation/openvdb.
Ordered dither—An ordered dither is the application of a series of dither values
which change in according to a particular pattern, most often in the form of a matrix
(also referred to as a “dither matrix”), over the course of a set of dithering operations.
An ordered dither is traditionally applied positionally, using the modulus of the
destination pixel X and Y position as an index into the dither matrix.
Outside-In-Tracking—Outside-In-Tracking is a form of positional tracking where
fixed external sensors placed around the viewer are used to determine the position
of the headset and any associated tracked peripherals. Various methods of tracking
can be used, including, but not limited to, optical and IR.
PAL—Analog TV system used in Europe and elsewhere.
Palette—The computer graphics term “palette” is derived from the concept of an
artist’s palette, the flat piece of material upon which the artist would select and blend
his colors to create the desired shades. The palette on a graphics board specifies the
range of colors available in any one pixel. For example, standard VGAs tend to have
a palette of 262,144 colors, stemming from the fact that each color in the palette is
composed of 6 bits each of red, green, and blue (total of 18 bits and 218 = 262,144).
However, since the VGA can only display 16 or 256 colors on-screen at any one
time, it means that each one of these 16 or 256 colors must be chosen from the larger
palette via a set of LUTs. See “LUT” for details.
PAM—Potential available market.
PCI—Acronym for Peripheral Component Interface. PCI is a bus standard which
Intel developed to overcome the performance bottlenecks inherent in the ISA bus
design, and most modern graphics boards are PCI-based (i.e., they need to be inserted
into the PCI bus in order to work).
394 Appendix B: Definitions
The term “dots per inch” (dpi), extended from the print medium, is sometimes used
instead of pixels per inch. The dot pitch determines the absolute limit of the possible
pixels per inch. However, the displayed resolution of pixel s (picture elements) that
is set up for the display is usually not as fine as the dot pitch.
Projection mapping—Projection mapping, also known as video mapping and
spatial augmented reality, is a projection technology used to turn objects, often irreg-
ularly shaped, into a display surface for video projection. The technique dates back to
the late 1960s, where it was referred to as video mapping, spatial augmented reality,
or shader lamps. This technique is used by artists and advertisers alike who can add
extra dimensions, optical illusions, and notions of movement onto previously static
objects.
PQ—Perceptual Quantizer Transfer Function for HDR signals (SMPTE ST 2084,
ITU-R BT.2100).
PvP—See player versus player.
RAMDAC—Acronym for Random Access Memory Digital to Analog Converter.
The “RAM” portion of a RAMDAC refers to the LUTs, which by necessity are RAMs,
while the “DAC” refers to the Digital to Analog Converters. See “DAC” and “LUT”
for more details.
Raster graphics—Also called scanline, and bitmap graphics, a type of digital
display that uses tiny 4-sided but not necessarily square pixels, or picture elements,
arranged in a grid formation to represent an image. Raster-scan graphics has origins in
television technology, with images constructed much like the pictures on a television
screen.
Raster-scan display—A CRT uses a raster scan. Developed for television tech-
nology, an electron beam sweeps across the screen, from top to bottom covering one
row at a time. The beams intensity is turned on and off as it moves across each row
to create images. The screen points are referred to as pixels.
Recurrent neural network (RNN)—A class of neural networks whose connections
form a directed cyclic graph. In other words, unlike a feedforward network such as a
CNN, the connections include feedback so that outputs can affect subsequent inputs,
giving rise to temporal behaviors. An example of an RNN is the Long Short-Term
Memory (LSTM) network popular in speech recognition.
396 Appendix B: Definitions
SaaS—Software as a service.
SAM—Served available market.
Scanline display—See raster graphics display.
Scanline rendering—An algorithm for visible surface determination, in 3D
computer graphics, that works on a row-by-row basis rather than a polygon-by-
polygon or pixel-by-pixel basis.
Screen size—On 2D displays, such as computer monitors and TVs, the display
size (or viewable image size or VIS) is the physical size of the area where pictures
and videos are displayed. The size of a screen is usually described by the length of
its diagonal, which is the distance between opposite corners, usually in inches.
Screen tearing—A visual artifact in video display where a display device shows
information from multiple frames in a single screen draw. The artifact occurs when
the video feed to the device is not in sync with the display’s refresh rate. This can be
due to non-matching refresh rates—in which case the tear line moves as the phase
difference changes (with speed proportional to difference of frame rates).
SDK—Software development kit.
SECAM—Analog TV system used in France and parts of Russia and the Mideast.
SDR—Standard dynamic range TV (Rec.601, Rec.709, Rec.2020).
Shaders—Shaders is a broadly used term in graphics and can pertain to the
processing of specialized programs for geometry (known as vertex shading or
transform and lighting), or pixels shading.
Shifter—A device which shifts numbers 1 or more bit positions. For example, the
decimal number 14 (1110b), when passed through a shift which shifts one bit to the
left, would produce decimal 7 (0111b). Each bit shift to the right is equivalent to an
integer divide by 2, while each bit shift to the left is equivalent to an integer multiply
by 2. Shifters are normally used to scale values up or down.
SIMD—Same Instruction Multiple Data describes computers with multiple
processing elements that perform the same operation on multiple data points simul-
taneously. Such machines exploit data level parallelism, but not concurrency: There
are simultaneous (parallel) computations, but only a single process (instruction) at a
given moment. SIMD is particularly applicable to common tasks like adjusting the
contrast and colors in a digital image.
SOM—Share of market.
Southbridge—The southbridge controller handles the remaining I/O, including
the PCI bus, parallel and Serial ATA drives (IDE), USB, FireWire, serial and parallel
ports, and audio ports. Earlier chipsets supported the ISA bus in the southbridge.
Starting with Intel’s 8xx chipsets, northbridge and southbridge were changed to
memory controller and I/O controller (see Intel Hub Architecture).
Span mode—Some applications, such as games, have an explicit screen resolution
setting. They will typically default to monitor’s registered resolution. In span mode,
which is a feature of the driver provided by the GPU supplier, it is possible to make
one contagious display that spans across all the monitors you choose. Then, when
the application is opened, it will fill the screens.
398 Appendix B: Definitions
of a frame being output by the GPU, resulting in two or more frames being shown
at once.
Telecine—Telecine is the process of transferring motion picture film into video.
The most complex part of telecine is the synchronization of the mechanical film
motion and the electronic video signal. Normally, best results are then achieved by
using a smoothing (interpolating algorithm) rather than a frame duplication algorithm
(such as 3:2 pulldown).
Tessellation shaders—A tessellation shader adds two new shader stages to the
traditional model. Tessellation Control Shaders (also known as Hull Shaders) and
Tessellation Evaluation Shaders (also known as Domain Shaders), which together
allow simpler meshes to be subdivided into finer meshes at run-time according to a
mathematical function. The function can be related to a variety of variables, most
notably the distance from the viewing camera to allow active level-of-detail scaling.
This allows objects close to the camera to have fine detail, while further away ones can
have coarser meshes, yet seem comparable in quality. It also can drastically reduce
mesh bandwidth by allowing meshes to be refined once inside the shader units instead
of down-sampling very complex ones from memory. Some algorithms can up-sample
any arbitrary mesh, while others allow for “hinting” in meshes to dictate the most
characteristic vertices and edges. Tessellation shaders were introduced in OpenGL
4.0 and Direct3D 11.
One cannot use tessellation to implement subdivision schemes that requires the
previous vertex position to compute the next vertex positions.
Texel—Acronym for TEXture ELement or TEXture pixEL—the unit of data
which makes up each individually addressable part of a texture. A texel is the texture
equivalent of a pixel.
Texture mapping—The act of applying a texture to a surface during the rendering
process. In simple texture mapping, a single texture is used for the entire surface,
no matter how visually close or distant the surface is from the viewer. A somewhat
more visually appealing form of texture mapping involves using a single texture with
bilinear filtering, while an even more advanced form of texture mapping uses multiple
textures of the same image but with different levels of detail, also known as mip-
mapping. See also “bilinear filtering,” “level of detail,” “mip-map,” “mip-mapping,”
and “trilinear filtering.”
Texture Map—Same thing as “texture.”
Texture—A texture is a special bitmap image, much like a pattern, but which is
intended to be applied to a 3D surface in order to quickly and efficiently create a
realistic rendering of a 3D image without having to simulate the contents of the image
in 3D space. That sounds complicated, but in fact it’s very simple. For example, if
you have a sphere (a 3D circle) and want to make it look like the planet Earth, you
have two options. The first is that you meticulously plot each nuance in the land
and sea onto the surface of the sphere. The second option is that you take a picture
of the Earth as seen from space, use it as a texture, and apply it to the surface of
the sphere. While the first option could take days or months to get right, the second
option can be nearly instantaneous. In fact, texture mapping is used broadly in all
sorts of real-time 3D programs and their subsequent renderings, because of its speed
400 Appendix B: Definitions
and efficiency. 3D games are certainly among the biggest beneficiaries of textures,
but other 3D applications, such as simulators, virtual reality, and even design tools,
take advantage of textures too.
Tile-based deferred rendering (TBDR)—Defers the lighting calculations until all
objects have been rendered, and then, it shades the whole visible scene in one pass.
This is done by rendering information about each object to a set of render targets
that contain data about the surface of the object; this set of render targets is normally
called the G-buffer.
Tiled rendering—The process of subdividing a computer graphics image by a
regular grid in optical space and rendering each section of the grid, or tile, separately.
The advantage to this design is that the amount of memory and bandwidth is reduced
compared to immediate mode rendering systems that draw the entire frame at once.
This has made tile rendering systems particularly common for low-power handheld
device use. Tiled rendering is sometimes known as a “sort middle” architecture,
because it performs the sorting of the geometry in the middle of the graphics pipeline
instead of near the end.
ToF—An acronym for Time of Flight. Used to refer to active sensors which
measure distance to objects in a scene by emitting infrared pulses and measuring the
time taken to detect the reflection. These sensors simplify the computational task of
producing a point cloud from image data but are more expensive and lower resolution
than regular CMOS sensors.
Tone-mapping—A technique used in image processing and computer graphics to
map one set of colors to another to approximate the appearance of high-dynamic-
range images in a medium that has a more limited dynamic range.
Transcoding—Transcoding is the process of converting a media file or object
from one format to another. Transcoding is often used to convert video formats (i.e.,
Beta to VHS, VHS to QuickTime, QuickTime to MPEG). But it is also used to fit
HTML files and graphics files to the unique constraints of mobile devices and other
Web-enabled products.
Trilinear filtering—A combination of bilinear filtering and mip-mapping, which
enhances the quality of texture mapped surfaces. For each surface that is rendered,
the two mip-maps closest to the desired level of detail will be used to compute pixel
colors that are the most realistic by bilinearly sampling each mip-map and then using
a weighted average between the two results to produce the rendered pixel.
Trilinear mip-mapping—See above, trilinear filtering.
Truncation—An arithmetic operation which simply removes the fractional portion
of a number in integer-fraction format to produce an integer, without regard for the
magnitude of the fractional portion. Therefore, 2.99 and 2.01 truncated are both 2.
See also “rounding.”
UDIM—An enhancement to the UV mapping and texturing workflow that makes
UV map generation easier and assigning textures simpler. The term UDIM comes
from U-Dimension and design UV ranges. UDIM is an automatic UV offset system
that assigns an image onto a specific UV tile, which allows one to use multiple lower
resolution texture maps for neighboring surfaces, producing a higher resolution result
Appendix B: Definitions 401
where so accurate physical measurements could be taken from the screen. For that
reason, they were also called calligraphic displays.
Vector graphics—Refers to a method of generating electronic images using math-
ematical formulae to calculate the start, end, and path of a line. Images of varying
complexity can be produced by combining lines into curved and polygonal shapes,
resulting in infinitely scalable objects with no loss of definition.
Vector unit (SIMD vector unit)—An Arithmetic Unit or Arithmetic Logic Unit
which operates on one or more vectors at a time, using the same instruction for all
values in the vector.
Verilog/HDL—A “Hardware Description Language” is a textual representation
of logic gates and registers. It differs from a programming language mainly in that
it describes a parallel structure in space rather than a sequence of actions in time.
Verilog is one of the most popular HDLs and resembles C or C++ in its syntax.
VESA—Video Electronics Standards Association, a technical standards organiza-
tion for computer display, PC, workstation, and computing environments standards.
The organization incorporated in California July 1989.
VFX—Visual effects.
VGA (video graphics array)—VGA is a resolution and electrical interface stan-
dard original developed by IBM. It was the defector display standard for the PC. VGA
has three analog signals, red, blue, and green (RGB) and uses an analog monitor.
Graphics AIBs output analog signals. All CRTs and most flat panel monitors accept
VGA signals, although flat panels may also have a DVI interface for display adapters
that output digital signals.
vGPU—An AIB with a powerful dGPU located remotely in the cloud or a campus
server.
Vignetting—A reduction of an image’s brightness or saturation at the periphery
compared to the image center.
VPNA—See Visual Processing Unit.
Virtual reality—Virtual reality (VR) is a fully immersive user environment
affecting or altering the sensory input(s) (e.g., sight, sound, touch, and smell) and
allowing interaction with those sensory inputs by the user’s engagement with the
virtual world. Typically, but not exclusively, the interaction is via a head-mounted
display, use of spatial or other audio, and/or hand controllers (with or without tactile
input or feedback).
VR Video and VR Images—VR Video and VR Images are still or moving imagery
specially formatted as separate left and right eye images usually intended for display
in a VR headset. VR Video capture and subsequent display are not exclusive to 360°
formats and may also include content formatted to 180° or 270°; content does not
need to visually surround a user to deliver a sense of depth and presence.
Vision processing—Processing of still or moving images with the objective of
extracting semantic or other information.
VLIW (very long instruction word)—Often abbreviated to VLIW. A micropro-
cessor instruction which combines multiples of the lowest level of instruction words
and presents them simultaneously to control multiple execution units in parallel.
Appendix B: Definitions 403
A ATI, 136
Accelerated Processor Unit (APU), 71 Axe Technology, 227
Acorn, 108
Acorn Computers, 108
Adler Lake, 335 B
Adreno GPU, 170 Bachus, Kevin, 191
Adreno GPU, Qualcomm, 141 Baker, Nick, 200
Advanced eXtensible Interface (AXI), 273 Barlow, Steve, 267
Advanced High-performance Bus (AHB), Battlemage GPU, 342
113 BBC Computer, 108
Advanced Interface Bus, 340 Bergman, Rick, 22
Advanced Micro Devices (AMD), 144, Berkes, Otto, 191
145, 163, 261 Bezier curve, 8
Advanced Peripheral Bus (APB), 113 Bidirectional Reflectance Distribution
Alben, Jonah, 53 Function (BRDF), 272
Alchemist GPU, 342 Bifrost, Mali-, 154
Alho, Mikko, 144 Big.LITTLE, 153
Alphamosaic, 267 Binary Runtime Environment for Wireless
Ambiq, 276 (BREW), 139
AMD ray accelerator, 347 Bitboys, 130, 137
AMD TeraScale architecture, 66 Blackley, Seamus, 191
Ampere, Nvidia, 288 Bolt Graphics, 262
Andrews, Jeff, 200 Boost, AMD, 91
Anisotropic filtering, 46 Bounded Volume Hierarchy (BVH), 350
Anti-aliasing, 41, 46 Bounding Volume Hierarchy (BVH), 329
Apple, 107 Broadway IBM PowerPC, 209
Apple M1, 172 Brown, Nat, 191
AR and VR, 157 Bulldozer, AMD, 67
Arc client graphics, 342 Bush, Jeff, 308
Arc, Intel AIB, 250
Argonne National Laboratory, 248
Arm, 108, 109 C
Arm-Nvidia, 164 Cai, Mike, 281
Asynchronous Compute Engines (ACE), 88 Campbell, Gordon (Gordie), 123
Asynchronous Compute Tunneling, 88 Carmean, Doug, 61
Atari VCS, 225 Celestial GPU, 342
© The Editor(s) (if applicable) and The Author(s), under exclusive license 405
to Springer Nature Switzerland AG 2022
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1
406 Index
I L
Icera, 127 Larrabee project, 57
Imageon, 101, 137 Lee, Byoung Ok, 242
Imageon, ATI, 139 Leia, Bitboys Qualcomm, 138
Imageon processor, 140 Leighton, Luke Kenneth Casson, 313
Imagination BXS, 105 Leland, Tim, 141
Imagination Image Compression (IMGIC), Lens-Matched Shading (LMS), 331
106, 296 Lightspeed Memory Architecture, 3
Imagination Technologies, 163 Lin, Chris, 25
Imagination Technologies’ B-series, 104 Liverpool/Durango APU, 210
Infinity Cache, 347 Ljosland, Borgar, 110, 115
Inglis, Mike, 109 Logan, Nvidia, 166
Inline Raytracing, 328 Logo, Nvidia, 57
Innosilicon, 354 Lumen in the Land of Nanite, 229
Input-to-response latency, 92
Instructions Per Clock (IPC), 87
Integrated Graphics Chipsets (IGCs), 26
M
Intel, 163
M1 Max, 175
Intel, iGPUs, 79
Intel Kaby Lake G, 85 M1 Pro, 174
Intellectual Property (IP), 133 M1 Ultra, 177
Intellisample, Nvidia, 46 Machine Learning, 157
Intel Xe , 335 Makivaara, Jarkko, 144
Interactive PC computer graphics (CGI), 28 Mali, 109
IPad, 101 Mali Java stack, 113
IRISxe dGPU, 335 Mallard, John, 124
Ironlake GPU, 63 Many-core Integrated Accelerator of
Ironlake graphics, 62 Waterdeep/Wisconsin (MIAOW),
Iwata, Satoru, 202 309
Marci, Joe, 33
Masayoshi, Son, 163
J McNamara, Patrick, 307
Jade, Fujitsu, 119 MediaGX, 255
Jaguar APU, 212 MediaQ, 120
James, Dick, 208 Memory-Management Units (MMU), 75
Jensen, Rune, 200 Mesh interconnect architecture (MDFI),
Jiawei, Jing, 264 251
Jingjia Microelectronics, 265 Mesh shader, 333
Jobs, Steve, 124 MetalFX Upscaling, 182
Joe Palooka, 138 MetaX, 258
Johnson, Gary, 124 Meyer, Dirk, 70
Microsoft’s Zune, 126
Midgard, 148
K Midgard, Mali, 149
Kal-El, 127 Midway, 191
Kelvin architecture, 193 Mijat, Roberto, 150
Khan. ATI, 13 Miller, Timothy, 306
Koduri, Raja, 82, 250 Min Yoon, Hyung, 242
Komppa, Jari, 144 Miyamoto, Masafumi, 29
Kumar, Devinder, 68 Miyamoto, Shigeru, 202
Kutaragi, Ken, 191, 196 Mobile GPUs, 103
408 Index
Q
O Q3D Dimension platform, 140
Ogasawara, Shinichi, 195 Qi, Nick, 262
Qualcomm, 163
Open Core Protocol (OCP), 273
Qualcomm handheld, 235
OpenGL-ES, 103
Qualcomm’s variable rate shading, 171
OpenGL ES 1.0, 123
Quincunx anti-aliasing, 3
OPenGL ES 3.2, 143
OpenGPU, AMD, 352
Open Graphics Project, 306 R
Open Hardware Foundation, 307 R300, ATI, 13
Open Mobile Application Processor Radeon HD 4870, 67
Interfaces (OMAPI), 146 Radeon Media Engine, 349
Open Multimedia Application Platform Raspberry Pi, 269
(OMAP), 146 RayCore, 242
OpenVG, 132 RayQuery, 324
Original Device Manufacturers (ODM), 23 Ray tracing, 323
Orton, Dave, 13, 22, 32, 67, 70 RayTree, 245
Otellini, Paul, 57 RDNA 2, 347
Index 409
Z
W Zhaoxin, 256
Waves, 84 Zhongshan, 227
WebTV, 192 Zulu, Sun, 240