You are on page 1of 434

Jon Peddie

The History of
the GPU - New
Developments
The History of the GPU - New Developments
Jon Peddie

The History of the GPU -


New Developments
Jon Peddie
Jon Peddie Research
Tiburon, CA, USA

ISBN 978-3-031-14046-4 ISBN 978-3-031-14047-1 (eBook)


https://doi.org/10.1007/978-3-031-14047-1

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword

Real-time 3D graphics and consumer gaming markets have been responsible for
driving tremendous innovations to feed the insatiable appetite of high-resolution,
photo-realistic gaming technologies. Capturing the interest of computer scientists
and creative hardware developers around the world, the development of the GPU
has led to advancements in the computational capabilities and memory systems to
feed them. Advanced algorithms and APIs to manage large, complex data systems—
along with the move to general-purpose programming models with exploitation for
general-purpose computing, high-performance computing, cryptocurrency, and arti-
ficial intelligence—have further propelled the GPU into an unprecedented pace of
development.
In the early 1990s when I first became involved with the commercialization of
3D graphics technology, Jon Peddie was already a well-known graphics market
analyst. I joined a team of very seasoned hardware and software developers at GE
Visual Systems in Daytona Beach, where large-scale military and NASA training
systems were developed. We created some of the first consumer commercial uses
of the technology with Sega Models 1 and 2 hardware, initially sporting 180 k
polygons per second and 1.2 M pixels per second with a resolution of 496 × 384 in
the arcade gaming space. After acquisition by Martin Marietta and then Lockheed
Martin, real 3D was formed where I was part of a small team that developed Intel
740 3D architecture that started Intel’s 3D rendering roadmaps. Jon has shared some
unique perspective on I740 development and Intel’s entry into 3D graphics that a
quick search will reveal. His second book of this series will cover industry trends
and struggles during this period. I joined ATI Technologies in 1999, later acquired by
Advanced Micro Devices, Inc. (AMD) where I have had the pleasure of advancing the
Radeon product line, console gaming systems, and our latest RDNA/CDNA products
that power some of the most exciting developments of the century. Over the years, I
have regularly read JPR research report by Jon to understand his broad prospective of
relevant emerging trends in our industry. I have had the pleasure of meeting with Jon
on several occasions at product introduction events and industry conferences to chat
about trends, motivations, technical detail, and the successes in real-time graphics.

v
vi Foreword

In Jon’s third book of a three-book series on the History of the GPU, he shares an
interesting and knowledgeable history of the chaotic and competitive time that forged
today’s leaders in the 3D gaming hardware industry. Jon draws on the breath of his
relationships formed over the years and his knowledge to break these contributions
into six eras of GPU development. In each chapter, Jon not only covers innovations
and key products, but also shares his perspective on company strategy, key leaders,
and visionary architects during each era of development. I hope that you will thor-
oughly enjoy this series and the final book while learning about the tremendous
growth of technology and the hard work, risk, and determination of those who have
contributed to today’s GPU success.

Michael Mantor
AMD Chief GPU Architect
and Corporate Fellow
Preface

This is the third book in the three-book series on the History of the GPU.
The first book covered the history of computer graphics controllers and proces-
sors from the 1970s leading up to the introduction of the fully integrated GPU first
appearing in game consoles in 1996, and then the PC in 1999. The second book in
the series covers the developments that led up to the integrated GPU, from the early
1990s to the late 1990s.
The GPU has been employed in many systems (platforms) and evolved since
1996.
This final book in the series covers the second to sixth eras of the development of
GPU on the PC platform, and other platforms. Other platforms include workstations,
game machines, and others, such as various vehicles—GPUs are used everywhere
in almost everything.
Each chapter is designed to be read independently, hence there may be some
redundancy. Hopefully, each one tells an interesting story.
In general, a company is discussed and introduced on the year of its formation.
However, a company may be discussed in multiple time periods in multiple chapters
depending on how significant their developments were and what impact they had on
the industry.

vii
viii Preface

History of the GPU

Steps to Invention Eras and Environment New Developments


Book 1 Book 2 Book 3

1. Preface 1. Preface 1. Preface

2. History of 2. Race to build 2. Second Era of


the GPU the first GPU GPUs (2001-2006)

3. 1980-1990 3. Third to Fifth Era


Graphics Controllers 3. GPU Functions
on Other Platforms of GPUs

4. 1980-1989
Graphics Controllers 4. Major Era of GPUs 4. Mobile GPUs
on PCs

5. 1990-1995
Graphics Controllers 5. First Era of GPUs 5. Game Console GPUs
on PCs

6. 1990-1999
6. GPU
Graphics Controllers 6. Compute GPUs
Environment-Hardware
on Other Platforms

7. 1996-1999 7. Application Program


Graphics Controller 7. Open GPUs
on PCs Interface (API)

8. GPU
8. What is a GPU Environment-Software 8. Sixth Era of GPUs
Extensions

The History of the GPU - New Developments

I mark the GPU’s introduction as the first fully integrated single chip with hardware
geometry processing capabilities—transform and lighting. Nvidia gets that honor on
the PC by introducing their GeForce 256 based on the NV10 chip in October 1999.
However, Silicon Graphics Inc. (SGI) introduced an integrated GPU in the Nintendo
64 in 1996, and ArtX developed an integrated GPU for the PC a month after Nvidia.
As you will learn, Nvidia did not introduce the concept of a GPU, nor did they
Preface ix

develop the first hardware implementation of transform and lighting. But Nvidia was
the first to bring all that together in a mass-produced single chip device.
The evolution of the GPU however did not stop with the inclusion of the trans-
formation and lighting (T&L) engine because the first era of such GPUs had fixed
function T&L processors—that was all they could do and when they were not doing
that they sat idle using power. The GPU kept evolving and has gone through six eras
of evolution ending up today as a universal computing machine capable of almost
anything.

The Author

A Lifetime of Chasing Pixels

I have been working in computer graphics since the early 1960s, first as an engineer,
then as an entrepreneur (I founded four companies and ran three others), ending up
in a failed attempt at retiring in 1982 as an industry consultant and advisor. Over
the years, I watched, advised, counseled, and reported on developing companies
and their technology. I saw the number of companies designing or building graphics
controllers swell from a few to over forty-five. In addition, there have been over thirty
companies designing or making graphics controllers for mobile devices.
I’ve written and contributed to several other books on computer graphics (seven
under my name and six co-authored). I’ve lectured at several universities around the
world, written uncountable articles, and acquired a few patents, all with a single,
passionate thread—computer graphics and the creation of beautiful pictures that tell
a story. This book is liberally sprinkled with images—block diagrams of the chips,
photos of the chips, the boards they were put on, and the systems they were put in,
and pictures of some of the people who invented and created these marvelous devices
that impact and enhance our daily lives—many of them I am proud to say are good
friends of mine.
I laid out the book in such a way (I hope) that you can open it up to any page and
start to get the story. You can read it linearly; if you do, you’ll probably find new
information and probably more than you ever wanted to know. My email address is
in various parts of this book, and I try to answer everyone, hopefully with 48 hours.
I’d love to hear comments, your stories, and your suggestions.
The following is an alphabetical list of all the people (at least I hope it’s all of
them) who helped me with this project. A couple of them have passed away, sorry to
say. Hopefully, this book will help keep the memory of them and their contributions
alive.
Thanks for reading
Jon Peddie—Chasing pixels, and finding gems
x Preface

Acknowledgments and Contributors

The following people helped me with editing, interviews, data, photos, and most of
all encouragement. I literally and figuratively could not have done this without them.
Ashraf Eassa—Nvidia
Andrew Wolfe—S3
Anand Patel—Arm
Atif Zafar—Pixilica
Borger Ljosland—Falanx
Brian Kelleher—DEC, and finally Nvidia
Bryan Del Rizzo—3dfx & Nvidia
Carrell Killebrew—TI/ATI/AMD
Chris Malachowsky—Nvidia
Curtis Priem—Nvidia
Dado Banatao—S3
Dan Vivoli—Nvidia
Dan Wood—Matrox, Intel
Daniel Taranovsky—ATI
Dave Erskine—ATI & AMD
Dave Orton—SGI, ArtX, ATI & AMD
David Harold—Imagination Technologies
Dave Kasik—Boeing
Emily Drake—Siggraph
Edvaed Sergard—Falanx
Eric Demers—AMD/Qualcomm
Frank Paniagua—Video Logic
Gary Tarolli—3dfx
Gerry Stanley—Real3D
George Sidiropoulos—Think Silicon
Henry Chow—Yamaha & Giga Pixel
Henry Fuchs—UNC
Henry C. Lin—Nvidia
Henry Quan—ATI
Hossain Yassaie—Imagination Technologies
Iakovos Istamoulis—Think Silicon
Ian Hutchinson—Arm
Jay Eisenlohr—Rendition
Jay Torberg—Microsoft
Jeff Bush—Nyuzi
Jeff Fischer—Weitek & Nvidia
Jem Davis—Arm
Jensen Huang—Nvidia
Jim Pappas—Intel
Joe Curley—Tseng/Intel
Preface xi

Jonah Alben—Nvidia
John Poulton—UNC & Nvidia
Karl Guttag—TI
Karthikeyan (Karu) Sankaralingam—University of Wisconsin-Madison
Kathleen Maher—JPA & JPR
Ken Potashner—S3 & SonicBlue

Tiburon, USA Jon Peddie


Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Programmable Vertex and Geometry Shaders (2001–2006) . . . . . . 1
1.1.1 Nvidia NV20—GeForce 3 (February 2001) . . . . . . . . . . . . 2
1.1.2 ATI R200 Radeon 8500 (August 2001) . . . . . . . . . . . . . . . . 4
1.1.3 Nvidia’s NV25–28—GeForce 4 Ti (February 2002) . . . . . 11
1.1.4 ATI’s R300 Radeon 9700 and the VPU (August
2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.4.1 First PC GPU with Eight Pipes . . . . . . . . . . . . . 16
1.1.4.2 Z-Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.4.3 Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.1.4.4 Memory Management . . . . . . . . . . . . . . . . . . . . . 19
1.1.4.5 Multiple Displays . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.1.4.6 Along Comes a RenderMonkey . . . . . . . . . . . . . 22
1.1.4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.1.5 SiS Xabre—September 2002 . . . . . . . . . . . . . . . . . . . . . . . . 23
1.1.5.1 SiS 301B Video Processor . . . . . . . . . . . . . . . . . 26
1.1.5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.1.6 The PC GPU Landscape in 2003 . . . . . . . . . . . . . . . . . . . . . 27
1.1.7 Nvidia NV 30–38 GeForce FX 5 Series (2003–2004) . . . 27
1.1.7.1 CineFX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.1.7.2 Nvidia Enters the AIB Market
with the GeForceFX (2003) . . . . . . . . . . . . . . . . 31
1.1.8 ATI R520 an Advanced GPU (October 2005) . . . . . . . . . . 31
1.1.8.1 Avivo Video Engine . . . . . . . . . . . . . . . . . . . . . . . 44
1.1.8.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.1.8.3 Nvidia’s NV40 GPU (2005–2008) . . . . . . . . . . 45
1.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

xiii
xiv Contents

2 The Third- to Fifth-Era GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


2.1 The Third Era of GPUs—DirectX 10 (2006–2009) . . . . . . . . . . . . . 51
2.1.1 Nvidia G80 First Unified Shader GPU (2006) . . . . . . . . . . 53
2.1.2 Nvidia GT200 Moving to Compute (2008) . . . . . . . . . . . . 54
2.1.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.1.3 Intel Larrabee to Phi (2006–2009) . . . . . . . . . . . . . . . . . . . . 57
2.1.4 Intel’s GM45 iGPU Chipset (2007–2008) . . . . . . . . . . . . . 62
2.1.5 Intel’s Westmere (2010) Its First iGPU . . . . . . . . . . . . . . . . 62
2.2 The Fourth Era of GPUs. October 2009 . . . . . . . . . . . . . . . . . . . . . . . 65
2.2.1 The End of the ATI Brand (2010) . . . . . . . . . . . . . . . . . . . . 66
2.2.2 AMD’s Turks GPU (2011) . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.2.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.2.3 Nvidia’s Fermi (2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.2.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.2.4 AMD Fusion GPU with CPU (January 2011) . . . . . . . . . . 70
2.2.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.2.5 Nvidia Kepler (May 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.2.6 Intel’s iGPUs (2012–2021), the Lead Up to dGPU . . . . . . 79
2.2.7 Nvidia Maxwell (2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.3 The Fifth Era of GPUs (July 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.3.1 AMD’s CGN RX380 (June 2016) . . . . . . . . . . . . . . . . . . . . 83
2.3.2 Intel’s Kaby Lake G (August 2016) . . . . . . . . . . . . . . . . . . . 85
2.3.3 Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.3.4 AMD’s Navi RDNA Architecture (July 2019) . . . . . . . . . . 87
2.3.4.1 Radeon RX 5700 XT AIB (July 2019) . . . . . . . 88
2.3.4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.3.4.3 RX 5500 Series (2019) . . . . . . . . . . . . . . . . . . . . 91
2.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.3.6 Intel’s Whisky Lake 620 GT2 iGPU (2018) . . . . . . . . . . . . 92
2.3.7 Intel’s Gen 11 iGPU (March 2019) . . . . . . . . . . . . . . . . . . . 93
2.3.7.1 Intel’s GPU’s Geometry Engine . . . . . . . . . . . . . 94
2.3.7.2 Intel Updates Its Ring Topology . . . . . . . . . . . . 95
2.3.7.3 Coarse Pixel Shading . . . . . . . . . . . . . . . . . . . . . . 96
2.3.7.4 Position Only Shading Tile-Based
Rendering (POSH) . . . . . . . . . . . . . . . . . . . . . . . . 96
2.3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3 Mobile GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.2 Mobiles: The First Decade (2000–2010) . . . . . . . . . . . . . . . . . . . . . . 103
3.3 Imagination Technologies First GPU IP (2000) . . . . . . . . . . . . . . . . 104
3.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.4 Arm’s Path to GPUs (2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Contents xv

3.4.1 Falanx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110


3.4.2 Mali Family (2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.4.3 More Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.4.4 Balanced, Scalable, and Fragmented . . . . . . . . . . . . . . . . . . 115
3.4.5 More Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.5 Fujitsu’s MB86292 GPU (2002–) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.5.1 IMB86R01 Jade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.5.2 Several Name Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles
and consoles (2003–) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.6.1 Tegra is Born . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.6.2 Nvidia Enters the Automotive Market (2009) . . . . . . . . . . 128
3.7 Bitboys 3.0 (2002–2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.7.1 End Game: Bitboys’ VG (2003) . . . . . . . . . . . . . . . . . . . . . . 131
3.8 Qualcomm’s Path to the Snapdragon GPU (2004–) . . . . . . . . . . . . . 139
3.8.1 The Adreno GPU (2006) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.9 SECOND DECADE of Mobile GPU Developments (2010
and on) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
3.10 Siru (2011–2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
3.10.1 Samsung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.11 Texas Instruments OMAP (1999–2012) . . . . . . . . . . . . . . . . . . . . . . . 146
3.12 Arm’s Midgard (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.12.1 Arm’s Bifrost (2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.12.2 Arm’s Valhall (2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.12.2.1 AR and VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
3.12.3 Valhall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
3.12.3.1 ML and Display . . . . . . . . . . . . . . . . . . . . . . . . . . 160
3.12.3.2 Mali-D77 Display Processor (2019) . . . . . . . . . 161
3.12.4 Arm Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.12.5 Second Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.13 Nvidia Leaves Smartphone Market, 2014 . . . . . . . . . . . . . . . . . . . . . 165
3.13.1 Xavier Introduced (2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.14 Qualcomm Snapdragon 678 (2020) . . . . . . . . . . . . . . . . . . . . . . . . . . 167
3.15 Qualcomm Snapdragon 888 (2020) . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.16 Apple’s M1 GPU and SoC (2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
3.16.1 Apple’s M1 Pro GPU (2021) . . . . . . . . . . . . . . . . . . . . . . . . 174
3.16.2 Apple’s M1 Ultra (2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
3.16.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
3.17 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
xvi Contents

4 Game Console GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187


4.1 Sony PlayStation 2 (2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
4.2 Microsoft Xbox (2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
4.2.1 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
4.3 Sony PSP (2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.4 Xbox 360—Unified Shaders and Integration (November
2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.4.1 The Xbox 360 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.4.2 The Many Versions of Xbox 360 . . . . . . . . . . . . . . . . . . . . . 200
4.4.3 Updated Xbox 360—Integrated SoC (August 2010) . . . . . 200
4.5 Nintendo Wii (November 2006) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
4.6 Sony PlayStation 3 (2006) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.7 Nintendo 3DS (June 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.8 Sony PS Vita (December 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
4.9 Eighth-Generation Consoles (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . 209
4.10 Nintendo Wii U (November 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.11 CPUs with GPUs Lead to Powerful Game Consoles (2013) . . . . . . 212
4.12 Nvidia Shield (January 2013–2015) . . . . . . . . . . . . . . . . . . . . . . . . . . 212
4.12.1 A Grid Peripheral? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.12.2 But Was It Disruptive? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
4.13 Sony PlayStation 4 (November 2013) . . . . . . . . . . . . . . . . . . . . . . . . 216
4.14 Microsoft Xbox One (November 2013) . . . . . . . . . . . . . . . . . . . . . . . 217
4.15 Nvidia Shield 2 (March 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.16 Playmaji Polymega (February 2017) . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.17 Nintendo Switch (March 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.18 Atari VCS (June 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
4.19 Zhongshan Subor Z-Plus Almost Console (2018–2020) . . . . . . . . . 226
4.20 Sony PlayStation 5 (November 2020) . . . . . . . . . . . . . . . . . . . . . . . . 227
4.21 Microsoft Xbox Series X and S (November 2020) . . . . . . . . . . . . . . 230
4.22 Valve Steam Deck Handheld (July 2021) . . . . . . . . . . . . . . . . . . . . . . 232
4.23 Qualcomm Handheld Dec (2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4.24 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
5 Compute Accelerators and Other GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . 239
5.1 Sun’s XVR-4000 Zulu (2002) the End of an Era . . . . . . . . . . . . . . . 240
5.2 SiliconArts Ray Tracing Chip and Intellectual Property (IP)
(2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.2.1 RayCore 1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.2.2 RayCore 2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
5.2.3 RayCore Lite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
5.2.4 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
5.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
5.3 Intel Xe Architecture-Discrete GPU for High-Performance
Computing (HPC) (2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Contents xvii

5.4 Compute GPU Zhaoxin (2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255


5.5 MetaX (2020–) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.5.1 MetaX Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
5.6 XiangDiXian Computing Technology (2020) . . . . . . . . . . . . . . . . . . 262
5.7 Bolt Graphics (2021–) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
5.8 Jingjia Micro Series GPUs (2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
5.9 Alphamosaic to Pi via Broadcom (2000–2021) . . . . . . . . . . . . . . . . 267
5.10 The Other IP Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
5.10.1 AMD 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
5.10.2 Digital Media Professionals Inc. (DMP Inc.) 2002 . . . . . . 271
5.10.3 Imagination Technologies 2002 . . . . . . . . . . . . . . . . . . . . . . 274
5.10.4 Think Silicon (2007) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
5.10.5 VeriSilicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
5.11 Nvidia’s Ampere (May 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
5.11.1 A Supercomputer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
5.12 Imagination Technologie’s Ray Tracing IP (2021) . . . . . . . . . . . . . . 293
5.12.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
5.13 Nvidia’s Mega Data Center GPU Hopper (2022) . . . . . . . . . . . . . . . 298
5.13.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
5.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
6 Open GPU Projects (2000–2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
6.1 Open Graphics Project (2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
6.2 Nyuzi/Nyami (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
6.3 MIAOW (2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
6.4 GPUOpen (2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
6.5 SCRATCH (2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
6.6 Libre-GPU (2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
6.7 Vortex: RISC-V GPU (2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
6.8 RV64X (2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders . . . . . . . . . . . . . . 323
7.1 Miners and Taking a Breath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
7.2 Nvidia’s Turing GPU (September 2018) . . . . . . . . . . . . . . . . . . . . . . 326
7.2.1 Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
7.2.2 Hybrid-Rendering: AI-Enhanced Real-Time Ray
Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
7.2.2.1 Variable Rate Shading . . . . . . . . . . . . . . . . . . . . . 330
7.2.2.2 Nvidia’s New DLSS (March 2020) . . . . . . . . . . 331
7.2.2.3 Mesh Shaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
7.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
7.3 Intel–Xe GPU (2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
xviii Contents

7.3.1Intel’s Xe Max (2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336


7.3.2Intel’s dGPU Family (2021) . . . . . . . . . . . . . . . . . . . . . . . . . 340
7.3.3DG1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
7.3.3.1 Hello Arc, Goodbye DG . . . . . . . . . . . . . . . . . . . 342
7.3.3.2 Intel’s Supersampling (XeSS) . . . . . . . . . . . . . . 344
7.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
7.4 AMD Navi 21 RDNA 2 (October 2020) . . . . . . . . . . . . . . . . . . . . . . 346
7.4.1 AMD Ray Tracing (October 2020) . . . . . . . . . . . . . . . . . . . 349
7.4.2 FidelityFX Super Resolution (March 2021) . . . . . . . . . . . . 351
7.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
7.5 Innosilicon (2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
7.5.1 The GPU Population Continued to Expand in 2021 . . . . . 354
7.5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

Appendix A: Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365


Appendix B: Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
List of Figures

Fig. 1.1 Nvidia NV20-based GeForce 3 AIB with AGP4x buss.


Courtesy tech Power Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Fig. 1.2 ATI R200-based Radeon 8500 AIB. Courtesy tech Power
Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Fig. 1.3 ATI R200 block diagram. The chip had 60 million
transistors, four-pixel shaders, two vertex shaders, two
texture-mapping units, and four ROP engines . . . . . . . . . . . . . . . . 5
Fig. 1.4 Tessellation can reduce or expand the number of triangles
(polygons) in a 3D model to improve realism or increase
performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Fig. 1.5 Normal(s) generation within a TruForm N-patch. Courtesy
of ATI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Fig. 1.6 Generation of control points with N-patches. Courtesy
of ATI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Fig. 1.7 Subdivision and tessellation add realism. Courtesy of ATI . . . . . 10
Fig. 1.8 ATI’s TruForm was a preprocessor in an expanding chain
of graphics functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Fig. 1.9 VisionTek Nvidia NV25-based GeForce Ti 4200 AIB.
Courtesy of Hyins for Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Fig. 1.10 Nvidia GeForce 4 pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Fig. 1.11 ATI R300 Radeon 9700 AIB. Notice heatsinks
on the memory and similar layout to Nvidia NV25-based
GeForce Ti 4200 AIB, in Fig. 1.9. Courtesy of Wikimedia . . . . . 14
Fig. 1.12 ATI R300 block diagram. The display interface included
a multi-input LUTDAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Fig. 1.13 ATI’s R300 vertex setup engine (one of four) . . . . . . . . . . . . . . . . 15
Fig. 1.14 ATI’s R300 pixel shader engine the chip had eight of these
“pipes” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Fig. 1.15 ATI R300 video processing engine block diagram . . . . . . . . . . . . 19
Fig. 1.16 ATI R300 video processing engine showing all the outputs . . . . . 20

xix
xx List of Figures

Fig. 1.17 Xabre 600 AIB with similar layout to ATI and Nvidia.
Courtesy of Zoltek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Fig. 1.18 SiS’s Xabre vertex shader data flow between CPU and GPU . . . 25
Fig. 1.19 SiS’s competitive market position . . . . . . . . . . . . . . . . . . . . . . . . . 25
Fig. 1.20 Nvidia’s NV30-based GeForce Fx 5900 with heat sink
and fan removed. Courtesy of iXBT . . . . . . . . . . . . . . . . . . . . . . . 28
Fig. 1.21 Nvidia NV30 block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Fig. 1.22 Final Fantasy used subdivision rendering for skin tone.
Courtesy of Nvidia [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Fig. 1.23 ATI R520 ring bus memory controller. The GDDR is
connected at the four ring stops. (Source ATI) . . . . . . . . . . . . . . . 34
Fig. 1.24 ATI R520 block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Fig. 1.25 ATI R520 thread size and dynamic branching efficiency
was improved with ultra-threading. Courtesy of ATI . . . . . . . . . . 37
Fig. 1.26 ATI R520 vertex shader engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Fig. 1.27 Making things look brighter than they are. Courtesy of ATI . . . . 39
Fig. 1.28 Inside the abandoned church with HDR on. Courtesy
of Valve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Fig. 1.29 Inside the abandoned church with HDR off. Courtesy
of Valve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Fig. 1.30 Different modes of anti-aliasing. Courtesy of Valve . . . . . . . . . . . 41
Fig. 1.31 ATI’s special class of bump mapping . . . . . . . . . . . . . . . . . . . . . . 42
Fig. 1.32 ATI’s Ruby red CrossFire—limited production. Courtesy
of ATI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Fig. 1.33 Nvidia NV40 Curie vertex and fragment processor block
diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Fig. 1.34 Nvidia’s NV40 curie-based GeForce 6800 Xt AIB.
Courtesy tech Power Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Fig. 1.35 Nvidia curie block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Fig. 2.1 Tony Tamasi. Courtesy of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Fig. 2.2 GPU architecture progression, first and second era.
Courtesy of Tony Tamasi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Fig. 2.3 Evolution from first-era to third-era GPU design . . . . . . . . . . . . . 53
Fig. 2.4 Nvidia’s G80 unified shader GPU—a sea of processors . . . . . . . 54
Fig. 2.5 Nvidia GeForce 8800 Ultra with the heatsink removed
showing the 12 memory chips surrounding the GPU.
Courtesy of Hyins—Public Domain, Wikimedia . . . . . . . . . . . . . 55
Fig. 2.6 Nvidia’s GT200 streaming multiprocessor . . . . . . . . . . . . . . . . . . 56
Fig. 2.7 Evolution of Nvidia’s logo, 1993 to 2006 (left) and 2006
on (right). Courtesy of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Fig. 2.8 Daniel Pohl demonstrating Quake running ray-traced
in real time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Fig. 2.9 Intel Larrabee AIB. Courtesy of the VGA Museum . . . . . . . . . . 58
Fig. 2.10 General organization of the Larrabee many-core
architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
List of Figures xxi

Fig. 2.11 Larrabee’s simplified DirectX 10 pipeline. The gray


components were programmable by the user, and blue
were fixed. Omitted from the diagram are memory access,
stream output, and texture-filtering stages . . . . . . . . . . . . . . . . . . . 59
Fig. 2.12 Larrabee CPU core and associated system blocks. The
CPU was a Pentium processor in-order design, plus 64-bit
instructions, multi-threading, and a wide vector processor
unit (VPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Fig. 2.13 Intel’s G45 chipset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Fig. 2.14 Block diagram of an iGPU within a CPU . . . . . . . . . . . . . . . . . . . 64
Fig. 2.15 Intel’s Westmere dual-chip package. Courtesy of Intel . . . . . . . . 64
Fig. 2.16 Intel’s Ironlake-integrated HD GPU . . . . . . . . . . . . . . . . . . . . . . . 65
Fig. 2.17 AMD graphics logos, circa 1985, 2006, 2010. Courtesy
of AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Fig. 2.18 AMD’s Turks entry-level GPU (2011) . . . . . . . . . . . . . . . . . . . . . 69
Fig. 2.19 Portion of the Llano chip. Courtesy of AMD . . . . . . . . . . . . . . . . 73
Fig. 2.20 Comparison of GPU balance philosophy of semiconductor
suppliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Fig. 2.21 AMD’s APU road map. Courtesy of AMD . . . . . . . . . . . . . . . . . . 74
Fig. 2.22 AMD’s integrated Llano CPU–GPU . . . . . . . . . . . . . . . . . . . . . . . 75
Fig. 2.23 Nvidia’s GeForce GTX 780. Courtesy of Wikipedia
GBPublic_PR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Fig. 2.24 Nvidia demo of a crumbling building. Courtesy of Nvidia . . . . . 77
Fig. 2.25 Intel’s Gen 11 Tiger Lake CPU with iGPU . . . . . . . . . . . . . . . . . . 80
Fig. 2.26 Intel’s SuperFin transistor. Courtesy of Intel . . . . . . . . . . . . . . . . . 80
Fig. 2.27 Die shot of Intel’s 11th Gen Core processor showing
the amount of die used by the GPU. Courtesy of Intel . . . . . . . . . 81
Fig. 2.28 Raja Koduri, Intel’s Chief Architect and Senior Vice
President. Courtesy of Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Fig. 2.29 Nvidia Maxwell GPU running voxel illumination.
Courtesy of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Fig. 2.30 AMD’s CGN CU block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Fig. 2.31 AMD revealed their GPU roadmap. Courtesy of AMD . . . . . . . . 85
Fig. 2.32 Intel multi-chip Kaby Lake G. The chip on the left is
the 4 GB HMB2, the middle chip is the Radeon RX Vega,
and the chip on the right is the eighth-gen core. Courtesy
of Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Fig. 2.33 Nvidia’s GPU roadmap. Courtesy of Nvidia . . . . . . . . . . . . . . . . . 87
Fig. 2.34 Block diagram of the AMD Navi 10, one of the first GPUs
powered by the RDNA architecture . . . . . . . . . . . . . . . . . . . . . . . . 89
Fig. 2.35 AMD’s RDNA command processor and scan converter . . . . . . . 90
Fig. 2.36 AMD’s RDNA compute unit front-end and SIMD . . . . . . . . . . . . 90
Fig. 2.37 Intel GT2 iGPU block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Fig. 2.38 Intel Gen 11 iGPU block diagram . . . . . . . . . . . . . . . . . . . . . . . . . 95
Fig. 2.39 CPS added two more steps in the GPU’s pipeline . . . . . . . . . . . . 96
xxii List of Figures

Fig. 2.40 Geometry with red boxes is sufficiently far from the camera,
and therefore, it is of minor importance to the overall
image. Thus, the color shading frequency could be reduced
(using CPS with no noticeable effect on the visual quality
or the frame rate). Courtesy of Intel . . . . . . . . . . . . . . . . . . . . . . . . 97
Fig. 2.41 Position only tile-based rendering (PTBR) block diagram . . . . . . 97
Fig. 3.1 The rise and fall of mobile graphics chip and intellectual
property (IP) suppliers versus market growth . . . . . . . . . . . . . . . . 102
Fig. 3.2 Mobile devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Fig. 3.3 Sources of mobile GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Fig. 3.4 Big jump in GPU power efficiency. Courtesy of Imagination
Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Fig. 3.5 Tile region protection isolates critical functions from each
other. Courtesy of Imagination Technologies . . . . . . . . . . . . . . . . 105
Fig. 3.6 Imagination’s BXT MC4 block diagram. Courtesy
of Imagination Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Fig. 3.7 The B boxes of imagination. Courtesy of Imagination
Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Fig. 3.8 In 2020 imagination had a broadest range of IP GPU
designs available. Courtesy of Imagination Technologies . . . . . . 107
Fig. 3.9 Mali in Arm, circa 2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Fig. 3.10 Falanx Arm Mali block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Fig. 3.11 Arm Mali’s graphics stack with MIDlets . . . . . . . . . . . . . . . . . . . . 114
Fig. 3.12 The Mali-400 could share the load on fragments . . . . . . . . . . . . . 116
Fig. 3.13 Fujitsu MB86292 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Fig. 3.14 Fujitsu’s MB86R01 SoC Jade . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Fig. 3.15 MediaQ MQ-200 block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Fig. 3.16 MediaQ MQ-200 drawing engine . . . . . . . . . . . . . . . . . . . . . . . . . 122
Fig. 3.17 Symbolic block diagram of the Nvidia TEGRA 6x0 (2007) . . . . 126
Fig. 3.18 Nvidia’s Tegra road map (2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Fig. 3.19 Nvidia offered its X-Jet software development toolkit
(SDK) software stack for automotive development
on the Jetson platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Fig. 3.20 Mercedes concept car of the future. Courtesy of Nvidia . . . . . . . 130
Fig. 3.21 In the back row from left: Petri Norlaund, Kaj Tuomi,
and Mika Tuomi from Bitboys. In the front row, Falanx,
from left: unknown (guy in blue jeans), Mario Blazevic,
Jørn Nystad, Edvard Sørgård, and Borgar Ljosland.
Courtesy of Borgar Ljosland . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Fig. 3.22 Bitboys’ Acceleon handheld prototype and the art it is
rendering. Courtesy of Petri Nordlund . . . . . . . . . . . . . . . . . . . . . 133
Fig. 3.23 Bitboys’ G40 mobile GPU organization . . . . . . . . . . . . . . . . . . . . 134
Fig. 3.24 Mikko Sarri 2009. Courtesy Mikko Sarri . . . . . . . . . . . . . . . . . . . 138
Fig. 3.25 Ideal’s Joe Palooka punching bag. Source
thepeoplehistory.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
List of Figures xxiii

Fig. 3.26 Qualcomm’s SMS6550 SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140


Fig. 3.27 Qualcomm’s Snapdragon SoC with Adreno GPU . . . . . . . . . . . . 142
Fig. 3.28 Mikko Alho. Courtesy of Siru . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Fig. 3.29 The many lives of the Bitboys . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Fig. 3.30 Texas Instruments early OMAP SoCs . . . . . . . . . . . . . . . . . . . . . . 147
Fig. 3.31 Arm Mali-T658 organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Fig. 3.32 Mali-T658 program management . . . . . . . . . . . . . . . . . . . . . . . . . 151
Fig. 3.33 Pipelines in Arm’s Mali architecture . . . . . . . . . . . . . . . . . . . . . . . 151
Fig. 3.34 Arm Midgard block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Fig. 3.35 Jem Davis VP Arm. Courtesy of Arm 2020 . . . . . . . . . . . . . . . . . 152
Fig. 3.36 The 12-year history of Arm Mali architectures over time . . . . . . 154
Fig. 3.37 Arm’s Mali-G76’s core design block diagram . . . . . . . . . . . . . . . 154
Fig. 3.38 Improved gaming performance with Mali-G76. Courtesy
of Arm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Fig. 3.39 Arm’s comparison of performance between the Mali-G52
and the new Mali-G57. Courtesy of Arm . . . . . . . . . . . . . . . . . . . 156
Fig. 3.40 Arm’s Valhall shader architecture block diagram . . . . . . . . . . . . . 158
Fig. 3.41 Arm Valhall microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Fig. 3.42 Arm’s Mali Valhall architecture. Courtesy of Arm . . . . . . . . . . . . 159
Fig. 3.43 Arm has the whole suite of engines for 5G AI, ML,
and VR. Courtesy of Arm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Fig. 3.44 Arm’s D77 display processor block diagram . . . . . . . . . . . . . . . . 161
Fig. 3.45 Arm said an SoC that could drive the level of performance
for wearable VR HMDs did not exist (in 2019). That
presented a significant challenge to SoC vendors who need
to achieve the above requirements. Courtesy of Arm . . . . . . . . . . 162
Fig. 3.46 Nvidia’s route to and from the mobile market . . . . . . . . . . . . . . . . 165
Fig. 3.47 Nvidia’s high-level (circa 2017) Xavier block
diagram—DLA is the deep learning accelerator . . . . . . . . . . . . . . 167
Fig. 3.48 Nvidia’s Xavier-based Pegasus board (circa 2018) offered
320 TOPS and the ability to run deep neural networks
at the same time. Courtesy of Nvidia . . . . . . . . . . . . . . . . . . . . . . . 168
Fig. 3.49 Nvidia’s Tegra SoC roadmap 2022. Courtesy of Nvidia . . . . . . . 168
Fig. 3.50 Qualcomm Snapdragon 6xx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Fig. 3.51 Block diagram of Apple’s M1 PC SoC . . . . . . . . . . . . . . . . . . . . . 172
Fig. 3.52 Floor plan of Apple’s M1 substrate with chip and memory.
Courtesy of Apple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Fig. 3.53 Apple’s M-Series SoCs. Courtesy of Apple . . . . . . . . . . . . . . . . . 174
Fig. 3.54 The CPUs of the M1 Pro. Courtesy of Apple . . . . . . . . . . . . . . . . 175
Fig. 3.55 The M1 Max offers 4x faster GPU performance than M1.
Courtesy of Apple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Fig. 3.56 The M1 Mx with its unified embedded memory. Courtesy
of Apple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Fig. 3.57 Performance comparison. Courtesy of Apple . . . . . . . . . . . . . . . . 177
xxiv List of Figures

Fig. 3.58 Apple’s UltraFusion packaging architecture connects two


M1 Max die to create the M1 Ultra. Courtesy of Apple . . . . . . . . 178
Fig. 3.59 Apple said the 20-core CPU of the M1 Ultra could deliver
90% higher multi-threaded performance than the fastest
2022 16-core PC desktop chip in the same power envelope.
Courtesy of Apple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Fig. 3.60 Apple claimed its M1 Ultra 64-core GPU produced faster
performance than the highest-end PC GPU available
while using 200 fewer watts of power. Courtesy of Apple . . . . . . 179
Fig. 3.61 Apple M1 compared to AMD Ryzen chip size. Courtesy
of Max Tech/YouTube [49] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Fig. 4.1 Rise and fall of console supplier versus market growth . . . . . . . . 188
Fig. 4.2 Number of consoles offered per year over time . . . . . . . . . . . . . . 188
Fig. 4.3 Sony PlayStation 2 block diagram . . . . . . . . . . . . . . . . . . . . . . . . . 189
Fig. 4.4 Original Xbox team Ted Hase, Nat Brown, Otto Berkes,
Kevin Bachus, and Seamus Blackley. Courtesy of Microsoft . . . 191
Fig. 4.5 Xbox block diagram with Nvidia IGP . . . . . . . . . . . . . . . . . . . . . . 193
Fig. 4.6 Halo, developed by Bungie, was an exclusive Xbox
title and credited with the machine’s success. Courtesy
of Microsoft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Fig. 4.7 Sony PSP block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Fig. 4.8 Ken Kutaragi at E3 2003 telling the audience about the PSP . . . . 196
Fig. 4.9 Microsoft Xbox 360 block diagram . . . . . . . . . . . . . . . . . . . . . . . . 197
Fig. 4.10 Microsoft Xbox 360 GPU block diagram . . . . . . . . . . . . . . . . . . . 198
Fig. 4.11 Microsoft’s Xbox 360 Vejle SoC block diagram . . . . . . . . . . . . . 201
Fig. 4.12 Microsoft Xbox 360 SoC chip floor plan. Courtesy
of Microsoft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Fig. 4.13 Nintendo Wii Hollywood chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Fig. 4.14 IBM technologist Dr. Lisa Su holds the new Cell
microprocessor. The processor was jointly developed
by IBM, Sony, and Toshiba. IBM claimed the Cell provided
vastly improved graphics and visualization capabilities,
in many cases 10 times the performance of PC processors.
Courtesy of Business Wire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Fig. 4.15 Nintendo 3DS handheld game machine. Courtesy
of Nintendo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Fig. 4.16 DMP PICO GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Fig. 4.17 Sony PlayStation Vita. Courtesy of Sony . . . . . . . . . . . . . . . . . . . 208
Fig. 4.18 Imagination technologies’ SGX543 IP GPU . . . . . . . . . . . . . . . . . 209
Fig. 4.19 CPUs plus caches take up approximately 15% of the chip
area. The GPUs (center) take up about 33% of the 348
mm2 die area; the rest of the chip area was the memory.
Courtesy of Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Fig. 4.20 Block diagram of AMD Liverpool (PS4) and Durango
(Xbox One) APU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
List of Figures xxv

Fig. 4.21 Tiny but mighty, AMD’s Jaguar-based APU powered


the most popular eighth-generation game consoles.
Courtesy of AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Fig. 4.22 Nvidia’s Shield game controller/player. Courtesy of Nvidia . . . . 213
Fig. 4.23 Nvidia’s grid. Courtesy of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . 214
Fig. 4.24 An Nvidia Shield look-alike, the MOGA Pro controller
with smartphone holder. Courtesy of MOGA . . . . . . . . . . . . . . . . 215
Fig. 4.25 Sony’s eighth-generation PlayStation 4 with controller
changed the design rules for consoles. Courtesy of Sony
Computer Entertainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Fig. 4.26 Xbox One system architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Fig. 4.27 Internals of Xbox One’s 5+ billion transistor SoC . . . . . . . . . . . . 218
Fig. 4.28 Nvidia’s Shield Console in its holder with controller.
Courtesy of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Fig. 4.29 Nvidia Tegra X1 block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Fig. 4.30 Artist’s rendition of the Polymega system. The final
version was a dark, flat gray. Courtesy of Polymega . . . . . . . . . . 222
Fig. 4.31 Nintendo’s Switch with controls attached. Courtesy
of Nintendo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Fig. 4.32 Nintendo Switch desk mount. Courtesy of Nintendo . . . . . . . . . . 223
Fig. 4.33 Nintendo console introduction timeline . . . . . . . . . . . . . . . . . . . . . 224
Fig. 4.34 Feargal Mac (left) of Atari and former Microsoft games
executive Ed Fries. Courtesy of Dean Takahashi . . . . . . . . . . . . . 225
Fig. 4.35 Atari 2600 and VCS. Courtesy of Wikipedia . . . . . . . . . . . . . . . . 225
Fig. 4.36 Xiaobawang Zhongshan Subor Z-plus console. Courtesy
of Xiaobawang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Fig. 4.37 Sony PlayStation 5 block diagram . . . . . . . . . . . . . . . . . . . . . . . . . 228
Fig. 4.38 PlayStation introduction timeline . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Fig. 4.39 Microsoft Xbox Series X block diagram . . . . . . . . . . . . . . . . . . . . 230
Fig. 4.40 Microsoft’s Xbox series APU. Courtesy of Microsoft . . . . . . . . . 231
Fig. 4.41 Microsoft Xbox series introduction timeline . . . . . . . . . . . . . . . . . 231
Fig. 4.42 Valve’s Steam Deck game console. Courtesy of Valve . . . . . . . . . 234
Fig. 4.43 Qualcomm handheld, game console, reference design . . . . . . . . . 235
Fig. 5.1 The GPU scales faster than any other processor . . . . . . . . . . . . . . 240
Fig. 5.2 Sun microsystem’s XVR-4000 graphics subsystem.
Courtesy of forms.irixnet.org . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Fig. 5.3 SiliconArt’s RayCore 1000 block diagram . . . . . . . . . . . . . . . . . . 243
Fig. 5.4 Autodesk 3DS Max 2019 test with two omni-directional
lights and 12,268 triangles. Courtesy of SiliconArts . . . . . . . . . . 244
Fig. 5.5 SiliconArt’s RayCore 2000 block diagram . . . . . . . . . . . . . . . . . . 245
Fig. 5.6 SiliconArt’s RayTree structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Fig. 5.7 SiliconArt’s RayTree architecture . . . . . . . . . . . . . . . . . . . . . . . . . 247
Fig. 5.8 Features of intel’s ponte vecchio GPU. Courtesy of Intel . . . . . . 249
Fig. 5.9 Aurora exploited a lot of intel technology. Courtesy of Intel . . . . 249
xxvi List of Figures

Fig. 5.10 The Xe architecture is scaled by ganging together tiles


of primary GPU cores. Courtesy of Intel . . . . . . . . . . . . . . . . . . . . 249
Fig. 5.11 Intel’s Xe -HPC 2-stack shows the configurable and scalable
aspects of its Xe -core design. Courtesy of Intel . . . . . . . . . . . . . . 250
Fig. 5.12 The Xe link allowed even more extensive subsystems to be
created. Courtesy of Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Fig. 5.13 Ponte Vecchio with >100 billion transistors. 47 active titles
and five process nodes. Courtesy of Intel . . . . . . . . . . . . . . . . . . . 252
Fig. 5.14 Ponte Vecchio chips in a carrier from the fab. Courtesy
of Stephen Shankland/CNET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Fig. 5.15 Intel’s Ponte Vecchio circuit board revealing the tiles
in the package. Courtesy of Intel . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Fig. 5.16 Intel’s accelerated compute system. Courtesy of Intel . . . . . . . . . 254
Fig. 5.17 Intel’s oneAPI allowed heterogeneous processors
to communicate and cooperate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Fig. 5.18 Zhaoxin’s road map showed a dGPU. Courtesy
of CNTechPost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Fig. 5.19 Texas CPU to new GPU—a long, tortuous path . . . . . . . . . . . . . . 258
Fig. 5.20 GlenFly’s AIB running the Unigine Heaven benchmark.
Courtesy of Glenfield GlenFly Technology . . . . . . . . . . . . . . . . . . 259
Fig. 5.21 Muxi’s CEO Chen Weiliang has worked in the GPU field
for 20 years. Courtesy of Muxi . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Fig. 5.22 Darwesh Singh, CEO and founder bolt graphics. Courtesy
of Singh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Fig. 5.23 Bolt’s HPC GPU. Courtesy of Bolt . . . . . . . . . . . . . . . . . . . . . . . . 263
Fig. 5.24 Bolt targeted industries characterized by exponentially
expanding workloads. Courtesy of Bolt Graphics . . . . . . . . . . . . . 264
Fig. 5.25 Jing Jiawei, founder of Changsha Jingjia Microelectronics
Co., Ltd. Courtesy of Changsha Jingjia . . . . . . . . . . . . . . . . . . . . . 265
Fig. 5.26 Jingjia micro’s JM7200-based PCIe AIB. Courtesy
of Changsha Jingjia Microelectronics Co. . . . . . . . . . . . . . . . . . . . 266
Fig. 5.27 Alphamosaic’s Dr. Robert Swann shows off the VC02’s
development board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Fig. 5.28 Raspberry Pi 4 model B development board. Courtesy
of Miiicihiaieil Hieinizilieir for Wikimedia Commons . . . . . . . . 269
Fig. 5.29 Doom III running on a Raspberry Pi 4. Courtesy of Hexus . . . . . 270
Fig. 5.30 Rendering examples using only OpenVG features.
Courtesy of DMP Inc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Fig. 5.31 DP CEO Tatsuo Yamamoto and his dog Momo at E3 . . . . . . . . . 274
Fig. 5.32 Think silicon founders George Sidropoulos and Iakovos
Stamoulis. Courtesy of Think Silicon . . . . . . . . . . . . . . . . . . . . . . 275
Fig. 5.33 Think silicon’s whiteboard from 2015. Courtesy of Think
Silicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Fig. 5.34 Think silicon’s Nema pico GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Fig. 5.35 The think silicon team. Courtesy of Think Silicon . . . . . . . . . . . . 278
List of Figures xxvii

Fig. 5.36 Comparison of think silicon’s Nema and Neox GPUs.


Courtesy of Think Silicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Fig. 5.37 System diagram using the think silicon IP blocks . . . . . . . . . . . . . 279
Fig. 5.38 An example of an SoC with Neox IP cores . . . . . . . . . . . . . . . . . . 280
Fig. 5.39 Think silicon’s application and device range. Courtesy
of Think Silicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Fig. 5.40 Vivante’s Vega IP GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Fig. 5.41 VeriSilicon’ s GPU could scale from IoT and wearables
to AI training systems. Courtesy of VeriSilicon . . . . . . . . . . . . . . 286
Fig. 5.42 Nvidia GPU growth in transistors and die size over time . . . . . . . 288
Fig. 5.43 The Nvidia GA100 streaming multiprocessor . . . . . . . . . . . . . . . . 291
Fig. 5.44 Nvidia’s A100 Ampere chip on a circuit board . . . . . . . . . . . . . . . 292
Fig. 5.45 Nvidia’s DGX A100 supercomputer. Courtesy of Nvidia . . . . . . 293
Fig. 5.46 Conventional ray tracing organization . . . . . . . . . . . . . . . . . . . . . . 294
Fig. 5.47 Imagination technologie’s Photon RAC . . . . . . . . . . . . . . . . . . . . 295
Fig. 5.48 Imagination technologie’s GPU with RAC . . . . . . . . . . . . . . . . . . 296
Fig. 5.49 Rays per second monitor in PVRTune. Courtesy
of Imagination Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Fig. 5.50 Imagination created the industry’s first real-time ray
tracing silicon in 2014. It showed the R6500 test chip code
named Plato. Courtesy of Imagination Technologies . . . . . . . . . . 297
Fig. 5.51 Nvidia’s hopper subsystem board. Courtesy of Nvidia . . . . . . . . 299
Fig. 5.52 Time to train the mixture of experts transformer network
for H100 versus A100. Courtesy of Nvidia . . . . . . . . . . . . . . . . . . 300
Fig. 5.53 Nvidia’s H100 Hopper AIB with NVLinks (upper left)
supports a unified cluster of eight GPU. Courtesy of Nvidia . . . . 301
Fig. 5.54 Nvidia’s DGX H100 supercomputer. Courtesy of Nvidia . . . . . . 301
Fig. 5.55 Nvidia’s Earth 2 supercomputer. Courtesy of Nvidia . . . . . . . . . . 302
Fig. 6.1 Timothy Miller. Courtesy of University Binghamton . . . . . . . . . . 307
Fig. 6.2 OGP test board. Courtesy of en. wikipedia . . . . . . . . . . . . . . . . . . 308
Fig. 6.3 Karu Sankaralingam. Courtesy of University
of Wisconsin–Madison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Fig. 6.4 MIAOW block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Fig. 6.5 AMD RDNA generation series block diagram . . . . . . . . . . . . . . . 311
Fig. 6.6 Two different trimmed architectures were generated
for two distinct soft kernels. Courtesy of Pedro Duarte
and Gabriel Falcao from the Universities of Coimbra
and Lisboa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Fig. 6.7 During compile-time, the instructions present in kernel
A indicated that only scalar and vectorized integer
FUs should be instantiated on the reconfigurable
fabric. Courtesy of Pedro Duarte and Gabriel Falcao
from the Universities of Coimbra and de Lisboa . . . . . . . . . . . . . 313
Fig. 6.8 Luke Kenneth Casson Leighton. Courtesy of Leighton . . . . . . . . 314
Fig. 6.9 The libre-SOC hybrid 3D CPU-VPU-GPU . . . . . . . . . . . . . . . . . . 314
xxviii List of Figures

Fig. 6.10 Vortex block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316


Fig. 6.11 Dr. Atif Zafar. Courtesy of Zafar . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Fig. 6.12 RV64X block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Fig. 6.13 RV64X’s scalable design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Fig. 7.1 Nvidia’s Turing TU102 GPU die photo and block diagram.
Courtesy of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Fig. 7.2 Ray tracing features supported in Nvidia’s Turing GPU.
Courtesy of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Fig. 7.3 Nvidia’s hybrid-rendering technology combining the ray
tracing capabilities of the RT cores and the image denoising . . . 330
Fig. 7.4 Data flow of Nvidia’s DLSS 2.0 process. Courtesy of Nvidia . . . 332
Fig. 7.5 Nvidia’s DLSS used motion vectors to improve
the supersampling of the enhanced image. Courtesy
of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Fig. 7.6 An Nvidia demo of ray tracing used in a game. Courtesy
of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Fig. 7.7 History of Intel graphics devices . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Fig. 7.8 Intel’s product range for GPUs. Courtesy of Intel . . . . . . . . . . . . 339
Fig. 7.9 EMIB created a high-density connection between the Stratix
10 FPGA and two transceiver dies. Courtesy of Intel . . . . . . . . . . 339
Fig. 7.10 Intel plans to span the entire dGPU market. Courtesy of Intel . . . 340
Fig. 7.11 Pat Gelsinger, Intel’s CEO. Courtesy of Intel . . . . . . . . . . . . . . . . 341
Fig. 7.12 Intel’s DG1 AIB. Courtesy of Intel . . . . . . . . . . . . . . . . . . . . . . . . 342
Fig. 7.13 Intel’s Xe Arc HPG road map circa 2021. Courtesy of Intel . . . . 343
Fig. 7.14 Intel’s HPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Fig. 7.15 Classic performance versus quality relationship . . . . . . . . . . . . . . 346
Fig. 7.16 Intel had a new SDK for its recent supersampling
and scaling algorithm. Courtesy of Intel . . . . . . . . . . . . . . . . . . . . 346
Fig. 7.17 Example of the quality of Intel’s Xe SS—notice the Caution
sign. Courtesy of Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
Fig. 7.18 AMD RDNA 2 Big Navi GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Fig. 7.19 AMD RDNA 2 compute unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Fig. 7.20 AMD’s intersection block diagram . . . . . . . . . . . . . . . . . . . . . . . . 351
Fig. 7.21 AMD’s FSR pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
Fig. 7.22 Innosilicon’s series one family. Courtesy of Innosilicon . . . . . . . 356
Fig. 7.23 An HDMI, display port with a VGA connector on the back
of the AIB. Courtesy of Innosilicon . . . . . . . . . . . . . . . . . . . . . . . . 356
Fig. 7.24 Fantasy one type B AIBs. Courtesy of Innosilicon . . . . . . . . . . . . 357
Fig. 7.25 Innosilicon’s innolink IP Chiplet block diagram. Courtesy
of Innosilicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Fig. 7.26 Innosilicon’s roadmap. Courtesy of Innosilicon . . . . . . . . . . . . . . 358
Fig. 8.1 Slick car from Forza Horizon 5. Courtesy of Xbox Game
Studios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
Fig. 8.2 Game characters in the 1990s: Doom and Tome raider.
Source Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
List of Figures xxix

Fig. 8.3 Final Fantasy 2001 and Tomb Raider 2013. Courtesy
of Wikipedia and Crystal Dynamics . . . . . . . . . . . . . . . . . . . . . . . 362
Fig. 8.4 Death Standing 2020 and enemies. Courtesy of Sony
Interactive Entertainment and Unity . . . . . . . . . . . . . . . . . . . . . . . 363
Fig. 8.5 Computational fluid dynamics is used to model and test
in a computer to find problems and opportunities. Courtesy
of Siemens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
List of Tables

Table 1.1 Second-era GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2


Table 1.2 Comparison of three generations of ATI chips . . . . . . . . . . . . . . 21
Table 1.3 Comparison of DirectX features . . . . . . . . . . . . . . . . . . . . . . . . . 21
Table 2.1 Compute capability of Fermi and Kepler GPUs . . . . . . . . . . . . . 78
Table 3.1 Configuration and performance parameters for Mali
family of graphics cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Table 3.2 Mali function list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Table 3.3 2008 Mali-400 MP GPU process-fill rate . . . . . . . . . . . . . . . . . . 115
Table 3.4 Arm Mali Midgard arithmetic unit per pipeline (per core) . . . . 150
Table 3.5 Comparison of Mali-G72 to Mali-G76 . . . . . . . . . . . . . . . . . . . . 155
Table 3.6 Nvidia Tegra SoC product line . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Table 4.1 Game consoles introduced after the GPU . . . . . . . . . . . . . . . . . . 189
Table 4.2 Nvidia’s Shield Console specifications. Courtesy
of Nvidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Table 4.3 Atari VCS specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Table 4.4 Comparison of Sony PlayStation 5 and Microsoft Xbox
Series X key specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Table 4.5 Microsoft’s Xbox Series X and S game consoles . . . . . . . . . . . . 232
Table 4.6 Steam Deck’s specifications compared to Nintendo’s
Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Table 5.1 SiliconArts feature set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Table 5.2 RayCore versus CPU ray tracing . . . . . . . . . . . . . . . . . . . . . . . . . 244
Table 5.3 SiliconArt’s RayCore Lite specifications . . . . . . . . . . . . . . . . . . 246
Table 5.4 SiliconArt’s RayCore MC specifications . . . . . . . . . . . . . . . . . . 246
Table 5.5 Bolts specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
Table 5.6 Comparison JM9000-series GPUs to Nvidia
GTX1000-series GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Table 5.7 A comparison of features of Nema and Neox . . . . . . . . . . . . . . . 281
Table 5.8 Nvidia’s ampere A100 specifications . . . . . . . . . . . . . . . . . . . . . 289
Table 5.9 Nvidia ampere GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Table 5.10 Nvidia’s hopper H100 GPU compared to previous GPUs . . . . . 299

xxxi
xxxii List of Tables

Table 7.1 Intel’s 2020 discrete mobile GPU . . . . . . . . . . . . . . . . . . . . . . . . 337


Table 7.2 Intel’s Arc alchemist mobile dGPU product line . . . . . . . . . . . . 345
Table 7.3 AMD’s Radeon series AIBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Table 7.4 AMD’s quality settings versus performance . . . . . . . . . . . . . . . . 353
Chapter 1
Introduction

Microsoft’s introduction of DirectX 8 in November 2000 kicked off the second era
of GPU development which started in early 2001. ATI, Nvidia, and others showed
Microsoft and Khronos plans for future GPUs, and programmability was a big part
of it. Moore’s law was enabling the GPU companies to add more transistors. And
those tens of millions of new transistors were being put to clever use. For instance,
more registers, caches, and control logic were added to the GPU. The advances
would convert the GPU from a fixed function graphics controller that could do a
little geometry to a first-class parallel processor, a major step in the evolution of the
GPU and the computer graphics industry.
Perhaps some of the best news was that the industry had learned its lesson and
was not taking any chances on proprietary APIs such as 3dfx’s, rendition’s, and
others (discussed in Book two). In the early 2000s, OpenGL was ahead of DirectX in
terms of advanced graphics functions. OpenGL1.1 (1997) already had programmable
vertex shaders in support of workstations.
But there was a gap between the professional graphics applications and consumer
applications like games. Even though programmable shaders would make the games
look better and run faster, the game developers, never looking for more work and
always fighting deadlines, were slow to adopt the capability, with a few exceptions.
In the case of professional tools, graphics capabilities represented a competitive edge
in some cases.

1.1 Programmable Vertex and Geometry Shaders


(2001–2006)

A vertex is a point of a triangle where two edges meet. A triangle has three vertices.
A vertex shader is a processor that transforms shape and positions into 3D drawing
coordinates. It transforms the points of a triangle’s attributes—such as position,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 1


J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_1
2 1 Introduction

direction, color, and texture—from their initial virtual space to the display space. It
allows the original objects to be distorted or reshaped in any manner.
Vertex shaders do not change the type of data; they change the values of the data.
Therefore, a vertex emerges from the process (called shading) with a different color,
different textures, or a different position in space. The vertex shader is sometimes
referred to as an assembler, as is discussed in Book two.
A triangle is a primitive formed by three vertices.
In both the OpenGL and Direct3D rendering pipelines, the geometry shader
operates after the vertex shader and before the fragment/pixel shader.
The primary purpose of geometry shaders is to create new primitives as part of
the tessellation process. Geometry shaders are started by an incoming primitive (a
triangle), which comes with all the data on a specific primitive as well as adjacent
ones.
Before GPUs, vertex and geometry shaders were run in a dedicated coprocessor
or by the floating-point processors in the CPUs. CPUs that had parallel processors
known as same-instruction, multiple-date (SIMD) processing stages were also used.
Examples of second-era GPU are given in Table 1.1.
Due to advances in semiconductor manufacturing, the pace of change was much
faster in the second era of GPUs. The GPUs ran faster, had more memory internally
and externally, and drove higher resolution displays. It was a golden era, but not the
last one.

1.1.1 Nvidia NV20—GeForce 3 (February 2001)

Having led the way with the GeForce 256 and its follow-on, GeForce 2, Nvidia
surprised the industry with its powerful GeForce 3. It was the first GPU with
programmable vertex shaders and ushered in new capabilities. The GPU pushed
the display resolution up to 1800 × 1400 × 32 at 70 Hz refresh and 1600 × 1200
× 32 at 85 Hz, a range usually reserved for workstations. Few gamers had such a

Table 1.1 Second-era GPUs


Era Year Product Arch Process Transistors Fill rate Polygon
name (nm) (M) Mpix/s rate M tri
Second 2001 Radeon R200 150 60 800 75
8500
Second Aug 2002 Radeon R300 150 107 2200 300
9700
Second Early 2001 GeForce 3 NV20 150 57 800 30
Second Early 2002 GeForce NV25 150 63 1000 60
4 Ti
Second Early 2003 GeForce NV30 130 125 2000 200
FX
1.1 Programmable Vertex and Geometry Shaders … 3

monitor, so it became an aspirational specification and helped drive some monitor


sales. Along with that higher resolution, Nvidia introduced full-screen anti-aliasing.
The GeForce 256 offered supersampling anti-aliasing (SSAA), a demanding
process that renders the image at a large size internally and then scales it down to the
end-output resolution for the best visual quality. GeForce 3 added multi-sampling
anti-aliasing (MSAA) and Quincunx anti-aliasing methods, both of which performed
significantly better than supersampling.
Multi-sample anti-aliasing (MSAA) is a spatial anti-aliasing technique, a tech-
nique used to remove edge jaggies. MSAA was a major improvement over super-
sampling (SSAA). The boost was achieved by sampling two or more adjacent pixels
together, instead of rendering the entire scene at a very high (i.e., super) resolution.

A Quincunx is a geometric pattern consisting of five points arranged in a cross like


the number five face of a dice. Quincunx AA takes four pixels surrounding the pixel
of interest.
The GeForce 3 was so feature rich that some wondered if the game developers
would take advantage of it given their lowest-common-denominator mentality. It was
and still is expensive to develop a game. As a result, developers wanted to reach as
large an audience as possible and hit prime release dates, so they were reluctant to
use only one supplier’s features.
Nvidia launched the NV20 (code named Kelvin1 ) on February 27, 2001. Built on
the 150 nm TSMC process, it had a die area of 128 mm2 and 57 million transistors. The
GPU ran at 200 MHz and had four-pixel shaders, one vertex shader, eight texture-
mapping units, and four ROPS. It supported DirectX 8.1 and OpenGL 1.3. The
GeForce 3 had 64 MB of 230 MHz DDR (Fig. 1.1).
The NV20 nFinite FX engine also had Lightspeed Memory Architecture (LMA)—
an ATI-like HyperZ method for culling occluded pixels. It also included a Z
compression/decompression function to optimize bandwidth usage.
The GeForce 3 had 64 MB of 230 MHz DDR memory connected via a 128-bit
memory interface.
The NV20 was Nvidia’s third-generation GPU, and the company used it in several
versions of the GeForce 3 series AIBs.
Nvidia introduced the NV20 three months after acquiring the assets of 3dfx and
marketed the GPU as the nFinite FX Engine. Nvidia’s first Microsoft Direct3D 8.0
compliant AIB added programmable pixel and vertex shaders and multi-sampling
anti-aliasing.

1 With the NV04, Nvidia adopted the policy of using famous scientist’s names for the code names

of their GPUs.
4 1 Introduction

Fig. 1.1 Nvidia NV20-based GeForce 3 AIB with AGP4x buss. Courtesy tech Power Up

So, the Kelvin-based nFinite NV20 GPU on the GeForce 3 was Nvidia’s first
second-era GPU which made it significant because it was Nvidia’s first GPU with
programmable vertex shaders and the first to market with a 150 nm chip.

1.1.2 ATI R200 Radeon 8500 (August 2001)

On August 14, 2001, after the November 2000 introduction of Microsoft’s Direct3D
8.1 Shader Model 1.4, ATI introduced the R200 GPU, code named Morpheus.
Direct3D 8.1 contained many powerful new 3D graphics features, such as vertex
shaders, pixel shaders, fog, bump mapping, and texture mapping. The AIB was also
OpenGL 1.3 compatible.
ATI took Direct3D a step further and introduced TruForm, which added hardware
acceleration of tessellation. ATI included TruForm in Radeon 8500 and later products
(Fig. 1.2).
Manufactured in TMSC’s 150 nm process, the R200 used the Rage7 architecture
(Fig. 1.3). The 120 mm2 GPU had 60 million transistors, four-pixel shaders, two
vertex shaders, two texture-mapping units, and four ROP engines. The GPU drew
92 watts which was close to the 110 w limit of AGP4x.
However, what made the R200 significant was its TruForm hardware tessellation
acceleration and its use of a number (n)-patches.
1.1 Programmable Vertex and Geometry Shaders … 5

Fig. 1.2 ATI R200-based Radeon 8500 AIB. Courtesy tech Power Up

Fig. 1.3 ATI R200 block diagram. The chip had 60 million transistors, four-pixel shaders, two
vertex shaders, two texture-mapping units, and four ROP engines
6 1 Introduction

A patch is a collection of control points used to construct a curve or surface.


A control point is any data needed during the creation of the curve or surface. For
example, a control point can represent positional and color information. A line prim-
itive is a patch consisting of two control points, and a triangle primitive is a patch
containing three control points. The control points have positional information and
other things useful in the programmable shading stages, such as color and surface
normal data as discussed previously.
TruForm was a semiconductor intellectual property (SIP) block developed by
ATI for the hardware acceleration of tessellation. Figure 1.4 is a simple example of a
tessellation pipeline rendering a sphere from a crude cubic vertex set. You may recall
the discussion in Book two and the illustration of the Catmull–Clark subdivision. It
is reorganized here to show its association with the GPU’s pipeline.
For anisotropic filtering, the Radeon 8500 used a technique like the R100 but
improved with trilinear filtering and some other refinements. (Refer to Book one, 3D
Texture Filtering).
ATI hoped TruForm would give it a competitive edge. At the R200 rollout, ATI
touted TruForm for its ability to create smoother, more detailed models with a
minimum of work on the part of developers. It would employ N-patches (AKA,
PN triangles), a new higher-order surface composed of curved rather than flat trian-
gles. ATI introduced it, and Microsoft DirectX and OpenGL incorporated it. ATI
called their new front-end preprocessor TruForm (Fig. 1.5).
Hit movies such as Toy Story, Ants, and Shrek used N-patches with great success.
N-patches are a function typically done in software before rendering.

Fig. 1.4 Tessellation can reduce or expand the number of triangles (polygons) in a 3D model to
improve realism or increase performance
1.1 Programmable Vertex and Geometry Shaders … 7

Fig. 1.5 Normal(s) generation within a TruForm N-patch. Courtesy of ATI

ATI developed a preprocessor to recognize the normals associated with the


Gouraud shading,2 then subdivide a polygon one, two, or three times, and then inter-
polate the normals between the new vertices. Then the surface was given a Bezier-like
smoothing, which resulted in excellent, rounded surfaces that otherwise would be
squared-off edges. The technique worked very well in almost every situation and was
backward compatible with existing games with normals associated with its polygons
[1].
Although existing games could use the new processor, ATI developed a fix for
game developers that took advantage of the TruForm processor to ensure the devel-
oper had control over how the game looked and that it was true to the artist’s intent.
The patch allowed the game developer to check the artwork to ensure that the TruForm
preprocessor’s effects were what the artist desired in the original illustration or image.
The number of subdivisions or tessellations of each polygon was adjustable. Also,
the operation (and patches) could be turned on or off. In addition, the procedure could

2 A technique developed by Henri Gouraud in the early 1970s that computes a shaded surface based
on the color and illumination at the corners of every triangle. Gouraud shading is the simplest
rendering method and is computed faster than Phong shading. It does not produce shadows or
reflections. The surface normals at the triangle’s points are used to create RGB values, which are
averaged across the triangle’s surface.
8 1 Introduction

be identified or used on a per-object basis within a scene. That degree of flexibility


gave the game developer a wide range of manipulation for existing code and an even
broader range of capabilities on new code.
Because the system was an interpolator, it allowed sending models with fewer
polygons to the rendering pipeline. That had a beneficial effect on bandwidth demand
in both the AGP pipeline and in Internet data transmission.
It was easy to see how small models could be sent in web-based online games and
then reconstructed with more richness on the client side using the technique.
To implement TruForm, ATI chose a third-order (cubic) Bezier curve, which
provided a basis for many higher-order surface methods. (It is mathematically more
straightforward but more difficult to blend than a b-spline curve.)
A Bezier curve is a particular type of curve described by control points. The
control point has local control over its portion of the curve. If a control point was
moved, the part of the curve closest to the control point was affected the most. A
cubic Bezier curve has four control points, one for each endpoint and two inside the
curve. The concept of the Bezier curve can be extended to construct a Bezier surface
in which the surface curvature is also described using a control point.
ATI generated the control points by subdividing polygons sent initially to the GPU
from the game or any graphics application (Fig. 1.6).
N-patches require only the vertices and normals to create some visually impressive
models. As a result, N-patches allow for a higher level of scalability, meaning the
number of triangles and 3D models can be scaled according to the available graphics
hardware. Software developers tended to design for the lowest common denominator.
They created low triangle count models to run well on the low-end systems that make
up most of the market and seldom created high detail models for users with more
powerful systems. N-patches gave developers a bridge to high-end users. N-patches
can take those small polygon models and generate smooth, highly detailed models.
The final patch-lighting technique used two different algorithms to generate
normals. The algorithm used depended on the complexity of the curved surface.
If the curvature of the patch was relatively simple, the vertex normals could be
linearly interpolated (i.e., blended) across the entire curved surface. That algorithm
got invoked when normal linear generation occurred. When the curvature of a triangle
was more complex, a quadratic Bezier surface could be used to define the difference
of normals across the surface of the patch [2]. Six control points represent a quadratic
triangular surface: three for the vertices and one for each edge midpoint. Taking the
average of the vertex normals at either end of the edge generated each mid-edge
control point.
The one case was where the artist had to do some hand-tweaking on sharp edges
or corners. The TruForm preprocessor attempted to round them (Fig. 1.7), changing
the image (the corner of a floor and wall, for example). ATI developed a trick that
generated a very thin curved polygon and inserted it at the edge.
ATI demonstrated the algorithm with some models (~500 to 5 k polygons) and
animations with Gouraud shading. Then ATI turned on TruForm, and a fantastic
realism popped out. Lights appeared that were not there in the flatter models, and the
overall effect of smoothing and curves was much more realistic. It looked like ATI
had a real winner feature.
1.1 Programmable Vertex and Geometry Shaders … 9

Fig. 1.6 Generation of control points with N-patches. Courtesy of ATI

The preprocessor operated in front of the T&L engine and extended the graphics
pipeline further into the application (Fig. 1.8). The first reaction to a software update
was received with skepticism, but ATI explained that online gaming providers would
sell the update to gamers and make a few extra dollars, so they would love it.
TruForm was a technology announcement, and no product was immediately asso-
ciated with it yet. However, the demos looked good, and the supporting white paper
[3] was also well done, so it was something ATI had been working on and preparing for
quite a while. The company said it was getting good responses from game developers.
The game developers were less resistant to the idea of developing for hardware since
there was so much less unique hardware in the market to create for, and DirectX 8
was the common denominator.
ATI offered TruForm up and until the Radeon X1000 series. After that, they no
longer advertised it as a hardware feature. Nonetheless, the Radeon 9500 and subse-
quent AIBs compatible with Shader Model 3.0 had a render-to-vertex buffer capa-
bility, which could be employed for tessellation acceleration. Tessellation returned
in ATI’s Xenos processor in the Xbox 360 and the 2007 Radeon R600 GPUs [4].
Game developers did not embrace ATI’s TruForm because they would have had
to design their games initially with TruForm procedures to get the best results—it
10 1 Introduction

Fig. 1.7 Subdivision and tessellation add realism. Courtesy of ATI

Fig. 1.8 ATI’s TruForm was a preprocessor in an expanding chain of graphics functions
1.1 Programmable Vertex and Geometry Shaders … 11

was not just a patch to the game. Models used to create objects and scenes in the
game needed flags assigned to them to indicate which ones should be tessellated.
And since only ATI had it, the feature, though valuable and useful, did not gain the
support of developers.
The R200 was an advanced design and ATI’s second GPU with a vertex shader.
Sadly, because of minimal adoption by game developers, ATI dropped TruForm
support from its future hardware. The world would have to wait until the third era of
GPUs in 2009 when DirectX 11 was released. Being first did not always pay off.

1.1.3 Nvidia’s NV25–28—GeForce 4 Ti (February 2002)

Launched in February 2002, Nvidia’s NV25-based GeForce 4 Ti (Fig. 1.9) was


an updated GeForce 3 (NV20). Like its predecessor, it had some significant
improvements:
• A higher clock rate (300 MHz versus 200 MHz for the NV20)
• An additional vertex shader
• The pixel shaders had updated instructions for Direct3D 8.0a support and were
rebranded as nFinite FX Engine II
• An updated memory controller called Lightspeed Memory Architecture II as well
as higher memory clock rates—325 MHz
• Hardware anti-aliasing (Accuview AA)
• DVD playback
• DirectX 8.0.

Nvidia implemented Direct3D 7-class fixed function T&L in the vertex shaders.
The company also included an improved dual-monitor capability (TwinView)
adapted from the GeForce 2 MX.4.

Fig. 1.9 VisionTek Nvidia NV25-based GeForce Ti 4200 AIB. Courtesy of Hyins for Wikipedia
12 1 Introduction

The GeForce 4 Ti was superior to the GeForce 4 MX in virtually every respect


except manufacturing cost—it was the same chip, AIB and other components, it just
ran faster. The MX also had the Nvidia video processing engine (VPE), which Nvidia
did not add to the Ti. The GeForce 4 pipeline is shown in Fig. 1.10.
That chip was like the GeForce 3. The most significant changes were simple: the
chip ran at speeds as high as 300 MHz, and memory operated as fast as 325 MHz
(or 650 MHz DDR).
Nvidia used TSMC’s 150 nm process for the NV25 GPUs. The company could
have brought out the GeForce 4 two months earlier than it did, but the GeForce 3
was selling so well during the holiday season. Nvidia delayed the GeForce 4 launch
to avoid cannibalizing the very profitable GeForce 3 sales.
Nvidia created several products from the NV25 design. At the high-end, two
titanium (Ti) models (Ti 4400 and Ti 4600) were the first to be introduced. Then
Nvidia announced three mobile GPUs based on the MX design.
The MX GPUs and AIBs based on them were not impressive compared to the
earlier GeForce 2. However, the titanium models delivered a significant performance
improvement over the previous generation, as much as 50% in some tests. Due to

Fig. 1.10 Nvidia GeForce 4 pipeline


1.1 Programmable Vertex and Geometry Shaders … 13

its performance advantage, the sexy name, and Nvidia’s marketing strength, the Ti
4600 quickly became the most popular AIB and reduced the gamer’s interest in ATI’s
Radeon 8500. That situation would change when ATI introduced the Radeon 9700.

1.1.4 ATI’s R300 Radeon 9700 and the VPU (August 2002)

Nvidia promoted the GPU name to enormous success. If asked in the early 2000s
what a GPU was, most people would say a chip made by Nvidia. ATI sought to gain
product and market differentiation and not be the other GPU company. Dave Orton
knew he could not make ATI the GPU company, so he came up with a new term—the
VPU—the visual processor unit. The Radeon 9700 would be a VPU, not a GPU. The
VPU incorporated a technique ATI called vertex skinning for a more fluid movement
of polygons (Refer to the image of the leopards in Book two’s discussion of ATI’s
R100). ATI and 3D/creative laboratories said the VPU would be the next generation
in graphics and visualization systems.
The ArtX team that developed the first IGP also developed the R300 Radeon 9700,
code named Khan. Built on the Rage 8 architecture, the ATI R300 GPU delivered
exceptional results and shipped on schedule. It was the first to offer DirectX 9.0 and
Shader Model 2.0, Vertex Shader 2.0, and Pixel Shader 2.0 compatibility. The R300
was the industry’s second device to offer AGP 8 × capability; SiS’s Xabre 80/200/400
line was the first. And the R300 was the part that used a flip-chip package, which
enables high pin counts and faster performance.3
A VPU, said ATI, had to have several programmable elements; lots of memory
bandwidth; wide-bus DACs; and at least eight pipelines for parallel processing.
The R300 VPU (Fig. 1.11) was one of the most advanced graphics processors ever
created. Manufactured in TSMC’s 150 nm process, it had 107 million transistors. It
was a completely new architecture designed around the concepts of high bandwidth,
parallelism, efficiency, precision, and programmability—all the stuff needed to be a
VPU [5].
The chip had a 156-bit crossbar memory controller, multiple display outputs,
video-in and processing, and an AGP8X I/O, as depicted in Fig. 1.12.
The vertex processing engine has four parallel vertex shader pipelines coupled to
an optimized triangle setup engine. ATI said the R300 was the first chip to process
one vertex or triangle in a single clock cycle. Also, it was the first chip to implement
the 2.0 vertex shader specification introduced in DirectX 9.0.
Each vertex shader pipeline in the R300 could handle vector and scalar operations
simultaneously. Vector operations worked on values composed of multiple compo-
nents, such as 3D coordinates (x, y and z components) and color (red, green, and blue

3 A chip packaging technique in which the active area of the chip is “flipped over” facing downward.
The flip chip allows for lots of interconnects with shorter distances than wire, greatly reducing
inductance and the enemy of bandwidth. It is a package solution for high pin count and high-
performance chip package needs.
14 1 Introduction

Fig. 1.11 ATI R300 Radeon 9700 AIB. Notice heatsinks on the memory and similar layout to
Nvidia NV25-based GeForce Ti 4200 AIB, in Fig. 1.9. Courtesy of Wikimedia

Fig. 1.12 ATI R300 block diagram. The display interface included a multi-input LUTDAC
1.1 Programmable Vertex and Geometry Shaders … 15

color components). Scalar operations worked on values with just a single component.
Since vertex shaders typically included a mixture of vector and scalar operations, the
optimization could improve processing speed by up to 100%. Figure 1.13 illustrates
the organization.
As mentioned, the device had four vertex engines, and all four outputs went to
the setup engine.
The vertex processing engine of R300-based Radeon 9700 also included support
for ATI’s latest version of TruForm. The higher-order surface technology smoothed
the curved surfaces of 3D characters, objects, and terrain by increasing the polygon
count through tessellation. By taking advantage of the parallel vertex processing of
the R300, such an algorithm could deliver more natural-looking 3D scenes without
requiring any changes to existing artwork.
TruForm also supported displacement mapping, which provided more control
over the shape of 3D objects and surfaces. It worked by modifying the positions

Fig. 1.13 ATI’s R300 vertex


setup engine (one of four)
16 1 Introduction

of vertices according to values sampled from a particular type of texture called a


displacement map. The visual effect was like bump mapping but much more realistic
and detailed.
HyperZ consists of three mechanisms:
Z compression. The z-buffer was stored in a lossless compressed format to mini-
mize z-buffer bandwidth as z reads or writes took place. ATI’s compression scheme
operated 20% more effectively than on the original Radeon and Radeon 7500.
Fast Z clear: Rather than writing zeros throughout the entire z-buffer, and thus
using the bandwidth of another z-buffer write, a fast Z clear technique was used that
could tag entire blocks of the z-buffer as cleared. ATI claimed that process could
clear the Z-buffer up to approximately 64 times faster than an AIB without fast
Z clear.
Hierarchical Z-buffer: This feature allowed for the pixel being rendered to be
checked against the z-buffer before the pixel arrived in the rendering pipelines. That
allowed useless pixels to be thrown out early (early Z reject), before the Radeon has
to render them.
HyperZ was announced in November 2000 and was available in the TeraScale-
based Radeon HD 2000 AIBs and in Graphics Core Next-based GPU products. It is
described, in more detail, further in this chapter.
After all the triangles that made up a 3D scene were arranged and lit as necessary,
the next step in rendering the scene was to fill in the triangles with individual pixels.
The applied textures, the lighting conditions, and the material properties assigned to
the triangle determined the color of each pixel. The pixel shaders controlled the pixel
coloring process via programs uploaded to the VPU and executed by the rendering
engine.

1.1.4.1 First PC GPU with Eight Pipes

ATI’s R300 was the first graphics processor able to render up to eight pixels simulta-
neously. The VPU accomplished that with eight parallel, 96-bit rendering pipelines,
each had an independent texture and pixel shader engine. The texture unit of each
rendering pipeline could sample up to 16 textures of up to 1024 instructions with
flow control in a single rendering pass. Depending on the desired quality level, those
textures could be one, two, or three dimensional, with bilinear, trilinear, or anisotropic
filtering applied. The R300’s block diagram is shown in Fig. 1.14.
ATI said its DirectX 9.0 compatible pixel shader engines were designed to handle
floating-point operations and provide increased range and precision compared to the
integer operations used in earlier designs. That was because the engines had up to
96 bits of precision for all calculations, which was necessary for recreating Holly-
wood studio-quality visual effects. Figure 1.14 shows a generalized block diagram
of one of eight R300 shaders.
The R300’s pixel shader engines also achieved greater efficiency by simultane-
ously processing up to three instructions: one texture lookup, one texture address
operation, and one-color operation. Since pixel shaders typically contain a mixture
1.1 Programmable Vertex and Geometry Shaders … 17

Fig. 1.14 ATI’s R300 pixel shader engine the chip had eight of these “pipes”

of those three procedures, that capability would ensure maximum engine utilization
and performance.
The R300 in the 9700 also used ATI’s SmartShader 2.0, the company’s second
generation of programmable vertex, and pixel shader technology. The first imple-
mentation of DirectX 9.0, SmartShader 2.0, was running a little ahead of the industry
(and especially of the game developers). ATI said it would be fully compatible with
current and future revisions of OpenGL. ATI positioned its new shader as enabling
movie-quality effects in real-time games and other interactive applications.
Of course, the R300 rendering engine also included ATI’s latest generation of
Smoothvision (also kicked up to version 2.0), the company’s anti-aliasing, and
texture-filtering technology.

1.1.4.2 Z-Buffer

Reading and updating the z-buffer typically consumes more memory bandwidth
than any other part of the 3D rendering process, making it a significant performance
bottleneck.
A trick to reduce data transfers and avoid bandwidth starving was to use z compres-
sion. It was tricky, and a lossless algorithm was needed to compress data sent to the
z-buffer. So, ATI (and Nvidia and 3Dlabs) had a unique name for those things, and
ATI called its HyperZ III.
The goal of ATI’s HyperZ technology was to reduce the memory bandwidth
consumed by the z-buffer, thereby increasing performance. It achieved a minimum
18 1 Introduction

of a 2 to 1 compression ratio and up to 4 to 1 in some cases, and ATI claimed that


its implementation worked particularly well with full-scene anti-aliasing (FSAA).
With six times FSAA, the compression ratio could be as high as 24 to 1, and that
characteristic was vital to the anti-aliasing performance of the R300.
The first component of HyperZ was called hierarchical Z. That technique subdi-
vided the z-buffer into blocks of rapidly checked pixels to see if they would be
visible in the final image (also known as occlusion culling). The processor discarded
an entire block if hidden, and the renderer moved on to the next block. If some
portions were visible, the block got subdivided. Subdivision processing continued
until it discarded all hidden pixels. Then the remaining visible pixels got sent to the
pixel shader engines. ATI liked to brag about that, but all the latest era GPUs and
VPUs did that.
After the data was culled and compressed, it got flushed for the next frame.
Clearing out the z-factors was extremely important in GPU/VPU design, especially
at higher resolutions. For example, at 1600 × 1200 resolution the system had to write
7.7 MB of data to the z-buffer every frame to clear it. ATI said its HyperZ III required
just a sixty-fourth amount of data to empty the z-buffer, providing dramatically
improved performance at higher resolutions.

1.1.4.3 Video

If one was going to build a VPU, one had better do video and visualization, too.—if
anyone knew that it was ATI. Therefore, the R300 had an integrated video processing
engine. Using its video shader technology, ATI claimed to offer the first product that
could apply the power of programmable pixel shaders to enhance video capture and
playback in real time. It was one of, if not the, truely first multimedia AIBs.
ATI used it for a wide range of new applications. It could do deblocking of
streaming Internet video, noise removal filtering for captured video, and adaptive
de-interlacing. It had algorithms for sharper and clearer TV and DVD playback.
And it could apply photoshop-style filters in real time for video-editing applications.
Figure 1.15 shows the way ATI went about it.
3D rendering and video functions got processed by two separate sections of the
VPU. Each part has its own set of features.
One example of an application for video shader technology that ATI liked to
use was streaming video deblocking. Most streaming Internet video exhibits blocky
compression artifacts, which are especially noticeable in low bandwidth connections.
The R300, said ATI, could automatically filter out those artifacts, providing smoother,
clearer video images.
ATI pointed out that its programmable pixel shaders would enhance TV and DVD
playback quality by improving the adaptive de-interlacing algorithms used in other
ATI graphics chips. On captured video signals, video shader could apply real-time
noise filtering to produce cleaner video. And, the company added, the R300 also
offered interesting new possibilities for video editing. Image filtering effects such as
blurring, embossing, and outlining would get applied to video streams in real time.
1.1 Programmable Vertex and Geometry Shaders … 19

Fig. 1.15 ATI R300 video processing engine block diagram

1.1.4.4 Memory Management

The VPUs and GPUs of 2002 started looking increasingly like a CPU and not the
least of that was how they managed memory. The R300 incorporated a new high-
performance 256-bit DDR memory interface, which ATI said could provide over
20 GB/second of graphics memory bandwidth. It was composed of four independent
64-bit memory channels, each of which could simultaneously write data to memory
or read data back into the graphics processor.
ATI followed Nvidia’s example, which copied it from SGI, and designed a crossbar
memory controller that divided the 256-bit-wide memory interface into four subunits
that could access the memory separately, making memory accesses more efficient.
The sophisticated sequencer logic ensured the utilization of all four channels for
maximum efficiency.
20 1 Introduction

1.1.4.5 Multiple Displays

ATI was a leader in support of multiple displays, and with its acquisition of Appian
Graphics HydraVision technology for $2 million in 2001, it put together a compre-
hensive package. The R300 display interface supported a wide range of display
configurations and new technologies that improved the quality of displayed images.
The general organizational block diagram is shown in Fig. 1.16.
First to offer HDR. For one thing, the R300 had a new, high-precision 10-bit-per-
color channel frame buffer format enabled by DirectX 9.0. That enhancement of the
standard 32-bpp color format (which supports just 8 bits per color channel) known
as HDR could represent over one billion distinct colors. Although used in high-
end professional graphics systems and workstations, HDR only came to consumer
computers and TVs in 2019.
Also, the R300 had two integrated display controllers, allowing it to drive two
displays with entirely independent images, resolutions, and refresh rates. The chip
also included the following:

Fig. 1.16 ATI R300 video processing engine showing all the outputs
1.1 Programmable Vertex and Geometry Shaders … 21

Table 1.2 Comparison of three generations of ATI chips


Parameter Radeon R200 R300
(Apr 00) (Aug 01) (Jul 02)
Transistors 30 60 107
Fab process (nm) 180 150 150
Memory width 128 128 256
Memory speed MHz 200 230 600
Graphic pipes 2 4 8
Basic clock MHz 183 270 325
Triangles per second 30 75 325
Mpixels/second (Gouraud shaded, z-buffered, and non-textured) 500 1500 2600
API DX7 DX8 DX9

Table 1.3 Comparison of


DX 8.0 DX 8.1 DX 9.0
DirectX features
Vertex shaders 1.1 1.1 2.0
Max instructions 128 128 102
Max constraints 96 96 256
Flow control No No Yes
Higher-order surfaces Yes Yes Yes
N-patches Yes Yes Yes
Continuous tessellation No No Yes
Displacement mapping No No Yes
Pixel shaders 1.1 1.4 2.0
Texture maps 4 6 16
Max texture instructions 4 8 32
Max color instructions 8 8 64
Data type Integer Integer Floating point
Data precision 32 bits 48 bits 128 bits

• Two 10-bit, 400 MHz DACs for driving high-resolution analog displays through
the VGA port
• A 165-MHz TMDS transmitter for driving digital displays through the DVI port
• TV output capability at resolutions up to 1024 × 768, supporting NTSC/PAL/
SECAM formats.
The R300 was a significant leap from the previous chip (R200; consider Table 1.2).
The other level of comparison is the API. The company was proud to say that the
R300 was for Microsoft’s upcoming DirectX 9 specification (Table 1.3).
The step from DirectX 8.1 to DX9 was a significant improvement in exposing
and exploiting the power of the new VPUs and GPUs.
22 1 Introduction

In DirectX 9, vertex shader programs could be much more complex than before.
The new vertex shader specifications added flow control, more constants, and 1024
vertex shader instructions per program. While the new pixel shaders would not allow
flow control, the maximum number of pixel shader instructions had grown to 160.
However, the significant feature of DirectX 9 was the introduction of RGBA values in
64-bit (16-bit FP per color) and 128-bit (32-bit FP per color) floating-point precision.
That tremendous increase of color precision allowed an incredible new number of
visual effects and picture quality.
Although DirectX 9 was not introduced until later that year, there were games that
used OpenGL and could take advantage of the R300 to accelerate those features.
Rick Bergman, ATI’s Senior Vice President of marketing and General Manager,
said ATI was moving DirectX 8.1 into the mainstream and making DirectX 9 available
for the enthusiast segment when it came out.
ATI partnered with graphics board companies to bring various products and
configurations to market in all parts of the world. Those product introductions marked
the first time ATI had introduced a family of products simultaneously with its board
partners. AIB manufacturers worked with ATI to market the Radeon 9000, Radeon
9000 PRO, and Radeon 9700-based products.
The company also introduced a mainstream AIB version with a slower R300 and
less memory, called the Radeon 9000 Pro. It featured 64 MB of DDR memory and
flexible dual-display support with DVI-I, VGA, and TV-out, with a suggested retail
price of $149.

1.1.4.6 Along Comes a RenderMonkey

RenderMonkey was an open, extensible shader development tool for current and
future hardware that ATI said enabled programmers and artists to collaborate on
creating real-time shader effects.
ATI showed the tool publicly for the first time at SIGGRAPH 2002 in San Antonio,
and a beta version was available for download from ATI shortly after SIGGRAPH—
free of charge.
The toolkit could be used as a plug-in with any of the 3D-development suites of
the day and would generate vertex and pixel shader code.
It had a real-time viewer that was comfortable for developers and artists to use
and made the development of titles that used vertex and pixel shaders a lot easier
than it had been. Additionally, ATI included a compiler for Renderman, and another
compiler for Maya was in the works. Nvidia offered its Cg; 3Dlabs also had a high-
level programming language and thus began the shader compiler wars.

1.1.4.7 Summary

ATI said it was taking back the high ground and believed the R300 was ample proof.
“The crown is ours to take,” said Dave Orton, ATI’s CEO.
1.1 Programmable Vertex and Geometry Shaders … 23

Orton knew Asia was where the battle would be. ATI did an excellent job repo-
sitioning itself as a semiconductor supplier to the original device manufacturers
(ODMs) and original equipment manufacturers (OEMs) while still leaving room to
maneuver in the ATI high-end AIB retail market. But there were penalties for being
first to market; still, the R300 gave Nvidia a severe challenge and forced Nvidia into
a more aggressive position. But the most exciting part was the remarkable come-
back of ATI. It took almost two years. During that time, the company continued to
produce excellent products, expand into new markets, reorganize the management
and engineering teams, and eke out a modest profit most of the time—overall, one of
the most impressive corporate moves ever seen. A few years later, ATI was acquired
by AMD.
The Radeon 9700 was further honored by being selected by venerable E&S for
its simFUSION 6000q used in simulation and platinum systems [6]. And Silicon
Graphics used up to 32 ATI R3xx VPUs in their high-end Onyx 4 “UltimateVision”
systems, which came out in 2003 [7].

1.1.5 SiS Xabre—September 2002

SiS released its Xabre 400/200/80 in September 2002 and was the first to market a
GPU supporting the (1997) Accelerated Graphics Port (AGP) 8X bus, a variation
on the (1992) PCI bus standard that enabled a dedicated path between the slots to
the processor enabling faster graphics performance. In November, at the Comdex
conference in Las Vegas, the company showed its Xabre 600, which supported
AGP 8X and DirectX 8.1, whereas the 400 was built on 150 nm and the 600 series
was made in SiS’s new 130 nm process.
SiS claimed that its 30 million transistor Xabre 600 was the most AGP compliant
GPU on the market, supporting AGP 8X 533 MHz with a 16-stage pipeline, full
sideband function, and dual 300 MHz clocks for the engine and the 128-bit memory
I/O. The chip also integrated a 256-bit 3D graphics engine, and the company claimed
was loaded with new SiS proprietary technologies for the mainstream gaming market.
The chip had bump mapping, cubic mapping, and volume texture. It offered texture
transparency, blending, wrapping, mirror, clamping, fogging, alpha blending, and
2X/3X/4X full-scene anti-aliasing. The 256-bit engine had four programmable pixel
rendering pipelines and eight texture units.
Peak polygon rate was specified at 30 M polygons/sec at 1 pixel/polygon with
Gouraud shaded, point-sampled, linear, and bilinear texture mapping. The peak fill
rate was 1200 M pixel/sec and 2400 M texture/sec at 10,000 pixels/polygon with
Gouraud shading. The following are the highlights of the chip.
• Four pipelines could do eight textures per pass
• Up to 11.2 GB/sec of memory bandwidth
• 4 × 32-bit DDR2 memory controllers running at ~700 MHz
• AGP 8X support
24 1 Introduction

Fig. 1.17 Xabre 600 AIB with similar layout to ATI and Nvidia. Courtesy of Zoltek

• DX9 pixel and hybrid vertex shader support


• 35 million transistors
• 130 nm GPU.
SiS included a scene sensing (polygon count) feature in the driver called Xmart-
Drive. It calculated if the system needed more speed and increased the clocks
when entering complex 3D environments. When the power demand decreased, the
XmartDrive lowered the processing speed. The Xabre 600 AIB is shown in Fig. 1.17.
The Xabre 600 also incorporated SiS’s proprietary hardware/software Vertexlizer
Engine, which used the CPU for the floating-point operations and handled the fixed-
point function on the chip, which saved the company a lot of transistors and, therefore,
costs (see Fig. 1.18).
The design offered two benefits. The first possibility was that the vertex shading
operation could be upgraded independently of the GPU. That meant the shading
engine could be upgraded to DirectX 9’s Vertex Shader 2.0 specification. The second
benefit was that as CPUs became faster, the GPU would scale appropriately.
The company said that approach increased efficiency while reducing crucial GPU
loading. In addition, the vertex processor could handle point, spot, specular, and
fog active light sources (up to ten) and object space-to-screen space transform and
back-face culling. It also provided hardware-view volume clipping, OpenGL two-
side lighting, and OpenGL primary and secondary color merging. The Xaber 600
also offered vertex fogging (with fog table) and T&L-based vertex fog range.
1.1 Programmable Vertex and Geometry Shaders … 25

Fig. 1.18 SiS’s Xabre vertex shader data flow between CPU and GPU

As GPUs became more powerful, the need for memory bandwidth increased even
faster. To meet that need, in late 2000, ATI announced HyperZ, and Nvidia announced
its Lightspeed Memory Architecture in the GeForce 3 in February 2001. SiS also
offered a similar capability and increased its memory bus to 128-bit with 64 MB
of 300 MHz DDR memory. The Xabre 600 realized a peak of 9.6 GB per second
memory throughput.
The chip supported DirectX 8.1 (Pixel Shader version. 1.3). Although it had a
pixel engine (“Pixelizer”), it did not support DirectX 9.0. The company targeted the
chip at the mainstream user and assumed DirectX 9 would not be available for a year.
As Fig. 1.19 shows, based on SiS data, SiS had a competitive market and product
plan.
Chris Lin, Vice President of the Multimedia Product Division at SiS, said in
an interview, “With the Xabre family, our goal was to break all the bottlenecks
gamers experience and maximize bandwidth speed, graphics clarity, reliability, and

Fig. 1.19 SiS’s competitive market position


26 1 Introduction

performance. The Xabre 600 built on that foundation and took the experience up
another notch with even greater speed and performance” [8].
The SiS Xabre was offered in a 37.5 mm2 package and priced in the $30 range.
The company said the chip would be available in small quantities (sampling) in Q1
2003. SiS planned to introduce its DirectX 9.0 part in Q3 2003 and said it would
have eight graphic pipelines and use DDR2.
Thomas Tsui, Director of SiS’s Multimedia Division, said in June 2001 that the
math was in his favor in that Intel’s competitors struggled to make money on the
unpleasant ASPs that were coming out of the integrated graphics chipsets (IGCs)
market [9]. Tsui believed the only way to achieve the kind of margins that made
building IGCs worthwhile was by having control over the manufacturer to achieve
the required economies of scale. In contrast to Nvidia and ATI, SiS had its own
fab. SiS’s integrated 370S and 730SE had integrated 3D and I/O and were compat-
ible with AMD’s 266 MHz FSB. In addition to the 256-bit SiS 315, the company
produced a low-cost discrete graphics chip with T&L for the same market that the
STMicroelectronics—Imagination Technologies Kyro II—was after. SiS expected
the integrated 730 product line to increase the company’s revenues.

1.1.5.1 SiS 301B Video Processor

The SiS 301B did video processing and output in a separate companion chip, another
cost-cutting decision because not every customer wanted or needed video. Also, as
Lin pointed out, the video output sections of a chip did not scale with Moore’s law
and required a certain amount of floor space for resistors and capacitors (RCs). The
video bridge of the SiS 301B chip’s output section had a TV encoder and a 375 MHz
DAC. The chip’s output included a motion-fixing video processor. The chip had de-
interlacing, half-downscaling functions, and four fields per-pixel motion-detection
de-interlace function.
On the input side, the Xabre 600 had a MPEG-2/1 video decoder with motion
compensation layer decoding architecture that the company said could deliver up to
20 Mb/sec bit rate decoding, making it capable of reading VCD, DVD, and HDTV
decoding.
At the beginning of 2003, SiS reported record sales attributed mainly to the Xabre
600. In March 2003, the company announced its 130 nm DirectX 9, AGP 8 ×
300 MHz Xabre 660 (AKA Xabre II), which would sell for $60, twice the price of
the original Xabre 600.

1.1.5.2 Summary

The Xabre 600’s weak points were no DirectX 9 compatibility, no anisotropic


filtering, slow speed of pixel shaders, and the hybrid T&L. The company also offered
a turbo texturing feature that allowed the user to adjust quality (with texture detailing
and bilinear filtering) and performance.
1.1 Programmable Vertex and Geometry Shaders … 27

The Xabre line was the last dGPU and AIB from SiS because the company spun
off its graphics division (renamed XGI), and it merged with Trident Graphics a couple
of months later, as discussed earlier in this chapter (SiS’s first PC-based IGP).
Most of the XGI products were disappointing and underdelivered. The one excep-
tion was the entry-level V3, which offered performance equal to the GeForce FX 5200
Ultra and Radeon 9200.
XGI introduced the Volari 8300 in late 2005, which was more competitive with
the Radeon X300SE and the GeForce 6200. However, XGI could not sell enough to
sustain itself, and in October 2010, the company was reabsorbed back into SiS.

1.1.6 The PC GPU Landscape in 2003

By 2003, the PC GPU market had consolidated into two main suppliers: ATI and
Nvidia. All traces of other suppliers had faded away except for integrated graphics in
chipsets. ATI and Nvidia kept introducing new GPUs on a relatively regular schedule,
mostly in sync with new process nodes introduced by TSMC and new versions of
Microsoft’s DirectX API. ATI and Nvidia would show and discuss advance plans
with Microsoft for forthcoming GPUs. And if Microsoft liked the ideas, it would
incorporate them into the next version of the DirectX API. In that way, when a new
GPU with new features came out, the API was waiting for it. Game developers also
received advanced information about the new GPUs and APIs, but game development
always took longer than anyone estimated or wanted. Therefore, when a new GPU
was introduced, the AIB, PC suppliers, and Microsoft would have to wait for games
to exploit the new features. Everything would sync up occasionally, and all three
elements—API, GPU, and games—would show up simultaneously, but that was
a rarity. Often ATI and Nvidia would go to game developers and help them with
programming the new features.

1.1.7 Nvidia NV 30–38 GeForce FX 5 Series (2003–2004)

Nvidia’s NV30 to NV38 GPUs (code named Rankine), used in the GeForce FX
series, was the fifth generation of the popular line.
The NV30 should have been released in August 2002, about the same time as
ATI’s Radeon 9700. However, Nvidia experienced start-up problems and high defect
rates with TSMC’s low-K 130 nm process and that delayed Nvidia’s release. Also,
while Nvidia was trying to get the NV30 out, it was developing a special version of
the chip for Microsoft’s Xbox, which spread Nvidia’s engineering resources a little
thin. The Nvidia GeForce Fx 5900 AIB with NV30 is shown in Fig. 1.20.
Nvidia switched fabs and went to IBM, which had a more conventional (fluo-
rosilicate glass-FSG) low-K 130 nm process [10]. That eliminated the defects but
delayed the introduction of the new chip.
28 1 Introduction

Fig. 1.20 Nvidia’s NV30-based GeForce Fx 5900 with heat sink and fan removed Courtesy of
iXBT

The company released the NV30-based GeForce FX 5800, its first GDDR2-based
AIB, in January 2003, several months after ATI released its Radeon 9700 DirectX 9
architecture in August 2002. Nvidia introduced the 500 MHz, 128 MB, GeForce FX
5800 AIB on March 6, 2003. Built on the 130 nm process, the 135 million transistor
GPU had a 128-bit memory bus and supported DirectX 9.0a with Vertex Shader 2.0
and Pixel Shader 2.0. The AIB was compatible with AGP 8x. The NV30’s block
diagram is shown in Fig. 1.21.
The NV30-based GeForce 5 × 00 AIBs were Nvidia’s response to ATI’s Radeon
9700 Pro, but Nvidia was several months late to market. When it arrived, it outper-
formed ATI’s R300-based Radeon 9700 Pro by 10% in some tests and underper-
formed relative to the R300 in many other tests [11]. The problems that beset the
NV30’s launch were not totally under Nvidia’s control. It takes a long time to design
and build a GPU, and Nvidia decided to use TSMC’s 130 nm process one or two
years earlier. ATI was conservative with the R300 and stayed with a proven 150 nm
process.

1.1.7.1 CineFX

One of the most noteworthy features to be introduced with the NV30 was CineFX.
Nvidia postulated at the time that interactive PC graphics (CGI) were approaching the
realism of computer graphics, cinematic shading used in cinema for special effects,
and even feature-length movies.
One such movie was Final Fantasy: The Spirits Within, a 2001 computer-animated
science fiction film directed by Hironobu Sakaguchi, the creator of the Final Fantasy
1.1 Programmable Vertex and Geometry Shaders … 29

Fig. 1.21 Nvidia NV30 block diagram

franchise, at Square Co., Ltd. Square was a well-known and respected Japanese video
game company founded in September 1986 by Masafumi Miyamoto. The 2001 film
was the first computer-animated feature film with photo-realistic effects and images.
It was also the most expensive video game-inspired film for the next nine years.
Square rendered the movie with the most advanced processing capabilities for film
animation at that time [12]. The movie received poor reviews but lauded for its
character’s realism. Nonetheless, it did not do well at the box office and has been
blamed for the failure of Square Pictures, resulting in the merger of Square and Enix
[13] (Fig. 1.22).
30 1 Introduction

Fig. 1.22 Final Fantasy used subdivision rendering for skin tone. Courtesy of Nvidia [14]

Real-time cinematic shading required new levels of features and performance such
as advanced programmability, high-precision color, high-level shading language,
highly efficient architecture, and high bandwidth to system memory and the CPU.
With the introduction of programmable vertex shaders, it was possible to issue over
65,000 instructions, giving great control to game developers.
Along with the expanded programming range, Nvidia introduced support for
16- or 32-bit floating-point components and 64-bit and 128-bit FP color precision.
That 16-bit floating-point format was the same format that Pixar and ILM use for
films, the so-called s10e5 representation. And, of course, it had high-dynamic-range
illumination.
CineFX could render a vertex array and offered displacement mapping, particle
systems, and even ray tracing. Nvidia heralded it as the convergence of film and
real-time rendering.
Nvidia also developed a programming language based on C that they called Cg
Shader Code. It had a compiler, could be optimized into vertex and pixel shader
assembly code, and was compatible with DX9’s high-level shading language (HLSL).
NV30 was Nvidia’s first GPU that enabled unified shaders for the Direct3D 9 API
and the unified Shader Model of Direct3D. However, it was not a true unified shader
system that would come later with DirectX 10 and the third era of GPUs discussed
in Chapters ten and thirteen.
1.1 Programmable Vertex and Geometry Shaders … 31

The NV30 matched the ATI R300 on the number of textures that could be filtered
(bi-, tri-, or anisotropic) per clock tick; however, the NV30 ran at 500 MHz while
the ATI R300 was only at 325 MHz, giving the NV30 a considerable potential
performance gain.
There was speculation that significant input came from the 3dfx/Gigapixel engi-
neers for the NV30 design. And people tried to read symbolism into the FX name
of the AIBs, thinking it was a gesture to honor the designers from 3dfx that joined
Nvidia. Also, people noted that 3dfx had promoted cinematic effects (motion blur and
other effects done in hardware) and wondered if that was where Nvidia got the idea
for the name CineFX. But no one who was at Nvidia at the time could corroborate
any of those speculations.
The G70 GPU, based on the CineFX architecture and implemented in the GeForce
7880 series of AIBs, was introduced in 2005. It was significant because it offered
three kinds of programmable engines for various stages of the 3D pipeline plus
several additional stages of configurable and fixed function logic. Researchers at
universities used it as a general-purpose parallel processor but had to program it in
the arcane OpenGL language. It shows how excited they were about the potential
payoff and acceleration. It also signaled the need for better software tools if the GPU
was to become a first-class compute accelerator.

1.1.7.2 Nvidia Enters the AIB Market with the GeForceFX (2003)

In early 2003, Nvidia said it would sell and certify the quality NV30, NV31, and
NV34-based GeForce FX AIBs to its AIB partners [15]. That step was being taken,
Nvidia said, to control the quality of NV30-based boards. Some OEMs had used
knockoff cheap capacitors that failed after a few hours of operation, thus damaging
Nvidia’s reputation with customers. Nvidia’s tightened control over manufacture and
the elevated warranty costs, but Nvidia has always been protective of its image, brand,
and standing in the community. Nvidia said it would still sell low-end chips (which
did not require such tight tolerance) to certain OEM board builders and required
them to tell Nvidia where they sourced their parts.
Naturally, rumors circulated about the move and the gamer press and fans spec-
ulated the move was motivated by the late arrival of the NV30 to the market.
Nvidia countered those rumors by saying that tighter control on the manufacturing
process of complete boards would result in faster availability. Ten years later, one of
Nvidia’s largest AIB partners, EVGA, had had enough of Nvidia’s competition and
margin squeezing and called it quits, refusing to produce or market Nvidia’s newest
generation, the RTX 4000 serries AIBs.

1.1.8 ATI R520 an Advanced GPU (October 2005)

Microsoft released service pack 1 of Windows XP in September 2001, the successor


to Windows 2000. In mid-December 2002, Microsoft released DirectX 9.0. The
company had released several updates of DirectX 9 until August 4, 2004, when
Microsoft released DirectX 9.0c with Shader Model 3.0, which was the most robust
32 1 Introduction

API in the series, and the last one until Vista came out in November 2006. However,
Microsoft revealed Shader Model 3 (SM3) in May 2004.
Nvidia’s GeForce 6 series (code name NV40) launched on April 14, 2004, with
built-in support for Shader Model 3.0. At the Windows Hardware Engineering
Conference (WinHEC), on May 4, 2004, Nvidia demonstrated its 130 nm GeForce
6 Series with Shader Model 3.0. The company declared that the GeForce 6 was the
first and only GPU to take full advantage of Shader Model 3.0—and it was.
ATI also had a GPU that would be SM3 capable, the R520. ATI had worked with
Microsoft on the SM3 definition and design and knew everything about it. It was the
foundation for a line of DirectX 9.0c and OpenGL 2.0 3D Radeon 1000 AIBs. The
R520 was ATI’s first major architectural overhaul since the R300 and optimized for
Shader Model 3.0. But ATI could not ship it until October 5, 2005.
Regardless of the timing, the R520 was genuinely new and powerful graphics
architecture with a new memory manager design, ultra-threading, and flow control.
It also had expanded floating-point and video capabilities that enhanced HDR.
By this time, Dave Orton, the former CEO of ArtX and VP at Silicon Graphics
was running ATI. The ArtX team within ATI came from SGI and had brought several
advanced computer graphics ideas and techniques. Orton wanted to apply them to
the GPU. The R520 was to be the showcase GPU for ATI.
ATI was aggressive and bold and the first to move to the new 90 nm fabrication.
New process nodes require new design libraries. Design libraries define how long
connecting wires can be and what, if any, capacitors are needed—at microscopic
sizes. It was tricky and hard to get it right the first time. AMD did not get it right for
months. Maddening and insidious timing errors that defied synthesizers, analyzers,
and simulators, refusing to match up with the fab process and not staying in one place,
drove the engineering and manufacturing staff in Toronto, Boston, Santa Clara, and
Taipei nuts for six months.
R520 test chips coming from TSMC’s fab failed in ATI’s laboratory. The ATI
engineers discovered a problem after several very costly trials. Each time they fixed
it, they had to have a new mask generated. The mask is like a stencil that tells the
deposition process in the fab where to put the silicon. The problem was a faulty
90 nm chip design library from a supplier. AMD had to make multiple masks, called
re-spinning before they rid the design of the problem. The bug was finally tracked
down—after 22 mask tries. So, where ATI (and then AMD) could have been the
first to market with their revolutionary unified shader design, they were the last.
However, AMD could take credit for being the first company in the graphics industry
to introduce three new chips in a new process simultaneously.
One of the side effects of a late rollout is that it shortens the product lifetime and
impacts the return on the engineering investment, which influences margins. But ATI
had no choice other than to get the next version (R580) out on schedule.
When ATI moved to the TSMC 90 nm process, it incorporated dynamic voltage
control (DVC). DVC allows the software to adjust the voltage level used by the GPU.
DVC is used to adjust the performance level of functions not needed for the current
application. It is a feature used today on all GPUs and SoCs.
1.1 Programmable Vertex and Geometry Shaders … 33

The R520 (code named Fudo after a colorful industry reporter) was a genuinely
new and powerful graphics architecture with a new memory manager design, ultra-
threading, and flow control. It also had expanded floating-point and video capabilities
that would enhance HDR.
Here is an overview of what the GPU had:
• Ultra-threaded shader engine with support for DirectX 9 programmable vertex
and pixel shaders
• Fast dynamic branching
• DirectX Vertex Shader 3.0 vertex shader functionality, with 1024 instructions
(unlimited with flow control)
• Single-cycle trigonometric operations (SIN & COS)
• Pixel Shader 3.0 running on ATI’s ultra-thread, pixel shader engine
• Single-precision 128-bit floating-point (fp32) processing as well as 128, 64 and
32-bit per-pixel floating-point color formats
• 16 textures per rendering pass
• 32 temporary and constant registers per-pixel
• Facing register for two-sided lighting
• Multiple render target support
• Shadow volume rendering acceleration.
Manufactured in 90 nm on a 288 mm2 die, the R520 had 320 million transistors,
a core clock of 500 MHz, and a memory clock of 1 GHz. ATI always had leading
memory design capabilities and Vice President, Joe Marci, who was a memory expert.
The chip employed a dual ring architecture, shown in Fig. 1.23.
ATI claimed the ring design would reduce routing complexity, permit higher clock
speeds—one ring-stop per pair of memory channels, and link directly to the memory
interface. It supported the fastest graphics memory of the time, GDDR3, at 48+
bytes/second with the 512-bit ring (two, 256 rings) bus.
It had a new, larger cache design (one of the reasons for all the transistors) and
was fully associative for best performance. It also gave the R520 improved HyperZ
performance, better compression, hidden surface removal, and had programmable
arbitration logic.
The memory controller broke the memory channels into 8- by 32-bit blocks. That
provided a tighter coupling between the GDDR and the caches and further improved
efficiency. It also allowed for cache lines to map to any location in the external
GDDR. Associative memory has reduced memory bandwidth requirements for all
operations (texture, color, Z, and stencil), and ATI took advantage of that.
The R520 had an improved hierarchical z-buffer that detected and discarded
hidden pixels before shading (e.g., overlapping objects). The company claimed it
had developed a new technique for using floating point for improved precision,
which caught more hidden pixels than the previous design.
There was also improved Z compression in the R520. Z-buffer data is typically the
largest user of memory bandwidth, and bandwidth could be reduced by up to 8:1 using
lossless compression. ATI said that its new method achieved higher compression
ratios more often.
34 1 Introduction

Fig. 1.23 ATI R520 ring bus memory controller. The GDDR is connected at the four ring stops.
(Source ATI)

Those technology improvements were most apparent in bandwidth-demanding


situations such as high resolutions (1600 × 1200 and up). Adding anti-aliasing (4 ×
and 6 × modes and adaptive AA) and anisotropic filtering (8 × and 16 × quality AF
modes), it could reach frame rates over twice as fast.
Ultra-threading. One of the improvements in the R520 was the GPU’s ability to do
what ATI called ultra-threading. The ultra-threaded pixel shader engine was capable
of handling up to 512 threads at once. The design minimized wasted processing
cycles and claimed ATI, delivered over 90% efficiency for shader processing. ATI
optimized the shader engine for Shader Model 3.0, with fast dynamic branching and
full-speed 128-bit floating-point rendering.
The diagram in Fig. 1.24 shows the R520’s organization, with the quad pixel
shader core processors outlined in the center.
1.1 Programmable Vertex and Geometry Shaders … 35

Fig. 1.24 ATI R520 block diagram

The dispatch processor maintained the same count of texture address and texture
sample units but removed their tie to the pixel hardware. Pixel threads that needed
to texture could do so independently. In addition, texturing operations could be
scheduled separately from pixel vector and scalar processing arithmetic. Typically,
texture operations take many more cycles than math ops.
Pixel Shader 3.0 inefficiencies. ATI’s ultra-threaded pixel shader engine offered
PS 3.0 support and fast dynamic branching. The vertex shader engine also supported
36 1 Introduction

VS3.0, up to two vertex transformations per clock cycle, and full-speed 128-bit
floating-point processing for all calculations said the company.
ATI improved pixel shader efficiency by eliminating or hiding latency and
avoiding wasted cycles. One of the significant sources of inefficiency in pixel shaders
is texture fetching. If a pixel shader needed to look up a texture value not located in
the texture cache, it had to look in graphics memory, which introduced hundreds of
cycles of latency.
Dynamic branching was a source of inefficiency within Pixel Shader 3.0. It allowed
a pixel shader program to execute different branches or loops depending on calculated
values. Therefore, cleverly implemented, dynamic branching could provide signif-
icant opportunities for optimization. For example, it allowed for early outs, where
large portions of shader code could be skipped for specific pixels when determined
unnecessary. Unfortunately, dynamic branching in pixel shaders destroys traditional
graphics architecture’s parallelism, which could often eliminate any performance
benefits.
So, ATI attacked those issues by developing their ultra-threaded pixel shader
engine. Ultra-threading was a kind of large-scale multi-threading, which breaks down
the pixel-processing workload required to render an image into many small tasks or
threads. Such threads consist of small four-by-four blocks of pixels. That block of
16-pixels had the same shader code run on it. Ultra-threading was also used in dedi-
cated branch execution units to eliminate flow control overhead in shader processors
and large, multi-ported register arrays to enable fast thread switching.
The R520 pixel shader engine incorporated a central dispatch unit that tracked
and distributed up to 512 threads across an array of shader processors, arranged into
four identical groups called quad pixel shader cores. Each core was an autonomous
processing unit that could execute shader code on a two-by-two block of pixels.
Dynamic flow control. A unique feature found in the R520 was dynamic flow
control. It allowed different paths through the same shader run on adjacent pixels.
That provided optimization opportunities, allowed parts of a shader to be skipped
(early out), and avoided state change overhead by combining multiple related shaders
into one. In general, it allowed the GPU to execute CPU code more effectively.
On the negative side, it could interfere with parallelism, and redundant computa-
tion could often reverse any flow control benefits. So, programmers still had to keep
an eye on their threads.
No one sits around. Whenever the dispatch processor sensed a core was idle (either
having completed a task or waiting for data), it immediately assigned a new thread to
execute. The dispatcher suspended an idle thread waiting for data, freeing its ALUs
to work on other threads. That enabled the GPU pixel shader cores to achieve over
90% utilization in practice.
ATI tried to illustrate this in the diagram in Fig. 1.25.
The ability to manage many threads, each containing a relatively small number
of pixels, allowed dynamic branching to occur (see Fig. 1.25). The goal was to
minimize the cases where different pixels in the same thread could branch down
different shader code paths. Each time that happened, all the pixels in the thread
1.1 Programmable Vertex and Geometry Shaders … 37

Fig. 1.25 ATI R520 thread size and dynamic branching efficiency was improved with ultra-
threading. Courtesy of ATI

had to run each possible code path, which eliminated the performance benefit of
branching.
Two-by-two. Each pixel shader processor in the R520 could perform two vector
operations and two scalar operations each clock cycle. The pixel shader engine
included a set of 16-texture address ALUs that could perform texture operations
without tying up the pixel shader cores. The dispatch processor contained sequencing
logic that automatically arranged shader code to maximize the use of the ALUs. Each
core had a dedicated branch execution unit, which could execute one flow control
operation.
The R520 could run six-pixel shader instructions per clock cycle on 16 pixels
simultaneously. The pixel shader processors could simultaneously perform two scalar
instructions, two three-component vector instructions, and a flow control instruction.
In addition, a bank of independent texture address units (TAU) could process texture
address instructions in parallel with the shader processors.
Vertex shader engine. The R520 had eight vertex shader units supporting the
VS3.0 Shader Model from DirectX 9. Each was capable of processing one 128-bit
vector instruction plus one 32-bit scalar instruction per clock cycle. Combined, the
eight vertex shader units could transform up to two vertices every clock cycle.
The R520’s vertex shader (Fig. 1.26) was the first GPU capable of processing
10 billion vertex shader instructions per second. The vertex shader units also
supported dynamic flow control instructions, including branches, loops, and subrou-
tines.
Four-by-thirty-two float. One of the significant new features of DirectX 9.0 was
the support for floating-point processing and data formats. Compared to the integer
38 1 Introduction

Fig. 1.26 ATI R520 vertex shader engine

formats used in previous API versions, floating-point formats provided much higher
precision, range, and flexibility.
ATI said its new shader engine would perform all calculations with FP preci-
sion, vertex, and pixel shader operations executed on 128-bit floating-point data
formats. The internal data paths within the shader engine were wide enough to
process those formats at full speed, so there was no need to reduce precision to
optimize performance.
HDR—high-dynamic range. Nvidia got credit for being the first to market with full
32-bit floating-point shaders. And it spoke about high-dynamic range compensating
for overdriving the display’s limited dynamic range.
On the other hand, ATI said it had HDR with its natural light feature it had
demonstrated on the 9700 almost two years earlier.
It came down to shifting and compressing portions of a scene’s intensity so that
adjoining pixels did not mask out neighbors—and to do that, one needed 32-bit
floating-point processing (32 float).
ATI had that in its R520, and it extended that to the alpha element as well, so it
spoke about four-component 32-bit float HDR or 128-bit float HDR.
Dynamic range defines the ratio between the highest and lowest value repre-
sented—i.e., more bits of data equal a greater dynamic range. And the floating point
had a much greater range than the integer. For example:
• 8-bit integer—256:1
• 10-bit integer—1024:1 16-bit integer—65,536:1
• 16-bit floating point—2.2 trillion:1.
However, LCDs and some CRTs could only recognize values between zero and
255 (i.e., 8 bits per color component); therefore, they require tone-mapping to
preserve or show detail.
1.1 Programmable Vertex and Geometry Shaders … 39

Fig. 1.27 Making things look brighter than they are. Courtesy of ATI

Tone-mapping is used with light bloom and lens flare effects to help convey a high
brightness. HDR rendering takes advantage of color formats with greater dynamic
range, tricking the eye and brain into seeing what we believe are more realistic
images. Computer graphics are tricks to imitate nature with limited computational
and display resources—we are just faking it all the time. A case in point is the
illustration in Fig. 1.27.

Tone-mapping is used in image processing and computer graphics to map one


set of colors to another to approximate the appearance of high-dynamic-range
images in a medium that has a more limited dynamic range.

One of the most used images to show off the effects of HDR was from Valve’s
highly anticipated Half Life2: Lost Coast. In it, the hero, Dr. Freeman, fights his
way into an abandoned church, where he must look for threats and possibly goodies.
Seeing in the muted light is especially important, as any FPS player will attest (see
Fig. 1.28).
Notice the bright windows, and yet you could still see detail in walls and floor.
Notice in Fig. 1.29 that the muted windows balance the overall brightness of the
scene, clearly not as realistic.
ATI offered several HDR modes with the R520n, including 64-bit (FP16, Int16)—
a maximum range that includes HDR with anti-aliasing, blending, and filtering, 32-bit
(Int10) for full speed that also offers HDR texture compression with DXT & 3Dc+,
and custom formats (e.g., Int10 + L16) for tone-mapping acceleration.
40 1 Introduction

Fig. 1.28 Inside the abandoned church with HDR on. Courtesy of Valve

Fig. 1.29 Inside the abandoned church with HDR off. Courtesy of Valve

ATI’s R520 supported an alternative high-precision 32-bit integer color format


that could be usefully called 10:10:10:2. It could represent 1024 red, green, and blue
levels, using 10 bits each, for a total of over one billion distinct colors. That mode
also supported limited transparency, with just two bits allocated for four distinct
levels. In cases where that was sufficient, the 10:10:10:2 format could provide four
times the dynamic range and precision of standard 32-bit modes, with no additional
1.1 Programmable Vertex and Geometry Shaders … 41

performance or memory cost. Blending and multi-sample anti-aliasing were possible


with that format.
The 10:10:10:2 is an example of an alternative HDR format that is useful for appli-
cations that allow dynamic range limited to improve performance without impacting
image quality.
Anti-aliasing. The R520 was the first GPU that allowed HDR used in conjunction
with multi-sample anti-aliasing. That meant it was no longer necessary to give up one
image quality enhancement to get another. The result was a considerable increase in
lighting quality and realism, said the company.
The R520 offered the usual anti-aliasing techniques such as 4x multi-sample
and adaptive sampling in 4X and 6X modes (see Fig. 1.30). Adaptive anti-aliasing
worked by rendering most of each frame with multi-sample anti-aliasing and then
switching to a supersample approach for only those polygons expected to benefit from
it visually. Each pixel got rendered in multiple passes for supersampled polygons,
with the pixel center shifted to a new sample location on each pass. The results of
each pass were then blended to produce the final pixel color.
That method, said ATI, worked with all ATI’s other anti-aliasing technologies,
including programmable sample patterns, gamma correct blending, temporal anti-
aliasing, and Super AA. It also worked together with HDR rendering to produce
stunning image quality, with only marginally lower frame rates than standard multi-
sample anti-aliasing in most cases.
Flat things that look bumpy. ATI called it Parallax Occlusion Mapping, and it
added 3D detail to flat surfaces. As a result, details could hide and cast shadows on
each other (Fig. 1.31).

Fig. 1.30 Different modes of anti-aliasing. Courtesy of Valve


42 1 Introduction

Fig. 1.31 ATI’s special class of bump mapping

The technology used ray tracing techniques with dynamic branching for improved
performance. Thanks to its ultra-threading technology for SM 3.0, ATI said it was
now possible at real-time frame rates.
Crossing over fire. ATI had its share of problems, some due to management issues
and others to just bad luck. The delay of the R520 was a significant problem, and the
net result was that management was distracted for almost three calendar quarters.
That distraction let things slip through the cracks, including effective marketing, and
nothing suffered more than the highly anticipated CrossFire.
ATI showed its dual AIB system, CrossFire, for the first time at a private showing
at the 2005 E3 conference. It used two X850 XT PE AIBs with master control logic
(an FPGA) attached.
Analysts and reporters thought CrossFire got delayed with the R520, which made
sense since you could not get a dual AIB operation without a master R520 AIB
(Fig. 1.32).
CrossFire required a regular AIB with an R520, plus a custom AIB with a
controller (the FPGA) and an R520, and a motherboard with an ATI Xpress 200
chipset CrossFire model, which provided 16 PCI Express lanes for the AIBs (ATI
offered an Xpress chipset for Intel and AMD CPUs).
CrossFire was probably one of the most mismanaged product launches in ATI’s
history, and Nvidia and its fans loved it. Every day Nvidia shipped a hundred or so
more SLI AIBs, while ATI struggled to get the R520 out.
CrossFire promised features beyond those of Nvidia’s SLI. CrossFire allowed
the use of different AIB configurations, such as combining an X800 with X1800.
CrossFire also offered more dual rendering modes, such as alternate frame rendering
(AFR, which ATI said it invented). In addition, there were scissors or split screen,
tiling (subdividing the screen into a checkerboard-like pattern of 32 × 32-pixel
squares, with each AIB rendering every other square), and SuperAA to picture.
1.1 Programmable Vertex and Geometry Shaders … 43

Fig. 1.32 ATI’s Ruby red CrossFire—limited production. Courtesy of ATI

ATI’s CrossFire did not use a strap connector between the two AIBs as Nvidia’s
SLI did. Instead, it used an external cable between the AIB’s video output connectors
(reminiscent of the Voodoo 2 and original SLI). The DVI out of the slave Radeon
AIB got fed into the DMS port on the master CrossFire Edition Radeon AIB.
The CrossFire Edition (master) was a compositing engine that used a Xilinx FPGA
combined with a Texas Instruments TDMS DVI receiver/deserializer and a Silicon
Image 1161 DVI panel link receiver. It was a little clunky. That was because ATI
did not think SLI was useful. It had tried similar dual AIB operations with some
Rage products when 3Dfx promoted the idea, and ATI did not find much traction
or consumer interest. Of course, in those days, ATI was not the darling of graphics
and gamers that 3Dfx was, so it was no wonder ATI’s customers did not show any
interest.
Put bluntly, ATI missed it, and when they realized dual AIBs were desirable,
the R520, 530, and 515 were already in production (tapped out). Therefore, the
compositing engine had to be put on a master AIB instead of inside the chip as
Nvidia did, not that Nvidia got it perfectly correct.
The X1950 Pro GPU, released on October 17, 2006, used the 80 nm RV570 GPU.
It had 12 texture units and 36-pixel shaders and was the first ATI AIB that supported
native CrossFire implementation. Those AIBs also displayed 2048 × 1536 at 70 Hz
on the Radeon X1800, X1600, and X1300 (14X AA).
Although CrossFire worked with any application or game, like Nvidia’s SLI, it
could not help when CPU bound. For those cases, CrossFire could bring a higher
degree of anti-aliasing to improve the looks of the images, if not the performance.
44 1 Introduction

And with the combined AIB’s AA working, the user could get 16X AA and improved
anisotropic filtering.
Also, CrossFire was not quite as robust as ATI suggested concerning mixing and
matching AIBs. Only AIBs within a family would work, i.e., only X800-class AIBs
would work with an X800 master and only X850 AIBs with an X850 master.

1.1.8.1 Avivo Video Engine

Before the launch of R520-based AIBs, ATI released its new Avivo video and display
platform. With the PC’s ever-expanding role as a family entertainment unit, ATI
realized that improving brightness, resolution, de-interlacing, and video playback
were essential.
ATI felt that before HD and terrestrial HD delivery to PC was available, consumer
demand for rich video display platforms was not crucial. But with the growing
integration of media into the PC, consumers would demand better video capabilities
and performance from the PC. With Avivo, ATI hoped to satisfy those expectations.
Capture. Avivo used 3D comb filtering, hardware noise reduction, and multi-path
cancelation to enhance the capture of terrestrial, satellite, and cable signals. Avivo
examined the incoming video and raised it to ensure it was going into the rest of the
pipeline in as high quality as possible.
Avivo had 12-bit A/Ds to convert analog signals into digital. The company
emphasized that accuracy level because anything in the analog signal not accurately
converted would be permanently lost, jeopardizing the work of the rest of the video
pipeline.
ATI used 3D comb filtering because 2D filtering separates the signals from within
a single image. 3D filtering uses the two dimensions of the image plus the third
dimension of time for the best separation of the signals.
Before the demodulation process, ATI used multi-path cancelation to eliminate
echoes and shadows caused by urban environments.
Encoding. Avivo playback products (ATI’s discrete graphics and integrated
graphics chipsets) had hardware-assisted compression and transcoding to facilitate
media interchange. ATI had hardware MPEG-2 compression with its tuner encoder
products, and ATI claimed that its encoder products would reduce CPU utilization
to as low as 3−4% while encoding live TV signals.
ATI believed that transcoding capability was a vital capability with the introduc-
tion of multiple video-capable devices (PDAs, cell phones, portable game consoles,
etc.) that had a wide variety of formats and storage capacity.
Therefore, the company equipped its Avivo products with encoder support for
MPEG-2, WMV9, and H.264. ATI’s VPU had dedicated circuitry for video decode.
Avivo Display. The Avivo display engine consisted of two symmetrical display
pipelines with dual 10-bit end-to-end display processing. The pipelines ensured that
the output image (video or other) matched the device’s display.
1.1 Programmable Vertex and Geometry Shaders … 45

1.1.8.2 Summary

ATI was forming a special relationship with Microsoft. The two companies had been
partners for years, and with the introduction of the Xbox 360, they developed even
a tighter connection. The Xbox 360 had advanced capabilities not yet available on
the PC, which ATI developed with Microsoft.
Hoping to leverage that information, ATI pushed the fab (TSMC) to build a 90 nm
next-generation GPU. ATI took several chances, adding new features, API, video
processing, and memory management. It was bold, and some might even say reckless,
but admirable and impressive. But the weak link in the chain got them, and the
company lost its leading position with all its technology. Nonetheless, the R520 was
an outstanding and astonishing design, and ATI would leverage the developments
for several generations.
Shortly after the R520 finally came out, AMD acquired the company. That put a
brake on development, marketing, management planning, and oversight and started
ATI—now AMD—on a downward path that would take over a decade to correct, as
you will see in subsequent chapters.

1.1.8.3 Nvidia’s NV40 GPU (2005–2008)

One of Nvidia’s most successful GPU architectures was the Curie series that spawned
over a hundred AIB models across almost four years. Curie launched in September
2005 with the NV40 in the GeForce 6800 XE (Fig. 1.33). Nvidia ended it in August
2008 with the RSX 65 that went into the Sony PlayStation 3. There were even two
process shrinks for the PlayStation: 40 nm in October 2012 and 28 nm in June 2013.
Nvidia was able to scale the DirectX 9.0c design from 130 to 90 nm. It also
increased from 12 shaders in the NV40 on the GeForce 6800 XT to 27 shaders in

Fig. 1.33 Nvidia NV40 Curie vertex and fragment processor block diagram
46 1 Introduction

the G71 on the GeForce 7800 GS+. All the products of the NV4x series offered the
same feature set for DirectX 9.
The Curie architecture could be built with 4, 8, 12, or 16-pixel pipelines.
Nvidia extensively redesigned the architecture of the NV40’s pixel shader engine
from its predecessor. For example, the NV40 could calculate 32 pixels per clock,
while the NV35/38 could only render eight. Also, the NV40’s pixel shaders were
32-bit floating-point precision. And although the GPU could execute the half-
precision modes of the previous NV3x series, it was not dependent on them to realize
its peak performance.
With the Curie architecture, Nvidia introduced PureVideo. Based on the GeForce
FX’s video engine (VPE), PureVideo reused the MPEG-1/MPEG-2 decoding
pipeline and improved the quality of de-interlacing and overlay resizing. It included
limited hardware acceleration for VC-1 (motion compensation and post-processing)
and H.264 video compatibility with DirectX 9’s VMR9.
PureVideo offloaded the MPEG-2 pipeline starting from the inverse discrete cosine
transform, leaving the CPU to perform the initial run-length decoding, variable-
length decoding, and inverse quantization.
PureVideo became a foundational element of Nvidia GPUs, and the company kept
enhancing, expanding, and improving it through over eleven generations.
Nvidia introduced its UltraShadow technology with the Rankine NV35 GPU
found on the GeForce FX 5900. Correct shadows are crucial for realistic and believ-
able images [16]. However, the complex interactions between the light sources, char-
acters, and objects could require complicated programming. In an application like
a game, each light source must be analyzed relative to every object for each image
or frame. The more passes the GPU had to make for the lighting and shadow calcu-
lations in a scene, the more significantly the performance would slow down the PC
and gameplay.
With UltraShadow II in the GeForce 6 and GeForce 7 Series of GPUs, Nvidia
made improvements that complex scenes achieved noticeably improved perfor-
mance results. The improvements in UltraShadow II produced four times the perfor-
mance (compared to the previous generation) for passes involving shadow volumes
(Fig. 1.34).
UltraShadow II gave developers the ability to calculate shadows quickly and elim-
inate areas unnecessary for consideration. It defined a bounded portion of a scene
(called depth bounds), and limited calculations only in the area affected most by
the light source. Now developers could accelerate the shadow generation process.
UltraShadow II offered the ability to fine-tune shadows within critical regions. That
enabled developers to create great visualizations that mimicked reality while still
realizing good performance for fast-action games. It also worked with Nvidia’s
Intellisample technology for anti-aliasing shadow edges.
Intellisample Version 4.0 was used in the GeForce 6 and GeForce 7 series and
included two new methods: Transparency Supersampling (TSAA) and the faster but
lower-quality Transparency Multi-sampling (TMAA). Those methods improved the
anti-aliasing quality of scenes with partially transparent textures (such as chain link
fences) and anisotropic filtering of textures at oblique angles to the viewing screen.
1.1 Programmable Vertex and Geometry Shaders … 47

Fig. 1.34 Nvidia’s NV40 curie-based GeForce 6800 Xt AIB. Courtesy tech Power Up

Nvidia upgraded its CineFX engine for complex visual effects to Microsoft’s
DirectX 9.0c Shader Model 3.0 and OpenGL 1.5 APIs.
Introduced in 2002 with GeForceFX’s CineFX, programmers could now develop
shader programs utilizing those technologies and techniques.
DirectX 9.0 Shader Model 3.0 capabilities provided by the Nvidia NV40 GPU
included the following:
• Infinite length shader programs. The CineFX 3.0 had no hardware-imposed limi-
tations on shader programs. Therefore, longer programs ran faster than previous
GPUs.
• Dynamic flow control. Additional looping/branching options provided subrou-
tine call/return functions, giving programmers more choices for writing efficient
shader programs.
• Displacement mapping. CineFX 3.0 allowed vertex processing with textures
which provided new realism and depth to every component. Displacement
mapping enabled developers to make subtle changes in a model’s geometry with
minimal computational cost.
• Vertex frequency stream divider. Effects could be applied to multiple characters or
objects in a scene, providing individuality where models were otherwise identical.
• Multiple render target (MRT). MRTs allowed for deferred shading. After
rendering all the geometry, eliminating multiple passes through the scene, the
scene’s lighting could be manipulated. Photo-realistic lighting could be created
while avoiding unnecessary processing time for pixels that did not contribute to
the visible portions of an image.
48 1 Introduction

New effects included subsurface scattering, which provided depth and realistic
translucence to skin and other surfaces. It could generate soft shadows for sophis-
ticated lighting effects, accurately represent environmental and ground shadows,
and create photo-realistic lighting with global illumination. The Nvidia Curie block
diagram is shown in Fig. 1.35.
The GPU could use DDR and GDDR3 memory via a 256-bit-wide memory inter-
face (bus). The device offered 16x anisotropic filtering, rotating grid anti-aliasing,
and transparency anti-aliasing with high-precision dynamic range (HPDR). With its
dual 400 MHz LUT-DACs, the NV40 could display 2048 × 1536 up to 85 Hz.
The GPU included an integrated TV encoder with TV output up to 1024 × 768
resolutions and video scaling and filtering, and the HQ filtering technique could
operate up to HDTV resolutions.

Fig. 1.35 Nvidia curie block diagram


References 49

1.2 Conclusion

Programmable vertex shaders introduced by ATI and then Nvidia expanded the capa-
bilities of the GPU significantly and ushered in a new era of computer graphics effects
that executed completely and only on the GPU. It was the precursor of what the GPU
would become—a totally programmable compute engine.
During this period, ATI and Nvidia advanced computer graphics features formerly
found in high-priced workstations. This era was marked by additional consolidation
as the number of suppliers dropped from 32 to 8 and the concentration of engineering
talent ATI and Nvidia accelerated setting the stage for yet even more advanced
developments in the next-generation and future GPUs.

References

1. Peddie, J. True to form, ATI raises the ante in graphics processing, The Peddie Report, Volume
XIV, Number 22 (May 28, 2001).
2. Chung, J, and Kim, L-S. A PN Triangle Generation Unit for Fast and Simple Tessellation
Hardware, Proceedings of the 2003 International Symposium on Circuits and Systems, 2003.
ISCAS ‘03. (June 25, 2003). https://tinyurl.com/28nfku3v
3. ATI, TruForm White paper, August 2001. https://tinyurl.com/4tzbyxw4
4. Witheiler, M. ATI TruForm—Powering the next generation Radeon, AnandTech, (May 29,
2001). https://www.anandtech.com/show/773
5. Jon Peddie’s Tech Watch, Volume 2 Number 15, July 22, 2002.
6. De Maesschalck, T. The simFUSION 6000q the force of four Radeon 9700 cores!. (November
5, 2003). https://www.dvhardware.net/article2076.html
7. Shankland, S. SGI uses ATI for graphics behemoths. (July 14, 2003). https://www.cnet.com/
tech/computing/sgi-uses-ati-for-graphics-behemoths/
8. Peddie, J. SiS previews Xabre600 GPU, TechWatch Volume 2, Number 24 (November 25,
2002).
9. Maher, K. Computex 2001, Chipsets, The Peddie Report, Volume XIV, Number 24, pp 989.
(June 11, 2001).
10. Singer, G. History of the Modern Graphics Processor, Part 3: Market Consolidation, The Nvidia
vs. ATI Era Begins, TechSpot. (December 6, 2020). https://www.techspot.com/article/657-his
tory-of-the-gpu-part-3/
11. Shimpi, A. Nvidia GeForce FX 5800 Ultra: It’s Here, but is it Good? AnandTech (January 27,
2003). https://www.anandtech.com/show/1062/19
12. Transmedia storytelling: A Demise Caused By A Film, Film Tv Moving Image University of
Westminster. (March 6, 2017)
13. Briscoe, D. Final Fantasy’ flop causes studio to fold, Chicago Sun-Times. (February 4, 2002).
https://article.wn.com/view/2002/02/04/Final_Fantasy_flop_causes_studio_to_fold/
14. CineFX Architecture, Siggraph 2002, Nvidia. http://developer.download.nvidia.com/assets/
gamedev/docs/CineFX_1-final.pdf
15. Wang, E and Lee, C. Nvidia said to sell NV30 cards, NV31 and NV34 chips to down-
stream clients, Digitimes. (January 13, 2003). https://www.digitimes.com/news/a20030113
01002.html
16. UltraShadow II Technology. https://www.nvidia.com/en-us/drivers/feature-ultrashadow2/
Chapter 2
The Third- to Fifth-Era GPUs

Due to Moore’s law, the increase in transistor density opened up new possibilities
for GPU designers and CG scientists. The concepts, math, and algorithms remained
well understood. However, efficiently executing the algorithms was the tricky part.
It was what set one GPU architectural design apart from others (Fig. 2.1).
At Nvidia’s Nvision 2008 conference in San Jose, Tony Tamasi, a former senior
engineer at 3dfx and a senior vice president of content and technology at Nvidia,
took a look back and traced the development of GPUs up to the third era [1]. Tamasi
produced the chart in Fig. 2.2.
Previously fixed functions constrained the first-generation GPUs. For instance,
game developers could only use a limited number of characters in a game. There
were limited animation capabilities. And using multi-textures required multi-pass
rendering. To economize on graphics resources, game developers relied on simplistic
scenes and environments were usually indoors. With the arrival of DirectX 10,
Microsoft would usher in a new period of innovation enabling cinematic effects
and realistic graphics.

2.1 The Third Era of GPUs—DirectX 10 (2006–2009)

The Unified Shader is Introduced


DirectX 8 and 9 and up to DirectX 9c brought programmable shading and the first
efforts at tessellation. The second era of GPUs ran from the late 2000 to the end of
2006.
Dynamic programmable shading was introduced in the second generation of GPUs
and allowed an increased number of characters, thousands of polygon skeletons, rigid
bodies, and outdoor lighting with programmable pixels [2].
DirectX 10, released with Microsoft’s Windows Vista in the late 2006, marked
the beginning of the third era of PC GPUs.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 51


J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_2
52 2 The Third- to Fifth-Era GPUs

Fig. 2.1 Tony Tamasi. Courtesy of Nvidia

Fig. 2.2 GPU architecture progression, first and second era. Courtesy of Tony Tamasi

The defining technology of this era was the introduction of unified shaders, which
combined vector and pixel shaders. The first instance of the unified shader model
was realized in the ATI Xeon processor in the Xbox 360 in June 2005.
The shift opened the door to programmable graphics and is symbolized in Fig. 2.3.
It allowed armies of characters, complex physical simulations, sophisticated Al,
procedural generation, custom renderers, and lighting. In Microsoft’s parlance, this
strategy was called Shader Model 4, or SM 4.
Unified shader design made all shaders equal, and their capabilities were available
simultaneously. In the past, the differences between vertex and pixel shaders could
cause situations where the processes got out of sync so that vertex shaders were idle,
and pixel shaders got backlogged.
The following are some examples of third-era GPUs.
2.1 The Third Era of GPUs—DirectX 10 (2006–2009) 53

Fig. 2.3 Evolution from first-era to third-era GPU design

2.1.1 Nvidia G80 First Unified Shader GPU (2006)

On November 8, 2006, Nvidia launched its first unified shader architecture and the
first DirectX 10, Shader Model 4-compatible GPU, the G80. The G80 redefined what
GPUs were capable of and what they would become.
Radeon 8500 in 2001 and Nvidia’s GeForce 3 could execute small programs via
specialized, programmable vertex and pixel shaders. Nvidia and ATI carried on with
that basic design up until March 9, 2006, when Nvidia released the G71. The G80
(Fig. 2.4) broke the mold and ushered in a new era of GPU capabilities.
The G80 was based on Nvidia’s Tesla architecture (code named NV50) and had
128 shaders (grouped in 16 streaming modules (SMs)). The 484 mm2 die was fabri-
cated in TSMC’s 90 nm process and had 681 million transistors. The GPU had
32 TMUs and eight texture processor clusters (TPCs). It used GDDR3 memory
clocked at 792 MHz, while the GPU ran at 513 MHz.
Nvidia’s G80, which powered the GTX 8800 family of AIBs, was the first to
replace dedicated pixel and vertex shaders with an array of standard (unified) stream
processors (SPs—Nvidia would later re-brand then as CUDA cores).
Nvidia’s previous GPUs were SIMD vector processors and could run concurrently
on RGB+A color components of a pixel. The G80 had a scalar processor design such
that each streaming processor could handle one color component. Nvidia moved
from a GPU architecture with dedicated hardware for specific shader programs
to an array of relatively simple cores. Those (seemingly) simpler cores could be
programmed to perform whatever shader calculations an application required. It was
a clear breakthrough and a break-away design.
In an interview with Joel Hruska of ExtremeTech in November 2016, Jonah Alben,
Nvidia’s Senior VP of GPU Engineering, said, “I think that one of the biggest chal-
lenges with G80 was the creation of the brand new ‘SM’ processor design at the
54 2 The Third- to Fifth-Era GPUs

Fig. 2.4 Nvidia’s G80 unified shader GPU—a sea of processors

core of the GPU. We pretty much threw out the entire shader architecture from
NV30/NV40 and made a new one from scratch with a new [similar-instruction,
multiple-threads] general processor architecture (SIMT), that also introduced new
processor design methodologies.” [3].
Nvidia designed the G80 to run much more complicated pixel shaders with more
branching, dependencies, and resource requirements than previous chips. The new
cores could also operate much faster. The previous GeForce 7900 GTX’s GPUs
were manufactured in a 90 nm process at TSMC and ran at 650 MHz. The G80-
based GeForce 8800 GTX (Fig. 2.5) used an 80 nm half node process, and its shader
cores ran at 1.35 GHz.
The new chip debuted in two new AIBs, the $599 GeForce 8800 GTX and the
$449 GeForce 8800 GTS.
The G80 was the threshold processor that would lead Nvidia to general computing
acceleration beyond gaming—a big evolutionary step that would have consequences
on the entire computing industry for decades.

2.1.2 Nvidia GT200 Moving to Compute (2008)

Demonstrating the scalability of its primary texture processor cluster (TPC), Nvidia
took the G80 design (that was built in 90 nm) and shrank it to 65 nm and then made
the whole chip larger by increasing the 681 transistors in the G80 to an astonishing
1.4 billion in the GT200 and in the process creating the biggest chip built to date.
2.1 The Third Era of GPUs—DirectX 10 (2006–2009) 55

Fig. 2.5 Nvidia GeForce 8800 Ultra with the heatsink removed showing the 12 memory chips
surrounding the GPU. Courtesy of Hyins—Public Domain, Wikimedia

Nvidia would make that claim on several future GPUs as well—bigger is better at
Nvidia, or as Jensen Huang has said, “Moore’s law is our friend.” [4].
Nvidia started with a basic design for a streaming processor (SP), known as a
shader—see the CUDA core call out Fig. 2.6. Nvidia then built a streaming multi-
processor (SM) from an array of SPs. In the GT200, there were eight SPs in an SM;
the SPs are depicted as cores in the block diagram. An SM in the GT200 consists of
eight SPs (cores) and a special function unit (SFU).
Nvidia designed its GPU architecture to be scalable, so a texture processing core
(TPC) could be made of any number of SMs. In the G80, there were two SMs per
TPC, and in the scaled up GT200, there were three SMs.
Nvidia continued the modular theme by grouping several TPCs to form a streaming
processor array (SPA).
The result was a chip that was 576 mm2 , with 240 shaders or cores, 80 TMUs,
adding up to 60 special function units (SFUs) or 10 TPCs. The chip drew a whopping
2236 w, another biggest number for Nvidia. The GPU ran at 600 MHz, while the
GDDR3 memory ran at 2.2 GHz from a 1.1 GHz clock. Like the G80, it was DirectX
11 (10_0) compatible as well as OpenGL 3.3. The chip was used on Nvidia’s GTX2x0
series AIB and the top of the line, the GTX260 had a GB of memory, and another
biggest for Nvidia.

2.1.2.1 Summary

What was significant about the chip is that it demonstrated two substantial aspects of
Nvidia’s Tesla design, scalability in moving to a smaller process node (90–65 nm)
and scalability in the modular assembly capability of the shaders and their subsequent
larger building blocks.
56 2 The Third- to Fifth-Era GPUs

Fig. 2.6 Nvidia’s GT200 streaming multiprocessor


2.1 The Third Era of GPUs—DirectX 10 (2006–2009) 57

Fig. 2.7 Evolution of Nvidia’s logo, 1993 to 2006 (left) and 2006 on (right). Courtesy of Nvidia

Also in 2006, Nvidia changed their logo font from Italics to bold (Fig. 2.7).
Nvidia likes to write its name in all caps. However, the name is a proper noun and
not an acronym and should not be all caps.

2.1.3 Intel Larrabee to Phi (2006–2009)

Intel launched the Larrabee project in 2005, code named SMAC. Paul Otellini, Intel’s
CEO, hinted about the project in 2007 during his Intel Developer’s Forum (IDF)
keynote. Otellini said it would be a 2010 release and compete against AMD and
Nvidia in the realm of high-end graphics.
Intel announced Larrabee in 2008 and early August at SIGGRAPH. Then at the
Hot Chips conference in late August and finally at the IDF in mid-August. The
company said Larrabee would have dozens of small, in-order x86 cores and run as
many as 64 threads. The chip would be a coprocessor suitable for graphics processing
or scientific computing. Intel said, at the time, programmers could, at any given time,
decide how they would use those cores.
At Intel’s Research @Intel Day, held in the Computer History Museum on June
14, 2009, the company showed a ray-traced version of Enemy Territory: Quake Wars
at 1024 × 720. Intel research scientist Dr. Daniel Pohl demonstrated his software
on a 16-core 2.93 GHz Xeon Tigerton system, with four processors, each with four
cores (Fig. 2.8). The implication was that the multi-processor Larrabee would deliver
similar performance, but the processors in Larrabee were not of Xeon class.
Intel did not release the details on Larrabee, but rumors swirled that it would have
64 processors—four times as many as Pohl used.
Larrabee was scheduled to launch in the 2009–2010 timeframe; then in December
2009, Intel surprised the industry and canceled it. Rumors circulated in the late
2009 that Larrabee was not performing as well as expected. And in 2010, Intel
acknowledged the power density of x86 cores did not scale as well as a GPU. The
AIB was big, as indicated in Fig. 2.9.
To salvage all the work that had gone into Larrabee, Intel pivoted to the Knight’s
bridge Phi’s compute-coprocessor based on the Larrabee chip.
Larrabee used multiple in-order x86 CPU cores augmented by a wide vector
processor unit and a few fixed function logic blocks. Intel said this would provide
58 2 The Third- to Fifth-Era GPUs

Fig. 2.8 Daniel Pohl demonstrating Quake running ray-traced in real time

Fig. 2.9 Intel Larrabee AIB. Courtesy of the VGA Museum

dramatically higher performance per watt and unit of area than out-of-order CPUs
on highly parallel workloads. The company asserted that Larrabee’s highly parallel
architecture would make the rendering pipeline completely programmable. It could
run an extended version of the x86 instruction set, including wide vector processing
operations and specialized scalar instructions. The three diagrams, Figs. 2.10, 2.11,
and 2.12, are based on Intel’s presentation at SIGGRAPH.
2.1 The Third Era of GPUs—DirectX 10 (2006–2009) 59

Fig. 2.10 General organization of the Larrabee many-core architecture

Fig. 2.11 Larrabee’s


simplified DirectX 10
pipeline. The gray
components were
programmable by the user,
and blue were fixed. Omitted
from the diagram are
memory access, stream
output, and texture-filtering
stages

Fig. 2.12 Larrabee CPU


core and associated system
blocks. The CPU was a
Pentium processor in-order
design, plus 64-bit
instructions, multi-threading,
and a wide vector processor
unit (VPU)
60 2 The Third- to Fifth-Era GPUs

The diagram was symbolic of the number of CPU cores and the number and type
of coprocessors and I/O blocks, which were implementation dependent, as were the
locations of the CPU and non-CPU blocks on the chip. Each core could access a
subset of a coherent L2 cache to provide high bandwidth access and simplify data
sharing and synchronization.
Intel claimed Larrabee would be more flexible than current GPUs. Its CPU-like
x86-based architecture supported subroutines and page faulting. Some operations that
GPUs traditionally perform with fixed function logic, such as rasterization and post-
shader blending, were implemented entirely in software in Larrabee. As with GPUs,
Larrabee used fixed function logic for texture filtering, but the cores assisted the
fixed function logic (e.g., by supporting page faults). Larrabee’s core black diagram
is shown in Fig. 2.12.
Larrabee’s programmability offered support for traditional graphics APIs such
as DirectX and OpenGL via tile-based deferred rendering that ran as software
layers. Intel ran the renderers using a tile-based deferred rendering approach. Tile-
based deferred rendering can be very bandwidth-efficient, but it presented some
compatibility problems at that time—the PCs of the day were not using tiling.
Each core had fast access to its 256 kB local subset of the coherent second-level
cache. The L1 cache sizes were 32 kB for I cache and 32 kB for D cache. Ring
network accesses passed through the L2 cache for coherency. Intel manufactured the
Knights Ferry chip in its 45 nm high-performance process and the Knights Corner
chip in 22 nm.
In Pohl’s ray-tracing demo, he used four, four-core, 2.93 GHz Xeon 7300 Tigerton.
Assuming perfect scaling, they could produce 358 GTFLOPS per processor or 1.4
TGFLOPS total. The re-purposed coprocessor version of Larrabee, Knights Ferry
(Larrabee 1), had 32 cores at up to 1.2 GHz, each producing 38 GFLOPS, for a
total of 1.2 TFLOPS. However, its x86 cores were based on the much simpler P54C
Pentium. A public demonstration of the Larrabee architecture took place at the IDF
in San Francisco on September 22, 2009. Pohl ported his Quake Wars ray-traced
demo, and it ran in real time. The scene included a ray-traced water surface that
accurately reflected the surrounding objects, like a ship and several flying vehicles.
Intel used the Larrabee chip for its Knights series many integrated core (MIC)
coprocessors. Former Larrabee team member Tom Forsyth said, “They were the
exact same chip on very nearly the exact same board. As I recall, the only physical
difference was that one of them did not have a DVI connector soldered onto it.” [5].
Knights Ferry had a die size of 684 mm2 and a transistor count of 2300 million—
a large chip. It had 256 shading units, 32 texture-mapping units, and 4 ROPS, and
it supported DirectX 11.1. For GPU-compute applications, it was compatible with
OpenCL version 1.2.
The cores had a 512-bit vector processing unit, able to process 16 single-precision,
floating-point numbers simultaneously. Larrabee was different from the conventional
GPUs of the day. Larrabee used the x86 instruction set with specific extensions. It had
cache coherency across all its cores. It performed tasks like z-buffering, clipping, and
blending in software using a tile-based rendering approach (refer to the simplified
DirectX pipeline diagram in Book two). Knights Ferry, aka Larrabee 1, was mainly
2.1 The Third Era of GPUs—DirectX 10 (2006–2009) 61

an engineering sample, but a few went out as developer devices. Knights Ferry
D-step, aka Larrabee 1.5, looked like it could be a proper development vehicle. The
Intel team had lots of discussions about whether to sell it and, in the end, decided
not to. Finally, Knights Corner, aka Larrabee 2, was sold as XeonPhi.
The Intel developers (Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth,
Pradeep Dubey, Stephen Junkins, Adam Lake, Robert Cavin, Roger Espasa, Ed
Grochowski, and Toni Juan) believed Larrabee was more programmable than GPUs
of the time [6]. Additionally, Larrabee had fewer fixed function units, so they thought
Larrabee would be an appropriate platform for the convergence of GPU and CPU
applications.
Larrabee used general-purpose CPUs and could theoretically have run its own
operating system. However, the graphics performance wasn’t good enough compared
to competing products.
The Larrabee/MIC/XeonPhi split was purely marketing, branding, and pricing.
Same team, same chips.
In 2009, Intel announced it had canceled the Larrabee product. However, the first
run of devices was re-purposed as a software development vehicle for ISVs and the
high-performance computing (HPC) community.
Unlike conventional GPUs, the hardware in Larrabee was half of the solution.
The other was the software that translated the DirectX graphics pipeline elements
into x86-compatible code. Unfortunately, legacy systems took too long to write, test,
debug, and fix that mountain of software, especially in the OpenGL realm. The corner
cases of old OpenGL extensions were especially insidious. And because Intel is Intel,
it had to guarantee backward compatibility.
The result was that Larrabee missed its launch window. If the product were
released later (assuming they could get all the software cases to work—and
there was no certainty about that), Larrabee would compete against newer, more
powerful competition than initially planned. Moreover, designing and manufacturing
a next-generation chip would take three years. It was an untenable situation.
Moreover, to use a cliché, Intel was fixing the engine while the airplane was flying.
Naturally, for a CPU company, Intel did not have a very deep team of GPU experts
as Larrabee ramped up. Intel staffed up in 2007 and 2008, bringing in dozens, maybe
hundreds, of new people from a dozen or more companies and universities. Getting
them integrated into a team took more time than Intel anticipated. Then, continuing
to fix the engine in flight, Intel launched the chip with a 32-port coherent cache. The
most extensive coherent cache Intel had built to date was eight CPUs.
In May 2019, Intel told customers it would no longer accept orders for the Phi
products after August 2019 and that the Xeon Phi 7295, 7285, and 7235 would be
end-of-life (EOL) July 31, 2020—from inception (2006) to termination (2023) is a
long run in this industry.
Larrabee was significant because it tested the notion that a SIMD construction
of any CISC or RISC processor with a FPU could be used as a GPU. The legacy
of evolution of CISC processors like Intel’s x86 carried so many functions and
features; it could be made small enough to be replicated in large quantity and the
62 2 The Third- to Fifth-Era GPUs

wasn’t powerful enough to be that much better than several shaders in a conventional
GPU. It also wasn’t power efficient, or inexpensive.
Although Intel disparaged the GPU every chance it got, it always wanted one and
tried a few times before Larrabee (e.g., 82768, i860, i740). In 2018, it kicked off
another GPU project code named Xe , discussed in a later chapter.

2.1.4 Intel’s GM45 iGPU Chipset (2007–2008)

With the G965 and GM965 chipset evolution, Intel expanded the GPU in the G45
Graphics and Memory Controller Hub (GMCH) to incorporate DirectX 10 m SM
4.0, unified shader architecture. Figure 2.13 shows the block diagram of the G45.
The GPU had 10 EU, five threads per EU, and it offered HD Decode (high-quality
video), with, said the company, a focus on game compatibility. The G45 was one of
Intel’s last external chipsets.

2.1.5 Intel’s Westmere (2010) Its First iGPU

Intel was the first to introduce a CPU with a built-in GPU in January 2010—five
years after AMD and ATI announced their plan for CPU with an integrated GPU and
a year before AMD actually managed to accomplish it (Fig. 2.14).
Worried AMD would beat them to the market with an integrated CPU and GPU
Intel marshaled its forces and went into skunk-work mode to build such a device.
At the same time, AMD hit one roadblock after another whittling away their time to
market advantage and costing them first mover status for a concept they created.
The first iGPU
Westmere was Intel’s first CPU with an encapsulated shared memory integrated
GPU—the GPU and CPU were in the same package but not the same die (Fig. 2.15).
Westmere was Intel’s latest microarchitecture and was not the name of the processors
that used it. The Westmere (formerly Nehalem-C) architecture was a 32 nm die shrink
of the Nehalem architecture. The Westmere design could use the same CPU sockets
as Nehalem-based CPUs.
The 32 nm Clarkdale (80616) was the first processor to use the Westmere archi-
tecture and incorporate a GPU die in the same package. The GPU was Intel’s 45 nm
fifth-generation HD series, Ironlake graphics GPU. It ran at 500–900 MHz. It had
177 million transistors in a 114 mm2 die with 24 shaders—12 execution units (EUs)
and a four MB L3 cache—and it was DirectX 10.1, Shader Model 4.0, and OpenGL
2.1 compatible. A block diagram of the Ironlake iGPU is shown in Fig. 2.16. The
GPU could produce up to 43.2 GFLOPS at 900 MHz (24 GFLOPS at 500 MHz).
The iGPU could decode an H264 1080p video at up to 40 fps.
2.1 The Third Era of GPUs—DirectX 10 (2006–2009) 63

Fig. 2.13 Intel’s G45 chipset

The Ironlake GPU could provide 2560 × 1600 resolution capability via a Display-
Port connection. For DVI (due to its single-link connection), the resolution was
limited to 2048 × 1536 and, for HDMI, 1920 × 1200. Intel believed DisplayPort,
which it invented, was the future—they were right.
Intel branded the Clarksdale processors as Celeron, Pentium, or Core with HD
Graphics.
Intel initially sold Clarkdale as desktop Intel Core i5, Core i3, and Pentium. It was
closely related to the mobile Arrandale processor. The most significant difference
between Clarkdale and Arrandale is that the latter had integrated graphics.
64 2 The Third- to Fifth-Era GPUs

Fig. 2.14 Block diagram of an iGPU within a CPU

Fig. 2.15 Intel’s Westmere dual-chip package. Courtesy of Intel

A year later, in February 2011, Intel introduced the second-generation Core brand
with the Sandy Bridge architecture and the GPU integrated into the CPU.
Intel built the fully integrated 131 mm2 die, four-core, 2.27 GHz processor Sandy
Bridge, with the iGPU in its 32 nm fab. Depending upon the model, the GPU’s clock
was 650–1100 MHz (turbo to 1350 MHz).
2.2 The Fourth Era of GPUs. October 2009 65

Fig. 2.16 Intel’s Ironlake-integrated HD GPU

The Sandy Bridge GPU, code named HD Graphics 3000 (where HD stood for
high-definition), did not have dedicated memory and shared the Level 3 cache and
part of the main memory with the CPU.
The GPU had 12 EUs with 96 shaders and 12 texture-mapping units (TMUs), but
Sandy Bridge’s GPU was faster than Clarkdale due to architectural changes and a
process shrink. The GPU was DirectX 10.1, OpenGL 3.0, and DirectCompute 4.1
compatible. The GPU also incorporated dedicated units for decoding and encoding
HD videos.
The GPU was impressive for the number of shaders and other processors it had
and within the power envelope of an integrated system with a x86 CPU. It was
significant in that it shifted T&L work from the CPU to the GPU’s shaders making
it truly DirectX 8 to 11 compatible.

2.2 The Fourth Era of GPUs. October 2009

The fourth era of GPUs was launched with Microsoft’s D3D11/DirectX 11 and the
introduction of compute shaders. DirectX 11 brought programmable geometry shades
and tessellation with its Hull and Domain Shaders. It was a significant improvement
and fostered a whole new era of games with never seen before qualities approaching
ray tracing.
66 2 The Third- to Fifth-Era GPUs

Fig. 2.17 AMD graphics logos, circa 1985, 2006, 2010. Courtesy of AMD

2.2.1 The End of the ATI Brand (2010)

AMD made seven variations of the first AMD-branded GPUs, code named Northern
Islands and Vancouver.
• 292M 40 nm (Cedar)
• 370M 40 nm (Caicos)
• 716M 40 nm (Turks)
• 1.040M 40 nm (Juniper)
• 1.700M 40 nm (Barts)
• 2.640M 40 nm (Cayman).
The Northern Islands GPUs, introduced in October 2010, were the first new GPUs
developed by ATI, under the management of AMD. It formed part of AMD’s 40 nm
Radeon brand. AMD based some versions on the second generation of the TeraScale
architecture (VLIW5) and some on the third-generation TeraScale 3 (VLIW4).
With the Northern Islands GPUs introduction, AMD discontinued the ATI brand.
AMD wanted to tighten the correlation between the graphics products and the AMD
CPU and chipset branding. For the most part, the AMD badging was just a replace-
ment for the ATI badge, and people continued to refer to the graphics products as
ATI for almost a decade afterward because the name was so admired—and some say
loved.
The former ATI logo received a renovation and took on some of the 2010 AMD
Vision logo design elements (Fig. 2.17).
AMD also retired the Mobility Radeon name for laptop GPUs and used an M
suffix at the end of the GPU model number.

2.2.2 AMD’s Turks GPU (2011)

Jerry Sanders founded AMD in 1969 to compete with Intel in the x86 CPU business,
which it did very successfully. Sanders stepped down in 2002, and Hector Ruiz took
over and ran AMD until 2008. During his tenure, AMD acquired ATI and entered
the GPU market.
2.2 The Fourth Era of GPUs. October 2009 67

Ruiz and Dave Orton, ATI’s former CEO, planned to design and build an integrated
device with an x86 CPU and a powerful GPU they called Fusion. However, AMD
was top-heavy, and Orton didn’t want to move to Texas, so he resigned in 2007,
which slowed the energy of the Fusion product. In late 2006, Nvidia reached parity
with AMD in market share and then continued to gain.
AMD’s CPUs were not as good as they should have been, and the company was
rapidly losing market share to Intel and Nvidia and started losing money. When Ruiz
resigned in 2008, AMD’s BOD elected VP Dirk Myer, a protegee of Sanders to the
CEO position. But AMD continued to lose money, and Myer looked for things to cut
or sell off.
In 2008, Myer sold the ATI TV business to Broadcom, which helped their position
in the set-top box market. Myers sold off AMD’s fab in 2008, and it became Global
Foundries. That stopped some of the losses and brought in some cash. But the debt
the company was carrying was crushing it, and the funding for GPU investment
almost disappeared. Myer sold the ATI mobile group to Qualcomm in 2009, helping
that company establish its long running and highly successful Snapdragon product
line and become the largest independent iGPU supplier in the world.
Myer was a CPU guy and didn’t appreciate the GPU. Rumor had it that although
he helped develop the logic for the acquisition, he actually opposed it. Meyer focused
the company on notebook PCs and data center server market. As a result, funding
for R&D for GPUs was secondary to CPUs. GPUs are complicated devices to design
and build and take three to five years to develop. In 2009, after Myer took over, the
ATI group lost several vital engineers, and GPU development stalled.
Relying primarily on the CPU for revenue, the under-performing Bulldozer CPU
(designed under Myer’s VP days) was dragging the company down. It suffered years
of losses and accumulated debt, even though the GPU business brought in money.
ATI had new designs in the pipeline when AMD bought them, and in the summer
of 2008, AMD launched the Radeon HD 4870. It was an excellent product, and
AMD’s AIB OEMs were able to offer an AIB for $300 that had comparable or better
performance to Nvidia boards selling for $400–$450. It was an incredible value at the
time, and an instant hit, and the GPU revenue significantly helped AMD’s sales and
sustain the company.
Nonetheless, the company was losing money, and rumors circulated that AMD
would be sold to Intel, Nvidia, or VCs or file bankruptcy. In possibly three of the
worst years of AMD’s history and not totally his fault, Myers resigned in 2011.
AMD hired Rory Reed to be the new CEO. Reed was President and COO at
Lenovo, and at the time, Lenovo had just marked its 7th straight quarter as the
fastest-growing PC maker in the world and had become the world’s third-largest
global PC manufacturer. The BOD thought if anyone could get AMD back on track
and into the OEMs, it was Reed.
Reed was the first CEO of AMD who wasn’t an engineer. Reed’s legacy in AMD
is that he hired the best. Reed brought in Mark Papermaster, CTO at Cisco, as AMD’s
CTO. And he hired Lisa Su, a distinguished engineer, and manager who had been
at IBM, and Freescale, where she was Senior Vice President and General Manager.
She assumed the same positions at AMD under Reed.
68 2 The Third- to Fifth-Era GPUs

AMD continued to sink, reporting its lowest revenue in 2015 over ten years. But
despite that, Reed, Su, Papermaster, and the brilliant CFO Devinder Kumar, reduced
AMD’s debt, and under Su’s direction, started hiring top-notch engineers, some of
whom had left AMD earlier, including Raja Koduri, who had been at Apple after he
left AMD.
Under Su’s leadership, AMD began investing in and developing the next gener-
ation of CPUs and GPUs. In 2014, Reed was asked to step down, and the BOD
appointed Lisa Su to be the company’s CEO, a position she held for over nine years
(as of this writing) and the second longest-running CEO at AMD since Jerry Sanders.
But the damage of underinvestment had taken its toll. In 2007, AMD launched
the HD 2000 series, the company’s last high-end AIB.
In 2010, AMD had to give up its position in the high-end of the GPU market. With
the introduction of DirectX 11, AMD came to the market with cost-effective mid-
range GPUs that had excellent performance. The company was still in the game,
so to speak, just not in the high-end. The kickoff products for the fourth era and
DirectX 11 were the mid-range TeraScale GPUs, code-named Barts (October 2010),
and Turks (February 2011), discussed in the following sections.
The entry-level Radeon HD 6500/6600, code-named Turks, was released on
February 7, 2011. The Turks family included Turks PRO and Turks XT, and AMD
marketed them as HD 6570 and HD 6670. Originally released only to OEMs, they
proved so popular and cost-effective that AMD released them to retail. Figure 2.18
is a block diagram of the AMD Turks GPU.
The Turks GPU had 480 shaders, 24 TMUs, and 8 ROPS and could drive four
independent monitors. The 6.1 MM2 chip had 716 million transistors and drew about
75 W. It complied with DirectX 11.2, Shader Model 5, and OpenGL 4.4.

2.2.2.1 Summary

AMD’s Radeon HD 6570 and 6670 were minor upgrades of their Evergreen prede-
cessor, the HD 5570 and 5670. The Turks GPUs contained 80 more stream processors
and four more texture units. AMD also upgraded the design to include the new tech-
nologies in the Northern Islands GPUs such as HDMI 1.4a, UVD3, and stereoscopic
3D.
Its direct competitor was Nvidia’s GeForce 500 Series, which Nvidia launched
approximately a month later.
What made the Turks GPU significant was that it demonstrated the scalability of
the TeraScale architecture. Typically, entry-level GPUs are scrapes, higher-end parts
that did not meet specs. That is known as binning. The Turks was a unique design
and stood on its own merits.
2.2 The Fourth Era of GPUs. October 2009 69

Fig. 2.18 AMD’s Turks entry-level GPU (2011)

2.2.3 Nvidia’s Fermi (2010)

Nvidia introduced its Fermi microarchitecture in the late March 2010. It was the
next-generation GPU and branded as the GF100 family, superseding the popular
Tesla architecture-based products such as the G80 GPUs.
Manufactured at TSMC in a 40 nm process, the three billion transistor Fermi GPU
contained 512 processors—now called stream processors—organized into 16 groups
of 32. The Fermi processors were Nvidia’s first GPU compatible with OpenGL 4.0
and Direct3D 11. Nvidia released over 70 products with the Fermi architecture.
According to Nvidia, the GF100’s unified shader architecture incorporated tessel-
lation shaders into the same vertex, geometry, and pixel shader architecture. Thus,
70 2 The Third- to Fifth-Era GPUs

the benefits of the Fermi, asserted the company, were improved computing, physics
processing, and computational graphics capabilities, and the addition of tessellation
[7].
The die size of the Fermi was 529 mm2 . Even with a reduced process (40 nm
vs. 90), Fermi was larger than the 484 mm2 Tesla chip; those extra shaders took up
space.
The company had introduced its CUDA parallel processing software based on
C++, and the Fermi GPU was the first to exploit it. CUDA would prove an essential
tool for the industry and Nvidia especially. It became a de facto standard, finding
its way into hundreds of programs and leading Nvidia into the realm of artificial
intelligence for which it would become famous.
Large GPUs were not sold with all elements enabled, especially with the initial
production run. As GPUs increased in complexity, additional backup circuits were
built-in and used in case of failures during the manufacturing cycle. In the case of
Nvidia’s Fermi, the company did ship the GTX 580 in 2010 with all units enabled
(the 580 was the same layout as the 480 in most respects but with the texture and
z-cull units updated to add performance optimizations).

2.2.3.1 Summary

The Fermi architecture was Nvidia’s move to offer a GPU designed for GPU
computing, a vector accelerator for HPC, and servers. The GTX 480 initially shipped
with 15 streaming multiprocessors and six memory controllers. (One streaming
multiprocessor was disabled.) The remainder of the stack of products was filled out
by the GTX 470, with 14 streaming multiprocessors and five memory controllers.
The GTX 465 had 11 streaming multiprocessors and four memory controllers.
A year later, a new top-end model, the GTX 580, was introduced with the full 16
streaming multiprocessors and six memory controllers.
Nvidia retired the Fermi GPU line in April 2018. That is a very long life for a
semiconductor and a testament to its versatility.

2.2.4 AMD Fusion GPU with CPU (January 2011)

When AMD bought ATI in 2006, Hector Ruiz was CEO of AMD, and Dave Orton
was CEO of ATI. Dirk Meyer, President of AMD, and Orton became the architects
of the acquisition primarily based in the idea of building a CPU with integrated
graphics.
When Dave Orton was at SGI in 1998, he championed the development of Cobalt,
the first integrated GPU chipset (discussed in Book two). Then, when he was running
ArtX, he pushed the idea into game consoles and carried on that effort after ATI
acquired ArtX in 2000. Orton had a vision, understood Moore’s law, and saw the
inevitable assimilation of the GPU by the CPU.
2.2 The Fourth Era of GPUs. October 2009 71

The board promoted Orton to CEO of ATI, and, in 2004, he explored how hetero-
geneous processors like a GPU and a CPU could work harmoniously and deliver a
synergist result—1 + 1 = 3. When AMD visited him for exploratory discussions
about a merger or acquisition, Orton was more interested in their vision of heteroge-
neous integration of processors than their grand scheme of taking over the computer
market. Fortunately, AMD had the right people, who agreed with Orton and shared
some of the work they had been doing on the concept. This, Orton thought, could be
a fantastic and highly successful merger.
AMD acquired ATI in 2006, and the keystone of the deal was to build a single chip
with a powerful CPU and a vector accelerator processor—a GPU. The code name for
the project was Fusion, and it represented not just the merging of GPUs and CPUs but
the merging of two companies with divergent cultures from two different countries,
using two different types of processor manufacturing and technology—what could
go wrong. Orton would need every hard-won skill he had in his long and storied
career to pull that off.
The Fusion project sputtered, and fiefdoms were threatened. Its old board
members, who believed in central control, insisted California-based Orton move
to Austin. Orton had just spent the last six years communing from Silicon Valley
to Markham, Canada, and now, with a son in high school, Orton did not want to
repeat the process, so he resigned. It was among AMD’s dumbest moves. They let
Orton walk away with his decades of experience in precisely what AMD would now
have to learn the hard way by trial and error. They did not even have the humility or
wisdom to engage him as a consultant. If Orton had stayed, it would have added an
extra management layer, and no one was in favor of that. But Orton also represented
an additional management layer, and he knew that would not be in the company’s
best interest.
Orton left AMD in 2007, and Ruiz left in 2008.
In July 2007, Rick Bergman, formally ATI/AMD’s Senior Vice President and GM
for the Graphics Division, announced he was moving up to run the entire Graphics
business, reporting directly to the AMD CEO, eventually running all product groups
at AMD.
AMD would have two new CEOs after Ruiz and before AMD could introduce an
integrated GPU–CPU in the following three years.
Four torturous years after Orton left, Bergman announced the Llano Fusion
processor, renamed the APU-accelerated processor unit for copyright reasons and
outside marketing advice. Seven years from Orton’s vision in 2004 and AMD’s final
implementation came to production in June 2011. Then in September 2011, Bergman
departed to become CEO role at Synaptics, a small, public semiconductor company.
AMD had been building its CPUs using silicon on insulator (SOI); the designs
of the ATI GPUs were in bulk-CMOS and had to be converted to SOI-compatible
layouts—not trivial. Then, the model had to be simulated, and, from that, the software
developers could start building drivers.
It might be useful to compare Intel’s journey to an integrated CPU to AMD’s.
Intel approached the problem with baby steps using two dies and then integrating
the GPU. AMD went for the whole thing at once. And AMD did it while it was
hemorrhaging money and switching CEOs. AMD lost money four out of seven years
from 2006 to 2011, with two of the most significant losses ever in 2007 and 2008
[8].
72 2 The Third- to Fifth-Era GPUs

To try and mitigate some of its losses, in October 2008, AMD announced it would
go fabless and spin-off its semiconductor manufacturing business into a new company
temporarily called The Foundry Company. It later became GlobalFoundries (GF) and
then GoFlo. GF continued to make AMD’s chips but had difficulty manufacturing
the 32 nm process node Llano needed. GF’s difficulties, unfortunately, truncated the
success of Llano. Avoiding that problem, the newer Brazos Fusion/APU processor
got built on the TSMC 40 nm process. The APU launched in 2010. It was a more
modest architecture but enjoyed high volume success as a mainstream solution.
The company was cash-flow strapped, laid off over 10% of its employees, and
morale was at an all-time low. There was also a small cultural war going on, as is
typical after an acquisition. Each camp was frustrated by the intransigence of the
other. Cell libraries, the basic design of the transistors, did not match up. IC layout
policies did not match, and even the internal nomenclature did not agree. It took top
management decrees to force one group to accept the other’s policies and procedures.
Meanwhile, homogeneous, and monolithic Intel charged ahead, running scared that
AMD would beat it once again. AMD did not win this race.
Intel beat AMD to it and introduced the Clarkdale in 2011. Intel already had a
graphics group and did not have to overcome cultural barriers, library mismatches,
or process differences. They just had to make it work and convince management and
the fab that using 60% of the die for a GPU that they could not charge extra for was
a good idea.
AMD began revealing details about the Llano in February 2011 at the ISSCC in
San Francisco. The company spoke about power management enhancements made to
the x86 cores in Llano to increase performance per watt and help make the CPU/GPU
combination even more compelling. Core power gating, a processor feature that
disconnects power to the core when not in use, was employed. AMD claimed its
silicon on insulator (SOI) process allowed it to use more efficient nFET transistors for
power gating instead of the pFET transistors used with a bulk silicon manufacturing
process [9]. Both AMD and Intel’s design used shared memory (between the CPU
and GPU) and so cache design was critical to performance. AMD’s was twice the
size of Intel’s.
The photograph of the die in Fig. 2.19 shows the core’s more than 35 million
transistors that fit within 9.7 mm2 (not counting the 1 MB of L2 cache shown on the
right):
AMD said its first APU would have the following:
• Four CPU cores, DDR3 memory, and a DirectX 11 capable SIMD engine
integrated on die.
• Llano was the first design from AMD using the 32 nm SOI process technology.
• The APU used AMD’s Sabine platform for mainstream notebooks.
• The APU ran above 3 GHz. The graphics could also handle Blu-Ray playback.
Getting the right balance is always tricky. Intel thought the answer lay in X86 as
the panacea to all problems. Nvidia said a GPU could solve all one’s needs, whereas
AMD sought balance with its Fusion design (Fig. 2.20).
There was little specific GPU functionality in Intel’s Larrabee or Clarksdale
processors. There was no X86 functionality in Nvidia’s Fermi. And there is quite a
bit of GPU and X86 functionality in AMD’s Llano Fusion product.
2.2 The Fourth Era of GPUs. October 2009 73

Fig. 2.19 Portion of the Llano chip. Courtesy of AMD

Fig. 2.20 Comparison of GPU balance philosophy of semiconductor suppliers

AMD said Llano offered observable and controllable processing with context
switching, single-lane programming, and support for x86 virtual memory. It also
provided support for C++ constructs, virtual functions, and direct-link libraries
(DLLs).
There was a multitude of elements in Llano:
• Multiple instructions, multiple data (MIMD): Four threads per cycle per vector,
from different apps, per compute unit.
• SIMD: 64 FMAD vectors for four waves per cycle (floating-point fused Multiply–
ADd vectors).
74 2 The Third- to Fifth-Era GPUs

• Simultaneous multi-threading (SMT): 40 waves per CU active (the vector units


in the CU did most of the processing. Each unit contained four cores and allowed
for the processing of four wavefronts at any one time).
• Vector unit and scalar unit coprocessor (a significant change from the VLIW5/4
architectures).
The nomenclature VLIW4/5 (very long instruction word) refers to the organization
of AMD’s SIMD processor. Very long instruction word processors take big hunks
of data and process it all simultaneously. How much they take is designated by the
number. In AMD’s case, a VLIW4 meant they could process a four-component dot
product (e.g., w, x, y, z) and a scalar component (e.g., lighting) at once. Ryan Smith
of Anandtech wrote an excellent description of the use of VLIW by AMD and Nvidia
[10].
Llano ran multiple command streams with numerous asynchronous and indepen-
dent command streams. Llano was indeed a heterogeneous processor, one of the first
of its kind.
Llano offered four Star x86 cores and up to 400 GPU cores that delivered over 400
GFLOPS of single-precision compute. AMD said the next-generation APU, Trinity,
would double that, with critical new features that would make using the APU easier
and more efficient (Fig. 2.21).
A Trinity-based system was demonstrated in a live, working notebook running
Windows and playing video. AMD did not plan to ship Trinity until 2012, but
Bergman showed it working in 2011; that was quite an accomplishment given all the
trials and tribulations the company and the Fusion team had gone through.
At the time, Eric Demers, AMD’s CTO of graphics (he too would be one to
leave, going to Qualcomm), gave a historical sketch of the evolution of graphics
engines to GPUs and ultimately HPUs. He showed a summary block diagram of
Llano reconstructed below.
At Computex 2011 in June in Taipei, AMD introduced the Fusion Llano, shown
in Fig. 2.22.

Fig. 2.21 AMD’s APU road map. Courtesy of AMD


2.2 The Fourth Era of GPUs. October 2009 75

Fig. 2.22 AMD’s integrated Llano CPU–GPU

The Llano used AMD’s latest processor, code named Steamroller, a four-core
K10 x86 CPU, and a Radeon HD 6000-series GPU on a 228 mm2 die. AMD had it
fabricated at Global Foundries in 32 nm.
The GPU in the Llano was AMD’s Redwood GPU core (Radeon HD 5570,
discussed later in this chapter) with some enhancements, code named Sumo. The
DirectX 11 GPU had five SIMD arrays with 80 cores (400 shader processors).
The APU presented a single face to the OS. The GPU in an APU was an actual
application accelerator, but the application had to be written correctly to take advan-
tage of it. That was different from Intel’s approach: it had an iGPU, but its only
interaction with the CPU was through main memory (and possibly the L2). An
application could only access the Intel GPU via the graphics API. AMD stated that
the OS would service both memory-management units (MMU) and IO-MMU under
a unified address space–CPU and GPU would use the same 64b pointers.
The MMUs were available to the CPU and GPU and could pass a memory pointer
between them. The OS, of course, had to provide for it, and manual synchronization
was required. There was no coherent view of the memory—the GPU did not snoop,
meaning the GPU did not check the cache’s activity to see what the CPU was up to.
But the x86 also did not snoop GPU cache writes unless it was explicitly marked;
the processors were not working in harmony because they were unaware of each
76 2 The Third- to Fifth-Era GPUs

other. Snooping is how one processor checks the cache’s condition regarding another
processor’s cache activity.
Back to the Future
The team that developed the chip for the Nintendo 64 in 1996 set the stage for
the development of the APU. Seventeen years later, in 2013, it would become the
processor for the PlayStation 3 and Xbox One game consoles. SGI to ArtX to ATI
to AMD was a long, sometimes bumpy ride, but one that can be looked at with great
pride and satisfaction by the developers and their customers.
Today’s SOCs have striking similarities to previous devices like AMD’s APU.

2.2.4.1 Summary

Llano did not do that well for AMD initially; its x86 cores were not as competitive as
they would later become, but it certainly set the stage for many other processors and
led to powerful game consoles (2013) and a sweep of the console market for AMD.
The second-generation APU from AMD announced in June 2012 was the Trinity
for high-performance and Brazos-2 for low-power devices. The third-generation
Kaveri was for high-performance devices was launched in January 2014.
The Fusion design and subsequent APUs was significant because the disclosure
of the design in 2006 changed the way the industry thought about integrated GPUs—
they didn’t have to be power starved and low performance. And the design fueled
the console market for the next decade and beyond.

2.2.5 Nvidia Kepler (May 2013)

The Kepler architecture enjoyed a double life, first as an upgrade or mid-life kicker
of the GTX 600 series, and then as the GTX 700 series. Kepler was designed to
establish Nvidia’s entry into the GPU-compute market segment as well as graphics.
Built on 28 nm (from TSMC) and Nvidia’s new Kepler architecture, it would be
deployed in the GeForce GTX 780 add-in board. With three to six GB of GDDR5
memory, the memory used a 384-bit bus and ran at 1.5 GHz giving it plenty of
bandwidth (Fig. 2.23).
Kepler was a major upgrade and new design. The GPU was more than doubled
from the GTX 680’s GK104’s 3.5 billion to seven billion transistors in the GK110
Kepler GTX 780. The GPU clock was slightly lower from 1 GHz to 900 MHz to
keep the temperature down, but the number of shaders was increased from 1536 to
2304, moving the performance from 3.2 TFLOPS to 4.16 TFLOPS. The chip was
398 mm2 and the board drew 250 w. With GPU Boost 2.0, the GPU could boost to the
highest clock speed it could, while operating at 80 °C, Boost 2.0 dynamically adjusts
the GPU fan speed up or down as needed to attempt to maintain this temperature.
2.2 The Fourth Era of GPUs. October 2009 77

Fig. 2.23 Nvidia’s GeForce GTX 780. Courtesy of Wikipedia GBPublic_PR

Fig. 2.24 Nvidia demo of a crumbling building. Courtesy of Nvidia

The significant aspect of the new GPU was the introduction of Nvidia’s PhysX
physics software which allowed for a game’s 3D models to be destructible, and a
realistic and non-baked way meaning elements would break and fall differently each
time (Fig. 2.24).
78 2 The Third- to Fifth-Era GPUs

Table 2.1 Compute capability of Fermi and Kepler GPUs


FERMI FERMI KEPLER KEPLER KEPLER
GF100 GF104 GK104 GK110 GK210
Compute capability 2.0 2.1 3.0 3.5 3.7
Threads/warp 32
Max threads/thread block 1024
Max warps/multiprocessor 48 64
Max threads/multiprocessor 1536 2048
Max thread blocks/multiprocessor 8 16
32-bit registers/multiprocessor 32768 65536 131072
Max registers/thread block 32768 65536 65536
Max registers/thread 63 255
Max shared memory/multiprocessor 48K 112K
Max shared memory/thread block 48K
Max X grid dimension 216−1 232−1
Hyper-Q No Yes
Dynamic parallelism No Yes

Surface tensions and vicious forces were modeled, as well as the density and
weight of various elements in a model or scene. It also introduced real-time fluid
dynamics and particles for realistic smoke, waves, and water ripples.
The GPU was designed for DirectX 11 (the fourth era of GPUs), but when DirectX
12 was introduced (in the early 2015), the GTX 780 could run a lot of its features.
The Kepler GK110 and GK210 were also designed to be a parallel processing
powerhouse for Tesla and the HPC market (Table 2.1).
GK110 and GK210 provided fast double-precision computing performance to
accelerate professional HPC compute workloads; that is a key difference from the
Nvidia Maxwell GPU architecture, which was designed primarily for fast graphics
performance and single-precision consumer compute tasks. While the Maxwell archi-
tecture performs double-precision calculations at rate of 1/32 that of single-precision
calculations, the GK110 and GK210 Kepler-based GPUs were capable of performing
double-precision calculations at a rate of up to 1/3 of single-precision compute
performance.
Each of the Kepler GK110/210 SMX units featured 192 single-precision CUDA
cores, and each core had a fully pipelined floating-point and integer Arithmetic Logic
Units. Kepler retained the IEEE 754-2008 compliant single- and double-precision
arithmetic introduced in Fermi, including the fused multiply–add (FMA) operation.
One of the design goals for the Kepler GK110/210 SMX was to significantly increase
the GPU’s delivered double-precision performance since double-precision arithmetic
is at the heart of many HPC applications. Kepler GK110/210’s SMX also retained
the special function units (SFUs) for fast approximate transcendental operations
as in previous-generation GPUs, providing 8x the number of SFUs of the Fermi
2.2 The Fourth Era of GPUs. October 2009 79

GF110 SM warps. The SMX scheduled threads in groups of 32 parallel threads


called warps. Each SMX featured four warp schedulers and eight instruction dispatch
units, allowing four warps to be issued and executed concurrently. Kepler’s quad-
warp scheduler selected four warps, and two independent instructions per warp could
be dispatched for each cycle. Unlike Fermi, which did not permit double-precision
instructions to be paired with other instructions, Kepler GK110/210 allowed double-
precision instructions to be paired with other instructions.
The Kepler architecture was one of Nvidia’s most successful designs, and several
products were developed from it.

2.2.6 Intel’s iGPUs (2012–2021), the Lead Up to dGPU

Intel continued to improve the iGPU in its CPU. In 2012, the company introduced
the 22 nm Ivy Bridge CPU with the HD 4000 iGPU. It had 16 EUs, 128 shaders, 16
TMUs, and two ROPS. The 1.2 billion transistor GPU ran at 600–1000 MHz, could
generate 256 (fp32) GFLOPS, and was compatible with DirectX 12, but just basic
functions, not ray tracing and still just a 512k L2 cache.
The company expanded the HD to 10 EUs (80 shaders—184 GFLOPS) when it
introduced the Haswell CPU in 2013. Subsequent Haswell processors had the HD
4200 (20 EUs, 160 shaders—640–768 GFLOPS) and the HD 5000 (40 EUs, 320
shaders—832 GFLOPS).
The Intel Broadwell CPU brought out in 2014 had the Haswell DirectX 12 GT1
iGPU, with 12 EUs (96 shaders—163 GFLOPS), and the subsequent three versions
had HD Graphics 5300, 5500, 5600, and P5700 that used the GT2 chip with 24 EUs
(192 shaders—384 GFLOPS), and the first Iris Pro graphics P6300 iGPU with 48
EUs (384 shaders—883 GFLOPS).
In 2015, Intel brought out the popular i7 6700k (code named Skylake) using a
Broadwell iGPU with 24–72 EUs (576 shaders—1152 GFLOPS).
In 2017, the seventh-gen CPU 17-7700k (code named Kaby Lake) came out with
the new HD 630 iGPU with 24 EUs. Kaby Lake also was available with Iris Pro
graphics 650 with 48 EUs (384 shaders—883 GFLOPS).
Then in 2018, Intel released the ninth-gen core 19-9900k (code named Coffee
Lake), with as many as 18 CPU cores. The UHD 630 iGPU (GT2) had 24 unified
EUs. Intel brought out a particular version, the 19-9900KF, that did not have an
iGPU.
Intel introduced its Gen 11 integrated graphics processor Core H series Tiger
Lake processor in May 2021 and claimed the iGPU had enhanced execution units
(Fig. 2.25).
The Gen 11 CPU had 64 EUs (512 shaders—1126 GFLOPS), more than double
Intel Gen 9 iGPU’s 24 EUs. Intel said the Gen 11 GPU would break the 1 TFLOPS
barrier. The iGPU was released in early 2019 and built with an Intel 10 nm process
using a new SuperFin process [11] shown in Fig. 2.26. When announced in the fall
of 2020, Intel said it was the largest single intranode enhancement in Intel’s history.
80 2 The Third- to Fifth-Era GPUs

Fig. 2.25 Intel’s Gen 11 Tiger Lake CPU with iGPU

Fig. 2.26 Intel’s SuperFin transistor. Courtesy of Intel

Tiger Lake used Intel’s Willow Cove cores, the successor to Sunny Cove.
Tiger Lake was the first processor family to use Intel’s 10 nm SuperFin transis-
tors. According to Intel, the SuperFin, an intranode enhancement, would improve
performance compared to a full-node transition [12].
What made Intel’s move to integrate the GPU with the CPU was the commitment
of silicon. Over 50% of the Gen 11 integrated processor got devoted to the iGPU and
associated image processing and display as shown in Fig. 2.27.
In 2013, when Intel was offering the Haswell processor with 10 execution units,
Senior Vice President Mooly Eden told a group of analysts he hated GPUs.
It was, as you can imagine, a shocking statement. It was not that Eden hated GPU
technology; he hated the cost of the GPUs. “Sixty percent of the die,” he said, “goes
2.2 The Fourth Era of GPUs. October 2009 81

Fig. 2.27 Die shot of Intel’s 11th Gen Core processor showing the amount of die used by the GPU.
Courtesy of Intel

to the GPU—60 percent!” “And you know what we get paid for that?” he asked.
“Nothing, zero dollars, not a dime.” [13].
And so it had been since Q1’10 when Intel first put the GPU in the CPU. From
50 to 60% of an Intel processor’s die area was for free. It was not a good business
proposition. Intel painted themselves into a corner just to beat AMD. But, as Intel
found out too late, AMD had a different business model for their integrated GPU
processor, the APU—they charged for the GPU contribution.
Intel’s integrated GPUs were always unfairly compared to discrete GPUs, and
the performance difference was embarrassing at times—but it was (and is) an unrea-
sonable evaluation (see Flops Versus Fraps: Cars and GPUs charts in Book two).
The actual performance was quite good, and given that it was for free, it was disin-
genuous of benchmarkers to make such comparisons without including the price.
One tester, Jon Peddie Research, used price, power consumption, and performance s
(Pmark [14]) in its Mt. Tiburon Testing Labs reports to evaluate AIBs. Refer to Book
two, Why Good Enough Is Not, for an explanation as to why iGPUs cannot match a
dGPU’s performance.
Furthermore, Intel was unhappy with the big hunk of the high-performance and
server market Nvidia, and AMD had taken by providing GPUs as compute and AI
accelerators.
Therefore, in early 2017, Intel decided it would end its attempted processor hege-
mony and not only acknowledge the value of GPUs but launch a project to create an
entire top-to-bottom dGPU product line. In late 2017, Intel shocked the industry and
hired AMD’s Senior Vice President and Chief Architect of the Radeon Technologies
Group, Raja Koduri for the job (Fig. 2.28).
Koduri became Intel’s Chief Architect and Senior Vice President of the newly
formed Core and Visual Computing Group. He also was appointed General Manager
of a new initiative to drive edge computing solutions [15]. Under Koduri’s leadership,
Intel would launch the Xe GPU product line, discussed later in this chapter.
82 2 The Third- to Fifth-Era GPUs

Fig. 2.28 Raja Koduri,


Intel’s Chief Architect and
Senior Vice President.
Courtesy of Intel

2.2.7 Nvidia Maxwell (2014)

The follow-on GPU to Nvidia’s 2013 Kepler architecture was the Maxwell design
introduced in 2014. Built on a 28 nm TSMC process like Kepler, it was a completely
new architecture and designed for computer graphics rather than GPU compute as
Kepler was. Maxwell was introduced in the GeForce 700 series, GeForce 800M
series, GeForce 900 series.
Second-generation Maxwell GPUs introduced several new technologies:
Dynamic Super Resolution, Third-Generation Delta Color Compression, Multi-
Pixel Programming Sampling, Nvidia VXGI (Real-Time Voxel-Global Illumination),
VR Direct, Multi-Projection Acceleration, Multi-Frame Sampled Anti-Aliasing
(MFAA), and Direct3D12 Feature Level 12_1. HDMI 2.0 support was also added.
The GPU could run voxel illumination, and the company produced an amazing
demo showing the Apollo lander on the moon (Fig. 2.29).
The GM204 was a large chip with a die area of 398 mm2 and 5.2 billion transistors.
It had 2048 shading units, 128 texture-mapping units, and 64 ROPs. The GPU was
used in Nvidia GeForce GTX 980 and came with 4 GB GDDR5 memory using a
256-bit memory interface and ran at 1.753 GHz (7 Gbps effective). The GPU ran at
1.127 GHz, which could be boosted up to 1.216 GHz, memory is being a dual-slot
AIB, the GTX 980 drew power from two 6-pin power connectors (on the top of the
AIB) and consumed 165 W maximum. Display outputs include: DVI, HDMI 2.0,
and three DisplayPort 1.4a. The board used a PCI Express 3.0 x16 interface and ran
DirectX 12.
2.3 The Fifth Era of GPUs (July 2015) 83

Fig. 2.29 Nvidia Maxwell GPU running voxel illumination. Courtesy of Nvidia

2.3 The Fifth Era of GPUs (July 2015)

The fifth era of GPUs was marked by the introduction of DirectX 12 (D3D12) and
featured advanced low-level programming which reduced driver overhead.
DirectX 12 differed from DirectX 11 in that it was closer to the GPU (low level)
like AMD’s Mantel, and it gave developers a fine-grained control of how games
could interact with the CPU and GPU.

2.3.1 AMD’s CGN RX380 (June 2016)

AMD introduced its new RX 480 (code named Ellesmere) with 2048 stream proces-
sors at 5.8 TFLOPS, built with 14 nm FinFETs at Global Foundry. The GPU was
based on AMD’s graphics core next (GCN) 4.0 architecture and had a number of
fundamental features that defined it including:
• Primitive discard accelerator
• Hardware scheduler
• Instruction prefetch
84 2 The Third- to Fifth-Era GPUs

• Improved shader efficiency


• Memory compression.
In AMD graphics architectures, a kernel was a single stream of instructions that
operated on a large number of data parallel work-items. The work-items were orga-
nized into architecturally visible workgroups that could communicate through an
explicit local data share (LDS). The shader compiler subdivided workgroups into
microarchitectural wavefronts that were scheduled and executed in parallel.

A fin field-effect transistor (FinFET) is a multi-gate device, a MOSFET (metal-


oxide-semiconductor field-effect transistor) built on a substrate where the gate
is placed on two, three, or four sides of the channel or wrapped around the
channel, forming a double or even multi-gate structure.

Wikipedia

The GCN shader compiler created wavefronts (also called simply waves) that
contained 64 work-items. When every work-item in a wavefront was executing the
same instruction, the organization was very efficient. Each GCN compute unit (CU)
included four SIMD units that consisted of 16 ALUs; each SIMD executed a full
wavefront instruction over four clock cycles (Fig. 2.30). The main challenge then
became maintaining enough active wavefronts to saturate the four SIMD units in a
CU.
A GCN CU had four SIMDs, each with a 64 KiB register file of 32-bit vector
general-purpose registers (VGPRs), for a total of 65,536 VGPRs per CU. Every
CU also has a register file of 32-bit scalar general-purpose registers (SGPRs). Until
GCN3, each CU contained 512 SGPRs and from GCN3 on the count was bumped
to 800. That yields 3200 SGPRs total per CU, or 12.5 KiB.
The RDNA architecture was designed for a new narrower wavefront with 32 work-
items, called wave32, that was optimized for efficient compute. Wave32 offered
several critical advantages for compute and complemented the graphics-focused
wave64 mode.
The GPU also incorporated h.265 decode at up to 4K and encode at 4K and
60 FPS. It did not incorporate HBM as the previous generation did (Fig. 2.31).
2.3 The Fifth Era of GPUs (July 2015) 85

Fig. 2.30 AMD’s CGN CU block diagram

Fig. 2.31 AMD revealed their GPU roadmap. Courtesy of AMD

AMD’s Vega GPU which had also been known as Greenland featured 4096 stream
processors. The stream processors utilized the advancements made in the IP v9.0
generation of graphics SOCs by AMD. The Vega 10 GPU could be configured with
as much as 32 GB of HBM2 VRAM and use 18 billion transistors.
AMD also announced Embedded Radeon E9260 and E9550 Polaris for Embedded
Markets.
And it introduced the A12 APU with R7 GPU with 512 stream processors. It was
the last APU with the last-generation GCN architecture.

2.3.2 Intel’s Kaby Lake G (August 2016)

Intel surprised the industry by announcing a CPU with an embedded AMD GPU—
code named Kaby Lake G. The company built the CPU processor in its 14 nm fab.
It was Intel’s eighth-generation Intel Core processor and had a Radeon RX Vega
M GH GPU co-processor. The company offered multiple versions—the i7-8809G,
i7-8808G, 8709G, 8706G, -705G, and i5-8305G series.
86 2 The Third- to Fifth-Era GPUs

Fig. 2.32 Intel multi-chip Kaby Lake G. The chip on the left is the 4 GB HMB2, the middle chip
is the Radeon RX Vega, and the chip on the right is the eighth-gen core. Courtesy of Intel

The high-end GPU had 24 compute units (1536 shaders) and 96 texture units,
and all the others had a 20 CU (1280 shaders) and 80 texture unit version. It had
a 1.06 GHz clock (boost to 1.19 GHz) and a 0.93 GHz clock (boost 1.01 GHz)
(Fig. 2.32).
The GPU was DirectX 12 and OpenGL 4.5 compatible and had 4 GB of internal
HBM.
Intel discontinued the product line in January 2020, two years after announcing
it would build its own dGPU. But it was significant because Intel was able to refine
its chip-to-chip interconnection scheme which led to Intel’s embedded multi-die
interconnect bridge (EMIB) for its chiplet Xe dGPU design.

2.3.3 Nvidia

Nvidia added the Nvidia GeForce GTX 1060 with a starting price of $249 to its
Pascal family of gaming GPUs, complementing the GTX 1080 and 1070 following
their launch two months earlier. The GTX 1060 had 1280 CUDA cores, 6 GB of
GDDR5 memory running at 8 Gbps, and a boost clock of 1.7 GHz, which can be
easily overclocked to 2 GHz for further performance.
The GTX 1060 also supported Nvidia Ansel technology, a game-capture tool that
allowed users to explore, capture, and compose gameplay shots, pointing the camera
in any direction, from any vantage point within a gaming world, and then capture
360° stereo photospheres for viewing with a VR headset or Google Cardboard.
Nvidia also announced Xavier, a new SoC based on the company’s next-gen
Volta GPU, which Nvidia hoped would be the processor in the future self-driving
2.3 The Fifth Era of GPUs (July 2015) 87

Fig. 2.33 Nvidia’s GPU roadmap. Courtesy of Nvidia

cars. Xavier featured a high-performance GPU, and the latest ARM CPU yet had
great energy efficiency according to the company (Fig. 2.33).
Using the expanded 512-core Volta GPU in Xavier, the chip was designed to
support deep learning features important to the automotive market, said the company.
A single Xavier-based AI car supercomputer would be able to replace cars configured
with Drive PX 2 with two Parker SoCs and two Pascal GPUs. Xavier was be built
using 16 nm FinFET process and had seven billion transistors—it was probably the
biggest chip ever built anywhere at the time.

2.3.4 AMD’s Navi RDNA Architecture (July 2019)

AMD introduced its new Navi GPU architecture in mid-2019. New product
announcements are typically made at CES in January and Computex in June or
July, where OEM meetings can get arranged in one central place, reducing travel
time and expense for everyone.
AMD also introduced a new name for its architecture—Radeon DNA—RDNA.
Navi GPUs were the first to use AMD’s new Navi RDNA (1.0) architecture, and
they had redesigned compute units with improved efficiency and instructions per
clock (IPC) capability. They had a multi-level cache hierarchy, which offered higher
performance, lower latency, and less power consumption than the previous series.
The new architectural design provided 25% better performance per clock per core
and 50% better power efficiency than AMD’s previous Vega generation architecture.
88 2 The Third- to Fifth-Era GPUs

2.3.4.1 Radeon RX 5700 XT AIB (July 2019)

The first AIB announced with the Navi RDNA was the Radeon RX 5700 XT,
introduced in July 2019.
The Navi 10 also had an updated memory controller with GDDR6 support—
AMD’s first use of GDDR6 (Nvidia had employed the higher-speed memory in its
RTX 2080 a year earlier). The memory had a 256-bit bus, giving a GPU 448 GB/s
memory bandwidth with a 1.75 GHz memory clock.
The Navi 10 ran at 1.68 GHz (a special anniversary unit ran at 1.98 GHz) and had
2560 shaders, 160 tensor processing unit (TPUs), 64 render units, and 40 compute
units. The chip had 10.3 billion transistors and was in a 251 mm2 die; its block
diagram is shown in Fig. 2.34.
One of the significant differences in the Navi architecture was AMD’s use of a
communications fabric among the various elements in the GPU.
AMD organized the GPU into several main blocks, connected by AMD’s Infinity
Fabric. The command processor and PCI Express interface connected the GPU
to the PC system and controlled various functions. The two shader engines held
the programmable compute resources and some dedicated graphics hardware. Each
shader engine included two shader arrays, which comprised the new dual compute
units, a shared graphics L1 cache, a primitive unit, a rasterizer, and four render
back-ends (RBs). In addition, the GPU included a dedicated logic for multimedia
and display processing. The partitioned L2 cache and memory controllers routed
memory access. This was an on-die predeessor to a future chiplets design.
The command processor received API commands and, in turn, operated different
processing pipelines in the GPU, as illustrated in Fig. 2.35. The graphics command
processor managed the traditional graphic pipeline (e.g., DirectX, Vulkan, OpenGL)
shaders tasks and fixed function hardware. The Asynchronous Compute Engines
(ACE) implemented compute tasks that managed compute shaders. Each ACE main-
tained an independent stream of commands and could dispatch compute shader wave-
fronts to the compute units. Similarly, the graphics command processor had a stream
for each shader type (e.g., vertex and pixel). The command processor spread work
across the fixed function units and shader arrays for maximum performance.
The RDNA architecture introduced a new scheduling and quality-of-service
feature known as Asynchronous Compute Tunneling, enabling compute and graphics
workloads to co-exist in the GPU. Different shaders could execute on the RDNA
compute unit in a typical operation. However, a task could be more sensitive to
latency than other jobs. The RDNA architecture could suspend the execution of
shaders and free up compute units for high-priority tasks.
The command processor and scheduling logic partitioned graphics and compute
worked to facilitate dispatching to the arrays to improve performance. For example,
the graphics pipeline was partitioned for screen space and then sent to each partition
independently. Developers could also create scheduling algorithms for computer-
based effects.
The RDNA architecture consisted of multiple independent arrays, fixed function
hardware, and programmable dual compute units. AMD could scale performance
2.3 The Fifth Era of GPUs (July 2015) 89

Fig. 2.34 Block diagram of the AMD Navi 10, one of its first GPUs powered by the RDNA
architecture

from the high-end to the low-end by increasing the number of shader arrays and
altering the balance of resources within each shader array. The Radeon RX 5700
XT included four shader arrays, and each one had a primitive unit, a rasterizer, four
render back-ends (RBs), five dual compute units, and a graphics L1 cache.
The primitive units assembled triangles from vertices and were responsible for
fixed function tessellation. Each primitive unit provided culling of two primitives per
clock, making it twice as fast as the previous generation.
90 2 The Third- to Fifth-Era GPUs

Fig. 2.35 AMD’s RDNA command processor and scan converter

The rasterizer in each shader engine performed the mapping from the geometry
stages to the pixel stages. AMD subdivided the screen with other fixed function
hardware, each portion distributed to one rasterizer.
The GPU’s dual compute unit had a dedicated front-end, as shown in Fig. 2.36.
The L0 instruction cache was shared between all four SIMDs within the dual compute
unit—previous instruction caches got shared between four CUs—or 16 graphics core
next (GCN) SIMD. The instruction cache of the RNDA architecture was 32 KB and
four-way set-associative; it consisted of four banks of 128 cache lines that were 64
bytes long. Each of the four SIMDs could request instructions every cycle. And the
instruction cache could deliver 32 bytes (typically two to four instructions) every
clock to each SIMDs—roughly 4X greater bandwidth than GCN.
The fetched instructions went to wavefront controllers. Each SIMD had a private
instruction pointer and a 20-entry wavefront controller for 80 wavefronts per dual
compute unit. Wavefronts could be from a different work-group or kernel, although
the dual compute unit could maintain 32 work groups simultaneously and operate in
wave32 or wave64 mode.
The architecture had a hypervisor agent, allowing the GPU to be virtualized
and shared between different operating systems. That was useful for cloud gaming
services in data centers where virtualization was crucial from a security and oper-
ational standpoint. Although consoles focused on gaming, many offered a suite of

Fig. 2.36 AMD’s RDNA compute unit front-end and SIMD


2.3 The Fifth Era of GPUs (July 2015) 91

communication and media capabilities and benefited from virtualizing the hardware
to deliver performance for all tasks.
The GPU could reach a theoretical performance level of 8.6 TFLOPS (the
anniversary unit could get to 10.13 TFLOPS).
The new GPU was fabricated at TSMC using their 7 nm manufacturing process,
and AMD was the first to produce chips at that node. Nvidia went to Samsung for
its chips and made them at 8 nm.
The new RDNA Navi parts employed PCI Express 4.0 as well. The Navi 10 was
AMD’s second GPU with PCIe 4.0 capability; the Vega 20 also had it, but AMD
restricted PCIe 4.0 to its high-end Vega parts, whereas with Navi, all segments had
PCIe 4.0. The new PCIe 4.0 interface operated at 16 GT/s, double the throughput of
earlier 8 GT/s PCIe 3.0-based GPUs.

2.3.4.2 Summary

Combined with the architectural improvements and TSMC’s 7 nm process, AMD


claimed it had achieved a 50% increase in performance per watt with the Navi-based
GPUs. The anniversary version GPU drew 225 W.
AMD further distinguished the new RX 5000 series AIBs based on the Navi
architecture by going to a four-digit nomenclature (i.e., the Vega series were the RX
500 series).

2.3.4.3 RX 5500 Series (2019)

In October 2019, AMD launched the RX 5500 series AIBs based on its 7 nm Navi
14 RDNA 1.0 GPU. Introduced with the AIB were three new gaming features, an
anti-lag feature that shortened the time from a mouse click to screen action, an
image-sharpening feature, and a variable resolution feature called Boost.
In August 2016, AMD acquired HiAlgo, a developer of PC gaming tools designed
to improve the gaming experience without overclocking the GPU. The company was
founded in 2011 in Sunnyvale by Eugene Fainstain and Alex Tsodikov.
HiAlgo applications were plug-ins that helped hardware perform better. In 3D
games, the application allocated computer resources by dynamically changing frame
rates and picture resolution. The applications used code-injection techniques to attach
to a game.
The company introduced three tools that gamers and developers could use: Boost,
Chill, and Switch.
Boost was a utility that made gameplay smoother with less lag. It intercepted
and, on the fly, modified commands sent from the game to the graphics AIB, which
the company claimed optimized performance frame by frame. During fast-paced
moments of the game, it lowered the rendering resolution, causing the frame rate
and responsiveness to increase, effectively increasing the performance of the GPU
noticeably.
92 2 The Third- to Fifth-Era GPUs

Chill was a smart frame-rate limiter utility that reduced GPU and CPU churning
(and subsequently overheating). The application tracked what was happening in the
game and allocated computational resources for the best game performance. When
there was not much action in the game, the application lowered the frame rate. When
the action picked up, the frame rate went up. The company said Chill prevented
underclocking and saved power. That increased gaming time on laptops.
Switch was a utility that changed the game resolution from 100 to 50% with a
button push that increased the frame rate. It worked by intercepts, which modified
commands sent from the game to the AIB on the fly. Rendering resolutions could be
adjusted, even if the game could not do it without a restart.
When AMD acquired the firm, its software ran only on games compatible with
DirectX 9; DirectX 11 was on the to-do list.
With the RX 5500 XT launch in December 2019, AMD introduced three new soft-
ware features: Anti-Lag, Boost, and Chill. In HiAlgo terms, these were the equivalent
of Boost, Switch, and Chill.
AMD’s Radeon Boost dynamically lowered the resolution of the entire frame
when fast on-screen character motion was detected (from the user’s mouse activity).
That allowed higher FPS with little perceived impact on quality. The feature reduced
screen resolution on a linear scale (down to a 50% minimum). AMD also referred to
that feature as Motion Adaptive Resolution.
Radeon Chill was a power-saving feature that dynamically regulated the frame
rate based on your character and camera movements in game. As activity decreased,
Radeon Chill reduced frame rate and saved power, helping lower the GPU’s temper-
ature. Radeon Chill worked for most titles using DirectX 9, 10, 11, 12, and
Vulkan.
Radeon Anti-Lag was a feature that helped reduce input-to-response latency (input
lag) by reducing the time between the game’s sampling of user controls and the output
appearing on the display.
Radeon Chill, Radeon Anti-Lag, and Radeon Boost were mutually exclusive, and
only one could be enabled at a time.

2.3.5 Summary

The AMD RDNA Navi line of GPUs was significant because AMD introduced a
new, more efficient and powerful architecture with RDNA, and at the same took
advantage of a process shrink to 7 nm to add several new features.

2.3.6 Intel’s Whisky Lake 620 GT2 iGPU (2018)

The Gen 9.5-integrated UHD Graphics 620 (GT2) was in processors from the Whisky
Lake generation. The DirectX 12-compatible GT2 version of the Skylake GPU had
2.3 The Fifth Era of GPUs (July 2015) 93

Fig. 2.37 Intel GT2 iGPU block diagram

24 EUs (192 shaders) clocked at up to 1150 MHz, and it had three MB-dedicated L3
caches. The HD 620 used a shared memory architecture with the CPU (DDR4-2133).
A block diagram of the GT2 iGPU is shown in Fig. 2.37.
The video engine supported H.265/HEVC Main10 profile in hardware with 10-bit
color. Google’s VP9 codec could be hardware decoded. The Core i7 chips supported
HDCP 2.2 and, therefore, Netflix 4K. HDMI 2.0, however, only worked if the TV
had a high-speed level shifter and active-protocol converter (LSPCon) converter chip
in it.
What made it significant was its large number of shaders—and that the processor
and iGPU had been designed for low-power operation in notebooks. It also showed
the direct the Xe basic design would likely follow.

2.3.7 Intel’s Gen 11 iGPU (March 2019)

A prelude to X e ?
Intel’s integrated Gen 11 GPU was a monolithic design, with significant microar-
chitectural enhancements (over earlier generations) that improved performance per
watt efficiency. The Gen 11 GPU graphics technology (GT) architecture said the
company, targeted modern thin and light mainstream and premium PC designs. At
the time, speculation was the Gen 11 graphics architecture would be the basis for
Intel’s upcoming Xe discrete GPU architecture.
94 2 The Third- to Fifth-Era GPUs

The Gen 11 GPU GT2 architectural enhancements (over Gen 9) improved perfor-
mance per FLOP by removing bottlenecks and increasing the efficiency of the
pipeline.
The design had 64 EU (512 shaders), 32 TMUs, and 8 ROPS; it was DirectX
12 (12_1) compatible and generated 1100 TFLOPS. It also supported 3D rendering,
GPU computing, and programmable and fixed function media capabilities. Intel split
the iGPU architecture into four subslices: The Global Assets slice, which had some
fixed function blocks that interfaced to the rest of the SoC, the Media fixed function
slice, the 2D blitter, and the Graphics Technology Interface (GTI) slice. The GTI
slice housed the 3D fixed function geometry, eight subslices containing the EUs, and
a slice common that held the rest of the fixed function blocks supporting the render
pipeline and L3 cache.
For the Gen 11 GPU-based products, Intel aggregated eight subslices into one
slice. Thus, a single slice aggregated a total of 64 EU. Aside from grouping subslices,
the slice integrated additional logic for the geometry, L3 cache, and the slice common.

2.3.7.1 Intel’s GPU’s Geometry Engine

The Gen 11 GPU’s 3D fixed function geometry had a typical render front-end that
mapped to the graphics pipeline in OpenGL DirectX, Vulkan, or Metal APIs. Addi-
tionally, it included the Position Only Shading pipeline, or POSH pipeline, used to
implement position only tile-based rendering (PTBR) mentioned above.
Vertex fetch (VF), one of the first stages in the geometry pipe, was, as its name
implied, used to fetch vertex data from memory. Those data then was used in later
vertices. They got reformatted and written into a buffer. A vertex typically has more
than one attribute (e.g., position, normals, texture coordinates, color). As graphics
workload complexity has increased, more vertex attributes have grown too. The Gen
11 GPU increased the VF input rate from four attributes/clock to six and improved the
input data cache efficiency. Another significant VF change in the Gen 11 GPU was the
increased number of draw calls it could handle at the same time to enable streaming
of back-to-back draw calls. Newer APIs like DirectX 12 and Vulkan reduced the
overhead significantly for draw calls, which increased draw calls that could be made
per frame, improving the visual quality [16]. Shown in the block diagram in Fig. 2.38
are the iGPU and CPU with a connecting ring.
The Gen 11 GPU also made tessellation improvements. It delivered up to twice
the Hull Shader thread dispatch rate, increasing the efficiency of output topology,
especially for patch primitives subject to low tessellation factors. Another notable
new feature of the Gen11 iGPU was variable rate shading (VRS).
This was another significant integrated GPU introduced by Intel. The company
devoted over 50% of its precious die to the free GPU in the CPU. And it wasn’t just
the cost of the silicon, there were added costs for testing and driver writing. Those
are big investments, and Intel would have made them if it did not see a long-term
strategic advantage in doing so.
2.3 The Fifth Era of GPUs (July 2015) 95

Fig. 2.38 Intel Gen 11 iGPU block diagram

2.3.7.2 Intel Updates Its Ring Topology

The chip used a ring-based topology bus between CPU cores, caches, and the GPU. It
had dedicated local interfaces for each connected agent. Intel first introduced the ring
architecture for graphics in the Larrabee design in 2006. The ring connected a system
agent for off-chip system memory transactions to/from CPU cores and to/from the
iGPU. Intel processors included a shared Last-Level Cache (LLC) connected to the
ring, and the integrated GPU shared the LLC. So the ring design for GPUs by Intel
had been around for many years.
The Gen 11 GPU integrated within Intel’s Core processors implemented multiple
clocks. Intel partitioned the clocks into a processor graphics clock domain, a per-
CPU core clock domain, and a ring interconnect clock domain. Intel used those
segmentations in power management scenarios.
Since before the Larrabee project in 2006, Intel had been a regular contributor and
presenter at the ACM SIGGRAPH conferences worldwide. The papers were always
well received, frequently referenced, and respected. However, most of the advanced
concepts Intel presented did not seem to find their way into Intel’s GPUs. The Gen
11 iGPU was an exception and showed some of the power Intel would bring to its
Xe dGPUs, announced in 2018 and known unofficially as Gen 12.
Two significant features of Gen 11 were coarse pixel shading and POSH.
96 2 The Third- to Fifth-Era GPUs

Fig. 2.39 CPS added two more steps in the GPU’s pipeline

2.3.7.3 Coarse Pixel Shading

Decoupled sampling techniques such as coarse pixel shading (CPS) can lower the
shading rate while resolving visibility at full resolution, preserving details along
geometric edges [17]. Coarse pixel shading decreased the workload on GPUs by
reducing the number of color samples used to render an image. Decreasing pixel
shader runs also saved power. Coarse pixel shading provided application developers
with a new rate control on pixel shading operations. CPS is better than upscaling. It
lets developers work at the render target resolution while sampling the more slowly
varying color values at the coarse pixel rate. AMD and Nvidia introduced similar
features in their GPUs and drivers (Fig. 2.39).
CPS cut in half the number of shader invocations, yet there was almost no
noticeable difference in high pixel density display (Fig. 2.40).
Intel first described coarse pixel shading as a technique in its 2014 High-
Performance Graphics Paper [18].

2.3.7.4 Position Only Shading Tile-Based Rendering (POSH)

Tile-based rendering reduced memory bandwidth by managing multiple render


passes to data per tile in the chip (see Book one, Microsoft Talisman—the Chip
That Never Was).
The Gen 11 GPU added a parallel geometry pipeline that acts as a tile binning
engine to support tile-based rendering. Used in front of the render pipeline for visi-
bility binning prepass per tile, it looped over geometry per tile and consumed the
visibility stream for that tile.
2.3 The Fifth Era of GPUs (July 2015) 97

Fig. 2.40 Geometry with red boxes is sufficiently far from the camera, and therefore, it is of minor
importance to the overall image. Thus, the color shading frequency could be reduced (using CPS
with no noticeable effect on the visual quality or the frame rate). Courtesy of Intel

The POSH pipeline, Intel’s position only tile-based rendering (PTBR) system,
deployed two geometry pipelines—a standard rendering pipeline and the POSH
pipeline.
The POSH pipeline ran the position shader in parallel with the main application.
Still, it typically generated results much faster, as it only shaded position attributes
and skipped the rendering of pixels. The POSH pipeline ran ahead of the rendering
pipeline and used attributes from the shaded position to compute visibility informa-
tion for triangles and determine if they were culled. The object visibility recording
unit of the POSH pipeline calculated the visibility, compressed the data, and recorded
it in memory, as illustrated in Fig. 2.41.
In theory, POSH was a faster and more power-efficient way to handle certain types
of geometry processing. Overall performance and applicability to workloads would
depend on the rendering mode used by games. However, Intel was thinking about

Fig. 2.41 Position only tile-based rendering (PTBR) block diagram


98 2 The Third- to Fifth-Era GPUs

maximizing memory bandwidth and introducing more advanced features around the
idea. The company patented the concept [19].

2.3.8 Summary

The Gen 11 iGPU signaled what Intel would include in its discrete GPU. The Xe
architecture was designed to span from integrated GPUs to high-end workstation and
data center accelerator GPUs.
Intel open-source developers began preparing the graphics compiler back-end
changes for the Gen 12 Xe GPU, starting with Tiger Lake processors. Significant
architectural changes were revealed when compared to Ice Lake Gen 11 GPU. The
patches showed that the Gen 12 GPU ISA was one of the biggest reworks ever to the
Intel EU ISA since the original i965 graphics a decade earlier.
Intel updated nearly every instruction field, opcode, and register type. Other signif-
icant changes include removing the hardware register scoreboard logic—which left
it up to the compiler to ensure data coherency between register reads and writes to
sync hardware instruction.
Intel’s 11th-gen desktop CPUs, Rocket Lake, launched in March 2021. It
competed with AMD’s Ryzen 5000 series.

2.4 Conclusion

The third era of the GPU saw the introduction of unified shaders. Prior to that, the
shaders were semi-fixed function and would sit idle, while other semi-fixed function
shaders might have been overburdened, not an efficient use of processing power.
Unified shaders were one step closer to the ultimate all compute GPU.
Also during the era, we saw the introduction of the GPU integrated in with the
CPU creating the iGPU. Intel was the first to market with such a device and quickly
rose to the number one GPU supplier since almost every CPU now had a GPU. The
iGPU would still trail the discrete GPU (dGPU) in performance, albeit getting more
powerful with every introduction, and you couldn’t beat the price—free.

References

1. Tamasi, T. The Evolution of Computer Graphics, Nvision keynote, (2018), https://www.nvidia.


com/content/nvision2008/tech_presentations/Technology_Keynotes/NVISION08-Tech_K
eynote-GPU.pdf
2. Lubke, D. and Humphreys, G. How GPUs Work, IEEE Computer, P 96–100,
(February 2007), https://ieeexplore.ieee.org/document/4085637?tp=&arnumber=4085637&
isnumber=4085604
References 99

3. Hruska, J. 10 years ago, Nvidia launched the G80-powered GeForce 8800 and changed PC
gaming, computing forever, Extreme Tech, November 8, 2016, https://www.extremetech.com/
gaming/239078-ten-years-ago-today-nvidia-launched-g80-powered-geforce-8800-changed-
pc-gaming-computing-forever
4. Peddie, J. Chasing the nanometer: When does it become indivisible? (April 30, 2020), https://
www.jonpeddie.com/news/chasing-the-nanometer/
5. Forsyth, T. Why didn’t Larrabee fail?, TomF’s Tech Blog, (August 15, 2016), https://tomfor
syth1000.github.io/blog.wiki.html[[Whydidn’tLarrabeefail?]]
6. Seiler, L., Cavin, D., Espasa, E., Grochowski, T., Juan, M., Hanrahan, P., Carmean, S., Sprangle,
A., Forsyth, J., Abrash, R., Dubey, R., Junkins, E., Lake, T., Sugerman, P. Larrabee: A Many-
Core x86 Architecture for Visual Computing (PDF). ACM Transactions on Graphics. Proceed-
ings of ACM SIGGRAPH, (August 2008), https://dl.acm.org/doi/pdf/10.1145/1360612.136
0617
7. Wittenbrink, C. M., Kilgariff, M., and Prabhu, A. Fermi GF100 GPU Architecture, IEEE Micro,
pp. 50–59, vol. 31, (March/April 2011), DOI Bookmark: https://doi.org/10.1109/MM.2011.24,
https://www.computer.org/csdl/magazine/mi/2011/02/mmi2011020050/13rRUILtJih
8. AMD Net Income/Loss 2006–2021|AMD, https://www.macrotrends.net/stocks/charts/AMD/
amd/net-income-loss
9. Transistors as Switches, https://learn.digilentinc.com/Documents/312
10. Smith, R. AMD’s Radeon HD 6970 & Radeon HD 6950: Paving The Future For AMD,
(December 15, 2010), https://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-
hd-6950/4
11. Dillinger, T. A “Super” Technology Mid-life Kicker for Intel, August 17, 2020, Semi-
Wiki, https://semiwiki.com/semiconductor-manufacturers/intel/289716-a-super-technology-
mid-life-kicker-for-intel/
12. Intel Architecture Day 2020, https://www.intel.com/content/www/us/en/newsroom/resources/
press-kits-architecture-day-2020.html#gs.cbozbp
13. Peddie, J. Free GPU — looking a gift horse in the mouth, TechWatch, (January 22, 2019),
https://www.jonpeddie.com/editorials/free-gpu-looking-a-gift-house-in-the-mouth/
14. Dow, R. AMD’s RNDA 2.0 Add-in board delivers performance in the mid-range, Jon Peddie
Research, (August 9, 2021), https://www.jonpeddie.com/reviews/6600-xt
15. Raja Koduri Joins Intel as Chief Architect to Drive Unified Vision across Cores and Visual
Computing, (November 8, 2017), https://newsroom.intel.com/news-releases/raja-koduri-joins-
intel/#gs.f2gsw5
16. Intel Processor Graphics Gen11 Architecture, Version 1.0, https://www.intel.com/content/
dam/develop/external/us/en/documents/the-architecture-of-intel-processor-graphics-gen11-
r1new-810410.pdf
17. Kai Xiao, K, Liktor, G, and Vaidyanathan, K. Coarse Pixel Shading with Temporal Super-
sampling, ACM SIGGRAPH I3d ’18 Symposium on Interactive 3D Graphics and Games,
(May 4–6, 2018), https://software.intel.com/content/dam/develop/external/us/en/documents/
CPST_preprint.pdf
18. Vaidyanathan, K., Salvi, M., Toth, R., Foley, T., Akenine-Möller, T., Nilsson, J., Munkberg, J.,
Hasselgren, J., Sugihara, M., Clarberg, P., Janczak., T., and Lefohn, A. Coarse Pixel Shading,
HPG ‘14: Proceedings of High Performance Graphics’, (June 2014), https://www.researchg
ate.net/publication/288811329_Coarse_pixel_shading
19. Position Only Shader Context Submission Through a Render Command Streamer, US patent
20170091989A1, https://patents.google.com/patent/US20170091989A1/en
Chapter 3
Mobile GPUs

In 1996, Nokia introduced its PDA/cellphone device, the 9000 Communicator.


Also, in 1996 Palm introduced its legendary PalmPilot PDA with a 160 × 160-
pixel monochrome touchscreen LCD. The mobile processor companies increased to
25 in 2004 and then steadily declined to five by 2018 as shown in Fig. 3.1.
Shortly after that, in November, Qualcomm introduced the Snapdragon S1
MSM7227 systems on a chip (SoC). Several companies had developed SoCs with
integrated GPUs, primarily for the smartphone market. Apple used Imagination
Technologies’ GPU design, and Qualcomm used ATI’s Imageon GPU technology.
In January 2009, AMD sold its Imageon handheld device graphics division to
Qualcomm.
Apple surprised the market again in 2010 by introducing the iPad, a 9.7-in. device
with a 1024 × 768 resolution screen that looked like a giant iPhone. Tablets had been
around since 1987 (the Cambridge research Z88 Write Top), and in the early 2000s,
there was a surge in tablet introductions based on Microsoft’s Windows Mobile OS.
But it was Apple that ignited the interest in a portable computer that looked like a
tablet.
However, the tablet introduction affected the PC market. Consumers using a PC
for minor tasks such as email, and simple photo editing found the thin, lightweight,
and affordable device an ideal solution. Simultaneously, smartphones were getting
larger and higher resolution screens driven by amazing SoCs with surprisingly good
GPUs.
The steady growth in mobile devices sales has tapered off a bit, but there is still
strong interest in and desire for a sleek, modern mobile device, a truly personal
computer.
A mobile device is defined by its battery dependence—I have excluded mobile
PCs—notebooks or laptops from that definition, they are covered elsewhere. So, for
this chapter, mobile devices are those shown in the vertical stack in Fig. 3.2.
Most mobile devices use an Arm processor.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 101
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_3
102 3 Mobile GPUs

Fig. 3.1 The rise and fall of mobile graphics chip and intellectual property (IP) suppliers versus
market growth

Fig. 3.2 Mobile devices

The GPUs that go into mobile devices come from various suppliers, as shown
in Fig. 3.3. Some specialize in market segments such as automotive or gaming, and
some serve all markets and platforms.
This chapter will look at the companies in the mobile market, which makers of
smartphones and tablet GPUs dominate.
3.2 Mobiles: The First Decade (2000–2010) 103

Fig. 3.3 Sources of mobile GPUs

3.1 Organization

This chapter has two parts—the first decade (2000–2010) and the second decade and
on (2010 and on). These are of course a slight overlap; the companies didn’t know
at the time they were making history.
The mobile market doesn’t follow the API advancements as the PC market does
and, therefore, doesn’t have eras like the PC. The APIs for the mobile market tend
to move in fits and starts, and the adoption of them is equally irregular. Most of the
mobile devices use the Android operating system or some Linux variant. Apple has
its own, Linux-based OS they call iOS. The Android/Linux OS use the OpenGL ES
API and in the 2020s began adopting Khronos’ Vulkan API. Apple has its own API,
Metal.

3.2 Mobiles: The First Decade (2000–2010)

Up until 2003 with the establishment of OpenGL ES 1.0, the mobile market was
chaotic with various APIs. That made it difficult (to impossible) for software devel-
opers to get their applications to run on various devices. This is discussed more fully
in book two, The Ears and Environment, in the chapter on The GPU Environment—
APIs. By 2010, OpenGL ES 2.0 was in place and the mobile market had, for the
most part, stabilized. Apple continued to be an outlier and offer only its own API
for its own OS. But Linux and various versions of Android used OpenGL ES and
applications could run on any platform.
From 1999 to 2003, the mobile market had several independent and proprietary
APIs such as Nokia’s Symbian, RIM’s BlackBerry API, and Qualcomm’s BREW,
as well as Java ME mobile 3D graphics APIs.
104 3 Mobile GPUs

The situation was chaotic. In those early years, Nokia’s Symbian OS was the most
popular and there were four different UIs that ran on top of the Symbian OS builds:
S60, S80, S90, and UIQ. Despite their common OS, no app written for one OS could
be used on the other. Nevertheless, Nokia had almost 50% of the market by 2007.
Its numbers began to decline following the introduction of the iPhone. Nokia finally
gave up and dumped Symbian in favor of Android and OpenGL ES in 2010.
OpenGL ES stabilized the mobile market, and the SoC developers began following
it as PC developers followed DirectX.

3.3 Imagination Technologies First GPU IP (2000)

Imagination was one of the pioneers of GPUs in mobile, and in 2000, it introduced
its first transform and lighting unit, Elan, which drove two CLX2s for the Naomi 2
arcade machines (the second-generation PowerVR2’s code name was CLX2).
Imagination enjoyed several years as the GPU in Texas Instruments (TI’s) Open
Multimedia Application Platform (OMAP) smartphone processors, Samsung’s, and
other smartphone SoC suppliers.
In 2019, Imagination Technologies introduced its Imagination A-Series. The
company said the A-Series was its most crucial GPU launch since the mobile
PowerVR GPU 15 years before.
Up to 6 TFLOPS and low power in a mobile device
In November 2020, the company said it had topped itself and introduced the
Imagination B-Series—an expanded range of GPU IP (see chart in Fig. 3.4). It
delivered up to 6 TFLOPS with a reduction in power up to 30%, a 25% area reduction
over earlier generations, and the company claimed its fill rate was up to 2.5 times
higher than competing IP cores.

Fig. 3.4 Big jump in GPU power efficiency. Courtesy of Imagination Technologies
3.3 Imagination Technologies First GPU IP (2000) 105

With Imagination A, the company said it made an extraordinary leap over earlier
generations. That resulted in an industry-leading position for performance and power
characteristics. The B-Series was a further evolution. The company said that it deliv-
ered the highest performance per mm2 for GPU IP with new configurations for lower
power. It reduced the bandwidth by up to 35% for a given performance target. All
that, said the company, made it a compelling solution for top-tier designs.
The B-Series offered a wide range of configurations, which expanded the options
available for its customers. Its scalability meant the B-Series was suitable for many
markets, including entry-level to premium mobile phones, consumer devices, Internet
of things, microcontrollers, digital television, and automotive. Add multi-core to the
mix and the B-Series was adaptable to data center performance levels. That said, the
company was a unique range for GPU IP.
The B-Series Imagination BXS had the first ISO 26262-capable cores in 2020,
which opened a range of automotive options, from small, safety-fallback cores to
many teraflops of computing. The high TFLOPS targeted advanced driver-assistance
systems and high-performance autonomy applications such as the dashboard display
shown in Fig. 3.5. Imagination was immensely proud of its ISO certification.
Imagination believed they picked the sweet spot between high-performance and
power-optimized cores. They incorporated an innovative decentralized approach to
multi-core design. The company claimed the design would allow it to deliver high-
efficiency scaling. BXT offered compatibility with industry trends such as chiplet
architectures. That enabled the company to provide a range of performance levels
and configurations, a capability that Imagination said had not before been possible
in GPU IP.
Imagination claimed it had optimized the new multi-core architecture for each
product family. The BXT and BXM cores featured primary full-core scaling capa-
bilities (refer to the block diagram in Fig. 3.6). Combined with all the cores’ power, it

Fig. 3.5 Tile region protection isolates critical functions from each other. Courtesy of Imagination
Technologies
106 3 Mobile GPUs

Fig. 3.6 Imagination’s BXT MC4 block diagram. Courtesy of Imagination Technologies

Fig. 3.7 The B boxes of


imagination. Courtesy of
Imagination Technologies

delivered impressive single-app performance, enabling the cores to run independent


applications.
BXE cores offered primary-secondary scaling. BXE was an area-optimized
solution that presented a single high-performance GPU and used Imagination’s
HyperLane for multi-tasking.
Imagination said BXS Automotive cores also featured multi-core with a multi-
primary design. That enabled performance scaling. It also permitted safety checks
to run across cores to confirm correct execution. Figure 3.7 shows Imagination
Technologies’ B-Series of GPU designs.
Imagination was particularly proud of its image compression technology, IMGIC,
which it claimed supplied new bandwidth-saving options. It offered up to four
compression levels, from pixel-perfect lossless modes to an extreme bandwidth-
saving mode. Imagination guaranteed a 4:1 or better compression rate, which
provided flexibility for SoC designers—they could optimize performance or reduce
system costs. IMGIC, said the company, was compatible with every core across the
B-Series range.
The Imagination B-Series cores
Imagination B-Series spread across four product families, as illustrated in Fig. 3.8
offered specialized cores for specific market needs:
• Imagination BXE—From one to 16 pixels per clock, BXE scaled from 720p to
8K. It was designed for UI rendering and entry-level gaming and offered up to
3.3 Imagination Technologies First GPU IP (2000) 107

Fig. 3.8 In 2020 imagination had a broadest range of IP GPU designs available. Courtesy of
Imagination Technologies

25% area saving over previous-generation cores. And, said the company, it had
up to 2.5x the fill rate density compared to the competition.
• Imagination BXM—High-efficiency cores with fill rate and compute balance.
The company designed the BXM for a compact silicon area in mid-range mobile
gaming, DTV, and complex UI solutions.
• Imagination BXT—The company claimed its four-core part, designed for hand-
held devices to the data center, could generate 6 TFLOPS while producing
192 gigapixels per second. It could supply 24 trillion operations per second
(TOP/s) for AI, delivering the industry’s highest performance per mm2 .
• Imagination BXS—The BXS family were ISO 26262-capable GPUs. The
company claimed they would enable next-generation HMI, UI display, infotain-
ment, digital cockpit, and surround view.

Several OEMs selected the Imagination BXT core for its scalability. The company
claimed its GPUs could provide up to a 70% higher compute density than desktop
GPUs. Imagination said the BXT core IP offered flexible control over the OEM’s
configuration and layout of individual cores in SoCs and multi-die packages, claimed
the company.

3.3.1 Summary

If Imagination proved nothing else, it demonstrated its resiliency, perseverance, and


tenacity. The company was whipsawed by the bad faith of its partners, including
Intel and Apple. It could be argued that Imagination was also the victim of personnel
pirating. Apple continued forward with designs developed initially by Imagina-
tion through a new licensing agreement with Imagination. The company went
through foreign acquisitions and major management shakeups. It is now owned by
108 3 Mobile GPUs

Chinese equity company Canyon Bridge. Like others, it suffered through a world-
wide pandemic and unrelenting competitors. And through all that, maybe because of
it, the company developed a string of new architectures with long legs, scalability,
and continually surprising features and results.

3.4 Arm’s Path to GPUs (2001)

Acorn Computers was founded in Cambridge, UK, in 1978 and was the genesis of
Arm. Acorn produced several computers, which were especially popular in the UK
including the Electron and the Archimedes. The company had been using commercial
off-the-shelf processors and decided it should build its own. The new reduced instruc-
tion set computer (RISC) was a popular design for its speed, low power consumption,
and minimalistic design. So Acorn decided it would build a RISC processor.
The Acorn RISC project started in October 1983, and Acorn had spent over $8.6
million on it by 1987. VLSI Technology was chosen to build the chip. VLSI produced
the first Arm processor on April 26, 1985, 4 years after the IBM PC [1]. Arm started
life in 1987 as part of the BBC computer made by Acorn [2].
In 1990, Acorn and Apple began collaborating to develop a low-power processor
for Apple’s Newton. The two firms concluded that a single company should do
processor development. Acorn transferred most of its advanced research and devel-
opment section to the new company and formed Arm Ltd. in November 1990. Acorn
and Apple had a 43% shareholding in Arm (in 1996) [3]. VLSI was also an investor
and the first Arm licensee.
Arm became the processor of choice for handheld devices, from PDAs (Newton,
Palm, etc.) to smartphones and early pretablet slate-like devices.
Arm had been working indirectly with Imagination Technologies for GPU IP. In
January 2001, the two announced a strategic development deal: They would integrate
Imagination’s compact PowerVR graphics core with Arm’s 16/32-bit RISC cores.
The companies said their first markets would be mobile phones, PDAs, digital set-top
boxes, tethered Internet appliances, and mobile gaming markets.
That arrangement lasted until June 2006 when Arm announced it would acquire
the Norway-based firm Falanx Microsystems. In addition to the GPU technology,
Arm acquired the Mali product brand.
Imagination quickly announced that it would continue to market its next-
generation graphics technology, PowerVR SGX, directly to customers and inde-
pendently of Arm. SGX was Imagination’s newest design and supported OpenGL
ES 2.0, and they hoped Arm would pick it up. But time moved on, and Imagination
had been re-evaluating its marketing of PowerVR SGX technology for a while.
Imagination had always had a close and frank relationship with Arm with good
executive- and working-level contacts, and they were both aware of each other’s
views and plans. Arm didn’t decide to go its separate way based on the technical
aspects of Imagination’s designs. The driving factors were customer input/support
trends, commercial and strategic.
3.4 Arm’s Path to GPUs (2001) 109

So, in 2005, Arm began to look for another approach and launched the Flag project
to find the options available to the company (Flag was not an acronym, just a project
name). With a directive from the board of directors, the Arm team started a due-
diligence investigation, visited almost all the suppliers, and proclaimed suppliers of
graphics coprocessor chips and IP. The FLAG team did not begin with any precon-
ceived notions. They looked at the candidate companies’ cultures, philosophies,
direction, technology, and resources. Then in June 2005, they sent some repre-
sentatives to the northern university town of Trondheim, the third-largest city in
Norway.
Imagination Technologies was not out of consideration, but the Flag team found
the architectural directions taken in the development of the SGX did not completely
line up with Arm’s plans. Also, Arm management recognized graphics had become
a significant part of an SoC solution and having a design of its own would give the
company more leverage and control of its future.
After lots of investigation and several more meetings, Falanx agreed it should
become the new graphics IP business unit within Arm.
Announcing the acquisition in June 2006, Arm said the Falanx’s GPU family,
Mali, would allow Arm to meet the demand for more sophisticated graphics in
mobile, automotive, and home applications. Figure 3.9 shows the organization of the
Falanx GPU.
Falanx said it was not really looking to be acquired; it had investors with deep
pockets and long horizons. Arm knew better and that Falanx was, in fact, pretty
desperate and looking to find a partner or a buyer. Falanx did have a development
deal at Zoran and others in the works; all of which could be moved over to Arm.
Arm acquired Falanx in June 2006, and Mike Inglis, executive vice president at
Arm, said, “The estimated total available market for embedded 3D hardware is set
to grow from 135 million units in 2006 to more than 435 million units in 2010.” His

Fig. 3.9 Mali in Arm, circa 2006


110 3 Mobile GPUs

forecast was a little off—in 2010, 325 million mobile devices with Mali shipped. By
2012, the market had jumped to 850 million, and in 2021, it reached 1.3 billion units
[4].
Arm won one socket after another during that time, and the Mali GPU found its
way into most low-end and mid-range phones and most tablets. Mali became the
unquestioned market leader.
The Arm ecosystem ships billions of chips per quarter, and in the fourth quarter of
2020, Arm reported that its silicon partners shipped a record 6.7 billion Arm-based
chips in the prior quarter. That equates to about 842 chips shipped per second. By
2021, Arm partners shipped more than 180 billion Arm-based chips to all platforms
(including autos, toys, TVs, and refrigerators)—chances are you have three or more
Mali GPUs in your collection of electronic stuff.

3.4.1 Falanx

In 1998, Jørn Nystad and Borgar Ljosland met as students at the Norwegian Univer-
sity of Science and Technology (NTNU) in Trondheim. They started to discuss what
held clock frequency performance back in current GPU designs. Jørn had architec-
tural ideas from his CPU designs that formed the basis for a new type of architecture
for a low-power, high-performance GPU (Nystad has over 100 patents and is an Arm
fellow). Ljosland and Nystad assembled a team that included Mario Blazevic (soft-
ware engineering) and Edvard Sørgård (hardware engineering). They were thinking
they could commercialize the design and enter the PC add-in board (AIB) market,
but later changed their plan to being an IP supplier.
Backed by a business plan contest prize money from the university, govern-
mental research grants, and equipped with a promising field-programmable gate
array (FPGA) prototype, they formed Falanx Microsystems. In their young careers,
they quickly saw what was happening in the PC and AIB markets and changed their
strategy to offering IP for SoCs. It was a strategy not too unlike what Imagination
Technologies had gone through. From that move, came the original Mali design.
Later, as the business began to develop and the company got customers, Imagination
Technologies spoke about suing them for patent violation. However, no suit was ever
filed; neither company wanted to take on the burden of lawyer fees [5].
Falanx was incorporated on April 4, 2001, finished its RTL (register-transfer
level, a design abstraction) GPU IP model, and began marketing its IP Graphics
Core. The company targeted the mobile market and said its design could be used
in mobile phones, PDAs, set-top boxes, handheld gaming devices, and infotainment
systems. Falanx hoped to license its family of IP cores directly to equipment, and
SoC suppliers like Arm and Imagination Technologies had. Falanx spoke about its
independence from the university by emphasizing that the Mali graphics solution
needed no third-party licenses.
3.4 Arm’s Path to GPUs (2001) 111

3.4.2 Mali Family (2005)

The Mali family of graphics IP cores offered 4× and 16× full-scene anti-aliasing
(FSAA) to improve the mobile gaming and multimedia experience. Arm claimed the
4× mode came with no performance, power, or bandwidth penalties, making its use
practical in mobile phones and other handheld devices. Initially, the Mali family had
three members:
• Mali-50
– Entry-level and basic mobile phones
• Mali-100
– For gamepads, feature, and smartphones
• Geometry engine
– An onboard geometry processor offered lower power consumption and
increased T&L performance.
The relative differences between the three designs are shown in Table 3.1.
The Mali-110 and Mali-55 were improvements of partly reimplemented versions
of Mali-100 and Mali-50. As the data in Table 3.1 indicates, the design scaled very
well. It was safe to say that the Mali GPU design represented the smallest GPU with
integrated T&L available.
Shown in Table 3.2 are the feature sets available in the primary family members.
In 2005, Falanx Microsystems added a video capability to its lineup of IP, offering
customers another feature for handheld devices operating in low-power configura-
tions: the Mali Video Series IP cores. Those cores provided H.264 encode/decode,
enabling customers to build high-frame-rate video capture and playback devices.
Historically, the video-enabled cores are a footnote. They were video acceleration
features integrated into the GPUs. It worked well for the resolutions of the time, but
the dedicated video accelerators won out as video resolution increased rapidly.
The company also reduced traffic between the processor and memory and created a
pipeline based on computational patterns rather than fixed functions. That approach
let Falanx take advantage of the similarities between video compression and 3D

Table 3.1 Configuration and performance parameters for Mali family of graphics cores
Mali geometry Mali-110 Mali-55
Gate count, logic (nand 2 × 2, full scan) 150k 230k 190k
SRAM 10 kb 7 kb 4 kb
Die area (130 mm) 1.5 mm2 3.0 mm2 2.0 mm2
Max clock 150 MHz 200 MHz 200 MHz
Mpix/s (4×FSSA, bilinear) na 300 100
Mtri/s (transform) 5 5 (consume) 1 (consume)
112 3 Mobile GPUs

Table 3.2 Mali function list


3D Graphics Video (e.g., MPEG-4)
4×/16×X FSAA Motion estimation
T&L (SW or HW) Motion compensation
Triangle setup Video scaling
Flat/Gouraud shading 2D Graphics
Point sampling/bilinear/trilinear filtering Lines, squares, triangles, points
Dot3 bump mapping Point sprites
Multi-texturing ROP3/4
Alpha blending Rotation/scaling in any degree
Stencil buffering (4-bit) Font rendering
Mipmapping Alpha blending
Early Z-test
Perspective correct texturing
18 bits internal color precision
2 bits per texel texture compression

graphics. As a result, the company could use the same gates for graphics as for
video display, reducing the gate count and simplifying the design process. Although
that approach was innovative and generated a lot of interest, in the end, it was not
commercially successful. Video and graphics requirements exploded, and power
and performance were ultimately more important than cost optimizations. A block
diagram of the Falanx Arm Mali GPU is shown in Fig. 3.10.

Mali Interface Entity


MaliGP
Geometry ARM
Graphics Core Processor

Mali50 /100
SoC Interface External
Pixel
Memory
Processor

AHB APB

Fig. 3.10 Falanx Arm Mali block diagram


3.4 Arm’s Path to GPUs (2001) 113

The Mali graphics core communicated with the Arm processor and system
memory through the advanced high-performance bus (AHB), and with peripherals
through the Advanced Peripheral Bus (APB).
The Mali Video Series included three pixel-processing cores: the Mali-110V,
Mali-55V, and a geometry processing core, the Mali-GP-V. These processors worked
with the Arm CPU to accelerate video encoding and decoding up to 30 frames
per second, with motion estimation, motion compensation, video scaling, imagine
differencing, and color-space conversion.
The introduction of OpenGL ES2.0 and the Mali-200 (originally Mali-120) used
a split programmable shader architecture. Its development spanned the acquisition
and marked the step up to Arm quality standards. It took a lot longer to complete
than initially envisioned, but that investment in time and resources from Arm created
the foundation for success 5–10 years later.
Because Arm had led the explosion of the handheld market and had done quite well
in a few others, the company thought it was time for a multi-core GPU in a handheld
device. In 2008, it announced the Utgard 4-core architecture, the Mali-400 MP GPU.

3.4.3 More Cores

Arm had already demonstrated that it could design multi-core CPUs with its
successful MPCore product. The company leveraged that technology (design tools,
test-and-measurement, etc.) and exploited the Mali architecture’s inherent scalability.
Its results produced a design that delivered more than a billion pixels a second while
using little memory bandwidth, which was the key to reducing power consumption.
Part of the Arm announcement was its software stack. Arm had been adding to
its stack for years. But before Mali was Arm, the clever Trondheimers had already
grafted several important APIs, interfaces, and miscellaneous standards to the GPU
core. Back then, it looked like the diagram in Fig. 3.11.
At the time, Arm said it had shipped 50 million graphics software engines before
the Falanx acquisition. A software stack ran JSR-184 on the CPU via an API and
contained a software rendering engine. JSR 184 is a specification that defines the
Mobile 3D Graphics (M3G) API for the J2ME (Java 2 Platform, Micro Edition),
a technology that allows programmers to use the Java programming language and
related tools to develop programs for mobile devices. LG and Motorola used it at the
time in tens of millions of handsets [6].
Arm sold drivers to the semiconductor suppliers with an abstracted driver to
OEMs. The OEM’s experience and feedback gave Arm information about how to
tune the drivers. Arm then fed that information to the chip suppliers, which allowed
Arm to understand the OEM’s use cases and other issues and help the chip suppliers
deliver better-tuned, higher-performing parts to get the right CPU into the right OEM
with the suitable applications.
Licensing the software to OEMs was an essential feature for Arm. It added signif-
icant value by prevalidating the software with Arm’s hardware. That helped ensure
114 3 Mobile GPUs

Fig. 3.11 Arm Mali’s graphics stack with MIDlets1

Arm’s partners did not end up with complications when the lower-level software stack
was integrated with Java, which was hampering other companies. The important thing
to see in Fig. 3.11 is all the “J” numbers and “Open” words and numbers that informed
the OEMs where that software could be applied and how broad the applications and
opportunities were for it. That work did not come easy and certainly not overnight.
And as part of Arm’s offering, it was a substantial claim—preverified—that was, it
should work.
The software stack needed something to run on, and that was where the Mali-
400 MP came in.
The 2008 design tackled one of the main power drain issues—memory band-
width—one wants as much as possible, but it comes at the expense of joules. Arm
claimed the Mali-400 MP GPU reduced memory bandwidth by combining the best
immediate-mode and tile-based rendering and clever use of a shared L2 cache with
unified memory access. Like other designs, it supported multiple levels of power
gating, all of which, said Arm at the time, would provide a 30–45% power savings.
But the concept had been a Falanx design philosophy since 1998. With the L2 cache,
Arm improved it even more.
Arm used a large set of different applications and use cases for its measurements
to support those claims. The partners supplied test cases developed internally and
included UI engines, 2D and 3D games, industry-standard benchmarks, and tech
demos. Bandwidth measurements were performed inside and outside the L2 cache
using Arm’s profiling tools and real (ASIC and FPGA) platforms or by running RTL
simulations of the Mali-400 MP. How quickly the Mali-400 MP could power down

1A MIDlet is an application that uses the mobile information device profile (MIDP) for the Java
Platform, Micro Edition (Java ME) environment.
3.4 Arm’s Path to GPUs (2001) 115

Table 3.3 2008


Fragment processors Fill rate (Mpix) at 275 MHz
Mali-400 MP GPU
process-fill rate 1 275
2 550
3 825
4 > 1000
Source Arm

portions of the core depended on how fast the OEM’s silicon implementation could
provide stable power. The latencies imposed by Mali were just tens of cycles.
Arm also claimed that the design scaled linearly with the number of cores and
offered Table 3.3 as an illustration.
The fill rate was measured using supersampling anti-aliasing (SSAA), also called
full-scene anti-aliasing (FSAA) without any overdraw assumptions.

3.4.4 Balanced, Scalable, and Fragmented

Arm said the Mali driver reduced complexity because it was the same API and driver
for all configurations. The hardware detail (hidden from developers) and the cores
work autonomously on different parts of the frame buffer in parallel (see Fig. 3.12).
Therefore, the rendering was distributed for optimal load balancing and bandwidth
efficiency.
Because the L2 cache was crucial to the multi-core design and efficiency, the
fragment processors were unaware of one another and worked on separate parts of
the final frame buffer. The granularity of the load balancing was on a tile-by-tile basis.
Only one fragment processor touched any pixel in the frame buffer (in that frame).
Refer to Fig. 3.12 illustration where the screen grid represents the tiles making up
the full-frame buffer.
On its face, Falanx looked like it had a tough challenge as it took its IP licensing
model for handheld devices up against established and well-trained suppliers like
ATI, Nvidia, and Imagination Technologies. It was a tough challenge, but it did not
bother Falanx. CEO Borgar Ljosland said his company was young and built from
the start on an IP licensing model, and as a result, it was positioned to function as
IP-enabled SoCs. Although the competitors had IP strategies, the people at Falanx
reasoned that those competitors were not really organized to enjoy the margins of an
IP licensing model. They said that the Mali architecture was flexible and scalable in
ways that the competitors were not. Ljosland said the Mali architecture enabled chip
builders to use two and four cores to create more powerful platforms.
116 3 Mobile GPUs

Fig. 3.12 The Mali-400 could share the load on fragments

3.4.5 More Designs

Utgard was the final design based on the original Falanx architecture. The Mali-
50/55/100/110 fixed function architecture (never given a code name) was the only
original Falanx architecture. The start of Utgard (Mali-200) was before Arm, but the
majority were developed as Arm. Arm managed to keep most of the original design
team and added significantly to the Trondheim and Cambridge teams. The company
developed three more architectures, the Midgard, the Bifrost, and the Valhall.
3.5 Fujitsu’s MB86292 GPU (2002–) 117

Fig. 3.13 Fujitsu MB86292 GPU

3.5 Fujitsu’s MB86292 GPU (2002–)

First GPU for automobiles


Fujitsu had been building 2D graphics display controllers (GDC) since the early
1990s for its PCs and game machine, and as an OEM supplier to the automotive
industry in Japan.
Its first 2D/3D graphics display controller was the MB86290A, code name
Cremson, introduced in January 2000 designed for car navigation and infotainment
systems. The controller could display graphical information on four separate memory
layers, obviating the need to redraw the map each time new data was added. It also
had 2D and 3D graphic functions for displaying information. The overall graphical
performance was 100 Mpixels/s, about ten times greater, said the company, than
competitive products.
The graphics controller provided an analog RGB output with various screen reso-
lutions from 320 × 234 to 1024 × 728. It also offered line pattern, chroma-key,
shading, texture mapping, double buffer management, and a cursor.
The MB86290A Cremson had 3D features but relied on a Hitachi SH3/4 CPU for
geometry and T&L processing.
In October 2001, Fujitsu integrated an FP engine for geometry processing, coordi-
nating transformation, and releasing the MB86292, code named Orchid (Fig. 3.13).
The geometry engine said the company would enable a significant reduction in
numerical computation for graphics processing with high CPU loads in embedded
systems.
The Orchid GDC had digital video capture and could store digital video data such
as TV in graphics memory. It could display rendered graphics and video graphics on
the same screen.
The controller provided functions such as XGA display (1024 × 768 pixels),
4-layer overlay, left/right split display, wrap-around scrolling, double buffers, and
118 3 Mobile GPUs

translucent display. In addition to analog RGB output, the controller supported digital
RGG output and picture-in-picture video data.
The MB96291 supported 3D rendering, such as perspective texture mapping
with perspective collection, Gouraud shading, alpha blending, and anti-aliasing for
rendering smooth lines.
The host MB86292 CPU interface could transfer display list, texture pattern
data from the main memory to the Orchid graphic memory or internal registers
using an external direct memory access (DMA) controller. The device had a 32-bit,
33 MHz PCI interface that enabled data transfer at 70 MB/s, more than any other
GDCs claimed by the company. The GDC incorporated an external memory inter-
face to allow off-chip connections to synchronous dynamic random-access memory
(SDRAM) or fast cycle RAM (FCRAM).
Fujitsu manufactured the GDC in its 250 nm CMOS technology fab in Japan.
In 2004, Fujitsu did a process shrink and introduced the MB86296 Coral PA GPU,
manufactured in a 180 nm process. The Coral PA ran on 1.8 V at 500 mA and 3.3 V
at 100 mA. It was compatible with all Fujitsu graphic display controller integrated
circuits (ICs). The DGC worked with different host CPU buses, including the FR
series from Fujitsu, SH3, and SH4 from Hitachi, and V83 NEC Electronics. Coral
PA required no external glue logic.

3.5.1 IMB86R01 Jade

In 2014, Fujitsu changed its design strategy and built an SoC with an ARM926 RISC
CPU and provided a popular and well-known ISA to developers.
The MB86R01 incorporated Fujitsu’s MB86296 3D graphics core the company
upgraded to support high-speed double data rate (DDR) memory. Fabricated in
Fujitsu’s 90 nm process, the SoC offered an optimal balance of power (low leakage
current) and performance.
The MB86R01 graphics SoC featured a hierarchical bus system that isolated high-
performance functions like 3D graphics processing low-speed I/O and routine jobs.
Fujitsu designed the Arm processor to run at twice the rate of the graphics core to
reduce memory bus contention between those two primary functions. (The Arm ran
at 333 MHz and the graphics core at 166 MHz.)
Central to MB86R01’s architecture was a 3D geometry processing unit capable
of performing all primary 3D operations, including transformations, rotations, back-
face culling, view-plane clipping, and hidden surface management. See the block
diagram in Fig. 3.14.
A display controller supported two capture sources (YUV/ITU656 or RGB) and
enabled both upscaling and downscaling of video images. It was possible to map
the video to any of the six display layers and have it texture-mapped to polygons
to create special effects. The display controller was also capable of dual digital
outputs, supporting multiple monitor configurations in different resolutions. The
content could be either the same or unique to each panel. For example, a single
3.5 Fujitsu’s MB86292 GPU (2002–) 119

Fig. 3.14 Fujitsu’s MB86R01 SoC Jade

MB86R01 controller could support a 1024 × 768 resolution center console featuring
a navigation system and an 800 × 480 resolution rear-seat display for video, and the
menu buttons could be shared between the displays.
The MB86R01’s six display layers could be six individual frame buffers or indi-
vidual canvases; each could contain unique content. The layers could be optimally
sized to save memory and improve system throughput and graphics performance.

3.5.2 Several Name Changes

Founded in 1979 and headquartered in Sunnyvale, California, Fujitsu Electronics


America, Inc., (FEA) was a subsidiary of Fujitsu Electronics, Inc., (FEI) of Japan.
FEI was a Fujitsu Semiconductor Limited (FSL) company, one of the world’s largest
semiconductor and electronics component distributors.
120 3 Mobile GPUs

On 2nd October 2001, Fujitsu Microelectronics, Inc., (FMI) restructured in the


U.S. The company reorganized its business and became a new company, named
Fujitsu Microelectronics America, Inc. (FMA). FMA said it would focus its efforts
on system design, product marketing, sales, service, and support in the Americas but
would not include a manufacturing division. FMI’s manufacturing plant in Gresham,
Oregon, would continue under the FMI name. The company said negotiations were
in the works to transfer the facility to a joint venture under AMD and Fujitsu, Ltd.
Then, in July 2010, FMA, Inc., (FMA) announced the company would change its
name to Fujitsu Semiconductor America, Inc. (FSA). That said, the company was
in keeping with the name change of its parent company, FSL, which changed its
name from Fujitsu Microelectronics Limited as of 1st April 2010. Under its new
company name, the Fujitsu Semiconductor group hoped to gain greater recognition
from customers in the global marketplace and clarify its business in semiconductor-
related products and services.
In February 2013, Fujitsu announced the elimination of 5000 jobs, 3000 in Japan
and the rest overseas from its 170,000 employees. Fujitsu also merged its large-scale
integration (LSI) chip designing business with Panasonic Corporation, resulting in
the establishment of Sociolect.
In 2014, after severe losses, Fujitsu spun off its LSI chip manufacturing divi-
sion, as well, as Mie Fujitsu Semiconductor, which was later bought in 2018 by
United Semiconductor Japan Co., Ltd., wholly owned by United Microelectronics
Corporation.
Then, in December 2015, the company changed its name from Fujitsu Semicon-
ductor America (FSA) to Fujitsu Semiconductor, Ltd. (FSL) to reflect customers’
broader products and services. FSL became one of the world’s largest semiconductor
and electronics component distributors.

3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles


and consoles (2003–)

Psion generally gets credit for producing the first personal digital assistant (PDA)—
the 1984 Organizer. That was followed seven years later with the Series 3, in 1991,
which established the familiar PDA form factor with a full keyboard.
John Sculley, Apple’s CEO, introduced the term PDA on January 7, 1992. At the
Consumer Electronics Show in Las Vegas, Nevada, he introduced the Apple Newton.
Apple chose a 20 MHz ARM610 processor made by VLSI (the first silicon partner
for Arm) and chose the VY86C610 to power the Newton. Apple had discovered Arm
in September 1990 and bought a 43% stake in the company.
IBM introduced the Simon in 1994, the first PDA/smartphone [7], and in 1996,
Nokia introduced the 9000 Communicator, its PDA with digital cellphone function-
ality. That same year Palm released its famous PDA products, powered by a Motorola
68328. It had built-in functions, like a color and grayscale display controller, and
3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles (2003–) 121

drove a 160 × 160-pixel monochrome touchscreen LCD. Palm quickly became the
dominant vendor of PDAs until the introduction of consumer smartphones in the
early 2000s. All those devices had simple 2D LCD screens.
Realizing the emergence of the handheld market, in 1996, Microsoft introduced its
CE operating system, code name Pegasus. It was a scaled-down version of Windows
optimized for devices with minimal memory; a Windows CE kernel could run on
one megabyte of memory.
Seeing that development, Ramesh Singh and Ignatius Tjandrasuwita, who were at
S3, along with four others at Chips & Technologies and Cirrus Logic, formed MediaQ
in 1997. They had entered the market to develop a dedicated graphics controller for
PDAs, mobile phones, and handheld game machines.
In 1998, another variant of windows was introduced, Jupiter, a stripped-down
version of the Microsoft Windows CE operating system for low-power RISC
processors like Arm. It never went into production.
In 1999, MediaQ launched its first device, the MQ-200, a 128-bit graphics display
controller with 2 MB of embedded dynamic random-access memory (DRAM), illus-
trated in Fig. 3.15. The company had grown to 23 people from just about every
graphics semiconductor company in Silicon Valley by that time. The device used a
250 nm process and had a 1442 mm die.

Fig. 3.15 MediaQ MQ-200 block diagram


122 3 Mobile GPUs

Fig. 3.16 MediaQ MQ-200 drawing engine

The Graphics Engine was a specialized logic processor for 2D graphics opera-
tions such as BitBlts and Raster-Ops (ROP), area fill, and vertical and horizontal
line drawing. It also provided hardware support for clipping, transparency, and font
color expansion. MediaQ offered a Windows CE device driver package that used
the graphics engine to accelerate graphics and windowing performance in 8, 16, and
32 bpp graphics modes. The graphics controller freed the CPU from most of the
display rendering function with three main benefits (Fig. 3.16):
• If the CPU was busy, it did not halt the graphics controller’s accelerated operations.
The end-user did not see screen operations start and stop or slow down in most
cases.
• The graphics controller consumed less power because its display buffer and engine
were in the same device.
• The CPU was free to perform time-critical or real-time operations such as
functioning as a software modem, while the graphics engine performed the
rendering.

MediaQ did well in the market and won several designs with follow-on projects.
The company became the leader in micrographics controllers. Its customers included
handset and PDA manufacturers such as Mitsubishi, Siemens, DBTel, Dell, HP,
Palm, Philips, Sharp, and Sony. Its only serious competitor was Epson, primarily in
Japan. MediaQ was in devices using different processors such as StrongARMs and
SH-4s, and it supported the major operating systems, including EPOC, WindRiver’s
VxWorks, Linux, and Palm OS.
By the mid-2000s, most PDAs had morphed into smartphones. Stand-alone PDAs
without cellular radios had died off.
The smartphone market was just getting started, Nokia was the leader, Apple had
just introduced the iPod in 2001, and rumors were swirling about its interest in the
phone market.
3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles (2003–) 123

Nvidia had also been eyeing the mobile market and looking for an entry point to
leverage its graphics technology and reputation. The internal microdevice projects
Nvidia was running were not progressing fast enough, and management was worried
it would miss a window of opportunity.
In August of 2003, Nvidia announced it had signed a definitive agreement to
acquire MediaQ for $70 million in cash. That was a big deal at the time and caught
a few companies off guard. Nvidia named the new division GoForce and rebanded
MediaQ’s graphics processors as the GoForce brand.
“This acquisition supports Nvidia’s strategy of extending our platform reach and
accelerates our entry into the wireless mobile markets,” said Jensen Huang, Nvidia’s
CEO. “The MediaQ acquisition extends Nvidia’s competencies in ultra-low-power
design methodologies and system-on-chip designs, as well as in the Microsoft Pocket
PC, Microsoft SmartPhone, Palm, and Symbian operating systems.”
Just before Nvidia’s acquisition, in July 2003 the Khronos Group released
OpenGL ES 1.0. OpenGL ES 1.0, based on the original OpenGL 1.3 API, had some
functionality, most extensions, and overhead removed. The members of Khronos,
including Nvidia and MediaQ, of course, knew all about OpenGL ES. It promised
to stabilize the mobile and handheld market and allow apps to migrate from one
platform or device to another easily—and it did just that.
The leading developer of OpenGL ES code examples at the time was Hybrid
Graphics, a small company in Finland backed by an advertising agency. Its drivers
were small, tight, and very efficient. It developed and licensed graphics technology
solutions for handheld consumer devices. Its customers were key players in the
mobile industry, including Nokia, Ericsson, Philips, Samsung, and Symbian, who
held well over 50% of the existing handheld market.
The company also had an operation for making 3D graphics images called Fake
Graphics, which it spun off in 2004. After that, the company’s most important product
was SurRender 3D (1996–2000), its OpenGL graphics library.
In 2000, Hybrid launched visibility optimization middleware, the dPVS (dynamic
Potentially Visible Set) that massively multi-player online role-playing games
(MMORPGs) such as EverQuest II and Star Wars Galaxies used. In 2006, Hybrid sold
its dPVS technology to Umbra Software, founded that same year. Umbra is special-
ized in occlusion culling and visibility solution technology and provided middleware
for video games.
Hybrid Graphics provided the first commercial implementations for the OpenGL
ES and OpenVG mobile graphics APIs. The company also was actively involved in
the development of the M3G (JSR-184) Java standard.
Nvidia was still trying to get into the mobile business and realized its driver
development for OpenGL ES was not progressing fast enough. To remedy that,
Nvidia bought Hybrid Graphics for an undisclosed amount (which meant less than
$20 million) in March of 2006.
“Since its inception, Hybrid has successfully delivered graphics solutions to
hundreds of millions of handheld devices,” said Mikael Honkavaara, CEO of Hybrid
Graphics. “We provide innovative graphics technology, and we make it work in the
real world, in real devices.”
124 3 Mobile GPUs

Nvidia now had two of the four components needed to enter the mobile market;
it still needed Arm processor technology and experience and some radio capability.
And although the company had made complex integrated graphics processors (IGPs)
and GPUs, it had no experience with heterogeneous SoCs or radios.
In 1999, the same year Nvidia would introduce its PC-based GPU, six guys
formed a media chip semiconductor company they named PortalPlayer. In 1998,
the backer and mentor of 3Dfx, Gordon (Gordie) Campbell, visited chip pioneer
National Semiconductor, hoping to interest them in building a chip system for an
MP3 player. National was not interested, as it was pursuing its own Cyrix media
processor. However, John Mallard, National’s CTO, was impressed, and as the
legend goes, followed Campbell out to the parking lot after the meeting to discuss it
further. Mallard and four other National engineers, with $5 million in funding from
Campbell’s Techfund and J. P. Morgan, formed PortalPlayer in 1999.
The company developed an MP3 player with a flexible design that allowed
customers to use different functions that would create a unique custom chip. Word
of the company’s flexible design got to Apple, where Steve Jobs was trying to get
an MP3 player project off the ground—the timing was perfect. Several companies
were bidding on the Apple project, including mighty Texas Instruments, considered
the logical winner with its strong DSP background. But the Texas Instruments part
was too big and used too much power. In the summer of 2001, Apple picked the
start-up PortalPlayer [8]. The young, energetic company poured its collective heart
and soul into the Apple project, and Apple introduced the iPod in November of that
same year. That put the company in the headlines, and soon after, Samsung, RCA,
Rio, and others would place orders with the firm. However, Steve Jobs did not want
any information about Apple to get out, and he was not happy about the notoriety
PortalPlayer was getting.
In 2002, founder CEO Mallard stepped down, and Gary Johnson, who had been
running S3, came in as CEO. MP3 players were taking off, and over 4 million of them
sold in 2002; by 2004, it had climbed to 7 million. Information about PortalPlayers’
relationship with Apple dried up. Johnson could not concede the relationship even
existed out of fear of offending the company’s biggest, and notoriously secretive
customer. “I’m not even going to refer to those guys,” he told Forbes magazine [9].
PortalPlayer had a unique design and may be a year’s lead on the competition, and
everyone wanted Apple’s business.
In November 2004, PortalPlayer filed an IPO and went public. In its disclosure,
it revealed that Apple represented 90% of the company’s $48 million revenue, and
PortalPlayer had 85% of the MP3 market. Furthermore, the company revealed to
investors anxious to get a piece of the action, PortalPlayer was likely to get the
design win for Apple’s follow-on product; the nano iPod, expected out in 2005.
But PortalPlayer had committed the unforgivable sin of mentioning Apple’s name.
When asked about PortalPlayer, Jobs would only say it was one of many suppliers.
Jobs could not stand sharing the spotlight with anyone and was notoriously famous
for keeping his key component suppliers a secret from the competition. The damage
had been done, and Apple selected Samsung for the MP3 processor for the Nano
iPod in 2006. With the new device coming out, orders from Apple for the old device
3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles (2003–) 125

fell off drastically. In June, PortalPlayer announced it would reduce its workforce
by 14% following its failure in April to secure the Apple business. A month later,
PortalPlayer’s President and CEO Gary Johnson said he would resign by the end of
the year. PortalPlayer’s quarterly earnings of $34.6 million were down 50% compared
to the previous year.
And then in November, Nvidia reported it would acquire PortalPlayer to beef up
its development of personal media players (PMPs). Nvidia said it would pay about
$357 million for PortalPlayer. Now, Nvidia would have three of the four components
for a smartphone chip. But why the company paid so much for a failing company
remained a mystery. People thought if Nvidia had waited a few months, it could have
gotten the company for a third of what it paid. But Jensen Huang saw PortalPlayer’s
IP and technology as too valuable to pass up. PortalPlayer’s product mix of semicon-
ductors, firmware, and software platforms for PMPs would fit neatly into Nvidia’s
family of graphics processing units, Nvidia said [10].
Nvidia was smart and always on the hunt for talent. By acquiring PortalPlayer,
Nvidia would pick up approximately 300 employees, including more than 125 in
Hyderabad, India; about 200 of PortalPlayer’s employees were engineers. Nvidia
had been expanding its operation in India lately, so it all tied together nicely.
Why would Nvidia want PortalPlayer? Because Nvidia was going to design and
sell an application processor for the handheld market. The company was already an
Arm licensee and had started its own applications processor project. But the project
was not moving as fast as management wanted, and the PortalPlayer team was a
lucky pickup to expand and accelerate the project.
PortalPlayer was also an Arm licensee for the ARM7 cores and provided not
only the iPod’s sound chip but also its CPU. As Nvidia was acquiring Portalplayer,
rumors of Apple’s iPhone were circulating. The stories said Portalplayer would be
the SoC supplier for it. It was speculated that $150 million of the $357 million
Nvidia offered for Portalplayer was expected to come from the Apple deal. However,
Portalplayer/Nvidia never got the contract, and Apple’s iPhone A4 SoC was built
by Samsung and used Imagination Technologies GPU IP. Once again, PortalPlayer
had committed the sin of speaking about Apple. Until late 2005, it was seriously
being considered for the iPhone, but when Jobs heard the rumors, he went back to
Samsung. PortalPlayer had ignored the first rule of the Apple Fight Club: never talk
about Apple.
Nvidia began developing its GeForce Arm-based applications processor in 2006
and showed it at the 3GSM conference in Nice France in February 2007. The GoForce
6100 was a legacy part from PortalPlayer (so it was not really Nvidia’s first apps
processor), but it formally launched it and had a customer (SanDisk for the Sansa
View PMP).
126 3 Mobile GPUs

3.6.1 Tegra is Born

In the summer of 2007, in Taipei, Nvidia launched a new brand and product line
around the design and named it Tegra. Tegra joined GoForce, Quadro, and Tesla, and
Nvidia quietly put the mobile GoForce product line to sleep.
Nvidia introduced the Tegra APX 250 SoC in 2008 with a 300–400 MHz integrated
GPU and a 600 MHz Arm 11 processor. Audi incorporated it in its entertainment
systems, and other car companies followed. In March 2017, Nintendo announced it
would use the Tegra in the new Switch game console.
Tegra was Nvidia’s Arm-based family of small-package, low-power, high-
performance application processors targeting the handset market (APX 2500) and
the mobile Internet devices (MID) market (Tegra 650 and 600), which were SoCs,
illustrated in Fig. 3.17. Nvidia had also taken the step to define the MID market.
Nvidia’s first apps processor was the APX 2500, announced at Mobile World
Congress in February 2008. In December, Nvidia pushed its SoC, Tegra, initially
planned for late 2008, back to late spring of 2009. These SoCs are difficult, Nvidia
said.
In February 2009, Nvidia announced the APX 2600.
Microsoft’s Zune HD media player was one of the first products to use the Tegra
in September 2009. Samsung employed it in its Vodafone M1, and Microsoft’s Kin
was the first cellular phone to use the Tegra. Microsoft did not have an app store, so
the phone did not sell very well.
Then, in May 2009, Tegra was the weak sister in Nvidia’s suite of products,
although not for lack of investment or patience. It was difficult to understand the
passion the company had for Tegra. It was a low ASP (and margin) part; the category
had huge competitors (Qualcomm, TI, Freescale, Broadcom, Marvel, Samsung, etc.),
and the carriers controlled the market. Also, Intel was going to re-enter the market
and disturb it even more. The HTC Android phone that Nvidia showed at the Work
Group Meeting (WGM) conference did not appear either. A couple of no-name white

Fig. 3.17 Symbolic block diagram of the Nvidia TEGRA 6x0 (2007)
3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles (2003–) 127

Fig. 3.18 Nvidia’s Tegra road map (2011)

box navigation units and one stock-keeping unit (SKU) in the Audi did not make for
a sustainable business.2
However, in Nvidia’s defense, the company had always said Tegra would not ramp
up meaningfully until 2H 09 and privately said it never expected big-name wins in
the first half of the year. Nvidia still believed Tegra should be at least $25 m–$50 m
in FY10 and would exit 2010 at a run rate > $100 M/year, with a fast ramp. That did
not happen.
In January 2010, at the Consumer Electronics Show (CES), Nvidia demonstrated
its next-generation Tegra SoC, the Nvidia Tegra 250, and showed off devices from
NotionInk, ICD, Compal, MSI, and Foxconn. The Tegra platform used a dual-core
Arm Cortex-A9 CPU running at speeds up to 1 GHz, with an eight-core GPU. The
company announced that it supported Android on Tegra 2 and that booting other
Arm-supported operating systems was possible on devices where the bootloader was
open. The company also announced that Ubuntu Linux distribution support was also
at the Nvidia developer forum.
In August 2010 and the IFA conference in Berlin, Nvidia OEMs introduced a
series of tablets based on the Tegra 2. IFA was the largest consumer electronics trade
show in Europe, known as the European CES, an important conference with lots of
new product announcements making it challenging to get noticed. However, Nvidia
had such a strong brand and leading GPU technology it did get noticed in its partner’s
exhibits.
In late September 2010, Nvidia said it had almost finished the Tegra 3 and was
working on the Tegra 4 (Fig. 3.18).
Nvidia announced its quad-core SoC, code named Kal-El, which would become
the Tegra 3, at the Mobile World Congress in Barcelona, in February 2011.

2 SKU—a stock-keeping unit—a number or string of alpha and numeric characters that uniquely
identify a product.
128 3 Mobile GPUs

Then in May 2011, Nvidia announced it would acquire Icera, a baseband chip-
making company in the UK, for $367 million. Now, Nvidia had everything it needed
to make an SoC for smartphones.
In 2012, Nvidia unveiled its Project Kai-El, a Tegra 3 reference design called
Tegra Tab, for low-cost tablets. The Google Nexus 7 was based on it. Tegra 3 was
in many Android 7-in. tablets, Windows Surface tablets, and other Android tablets.
It also found a home in many infotainment devices. Revenue for the second half of
2012 was big growth for Nvidia.
Just after CES 2013, in February and ahead of the Mobile World Congress event
in Barcelona, Nvidia announced the Tegra 4i (code named Grey). It had hardware
support for the same audio and video formats as Wayne but used the Cortex-A9
instead of a Cortex-A15. The Tegra 4i was a low-power variant of the Tegra 4 designed
for phones and tablets and had an integrated long-term evolution (LTE) modem. Then
in 2013, at Computex in Taipei, Nvidia showed the Tegra Note. It was a $199 Android
reference design tablet (a technical blueprint that for others to copy) that used the
Tegra 4 processor.
Although many OEMs embraced Nvidia’s tablet designs based on various gener-
ations of Tegra, those tablets did not sell well. They were generally a little thicker and
heavier than the Apple iPad and Samsung tablets based on Qualcomm. The Nvidia
tablets also did not get very good battery life compared with the competition.
The Tegra 4 shipped in some China phones (Xiaomi). Following T4 was TK1
(integrating the same CUDA GPU from GeForce into Tegra). TK1 went into Android
tablets.
Nvidia got a few designs wins in phones and tablets but not enough to sustain a
business and support the necessary R&D.

3.6.2 Nvidia Enters the Automotive Market (2009)

Nvidia started investigating the automotive market in 2008 thinking it Tegra processor
would provide a powerful solution for the new console navigation and multi-seat
entertainment screens.
In June of 2009, Audi announced Nvidia would power the display system in the
2009 Q5.
Nvidia knew the automotive market would not come close to the expectations of
the mobile market but would provide some ROI. Here, Nvidia surprised itself with
its success.
The company’s Logan Tegra chip had 192 shaders and could drive two displays.
It had four Arm A15 processors with Neon SIMD and shadow LP cores.
Nvidia introduced image signal processors (ISPs) and support for OpenCV in
its Tegra (T124) Logan chip. The chip made full use of the Khronos OpenVX API
designed as an enabler to the popular OpenCV libraries.
Machine Vision was a logical application for vehicles. Nvidia’s Danny Shapiro,
Senior Director of Automotive, said:
3.6 Nvidia’s Tegra—From PDAs to Autonomous Vehicles (2003–) 129

Fig. 3.19 Nvidia offered its X-Jet software development toolkit (SDK) software stack for
automotive development on the Jetson platform

Progress in the automotive field could be deceptive because the process from design to
implementation takes a long time. Nvidia established itself in the automotive field, with over
23 automotive brands signing on as of 2014.

Shapiro added that Nvidia had shipped over 4 million chips to date 2014, and
he forecasted 25 million in shipments when the recent design wins came to market
(Fig. 3.19).
In 2014, Nvidia embedded its Jetson computing boards as part of its Advanced
Driver Assistance Systems program. It offered imaging for panoramic view, night
vision (infrared), autonomous parking, collision detection, and pedestrian detection
and tools for developers.
Nvidia was active in the OpenCV group, initially formed by Intel in 1999 to
explore and promote applications for computer vision. At that time, Intel was inter-
ested in how computer vision could help sell CPUs. The group, however, devel-
oped into a broad-based and active community of cross-platform developers, and an
OpenCV for Android subset emerged with features that could extend the usefulness
to mobile devices and cars. Nvidia contributed algorithms to OpenCV, and developed
its offshoot, OpenCV for Tegra. Functions included algorithms for image processing,
video, stereo, and 3D.
In January 2015 at CES, Nvidia announced its Drive platform for autonomous
car and driver-assistance functionality. In January 2017, at CES, Nvidia announced
a partnership with Audi to put the world’s most advanced AI car on the road by
130 3 Mobile GPUs

Fig. 3.20 Mercedes concept car of the future. Courtesy of Nvidia

2020. As discussed in the Games Consoles chapter in this section, Nintendo chose
the Tegra for its very popular Switch portable game console.
In June 2020, Nvidia announced it had formed a partnership with Mercedes-Benz
to build advanced, software-defined, and autonomous vehicles. Mercedes said its
fleet would become perpetually upgradable, backed by teams of software engineers
and AI experts, starting in 2024 (Fig. 3.20).
In April 2021, at Nvidia’s virtual GPU Technology Conference, Nvidia announced
the next-generation SoC, code named Atlan.

3.7 Bitboys 3.0 (2002–2011)

Founded in Noormarkku, Finland, in 1991, Bitboys was an innovative and dedicated


developer of 3D graphics processors. The company could have been the inventor of
the integrated GPU with their Pyramid 3D chip had things gone a little better for
it in 1996 (See book one, The Steps to Invention). It kept trying and introduced the
Glaze3D GPU in 1999, but it too failed to reach liftoff, and so the company quit
trying to enter the PC 3D GPU race and in late 2002 turned to mobile devices.
Over a couple of beers one night in late 2002, one of the founders said, what do we
now? Petri Nordlund, then CTO of Bitboys, said he had been thinking about how to
reduce some of their designs so it could be smaller and more efficient and maybe run
in a phone. Finland was the leader in mobile phones at the time, and everyone there
already had one. Coming from the demo days when people were giving impressive
demos using only 64 kB of RAM, Petri always wondered why other companies were
using so many transistors. Mikko Saari, who had been running the place, said, “We
have a few million in the bank; we can live off that until you guys get something
3.7 Bitboys 3.0 (2002–2011) 131

running, and then we can show it and maybe get some new investors.” A plan was
made, and another round of beers was ordered.

3.7.1 End Game: Bitboys’ VG (2003)

In March 2003, Mika Tuomi visited NEC in Minato-ku, Tokyo, Japan. While there,
he was wrestling with how to efficiently run a 2D engine in minimum memory and
with minimum power. He stepped out on a fire escape to have fresh air, and while
looking at the distant Toyko skyline, he had an epiphany. That brainstorm led him
and the team to the VG engine, a highly efficient 2D controller.
The photo in Fig. 3.21 was taken at the Bitboys office in Noormarkku, Finland,
around 2003.
“The Falanx guys paid us a friendly visit,” said Norlund. “We were competitors
back then but good friends nevertheless.”
By January 2004, Bitboys had established new relationships with mobile phone
suppliers NEC and Nokia. Nokia talked with and watched Bitboys’ progress for a
while, as had NEC through regular meetings at Khronos.
As a result, of the Japan epiphany and some funding from NEC, that year NEC
built a mobile phone display integrating the Bitboys recently developed G10 vector
graphics processor into the display driver silicon. NEC had financed the company
with significant monthly payments until Bitboys ended that funding so they could
work on 3D hardware.
By that time, Khronos was operating and drawing in everyone and anyone who
wanted to be in the handheld space. Bitboys were a member of Khronos, and Bitboys’
VG became the foundation for Khronos’ OpenVG API (Fig. 3.22).

The OpenVG group was formed on July 6, 2004, by several firms: 3Dlabs,
Bitboys, Ericsson, Hybrid Graphics, Imagination Technologies, Motorola,
Nokia, PalmSource, Symbian, and Sun Microsystems. Other firms including
chip manufacturers ATI, LG Electronics, Mitsubishi Electric, Nvidia, and
Texas Instruments and software- and/or IP vendors DMP, Esmertec, ETRI,
Falanx Microsystems, Futuremark, HI Corporation, Ikivo, HUONE (formerly
MTIS), Superscape, and Wow4M also participated. The first draft specification
was at the end of 2004, and the 1.0 version of the specification was released
on August 1, 2005.

Bitboys had long used SIGGRAPH as a place to try out new ideas and as a
friendly venue for new product announcements. In 2004, Bitboys planned to show
its repositioned graphics technology for wireless and embedded graphics processors.
The company had been using the name Acceleon for its mobile graphics processors
132 3 Mobile GPUs

Fig. 3.21 In the back row from left: Petri Norlaund, Kaj Tuomi, and Mika Tuomi from Bitboys. In
the front row, Falanx, from left: unknown (guy in blue jeans), Mario Blazevic, Jørn Nystad, Edvard
Sørgård, and Borgar Ljosland. Courtesy of Borgar Ljosland

line. In their initial announcements, the Acceleon line included plans for three cores,
the G10, G20, and G30. True to Bitboys conviction that less is more, the G10 was
designed with just 60,000 transistors and yet included full-screen anti-aliasing, hard-
ware, texture decompression, programmable geometry, pixel shaders, and hardware
vector graphics. The company hoped to license the new design to semiconductor and
mobile phone manufacturers as IP cores for SoC manufacturers’ products.
However, no matter how good Bitboys’ products looked at SIGGRAPH, the
general feeling, including among customers, was that they had better have something
3.7 Bitboys 3.0 (2002–2011) 133

Fig. 3.22 Bitboys’


Acceleon handheld prototype
and the art it is rendering.
Courtesy of Petri Nordlund

concrete, given that they had made so many announcements at so many SIGGRAPHs
in the past. Bitboys was a member of the Khronos Group and put its support behind the
new Futuremark benchmark, which at least demonstrated the company’s commitment
to standards and that it would aim for commonly agreed upon benchmark goals.
The new Bitboys product line covered the entire range of mobile phone platforms.
The low-end G32 graphics accelerator covered the volume-market wireless
devices. It formed the basis of the Bitboys’ new product line. It had been designed to
be compatible with the new OpenGL ES 1.1 API. They put emphasis on producing
a minimal design size and very low power consumption to make it suitable for
volume-market wireless devices.
The mid-range G34 targeted wireless gaming. An evolution from the G32 design,
the G34, was a 2D/3D graphics processor core for the OpenGL ES 1.1 feature set. It
had added performance through a programmable geometry engine and support for
programmable vertex shaders qualifying it as GPU. The G34 could support scenes
with numerous animated characters with tens of thousands of polygons, rendered
with anti-aliasing, 32-bit color at more than 30 frames per second. The floating-point
geometry processor allowed for advanced scene and object complexity.
G40 was a bigger version of the G34 (see Fig. 3.23). It also had programmable
2D, 3D with T&L, vector graphics acceleration, and OpenGL ES 2.0 compatibility.
Hardware acceleration was provided for all forms of graphics, including 2D bitmap
graphics, vector graphics, and 3D graphics. It could be used for everything from UIs
to applications and games.
Bitboys based the rendering pipeline on OpenGL ES 2.0 shader architecture [11].
They also emulated fixed functions using the programmable pipeline.
G40 rendering features included the following:
• 2D graphics rendering
– BitBlt, fills, ROPS (256)
134 3 Mobile GPUs

Fig. 3.23 Bitboys’ G40


mobile GPU organization

– Small separate core for rendering bitmap-based user interfaces


• Vector graphics rendering
– SVG Basic level feature set, targeting OpenVG
– Anti-aliased rendering of concave and convex polygons
– Rasterization integrated into the 3D pipeline
– Support for linear and radial gradients
– Arbitrary clip paths
– 10–50x performance over software rendering
• 3D graphics
– Transformation and lighting in hardware
– Floating-point vertex and pixel shaders
– Multi-texturing: Four textures per-pixel
– Fully programmable architecture, no fixed function pipeline
– Full-screen anti-aliasing
– Packman hardware texture decompression.
During a presentation at the 2004 Hot chips conference in Palo Alto, CA, Petri
Nordlund said the design was targeted for 90 or 65 nm (the process technology used
for mobile phone SoCs in that timeframe). The estimated peak clock frequency was
200 MHz.
The scene complexity and performance targets were as follows:
• 60 FPS
• 20–30k polygons/frame
• QVGA or VGA display resolution
• Depth complexity 5
• Relatively complex pixel shaders
3.7 Bitboys 3.0 (2002–2011) 135

• High sustained pixel fill rate.


In the 2007–2010 timeframe, the target market was mobile phones. “We expect
3D graphics breakthrough in mobile phones in the 2006 timeframe,” said Nordlund,
“Japan first, then Europe, followed by the U.S.”
Bitboys’ view of the competitive landscape for embedded graphics hardware was
as follows:
The early contenders
• Bitboys G10: SVG Tiny vector graphics acceleration
• Other propriety, non-standard 3D graphics hardware.
Designs conforming to OpenGL ES 1.0
• ATI Imageon, Nvidia GeForce
• Bitboys G30
• Imagination MBX
• Mali series from Falanx
• Sanshin’s G-Shark.
Designs conforming to OpenGL ES 1.1
• Bitboys G32 and G34.
Future standards
• Targeting programmability, OpenGL (ES) 2.0
• Bitboys G40.
The programmability of Bitboys’ processors provided for future graphics API
compatibility, which would extend support for the next-generation 3D graphics
APIs and shading languages. The programmable pixel processor shaders were per-
pixel executed programs that allowed developers to generate realistic-looking object
surfaces like metals, woods, water, lighting effects, and reflections.
Right after SIGGRAPH, in August 2004, Bitboys announced it had licensed its
G34 graphics processor design to NEC Electronics [12]. In addition to the features
mentioned above, the graphics processor offered multi-texturing, trilinear texture
filtering, and 32-bit internal and external color. Furthermore, it supported industry-
standard graphics APIs such as OpenGL ES, Microsoft Direct3D Mobile, and M3G
for J2ME.
Bitboys enhanced the performance of the G34 further by a programmable
geometry engine that supported programmable floating-point vertex shaders, which
enabled detailed, lifelike scenes and animations.
At the time, Bitboys had 33 dedicated graphics professionals, less than a tenth of
what ATI or Nvidia had. “What do they need all those people for?” asked a member
of the Bitboys team.
Looking at the product segmentation of the mobile phone market, Bitboys
concluded (with a bit of encouragement from NEC) that as good as their graphics
processor design was, it only satisfied the high-end segment. Most of the market,
136 3 Mobile GPUs

more than 85%, was in the mid-range and entry-level segments. So, at the 2005
Games Developer Conference in San Francisco, Bitboys announced the G12, its
vector graphics processor for mobile devices [13].
The Bitboys G12 was an extremely compact vector graphics processor that
supported OpenVG 1.0 [14] and SVG Tiny 1.2 [15] graphics hardware rendering.The
company claimed it would deliver over 60 frames per second, more than a 100-fold
improvement over software-based vector graphics rendering. The processor, said
Bitboys, operated at extremely low power consumption levels and did not require
much CPU capacity.
Vector graphics technology offered small file sizes for high-quality graphics, scal-
ability to any display size, and lossless compression without JPEG artifacts. Vector
graphics would allow content to be rendered to the greatest possible detail and quality,
animation was efficiently executed, and it used very little memory. Also, vector
graphics facilitated the smooth rendering of anti-aliased text and complex non-Latin
fonts such as Kanji.
The advanced vector graphics features of Bitboys G12 enabled the following:
• Fast, ultra-clear user interfaces
• Animated, interactive advertisements for network operators
• Scalable, rotating maps with small file size, including animations and corporate
logos
• Cartoons, anime, greeting cards, games, and other mobile entertainment applica-
tions.
Because vector graphics content took very little memory space and its
data compressed well, it could easily get distributed over a wireless network, such
as CDMA (code division multiple access) and GSM (global system for mobile
communication).
As much as we all love 3D graphics, it is 2D that does most of the work, but it
must be good 2D with sharp lines with anti-aliased edges. To do it right without using
excessive amounts of power or transistors takes real skill. The market potential for
2D or vector graphics acceleration in 2005 was over 425 million units, and more if
one added dedicated vector acceleration to high-end phone 3D engines.
Bitboys G40 also included hardware acceleration of vector graphics by a
2D/3D/vector graphics processor; Bitboys introduced it in August 2004 and targeted
high-end multimedia mobile phones.
Things were rocking along for the boys, sales were picking up, profits were almost
there, and then in late 2005, Nokia Growth Partners invested $4 million for 14% of
Bitboys (which came back to them at about 1.2x, in about a year—not bad) [16].
And then, in May 2006, word came out that Bitboys, with about 40 employees,
would be acquired by ATI for $44 million and integrated into ATI’s handheld business
unit and form the nucleus for a key design center for graphics software design in
Europe.
“Bitboys dovetails into ATI perfectly,” said Paul Dal Santo, general manager of
ATI’s handheld business unit [17].
3.7 Bitboys 3.0 (2002–2011) 137

Also, ATI had just announced a new strategic relationship with … Nokia! Nokia
and ATI were working on 3D gaming and mobile TV.
Nokia said that it had started a long-term strategic relationship with ATI to develop
music playback, 3D gaming, mobile TV, and video functions for its handsets for
Nokia’s customers worldwide. ATI would provide the tools chain and software
development toolkit (SDK) for multimedia developers in the fall of 2006.
“Our role,” said Dal Santo, “is to enable all content, from ultra-high-quality music playback
to 3D gaming, and we will jointly guide and support the members of the content

“We want to make sure that with the NSeries devices, we are pushing the multi-
media experience,” said Damian Stathonikos, a spokesman for Nokia. Nokia had
recently introduced new features to the NSeries line for multimedia facilities such
as video capture.
Nokia had been increasingly emphasizing its multimedia devices, and with good
reason: The Nokia N70 multimedia phone was the highest revenue generator for the
company.
As close as Bitboys was to Nokia, Bitboys just was not big enough or deep enough
(in terms of tools, middleware, and range of solutions) to make a giant like Nokia
completely comfortable, giving Bitboys a design win. Big companies like to deal
with big companies, knowing they would be around and providing support and a
sustainable road map. The big guys cannot afford point products.
Nokia pretty much dictated what hardware it wanted integrated into chips it bought
from ST Microelectronics and TI. So Bitboys talked with all those companies and
eventually made a licensing deal with TI on the G40 GPU. News of the deals rippled
across Wall Street, and the financial analysts were tripping all over themselves to
point out how they had always liked ATI, even when they were calling for shorting
it and dumping it.
As all this was going on, Bitboys were pretty far along in merging with Hybrid. At
the end of the day, the deal was canceled from the Hybrid side, and in March of 2006,
Nvidia bought Hybrid Graphics for an undisclosed sum (estimated by analysts as
about $10 million). As a result, ATI was expected to quickly pull Bitboys stuff from
Hybrid and replace all that software with its own. If Nvidia bought Hybrid because
they thought that would be their path to Nokia, they made a mistake. The employee
at Nokia who was on Hybrid’s BOD (as a Nokia representative) was working for
ATI when the ATI–Nokia deal went through [18].
Then, in May 2006, ATI announced it had acquired privately held Bitboys Oy (for
about $44 million). Bitboys brought valuable engineering experience, technology,
and customer relationships that enhanced ATI’s existing mobile phone multimedia
offerings. Based in Finland, the Bitboys team became part of ATI’s handheld business
unit and formed the nucleus for a critical design center for ATI in Europe—all that
technology flowed to Qualcomm through their licensing deal [19].
A few months after ATI acquired Bitboys and won the deal with Nokia, in July
2006, AMD acquired ATI for $5.6 billion. The mobile unit that included Bitboys
personnel was renamed Imageon. But the market for discrete GPUs in mobile devices
was disappearing. The mobile group in AMD moved into IP, and their biggest
138 3 Mobile GPUs

customer was Qualcomm. ATI sold IP and tech support to Qualcomm and, in January
2009, sold the Imageon group to Qualcomm for $65 million. Qualcomm established
the operations as its Finnish development center.
Before and after the acquisition, Bitboys helped develop Qualcomm’s Adreno
GPU.
As the British IT journal The Register put it, Bitboys became “known chiefly for
its ebullient performance claims and a string of missed deadlines.” [20].
Commenting on that, in Juha-Pekka [21] interview with Mikko Saari [21], a
Bitboys cofounder who was then country manager of Qualcomm Finland and a
friend of the Tuomi brothers since childhood conceded,
The criticism was not unfounded. Bitboys were overoptimistic about what could be
achieved—not technologically, but outside of the engineering work. We could have tried
to fix our reputation, but we thought we would handle it with just getting on with our work.
We have learned our lesson. Luckily pros did not care. Nokia funded us in 2004, and they
would not have thrown their money into a sink.

Bitboys were part of ATI, then AMD, and then Qualcomm. The Qualcomm Adreno
220/225 GPU was based on Bitboys’ GPUs, developed in Finland. The architecture
was code named “Leia.” (Fig. 3.24).
Two years later, due to internal disagreements on design philosophies between
Qualcomm’s San Diego GPU design group, and the Finnish GPU design group,
Qualcomm shut down the Finland operations and laid everyone off.
A former Qualcomm executive commented in 2021 that “the Bitboys mythology
was highly overrated.”
In the 1950s, a prize-fighter comic book character named Joe Palooka (the strip
debuted in 1930 and ran till 1984) could take a punch and always get up and win the
fight. The Ideal toy company made a 48-in.-tall inflatable punching bag in 1952 that
had a weighted bottom. A child could punch the bag as hard as he wanted and knock
it down, and then as if nothing had happened, the bag would pop back up, just like
Joe Palooka (Fig. 3.25).

Fig. 3.24 Mikko Sarri 2009.


Courtesy Mikko Sarri
3.8 Qualcomm’s Path to the Snapdragon GPU (2004–) 139

Fig. 3.25 Ideal’s Joe Palooka punching bag. Source thepeoplehistory.com

The Bitboys team was Joe Palooka. You could knock them down, and they would
pop right back up.

3.8 Qualcomm’s Path to the Snapdragon GPU (2004–)

Qualcomm initially offered display support in the pre-Snapdragon days through


a fixed function 2D display controller with an integrated DSP (upstream of the
integrated display controller) that was part of the CDMA modem.
The smartphone market began to blossom in the early 2000s, replacing feature
phones. Screens and display resolution increased every year, and it became apparent
that a 2D controller and DSP could not scale to the projected future displays.
In 1999, Qualcomm developed the Binary Runtime Environment for Wireless
(BREW), an application development platform for CDMA mobile phones, featuring
third-party applications such as mobile games. Some feature phones and mid-to-
high-end mobile phones used it, but not in smartphones. It debuted in September
2001 as a platform for wireless applications on CDMA-based mobile phones.
ATI had been in the feature phone market since 2000 with the Imageon proces-
sors, which were simply camera image signal processors (ISP’s), multimedia MPEG
decompressors, and display drivers. In early 2004, Qualcomm signed a licensing
agreement with ATI for Imageon technology. At the time, the leading smartphone
chip suppliers were TI, with its Imagination-based OMAP processor, and Intel, with
its XScale Arm-based chip.
140 3 Mobile GPUs

At the Consumer Electronics Show (CES) in January 2004, claiming to deliver


the first 3D gaming chip for cell phones, ATI demonstrated its new Imageon 2300.
It showed surprising 3D performance on mobile phones with little impact on power
consumption. Playing 3D-enabled games on mobile phones was now possible, said
ATI, thanks to its new multimedia and graphics family of coprocessors [22].
Qualcomm said in 2004 that it would use ATI’s Imageon technology for its next-
generation Mobile Station Modem baseband solutions, the MSM7xxx family of
chipsets, and would interface ATI Imageon ASICs to its then-current MSM. In July
2004, Qualcomm introduced its first integrated graphics processor in the MSM6150
and MSM6550 using ATI’s graphics Imageon processor, shown in the block diagram
in Fig. 3.26.
The graphics processor could support the functionality of 100k triangles/s and
7 Mpixels/s for console-quality gaming and graphics.
LG was one of the most aggressive handset suppliers in games, and it adopted the
Qualcomm MSM 7500. In 2005, LG Electronics said it had the world’s fastest 3D

Fig. 3.26 Qualcomm’s SMS6550 SoC


3.8 Qualcomm’s Path to the Snapdragon GPU (2004–) 141

GamePhone—the first cell phone ever armed with a million polygons-per-second


graphics accelerator chip (which was ATI’s).
In February 2005, at the annual 3GSM conference, Qualcomm introduced its new
Q3D Dimension platform and MSM6xxx hardware [23]. In addition to developing
application processors for the handheld market, Qualcomm developed a software
architecture to complement its products.
The Q3Dimension gaming solution was part of Qualcomm’s Launchpad suite
of software technologies. Q3Dimension was a development tool that enabled the
adaptation of PC and Console games to the wireless platform. The problem was that
Q3D did not have any 3D hardware to speak of; it was an aspirational title.
Just as was seen in the PC market, in 2005, there was an explosion in semi-
conductor suppliers for mobile devices who either were building or threatening to
build 3D hardware accelerators for mobile devices. There were three ways to get
3D hardware acceleration to a handheld system: With a coprocessor like the ones
Alphamosaic, ATI, and Nvidia were offering; with an application processor with
dedicated graphics acceleration like Qualcomm and Renesas offered; or with an SoC
like the ones TI and Renesas offered and Freescale, Intel, and Qualcomm and were
working on. They had embedded IP from graphics IP companies like ATI, Bitboys,
DMP, Falanx, Imagination, Nvidia, or Vivante. In 2005, 33 companies were in one
or more of those categories.
At about the same time Qualcomm signed its licensing arrangement with ATI, it
also launched a program to adopt the Arm processor. Then, in 2006, Qualcomm
announced a new platform, Snapdragon, in which the Scorpion CPU would be
used alongside several other processors and coprocessors. According to Qualcomm,
Snapdragon would serve a range of high-performance mobile applications, such as
high-end smartphones and mobile Internet devices.

3.8.1 The Adreno GPU (2006)

Adreno (thought to be an anagram of AMD’s graphic card brand Radeon) started as


Qualcomm’s in-house brand of graphics technologies and was announced in 2006 as
part of the S1 MSM7200. Qualcomm’s Snapdragon was now part of the GPU club.
In March 2007, Qualcomm launched the 100 Million Gaming Phone Alliance. The
alliance’s goal was to have 100 million hardware-enabled 3D gaming phones in the
market by 2009. The alliance consisted of telecom operators, handset manufacturers,
content developers, and Qualcomm.
Also, in 2007, Tim Leland joined Qualcomm and took over the graphics group.
Leland had worked at Nvidia for the previous six years and was running its Tegra
graphics group. Tegra was an SoC Nvidia had developed with the hope of getting
into the booming smartphone market. Tegra could not meet the power requirements,
and Nvidia withdrew from the market in 2014 and later repositioned the chip as an
automotive processor for dashboards and ultimately autonomous driving, and low
power game consoles.
142 3 Mobile GPUs

Fig. 3.27 Qualcomm’s Snapdragon SoC with Adreno GPU

In 2008, Qualcomm launched the Snapdragon S1 with the Adreno 200, which
incorporated the ATI-Yamato core IP (Z430)—see Fig. 3.27. Yamato was a cutdown
version of the Xbox 360 GPU for the mobile market. The S1/Adreno 200 marked
the first fully 3D hardware-accelerated GPU in a Qualcomm SoC. The 337 million
transistors Xenos was one of two dies in a package with eDRAM. Xenos was ATI’s
first implementation of the unified shader. It was also a quasi-tiled renderer, which
was why Adreno was a tiled renderer.
On January 20, 2009, the story ended when AMD announced it would sell its strug-
gling handset division to Qualcomm for $65 million [24]. The agreement meant that
Qualcomm would acquire AMD’s handset division and that the company also inher-
ited the graphics, multimedia technology, and other IP from the business. Qualcomm
planned to use that technology and IP to develop new types of 2D and 3D graphics
technologies for handset devices and cell phones and enhanced audio, video, and
display capabilities. Qualcomm also extended job offers to some of AMD’s handset
division employees.
Qualcomm had worked with AMD and ATI before. Since 2004, the Snapdragon
Adreno GPU used ATI’s Xenos GPU IP (as did Microsoft’s Xbox360).
“This acquisition of assets from AMD’s handheld business brings us strong multi-
media technologies, including graphics cores that we have been licensing for several
years,” Steve Mollenkopf, executive vice president of Qualcomm and president of
Qualcomm CDMA Technologies, said [25].
Qualcomm expanded and enlarged Snapdragon in the subsequent years, adding
multiple DSPs and ISPs.
3.10 Siru (2011–2022) 143

3.9 SECOND DECADE of Mobile GPU Developments


(2010 and on)

By 2010, the number of SoC suppliers in the mobile market was shrinking and by
2017 would be reduced to just five as shown in Fig. 3.1. OpenGL ES was providing
leadership to the market, and in 2015, OpenGL ES 3.2 was introduced with tessella-
tion and geometry, floating-point render targets, and texture compression. Then, in
January 2016 Vulkan was introduced and mobile devices had access to the same API
as PCs. New opportunities opened up for developers everywhere.

3.10 Siru (2011–2022)

The Bitboys had been subsumed into Qualcomm and then seemingly abandoned
in Finland when Qualcomm shut down their operations—but much of the team
regrouped in 2011 and formed Siru. Among them, Joonas Torkkeli, Jani Huhtanen,
Kari Malmberg, Mikko Nurmi, and Juuso Heikkila went to Siru Innovations. Marko
Laiho, the former chief software architect at Bitboys, left Qualcomm for another
stealth start-up, Vire Labs.
The Siru team was still flying low enough under the radar to the point where it
was unknown where all of the members ended up. But brothers Mika and Kaj Tuomi
were the cofounders, along with Mikko Alho. Alho, Siru’s CEO, was the former
graphics processor hardware project manager for Qualcomm Finland. Importantly,
although Alho was in a management position, keeping things on track with project
planning, resourcing, design definition, day-to-day project leading, and project status
reporting, he was doing the block design for Siru, including the C- and RTL-model
implementation. That meant an engineer and not a suit ran Siru’s management
(Fig. 3.28).

Fig. 3.28 Mikko Alho.


Courtesy of Siru
144 3 Mobile GPUs

Jarkko Makivaara, Qualcomm Finland’s former director of engineering, was also


with Siru. Jari Komppa, a senior engineer with Qualcomm Finland, joined Siru too.
There were other engineers at Siru as well, all from Bitboys.
The end? Not yet….
Suri went on to design low-power GPUs and sell the IP and by 2020 had grown to
over 20 people. One of their customers ironically was AMD. AMD in turn licensed
the mobile GPU IP. One of AMD’s customers for the mobile GPU IP was Samsung
(Fig. 3.29).
Siru did development work on low-power mobile designs for AMD Samsung, and
others.
Another twist. In January 2022, AMD terminated its relationship with Siru. The
company had several unsolicited bidders for an acquisition. Then in mid-2022, Intel
acquired Siru and folded into the group under Executive Vice President, and former
ATI/AMD Raja Koduri’s growing empire—Bitboys 5.0.

Fig. 3.29 The many lives of the Bitboys


3.10 Siru (2011–2022) 145

3.10.1 Samsung

In 2008, Samsung began developing it Hummingbird SoC, the S5PC110 for its
smartphone line. The company had been using Qualcomm SoCs and Samsung wanted
to have more freedom and hopefully reduce costs. In 2010, the company rebranded
the SoC as Exynos and launched the first smartphone with an Exynos SoC in 2011.
Samsung used its SoC in international markets and Qualcomm in the U.S. The
Exynos SoC used an Arm CPU and the Mali GPU, and as such was not as powerful
as Qualcomm’s SoC with the Adreno GPU (which at the time was based on the
Bitboys design).
Wanting to catch up with Qualcomm, and most importantly Apple, Samsung
decided to develop its own GPU and launched an R&D project in 2013. Samsung
had been using Imagination Technologies’ SGX 544 GPU IP to compete with Apple,
also an Imagination technologies customer, and an investor in Imagination.
Samsung moved the project out of R&D in Korea and into engineering in Austin
and San Jose in 2017. And then in a surprise to everyone, in January 2019, Yongin
Park, President of System LSI Business at Samsung Electronics announced Samsung
would build the Xclipse mobile GPU with AMD’s RDNA 2 graphics IP technology.
The Exynos 2200 series of chips was announced January 18th for the Galaxy
S22, Galaxy Z Fold3, and Flip3 phones. Based on Samsung’s 4 nm processing node,
the SoC’s Xclipse 920 GPU included AMD’s intersection ray tracing processors—
making Exynos 2200 SoC the first mobile processor with hardware-accelerated ray
tracing. The Xclipse could also handle variable rate shading and came with a multi-IP
governor system to enhance efficiency.
Samsung hoped to set a standard for mobile gaming experience, along with
improving social media apps and photo usage. The Xclipse GPU was positioned
between a console and a mobile graphics processor. The name, Xclipse, includes the
X for Exynos and references eclipse as in it shades all that has gone before. It marked
the beginning of a new era in mobile gaming.
In the past, Samsung sold its phones with its SoC into the international market
and used Qualcomm’s Snapdragon chips for U.S. models.
The Exynos 2200 used Arm’s octa-core Armv9 CPU cores in a tri-cluster
configuration consisting of a single Arm Cortex-X2 flagship core, three balanced
Cortex-A710 big cores, and four power-efficient Cortex-A510 little cores.
Samsung said the Exynos 2200’s neural processing unit (NPU) performance has
been doubled when compared to its predecessor. That allows more calculations to be
made in parallel, which enhances the overall AI-related performance and experience
on the smart phone. The NPU also offers greater precision with FP16 support and
power-efficient INT8 and INT16.
The Exynos 2200 SoC included an advanced ISP (Image Signal Processor) that
could support image sensors up to 200 MP. That, said the company, allowed a creator
to record and edit 8k videos. With an advanced MFC (Multi Format Codec), the
Exynos 2200 could encode and decode high-resolution videos at faster frame rate
than previous Exynos SoCs.
146 3 Mobile GPUs

The Xclipse 920 used the Vulkan 1.2 API to deliver the ray tracing features, and
also OpenGL ES 3.1 for legacy apps.
It had zero shutter lag at 108 MP. Samsung said it could connect up to seven
cameras and run four of them concurrently. With the help of its AI engine, the ISP
also provided tailored choice of color, white balance, exposure, and dynamic range
for each individual scene. Samsung’s SoC could do 8K video recording at 30 fps and
was capable of decoding 4K videos at up to 240 fps and 8K videos at up to 60 fps.
In addition, the MFC included a power-efficient AV1 decoder that helped to enable
a longer playback time.
On the display side, the SoC offered HDR10+, and a high refresh rate up to
4K—3840 × 2400 (WQUXGA) at 120 Hz, or HD+ at 144 Hz.
Samsung invested millions of dollars in R&D, gone through several managers and
engineers, and never gave up. The company was determined to have its own GPU
and be vertically integrated like Apple and Qualcomm.
Unfortunately, the Xclipse 920 SoC’s Arm v9 CPUs and AMD GPU design
consumed more power than forecasted and that necessitated lowering frequencies
below the original specifications. Samsung had projected improved power efficiency
from its 4 nm node, but the process shrink failed to meet the company’s expectations.
Samsung trimmed clock speeds through firmware changes and that caused a delay
of the Galaxy S22 launch by more than a month.
Because of the performance reductions, the new Exynos was only made avail-
able in few countries (compared to previous years); price-conscious regions such as
India and Latin America. For its premium markets (the US, Canada, South Korea,
Hong Kong, and most of Europe), Samsung used Qualcomm’s Snapdragon 8 Gen 1
processor.

3.11 Texas Instruments OMAP (1999–2012)

Texas Instruments (TI) had been steadily expanding and improving their OMAP
product line since they first introduced it in May 1999 and began shipping it in Q4
2000—and it had a string of successes for the company.
TI entered the mobile phone market with DSP-based baseband processors for
radio modems. Early OMAP models featured the TI TMS320 series DSP. TI OMAP
devices also included a general-purpose Arm processor plus one or more specialized
coprocessors.
On December 12, 2002, STMicroelectronics and TI jointly announced the Open
Mobile Application Processor Interfaces (OMAPI) initiative [26]. OMAPI was
designed for 2.5 and 3G mobile phones expected during 2003. Shortly afterward, in
2003, TI and STMicroelectronics joined with Arm, Intel, Nokia, and Samsung and
formed the larger initiative Mobile Industry Processor Interface (MIPI) Alliance,
which incorporated OMAPI. TI, however, carried on with the OMAP product line
branding. OMAP would be MIPI compatible and was the model of what a MIPI
3.11 Texas Instruments OMAP (1999–2012) 147

Fig. 3.30 Texas Instruments early OMAP SoCs

device should be for a while. Such devices were also called mobile Internet devices
(MIDs).
The first OMAP SoC had a camera processor and an early version of an ISP.
In 2003, TI incorporated Imagination Technologies’ GPU IP and introduced the
TMS2400 series.
TI segmented the expanding product line into three main categories:
• High-performance multimedia
• Basic multimedia
• Integrated modem and applications.
TI differentiated its SoC by the Arm processor and the number of processors
overall. The top of the line, TMS2420, had four processors, as illustrated in Fig. 3.30,
which shows the 2410 and 2420.
The OMAP2410 included a 330 MHz Arm1136JS-FTM core, a 220-MHz TI DSP,
an integrated camera interface, a 2D/3D graphics accelerator offering up to 2 million
polygons per second, and a direct access media (DMA) controller.
The OMAP2420 processor added a TI programmable imaging and video accel-
erator supporting 6-megapixel still capture, full-motion video encode, 640 × 480
video encoding at 30 fps, or decode in common intermediate format (CIF) to VGA
resolution on top of the OMAP2410. The OMAP2420 also could output images onto
an external TV. There were 5 MBytes of on-die SRAM. It was this component that
delivered the 6-megapixel camera support.
TI updated its OMAP with a second-generation architecture that integrated a
3D graphics acceleration engine, an imaging accelerator, support for hi-resolution
digicams and camcorders, and a TV output. Imagination Technologies’ PowerVR
MBX graphics core was the major contributor to the video I/O, 2D/3D, and display
capabilities and could process up to 2 million polygons per second.
148 3 Mobile GPUs

TI emphasized TV on a cell phone as an essential feature. In November 2003,


the company said that it was only a matter of time before TV viewing on cell phone
screens would be a reality, and TI was not alone in this view. It never happened, but
video streaming did.
The OMAP product line enjoyed success in the smartphone and tablet market until
2011 when it began losing market share to Qualcomm’s Snapdragon. On September
26, 2012, TI announced it would shift from the smartphone market and try its luck
in the embedded and automotive space [27]. A month later, the company eliminated
1700 jobs when it shifted from mobile to embedded platforms; the last OMAP5 chips
were released in Q2 2013.
The problem TI faced was that the value-add component of the OMAP SoC was
just their DSP; other firms like Arm and Imagination supplied 70–80% of the SoC.
Whereas TI had been a pioneer in high-performance graphics processors (see book
one, The Steps to Invention), it lost that IP through various business decisions and
tried to get it back by buying IP. But TI could not compete with vertically integrated
SoC builders like Qualcomm. Neither could other vertically integrated suppliers like
Intel and Nvidia.

3.12 Arm’s Midgard (2012)

Midgard was the third architectural design from the Falanx Arm team. It consisted of
the T604 and T658 series (first generation), the T622, 24, 28, and 78 series (second
generation), the T700 series (third generation), and the T800 series (fourth and last
generation).
Architecturally, Midgard was the progeny of Utgard the final design based on
the original Falanx architecture. There was still a difference in how the unified and
discrete shaders operated. While there were clear elements of relation, the step change
was perhaps the largest they had ever done. In one step, they moved to unified shaders
ad added full compute, a whole new tiling system, and a new memory subsystem, to
mention a few features. The shader design for Midgard inherited many of Utgard’s
design elements and features. However, those differences were more important for
programmers than users.
In 2011, Arm gave its forecast of the mobile processor demand for the next five
years, and the growth and demand for performance were impressive. Its unique rela-
tionship with all the semiconductor and smartphone builders gave Arm a surprisingly
good view of future needs.
In late 2011, Arm announced the latest development in its Mali line of GPU IP
cores. The company predicted they would start appearing in various mobile devices
by 2013. Arm claimed the new Mali-T658 design would have four times the GPU
compute performance of the quad-core Mali-T604 GPU and had up to 10 times
greater graphics performance of the Mali-400 GPUs. The Mali-T604, Arm’s previous
top-of-the-line, was launched at the Arm TechCon in 2010. That core appeared in
silicon and then in devices by 2012 [28].
3.12 Arm’s Midgard (2012) 149

“It is simple,” said Steve Steele at the time. Steele was the senior product manager
of Arm’s media processing division. “It is all about higher performance—twice as
many shader cores (eight, compared with the Mali-T604’s four) and doubles the
arithmetic pipelines per core (from two to four). For graphics, this means that the
Mali-T658 GPU offers up to 10 times the performance of the Mali-400 MP GPU.”
The Mali-T658 delivered what Arm called desktop-class performance. Arm said
it got the performance by doubling the number of arithmetic pipelines within each
GPU core and improving the compiler and pipeline efficiency.
The T658 was compliant with multiple compute APIs, including Khronos OpenCL
1.1 (Full Profile), Google Renderscript compute, and Microsoft DirectCompute. The
GPU design had hardware support for 64-bit scalar and vector, integer, and floating-
point data types, which were fundamental for accelerating complex and computation-
ally intensive algorithms. Compatibility with Khronos APIs was maintained across
the Mali-T600 Series of GPUs.
“By showing off some of the versatility of the Midgard architecture, it brings in
a compute punch of up to 350 GFLOPS,” said Edvard Sørgård, Arm’s Consultant
Graphics Architect in Trondheim, Norway [29]. “And with over 5 GPixel/s real fill
rate to external memory to power high-end mobile devices with visual computing and
augmented reality, the 4k DTV revolution, and exascale high-performance compute,
it was another fantastic component in the joined-up computing story from Arm.”
As would be expected, the Mali-T658 GPU was compatible with other popular
graphics APIs, including Khronos OpenGL ES, Open VG, and Microsoft DirectX
11. The overall organization of the Mali-T658 is shown in Fig. 3.31.
Arm added some additional functionality to Midgard enabled by the Android
Extension Pack (AEP) for Android L. The AEP extended OpenGL ES 3.1 by enabling
features such as tessellation and geometry shaders, features that did not make it into
OpenGL ES 3.1.

Fig. 3.31 Arm Mali-T658 organization


150 3 Mobile GPUs

Table 3.4 Arm Mali


Model ALU/pipe
Midgard arithmetic unit per
pipeline (per core) T628 2
T678 4
T720 1
T760 2

Arm also offered Direct3D support on Midgard. That functionality was never used
because all Windows Phone and Windows RT devices at the time used Qualcomm
or Nvidia SoCs. Only Mali-T760 was Direct3D Feature Level 11_1 capable.
The T700 series of the Midgard added more shader cores, going from 1 to 16.
The T658 was designed to work with the latest version (4) of the Advanced Micro-
controller Bus Architecture, which featured cache-coherent interconnect (CCI).
“Data shared between processors in the system—a natural occurrence in heteroge-
neous computing—no longer required costly synchronization via external memory
and explicit cache maintenance operations,” said Roberto Mijat, GPU Computing
Marketing Manager for Arm at the time [30] (Table 3.4).
“All of this is now performed in hardware and is enabled transparently inside the drivers,”
added Mijat. “In addition to reduced memory traffic, CCI avoids superfluous sharing of data:
Only data genuinely requested by another master is transferred to it, to the granularity of a
cache line. No need to flush a whole buffer or data structure anymore.”

Like its predecessor, the Mali-T658, shown in Fig. 3.32, could perform
parallel processing operations on appropriate applications, including physics, image
processing and stabilization, augmented reality, and transcoding. The T658 had eight
cores and four pipelines (refer to Fig. 3.32).
“The difference between a core and a pipeline,” said Ed Plowman, Arm’s Technical
Marketing Manager, “is that a core is a self-contained entity—which had its own self-
contained resources—so you can build SoCs with different numbers of cores. You
can also scale dynamically by powering cores up and down—independently of other
elements. The Mali-T658 scales by numbers of cores.” [31].
Each arithmetic pipeline (Figs. 3.33 and 3.34) was independent, working on its
own set of threads. Midgard could process one bilinear filtered texel per clock or
one trilinear filtered texel over two clocks. In the high-end T760, the number of
texture units and render output units (ROPS) per shader core was the same, so all
configurations had a 1:1 ratio between texels and pixels.
Arm claimed the Mali-T658 would give the company performance leadership over
its rival graphics core licensor, Imagination Technologies. Imagination’s top-of-the-
line GPU then was the PowerVR Series 6 design that went by the code name Rogue.
“Imagination and Vivante; they are the competition,” said Ian Smythe, marketing
director for the media processing division.
However, Arm also claimed that Mali, in all its versions, with 57 licenses and 29
SoCs designed, was then the most widely licensed GPU architecture—which was
true.
3.12 Arm’s Midgard (2012) 151

Fig. 3.32 Mali-T658 program management

Fig. 3.33 Pipelines in Arm’s Mali architecture


152 3 Mobile GPUs

Fig. 3.34 Arm Midgard block diagram

Arm emphasized that with the Mali-T658 SoC, designers could use a carefully
crafted system-level approach to multi-core design. That approach included Arm
Cortex processor cores, the little-big power-efficiency technology, and CCI.
According to Jem Davies, vice president of technology for the media processing
division at Arm, “Designers are expected to target high-end smartphones on 28 nm
silicon with quad-core Mali-T658. They will be coming to market in 2013, and
eight-core Mali-T658 graphics units to be fabricated in 20 nm silicon in 2015.” Arm
also expected to find applications in tablet computers, smart-TVs, and automotive
infotainment systems for the core (Fig. 3.35).

Fig. 3.35 Jem Davis VP Arm. Courtesy of Arm 2020


3.12 Arm’s Midgard (2012) 153

“Mali-T658,” added Davies, “will be able to take on computation tasks in applications such
as image processing or augmented reality. The core had been made compatible with the
recently announced A7-A15 little–big coupling so that as computation was moved on to the
T658, it may accompany the movement of the core program down from the A15 to the A7.”
[32]

The Mali Job Manager was autonomous and could carry on graphics processing
with a reduced load on the CPU, which meant it was well suited to working with
a big.LITTLE CPU system. Arm’s big.LITTLE technology is a heterogeneous
processing architecture that uses two types of processors. LITTLE processors are
designed for maximum power efficiency while big processors are designed to provide
maximum compute performance.
Using the correct processor for the right task, the Mali-T658 handled GPU
compute tasks in parallel with the CPU running the always-on, always-connected
tasks. “Arm’s CoreLink system IP enables system-level cache coherency across
clusters of multi-core processors, including the Cortex-A15 and Mali-T658,” added
Davies.
Also, the Mali-T658 was compatible with the Armv8’s entire 64-bit ISA, as was
the Mali-T604.
The lead partners signing up to develop the Mali-T658 were Fujitsu Semicon-
ductor, LG Electronics, Nufront, and Samsung.
In October 2011, Peter Hutton, general manager of the media processing division
at Arm, said, “We are looking at a little–big approach for Mali too.”
With more than six types of portable devices (smartphones, tablets, game consoles,
cameras, navigation units, and notebooks) needing powerful low-energy graphics
engines with GPU compute capability, competition increased to meet this demand.
The lines were becoming divided on who Arm’s clients for Mali would be: those
customers who did not have an Arm architectural license. It was exceedingly difficult
for a single company that did not have a large, powerful GPU and CPU design team
to compete against the architectural licensees and Arm itself. The licensees like
Freescale, Marvell, Nvidia, Qualcomm, and TI would create a differentiated product,
while others would have to rely on the Cortex/Mali menu options. The net result was
that some amazingly innovative products emerged from the Arm community and
Intel in the following years. They all competed for the potential of the two-billion-
unit market that all the mobile devices represented. However, that volume was not
realized due to competition and a slow-down in the market.

3.12.1 Arm’s Bifrost (2018)

Arm steadily improved the Mali GPU after acquiring it from Falanx. The little GPU
had found its way into TVs, phones and tablets, automobiles, and various consumer
and industrial devices. The little GPU that could and did went through several archi-
tectural evolutions. In early 2016, the company introduced its Bifrost architecture.
154 3 Mobile GPUs

Fig. 3.36 The 12-year history of Arm Mali architectures over time

Implemented in the Mali-G31, then in the second quarter the Mali-G71, and in
October 2017 the Mali-G72, G76 in 2018, and G52 and G31.
The Bifrost architecture was significantly different from the previous Midgard
design, and Bifrost positioned the GPU for new tasks in AI, AR, and rendering. The
multi-generational history of Mali is illustrated in Fig. 3.36.
On the last day of May 2018, at the Arm Tech Day conference, the company
revealed its newest upgrade to the Bifrost family and showed the Mali-G76. Arm
skipped the numbering sequence from G72 to G76 to align Mali with the new CPU
Cortex-A76.
Mali-G76, the third-generation GPU based on the Bifrost architecture featured
three execution engines per shader core, dual texture mapper, and configurable 4–20
shader cores and could be configured with 2–4 slices of L2 cache—from 512 KB to
4 MB total. The G72’s execution engines could run four scalar threads in lockstep,
and when doing a vec3 FP32 add, it only took three cycles. The G76 employed
wide execution engines, which doubled the number of lanes, so eight threads doing

Fig. 3.37 Arm’s Mali-G76’s core design block diagram


3.12 Arm’s Midgard (2012) 155

Table 3.5 Comparison of


G72 G76
Mali-G72 to Mali-G76
Execution lanes per engine 4 8
Engines per core 3 3
Cores 32 20
Execution lanes 384 480

vec3 FP32 could add in 3 cycles and provided int8 dot product support for machine
learning. The core design is illustrated in Fig. 3.37.
Arm said the G76 could work faster than the previous generation because its eight
lanes use the same energy as four lanes to process a given workload.
A simple comparison of the total compute capability is shown in Table 3.5.
The G76 also had dual texture units so applications could execute two textures
per core, which effectively gave the G76 twice the throughput.
At the time, Arm said its microarchitecture improvements would provide area and
power improvement resulting from register bank optimization and half the number
of register banks but larger sizes. The G76 also had increased compute performance
through the thread-local storage area, where the stack area was used for register
spilling in shaders. That provided the data for a single thread. It then got grouped
into chunks at the same location. As a result, said the company, gaming performance
would improve, and they offered the comparison in Fig. 3.38 to make the point (based
on Arm’s testing).
Because of the efficiency improvements, games can also be played for more
extended periods, said the company.

Fig. 3.38 Improved gaming performance with Mali-G76. Courtesy of Arm


156 3 Mobile GPUs

For AI applications where ML inference on the device effectively used general


matrix multipliers, the Mali-G76 added dedicated 8-bit dot product support for
inference.
Summarizing it all, Arm claimed the Mali-G76 would offer 30% greater energy
improvement over the G72, 30% greater density, and 2.7 times the ML compute
density. The net of all that, Arm said, would deliver 1.5 times the overall performance
of a G72.

3.12.2 Arm’s Valhall (2019)

In October 2019, Arm expanded its Mali graphics portfolio with its G57 GPU based
on the new Valhall architecture, the fourth generation of Mali GPUs [33]. The G57
targeted the mainstream market, and the G77 targeted the premium mobile market.
Arm said that the Mali-G57-enabled capabilities are not usually associated with
smartphones and home appliances (e.g., set-top boxes and DTVs). That included
high fidelity content, 4K and 8K user interfaces, console-like smartphone graphics,
and more complex ML, AR, and VR workloads.
The heart of Mali-G57 was based on the company’s Mali-G77 (see below).
Compared with the previous Bifrost generation Mali-G52, the Valhall G57 had 30%
greater performance density and was designed to run content like Fortnite at high
resolution. Also, said Arm, the G57 provided double the texturing performance of
the G52, which would improve high-resolution UI performance in 4K and 8K DTVs,
AR, VR, and gaming. The increase in compute, and texture capabilities also made the
Mali-G57 a good candidate for HDR rendering, physically based rendering (PNR),

Fig. 3.39 Arm’s comparison of performance between the Mali-G52 and the new Mali-G57.
Courtesy of Arm
3.12 Arm’s Midgard (2012) 157

and volumetric effects, which were becoming standard features on mobile devices.
That would, they said, translate to enhanced and smoother experiences and faster,
greater responsiveness on their mainstream devices for the end-user [34].
Efficiency. Along with performance, Arm said it made significant energy effi-
ciency improvements with Valhall, up to 30% over Bifrost. The company offered
some comparisons of the G57 and the G52 (Fig. 3.39).
Whereas the premium Mali-G77 had at least seven cores, the Mali-G57 had one
to six cores depending on the configuration.
ML performance. Mali-G77 brought a significant improvement in ML perfor-
mance—up to 60% over the G52. Arm claimed the Mali-G57 would show similar
improvements, taking more complex ML workloads to mainstream devices. The 60%
increase in on-device ML performance was made possible by twice as many fused
multiply-accumulate (FMA) processors to Mali-G52 (depending on the configura-
tion) and architectural optimizations. That provided faster responsiveness to a wide
range of ML use cases common on smartphones, such as face detection, image
quality enhancement, and speech recognition. Moreover, Arm claimed Mali-G57’s
flexibility to perform different ML workloads would ensure that the next generation
of mainstream devices would provide future and emerging use cases based on ML.

3.12.2.1 AR and VR

Arm believed consumers and the mainstream market wanted more AR and VR
immersive experiences. Therefore, AR and VR games and applications were often
limited to premium devices. Using VR as an example, the Mali-G57 offered foveated
rendering, allowing VR developers to reduce their application’s workload. One typi-
cally achieved foveated rendering by selectively reducing the shading rate for the
regions of the screen that are less visible through the lenses of VR headsets. As a
result, claimed the company, the industry was likely to see users with access to more
immersive VR games, apps, and experiences. Similar improvements applied for AR
said Arm, where Mali-G57’s performance increased would satisfy immersive and
higher quality AR content, games, and features on mainstream devices.

3.12.3 Valhall Architecture

Valhall was Arm’s third-generation scalar GPU architecture. It had a 16-wide-


warp execution engine (the GPU executed 16 instructions in parallel per cycle, per
processing unit, per core). That was an increase from 4 and 8 wide in Bifrost.
Other new architectural features included hardware-managed dynamic instruction
scheduling and new instructions that retained functional equivalency to Bifrost. It
also supported Arm’s AFBC1.3 compression format, FP16 render targets, layered
rendering, and vertex shader outputs. A block diagram of the Valhall architecture is
shown in Fig. 3.40.
158 3 Mobile GPUs

Fig. 3.40 Arm’s Valhall shader architecture block diagram

The Mali-G77 did 33% more math in parallel than the G76.
The significant architectural changes could be found in the execution unit inside
the core, the part of the GPU responsible for number crunching.
Inside the execution engine
Each GPU core In Bifrost had three execution engines (or two in some lower-end
Mali-G52 models). Each engine had an i cache, a warp control unit, and a register file.
In the Mali-G72, each engine handles four instructions per cycle, which increased
to 8 in last year’s Mali-G76. Spread across these three cores allowed for 12 and
24 32-bit floating-point (FP32) fused multiply-accumulate (FMA) instructions per
cycle.

Fig. 3.41 Arm Valhall microarchitecture


3.12 Arm’s Midgard (2012) 159

With the Valhall Mali-G77, only a single execution engine was inside each GPU
core (Fig. 3.41). That engine held the warp control unit, register, and i cache shared
across two processing units. Each processing unit handled 16 warp instructions
per cycle, for a total throughput of 32 FP32 FMA instructions per core (the 33%
instruction throughput boost over the Mali-G76).
With Valhall, Arm transitioned from three to a single execution unit per GPU
core, but two processing units were within a G77 core.
Each processing unit had two new math function blocks. A convert unit (CVT)
handled basic integer, logic, branch, and conversion instructions. And the special
function unit (SFU) accelerated integer multiplication, divisions, square root,
logarithms, and other complex integer functions.
The FMA unit supported 16 FP32 instructions per cycle, 32 FP16, or 64 INT8
dot product instructions. Those optimizations produced a 60% performance uplift in
machine learning applications.
At the time, Arm said the Valhall architecture would align with modern APIs
such as Vulkan, which was considered the new standard for graphics on mobile and
other platforms. Thanks to the performance and energy efficiency improvements
of Mali-G57, and the company said games would run smoother and for longer on
devices. That, in turn, said Arm, would help reinforce the reputation of developers
in the gaming ecosystem to device manufacturers. Arm saw the Chinese and Asian
markets, where mainstream devices and mobile gaming were most prevalent, as
an excellent opportunity for mainstream gaming apps. According to a 2018 Newzoo
report on the China games market, China would generate $23 billion in mobile games
revenue alone [35]. Therefore, game developers were keen to access these regions
where the mainstream market was robust (Fig. 3.42).
The Valhall architecture was the basis of Arm’s latest generation GPUs and
contained various improvements and new features over the previous Bifrost architec-
ture. It was a scalable architecture that made the high-end and complex features in the
G77 possible on both premium and mainstream devices. In addition to aligning better

Fig. 3.42 Arm’s Mali


Valhall architecture.
Courtesy of Arm
160 3 Mobile GPUs

with the Vulkan API, the key elements of Valhall were a new superscalar engine,
a simplified scalar ISA, and new dynamic scheduling of instructions. All of those
brought about the performance and energy efficiency improvements in the Mali-
G57. A companion to the Mali-G57 was the Mali-D57 display processor (discussed
below).
Arm coined the expression total compute as a solution-focused approach to
system-on-chip design, moving beyond individual IP elements to design and opti-
mize the system as a whole. It was an apt description and easy to remember. Along
with 5G, the acceleration of AI, extended reality (XR), and the Internet of things (IoT)
compute requirements were changing. The performance needed for digital immer-
sion, said Arm at the time, will have to push beyond what we have today toward the
world of total compute.

3.12.3.1 ML and Display

In the spring of 2019, Arm announced its latest development in GPU land—the
Mali-G77. Arm said the new GPU was 40% faster in overall graphics than the G76
and 60% faster at ML tasks. The company also claimed the G77 was 30% more
energy-efficient and used 40% less bandwidth [36].
The company also introduced its second-generation ML processor design, which
was twice as efficient as the original. The peak performance improved slightly from
4.6 to 5 TOP/s, but that performance required only 1 W instead of 2 W for the original
ML processor.

Fig. 3.43 Arm has the whole suite of engines for 5G AI, ML, and VR. Courtesy of Arm
3.12 Arm’s Midgard (2012) 161

Arm said the new ML processor had improved memory compression techniques
by three times that of the previous generation, and it could scale up to eight cores
for a total performance of 32 TOP/s. However, such designs were unlikely to make
it into mobile devices due to higher power requirements (Fig. 3.43).
In addition to the ML processor, Arm introduced a companion display processor,
the Mali-D77.

3.12.3.2 Mali-D77 Display Processor (2019)

Arm also brought a companion for the G77, the D77 display processor—the
companion for the Mali-V61 from 2017, shown in Fig. 3.44 [37]. Looking to compete
with Qualcomm’s XR platform, Arm claimed its D77 display processor IP delivered
superior head-mounted display (HMD) VR performance for eliminating motion sick-
ness and optimized for 3K resolution at 120 Hz refresh. The D77 had new fixed func-
tion hardware, which Arm said achieved more than 40% system bandwidth savings,
12% power savings for VR workloads. That said, the company hoped to develop
lighter, smaller, and more comfortable untethered VR HMDs to standard premium
mobile displays.
Arm claimed that the D77 was a display processor that could significantly improve
the VR user experience with dedicated hardware functions for VR HMDs, namely
lens distortion correction, chromatic aberration correction, and asynchronous time
warp (ATW).
VR HMDs require displays close to the eyes; therefore, to maintain the perceived
quality of the images (e.g., reducing artifacts in the display like the screen-door

Fig. 3.44 Arm’s D77 display processor block diagram


162 3 Mobile GPUs

Fig. 3.45 Arm said an SoC that could drive the level of performance for wearable VR HMDs did
not exist (in 2019). That presented a significant challenge to SoC vendors who need to achieve the
above requirements. Courtesy of Arm

effect), more pixels were needed in the same area. Unlike regular smartphone
displays, VR HMDs required at least six times more pixels per unit area to maintain
the same perceived quality. The state-of-the-art VR devices in 2020 were 2880 ×
1440 pixels. However, the trend was moving toward increased resolutions within
ever-shrinking power budgets on lighter, more comfortable headsets (Fig. 3.45).
Arm claimed the Mali-D77 would also eliminate motion sickness through the
higher frame rates and instantly respond to real-world head movements by repro-
jecting the VR scene according to the latest pose due to its just-in-time single pass
composition process. Moreover, it would deliver crisper images free from artifacts
through its more advanced hardware-based filtering and image processing functions.
Key features:
• R63455 VR DDIC optimized for 2160 × 2400 at 90 Hz head-mounted display
• 1000 ppi, 2K per eye image quality
• Foveal transport provided a clear visual where the user needs it
• VXR7200 VR Bridge supported DP1.4 bandwidth with AMD/Nvidia GPUs over
USB-C
• Support for panels > 2K resolution or faster refresh rates, without image loss due
to cabling.
Although virtual reality devices were becoming more common, there were still
plenty of issues, as anyone who has endured too many demos can attest. For instance,
there could be problems with getting the level of resolution needed to avoid motion
sickness and the screen-door effect.
Companies continued to invest in and promote VR for consumers. Although games
were not the only application for consumer VR, they still comprised 96% of consumer
3.12 Arm’s Midgard (2012) 163

VR applications. As such, VR was still a novelty and a snack for consumers. A few
diehard VR gamers would spend an hour or more, but most normal people gave it
up after 10 or 15 min and did not go rushing back to it later.
Arm was in a unique marketing position as an SoC IP component supplier. It
was the only company that offered CPU, GPU, display, and other designs to OEM
chip builders. Other GPU designers either use their designs internally as Nvidia and
Qualcomm do. AMD also used its GPU design internally and licensed it. And other
companies such as Digital Media Professionals (DMP), Imagination Technologies,
Think Silicon, and VeriSilicon only licensed their GPU designs. Intel could offer
CPU and GPU IP but did not participate in the IP market. The IP GPU suppliers did
not have a CPU design to offer, so Arm was uniquely positioned to know about all
sorts of companies’ needs and ambitions.

The screen-door effect


The screen-door effect (SDE) or fixed-pattern noise (FPN) is a visual arti-
fact of displays, where the fine lines separating pixels become visible in
the displayed image. It is sometimes incorrectly referred to as pixel-pitch.
Screen door effect is a phrase used to describe a display that has a visible
gap between individual pixels. However, there is no technical specification
for this gap.

3.12.4 Arm Epilogue

In September 2016, SoftBank bought Arm for $32 billion [38]. Many in England
were unhappy that the country’s largest technology company would be in the hands
of the Japanese. Some newspapers in England saw the sale as an unfortunate result
of Brexit, which reduced the pound’s value by 11% against the yen and the dollar
and made Arm more of a bargain for international buyers.
However, SoftBank chairman Masayoshi Son said that he had been following
Arm for many years and that the reduced value of the British pound was not a factor.
Ironically, Prime Minister Theresa May, commenting on a disputed and eventually
thwarted sale of pharmaceutical company AstraZeneca to Pfizer, noted that it was
important for Britain to have control over its business assets. The implication was the
UK might move to block the sale of valuable companies and technologies outside
164 3 Mobile GPUs

the country. However, that was not her position after the sale. May said that the deal
had proven that England was open for business. In her official statement, May said,
“The announcement of investment this morning from SoftBank into Arm Holdings
was clearly a vote of confidence in Britain.”
It helped that SoftBank pledged to grow Arm, keep the headquarters in Cambridge,
and promised to “at least double the employee headcount in England.”Arm CEO
Simon Segars stayed on to run the company, and SoftBank said: “management will
stay in place.” In an interview on the Arm Web site, Segars said that Arm was
not looking to sell the company and was confident about their opportunities in the
future. He said then that for Arm to consider a sale, the price would have to be very
compelling, and the deal would have to open more opportunities than the company
could achieve on its own.
Arm certainly did not lose its way since being bought by SoftBank, nor had it
been auctioned off in pieces which some people feared would happen. When Son
told the Arm executives that he wanted to buy the company, he made them a series
of promises: The company would remain an independent subsidiary of SoftBank, he
would not interfere in the day-to-day management of Arm, and the company would
be allowed to invest all the profits into research and development [39].
And SoftBank brought a boatload of money to the party and lots of new contacts.
The workforce had grown from 4500 to over 6000 by 2019, and more hands meant
more and faster development. As a result, Arm’s reach into new market areas was
impressive, and its new concepts of what a GPU should do expanded.
In February 2018, Arm launched Project Trillium, now known as the Arm AI
Platform, a new machine learning-powered platform to provide advanced computing
capabilities to connected devices.

3.12.5 Second Epilogue

Four years later, in September 2020, Nvidia announced it would buy Arm from Soft-
Bank for $40 billion [40]. In April 2021, the UK government issued an intervention
notice over the sale of Arm by Japan’s SoftBank to Nvidia [41]. Then in January
2022, the U.S. Federal Trade Commission sued to block Nvidia’s purchase of chip
designer Arm, saying the deal would create a powerful company that could hurt the
growth of new technologies. On February 8, 2022, Nvidia withdrew its bid to acquire
Arm. Softbank positioned Arm for a public offering, and the company laid of 1200
people.
3.13 Nvidia Leaves Smartphone Market, 2014 165

3.13 Nvidia Leaves Smartphone Market, 2014

Nvidia led with its strength in the mobile market and promoted it graphics perfor-
mance. However, in the mobile market, OEMs and consumers valued power effi-
ciency more. And with the smaller screens, high-performance graphics was not
that beneficial or appreciated. Power-efficient and lower performance GPUs from
Qualcomm, Imagination, and Arm took over the market.
Then, in May 2015, Nvidia said that instead of wasting its resources on a budget
or even mainstream mobile phone, it would instead carve out a new market for what
it called superphones.
In January, the company introduced the Tegra 4 SoC with a 72 core GPU, a video
engine, and a dual-channel DDR3L 1833 memory controller. It was manufactured
at TSMC in 28 nm HPL (low power with High-K + metal gates optimized for low
leakage).
In February 2013, Nvidia introduced its Phoenix 5-in. superphone platform based
on the application processor and LTE modem on the same silicon. The Tegra 4i is
an application processor and LTE modem on the same silicon. Then, in September
2013 the company announced Xiaomi had introduced the Mi3 Super Phone powered
by a Tegra 4.
Fifteen months later in May 2014, Jensen Huang told CNet that the company
would withdraw from the smartphone and tablet market altogether and concentrate
on automotive and games machines.
For the second time in his career and in the company’s history, it pivoted from
a losing situation to winning one. In May 2015, the company wrote off the Icera
operation it paid $352 million for in June 2011. It incurred expenses of $100 to $125
million in severance and other employee termination benefit costs [42] (Fig. 3.46).

Fig. 3.46 Nvidia’s route to


and from the mobile market
166 3 Mobile GPUs

Research firm Jon Peddie Research estimated Nvidia had invested over a billion
dollars in developing the Tegra product line beginning with the acquisition of MediaQ
in 2003. Mobile phones were always the goal, and Nvidia did not waver from trying to
get into the market—but the market changed faster than Nvidia could. Qualcomm was
always the big player and seeing how rich it was getting prompted smaller companies
(like MediaTek, Rockchip, and others) and giant Intel to enter the market. Failing
to develop a viable processor, Intel began buying its way into the market, a strategy
Nvidia could not match. At the same time, the Chinese suppliers offered nothing
more than an Arm reference design cut the prices to a few dollars for a mobile SoC.
Not a price structure Nvidia could or wanted to live with.
In 2013, Nvidia showed its roadmap for the Tegra product line, revealing the Logan
and parker SoCs. Logan was Nvidia’s first SoC with CUDA-compatible shaders based
on the Kepler architecture GPU and OpenGL 4.3. Kepler had a shader block granu-
larity of 1 SMX (192 CUDA cores). Logan demos appeared in 2013 and production
devices in early 2014.
After Logan was Parker (code name Denver) with 64-bit capabilities and Maxwell
GPU. Parker was also built using 3D FinFET transistors, from TSMC.

3.13.1 Xavier Introduced (2016)

Announced in September 2016 Xavier, a-new SoC based on the company’s next-gen
Volta GPU, which Nvidia hopes will be the processor in future self-driving cars.
Xavier features a high-performance GPU, and the latest Arm CPU, yet has great
energy efficiency according to the company.
Using the expanded 512-core Volta GPU in Xavier, the chip, was designed to
support deep learning features important to the automotive market, said the company.
A single Xavier-based AI car supercomputer would be able to replace the company’s
fully configured Drive PX 2 with two Parker SoCs and two Pascal GPUs. Xavier was
built using 16 nm FinFET process and had seven-billion transistors—which was the
biggest chip built to date.
And again at CES 2017, Nvidia’s Xavier SoC was featured and then at Nvidia’s
GPU Technology Conference (GTC) in March 2017. Nvidia’s CEO Jensen Huang
announced Toyota would use the Xavier processor for its autonomous car in 2020.
Quite a claim and commitment for a product that had not been built yet.
The Nvidia Drive AV (autonomous vehicle) platform used neural networks to let
cars drive themselves and had two new software platforms: Drive IX and Drive AR.
Drive IX, said the company, was an intelligent experience software development
kit that would enable AI assistants for sensors inside and outside the car and for
drivers and passengers in the car.
Xavier was the Drive software stack and had been expanded to a trio of AI plat-
forms covering every aspect of the experience inside next-generation automobiles.
Later in June 2017, Nvidia released (via the Lineley report) [43] an updated high-
level block diagram and replaced the computer vision accelerator (CVA) with a deep
3.14 Qualcomm Snapdragon 678 (2020) 167

Fig. 3.47 Nvidia’s high-level (circa 2017) Xavier block diagram—DLA is the deep learning
accelerator

learning accelerator (DLA)—a deep learning accelerator, a much sexier name. The
chip was symbolized in the block diagram in Fig. 3.47.
At CES 2018, the company said Xavier had more than nine-billion transistors,
which included an eight-core (Arm65) CPU, a deep learning accelerator, a 512-core
Volta GPU, an 8K high-dynamic-range (HDR) video processor, and new computer
vision accelerators. The SoC could perform 30 trillion operations per second on 30
W of power. Nvidia claimed Xavier processors were being delivered to customers
that quarter and that it was the most complex SoC the company ever created [44].
Xavier was a key part of the Nvidia Drive Pegasus AI computing platform. It
offered, the company said, the equivalent amount of processing power as a trunk
full of PCs. It was nothing less, claimed Nvidia, than the world’s first AI car
supercomputer, designed for fully autonomous Level 5 robotaxis (Fig. 3.48).
Pegasus was built on two Xavier SoCs and two next-generation Nvidia GPUs.
Nvidia claimed more than 25 companies were already using Nvidia technology to
develop fully autonomous robotaxis, and Pegasus would be its path to production.
Xavier was announced in 2016 and appeared in products available by end of 2017
and the beginning of 2018.
The next SoC from Nvidia was the Orin, illustrated in Fig. 3.49.
The details of the devices are shown in Table 3.6.
Nvidia continued to pursue the automotive and autonomous vehicle market and
also saw success with their Tegra chips in Nintendo Switch consoles. In 2022
the company introduced it Grace Hopper SoC with a 72 core Arm CPU designed
for giant-scale AI and HPC systems.

3.14 Qualcomm Snapdragon 678 (2020)

Qualcomm Technologies introduced its Snapdragon 678 Mobile Platform in


December 2020 as a follow-on to the Snapdragon 675 introduced in 2018 (Fig. 3.50).
The chip, said Qualcomm, delivered overall performance upgrades, high-speed
168 3 Mobile GPUs

Fig. 3.48 Nvidia’s Xavier-based Pegasus board (circa 2018) offered 320 TOPS and the ability to
run deep neural networks at the same time. Courtesy of Nvidia

Fig. 3.49 Nvidia’s Tegra SoC roadmap 2022. Courtesy of Nvidia

connections for sophisticated photo and video capture, and immersive entertainment
experiences.
The performance enhancements of the Snapdragon 678 over Snapdragon 675
included
• Kryo 460 CPU clock speed up to 2.2 GHz
• Adreno 612 GPU performance increase.
3.14 Qualcomm Snapdragon 678 (2020) 169

Table 3.6 Nvidia Tegra SoC product line


Product Drive PX Drive PX 2 Drive Xavier Drive Drive AGX Orin
name Pegasus
SOC name Tegra X1 Parker Xavier Xavier Orin
Process 20 nm SOC 16 nm FinFET 12 nm 12 nm 6 nm FinFET
technology FinFET FinFET
SOC 2 billion N/A 7 billion 7 billion 7 billion (Xavier)
transistors (Tegra X1) (Xavier) (Xavier)
GPU Maxwell Pascal (256 Volta (512 Volta (512 Ampere 1024
architecture (256 Core) Core) Core) Core) CUDA cores and
32 Tensor Cores
CPU 16 Core 12 Core Arm 8 Core Arm 16 Core Arm 8 Core Arm CPU
Arm CPU CPU CPU CPU
CPU 8x Cortex 4x Denver Carmel Carmel 8-core Arm
architecture A578x 8x Cortex A57 ARM64 8 ARM64 8 Cortex-A78AE
Cortex A53 Core CPU Core CPU v8.2 64-bit CPU
(8 MB L2 + (8 MB L2 + 2 MB L2 + 4 MB
4 MB L3) 4 MB L3) L3 s
Compute N/A 20 DLTOPs 30 TOPs 320 TOPs 200 TOPs
DLTOPs
Total chips 2 × Tegra 2 × Tegra X 1 × Xavier 2 × Volta 1 × Ampere
X1 2 × Pascal 2 × Turing
MXM GPUs
System LPDDR4 8 GB LPDDR4 16 GB LPDDR4 + 12 GB 128-bit
memory (50+ GB/s) 256-bit GDDR6 LPDDR5
LPDDR4 102.4 GB/s
Graphics N/A 4 GB GDDR5 137 GB/s 1 TB/s 200 GB/s
memory (80+ GB/s)
TDP 20 W 80 W 30 W 500 W 10 W/15 W/25 W
Courtesy of Nvidia

The 612 GPU runs at 700–750 MHz, has two execution units with 96 shading
units, and can produce 328.2 Gigaflops. It can drive a display with 2520 × 1080
and a color depth: up to 10 bits. The GPU supports Vulkan 1.0, OpenGL ES 3.2,
OpenCL 2.0, and DirectX 11 (FL 11_1). In addition to those performance upgrades,
the 678 supported dynamic photography and videography capabilities, immersive
entertainment experiences, fast connectivity, and long battery life.
Dynamic photography and videography came from the two 14-bit Spectra 250L
ISP, which could process sensors with up to 48 MP and zero shutter lag (ZSL).
• Dual Camera, Multi-Frame Noise Removal (MFNR), ZSL, 30 fps: up to 16 MP
• Single Camera, MFNR, ZSL, 30 fps: up to 25 MP
• Single Camera, MFNR: up to 48 MP
• Single Camera: up to 192 MP.
170 3 Mobile GPUs

Fig. 3.50 Qualcomm


Snapdragon 6xx

The third-generation Qualcomm AI Engine featured portrait mode, low-light


capture, and laser autofocus capabilities. Users could capture HD 4K video with
recording features such as slo-mo recording (1080p at 120 FPS, 720p at 240 FPS),
5x optical zoom, and portrait mode dual-camera support to 16 MP. The camera also
featured accelerated electronic image stabilization. The video codec supported H.265
(HEVC), H.264 (AVC), VP8, and VP9.
Qualcomm claimed the Snapdragon 2.2 GHz Kryo 460 CPU, Octa-core CPU678
Kryo 460 CPU, and Adreno 612 GPU could drive faster graphics rendering, providing
sharp, lifelike visuals at high frame rates with fewer frame drops. And, claimed the
company, the Snapdragon 678 was optimized for Unity, Messiah, NeoX, and Unreal
Engine 4.
The SoC X12 LTE Modem supported advanced carrier aggregation with down-
loads up to 600 Mbps and uploads up to 150 Mbps.
The Adreno 612 was not a new GPU, having been introduced in 2018 for mid-
range smartphones and tablets. The Snapdragon 675 SoCs included it, and like the
678, it was made using the 11 nm linear programming problems (LPP) process at
Samsung. The performance resembled the old Adreno 512 GPU and was only found
in the lower mid-range of modern smartphone SoCs.

3.15 Qualcomm Snapdragon 888 (2020)

In December 2020, the company introduced its flagship Snapdragon 888 5G SoC at
the Qualcomm Snapdragon Tech Summit. The company said at the time it hoped the
888 would set the benchmark for flagship smartphones in 2021. The SoC integrated
5G along with Wi-Fi 6 and Bluetooth audio.
3.15 Qualcomm Snapdragon 888 (2020) 171

Qualcomm claimed the 880’s Adreno 660 GPU would deliver a 35% increase in
graphics rendering (measured probably in fps). That said, it was the company’s most
significant performance leap for its GPUs yet. The GPU’s classification number
showed that the Adreno 660 was not the highest performance GPU Qualcomm
offered. The 2019 Snapdragon 8cx SoC (for Always-On, Always-Connected PCs)
had an Adreno 680 GPU.
Compared with the Snapdragon 865, the 888 CPU and GPU were more power-
efficient, the company said. Qualcomm claimed a 25% improvement for the Kyro
680 CPU (over the 585). There was a 20% improvement with the Adreno 660 (over
the 650). (The 888 was what people thought would be a Snapdragon 875.) The
Snapdragon 865 used LPDDR5 2750 MHz/LPDDR4X 2133 MHz. The 888 used
LPDDR5. The SoC was manufactured in Samsung’s 5 nm process.
Based on Qualcomm’s numbering system, speculation was that the Adreno
660 would still fall below Apple’s M1 four-core GPU’s performance. However,
compared with the Arm Mali-G78 GPU, the Adreno 660 should have had a significant
advantage.
The Snapdragon 888 introduced Qualcomm’s variable rate shading (VRS) solu-
tion. That made the 880 the first mobile device GPU to have such capability. Qual-
comm said its VRS would improve game rendering by up to 30% for mobile experi-
ences, with improved power consumption at the same time. The increased graphics
processing also enabled HDR features in mobile gaming and allowed frame rates of
up to 144 fps.
The company also introduced Game Quick Touch, Qualcomm’s anti-lag cursor
control. That, said the company, would increase responsiveness by up to 20% by
lowering touch latency. With the speeds and low latencies that 5G and Wi-Fi 6
deliver, Qualcomm believed elite gamers could unite or compete in real-time for
unmatched global competition.
The 888 had impressive camera improvements. The Spectra 580 ISP was the
first from Qualcomm to feature a triple ISP. It could capture three simultaneous 4K
HDR video streams or three 28-megapixel photos at once at up to 2.7 gigapixels
per second (35% faster than the 865). It also offered improved burst capabilities and
could capture up to 120 photos in a single second at a 10-megapixel resolution. Lastly,
the upgraded ISP added computational HDR to 4K videos and had an improved low-
light capture architecture. It also offered the option to shoot photos in 10-bit color in
high-efficiency image file format.
Qualcomm’s Hexagon 780 AI processor in the Snapdragon 888 was a sixth-
generation AI engine that the company claimed would help improve everything from
computational photography to gaming to voice assistant performance. Qualcomm
said the 888 could perform 26 trillion TOP/s (compared with 15 TOP/s on the 865)
and would do it while delivering three times the power efficiency. Additionally,
Qualcomm promised significant improvements in both scalar and tensor AI tasks as
part of those upgrades.
The Snapdragon 888 also had the second-generation sensing hub, a dedicated
low-power AI processor for smaller hardware-based tasks, such as identifying when
172 3 Mobile GPUs

the user raised the phone to light up the display. The new sensing hub relied less on
the Hexagon processor for those tasks.

3.16 Apple’s M1 GPU and SoC (2020)

In late 2020, Apple introduced its latest SoC, the M1, as the heart of its 2021 PC,
iPhone, Air, Mac Pro, Mini, and iPad.
Apple had built SoCs for its smartphones and MP3 players since the late 1990s, so
it was not a new adventure for the company. The M1 differed from the iPhone’s A13
Bionic in that the A13 had extensive camera processors (ISPs), AI processors, and
memory (Fig. 3.51). However, the two SoCs had some CPU and GPU similarities.
The M1 SoC had an Arm big.LITTEL 8-core CPU with an ultra-wide execution
architecture. The four big cores had a 192 kb instruction cache, a 192 kb data cache,
and a shared 12 MB L2 cache. The four little cores had a 128 kb instruction cache, a
64 kb data cache, and shared a 4 MB L2 cache. Apple said the little cores only used
10% as much power as the big cores.
The 8-core GPU had 128 EUs that could execute 24,576 concurrent threads.
In compute mode, the GPU was capable of 2.6 TFLOPS, 82 Gtex/s, and 41 Gpix/s.
Apple said it could deliver 200% of the performance of the iGPU in the Intel processor
used in the previous-generation Air computer when operating at 10 w. The iGPU used
a unified (shared) memory architecture (UMA), and the memory which was on the
substrate was tightly coupled to SoC. The company said the M1 could accomplish
up to 390% faster video processing than a 1.2 GHz quad-core Intel Core i7-based
MacBook Air system (both configured with 16 GB RAM and 2 TB SSD) and up to
710% faster image processing than a 3.6 GHz quad-core Intel Core i3-based Mac
mini (Fig. 3.52).
There was also a 16-core neural engine that could reach 11 trillion operations/s.
That meant the Apple processor could provide 150% faster ML performance than a
3.6 GHz quad-core Intel Core i3-based Mac mini.

Fig. 3.51 Block diagram of


Apple’s M1 PC SoC
3.16 Apple’s M1 GPU and SoC (2020) 173

Fig. 3.52 Floor plan of Apple’s M1 substrate with chip and memory. Courtesy of Apple

The chip was built-in TSMC’s 5-nm fab and had 16 billion transistors, the most
ever for an Apple semiconductor. Overall, Apple said the M1 would provide 200%
of the performance at max power or the same performance on 33% as much power
as an Intel-based Apple computer.
With its 8-core GPU, ML accelerators, and neural engine, the entire M1 chip
would excel at ML. Final Cut Pro could intelligently frame a clip in a fraction of
the time. Pixelmator Pro could magically increase sharpness and detail at incredible
speeds. And every app with ML-powered features would benefit from performance
never before seen on Mac, claimed the company.
Using an Arm processor for Apple’s OS was not a big step. The basic OS was
already running on an Arm processor in the iPhone. And Qualcomm demonstrated it
was possible to run Windows on an Arm processor. The Intel processor had an iGPU,
so moving to an Arm-based processor with an iGPU was not a significant change.
Apple had been building its own SoCs with its iGPU for many years.
Apple’s slide presentation clarified that the GPU was a tile-based deferred
rendering architecture, which only Imagination Technologies’ PowerVR offered.
A lot of the developer docs, especially the A14 ones, made this straightforward.
Apple’s Arm license was architectural, so although the GPU was PowerVR,
it was probably best considered a branch of that design. Someone could do the
174 3 Mobile GPUs

GPU part better using Imagination’s higher-end A-Series or B-Series than Apple’s
implementation.
Who could match Apple’s CPU was another question. However, the other archi-
tectural Arm licensees mostly went back to taking cores (Fujitsu and Nvidia being
notable exceptions).
Apple excelled at integration, overall chip design, and verification. It was the
master of these heterogeneous SoCs, like Qualcomm, and getting the maximum
speed (or is it minimum latency?) from its integrations.

3.16.1 Apple’s M1 Pro GPU (2021)

More performance, less power


Apple knows how to make transitions. The company proved it by transitioning away
from the PowerPC platform to Intel, and now they showed the same discipline in
2021 when they moved away from Intel and on to their own Arm-based family of
chips (Fig. 3.53).
Tim Cook, CEO of Apple, said in June 2020 the company was on a year into a
two-year transition that started with the all-Apple M1 SoC [45]. In October 2021, the
company introduced its M1 Pro, and M1 Max SoCs, and the lineup was impressive
[46]. In addition to knowing how to make a transition, Apple also knows how to roll
out products.
The M1 Pro memory bus was twice as wide as the M1 and provided 200 Gb/s ~
3x M1 bandwidth. Apple coupled 32 GB of unified memory to the SoC, and the M1
Pro had 33.7 b transistors—4x the M1.
Built on TSMC’s 5 nm process, the 10-core CPU had eight high-performance
cores, two high-efficiency cores and had 70x the performance of the M1. Figure 3.54

Fig. 3.53 Apple’s M-Series SoCs. Courtesy of Apple


3.16 Apple’s M1 GPU and SoC (2020) 175

Fig. 3.54 The CPUs of the M1 Pro. Courtesy of Apple

shows the general layout and organization of the SoC, the CPU cores are in the upper
left.
The 16-core GPU was equally impressive, and Apple claimed it offered twice the
performance of the M1. Seen in the center of the top, the GPU had 16 cores with
2048 execution units and could run up to 49,512 concurrent threads. The GPU was
capable of 5.2 TFLOPS, 164 Gtexels/s, and 82 Gpixels/s.
The SoC had three media engines (upper right blocks, Fig. 3.54) that ran multiple
streams of 4k and 8k video and the SoC had a ProRes (RAW) codec.
The M1 Max, based on M1 Pro, was even more impressive. It doubled the M1
Pro’s bandwidth to 400 GB/s and had a whopping 57 billion transistors, 1.75x the
count of the M1 pro and 3.5x the M1 Pro. Most of the additional transistors were
used for the GPU as shown in Fig. 3.55.
The M1 Max also had double the embedded unified memory to 64 GB (see
Fig. 3.56), again for the GPU.
Apple said their processors were incredibly power efficient. Compared to laptop
with discrete GPU the M1 Pro reached the same performance at 70% less power,
or 7x the performance at the same power. That stung Intel, and Intel’s CEO, Pat
Gelsinger said at the time, the company [Intel] hoped to win back Apple’s business,
but it will need to create a better chip than Apple Silicon to do it [47].
The M1 SoC-powered Apple’s 14-in. Mac Pro notebook which has 14.2 active
diagonal screen with 3024 × 1965 resolution (5.9 mpix) and a refresh rate up to
120 Hz that dynamically adjusted to the content if static refresh slows down.
The M1 max was used to power the 16-in. Mac Pro with 16.2-in. diagonal display
of 3456 × 2234 pixels (7.7 mpix), 1 billion colors (10-bit) liquid retina XDR display
that put out 1000 nits (1600 peak). It had thousands of LEDs in the back light with
dozens of zones and offers 1,000,000:1 contrast ratio (Fig. 3.57).
176 3 Mobile GPUs

Fig. 3.55 The M1 Max offers 4x faster GPU performance than M1. Courtesy of Apple

Fig. 3.56 The M1 Mx with its unified embedded memory. Courtesy of Apple

In the fine print, Apple says the test systems were 4-core MSI Prestige 14 EVO
PC laptops with iGPUs and an 8-core MSI GP66 Leopard 11ug which use Intel Core
i7-1185G7, and the Core i7-11800H, 4-core and 8-core models of Intel’s Tiger Lake
10 nm SuperFin CPUs.
3.16 Apple’s M1 GPU and SoC (2020) 177

Fig. 3.57 Performance comparison. Courtesy of Apple

The M1 has five GPU cores in an 8-156 configuration—8 TMUs and 156 shaders
per core, or 40 TMUs and 1280 shaders.
Apple is now FP32 centric, and FP16 has the same rate (no speed up, but versus
previous gen this means FP32 is 2x rate).
M1 Pro is 14–16 cores, so 16 such cores would be a 128-4096 design, likely
running at similar or higher clocks than Mobile, which uses up to 1.3 GHz.
M1 Max is 24–32 cores, so 32 cores would be a 256-8096 design, again likely
running at up to 1.3 GHz.

3.16.2 Apple’s M1 Ultra (2022)

In March 2022, at its annual new products conference, Apple announced a new
version of the M1—the M1 Ultra, a dual die chiplet device that introduced Apple’s
UltraFusion interconnect technology (Fig. 3.58).
Apple’s UltraFusion used a silicon interposer to connect the chips across more
than 10,000 signals, providing 2.5 TB/s of low latency and inter-processor bandwidth.
That was four times more Apple claimed at the time than the bandwidth of the leading
multi-chip interconnect technology (see Chiplets in Book two).
Apple said the M1 Ultra would behave and be recognized by software as one
chip, so developers don’t need to rewrite code to take advantage of its performance
(Fig. 3.59).
The M1 Ultra featured a 20-core CPU composed of 16 high-performance and
four high-efficiency cores. Apple claimed it delivered 90% higher multi-threaded
178 3 Mobile GPUs

Fig. 3.58 Apple’s UltraFusion packaging architecture connects two M1 Max die to create the M1
Ultra. Courtesy of Apple

Fig. 3.59 Apple said the 20-core CPU of the M1 Ultra could deliver 90% higher multi-threaded
performance than the fastest 2022 16-core PC desktop chip in the same power envelope. Courtesy
of Apple

performance than the fastest available 16-core PC desktop chip in the same power
envelope. Apple also claimed, the M1 Ultra could reach the same peak performance
as PC chips with 100 fewer watts.3 Less power consumption means fans would run

3 Testing was conducted by Apple in February 2022 using preproduction Mac Studio systems with
Apple M1 Max, 10-core CPU and 32-core GPU, and preproduction Mac Studio systems with
Apple M1 Ultra, 20-core CPU and 64-core GPU. Performance measured using select industry-
standard benchmarks. 10-core PC desktop CPU performance data tested from Core i5-12600K
and DDR5 memory. 16-core PC desktop CPU performance data tested from Core i9-12900K and
3.16 Apple’s M1 GPU and SoC (2020) 179

Fig. 3.60 Apple claimed its M1 Ultra 64-core GPU produced faster performance than the highest-
end PC GPU available while using 200 fewer watts of power. Courtesy of Apple

quieter, even while running power-demanding apps like Logic Pro and processing
massive amounts of virtual instruments, audio plug-ins, and effects (Fig. 3.60).
For graphics needs, like 3D rendering and complex image processing, the M1
Ultra had a 64-core GPU—eight times the size of M1. Apple claimed it could deliver
faster performance than even the highest-end PC GPU available while using 200
fewer watts of power. Performance was measured using Apple-selected benchmarks
and compared against the performance of a Core i9-12900K with DDR5 memory
and a GeForce RTX 3060 Ti and GeForce RTX 3090.
Apple said their unified memory architecture also scaled up with the M1 Ultra.
Memory bandwidth increased to 800 GB/s, more, claimed Apple, than ten times the
latest PC desktop chip. The M1 Ultra could address up to 128 GB of unified memory.
Apple compared that shared memory configuration with dGPU’s dedicated GDDR6
and cited the dGPU-based AIBs as being limited to 48 GB, claiming the M1 Ultra
offered more memory to support GPU-intensive workloads such as 3D geometry
and rendering massive scenes. However, such claims cannot be proved and do not
consider the needs of the OS and applications’ use of the shared memory, nor did
Apple address the performance difference of DDR5 versus GDDR6. A PC of the
day could be configured with 128 GB of DDR5 plus the 48 GB of GDDR6 for a
total of 176 GB, so Apple’s claim of more memory than a PC was not defendable.
Subsequent testing showed Nvidia’s RTX3090 resound beating the M1 Ultra [48].
Was it all just a publicity stunt?
The new SoC contained 114 billion transistors, the most ever in a personal
computer chip. Apple claimed the performance would be appreciated by artists
working in large 3D environments previously challenging to render, developers

DDR5 memory. Performance tests were conducted using specific computer systems and reflect the
approximate performance of Mac Studio.
180 3 Mobile GPUs

compiling code, and video professionals. The company said users could transcode
video in ProRes up to 5.6x faster than with a 28-core Mac Pro with Afterburner.
Although Apple has designed their GPU, after separating from Imagination Tech-
nologies and almost destroying the company, Apple has returned to Imagination.
Even though neither company will say it in public, there is likely quite a bit of
Imagination IP in the new GPU.
Apple uses its API called Metal, derived from AMD’s Mantle, which is also
the basis for Khronos’s Vulkan and Microsoft’s DirectX 12. Therefore, the API
performance of Apple’s drivers is likely to be on par with PCs using a Khronos or
Microsoft API.
Apple’s claims of superior GPU performance while using a shared memory archi-
tecture and DDR5 are sure to be challenged and probably disproven by the discrete
GPU suppliers as soon as they can get their hands on an M1 Ultra.
The M1 Ultra shows as a single Metal device so it is one GPU not two GPUs from
an API point of view. Apple, with Metal, does have more control than other vendors
who have to deal with standardized (multi-vendor) APIs which has some benefits
(e.g., one can avoid difficult corner cases).
The M1 Ultra has 2.5 TB/s of interconnect which is enough for memory access
but also enough for die to die to make GPUs work (e.g., a big GPU even on a single
die has to distribute work and not everything talks to everything so making it scale
over a 2.5 TB/s link is not impossible).
SLI is not employed. However, Apple’s GPU is based on Imagination’s architec-
ture, and Imagination had multi-chip designs since the Dreamcast days. Distributing
work in a tile-based system is far easier than immediate-mode renderer (IMR)-
like systems, e.g., spreading out tiles is relatively easy. Imagination’s Series5XT
had multi-core, e.g., Imagination connected SGX5xxMPs, back in 2012 (e.g., MP4
designs). Imagination did it then, and it scaled linearly.
Apple would have you believe they have somehow broken the laws of physics
by bolting two die together and increasing the memory address space. Shared DDR
memory is still shared DDR memory, and DDR5 has a clock limit. Clocks also
determine how hot a chip gets. And although Apple claims they get to performance
levels at 100 w, that’s still 100 w, which is not cool; it’s hot (put your hand on a
100 w light bulb). That’s not to diminish what Apple did, but they didn’t do anything
revolutionary. AMD and Intel have been building multi-chip devices for ten years.
Qualcomm has been building higher-performance, low-power GPUs even longer.
AMD is building processors using 5 nm, and so has Samsung, so Apple hasn’t
broken any Moore’s law barriers or been first in that node either—although they
were first with the M1 Max at 5 nm.
The Apple M1 Ultra was a huge chip, about three times larger than AMD’s Ryzen
APU (Fig. 3.61).
The competition in the PC market is brutal, and Apple has never been shy about
praising itself. Their Mac Studio will be an able performer, but it can’t replace a
powerful desktop system.
In mid-2022, Apple introduced the second-generation M-series processor, the M2.
3.16 Apple’s M1 GPU and SoC (2020) 181

Fig. 3.61 Apple M1 compared to AMD Ryzen chip size. Courtesy of Max Tech/YouTube [49]

Built using TSMC’s 5-nm process, M2 took the performance-per-watt M1 features


even further with an 18% faster CPU, a 35% more powerful GPU, and a 40%
faster Neural Engine.4 The company also asserted it delivered 50% more memory
bandwidth than the M1 and up to 24 GB of fast unified memory.
The SoC had 20 billion transistors—25% more than M1. The additional transistors
improve features across the entire chip, said the company, including the memory
controller that delivered 100 GB/s of unified memory bandwidth—Apple said that
was 50% more than M1. The M1 had 68 GB/s of memory bandwidth, so the 50%
claim is difficult to reconcile. The M2 had 24 GB of unified memory, enabling the
M2 to handle even larger and more complex workloads, bragged Apple.
Unified memory, you will recall, means shared memory—the GPU and the CPU
use the same memory and memory bus simultaneously, so one of them must wait on
the other to get access. The M2 used LPDDR5-6400 memory, whereas the M1 used
a slower LPDDR4X-4266.
Apple also claimed the M2’s GPU was 25% more powerful than the M1’s (up to
35% better performance at its max power 1) and had up to ten cores (two more than
the M1). Compared with the integrated graphics of the latest PC laptop chip, Apple
claimed the GPU in M2 delivers 2.3x faster performance at the same power level and
matches its peak performance using a fifth of the power.5

4 Testing conducted by Apple in May 2022 using preproduction 13-in. MacBook Pro systems with
Apple M2, 8-core CPU, 10-core GPU, and 16 GB of RAM; and production 13-in. MacBook Pro
systems with Apple M1, 8-core CPU, 8-core GPU, and 16 GB of RAM. Performance measured
using select industry-standard benchmarks. Performance tests are conducted using specific computer
systems and reflect the approximate performance of MacBook Pro.
5 Testing conducted by Apple in May 2022 using preproduction 13-in. MacBook Pro systems with

Apple M2, 8-core CPU, 10-core GPU, and 16 GB of RAM. Performance measured using select
industry-standard benchmarks. 10-core PC laptop chip performance data from testing Samsung
Galaxy Book2 360 (NP730QED-KA1US) with Core i7-1255U and 16 GB of RAM. Performance
182 3 Mobile GPUs

The company also claimed the new SoC’s Neural Engine could process up to
15.8 trillion operations per second—over 40% more than M1. The media engine
included a higher-bandwidth video decoder supporting 8K H.264 and HEVC video
and Apple’s ProRes video engine that enabled playback of multiple streams of both
4K and 8K video. The company says there is also a new image signal processor (ISP)
that delivers better image noise reduction.
MetalFX
With the M2, Apple introduced its MetalFX Upscaling Technology—a mix of
spatial and temporal upscaling algorithms. Metal is Apple’s API similar to DirectX
and Vulkan and discussed in Book two, The History of the GPU—the Eras and
Environment. With the introduction of the M2, Apple also introduced Metal 3.
Nvidia was the first to introduce AI-based scaling with DLSS in August 2019, also
discussed in Book Two. AMD released its FidelityFX Super Resolution (FSR) as an
efficient scaling alternative to DLSS in June 2021, and then a significant improvement
of it with FSR 2.0 in May 2022. Intel announced its XeSS in May 2022 in its Xe
GPU’s Matrix Extensions (Intel XMX).
Nvidia DLSS 1 and AMD’s FSR 1 used spatial upscaling, and the second-
generation versions of those technologies included superior and more compute-
intensive temporal upscaling. Nvidia introduced DLSS 2.0 in March 2020, and AMD
introduced FSR 2.0 in May 2022.

3.16.3 Summary

Apple clearly achieved world-class semiconductor design status and succeeded in


being the first such company to get the benefits of TSMC’s 5 nm process. The power-
performance curves shown in Fig. 3.57 are an engineer’s dream and the epitome of
Moore’s law.

3.17 Conclusion

The mobile market exploded and then contracted. In the end, Apple, a company with
no history in communications, became the market leader. Companies with massive
funding and resources, like NEC, were driven out or, like Nokia, shrunk. Others
with fabulous technology, like Nvidia, found out they did not have the right stuff.
The China mobile phone market expanded and initially embraced foreign suppliers,
but by 2020 had driven most of them out in favor of Chinese suppliers, Apple and
Samsung being exceptions.

tests are conducted using specific computer systems and reflect the approximate performance of
MacBook Pro.
References 183

References

1. Garnsey, E., Lorenzoni, G., Ferriani, S. Speciation through entrepreneurial spin-off: The Acorn-
Arm story, (March 2008). Research Policy. 37 (2): 210–224. doi:https://doi.org/10.1016/j.res
pol.2007.11.006. Retrieved 2 June 2011.
2. History of Arm: from Acorn to Apple, The Telegraph, (January 6, 2011), https://www.telegraph.
co.uk/finance/newsbysector/epic/arm/8243162/History-of-Arm-from-Acorn-to-Apple.html
3. Acorn Group and Apple Computer Dedicate Joint Venture to Transform IT in UK Education,
Archived 3 March 2016 at the Wayback Machine, press release from Acorn Computers, 1996.
4. Clarke, P. Arm acquires Norwegian graphics company, EE Times, (June 23, 2006), https://
www.eetimes.com/arm-acquires-norwegian-graphics-company/#
5. Falanx Microsystems rolls out new multimedia accelerator cores for handheld SoCs, Tech-
Watch, Volume 5, Number 3, (February 14, 2005).
6. 3D on Java?—ask Arm, TechWatch, Volume 8, Number 4, (February 25, 2008).
7. Peddie, J. The History of Visual Magic in Computers, Springer Science & Business Media,
(June 13, 2013), https://link.springer.com/book/10.1007/978-1-4471-4932-3
8. Kahney, L. Inside Look at Birth of the IPOD, Wired, (July 21, 2004), https://www.wired.com/
2004/07/inside-look-at-birth-of-the-ipod/
9. Apple seed, Forbes, (February 16, 2004), https://www.forbes.com/forbes/2004/0216/050.html?
sh=1cb8705b234b
10. Gardner, D. Nvidia Acquires PortalPlayer For $357 Million, InformationWeek (November 6,
2006), https://www.informationweek.com/nvidia-acquires-portalplayer-for-$357-million/d/d-
id/1048542
11. Nordlund, P. Bitboys G40 Embedded graphics processor, HotChips, Graphics Hardware 2004
Hot3D presentations, https://www.graphicshardware.org/previous/www_2004/Presentations/
gh2004.hot3d.bitboys.pdf
12. Clarke, P. Bitboys licenses G34 graphics processor to NEC Electronics, EETimes
(August 10, 2004), https://www.eetimes.com/bitboys-licenses-g34-graphics-processor-to-nec-
electronics/#
13. Bitboys Introduces Vector Graphics Processor for Mobile Devices at Game Developers
Conference, Design & Reuse, (March 7, 2005), https://tinyurl.com/yph4wwnb
14. Scalable Vector Graphics (SVG) Tiny 1.2 Specification, W3C Recommendation (December 22,
2008), https://www.w3.org/TR/SVGTiny12/
15. Rice, D. (Editor), OpenVG Specification, Version 1.0.1 (August 1, 2005), https://www.khronos.
org/registry/OpenVG/specs/openvg_1_0_1.pdf
16. Nokia Growth Partners invests four million euros in Bitboys to support the company’s growth
and product development Press release, (February 9, 2006), https://tinyurl.com/2t8jwfk7
17. ATI acquires Bitboys Oy, Press release (May 4, 2006), https://evertiq.com/news/3786
18. Peddie, J. Bitboys acquires ATI and leads them to Nokia, TechWatch - Volume 6, Number 10,
(May 8, 2006).
19. Peddie, J. Bitboys acquires ATI and leads them to Nokia, TechWatch, Volume 6, Number 10,
May 8, 2006, (page 12).
20. Smith, T. Bitboys offers next-gen mobile 3D chips, (August 10, 2004), https://www.theregister.
com/2004/08/10/bitboys_g40/
21. Tikka, J-P. The Ups and Downs of Bitboys, Now Known As Qualcomm Finland, (April
1st, 2009), https://xconomy.com/san-diego/2009/04/01/the-ups-and-downs-of-bitboys-now-
known-as-qualcomm-finland/2/
22. ATI to exhibit handheld products at International CES, TechWatch, Volume 4, Number 1,
(January 12, 2004), (page 30).
23. Peddie, J. Qualcomm demonstrates its Q3D gaming architecture, TechWatch, Volume 5,
Number 4, page 16 (February 28, 2005).
24. Peddie, J. Make a left, and the exit door will be on the right in front of you, TechWatch, Volume
9, Number 3, page 4, (February 2, 2009).
184 3 Mobile GPUs

25. Qualcomm Acquires AMD’s Handheld Business, Press release (January 21, 2009), https://news.
softpedia.com/news/Qualcomm-Acquires-AMD-039-s-Handheld-Business-102577.shtml
26. STMicroelectronics and Texas Instruments Team Up to Establish an Open Standard for Wireless
Applications. STMicroelectronics Press release (December 12, 2002), https://tinyurl.com/89j
kt3cw
27. Goddard, L. Texas Instruments admits defeat, moves focus away from smartphone proces-
sors, Reuters (September 26, 2012), https://www.theverge.com/2012/9/26/3411212/texas-ins
truments-omap-smartphone-shift
28. Smith, R. Arm’s Mali Midgard Architecture Explored (July 3, 2014), https://www.anandtech.
com/show/8234/arms-Mali-midgard-architecture-explored
29. Sørgård, E Launching Mali-T658: “Hi Five-Eight, welcome to the party!”, Arm blog,
(September 11, 2013), https://tinyurl.com/y9r763rp
30. Mijat, R GPU Computing in Android? With Arm Mali-T604 & RenderScript Compute You
Can!, Arm Blog, (September 11, 2013), https://tinyurl.com/xrw952bt
31. Plowman, E. Multicore or Multi-pipe GPUs: Easy steps to becoming multi-frag-gasmic, Arm
Blog, (September 11, 2013), https://tinyurl.com/3kshz7my
32. Clarke, P. Arm announces 8-way graphics core, EE Times (November 10, 2011), https://www.
eetimes.com/arm-announces-8-way-graphics-core/#
33. Peddie, J. Arm’s new Valhall-based Mali-G57, TechWatch, https://www.jonpeddie.com/report/
arms-new-valhall-based-Mali-g57/
34. Peddie, J. Company claims highest-performing Valhall Mali GPU, (May 26, 2020), https://
www.jonpeddie.com/report/arm-Mali-g78-gpu/
35. China Games Market 2018 (August 3, 2018), https://newzoo.com/insights/infographics/china-
games-market-2018/
36. Peddie, J. Arm introduces a display processor and enhances its Mali GPU: Enabling VR with
a display processor and an AMP-sipping little GPU (June 3, 2019), https://www.jonpeddie.
com/report/arm-introduces-a-display-processor-and-enhances-its-Mali-
37. Peddie, J. Arm’s Mali-Cetus display processor (May 31, 2027), https://www.jonpeddie.com/
report/arms-Mali-cetus-display-processor1/
38. SoftBank Offers to Acquire Arm Holdings for GBP 24.3 Billion (USD 31.4 Billion) in Cash,
BusinessWire, (July 18, 2016), https://www.businesswire.com/news/home/20160717005081/
en/SoftBank-Offers-to-Acquire-Arm-Holdings-for-GBP-24.3-Billion-USD-31.4-Billion-in-
Cash
39. Medeiros, J. How SoftBank ate the world, Wired, (July 2, 2019), https://www.wired.co.uk/art
icle/softbank-vision-fund
40. Peddie, J. Nvidia to buy Arm: The next wave of IT is underway, (September 15, 2020), https://
www.jonpeddie.com/report/nvidia-to-buy-arm/
41. Sandle, P. UK invokes national security to investigate Nvidia’s Arm deal, Reuters,
(April 19, 2021), https://www.reuters.com/world/uk/uk-intervenes-nvidias-takeover-arm-nat
ional-security-grounds-2021-04-19/
42. TechWatch, Volume 15, Number 10 (May 12, 2015).
43. Demler, M. Xavier Simplifies Self-Driving Cars. Linley Newsletter (Formerly Processor Watch,
Linley Wire, and Linley on Mobile), Issue #533, (June 22, 2017).
44. Shapiro, D. Nvidia Drive Xavier, World’s Most Powerful SoC, Brings Dramatic New AI
Capabilities, (January 7, 2018), https://blogs.nvidia.com/blog/2018/01/07/drive-xavier-proces
sor/
45. Apple Computer, Apple announces Mac transition to Apple silicon, [Press release], (June 22,
2020), https://www.apple.com/newsroom/2020/06/apple-announces-mac-transition-to-apple-
silicon/
46. Apple Computer, Introducing M1 Pro and M1 Max: the most powerful chips Apple has ever
built [Press Release], (October 18, 2021), https://www.apple.com/newsroom/2021/10/introd
ucing-m1-pro-and-m1-max-the-most-powerful-chips-apple-has-ever-built/
47. Gallagher’s, W. Intel CEO hopes to win back Apple with a ‘better chip’, (October
18, 2021), https://appleinsider.com/articles/21/10/18/intel-ceo-hopes-to-win-back-apple-with-
a-better-chip
References 185

48. Clover, J. M1 Ultra Doesn’t Beat Out Nvidia’s RTX 3090 GPU Despite Apple’s Charts, (March
17, 2022), https://www.macrumors.com/2022/03/17/m1-ultra-nvidia-rtx-3090-comparison/
49. Max Tech Shorts Channel, M1 Max Mac Studio FULL Teardown & Thermal Throttle Test!
(March 21, 2022), https://www.youtube.com/channel/UCptwuAv0XQHo1OQUSaO6NHw
Chapter 4
Game Console GPUs

When game consoles were first introduced, they used the best processors available,
hoping they would be good enough for at least five years or more (Fig. 4.1).
Many consoles had custom-made coprocessors and accelerators to give them
unique and proprietary features. Also several of the game developers were often
employees of the console makers so they didn’t have to worry about pleasing
developers and often exploited custom features for competitive advantage (Fig. 4.2).
Consoles have had GPU-quality technology since the Nintendo 64 in 1996. As
processors and their support became more complicated, it became apparent that
custom systems (producing tens of millions of units) were no competition for the
commercial-off-the-shelf (COTS) devices (producing hundreds of millions of units).
As a result, the industry shifted to COTS and semi-custom devices based on COTS
designs. This shift began first with GPUs and then included CPUs (Table 4.1).
As will be discovered in this chapter, the console market has a shrinking number
of semiconductor choices and a growing number of console suppliers.

4.1 Sony PlayStation 2 (2000)

In the late 1990s, Ken Kutaragi, Sony’s CEO of American operations, convinced
Sony’s senior management to take a considerable risk on his vision by authorizing
the PlayStation 2 project [1]. The project was risky because Kutaragi wanted to create
an entirely new design instead of using components from the first PlayStation. The
costs of developing new designs and parts can grow unexpectedly, and it was more
than Sony could take on alone. Sony executives told Kutaragi to find a partner. They
suggested he work with Microsoft to produce an online video game business. In 1999,
Kutaragi met with Microsoft’s CEO and chairman, Bill Gates, but no agreement or
partnership came out of it. Kutaragi did not know at the time that Microsoft was
planning to compete with Sony, and Kutaragi’s offer may have inadvertently given
Microsoft insight into Sony’s plans.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 187
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_4
188 4 Game Console GPUs

Fig. 4.1 Rise and fall of console supplier versus market growth

Fig. 4.2 Number of consoles offered per year over time

Kutaragi’s team took the popular and well-respected 64-bit MIPS R3000 processor
design and doubled it, making it the first 128-bit MIPS processor. Kutaragi named
his creation the Emotion Engine. Toshiba manufactured the 13.5 million transistor
chips in 175 nm at its Oita fab. With the transistor chips, the Emotion Engine could
reach 6.2 GFLOPS performance.
The floating-point processors in the Emotion Engine processed geometric data
(T&L) and sent the transformed vertices to the Sony Graphics Synthesizer (GS)
250 GPU. The CPU also created primitive instructions and sent them to the GPU.
4.1 Sony PlayStation 2 (2000) 189

Table 4.1 Game consoles introduced after the GPU


Generation Company Model GPU Date
Sixth Sony PlayStation 2 Toshiba 2000
Sixth Microsoft Xbox Nvidia GeForce 3 2001
Seventh Sony PSP Sony custom 2004
Seventh Microsoft Xbox 360 AMD Xenos Nov 22, 2005
Seventh Nintendo Wii AMD Hollywood Nov 19, 2006
Seventh Sony PlayStation 3 Nvidia RSX NV47 2006
Eighth Nintendo 3DS DMP PICA200 Jun 6, 2011
Eighth Nintendo Wii U AMD Latte Nov 18, 2012
Eighth Sony PS Vita Imagination SGX543 Dec 17, 2012
Eighth Sony PlayStation 4 AMD Liverpool Nov 15, 2013
Eighth Microsoft Xbox One AMD Durango Nov 22, 2013
Eighth Polymega RetroBox Intel Pentium GTI Feb 12, 2017
Eighth Nintendo Switch Nvidia Tegra Mar 3, 2017
Ninth Atari VCS AMD APU Vega 2017–2021
Ninth Sony PlayStation 5 AMD RDNA 2 Nov 5, 2020
Ninth Microsoft Xbox Series X and S AMD RDNA 2 Nov 10, 2020

The GPU took the vertices data and clipped, where necessary, for the viewing window.
Then it conducted a z-sort to remove any occluded polygons that would not appear
in the viewport. Figure 4.3 shows the block diagram of the PS2.
Toshiba also designed and manufactured a custom GPU for Sony called the GS. It
had a fill rate of 2.4 gigapixels per second and could render up to 75 million polygons
per second. When the GPU used functions such as texture mapping, lighting, and
anti-aliasing, the average game performance dropped from 75 million polygons per
second to only 3–16 million polygons per second—which was still very respectable
for that time.

Fig. 4.3 Sony PlayStation 2 block diagram


190 4 Game Console GPUs

With a full texture diffuse map, Gouraud shaded (see Chap. 4) the chip could
generate 1.2 Gpix per second (37,750,000 32-bit pixel raster triangles). Moreover,
Gouraud claimed the chip could generate 0.6 Gpix per second (18,750,000 32-bit
pixel raster triangles) with two full textures (diffuse map and specular, alpha, or
other).
The chip had 4 Mbytes of embedded SDRAM and an external dual-channel Direct
Rambus DRAM with 3.2-Gbit/s bandwidth. Kutaragi claimed it was ten times faster
than any graphics accelerator available. Kutaragi even suggested the next iteration
of the graphics synthesizers might have embedded flash memory, but that never
occurred.
The GPU had 53.5 million 180 nm transistors in a 279 mm2 die built into an
embedded DRAM CMOS process. The GPU could run at 147.5 MHz and display
up to 1920 × 1080, 32-bits (RGBA: 8 bits each).
At the 2001 IEEE International Solid-State Conference in San Francisco, Sony
Computer Entertainment engineers presented a GPU for a development system
named GScube. Created in parallel with the GS-250 PlayStation chip, the GScube
used multiple Emotion Engines and Graphics Synthesizers.
The chip resulted from engineering cooperation between Sony Computer Enter-
tainment, the Sony Corporation Semiconductor Network Company, the Sony Kihara
Research Center, and Altius Solutions, Inc. (Altius merged with Simplex Solutions
in October 2000). Sony Computer Entertainment, the Sony Kihara Research Center,
and Sony Semiconductor developed the chip’s architecture and functional design.
Altius and Sony Semiconductor developed the electrical and physical design of the
chip.
The GScube system’s architecture was an enhanced version of the PlayStation 2
computer entertainment system’s architecture. The GScube system included 16 sets
of graphics units combined with a 128-bit microprocessor and a graphics rendering
processor.
In April 2003, Sony announced the Emotion Engine and graphics synthesizer chips
would be integrated and manufactured in 90 nm at Sony’s new fab at 90 nm. The new,
highly integrated semiconductor would have 53.5 million transistors, enhanced low
power consumption, and reduced cost. According to the International Technology
Road map for Semiconductors, mass production with a 90 nm embedded DRAM
process put Sony in a leadership position six months ahead of others in the industry
[2]. Consuming only 8 W, the new integrated chip would have an 86 mm2 die and
run at 250 MHz [3].
Production of the new semiconductor started in the spring of 2003 in Oita TS
Semiconductor, a joint venture between SCEI and Toshiba Corporation. SCEI’s semi-
conductor fabrication facility in Isahaya City, Nagasaki Prefecture, began producing
the chip in the fall.
The PlayStation 2 proved to be even more successful than the first PlayStation
and became the bestselling video game console of all time. As of September 2020,
Sony had sold almost 158 million PlayStation 2 units worldwide (compared to 102
million for the PlayStation 1).
4.2 Microsoft Xbox (2001) 191

Kutaragi was described as a rare breed in Corporate Japan: an engineer with


vision and marketing smarts; winning the respect of other Sony executives took a
long time. “They used to say I was merely lucky with the PlayStation,” he told the
magazine [4].
Notably, the PlayStation was not without competition, making its success even
more impressive.

4.2 Microsoft Xbox (2001)

When Ken Kutaragi met with Bill Gates in 1999 to propose a partnership, Gates
politely turned him down. Gates privately agreed with Kutaragi that a console would
become the living room entertainment center because a console would be less intim-
idating and more like a consumer device—easy to use, limited, and specific. Candid
and forthright, Kutaragi explained his vision of how the PlayStation 2 could become
a home entertainment center, including using its internet connection to do email. That
frightened Gates and Intel because they planned to make the PC a home entertain-
ment center, but Microsoft struggled to get Outlook established as the email client for
the PC. Moreover, Gates was concerned that game developers would stop developing
PC games in favor of the PlayStation. If game developers did a PC version (a port) of
a game, it usually came out later and did not exploit all the PC’s power capabilities.
However, Gates also had a backup plan in the form of a Microsoft console code
named Midway—a slur referring to the defeat of Japanese-owned Sony.
The idea for project Midway was developed in 1998 and championed by Seamus
Blackley and Kevin Bachus [5]. Otto Berkes, team leader of DirectX, and Ted Hase
later joined the team. They decided the console would use Microsoft’s DirectX API
so PC games could be ported quickly to the console (Fig. 4.4).

Fig. 4.4 Original Xbox team Ted Hase, Nat Brown, Otto Berkes, Kevin Bachus, and Seamus
Blackley. Courtesy of Microsoft
192 4 Game Console GPUs

When writing The Making of the Xbox: How Microsoft Unleashed a Video Game
Revolution, Dean Takahashi claimed, “DirectX enabled the PC to take advantage of
the enormous boosts in 3D graphics and keep up with consoles such as the Sony
PlayStation. Were it not for DirectX, Microsoft would have had no foundation to
build a software-based games business” [6].
Blackley had met Gates during a DreamWorks tech demo for Trespasser, the game
that tied into Jurassic Park. The game impressed Gates, and he invited Blackley to
join Microsoft. Gates helped Blackley secure a job at Microsoft as program manager
for Entertainment Graphics.
The team pitched the console product idea to Bill Gates, and Gates approved
it [7]. The product was named the DirectX Box. It would use Windows 2000 and
an x86 CPU. Kutaragi had not revealed all his plans to Gates. When asked which
processor would be in the PS2, Kutaragi only said it was a custom CPU. Gates and
the team thought a PC processor would be superior because Microsoft had a special
relationship with Intel.
While the design for the Xbox was coming together, two factions formed within the
company about whose semiconductors to use. In August 1997, Microsoft completed
the acquisition of WebTV, a company founded in 1995 by Steve Pearlman, Bruce
Leak, and Phil Goldman and established a hardware design and development team
within Microsoft. WebTV later become MSN TV.
When the Midway DirectX box project became known within Microsoft,
Pearlman and his team lobbied to design the GPU for the project, but Kevin Bachus
and his team lobbied to use COTS parts. It soon became clear that Microsoft could
not design, build, and debug an in-house part to be in time due to budget constraints.
In addition to this disagreement, Perlman also wanted to use a different API instead
of DirectX. The project was dealing with too many variables and too little time. Jay
Torborg, the architect and a proponent of the Talisman project, sided with Bachus.
Negotiations began with Nvidia and briefly with ATI.
Microsoft decided to go with Nvidia if the price was favorable. Negotiations
between Nvidia and Microsoft went on for months. Nvidia wanted the deal, Microsoft
wanted the part, and all that stood between them was the price. Eventually, Nvidia
gave in; and Microsoft confirmed that an Nvidia GPU would be in the Xbox. But the
transaction, subsequent communications, and tech support between the two compa-
nies did not go smoothly, and Nvidia would not receive a second deal with Microsoft
after the game console project concluded. Nvidia was unhappy with the margins and
did not pursue any follow-up business.
Nvidia supplied a 233 MHz NV2A, a variant of its Nforce IGP, which consisted
of a GeForce 3 GPU and a Northbridge UMA chip. The GPU had a 128-bit memory
interface capable of running at 200 MHz. The Xbox had four banks of 16 MB DDR
SDRAM. The GPU had half of the memory to avoid conflicts with the CPU and
other delays.
The Xbox used a 733 MHz, 32-bit one GFLOPS Intel Pentium III, less than the
Sony PlayStation 2’s 300 MHz 132-bit 6.2 GFLOPS. However, the Xbox had twice
the RAM storage (64 MB rather than the PS2’s 32) [8]. Figure 4.5 shows a block
diagram of the Xbox system.
4.2 Microsoft Xbox (2001) 193

Fig. 4.5 Xbox block diagram with Nvidia IGP

Based on Nvidia’s Kelvin Architecture, the NV2A incorporated Nvidia’s Light-


speed Memory Architecture and nfiniteFX engine. And according to Nvidia’s
marketing department, the nfiniteFX engine offered developers the capability to
“program virtually an infinite number of special effects.” Developers could specify
personalized combinations of graphics operations and create their custom effects.
That was made possible by Nvidia’s vertex and pixel shaders (Fig. 4.6).
The GPU had a theoretical performance capacity of 76 GFLOPS. The geometry
engine’s peak performance was 115 million vertices per second and 125 million

Fig. 4.6 Halo, developed by Bungie, was an exclusive Xbox title and credited with the machine’s
success. Courtesy of Microsoft
194 4 Game Console GPUs

particles per second. It had four pixel pipelines with two texture units. Each unit
had a peak fill rate of 932 megapixels per second (233 MHz times four pipelines),
a texture fill rate of 1864 megatexels per second (932 MP times two texture units),
and four ROPS.
The NV2A graphics processor was an average-sized chip with a die area of
142 mm2 and 57 million transistors. Built on TSMC’s 150 nm process, the
GeForce 3 IGP-integrated variant was compatible with DirectX 8.1.
A popular story at the time had it that Nvidia could not have gotten the Xbox
GPU out on time if not for the 3dfx engineers working at Nvidia. However, those
engineers’ contributions would have had to arrive late in the development of the
nForce derivative of the Xbox chipset. Nvidia did not absorb 3dfx until 2001, which
was during the last 25% of the Xbox project. 3dfx contributions to the nForce2,
which came out in July 2002 seem more likely.

4.2.1 Epilogue

Perlman left Microsoft in 1999 and started Rearden Steel (later renamed Rearden,
Limited), a business incubator for new media and entertainment technology compa-
nies. The members of Perlman’s group who stayed at Microsoft produced the
Microsoft TV platforms and later helped develop the web browsing capabilities
for Microsoft’s next-generation console, the Xbox 360.
Although it was more expensive to developed than Microsoft expected, the Xbox
was a huge commercial success. Still, Microsoft did not make a profit from the
success of the first Xbox, a fact the company tried to hide for a long time. By 2002,
the original team had broken up and gone to other companies or projects.
Seamus Blackley was the cocreator and technical director of the Xbox. Blackley
was one of the people who convinced Bill Gates to risk a potential $3.3 billion in
losses. Blackley argued it was a smart investment for Microsoft to gain the desired
foothold in living rooms, but he was too outspoken about his opinion, so he was booted
out of Microsoft in 2002. Notably, Blackley’s actions were not unlike Kutaragi’s
actions at Sony, which also led to his removal.
Of the other four original Xbox team members, Keven Bacchus left Microsoft
and the industry in 2001. In 2005, he reappeared and took over the start-up console
company Infinium Labs, which became Phantom Labs and struggled to introduce
the much-hyped and therefore controversial Phantom console. It was never officially
launched, and Bachus left shortly after joining. Ted Hase left Microsoft in 2006, and
Otto Berks stayed at Microsoft and worked on other projects until May 2011. All
the other supporters of the Xbox such as Ed Fries, Cameron Ferroni, J Allard, and
Robbie Bach have left Microsoft.
Nvidia never got over the lowball price given to Microsoft, and the two companies
entered into arbitration over the dispute in 2002 [9]. Nvidia stated in its SEC filing
that Microsoft wanted a $13 million discount on shipments for the 2002 fiscal year.
The two companies arrived at a private settlement on Feb 6, 2003.
4.3 Sony PSP (2004) 195

In December, 2021 Microsoft produced a six-part video of the history of the


Xbox [10].

4.3 Sony PSP (2004)

In December 2004, Sony released the PlayStation Portable (PSP) handheld game
console in Japan. Then in March 2005, the company introduced the PSP in North
America. Later that same year, the game console was also released in the PAL regions.
It was the first and only handheld device in the PlayStation line of consoles.
Legend has it that Ken Kutaragi came up with the idea for the PSP on the back of
a beer mat just before E3 2003 [11]. The graphics were a custom rendering engine
plus a surface engine GPU built by Toshiba using Imagination Technologies GPU
IP design. The GPU ran at 166 MHz as indicated in the block diagram in Fig. 4.7.
When the PSP was released, Shinichi Ogasawara led its design team at Sony.
The PSP was the most powerful handheld on the market [12]. The GPU had a 2 MB
VRAM frame buffer, which was quite a lot for such a system at that time. The display
resolution was 480 × 272 pixels, with 24-bit color (8 bits more than other handheld
devices). It had the largest handheld display for the time, with a 30:17 widescreen
that used a TFT LCD.
The system was the first real competitor to Nintendo’s handheld dominance.

Fig. 4.7 Sony PSP block diagram


196 4 Game Console GPUs

The PSP’s graphics capabilities made it a robust mobile entertainment device. It


was not only a stand-alone device; it could also connect to the internet, the PlayStation
2 and 3, or any computer with a USB interface.
The PSP also had impressive multimedia features such as video playback. There-
fore, several people described the PSP as a portable media player. It used a custom,
miniature optical disc—Universal Media Disc (UMD)—as its primary content
delivery. PSP games and movies were released on those UMDs. The system sold
for $249.99.
A user could update the custom operating system using the internet, a USB flash
drive, or a UMD. There were rumors that the PSP was built on the Linux operating
system, which was a new and sexy alternative at the time to embedded operating
systems and many thought a potential rival to Windows. There are some who still
think that, but contrary to rumors, the OS was proprietary software stored in the
firmware.
At the E3 2003 game conference, CEO Ken Kutaragi called the PSP “the Walkman
of the twenty-first century.”
However, even devoted Japanese game players directed criticisms at Kutaragi over
problems with the PSP after its initial release in Japan. The criticisms were severe
blows to his armor. The UMD format was limited, and there were not many games or
videos available. And the UMD drive used a lot of power, limiting the battery (play)
time (Fig. 4.8).
Over 76 million PSPs sold worldwide throughout its eight-year life.
In 2005, however, the 54-year-old Kutaragi’s outspoken personality put him at
odds with Sony’s board of directors (BOD), who were more accustomed to making
decisions through consensus and polite manners. Shockingly frank by Japanese
standards, Kutaragi’s public criticisms of the company’s decisions embarrassed the
company. In a dramatic management reorganization in 2005, the BOD demoted
Kutaragi from CEO and promoted Howard Stringer from Sony’s music and movie
business. Asked what he would do if he were running Sony, Kutaragi said, “The
company must revive its original, innovative spirit when it boasted engineering

Fig. 4.8 Ken Kutaragi at E3


2003 telling the audience
about the PSP
4.4 Xbox 360—Unified Shaders and Integration (November 2005) 197

finesse with the transistor radio, Walkman, and Trinitron TV. Sony also has been
hurt by its insistence on making its content proprietary,” Kutaragi said [13].
Thus, an era of Kutaragi’s influence at Sony ended, but Sony’s success in the
console business would continue.
At the 2014 Game Developer’s Conference, Ken Kutaragi received a Lifetime
Achievement Award. The award was presented to Kutaragi by Mark Cerny, the lead
architect on the PlayStation 4 console [14].

4.4 Xbox 360—Unified Shaders and Integration


(November 2005)

By 2002, Microsoft knew it had created a valuable and desirable brand with the Xbox
franchise. Even though several of the original team members had left, others in the
company began to talk about a follow-up product and discuss its road map. Microsoft
and Sony were the dominant console suppliers at this time and competed to obtain
exclusive deals on new games.
Although the Nvidia IGP in the original Xbox performed well, hard feelings
remained between Microsoft and Nvidia. Therefore, Microsoft contacted ATI. ATI
had a good relationship with the DirectX team and often brought new concepts and
technology to Microsoft.
In addition to using ATI for the GPU, Microsoft would also change the central
processor in the Xbox 360 to the XCPU (a code-named Xenon processor designed
by IBM). The XCPU’s 3.2 GHz triple-core was an advanced multi-processor; each
core could process two threads simultaneously [15] (Figs. 4.9 and 4.10).

Fig. 4.9 Microsoft Xbox


360 block diagram
198 4 Game Console GPUs

Fig. 4.10 Microsoft Xbox 360 GPU block diagram

The Xbox 360’s CPU had three identical cores that shared an eight-way set-
associative and 1-Mbyte L2 cache, and they ran at 3.2 GHz. Each core contained a
complement of four-way SIMD vector units. IBM customized the CPU’s L2 cache,
cores, and vector units.
The front-side bus ran at 5.4 Gbit per pin per second with 16 logical pins in
each direction. That provided a 10.8-GB/s read bandwidth and a 10.8-GB/s write
bandwidth. The FSB design combined with the CPU L2 provided the additional
support needed for the GPU to read directly from the CPU L2 cache.
The CPU used Direct3D (D3D) compressed data formats. The same formats used
by the GPU. D3D allowed the user to store compressed graphics data generated by
the CPU directly in the L2. The compressed formats allowed an approximate 50%
savings in required bandwidth and memory size.
Xbox 360 Gave Game Developers a Cross-Platform Advantage
The GPU operated at 500 MHz and had 48 combined vector and scalar shader ALUs.
The GPU shaders’ dynamic allocation was of significant interest; there was no distinct
vertex or pixel shader—the hardware automatically adjusted to the load on a fine-
grained basis. The dissolution of unique shaders occurred before the public release
of DirectX 10 (known in Direct3D 10 as Shader Model 4.0). Microsoft described the
GPU in the Xbox 360 as compatible with the High-Level Shader Language of D3D
9.0 with extensions. Indeed, those extensions gave the Xbox 360 a year’s head start
on the unified shaders capability that would appear in Windows Vista’s DirectX 10
in late 2006.
4.4 Xbox 360—Unified Shaders and Integration (November 2005) 199

Although Sony had its game studio, Microsoft was in a contest with Sony and
wanted to attract the independent game developers away from the PS2 and toward the
Xbox. Thanks to the success of the PlayStation most game developers wanted to get
on the PS2 platform. It takes years and millions of dollars to develop a game. If Sony
attracted the game developers first, it could be a year or two before the developers
could come out with an Xbox version and even later for the PC. If Microsoft could
offer the game developers advanced information and a development platform for the
next generation of PCs, the team could offer developers a double payoff that Sony
could not match.
The gambit paid off, and Microsoft had 14 games available in North America and
13 in Europe at its launch; Sony only had 10. Microsoft’s premier game, Halo 3, was
released two years later.
On May 12, 2005, Microsoft announced the Xbox 360 on MTV during MTV
Presents: The Next Generation Xbox Revealed; The Xbox360 shipped on Nov 22,
2005.

4.4.1 The Xbox 360 GPU

The GPU in the Xbox 360 was a customized version of ATI’s R520, a revolutionary
design at the time. In keeping with the “X” prefix and echoing the IBM processor,
ATI called this GPU the Xenos.
The ATI Xenos, code named C1, had 10 MB of internal eDRAM and 512 MB of
700 MHz GDDR3 RAM. ATI’s R520 GPU used the R500 architecture, made with
a 90 nm production process at TSMC, a die size of 288 mm2 , and a transistor count
of 321 million. See Fig. 4.10 for a block diagram of the system.
The GPU’s ALUs were 32-bit IEEE 754 floating-point compliant (with typical
graphics simplifications of rounding modes), denormalized numbers (flush to zero
on reads), exception handling, and not-a-number handling. They were capable of
vector (including dot product) and scalar operations with single-cycle throughput—
that is, all operations issued every cycle. That allowed the ALUs a peak processing of
96 shader calculations per cycle while fetching textures and vertices.
The GPU had eight vertex shader units supporting the VS3.0 Shader Model of
DirectX 9. Each was capable of processing one 128-bit vector instruction plus one
32-bit scalar instruction for each clock cycle. Combined, the eight vertex shader units
could transform up to two vertices every clock cycle. The Xenos was the first GPU
to process 10 billion vertex shader instructions per second. The vertex shader units
supported dynamic flow control instructions such as branches, loops, and subroutines.
One of the significant new features of DirectX 9.0 was the support for floating-
point processing and data formats known as 4 ×-32 float. Compared with the integer
formats used in previous API versions, floating-point formats provided much higher
precision, range, and flexibility.
The Xenos introduced the precursor to the third era of GPUs—the unified shader.
ATI/AMD incorporated it in its TeraScale microarchitecture for PCs in 2007.
200 4 Game Console GPUs

An essential requirement for the Xbox 360 was supporting a 720p (progressive
scan), high definition (HD), 16:9 aspect ratio screen. That meant that the Xbox 360
needed a significant and reliable fill rate.
Red Ring of Death
Early in the life of the Xbox 360, a problem appeared: components in the system
were overheating. When this happened, temperature sensors in the system would
attempt to alert the user by flashing a ring of red around the power button. Usually,
the red ring was a death signal, and the system died. Gamers called it the Red Ring
of Death.
To mitigate the issue, Microsoft extended the Xbox 360’s warranty to three
years for all models released before the Xbox 360 S. (Microsoft’s Xbox360 original
warranty period was 90-days from the date of purchase.)

4.4.2 The Many Versions of Xbox 360

The Xbox 360 launched in November 2005 with two configurations: the Xbox 360
(known as the 20 GB Pro or Premium) and the Xbox 360 Core. The Xbox 360 Arcade
later replaced the 360 Core in October 2007. Next came a 60 GB version of the Xbox
360 Pro on August 1, 2008. On August 28, 2009, Microsoft discontinued the Pro
package.
There were two significant hardware improvements: the Xbox 360 S (also referred
to as the Slim) replaced the Elite and Arcade models in 2010. All told across the
lifetime of the Xbox 360—from late 2005 to early 2016—the company introduced
nine different versions ranging in price from $400 down to $200.

4.4.3 Updated Xbox 360—Integrated SoC (August 2010)

Microsoft wanted to reduce the costs of the Xbox 360 and boldly authorized Rune
Jensen, Nick Baker, and Jeff Andrews to design an SoC using IP from ATI and IBM.
It was a bold move because Microsoft did not have much depth in semiconductor
design or manufacturing, except for the design team from WebTV. Considering the
battles over who should produce the GPU for the first Xbox, this authorization was
a surprising move. Nonetheless, the company moved forward with this plan, and
its success exceeded all expectations. Baker and the team did such a good job, the
updated (referred to as a shrink going from 65 to 45 nm) AMD APU ran a lot faster.
Microsoft had to slow it down to keep within the original specification range. Console
manufacturers guarantee a stable (i.e., no changes) platform to game developers so
all users get the same experience regardless of when they bought the console.
4.4 Xbox 360—Unified Shaders and Integration (November 2005) 201

Months after launching the revised Xbox 360, the company revealed the console’s
integrated CPU and GPU (APU) configuration features, illustrated in Fig. 4.11. The
Slim 360 would have a 45 nm SoC code named Vejle. The chip had 372 m transistors,
was 50% smaller, and drew 60% less power than the original CPU/GPU combo.
Integrating the significant components meant the console would require fewer
chips, heatsinks, and fans (see chip layout in Fig. 4.12). Therefore, the console
could use a smaller motherboard and power supply. Reducing power and heat also
decreased the chance users would experience the red circle of death.
IBM manufactured the SoC at its advanced 45 nm SOI fab in East Fishkill,
NY [16].
With all the processors integrated into a smaller process node, efficiency improved,
and with the clocks turned down saved power, the chip ran even cooler. The Vejle
SoC also used an FSB replacement block with the same bandwidth as the bus used
by the stand-alone CPU and GPU, which kept the system from being faster than its

Fig. 4.11 Microsoft’s Xbox 360 Vejle SoC block diagram


202 4 Game Console GPUs

Fig. 4.12 Microsoft Xbox 360 SoC chip floor plan. Courtesy of Microsoft

predecessors [17]. Microsoft sold over 84 million Xbox 360 consoles throughout its
life from 2005 to 2016.

4.5 Nintendo Wii (November 2006)

When Nintendo decided to develop the Wii, Satoru Iwata president of Nintendo told
the engineers to avoid competing on graphics and power with Microsoft and Sony
and to aim at a broader demographic of players with unique and novel gameplay.
Legendary Game developer Shigeru Miyamoto (Mario, The Legend of Zelda, Donkey
Kong, Star Fox and Pikmin) and Genyo Takeda led the development of the new
console and gave it the codename Revolution.
Initial Wii models included full backward compatibility support for the Game-
Cube. Later in its lifecycle, two lower-cost Wii models were produced: a revised
model with the same design as the original Wii but removed the GameCube compat-
ibility features and the Wii Mini, a compact, budget redesign of the Wii which further
removed features including online connectivity and SD card storage.
4.6 Sony PlayStation 3 (2006) 203

Fig. 4.13 Nintendo Wii Hollywood chip

In 2006, Nintendo updated the Wii’s internal hardware, a modernized derivative


of the GameCube processor designed by ArtX. Hollywood was the code name of the
GPU in the Wii and Wii U (Fig. 4.13).
The Hollywood also contained an ARM926 core, unofficially nicknamed Starlet.
That processor managed many of the input/output functions, and the wireless
networking, USB, and optical disc drive hardware.
Compared with the other systems in the seventh-generation console, the Wii was
behind in graphics capabilities. Most significantly, it lacked the high-definition video
output that was a standard feature of the competitors. However, the Wii did have the
Nintendo Wi-Fi Connection service that allowed supported games to offer online
multi-player and other features.
The Wii introduced new motion control gaming, and it was such a success many
competitors copied the system. Nintendo sold over 100 million units. In October
2013, Nintendo ceased production of the Wii, its first motion-control console. The
Wii U that was later released was incompatible with the Wii’s motion controllers and
games.

4.6 Sony PlayStation 3 (2006)

The Sony PlayStation 3 ended up using an Nvidia GPU, based on the company’s
Curie architecture, but that was not the original plan. Sony had planned to use a
GPU design jointly developed by Sony and Toshiba, as they had done in the
PlayStation 2. However, Sony was pushing too many variables at once.
First, the company decided to abandon its 128-bit MIPS Expression
engine processor used in the PlayStation 2 in favor of a newly developed multi-core
IBM processor known as the Cell.
204 4 Game Console GPUs

The Cell processor had an uncommon bus and instruction set. Therefore, the GPU
Sony picked had to match the Cell, and a unique software driver had to be written.
That was not impossible, but it was time-consuming and had a steep learning curve.
At the same time, Toshiba was experiencing difficulties with its GPU design.
Sony had a deadline based on established introduction dates for the new console,
and that date was rapidly approaching. Kutaragi and the senior management made
the difficult decision to cancel Toshiba’s GPU and went looking for a replacement.
They decided Nvidia had the best GPU and would perform well for the next five
to seven years, but could it be married to the Cell? Having lost the Xbox contract,
Nvidia was anxious to re-establish itself in the console market. Nvidia wanted to
be involved in all gaming production. Therefore, the company put aside its newest
GPU design project and accepted Sony’s challenge. Sony announced the partnership
in December 2004. People familiar with the Cell knew Nvidia was heading into a
challenge; Nvidia knew it too but realized there would be no gain without risk [18].
The switching from Toshiba to Nvidia, combined with other problems inherent
to shifting to a new CPU, cost Sony time, and the company missed the desired
introduction date. Because of the missed deadline, Microsoft beat Sony to market
with the AMD-powered Xbox 360 by almost a year.
Nvidia took an existing design, the NV47, and its I/O structure to match the
IBM Cell processor and XDR memory system. It named this repurposed design
the G70/G71, and this later became known as the RSX Reality Synthesizer. The
Cell had a unique OS and a somewhat primitive graphics API (PSGL (OpenGL
ES 1.1 + Nvidia Cg)). Nvidia had to develop a companion chip to translate from
the PCI structure to the Cell’s FlexIO bus, learn a new OS and API, and write a
driver that would work in that environment. It was not just a simple “Hello World”
software hack; it had to be a high-performance driver to expose everything in the
GPU possible to potential apps—games. Somehow, Sony did it. The PS3 was a
very popular console, and Sony sold over 90 million units. That was more than
Microsoft’s 88 million and their year’s advance introduction.
Sony built the semi-custom 300 million transistor GPU, aptly named RSC-90 nm,
in their eight-layer 90 nm fab. The combination of the RSX GPU and the Cell CPU
required almost 600 million transistors—the most for any console to date.
Dr. Lisa Su was a senior engineer on the IBM Cell project in 2006 and worked
with Nvidia and Jensen Huang. Ironically she would take over AMD eight years later
and become Huang and Nvidia’s biggest competitor, and Sony’s processor supplier
for PlayStation (Fig. 4.14).
The RSX GPU ran at 550 MHz and had 256 MB of local GDDR3 memory at
700 MHz and individual vertex and pixel shader pipelines. And there were 24 parallel
pixel shader ALU pipes with 5 ALU operations per pipeline per cycle. Eight parallel
vertex pipelines ran at 500 MHz.
The GPU could process 136 shader operations per cycle compared to the 53
shader ops per cycle processed by Nvidia’s GeForce 6 GPU. The RSX was clearly
more powerful than even two GeForce 6 GPUs in shader performance, and it only
drew 80 W. At Sony’s pre-show press conference at E3 in 2005, Jensen Huang,
4.7 Nintendo 3DS (June 2011) 205

Fig. 4.14 IBM technologist Dr. Lisa Su holds the new Cell microprocessor. The processor was
jointly developed by IBM, Sony, and Toshiba. IBM claimed the Cell provided vastly improved
graphics and visualization capabilities, in many cases 10 times the performance of PC processors.
Courtesy of Business Wire

Nvidia’s founder, and CEO said the RSX was twice as powerful as the GeForce 6800
Ultra [19].
The GPU had eight vertex shaders (10 ops per clock per cycle), making it capable
of 40 GFLOPS of vertex floating-point operations when running at 500 MHz. The
RSX also had 24 TMUs.
The RSX GPU offered dual-screen output, with a resolution of up to 1080p
(1920 × 1080) for both screens. It was 128-bit pixel precision that provided scenes
with high dynamic range rendering (HDR).
Sony discontinued the PS3 in June 2017, and it was the last PlayStation Nvidia
would help create. At that time, it looked like Nvidia was out of the game console
business.

4.7 Nintendo 3DS (June 2011)

In 2011, Nintendo introduced its 3DS, a 3D handheld gaming system that did not
require 3D glasses.
The Nintendo’s 3DS’s most distinguishing and innovative feature (shown in
Fig. 4.15) was its 3D top screen, which used a clever feat of engineering to achieve
the illusion of depth.
Sharp made the 3D display which measured 3.5 in. inward with an 800 × 240
resolution (400 × 240 per eye). The display used an integrated LCD-based parallax
barrier panel sandwiched to the back of the color LCD, which rapidly alternated
between left and right images.
206 4 Game Console GPUs

Fig. 4.15 Nintendo 3DS handheld game machine. Courtesy of Nintendo

On one side of the glass was a conventional color TFT panel, whereas the other
side had a monochrome LCD element. The monochrome LCD parallax barrier in the
back acted as a gate that allowed or denied light to pass through some screen regions.
Switching the gate in the correct patterns at a high frequency created the illusion of
3D depth.
Sharp also made the lower, secondary display that doubled as a touch screen.
The significant subsystem in the 3DS was the applications processor; it costs
approximately $10 and was manufactured by Sharp using the TSMC fab. The GPU
in the AP was the DMP 286 MHz PICA200. It could drive 800 megapixels per second
at 200 MHz. It also had hardware transformation of 40 M-triangles per second at
100 MHz and lighting vertex performance of 15.3 million polygons per second at
200 MHz. The frame buffer was 4095 × 4095 pixels.
The GPU had DMP’s Maestro-2G technology with per-pixel lighting, fake
subsurface scattering, procedural texture, refraction mapping, subdivision primitive,
shadow, and gaseous object rendering. Refer to the block diagram in Fig. 4.16.
The 3DS had 2 Gbytes of RAM, eight times the 256 Mbytes in the DSi.
The 3DS subsystem also had a microelectromechanical system gyroscope and
an accelerometer, which allowed the game system to operate using motion-sensitive
control.
The tiny but mighty gaming machine had three cameras to deliver 3D photography.
The stereo vision system employed two parallel VGA cameras in a module and a third
VGA camera. Users could record and view moments from sporting events, birthday
parties, and holiday events or create original 3D productions to show others.
4.8 Sony PS Vita (December 2011) 207

Fig. 4.16 DMP PICO GPU

Because the 3DS had more extensive and sophisticated displays and a higher-
performance application processor than the DSi, it could include new features such
as a gyroscope and an accelerometer, which needed a higher capacity battery. The
original DSi battery was an 840 milliampere-hour (mAh) model, and the battery in
the 3DS was 1300 mAh.
On March 27, 2011, the 3DS went on sale in the United States for $249.99, and
15 games were made available for purchase on the same day. Nintendo slashed the
price to $169.99 in July because the reported sales had not met expectations.
Nintendo was caught by surprise when the industry shifted toward casual
smartphone gaming, and the iPad became one of the fastest-growing gaming
platforms.
Nintendo sought to keep its loyal customers, so it offered 20 free downloadable
games to any 3DS owner who purchased the console at the original price.
Nintendo discontinued the 3DS in September 2020 after selling over 76 million
units.

4.8 Sony PS Vita (December 2011)

Sony unveiled the Vita at the E3 Expo in June 2011. It was a follow-up to its PlaySta-
tion Portable (PSP) and a competitor to the Nintendo 3DS. Although Nintendo and
Sony still held a large share of the portable gaming market, smartphones were rising
in popularity as handheld gaming platforms.
Sony equipped the PlayStation Vita with a 5-in. OLED multi-touch screen that
could display 960 × 544 pixels with 16 million colors. The Vita also had an
innovative, touch-sensitive back panel expected to usher in new gameplay, but it
didn’t.
The PlayStation Vita had all the features of modern luxury smartphones, including
a GPS device and a three-axis gyroscope, accelerometer, and electronic compass.
208 4 Game Console GPUs

Fig. 4.17 Sony PlayStation Vita. Courtesy of Sony

Sony even offered two different models: a version with 3G and Wi-Fi and a Wi-Fi-
only model. The PS Vita is shown in Fig. 4.17.
The PlayStation Vita was the first handheld device to use a custom, quad-core
ARM processor. Sony hoped the processor would differentiate Vita from other
handheld gaming consoles and gaming experiences offered on tablets and high-end
smartphones.
The PS Vita’s increased performance came from the highly integrated Sony
CXD5315GG processor composed of a quad-core ARM Cortex-A9 device with an
embedded Imagination SGX543MP4+ quad-core GPU. Sony used IBM and Toshiba
to manufacture the SoC.
The GPU had four, pixel shaders, two vertex shaders, eight TMUs, and four ROPS.
The bus bandwidth was an impressive 5.3 GB per second. The bus also had a pixel
fill rate of 664 million pixels per second and a maximum of 33 M polygon per second
(T&L). Figure 4.18 shows a block diagram of the design.
The GPU had texture compression, hardware clipping, morphing, and hardware
tessellation, and it could perform Bezier, B-Spline (NURBS), which was overkill for
a handheld game machine.
In 2012, Dick James of Chipworks took X-rays of the chip and discovered it was a
five-die stack. The processor was the base of the chip placed facing a Samsung 1 Gb
wide I/O SDRAM, and the top three dies were two Samsung 2 Gb mobile DDR2
SDRAMs separated by a spacer die. The base die was ~ 250 µm thick, and the others
had thicknesses of ~100–120 µm [20].

Non-uniform rational basis spline (NURBS) is a mathematical model using


basis splines (B-splines) for representing curves and surfaces. It offers great
flexibility and precision for handling both analytic and modeled shapes.

Toshiba also supplied the device with multi-chip memory. Qualcomm provided
the Vita with an MDM6200 HSPA+ GSM modem.
4.9 Eighth-Generation Consoles (2012) 209

Fig. 4.18 Imagination technologies’ SGX543 IP GPU

Sony released the new PSP Vita in 2012 and sold over seven million before Sony
discontinued the device in March 2019.

4.9 Eighth-Generation Consoles (2012)

Beginning with the eighth generation of consoles, the three leading suppliers
(Microsoft, Nintendo, and Sony) selected AMD for their graphics. Nintendo used a
Broadway IBM PowerPC for the CPU, whereas Microsoft and Sony chose AMD’s
integrated CPU–GPU APU chips. Because the APU in the Microsoft Xbox One and
the Sony PlayStation 4 were relatively similar, they are examined together in the
following section. This convention will be followed for the section discussing the
ninth-generation consoles too.
The APUs used in the eighth consoles from Microsoft and Sony contained 20 GCN
GPU compute units (two of which were for redundancy to improve manufacturing
yield). As shown in Fig. 4.19, GPU cores dominated the chip’s die area (the big
bluish glob in the middle, memory is to its right).
Manufactured using TSMC’s 28 nm process, AMD’s APU and the Durango GPU
used the 2013 GCN 1.0 architecture. It was not a large chip with a die size of
363 mm2 (approximately ¾ of an inch on a side). The APU had five billion transistors
210 4 Game Console GPUs

Fig. 4.19 CPUs plus caches


take up approximately 15%
of the chip area. The GPUs
(center) take up about 33%
of the 348 mm2 die area; the
rest of the chip area was the
memory. Courtesy of
Wikipedia

and was compatible with DirectX 11.2 (Feature Level 11.0). It included 768 shading
units, 48 texture-mapping units, and 16 ROPS. For GPU compute applications, the
Durango GPU could use OpenCL version 1.2. A block diagram of the chip is shown
in Fig. 4.20.
The AMD APU, shown in Fig. 4.21, replaced at least three chips from the previous
generation. That replacement reduced costs, increased performance, decreased power
consumption, and gave developers a platform they recognized from the PC—an x86
CPU and a DirectX GPU.
The GPU had 18 operational shader units and could produce a theoretical peak
performance of 1.84 TFLOPS.

Fig. 4.20 Block diagram of


AMD Liverpool (PS4) and
Durango (Xbox One) APU
4.10 Nintendo Wii U (November 2012) 211

Fig. 4.21 Tiny but mighty, AMD’s Jaguar-based APU powered the most popular eighth-generation
game consoles. Courtesy of AMD

Therefore, these changes made by leading console suppliers created another


inflection point in the gaming industry.

4.10 Nintendo Wii U (November 2012)

Nintendo developed the Wii U video game console as the successor to the Wii and
released it to the public in late 2012. The Wii U was the first eighth-generation game
console and would compete with Microsoft’s Xbox One and Sony’s PlayStation.
The system used a 1.24 GHz Tri-Core IBM PowerPC Espresso CPU and had a
2 GB DDR3 RAM with internal flash memory (8 GB (Basic Set)/32 GB (Deluxe
Set).
Based on AMD’s TeraScale 2 architecture, the Wii U’s GPU used an AMD
550 MHz Radeon-based unified shader architecture (code named Latte). It had
160 shaders, 16 TMUs, eight ROPS, and it ran memory at 800 MHz.
The GPU had 32 MB of high bandwidth eDRAM and supported 720p 4x MSAA
(or 1080p). That enabled the GPU to render a frame in a single pass via HDMI and
component video outputs. Nintendo had the chip fabricated in 40 nm at Renesas. It
had 880 million transistors and a small 146 mm2 die.
Although the GPU was compatible with DirectX, the Wii U used a proprietary OS
known as Café GX2, also with a proprietary 3D graphics API. With AMD’s support,
Nintendo designed the API to be as efficient as the GX (1) used in the Nintendo
GameCube and Wii systems. The API adopted features from OpenGL and the AMD
r7xx series GPUs. Nintendo referred to the Wii U’s graphics processor as GPU7.
In January 2017, Nintendo discontinued production of the Wii U because the
console was not considered a success. As of December 31, 2016, the company had
sold only 13.56 million units of the Wii U, which paled in comparison to the over
212 4 Game Console GPUs

101 million sold units of the first Wii console. Although this appeared to be a disap-
pointment, the initial Wii’s success was always going to be a challenge for the Wii U.
The Wii U was more like a handheld than the multi-user Wii console. It had minimal
third-party support, was underpowered relative to the competition, and didn’t have
a hard drive.

4.11 CPUs with GPUs Lead to Powerful Game Consoles


(2013)

When Sony introduced the PlayStation 4 (PS4), Microsoft launched the Xbox One
based on a custom version of AMD’s Jaguar APU.
Sony used an eight-core AMD x86-64 Jaguar 1.6 GHz CPU (2.13 GHz on PS4
Pro) APU with an 800 MHz (911 MHz on PS4 Pro) GCN (graphics core next) Radeon
GPU.
Microsoft used an eight-core 1.75 GHz APU (two quad-core Jaguar modules),
and the X model had a 2.3 GHz AMD eight-core APU. The Xbox One GPU ran at
853 MHz, the Xbox One S at 914 MHz, and the Xbox One X at 1.172 GHz, using
AMD Radeon GCN architecture.
The integrated GPU (iGPU) was the most popular device in terms of unit ship-
ments. It was cost-effective (free) and powerful enough (good enough) for most tasks
[21]. The iGPU had even found use in performance-demanding workstation market
applications.
The iGPU was the dominant GPU used in PCs; found in 100% of all game x86-
based consoles; 100% of all tablets, smartphones, Chromebooks; and about 60% of
all automobiles. As of 2020, over four billion iGPU units had been sold.
GPUs are incredibly complicated and complex devices, with hundreds of 32-bit
floating-point processors (called shaders) made with millions of transistors. It was
because of Moore’s law that such density was realized. Every day one engages with
multiple GPUs: in one’s phone, PC, TV, car, watch, game console, and cloud account.
The world would not have progressed to its current technological level without the
venerable and ubiquitous GPU.

4.12 Nvidia Shield (January 2013–2015)

The Shield was Nvidia’s Android-based, Tegra-powered, handheld game controller


with an attached screen. It was a dedicated design, but the configuration was not new.
It looked like a game controller that held a mobile phone or a tablet.
The Nvidia Shield Portable (Nvidia Shield or NSP) consisted of a console-like
game controller with a dedicated, unattachable, 5-in., 720p, multi-touch display
4.12 Nvidia Shield (January 2013–2015) 213

Fig. 4.22 Nvidia’s Shield


game controller/player.
Courtesy of Nvidia

(Fig. 4.22). It was powered by Nvidia’s latest ARM-based processor, Tegra 4, which
ran Google’s Jellybean version of Android.
It was an appealing product and was irresistible to pick up because of its controller
appearance and flip-up screen, but was that enough to make it disruptive? Mobile
phones certainly existed before Apple introduced the iPhone, yet not many people
would dispute the iPhone was a transformational and disruptive product.
In Professor Clayton M. Christensen’s 1997 best-selling book, The Innovator’s
Dilemma, he separated new technology into sustainable technology and disruptive
technology. Disruptive technology often has novel performance problems, appeals
to a limited audience, and may lack a proven practical application.
The use of disruptive as a description of the Nvidia’s Shield or the iPhone is hardly
adequate. It is doubtful the Shield would have any more performance problems than
any other new product would have. The same can be said about the iPhone. Its game
player audience was limited but consisted of 30 million consumers overall as of 2015,
and the number of gamers for all platforms approached 300 million. Therefore, scale
must be considered when determining what should be considered limited. Moreover,
the iPhone and the Shied certainly have proven practical.

4.12.1 A Grid Peripheral?

Announced at its February 2012 GPU Technology Conference in San Jose, Nvidia’s
Grid was an on-demand gaming service. Nvidia claimed it would provide advantages
over traditional console gaming systems and was an “any-device” gaming service.
According to Nvidia, it offered high-quality, low-latency gaming on the PC, Mac,
tablet, smartphone, and TV.
Nvidia combined its game development on an x-86-based PC with a GeForce
graphics add-in board (AIB) and a non-x86 streaming device. This technique used
the non-x86 device as a thin client (TC) so that the TC could send commands and
display the streamed results. Except for the latency, a network function, the user’s
214 4 Game Console GPUs

Fig. 4.23 Nvidia’s grid.


Courtesy of Nvidia

experience was as if the powerful AIB was in their local device. Nvidia schematized
the concept in the diagram shown in Fig. 4.23.
The diagram in Fig. 4.23 does not include the equipment that formed the green,
cloud-like Grid in the center.
That green cloud could be a generic server with an Nvidia GeForce AIB, an Nvidia
low-latency encoder, and fast frame buffer capture technology. According to Nvidia’s
calculations, Grid would deliver the same throughput and latency as a console.
The Grid could also connect to the Shield. Furthermore, the Shield could drive a
large screen TV display through its HDMI output. This feature made the Grid even
more similar to a console.

4.12.2 But Was It Disruptive?

No, it was not. Interesting? Yes. Clever? Yes. Well-executed? Time would tell, but
Nvidia’s track record suggested that it would, but it wasn’t.
Nvidia’s charismatic president and founder, Jensen Huang, said he built the Shield
because no one else was making a mobile device like it. He told GamesBeat in an
interview:
We are not trying to build a console. We are trying to build an Android digital device in the
same way that Nexus 7 enjoys books, magazines, and movies. That is an Android device for
enjoying games. It is part of your collection of Android devices. The reason why I built this
device is because only we can build this device [22].

Because only Nvidia could build the device that would only run on Nvidia Grid
servers and the Nvidia Tegra processor, the company had established a proprietary
closed garden. If other game delivery services such as Amazon, Google, Steam, or
Origin adopted the Grid, it would be more universal. However, until other providers
embrace Grid, the acceptance of Shield would be less than ideal. The only other
option to expand the Grid’s popularity was for Nvidia to become a streaming game
4.12 Nvidia Shield (January 2013–2015) 215

Fig. 4.24 An Nvidia Shield


look-alike, the MOGA Pro
controller with smartphone
holder. Courtesy of MOGA

service like Microsoft and Sony. That was a potential licensing nightmare and would
drain enormous resources from the company—Huang was unlikely to adopt such a
plan.
Assuming the content distribution question could be resolved, the Shield adoption
equation would become one of economics.
At the time, Huang indicated he was not interested in the hardware-subsidized
game console model, nor should he be since he did not own any content. Therefore,
the Shield had to sell for cost-of-goods (COG) plus a margin (Fig. 4.24).
With 5.3+ in., 1280 × 768 screens, smartphones would challenge the Nvidia
Shield’s 5-in., 1280 × 720 screen. Although Huang said, the Shield would be “part
of your collection of Android devices.” A lightweight controller that attached to a
smartphone would have been a better choice. Moreover, smartphones could drive an
HDTV too.
Nvidia discontinued the handheld controller with design Shield in 2015.
Disruptive to Nvidia’s Business?
By introducing the Shield, Nvidia put itself in competition with its customers who
used Nvidia products in their console products. Nvidia argued this was incorrect
because no one else made a device like the Shield. Sony, one of Nvidia’s customers
at the time, would have undoubtedly seen the competition for console and handheld
players that came with the release of the Shield. Nvidia argued that Sony offered Sony
handhelds and consoles games, whereas the Shield was designed as an Android game
player. The release of the Shield only enhanced the existing competition between
Sony and Android. Nvidia quietly withdrew the product after it introduced the Shield
tablet.
216 4 Game Console GPUs

4.13 Sony PlayStation 4 (November 2013)

Starting in 1994 with the first CD-ROM-based console using a 32-bit MIPS CPU,
Sony has always been bold in its processor selection for the PlayStation. It followed
that selection with the 128-bit (MIPS-based) Emotion Engine PS2 and the first DVD-
based console in 2000. The PS3 came out in 2006 as the first machine with a Blu-
ray player and the new IBM-Power-based SIMD CPU structure called the Cell.
In 2013, Sony introduced the PS4 with an AMD x86-based architecture and an
embedded powerful GPU that resembled a PC. See Fig. 4.25 for Sony’s PlayStation
introductions and life span.
The common thread that connects these designs is the lack of backward compati-
bility. That is a brave step for a company to take, and Sony should have been credited
with being so courageous. But instead, they were criticized for it. Why? You don’t
need backward compatibility—you never did. If you have old games, you probably
also have an old machine, so use it. Who wants to play clunky old games with the
availability of new, high-quality performance games?
Rumors circulated in May 2011 that Sony chose Nvidia to create the processor
for the PS4. These rumors were inflamed when a hiring requisition at Sony (SCEA)
became public and revealed that Sony was looking for someone with (Nvidia) CUDA
experience. That could have simply been related to the game development of the PS3,
which had a 136 shader G70 GPU (code named RSX). Regardless, the information
encouraged people to speculate.
The rumors and speculation ended in January 2012, when Forbes reported AMD
and Sony were in discussions for the PlayStation 4 [23].
The PlayStation 4 used an AMD Jaguar-based APU. The system had 8 GB of
GDDR5 unified memory, 16 times the amount found in the PS3. The PS4’s GDDR5
memory could run up to 2.75 GHz (5500 MT per second) and had a maximum
memory bandwidth of 176 GB per second.

Fig. 4.25 Sony’s eighth-generation PlayStation 4 with controller changed the design rules for
consoles. Courtesy of Sony Computer Entertainment
4.14 Microsoft Xbox One (November 2013) 217

The PlayStation 4 Pro added half a gigabyte of extra RAM for game developers
because of the additional 1 GB of memory needed for nongaming uses like multi-
tasking and reports.
The console also contained an audio module offering in-game chat rooms and
audio streams. And all PlayStation 4 models would have high dynamic range color
profiles.
Leveraging the game streaming technology acquired from Gaikai in July 2014
(for $380 million), the PS4 allowed games to be downloaded and updated in the
background or standby mode. The system also enabled digital titles to be playable
while downloading.
Vita owners could also play PS4 games remotely. Sony saw the PS Vita as the
companion device for the PS4. Sony had long-term visions to make most PS4 games
playable on PS Vita and transferrable over Wi-Fi.
A new application called the PlayStation App-enabled mobile devices like
iPhones, iPads, and Android-based smartphones and tablets to become second
screens for the PS4. However, this ability depended on the version of OS that the
users had. Without the correct OS version, users could not use the PlayStation app.

4.14 Microsoft Xbox One (November 2013)

An AMD Jaguar APU also powered the Xbox One. It used 8 GB of DDR3 RAM,
with a memory bandwidth of 68.3 GB per second instead of the GDDR5 Sony used
in the PS4.
The Xbox One memory subsystem also had 32 MB of embedded static RAM
(ESRAM). It had a memory bandwidth of 109 GB per second. The ESRAM was
capable of a theoretical memory bandwidth of 192 GB per second and a memory
bandwidth of 133 GB per second while using alpha transparency blending. A block
diagram of the Xbox one can be seen in Fig. 4.26.
The Xbox One was Microsoft’s successor to the 2005 ATI Xenos GPU-powered
Xbox 360, introduced eight years earlier. Because of the gaming PC’s much shorter
product cycle, its primary competition (consoles) had to remain ahead of the curve.
The sheer magnitude of the SoC created by Microsoft demonstrated its dedication
to this goal.
With 47 MB of on-chip SRAM storage and five billion transistors, the console’s
Main SoC used the bulk of Xbox One’s functionality. AMD co-engineered the Xbox
One SoC with Microsoft; its block diagram is shown in Fig. 4.27.
Xbox One processing used an AMD eight-core x86 CPU, a Graphics Core Next
GPU, and 15 special-purpose processors.
As illustrated in Fig. 14.22, the custom APU saved CPU and GPU cycles
by offloading well-known processing tasks to dedicated hardwired engines and
DSPs. The offloaded processing tasks were typically things like video encoding
and decoding, audio, and display. Several swizzle engines handled unaligned image
copies in memory.
218 4 Game Console GPUs

Fig. 4.26 Xbox One system architecture

Fig. 4.27 Internals of Xbox One’s 5+ billion transistor SoC


4.15 Nvidia Shield 2 (March 2015) 219

All internal processors had access to shared, unified memory via host/guest
MMUs. That allowed low overhead coprocessing by the CPU and the GPU (and
other priority agents). Shared unified memory let multiple processors work together
to pass pointers to data structures. The alternative would have been making each
processor copy the structure itself. When large memory transfers were required,
the SoC’s memory subsystem was up to the challenge; it offered an impressive
200 GB per second of realizable bandwidth. Only 30–68 GB per second of that
bandwidth (depending on whether coherency enforcement was used) came from
external DRAM, whereas the majority (204 GB per second peak) came from 32
Mbytes of embedded SRAM.
Reacting to the motion-sensing capabilities of the Wii and PlayStation, Microsoft
introduced Intel’s RealSense image sensing system, Kinect, as an accessory to the
Xbox. It was entirely CPU-based and did not use any GPU resources for capture.

4.15 Nvidia Shield 2 (March 2015)

In the mid-2010s, Nvidia steadily invested in expanding its Shield product line and
Grid (GPU) server system capabilities. The original product, the Shield Portable,
was a game controller with an attached screen. This product was Nvidia’s first taste
of the consumer electronics business. The following product released was the Shield
Tablet, which Nvidia shared with its partners (e.g., EVGA).
At the 2015 GDC in San Francisco, Nvidia unveiled its latest Shield product: the
Shield Console. It was an Android TV-like device with a game controller and an
optional remote-control stick.
Nvidia also set up a game store with over 50 titles, and users could sign up
for a subscription. Nvidia’s Shield console was like Amazon’s Fire TV for easy
comparison.
However, Nvidia had been shaving milliseconds since it brought out the first Shield
Portable. The difference between Nvidia’s Shield console and all other Android
TV/consoles was the response time, latency, and throughput; Nvidia’s was faster
because Nvidia controlled both ends of the pipe—the server and the client.
The company worked closely with game developers and reported several new
games. These games ran well at 4k and even better at HD.
The cabinet was attractive, with sharp diagonal lines and multi-reflective surfaces.
Measuring 8 × 5 in. and 1 in. thick, the Shield was roughly the size of a thin book
(Fig. 4.28).
The system sold for $199 and came with the Shield processor cabinet (see
Table 4.2), a stand, a game controller, and a power supply. The optional TV controller
was $30. The cabinet stand had an amazingly sticky nano surface on the bottom; once
put on a flat surface such as a table, it stayed put. The cabinet slipped neatly and
firmly into the stand.
In the demonstration at Nvidia’s facilities in Santa Clara, the system was very
responsive, even though the server was in Seattle. That was a surprise because it
220 4 Game Console GPUs

Fig. 4.28 Nvidia’s Shield


Console in its holder with
controller. Courtesy of
Nvidia

Table 4.2 Nvidia’s Shield Console specifications


Processor Nvidia Tegra X1 Processor with 256-Core Maxwell GPU with 3 GB RAM
Video features 4K Ultra-HD Ready with 4K playback and captures up to 60 FPS (VP9,
H265, H264)
Audio 7.1 and 5.1 surround-sound pass-through over HDMI, High-resolution audio
playback up to 24-bit/192 kHz over HDMI and USB
Storage 16 GB
Wireless 802.11ac 2 × 2 MIMO 2.4 GHz and 5 GHz Wi-Fi, Bluetooth 4.1/BLE
Interfaces HDMI 2.0, Gigabit Ethernet, Two USB 3.0 (Type A), Micro-USB 2.0
Gaming features Nvidia GRID streaming service
Power 40 W power adapter
Weight and size Weight: 654 g (23 oz), 130 mm (5.1 in) high, 210 mm (8.3 in) wide, 25 mm
(1 in) deep
Operating system Android TV, Google Cast Ready
Options SHIELD controller, SHIELD remove, SHIELD stand
Courtesy of Nvidia

was so fast. There was no noticeable latency, sound, or synchronization problems.


It wirelessly drove a full 4K screen (the system used 802.11 a/c, the latest Wi-Fi
networking standard that provides high-throughput Wireless. Local Access Networks
(WLAN) on the 5 GHz band).
The Shield utilized Nvidia’s Tegra X1 SoC, based on the ARM Cortex-A57 CPU.
It also incorporated Nvidia’s Maxwell microarchitecture GPU with 3 GB of RAM.
The device supported 4K resolution output at 60 FPS over an HDMI 2.0 output
with compatibility to HEVC-encoded video. The Shield could contain 16 GB of
internal flash storage or a 500 GB hard drive and be expanded via a microSD card
or removable storage. The 2015 and 2017 Shield models with a 500 GB hard drive
were branded as the Shield Pro because of their additional storage space.
4.16 Playmaji Polymega (February 2017) 221

Fig. 4.29 Nvidia Tegra X1 block diagram

The GPU had 256 shader cores (2 SMMs) and ran at 1000 MHz (see Fig. 4.29).
The memory interface offered a maximum bandwidth of 25.6 GB per second (2x 32
Bit LPDDR4-3200).
The product marked a significant step for Nvidia that it had been building toward
for decades. With the Shield’s release, Nvidia became a full-service consumer elec-
tronics company. For several years, Nvidia had been saying, “We are not a semi-
conductor company.” Their Grid, Tesla, Quadro, and other products are ample proof
of this claim. Yes, they sold semiconductors, and they also sold components (like
automotive subsystems). The main difference between Apple and Nvidia (other than
size) was Nvidia’s use of a common OS. Nvidia saw this as an advantage because it
could leverage all app development work, the APIs, and the OS without carrying all
the expense.

4.16 Playmaji Polymega (February 2017)

Polymega was a modular multi-system game console that ran original game cartridges
and CDs for classic game consoles on the user’s HDTV. Developed by LA-based
Playmaji, Polymega was initially named Retroblox when first announced in 2017.
222 4 Game Console GPUs

The company offered several modules that accepted cartridges from almost all old
consoles and CD compatibility with the Base Unit. Each box in the system plugged
into a base platform (Fig. 4.30).
Polymega had one of the widest ranges of emulation systems on the market. The
system used a 2.9 GHz Pentium Intel Coffee Lake S CPU with UHD graphics 610.
It had 2 GB of DDR and 32 GB of NanoSSD storage, expanded with an SSD
or a microSD. It had a Realtek Wi-Fi and Bluetooth Combo Module, HDMI 1.4,
Gigabit Ethernet, 2x USB 2.0, and a Polymega Expansion Bus. The system used a
Proprietary Linux-based OS.
The integrated UHD Graphics 610 (GT1) had been in processors from the Whisky-
Lake generation. The GT1 version of the Skylake GPU had 12 Execution Units (EUs),
which could clock up to 950 MHz. The HD 610’s architecture shared memory with
the CPU (2x 64bit DDR3L-1600/DDR4-2133).
The video engine supported H.265/HEVC Main10 profiles in hardware with 10-bit
color. Google’s VP9 codec could be hardware decoded. The Pentium chips supported
HDCP 2.2 and Netflix 4K. HDMI 2.0, however, could only be supported with an
external converter chip (LSPCon).

Fig. 4.30 Artist’s rendition of the Polymega system. The final version was a dark, flat gray. Courtesy
of Polymega
4.17 Nintendo Switch (March 2017) 223

The system’s games were limited in color and resolution, so the processor (an
iGPU) was more than adequate. The system’s magic was in the emulations that Play-
maji developed and the physical hardware interfaces for cartridges and controllers.
The system shipped to consumers in September 2021.

4.17 Nintendo Switch (March 2017)

Two years after Nvidia withdrew from the mobile market in March 2017, Nintendo
announced the release of the Switch shown in Figs. 4.31 and 4.32. The Nintendo
Switch was a portable game console that used a Tegra X1 processor to form a repack-
aged Nvidia Tegra tablet. Tegra processors are covered in this section. The 256-shader
core Tegra GPU is shown in Fig. 4.29 in the Nvidia Shield section.
The Switch had two innovative controllers attached to the tablet’s sides for game
control. Alternatively, the tablet could be put in a desk stand, and the controllers could
be detachable and used independently as handheld devices without being directly
attached to the Switch.

Fig. 4.31 Nintendo’s Switch


with controls attached.
Courtesy of Nintendo

Fig. 4.32 Nintendo Switch desk mount. Courtesy of Nintendo


224 4 Game Console GPUs

1996 2001 2002 2006 2007 2012 2013 2016 2017 2019 2021 2022 2028
Nintendo 64
Game Cube
Wii
Wii U
Switch
Switch OLED
5th 7th 8th 8th

Fig. 4.33 Nintendo console introduction timeline

In mid-2021, Nintendo upgraded the Switch with a higher-quality OLED screen


but continued to use the six-year-old Tegra X1 processor.
The Switch was an extremely successful game console. As of 2021, since its
launch in 2017, over 90 million units had been shipped worldwide. The Switch was
on track to meet or exceed the Wii console’s 101.63 million lifetime sales (Fig. 4.33).
Although the Switch was both popular and economical, many gamers still used
multiple devices. In mid-2021, NPD, a consumer research firm, estimated over 40%
of U.S. Switch users also owned a PS4 and Xbox One [24].

4.18 Atari VCS (June 2017)

Atari withdrew from the console market in 1996 when the Atari Jaguar CD game
console failed to reach its sales goals (see Section 3: The First GPU and What It
Led to on Other Platforms). In 1998, the company liquidated, and Hasbro Interactive
acquired Atari’s intellectual property and brand [25].
In January 2001, Infogrames Entertainment SA acquired Hasbro Interactive for
$100 million and renamed it Infogrames Interactive [26]. In May 2003, Infogrames
renamed itself Atari SA and the Interactive subsidiary. Atari Interactive has provided
the licensing of Atari consoles produced since 2004.
At the E3 gaming conference in June 2017, Atari’s CEO, Fred Chesnais, told
GamesBeat that the company was working on a new console code named AtariBox.
The hearts of older gamers began to race. Courtesy of Atari, rumors about the
AtariBox and dark model pictures appeared in the news.
Feargal Mac Conuladh (Mac Conuladh), who joined Atari in September 2017 as
the general manager, was inspired by the concept of the Atari VCS (code named
AtariBox). Conuladh saw that Atari’s game catalog had strong brand recognition,
and the console would be a perfect vehicle to exploit that famous catalog.
Conuladh said in a 2017 interview that seeing gamers connect their laptops to
televisions to play games on a larger screen inspired him to develop the unit [27]
(Fig. 4.34).
Conuladh’s goal for the AtariBox was to satisfy the nostalgia for the former Atari
consoles and indie games without needing a PC. Following the lead of both Microsoft
and Sony, Conuladh said Atari SA would use a custom version of AMD’s APU (code
named Kestrel).
4.18 Atari VCS (June 2017) 225

Fig. 4.34 Feargal Mac (left) of Atari and former Microsoft games executive Ed Fries. Courtesy of
Dean Takahashi

After announcing the AtariBox at the 2017 E3, the Atari management then went to
San Francisco during the 2018 GDC to show a mockup to a few reporters (Fig. 4.35).
The company claimed that the Atari VCS platform would provide 4K resolution,
HDR, 60 FPS content, onboard and expandable storage options, dual-band Wi-Fi,
Bluetooth 5.0, and USB 3.0.
Atari said the VCS would include the Atari Vault, with more than 100 classic
games, such as arcade and home entertainment favorites like Asteroids, Centipede,
Breakout, Missile Command, Gravitar, and Yars’ Revenge.
Designed to be reminiscent of the 1977 Atari 2600, which was called the Atari
Video Computer System or VCS, the new VCS was introduced in a YouTube teaser
in 2017. Some thought Atari’s 2021 summer debut of the VCS console was about
four years late; others thought the console should have been released more than four
decades ago. Having promised a delivery date in the summer of 2019, the company
launched a successful crowdfunding campaign for the product release. Several press
releases and news leaks appeared in 2019 and 2020, but the product did not launch.
The VCS unit was finally released in late 2021.
Atari VCS Specifications (Table 4.3).

Fig. 4.35 Atari 2600 and VCS. Courtesy of Wikipedia


226 4 Game Console GPUs

Table 4.3 Atari VCS specifications


Feature Description
Software Atari Custom Linux OS (Debian)
CPU/GPU AMD Ryzen Embedded R1606, 2.6–3/5 GHz dual-core
(4-threads)
GPU AMD Radeon Vega 3, 3 Compute Units, 1.2 GHz
Internal storage 32 GB eMMC fixed internal
Internal SATA M.2 SSD slot For storage expansion
RAM 8 GB DDR4 RAM (upgradable)
Compatible operating systems Ubuntu-based Linux, Windows, Chrome OS
Wireless Wi-Fi 802.11 b/g/n 2.4/5 GHz, Bluetooth 4.0
Ports HDMI 2.0, Gigabit Ethernet, 4x USB 3.1 (2x front, 2x rear)
Included controllers Classic Joystick, Modern Controller (mice and keyboards also
supported via USB or Bluetooth)

Was there room for the Atari VCS in a society obsessed with smartphones,
streaming, and powerful gaming PCs? Such retro enthusiasm had been seen before,
with limited success. Other nostalgia gaming devices have been attempted, but it
was not easy for nostalgic games to compete with the modern games available for
PlayStation and Xbox. So, the answer is no, or maybe. Nintendo has proven several
times that millions of loyal and enthusiastic gamers still love classic games and
characters.

4.19 Zhongshan Subor Z-Plus Almost Console (2018–2020)

In 2015, China lifted a 14-year ban on console gaming. During those 14 years, with
no gaming consoles, the gaming scene in China became much more PC-focused. But
there was also an active black market for consoles in China. When China lifted the
ban, Microsoft’s Xbox One was the first console to be offered. Sony was next and
ended up being the preferred console. But China required both companies to manu-
facture some of the consoles in China. And there was a desire from the government
for a Chinese-made console.
In 2013, Xiaobawang and Ali Group formed a strategic cooperation. It announced
it would work with Ali TVOS to launch a home game console equipped with Ali
TVOS and a variety of Xiaobawang games on the Ali TVOS platform.
In 2016, Xiaobawang and AMD reached an agreement for a customized VR host
chip, becoming the fourth company after Microsoft, Sony, and Nintendo and the first
company in China to have a high-end host chip. Xiaobawang invested about $60
million in AMD to develop a custom processor for Xiaobawang’s upcoming console
and PC hybrid system.
4.20 Sony PlayStation 5 (November 2020) 227

Fig. 4.36 Xiaobawang Zhongshan Subor Z-plus console. Courtesy of Xiaobawang

As a result, Xiaobawang introduced the Zhongshan, the Subor Z-Plus console, in


2018. Subor, an electronics company in Xiqu Subdistrict, Zhongshan, Guangdong,
had made a clone of the Nintendo Famicom that was popular in the 1980s and 1990s.
The company hoped there would be a market for a console with the Subor name.
At the massive Chinese game conference, ChinaJoy, in 2018, AMD and
Xiaobawang jointly announced the launch of game PCs and game consoles for the
Chinese market. According to statistics, in 2018, the number of gamers in China
reached 626 million, a year-on-year increase of 7.3%.
The Zhongshan Subor Z-plus was a Windows 10 games console powered by an
AMD Zen CPU with a 24-CU Vega GPU APU. It had 8 GB memory and 128 GB
SSD+1 TB (Fig. 4.36).
Subor had financial difficulties and missed its August 2018 launch target. Costs
kept rising, which meant that the console would have to launch at a higher price than
a comparable PC; as each month ticked by, the situation got worse. The company’s
investors became pessimistic about the project’s future. The company went bankrupt
in 2020.
Another casualty of the Chinese console market was Axe Technology. In 2016 the
company released a game console called Tomahawk F1. Many Chinese gamers had
the hope of the rise of a domestic game console market. However, the Tomahawk F1
did not sell well through its e-commerce platform. This company ceased operations
in less than a year.

4.20 Sony PlayStation 5 (November 2020)

Sony’s PlayStation 5 used another custom AMD APU, sometimes referred to as


an SoC. Manufactured at TSMC in 7 nm, the eight-core AMD Zen 2 CPU runs at
3.5 GHz. The lead system architect for the upcoming PS5, Mark Cerny, revealed
228 4 Game Console GPUs

the main features of the future console in a virtual presentation called “The Road to
PS5.” The new console would have three major features: a high-speed drive, GPU
with ray tracing, and 3D spatial audio [28] (Fig. 4.37).
The GPU had 36 compute units (CU) consisting of 2304 shaders. When the GPU
ran at 2.23 GHz, it had a theoretical performance level of 10 TFLOPS.
The GPU had hardware-accelerated ray tracing capability, AMD’s VRS (variable
rate shading), and AMD’s Fidelity FX super (scaling) resolution. The console came
with 16 GB of GDDR6 SDRAM with 448 GB per second peak bandwidth. It was
equipped with Bluetooth 5.1, 802.11ax (Wi-Fi 6), USB Type-A Hi-Speed, USB
type-C, and HDMI 2.1.
The PS5 GPU used AMD’s Primitive Shaders from RDNA 1; AMD introduced
primitive Shaders in June 2019.
The PS5 had backward compatibility features that were a function of an x86-
based APU. Sony looked at the top 100 PS4 titles as ranked by playtime and expected
almost all of them to be playable on PS5 when it launched. With more than 4000
games published on the PS4, the company said it would continue the testing process
and expand backward compatibility coverage over time (Fig. 4.38).
Although the Microsoft Xbox Series X (XSX) and the Sony PlayStation 5 (PS5)
both used a custom variant of the same basic APU from AMD, the two machines
had significantly different GPUs and mass storage systems (see Table 4.4).

Fig. 4.37 Sony PlayStation 5 block diagram

1994 2000 2006 2012 2013 2015 2020 2021 2022 2025 2028
PS1
PS2
PS3
PS4
PS5
PS6?
5th 6th 7th 8th 9th 10th

Fig. 4.38 PlayStation introduction timeline


4.20 Sony PlayStation 5 (November 2020) 229

Table 4.4 Comparison of Sony PlayStation 5 and Microsoft Xbox Series X key specifications
PS5 Xbox Series X
CPU 8-core 3.5 GHz AMD Zen 2 8-core, 3.8 GHz AMD Zen 2
GPU 10.3 TFLOPS AMD RDNA 1, 12.0 TFLOPS AMD RDNA 2,
2.23 GHz 1.82 GHz
Shaders 2304 (36 CU) RDA-1 3328 (52 CU) RDA-1
Resolution Up to 8K Up to 8K
Frame rate Up to 120 FPS Up to 120 FPS
RAM 16 GB GDDR6, 256-bit bus, 16 GB GDDR6, 320-bit bus,
448 GB/s 560 GB/s
Storage 825 GB custom SSD, 5.5 GB/s, 1 TB custom NVMe SSD,
8–9 GB/s compressed 2.4 GB/s 4.8 GB/s compressed
Optical disc drive 4K UHD Blu-ray (Standard PS5 4K UHD Blu-ray
only)
Backward compatibility Almost all PS4 games, including All Xbox One games/Select Xbox
optimized PS4 Pro titles 360 and original Xbox games
Price $500 (PS5); $400 (PS5 Digital $500
Edition)

The SSD was a significant feature. It took six to seven seconds to load 1 GB of
data and approximately 20 s to read the data on an HDD. Games in 2020 were using
5–6 GB, which meant an SSD was a necessity for gamers; using an SSD could cut
those times dramatically.
Sony decided against a faster GPU with fewer shaders in favor of a faster SSD.
However, the PS5 could not use INT8 or INT4 instructions for ML. It could only use
FP16 for basic AI.
Even though it used primitive shaders, the PS5 ran Epic’s extraordinary Lumen
in the Land of Nanite demo (discussed in Book one).
The GPU’s compute units included an intersection engine for ray tracing, which
calculated the intersection of rays with boxes and triangles. To use the engine, a BVH
acceleration structure needed to be built. BVH was a popular ray tracing acceleration
technique that used a bounding volume hierarchy. The technique used a tree-based
acceleration structure that contained multiple hierarchically arranged bounding boxes
(bounding volumes) that encompass different amounts of scene geometry.
The shader program used a new instruction inside the compute unit that checked
the intersection engine against the bounding volume hierarchy.
A PS5 compute unit (CU) was not the same as a CU in the PS4. Sony chose
fewer compute units at a higher frequency rather than more compute units at lower
frequencies. To demonstrate the trade-off, 36 compute units running at 1 GHz resulted
in the same 4.6 TFLOPS as 48 compute units running at 0.75 GHz. Moreover, using
a higher clock allowed everything else to run faster. It was easier to use 36 compute
units in parallel than to use 48 simultaneously. It was harder to fill all available CUs
with useful work when triangles were small.
230 4 Game Console GPUs

The CPU supported 256-bit native instruction, and the GPU’s 36 compute units
(i.e., 2304 shaders or 58 PS4 compute units) employed variable frequencies for a
continuous boost. The clock was increased until the system reached the system’s
cooling solution capacity; therefore, the system ran at constant power, and the
frequency varied according to the load. Rather than looking only at chip temperature,
Sony also noted the CPU and GPU’s actual activities and set the frequency accord-
ingly. To maximize efficiency, Sony used AMD’s Smartshift technology to send any
unused power from the CPU to the GPU to squeeze out a few more pixels. The CPU
could run as high as 3.5 GHz. The GPU frequency was capped at 2.23 GHz to allow
the on-chip logic to run correctly. With 36 compute units running at 2.23 GHz, the
system produced 10.3 TFLOPS.

4.21 Microsoft Xbox Series X and S (November 2020)

The Xbox Series X had a custom 7 nm AMD APU with an iGPU based on AMD’s
RDNA 2 architecture (see block diagram in Fig. 4.39). The iGPU had 56 compute
units with 3584 cores, but only 52 CUs and 3328 cores were enabled and ran at the
fixed 1.825 GHz rate. The iGPU was capable of 12 TFLOPS. Series X and S had
completely programmable front ends (Mesh Shaders, VRS, SFS).
The GPU had ray tracing capability through a new processor element called an
intersection engine in the CPU.
The Xbox S had AMD’s FidelityFX Super Resolution image upscaling tech-
nology, which enabled higher resolutions and frame rates.

Fig. 4.39 Microsoft Xbox Series X block diagram


4.21 Microsoft Xbox Series X and S (November 2020) 231

Fig. 4.40 Microsoft’s Xbox


series APU. Courtesy of
Microsoft

The APU also had an eight-core Zen 2 CPU that ran at 3.8 GHz; however, the
CPU ran at 3.6 GHz when simultaneous multi-threading (SMT) was used. Microsoft
dedicated one CPU core to the operating system. Xbox Series X could use INT8 and
INT4 instructions, but the PS5 could not.
As the die photo shows, the GPU occupied most of the silicon in the die, as
revealed by Fig. 4.40.
The console had 16 GB of GDDR6 SDRAM, with 10 GB running at 560 GB per
second for the iGPU and 6 GB at 336 GB per second for other computing functions.
The custom AUP could drive 4K TVs up to 120 Hz, and it could drive 8K TVs up
to 60 Hz. The Xbox Series X performance goal was to render games in 4K resolution
at 60 frames per second. Microsoft planned to achieve this by using roughly four
times more powerful CPU than the Xbox One X CPU and a GPU twice as powerful
(Fig. 4.41).
Microsoft had two machines with different hardware capabilities to play the same
games. Table 4.5 compares the components of Microsoft’s fourth generation of Xbox
consoles.
The consoles had compatibility with the new features of the HDMI 2.1. These
new features included the integration of variable refresh rate and auto low latency
mode (ALLM) into televisions in 2019.

2001 2005 2009 2013 2016 2017 2021 2026


Xbox
Xbox360
Xbox
Xbox one
Xbox One
Xbox
6th 7th 8th 8th 8th 9th

Fig. 4.41 Microsoft Xbox series introduction timeline


232 4 Game Console GPUs

Table 4.5 Microsoft’s Xbox Series X and S game consoles


Component Series X Series S
Processors CPU Custom AMD Zen 2 8 Cores at Custom AMD Zen 2 8 Cores
3.8 GHz (3.66 GHz with SMT) at 3.6 GHz (3.4 GHz with
SMT)
GPU Custom RDNA 2 52 CUs at Custom RDNA 2 20 CUs at
1.825 GHz 12 TFLOPS 1.565 GHz 4 TFLOPS
Transistors (B) 15.3 6.6
Semiconductor technology TSMC 7 TSMC 16
(nm)
Memory 16 GB GDDR6 with 320-bit bus 10 GB GDDR6 with 128-bit
10 GB at 560 GB/s, 6 GB at bus 8 GB at 224 GB/s, 2 GB
336 GB/s at 56 GB/s
Storage Internal 1 TB PCIe Gen 4 custom NVMe 512 GB PCIe Gen 4 custom
SSD 2.4 GB/s raw, 4.8 GB/s NVMe SSD 2.4 GB/s raw,
compressed 4.8 GB/s compressed
Expandable 1–2 TB expansion card (rear)
External USB 3.1 external HDD support
Optical drive Ultra-HD Blu-ray None
Performance target 4K resolution at 60 FPS, up to 1440p at 60 FPS, up to 120
120 FPS FPS
Dimensions Size 301 mm × 151 mm × 151 mm 275 mm × 151 mm × 65 mm
(12 in × 5.9 in × 5.9 in) (11 in × 5.9 in × 2.6 in)
Weight 4.45 kg (9.8 lb) 1.93 kg (4.3 lb)
Model 1882 1883/1881
Price $499 $299

4.22 Valve Steam Deck Handheld (July 2021)

In 2021, AMD was the supplier for four of the five available game consoles: Atari
VCS, Microsoft’s Xbox Series X, Sony’s PS5, and Valve’s Steam Deck. Each of the
consoles was an x86-based game machine.
Valve’s Steam Deck was a handheld PC with an integrated 7-in. screen and game
controls instead of a keyboard. The software supported an external keyboard, which
could be attached via Bluetooth or USB-C1. Like Nintendo’s Switch, one could
connect an external display via the USB-C port.
The Steam Deck Portable console was the first PC to use AMD’s Van Gogh APU.
The entry-level APU ran Steam’s OS 3.0. Van Gogh had a Zen 2 CPU with an RDNA
2 GPU, which promised over 2 TFLOPs of capability. This level of performance was
higher than the performance capabilities of the AMD-powered PlayStation 4 or the
Xbox One. The Steam Deck could run AMD’s FidelityFX Super Resolution. It also
had hardware support for DirectX 12 Ultimate features such as variable rate shading
4.22 Valve Steam Deck Handheld (July 2021) 233

(VRS) and acceleration for ray tracing. Lastly, the Steam Deck was also capable of
running Linux and using the Vulkan API.
Valve said the Steam Deck would run AAA games. That was a safe claim given
the APU that powered it. The Steam Deck’s display was limited to 1280 × 800 pixels,
with a 16:10 aspect ratio.
Specifications of the Steam Deck:
• Processor: AMD APU
• CPU: Zen 2 4c/8t, 2.4–3.5 GHz (up to 448 GFLOPS FP32)
• GPU: AMD 8 RDNA 2 CUs, 1.0–1.6 GHz (up to 1.6 TFLOPS FP32)
• RAM: 16 GB LPDDR5 RAM (5500 MT/s)
• 256 GB NVMe SSD (PCIe Gen 3 x4)
• Storage: 64 GB eMMC (PCIe Gen 2 x1)
• 512 GB high-speed NVMe SSD (PCIe Gen 3 x4)
• Display: 7-in. diagonal LCD with a 400 nits 60 Hz refresh
• All models included a high-speed microSD card slot
• Power: 4–15 W
• Connectors: USB-C with DisplayPort 1.4, and a 45 W USB Type-C PD 3.0, a
3.5 mm combo audio jack
• Communications: Bluetooth 5.0, dual-band Wi-Fi.
Valve’s Steam Deck allowed users to take their Steam library with them; therefore,
it commonly gets compared to Nintendo’s portable Switch. Table 4.6 shows the
characteristics of both devices.
Nintendo’s Switch ran on its self-contained battery for approximately two-and-a-
half to six-and-a-half hours. The Deck’s 40 W/h battery provided two to eight hours
of gameplay, depending on the games played.
The Switch was a less powerful device since it used a 2015 Arm-based Nvidia
Tegra X1 chipset, so it did not seem fair or logical to compare it to the Steam Deck.

Table 4.6 Steam Deck’s


Parameter Steam deck Nintendo switch
specifications compared to
Nintendo’s Switch GPU AMD RDNA Tegra X1 (T210)
Shaders/Clk 256/3.5 GHz 256/1 GHz
GPU TFLOPS 1.6 0.393
Screen size 7-in 7-in
Resolution 1280 × 800 1280 × 720
Width mm/in 239/9.4 298/11.7
Height 102/4 117/4.6
Thickness 14/0.55 49/1.9
Weight gm/lb 299/0.66 322/0.71
Price $399–$649 $349
234 4 Game Console GPUs

Fig. 4.42 Valve’s Steam Deck game console. Courtesy of Valve

Valve said the Steam Deck emulated the regular Steam app on a desktop. The
emulation included chat, notifications, cloud save support, and all of one’s synced
library, collections, and favorites. It could also stream games to the Steam Deck from
a PC using Valve’s Remote Play feature. In that it was similar to the original handheld
Nvidia Shield (Fig. 4.42).
Only one question remained: What problem did this machine solve? Who needed
or wanted the Steam Deck? What did the Deck do that any laptop with a game
controller could not do the same, if not better.
When Nvidia introduced the Shield, it referred to it as another Android device.
That reference did not pay off for them. Steam Deck was just a PC in a different
package. The main difference was the price. A $400 PC with a built-in game controller
was an excellent idea, but that $400 investment would spend most of its time turned
off on a table or inside a backpack. No one would want to replace their 13-in. notebook
with low res 800 line 7-in. display. Web browsing would be just as much fun as it
was on a smartphone with a higher resolution display.

4.23 Qualcomm Handheld Dec (2021)

In December 2021, Qualcomm announced a handheld, game console, reference


design. Qualcomm has had a long and influential history of developing reference
designs to illustrate how their SoCs could be successfully employed. Often, some
OEMs would use the entire design either because they had limited design resources,
time, or because it just was an excellent design.
The screen handheld gaming was 6.65-in., 16:9 aspect ratio, 400 nits, FHD+
(~2000 × 1080), 120 Hz, and the gaming-specific platform called Snapdragon G. It
4.24 Conclusion 235

Fig. 4.43 Qualcomm handheld, game console, reference design

had foundations in mobile and elite gaming but was a gaming-specific configuration
with a better GPU, top-end CPU, and memory (see Fig. 4.43).
It ran Android and, therefore, Android games. The company was consid-
ering making it Windows compatible too. That would have been relatively easy
because they already provided chips to OEMs like Asus, HP, and Lenovo for
always-connected PC running windows.
The dev kit used 6 GB LPDDR5; 128 GB UFS 3.1 flash memory storage.
Commercial device developers could do whatever they wanted concerning
memory.
It had a battery life of four to eight hours, 5G capabilities, Snapdragon sound, and
remote TV driving capabilities.
The dimensions were 293 mm × 116 mm × 21.5 –52 mm, and it weighed just a
little over 500 g.
Although it had 5G and could do web browsing, game downloading, and online
gaming, it could not be used as a phone but could do Voice Over Internet Protocol
(VoIP).
Razer was Qualcomm’s first partner and distributor of ref kits to game developers.

4.24 Conclusion

The fifth-generation Sony PlayStation, introduced in 1993 had a separate T&L copro-
cessor within the CPU. That machine along with the fifth-generation Nintendo 64
236 4 Game Console GPUs

introduced in 1996 kicked off the era of game consoles having GPU-like character-
istics. Microsoft’s first game console, the sixth-generation Xbox introduced in 2001,
had a dedicated Nvidia GPU. The consoles evolved, as did their GPUs, until 2020,
when Sony and Microsoft were offering ninth-generation devices with GPUs capable
of mesh shading.
The number of suppliers and products offered swelled from 1972 to seven compa-
nies and 19 products in 1996. It then declined to three companies by 1998 and stayed
at that level till 2017 where it grew to five companies, and then to seven in 2022.
Consoles used to be a leading technology platform until 2005 when Sony adopted
AMD’s APU then Microsoft did the same, and in 2017 Nintendo switched from
AMD to Nvidia’s Tegra for the Switch. The consoles components were derived from
products already in the market. The PC leaped ahead in GFLOPS and other features,
but in 2021 Microsoft and Sony surprised the industry by introducing consoles that
could run ray tracing and mesh shaders. The unified memory architecture of the APUs
in the Microsoft and Sony consoles will keep them behind the PC in performance,
but clever game developers have demonstrated they can get enormous performance
out of those devices on par with the PC at a fraction of the cost.

References

1. Ken Kutaragi Biography, Encycopedia of World Bioraphy, https://www.notablebiographies.


com/newsmakers2/2005-Fo-La/Kutaragi-Ken.html#ixzz70LWXowX1
2. International Technology Road map for Semiconductors, http://www.itrs2.net/
3. Emotionengine And Graphics Synthesizer Used in the Core Of Playstation Become One Chip,
Sony Computer Entertainment, (April 21, 2003), https://www.sie.com/content/dam/corporate/
en/corporate/release/pdf/030421be.pdf
4. McGraw-Hill BusinessWeek, p. 64. (January 13, 2003)
5. Gurwin, G. The history of the Xbox, Digital Trends, (March 16, 2021), https://www.digitaltr
ends.com/gaming/the-history-of-the-xbox/
6. Takahashi, D. Opening the Xbox: Inside Microsoft’s Plan to Unleash an Entertainment
Revolution, Prima Lifestyles, (April 23, 2002)
7. Knoop, J. How the Xbox was born at 35,000 feet, IGN, (May 16, 2018), https://www.ign.com/
articles/2018/05/16/how-the-xbox-was-born-at-35000-feet-a-ign-unfiltered
8. Mitch, D. The Life and Death of the Original Xbox, IGN, (May 15, 2019), https://www.ign.
com/articles/2011/11/23/the-life-and-death-of-the-original-xbox
9. Microsoft takes Nvidia to arbitration over pricing of Xbox processors, EE Times, (April 29,
2002), https://www.eetimes.com/microsoft-takes-nvidia-to-arbitration-over-pricing-of-xbox-
processors/
10. Chansanchai, A. Power On: The Story of Xbox: New docuseries explores the origins of Xbox
and its evolution over 20 years, https://news.microsoft.com/features/power-on-the-story-of-
xbox-new-docuseries-explores-the-origins-of-xbox-and-its-evolution-over-20-years/
11. Eurogamer staff, Ridge Racers: All is forgiven, (November 28, 2012), https://www.eurogamer.
net/articles/ir_ridgeracers_psp
12. Santo, B. The Consumer Electronics Hall of Fame: Sony PlayStation Portable, IEEE Spectrum,
(November 14, 2019), https://spectrum.ieee.org/consumer-electronics/gadgets/the-consumer-
electronics-hall-of-fame-sony-playstation-portable
13. Kageyama, Y. Behind the demotion of Sony’s electronics wizard, The Associated Press (April
9, 2005), https://tinyurl.com/32ws3wfz
References 237

14. Corriea, A. R. Father of the PlayStation’ Ken Kutaragi receiving Lifetime Achievement Award
at GDC, (January 28, 2014), https://www.polygon.com/2014/1/28/5353680/father-of-the-pla
ystation-ken-kutaragi-receiving-lifetime-achievement
15. Andrews, J.; Baker, N. Xbox 360 System Architecture, Computer Science IEEE Micro, (March-
April 2006), https://www.cis.upenn.edu/~milom/cis501-Fall08/papers/xbox-system.pdf
16. Devendra K. Sadana, D., Current, M. Fabrication of Silicon-on-Insulator (SOI) and Strain-
Si-on-Insulator (SSOI) Wafers Using Ion Implantation, IBM Research Report IBM Research
Division Thomas J. Watson Research Center, (W0409-077) RC23337, (September 14, 2004),
https://dominoweb.draco.res.ibm.com/reports/rc23337.pdf
17. Drehmel, B., Jensen, R. The new Xbox 360 250GB CPU GPU SoC, Hot Chips 22, (August 23,
2010), https://old.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.23.215-1-Jensen-
XBox360.pdf
18. Maher, K. Nvidia goes for the mass market with incremental changes to lineup, Tech Watch,
Volume 4, Number 25, pp 5 (December 13, 2004)
19. Lal Shimpi, A Sony Introduces Playstation 3, to launch in 2006, (May 16, 2005), https://www.
anandtech.com/show/1683/4
20. James, D Sony’s PS Vita Uses Chip-on-Chip SiP—3D, but not 3D, (July 6, 2012), http://chi
pworksrealchips.blogspot.com/2012/07/sonys-ps-vita-uses-chip-on-chip-sip-3d.html
21. Peddie, J. Intel misquoted on integrated graphics overtaking discrete, (January 22,
2016), https://www.jonpeddie.com/blog/intel-misquoted-on-integrated-graphics-overtaking-
discrete/
22. Takahashi, D. Nvidia CEO’s 7-year journey to make the Project Shield portable gaming device
(exclusive interview), (January 9, 2013), https://venturebeat.com/2013/01/09/nvidia-ceos-
seven-year-journey-to-make-project-shield-portable-gaming-device-exclusive-interview/
23. Caulfield, B. After Years Of Hard Knocks, AMD Honing Its Killer Instinct, Forbes, (Febuary
22, 2012), https://www.forbes.com/sites/briancaulfield/2012/02/22/the-predator/?sh=4f2cc2
1be91b
24. Rahming, A.K US—Over 40 percent of Switch users also own PS4/Xbox One, says industry
analyst, (January 3, 2020), https://www.nintendoenthusiast.com/us-over-40-of-switch-users-
also-own-ps4-xbox-one-says-industry-analyst/
25. Johnston, C. Atari Goes to Hasbro, GameSpot. (April 8, 2000), https://www.gamespot.com/
articles/atari-goes-to-hasbro/1100-2462915/
26. Hasbro Completes Sale Of Interactive Business, The Associated Press—New York Times,
(January 30, 2001), https://www.nytimes.com/2001/01/30/business/company-news-hasbro-
completes-sale-of-interactive-business.html?n=Top/News/Business/Companies/Hasbro+Inc
27. Takahashi, D. Former Xbox leader Ed Fries quizzes Feargal Mac on Atari’s new console,
Venture Beat, (October 18, 2017), https://venturebeat.com/2017/10/18/former-xbox-leader-ed-
fries-quizzes-feargal-mac-on-ataris-new-console/
28. Peddie, J. The Road to Sony’s PS5, Tech Watch, (March 20, 2020), https://www.jonpeddie.
com/report/the-road-to-sonys-ps5/
Chapter 5
Compute Accelerators and Other GPUs

Very large-scale integration (VLSI) enabled the development and expansion of the
GPU thanks to Moore’s law. Every two years, process nodes shrank while costs
stayed relatively constant, and even in later years, as prices did rise, the advantages
of node shrinkage outweighed the added costs. As the GPU became denser, with
more computational power in the same area, its cost–benefit ratio was irresistible,
and people began adapting the GPU to every imaginable problem and application
(Fig. 5.1).
One would be challenged to find a platform or electronic device in their life that
did not have a GPU somewhere in it, and if not in it, in its manufacturing and design.
That is a beautiful story but one that is difficult to tell. So many companies have
come and gone and made tremendous contributions to the development of the GPU.
There are so many platforms to examine. And, in many cases, a single GPU model
will find itself on multiple platforms.
In this chapter, I try to cover all the special cases not discussed so far.
In 1999, the professional graphics workstation market began a major shift from
proprietary and purpose-built systems to systems based on commercial-off-the-shelf
semiconductors (COTSS). Manufacturers recognized the advantage of moving to
COTSS and the industry segment changed completely in a matter of two years, but
of course the transition for customers was a lot more complicated. The workstation
vendors had to support their customers that had older equipment. That meant that the
older, established workstation suppliers like Evans & Sutherland, Fujitsu, HP, NEC,
SGI, and Sun had the financial burden of supporting UNIX-proprietary workstations
as well as the new Microsoft Windows and Intel systems (WinTel)-COTSS work-
stations. The older suppliers had difficulty competing against startup companies like
Boxx, Compaq, and Dell.
HP offered its last proprietary workstation AIB, the Visualize FX10, in June 2000.
In 2002, computer graphics pioneers Evans and Sutherland began to slowly wind
down and exited the professional graphics market in 2006 to concentrate on the
planetarium projector market. SGI became a server supplier, and Sun introduced its

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 239
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_5
240 5 Compute Accelerators and Other GPUs

Fig. 5.1 The GPU scales faster than any other processor

long-awaited XVR-4000 (Zulu) in 2002, a massive AIB too late and too expensive
to compete.
Workstation graphics shifted to 3Dlabs, ATI, and Nvidia, while workstation CPUs
moved to Advanced Micro Devices (AMD) and Intel.
The following sections will examine some of the most interesting and, in some
cases, significant developments in professional graphics and compute accelerators.

5.1 Sun’s XVR-4000 Zulu (2002) the End of an Era

Like its competitors HP, SGI, and others, Sun could see Moore’s law changing
the landscape in the workstation segment. Sun tried to carve out the high-end, high-
performance segment for the company’s value-add to avoid being seen as just another
COTSS box supplier.
In late 2002, the company belatedly introduced the Zulu supergraphics subsystem
it had been working on for about three years. It had four pipelines, each with a MAJC
5200 geometry engine and raster engine (which Sun called a frame buffer controller).
That was followed by schedulers that transferred the pixels to the 3DRAM buffers.
Data then went to the routers, which fed it into the convolve engines and the output
digital-to-analog converters (DACs).
5.1 Sun’s XVR-4000 Zulu (2002) the End of an Era 241

Fig. 5.2 Sun microsystem’s XVR-4000 graphics subsystem. Courtesy of forms.irixnet.org

The large subsystem obtained impressive results:


• Greater than 80 M fully lit, textured, anti-aliased triangles/second
• Graphics interconnect bandwidth of greater than 2 GB/s
• Master controller drove four rendering subunits, providing a single-pipe interface
to OpenGL applications
• Each rendering subunit had 256 MB texture memory (1 GB total onboard)
• One TFLOP/s anti-aliasing filter pipeline.
The subsystem was really large (see Fig. 5.2). It was 533 mm (21-in.) long, 234
(9.2-in.) deep, and 69 mm (2.7-in.) high, and it drew 238. A graphics AIB of the
2020s measured 242 mm (9.5-in.) long, 112 mm (4.4-in.) high, and 38 mm (1.5-in.)
wide and drew 200–300 W.
Sun was very proud of its real-time, full-scene anti-aliasing with programmable
5 × 5 radial filter and increased color precision to 30-bit RGB. It employed a
supersampled anti-aliasing technique using the convolution filters and a randomizing
technique based on stochastic sampling.
Each rendering subunit had 256 MB texture memory, with one GB onboard texture
memory. The system could generate up to 1920 × 1200, at 75 Hz with 30-bit color—
the XVR-4000 graphics accelerator listed for $30,000.
Michael Frank Deering was the Zulu product designer and manager and said in
later years that the XVR400 made use of the 3DRAM-64 product (3DRAM) and
many internal circuits from the XVR-1000 product (FFB3) the first product to use
3DRAM-64. External vendors supplied custom CMOS circuits. The project also
called on back-end support, chip fabrication, and testing from outside vendors. On
the software side, other teams supplied OS drivers. Each shader core had five units,
two 64-bit wide ADDs, two 64-bit wide MULs, and one transcendental. Although
Deering described his role on Zulu as that of chief architect, he also pointed out the
242 5 Compute Accelerators and Other GPUs

high-level contributions of many others, such as Mike Lavelle, Dave Neagle, and
Scott Nelson.
“Zulu was my last Sun 3D graphics accelerator product (a.k.a. Sage a.k.a. XVR-4000). The
SIGGRAPH 2002 paper [1] gives an overview of the architecture of the graphics pipeline
used and goes into detail on its most novel aspect: real-time five-by-five-pixel convolution
of super-sampled rendered images with higher-order anti-aliasing and reconstruction filters.
This product still has by far the highest final anti-aliased pixel quality of any real-time
machine ever built.” [2].

Sun would follow the industry and later offer the Sun XVR-1200 graphics accel-
erator, which featured a 3Dlabs Wildcat graphics processing unit. However, in April
2009, Oracle bought Sun for $7.4B after IBM dropped its bid. Oracle wanted the
Java software and discarded the graphics.

5.2 SiliconArts Ray Tracing Chip and Intellectual Property


(IP) (2019)

SiliconArts was founded in 2010 in Seoul by Dr. Hyung Min Yoon, formerly at
Samsung; Hee-Jin Shin from LG; Byoung Ok Lee from MtekVision; and Woo Chan
Park from Sejong University. The company took on the formidable task of designing
and manufacturing a ray tracing hardware accelerator coprocessor called RayCore.
The company showed its first implementation in a field-programmable gate array
(FPGA) in 2014, and it was impressive then [3]. During the four years that followed,
the company made steady improvements, expanding its product line and developing
an ambitious and impressive road map.

5.2.1 RayCore 1000

The first implementation of SiliconArt’s ray tracing hardware accelerator was the
RayCore 1000, shown in Fig. 5.3.
The RayCore design had several novel and interesting features, as shown in
Table 5.1.
The company targeted the RayCore 1000 at smartphones and tablets and
embedded and industrial processors for virtual reality (VR) and augmented reality
(AR).
In a test using Autodesk 3DS Max 2019, the company achieved the results, shown
in Table 5.2 with the RayCore 1000 (Fig. 5.4).
The host system was an Intel Pentium Gold G5600 at 3.9 GHz (4 CPU).
The next generation was the RayCore 2000, described in the following section.
5.2 SiliconArts Ray Tracing Chip and Intellectual Property (IP) (2019) 243

Fig. 5.3 SiliconArt’s RayCore 1000 block diagram

Table 5.1 SiliconArts


Natural expression of light Global refraction
feature set
(e.g., reflection, refraction,
transmission, shadow)
Based on specular reflection Global transmission
Ray tracing-specific graphics Optical effects
effects
Moving light Various special effects (with
shader)
Global lighting Defocus
Global shadow Motion blur
Colored shadow Depth of field
Textured shadow Light shaft
2nd shadow Other filtering techniques
Global reflection Ray tracing optics effects
Courtesy of SiliconArts
244 5 Compute Accelerators and Other GPUs

Table 5.2 RayCore versus


CPU only VRay S/W Default option: 61 s
CPU ray tracing
Art S/W Draft: 20 s
Arnold S/W Default option: 38 s
RayCore on Intel RayCore plug-in Max option: 1 s
Arria 10 FPGA S/W

Fig. 5.4 Autodesk 3DS


Max 2019 test with two
omni-directional lights and
12,268 triangles. Courtesy of
SiliconArts

5.2.2 RayCore 2000

In 2018, the company introduced its RayCore 2000 real-time ray tracing IP design,
capable of 250 m-rays/s at 500 MHz (4 cores) and running at 2048 × 2048 resolution.
A block diagram of the design is shown in Fig. 5.5.
The RayCore 2000 offers displacement mapping, the effect of actual movement
of geometric points according to a given height field. It enhanced the level of realism
while cutting complicated modeling processes. And the processor supported Ambient
Occlusion and calculated how exposed each point in a scene was to ambient lighting.
It also offered multi-texture and overlapping of multiple texture images and light
binding on a target object. Although the frame rate was not real time, the megarays
were impressive.

5.2.3 RayCore Lite

In 2010, the company introduced a powerful path-tracing acceleration GPU IP solu-


tion called Lite. It featured a unified traversal and intersection (T&I) test based on
high-performance MIMD architecture and applicable for servers and high-end GPU
chips.
5.2 SiliconArts Ray Tracing Chip and Intellectual Property (IP) (2019) 245

Fig. 5.5 SiliconArt’s


RayCore 2000 block diagram

The hardware T&I test for ray/path-tracing performance was 760 Mray/sec16
cores at 190 MHz). Those processes were normal in ray/path-tracing. The traversal
unit found the object hit by the ray (searching tree data structure), and the intersection
test unit found the exact or closest point of the object intersected by the ray.
The performance was ideal based on the clock speed and the required clock
cycles to complete the logic pipeline. The effective performance was 160 Mray/secs,
benchmarked based on an Intel Arria 10 PAC 10 GX FPGA AIB with Cornell Box.
The 2010 feature list is shown in Table 5.3.
RayCore Lite became available in 2020.

5.2.4 Road Map

In addition to the three implementations listed above, the company also developed
other variants and extensions of the design.
Multi-core. The company introduced a multi-core version of the design called
RayCore MC, a photorealistic GPU IP offering with Monte-Carlo path-tracing, ray
generation, and direct/indirect illumination. Shown in Table 5.4 are the RayCore
MC’s proposed specifications.
RayCore MC was also available as IP.
RayTree. RayTree was a novel approach to ray tracing using a fast KD-tree
acceleration structure generation H/W for dynamic object rendering. KD-trees are
246 5 Compute Accelerators and Other GPUs

Table 5.3 SiliconArt’s RayCore Lite specifications


Item Description
Path-tracing functions T&I test
Input data: ray information
Output data: hit-point calculation results
Path-tracing and ray tracing functions with SW
Others Scalable architecture (multi-cores support)
S/W controllable T&I
Block-based ray generation and transmission
Pipelined ray block transfer
API/software Intel Embree
MS DirectX (under development)
3DS MAX plug-in for path-tracing rendering (under development)
Blender plug-in for ray tracing rendering (under development)
FPGA platform Intel OPAE/PAC support
Courtesy of SiliconArts

Table 5.4 SiliconArt’s RayCore MC specifications


Functionality Description
Path-tracing functions Path-tracing support
Monte-Carlo ray generation
Real-time diffuse reflection/refraction/transmission/soft shadow
Glossy reflection
Colored shadow on transparent objects
Textured shadow, multi-shadows
Depth of field, motion blur
Lighting Point light, spotlight, directional light
Multiple light sources support, global lighting
Others Anti-aliasing
Dynamic/static scene support
Scalable architecture (multi-cores support)
API RayCore MC API
Courtesy of SiliconArts

a strategy for organizing data structures in 3D space. The data in each node is in a
K-Dimensional point of space which can be navigated and searched.
The company offered a dedicated KD-tree generation design IP for hardware
implementation ray tracing. The company believed KD-tree regeneration was
compulsory for any application to deliver high-quality dynamic 3D contents and
guarantee real-time interactivity. Despite CPU overhead in systems, the CPU had
been primarily responsible for KD-tree generation, which caused process delays and
high-power consumption. RayTree said the company could replace the CPU’s role
and maximize KD-tree regeneration performance. It would re-generate KD-trees real
5.2 SiliconArts Ray Tracing Chip and Intellectual Property (IP) (2019) 247

Fig. 5.6 SiliconArt’s RayTree structure

time, thereby realizing on-the-fly dynamic scene processing without CPU use and
saving power consumption.
Shown in Fig. 5.6 is the generalized concept of the pipeline.
SiliconArts designed RayTree for implementation in dedicated KD-tree genera-
tion hardware RayTree scanned primitives and generated acceleration structure (KD-
tree) to support real-time dynamic scene processing. The company said it solved the
bottleneck problem between rendering and tree building tasks by load balancing and
distributing resources, efficiently yielding efficient ray tracing rendering.
The company predicted that when compared to KD-tree generation performance
using mobile CPU, RayTree had 35x faster KD-tree generation capability. Further-
more, offloading CPU overhead improved power efficiency. Combining RayTree
with RayCore effectively cut the KD-tree generation process in mobile CPUs, the
company said, which would result in reduced power consumption at the system level.
The generalized architectural organization is shown in Fig. 5.7.
SiliconArt’s RayTree offered a parallel hybrid tree architecture with a single scan-
tree unit and n KD-tree units.

Fig. 5.7 SiliconArt’s RayTree architecture


248 5 Compute Accelerators and Other GPUs

5.2.5 Summary

SiliconArts defied predictions of its demise and found other funding to carry on
its R&D, which appeared well invested, and the company’s product line and road
map were impressive. The company claimed to have a large OEM customer. When
that customer would bring a product to market with SiliconArt’s technology, that
was when the company could stop using investor money and move toward positive
cash flow. 2019 was marked the year of ray tracing with Nvidia’s huge commitment
and introduction of ray tracing products, Sony’s announcement that the PS5 would
offer ray tracing, and the expected introduction of hardware-accelerated ray tracing
capabilities in AMD’s and Intel’s GPUs. SiliconArts was in the right place at the
right time.

5.3 Intel Xe Architecture-Discrete GPU


for High-Performance Computing (HPC) (2019)

Intel made significant and interesting announcements at the Denver 2019 supercom-
puter conference (SC19). The company officially launched its oneAPI, a unified
and scalable programming model for heterogeneous computing architectures. It also
announced a general-purpose GPU-compute design based on the Xe architecture,
code named Ponte Vecchio, and optimized for HPC/AI acceleration. And, it revealed
more architectural details of the exascale Aurora Supercomputer at Argonne National
Laboratory.
Intel said Ponte Vecchio would be manufactured on Intel’s 7 nm technology. The
GPU would use Intel’s Foveros 3D and embedded multi-die interconnect bridge
(EMIB) packaging (introduced December 2019). Other packaging technologies
included high bandwidth memory, Compute Express Link interconnect, and other
intellectual property.
Then, in August 2021, the company conceded its manufacturing was not up to
par and announced that the Ponte Vecchio chips would be manufactured by Taiwan
Semiconductor Manufacturing Company (TMSC) in Taiwan (Fig. 5.8).
Intel said its data-centric silicon portfolio and oneAPI initiative laid the foundation
for converging HPC and AI workloads at exascale within the Aurora system at
Argonne National Laboratory. Aurora was the first U.S. exascale system (Fig. 5.9).
Intel said it used the full breadth of its data-centric technology portfolio, building
on the Intel Xeon Scalable platform and using Xe -based GPUs and Intel Optane DC
Persistent Memory and connectivity technologies.
The compute node architecture of Aurora featured two 10 nm-based Intel Xeon
processors (code named Sapphire Rapids) and six Ponte Vecchio GPUs. Aurora had
over 10 petabytes of memory and over 230 petabytes of storage and used the Cray
Slingshot fabric to connect nodes across more than 200 racks.
Intel’s entry into a top-to-bottom, scalable GPU architecture called Xe promised
a lot (Fig. 5.10), and the Ponte Vecchio device was the first reveal.
5.3 Intel Xe Architecture-Discrete GPU for High-Performance … 249

Fig. 5.8 Features of intel’s ponte vecchio GPU. Courtesy of Intel

Fig. 5.9 Aurora exploited a lot of intel technology. Courtesy of Intel

Fig. 5.10 The Xe architecture is scaled by ganging together tiles of primary GPU cores. Courtesy
of Intel
250 5 Compute Accelerators and Other GPUs

Fig. 5.11 Intel’s Xe -HPC 2-stack shows the configurable and scalable aspects of its Xe -core design.
Courtesy of Intel

The HPC version had eight 512-bit vector and eight 4096-bit matrix engines,
showing the scaling mix-and-match capability of the Xe design. And Intel showed a
diagram (Fig. 5.11) of 64 of them stacked up and interconnected via 8 MB Li caches
[4].
In addition, the Xe -HPC 2-stack had eight slices with 128 core, 128 ray tracing
units, and eight hardware contexts, as well as two media engines, eight HBM2e
controllers, and 16 Xe links (Fig. 5.12).
In late October 2021, Raja Koduri, senior vice president and general manager of
Intel’s accelerated computing systems and graphics (AXG) group, said in a Tweet:
We deployed Xe-HP in our oneAPI devCloud and leveraged it as a SW development vehicle
for oneAPI and Aurora. We currently do not intend to productize Xe-HP commercially;
it evolved into high-performance graphics (HPG) and HPC that was on general market
production path [5].

Intel ended its plans in late 2021 to bring its Xe -HP server GPUs to the commercial
market, saying that Xe -HP had evolved into the Xe -HPC (Ponte Vecchio) and Xe -
HPG (Intel Arc, gaming GPUs) products to Intel’s GPU group. The company no
longer saw the benefit or return on investment from releasing the second set of server
GPUs based on Xe -HP.
Intel’s first family of server GPUs, also known by the code name Arctic Sound,
had been the most high-profile product under development from Intel’s reorganized
GPU group. Koduri had often shown chips housing the silicon as Intel brought-up
prototypes in its labs. And except for the Xe -LP/DG1 AIB, Xe -HP was the first Xe
silicon Intel developed. It would have been the only high-performance Xe silicon
manufactured by Intel, but TSMC subsequently built the Xe -HPC’s compute tiles
and Xe -HPG dies.
The Xe -HPC-based Ponte Vecchio was the most complex SoC Intel had ever
designed and a showcase example of its integrated device manufacturing (IDM)
2.0 strategy it announced in March 2021. Ponte Vecchio took advantage of several
advanced semiconductor processes, such as Intel’s EMIB technology and Foveros
3D packaging. Intel was bringing to life what it called its moonshot project with that
product. The 100 billion transistor devices delivered lots of floating operations per
5.3 Intel Xe Architecture-Discrete GPU for High-Performance … 251

Fig. 5.12 The Xe link allowed even more extensive subsystems to be created. Courtesy of Intel

second (FLOP/s) and compute density to accelerate AI high-performance computing


and advanced analytics workloads.
Ponte Vecchio included several complex designs in the EMIB tile that enabled a
low-power, high-speed connection between the tiles. They used Foveros packaging
for the 3D stacking of active silicon for power and interconnect density. A high-speed
mesh interconnect architecture (MDFI) allowed scaling from one to two stacks.
A compute tile was a dense package of Xe -cores and was the heart of Ponte
Vecchio. One tile had eight Xe -cores with 4 MB L1 cache; it was key, said Intel,
to deliver power-efficient compute. The tiles were built on TSMC’s most advanced
process technology at the time, 5-nm N5. Intel had set up the design infrastructure,
tools flows, and methodology to be able to test and verify tiles for that node. The tile
had a tight 36-micron bump pitch for 3D stacking with Foveros, refer to Fig. 5.13.
252 5 Compute Accelerators and Other GPUs

Fig. 5.13 Ponte Vecchio with >100 billion transistors. 47 active titles and five process nodes.
Courtesy of Intel

The base tile, the connective tissue of Ponte Vecchio, was where all the complex
I/O and high bandwidth components would come together with the SoC infrastruc-
ture—PCIe Gen5, HBM2e memory, and MDFI links to connect tile-to-tile and EMIB
bridges.
High bandwidth 3D connect with high 2D interconnect and low latency made
Ponte Vecchio an infinite connectivity machine. Intel said the technology develop-
ment team worked to match the requirements on bandwidth, bump pitch, and signal
integrity.
Xe link tile was critical for scale-up for HPC and AI and provided the connectivity
between GPUs, providing eight links per tile. Intel added that tile to enable the
scale-up solution for the Aurora exascale supercomputer.
Ponte Vecchio was scheduled for release in 2022 for HPC and AI markets.
Intel disclosed IP block information of the Xe -HPC microarchitecture. It had
eight vector and matrix engines (XMX—Xe Matrix eXtensions) per Xe -core; slice
and stack information; and tile information including process nodes for the compute,
base, and Xe link tiles. The system had eight Xe -cores per tile and 4 MB L1 cache
per tile (Fig. 5.14).
The base tile was built on Intel’s Foveros substrate and was 640 mm2 in size
(Fig. 5.15). It included PCIe Gen5 I/O and its 144 MB l2 cache.
At Architecture Day 2021, Intel showed A0 Ponte Vecchio silicon that gener-
ated more than 45 TFLOP/s FP32 throughput, greater than five terabytes per second
5.3 Intel Xe Architecture-Discrete GPU for High-Performance … 253

Fig. 5.14 Ponte Vecchio chips in a carrier from the fab. Courtesy of Stephen Shankland/CNET

Fig. 5.15 Intel’s Ponte Vecchio circuit board revealing the tiles in the package. Courtesy of Intel

(TBps) memory fabric bandwidth, and greater than two-TBps connectivity band-
width. Ponte Vecchio would use the oneAPI standards-based, cross-architecture,
and cross-vendor unified software stack as with all its Xe architectures.
Then, Intel stacked up the Ponte Vecchio circuit boards shown in Fig. 5.14 to
build the 2S Sapphire Rapid’s accelerator board shown in Fig. 5.16.
Tying all the scalability together with CPUs was done via Intel’s oneAPI
(Fig. 5.17). Koduri said oneAPI was open, standards-based with a unified soft-
ware stack. That would allow freedom from proprietary programming models while
254 5 Compute Accelerators and Other GPUs

Fig. 5.16 Intel’s accelerated compute system. Courtesy of Intel

giving developers full performance from the hardware and peace of mind. It was
an apparent reference to Nvidia’s proprietary and very successful compute unified
device architecture (CUDA) programming environment.
The oneAPI initiative offered an open, standards-based unified software stack
that Intel said was cross-architecture and cross-vendor, allowing developers to break
free from proprietary languages and programming models. It had data-parallel C++
(DPC++) and one API library implementations for Nvidia GPUs, AMD GPUs, and
Arm CPUs. Intel said that oneAPI was being adopted broadly by independent soft-
ware vendors, operating system vendors, end-users, and academics. Key industry

Fig. 5.17 Intel’s oneAPI allowed heterogeneous processors to communicate and cooperate
5.4 Compute GPU Zhaoxin (2020) 255

leaders were helping to evolve the specification to support additional use cases and
architectures. Intel also had a commercial product offering with the foundational
oneAPI base toolkit, which added compilers, analyzers, debuggers, and porting tools
beyond the specification language and libraries.
The scalability of the Xe design was impressive. Intel had organized major subsys-
tems into tidy blocks and could be added and subtracted as needed for a particular
market segment. The tiling approach would work very well if the Foveros substrate
had good bandwidth and low latency (as Intel claims it did). The design would
have long legs and benefit from node improvements. If Intel could do that, then
the company’s ROI would be a win. The counterargument was economy of scale—
building several parts and then bolting them together vs. building a monolithic chip
and manufacturing a large quantity to drive down costs and accelerate ROI. It would
take at least three years to determine if Intel made the right bet. Intel’s next-gen
codename ’Rialto Bridge’ GPU launching in 2023, will succeed the Ponte Vecchio.
The oneAPI was also promising and challenging. Intel did not include discrete
GPU, DSP, or neural network processors in its early unveiling. They all had unique
instruction set architectures (ISAs) and programming rules.

5.4 Compute GPU Zhaoxin (2020)

Strong murmurs and rumors from China indicated that Zhaoxin, a joint venture
startup between Via Technologies and the Shanghai Municipal Government founded
in 2013, might introduce a dGPU. How did it manage to do that? Via Technologies
was founded in late 1980s by Cher Wang, heir to the Formosa Plastics Group. She
established the company to supply motherboard chipsets and components, but she
had a much bigger vision from the start. S3 was initially founded in California and
the UK to be near the fast-growing PC community, and as that business increasingly
moved to Asian suppliers, Via Technologies moved to Taipei in 1992.
Meanwhile, Cyrix Semiconductor started in 1988 in Richardson, Texas. In 1992,
it released a 486 CPU, and in 1995, it introduced its 586 to compete with Intel’s
Pentium class processors. In 1996, it introduced the MediaGX, one of the first CPUs
with integrated graphics. National Semiconductor bought Cyrix in 1997 for $550
million. And then, in July 1999, Via Technologies bought the Cyrix x86 division
from National Semiconductor for considerably less. National, however, hung on to
the MediaGX part—for a while.
Also, in Texas, Centaur Technologies was formed in 1995 to design low-power
versions of the x86 processor. Funded by IDT, Centaur introduced the WinChip
in 1997. When running out of money in 1999, Via bought Centaur. The design was
subsequently named and became the Via C3 to C7. Via was amassing CPU expertise.
In 2001, at Comdex, Via said it was enjoying growing sales for its processors and
expected to sell 10 million Via-Cyrix III processors. In 2011, the company announced
a quad-core x86 processor.
256 5 Compute Accelerators and Other GPUs

National renamed MediaGX Geode, hoping to sell it as an integrated processor.


But, in 2003, National sold Geode to AMD. Then in June 2006, AMD unveiled
the world’s lowest-power x86-compatible processor, which consumed only 0.9 w.
That processor demonstrated that Cyrix’s architecture was long-lasting based on the
Geode core.
Via’s graphics development began in 1999 when it acquired graphics pioneer S3,
founded in 1989. Via was a solid and innovative contender when the market was
awash in graphics hardware developers. S3 was among the first companies to offer
3D acceleration. It began developing for the nascent VR market, which did not take
off. After almost running out of cash, S3 and Via teamed up and announced the
formation of S3–Via, a partnership. Then in 2000, during the crash of the Internet
Bubble, S3–Via bought another graphics chip pioneer, Number Nine, founded in
1982. Via now had a treasure trove of graphics IP and patents from Cyrix, S3, and
Number Nine.
Via struggled along, always strapped for cash, and in mid-2011 announced that
HTC (also founded by Cher Wang) would purchase S3 Graphics from Via for $300
million. With the acquisition, HTC bought S3 Graphic’s portfolio of patents and
pending applications; however, HTC said it would provide a perpetual license of the
S3 patents to Via.
Finally, in 2013, running low on cash again, Via formed a joint venture with the
Shanghai Municipal Government and created Zhaoxin Semiconductors (a.k.a. Via
Alliance Semiconductor Co., Ltd.). The new joint venture, based in China, now had
all the patents and most of the engineers needed to design and build an X86 processor
and GPUs.
In September 2018, at the China International Industrial Expo, the company
announced the KX-6000 with performance comparable to that of a seventh-
generation (2016) Intel Core i5-7400 processor. It was built on a 16 nm TSMC
process, a clock frequency of 3.0 GHz. And, the chip had a DirectX 11.1 compatible
iGPU based on S3 Graphics technology.
The Kaixian KX-6000 series x86 processor integrated the memory controller,
PCIe 3.0 controller, USB, and SATA controller; it incorporated a DirectX 11 display
core, HD audio, and video codec. The HDMI/DP video output module did not require
a motherboard chipset—a true single chip solution [6].
In 2020, CNTechPost reported that Zhaoxin would develop a dGPU based on the
slide shown in Fig. 5.18.
In 2019, Zhaoxin established a subsidiary, GlenFly Fei (Glenfield Intelligent Tech-
nology Co., Ltd.), chartered with the goal of building a graphics GPU. GlenFly
Intelligent Technology Co., Ltd. (GlenFly) said on its web page it was committed
to helping customers in the fields of computer software and hardware, autonomous
driving, online games, computer software, smart office, etc. It would supply software-
and hardware-integrated GPU graphics and image solutions and AMOLED display
solutions. In October 2019, GlenFly received a U.S. patent (11,030,937) for a subpixel
rendering method and device.
5.4 Compute GPU Zhaoxin (2020) 257

Fig. 5.18 Zhaoxin’s road map showed a dGPU. Courtesy of CNTechPost

Shortly after the transfer of S3 technology to Shanghai SASAC, the company


announced a GPU-compute AIB. In early 2021, Shanghai Tianshu Zhaoxin Semi-
conductor Co. released the first 7-nm chip in China based on a proprietary GPU
architecture, code named Big Island GPGPU. The long and arduous path of the x86
CPU technology is illustrated in Fig. 5.19.
Then, in June 2021, images of GlenFly’s graphics AIB appeared on the web
(Fig. 5.20) [7].
In 2020, Zhaoxin promised a GPU that would support DirectX 11.1 or DirectX
12-level feature set with a 70 W thermal power envelope built on TSMC’s 28 nm
fabrication process. The photo in Fig. 5.20 shows no power cables at the top of the
AIB, which suggests the power levels were met.
Then, after 22 years of ignoring Via, Intel suddenly took interest. In late 2021, Intel
hired all of Via’s x86 engineers. “Intel will recruit some of Centaur’s employees …
with certain covenants from the Company… As consideration, Intel will pay Centaur
U.S.$125 m,” read the Via investors note [8].
Although Via bought Centaur in 1999 in the hopes of competing with Intel with a
less expensive and more power-efficient x86 processor, Via never obtained more than
2% market share. In July 2021, Centaur said it was developing a high-performance
deep learning coprocessor integrated into a server-class x86 processor. The company
claimed the design would achieve 20 tera-operations per second through eight high-
performance cores.
258 5 Compute Accelerators and Other GPUs

Fig. 5.19 Texas CPU to new GPU—a long, tortuous path

5.5 MetaX (2020–)

Chen Weiliang started MetaX (Mu Xi) Integrated Circuit Co. in the Pudong Lingang
Special Area of Shanghai’s Free Trade Zone in September 2020. With a master’s
degree from the Institute of Microelectronics of Tsinghua University, Chen had been
a senior GPU researcher at AMD China. Chen said MetaX would develop artificial
intelligence processors consisting of application-specific integrated circuits (ASICs),
FPGAs, and GPUs.
Chen commented that “GPU chip’s versatility and parallel computing capabilities
help it effectively apply to AI, cloud computing, and high-performance computing”
[9].
The company conducted four rounds of financing, with nearly 100 million yuan
(~$15 million); Heli Capital led the financing. By August 2021, the company had
raised CNY1 billion (~$150 million) Series A round of funding. Two state-owned
equity investment platforms, The China Structural Reform Fund and the China
5.5 MetaX (2020–) 259

Fig. 5.20 GlenFly’s AIB running the Unigine Heaven benchmark. Courtesy of Glenfield GlenFly
Technology

Internet Investment Fund, plus Lenovo’s venture capital, led the latest round. Existing
shareholders, including Jingwei China, Heli Capital, Sequoia China, and Lightspeed
China Partners, continued to invest in the firm.
The MetaX team came to the GPU market with an average of nearly 20 years of
research and development experience, including developing GPU products from 55 to
7 nm. The company claimed its team members had led the research and development
of more than ten world mainstream high-performance GPU products. The company
developed a concept of “extending reconfigurable hardware architecture based on
the traditional GPU architecture.”
In 2006, AMD invested $16 million in setting up its R&D center in Shanghai,
its most significant overseas R&D investment [10]. Weiliang and the core team had
worked for AMD for years before leaving to start MetaX. In late 2021, the startup
had more than 300 employees, over 80% of whom worked in R&D.
Ms. Peng Li, chief architect of hardware at MetaX, was the first Chinese female
made a Fellow at AMD and had 15 years of experience in high-performance GPU
design at AMD. Dr. Yang Jian, MetaX’s Chief software architect, was the first scien-
tist at AMD China. He had served as the chief architect for AMD and HiSilicon
and had 20 years of experience in large-scale chip and GPU software and hardware
design. Other senior management had worked at chip manufacturers such as Trident,
Arm, Nvidia, Intel, Spreadtrum, Cadence, Synopsys, byte, and LanChi (Fig. 5.21).
“The GPU,” said Chen, “has achieved many improvements in its microarchi-
tecture from the initial fixed pipeline ASIC to today’s general GPU processor.
Mu Xi proposed a software-defined hardware-reconfigurable architecture for big
260 5 Compute Accelerators and Other GPUs

Fig. 5.21 Muxi’s CEO


Chen Weiliang has worked in
the GPU field for 20 years.
Courtesy of Muxi

data computing. The granular hardware reconfiguration enabled more accurate and
flexible calculations to achieve a better energy efficiency ratio”.
Reflecting China’s ambitions for its technology sector, Chen said:
Only by redefining the GPU instruction set and building the microarchitecture, software
drivers, compilers, and library functions from scratch can we achieve completely independent
intellectual property rights and higher product performance.

Chen and his backers saw that overseas companies such as AMD, Intel, and Nvidia
dominated the GPU market and believed there was a huge opportunity for Chinese
companies. Chinese developers and a few investors had seen the opportunities, but
the high investment, slow return, lengthy R&D pace, and lack of scientific and tech-
nological talents made many companies and investors hesitate. The semiconductor
firms that did launch in China encountered bottlenecks. China imported 543.5 billion
integrated circuits in 2020, a year-on-year increase of 22.1%; imports amounted to
$350 billion, with a year-on-year increase of 14.6%. However, although China’s chip
demand continued to rise, domestic chip production accounted for less than 20% of
the domestic chip market. Regardless of the industry, China has a market advantage
due to its population base and market size.
In March 2021, Tian Yulong, a member of the Party Leadership Group, Chief
Engineer, and Spokesperson of the Ministry of Industry and Information Technology,
stated at a State Council Information Office [11] press conference that the government
had issued policies to promote and the integrated circuit industry and supporting soft-
ware. He also said the government had pledged long-term support for chip corporate
income tax, the entire chip industry chain, talent reserves, and training.
5.5 MetaX (2020–) 261

In addition to policy support, the Chinese government would help with capital to
develop the industry. By August 2021, there were 27 financing operations for chip
design companies, twenty-two of which were above 100 million yuan (~$15 million).
The benefits of the increased funding for industrial development were apparent to
all, and startups appeared on a large scale.
Many chip companies entered the market in China, but the failure rate was high.
Chen Weiliang said that the lack of talent was a significant problem encountered in
the development of semiconductor companies. Still, it was not the root cause of the
high bankruptcy rate in the industry. The lack of core technology, weak products,
liquidity, mismatch of strategy and capability, and weak team cohesion may all cause
business failure [11].
In 2019, market research firm International Data Corp (IDC) predicted China’s
GPU server market would reach 30 billion yuan ($4.32 billion) by 2023 [12]. As a
result, armed with backing from government-controlled companies and the govern-
ment, the Chinese GPU startups including MetaX, Zhaoxin, XiangDiXian, and others
chased after the 30 billion yuan IDC predicted. The GPU server market in China was
predicted to reach about 45 billion yuan ($6.75 billion) in 2024.
GPU chips accounted for about 50% of the cost of GPU servers. The total market
size for GPUs in the Chinese GPU server market in 2025 was estimated to be about
27 billion yuan.
The wipeout of AMD’s China R&D center was a blow to the company, and as of
this writing, it remains to be seen where the Chinese GPUs would be manufactured.
At the time, China was pushing to get 12 nm fabs up, but to compete in the server
market in 2021, the startups needed to find a 5 or 3 nm fab. That meant going offshore
to Taiwan or Korea.
AMD restaffed its R&D center and, in 2022, had more people working there than
ever before.

5.5.1 MetaX Epilogue

AMD had struggled with China for a while. In June 2019, The Wall Street Journal
reported “How a Big US Chip Maker Gave China the ‘Keys to the Kingdom [13].”’
The story referred to AMD forming a joint venture called Tianjin Haiguang Advanced
Technology Investment Co. Ltd. (THATIC). THATIC was owned by AMD (51%) and
public and private Chinese companies, including the Chinese Academy of Sciences.
THATIC was chartered to enable and allow Haiguang Microelectronics Co. Ltd.
(a.k.a. HMC) to build X86 processors using AMD’s IP [14].
AMD would license SoC and high-performance processor IP to THATIC for
China’s fast-growing server market. AMD stood to gain up to $293 million in revenue
from the deal without incurring investment exposure. The THATIC deal involved
licensig x86 IP and not AMD graphics or Arm-based designs.
262 5 Compute Accelerators and Other GPUs

5.6 XiangDiXian Computing Technology (2020)

Xiangdixian Computing Technology (Chongqing) Co., Ltd. was established in 2020


and positioned itself as a high-end processor chip design company. The company’s
headquarters were in Chongqing, and it claimed to have R&D centers in Beijing,
Shanghai, Chengdu, Suzhou, and other places.
The senior product development manager, Nick Qi, worked at Hycon and AMD
before joining XiangDiXian.
The company joined Khronos, the PCI SIG, VESA, and JEDEC.
The company said on its web page that it intended to develop a GPU that could
serve graphics and image processing, AI, VR and AR, gaming, and industrial appli-
cations. As of the end of 2021, there was no further information on or about the
company.

5.7 Bolt Graphics (2021–)

Bolt Graphics was a startup in stealth mode in 2020 and obtained funding in 2021 to
bring the company out of the shadows. The company was started by Darwesh Singh,
a cloud architect in Minneapolis, with the goal of building a ray trace accelerator for
the rendering/VFX. The company claimed it had the first fully hardware-accelerated
pipeline in the world. Bolt said its GPU was optimized for scenes involving reflective
surfaces, diffuse materials, and complex geometry. Scenes developed for 8 K or
higher resolutions, hundreds of samples per-pixel, and multiple bounces would render
quickly on the Bolt Platform, while other systems would struggle (Fig. 5.22).
Bolt argued that workloads were scaling faster than platforms—a new architecture
was needed. Figure 5.23 is a symbolized diagram of Bolt’s approach.
Since the Bolt 1 chip was designed for scale-out systems, the company claimed
it could perform equally well alone as with many chips. The Bolt Platform shared
memory between 16 chips, which reduced data transfer overheads when a data set
grows past one terabyte.
For enterprise customers, Bolt Graphic’s data center GPU used technologies like
HBM, TSMC’s 5 nm process, and PCIe Gen5 and got its first silicon in the second half
of 2022. Bolt’s GPU was projected to offer greater than three times the performance in
rendering and AI workloads than other solutions. The second-generation part would
be trimmed for gaming, DCC, and general-purpose computing and graphics.
One Bolt server, said the company in late 2021, would provide the same perfor-
mance as 16 high-end dual-CPU servers at a similar cost (Table 5.5). In situations
where render deadlines were tight, Bolt claimed its platform could perform 11.5x
faster, enabling users to meet aggressive render deadlines.
Bolt argued that workloads were scaling faster than platforms and that a new
architecture like theirs was needed (Fig. 5.24).
5.7 Bolt Graphics (2021–) 263

Fig. 5.22 Darwesh Singh, CEO and founder bolt graphics. Courtesy of Singh

Fig. 5.23 Bolt’s HPC GPU. Courtesy of Bolt

Bolt said its servers consumed less than one-fourth of the total power of CPU-
based servers, significantly reducing the environmental footprint. No special cooling
treatment was needed, enabling simple integration within existing data centers.
264 5 Compute Accelerators and Other GPUs

Table 5.5 Bolts


Single-precision performance 590 TFLOPS
specifications
Ray-trace performance 230 sustained G rays
Memory capacity Up to 8 TB across 16 DDR5
DIMMs
Memory bandwidth Up to 716.8 GB/s
Network ports Quad 400 GbE QSFP-DD
System size 2U 19 Rackmount
Maximum power requirements 5000 W (200–240 Vac)
Courtesy of Bolt Graphics

Fig. 5.24 Bolt targeted industries characterized by exponentially expanding workloads. Courtesy
of Bolt Graphics

5.8 Jingjia Micro Series GPUs (2014)

Changsha Jingjia Microelectronics Co., Ltd (Jingjia Micro) was founded in the
Yuelu District of Changsha Hunan, China, on April 5, 2006, to design military elec-
tronic products. Its products included integrated circuits, intelligent display modules,
graphic controller board, signal processing board, graphics chips, and related soft-
ware. Jing Jiawei was the founder and chairman of the Board, and in 2022, the
company had over 870 employees (Fig. 5.25).
In April 2014, we successfully developed the first domestic high-reliability, low-power GPU
chip-JM5400, with completely independent intellectual property rights, breaking the long-
term monopoly of foreign products in the GPU market in my country—Jing Jiawei

In late 2021, the company announced it was preparing its latest GPU, the JM7000-
series.
Information about the company’s pursuit to develop a GPU was first announced
in August 2019 [15] when the company projected its 28 nm-based GPU would
compete with Nvidia’s GTX 1050 and 1080. The company based its speculations on
the Jingmei JM5400 series GPUs used in Chinese military aircraft. At the time, the
company said the JM7200 series had already obtained some orders.
The JM7200 had the following specifications:
Clock frequency: 1300 MHz and supports dynamic frequency modulation
5.8 Jingjia Micro Series GPUs (2014) 265

Fig. 5.25 Jing Jiawei,


founder of Changsha Jingjia
Microelectronics Co., Ltd.
Courtesy of Changsha Jingjia

Host interface: support PCIE2.0-8×


Video memory capacity: Two groups of DDR3 memories on-chip; each group
was 32 bits wide and supported a maximum capacity of 4 GB
Video memory bandwidth: 17 GB/s
Rendering capability: four rendering pipelines, at 1300 MHz, the pixel filling rate
was 5.2 Gpixels/s, and the texture filling rate was 10.4 GT/s
HD decoding: supports H.264, VC-1, VP8, MPEG-2, and MPEG-4 hardware
decoding and could display the texture of the decoded video
Display output: supports four independent display outputs, supports 4 HDMI, 2
VGA, 2 LVDS, and 2 DVO interfaces
Video input: four LVTTL
OpenGL support
Power consumption: desktop applications were less than 20 W
Manufactured at Semiconductor Manufacturing International Corporation
(SMIC) and packaged in 23 mm2 FCBGA.
The AIB was designed for the Unity Operating System (UOS).
The company offers the GPU on two form factors, MSM board, and PCIe AIB,
shown in Fig. 5.26.
In 2021, the company began preparing its next-generation GPU, the 9000 series,
and offered the comparison shown in Table 5.6.
The next-generation JM9 series GPU performance target was to reach the 2017
and the beginning of 2018 level of the end of AMD and Nvidia class GPUs.
266 5 Compute Accelerators and Other GPUs

Fig. 5.26 Jingjia micro’s JM7200-based PCIe AIB. Courtesy of Changsha Jingjia Microelectronics
Co.

Table 5.6 Comparison JM9000-series GPUs to Nvidia GTX1000-series GPU


JM9231 GTX 1050 JM9271 GTX 1080
API support OpenGL 4.5, OpenGL 4.6, OpenGL 4.5, OpenGL 4.6,
OpenCL 1.2 DX12 OpenCL 2.0 DX12
Boost clock >1500 MHz 1455 MHz >1800 MHz 1733 MHz
rate
Bus width PCIe 3.0 PCIe 3.0 PCIe 4.0? PCIe 3.0
Memory 256 GB/s 112 GB/s 512 GB/s 320 GB/s
bandwidth
Memory 8 GB GDDR5 2 GB GDDR5 16 GB HBM 8 GB GDDR5X
capacity/type
Pixel rate >32 GPixel/s 46.56 GPixel/s >128 GPixel/s 110.9 GPixel/s
FP32 2 TFLOPs 1.8 TFLOPs 8 TFLOPs 8.9 TFLOPs
performance
Output options HDMI 2.0, HDMI 2.0, HDMI 2.0, HDMI 2.0,
DisplayPort 1.3 DisplayPort 1.4 DisplayPort 1.3 DisplayPort 1.4
Video encoding H.265/4 K 60FPS H.265/4 K 60FPS H.265/4 K 60FPS H.265/4 K 60FPS
TDP 150 W 75 W 200 W 180 W
Courtesy of Jingjia Micro

According to the company’s specifications, the performance of JM9231 could


reach the level of low-end products from 2016. The core frequency of JM9271
was greater than 1.8 GHz; was compatible with PCIe 4.0 x16, with 16 GB HBM
memory; had a bandwidth of 512 GB/s; and had floating-point performance of up
5.9 Alphamosaic to Pi via Broadcom (2000–2021) 267

to eight TFLOP/s. The performance was equal to or greater than the Nvidia’s GTX
1080.
In 2019, the Chinese government announced it had set up a national semiconductor
fund (China Integrated Circuit Industry Investment Fund) to develop and expand its
domestic chip industry and close the technology gap with the U.S. [16]. The fund
was budgeted at 204.2 billion yuan ($28.9 billion). It was at the time the largest fund
compared with similar funds launched in China in 2014 that had raised 139 billion
yuan.
Although Jingija Micro had targeted the GTX 1080, it was building its GPU at
28 nm, whereas the GTX 1000-series were built on 16 nm, giving Nvidia a consid-
erable advantage in performance headroom and cost. China had not been able to
build low-end nm fabs yet, and it would take several years to get the necessary
manufacturing equipment and experience to do so. Clearly, though, the Chinese are
committed to their semiconductor programs.
Changsha Jingjia Microelectronics was a pure-play fabless semiconductor
company similar in organization, goals, and ambitions to AMD, Nvidia, and Medi-
aTek. The company received sales from many key national Chinese projects, giving
it a considerable home market advantage and a favorable ROI for the immense costs
of developing a GPU. The only drivers available were for the UOS and Kylin OS.
UnionTech developed the Unity Operating System (UOS) or Chinese Linux distri-
bution. It was used in the People’s Republic of China as part of a government initiative
beginning in 2019 to replace foreign-made software such as Microsoft Windows with
domestic products.
Kylin was an operating system developed in 2001 at the National University of
Defense Technology in China and named after the mythical beast Qilin [17]. The
first versions were based on FreeBSD and intended for use by the Chinese military
and other government organizations.
The major Chinese chipmakers, including the SMIC foundry, have continued to
develop 14 and 12 nm process nodes based on fin field-effect transistor technology,
with solid support and encouragement from the government. China’s Vice Premier
Liu started a program in 2020 focused on using the country’s semiconductor manu-
facturing resources and talent to make China independent and a potential world leader
in compound semiconductors.

5.9 Alphamosaic to Pi via Broadcom (2000–2021)

Robert Swann and Steve Barlow founded Alphamosaic Ltd, a UK semiconductor


company, in 2000 in Cambridge. They had worked at Cambridge Consultants, devel-
oping low-power mobile multimedia processors based on its VideoCore architecture.
Alphamosaic was a spin-out from the consulting firm.
The founders Swann and Barlow developed the VC01, a novel low-power
processor for 2D video and images using a DSP architecture. They thought it would
be used for consumer devices. And it was. Samsung used the chip, and so did the first
268 5 Compute Accelerators and Other GPUs

Apple video iPod to handle video record and playback, image capture and processing,
audio capture and processing, graphics, games, and ringtones.
The second-generation chip, VC02, ran at 150 MHz (almost twice the speed of
the 85-MHz operation of the VC01), displayed video on quarter video graphics array
(QVGA) screens, and captured images up to 8 MP from image sensors. The new chip
also had an input for TV tuners and TV-out. There was more internal SRAM (10
Mbits versus 8 Mbits in the earlier part) advanced image filters (Fig. 5.27).
The VC02 had features useful for video [18].
The VC02 also had dual 32-bit RISC processors and a dual-issue compiler. Like
all mobile devices of the time, the CPU(s) were fixed-point only. However, like
the predecessor single CPU part, the VideoCore II was a dual video DSP, with a
16 parallel data path very long instruction word (VLIW) vector processor tightly
coupled (by shared registers) to 32-bit RISC scalar processors.
The photo shows the development board; the VC02 is the chip just above the
display.
In September 2004, Alphamosaic was acquired by Broadcom for $123 million,
forming its Mobile Multimedia group on the Cambridge Science Park site.

Fig. 5.27 Alphamosaic’s


Dr. Robert Swann shows off
the VC02’s development
board
5.9 Alphamosaic to Pi via Broadcom (2000–2021) 269

Broadcom launched its first new chip in the VideoCore line at 3GSM in February
2005. The chip was designed for mainstream phones but enabled advanced features
such as video, 3D gaming, and multimedia, previously associated with high-end
phones. Broadcom branded it the BCM2705.
Broadcom reduced the amount of memory, reduced the JPEG encode and decode
from 8 to 4 MP, and removed some peripheral support, including TV-out and USB.
The chip would enable phones to display video at 30 fps on a 2-inch color LCD and
capture 4 MP images. The chip was 100% programmable and had MPEG-4 encode
and decode.
The VideoCore processor was a 150 MHz dual-ALU, allowing the BCM2702 to
function as a coprocessor or as a stand-alone. The chip was manufactured in 130 nm
CMOS and packaged in a 281-pin thin profile fine-pitch ball grid array (TFBGA)
package (10.9 × 10.1 mm).
Swann said, “We showed some pretty good 3D games before, and we have got
better games now.”
Though pricing was a very relative thing, the chip was in the range of $10.
In 2012, hobbyists began exploiting a powerful little development kit known as
Raspberry Pi created in the UK by the Raspberry Pi Foundation in association with
Broadcom. As of May 2021, over forty million boards had been sold (Fig. 5.28).
The Raspberry Pi had a Broadcom SoC with a VideoCore IV 3D graphics core
and used a closed-source binary driver (called a blob) that communicated with the
hardware. The blob ran on the BCM2835 SoC’s vector processing unit (VPU) of the
Raspberry Pi. Open-source graphics drivers were a thin shim runnng on the ARM11
via a driver in the Linux kernel. But the lack of an open-source graphics driver and
documentation was a problem for Linux on Arm—it prevented users from fixing

Fig. 5.28 Raspberry Pi 4 model B development board. Courtesy of Miiicihiaieil Hieinizilieir for
Wikimedia Commons
270 5 Compute Accelerators and Other GPUs

Fig. 5.29 Doom III running on a Raspberry Pi 4. Courtesy of Hexus

driver bugs, adding features, and generally understanding what their hardware was
doing.
Then in February 2014, Broadcom announced it would give the VC4 to the
community. It released all the documentation for graphics core and a complete source
code of the graphics stack under a three-clause Berkeley Source Distribution (BSD)
license—anyone could use it.
The source release targeted the BCM21553 cellphone chip, but it was straightfor-
ward to port it to the BCM2835 on the Pi. That allowed access to the graphics core
without using the blob. As an incentive to do that work, the Raspberry Pi organization
offered a $10,000 prize for the first person to demonstrate to them that Quake III
could run successfully at a playable framerate on Raspberry Pi using those drivers
(Fig. 5.29).
In April 2014, only a month after the prize was offered, it was claimed by Simon
Hall, a longtime Pi hacker.
In 2022, Raspberry Pi 4 kits could be bought for as little as $25.

5.10 The Other IP Providers

Arm and Imagination Technologies were the most prominent suppliers of GPU IP
but were far from the only ones. As of 2022, there were four others: AMD, Digital
Media Professionals (DMP), Think Silicon, and VeriSilicon, plus the open GPU
organizations. The GPU IP suppliers serviced the other platform suppliers. Mobile
devices were the largest market, followed by automotive, and digital TV (DTV) and
5.10 The Other IP Providers 271

set-top boxes (STB) SoCs. Arm claimed in 2021 they had 80% market share in DTV
with their Mali GPU.
The following are some brief stories of those IP suppliers. I have put them all in
this mobile section for convenience.
The problem with IP is that you cannot see it. You can see representations of it,
such as a block diagram, a register-transfer level (RTL) netlist, or an SoC with the IP
buried inside it, but the pure IP is just a bunch of ones and zeros on a disk somewhere.

5.10.1 AMD 2004

AMD is discussed throughout this book. However, it may not be evident that the
company has been an IP provider and a discrete and integrated GPU supplier.
ATI officially entered the IP market in 2003 with a project for Microsoft for the
Xbox 360 and, in 2004, with a deal with Qualcomm. Through various acquisitions
and partnerships, ATI was on a quest for dominance in all sectors of the graphics
market. Qualcomm would embed ATI’s Imageon graphics (discussed in Chapter
fifteen) in Qualcomm’s next-generation baseband processor. Qualcomm would use
ATI’s graphics technology to compete against TI and Intel with their Imagination
Technologie’s graphics.
The relationship went well, and to augment it and add more value, in May 2006,
ATI bought the renowned BitBoys and incorporated its newly developed 2D engine.
AMD acquired ATI, and when AMD got into financial difficulties in 2007 and
2008, it started looking for things to see and get rid of to lower operating costs. In
January 2009, AMD sold the Imageon group to Qualcomm, and AMD temporally
exited the IP business.
The company re-entered the IP market in 2012 when it developed a custom chip
for the Sony PlayStation 4 and then licensed Sony to build subsequent versions.
AMD did the same thing with Microsoft on the Xbox One.
One of the biggest surprises was in 2019 when word leaked out that ATI had
provided the IP for the GPU for Samsung’s next-generation Exynos SoC for Samsung
smartphones. Samsung had started an internal GPU project several years earlier. After
several manager changes and other setbacks, the senior management of Samsung had
had enough and decided to go with AMD.

5.10.2 Digital Media Professionals Inc. (DMP Inc.) 2002

Founded in 2002, in Japan Digital Media Professionals was well-known for its VR
and multimedia systems work. In 2004, DMP decided to change direction from a
high-end chip supplier to an IP provider.
The company was best known for its power-efficient hardware-accelerated
graphics core, which it offered as IP and in LSI.
272 5 Compute Accelerators and Other GPUs

The company had been developing its proof-of-concept Ultray design based on the
physical model rendering and experimenting with algorithms that defied the limits
of miniaturization. The company planned to release Pica, its first IP core based on
Ultray architecture, in 2006.
Tatsuo Yamamoto, the president, and CEO of DMP, said Ultray would allow
real-time photorealistic rendering with physically correct lighting and shadows,
such as soft shadow casting and position-dependent environmental mapping [19].
Furthermore, he pointed out that the physical model rendering achieved real-time and
high-quality images, significantly reduced the dependency on textures, and offered
excellent memory efficiency.
The architecture was very scalable and offered functionality from high-end appli-
cations such as VR and content creation to low-end, low-power applications such
as mobile devices. A pixel-level shader with components such as bidirectional
reflectance distribution function (BRDF) and Phong built into the hardware pipeline
achieved excellent quality and high-performance graphics with much fewer polygons
even in low-end embedded applications.
The Ultray supported the following features in the first-generation high-end DMP
product.
Hardware-accelerated shading:
Phong shading
Cook/Torrance shading
BRDF shading
Multi-layer reflection
Hardware-accelerated effects:
Texture mapping (bilinear, trilinear, programmable 4 × 4 tap filtering)
Bump mapping
Refraction mapping position-dependent cube mapping
Vector and boundary-edge anti-aliasing
Anti-aliased soft shadow casting
Hair generator
Glare and flare renderer
Gaseous object renderer (see image)
Polygon subdivision to lower processor bus bandwidth
Per-vertex subsurface scattering
Although the Ultray had many specifications in common with similar products, the
Ultray had unique functions. Notably, a hardware parametric engine did the gaseous
object rendering, not a shader, saving transistors, power consumption, and time. With
that unique feature, clouds, smoke, gas, and other fuzzy objects could be shaded and
rendered at an interactive rate.
Many techniques have been proposed to model gaseous objects, using numerical
simulations, volumetric functions, fractal, cellular automaton, and particle systems,
among several others. DMP looked at all those approaches, and to DMP, it appeared
the common denominator of the techniques could be defined, at a given level, as a
point set with attributes, or in other words, as a hyper-volume or a fiber bundle.
5.10 The Other IP Providers 273

DMP’s Pica was a 3D/2D graphics IP core for embedded systems. DMP planned
to start licensing Pica in the fall of 2006. In addition, DMP offered an OEM chip
development service based on its chip development capabilities proven in the launch
of Ultray2000 in 2005.
In May 2009, the company released its SMAPH-F vector graphics IP core. DMP
said it had licensed the core to a major Japanese Tier-1 automotive supplier.
SMAPH-F was designed to accelerate graphical user interface (GUI) applications
for entry-level embedded products such as mobiles, TVs, digital cameras, graphics
meters, navigation, gaming, and office products. The company said the applications
would be offered at a low cost and low power consumption while achieving excellent
vector graphics performance. SMAPH-F was compliant with the Khronos Group’s
OpenVG 1.1 and offered acceleration of vector graphics contents, including Adobe
Flash Lite and SVG. SMAPH-F included DMP’s Gradient Extension hardware accel-
eration for gradient animations and additional procedural texturing such as a woody
pattern (Fig. 5.30).
The company said it believed SMAPH-F was a friendly IP regarding integration
and achievement of performance goals in complex SoCs. That was partly due to its
support of industry-standard Open Core Protocol (OCP) and Advanced eXtensible
Interface (AXI) interconnect and its design for optimum system performance with
DDR burst accesses [20].
In June 2010, at the E3 conference in LA, Nintendo surprised the industry by
introducing the 3DS, a handheld game machine with a glasses-free 3D stereo screen.
It provided a 400 × 240 image per eye, which looked very sharp. Nintendo stock
rose about 11% during the first two days of the conference (Fig. 5.31).
Just after E3, Nintendo announced it had chosen the Pica200 graphics technology
from DMP.
The Pica200 used DMP proprietary Maestro extensions for the 3D graphics. By
hardware implementation of complex shader functionality, those extensions allowed
high-performance rendering found on high-end products to be realized on mobile
devices with low power consumption requirements.
The company went public on the Tokyo stock exchange in June 2011.

Fig. 5.30 Rendering


examples using only
OpenVG features. Courtesy
of DMP Inc.
274 5 Compute Accelerators and Other GPUs

Fig. 5.31 DP CEO Tatsuo


Yamamoto and his dog
Momo at E3

In the spring of 2014, the company bought the sole rights to sell Cognivue’s
embedded computer vision IP Apex Image Cognition Processor cores to the Taiwan
and Japanese markets. That enabled customers to buy multiple components for an
SoC from DMP in Taiwan, where there are many semiconductors and SoC makers
in mobile, automotive, and consumer electronics.

5.10.3 Imagination Technologies 2002

Imagination Technologies is covered in several chapters and sections in this book.


Those sections were about GPUs for the PC and game consoles. However, Imagina-
tion was also an active supplier to the automotive industry. The company provided
PowerVR GPU IP to Hitachi, which merged into Renesas for the SuperH SoCs.
Those SoCs were used for car entertainment systems and put Imagination on the
path to becoming a significant supplier to the automotive industry.
Imagination Technologies steadily increased its sales of IP and signed up Sunplus
Technology Company in 2004. Sunplus was building application processors for
automotive applications with Imagination’s PowerVR MBX.
5.10 The Other IP Providers 275

By 2022, sales to the automotive industry accounted for a significant amount of


Imagination’s revenue. The company achieved safety certification for its parts and
was powering dashboard displays, entertainment systems, and other systems found
in a modern vehicle.

5.10.4 Think Silicon (2007)

Think Silicon was founded in Patras, Greece, in 2007 by George Sidropoulos and Dr.
Iakovos Stamoulis. Dr. Stamoulis had been working at the ray tracing pioneer hard-
ware company Advanced Rendering Technologies in Cambridge, U.K., and wanted
to come home. In 2005, his old friend Sidropoulos convinced him to join Atmel IC
Design Team. They worked together in the MMC (multimedia and communications)
group of Atmel in Patras until Atmel closed the operation in late 2006. The two and
a few others from Atmel started Think Silicon (TSi). The Athens-based Metavallon
VC group backed Think Silicon. The company planned a wide range of graphics and
display processors for the Internet of things (IoT), wearables, and broader display
device markets (Fig. 5.32).
Think Silicon developed the tiny GPU design after they left Atmel—or more
accurately after Atmel left them. In 2009, Think Silicon released its first graphics
accelerator, named the Think2D, and a display controller, the ThinkLCD. In 2016,
Microchip Technology bought Atmel, and TSi licensed the GPU and display design
to Microchip and several other firms.
One of the first challenges the group took on was to provide usable graphics
in an early IoT system with just 128 KB by a phone vendor, which seemed to be
impossible. However, Stamoulis had used Commodore Amigas and Atari STs where
it was possible in systems with that amount of memory and main CPUs that were
just 8 MHz (Fig. 5.33).

Fig. 5.32 Think silicon founders George Sidropoulos and Iakovos Stamoulis. Courtesy of Think
Silicon
276 5 Compute Accelerators and Other GPUs

Fig. 5.33 Think silicon’s whiteboard from 2015. Courtesy of Think Silicon

“The challenge was accepted,” said Stamoulis. “And the designed IP used many old tech-
niques almost forgotten in the graphics world, combined with many modern techniques, and
the result was a graphics processor unit that was ideal for the emerging IoT and Smartwatch
market.”

In 2011, the company released a vector graphics GPU, Think VG, which it licensed
to Dialog Semiconductor, and in 2016, it introduced its multicore and multithreaded
Nema GPU design (Nήμα in Greek). Nema became a leading GPU for low-power
SoCs from Ambiq, ST Microelectronics, and other Tier-1 companies, including chip-
scale package SoC suppliers. Soon several devices in the market were using Nema.
Ambiq founders were the developers of the Subthreshold Power Optimized Tech-
nology (SPOT) processor at the University of Michigan and founded Ambiq in 2010.
TSi supplied the GPU IP design to Ambiq, who built the tiny SoCs that went into
millions of fitness and smartwatches, smart thermometers, and home devices (i.e.,
end-point devices).
In 2021, Ambiq revealed its newest Apollo4 SoC family incorporating Think
Silicon’s Nema pico GPU and Nema display controller IP, shown in Fig. 5.34.
In a power-efficient way, the technology was a lean version of the Nema ISA,
Nema pico XS microarchitecture combined VLIW, low-level vector processing, and
hardware-level support for multithreading. The Nema interface was not compatible
with standard or open APIs like OpenGL ES, Vulkan. It was a proprietary interface
that was light, designed to be used in embedded devices.
The tiny but mighty GPU boasted the following elements:
Hardware elements:
Texture mapping unit
Programmable VLIW instruction set shader engine
5.10 The Other IP Providers 277

Fig. 5.34 Think silicon’s Nema pico GPU

Primitives Rasterizer
Command list-based DMAs to minimize CPU overhead
Display controller (optional)
Image transformation:
Texture mapping
Point sampling
Bilinear filtering
Blit support
Rotation any angle
Mirroring
Stretch (independently on x and y-axis)
Source and destination color keying
Format conversions on the fly
Blending capabilities:
Fully programmable alpha blending modes (source and destination)
Source/destination color keying
Drawing primitives:
Pixel/line drawing
Filled rectangles
Triangles (Gouraud Shaded)
Quadrilateral
Text rendering supports:
Bitmap anti-aliased (A1/A2/A4/A8)
Font Kerning
Unicode (UTF8)
The Nema pico XL/XS series was an extended version with a display controller
designed for SoC. The company specifically targeted the market for high-end wear-
ables and embedded IoT display devices. The popular Fitbit used a TSi GPU and the
display controller.
The Nema product family had four major segments that could be organized into
seven configurations.
Nema pico GPU Family:
278 5 Compute Accelerators and Other GPUs

• Nema pico XS 2D GPU (1 core) (1 configuration)


• Nema pico XL 1000/2000/4000—2.5D GPU, three cores (three configurations)
• Nema pico XL VG-lite 1000/2000/4000—2.5D, vector graphics GPU (3 config-
urations)
• Nema dc display controller (four configurations)
• (2.1) Nema dc 100/200/300/400—display controller with on to four-layer
configuration
The scalable multicore GPU IP platform could run on what is known as a bare-
metal real-time operating system (RTOS), which only required a small amount of
on-chip memory and system resources. That made it ideal for memory and power-
limited SoCs. TSi designed the Nema pico XL/XS series to run on low power by
minimizing memory and display access without sacrificing battery life, graphics
quality, or performance (Fig. 5.35).
TSi conducted a small financing round in mid-2019 and actively seeking a lead
investor for a Series A funding for the first half of 2020. Coincidently, Applied
Materials was looking for a low-power GPU design. So instead of raising Series A in
May 2020, Santa Clara, California-based Applied Materials bought TSi in what was
one of Greece’s most significant acquisition in the deep-tech sector. Sources said the
price exceeded 20 million euros.
In 2021, and now as part of Applied Materials, the company introduced Neox,
shown in Fig. 5.36, a new scalable and extensible processor with graphics and AI
extensions based on RISC-V Applied Materials.
The company first revealed its plans to develop a GPU based on RISC-V at the
RISC-V Summit in December 2019 [21].
The Neox borrowed some of the graphic features from Nema, which gave it its
versatility, as illustrated in the block diagram in Fig. 5.37.
The Neox architecture included AI-specific ISA extensions, SIMD Vector
in variable-length datatypes including 8-bit and optionally graphics ISA exten-
sions/coprocessors such as unified shader architecture, tile-based rendering,

Fig. 5.35 The think silicon team. Courtesy of Think Silicon


5.10 The Other IP Providers 279

Fig. 5.36 Comparison of think silicon’s Nema and Neox GPUs. Courtesy of Think Silicon

Fig. 5.37 System diagram using the think silicon IP blocks

color/vertex, vector support, and dedicated hardware modules, such as a rasterizer,


texture unit, tile management unit, and texture caches, shown in Fig. 5.37. Neox
was tile-based without deferred rendering. Cores communicated via an interprocess
communication (IPC) side channel for task work scheduling and cache/scratchpads
for data sharing. It offered a dedicated interface that allowed SoC architects
to augment the instruction set with user-defined instructions to enable product
differentiations, which gave them the ability to create custom and unique designs.
The Neox was a RISC-V ISA coprocessor array suitable for Al and graphics
or imaging workloads. It was a scalable multicore and multithreaded design with
one to 64 cores, and the company said it was suited for the embedded market to
high-performance solutions. TSi also developed an ecosystem for the design that
included compilers, other software tools, and tensor flow libraries. The design could
be implemented in just 450 k gates and draw 5 mW and was extensible through user-
defined instructions. On Neox, T&L was done on the unified RISC-V shaders without
280 5 Compute Accelerators and Other GPUs

Fig. 5.38 An example of an SoC with Neox IP cores

the need for the host processor. Each core had an execution shader core, a texture-
mapping unit (TMU), and a ROP unit. Figure 5.37 shows an SoC implementation
with Neox cores (Fig. 5.38).
Multi-threading maximizes efficiency in systems with long memory latency. It was
theoretically possible to achieve 100% compute use in memory-intensive applica-
tions. A thread-scheduler kept thread status and issued commands from a ready-to-run
pool of threads.
Table 5.7 shows a comparison of the Nema and Neox processors.
The Neox had a lightweight pipeline. That prevented it from having hazards or
complex interlocks; it also avoided branch prediction and did not have any feedbacks
paths. As a result, the pipeline was used as much as possible with no lost cycles
waiting for data, reducing power consumption. Nema XL and Neox supported ROP
operations, and there was one unit per core. In Nema XS, it was emulated by shader
code.
5.10 The Other IP Providers 281

Table 5.7 A comparison of features of Nema and Neox


CPU 32-bit Nema pico series XS and CPU 32/64-bit Neox series AI & Gfx
XL 2D/2.5D GPU AI accelerator/3D GPU
Nema XL Neox Gfx Neox AI DLA
Application 2.5D GPU 3D GPU/AI DLA AI DLA
engine/compute
Cores 1–4 1–64 4–64
Clock range 500 MHz 500 MHz 500 MHz
(at28 nm)
Area (k gates) 350 k/core 450 k + 26 kB mem/core 800 k + 80 kB (4 core)
Performance 1-pixel/clock/cycle/core 1-pixel/clock/cycle/core Na
ISA Nema VLIW RV64 IMFC + P ext + RV64IMFC + P ext
custom Gfx ext
Shader Nema fragment processor Programmable Programmable
processor GCC/LLVM RISC-V GCC/LLVM RISC-V
RV64 RV6
Frame buffer Yes Yes Na
compression
Texture Yes Yes Na
compression
Display Up to 1024 × 756, 4 K Up to 1024 × 756, 4 K Na
resolution video overlay Video Overlay
Memory AHB 32-bit AXI4 AXI4 64/128/256-bit AXI4 64/128/256-bit
system 64/128-bit
AI framework Na Neox AI SDK, Neox AI SDK,
TensorFlow lite, C/C++ TensorFlow lite, C/C+
Graphics NemaGFX + SDK NemaGFX + SDK Na
framework GUI builder GUI builder
PixPresso PixPresso
Power 2–8 mW ? ?

TSi positioned the Neox and Nema processors as well suited for various devices
and applications, as depicted in Fig. 5.39.
The company also offered 7 nm (TSMC) models, which scaled the clock to
700 MHz and reduced the area proportionally.

5.10.5 VeriSilicon

GiQuila was founded in 2004 by former SGI, ATI, and Nvidia GPU designer Mike
Cai to design and develop a GPU for mobile devices. In 2007, GiQuila changed its
name to Vivante and changed its direction from a chip builder and seller to focus on
282 5 Compute Accelerators and Other GPUs

Fig. 5.39 Think silicon’s application and device range. Courtesy of Think Silicon

designing and licensing embedded graphics processing unit designs. That same year
Wei-Jin Dai left Cadence Design and took over as president and CEO of Vivante.
By 2009, over 15 companies used Vivante GPU IP in twenty embedded designs
[22].
In 2010, Vivante demonstrated a low-power multicore GPU called Scalarmor-
phic that exceeded 1 GHz [23]. The company said its multicore GPUs had been
proven in multiple tier-one SoC vendor’s products and would drive next-generation
game consoles, tablets, smartphones, automotive displays, and home entertainment
applications.
Vivante’s multicore GPUs were multithreaded extensions of the OpenGL ES
single-core GC series architecture first launched in 2007. The multicore GPUs
were capable of more than 200 M triangles per second on industry-standard GPU
benchmark polygon throughput tests.
In August 2013, Vivante launched its highly granular Vega series IP GPUs for
mobile devices and included Vega 1X, 2X, 4X, and 8X, based on target performance
and market requirements.
Vivante described the Vega as an ultra-threaded GPU, with each GPU core able
to handle up to 256 threads. It supported switching between threads in a round-robin
mode unless there were dependencies for a thread. If that was the case, the thread
got skipped until the dependency cleared.
Many threads in a GPU, even one with 16 shader cores, could be consumed in
a few clock cycles. The ultra-shader unit (see block diagram in Fig. 5.40) could
dynamically balance the loads, parsing tasks to which shader was free. The pool of
256 possible threads could handle several tasks available to parse.
5.10 The Other IP Providers 283

Fig. 5.40 Vivante’s Vega IP GPU

Each shader core had five units, two 64-bit wide adders (ADD), two 64-bit wide
multipliers (MUL), and one transcendental. It was like AMD/ATI’s VLIW5 archi-
tecture with four of the pipes somewhat limited. Each 64-bit unit could do 1 × 64,
2 × 32, and 4 × 16 making it capable of 5–17 ops per cycle.
The dash-outlined blocks (3D pipeline, vector graphics pipeline, 2D pipeline)
could be segmented or stacked into various configurations. A GPU could have more
vector graphics engines or more ultra-threaded shaders, up to 32 cores. The number
of graphics front-ends depended on how many shader cores were connected to each
graphics core; the counts could be confusing.
Vega’s fine-grain approach was fundamentally different from that of other
suppliers. Other GPU suppliers tended to balance pixel, shader, and other resources
in one of two ways: either a simple architecture like older GPUs like Nvidia’s Tegra
4 (partition in advance for a fixed number of pixel and vertex shaders) or unified
GPUs like Nvidia’s Kepler, which could be programmed for multiple tasks. Vivante’s
GPUs had a unified shader architecture and were more granular. That concept led
the company to its virtualized GPU.
The feature set of the Vega was as follows:
• GPU speeds over 1 GHz while reducing overall power consumption
284 5 Compute Accelerators and Other GPUs

• Dynamic, reconfigurable shaders that supported double, single, and half-precision


formats
• Patented math units that worked in the logarithmic space to minimize area, power,
and bandwidth
• On-chip GPU processing of HDR with Stream-Out geometry shaders and multi-
way pipelines
• Immediate hidden surface removal reduced render processing an average of 30%
• Power savings up to 65% over previously certified cores through Vega dynamic
voltage and frequency scaling (DVFS) voltage scaling and incremental low-power
architecture enhancements
• Industry’s smallest graphics driver memory footprint that enabled graphics and
compute in DDR-cost-constrained systems to DDR less MCU/MPUs
• Vivante’s proprietary Vega lossless compression, which reduced on-chip band-
width by an average of 3.2:1 and streamlined the graphics subsystem including
the GPU, composition processor (CPC), interconnect, and display and memory
subsystems
• Compatible with OpenGL ES 3.0, OpenGL, OpenCL, OpenVG, Microsoft
DirectX 11 (9_3), WebGL, and Renderscript compute/filter script
The GPU could be configured with up to 32 SIMD Vec-4 shaders and offered:
• Balanced performance/bandwidth
• Tile rasterization
• Memory-localized scan conversion
• Many caches
• Fast depth culling
• Fast clear
• Texture compression
The GPU was designed for 28 nm high-performance multi-crystalline (HPM)
silicon implementations. Two shader units (which the company claimed were smaller
than competitor’s single shader) were in the silicon area with more math operations
per cycle—256 independent threads per shader unit.
The GPU offered high-quality 4x multisampling anti-aliasing (MSAA). It had
bandwidth characteristics comparable to tiling architectures at low to medium data
rates with dramatically lower bandwidth with complex data. The company claimed
design wins in six of the top 10 mobile OEMs.
Shortly after its announcement, the company showed its Vega GC7000 and
GC8000 GPUs. They were running photorealistic rendering with geometry and
tessellation shaders. Then, the company announced its virtualized GPU, used in
automotive systems [24]. Vivante was the first GPU IP company to offer a virtual-
ized design, encouraged to do so by its customer Freescale, which had been quite
successful in the automotive market. NXP acquired Freescale in December 2015.
5.10 The Other IP Providers 285

GPU virtualization refers to technologies that allow the use of a GPU to accel-
erate graphics or GPGPU applications running on a virtual machine. GPU
virtualization is used in various applications such as desktop virtualization,
cloud gaming, and computational science.

A few years earlier, in 2001, Dr. WeiMing Dai founded VeriSilicon to supply
IP-centric, platform-based custom silicon solutions and end-to-end semiconductor
turnkey services. The company began its IP portfolio with ASIC standard cell libraries
and other foundry foundation IPs. Also added were larger IP blocks such as A/D
converters, USB2.0, and peripherals.
In 2006, the company expanded its IP portfolio by buying LSI’s ZSP unit, offering
DSP and associated technology. Based on ZSP, the company developed Dolby and
DTS-certified HD audio processes, voice quality enhancement, and multimode and
multiband wireless baseband platforms. VeriSilicon established a licensing agree-
ment with Google and became the exclusive supplier of the Hantro G1 multi-format
video encoder and decoder supporting both H.264 and VP8. Later, VeriSilicon and
Google codeveloped the Hantro G2 multi-format video decoder IP to support ultra-
HD 4 K video decoding for integrated high-efficiency video coding (HEVC) and
VP9. (Google bought On2 Technologies for $125 million in 2010. In 2007, On2
Technologies bought Hantro Products, and as a result, Hantro became part of the
Google portfolio.)
Without really planning it, Vivante and VeriSilicon were headed in the same
direction, with pretty much the same goal in mind: keep adding IP cores to become
a one-stop-shopping point for companies designing SoCs. They were not the only
companies with that idea.
So, both companies were, independently and unknown to each other, on the hunt
for innovative technology to fill out their portfolios—get big or die.
The irony was that Wei-Ming Dai is the genius big brother of Wei-Jin Dai who are
close family members, yet they never discussed the obvious. Two board members
and investors in VeriSilicon suggested the company’s merge.
Click, click, and click—it was so obvious everyone slapped their foreheads. The
addition of GPU and other cores from Vivante would provide VeriSilicon just what
it needed for its growing IP portfolio, the stickiness of the SoC design service. The
deal would also give VeriSilicon more exposure to the larger tier-one customer base
and open opportunities in the automotive market that Vivante was serving and the
IoT market that both VeriSilicon and Vivante wanted to enter.
In late 2014, VeriSilicon, described as a SiPaaS (Silicon Platform as a Service)
company, filed an IPO in the U.S. Nasdaq under the symbol VERI.
Then, in October 2015, VeriSilicon announced it would buy Vivante in an all-
stock deal, and the result would be a new, larger company approaching $200 million
in sales [25].
VeriSilicon had two operations, design and turnkey services and IP cores. The
company split itself into two business units, IP and services. In addition to being
286 5 Compute Accelerators and Other GPUs

an officer and executive VP of the company, Wei-Jin Dai took over the IP business
unit. Another senior executive took over the services business unit, reporting to
Dr. Wayne Wei-Ming Dai, the board chairman, and the president and CEO. Mike
Cai was the CTO and went on to patent anti-aliasing techniques that he assigned to
the company.
That was one of those genuinely synergistic deals that do not come around too
often. In addition to mixed-signal and radio frequency (RF) over IP, DSPs, and
video core, VeriSilicon supplied custom silicon solutions for microelectromechan-
ical system (MEMS) sensors found in over a billion devices, including tier-one
smartphones and tablets. VeriSilicon kept the Vivante GPU brand. Also, in 2015
Vivante introduced a partitioned GPU design well suited for virtualization, a vision
processor, and automotive safety applications.
After joining VeriSilicon, Vivante introduced the Arcturus GC8000 series based
on the Vega architecture. It was compatible with OpenGL ES 3.2, OpenVG 1.1,
OpenVX 1.1, OpenGL 4.0 Vulkan 1.0, and OpenCL 2.0 (Fig. 5.41).
The GC8000 added early culling, improved hardware virtualization, and lossless
data compression features. There were many improvements to the old Vega architec-
ture, with the new code name Arcturus. The GC8000 offered flexible configurations
and customizable RTL for specific applications. VeriSilicon began shipping GC8000
RTL in 2Q16.
The GC8000 had doubled the triangle throughput of the GC7000. Vivante accom-
plished that by enhancing the fixed function pipeline, including the primitive assem-
bler and setup. That added die area, but with RTL enhancements, it yielded a 30%
performance-per-area increase for triangle and pixel throughput. An early culling
stage discarded unneeded geometry before the processing stage, which helped
increase throughput. The GC8000 could do dynamic load balancing and assign
threads to less busy clusters like the original Vega architecture.

Fig. 5.41 VeriSilicon’s GPU could scale from IoT and wearables to AI training systems. Courtesy
of VeriSilicon
5.11 Nvidia’s Ampere (May 2020) 287

The company could make a 2 × gigaflop/second (GFLOP/s) version for low-end


products. That option doubled resources for compute cores, which VeriSilicon called
hyper-pipe. However, it only increased GFLOPS performance by 15%.
The GC8000 included Vivante’s hardware virtualization, using the memory
management unit (MMU) and hypervisor design that the company called
GraphiVisor.

5.11 Nvidia’s Ampere (May 2020)

Nvidia pushed the envelope on graphics chips since it integrated the geometry
processor and pixel shader into one chip and called it a GPU in 1999. The poten-
tial for GPUs then vastly expanded in 2003 when a branch of development spiked
out of GPU applications that took advantage of the GPU’s parallel processing for
pure computation using Brook, a collection of C programs with streams developed
at Stanford. That work morphed into Nvidia’s famous CUDA and started the era of
GPU compute, also called a general-purpose graphics processing unit (GPGPU) and
accelerated computing. Brook freed the GPU from OpenGL, which until then had
been used for programming a GPU.
Nvidia—Getting to Ampere.
Over the next 17 years, Nvidia continued to develop new innovative and powerful
GPUs, targeted primarily at computer graphics and gaming, but with capabilities that
could be exploited for GPU computing.
The Pascal generation brought FP16 and INT8 support (the latter found in Pascal
P40 and P8), as well as NVLink technology which Nvidia deployed as a hybrid cube
mesh topology for multi-GPU, refer to Fig. 5.42. Nvidia GPU growth in transistors
and die size over time. The GP100, the first 100-series GPU based on Pascal, was
explicitly a data center part with 2xFP16, HBM2, and announced at GTC in 2016.
Pascal was followed by the Volta GV100 released in March 2018 and introduced
a new SM (streaming multiprocessor) and first-generation Tensor Cores for AI. The
GPU had 5120 shaders (80 SMs), 320 TMUs, 128 ROPS, and 640 tensor cores.
Volta was built on a 12 nm process and had 21,100 million transistors and a die
size 815 mm2 . It was implemented in an Nvidia Quadro AIB with 32 MB of HBM2
and a 4096-bit bus with 864 GB/s bandwidth and was capable of 16,6 (F32) TFLOPS.
The massive AIB sold for $9000 in 2018.
TU102 (Turing) came out after GV100 (Volta), but TU102 was primarily a
GeForce GPU while GV100 was a data center only chip.
Volta introduced Tensor Core technology, a breakthrough feature that dramatically
accelerated matrix math operations; 20 × more compute than the previous generation.
The Volta generation also brought the first generation of NVSwitch devices and was
a key enabler of Nvidia’s 16-GPU DGX-2 AI servers.
Turing added INT8 support to tensor core technology, bringing great speedups to
AI inference applications and included the introduction of the company’s T4 Tensor
Core GPU for AI inference and visual applications, running in just 70 W.
288 5 Compute Accelerators and Other GPUs

Fig. 5.42 Nvidia GPU growth in transistors and die size over time

Nvidia released the Ampere GA100 in May 2020. It was the biggest GPU ever
made in terms of transistors and in area—54 billion transistors in an 826 mm2
package.
Ampere added floating point (FP64) to tensor core technology to solve HPC
challenges. It introduced multi-instance GPU (MIG) technology to partition a single
GPU into multiple GPUs (7 for A100, 4 for A30). Nvidia also introduced TensorFloat-
32 (TF32), an AI-optimized FP32 precision which today is the default FP32 precision
for both TensorFlow and PyTorch frameworks.
Ampere had an improved streaming multiprocessor (SM) design that increased
power efficiency and CUDA compute capability. It marked Nvidia’s full commitment
to the data center and the use of GPU as computer accelerators. One could argue
that Fermi in 2009 or Kepler in 2012 was bigger steps (beyond original G80 CUDA
introduction in 2006) for providing features for compute workloads. Wherever one
puts the marker, with the added GPU-compute features and subsequent larger caches,
the transistor count (and die size) of Nvidia’s GPUs began to increase.
Gamer sites had been publishing leaked and speculated comments about Ampere
for months, expecting it to be the next generation of a ray tracing game engine. They
were surprised by what they discovered, the GA100 Ampere was no gamer chip,
and it was a killer GPU-compute chip with special emphasis and features for AI
training and server-based inferencing, in addition to HPC. It was a supercomputer
in a chip. There would be GA 10x chips based on the Ampere architecture targeting
5.11 Nvidia’s Ampere (May 2020) 289

Table 5.8 Nvidia’s ampere


Nvidia A100 specifications
A100 specifications
Transistor count 54 billion
Die size 826 mm2
FP64 CUDA cores 3456
FP32 CUDA cores 6912**
Tensor cores 432
SMs 108
Performance
FP64 9.7 TFLOPS
FP64 tensor core 19.5 TFLOPS
FP32 19.5 TFLOPS
TF32 tensor core 156/312 TFLOPS*
BFLOAT16 tensor core 312/624 TFLOPS*
FP16 tensor core 312/624 TFLOPS*
INT8 tensor core 624 TOPS/1248 TOPS*
INT4 tensor core 1248 TOPS/2496 TOPS*
GPU memory 40 GB
Memory bandwidth 1.6 TB/s
Interconnect NVLink 600 GB/s
PCIe Gen4 64 GB/s
Form factor 4/8 SXM GPUs in HGX A100
Max power 400 W (SXM)
* With structured sparsity enabled ** its 3456 FP64,plus 6912
32FP
Courtesy of Nvidia

gaming/graphics market as well. Table 5.8 is a summary of the chip’s most salient
features.
One big feature of the chip alluded to in the above table was sparsity. Sparsity,
fine-grained structured sparsity, was a method to double compute throughput for
deep neural networks. It is important, and it needs a lot of transistors.
Sparsity was essential and possible in deep learning because individual weights
evolve during the learning process in a neural net. Only a subset of weights acquired a
meaningful purpose in determining the learned output by the end of network training;
the remaining weights were no longer needed.
A fine-grained sparsity structure imposes a constraint on the pattern that makes it
easier and more efficient for hardware to align inputs. Deep learning networks could
adapt weights during the training process based on training feedback. Building on
that, Nvidia figured out that, in general, the structure constraint did not affect the accu-
racy of the trained network for inferencing. That enabled inferencing acceleration
with sparsity—something of a breakthrough.
290 5 Compute Accelerators and Other GPUs

For training acceleration, sparsity must be introduced early in the process to offer
a performance benefit, and methodologies for training acceleration without accuracy
loss are an active research area.
The headline numbers for Ampere were 3456 FP64 CUDA cores (32 per SM),
plus 432 Tensor (AI) cores (Fig. 5.43). There were 10,368 cores total between the
3456 FP64 cores and separate 6912 FP32 cores. They were separate blocks. Not
shared was a set of separate 6912 INT32 cores plus the 10,368 plus 432 tensor cores
for a total of 17,280 cores (Fig. 5.43).
The A100 had 40 GB of high-speed HBM2 memory delivering 1555 GB/s of
memory bandwidth—a 73% increase over Tesla V100. The chip could drive six
banks of HBM2 and had five such banks in its first release.
A third-generation tensor core also boosted FP16 throughput 2.5x over the V100
GPU. It added integer types for deep learning (DL) inferencing that could deliver a
further doubling of throughput with the sparsity feature. A100’s tensor core added
new support for TensorFloat-32 (TF32) for an out-of-the-box acceleration of FP32
networks in DL frameworks, at a rate 10x faster than FP32 on V100. A100 also
brought tensor cores to HPC, adding IEEE-compliant FP64 support at a rate 2.5x
faster than V100 FP64.
The GA100 was built on TSMC’s 7 nm fab, and the die was 826 mm2 (Fig. 5.44).
There were 12 Nvidia NVLinks per GPU, which offered up to 600 GB/s of inter-
GPU bandwidth (300 GB/s per direction). The GPU was accessible through six
Nvidia NVSwitches, which offered 4.8 TB/s of bidirectional bandwidth.
In addition to being big, and hungry for data, it was also hungry for watts and
would like 400, please. That clearly put it out of the realm of a gaming chip, but not
a supercomputer or data center, and given all the processors and memory it had, it
was pretty efficient.
A family of Amperes.
As with all Nvidia’s designs, the company was able to create several smaller
versions of the A100 GPU, as illustrated in Table 5.9.
As the table indicates, the model number sequence does not follow a sequential
calendar introduction schedule.

5.11.1 A Supercomputer

Nvidia took eight A100s and a couple of AMD Epyc 7742 64-core, 128-thread CPUs,
along with 1 TB of RAM, to create the DGX A100 supercomputer. The DGX A100
offered up to 10 POPS of INT8 performance, 5 PFLOPS of FP16, 2.5 PFLOPS of
TF32, and 156 TLOPS of FP64 compute performance. To put that in perspective,
the previous-generation Volta-powered DGX-2 with 16 GPUs offered two PFLOPS
of mixed-precision performance (Fig. 5.45).
Within the DGX A100 were eight single-port Mellanox Connect-X6 AIBs
for clustering (with 200 GB/s of peak throughput), a dual-port ConnectX-6 for
5.11 Nvidia’s Ampere (May 2020) 291

Fig. 5.43 The Nvidia GA100 streaming multiprocessor

data/storage networking, and a Mellanox ConnectX-6 VPI HDR InfiniBand/200


GigE HCA.
DGX A100 systems had been shipping, with the first multi-system order going
to the Department of Energy (DOE) Argonne National Laboratory. The DOE would
use the new DGX A100 cluster to better understand and fight COVID-19, among
292 5 Compute Accelerators and Other GPUs

Fig. 5.44 Nvidia’s A100 Ampere chip on a circuit board

Table 5.9 Nvidia ampere GPUs


Model GA100 GA102 GA103 GA104 GA106 GA107
Introduced May Sep 2020 Jan 2022 Sep 2020 Jan 2021 Apr 2021
2020
Architecture 7 nm 8 nm 8 nm 8 nm 8 nm 8 nm
Ampere Ampere Ampere Ampere Ampere Ampere
Streaming 128 84 58 48 30 24
multiprocessors
CUDA cores 8192 10,752 7424 6144 3840 3072
TMUs 512 336 232 192 120 96
ROPs 192 112 96 96 48 48
RT cores – 84 58 48 30 24
Tensor cores 512 336 232 192 120 96
Transistors 54.2B 28.3B ? 17.4B 12.0B ?
Die size (mm2 ) 826 628 ~ 496 392 276 ?
Memory bus 6144-bit 384-bit 256-bit 256-bit 192-bit 128-bit
Max Mem. Spec 80 GB 24 GB 16 GB 16 GB 12 GB 4 GB
HBM2e GDDR6X GDDR6 GDDR6X GDDR6 GDDR6
AIBs A100 RTX3090 RTX3080 RTX3070 RTX3060 RTX3050 Ti
Tensor Ti Ti Ti Laptop and Laptop
core & Laptop RTX 3080
RTX 3060 Laptop
Ti
5.12 Imagination Technologie’s Ray Tracing IP (2021) 293

Fig. 5.45 Nvidia’s DGX A100 supercomputer. Courtesy of Nvidia

other research efforts powered by high-performance computing. Nvidia DGX A100


systems started at $199,000.00.
Nvidia, which had been just getting traction with its RTX series, was running a bit
of a risk with gamers thinking the A100 was introducing a new family of GPUs with
a gaming version just around the corner. Nvidia had done that before—introducing
a high-end part and then developing a consumer version, and one could expect them
to do it again, but the new gaming part could take a while.
And then, there is the issue of the yield of a 54 billion transistor device—how
many processors would Nvidia be able to make? That question calls for a bit of
perspective: even if Nvidia only got a 50% yield from a wafer, it was getting those
chips at amazing prices (and performance) for that amazing chip, which would be
perfectly acceptable. And who knows what it could do with those marginal parts that
fell out of the bin?

5.12 Imagination Technologie’s Ray Tracing IP (2021)

In November 2021, Imagination Technologies introduced its latest ray tracing IP, the
Imagination CXT, for its flagship B-Series GPU IP. The announcement marked the
debut of Imagination’s PowerVR Photon ray tracing architecture.
Photon, said Imagination, was the industry’s most advanced ray tracing archi-
tecture, bringing desktop-quality visuals to mobile and embedded applications. The
biggest news was that it had already been licensed for multiple markets.
The Photon architecture represented over a decade of development by Imagina-
tion in making ray tracing workable in low-power-enabled devices. The company
294 5 Compute Accelerators and Other GPUs

intended to deliver a significant advance in the visual possibilities for smartphones,


tablets, laptops, and automotive solutions.
Imagination Technologies had been firing rays in computers since 1994 when
it first introduced deferred rendering. One must do a visibility test, the same as
ray tracing (but without bounces and material considerations), to get it right. In
terms of ray tracing as it was generally understood (physically accurate renderings),
Imagination had been in the ray tracing business since 2010 and demonstrated real-
time ray tracing in 2014 [26] with a simulator and then 2016 showed hardware. The
company defined six levels of ray tracing, which it used to chart its development
path.
The six levels of ray tracing acceleration
1. Legacy solutions
2. Software on traditional GPIs
3. Ray/box and ray/ triangle testers
4. Bounding volume hierarchy (BVH) processing in hardware
5. BVH processing with coherency sort in hardware
6. Coherent BVH processing with BVH hardware builder.
The Photon architecture inside Imagination CXT was at Level 4 of the real-time
system, making it one of the most advanced architectures available. It enhanced ray
tracing performance and efficiency to deliver a desktop-quality experience for mobile
gamers and developers.
The Imagination CXT-48–1536 RT3 featured the ray acceleration cluster (RAC), a
new low-power, dedicated hardware GPU block that accelerated and offloaded more
ray tracing computations from the shader cores than less-efficient Level 2 architec-
tures. Imagination said the Imagination CXT RT3 offered up to 1.3 Gy/s. That said,
the company delivered photorealistic ray-traced shadows, reflections, global illumi-
nation, and Ambient Occlusion—with high frame rates—in a low-power budget. A
conventional ray tracing system is shown in Fig. 5.46.
Ray tracing is dominated by countless intersection tests between the millions of
rays emitted in each frame and the acceleration structure, which holds a hierarchical

Fig. 5.46 Conventional ray tracing organization


5.12 Imagination Technologie’s Ray Tracing IP (2021) 295

Fig. 5.47 Imagination


technologie’s Photon RAC

box structure and the triangle geometry. The RAC fully offloads these expensive
operations to dedicated hardware, delivering significant area and power-efficiency
benefits (Fig. 5.47).
The RAC consisted of the ray store, ray task scheduler, and coherency gatherer.
It was coupled to two 128-wide unified shading clusters (USCs) that Imagination
said had high-speed, dedicated data paths for efficient and low-power ray-traced
deployment.
The box tester unit performed the search for rays intersecting with objects in 3D
space. It tested rays against axis-aligned boxes from the scene hierarchy. The RAC
had a dedicated box per triangle testing hardware (dual triangle tester units). Those
blocks got the RAC to a Level 2 RT solution.
Ray traversal was in hardware and used a dedicated task scheduling and USC
interface. Those blocks used dedicated ray store and dedicated ray state tracking
units.
The predictive box scheduler handled ray traversal, tracking, and checking in
hardware with tight integration with the CXT USCs. The ray store kept ray data
structures on-chip during processing, which provided high bandwidth read–write
access to all units in the RAC. That, claimed the company, avoided slowdowns or
power increases from storing or reading ray data to DRAM. The ray task scheduler
offloaded the shader clusters, deploying and tracking ray workloads with dedicated
hardware, keeping ray throughput high and power consumption low. The packet
coherency gatherer unit analyzed all rays in flight and bundled rays from across the
scene into coherent groups enabling them to be processed efficiently. Imagination
said they had patented that technology, and it got the RAC up to Level 4.
296 5 Compute Accelerators and Other GPUs

The company claimed that a RAC could deliver up to 433MRay/s ray throughput
per RAC and do up to 16GBoxTesVs, all scaled with increasing RAC units and Imag-
ination multicore. It was compliant with the VulkanRT ray query and ray pipeline
and represented Level 4 RTS in hardware. An implementation of RACs within a
GPU is shown in Fig. 5.48.
Imagination CXT, said Imagination, was a significant step forward for rasterized
graphics performance, with 50% more compute, texturing, and geometry perfor-
mance than Imagination’s previous-generation GPU IP. The company claimed its
low-power superscalar architecture delivered high performance at low clock frequen-
cies for exceptional FPS/W efficiency, while Imagination Image Compression
(IMGIC) reduced bandwidth requirements.
Imagination said the RTL-based IP Photon architecture could be scaled to the
cloud, data center, and PC markets through Imagination’s multicore technology.
That, claimed the company, could generate up to 9 TFLOPS of FP32 rasterized
performance and over 7.8 Gy/s of ray tracing performance while offering up to 2.5x
greater power efficiency than current Level 2 or 3 ray tracing solutions.
Bounding volume hierarchy (BVH) was a popular ray tracing acceleration tech-
nique that used a tree-based acceleration structure that contained multiple hierarchi-
cally arranged bounding boxes (bounding volumes) that encompass or surrounded
different amounts of scene geometry or primitives.
Coherent ray tracing solved the integer BVH node decompression overhead
problem by spreading the decompression cost over many rays.
Imagination showed a demonstration video that featured global illumination (GI),
per-pixel ray tracing, denoising, lighting, tone-mapping, and TAA. RT GI added
ambient lighting, grounding the objects in the scene and giving the most realistic
lighting. The demo ran at 1080p between 30 and 60 fps in a mobile power budget.
Imagination also offered a ray tracing software tool, PVRTune, which allowed
developers to see low-level ray tracing counters such as rays per second, box tester
load, cache hit rate, and the number of recursive transverse rays per second (Fig. 5.49).
The PVR ray tracing simulation Vulkan layer in the Imagination SDK could be
used to simulate the capabilities and behavior of the CXT block (Fig. 5.50).

Fig. 5.48 Imagination technologie’s GPU with RAC


5.12 Imagination Technologie’s Ray Tracing IP (2021) 297

Fig. 5.49 Rays per second monitor in PVRTune. Courtesy of Imagination Technologies

Fig. 5.50 Imagination created the industry’s first real-time ray tracing silicon in 2014. It showed
the R6500 test chip code named Plato. Courtesy of Imagination Technologies
298 5 Compute Accelerators and Other GPUs

As Imagination’s first Level 4 IP, CXT offered developers a hardware block dedi-
cated to accelerating ray tracing that promised more power and area efficiency than
other solutions. Consumers had come to appreciate the realism ray tracing provided
and would demand it in all their devices in the near future.

5.12.1 Summary

Imagination Technologies pioneered hardware-accelerated ray tracing in 2010 and


scaled it for the mobile market with its PowerVR Photon architecture. Ray tracing
has always been a someday technology. In 2017 based on a demo Nvidia made
using four Quadro AIBs, it was predicted (by extrapolating Moore’s law) that we
could see real-time ray tracing by 2021 or 2022. Then in late 2018, Nvidia lit up
the world and showed real-time race tracking (RTRT) running on a single AIB. That
was a defining moment for the industry, and consumers learned about ray tracing
combined with Nvidia’s powerful marketing. The next challenge was to get RTRT
to work on a mobile device. Adshir in Israel showed RTRT on a mobile device the
same time Nvidia showed theirs on a PC. Adshir’s solution was all software, and
although elegant and highly effective, Apple, Samsung, Qualcomm, and others in the
mobile market did not employ it. In the meantime, Adshir was acquired. Nonetheless,
consumers knew there was such a thing as ray tracing, and, by golly, they wanted it.
Three companies in the mobile space were in a race to provide RTRT: AMD,
Imagination Technologies, and SiliconArts.

5.13 Nvidia’s Mega Data Center GPU Hopper (2022)

Nvidia introduced their long-anticipated Hopper GPU with startling compute


results—the chip is actually more of a compute engine than a GPU per se. Nvidia
is claiming a 6x improvement over the previous-generation Ampere. 2X of that six
comes from getting the chip to do FP8 calculations for inferencing. Another 2X comes
from improvements in Cuda, and 1.3X comes from while 1.2x comes from—right it
doesn’t add up to 6x but there’s some kind of multiplier effect in the calculations.
The generation-to-generation numbers are impressive and interesting (Table 5.10).
Hopper has almost 2.5 times as many shader (2.44 to be exact) as the last generation
A100 Ampere).
The Hopper GPU which is implemented on a subsystem with NVLinks in lieu of
PCIe is targeted at data centers. Consumer versions of it, if any, will be introduced
later as Nvidia did with the Ampere series (Fig. 5.51).
Hopper is designed to run big applications such as AI, supercomputing, and 3D
universes and is produced using TSMC’s 4 nm process. And, Nvidia has a new
5.13 Nvidia’s Mega Data Center GPU Hopper (2022) 299

Table 5.10 Nvidia’s hopper H100 GPU compared to previous GPUs


H100 A100 V100
Architecture Hopper Ampere Volta
FP32 CUDA cores 16,896 6912 5120
Tensor cores 528 432 640
Boost clock ~1.78 GHz
HBM 80 GB 80 GB 16 GB/32 GB
Memory type clock 4.8 Gbps HBM3 3.2 Gbps HBM2e 1.75 Gbps HBM2
Memory bus width 5120-bit 5120-bit 4096-bit
Memory bandwidth 3 TB/sec 2 TB/sec 900 GB/sec
FP32 vector 60 TFLOPS 19.5 TFLOPS 15.7 TFLOPS
INT8 tensor 2000 TOPS 624 TOPS N/A
FP16 tensor 1000 TFLOPS 312 TFLOPS 125 TFLOPS
TF32 tensor 500 TFLOPS 156 TFLOPS N/A
FP64 tensor 60 TFLOPS 19.5 TFLOPS N/A
Interconnect NVLink 4
Die size 814 mm2 826 mm2 815 mm2
Transistor count 80B 54.2B 21.1B
TDP 700 W 400 W 300 W/350 W
Process TSMC 4 N TSMC 7 N TSMC 12 nm FFN
Courtesy of Nvidia

Fig. 5.51 Nvidia’s hopper subsystem board. Courtesy of Nvidia


300 5 Compute Accelerators and Other GPUs

software transformer engine that automatically switches between FP8 and FP16
formats, based on the workload and some AI cleverness.
During his keynote speech, Jensen Huang, Nvidia’s CEO, said the Hopper H100
provides a nine-times boost in training performance over Nvidia’s A100 and thirty
times more large-language-model inference throughput (Fig. 5.52).
Nvidia has continued with HBM, and the H100 is the first to use Gen5 reaching,
claims Nvidia, 40 terabits per second I/O bandwidth that’s 1.5x faster than the A100’s
HBM2E. While the A100 was available in 40 and 80 GB models, the H100 comes
with only 80 GB (so far).
Huang claimed twenty H100s could sustain the equivalent of the entire world’s
Internet traffic (Fig. 5.53).
With 80 GB per AIB, eight AIBs give a pool of 640 GB and to a program (or
programmer) look like one contagious memory pool. The NVLinks with super high
bandwidth and very low latency make that possible. Tying eight AIBs together like
that is kind of a superzinging of the chiplet concept.
Chiplets? Just say no.
Nvidia was asked about that—why has the company continued with the megachip
approach when its competitors are going with the chiplet approach?
Huang said everyone will acknowledge that a large chip can do interprocesses
better than multiple chips that have to go through I/O circuitry—wires on die are
orders of magnitude better than off die. Big chips are hard he said, but we are really
good at it, and it’s a competitive advantage. Do a superchip before you do a chiplet,
said Huang.
We asked Nvidia if they had any comparative data of Hopper to Intel’s Ponte
Vecchio AIB, and Nvidia said, no. Nvidia has launched the Nvidia H100 Tensor Core
GPU, providing an order of magnitude leap in performance versus prior generation
on the widest array of HPC and AI applications including up to 30X more inference
throughput and up to 9X faster training.

Fig. 5.52 Time to train the


mixture of experts
transformer network for
H100 versus A100. Courtesy
of Nvidia
5.13 Nvidia’s Mega Data Center GPU Hopper (2022) 301

Fig. 5.53 Nvidia’s H100 Hopper AIB with NVLinks (upper left) supports a unified cluster of eight
GPU. Courtesy of Nvidia

The company is taking the Hopper subassembly (shown above) and using multiple
of it to build the DGX H100 supercomputer (Fig. 5.54).
Then, Nvidia is taking 16 DGX H100s to build a super-duper supercomputer they
are calling Earth 2 that will model and predict world weather patterns (Fig. 5.55).
The company also introduced new and updated versions of cloud application
such as Merlin 1.0 for recommender systems and version 2.0 of both its Riva speech
recognition and text-to-speech service.

Fig. 5.54 Nvidia’s DGX H100 supercomputer. Courtesy of Nvidia


302 5 Compute Accelerators and Other GPUs

Fig. 5.55 Nvidia’s Earth 2 supercomputer. Courtesy of Nvidia

In mid-2022, GDep Advance a retailer specializing in HPC and workstation


systems in Japan sold Nvidia’s H100 80 GB AIBs for $36,400.

5.13.1 Summary

Nvidia has cleverly leveraged its GPU prowess and propelled itself into the AI
and supercomputer industry. Although built for the data center, Hopper had texture
processors which were used in Cuda program, not just for graphics. They performed
an auxiliary data fetch and could improve cache availability.
Twenty twenty-one and the first part of 2022 were interesting for Nvidia to say
the least. It made a gallant attempt to acquire Arm and probably underestimated the
blow back by other companies and their lobbying agents. A few weeks ago, they
suffered a major hack attack and like everyone else have seen their sales restrained
by supply issues.
Huang believed the data center is fundamentally changing and becoming AI
factories, where raw data in becomes intelligence out.
Nvidia has embraced the digital twin and describes Earth 2 as a digital twin of
the Earth.
References 303

5.14 Conclusion

Researchers at Sandford University in California were among the first to exploit the
computational power of the parallel processor capabilities of the GPU in 2000. ATI
was the leader in the effort, but Nvidia somehow took over and made things easier
with its CUDA C-like programming environment which hid the tediousness of the
architecture.
GPUs were employed in the scientific segments as compute accelerators for prob-
lems that had massive amounts of data that could be processed in parallel. And then
in the mid-2010s that application of GPUs to AI training took off and suddenly
everyone was an AI expert, and Nvidia was king of the hill.
The concepts extended to auto-drive systems for vehicles, natural language recog-
nition via the web, and advanced rendering techniques involving ray tracing. AI
would solve all our problems—if you had the data for it to learn from.
This chapter is not finished as we are still exploring the depths and possibilities
of AI.

References

1. Deering, M. F. and Naegle, D. N. The SAGE Graphics Architecture, in Proc. SIGGRAPH 2002.
2. Deering, M. F. Hardware: Zulu/Sage/XVR-4000, http://www.michaelfrankdeering.com/Pro
jects/HardWare/Zulu/Zulu.html
3. Peddie, J. SiliconArt’s RayCore ray-tracing processor: Disruptive technology from a little
startup, TechWatch volume 14, number 17, page. 19, (August 26, 2014)
4. Peddie, J. The Arc of the Alchemist, TechWatch, (August 20, 2021), https://www.jonpeddie.
com/report/the-arc-of-the-alchemist/
5. Koduri.R. @Rajaontheedge https://twitter.com/Rajaontheedge/status/1453808598283210752
6. Compete with Intel Core i5! Exclusive first test of domestic 3.0GHz x86 processor performance,
Weibo (October 17, 2018), https://itw01.com/QCT3JEW.html
7. Shilov, A. Zhaoxin’s First Discrete GPU Pictured, (June 25, 2021), https://www.tomshardw
are.com/news/zhaoxin-discrete-gpu
8. Announcement for a material resolution by the Company’s Board, (November 4, 2021), https://
tinyurl.com/td8npxre
9. Chen Weiliang, founder of Muxi: There is no shortcut to making cores, (August 9, 2021), https://
min.news/en/economy/8cd295abfeb24bc8bdedbc88809b5b01.html
10. AMD Opens US$16m Shanghai R&D Center, China Daily, (August 23, 2006) http://www.
china.org.cn/english/BAT/178811.htm
11. SCIO briefing on development of industry, and information technology (March 1, 2021), http://
english.scio.gov.cn/pressroom/node_8022792.htm
12. IDC Artificial Intelligence Drives China’s GPU Server Market (June 03, 2019), https://www.
ictransistors.com/news/idc-artificial-intelligence-drives-china-s-gpu-23897463.html
13. O’Keeffe, K and Spegele, B. How a Big U.S. Chip Maker Gave China the ‘Keys to the Kingdom,’
(June 27, 2019), https://www.wsj.com/articles/u-s-tried-to-stop-china-acquiring-world-class-
chips-china-got-them-anyway-11561646798
14. Pirzada, U. No, AMD Did Not Sell The Keys To The x86 Kingdom—Here’s How The
Chinese Joint Venture Works, (June 29, 2019), https://wccftech.com/no-amd-did-not-sell-keys-
kingdom-how-thatic-jv-works/
304 5 Compute Accelerators and Other GPUs

15. Peddie, J. Jingjia Microelectronics introduces new GPU, TechWatch, (August 27, 2019), https://
www.jonpeddie.com/report/jingjia-microelectronics-introduces-new-gpu/
16. Kubota, Y. China Sets Up New $29 Billion Semiconductor Fund, The Wall Street Journal,
(October 25, 2019), https://www.wsj.com/articles/china-sets-up-new-29-billion-semicondu
ctor-fund-11572034480
17. Qilin, Mythical Chinese creatures, https://about-mythical-creatures.weebly.com/qilin.html
18. Alphamosaic’s dual CPU media processor, TechWatch, V.4.3, p. 6. (February 9, 2004)
19. Peddie, J. Introducing DMP, TechWatch, Volume 5, Number 7 (April 11, 2005)
20. Shuler, K. AMBA AXI and OCP: Behind the Standards, (April 04, 2011), https://www.arteris.
com/blog/bid/59646/AMBA-AXI-and-OCP-Behind-the-Standards
21. Peddie, J. Think Silicon shows early preview of industry’s first RISC-V GPU, TechWatch,
(December 04, 2019)
22. Langhi, R. Vivante signs 15th GPU licensees, Industry news Vivante, (June 8, 2009), https://
tinyurl.com/ymd3644t
23. Peddie, J. Vivante Multicore GPU IP TechWatch, Volume 10, number 25, pp.13, (December
21, 2010)
24. Vivante Vega IP Enables Full GPU Hardware Virtualization on Mobile and Home Entertain-
ment Devices, Vivante Vega IP Enables Full GPU Hardware Virtualization on Mobile and Home
Entertainment Devices (design-reuse.com)
25. Peddie, J. Vivante acquired by VeriSilicon in synergistic deal, TechWatch Volume 15, Number
21, pp.11, (October 20, 2015)
26. Parkerson, S. Imagination Technologies Ray Tracing Graphics Cores Provide New Opportu-
nities for Unity Devs, (March 19, 2014), https://appdevelopermagazine.com/imagination-tec
hnologies-ray-tracing-graphics-cores-provide-new-opportunities-for-unity-devs/
Chapter 6
Open GPU Projects (2000–2018)

An open-source GPU is a concept espoused by many but never quite fulfilled.


Although often started through a university, this type of project has trouble main-
taining the funding necessary to survive beyond the initial enthusiasm of its initiation.
Often, the people involved at founding lack the skills to raise capital, and as a result,
it doesn’t grow to become a product or sustained organization. One of the things that
made the Pixel Planes project successful was its long-term and substantial funding.
Developing an ASIC GPU (i.e., a non-FPGA GPU) compatible with modern
technologies such as PCIe, DisplayPort, DirectX 11+, OpenGL 4.x, and Vulkan
APIs is not a trivial task. Companies designing and manufacturing GPUs like AMD,
Intel, Nvidia, or Qualcomm spend millions of dollars in investment and hiring skilled
people to develop a GPU architecture. In addition, there is the minefield of patents
held by all those companies, plus the IP providers and the patent portfolio companies
and lawyers.
And yet, as formidable as that is, people and organizations have tried to develop
and maintain an open-source GPU. Encouraged by the success of the RISC-V and
FreeDSP open-source efforts, enthusiasts for an open GPU have looked at extending
the RISC-V ISA as a base point. That allows the hardware to be almost anything
available, so FPGAs have been used to demonstrate concepts and run benchmark
tests.
Several projects have produced an open-source GPU or a graphics controller, such
as Project VGA in 2007. This chapter reports on a few noteworthy, and in some cases,
ongoing, open GPU projects—apologies to those left out.
The software industry has enjoyed growth due to developers’ assertive use of
open-source software (OSS). Some macro examples are Pinterest’s use of Hadoop,
Twitter, and Shopify using Ruby on Rails, Facebook’s original uses of PHP, and
Uber’s use of Node.js—all open-source development platforms. The open-source
community has made innovation and time to market quicker while reducing risk and
minimizing investment.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 305
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_6
306 6 Open GPU Projects (2000–2018)

However, regardless of its demonstrated benefits to the software industry, open


source was underused in the hardware market. Although open-source utilization had
been rewarding for system hardware and circuit board design, semiconductors were
lacking, particularly for system-on-chip (SoC) and application-specific integrated
circuit (ASIC) design. Several technology observers have said those are the areas
where it was needed the most.
There are good reasons to believe open-source hardware could drive SoC, ASIC,
and FPGA innovation and industry growth by enabling low-cost development,
making it possible to adopt device-level technological breakthroughs quickly.
It started with front-end architectural designs implemented in Register Transfer
Logic (RTL). Such designs may have incorporated existing modules like bus and
memory controllers, SRAM, and I/O pads. Once the front-end register transfer logic
was finalized, it was tested and verified using simulators. Verification could dominate
the cost and effort of the initial hardware design step.
There are obvious opportunities for open source in hardware development. For
instance, complex and complicated components and intellectual property (IP), such
as processor cores, are designed by experts and reused.
There is an ample supply of front-end toolsets for semiconductor development. In
addition, there are front-end design IP-like powerful cores, such as RISC-V, Rocket
(riscv.org), OpenRISC, and LEO.
The first open-source GPU design appeared in 2004 from Timothy Miller. Then
came the OpenVGA project by Wacco in the Netherlands in 2007. Soon, others like
Nyuzi, MIAOW, the Libre 3D GPU project, and AMD’s GPUOpen followed. This
section discusses these open-source hardware projects.

6.1 Open Graphics Project (2000)

In 2000, Timothy Miller was designing a graphics chip for his employer. As a recent
graduate of University of South Florida in computer science and not being told he
could not do it, he created a complete semiconductor design using whatever tools he
could get his hands on.
Later in 2004, Timothy Miller had his eureka moment. Surely, he thought, I cannot
be the only person in the world trying to design a GPU. So, he decided he would
share his developments, simulators, software tools, and most importantly, the ISA—
the instruction set architecture of his GPU design. He set up an open-source platform
using the MIT licensing policy. Miller named it the Open Graphics Project (OGP)
and launched it in October 2004 [1]. It, in turn, spawned other open organizations
such as the Open Hardware organization (Fig. 6.1).
As people joined the OGP, they wanted to find a manufacturer who could build the
chip so they could have hardware to work with. Miller learned that a small production
run of Open Graphics’s chips would cost about $2M. So Miller decided to create an
offshoot company called Traversal Technology Inc.
6.1 Open Graphics Project (2000) 307

Fig. 6.1 Timothy Miller.


Courtesy of University
Binghamton

One of Miller’s concerns was how the community would interact with the
company. To avoid any conflicts or even the impression of conflict and to safeguard
the interests of the OGP community, he suggested the creation of a new organization.
Based on that recommendation and supported by the rest of the community,
electrical engineer Patrick McNamara and Traversal Technology Inc. founded the
Open Hardware Foundation (OHF) in 2007. The goal was to facilitate the design,
development, and production of free and open hardware [2].
Another goal of the OHF was to help fund the production of open graphics prod-
ucts by providing Traversal with a known number of sales. Traversal benefited by
having less financial risk associated with producing the graphics chip. The open-
source community benefited by having hardware available at reduced or no cost
for developers who could contribute further to the project. But in 2009, McNamara
announced that to better support the Open Graphics Project, the foundation would
apply its funds (the product of donations) toward the Linux Fund [3].
Thanks to the funding help from the Linux Fund, Traversal built 25 boards and
distributed them to researchers worldwide [4] (Fig. 6.2).
The open graphics processor project was an idea that would not die. Since the
introduction of the commercial GPU in 1999 by Nvidia, the usefulness and scalability
of a GPU had been irresistible. People want to democratize the GPU and put it in
the hands of more users and researchers but without the cost and restrictions of
commercial devices and APIs.
308 6 Open GPU Projects (2000–2018)

Fig. 6.2 OGP test board. Courtesy of en. wikipedia

6.2 Nyuzi/Nyami (2012)

In late 2012, Jeff Bush heard about Timothy Miller and sent him an email. Bush
told Miller of his work to establish an open GPU project. Bush pointed Miller to
the work he had already open sourced (which he had been working on for around
a year). Bush was in the process of proving the design by building an FPGA. As it
turned out, Miller was looking at some similar concepts. After a few conversations,
the two decided to collaborate and coauthor some academic papers.
Nyuzi had been in development since 2010 and became a fully functional open-
source GPU inspired by Larrabee. Although architecturally different from the SIMD
architectures from AMD and Nvidia, researchers at Binghamton University and
several other places used it to investigate GPUs.
In 2015, Jeff Bush collaborated with Timothy Miller and Aaron Carpenter, through
Binghamton University to further study and promote Open GPU projects. Bush gave
a presentation at the IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS) on Nyami—“A Synthesizable GPU [5].”
Nyuzi (previously named Nyami) was an experimental GPU processor hardware
design focused on compute-intensive tasks (also referred to as a GPGPU and GPU
compute). Said Bush and Miller, it was optimized for use cases like deep learning
and image processing.
That project included a synthesizable hardware design written in System Verilog,
an instruction set emulator, an LLVM-based C/C++ compiler, software libraries, and
tests. The design could be used to experiment with micro-architectural and instruction
set design trade-offs.
6.3 MIAOW (2015) 309

Open-source hardware advocates and supporters have had access to open-source


processors, including OpenRISC, OpenCores, RISC-V, and OpenDSP.
In 2011, Bush published the initial RTL code (Verilog) for Nyami on GitHub [6].
One month later, he had an FPGA model running [7].

6.3 MIAOW (2015)

Karu Sankaralingam, an associate professor of computer science at the University


of Wisconsin–Madison, established a research unit at Madison named the Vertical
Research Group. One of their projects, started in 2014, was MIAOW (pronounced
“me-ow”)—Many-core Integrated Accelerator of Waterdeep/Wisconsin—an open-
source GPU (Fig. 6.3).
MIAOW was based on the publicly released Southern Islands ISA by AMD [8]
in 2012—AMD had subsequently removed it and added newer ISAs [9] under its
GPUOpen initiative.
The MIAOW design implemented a compute unit suitable for architecture anal-
ysis and experimentation with GPU-compute workloads. In addition to the Verilog
HDL composing the compute unit, MIAOW also included a suite of unit tests and
benchmarks for regression testing.
MIAOW had thirty-two compute units and is described in Fig. 6.4.
A primary motivator for MIAOW’s creation was that software simulators of hard-
ware like CPUs and GPUs often miss many subtle aspects that may distort the
performance or power and other results they produce. The researchers believed
implementing a GPU’s logic, like MIAOW, would be a useful tool for producing

Fig. 6.3 Karu


Sankaralingam. Courtesy of
University of
Wisconsin–Madison
310 6 Open GPU Projects (2000–2018)

Fig. 6.4 MIAOW block


diagram

more accurate quantitative results when benchmarking GPU-compute workloads


and providing context for the architectural complexities of implementing newly
proposed algorithms and designs intended to improve performance or other desired
characteristics.
MIAOW represented a GPU’s compute unit. It did not include the auxiliary logic
required to produce actual graphical output, and it did not have logic to connect it to
a specific memory interface or system bus. Outside contributors could develop those
extensions, and students at Madison were adding a graphics pipeline to the design;
they expected it to take about six months. MIAOW was created as a research tool,
and the presence of the other functions was not considered a necessity in running
benchmarks and experiments. MIAOW was licensed under the three-clause BSD
license (Berkeley Software Distribution). The group has contributed an open-source
journal to the ACM (Association for Computing Machinery) library called TACO
for Transactions on Architecture and Code Optimization. The publication showed
how the MIAOW GPU implementation could be used to further investigate hardware
reliability and timing speculation [10].
In August 2015, Sankaralingam presented MIAOW at the Hot Chips conference
[11]. A few groups picked up the MIAOW GPU work; however, Sankaralingam’s
research interests took on a different direction. Based partly on lessons learned from
doing the MIAOW project, Sankaralingam authored an IEEE computer article in 2017
on open-source hardware, drawing upon the virtuous cycle of innovation and personal
satisfaction that drives the open-source software community. His other research had
pioneered dataflow designs and compilers for dataflow hardware. His former advisor,
Doug Burger, described him as a person in search of the universal accelerator.

6.4 GPUOpen (2015)

AMD announced GPUOpen on December 15, 2015 [12], and released it on January
26, 2016. It started as a middleware software suite of visual effects for computer
6.4 GPUOpen (2015) 311

games developed by AMD for its Radeon processor. It was initially released in 2016
to compete against Nvidia’s GameWorks; however, over time, AMD expanded its
scope to include past generations of their GPU’s ISA. The majority of GPUOpen
content was open-source software, unlike GameWorks, which was criticized for its
proprietary and closed nature.
AMD had released ISA manuals for its GPUs since the Radeon R600 (a GPU that
helped usher in the DirectX 10 era in 2006). According to AMD, the main reasons
for an ISA were the following:
• To specify the language constructs and behavior, including the organization of
each type of instruction in both text syntax and binary format.
• To provide a reference of instruction operation that compiler writers could use to
maximize the processor’s performance.
The ISA information was intended for programmers writing application and
system software, including operating systems, compilers, loaders, linkers, device
drivers, and system utilities. It assumed that programmers were writing compute-
intensive parallel applications (streaming applications) and that they had an under-
standing of requisite programming practices.
AMD posted its RDNA ISA in September 2020 [13]. The first product featuring
RDNA was the Radeon RX 5000 AIBs, introduced in July 2019. The processor
implemented a parallel micro-architecture suitable for computer graphics applica-
tions and general purpose data-parallel applications. Data-intensive applications that
require high bandwidth or are computationally intensive can run on an AMD RDNA
processor. Figure 6.5 shows the organization of the AMD RDNA series processors.
The document describes the environment, organization, and program state of
AMD RDNA generation devices. It details the instruction set and the microcode

Fig. 6.5 AMD RDNA generation series block diagram


312 6 Open GPU Projects (2000–2018)

formats native to that family of processors accessible to programmers and compilers.


The document specifies the instructions (including the format of each type of instruc-
tion) and the relevant program state (including how the program state interacts with
the instructions).

6.5 SCRATCH (2017)

In October 2017, Pedro Duarte, Pedro Tomas, and Gabriel Falcao from the Univer-
sidade de Coimbra and Universidade de Lisboa in Portugal presented a paper at the
ACM MICRO-50' 17 symposium on SCRATCH, a soft-GPGPU architecture, and
trimming tool [14]. (There is also Scratch a visual programming environment that
allows users (primarily ages 8–16) to learn computer programming while working
on personally meaningful projects such as animated stories and games.)
The researchers postulated that power consumption limitations constrained
advanced signal processing and artificial intelligence algorithms in high-performance
and embedded supercomputing devices and systems. GPUs helped mitigate the
throughput-per-watt performance in many compute-intensive applications. However,
dealing more efficiently with the autonomy requirements of intelligent systems
demanded power-oriented, customized architectures. Such designs had to be
specially tuned for each application without manual redesign of the entire hardware
while also being capable of supporting legacy code (Fig. 6.6).
In kernel B, integer and floating-point instructions were detected. Thus, the archi-
tecture was trimmed (e.g., by considering available resources and power consumption
limitations) to support both FU types (Fig. 6.7).
The researchers proposed a new SCRATCH framework that automatically iden-
tified the specific requirements of each application kernel regarding instruction set
and computing unit demands. That allowed for the generation of application-specific
and FPGA-implementable trimmed-down GPU-inspired architectures. Their work
was based on an improved version of the original MIAOW system. The researchers
extended the design to support 156 instructions and enhanced it to support a fast

Fig. 6.6 Two different trimmed architectures were generated for two distinct soft kernels. Courtesy
of Pedro Duarte and Gabriel Falcao from the Universities of Coimbra and Lisboa
6.6 Libre-GPU (2018) 313

Fig. 6.7 During compile-time, the instructions present in kernel A indicated that only scalar and
vectorized integer FUs should be instantiated on the reconfigurable fabric. Courtesy of Pedro Duarte
and Gabriel Falcao from the Universities of Coimbra and de Lisboa

prefetch memory system within a dual-clock domain. Experimental results (with


integer and floating-point arithmetic benchmarks) demonstrated the ability to achieve
an average of 140 × speedup and 115 × higher energy efficiency levels (instructions-
per-Joule) compared to the original MIAOW system. Additionally, they achieved 2.4
× speedup and 2.1 × energy efficiency gains compared to their optimized version
without pruning.

6.6 Libre-GPU (2018)

The Libre-GPU open-source project was initiated in 2018 by Luke Kenneth Casson
Leighton in the Hague. It was a formidable undertaking (Fig. 6.8).
After 12 years studying SoCs, more than a hundred of them, he could not find one
that was open to the bedrock. They all had either closed GPU firmware, closed VPU
firmware, closed BIOS, DRM built-in with e-fuses, or usually blatant GPL copyright
violations (e.g., Amlogic, Allwinner, and others). Leighton was discouraged and, in
his eureka moment, decided the only way to solve that was to do it himself properly.
Leighton described libre RISC-V M-Class as a 100% libre RISC-V + 3D GPU
chip for mobile devices. The project began its life because Leighton wanted there to
be a completely free Libre system on a chip offering. At the time, he had access to
$250,000 in funding to make it happen. The design would use a RISC-V processor
ISA with extensions to increase the ability to run as parallel processors—and the
GPU would essentially be software-based and use the Khronos Vulkan API structure
[15].
314 6 Open GPU Projects (2000–2018)

Fig. 6.8 Luke Kenneth


Casson Leighton. Courtesy
of Leighton

Leighton emphasized that his plan called for a hybrid CPU, VPU, and GPU. It
was not, as some suggested, a dedicated exclusive GPU. However, he pointed out;
the option existed to create a stand-alone GPU product.
The primary goal was to design a complete SoC that included libre-licensed VPU-
and GPU-accelerated instructions as part of the actual main CPU. That, reasoned
Leighton, would greatly simplify driver development, applications integration, and
debugging, reducing costs and time to market.
Leighton said he did not have any illusions about the project’s cost and estimated
it would cost more than $6 million, with a contingency of up to $10 million. Leighton
sought backers to carry the project forward and hoped to have a tape out (RTL code)
by late 2020. It did not happen.
The project started as a RISC-V Vulkan accelerator design. However, Leighton
dropped RISC-V and switched to OpenPOWER ISA due to NDAs and other orga-
nizational issues. The original design goal in 2018 was modest at 1280 × 720 at
25 FPS and 5–6 GFLOPS. Chromebook-type laptops were envisioned as the first
platform for the SoC (Fig. 6.9).
In February 2021, at the FOSDEM conference, Leighton presented his ongoing
work for the hybrid CPU/VPU/GPU OpenPOWER [16]. Leighton declared that the

Fig. 6.9 The libre-SOC


hybrid 3D CPU-VPU-GPU
6.7 Vortex: RISC-V GPU (2019) 315

initiative would be fully open source, including the hardware design. Figure 6.9 is
based on his presentation.
Leighton’s focus in 2021 was on developing an embedded SoC and not building
a PCIe-based GPU for an AIB. Those past open-source graphics AIB efforts, such
as Project VGA, became school science projects.
For the Vulkan implementation, Leighton continued to use the Rust-based Kazan
implementation with Simple-V extension for RISC-V and other design elements for
making a Vulkan software implementation more efficient [17].

Rust was a multi-paradigm programming language designed for perfor-


mance and safety, especially safe concurrency.
The language grew out of a personal project begun in 2006 by Mozilla
employee Graydon Hoare. Mozilla began sponsoring the project in 2009
and announced it in 2010.
When asked about the origin or meaning of the name, Hoare said, “I don’t
have a really good explanation. It seemed like a good name (also a substring
of ‘trust,’ ‘frustrating,’ ‘rustic,’ and ‘thrust’?)”
Originally designed by Hoare, additional people such as Dave Herman,
Brendan Eich, and others refined the language and developed the browser
engine and the Rust compiler. It gained increasing use in industry. Rust
was voted the “most loved programming language” in the Stack Overflow
Developer Survey year after year since 2016.

6.7 Vortex: RISC-V GPU (2019)

In 2019, researchers at the Georgia Institute of Technology began investigating the


parallel processing capabilities of a GPU for computing applications. They reasoned
that if one could build a GPU’s SIMD function in FPGA, it might lead to lower costs
and a more flexible computer accelerator. Also, they wanted to make the design and
supporting software open source. They called their project Vortex: A Reconfigurable
RISC-V GPGPU Accelerator for Architecture Research [18].
Although open-source hardware and software were always interesting and impor-
tant, there was little open-source GPU infrastructure in the public domain. The
researchers believed that one of the reasons for the lack of open-source infrastructure
for GPUs was because of the complexity of the ISA and software stacks.
So, the first thing they did was propose an ISA extension to RISC-V that would
support GPGPUs and graphics.
The main goal of the ISA extension was to minimize the ISA changes so that corre-
sponding changes to the open-source ecosystem were also minimal. That, thought
the researchers, would create a sustainable development ecosystem.
316 6 Open GPU Projects (2000–2018)

Fig. 6.10 Vortex block


diagram

To demonstrate the feasibility of a minimally extended RISC-V ISA, they imple-


mented the complete software and hardware stacks of Vortex on FPGA (Fig. 6.10).
Vortex was a PCIe-based soft GPU that supported OpenCL and OpenGL. Vortex
could be used in various applications, including machine learning, graph analytics,
and graphics rendering, and they believed Vortex could scale. The researchers
predicted it could increase up to thirty-two cores on an Altera Stratix 10 FPGA
and would deliver a peak performance of 25.6 GFLOPS at 200 MHz.
Hoping to leverage the fast-growing open-source community around RISC-V
and the open-source LLVM and POCL compilers, Vortex tried to present a holistic
approach for GP-compute (GPGPU). The research explored new ideas of the hard-
ware and software stacks. Vortex implemented GPGPU functionality and 3D graphics
acceleration with its minimal ISA extensions. Along with its high bandwidth caches
and elastic pipeline, that enabled a design that achieved a high frequency on FPGAs.
A configurable RTL and a tightly coupled runtime stack allowed for quick yet
flexible experimentation, which the researchers hoped was evident from the variety
of evaluation metrics they presented. The Georgia Tech group believed it would allow
increasingly diverse and complex workloads to be deployed on Vortex, leading to
research on more realistic and meaningful scenarios.
They planned to extend Vortex’s compiler and runtime software to support CUDA
and Vulkan APIs for future work. The support for the ASIC design flow was also an
essential road map to chip fabrication.
The team presented a poster on Vortex at the 2020 Hot Chips conference [19].

6.8 RV64X (2019)

A group of enthusiasts led by Dr. Atif Zafar proposed a new set of graphics instruc-
tions designed for 3D graphics and media processing. These new instructions were
built on the RISC-V base vector instruction set. They added support for new data types
that were graphics specific as layered extensions in the spirit of the core RISC-V ISA.
Vectors, transcendental math, pixel and textures, and z/frame buffer operations were
supported. It could be a fused CPU–GPU ISA. The group called itself the RV64X
Group because instructions would be 64 bits long (32 bits would not be enough to
support a robust ISA, even though the data paths were 32 bits wide) (Fig. 6.11).
6.8 RV64X (2019) 317

Fig. 6.11 Dr. Atif Zafar.


Courtesy of Zafar

The world has plenty of GPUs to choose from. Why this? The group believed
commercial GPUs were less effective at meeting unusual needs such as dual-phase
3D frustum clipping, adaptable HPC (arbitrary bit depth FFTs), and hardware SLAM.
The RV64X Group thought collaboration provided flexible standards, reduced the
10–20 man-year effort otherwise needed, and helped with cross-verification to avoid
mistakes.
The team said their motivation and goals were driven by the desire to create a
small, area-efficient design with custom programmability and extensibility. It should
offer low-cost IP ownership and development and not compete with commercial
offerings. It could be implemented in FPGA and ASIC targets and would be free and
open source. The initial design was targeted at low-power microcontrollers. It was
Khronos Vulkan compliant and would support other APIs (OpenGL, DirectX, etc.).
The design was to be a RISC-V core with a GPU functional unit. It would look like
a single piece of hardware with 64-bit long instructions coded as scalar instructions
to the programmer. The programming model was an apparent SIMD; the compiler
would generate SIMD from prefixed scalar opcodes. It would include variable-issue,
predicated SIMD backend, vector front-end, precise exceptions, branch shadowing,
and more. There wouldn’t be any need for an RPC/IPC-calling mechanism to send
3D API calls to and from unused CPU memory space to GPU memory space and
vice-versa, said the team. And it would be available as a 16-bit fixed point (ideal for
FPGAs) and 32-bit floating point (ASICs or FPGAs).
The design employed the Vblock format (from the Libre-GPU effort see Libre-
GPU (2018)):
318 6 Open GPU Projects (2000–2018)

• It was a bit like VLIW.


• A block of instructions was prefixed with register tags, giving extra context to the
block’s scalar instructions.
• Sub-blocks included vector length, swizzling, vector and width overrides, and
predication.
• All this was added to scalar opcodes!
• There were no vector opcodes (and no need for any).
• In the vector context, if a register was used by a scalar opcode and the register
was listed in the vector context, vector mode was activated.
• Activation resulted in a hardware-level for-loop issuing multiple contiguous scalar
operations (instead of only one).
• Implementers were free to implement the loop in any fashion they desired: SIMD,
multi-issue, single execution, anything.
The design would employ scalars (8-, 16-, 24-, and 32-bit fixed and floats) and
transcendentals (sincos, atan, pow, exp, log, rcp, rsq, sqrt, etc.). The vectors (RV32-
V) will support two- to four-element (8, 16, or 32 bits/element) vector operations
along with specialized instructions for a general 3D graphics rendering pipeline for
points, pixels, and texels (essentially special vectors):
• XYZW points (64- and 128-bit fixed and floats)
• RGBA pixels (8-, 16-, 24-, and 32-bit pixels)
• UVW texels (8, 16 bits per component)
• Lights and materials (Ia, ka, Id, kd, Is, ks...).
Matrices would be 2 × 2, 3 × 3, and 4 × 4 matrices supported as native data type
and memory structures. One could also represent an attribute vector, essentially in a
4 × 4 matrix.
The advantages of a fused CPU–GPU ISA are that it could implement a standard
graphics pipeline using assembly code, design new instructions in microcode, provide
support for custom shaders, and implement ray tracing extensions. It could also
support vectors for numerical simulations and 8-bit integer data types for artificial
intelligence and machine learning (Fig. 6.12).
Custom rasterizers could be implemented, such as splines, SubDiv surfaces, and
patches.
The design would be flexible enough to implement custom pipeline stages, custom
geometry/pixel/frame buffer stages, custom tessellators, and custom instancing
operations.
The RV64X reference implementation will include:
• Instruction/data SRAM cache (32 KB)
• Microcode SRAM (8 KB)
• Dual-function instruction decoder
– Hardwired implementing RV32V and X
– Microcoded instruction decoder for custom ISA.
6.8 RV64X (2019) 319

Fig. 6.12 RV64X block diagram

• Quad vector ALU (32 bits/ALU—fixed/float)


• 136-bit Register Files (1 K elements)
• Special function unit
• Texture unit
• Configurable local frame buffer.
The design is meant to be scalable as indicated in Fig. 6.13.
The RV64X design has several novel ideas, including fused unified CPU–GPU
ISA, configurable registers for custom data types, and user-defined SRAM-based
microcode for application-defined custom hardware extensions for:
• Custom rasterizer stages
• Ray tracing
• Machine learning
• Computer vision.

Fig. 6.13 RV64X’s scalable


design
320 6 Open GPU Projects (2000–2018)

The same design serves as a stand-alone graphics microcontroller or scalable


shader unit, and data formats support FPGA-native or ASIC implementations.
Why is there a need for open graphics? The developers thought most graphics
processors cover the high end, such as gaming, high-frequency trading, computer
vision, and machine learning. They believed the ecosystem lacks a scalable graphics
core for more mainstream applications like kiosks, billboards, casino gaming, toys,
robotics, appliances, wearables, industrial HMI, infotainment, automotive gauge
clusters, and so on. Specialty programming languages must also be used to program
GPU cores for OpenGL, OpenCL, CUDA, DirectCompute, and DirectX.
A graphics extension for RISC-V would resolve the scalability and multi-language
burdens, enabling a higher level of use case innovation.
The team planned to have a discussion forum setup. They also had the immediate
goal of building a sample implementation with an instruction set simulator, an FPGA
implementation using open-source IP, and a custom IP designed as an open-source
project.
As for the Libre-RISC 3D GPU, the organization has said its goal was to design a
hybrid CPU, VPU, and GPU. As many news articles implied, it was not a dedicated
exclusive GPU although the option existed to create a stand-alone GPU product.
Their primary goal was to design a complete all-in-one processor SoC that happens
to include a libre-licensed VPU and GPU.
Application not listed above as potential users of a free, flexible, small GPU were
crypto-currency mining.
If it was the goal of the RISC-V community to emulate the IP suppliers such as
Arm and Imagination, then the next things to expect to see are DSP, ISP, and DP
designs. There was at least one open DSP proposal, and perhaps it could be brought
into the RISC-V community [20].

6.9 Conclusion

Open-source GPU took the idea from the success of the open-source software
communities. Open-source software is computer software released under a license.
The copyright holder grants users the rights to use, study, change, and distribute the
software and its source code to anyone and for any purpose.
Open-source software may be developed in a collaborative public manner. One
excellent example of open-source software is Khronos and its ever-growing library
of APIs and tools. Khronos works by engaging the community of interested parties to
help develop initiatives, standards, and definitions. Then, people from all industries
participate. The result is a robust, far-reaching, and usually a long-lasting standard
developed by the industry’s best minds.
Open-source GPU hopes to emulate that model and be as successful.
References 321

References

1. https://wiki.p2pfoundation.net/Open_Graphics_Project
2. McNamara, P. Open Hardware, Technology Innovation Management Review, (September
2007). https://timreview.ca/article/76
3. Open Hardware Foundation goodbye message. https://tinyurl.com/54v62hfc
4. LinuxFund, OGP Supply Developers with Open Graphics Cards– OSnews, (April 12, 2009).
https://www.osnews.com/story/21299/linuxfund-ogp-supply-developers-with-open-graphics-
cards/
5. Bush, J., Miller, T. Nyami: A Synthesizable GPU Architectural Model for General-Purpose
and Graphics-Specific Workloads, IEEE, (April, 2015). http://www.cs.binghamton.edu/~mil
lerti/nyami-ispass2015.pdf
6. jbush001/NyuziProcessor. https://github.com/jbush001/NyuziProcessor/commit/63b77a515
c1658b4514b13764e0afdc5ba9ecda6
7. jbush001/NyuziProcessor, Hello world test program works on cyclone II starter board.
jbush001/NyuziProcessor@f189e8e · GitHub, (September 2015). https://www.intel.com/con
tent/dam/altera-www/global/en_US/uploads/d/db/Hello_World_Lab_Manual_CVE.pdf
8. Larabel, M.: AMD Publishes “Southern Islands” ISA Documentation, (August 15, 2012).
https://www.phoronix.com/scan.php?page=news_item&px=MTE2MDg
9. AMD GPU ISA documentation. https://gpuopen.com/documentation/amd-isa-documentation/
10. Balasubramanian, R. et al. Enabling GPGPU Low-Level Hardware Explorations with MIAOW:
An Open-Source RTL Implementation of a GPGPU. https://doi.org/10.1145/2764908
11. MIAOW: An Open-source GPGPU, (June 24, 2015). https://tinyurl.com/cww8j7s7
12. Smith, R. AMD’s 2016 Linux Driver Plans & GPUOpen Family of Dev Tools: Investing In Open
Source, AnandTech, (December 15, 2015). https://www.anandtech.com/show/9853/amd-gpu
open-linux-open-source
13. RDNA 1.0 Instruction Set Architecture Reference Guide, AMD, (September 25, 2020), https://
developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf
14. Duarte, P. Tomas, P. and Falcao, G. SCRATCH: An End-to-End Application-Aware Soft-
GPGPU Architecture and Trimming Tool, MICRO-50’17: Proceedings of the 50th Annual
IEEE/ACM International Symposium on Microarchitecture, Cambridge, MA, (October 14–18,
2017). https://doi.org/10.1145/3123939.3123953
15. Hybrid 3D GPU/CPU/VPU. https://libre-soc.org/3d_gpu/
16. Leighton, Luke K. C. The Libre-SOC Hybrid 3D CPU, FOSDEM2021 (February 3, 2021),
https://tinyurl.com/2kbpty7a
17. kazan-3d/kazan, (2018), https://github.com/kazan-3d/kazan/tree/master/docs
18. Tine, B., Elsabbagh, F., Yalamarthy, K., Kim, H. Vortex: Extending the RISC-V ISA for GPGPU
and 3D-Graphics Research, MICRO ’21, (October 18–22, 2021), Virtual Event, Greece, https://
vortex.cc.gatech.edu/publications/vortex_micro21_final.pdf
19. Vortex: A Reconfigurable RISC-V GPGPU Accelerator for Architecture Research, https://vor
tex.cc.gatech.edu/publications/hotchips-poster.pdf
20. N. V. H. Toan and T. Q. Duc, “Design an open DSP-based system to acquire and process
the bioelectric signal in realtime,” 2016 International Conference on Biomedical Engineering
(BME-HUST), pp. 90–94, (2016), https://ieeexplore.ieee.org/document/7782108
Chapter 7
The Sixth Era GPUs: Ray Tracing
and Mesh Shaders

The big bang (2020–2023)


At GDC in San Francisco, in March 2014, Microsoft announced DirectX 12 and
said it would be released the following year. Microsoft officially launched DirectX
12 along with Windows 10 on July 29, 2015. The primary feature highlight for the
new release of DirectX was the introduction of advanced low-level programming
APIs for Direct3D 12, which could reduce driver overhead—a thinner API. (See
Book two, The GPU Environment—APIs, for additional information on DirectX 12
and its thinness).
It was not uncommon for AMD or Nvidia to introduce a new feature and for
Microsoft to follow up by incorporating it in the next release of DirectX or a new level
of the current generation of DirectX. However, 2018 saw an unusual breakthrough
led by Nvidia with its Turing GPU (discussed in the next section) and the arrival of
real-time ray tracing.
Ray tracing you will recall from Chap. 3, can simulate a variety of optical effects,
such as reflection, refraction, soft shadows, scattering, depth of field, motion blur,
caustics, ambient occlusion and dispersion phenomena (such as chromatic aberration)
by following the path of a light ray [1]. That is a different approach than the traditional
overlaying of polygons to arrive at an approximation of the object. Hence, ray tracing
is referred to as being physically correct both in geometry and lighting.
In 2018, Nvidia introduced three new features at once: ray tracing, variable rate
shading, and mesh shaders. Each one was evolutionary, with mesh shading perhaps
the most significant.
In August 2018, at SIGGRAPH 2018, Nvidia revealed its Turing GPU and demon-
strated real-time ray tracing. To implement it, application developers had to use
Nvidia’s unsanctioned extensions to DirectX 12, and many did. But Microsoft was not
far behind. In October 2018, Microsoft announced it was taking DirectX Raytracing
(DXR) out of experimental mode. Users and developers that updated to the next
release of Windows 10 would get DirectX Raytracing on supported hardware that
would be fully operational.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 323
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_7
324 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Then, in November 2019, Microsoft announced it would be releasing DirectX 12


Ultimate, with mesh shader capability [2], VRS, an updated level of ray tracing (to
tier 1.1) [3], and sampler feedback [4]. It also revealed the DirectX 12 Ultimate logo,
the first such logo in the history of DirectX.
Microsoft said in November 2019 that mesh shaders were reinventing the
geometry pipeline.
In November 2019, Khronos announced it would use Nvidia mesh shader code as
an extension to Vulkan and OpenGL. Then finally, in March 2020, DirectX 12 Ulti-
mate with Mesh Shaders was officially announced, and, in January 2021, Microsoft
released code samples—and computer graphics would never be the same.
In 2020, AMD released RDNA 2, and Nvidia introduced its Ampere microarchi-
tectures, which supported mesh shading through DirectX 12 Ultimate. Mesh shaders
allowed the GPUs to handle more complex algorithms and transfer work from the
CPU to the GPU. Mesh allowed for increasing the frame rate or the number of trian-
gles in a scene by orders of magnitude. Intel announced that Intel Arc Alchemist
GPUs shipping in Q1 2022 would support mesh shaders.
A mesh shader was a new type of shader that combined vertex and primitive
processing. It replaced the vertex shaders (VS), hull shaders (HS), Domain shaders
(DS), and geometry shader (GS) stages with an amplification shader and mesh shader.
Roughly, mesh shaders replace VS + GS or DS + GS shaders, and amplification
shaders replace VS + HS.
The amplification shader allowed programmers to decide how many mesh shader
groups to run, and it passed data to those groups. The amplification shader eventually
replaced hardware tessellation. As a result, the GPU became a compute engine.
DirectX 12 Ultimate also included DXR 1.1, which enabled RayQuery [3] from
any shader stage.1

7.1 Miners and Taking a Breath

Also in 2018, the availability and pace of new GPUs and AIBs slowed. The sales
of AIBs did not slow, just the introductions of new ones. AMD and Nvidia were
slowing down because design, development, and release were increasingly expensive

1 RayQuerys are used in Inline raytracing, an alternative form of raytracing that doesn’t use separate

dynamic shaders or shader tables. It is available in any shader stage, including compute shaders,
pixel shaders etc. Inline raytracing in shaders starts with instantiating a RayQuery object as a local
variable, acting as a state machine for ray query.
7.1 Miners and Taking a Breath 325

activities. Both companies had to pick their priorities carefully. And even though there
was a lot of fanfare when Koduri rejoined AMD in 2013, the Vega product line was
less than stellar, although it did have a surprising advantage for crypto-miners due
to its efficient and high-bandwidth memory manager. Meanwhile, Nvidia introduced
their award-winning Pascal GTX10 series in the summer of 2016 and expanded and
extended its ray tracing capabilities further.
The crypto craze caught both companies by surprise, and even though AMD
AIBs were more efficient and less expensive than Nvidia’s AIBs, Nvidia sold more
to the crypto crowd, demonstrating the power of brand. Selling is not exactly correct,
neither AMD nor Nvidia sold their products specifically for mining, but regardless
of the sales and marketing dynamics, both companies ran out of stock as prices in
the channel soared.
AMD was able to increase production a little faster than Nvidia; and in January
2018 announced they would be increasing the supply of the Radeon product line.
But when Nvidia does something it does with a flourish and enthusiasm, it amped
up production even more, but not till May.
Then at Computex, AMD’s new Senior Vice President of Engineering for the
Radeon Technologies Group, David Wang said he was committed to delivering a
new product every year, like clockwork.
Wang showed AMD’s graphics roadmap at a presentation at Computex, and
although it went out three years from 2017s including its next-gen Navi architecture,
and an un-named 7 nm+ architecture debuting in 2020, it didn’t look any different
than the roadmap shown at CES or GDC. Wang said at a round table discussion
that “AMD would be bringing out a new graphics product every year, via a new
architecture, process changes, or maybe incremental architecture changes”.2
Meanwhile, also at a Computex press briefing, Nvidia CEO Jensen Huang said
in response to a question about the next GPU release that gamers may not see next-
generation graphics boards for “a long time from now”.
That caught a few people by surprise because a press release from the Hot Chips
conference originally reported Nvidia would feature “their next-gen GPU.” That
statement was removed and shifted to “TBD.” The Hot Chips press release then said,
“We will hear from the CPU and GPU giants: AMD featuring their next-gen client
chip and Intel with an interesting die-stacked CPU with iGPU plus stacked dGPU”.
Speculation rippled through the conference and the net. One theory was if it ain’t
broke, don’t fix it. Nvidia had the performance advantage, with multiple SKUs and
the price difference between Nvidia and AMD did not seem to matter much. AMD
did not have a new part coming that year, so Nvidia had some breathing room to
devote R&D to other ambitions.
Another theory was that Nvidia had ramped up production of 1080 s and below,
and they needed to run that inventory down before announcing any new part (Osborne

2Wang and AMD lived up to his promise; 2018: November RX 590; 2019:December RX 5500X,
November RX 5500, July RX 5700 Series, January Radeon VII; 2020: November RX 6800 XT and
6900 XT; 2021:November RX 6600, July RX 6600 XT, and March RX 6700 XT.
326 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

effect).3 So Nvidia could take its time to roll out its next-generation GPU, code named
Turing. But they didn’t.

7.2 Nvidia’s Turing GPU (September 2018)

The Turing architecture introduced the first consumer products to deliver ray tracing
in real time—long thought to be available someday. The elements included artificial
intelligence processors (tensor cores) and dedicated ray tracing processors. Turing
was also the code name for Nvidia’s microarchitecture used in its Quadro RTX AIBs
and GeForce RTX 20 series AIBs.
With the Turing design, Nvidia also introduced new terms into the GPU vocab-
ulary: TIPS—tensor instructions per second, TOPS—tensor operations per second,
and TPU—tensor processing unit.
Nvidia had the Turing GPU manufactured with TSMC’s 12 nm FinFET process.
The top of the line, the TU102 GPU, had 18.6 billion transistors, 4608 shaders, 288
TMUs. 96 ROPS, and 72 streaming multi-processors (SMs). The GPU was the largest
ever made at 754 mm2 GPU, and it could generate 14.2 TFLOPS. The Turing GPU
featured independent tensor cores for AI inferencing like Nvidia’s Volta GPU and
had dedicated ray tracing cores.
The tensor cores processed deep learning (DL) and AI inferencing such as DL
anti-aliasing (DLAA), video and image denoising and resolution scaling, and video
re-timing. The GPU was capable of up to 500 trillion (tensor) operations per second
(TOPS), almost ten times more than the Pascal GPU.
Each streaming multi-processor contained 32 CUDA cores—an execute unit with
one floating-point (FP) and one integer (Int) compute processor. The streaming multi-
processor (SM) scheduled threads in a group of 32 threads called warps. The warp
schedulers could issue two warps at the same time [5].
The TU102 had a 384-bit wide GDDR6 memory bus and employed fast 14 Gbps
memory. There were also two NVLink channels, which Nvidia planned to use in its
next-generation multi-GPU technology.
The GPUs of this era had become so large (physically and logically) that block
diagrams became nothing more than a sea of little boxes representing shaders and
began to take on the look of a die shot, as shown in Fig. 7.1.
In addition to ray tracing, Turing had features for the data center. Nvidia claimed
it would deliver up to ten times more performance and had 25 times better energy
efficiency than CPU-based servers.
The Turing GPU architecture included features to improve the performance of data
center applications, such as an improved video engine and a multi-process service.
Nvidia claimed Turing improved inference performance for small batch sizes,

3The Osborne Effect is a reduction in sales of current products after the announcement of a future
product.
7.2 Nvidia’s Turing GPU (September 2018) 327

Fig. 7.1 Nvidia’s Turing TU102 GPU die photo and block diagram. Courtesy of Nvidia

reduced launch latency, and enhanced quality-of-service (QoS) while supporting


a more significant number of concurrent client requests.
The Turing-based Tesla AIBs offered a higher memory bandwidth and larger
memory size than the prior generation Pascal-based Tesla AIBs and targeted
similar server segments. It also provided a greater user density for virtual desktop
infrastructure (VDI) applications.
With Turing, Nvidia introduced three significant new developments that rocked
the industry:
• real-time ray tracing (RTRT)

– Deep-learning anti-aliasing
– Deep learning supersampling
– Hybrid-rendering
• variable rate shading
• mesh shaders
In addition to the Turing Tensor Cores, The Turing GPU architecture had several
features to improve the performance of data center applications, such as:
• An improved video engine. Turing could run additional video decode formats
such as HEVC 4:4:4 (8/10/12 bit) and VP9 (10/12 bit).
• Multi-process service. The Turing GPU inherited an improved multi-process
service (MPS) feature introduced in the Volta architecture. Compared to Pascal-
based GPUs, the MPS on Turing-based Tesla boards, Nvidia claimed, improved
inference performance for small batch sizes, reduced launch latency, and improved
quality of service (QoS) while servicing a higher number of concurrent client
requests.
• Higher memory bandwidth and larger memory size. According to Nvidia, the
Turing-based Tesla AIBs had larger memory capacity and higher memory band-
width than prior generation Pascal-based Tesla AIBs, which targeted similar server
segments. They provided a greater user density for VDI applications.
328 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

7.2.1 Ray Tracing

AI and ray tracing show the scalability of the GPU


The Turing architecture introduced the first consumer products to deliver real-time
ray tracing—a long-anticipated, long-wished for capability that suddenly became
available. Turing used artificial intelligence processors as well as dedicated ray
tracing, and a sea of universal shaders to accomplish much faster ray tracing.
Ray tracing maps the trajectory of light as it reflects and refracts off the material’s
surface while passing through a virtual environment. It produces a compelling 3D
image with realistic lighting and shadows if done correctly. However, the process is
highly demanding computationally, and for a long time, it was impossible to render
light rays with such precision in real-time.
In October 2018, Microsoft released its Windows 10 OS and software devel-
opment kit (SDK) to support DirectX Raytracing (DXR tier 1.0). In their official
releases, game developers used DXR and introduced several games with ray tracing
features. The advancements in ray tracing performance brought by Turing also had
the potential to transform the professional graphics industry.
A single pipeline state object (PSO) can contain any number of shaders in ray
tracing. DXR tier 1.1, released in November 2019, provided the ability to add extra
shaders to an existing ray tracing PSO, increasing the efficiency of dynamic PSO
additions. Microsoft added Execute Indirect, enabling adaptive algorithms to decide
the number of rays on the GPU execution timeline. Inline ray tracing was added,
which allowed for more direct control of the ray traversal algorithm and shader
scheduling.
Nvidia used the tensor processors in Turing for AI denoising of ray-traced images
to speed up ray tracing by reducing how many rays it took to obtain a clean image.
As a result, the tensor cores were underutilized, which led Nvidia to develop
DLSS—deep learning supersampling. DLSS used an AI algorithm specific to a game
to generate higher-resolution images from lower-resolution ones. That helped boost
performance without degrading the image.
The Turing GPU could generate up to 10 Giga rays (10 billion rays) per second,
the comparable ray tracing processing power of dozens of high-end CPUs in a render
farm.
Turing also had a deep learning, anti-aliasing (DLAA), temporal convolutional
autoencoder. It enabled the GPU to render at low-resolution to conserve resources.
It would then use AI to upscale the images for output on the screen. Nvidia trained
its DLAA algorithm to determine an image’s ground truth (correct properties) by
feeding 64 jittered samples of the same frame to create a ground truth frame. The
AI system could automatically produce a higher-quality and higher-resolution image
for that frame [6].
DLAA was DLSS without the upscaling portion of the total image, just jaggeis.
Instead of upscaling the image, Nvidia used its AI-assisted tech to create better
anti-aliasing at native resolution.
Real-Time Ray Tracing Five Years Ahead of Schedule.
7.2 Nvidia’s Turing GPU (September 2018) 329

Fig. 7.2 Ray tracing features supported in Nvidia’s Turing GPU. Courtesy of Nvidia

In March 2018, at the GDC Conference, to show the power of its Volta-
based Quadro workstation AIBs, Nvidia linked four Volta-based AIBs together to
demonstrate real-time ray tracing [7].
Since the early 1980s, the industry had been on a mission to improve ray tracing
performance to make a GPU that could produce photorealistic graphics in real time.
Generation over generation, GPU suppliers, improved their rendering capabilities,
using technologies such as physically based rendering and photogrammetry. Ray
tracing would enable the last piece of the puzzle—global illumination (Fig. 7.2).
Before the 2018 SIGGRAPH announcement, it was commonly thought GPU
architectures did not have the processing power necessary to handle the ray tracing
workload in real time. Based on Moore’s law projections, the industry thought it
would be five or six years before that workload would be possible with a single GPU.
However, Nvidia’s investments in deep learning helped accelerate the trajectory to
realize that goal. With Turing, the dream of real-time ray tracing (RTRT) was realized.

7.2.2 Hybrid-Rendering: AI-Enhanced Real-Time Ray


Tracing

Nvidia introduced a new AI-enhanced rendering technique with Turing, which took
advantage of the GPUs tensor and RT cores, see Fig. 7.3. Nvidia introduced a hybrid-
rendering technology that combined the ray tracing capabilities of the RT core’s
image denoising and image scaling capabilities. It increased ray tracing performance
by approximately six times in Epic’s RTRT and Microsoft’s DXRT ray tracing APIs
[8].
The RT cores accelerated the bounding volume hierarchy (BVH) traversal and
ray/triangle intersection testing. These two processes need to be performed in an
iteratively way because the BVH traversal otherwise would need thousands of
intersection-testing to finally calculate the color of the pixels. Since RT cores
330 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Fig. 7.3 Nvidia’s hybrid-rendering technology combining the ray tracing capabilities of the RT
cores and the image denoising

were specialized to take on this load, this gives the shader cores in the streaming
multi-processor (a sub-section of the GPU) capacity for other aspects of the scene.
Without this advancement, Nvidia was still on the path to releasing a real-
time, ray tracing GPU in approximately five to 10 years. Still, the company’s
ability to combine computer-based rasterization and compute-based techniques with
hardware-acceleration, and deep learning enabled real-time ray tracing sooner. Their
hybrid rendering allowed the combined use of rasterization and compute-based tech-
niques, with hardware-accelerated ray tracing and deep learning. It was since adopted
in a few games and engines as well [9].

7.2.2.1 Variable Rate Shading

GPUs vary in performance by generation and segment (high to low) within a gener-
ation. As a result, some GPUs cannot consistently deliver the same level of quality
on every part of the output image. Turing’s independent integer and floating-point
pipelines meant it could simultaneously address and process numeric calculations.
Variable rate shading (VRS) increased the rendering speed and quality by varying
the shading rate for different frame regions (image). Variable rate shading came from
developments for operations such as foveated rendering—a significant component
of VR headsets—and motion-adaptive shading. Foveation allows processors to save
resources by concentrating on the data at the center of focus where it presents a
high-resolution image and processing the rest in lower resolution.
With variable rate shading, the pixel shading rate of blocks of triangles varied.
Turing offered developers seven options for each 16 × 16-pixel region, as well as a
shading result used to color four pixels (two-by-two: 2 × 2), or 16 pixels (4 × 4), or
non-square footprints like 1 × 2 or 2 × 4.
Turing’s variable rate shading enabled a scene to be shaded with rates varying
from once per visibility sample (supersampling) to once per 16 visibility samples.
The developer could specify the shading rate spatially (via a texture). As a result, a
7.2 Nvidia’s Turing GPU (September 2018) 331

single triangle could be shaded using multiple rates, providing the developer with
fine-grained control.
VRS allowed the developer to control the shading rate without changing the
visibility rate. The ability to decouple shading and visibility rates made VRS more
useful than techniques such as multi-resolution shading (MRS) and lens-matched
shading (LMS), which lower total rendering resolution in specified regions. At the
same time, VRS, MRS, and LMS could be used in combination because they are
independent techniques enabled by separate hardware paths.

7.2.2.2 Nvidia’s New DLSS (March 2020)

DLSS takes a frame or two at 4 k and uses it as a reference, it then renders the scenes
at HD (1080p) and then it upscales the image to 4 K using references learned from
the origianl4K image.
Nvidia’s RTX AIBs introduced at SIGGRAPH 2018 was a sensational surprise
and instant hit. Initially, as is to be expected, there were not many games for it, but
the population steadily grew. However, the frame rate when running RTRT was not
optimal. In addition to having almost 3000 shader cores, the AIB also had 272 special-
ized ray tracing intersection cores, and 68 AI tensor cores. DLSS was a temporal
image upscaling technology. It used deep learning to upscale lower-resolution images
to a higher resolution for display on higher-resolution computer monitors.
Then, in March 2020, the company announced version 2.0 of its DLSS technology.
The 2.0 version was so different in its construction from the 1.0 version that several
fans and analysts thought Nvidia should have renamed it [10].
The basic concept of deep learning supersampling—DLSS—was just what its
name implied. Nvidia took a 3D model (a group of objects created by a game
developer for a scene in the game’s story) and rendered it using ray tracing techniques.
Unlike traditional scanline rendering which bogs down as objects are added, ray
tracing is impacted by the resolution of the final engine. So the trick of using lower
resolution for ray tracing and then scaling up the image is clever and efficient.
All rendering lives under the rules of performance vs. quality, with quality being
an arbitrary horizontal scale and performance being either frames-per-second on the
vertical axis for ray tracing or polygons-per-second for scan-line rendering.
Since DLSS was developed for ray tracing, the first step was to set the resolution
at a reference level. Through various empirical and experiential work, Nvidia settled
on 1080p as the smallest resolution to use as a base.
Nvidia ran a game at 1080, and fully ray-traced it, and then fed the ray-traced
images to a specially designed convolutional autoencoder neural network. The
motion vectors obtained from the game’s engine were then sent to and through the
network; they would get used later. The network then segmented and segregated key
elements, reassembled them, and created a 4 K output file. That file, along with a
sample file from a 16 K reference file, was then run through the network and iterated
a few times to arrive at the final finished, polished, fully ray-traced frames (Fig. 7.4).
The iteration eliminated the image noise ray tracing produces while processing.
332 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Fig. 7.4 Data flow of Nvidia’s DLSS 2.0 process. Courtesy of Nvidia

Figure 7.4 shows an active component (the game during run time) and the offline
training.
DLSS achieved its image quality by using four inputs to arrive at the final frame
seen by the user:
1. The game engine’s base resolution image (e.g., 1080p) was rendered (far left in
the diagram).
2. The image generated by the game engine also produced motion vectors, which got
extracted (center of the diagram). Motion vectors informed the DLSS algorithm
about which direction objects in the scene were moving from frame to frame—
that data was used to direct the supersampling algorithm later.
3. The high-resolution output of the previous DLSS-enhanced frame (4 K) was then
created.
4. And an extensive data set of 16 K-resolution ground truth images Nvidia had
acquired from various game content was used to train the AI network running on
an Nvidia supercomputer.
A convolutional autoencoder AI network received the current 1080p base reso-
lution frame, motion vectors, and a previous high-resolution frame. It determined
what was needed to generate a higher resolution version of the current frame on a
pixel-by-pixel basis.
By examining the motion vectors and the prior high-resolution frame, the DLSS
algorithm could track objects from frame to frame. That information provided
stability for motion and reduced flickering, popping, and scintillation artifacts. That
process is known as temporal feedback, as it uses history to inform the algorithm
about the future (Fig. 7.5).
The DLSS algorithm had access to prior frames and motion vectors, which allowed
it to track each pixel and take samples of the same pixel from several frames (known
as temporal supersampling). That could deliver greater detail and edge quality than
traditional upscaling solutions.
Offline, during the training process, the output 4 K super-resolution image (from
the network) was compared to an ultra-high-quality 16 K reference image (referred
7.2 Nvidia’s Turing GPU (September 2018) 333

Fig. 7.5 Nvidia’s DLSS used motion vectors to improve the supersampling of the enhanced image.
Courtesy of Nvidia

to above as the ground truth). Then the difference was communicated back into the
network so it could continue to learn and improve its results. The 16 K reference
images were from different types of game content (with and without ray tracing)
that Nvidia rendered to 16 K and then compared to the DLSS algorithm’s output. It
was an iterative learning cycle that ran until the network could consistently repro-
duce a similar image. The iteration was repeated tens of thousands of times on the
supercomputer until the network reliability outputted high-quality, high-resolution
images.
The DLSS algorithm learned to predict high-resolution frames with greater accu-
racy by training against a large dataset of 16 K-resolution images. And, through
continual training on Nvidia’s supercomputers, DLSS could learn how to deal with
new content classes—from fire to smoke to particle effects—at a rate that engineers
doing hand-coding of non-AI algorithms could not maintain (Fig. 7.6).
DLSS successfully exploited the tensor cores in GeForce RTX AIBs. These cores
could deliver up to 285 teraflops of dedicated AI processing. As a result, DLSS could
be run in real-time simultaneously with an intensive 3D game.

7.2.2.3 Mesh Shaders

In August 2018, at the ACM SIGGRAPH conference in Vancouver, Nvidia intro-


duced its Turing architecture and a new programmable geometric shading pipeline
with mesh shaders. The new shaders brought the compute programming model to the
graphics pipeline and used threads cooperatively to generate compact meshes (which
Nvidia named meshlets) within the GPU. Then the data was sent to the rasterizer and
334 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Fig. 7.6 An Nvidia demo of ray tracing used in a game. Courtesy of Nvidia

out to the display. Applications and games dealing with high-geometric complexity
would benefit from the flexibility of the two-stage approach, which allowed efficient
culling, level-of-detail (LOD) techniques, and procedural generation. There is an
extensive discussion on mesh shaders in Book two, the GPU Environment—APIs.
Suffice it to say; mesh shaders opened up a new era in GPU image processing and
generation.

7.2.3 Summary

Nvidia showed that the GPU could evolve to meet the dream of photo-realism and
interactive frame rates. Adding additional specialized processors or engines to the
GPU was not a new concept. Even the earliest graphics controllers had video codecs,
audio, and unique function accelerators. The Turing GPU added AI and dedicated ray
tracing cores to the ever-expanding GPU. The Turing processor was a revolutionary
product and set the threshold for what other GPU suppliers would have to meet.
The Nvidia Turing GPU was a breakthrough device. It had the most shaders and
was the largest chip made at the time. It was designed for two markets, gaming, and
the data center, which included parts not needed for each segment. That gave Nvidia
the benefit of economy of scale but inflated the price. Nonetheless, the GPU was
very successful for Nvidia.
7.3 Intel–Xe GPU (2018) 335

7.3 Intel–Xe GPU (2018)

As described in earlier chapters, Intel made several attempts at launching a graphics


processor. In 2017, the company kicked off another GPU project code named Xe .
In November 2017, Intel surprised the world, especially AMD, and announced it
had hired AMD’s Radeon Technologies Group leader (and chili pepper aficionado)
Raja Koduri. The implications were clear. In 2018, Intel hired several other AMD
people. Intel unofficially announced the company would introduce a discrete GPU,
possibly at CES 2019.
At the IEEE International Solid-State Circuits Conference (ISSCC) in February
2018 in San Francisco, Intel revealed an energy-efficient graphics processor featuring
fine-grain dynamic voltage and frequency scaling with integrated voltage regulators,
an execution-unit turbo, and retentive sleep in 14 nm tri-gate CMOS. That was, or
maybe could be, a GPU prototype based on the 14 nm process. The prototype chip
had 1.5 billion transistors and a frequency range from 50 MHz at 0.51 V to 400 MHz
at 1.2 V.
In June 2018, Intel confirmed it planned to launch a discrete GPU in 2020, code
named Arctic Sound. The chip would span from data centers to entry-level PC gaming
applications. Intel announced it would be a GPU-compute device in the Ponte Vecchio
supercomputer.
Later at Intel’s Architecture Day in December 2018, Koduri announced Intel’s
new Xe GPU architecture would scale from integrated to GPU compute. The first
mobile version would appear in 2019, he said. The industry assumed Intel would
announce it at CES. CES came and went, and then word came out that Intel would
release the Xe -based GPU sometime in 2020. During the Stuttgart FMX conference
in May 2019, Intel announced that high-end visualization systems would use Xe and
support ray tracing. Then at Intel’s investor meeting that same week, Intel said it
would manufacture Xe in 7 nm and bring it out in 2021. The first Xe GPU would
come as a general-purpose variant built on the 10 nm process, date unspecified.
In October 2019, Intel reported (via Twitter) that it had tested its first dGPU, code
named DG1. And, in late October, it announced the first dGPU product, the Intel
iRISXe Max.
In 2020, the company introduced its 11th gen CPU (code named Tiger Lake) with
Xe iGPU. Intel said the architecture of the 11th gen iGPU with 96 EUs was the same
as would be in all future Xe dGPUs.
And in November 2021 Intel introduced the Adler Lake, 12th Gen CPU In January
2020 the new processors were available with an Xe iGPU.
The Gen12 Xe LP media engine was the same in Tiger Lake but moved to Intel’s
7 nm giving it more headroom process. It came in two versions: the GT1 with 32
EUs, for desktops and the GT2 version with 96 EUs for the notebooks. The desktop
iGPUs had 33% more EUs than the Gen9.5 UHD 630 Graphics, but that was less
than the 96 EUs found in 11th Gen Tiger Lake. The UHD 770 engine had clock rates
from 1450 to 1550 MHz.
336 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Intel said the Xe LP engine would support 1080p gameplay and had a 12-bit
video pipeline end-to-end. The desktop models did not have an image processing
unit (IPU), those features were only available on mobile devices.
Alder Lake’s integrated GPU could drive up to five display outputs (eDP, dual
HDMI, and Dual DP++), and offered the same encoding/decoding features as both
Rocket Lake and Tiger Lake CPUs, including AV1 8-bit and 10-bit decode, 12-bit
VP9, and 12-bit HEVC.

7.3.1 Intel’s Xe Max (2020)

Intel has a long history in PC graphics chips beginning in 1983, and, in late 2020, it
announced a new discrete GPU (dGPU), the Xe Max. The company had taken several
runs at building a discrete graphics chip to take on the market leaders, but it had a
challenging time. The company never seemed to address building a dGPU with the
same seriousness and resources as the x86 CPU.
That attitude changed when Intel plotted the development and introduction of its
Xe dGPU line. It was careful to communicate its confidence while keeping a tight
grip on its messaging. Intel finally announced a thin and light notebook discrete
(dGPU), the iRISxe Max, on Halloween and it may have been scary news for some
of the incumbents.
The basic specifications of the 2020 mobile dGPU are in Table 7.1.
Intel paired the iRISXe Max dGPU with its 11th-gen Intel Core mobile processors.
The company claimed the new dGPU delivered additive AI. Additive meant both
GPUs (the new dGPU and the CPU’s iGPU) could work together on inferencing and
rendering. And that, said Intel, can speed up content creation workloads as much as
seven times.
Intel compared its first product against a 10th-gen Intel Core i7-1065G7 with an
Nvidia GeForce MX350.
The 2020 dGPU offered Hyper Encode for up to 1.78 times faster encoding than
a high-end desktop graphics AIB. For that test, Intel used a 10th-gen Intel Core
i9-10980HK with Nvidia GeForce RTX 2080 Super.
Additionally, iRISXe Max worked with Intel’s Deep Link. Deep Link enabled
dynamic power-sharing. With Deep Link, the CPU could have all the power and
thermal resources dedicated to it when the discrete graphics was idle, resulting, said
Intel, in up to 20% better CPU performance.
Intel was no stranger to graphics. It had deep and hard-won experience. Intel had
some of the finest graphics engineers in the business. And yet, with the best fabs,
a bank account that others could only fantasize about, and a brand that could sell
used eight-track players, the company had continuously failed to launch a successful
discrete graphics processor product line (Fig. 7.7).
For all that, Intel was the largest seller of integrated graphics processors and
shipped more GPUs than all its competitors combined. Maybe another company
would have been happy with that accomplishment, but not Intel.
7.3 Intel–Xe GPU (2018) 337

Table 7.1 Intel’s 2020 discrete mobile GPU


Technical specifications
Product name Intel iRISXe max graphics
EUs 96
Frequency 1.65 GHz
Lithography 10 nm SuperFin
Graphics memory type LPDDR4x
Graphics capacity 4 GB LPX DDR4
Graphics memory bus width 128
Graphics memory bandwidth 68 GB/s
PCI express Gena
Al support Intel DLBoost: DP4A *; see section “Conclusion” below
Media 2 Multi-format Codec (MFC) engines
Intel deep link technology Yes
Number of displays supported eDP 1.4b, DP 1.4, HDMI 2.0b
Graphics output 4096 × 2304 at 6 0 Hz (HDM1/eDP)
Max resolution 7680 × 4320 at 60 Hz DP
Pixel depth 12-bit HDR
Graphics features Variable rate shading, Adaptive Sync, Async Compute
DirectX support (Beta)
OpenGL support 12.1
OpenCL support 4.6
Power 25 W
Courtesy of Intel

In most of Intel’s previous adventures with graphics, specifically the discrete


graphics parts, scaling was an issue. The products that made it to the market were spot
products with average-to-mediocre performance. They had no headroom and no way
to scale. Intel’s last three generations of iGPUs were different. They demonstrated
scaling and process exploitation very well.
Intel said it would offer Xe dGPUs from entry-level (like the iRISXe Max) up to
supercomputer accelerators like Porte Vecchio (Fig. 7.8).
That looked good on paper, but it is almost impossible to accomplish. The only
thing the original iRISXe Max and the Porte Vecchio might share is the basic ALU in
the scaler. Caches, memory controllers, bus managers, video outputs, clock gating,
and many other parts are different from one product segment to the next.
However, Intel’s DG1’s iRISXe Max’s roots were in Intel’s iGPUs in the 10th, 11th,
and 12th gen and CPUs, such as the type of memory used by iRISXe Max—LPX
DDR4 instead of DGGR.
No single transistor is optimal across all design points.
338 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Fig. 7.7 History of Intel


graphics devices

Intel offered a more robust dGPU, DG2, in early 2021. It was a desktop part,
and versions had 128 and 512 EUs and used GDDR6. But remember, there were
four discrete segments in the desktop dGPU market: low-end, mid-range, high-end,
and workstation. Meeting the demands of each of those segments revealed a design
with the ability to scale. Scalability was one of the key requirements that killed the
Larrabee project.
Intel acknowledged the issue:
“No single transistor is optimal across all design points,” said chief architect Raja Koduri.
“The transistor we need for a performance desktop CPU, to hit super-high frequencies, is
very different from the transistor we need for high-performance integrated GPUs.”
7.3 Intel–Xe GPU (2018) 339

Fig. 7.8 Intel’s product range for GPUs. Courtesy of Intel

This time, however, things were going to be different. Since Larrabee, Intel had
developed its Embedded Multi-Die Interconnect Bridge (EMIB).
Designers put heterogeneous dies onto a single package before EMIB. Designers
and engineers used multiple dies for maximum performance or features set. Designers
used an interposer, which had wires through the substrate for communication.
Silicon vias (TSVs) passed through the interposer into a substrate, which formed
the package’s base, often referred to as 2.5D packaging.
EMIB abandoned the interposer in favor of tiny silicon bridges embedded in the
substrate layer (Fig. 7.9) The bridges contained microbumps that enabled die-to-die
connections. Intel demonstrated it with an FPGA implementation called Straatix.
Silicon bridges are less expensive than interposers. One of Intel’s first products
with embedded bridges was Kaby Lake. Laptops based on Kaby Lake G were consid-
ered expensive. However, they demonstrated Intel’s EMIB would work with hetero-
geneous dies in one package. For one thing, it consolidated valuable board space. It

Fig. 7.9 EMIB created a high-density connection between the Stratix 10 FPGA and two transceiver
dies. Courtesy of Intel
340 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Fig. 7.10 Intel plans to span the entire dGPU market. Courtesy of Intel

could also improve performance and reduce cost compared to discrete components.
Kaby Lake G used dies from three different foundries. That was the foundational
work Intel had done for chiplet designs, and chiplets were Intel’s plan for scaling
Xe dGPU processors (tiles) for the various segments. Additionally, since Intel was
building DG2 in 7 nm at TSMC, multi-die, multi-vendor interoperability was critical.
Intel referred to it as the Advanced Interface Bus (AIB) between its core fabric
and each tile.
Intel introduced its Foveros technology, a 3D stacking approach, allowed Intel to
pick the best process technology for each layer in a stack in late 2018. The Lakefield
processor had the first implementation of Foveros. It incorporated processing cores,
memory control, and graphics using a 10-nm die. That chiplet sat on top of the base
die, which included the functions usually found in a platform controller hub (e.g.,
audio, storage, PCIe). Intel used low-power 14 nm for those processors. Microbumps
connected power and communications through TSVs in the base die. Intel then put
LPDDR4X memory from one of its partners on the top of the stack (Fig. 7.10).
Intel’s Xe may become what the company promised in 2019—a scalable archi-
tecture that could satisfy high-end GPU compute to low-end thin and lite, [12, 13]
with a common architecture that could share one driver and live atop Intel’s One API
concept [13].

7.3.2 Intel’s dGPU Family (2021)

From 2016 to 2021, Intel had a succession of manufacturing failures at its Hills-
boro research and manufacturing center that delayed three generations of new Intel
processors. At the same time, AMD introduced a new range of powerful × 86 CPUs.
Intel’s difficulties gave AMD an opening to overtake Intel with a more advanced
processor, costing Intel sales and profits.
7.3 Intel–Xe GPU (2018) 341

Fig. 7.11 Pat Gelsinger,


Intel’s CEO. Courtesy of
Intel

In January 2021, Intel acted. The company recruited Pat Gelsinger, Intel’s former
CTO who had joined the company back in 1979, left Intel in 2009 to take over EMC
and VMware (Fig. 7.11).
During Gelsinger’s earlier time at Intel, he launched Intel’s annual developer
conference in 1997. In 2017, former CEO Brian Krzanich scrapped it. When
Gelsinger returned, he revived the event, which was virtual in 2021 due to the
pandemic. Announcing the conference, Gelsinger said, “The geek is back at Intel”
[14].
After Gelsinger’s return, Intel hired hundreds of more workers in preparation for
a $3 billion Hillsboro factory expansion planned for 2022. The company had also
been expanding its employment at its Folsom California facility, where the central
dGPU Xe team was located.

7.3.3 DG1

At CES in January 2020, Intel showed off its DG1 (discrete graphics one, Fig. 7.12)
Xe− based AIB. Intel built it using a next-gen integrated GPU removed from its CPU
and made into a discrete part. As such, it used conventional DDR RAM, not high-
bandwidth GDDR which has been developed specifically for graphics processors.
Its performance was not much better than an iGPU, but that was not the point.
The architecture of the GPU was from Intel’s Xe design, and, as such, the developers
could use it to run test code and it established the DG product name.
In June 2021, while Intel was preparing to launch its Xe series of GPUs, Intel’s
chief architect Raja Koduri tweeted about the DG2 (discrete graphics two) chip and
showed a picture [15]. It was built from the foundation created by the Intel DG1 and
was presumed to have improved performance.
342 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Fig. 7.12 Intel’s DG1 AIB. Courtesy of Intel

At its re-established Architectural Day virtual conference in August 2021, the


company announced the product span and new naming nomenclature.

7.3.3.1 Hello Arc, Goodbye DG

The Arc client graphics road map included Alchemist (previously known as DG2).
Intel decommissioned the DG# and renamed it Arc. Then it added generational
subsets such as Alchemist, Battlemage, Celestial, and Druid.
During the presentations, Koduri said, “We added a fourth microarchitecture to
the Xe family: Xe HPG optimized for gaming, with many new graphics features
including ray tracing support. We expect to ship this microarchitecture in 2021”
[16]. Figure 7.13 shows Intel’s road map.
Koduri said Intel’s Xe -HPG architecture employed energy-efficient processing
blocks from the Xe -LP architecture and high-frequency optimizations developed for
Xe -HP/Xe -HPC GPUs for data centers and supercomputers. The GPUs had high
bandwidth internal interconnections, a GDDR6-powered memory sub-system, and
hardware-accelerated ray tracing support. In a bold move, Intel manufactured the
DG2 family GPUs at TSMC.
The Xe -HPG microarchitecture powered the Alchemist family of SoCs, and the
first related products came to market in the first quarter of 2022 under the Intel
Arc brand. The Xe -HPG microarchitecture featured a new core, a compute-focused,
programmable, and scalable element.
7.3 Intel–Xe GPU (2018) 343

Fig. 7.13 Intel’s Xe Arc HPG road map circa 2021. Courtesy of Intel

The scalability of the Xe architecture spanned from iGPUs, which were classified
X LP, to high-performance gaming optimized GPUs, classified as Xe HP, and up to
e

GPU-compute accelerators designated Xe HPC.


The Arc graphics products were based on the Xe -HPG microarchitecture
(Fig. 7.14). It combined Intel’s Xe LP, HP, and HPC microarchitectures, delivering
scalability and computing efficiency with advanced graphics features. The first gener-
ation of Intel Arc products, Alchemist, featured hardware-based ray tracing and arti-
ficial intelligence-driven supersampling and complete compatibility with DirectX 12
Ultimate [17].
Intel’s and its Alchemist GPU code name were like AMD’s Navi or Nvidia’s
Ampere code names.
The Xe -core contained 16 Vector Engines (formerly called Execution Units), and
each operated on a 256-bit SIMD kernel and data. The vector engine could process
eight FP32 instructions simultaneously in what was traditionally called a GPU core.
At the InnovatiON event in late October 2021, Intel re-introduced its family of
discrete GPUs. Intel confirmed its Arc Alchemist GPU had 32 Xe cores. Each Xe
core had 16 Vector Engines and 16 Matrix Engines, equaling 512 EUs or 4096 shader
cores. TSMC built the chips at its N6 (refined N7) fab in Taiwan.
Intel announced two chips in the Arc Alchemist line, and a possible third was
expected between them that used a binned part from the larger die.
Table 7.2 shows the specifications of the Intel Arc Alchemist mobile models
announced in March 2022.
The New Xe -based GPUs included matrix engines (referred to by Intel as Xe
Matrix eXtensions, XMX). The XMX could accelerate AI workloads such as XeSS,
Intel’s upscaling technology that enabled high-performance and high-fidelity gaming
(Fig. 7.15).
344 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Fig. 7.14 Intel’s HPG

XMX was Intel’s branding for tensor cores, and Intel used them to overcome the
classic performance-quality conflict illustrated in Fig. 7.15.

7.3.3.2 Intel’s Supersampling (XeSS)

Intel was confident in the capabilities of its Xe -supersampling (XeSS) shown in


Fig. 7.16, and predicted it would meet the sweet spot of quality and performance at
reasonable render rates.
XeSS was like AMD’s FSR and Nvidia’s DLSS. It started with a low-res HD
image, and it used AI to determine the tweens to be scaled up to 4 K. At the conference
in 2021, Intel said it would make the code and tools available for developers in an
SDK (Fig. 7.17).
Xe SS took advantage of the Alchemist’s XMX AI acceleration to do the upscaling.
XMX AI used deep learning to synthesize images close to the quality of native
high-resolution rendering. Using Xe SS, games that would only be playable at lower-
quality settings or lower resolutions could run smoothly at higher quality settings
and resolutions.
Xe SS worked by reconstructing sub-pixel details from neighboring pixels and
motion-compensated previous frames. A neural network trained to deliver high
performance and excellent quality performed reconstruction, with up to twice the
performance. Intel said the Xe SS worked on a broad set of hardware, including
integrated graphics, by leveraging the DP4a instruction set. Intel said several game
developers had engaged Xe SS. The SDK for the initial XMX version was available
for ISVs in November 2021, and the DP4a version was available later.
7.3 Intel–Xe GPU (2018) 345

Table 7.2 Intel’s Arc alchemist mobile dGPU product line


Arc entry 3 Arc mid-range 5 Arc high-end 7
GPU A350M A370M A550M A730M A770M
Process (nm) TSMC N6 TSMC N6 TSMC N6 TSMC N6 TSMC N6
Transistors 7.2 7.2 21.7 (partial) 21.7 21.7
(billion)
Die size (mm2 ) 157 mm2 157 mm2 406 mm2 406 mm2 406 mm2
Vector engines 96 128 256 384 512
GPU cores (FP32 768 1024 2048 3072 4096
ALUs)
GPU Clock 1.15 1.55 0.9 1.1 1.65
(GHz)
RT cores 6 8 16 24 32
VRAM speed 14 14 14 14 16
(Gbps)
VRAM (GB) 4 4 8 12 16
GDDR6
Bus width 64 64 128 192 256
ROPS 24 32 64 96 128
TMUs 48 64 128 192 256
TFLOPS 1.8 3.2 3.7 6.8 13.5
Bandwidth 112 112 224 336 512
(GB/s)
TBP (watts) 25–35 35–50 60–80 80–120 120–150
Launch date Q1 2022 Q1 2022 Early summer Early summer Early summer
Launch price $199 $399 $599
Courtesy of Intel

7.3.4 Summary

Intel assembled a massive team of engineers and marketing people from all over
the company and the industry. They also had a national lab commitment hanging
over their heads, and they could not afford to default on that; the repercussions and
embarrassment would be too great. One of Intel’s challenges was learning how to
deal with an outside fab. The cultural and procedural processes were so different
from the Intel way.
The other thing that Intel had to deal with was backward compatibility. That was
one of the things that killed Larrabee. Intel was not some start-up that went to market
just because it got some sample chips that worked. Intel was not going into the dGPU
market; it was going into the AIB market. The supply chain, Q&A, marketing, tech-
support, legal, and standards compliance issues were huge, and one could not do
everything overnight.
346 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Fig. 7.15 Classic performance versus quality relationship

Fig. 7.16 Intel had a new SDK for its recent supersampling and scaling algorithm. Courtesy of
Intel

7.4 AMD Navi 21 RDNA 2 (October 2020)

In late 2020, AMD introduced its latest Radeon boards based on the Navi 21 GPU
architecture. There had been rumors, and AMD gave hints about Big Navi for over
a year. The consensus of the industry was that it was worth the wait.
7.4 AMD Navi 21 RDNA 2 (October 2020) 347

Fig. 7.17 Example of the quality of Intel’s Xe SS—notice the Caution sign. Courtesy of Intel

AMD had not been a contender in the high-end sector for a while. As a result,
Nvidia had been enjoying the enthusiast space to itself. Big Navi changed the land-
scape. Not only was the Radeon RX 6800 XT a powerful competitor, but AMD had
become a much stronger company. AMD could mount a robust marketing program
to back up the product. That was an element that had been missing in the past.
The heart of the Navi 21 was AMD’s RDNA 2. It, said AMD, featured significant
architecture advancements from previous RDNA architecture. It had an enhanced
compute unit and a new visual pipeline with ray accelerators. The company claimed
up to 1.54× higher performance-per-watt on various games they tested. Additionally,
RDNA 2 offered a 1.3× higher frequency at the same per-CU power than an RX
5700 XT. The RDNA 2 provided DXR, VRS, and AMD’s FidelityFX capabilities.
Compare the original 2019 RDNA (discussed in the chapter on the third-to-fifth eras)
with the expanded and evolved version in Fig. 7.18, just one year later.
AMD also introduced its Infinity Cache that sped up performance. The company
said it focused on delivering breakthrough speeds with power efficiency in RDNA 2.
The on-die cache resulted in frame data with much lower energy per bit, said AMD.
According to AMD, the 128 MB Infinity Cache provided up to 3.25 × effective
bandwidth of 256-bit GDDR6. And, when adding power to the equation, it achieved
up to 2.4 × more effective bandwidth/watt versus 256-bit GDDR6 alone.
Also new to the AMD RDNA 2 compute unit was the ray accelerator (Fig. 7.19).
Ray accelerators, said AMD, provided a massive acceleration for intersecting rays.
The RX 6800 XT had 72 ray-accelerator units. Each ray accelerator could calculate
up to four rays per bounding box intersection or one ray per triangle intersection with
every clock. The ray accelerators calculated the intersections of the rays with the
scene geometry in a bounding volume hierarchy. Then it sorted them and returned
the information to the shaders for further scene traversal or result shading.
348 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Fig. 7.18 AMD RDNA 2 Big Navi GPU

Fig. 7.19 AMD RDNA 2 compute unit


7.4 AMD Navi 21 RDNA 2 (October 2020) 349

Variable Rate Shading (VRS) adjusted the shading rate for different regions of
an image. VRS was initially developed for gaze tracking in VR HMDs for foveated
rendering, the GPU could concentrate the rendering work where needed the most.
VRS applied higher shading to the most complex parts of the image. These areas
usually had the most important visual cues in an image. Nvidia introduced the concept
in its Turing architecture.
AMD said its RDNA 2 VRS functionality was throughout the entire pixel pipe
and offered shading rates of 1 × 1, 2 × 1, 1 × 2, and 2 × 2.
The RDNA 2 VRS provided a unique shading rate to every 8 × 8 region of pixels.
That granularity enabled developers to make appropriate decisions about the shading
rate for a given region.
The RX 6800 XT included the Radeon Media Engine. It offered hardware-
accelerated encode/decode capabilities. The engine was compatible with popular
codecs, such as H.264, H.265, VP9 (decode only), and AV1 (decode only).
With the introduction of the RX 6800 XT, AMD resurrected its Rage branding
introduced in 1996. The ATI Rage series was the first 3D graphics accelerator devel-
oped by ATI and it ushered in a new era of PC gaming. To remind the industry of
ATI’s early triumph and to underline the importance and confidence AMD felt for
the RX 6800 XT AIBs, AMD introduced a new Rage Mode for the Radeon RX 6800
XT AIBs.
Rage Mode was one of three one-click, performance-tuning presets available on
the Radeon RX 6800 XT, along with Quiet and Balanced. Those presets automatically
adjusted power and fan levels to allow quick and easy customization of the GPU’s
behavior.
AMD released the Radeon 6x00 series AIB at the height of the crypto coin demand
for GPUs and AIBs and the demand from the pandemic. Prices soared, users were
frustrated, and only the middlemen and scalpers made a hefty profit (Table 7.3).
The GPU was significant for its physical size. the Nvidia A100 was 1.6 times
larger, with a corresponding increase of shaders (6912–4608)—but the AMD GPU
got 20.7 TFLOPS while the Nvidia A100 got 19.49. So with 40% less silicon, AMD
was able to squeeze out an equivalent amount of TFLOPS.

7.4.1 AMD Ray Tracing (October 2020)

In 2020, AMD and its console partner Sony announced AMD’s customized APUs
were in the new Sony PS5, and they also had ray tracing intersection shaders and
could do real-time ray tracing (RTRT) on consoles and PCs. In February 2021, AMD
announced that its RDNA2-based Radeon RX 6000 XT AIB could do RTRT.
With the RX 6000 series GPUs, AMD introduced a fixed function state
machine, intersection detection engine into the texture shader. It was a hybrid,
software-hardware approach to ray tracing, and AMD said it improved upon only
hardware-based solutions.
350 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Table 7.3 AMD’s Radeon series AIBs


Radeon RX 6800 XT Radeon RX 6800 RX 5700 XT
Architecture RDNA 2 RDNA 2 RDNA”
Manufacturing 7 nm 7 nm 7 nm
process
Transistor count 26.8 billion 26.8 billion 10.3 billion
Die size 519 mm2 519 mm2 251 rnrn2
Compute units 72 60 40
Ray accelerators 72 60
Stream processors 4608 3840 2560
Game GPO clock Up to 2015 MHz Up to 1815 MHz Up to 1755 MHz
Boost GPO clock (up Up to 2250 MHz Up to 2105 MHz Up to 1905 MHz
to)
Peak single precision Up to 20.74 TFLOPS Up to 16.17 TFLOPS Up to 9.75 TFLOPS
performance
Peak half precision Up to 41.47 TFLOPS Up to 32.33 TFLOPS Up to 19.5 TFLOPS
performance
Peak texture fill-rate Up to 648.0 GT/s Up to 505.2 GT/s Up to 304.8 GT/s
ROPS 128 96 64
Peak pixel fill-rate Up to 288.0 GPIs Up to 202.1 GPIs Up to 121.9 GPIs
AMD infinity cache 128 MB 128 MB
Memory (up to) 16 B GDDR6 16 GB GDDR6 8 GB GDDR6
Memory bandwidth 512 GB/s 512 GB/s 448 GB/s
Memory interface 256.bit 256.bit 256.bit
Board power 300 W 250 W 225 W
Courtesy of AMD

In AMD’s patent, the company said the hybrid approach addressed the issues with
hardware-based or software-based solutions. It used a shader unit to schedule the
processing when doing fixed function acceleration for a single node of the bounded
volume hierarchy tree addressed the issues with hardware-based or software-based
solutions [18]. AMD said it could preserve flexibility by controlling the overall calcu-
lation with the shader. It could bypass the fixed function hardware when necessary
while still getting the performance advantage of fixed function hardware. Addition-
ally, the texture processor infrastructure eliminated the large buffers. BVH caching
typically required in a hardware raytracing solution. The vector, general purpose
register and texture cache could be used in its place. That saved die area and reduced
the complexity of the hardware solution.
AMD used a specialized fixed function hardware ray intersection engine to handle
bounding volume hierarchy intersections (BVH calculations run through a stream
processor via software options have to deal with execution divergence and time-
consuming error corrections). AMD’s fixed function hardware was simpler than
7.4 AMD Navi 21 RDNA 2 (October 2020) 351

Fig. 7.20 AMD’s


intersection block diagram

Nvidia’s RT cores and was in parallel with the texture filter pipeline in the GPU’s
texture processor.
The system included an interconnected TP shader and cache. The TP had a texture
address unit (TA), a texture cache processor (TCP), a filter pipeline unit, and a ray
intersection engine.
The shader sent a texture instruction, which contained ray data and a pointer, to
a BVH node to the texture address unit (TAU). The TCP used an address provided
by the TAU to fetch BVH node data from the cache. The ray intersection engine
performed ray-BVH node type intersection testing using the ray and BVH data. The
intersection testing indications and results for BVH traversal were returned to the
shader via a texture data return path. The shader reviewed the intersection results
and indications to decide how to traverse to the next BVH node (Fig. 7.20).
Using fixed function acceleration for a single node of the BVH tree was a hybrid
approach. Utilizing a shader unit to schedule the processing addressed issues associ-
ated with only hardware-based, or just software-based solutions were eliminated.
Because the shader still controlled the overall calculation and could bypass the
fixed function hardware, flexibility was preserved, and fixed function hardware’s
performance advantage was still realized. Additionally, using the texture processor
infrastructure, large buffers for ray storage and BVH caching were eliminated as the
existing VGPRs and texture cache could be used in its place, which saved die area
and complexity of the hardware solution.
The ray tracer could provide the anti-aliased front end for the FSR in the 6000 XT
series (although, as mentioned, it was not required). The FSR used conventional anti-
aliasing techniques for older GPUs, so the performance was not the same. Moreover,
it was not the same on an Nvidia GPU because of the front-end processing.

7.4.2 FidelityFX Super Resolution (March 2021)

In March 2021, AMD announced an upgrade called FidelityFX. The company added
hardware acceleration and scaling to its GPUs and AIBs to enhance and speed up
352 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

gaming. The company offered the enhancement to developers via its OpenGPU
program. That made it useful for any GPU of almost any vintage. AMD had branded
it FidelityFX Super Resolution (FSR) [19].
It mitigated raster-scan rendering speed by the number of polygons in the image
and was not affected much by screen resolution. Ray tracing was the opposite—it
functioned almost independently of the polygon count and was challenged more by
screen resolution.
Nvidia developed a method of reducing the image’s resolution, using AI filtering,
and then scaling it up to speed up ray tracing’s real-time capabilities. AMD used a
similar approach but without AI filtering and its accompanying overhead. AMD’s
approach is explained in more detail below.
AMD’s FSR was an open-source, cross-platform technology designed to increase
frame rates and, at the same time, deliver high-quality, high-resolution gaming expe-
riences. AMD offered the following pipeline description of its process (Fig. 7.21).
The spatial upscaling technology utilized an algorithm to analyze features in
the source image. It then performed edge reconstruction and recreated the images
at a higher target resolution. Then the image was run through a sharpening filter,
which further improved image quality by enhancing texture details. However, the
sharpening step added edge noise and other artifacts. To fix that, AMD followed the
sharpening pass with a post-processing step to compensate for chromatic aberration
effects, film grain, and other clean-up functions. AMD said its FidelityFX Super
Resolution (FSR) could increase frame rates by as much as two and a half times.
AMD claimed the results produced an image with super high-quality edges
and distinctive pixel detail—especially when compared to other basic upscaling
techniques available at the time.
Our goal with FidelityFX Super Resolution was to develop an advanced, world-class
upscaling solution based on industry standards that can quickly and easily be implemented by
game developers, and is designed for how gamers really play games, said Scott Herkelman,
corporate vice president and general manager, Graphics Business Unit at AMD. FSR is the
industry’s ideal upscaler – it does not require any specialized, proprietary hardware and is
supported across a broad spectrum of platforms and ecosystems [20].

FSR had four quality settings: Ultra Quality, Quality, Balanced, and Performance.
The balance between image quality and performance was adjustable by the user. In
the Ultra Quality mode (see Table 7.4), AMD claimed that the FSR image quality was

Fig. 7.21 AMD’s FSR pipeline


7.4 AMD Navi 21 RDNA 2 (October 2020) 353

almost indistinguishable from the native resolution. Native resolution referred to the
game’s image quality at the monitor’s advertised resolution, without any sharpening
filters or upscaling techniques.
When added to a game, one would get very close to full image fidelity (quality)
of any target resolution (AMD said 1440p and 4 K were the best examples) from the
upscaling of FSR. FSR rendered at a lower resolution for improved performance,
and then the image was upscaled and sharpened to get back to the target resolution,
with near-native image quality. There was minimal image quality impact using FSR.
AMD said it had worked with game developers and studios to develop FSR and
claimed over 40 developers pledged to support and integrate FSR into their games
and game engines. It was a small patch and did not require special AI hardware or
training.
Dan Ginsburg, a graphics developer at Valve, said at the time, “With AMD’s
FidelityFX Super Resolution, we are able to offer customers improved image quality
at a lower performance cost than full resolution rendering. This is particularly attrac-
tive for users with mid-range GPUs wanting to target higher resolutions. We are
very pleased that it is designed for use with all GPUs and with AMD’s open-source
approach with FSR [21]”.
FSR was compatible with a broad range of GPUs, including legacy AMD GPUs,
such as GCN and Nvidia AIBs. FSR ran on RX 400 GPUs forward and Ryzen
APUs and all Nvidia GPUs from GTX 10 series on. DirectX 11 was the minimum
officially supported API, although it should be relatively straightforward to port FSR
to DirectX 9.
FSR consisted of two consecutive compute shaders: one shader did upscaling with
edge reconstruction, and another shader sharpened the resulting upscaled image to
extract pixel detail.
As the FSR pipeline diagram (Fig. 7.21) indicates, processed and anti-aliased data
were fed to the FSR. That data could be from any conventional game, including 2D
games or a ray-traced game or application.
Ray-traced games could use AMD’s ray tracing hardware (discussed elsewhere
in this chapter)—however, it must be pointed out that ray tracing was not required,
which is one reason AMD could offer the FSR software for any GPU and multiple
versions of DirectX.

Table 7.4 AMD’s quality settings versus performance


FSR quality mode Scale factor Input resolution for Input resolution for
1440p FSR output 4 K FSR output
Ultra-quality 1.3 × per dimension 1970 × 1108 2954 × 1662
Quality 1.5 × per dimension 1706 × 960 2560 × 1440
Balanced 1.7 × per dimension 1506 × 847 2259 × 1270
Performance 2.0 × per dimension 1280 × 720 1920 × 1080
354 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

7.4.3 Summary

Nvidia brought real-time ray tracing (RTRT) to the forefront in 2018 and lit up
everyone’s imagination. By 2021, ray tracing ran on the Xbox series X, the PS5,
AMD’s RX6000 and Pro AIBs, and all Nvidia’s RTX AIBs. At that time, over 60
games incorporated RTRT—with more coming. RTRT was forecasted to be available
in 2025, based on Moore’s law, but AI and clever intersection work by AMD and
Nvidia pulled that date closer by seven years—which was startling, remarkable, and
very welcomed. With ray tracing, games simply felt better and more realistic, with
fewer to no artifacts to distract the player. Players were more immersed and genuinely
part of the game. These were exciting times, and they were going to get better.
AMD offered FSR, which improved image quality and performance but was not
a direct competitor to Nvidia’s DLSS, which improved performance by rendering
frames at a lower resolution and then using a spatial upscaling algorithm with a
sharpening filter.
FSR was a post-process shader that made it easy for game developers to imple-
ment across various AIBs. FSR was not the equivalent of DLSS—and especially not
DLSS 2.0z, but SR was a good non-AI/temporal upscaler that offered good-quality
performance at a low price.
Ultra FSR was more than just a Lanczos implementation [21] plus sharpening,
and it offered free performance improvements with a minimal effect on the visual
quality.

7.5 Innosilicon (2021)

7.5.1 The GPU Population Continued to Expand in 2021

Cryptocurrency ASIC builder Innosilicon applied their talents to building a GPU


using Imagination technologies BXT 32–1024 MC4 RTL IP.
Innosilicon was founded in 2006 in Zhuhai, China, a modern city in China’s
southern Guangdong province, on the border with Macau, with R&D in Shanghai
and Wuhan. The company had over 700 employees all over the world in 2022. Its first
products were RFID and custom LTD chips. It expanded to satellite communications
and then AI memory chips controllers. And that led them to custom ARM CPUs and
into the mining market using clever GDDR memory management. Unfortunately,
testing revealed that their own GPU design wasn’t quite up to the competition in the
mainstream market. What makes a GPU good at mining is the memory bandwidth—
those big fat busses with high clock rates. Innosilicon had mastered that with their
AI memory controller. The company developed a super-high-speed PHY [22] for
GDDR6 that could realize 64 GB/s and uses four-level pulse amplitude modulation
(PAM4) signaling.
7.5 Innosilicon (2021) 355

To make the next leap in the company’s growth, it decided to use Imagination’s
GPU, Innosilicon’s memory manager, and tensor cores to build a high-end GPU/AIB,
which they named Fantasy One.
In October 2020, Imagination and Innosilicon announced a collaboration. In late
2021, Imagination confirmed it had licensed its BXT design to Innosilicon and
claimed it would deliver up to six TFLOPS of single-precision compute power.
Imagination’s specifications are slightly higher than what Innosilicon quoted for the
Fantasy One GPU.
Fantasy One had nine GPU blocks with an undeclared number of cores but could
be as high as 32 cores per block. The company did not disclose the process technology
for the Fantasy One GPU. There could be one of two reasons for that, either they were
embarrassed by the fab they got, or the fab deal had not been closed yet. However,
they did have chips and were building AIBs.
The company showed four products at a press event in China in December 2021,
and it included a dual-GPU AIB and single-GPU AIBs.
The type A AIB was a consumer/workstation board featuring a multi-chip (chiplet)
single Fantasy One GPU design.
According to Innosilicon at the time, the GPU could deliver a fill rate of 160
GPixel/s and up to five TFLOPS of single-precision compute power. The AIB
had HDMI 2.1, DisplayPort 1.4, and VGA outputs. The AIB had up to 16 GB of
GDDR6(X) memory (using Innosilicon’s PHY) with a 128-bit interface. At the time,
Nvidia had been the only user of Micron’s G6X technology. Innosilicon developed its
own PAM4 signaling and was able to squeeze up to 19 Gbps of memory bandwidth
out of their GDDR6X implementation. However, the 128-bit memory bus limited its
bandwidth, which could theoretically get to 304 GB/s (somewhere between a Radeon
RX 6700XT and a 6600XT).
The small orange AIB shown in Fig. 7.22 was an entry-level AIB based on the
Fantasy One GPU and had less memory than the dual-fan solutions (Fig. 7.23).
The Type B could be a fanless design or a triple-fan as in Fig. 7.24, and was a dual-
GPU Fantasy One GPU design connected by an Innolink interface. The company
claimed it could hit up to ten TFLOPS of computing power and 320 GPixel/s fill rates
with two GPUs. The AIB offered 32, 1080p/60 fps streams or 64 streams at 720/30
fps. It featured up to 32 GB of GDDR6(X) memory via dual 128-bit interfaces from
each GPU. All the AIBs included a PCI-Express 4.0 interface at full X16 width.
Notice there is no power connector on the top. The typical power consumption
was only 20 watts.
The AIBs supported OpenGL, OpenGL ES, OpenCL, Vulkan, and DirectX,
though the company didn’t reveal which version of DirectX.
In December 2021, the company said it was working on the next generation
Fantasy 2 and 3 GPU families and will unveil in 2022. Innosilicon has plans to
utilize 5 nm process technology for those GPUs.
Innosilicon says their Innolink IP chiplet solution allows “massive amounts of
low-latency data to pass seamlessly between smaller chips as if they were all on the
same bus.” In other words, it’s a chiplets design, defined as independent functional
blocks making up a large chip (Fig. 7.25).
356 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Fig. 7.22 Innosilicon’s series one family. Courtesy of Innosilicon

Fig. 7.23 An HDMI, display port with a VGA connector on the back of the AIB. Courtesy of
Innosilicon

The Innolink delivered 56Gbps/pair with 30 dB insertion loss. The company said
it was scalable to 4/8/16/32/64/128 lanes, PHY-independent, and had a very low
power mode.
It’s unlikely the company will do a specific miner product based on Imagination’s
IP Instead, Innosilicon has said it is focused on the data center, then desktop, then
laptop (Fig. 7.26).
The data center is the priority, then desktop. Performance range varies. FP32
performance was five to six TFLOPS and 160 gigapixel/s. (AMD’s RX 6600 is
7.5 Innosilicon (2021) 357

Fig. 7.24 Fantasy one type B AIBs. Courtesy of Innosilicon

Fig. 7.25 Innosilicon’s innolink IP Chiplet block diagram. Courtesy of Innosilicon

capable of seven teraflops, (Nvidia’s GTX 1660 SUPER can reach a hair over five
teraflops, so Innosilicon lands between those).
The AI calculations (INT8) performance was 25 TOPS with up to 16 GB GDDR6
(X) memory with 19 Gbit/s connected to the 128-bit memory interface, memory
bandwidth of 304 GB/s.
358 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

Fig. 7.26 Innosilicon’s roadmap. Courtesy of Innosilicon

The desktop graphics AIB was PCIe 4.0 × 16. The typical power consumption
was only 20 watts.
Two GPUs were connected via Innolink in the desktop and server. And, as
mentioned, the performance was projected to be ten TFLOPS and 320 Gpixel/s
and the memory 32 GB. Typical power consumption was fifty watts.

7.5.2 Summary

Innosilicon officially joined the other Chinese GPU makers (or promisers) Jinga,
MetaX, XianDiXian, and Zhaoxin. But Innosilicon was the first company in 22 years
to use Imagination IP to build a PC graphics processor, the last being NEC. Apple was
the first to use Imagination in a mobile phone built using PowerVR. Over the years,
Imagination demonstrated and improved on their tiling engine. And the company
always offered a lot of pixel GFLOPS for not too many milliwatts. It was exciting to
see Innosilicon take Imagination back to mainstream PC and into the data center. The
company always felt like it belonged there and was just looking for the right partner.
It could be speculated that if Canyon Bridge had not acquired Imagination, the deal
with Innosilicon might not have come about. It is also ironic that Apple’s public and
almost cruel abandonment of Imagination, and poaching Imagination employees, led
to the company’s valuation drop making the company a takeover target. And now
Apple is again a client of Imagination but will never have the stranglehold on the
company it once did.
It’s unlikely Innosilicon will push out of China for a while. China is a plenty big
market, and there are no cultural barriers to cross or associated expenses. However,
it also means AMD, Intel, and Nvidia will have more headwinds trying to penetrate
further into the Chinese GPU market, especially the data center, since three Chinese
companies have declared that they will own it.
References 359

China wants to self-sufficient and is building the foundations to make that happen
with western consumer dollars.

7.6 Conclusion

When the integrated T&L engine was introduced launching the first era of GPUs
the shaders were semi-fixed function and would sit idle while other semi-fixed func-
tion shaders might have been overburdened. The second era of GPUs introduced
programmable shaders were one step closer to the ultimate all compute GPU. The
introduction of unified shaders occurred in the third era of the GPU and the fourth
era introduced compute shaders. The fifth era brought us ray tracing and AI.
The sixth era was marked by the introduction DirectX 12 Ultimate, introduced in
November 2020. It brought a flood of new compute techniques and aligned the PC
with game consoles which greatly enhanced and sped up game development. Dx12U
introduced Mesh Shaders and Sampler Feedback. Mesh shading would transform
the detail and quality of computer images without penalizing performance.

References

1. Peddie, J. Ray Tracing: A Tool for All (2019) Springer Link,


https://link.springer.com/book/https://doi.org/10.1007/978-3-030-17490-3
2. Jobalia, S. Coming to DirectX 12— Mesh Shaders and Amplification Shaders: Reinventing the
Geometry Pipeline, (November 8, 2019), https://devblogs.microsoft.com/directx/coming-to-
directx-12-mesh-shaders-and-amplification-shaders-reinventing-the-geometry-pipeline/
3. Patel, A. DirectX Raytracing (DXR) Tier 1.1, (November 6, 2019), https://devblogs.microsoft.
com/directx/dxr-1-1/
4. Andrews, C. Coming to DirectX 12— Sampler Feedback: some useful once-hidden data,
unlocked, Coming to DirectX 12— Sampler Feedback: some useful once-hidden data, unlocked,
(November 4th, 2019), https://devblogs.microsoft.com/directx/coming-to-directx-12-sampler-
feedback-some-useful-once-hidden-data-unlocked/
5. Kilgariff, E., Moreton, H., Stam, N., and Bell, B. Nvidia Turing Architecture In-Depth, Tech-
nical Walkthrough, Nvidia Developer Blog, (September 14, 2018), https://developer.nvidia.
com/blog/nvidia-turing-architecture-in-depth/
6. Roach, J. What is Nvidia DLAA? New anti-aliasing technology explained, (September 28,
2021), https://www.digitaltrends.com/computing/what-is-nvidia-dlaa/
7. Burnes, A. Nvidia RTX Technology: Making Real-Time Ray Tracing A Reality For
Games, (March 19, 2018), https://www.nvidia.com/en-us/geforce/news/nvidia-rtx-real-time-
game-ray-tracing/
8. Einig, M. Hybrid rendering for real-time lighting: ray tracing vs rasterization, (January 2,
2017), https://www.imaginationtech.com/blog/hybrid-rendering-for-real-time-lighting/
9. Barré-Brisebois, C., et all. Hybrid Rendering for Real-Time Ray Tracing. Chapter 25 in Ray
Tracing Gems, edited by Eric Haines and Tomas Akenine-Möller, Apress, (2019), https://ray
tracinggems.com
10. Burnes, A. Nvidia DLSS 2.0: A Big Leap In AI Rendering, (March 23, 2020), https://www.nvi
dia.com/en-us/geforce/news/nvidia-dlss-2-0-a-big-leap-in-ai-rendering/
360 7 The Sixth Era GPUs: Ray Tracing and Mesh Shaders

11. Peddie, J. Intel unveils Xe-architecture-based discrete GPU for HPC, (November 17, 2019),
https://www.jonpeddie.com/report/intel-unveils-xe-architecture-based-discrete-gpu-for-hpc/
12. Peddie, J. Intel launches hybrid notebook processor: 3D stacking and very low power hallmarks
of new, (June 11, 2020), https://www.jonpeddie.com/report/intel-launches-hybrid-notebook-
processor/
13. Peddie, J. Intel’s stacked chip is sexy, (February 12, 2020), https://www.jonpeddie.com/report/
intels-stacked-chip-is-sexy/
14. Rogoway, M. Intel promises another decade of Moore’s Law as it strives to reconnect with
‘geeks,’ The Oregonian/OregonLive, (October. 27, 2021), https://www.oregonlive.com/sil
icon-forest/2021/10/intel-promises-another-decade-of-moores-law-as-it-strives-to-reconnect-
with-geeks.html
15. Koduri, R. @Rajaontheedgehttps, Xe-HPG (DG2) real candy - very productive time at the
Folsom lab couple of weeks ago (June 1, 2012), https://twitter.com/Rajaontheedge/status/139
9966271182045184
16. Koduri, R. Intel Delivers Advances Across 6 Pillars of Technology, Powering Our Leadership
Product Road map, (August 13, 2021), https://www.intel.com/content/www/us/en/newsroom/
opinion/advances-across-6-pillars-technology.html
17. Peddie, J. The Arc of the story—Intel brands its GPU, TechWatch, (August 20, 2021), https://
www.jonpeddie.com/report/the-arc-of-the-alchemist
18. Saleh, S.J., Kazakov, M.V., Goe, V. Texture Processor Based Ray Tracing Acceleration Method
and System, US 2019/0197761, (June 27, 219), https://www.freepatentsonline.com/201901
97761.pdf
19. GPUOpen, https://gpuopen.com/
20. AMD. With AMD FidelityFX Super Resolution, AMD Brings High-Quality, High-Resolution
Experiences to Gamers Worldwide [Press release]. (June 22, 2021), https://ir.amd.com/news-
events/press-releases/detail/1011/with-amd-fidelityfx-super-resolution-amd-brings
21. Lanczoz resampling, https://en.wikipedia.org/wiki/Lanczos_resampling
22. GPU GDDR6/6X PHY & Controller, Innosilicon, https://www.innosilicon.com/html/ip-sol
ution/14.html
Chapter 8
Concluding Remarks

The goal of computer graphics has been to present data in an informative and realistic
way. Molecular modeling was one of the first applications of presenting data and using
the graphics to gain new insights and see problems. CAD was the biggest impetus
to CG and then games and movies. GPU compute, although used in visualization
calculations, is not considered a CG application for this discussion.
The goal of visualization and simulation used for digital modeling and entertain-
ment has been to create an image that is indistinguishable from reality—to remove
any feeling of disbelief or discomfort.
In the case of engineering and design, ray tracing with global illumination has
been the Holy Grail and efforts to speed it up and make it a real-time tool have been
expended since Whitted’s breakout development that helped accelerate ray tracing
in 1979 [1]. Real-time ray tracing was realized in 2014 by Imagination Technologies
[2] and again and most famously in 2018 by Nvidia [3].
The beauty of real-time ray tracing was brought to games, and some of the best
show off examples were racing games with beautiful car models (Fig. 8.1).
Traditionally, one of the biggest challenges in computer graphics has been
rendering and simulating humans. Getting a life like, believable human image has
been a sought-after objective for decades. Image quality improved from crude cartoon
characters in games like Doom in the early 1990s and Lara Croft in Tomb Raider in
1996 (Fig. 8.2), to refined and believable images beginning in 2001 with the animated
movie Final Fantasy: The Spirits Within and Tomb Raider 2013 (Fig. 8.3). Skin, eyes,
and hair were the major challenges, followed by natural mechanics of movement. In
2020, facial realism powered by GPUs took another giant step as real people were
modeled for games such as in Death Standing. And then in 2022, new techniques
were applied as well as ray tracing by Unity after acquiring Weta. More realistic eyes
with caustics on the iris, a new hair system, and a new skin shader with peach fuzz
and winkle maps were added (Fig. 8.4).
As GPUs scaled up in the number of shaders, added AI matrix math cores, and
enlarged their memory size while increasing clock speeds, APIs expanded to mesh

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 361
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1_8
362 8 Concluding Remarks

Fig. 8.1 Slick car from Forza Horizon 5. Courtesy of Xbox Game Studios

Fig. 8.2 Game characters in the 1990s: Doom and Tome raider. Source Wikipedia

Fig. 8.3 Final Fantasy 2001 and Tomb Raider 2013. Courtesy of Wikipedia and Crystal Dynamics
8 Concluding Remarks 363

Fig. 8.4 Death Standing 2020 and enemies. Courtesy of Sony Interactive Entertainment and Unity

Fig. 8.5 Computational fluid dynamics is used to model and test in a computer to find problems
and opportunities. Courtesy of Siemens

shaders and ray tracing, and software developers saw opportunities to do things that
were only possible on big computers running for days.
In the movie world, computer graphics have brought dead actors back to life,
including well-known people such as Steve McQueen, Carrie Fischer, and Peter
Cushing—a challenge complicated by the close relationships audiences have with
these actors.
In the areas of engineering, CAD has been a main stay user of GPUs and
rendering. In areas of simulation, finite element analysis (FEA) and computational
fluid dynamics (CFD) have been a challenge (Fig. 8.5).
364 8 Concluding Remarks

Computational fluid dynamics (CFD) simulation is a powerful tool for design


and is used extensively for automotive, aeronautic, and marine design, where air
and water flows are vital to the performance of cars, planes, and boats. It has much
broader applications as well. CFD comes in handy in content creation for simulating
fire, smoke, lava flows, and even crowd movements. Being able to put CFD to work
is a superpower, but it ain’t easy. It’s a matter of predicting the behavior of millions
of particles that are always changing in 4D with complex interdependencies.
And this is not the end of the story, just the story to here. Between the time this
was written, the book published, and you reading it, probably newer and even more
amazing things have been done with GPUs. And now you know how we got here
and why. Thanks for reading.

References

1. Whitted T. (1979) An improved illumination model for shaded display. Proceedings of the 6th
annual conference on Computer Graphics and Interactive Techniques
2. Triggs, R. Imagination shows off real time ray tracing demos at MWC, (February 23, 2016).
https://www.androidauthority.com/imagination-ray-tracing-demo-mwc Moore’s 016-675829/
3. Takahashi, D. Nvidia unveils real-time ray tracing Turing graphics architecture, (August
13, 2018). https://venturebeat.com/2018/08/13/nvidia-unveils-real-time-ray-tracing-turing-gra
phics-architecture/
Appendix A
Acronyms

Common acronyms used in this book and the computer graphics industry (product
names are not included)

HPDR High-precision dynamic range


ODM Original device manufacturer
SIP Semiconductor Intellectual Property
TAU Texture address unit
DLL Direct-link library
EMIB Embedded Multi-die Interconnect Bridge
FMAD Floating-point fused Multiply-ADd vectors
GMCH Graphics memory and control hub
GT Graphics technology
GTI Graphics technology Interface
HEVC High-Efficiency Video Coding
IPC Instructions per clock
MIC Many integrated cores
MIMD Multiple instructions, multiple data
MMU Memory-management unit
POSH Position Only Shading pipeline
PSO Pipeline state object, also single pipeline state object
PTBR Position only tile-based rendering
RB Render backend
SMT Simultaneous multi-threading
SOI Silicon on insulator
TPC Texture processor cluster
TPU Tensor processing unit
(continued)

© The Editor(s) (if applicable) and The Author(s), under exclusive license 365
to Springer Nature Switzerland AG 2022
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1
366 Appendix A: Acronyms

(continued)
VF Vertex fetch
VPU Vector processing unit
ALLM Auto low latency mode
NURBS Non-uniform rational basis spline
VoIP Voice Over Internet Protocol
AHB Advanced high-performance bus (Arm)
APB Advanced Peripheral Bus
ATW Asynchronous time warp
BREW Binary Runtime Environment for Wireless
CDMA Code division multiple access
CIF Common intermediate format
DLA Deep learning accelerator
FMA Fused multiply-accumulate
GSM Global System for mobile communication
IMGIC Imagination Image Compression
IoT Internet of things
MID Mobile Internet devices
MIPI Mobile Industry Processor Interface
OMAPI Open Mobile Application Processor Interfaces
PBR Physically based rendering
SKU Stock keeping unit
AXI Advanced eXtensible Interface
BRDF Bidirectional reflectance distribution function
BSD Berkeley Source Distribution
COTSS Commercial-off-the-shelf semiconductors
CSP Chip-scale package
DPC Data-parallel C
DVFS Dynamic voltage and frequency scaling
GCC GNU (C and C++) Compiler Collections\
HPM High-performance multi-crystalline
IDM Integrated device manufacturing
MDFI Mesh interconnect architecture
MIG Multi-instance GPU
MUL Multiply
NNP Neural network processors
OCP Open Core Protocol
RAC Ray acceleration cluster
TBps Terabytes per second
(continued)
Appendix A: Acronyms 367

(continued)
UOS Unity Operating System
WinTel Microsoft Windows and Intel systems
XMX Xe Matrix eXtensions
AIB Advanced Interface Bus (Intel)
DLAA Deep learning anti-aliasing
DS Domain Shader
MPS Multi-process service
PAM Pulse amplitude modulation
QoS Quality of service
TIPS Tensor instructions per second
TOPS Tensor operations per second
TSV Through silicon vias
VDI Virtual desktop infrastructure
VGPR Vector general-purpose registers
VS Vertex shader
XESS Xe supersampling
Appendix B
Definitions

Does anyone really read a glossary? Hopefully yes. They take a lot of time and
research to write and can inform, clear up ambiguities, and ever cause some people
to change their perspective. The trick is to know what to put in and leave out.
Terms
Throughout this book, specific terms will be used that assume the reader understands
and is familiar with the industry.
• Model—The marketing name for a GPU assigned by the manufacture, for
example, AMD’s Radeon, Intel’s Xe , and Nvidia’s GeForce. A model can also be
a 3D object. For example, the design of a car is a 3D model.
• Code name—The GPU manufacturer’s engineering code name for the device.
• CAD—Computer-aided design.
• CAE—Computer-aided engineering.
• CAGR—Compound average growth rate.
• CFD—Computational fluid dynamics.
• CGI—Computer-generated imagery.
• Launch—There is no standard. It can be the date the GPU first shipped, or the
date of the announcement.
• Architecture—The name of the design, the microarchitecture used for the GPU.
It, too, will be a proper noun such as AMD’s Radeon DNA or Nvidia’s Hopper.
• Fab—The fabrication process. The average feature size of the transistors in the
GPU expressed in nanometers (nm).
• Transistors—The number of transistors in the GPU or chip.
• Die size—The square area of the chip, typically measured in square millimeters
(mm2 ).
• Core clock—The GPU’s reference or base frequency (and boost if available) is
expressed in MHz or GHz.
• Fill rate

© The Editor(s) (if applicable) and The Author(s), under exclusive license 369
to Springer Nature Switzerland AG 2022
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1
370 Appendix B: Definitions

– Pixel—The rate at which the raster operators can render pixels to a display,
measured in pixels/s.
– Texture—The rate at which the texture-mapping units can map surfaces onto
a polygon mesh, measured in texels/s.
• Performance
– Shader operations—How many operations the pixel shaders (or unified
shaders) can perform, measured in operations/s.
– Vertex operations—The number of operations processed on the vertex shaders
in Direct3D 9.0c and older GPUs, expressed in vertices/s.
• Memory
– Bus width—The bit width of the memory bus.
– Size—Size of the graphics memory expressed in gigabytes (GB).
– Clock—The reference or base frequency of the memory clock, expressed in
MHz or GHz.
– Bandwidth—The maximum rate of data transfer across the memory expressed
in mega- or gigabytes per second (GB/s or MB/s).
• TDP (thermal design power)—The maximum heat generated by the GPU,
expressed in watts.
• TBP—The typical AIB power consumption, measured in watts.
• Bus interface—The connection that attaches the graphics processor to the system
(typically an expansion slot, such as PCI, AGP, or PCIe).
• API support—Rendering and computing APIs supported by the GPU and the
driver.
• Image generation—The image generation stage in a GPU where the final,
displayed pictures are created before being sent to the screen. It is where the
user engages with the results of the entire system. In the case of movies or TV, it
is passive. In the case of computers, it is interactive, such as playing a game. In
the case of interactive images, they can be for content creation work or content
consumption. There is a relentless demand and need for high quality, fast response,
and image generation in all cases.
Terminology and conventions change over time.
2D—Two dimensional, used to refer to “flat” graphics which only have two axes
(plural of axis), X and Y, along which drawing occurs, such as those used in normal
Windows applications. Includes drawing functions such as line drawing, BitBlts, text
display, and polygons. Most common form of computer graphics, since displays are
2D as well.
3D—Three dimensional, used to refer to the rendering/display of graphics which
are 3D in nature (i.e., exist along three axes, X, Y, and Z). In existing PC graphics
systems, this 3D data needs to be rendered into a 2D surface, namely the display.
This is something that graphics chips that offer 3D acceleration specialize in, offering
features such as 3D lines, texture mapping, perspective correction, alpha blending,
and color interpolation for smooth shading (used in simulating lighted scenes).
Appendix B: Definitions 371

3D scene—A 3D scene is composed of interlocking groups of triangles that make


up all visible surfaces. By performing mathematical operations on the vertices at the
corners of each triangle, the geometry-processing engine can place, orient, animate,
color, and light every object and surface that needs to be drawn. Small programs called
vertex shaders, uploaded to the graphics chip and executed by the vertex-processing
engine, control the process.
ASIC—An “Application Specific Integrated Circuit” is similar to an FPGA, but
fixed at the factory, and much cheaper to produce in quantity.
Adaptive sync—Technology for LCD displays that support a dynamic refresh rate
aimed at reducing screen tearing. In 2015, VESA announced Adaptive-Sync as an
ingredient component of the DisplayPort 1.2a specification. See FreeSync.
Adder—A device with two or more inputs which performs the operation of adding
the inputs and outputting the result. Traditional use in computing is a binary adder,
in which the inputs and output are binary numbers. Inputs can range in width from
one bit to many bits. The output of an adder is typically one bit wider than the largest
input to account for a possible carry situation.
Adobe RGB—Adobe RGB (1998) is a color space, developed by Adobe Systems
in 1998. It has a wider gamut than the sRGB (mainly in the cyan-green range of
colors) and is widely used in professional printing.
AGP—Acronym for Accelerated Graphics Port. This is a new bus technology
that Intel introduced in the mid-1990s to provide faster access to graphics boards
and ultimately allow these graphics boards to utilize system memory for storage
of additional off-screen graphics elements. However, while some PCs with AGP
support (usually in the form of a single slot for a graphics board) have been shipping
since late in 1997, there’s no commercially available popular software that currently
takes any real advantage of AGP’s system memory sharing ability. This is probably
because so few installed systems currently offer AGP support, and because memory
prices have dropped enough so that graphics board makers can offer huge amounts
of memory (4, 8, and even 12 MB) on graphics boards at very low prices, eliminating
the need to use system memory for additional graphics storage.
AIB (add-in board)—An add-in board, also known as a card, is a board that gets
plugged into the PC. When an AIB contains a GPU and memory, it is known as a
graphics AIB or graphics card. It plugs into either PCI Express or the older buss
AGP.
ALU, Arithmetic Logic Unit—The circuits in a microprocessor where all arith-
metic and logical instructions are carried out. Distinguished from an Arithmetic Unit
by the inclusion of logical functions (shift, compare, etc.) as well as arithmetic (add,
subtract, multiply, etc.) in its repertoire of functions.
Ambient Occlusion—To create realistic shadowing around objects, developers
use an effect called Ambient Occlusion (AO), sometimes called “poor man’s ray
tracing.” AO can account for the occlusion of light, creating non-uniform shadows
that add depth to the scene. Most commonly, games use Screen-Space Ambient
Occlusion (SSAO) for the rendering of AO effects. There are many variants, though
all are based on early AO tech, and as such suffer from a lack of shadow definition
372 Appendix B: Definitions

and quality, resulting in a minimal increase in image quality (IQ) compared to the
same scene without AO.
Anaglyph 3D—Unrelated to 3D. This is a method of simulating a depth image on
a flat 2D display by overlaying colored images representing the view from left and
right eyes, then filtering the image presented to each eye through an appropriately
colored lens.
Anisotropic filtering (AF)—A method of enhancing the image quality of textures
on surfaces of computer graphics that are at oblique viewing angles with respect to
the camera where the projection of the texture (not the polygon or other primitive on
which it is rendered) appears to be non-orthogonal (thus the origin of the word: “an”
for not, “iso” for same, and “tropic” from tropism, relating to direction; anisotropic
filtering does not filter the same in every direction).
API—Acronym for application programming interface. A series of functions
(located in a specialized programming library), which allow an application to perform
certain specialized tasks. In computer graphics, APIs are used to expose or access
graphics hardware functionality in a uniform way (i.e., for a variety of graphics hard-
ware devices) so that applications can be written to take advantage of that function-
ality without needing to completely understand the underlying graphics hardware,
while maintaining some level of portability across diverse graphics hardware. Exam-
ples of these types of APIs include OpenGL and Microsoft’s Direct3D. An API is
a software program that interfaces an application (Word, Excel, a game, etc.) to the
GPU as well as the CPU and operating system of the PC. The API informs the appli-
cation of the resources available to it, which is called exposing the functionality. If
a GPU or CPU has certain capabilities and the API doesn’t expose them, then the
application will not be able to take advantage of them. The leading graphics APIs
are DirectX and OpenGL.
APU—The AMD Accelerated Processing Unit (APU), formerly known as Fusion,
is the marketing term for a series of 64-bit microprocessors from Advanced Micro
Devices (AMD), designed to act as a central processing unit (CPU) and graphics
accelerator unit (GPU) on a single chip.
ARIB STD-B67—Hybrid Log Gamma (HLG) is a high-dynamic-range (HDR)
standard that was jointly developed by the BBC and NHK. HLG defines a nonlinear
transfer function in which the lower half of the signal values uses a gamma curve
and the upper half of the signal values uses a logarithmic curve.
ASP—Average selling price.
Aspect ratio—The ratio of length to height of computer and TV screens, video,
film, or still images. Nearly, all TV screens are 4:3 aspect ratio. Digital TVs are
moving to widescreen which is 16:9 aspect ratio.
Attach rate—An attach rate (also called an attach ratio) measures how many add-
on products are sold with each of the basic product or platform and is expressed as
a percentage.
AU, Arithmetic Unit—The circuits in a microprocessor where all arithmetic
instructions are carried out. Often found in combination with separate logic and
other units, controlled by a long, or very long, instruction word.
Appendix B: Definitions 373

Augmented reality—Augmented reality (AR) overlays digitally created content


into the user’s real-world environment. AR experiences can range from informa-
tional text overlaid on objects or locations to interactive photorealistic virtual objects.
AR differs from Mixed Reality in that AR objects (e.g., graphics, sounds) are
superimposed on, and not integrated into, the user’s environment.
Backlight—The backlight is the source of light of the LCD display panels. The
type of backlight determines the image quality and the color space of the display.
There are various backlights such as CCFL, LED, WLED, and RGB-LED.
BGA—Ball-grid array, a type of surface-mount packaging (a chip carrier) used
for integrated circuits.
Bidirectional reflectance distribution function (BRDF)—A function of four real
variables that defines how light is reflected at an opaque surface. It is employed in the
optics of real-world light, in computer graphics algorithms, and in computer vision
algorithms. The function takes an incoming light direction and outgoing direction
(taken in a coordinate system where the surface normal lies along the z-axis) and
returns the ratio of reflected radiance exiting to the irradiance incident on the surface
from direction the light source.

Bidirectional scattering distribution function (BSDF)—Introduced in 1980 by


Bartell, Dereniak, and Wolfe, it is often used to name the general mathematical func-
tion which describes the way in which the light is scattered by a surface. However, in
practice this phenomenon is usually split into the reflected and transmitted compo-
nents, which are then treated separately as BRDF (bidirectional reflectance distribu-
tion function) and BTDF (bidirectional transmittance distribution function). BSDF
is a superset and the generalization of the BRDF and BTDF.
Bidirectional scattering-surface reflectance distribution function (BSSRDF)—or
B surface scattering RDF describes the relation between outgoing radiance and
the incident flux, including the phenomena like subsurface scattering (SSS). The
BSSRDF describes how light is transported between any two rays that hit a surface.
Bidirectional texture functions (BTF)—Bidirectional texture function is a 6-
dimensional function depending on planar texture coordinates as well as on view
and illumination spherical angles. In practice, this function is obtained as a set of
374 Appendix B: Definitions

several thousand color images of material sample taken during different camera and
light positions.
Bilinear filtering—When a small texture is used as a texture map on a large surface,
a stretching will occur and large block pixels will appear. Bilinear filtering smoothens
out this blocky appearance by applying a blur.
Binary—A counting system in which only two digits exist, “0” and “1.” Also
known as the base-2 counting system. Each digit represents an additive magnitude
of a power of 2, based on its position, with the rightmost digit representing 2 to the
0th power (20 ), the next digit representing 2 to the 1st power (21 ), etc. For example,
the binary number 1001B converts to a decimal or base-10 number as follows: 1*23
+ 0*22 + 0*21 + 1*20 = 8 + 0 + 0 + 1 = 9. The binary system is the basis for all
digital computing.
Binary digits—The numbers “0” and “1” in the binary counting system. Also
called a bit.
Binary Notation—In various graphics hardware reference documents, as well as
in some programming languages, it’s common to see binary numbers (a combination
of binary digits) listed as the binary digits followed by the letter “B” or “b,” as in the
example listed under the term “binary.”
Binary Units—One or more bits.
Binning—Binning is a sorting process in which superior-performing chips are
sorted from specified and lower-performing chips. It can be used for CPUs, GPUs
(graphics cards), and RAM. The manufacturing process is never perfect, especially
given the incredible precision necessary and number of transistors to produce GPUs
and other semiconductors. Manufacturing high-performance and expensive GPUs
results in getting some that cannot run at the specified frequencies. Those parts
however may be able to run at slower speeds and can be sold as less expensive
GPUs.
Bit—Acronym derived from the term “Binary digIT” (see definition above).
Bit-depth-BPP—See bits per-pixel and BPP.
Bitmap—A bitmap image is a dot matrix data structure that represents a generally
rectangular grid of pixels (points of color), viewable via a monitor, paper, or other
display medium. A bitmap is a way of describing a surface, such as a computer screen
(display) as having several bits or points that can be individually illuminated and at
various levels of intensity. A bit-mapped 4k monitor would have over 8-million bits
or pixels.
Bits per channel—See bits per-pixel.
Bits per-pixel—Bits per channel are the number of bits used to represent one of
the color channels (red, green, blue). The “bit depth” setting when editing images
specifies the number of bits used for each color channel—bits per channel (BPC).
The human eye can only discern about 10 million different colors. An 8-bit neutral
(single color) gradient can only have 256 different values which is why similar tones
in an image can cause artifacts. Those artifacts are called posterization. A 16-bit
setting (BPC) would result in 48 bits per-pixel (BPP). The available number of pixel
values of that is (248 ).
Appendix B: Definitions 375

Bilter—A BitBlt process or engine. BitBit is a data operation commonly used in


computer graphics in which several bitmaps are combined into one using a boolean
function. The operation involves at least two bitmaps, one source and destination,
possibly a third that is often called the “mask” and sometimes a fourth used to create
a stencil.
BPP—BPP is an acronym for bits per-pixel. The number of bits per-pixel defines
the depth of the color space usable by a graphics device. The following table shows
the relationship between BPP and colors:

BPP Number of available colors


1 2
2 4
4 16
8 256
15 32,768
16 65,536
24 16,777,216

Also see bits per-pixel.


Brightness—An attribute of visual perception in which a source appears to be
radiating or reflecting light. In other words, brightness is the perception elicited by
the luminance of a visual target. It is not necessarily proportional to luminance. This
is a subjective attribute/property of an object being observed and one of the color
appearance parameters of color appearance models. Brightness refers to an absolute
term and should not be confused with Lightness.
Bump-mapped—Bump mapping is a technique for creating the appearance of
depth from a 2D image or texture map. Bump mapping gives the illusion of depth
by adding surface detail by responding to light direction—it assumes brighter parts
are closer to the viewer. It was developed by Jim Blinn and is based on Lambertian
reflectance which postulates the apparent brightness of a Lambertian surface to an
observer is the same regardless of the observer’s angle of view.
Byte (kbyte, Mbyte, Gbyte, Tbyte)—1 byte = 8 bits (1 byte = 256 discrete values
(brightness, color, etc.)). A collection of 8 bits, accessible as a single unit. As such,
a byte may represent one of 256 (28 ) numbers.
• 1 kilobyte = ~1000 bytes (1024 bytes)
• 1 megabyte = ~1000 kilobytes (1,048,576 bytes)
• 1 gigabyte = ~1000 megabytes
• 1 terabyte = ~1000 gigabytes.
Cache, Cache Memory—Many processor chips depend on external memory to
store the bulk of their data. Since access to external memory is slow compared to
processor speeds, a smaller, faster on-chip memory called a cache is used to improve
performance. Since the cache holds only a small part of the required data, the cache
controller runs one of a set of algorithms that attempt to ensure that the processor
376 Appendix B: Definitions

has the fastest possible access to the data it needs at any one time. Many processors
have a hierarchy of progressively larger and slower on-chip caches in an attempt to
match the speed and data locality requirements of the processor with the external
DRAM array. These are referred to as Level one (L1), Level two (L2), etc.
Calligraphic display—See vector scope.
Chipset—Typically, a pair of chips that manage the data flows and traffic between
the system memory, CPU, disk drives, keyboard and mouse, and various I/O ports
(e.g., USB, Ethernet, etc.)—see southbridge and northbridge.
Chrominance—Chrominance (chroma or C for short) is the signal used in video
systems to convey the color information of the picture, separately from the accom-
panying luma signal (or Y for short). Chrominance is usually represented as two
color-difference components: U = B' − Y' (blue − luma) and V = R' − Y' (red
− luma). Each of these difference components may have scale factors and offsets
applied to it, as specified by the applicable video standard.
Complementary metal–oxide–semiconductor (CMOS) sensor—A CMOS sensor
is an array of active pixel sensors in complementary metal–oxide–semiconductor
(CMOS) or N-type metal-oxide-semiconductor (NMOS, Live MOS) technologies.
Clamp—A clamp is a device which takes an input and produces an output which
is bounded. A traditional clamp will have two or three different inputs: the signal or
number to be clamped; the upper bound to clamp to; and possibly a lower bound to
clamp to. When the signal/numeric input to be clamped is received, it is compared
against the upper bound and, if it exceeds it, is replaced by the upper bounding value.
Similarly, if there is a lower bound, the signal/numeric input is compared, and if
found lower than the lower bound, it’s replaced with the lower bound. The result of
all the bounding is then passed on to the output of the device.
Clone mode—Duplicates the computer’s screen on the other monitor(s), it’s
referred to as “Duplicate (in multiple displays’ pull-down menu window). It can
be useful for presentations and sometimes to provide a different representation of
the same output.
Color—In current computer graphics systems, color display information is gener-
ated as a blend of three colored light components: red, green, and blue (RGB).
The combination of all three of these color components at full intensity produces a
white output, while the absence of all three produces black output. Blending these
three-color components at different intensities can produce a near infinite number
of distinct colors. While a display monitor tends to require each color component to
have a voltage from 0 (off) to 0.7 V (full intensity), a computer graphics subsystem
tends to deal with color in digital terms, on a pixel per-pixel basis. Each pixel has a
specific depth, also known as BPP. Each pixel, in the process of being displayed from
video memory, passes through a component called a RAMDAC. For 8 BPP or less,
the pixel value read from video memory is usually passed through the LUT portion
of a RAMDAC in order to produce the requisite RGB information. For greater than 8
BPP modes, pixels generally bypass the LUTs and go directly to the DACs. In order
to do this, such pixels must be defined with fixed possible ranges of RGB. Therefore,
it is standard that 15-bit pixels have 5 bits each of R, G, and B, with one bit left
unused; 16-bit pixels have 5 bits each of R and B and 6 bits of G; and 24 and 32-bit
Appendix B: Definitions 377

pixels have 8 bits each of R, G, and B (with 8 bits unused in 32-bit pixels). 5 bits
gives 32 distinct intensity levels of a color component, 6 bits gives 64 levels, and 8
bits gives 256 intensity levels. It should be noted that pixel modes that go through
the LUT are called “indexed” color modes, while those that don’t are referred to as
“direct color” or “true color” modes.
Color gamut—The entire range of colors available on a particular device such as
a monitor or printer. A monitor, which displays RGB signals, typically has a greater
color gamut than a printer, which uses CMYK inks. Also see gamut and wide color
gamut.
Color space—See color gamut and gamut.
Combine—The verb used to describe an operation in which two or more values
or signals are added or concatenated with each other in order to produce a combined
output.
Comparator—A comparator is a device which generally takes two inputs,
compares them, and based on the result of the comparison, produces a binary output
or signal to indicate the result of the comparison. For example, for a “greater-than”
comparator, the first input would be compared against the second input, and if the
first is larger, a TRUE (usually a binary 1) would be output.
Computational photography—Processing of still or moving images with the
objective of modifying, enhancing, or manipulating the images themselves.
Conformal rendering—Foveation that offers a smoothly varying transition from
the high acuity region and the low acuity region. Considered more efficient than
traditional foveated rendering because it requires fewer rendered pixels than other
techniques.
Conservative raster—When standard rasterization does not compute, the desired
result is shown, where one green and one blue triangles have been rasterized. These
triangles overlap geometrically, but the standard rasterization process does not detect
this fact.

Comparing Standard and conservative rasterization

With conservative rasterization, the overlap is always properly detected, no matter


what resolution is used. This property can enable collision detection.
378 Appendix B: Definitions

Constant dither—A constant dither is the application of a dither value which


doesn’t change over the course of a set of dithering operations.
Contrast ratio—The contrast ratio is a property of a display system, defined as
the ratio of the luminance of the brightest color (white) to that of the darkest color
(black) that the system is capable of producing. A high contrast ratio is a desired
aspect of any display. It has similarities with dynamic range.
Convolution—Convolution is a mathematical operation on two functions (f and
g); it produces a third function, that is typically viewed as a modified version of
one of the original functions, giving the integral of the pointwise multiplication of
the two functions as a function of the amount that one of the original functions is
translated. Convolution is similar to cross-correlation. It has applications that include
probability, statistics, computer vision, natural language processing, image and signal
processing, engineering, and differential equations.

By Cmglee—Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/


index.php?curid=20206883
CNN (convolutional neural network)—A Deep Neural Network (DNN) that has
the connectivity in one or more of its layers arranged so that each node in layer
N is a convolution between a rectangular subset of the nodes in layer N − 1 and a
convolution kernel whose weights are found by training. The arrangement is designed
to mimic the human visual system and has proven to be very successful at image
classification as long as very large training data sets are available.
CPU—Acronym for central processing unit. In PC terms, this refers to the
microprocessor that runs the PC, such as an Intel Pentium chip.
Crossbar—A crossbar switch or matrix switch is an assembly of individual
switches between multiple inputs and multiple outputs that connects the inputs to
the outputs in a matrix manner. Crossbar switches were developed for information
processing applications such as telephony and circuit switching.
CRT—Cathode ray tube. Technical name for a display, screen, and/or monitor.
Most commonly associated with computer displays.
DAC—Digital to Analog Converter. A DAC is used to translate a digital (integer)
input, such as a pixel value, into an analog (non-integer) voltage signal. DACs are
Appendix B: Definitions 379

used in CD players to convert CD data into sounds. DACs are also a key component of
any graphics subsystem, since they convert the pixel values into colors on the screen.
Graphics boards typically use a device known as a RAMDAC, which combines DACs
with Look-Up Tables (LUTs). RAMDACs typically contain three LUTs and three
DACs, one each for the red, green, and blue color components of a pixel. See “color”
and “LUT” for more information.
DCI P3—DCI P3 is a color space, introduced in 2007 by the SMP T E. It is used
in digital cinema and has a much wider gamut than the sRGB.
dGPU—The basic, discrete (stand-alone) processor that always had its own private
high-speed (GDDR) memory. dGPUs are applied to AIBs and system boards in
notebooks.
Desktop GPU segments—The desktop is segmented into five categories, and the
desktop discrete GPUs follow the same designations.
• Workstation
• Enthusiast
• Performance
• Mainstream
• Value
Device driver—A device driver is a low-level (i.e., close to the hardware) piece
of software which allows operating systems and/or applications to access hardware
functionality without actually having to understand exactly how the hardware oper-
ates. Without the appropriate device drivers, one would not be able to install a new
graphics board, for example, to use with Windows, because Windows wouldn’t know
how to communicate with the graphics board to make it work.
Dewarp—In vision systems, this refers to the process of correcting the spherical
distortion introduced by the optical components of the system. Especially where a
single camera is capturing a very wide field of view, significant distortion can be
present. This is usually, but not always, removed in the ISP before any significant
vision processing or further computational photography is done.
Direct3D—Also known as D3D, Direct3D is the 3D graphics API that’s part of
Microsoft’s DirectX foundation library for hardware support. Direct3D actually has
two APIs, one which calls the other (called Direct3D Retained Mode or D3D RM)
and hides the complexity of the lower level API (called Direct3D Immediate Mode
or D3D IM). Direct3D is becoming increasingly popular as a method used by games
and application developers to create 3D graphics, because it provides a reasonable
level of hardware independence, while still supporting a large variety of 3D graphics
functionality (see “3D”).
DisplayPort—DisplayPort is a VESA digital display interface standard for a
digital audio/video interconnect, between a computer and its display monitor, or
a computer and a home-theater system. DisplayPort is designed to replace digital
(DVI) and analog component video (VGA) connectors in the computer monitors and
video cards.
Dithering—Used to hide the banding of colors when rendering with a low number
of colors (e.g., 16 bits). Banding is what happens when there are not enough shades
380 Appendix B: Definitions

of colors, resulting in the eye being able to see a distinct change of colors between
two shades. Dithering is also a way to visually simulate a larger number of colors
on a display monitor by interleaving pixels of more limited colors in a small grid or
matrix pattern, much in the way a magazine’s color pictures are actually composed
of small colored dots. Dithering takes advantage of the human eye’s capability to
blend regions of color. For example, if you could only display red and blue pixels,
but wanted to give the visual impression of purple, you would create a matrix of
interleaved red and blue pixels, as depicted using letters below (B = Blue, R = Red):
BRBRBRBR
RBRBRBRB
BRBRBRBR
RBRBRBRB
When viewed from a distance, the human eye would blend the red and blue pixels
in this pattern, making the area appear to be a shade of purple. This technique allows
one to simulate thousands of color in exchange for a small loss in detail, even when
there are only 16 or 256 colors available for display as might be the case when a
graphics subsystem is configured to display in an indexed color mode (see “color”).
DMCVT—Dynamic Metadata for Color Volume Transforms, SMPTE ST 2094.
Dolby Vision—12-bit HDR, BT.2020, PQ, Dolby Vision dynamic metadata.
DVI (Digital Visual Interface)—DVI is a VESA (Video Electronics Standards
Association) standard interface for a digital display system. DVI sockets are found
on the back panel of AIBs and some PCs and also on flat panel monitors and TVs,
DVD players, data projectors, and cable TV set-top boxes. DVI was introduced in
and uses TMDS signaling. DVI supports High bandwidth Digital Content Protection,
which enforces digital rights management (see HDCP).
Dynamic contrast—The dynamic contrast shows the ratio between the brightest
and the darkest color, which the display can reproduce over time, for example, in the
course of playing a video.
EDF—Emissive Distribution Functions.
eGPU—An AIB with a dGPU located in a stand-alone cabinet (typically called a
breadbox) and used as an external booster and docking station for a notebook.
Electronic imaging—Electronic imaging is a broad term that defines a system of
image capture using a focusing lens sensor with a sensor behind it to translate the
image into electronic signals. Those signals are then filtered, processed, and made
available for storage and/or display. A technique for inputting, recording, processing,
storing, transferring, and using images (ISO 12651-1). Using computers and/or
specialized hardware/software to capture (copy), store, process, manipulate, and
distribute “flat information” such as documents, photographs, paintings, drawings,
and plans, through digitization.
End-to-end latency—See Motion-To-Photon Latency.
Energy conservation—The concept of energy conservation states that an object
cannot reflect more light than it receives.
Appendix B: Definitions 381

Energy conservation scales

For practical purpose, more diffuse and rough materials will reflect dimmer and
wider highlights, while smoother and more reflective materials will reflect brighter
and tighter highlights.
Error correction model (ECM)—Belongs to a category of multiple time series
models most commonly used for data where the underlying variables have a long-
run stochastic trend, also known as co-integration. ECMs are a theoretically driven
approach useful for estimating both short-term and long-term effects of one-time
series on another. The term error correction relates to the fact that last-periods devia-
tion from a long-run equilibrium, the error, influences its short-run dynamics. Thus,
ECMs directly estimate the speed at which a dependent variable returns to equilibrium
after a change in other variables.
Extended mode—Extended mode creates one virtual display with the resolution
of all participating monitors. Depending on the hardware and software employed,
the monitors may have to have the same resolution (there’s more on this in the next
sections). Both of these modes present the display space to the user as a contiguous
area, allowing objects to be moved between, or even straddled across displays as if
they are one.
FEA—Finite element analysis.
Field of view—The field of view (also field of vision, abbreviated FOV) is the
extent of the observable world that is seen at any given moment. In case of optical
instruments or sensors, it is a solid angle through which a sensor detects the presence
of light.
Fixed function—Fixed function accelerator AIBs take some of the load off the
CPU by executing specific graphics functions, such as BitBlt operations and line
draws. That makes them better than frame buffers for environments that heavily
load the system CPU, such as Windows. Those types of AIBs have also been called
Windows and graphical user interface (GUI) accelerators.
A fixed function can also apply to the graphics pipeline, such as a T&L stage or
a tessellation stage.
Flat shading—A rendering method to determine brightness by the normal vector
on a polygon and the position of the light source and to shade the entire surface of
a polygon with the color of the brightness. This rendering method produces a clear
difference in the colors of adjacent polygons, making their boundary lines visible,
so it is unsuitable for rendering smooth surfaces.
Floating-point unit—An Arithmetic Unit which operates on floating-point data.
Most general-purpose floating-point units observe the IEEE 754 standard which
governs formats, precision, rounding, handling of exceptions, etc. Special purpose
AUs found in GPUs and other DSPs optimized for specific tasks do not always do
so and hence different results can be obtained for the same instructions executed on
different AUs. This is one of the challenges of heterogeneous computing.
382 Appendix B: Definitions

FLOP—An acronym for Floating-point Operations Per Second used as a measure


of the computational throughput of a floating-point arithmetic unit.
FOV, field of view—The field of view (also field of vision, abbreviated FOV) is
the extent of the observable world that is seen at any given moment. In case of optical
instruments or sensors, it is a solid angle through which a sensor detects the presence
of light.
Foveated imaging—A digital image processing technique in which the image
resolution, or amount of detail, varies across the image according to one or more
“fixation points.” A fixation point indicates the highest resolution region of the image
and corresponds to the center of the eye’s retina, the fovea.
Foveated rendering—A graphics rendering technique which uses an eye tracker
integrated with a virtual reality headset to reduce the rendering workload by limiting
the image quality in the peripheral vision (outside of the zone gazed by the fovea).
FPGA—A Field Programmable Gate Array is a reprogrammable logic gate chip
whose internal gate connections can be altered by downloading a bitstream to the
card with a special program written for that purpose.
FPU—A floating-point unit (FPU) is a part of a computer system specially
designed to carry out operations on floating-point numbers. Typical operations are
addition, subtraction, multiplication, division, and square root. FPUs can be found
within a CPU, in GPU shaders, and in DSPs and stand-alone coprocessors.
Fragment shader—Pixel shaders, also known as fragment shaders, compute color
and other attributes of each fragment. The simplest kinds of pixel shaders output one
screen pixel as a color value; more complex shaders with multiple inputs/outputs are
also possible. Pixel shaders range from always outputting the same color, to applying
a lighting value, to doing bump mapping, shadows, specular highlights, translucency,
and other phenomena. They can alter the depth of the fragment for z-buffering.
Frame buffer—The separate and private local memory for a GPU on a graphics
AIB. The term frame buffer is a bit out of date since the GPU’s local memory holds
much more than just a frame or an image for the display as they did when originally
developed. Today, the GPU’s local memory holds programs (known as shaders) and
various textures, as well as partial results from various calculations, and two to three
sets of images for the display as well as depth information known as a z-buffer.
Frame-rate control (FRC)—A method, which allows the pixels to show more color
tones. With quick cyclic switching between different color tones, an illusion for a
new intermediate color tone is created. For example, by using FRC, a 6-bit display
panel can show 16.7 million colors, which are typical for 8-bit display panels, and
not the standard 262,200 colors, instead. There are different FRC algorithms.
Frame-rate converter (FRC)—Frame rate, also known as frame frequency and
frames per second (FPS), is the frequency (rate) at which an imaging device produces
unique consecutive images called frames. FRC (Frame-Rate Conversion) algorithms
are used in compression, video format conversion, quality enhancement, stereo
vision, etc. FRC algorithm increases the total number of frames in the video sequence.
This is performed by inserting new frames (interpolated frames) between each pair
of neighbor frames of original video sequence.
Appendix B: Definitions 383

FreeSync—The brand name for an adaptive synchronization technology for LCD


displays that support a dynamic refresh rate aimed at reducing screen tearing.
FreeSync was initially developed by AMD. FreeSync is a hardware/software solution
that utilizes DisplayPort Adaptive-Sync protocols to enable smooth, tearing-free and
low-latency gameplay.
Frustrum, viewing—A viewing frustum is the 3D volume in a scene relative to
the viewer. The shape of the volume affects how models are projected from camera
space onto the screen. The most common type of projection, a perspective projection,
is responsible for making objects near the camera appear bigger than objects in
the distance. For perspective viewing, the viewing frustum can be visualized as a
pyramid, with the camera positioned at the tip. This pyramid is intersected by a front
and back clipping plane. The volume within the pyramid between the front and back
clipping planes is the viewing frustum. Objects are visible only when they are in this
volume.
G-buffer—Tile-based deferred rendering (TBDR).
Gamma correction—Gamma correction, gamma nonlinearity, gamma encoding,
or often simply gamma, is the name of a nonlinear operation used to code and decode
luminance or tristimulus values in video or still image systems. Gamma correction
is, in the simplest cases, defined by the following power-law expression:

Plot of the sRGB standard gamma-expansion nonlinearity (red), and its local gamma value, slope
in log–log space (blue)

In most computer systems, images are encoded with a gamma of about 0.45
and decoded with a gamma of 2.2. The sRGB color space standard used with most
cameras, PCs, and printers does not use a simple power-law nonlinearity as above, but
has a decoding gamma value near 2.2 over much of its range. Gamma is sometimes
confused and/or improperly used as “gamut.”
384 Appendix B: Definitions

Gamut—In color reproduction, including computer graphics and photography,


the gamut or color gamut is a certain complete subset of colors.

Typical gamut map. The grayed-out horseshoe shape is the entire range of possible chromaticities,
displayed in the CIE 1931 chromaticity diagram format

The most common usage refers to the subset of colors which can be accurately
represented in a given circumstance, such as within a given color space or by a certain
output device.
Also see color gamut and wide color gamut.
GDDR—An abbreviation for double data rate type six synchronous graphics
random-access memory, is a modern type of synchronous graphics random-access
memory (SGRAM) with a high bandwidth (“double data rate”) interface designed
for use in graphics cards, game consoles, and high-performance computation.
Geometry engine—Geometric manipulation of modeling primitives and transfor-
mations are applied to the vertices of polygons, or other geometric objects used as
modeling primitives, as part of the first stage in a classical geometry-based graphic
image rendering pipeline, which is referred to as the geometry engine. Geometry
transformations were originally implemented in software on the CPU or a dedicated
floating-point unit, or a DSP. In the early 1980s, a device called the geometry engine
was developed by Jim Clark and Marc Hannah at Stanford University.
Geometry shaders—Geometry shaders, introduced in Direct3D 10 and OpenGL
3.2, generate graphics primitives, such as points, lines, and triangles, from primi-
tives sent to the beginning of the graphics pipeline. Executed after vertex shaders,
geometry shader programs take as input a whole primitive, possibly with adjacency
information. For example, when operating on triangles, the three vertices are the
Appendix B: Definitions 385

geometry shader’s input. The shader can then emit zero or more primitives, which
are rasterized and their fragments ultimately passed to a pixel shader.
Global illumination—“Global illumination” (GI) is a term for lighting systems
that model this effect. Without indirect lighting, scenes can look harsh and artificial.
However, while light received directly is fairly simple to compute, indirect lighting
computations are highly complex and computationally heavy.
Gouraud shading—A rendering method to produce color gradual shading over the
entire surface of a polygon is performed by determining brightness with the normal
vector at each vertex of a polygon and the position of the light source and performing
linear interpolation between vertices.
The normal vector at each vertex can be determined by taking an average of
the normal vectors of all the polygons having the common vertex. For a triangular
polygon, the brightness at each vertex is determined by the normal vector obtained
for each vertex and the position of the light source. Therefore, the brightness of pixels
inside a triangle is determined by interpolation. This rendering method represents
color gradual variations between adjacent polygons, so it is suitable for rendering
smooth surfaces.
GPC—A graphics processing cluster (GPC) is group, or collection, of specialized
processors known as shaders, or simultaneous multiprocessors, or stream processors.
Organized as a SIMID processor, they can execute (process) a similar instruction
(program or kernel) simultaneously, or in parallel. Hence, they are known as a parallel
processor (A shader is a computer program that is used to do shading: the production
of appropriate levels of color within an image).
GPU (graphics processing unit)—The GPU is the chip that drives the display
(monitor) and generates the images on the screen (and has also been called a Visual
Processing Unit or VPU). The GPU processes the geometry and lighting effects and
transforms objects every time a 3D scene is redrawn—these are mathematically inten-
sive tasks and hence the GPU has upwards to hundreds of floating-point processor
(also called shaders or stream processors.) Because the GPU has so many powerful
32-bit floating-point processors, it has been employed as a special purpose processor
for various scientific calculations other than display and is referred to as a GPGPU in
that case. The GPU has its own private memory on a graphics AIB which is called a
frame buffer. When a small (less than five processors) GPU is put inside a northbridge
(making it an IGP), the frame buffer is dropped and the GPU uses system memory.
The GPU has to be compatible with several interface standards including software
APIs such as OpenGL and Microsoft’s DirectX, physical I/O standards within the
PC such as Intel’s Accelerated Graphics Port (AGP) technology and PCI Express,
and output standards known as VGA, DVI, HDMI, and DisplayPort.
GPU compute (GPGPU—General-Purpose Graphics Processor Unit)—The term
“GPGPU” is a bit misleading in that general-purpose computing such as the type
an ×86 CPU might perform cannot be done on a GPU. However, because GPUs
have so many (hundreds in some cases) powerful (32-bit) floating-point processors,
they have been employed in certain applications requiring massive vector operations
and mathematical intensive problems in science, finance, and aerospace applications.
386 Appendix B: Definitions

The application of a GPU can yield several orders of magnitude higher performance
than a conventional CPU.
GPU Preemption—The ability to interrupt or halt an active task (context switch)
on a processor and replace it with another task, and then later resume the previous
task this is a concept. In the era of single core CPUs preemption was how multi-
tasking was accomplished. Interruption in a GPU, which is designed for streaming
processing, is problematic in that it could necessitate a restart of a process and
thereby delay a job. Modern GPUs can save state and resume a process as soon as
the interruptive job is finished.
Graphics adapters—A graphics adapter is the device, subsystem, add-in board,
chip, or adapter used to generate a synthetic image and drive a display. It has been
called many things over the decades. Here are the names used in this book. The
differences may seem subtle, but they are used to differentiate one device from
another. For example, it is common to see the acronym GPU used when speaking or
writing about an add-in board. They are not synonyms, and a GPU is a component
of an AIB. That is not a pedantic diatribe. It would be like referring to an engine
or transmission to denote an entire automobile or truck. Part of the reason for the
misuse of terms is misunderstanding, another reason is the ease of speech (like calling
someone Tom instead of Thomas), and the third is that it is more fun and exciting to
use. People like to say GPGPU, an initialism for general-purpose GPU, as a shorthand
notation for GPU-computer. So, we cannot be the terminology police, but we can try
to clarify the differences. Generally, an acronym should be a pronounceable word.
Graphics controller—A graphics controller or graphics chip is a non-
programmable device designed primarily to drive a screen. More advanced versions
have some primitive drawing or shading graphic capabilities. The primary differ-
entiation between a controller and coprocessor or GPU is the programmable
capability.
Graphics coprocessors—Coprocessors (also written as coprocessors) can serve
as programmable processors, such as the Texas Instruments’ TI TMS34010 and Tl
TMS34020 series. Coprocessors can run all the graphics functions of an API and
display lists for applications such as CAD.
Graphics driver—A device driver is a software stack that controls computer
graphics hardware and supports graphics rendering APIs and is released under a free
and open-source software license. Graphics device drivers are written for specific
hardware to work within the context of a specific operating system kernel and to
support a range of APIs used by applications to access the graphics hardware. They
may also control output to the display, if the display driver is part of the graphics
hardware.
G-Sync—A proprietary adaptive sync technology developed by Nvidia aimed
primarily to eliminate screen tearing and the need for software deterrents such as
V-sync. G-Sync eliminates screen tearing by forcing a video display to adapt to the
framerate of the outputting device rather than the other way around, which could
traditionally be refreshed halfway through the process of a frame being output by
the device, resulting in two or more frames being shown at once.
Appendix B: Definitions 387

HBAO+—Developed by Nvidia, HBAO+ claims the company, improves upon


existing Ambient Occlusion (AO) techniques, and adds richer, more detailed, more
realistic shadows around objects that occlude rays of light. Compared to previous
techniques, Nvidia claims HBAO+ is faster, more efficient, and significantly better.
HBM (high bandwidth memory)—HMB is a high-performance RAM interface
for 3D-stacked DRAM from AMD and Hynix. It is to be used in conjunction with
high-performance graphics accelerators and network devices. The first devices to use
HBM are the AMD Fiji GPUs.
HDCP (High bandwidth Digital Content Protection)—HDCP is an encryption
system for enforcing digital rights management (DRM) over DVI and HDMI inter-
faces. The copy protection system (DRM) resides in the computer and prevents the
user of the PC from copying the video content.
HDMI (High-Definition Multimedia Interface)—HDMI is a digital, point-to-point
interface for audio and video signals designed as a single-cable solution for home
theater and consumer electronics equipment and also supported in graphics AIBs
and some PC motherboards. Introduced in 2002 by the HDMI consortium, HDMI is
electrically identical to video-only DVI.
Heterogeneous processors—Heterogeneous computing refers to systems that use
more than one kind of processor or cores. These systems gain performance or energy
efficiency not just by adding the same type of processors, but by adding dissim-
ilar coprocessors, usually incorporating specialized processing capabilities to handle
particular tasks.
Hexadecimal—Hexadecimal is the base-16 number system, which has the
following digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F. Each hexadec-
imal digit therefore can also be represented by 4 bits (also called a “Nibble”), with
two hexadecimal digits fully occupying a byte (8 bits). Also referred to as “Hex.”
Hexadecimal notation is frequently used in low-level programming, such as accessing
a graphics chip or writing device drivers. In the C and C++ programming languages,
hexadecimal numbers are designated by prefixing them with a “0x” (zero “x”), while
in Intel assembly language, hexadecimal numbers have a suffix of “H” and may
have a prefix of “0” (zero) if the first digit is greater than “9.” For example, the hex
number notation for “E988” would appear as 0xE988 in C or C++ and as 0E988H
in assembly language.
HDR10—10-bit HDR using BT.2020, PQ, and static metadata.
High dynamic range (HDR)—A dynamic range higher than what is consid-
ered to be standard dynamic range. The term is often used in discussing displays,
photography, 3D rendering.
High-dynamic-range TV (ITU-R BT.2100).
Also see wide color gamut.
High-dynamic-range imaging (HDRI)—The compositing and tone-mapping of
images to extend the dynamic range beyond the native capability of the capturing
device.
HEVC—High-Efficiency Video Codec (ITU-T h.265)—2× more efficient than
AVC.
HFR—high frame rate (100 and 120 fps).
388 Appendix B: Definitions

HLG—Hybrid Log Gamma Transfer Function for HDR signals (ITU-R BT.2100).
HLG defines a nonlinear transfer function in which the lower half of the signal values
uses a gamma curve (SD and HD) and the upper half of the signal values uses a
logarithmic curve. HLG is backwards compatible with SDR.
HPU—(Heterogeneous Processor Unit)—An integrated multi-core processor
with two or more x86 cores and four or more programmable GPU cores.
Hull Shaders—See tessellation shaders.
IGP (integrated graphics processor)—An IGP is a chip that is the result of inte-
grating a graphics processor with the northbridge chip (see northbridge and chipset).
An IGP may refer to enhanced video capabilities, such as 3D acceleration, in contrast
to an IGC (integrated graphics controller) that is a basic VGA controller. When a
small (less than five processors) GPU is put inside a northbridge (making it an IGP),
the frame buffer is dropped and the GPU uses system memory; this is also known as
a UMA—unified memory architecture.
iGPU—A scaled down version, with fewer shaders (processors) than a discrete
GPU which uses shared local RAM (DDR) with the CPU.
Image sensor—An image sensor, photo-sensor, or imaging sensor is a device,
which detects the presence of visible light, infrared transmission (IR), and/or ultra-
violet (UV) energy. That information constitutes an image. It does so by converting
the variable attenuation of waves of light (as they pass through or reflect off objects)
into electrical signals. Image sensors are used in electronic imaging devices of both
analog and digital types, which include digital cameras, camera modules, medical
imaging equipment, and night vision equipment such as thermal imaging devices,
radar, sonar, and others. The Digital Image Sensor is an Integrated Circuit Chip which
has an array of light sensitive components on the surface. The array is formed by the
individual photosensitive points. Each photosensitive sensor point inside the image
circle acts to convert the light to an electrical signal. The full set of electrical signals
are converted into an image by the on-board computer.
ISP, Image Synthesis Processor—An ISP refers to a processing unit which accepts
as input the raw samples from an imaging sensor and converts them into a human-
viewable image. The samples may have undergone some preprocessing by the sensor
circuitry to abstract certain details of the sensor operation but in general, they are
presented in the form of a “mosaic” of color samples without correction for things
like lens distortion, defective pixels, and temporal sampling artifacts. These things as
well as extracting the image from the color sample mosaic and encoding the output
into a standard format are the responsibility of the ISP.
ITU-R BT.2020—AKA Rec2020 defines various aspects of ultra-high-definition
television (UHDTV) with standard dynamic range (SDR) and wide color gamut
(WCG), including picture resolutions, frame rates with progressive scan, bit depths,
and color primaries.
ITU-R BT.2100—Defines various aspects of high-dynamic-range (HDR) video
such as display resolution (HDTV and UHDTV), bit depth, Bit Values (Files), frame
rate, chroma subsampling, and color space.
ITU-R BT.709—AKA Rec709 standardizes the format of high-definition televi-
sion, having 16:9 (widescreen) aspect ratio
Appendix B: Definitions 389

Jitter—In computer graphics, to “jitter a pixel” means to place it off side of


its normal placement by some random amount in order to achieve a more natural
appearance. It is also described as shaking. The term is used in several ways, but it
always refers to some offset of time and space from the norm.
JPR—Jon Peddie Research.
Judder—Vertical synchronization can also cause artifacts in video and movie
presentations, as they are generally recorded at frame rates significantly lower than
the typical monitor frame rates (24–30 frame/s). When such a movie is played on a
monitor set for a typical 60 Hz refresh rate, the video player misses the monitor’s
deadline fairly frequently, in addition to the interceding frames being displayed at
a slightly higher rate than intended for, resulting in an effect similar to judder (see
telecine: frame rate differences).
KB—Kilobyte. 1024 bytes, where each byte consists of 8 bits of data.
LCD (liquid crystal display)—The technology used for displays in notebook and
other smaller computers. Like light-emitting diode (LED) and gas-plasma tech-
nologies, LCDs allow displays to be much thinner than cathode ray tube (CRT)
technology.
Least significant bit—In a number in binary representation, the least significant
bit is the one with the lowest value, i.e., the rightmost bit when a number is shown
in traditional binary form. When this term is in plural form, it refers to multiple bits
of least significance.
Level of detail—A term used to describe one or more different sets of detail
defining a particular geometry or raster image. A low level-of-detail geometry or
image is one that would be used for rendering an image at a great viewing distance,
while a high level of detail would be used when at a short viewing distance. In raster
images where texture mapping is used for rendering, level of detail is used inter-
changeably with the term “mip-map.” In normal 3D graphics usage, a low numbered
level of detail refers to higher detail, with a level of detail 0 (zero) being the highest
level of detail. See “mip-map.”
LOD—See level of detail.
LOD Value—A value, used mostly in raster rendering, to define a calculated
viewing distance in terms of LODs. For example, an LOD value of 1.585 would
indicate a view distance located between LOD 1 and LOD 2.
Long Short-Term Memory (LSTM)—A component of a recurrent neural network
that includes a memory cell. The component can be used to “remember” events over
arbitrary periods of time. LSTM can also refer to any network that makes use of
LSTM components.
LSB—See least significant bit.
Luminance—A photometric measure of the luminous intensity per unit area of
light traveling in a given direction. It describes the amount of light that passes through,
is emitted or reflected from a particular area, and falls within a given solid angle.
The SI unit for luminance is candela per square meter (cd/m2 ). A non-SI term for
the same unit is the “nit.” The CGS unit of luminance is the stilb, which is equal to
one candela per square centimeter or 10 kcd/m2 .
390 Appendix B: Definitions

LUT—Acronym for Look-Up Table. LUTs are part of the RAMDAC of a graphics
subsystem and in modern graphics chips are usually located within the chip itself.
The LUT is the part of the output section of a graphics board which translates a pixel
value (primarily in 4 or 8 BPP indexed color modes) into its red, green, and blue
components. Once the components have been determined, they are passed through
the three DACs (red, green, and blue) to generate displayable signals. A diagram of
this operation, showing an 8-bit pixel, with a value of 250, going through the LUTs,
is below:

LUX—A Lux is one lumen per square meter.


M&A—Mergers and acquisitions.
M&E—Media and entertainment.
Mapped—A term that is used often in computer graphics, which loosely means to
be fitted to something. One maps to a spatial distribution of (something). A texture
map is a 2D image of something, bricks, or wood paneling for example.
MB—Megabyte. 1 megabyte is equal to 1 KB * 1 KB = 1024 × 1024 = 1,048,576
bytes.
MDL—Material definition library.
Mip-map—A mip-map is one of a series of different versions of the same texture,
each at a different resolution. Each version is generally one-quarter the size of the
version preceding it. See also “mip-mapping,” “bilinear filtering,” “trilinear filtering,”
“texel,” and “texture mapping.” Mip-mapping is the use of mip-maps during the
rendering process. For example, when an image is rendered using nearest mip-map
selection, the version of the texture that most closely matches the size of the image is
Appendix B: Definitions 391

the one chosen for rendering. In a linearly interpolated mip-mapping operation (also
known as “trilinear filtering”), a weighted average of the two nearest mip-maps based
on the LOD value. The term mip is an acronym for the latin expression “multum in
parvo”—(many in small) implying the presence of many images in a small package.
Mixed Reality—Mixed Reality (MR) seamlessly blends a user’s real-world envi-
ronment with digitally created content, where both environments coexist to create
a hybrid experience. In MR, the virtual objects behave in all aspects as if they are
present in the real world, e.g., they are occluded by physical objects, their lighting is
consistent with the actual light sources in the environment, and they sound as though
they are in the same space as the user. As the user interacts with the real and virtual
objects, the virtual objects will reflect the changes in the environment as would any
real object in the same space.
Motherboard—The main circuit board in a PC, also known as a system boar or
a planar (by IBM). Graphics AIBs and other cards (i.e., audio, gigabyte Ethernet,
etc.), as well as memory, the CPU, and disk drive cables plug into the motherboard.
Motion-To-Photon Latency (MTPL)—Also known as the end-to-end latency is
the delay between the movement of the user’s head and the change of the VR device’s
display reflecting the user’s movement. As soon as the user’s head moves, the VR
scenery should match the movement. The more delay (latency) between the two
actions, the more unrealistic the VR world seems. To make the VR world realistic,
VR systems want low latency of <20 ms.
Multi-Frame Noise Reduction (MFNR)—Automatically take multiple images
continuously, combine them, reduce the noise, and record them as one image. With
multi-frame noise reduction, one can select larger ISO numbers than the maximum
ISO sensitivity. The image recorded is one combined image.
Multiplexer—A multiplexer, also known as a MUX, is an electronic device that
acts as a switching circuit. A MUX has two or more data inputs, along with a switch
or select input which determines which of the data inputs are passed to the output
portion of the device.
Multi-projection—Multi-projection can refer to an image created using multiple
projectors mapped on to a screen, or set of screens (as in a CAVE) for 3D projection
mapping using multiple projectors. It can also refer to multiple projections within a
screen in computer graphics.
Multiplayer game—Multiplayer games have traditionally meant that humans are
playing with other humans cooperatively, competing against each other, or both.
Artificial intelligence controlled players have historically been excluded from the
traditional definition of multi-player game. However, as AI technology progresses
this is likely to change. In the future, human controlled player’s skill and behavior
tracked over time could program the skill and behavior of a unique AI that can be
substituted for the human’s participation in the game.
Battle Royale is a game mode that creates a translucent dome (or other demarcation)
over/around the entire playing area. As the match progresses, the dome starts to shrink
toward a random point on the map. Players must stay within the bounds of the dome or take
damage leading to death. The shrinking dome “herds” players into smaller and smaller areas,
eventually ensuring that they will be in “close combat.” In summary, Battle Royale mode
392 Appendix B: Definitions

allows large-scale combat using long range weapons and vehicles over large distances, but
eventually forces the remaining players into CQC (close quarter combat), ensuring that the
round time does not extend too long, and a new round can begin.
Permadeath is a video game and simulation feature where the player’s death eliminates
them from the ability to continue participation in the game or continue as the specific entity
they were playing. Permadeath can come in a number of forms. In multi-player combat
games, this usually means having to wait until the next round starts if killed. In most multi-
player combat games, rounds last 5 to 30 minutes; however, it is theoretically possible that
death in a game would permanently exclude the player from further participation.
In other multi-player combat games, permadeath can mean losing all your equipment
and your position on the map, forcing the player to respawn with no equipment as a new
“entity.” Even though there is no waiting period, the ramifications of dying are significant,
as the player has often spent significant time equipping themselves and moving to strategic
areas of the map.
Persistent world games track (or attempt to track) the entire game universe as individual
objects and the state of each object in one single instance. For example, in a massive multi-
player persistent world games if a tree is cut down, the tree will forever be cut down for all
players, and for all of time. In single player games, the user sometimes has the ability to
“restart” the universe or run multiple iterations of the universe. Running multiple iterations
is known as “sharding” the universe. In this former case, the tree would reappear or in the
latter case have various states of being dependent on the shard being played.
Sharded world games can have multiple simultaneous existences of the same base game
universe in varying degrees of state. Sharded world games that employ procedurally gener-
ated universes can have multiple simultaneous versions of non-matching universes. In either
case for multi-player, this is usually done to reduce the server load of players and reduce
latency by grouping players from geographical regions into the most optimal “shard.” There
can be hundreds of servers running the same universe but with unique player participation
and parametric states of being.

MUX—See multiplexer.
Nit—A nit is candela per square meter (cd/m).
Normal map—A normal maps can be referred to as a newer, better type of bump
map. A normal map creates the illusion of depth detail on the surface of a model but
it does it differently than a bump map that uses grayscale values to provide either
up or down information. It is a technique used for faking the lighting of bumps and
dents—an implementation of bump mapping. It is used to add details without using
more polygons.
Northbridge—The northbridge is the controller that interconnects the CPU to
memory via the frontside bus (FSB). It also connects peripherals via high-speed
channels such as PCI Express and the AGP bus.
NTSC (National Television Systems Committee)—Analog color television
system standard used in U.S.A, Canada, Mexico, and Japan. Other standards include
NURBS—Non-uniform rational basis spline (NURBS) is a mathematical model
commonly used in computer graphics for generating and representing curves and
surfaces. It offers great flexibility and precision for handling both analytic (surfaces
defined by common mathematical formulae) and modeled shapes.
Oscilloscope—Early oscilloscopes used cathode ray tubes (CRTs) as their display
element. Storage oscilloscopes used special storage CRTs to maintain a steady display
Appendix B: Definitions 393

of a signal briefly presented. Storage scopes (e.g., Tektronix 4010 series) were often
used in computer graphics as a vector scope.
ODM—Original device manufacturer.
OLED (organic light-emitting diode)—A light-emitting diode (LED) in which
the emissive electroluminescent layer is a film of organic compound that emits light
in response to an electric current. This layer of organic semiconductor is situated
between two electrodes; typically, at least one of these electrodes is transparent.
OLEDs are used to create digital displays in devices such as television screens,
computer monitors, and portable systems such as mobile phones.
Open Graphics Library (OpenGL)—A cross-language, cross-platform application
programming interface (API) for rendering 2D and 3D vector graphics. The API is
typically used to interact with a graphics processing unit (GPU), to achieve hardware-
accelerated rendering.
OpenVDB—OpenVDB is an Academy Award-winning open-source C++ library
comprising a novel hierarchical data structure and a suite of tools for the efficient
storage and manipulation of sparse volumetric data discretized on three-dimensional
grids. It was developed by DreamWorks Animation for use in volumetric applica-
tions typically encountered in feature film production and is now maintained by the
Academy Software Foundation (ASWF). https://github.com/AcademySoftwareFou
ndation/openvdb.
Ordered dither—An ordered dither is the application of a series of dither values
which change in according to a particular pattern, most often in the form of a matrix
(also referred to as a “dither matrix”), over the course of a set of dithering operations.
An ordered dither is traditionally applied positionally, using the modulus of the
destination pixel X and Y position as an index into the dither matrix.
Outside-In-Tracking—Outside-In-Tracking is a form of positional tracking where
fixed external sensors placed around the viewer are used to determine the position
of the headset and any associated tracked peripherals. Various methods of tracking
can be used, including, but not limited to, optical and IR.
PAL—Analog TV system used in Europe and elsewhere.
Palette—The computer graphics term “palette” is derived from the concept of an
artist’s palette, the flat piece of material upon which the artist would select and blend
his colors to create the desired shades. The palette on a graphics board specifies the
range of colors available in any one pixel. For example, standard VGAs tend to have
a palette of 262,144 colors, stemming from the fact that each color in the palette is
composed of 6 bits each of red, green, and blue (total of 18 bits and 218 = 262,144).
However, since the VGA can only display 16 or 256 colors on-screen at any one
time, it means that each one of these 16 or 256 colors must be chosen from the larger
palette via a set of LUTs. See “LUT” for details.
PAM—Potential available market.
PCI—Acronym for Peripheral Component Interface. PCI is a bus standard which
Intel developed to overcome the performance bottlenecks inherent in the ISA bus
design, and most modern graphics boards are PCI-based (i.e., they need to be inserted
into the PCI bus in order to work).
394 Appendix B: Definitions

Phong shading—Refers to an interpolation technique for surface shading in 3D


computer graphics. It is also called Phong interpolation or normal-vector interpola-
tion shading. Specifically, it interpolates surface normals across rasterized polygons
and computes pixel colors based on the interpolated normals and a reflection model.
Phong shading may also refer to the specific combination of Phong interpolation and
the Phong reflection model.
Ping-pong buffering—A technique for managing the sharing of real-time
streaming data between software threads or hardware units. Two or more buffers
are allocated, and the thread or hardware unit responsible for acquiring the data is
given control of the first buffer. As soon as that buffer is filled, control is handed to
the thread or unit responsible for processing the data. While processing is happening,
control of the second buffer is given to the input thread or unit and is filled with the
streaming data. Filling and processing continue to alternate, or ping-pong between
buffers in this fashion indefinitely.
Pixel—Acronym for PIX ELement (“Pix” is a shortened version of “Picture”).
The name given to one sample of picture information can refer to an individual
sample of RGB luminance or chrominance information. A pixel is the smallest unit
of display that a computer can access to display information but may consist of one
or more bits (see “BPP”).
Pixel density—Information of the number of pixels in a unit of length. With
the decrease of the display size and the increase of its resolution, the pixel density
increases.
Pixel pitch—The pixel pitch shows the distance from the centers of two neigh-
boring pixels. In displays, which have a native resolution (e.g., the TFT ones), the
pixel pitch depends on the resolution and the size of the screen.
Player versus player (PvP)—Player versus player (PvP) refers to a game that is
designed for gamers to compete against other gamers, rather than against the game’s
artificial intelligence (AI). PvP games generally feature an AI that acts as a second
player if the gamer plays solo. PvP games are the opposite of player versus envi-
ronment (PvE) games, where the player contends largely with computer-controlled
characters or situations.
Polygonal modeling—In 3D computer graphics, polygonal modeling is an
approach for modeling objects by representing or approximating their surfaces using
polygon meshes. Polygonal modeling is well suited to scanline rendering and is
therefore the method of choice for real-time computer graphics.
Power Island—In chip design, it is common practice to isolate unrelated parts of
the circuitry from each other and supply the power to each isolate region separately
so that they can be individually powered down when not required. This saves power
because CMOS transistors in particular “leak” charge into the substrate even when
inactive, as long as they are powered. It is necessary to take special steps to isolate
circuits in MOS devices because, unless modified, all transistors are (somewhat
weakly) connected together via the substrate.
PPI (pixels per inch)—The pixel density (resolution) of an electronic image
device, such as a computer monitor or television display, or image digitizing device
such as a camera or image scanner also referred to as pixels per centimeter (PPCM).
Appendix B: Definitions 395

The term “dots per inch” (dpi), extended from the print medium, is sometimes used
instead of pixels per inch. The dot pitch determines the absolute limit of the possible
pixels per inch. However, the displayed resolution of pixel s (picture elements) that
is set up for the display is usually not as fine as the dot pitch.
Projection mapping—Projection mapping, also known as video mapping and
spatial augmented reality, is a projection technology used to turn objects, often irreg-
ularly shaped, into a display surface for video projection. The technique dates back to
the late 1960s, where it was referred to as video mapping, spatial augmented reality,
or shader lamps. This technique is used by artists and advertisers alike who can add
extra dimensions, optical illusions, and notions of movement onto previously static
objects.
PQ—Perceptual Quantizer Transfer Function for HDR signals (SMPTE ST 2084,
ITU-R BT.2100).
PvP—See player versus player.
RAMDAC—Acronym for Random Access Memory Digital to Analog Converter.
The “RAM” portion of a RAMDAC refers to the LUTs, which by necessity are RAMs,
while the “DAC” refers to the Digital to Analog Converters. See “DAC” and “LUT”
for more details.
Raster graphics—Also called scanline, and bitmap graphics, a type of digital
display that uses tiny 4-sided but not necessarily square pixels, or picture elements,
arranged in a grid formation to represent an image. Raster-scan graphics has origins in
television technology, with images constructed much like the pictures on a television
screen.
Raster-scan display—A CRT uses a raster scan. Developed for television tech-
nology, an electron beam sweeps across the screen, from top to bottom covering one
row at a time. The beams intensity is turned on and off as it moves across each row
to create images. The screen points are referred to as pixels.
Recurrent neural network (RNN)—A class of neural networks whose connections
form a directed cyclic graph. In other words, unlike a feedforward network such as a
CNN, the connections include feedback so that outputs can affect subsequent inputs,
giving rise to temporal behaviors. An example of an RNN is the Long Short-Term
Memory (LSTM) network popular in speech recognition.
396 Appendix B: Definitions

Reflective shadow maps—Reflective shadow maps (RSMs) are an extension to a


standard shadow map, where every pixel is considered as an indirect light source.
The illumination due to these indirect lights is evaluated on the fly using adaptive
sampling in a fragment shader. By using screen-space interpolation of the indirect
lighting, it is possible to achieve interactive rates, even for complex scenes. Since
visualizations and games mainly work in screen space, the additional effort is largely
independent of scene complexity. The resulting indirect light is approximate, but
leads to plausible results and is suited for dynamic scenes.
Register file—Microprocessors hold data for immediate processing in a small
amount of fast, local memory referred to as a register file. This memory is closely
managed by the compiler, and the amount and characteristics of it are crucial to the
performance of the processor. There is typically, but not always, one register file for
each ALU or execution unit in the processor, and the organization of the register file
closely matches the organization of the execution unit. For example, a register file
associated with the scalar unit of a 32-bit processor would be 32 bits wide, whereas
one associated with a SIMD vector unit with four 32-bit paths would be 128-bits
wide to hold the required four 32-bit operands.
Relative luminance—Relative luminance is formed as a weighted sum of linear
RGB components, not gamma-compressed ones. Even so, luma is often erroneously
called luminance. SMPTE EG 28 recommends the symbol Y' to denote luma and
the symbol Y to denote relative luminance.
Render farm—A render farm is high-performance computer system, e.g., a
computer cluster, built to render computer-generated imagery (CGI), typically for
film and television visual effects.
Resolution, screen resolution—The number of horizontal and vertical pixels on
a display screen. The more pixels, the more information is visible without scrolling.
Screen resolutions have a pixel count such as 1600 × 1200, which means 1600
horizontal pixels and 1200 vertical pixels.
RGB—Red, green, and blue. Color components of a pixel blended to create a
specific color on a display monitor. See “color” for additional details.
Room-Scale VR—Room-Scale VR is an implementation of 6 DoF including the
required use of Spherical Video tracking, where the user is able to move around a room
sized environment using real-world motion as reflected in the virtual environment.
ROP—ROP stands for raster operator; raster operators (ROPs) handle several
chores near the end of the of the pixel pipeline. ROPs handle anti-aliasing, Z and
color compression, and the actual writing of the pixel to the output buffer.
Rounding—An arithmetic operation which adjusts a number up or down relative
to its magnitude in relation to some defined magnitude. Normally, rounding is used
to adjust a number in integer-fractional format to integer, with fractional values of 0
to less than 0.5 (or 1/2) being adjust downward and fractional values of 0.5 to less
than 1 being adjusted upward. Programmatically, rounding is usually accomplished
by adding 0.5 to a number and then truncating the resulting fractional amount. See
“truncation.” For example, 1.585 rounded is 2, while 1.499 rounded is 1.
RSMs—See reflective shadow maps.
RT—Ray tracer or ray tracing.
Appendix B: Definitions 397

SaaS—Software as a service.
SAM—Served available market.
Scanline display—See raster graphics display.
Scanline rendering—An algorithm for visible surface determination, in 3D
computer graphics, that works on a row-by-row basis rather than a polygon-by-
polygon or pixel-by-pixel basis.
Screen size—On 2D displays, such as computer monitors and TVs, the display
size (or viewable image size or VIS) is the physical size of the area where pictures
and videos are displayed. The size of a screen is usually described by the length of
its diagonal, which is the distance between opposite corners, usually in inches.
Screen tearing—A visual artifact in video display where a display device shows
information from multiple frames in a single screen draw. The artifact occurs when
the video feed to the device is not in sync with the display’s refresh rate. This can be
due to non-matching refresh rates—in which case the tear line moves as the phase
difference changes (with speed proportional to difference of frame rates).
SDK—Software development kit.
SECAM—Analog TV system used in France and parts of Russia and the Mideast.
SDR—Standard dynamic range TV (Rec.601, Rec.709, Rec.2020).
Shaders—Shaders is a broadly used term in graphics and can pertain to the
processing of specialized programs for geometry (known as vertex shading or
transform and lighting), or pixels shading.
Shifter—A device which shifts numbers 1 or more bit positions. For example, the
decimal number 14 (1110b), when passed through a shift which shifts one bit to the
left, would produce decimal 7 (0111b). Each bit shift to the right is equivalent to an
integer divide by 2, while each bit shift to the left is equivalent to an integer multiply
by 2. Shifters are normally used to scale values up or down.
SIMD—Same Instruction Multiple Data describes computers with multiple
processing elements that perform the same operation on multiple data points simul-
taneously. Such machines exploit data level parallelism, but not concurrency: There
are simultaneous (parallel) computations, but only a single process (instruction) at a
given moment. SIMD is particularly applicable to common tasks like adjusting the
contrast and colors in a digital image.
SOM—Share of market.
Southbridge—The southbridge controller handles the remaining I/O, including
the PCI bus, parallel and Serial ATA drives (IDE), USB, FireWire, serial and parallel
ports, and audio ports. Earlier chipsets supported the ISA bus in the southbridge.
Starting with Intel’s 8xx chipsets, northbridge and southbridge were changed to
memory controller and I/O controller (see Intel Hub Architecture).
Span mode—Some applications, such as games, have an explicit screen resolution
setting. They will typically default to monitor’s registered resolution. In span mode,
which is a feature of the driver provided by the GPU supplier, it is possible to make
one contagious display that spans across all the monitors you choose. Then, when
the application is opened, it will fill the screens.
398 Appendix B: Definitions

sRGB—sRGB is a color space, developed jointly by Hewlett-Packard and


Microsoft in 1996. It is used in different devices such as printers, displays, TV sets,
and cameras. The sRGB color space covers about 72% of the NTSC color space.
Static contrast—The static contrast shows the ratio between the brightest and the
darkest color, which the display can reproduce simultaneously, for example, within
one and the same frame/scene.
Stream processors—A stream processor is a floating-point processor found in a
GPU and is also known as a shader processor.
Stuttering—A term used to describe a quality defect that manifests as irregular
delays between frames rendered by the GPU(s), causing the instantaneous frame
rate of the longest delay to be significantly lower than the frame rate reported (by
benchmarking application). In lower frame rates when this effect may be apparent,
the moving video appears to stutter, resulting in a degraded gameplay experience in
the case of a video game, even though the frame rate seems high enough to provide
a smooth experience. Single-GPU configurations do not suffer from this defect in
most cases and can in some cases output a subjectively smoother video compared
to a multi-GPU setup using the same video card model. Microstuttering is inherent
to multi-GPU configurations using alternate frame rendering (AFR), such as Nvidia
SLi and AMD CrossFireX but can also exist in certain cases in single-GPU systems.
Variations in the rate of data input and processing speed can result in overflow,
which can sometimes be prevented by allocating more than two buffers, in which
case the system is referred to as a circular buffer.
Subdivision surface—Subdivision smooths and adds extra resolution to curves
and surfaces at display and/or render time. The renderer subdivides the surface until
it’s smooth down to the pixel level. The smooth surface can be calculated from the
coarse mesh as the limit of recursive subdivision of each polygonal face into smaller
faces that better approximate the smooth surface. This lets one work with efficient
low-polygon models and only add the smoothing “on demand” on the graphics card
(for display) or in the renderer. The tradeoff is that subdivision curves/surfaces take
slightly longer to render. However, smoothing low-resolution polylines using curve
subdivision is still much faster than working with inherently smooth primitives such
as NURBS curves.
Subsurface scattering (SSS)—Also known as subsurface light transport (SSLT),
is a mechanism of light transport in which light penetrates the surface of a translucent
object, is scattered by interacting with the material, and exits the surface at a different
point.
Sub-pixel Morphological Anti-aliasing (SMAA)—This filter detects edges in a
rendered image and classifies edge crossings into various shapes and shades, in an
attempt to make the edges or lines look smoother. Almost every GPU developer has
their own version of anti-aliasing.
Superray—A grouping of rays within and across views, as a key component of a
light field processing pipeline.
TAM—Total available market.
Taring and frame dropping—V-sync, where the monitor is synchronized to the
powerline frequency, can cause the screen to be refreshed halfway through the process
Appendix B: Definitions 399

of a frame being output by the GPU, resulting in two or more frames being shown
at once.
Telecine—Telecine is the process of transferring motion picture film into video.
The most complex part of telecine is the synchronization of the mechanical film
motion and the electronic video signal. Normally, best results are then achieved by
using a smoothing (interpolating algorithm) rather than a frame duplication algorithm
(such as 3:2 pulldown).
Tessellation shaders—A tessellation shader adds two new shader stages to the
traditional model. Tessellation Control Shaders (also known as Hull Shaders) and
Tessellation Evaluation Shaders (also known as Domain Shaders), which together
allow simpler meshes to be subdivided into finer meshes at run-time according to a
mathematical function. The function can be related to a variety of variables, most
notably the distance from the viewing camera to allow active level-of-detail scaling.
This allows objects close to the camera to have fine detail, while further away ones can
have coarser meshes, yet seem comparable in quality. It also can drastically reduce
mesh bandwidth by allowing meshes to be refined once inside the shader units instead
of down-sampling very complex ones from memory. Some algorithms can up-sample
any arbitrary mesh, while others allow for “hinting” in meshes to dictate the most
characteristic vertices and edges. Tessellation shaders were introduced in OpenGL
4.0 and Direct3D 11.
One cannot use tessellation to implement subdivision schemes that requires the
previous vertex position to compute the next vertex positions.
Texel—Acronym for TEXture ELement or TEXture pixEL—the unit of data
which makes up each individually addressable part of a texture. A texel is the texture
equivalent of a pixel.
Texture mapping—The act of applying a texture to a surface during the rendering
process. In simple texture mapping, a single texture is used for the entire surface,
no matter how visually close or distant the surface is from the viewer. A somewhat
more visually appealing form of texture mapping involves using a single texture with
bilinear filtering, while an even more advanced form of texture mapping uses multiple
textures of the same image but with different levels of detail, also known as mip-
mapping. See also “bilinear filtering,” “level of detail,” “mip-map,” “mip-mapping,”
and “trilinear filtering.”
Texture Map—Same thing as “texture.”
Texture—A texture is a special bitmap image, much like a pattern, but which is
intended to be applied to a 3D surface in order to quickly and efficiently create a
realistic rendering of a 3D image without having to simulate the contents of the image
in 3D space. That sounds complicated, but in fact it’s very simple. For example, if
you have a sphere (a 3D circle) and want to make it look like the planet Earth, you
have two options. The first is that you meticulously plot each nuance in the land
and sea onto the surface of the sphere. The second option is that you take a picture
of the Earth as seen from space, use it as a texture, and apply it to the surface of
the sphere. While the first option could take days or months to get right, the second
option can be nearly instantaneous. In fact, texture mapping is used broadly in all
sorts of real-time 3D programs and their subsequent renderings, because of its speed
400 Appendix B: Definitions

and efficiency. 3D games are certainly among the biggest beneficiaries of textures,
but other 3D applications, such as simulators, virtual reality, and even design tools,
take advantage of textures too.
Tile-based deferred rendering (TBDR)—Defers the lighting calculations until all
objects have been rendered, and then, it shades the whole visible scene in one pass.
This is done by rendering information about each object to a set of render targets
that contain data about the surface of the object; this set of render targets is normally
called the G-buffer.
Tiled rendering—The process of subdividing a computer graphics image by a
regular grid in optical space and rendering each section of the grid, or tile, separately.
The advantage to this design is that the amount of memory and bandwidth is reduced
compared to immediate mode rendering systems that draw the entire frame at once.
This has made tile rendering systems particularly common for low-power handheld
device use. Tiled rendering is sometimes known as a “sort middle” architecture,
because it performs the sorting of the geometry in the middle of the graphics pipeline
instead of near the end.
ToF—An acronym for Time of Flight. Used to refer to active sensors which
measure distance to objects in a scene by emitting infrared pulses and measuring the
time taken to detect the reflection. These sensors simplify the computational task of
producing a point cloud from image data but are more expensive and lower resolution
than regular CMOS sensors.
Tone-mapping—A technique used in image processing and computer graphics to
map one set of colors to another to approximate the appearance of high-dynamic-
range images in a medium that has a more limited dynamic range.
Transcoding—Transcoding is the process of converting a media file or object
from one format to another. Transcoding is often used to convert video formats (i.e.,
Beta to VHS, VHS to QuickTime, QuickTime to MPEG). But it is also used to fit
HTML files and graphics files to the unique constraints of mobile devices and other
Web-enabled products.
Trilinear filtering—A combination of bilinear filtering and mip-mapping, which
enhances the quality of texture mapped surfaces. For each surface that is rendered,
the two mip-maps closest to the desired level of detail will be used to compute pixel
colors that are the most realistic by bilinearly sampling each mip-map and then using
a weighted average between the two results to produce the rendered pixel.
Trilinear mip-mapping—See above, trilinear filtering.
Truncation—An arithmetic operation which simply removes the fractional portion
of a number in integer-fraction format to produce an integer, without regard for the
magnitude of the fractional portion. Therefore, 2.99 and 2.01 truncated are both 2.
See also “rounding.”
UDIM—An enhancement to the UV mapping and texturing workflow that makes
UV map generation easier and assigning textures simpler. The term UDIM comes
from U-Dimension and design UV ranges. UDIM is an automatic UV offset system
that assigns an image onto a specific UV tile, which allows one to use multiple lower
resolution texture maps for neighboring surfaces, producing a higher resolution result
Appendix B: Definitions 401

without having to resort to using a single ultra-high-resolution image. UDIM was


invented by Richard Addison-Wood and came from Weta Digital (circa 2002).
UI—User interface.
UHD Alliance Premium Logo—High-end HDR TV requirements Rec.709, P3 or
Rec.2020.
Ultra HD Blu-ray—HDR disk format using HEVC, HDR10, and optionally Dolby
Vision.
UMA (unified memory architecture)—When an IGP is employed in a PC, it needs
memory (sometimes called a frame buffer). One of the benefits of an IGP is the
reduced cost realized by eliminating a separate frame buffer, and to replace that extra
memory a portion of the PC’s main or system memory is used for the frame buffer.
When that is done, the organization is known as a unified memory architecture.
USB, Universal Serial Bus—The Universal Serial Bus (USB) is a common inter-
face that enables communication between devices and a host controller such as
a personal computer (PC). It connects peripheral devices such as digital cameras,
mice, keyboards, printers, scanners, media devices, external hard drives, and flash
drives.
VAR—Value-added reseller.
Vblank—In a raster graphics display, the vertical blanking interval (VBI), also
known as the vertical interval or VBLANK, is the time between the end of the final
line of a frame or field and the beginning of the first line of the next frame. It is
present in analog television, VGA, DVI, and other signals.
Vector—In computer programming, a vector quantity refers to any group of similar
values which are grouped together and processed as a unit, either serially or in parallel.
A vector can contain any number of elements. An example from computer graphics
is the vector which describes the location of a point in four-dimensional space. P =
x, y, z, w. Commonly referred to as Vec4 as it has four elements, it can be efficiently
processed by a four-wide SIMD vector unit.
Vector error correction model (VECM)—The basic ECM approach as described
above suffers from a number of weaknesses. Namely, it is restricted to only a single
equation with one variable designated as the dependent variable, explained by another
variable that is assumed to be weakly exogeneous for the parameters of interest. It
also relies on pretesting the time series to find out whether variables are I(0) or I(1).
These weaknesses can be addressed through the use of Johansen’s procedure. Its
advantages include that pretesting is not necessary, there can be numerous cointe-
grating relationships, all variables are treated as endogenous, and tests relating to
the long-run parameters are possible. The resulting model is known as a vector error
correction model (VECM), as it adds error correction features to a multi-factor model
known as vector auto-regression (VAR).
Vector display/scope—A display used for computer graphics up through the
1970s. A type of CRT, like an oscilloscope. In a vector display, the image is composed
of drawn lines rather than an array of pixels as in raster graphics. The CRT’s electron
beam draws lines along an arbitrary path between two points, rather than following
the same horizontal raster path for all images. Vector displays had no aliasing and
402 Appendix B: Definitions

where so accurate physical measurements could be taken from the screen. For that
reason, they were also called calligraphic displays.
Vector graphics—Refers to a method of generating electronic images using math-
ematical formulae to calculate the start, end, and path of a line. Images of varying
complexity can be produced by combining lines into curved and polygonal shapes,
resulting in infinitely scalable objects with no loss of definition.
Vector unit (SIMD vector unit)—An Arithmetic Unit or Arithmetic Logic Unit
which operates on one or more vectors at a time, using the same instruction for all
values in the vector.
Verilog/HDL—A “Hardware Description Language” is a textual representation
of logic gates and registers. It differs from a programming language mainly in that
it describes a parallel structure in space rather than a sequence of actions in time.
Verilog is one of the most popular HDLs and resembles C or C++ in its syntax.
VESA—Video Electronics Standards Association, a technical standards organiza-
tion for computer display, PC, workstation, and computing environments standards.
The organization incorporated in California July 1989.
VFX—Visual effects.
VGA (video graphics array)—VGA is a resolution and electrical interface stan-
dard original developed by IBM. It was the defector display standard for the PC. VGA
has three analog signals, red, blue, and green (RGB) and uses an analog monitor.
Graphics AIBs output analog signals. All CRTs and most flat panel monitors accept
VGA signals, although flat panels may also have a DVI interface for display adapters
that output digital signals.
vGPU—An AIB with a powerful dGPU located remotely in the cloud or a campus
server.
Vignetting—A reduction of an image’s brightness or saturation at the periphery
compared to the image center.
VPNA—See Visual Processing Unit.
Virtual reality—Virtual reality (VR) is a fully immersive user environment
affecting or altering the sensory input(s) (e.g., sight, sound, touch, and smell) and
allowing interaction with those sensory inputs by the user’s engagement with the
virtual world. Typically, but not exclusively, the interaction is via a head-mounted
display, use of spatial or other audio, and/or hand controllers (with or without tactile
input or feedback).
VR Video and VR Images—VR Video and VR Images are still or moving imagery
specially formatted as separate left and right eye images usually intended for display
in a VR headset. VR Video capture and subsequent display are not exclusive to 360°
formats and may also include content formatted to 180° or 270°; content does not
need to visually surround a user to deliver a sense of depth and presence.
Vision processing—Processing of still or moving images with the objective of
extracting semantic or other information.
VLIW (very long instruction word)—Often abbreviated to VLIW. A micropro-
cessor instruction which combines multiples of the lowest level of instruction words
and presents them simultaneously to control multiple execution units in parallel.
Appendix B: Definitions 403

Voxel—A voxel is a value in three-dimensional space. Voxel is a combination of


“volume” and “pixel” where pixel is a combination of “picture” and “element.” This
is analogous to a texel, which represents 2D image data in a bitmap (also referred
to as a pixmap). Voxels are used in the visualization and analysis of medical and
scientific data. (Some volumetric displays use voxels to describe their resolution.
For example, a display might be able to show 512 × 512 × 512 voxels.) Both ray
tracing and ray-casting, as well as rasterization, can be applied to voxel data to obtain
2D raster graphics to depict on a monitor.
VGPR (vector general-purpose registers)—General-purpose registers are used to
store temporary data within the GPU.
VPU (vector processing unit)—A vector processor or array processor implements
an instruction set containing instructions that operate on one-dimensional arrays of
data called vectors.
Today’s CPU architectures have instructions for a form of vector processing
on multiple (vectorized) data sets, typically known as SIMD (Single Instruction,
Multiple Data). Common examples include Intel x86’s MMX, SSE and AVX instruc-
tions, AMD’s 3DNow! Extensions, and Arm’s Neon and its scalable vector extension
(SVE).
VPU (Visual Processing Unit)—A Visual Processing Unit (VPU) is a silicon chip
or IP block dedicated to computational photography and/or vision processing.
A vision processing unit (VPU) is an emerging class of microprocessor; it is a
specific type of AI accelerator, designed to accelerate machine vision tasks. Vision
processing units are distinct from video processing units (which are specialized
for video encoding and decoding) in their suitability for running machine vision
algorithms such as CNN (convolutional neural networks). The name belies the real
importance of the function and should include neural network accelerator, which
results in the acronym VPNA.
VR—Virtual reality.
V-sync—Vertical synchronization of the monitor’s refresh rate based on the power
line frequency, 60 or 50 Hz.
VXGI is a new approach to computing a fast, approximate form of global illu-
mination (GI) dynamically in real time on the GPU. This new GI technology uses
a voxel grid to store scene and lighting information, and a novel voxel cone tracing
process to gather indirect lighting from the voxel grid. The purpose for VXGI is to
run in real time and doing full ray tracing of the scene is too computationally intense,
so approximations are required.
VXGI—Voxel Global Illumination (VXGI), developed by Nvidia, features one-
bounce indirect diffuse, specular light, reflections, and area lights. It is an advance-
ment in realistic lighting, shading, and reflections. VGXI is a three-step process:
voxelization, light injection, and final gathering and is employed in next-generation
games and game engines.
WCG—wide color gamut—anything wider than Rec.709, DCI P3, Rec.2020
—See wide color gamut.
Wave—
404 Appendix B: Definitions

Wide color gamut—High dynamic range (HDR) displays a greater difference in


light intensity from white to black; wide color gamut (WGC) provides a greater
range of colors. The wide-gamut RGB color space (or Adobe Wide Gamut RGB)
is an RGB color space developed by Adobe Systems, that offers a large gamut by
using pure spectral primary colors. It is able to store a wider range of color values
than sRGB or Adobe RGB color spaces.
Also see HDR and color gamut.
X Reality—X Reality (XR) is a general term to cover the multiple types of expe-
riences and technologies across VR, AR, MR, and any future similar areas. All of
these systems have in common some level of display technology (e.g., video, audio)
mixed with a method to track where the user is looking or moving (e.g., up/down,
side-to-side, turning around). How those systems work individually, and together,
determines which of the more defined experiences the product would be named—VR,
AR, MR, or some future XR.
Z-buffer—A memory buffer used by the GPU that holds the depth of each pixel
(Z axis.) When an image is drawn, each (X-Y ) pixel is matched against the z-buffer
location. If the next pixel in line to be drawn is below the one that is already there,
it is ignored.
Index

A ATI, 136
Accelerated Processor Unit (APU), 71 Axe Technology, 227
Acorn, 108
Acorn Computers, 108
Adler Lake, 335 B
Adreno GPU, 170 Bachus, Kevin, 191
Adreno GPU, Qualcomm, 141 Baker, Nick, 200
Advanced eXtensible Interface (AXI), 273 Barlow, Steve, 267
Advanced High-performance Bus (AHB), Battlemage GPU, 342
113 BBC Computer, 108
Advanced Interface Bus, 340 Bergman, Rick, 22
Advanced Micro Devices (AMD), 144, Berkes, Otto, 191
145, 163, 261 Bezier curve, 8
Advanced Peripheral Bus (APB), 113 Bidirectional Reflectance Distribution
Alben, Jonah, 53 Function (BRDF), 272
Alchemist GPU, 342 Bifrost, Mali-, 154
Alho, Mikko, 144 Big.LITTLE, 153
Alphamosaic, 267 Binary Runtime Environment for Wireless
Ambiq, 276 (BREW), 139
AMD ray accelerator, 347 Bitboys, 130, 137
AMD TeraScale architecture, 66 Blackley, Seamus, 191
Ampere, Nvidia, 288 Bolt Graphics, 262
Andrews, Jeff, 200 Boost, AMD, 91
Anisotropic filtering, 46 Bounded Volume Hierarchy (BVH), 350
Anti-aliasing, 41, 46 Bounding Volume Hierarchy (BVH), 329
Apple, 107 Broadway IBM PowerPC, 209
Apple M1, 172 Brown, Nat, 191
AR and VR, 157 Bulldozer, AMD, 67
Arc client graphics, 342 Bush, Jeff, 308
Arc, Intel AIB, 250
Argonne National Laboratory, 248
Arm, 108, 109 C
Arm-Nvidia, 164 Cai, Mike, 281
Asynchronous Compute Engines (ACE), 88 Campbell, Gordon (Gordie), 123
Asynchronous Compute Tunneling, 88 Carmean, Doug, 61
Atari VCS, 225 Celestial GPU, 342
© The Editor(s) (if applicable) and The Author(s), under exclusive license 405
to Springer Nature Switzerland AG 2022
J. Peddie, The History of the GPU - New Developments,
https://doi.org/10.1007/978-3-031-14047-1
406 Index

Cell, The, 204 F


Changsha Jingjia Microelectronics, 264 Fainstain, Eugene, 91
Chesnais, Fred, 224 Fake Graphics, 123
Chill. AMD, 92 Falanx, 109
China Integrated Circuit Industry Falanx Microsystems, 109
Investment Fund, 267 FidelityFX, 352
Chiplets, 340 FMAD, 73
Chip-Scale Package (CSP), 276 Forsyth, Tom, 60
Cinematic shading, 28 Foveated rendering, 330
Coarse Pixel Shading (CPS), 96 Foveros, 340
Cognivue, 274 Foveros, Intel, 248
Commercial-Off-The-Shelf (COTS), 187 Fudo, 33
Compute Unified Device Architecture Fujitsu Cremson, 116
(CUDA), 70, 254 Fujitsu Jade, 118
Full-Scene Anti-Aliasing (FSAA), 115
Conuladh, Feargal Mac, 225
Fused Multiply-Accumulate (FMA), 157
Cook, Tim, 174
Fusion—AMD, 70
Cremson, Fujitsu, 117
CrossFire, 42
G
Gates, Bill, 191
D GC8000 GPU, 286
D77 display processor, 161 Gen 11 GPU, Intel, 94
Dai, Wei Jin, 282 Geode, 256
Ginsburg, Dan, 353
Dai, WeiMing, 285
GiQuila, 281
Dal Santo, Paul, 136
GlenFly, 259
Davies, Jem, 152
Global illumination, 329
Deep-Learning Accelerator (DLA), 167
GoForce 6100, 125
Deep Learning Super Sampling (DLSS), Graphics and Memory Controller Hub
331 (GMCH), 62
Deering, Michael Frank, 241 Graphics Core Next (GCN), 83, 212
Deferred rendering, 173 Greenland (AMD), 85
Demers, Eric, 74
Digital Media Professionals (DMP), 163,
271 H
Digital TV (DTV), 270 Haiguang Microelectronics, 261
Direct-Link Library (DLL), 73 Hase, Ted, 191
DirectX 12, 94 Haswell GT1 iGPU, 79
DLAA, Nvidia, 328 Herkelman, Scott, 352
DLSS, Nvidia, 328 HiAlgo, 91
Druid GPU, 342 High Dynamic Range (HDR), 20, 38
Dynamic voltage control, 32 High-Level Shading Language (HLSL), 30
High-Performance Multi-crystalline
(HPM), 284
Hoare, Graydon, 315
E Hollywood, Wii, 203
Eden, Mooly, 80 Honkavaar, Mikael, 123
802.11 a/c, 220 Hruska, Joel, 53
Ellesmere (AMD), 83 Hummingbird SoC, 145
Embedded Multi-die Interconnect Bridge Hutton, Peter, 153
(EMIB), 86, 250, 339 Hybrid Graphics, 123
Execute Indirect, 328 Hybrid-Rendering, 330
Exynos SoC, 145 HyperLane, 106
Index 407

HyperZ, ATI, 16 Kylin, 267

I L
Icera, 127 Larrabee project, 57
Imageon, 101, 137 Lee, Byoung Ok, 242
Imageon, ATI, 139 Leia, Bitboys Qualcomm, 138
Imageon processor, 140 Leighton, Luke Kenneth Casson, 313
Imagination BXS, 105 Leland, Tim, 141
Imagination Image Compression (IMGIC), Lens-Matched Shading (LMS), 331
106, 296 Lightspeed Memory Architecture, 3
Imagination Technologies, 163 Lin, Chris, 25
Imagination Technologies’ B-series, 104 Liverpool/Durango APU, 210
Infinity Cache, 347 Ljosland, Borgar, 110, 115
Inglis, Mike, 109 Logan, Nvidia, 166
Inline Raytracing, 328 Logo, Nvidia, 57
Innosilicon, 354 Lumen in the Land of Nanite, 229
Input-to-response latency, 92
Instructions Per Clock (IPC), 87
Integrated Graphics Chipsets (IGCs), 26
M
Intel, 163
M1 Max, 175
Intel, iGPUs, 79
Intel Kaby Lake G, 85 M1 Pro, 174
Intellectual Property (IP), 133 M1 Ultra, 177
Intellisample, Nvidia, 46 Machine Learning, 157
Intel Xe , 335 Makivaara, Jarkko, 144
Interactive PC computer graphics (CGI), 28 Mali, 109
IPad, 101 Mali Java stack, 113
IRISxe dGPU, 335 Mallard, John, 124
Ironlake GPU, 63 Many-core Integrated Accelerator of
Ironlake graphics, 62 Waterdeep/Wisconsin (MIAOW),
Iwata, Satoru, 202 309
Marci, Joe, 33
Masayoshi, Son, 163
J McNamara, Patrick, 307
Jade, Fujitsu, 119 MediaGX, 255
Jaguar APU, 212 MediaQ, 120
James, Dick, 208 Memory-Management Units (MMU), 75
Jensen, Rune, 200 Mesh interconnect architecture (MDFI),
Jiawei, Jing, 264 251
Jingjia Microelectronics, 265 Mesh shader, 333
Jobs, Steve, 124 MetalFX Upscaling, 182
Joe Palooka, 138 MetaX, 258
Johnson, Gary, 124 Meyer, Dirk, 70
Microsoft’s Zune, 126
Midgard, 148
K Midgard, Mali, 149
Kal-El, 127 Midway, 191
Kelvin architecture, 193 Mijat, Roberto, 150
Khan. ATI, 13 Miller, Timothy, 306
Koduri, Raja, 82, 250 Min Yoon, Hyung, 242
Komppa, Jari, 144 Miyamoto, Masafumi, 29
Kumar, Devinder, 68 Miyamoto, Shigeru, 202
Kutaragi, Ken, 191, 196 Mobile GPUs, 103
408 Index

Mobile Industry Processor Interface P


(MIPI), 146 Palm PDA, 101
Mobile Internet Device (MID), 147 Papermaster, Mark, 67
Mollenkopf, Steve, 142 Parallax Occlusion Mapping, 41
Multi Format Codec (MFC), 145 Parker, Nvidia, 166
Multipliers (MUL), 283 Park, Woo Chan, 242
Multiprocess Service (MPS), 327 Park, Yongin, 145
Multi-Resolution Shading (MRS), 331 Pascal, Nvidia, 287
Multisampling Anti-Aliasing (MSAA), 3 Pearlman, Steve, 192
Myer, Dirk, 67 Pegasus board, 168
Photon architecture, 294
Physically based rendering (PNR), 156
Pica200, 273
N PICA200, DMP, 206
Nano iPod, 124 PlayStation 2, 189
Natural Light, ATI, 38 PlayStation 4, 216
NEC, 132 PlayStation 5, 228
Nema GPU, 276 Plowman, Ed, 150
Neox, 278 Pmark, 81
Neural Network Processors (NNP), 255 Pohl, Daniel, 57
Neural Processing Unit (NPU), 145 Polymega, 222
Newton computer, 108 Ponte Vecchi, 248
NfiniteFX engine, 193 Ponte Vecchio circuit board, 253
9000 Communicator, Nokia, 101 PortalPlayer, 124
Nintendo Switch, 223 Position Only Shading pipeline (POSH), 94
Nokia, 132 Position only Tile-Based Rendering
(PTBR), 94, 97
Non-Uniform Rational Basis Spline
PowerPC Espresso CPU, 211
(NURBS), 208
PowerVR, 173
Nordlund, Petri, 130
Project Trillium, Arm, 164
N-Patch, 6
PSP, Sony, 195
NV2A, Nvidia Xbox, 192
PS Vita, 208
Nvidia, 163 PureVideo, Nvidia, 46
Nystad, Jørn, 110

Q
O Q3D Dimension platform, 140
Ogasawara, Shinichi, 195 Qi, Nick, 262
Qualcomm, 163
Open Core Protocol (OCP), 273
Qualcomm handheld, 235
OpenGL-ES, 103
Qualcomm’s variable rate shading, 171
OpenGL ES 1.0, 123
Quincunx anti-aliasing, 3
OPenGL ES 3.2, 143
OpenGPU, AMD, 352
Open Graphics Project, 306 R
Open Hardware Foundation, 307 R300, ATI, 13
Open Mobile Application Processor Radeon HD 4870, 67
Interfaces (OMAPI), 146 Radeon Media Engine, 349
Open Multimedia Application Platform Raspberry Pi, 269
(OMAP), 146 RayCore, 242
OpenVG, 132 RayQuery, 324
Original Device Manufacturers (ODM), 23 Ray tracing, 323
Orton, Dave, 13, 22, 32, 67, 70 RayTree, 245
Otellini, Paul, 57 RDNA 2, 347
Index 409

Rearden Steel, 194 Streaming Processor (SP), 55


Reduced Instruction Set Computer (RISC), Subor, 227
108 Subor Z-plus, 227
Reed, Rory, 67 Su, Lisa, 67, 204
Register, The, 137 Superfin, 79
RenderMonkey, ATI, 22 Super Resolution (FSR), 352
Revolution, Nintendo Wii, 202 Super sampling, 41
RISC-V, 278 Super Sampling Anti-Aliasing (SSAA), 3,
Ruiz, Hector, 66, 70 115
Rust, 315 Swann, Robert, 267
Switch, AMD, 92
System-on-Chip (SoC), 133
S
Sakaguchi, Hironobu, 29
T
Samsung, 124, 145
Taiwan Semiconductor Manufacturing
Sanders, Jerry, 66
Company (TMSC), 248
Sankaralingam, Karu, 309
Takeda, Genyo, 202
Sapphire Rapid accelerator board, 253
Tamasi, Tony, 52
Sarri, Mikko, 130
Tammilehto, Sami, 132
Scalarmorphic GPU, 282
Tegra APX 250, 125
Screen-door effect, 162
Tensor Processing Unit (TPU), 88
Sculley, John, 120
TeraScale GPU, 68
Second era GPUs, 1
Texas Instruments, 124
Segars, Simon, 164
Texture Address Units (TAU), 37
Set-Top Box (STB), 271
Texture Processor Cluster (TPC), 54
Shanghai Zhaoxin, 257 ThinkLCD, 275
Shapiro, Danny, 128 Think Silicon, 163
Shield 2, 219 Third-era GPUs, 51
Shield, Nvidia, 212 Tianjin Haigua, 261
Shin, Hee Jin, 242 Tianjin Haiguang Advanced Technology
Sidropoul, George, 275 Investment Co. Ltd. (THATIC), 261
Silicon interposer, 177 Tiger Lake, 335
Silicon On Insulator (SOI), 72 Tile-based rendering, 94
Silicon Platform as a Service (SiPaaS), 285 Tone mapping, 38
Singh, Darwesh, 262 Torborg, Jay, 192
Singh, Ramesh, 120 TruForm, 6, 15
Single Pipeline State Object (PSO), 328 Tsodikov, Alex, 91
Six Levels of Ray Tracing acceleration Tsui, Thomas, 26
(RTLS), 294 Turing, Nvidia, 287, 326, 333
SMAPH-F vector Graphics, 273 Turks, AMD, 68
Smith, Ian, 150 Turks, GPU, 68
Smoothvision, ATI, 17
Snapdragon, 141
Snapdragon S1 MSM7227, 101 U
SoftBank, 163 UltraFusion, 177
Sony CXD5315GG processor, 208 UltraShadow, Nvidia, 46
Sørgård, Edvard, 110, 131, 149 Ultray architecture, 272
Stamoulis, Iakovos, 275 Umbra Software, 123
Stathonikos, Damian, 137 Unified Memory Architecture (UMA), 172
Steam Deck, 232 Unified Operating System (UOS), 267
Steele, Steve, 149 Unified shader, 52
Stock-Keeping Unit (SKU), 126 Unified Shading Cluster array (USC), 295
Streaming Multiprocessor (SM), 288, 326 UnionTech, 267
410 Index

Unity Operating System (UOS), 265 Weiliang, Chen, 260


Utgard, Mali, 113 Westmere, 62
Wii, 203
Wii U, 211
V WinTel, 239
Valhall, Mali-, 156
Variable Rate Shading (VRS), 94, 330
VC01, 267 X
Vector General Purpose Register (VGPR), Xavier, 167
84, 350 Xbox, 192
Vector Processing Unit (VPU), 59 Xbox 360, 52, 142, 201
Vega GPU, 282 Xbox One, 217
Vejle, Microsoft, 201 Xclipse GPU, 145
VeriSilicon, 163, 285 Xe Link Tile, 252
Vertex Fetch (VF), 94 Xe Matrix eXtensions (XMX), 252
VideoCore, 269 Xenos, ATI, 141, 142
Vire Labs, 143 Xe -Super Sampling (XeSS), 344
Virtual Desktop Infrastructure (VDI), 327 Xiangdixian Computing Technology, 262
Virtualized GPU, 284 Xiaobawang, 226
Visualize FX10, Sun, 239 X Tjandrasuwit, Ignatius, 120
Visual Processor Unit (VPU), 13
Vivante, 281
Voice Over Internet Protocol (VoIP), 235 Y
Volta, Nvidia, 287 Yamamoto, Tatsuo, 274
VR HMDs, 161 Yamato, ATI, 141
Vulkan, 94, 143

Z
W Zhaoxin, 256
Waves, 84 Zhongshan, 227
WebTV, 192 Zulu, Sun, 240

You might also like