2021-Mar - Hot Chips Conf

VOLUME 41, NUMBER 2 MARCH/APRIL 2021
Hot Chips
www.computer.org/micro
mpute
r.org ²ƧǞƵȁ
ɈǞ
Comp ˛Ƨ
uting
nted Data
and Visual
eality Inform izatio
n
on
ce4c1.
indd Techn ation
1
ology
on Quan
tum
y Comp
uting
putin
g
Intern
et of
Ethic Thing
s s
Mach
ine L
Quan ea rning
tu
Comp m
uting
JUNE
2020
JULY 20
ww w.c 20
ompu
ter.org
ww w.c
ompu
ter.org
ce7c1.ind
d 1
Secu
rity an
Priva d
cy High-P
Auto
matio
n Comp erforman
Block uting ce
chain Hard
Digit ware
al Affect
Transf iv
ormat Comp e
ion uting
Educa
tion
MAY 20
20
w w w.c
ompu
ter.org
ce5c1.
indd
1
S
ww w.c NOVE
ompu MBER
ter.org 2020
ww w.c
ompu
ter.org
ce11c1.in
dd 1
ndd 1
ComputingEdge
Secu
riit
Priva y and
cy
Data
Intern
et
ȲɈǞ˛
ƧǞƊǶ
XȁɈƵǶǶ
ǞǐƵȁƧ
Ƶ
Your one-stop resource for industry hot topics,
technical overviews, and in-depth articles.
Cutting-edge Unique original Keeps you up to

articles from the FRQWHQWbE\ date on what you
IEEE Computer computing need to know across
Society’s portfolio thought leaders, the technology
of 12 magazines. innovators, spectrum.
DQGbH[SHUWV
Subscribe for free

www.computer.org/computingedge
EDITOR-IN-CHIEF IEEE MICRO STAFF
Lizy K. John, University of Texas, Austin, USA Journals Production Manager: Joanna Gojlik,
j.gojlik@ieee.org
ASSOCIATE EDITOR-IN-CHIEF
Peer-Review Administrator:
Vijaykrishnan Narayanan, The Pennsylvania State micro-ma@computer.org
University, USA Publications Portfolio Manager: Kimberly Sperka
EDITORIAL BOARD Publisher: Robin Baldwin
Senior Advertising Coordinator: Debbie Sims
Guido Araujo, University of Campinas, Brazil IEEE Computer Society Executive Director: Melissa
R. Iris Bahar, Brown University, USA Russell
Christopher Batten, Cornell University, USA
Mauricio Breternitz, University of Lisbon, Portugal IEEE PUBLISHING OPERATIONS
Bronis de Supinski, Lawrence Livermore
Senior Director, Publishing Operations: Dawn Melley
National Laboratory, USA
Director, Editorial Services: Kevin Lisankie
Yasuko Eckert, AMD Research, USA
Director, Production Services: Peter M. Tuohy
Maya Gokhale, Lawrence Livermore National
Associate Director, Editorial Services: Jeffrey E. Cichocki
Laboratory, USA
Associate Director, Information Conversion
Shane Greenstein, Harvard Business School, USA
and Editorial Support: Neelam Khinvasara
Hyesoon Kim, Georgia Institute of Technology, USA
Senior Art Director: Janet Dudar
John Kim, Korea Advanced Institute of Science and
Senior Manager, Journals Production: Patrick Kempf
Technology, South Korea
Hsien-Hsin (Sean) Lee, Taiwan Semiconductor CS MAGAZINE OPERATIONS COMMITTEE
Manufacturing Company, USA
Richard Mateosian, USA Diomidis Spinellis (Chair), Lorena Barba,
Michael Mattioli, Goldman Sachs & Co., USA Irena Bojanova, Shu-Ching Chen, Gerardo Con Diaz,
Tulika Mitra, National University of Singapore, Lizy K. John, Marc Langheinrich, Torsten Möller,
Singapore Ipek Ozkaya, George Pallis, Sean Peisert,
Trevor Mudge, University of Michigan, Ann Arbor, USA VS Subrahmanian, Jeff Voas
Richard H. Stern, George Washington University CS PUBLICATIONS BOARD
Law School, USA
Sreenivas Subramoney, Intel Corporation, India M. Brian Blake (VP for Publications),
Carole-Jean Wu, Arizona State University, USA Hui Lei, Antonio Rubio, Diomidis Spinellis,
Lixin Zhang, Chinese Academy of Sciences, China Ex officio: Robin Baldwin, Melissa Russell, Forrest
Shull
ADVISORY BOARD
David H. Albonesi, Cornell University, USA COMPUTER SOCIETY OFFICE
Erik R. Altman, IBM, USA IEEE MICRO
Pradip Bose, IBM T.J. Watson Research Center, USA c/o IEEE Computer Society
Kemal Ebcioglu, Global Supercomputing 10662 Los Vaqueros Circle
Corporation, USA Los Alamitos, CA 90720 USA; +1 (714) 821-8380
Lieven Eeckhout, ELIS – Ghent University, Belgium
Michael Flynn, Stanford University, USA
Ruby B. Lee, Princeton University, USA Subscription change of address:
Yale Patt, University of Texas at Austin, USA address.change@ieee.org
James E. Smith, University of Wisconsin–Madison, USA Missing or damaged copies:
Marc Tremblay, Sun Microsystems, USA help@computer.org
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters: Three Park Ave., 17th Floor, New York,
NY 10016; IEEE Computer Society Headquarters: 2001 L St., Ste. 700, Washington, DC 20036; IEEE Computer Society Publications Office: 10662
Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. Postmaster: Send address changes and undelivered copies to IEEE, Membership
Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid at New York, NY, and at additional mailing offices. Canadian
GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Return undeliverable Canadian addresses to
4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the
copy; and 3) does not imply IEEE endorsement of any third-party products or services. Author and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own webservers without permission, provided that the IEEE copyright notice and a full citation to the
original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate
review suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to
ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising,
or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual
Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854 or pubs-permissions@ieee.org. ©2021 by IEEE. All rights reserved. Abstracting and
library use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy
fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE
Micro does not necessarily constitute an endorsement by IEEE or the Computer Society. All submissions are subject to editing for style, clarity, and
space. IEEE prohibits discrimination, harassment, and bullying. For more information, visit ieee.org/web/aboutus/whatis/policies/p9-26.html.
6
GUEST EDITORS’ INTRODUCTION
Best Papers From Hot Chips 32
Priyanka Raina and Cliff Young
MARCH/APRIL 2021
7 15 22
Theme Articles
IBM’s POWER10 Processor Marvell ThunderX3: The Xbox Series

William J. Starke, Next-Generation X System
Brian W. Thompto, Jeff A. Stuecheli, Arm-Based Server Architecture
and José E. Moreira
Processor Mark Grossman and Jeff rey
Andrews
Rabin Sugumar, Mehul Shah, and
Ricardo Ramirez
Theme Articles Continued
29 NVIDIA A100 Tensor Core GPU: Performance and Innovation
Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky
36 Manticore: A 4096-Core RISC-V Chiplet Architecture for

Ultraefficient Floating-Point Computing
Florian Zaruba, Fabian Schuiki, and Luca Benini
43 Pensando Distributed Services Architecture

Michael Galles and Francis Matus
50 Compute Substrate for Software 2.0

Jasmina Vasiljevic, Ljubisa Bajic, Davor Capalĳa, Stanislav Sokorac, Dragoljub
Ignjatovic, Lejla Bajic, Milos Trajkovic, Ivan Hamer, Ivan Matosevic, Aleksandar Cejkov,
Utku Aydonat, Tony Zhou, Syed Zohaib Gilani, Armond Paiva, Joseph Chu, Djordje
Maksimovic, Stephen Alexander Chin, Zahi Moudallal, Akhmed Rakhmati, Sean
Nĳ ar, Almeet Bhullar, Boris Drazic, Charles Lee, James Sun, Kei-Ming Kwong, James
Connolly, Miles Dooley, Hassan Farooq, Joy Yu Ting Chen, Matthew Walker, Keivan
Dabiri, Kyle Mabee, Rakesh Shaji Lal, Namal Rajatheva, Renjith Retnamma, Shripad
Karodi, Daniel Rosen, Emilio Munoz, Andrew Lewycky, Aleksandar Knezevic, Raymond
Kim, Allan Rui, Alexander Drouillard, and David Thompson
General Interest
56 The Design Process for Google’s Training Chips: TPUv2 and TPUv3
Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon,
Cliff Young, Norman Jouppi, and David Patterson
64 Klessydra-T: Designing Vector Coprocessors for Multithreaded

Edge-Computing Cores
Abdallah Cheikh, Stefano Sordillo, Antonio Mastrandrea, Francesco Menichelli,
Giuseppe Scotti, and Mauro Olivieri
Columns and Departments

From the Editor-in-Chief
4 CPUs, GPUs, and More From Hot Chips‘32
Lizy Kurian John
Security
72 Hidden Potential Within Video Game Consoles
Michael Mattioli and Atte Lahtiranta
Awards
78 A Brief History of Warehouse-Scale Computing: Reflections Upon Receiving the
2020 Eckert-Mauchly Award
Luiz André Barroso
Micro Economics
86 The Economics of Confrontational Conversation
Shane Greenstein
Also in this Issue

1 Masthead
84 IEEE Computer Society Info
www.computer.org/micro
ISSN: 0272-1732
FROM THE EDITOR-IN-CHIEF
CPUs, GPUs, and More From Hot

Chips‘32
Lizy Kurian John , The University of Texas, Austin, TX, 78712, USA
W
elcome to the March/April 2021 issue of TPU v2 are discussed next and the performance is
IEEE Micro. This Issue features selected compared using roofline plots.
articles from the Hot Chips ’32 Symposium, The second General Interest article is “Klessydra-T:
held virtually in August 2020. COVID-19 forced Hot Designing Vector Coprocessors for Multithreaded
Chips’32 to be a virtual event, however the chips are Edge-Computing Cores” by Cheikh et al. of Sapienza
more interesting and powerful than ever! Whether it is University of Rome. This work addresses the introduc-
graphics acceleration or sheer increase in traditional tion of coprocessor acceleration in interleaved multi-
compute capability, the chip design arena has become threading (IMT) cores for extreme edge computing.
hotter than ever. A lot of money is pouring into designing Specifically, it explores possible alternatives to imple-
both special purpose and general purpose chips. IEEE ment vector coprocessing units in RISC-V cores,
Micro is pleased to present seven selected articles showing the synergy between IMT and data-level
based on the presentations at the Hot Chips Sympo- parallelism in edge-computing applications.
sium for our readers. Priyanka Raina of Stanford Univer-
sity and Cliff Young of Google served as guest editors
for this special issue. They have compiled an excellent
WHETHER IT IS GRAPHICS
selection of articles on emerging chips and systems
from the Symposium, including articles on IBM Power10, ACCELERATION OR SHEER INCREASE
Marvell Thunder X3, Xbox Series X, NVIDIA A100, Manti- IN TRADITIONAL COMPUTE
core 4096 core chiplet, Pensando Distributed Service CAPABILITY, THE CHIP DESIGN ARENA
Architecture, and TensTorrent Compute substrate for HAS BECOME HOTTER THAN EVER.
Software 2.0. Please read the Guest Editors’ Introduc- A LOT OF MONEY IS POURING INTO
tion to get a preview of the seven articles. Thanks to the DESIGNING BOTH SPECIAL PURPOSE
editors, authors, and reviewers who worked hard to put AND GENERAL PURPOSE CHIPS.
this issue together.
In addition to the aforementioned seven Hot Chips
articles, there are two General Interest articles. The
first one is on the design process for Google’s TPUv2 This issue also features three department articles.
and TPUv3 training chips, written by Norrie et al. from Michael Mattioli of Goldman Sachs who joined the
Google. This was also originally presented at Hot IEEE Micro Editorial Board as a new Department Editor
Chips, however because the authors included one of has coauthored an article with Atte Lahtiranta on the
the guest editors, this was considered as a General hidden capabilities of video game consoles. The
Interest article, and the review process was separately authors describe nonvideo game capabilities of video
coordinated. The article describes Google’s approach game consoles such as web browsing, video confer-
to machine learning (ML) hardware, and provides encing, and audio/video/document content creation.
details on the scalar computation unit, the vector They posit that video game consoles are ideal for
computation unit, the matrix computation units, the enterprise deployment, with security features and
memory system, the interconnect, and the floor plan defense against a wide range of threats. The Microsoft
of TPU v2. The enhancements in TPU v3 compared to Xbox Series X architecture is described with emphasis
on the security processor. The security processor is
housed in the security complex with crypto engine,
random number generator, secure RAM, secure ROM,
0272-1732 ß 2021 IEEE security fuse bank, and side-channel monitors. Please
Digital Object Identifier 10.1109/MM.2021.3064094 read the article to appreciate the less-discussed fea-
Date of current version 26 March 2021. tures of video game consoles.
4 IEEE Micro Published by the IEEE Computer Society March/April 2021

FROM THE EDITOR-IN-CHIEF
This issue additionally includes an Award article Let me also provide an overview of what to expect
from Luiz Andre Barroso of Google who received the in upcoming issues. The May/June issue will be the
2020 Eckert-Mauchly Award. In his article, the author popular “Top Picks” Special Issue which presents the
provides a brief history of warehouse-scale computing. best of the best from articles in computer architecture
He describes the progression of datacenter computing conferences in 2020. Prof. Daniel Jimenez of Texas
as it evolved during the last two decades, and also A&M University and a selection committee from
describes his personal journey as a computer engineer. industry and academia have selected 12 papers from
He concludes the article acknowledging how the pan- more than 100 articles that were submitted in
demic has made many to realize the importance of com- response to the Top Picks call for papers. Readers can
puting technology and cloud-based services, and how look forward to an amazing collection of excellent
these have allowed us to continue to work and live. articles in May/June.
Another article presented in this issue is a Micro Many thematic special issues are planned for the
Economics column by Shane Greenstein, “The Econom- remainder of 2021. Themes include quantum comput-
ics of Confrontational Conversation,” discussing how ing, FPGA computing, in-memory computing, and
confrontational conversations are commonplace on the smart agriculture. The July/August issue will be a Spe-
internet. Greenstein focuses on the economics relevant cial Issue on Quantum Computing, guest edited by
to such confrontational conversations. One economic Ulya Karpuzcu of the University of Minnesota. The
factor is that it is inexpensive to host terabytes of data. FPGA Computing Special Issue will be guest edited by
Additionally simple focal platforms attract more users, Maya Gokhale of Lawrence Livermore Laboratories
which then attracts more apps and content. The third and Lesley Shannon of Simon Fraser University, Can-
relevant economic fact is that the mechanisms to ada. The In-Memory Computing Special Issue will be
address confrontation, whether human moderation or guest edited by Reetuparna Das of the University of
algorithmic processes, will be expensive. Michigan. We also have an open call for smart agricul-
ture, focusing on the use of artificial intelligence and
IoTs in agriculture, guest edited by Neeraj Kumar of
Thapar University and Sudip Misra of IIT Kharagpur.
MANY THEMATIC SPECIAL ISSUES We invite readers to submit to these Special
ARE PLANNED FOR THE REMAINDER Issues. Please find the open calls at:
OF 2021. THEMES INCLUDE QUANTUM
COMPUTING, FPGA COMPUTING, › https://www.computer.org/digital-library/
IN-MEMORY COMPUTING, AND SMART magazines/mi/call-for-papers-special-issue-
AGRICULTURE. on-processing-in-memory.
› https://www.computer.org/digital-library/
magazines/mi/call-for-papers-special-issue-on-
ai-edge-and-iot-for-smart-agriculture.
There have been some additions/enhancements
to the IEEE Micro Editorial Board this month. Dr. IEEE Micro is interested in submissions on any
Vijaykrishnan Narayanan of Penn State has been pro- aspect of chip/system design or architecture.
moted to the Associate Editor-in-Chief of IEEE Micro. Hope you enjoy the articles presented in this issue.
Michael Mattioli of Goldman Sachs will serve as a Happy reading!
Department Editor for Security and Product Reviews.
Prof. Guido Araujo of the University of Campinas
(Brazil) joins the Editorial Board this month as an LIZY KURIAN JOHN is a Cullen Trust for Higher Education
Associate Editor. I look forward to working with all of Endowed Professor with the Electrical and Computer Engi-
them and bringing to you an even more interesting neering Department, University of Texas at Austin, Austin,
reading experience. TX, USA. Contact her at ljohn@ece.utexas.edu.
March/April 2021 IEEE Micro 5

GUEST EDITORS' INTRODUCTION
Best Papers From Hot Chips 32

Priyanka Raina , Stanford University, Stanford, CA, 94305, USA
Cliff Young , Google, Palo Alto, CA, 94306, USA
W
elcome to our special issue of IEEE Micro, Microsoft’s latest gaming console, dedicating over
which highlights the best presentations two-thirds of its die to the GPU that delivers 4K at 120
from Hot Chips 32, held virtually on frames per second. While graphics remain central to
August 16–18, 2020. Like many things in 2020, Hot the mission of the article titled “NVIDIA A100 Tensor
Chips was unprecedented, going virtual for the first Core GPU: Performance and Innovation,” GPUs have
time in its history. Presentations were done by video, become the default for high performance and program-
with a small production team working in a studio and mability in one solution, supporting a huge variety of
a virtual conference supplemented by Zoom chat- scientific computing workloads and powering both
rooms. Despite the switch to the virtual format and neural network training and inference.
the challenges to the global economy, attendance Bridging between the general-purpose floating-
was the highest ever, and the technical program was point power of GPUs and the specialized application
robust, with strong representation from traditional focus of neural network accelerators, “Manticore: A
CPU, GPU, and FPGA manufacturers, and strong offer- 4096-Core RISC-V Chiplet Architecture for Ultraeffi-
ings from startups in both communication networks cient Floating-Point Computing” uses the extensibility
and neural networks. This issue collects articles of the RISC-V ISA to reduce control overheads and
derived from the best talks, chosen after the confer- energy costs for neural-network workloads.
ence by the Program Committee. It is a great time to Our final two articles come from startups, both
work in computer architecture, where the combina- with networking in their roots. “Pensando Distributed
tion of approaching limits to Moore’s Law and new Services Architecture” describes their domain-specific
transformative application areas mean that both architecture (including chips and programmable soft-
incumbent computer architectures and new startups ware) for building new applications and services within
have large contributions to make. a datacenter network. “Compute Substrate for Soft-
Two of the articles focus on server processors, one ware 2.0” explains startup TensTorrent’s unique archi-
from a long-time mainframe manufacturer and one from tecture for neural network acceleration, which takes a
a potential disruptor of the server business. “IBM’s packet-network-inspired approach to control and flex-
Power10 Processor” describes the newest instance of ibility, allowing better support for sparsity, varying
the POWER architecture from the seminal computing numerical precision, and compression than older,
company. Power10 was designed for general-purpose more monolithic neural network accelerators.
enterprise computing with interconnection across All of the talks from Hot Chips 32 (https://www.
16 chips in a multiprocessor, with 1-TB/s memory band- hotchips.org/archives/hc32/) are available at the Hot
width per CPU and high-bandwidth links to accelerators Chips website. Hot Chips are run by a great set of
including GPUs. By contrast, “Marvell ThunderX3: Next- volunteers, including a sophisticated logistical and mar-
Generation Arm-Based Server Processor” represents keting team in the Organizing Committee and the won-
the new wave of ARM-based server-class chips that aim derful set of academic and professional computer
to change the performance and price/performance of architects on our Program Committee. No other confer-
the datacenter server market. ence has the same focus on production computer sys-
Two articles describe chips with significant tems, presented by their designers, sharing how and why
graphics capability. “The Xbox Series X System Archi- they built their chips. We hope you find this issue, and
tecture” details the system-on-a-chip that powers future Hot Chips, as informative and fun as we have.
0272-1732 ß 2021 IEEE

Digital Object Identifier 10.1109/MM.2021.3060294
Date of current version 26 March 2021.

THEME ARTICLE: HOT CHIPS
IBM’s POWER10 Processor

William J. Starke , Brian W. Thompto E. Moreira
, Jeff A. Stuecheli, Jose , International Business Machines
Corporation, Armonk, NY, 10504, USA
The IBM POWER10 processor represents the 10th generation of the POWER family of
enterprise computing engines. It is built on a balance of computation and
bandwidth, delivered by powerful processing cores and intrachip interconnect,
respectively. Multiple system interconnect infrastructures support configurations
with up to 16 processor chips and up to 1920 simultaneous threads of execution, as
well as an expansive memory system with up to 2 Petabytes of addressing space.
Cross-system memory sharing and coherent accelerator attach are also supported.
The POWER10 processing core has been significantly enhanced over its POWER9
predecessor, including the addition of an all-new matrix math engine. Throughput
gains from POWER9 to POWER10 average 30% at the core level and three-fold at the
socket level. Those gains can reach ten- or twenty-fold at the socket level for
matrix-intensive computations.
T
he IBM POWER10 processor delivers significant POWER10 PROCESSOR CHIP
gains in capacity and capability over its immedi- A POWER10 processor die (see Figure 1) consists of 18
ate POWER9 predecessor1,2: an average 20% billion transistors in 602 mm2 of silicon, compared to 8
single-thread performance boost, and 30% gain in billion transistors in POWER9, and is built in Samsung’s
core throughput over a wide range of applications. 7-nm technology with 18 metal layers. The central part
Combined with a two-and-a-half increase in the num- of the die, approximately 300 mm2, is occupied by 16
ber of cores per package, these improvements result enterprise-grade cores, each capable of running eight
in three times or better per socket throughput on pop- simultaneous threads of execution (SMT8), and their
ular integer, floating-point, and commercial workloads, associated 2- and 8-MB levels 2 and 3 cache regions,
and 2–4 times increased memory bandwidth, depend- respectively. To better match the supply and demand
ing on memory technology. For matrix math, the gains of processor chips with the maximum number of cores,
in performance can reach 10 or 20 times through a we cap the number of active cores in a die to 15, keep-
new computational engine. ing one core as a manufacturing spare. This results in
Additional breakthroughs include: a new Power- up to 120 simultaneous threads of execution, backed
AXON system interconnect with 1 TB/s of bandwidth by 120 MB of level-3 cache.
per POWER10 chip and support for cross-system mem- The remaining half of the POWER10 processor chip
ory clustering; a new Open Memory Interface (OMI) that area is dedicated to the system interconnect, includ-
supports multiple industry-standard memory technolo- ing the two protocol spines to the left and right of the
gies on the same processor chip; a modular building core/cache complex, supporting the various intercon-
block die that enables systems with up to 1920 simulta- nect protocols for memory, multiple processors, accel-
neous threads of execution; hardware-enforced security erators, clusters, and I/O. The periphery of the die is
to protect sensitive code and data from attacks; and AI- filled with high bandwidth, power efficient signaling
optimized machine instructions to address the circuits that implement the PowerAXON,3 OMI,3 and
increased computing demands of modern machine PCI Gen5 I/O infrastructures.
learning/deep learning business applications. Not visible in Figure 1 are the large numbers of
communication trunk lines, which run horizontally
over the two L3 hemispheres and vertically over the
protocol spines. The placement of the L3 hemi-
0272-1732 ß 2021 IEEE
spheres, the protocol spines, the trunk lines, the inter-
Date of publication 10 February 2021; date of current version cept of these vertical and horizontal trunk lines, and
26 March 2021. the location of the protocol spines next to the
March/April 2021 Published by the IEEE Computer Society IEEE Micro 7

HOT CHIPS
FIGURE 1. POWER10 processor chip. Approximately half the die area is dedicated to cores and caches. The other half is for the
various system interconnects, including memory interfaces, SMP, accelerators, clustering, and I/O.
signaling infrastructure are the result of rearchitecting interface, for interconnecting processor chips to other
the chip floorplan around a computation-to-band- processors and accelerators and for implementing
width balance. cross-system memory clustering; and PCI Gen 5 for I/
POWER10 processor chips can be packaged in either O and other system interconnect.
single- or dual-chip modules (SCM/DCM). The SCM con- PowerAXON and OMI signaling runs at rates of up
figuration is optimized for scale-up systems and maxi- to 32 GT/s. With a combined 256 bidirectional lanes,
mizes power, interconnect bandwidth, and memory this results in up to 2 TB/s of total bandwidth on a pro-
capacity delivered to each core. It also supports more cessor chip, with 128 lanes and 1 TB/s for each inter-
flexible topologies, allowing configurations with up to 16 face. These are shown to the left and right of the
processor chips. The DCM configuration is optimized for POWER10 chip in Figure 2, respectively. (See Figure 1
scale-out systems and maximizes computational and I/ for physical placement of the interfaces. Each Power-
O density while trading off the power and memory per AXON corner has 32 lanes plus 4 spares.)
core compared with the SCM. It limits configurations to
a maximum of four DCMs (eight processor chips).
THE POWER10 CHIP INTRODUCES
The POWER10 chip introduces new security fea-
tures for cloud paradigms that extend trusted virtuali- NEW SECURITY FEATURES FOR
zation environments to include protected containers CLOUD PARADIGMS THAT EXTEND
and include in-line memory encryption and application TRUSTED VIRTUALIZATION
level protections against attacks. ENVIRONMENTS TO INCLUDE
PROTECTED CONTAINERS AND
INCLUDE IN-LINE MEMORY
POWER10 SYSTEMS ENCRYPTION AND APPLICATION
POWER10 systems are built around the three intercon- LEVEL PROTECTIONS AGAINST
nect infrastructures shown in Figure 2: the OMI, for ATTACKS.
connecting processors to memory; the PowerAXON
8 IEEE Micro March/April 2021

HOT CHIPS
FIGURE 2. POWER10 system interconnect. OMI is used for attaching memory to the processor. PowerAXON provides the SMP,
clustering, and accelerator interfaces. PCIe Gen5 is used for I/O and other interconnect.
OMI is a technology agnostic memory interface attach, and memory clustering. The largest single-sys-
based on open standards. Memory is attached to the tem configuration consists of 16 SCMs, intercon-
processor chip through an OMI-compliant buffer chip, nected as shown in the lower left corner of Figure 2.
which encapsulates technology specific requirements Each module is at most two hops away from any other
as first introduced in IBM’s POWER8 processor and its module. A system with this configuration can run up
companion Centaur memory buffer chip.4 POWER10 to 1920 simultaneous threads of execution and con-
systems will initially use DDR4 memory, through a tain up to 256 OMI DIMMs (16 DIMMs attached to
buffer chip built by Microchip.3 This buffer chip imple- each of 16 SCMs), with a maximum capacity of 1 Peta-
ments 25.6-GT/s signaling over an 8-bit interface, byte of memory. (POWER10 processors have a 2-Peta-
which matches its 8-byte DDR4 channel operating at byte physical memory address space.)
3.2 GT/s. With 16 channels in use, up to 410 GB/s of PowerAXON can also be used to support Open-
peak DDR4 bandwidth can be achieved per POWER10 CAPI, an open, asymmetric protocol for coherently
chip, with a latency that is only 10 ns over traditional attaching compute accelerators, memory devices,
DDR4 DIMMs. DDR5 memory DIMMs can be sup- network interfaces, and storage controllers, either in a
ported later through a future buffer chip. device slot or a cabled external enclosure. Since its
Alternative memory technologies can also be introduction with POWER9, a variety of vendors have
deployed with POWER10 processors using OMI. As provided OpenCAPI-attached devices that expand
shown in the right half of Figure 2, both high-band- and enhance the functionality of POWER systems.
width GDDR memory and high-capacity nonvolatile POWER10 OpenCAPI provides a new level of perfor-
storage-class memory can be connected to the same mance and functionality over the prior version.
OMI channels through corresponding buffer chips and The third and final functionality of PowerAXON in
using standard OMI DIMM form factors. A fully popu- POWER10 that we discuss in this article is memory clus-
lated 16-channel GDDR configuration would achieve tering, shown in the top left corner of Figure 2. This new
over 800 GB/s of memory bandwidth to a single feature of POWER10, which is called memory inception,
POWER10 processor. This approaches the bandwidth delivers the long-sought functionality of server disaggre-
achieved with high-bandwidth memory (HBM), but at gation. Memory inception enables systems to directly
higher capacities and lower cost. Alternatively, a stor- share each other’s main memory. The latency through
age-class (nonvolatile) memory solution could achieve memory inception is 50–100 ns over that of a remote (2-
capacities in the Terabytes per DIMM range. hop) socket within a server, and it is still low enough to
The PowerAXON infrastructure is used for system be used as main memory. The protocol for memory
scaling, including multiprocessor interconnect, device inception is built on top of the OpenCAPI protocol and

HOT CHIPS
different from the SMP protocol used to build (up to) 16- variant of the chip with twice as many SMT4 cores per
socket systems. Memory inception does not implement chip (up to 60 per DCM).
a cache coherence scheme and is not meant to enable The microarchitecture of the POWER10 core,
larger single system image configurations. Rather, the together with key factors affecting its performance
goal is to allow one server to map its address space to and power efficiency, are shown in Figure 3. The block
the physical memory of another server. diagram shows those microarchitecture resources
As a scenario for using memory inception, con- available for the execution of 1 to 4 simultaneous
sider the case of a cluster of homogeneous servers, threads, corresponding to half of the total resources
each with enough memory for the average workload. in an SMT8 core. POWER10 core components colored
By borrowing memory from other machines, a host- in green were somewhat improved in capacity over
ing system can run large memory workloads that go the predecessor POWER9 core. Those colored in blue
beyond the capacity of any single server. Another had their capacity at least doubled and those in red
scenario is a hub-and-spoke configuration, in which had their capacity at least quadrupled. These addi-
a very large central server has a big pool of memory, tional resources along with various other improve-
distributed as needed across a large set of much ments in latency and microarchitecture are
smaller machines. This combines the cost efficiency responsible for the 30% average increase in core
of small machines with the memory capacity of a throughput and a much higher boost in performance
much larger server. in some cases.
Memory inception can also be used as the mes- Each POWER10 SMT8 core has an associated 2
sage layer for a large cluster of POWER10 servers. MiB L2 cache that provides both instructions and data
Combined with the processor’s 2-Petabyte address and is four times the capacity of POWER9. For each
space, memory inception can use the address trans- half of the core, instructions are fetched at a sus-
lation facilities in each server to create a multihop tained rate of up 32-bytes per cycle and predecoded
interconnect, with messages delivered simply by before being installed in a 48-KiB instruction cache
writing to the target memory. A robust, fully hard- (50% more capacity than POWER9). During the prede-
ware managed end-to-end message capability is code stage, select pairs of instructions can be identi-
possible in clusters with thousands of nodes, deliv- fied for fusion into a single internal operation of the
ering high bandwidth, low latency, and flexible microarchitecture, which leads to a faster and more
topologies. efficient execution of those instructions. The new 64-
The final component of the POWER10 system inter- bit prefix instructions in Power ISA 3.15 are also identi-
connect, shown in the bottom of Figure 2, are the PCIe fied in that stage. POWER10 then decodes and dis-
Gen 5 interfaces. PCIe is central to the I/O infrastruc- patches to the execution slices up to 8 instructions
ture of POWER10 systems, and up to 64 lanes are per cycle per thread, or 16 instructions per cycle per
available in a DCM (32 lanes per chip). With a signaling SMT8 core. This represents a 33% increase in dispatch
rate of 32 GT/s, a single DCM can achieve 252 GB/s of rate when compared to POWER9. Over a thousand
I/O bandwidth in each direction. instructions can be in-flight, from dispatch to commit,
in a POWER10 SMT8 core, representing a doubling of
the out-of-order execution capabilities over POWER9.
POWER10 CORE The translation lookaside buffer (TLB) has been
The POWER10 core is the processing engine that runs increased four-fold with 8192 entries per SMT8 core,
both system and user software, responsible for the while at the same time reducing the latency and
computational capacity and capability of POWER10 increasing throughput over POWER9.
systems. There are two focus areas in the design of The four execution slices of POWER9 have been
the POWER10 core: performance strength and power widened to 128 bits each. This has resulted in a dou-
efficiency. A 30% average increase in core throughput bling of the general SIMD rate of execution, to a
while cutting power consumption in half combine to maximum of four SIMD instructions per cycle per
deliver a 2.6-fold average increase in energy efficiency thread or up to 8 SIMD instruction per cycle per
for computations. The increased energy efficiency has SMT8 core. Crypto processing in the execution sli-
allowed the implementation of DCMs with up to 30 ces have also been enhanced, with an overall four-
SMT8 cores, and up to a three-fold throughput over fold gain in throughput from POWER9 to POWER10
current POWER9 modules with similar power con- core.
sumption. The POWER10 core retains the modular A single-thread of execution can load up to two
architecture from POWER9 that provides a second 32-byte data chunks per cycle from the L1 cache,

HOT CHIPS
FIGURE 3. POWER10 core microarchitecture. The boxes on the left show the improvements over POWER9 on performance and
power efficiency, respectively. The latency numbers include both absolute values and improvements over POWER9.
with a total SMT8 core load bandwidth of 128 bytes unit itself, significantly reducing the total data move-
per cycle. (The same bandwidth can also be achieved ment (bits distance) when compared to an equiva-
from the L2 cache.) A single thread of execution can lent 512-bit SIMD operation. The result is improved
store up to four instructions per cycle by gathering power efficiency and higher frequency, enabling the
from up to two store queue entries when each entry POWER10 core to achieve a four-fold increase in
includes a fused store operation. Stores always tar- matrix math throughput compared to the POWER9
get the L2 cache and maximum bandwidth is 32 core.
bytes per cycle per thread or 64 bytes per cycle per In addition, a focus on power efficiency domi-
SMT8 core. nated many other elements of the POWER10 core
Complementing the four general purpose execu- microarchitecture and design. When compared to
tion slices, POWER10 introduces a new matrix math the POWER9 core, there is more use of clock-gating
accelerator (MMA) unit, optimized for the execution and an emphasis on reducing data switching. The
of new matrix instructions in Power ISA 3.1. The branch prediction accuracy has been improved,
instructions perform BLAS2- and BLAS3-class opera- which results in less wasted work and improves
tions on eight 512-bit accumulator registers that are thread latency. Instruction fusion also helps with
added to the architecture. The instructions use either both performance and power efficiency, by combin-
two or three 128-bit vector-scalar registers to perform ing multiple instructions in fewer operations.
rank-1, -2, -4 or -8 updates on either a 4 2 or 4 4 POWER10 supports both independent and dependent
matrix stored in an accumulator. Each input vector- forms of fusion. Dependent fusion combines the exe-
scalar register contains either a 2 1 vector of dou- cution of two instructions that share a register
ble-precision elements, a 4 1 vector of single-preci- dependence into a single operation (with no depen-
sion elements, a 4 2 matrix of 16-bit elements (half- dent latency) or a latency optimized pair of opera-
precision floating-point,6 bfloat16,7 or signed integer), tions, whereas independent fusion enables the
a 4 4 matrix of 8-bit elements (signed/unsigned inte- combining of loads or stores to adjacent memory
ger), or a 4 8 matrix of 4-bit elements (signed locations into a single wider access reducing
integer). resource consumption and conflicts.
The MMA microarchitecture reduces data The register file for the general-purpose and vec-
switching by storing the accumulators locally in the tor-scalar registers requires four times fewer write-

HOT CHIPS
FIGURE 4. POWER10 core speeds and feeds. Load/store and SIMD bandwidth have been doubled over POWER9, matching SIMD
and load throughputs. The matrix math unit offers increased throughput in computational-intensive operations.
ports per entry, when compared to POWER9, without

compromising performance. The L1 data and instruc-
tion caches have been converted to use effective- THE IBM POWER10 PROCESSOR
address tags, which means that address translation COMBINES INNOVATIONS IN SILICON
from effective- to real-address only needs to be per- TECHNOLOGY, SYSTEM
formed on L1 cache misses, as opposed to on every INTERCONNECTS, MEMORY SYSTEMS,
load or store, as with POWER9. Cache latency has also
AND PROCESSING CORE TO DELIVER
been optimized reducing time in flight and pipeline
A COMPUTE ENGINE THAT IS
hazards including the addition of zero-cycle store for-
OPTIMIZED FOR THE MODERN DAY
warding latency for loads and reduced latency for L2
and L3 caches. DEMANDS OF THE ENTERPRISE ON
The L1 data cache is a write-through cache and PREMISES AND IN THE CLOUD.
stores commit to L2, which is physical-address
indexed and tagged, and inclusive of L1. L3 is a victim
cache for evictions from its local L2 and is also physi- The table in the lower right corner of Figure 4
cal-address indexed and tagged. The L3 region associ- shows the computation rates that can be achieved
ated with a core can also be populated with cast-outs with the MMA instructions. The columns show the
from other cores depending on the state of dynamic maximum rate of operations per cycle for a single
sharing policies. A miss from a load or store by a core instruction, a single-thread (which can issue up to two
can be sourced by any L2 or L3, either on the same or instructions per cycle) and per SMT8 core. The opera-
another chip in the system. tion rate increases as the input data type gets shorter,
The speeds and feeds of the POWER10 core are and varies from a maximum of 64 (floating-point) oper-
summarized in Figure 4. The SMT8 core attaches to a ations per cycle per SMT8 core when the inputs are
multilayered memory hierarchy, which provides it with double-precision floating-point elements to a maxi-
up to 32 bytes/cycle of simultaneous read and write mum of 1024 (integer) operations per cycle per SMT8
bandwidth to main memory. Read and write band- core when the inputs are 4-bit integers.
width to the local 8 MiB L3 cache is 64 bytes/cycle for
an SMT8 core. Read and write bandwidth to the level- POWER10 PERFORMANCE
2 and level-1 cache is 128 and 64 bytes/cycle for an Comparisons of performance between POWER9 and
SMT8 core, respectively. POWER10 are shown in Figure 5. The comparisons are

HOT CHIPS
FIGURE 5. POWER10 general purpose socket performance gains. The three-fold improvement in performance comes from a
combination of increased number of cores and more powerful cores. DDR5 will double memory bandwidth in the future.
derived from presilicon simulations and have been cor- For computations that are heavy on matrix math,
related against first-pass silicon. We do not yet have the gains from POWER9 to POWER10 are even more
the final version of the chips and the results reflect the substantial, as shown in Figure 6. LINPACK is expected
projected frequency of operation for production to run ten times faster in POWER10 than POWER9,
POWER10 parts. The figures are for a dual-socket when compared socket-to-socket. The same is
POWER10 system relative to a dual-socket POWER9 expected for single-precision floating-point implemen-
S924 server. We observe a three-fold improvement in tation of the Resnet-50 benchmark. When some of the
performance across integer (SPECint2017_rate), float- new mixed-precision math features of POWER10 are
ing-point (SPECfp2017_rate), and commercial bench- taken into account, our evaluation shows that Resnet-
marks. For memory streaming benchmarks, the 50 will execute up to 15 (with bfloat16 data type) or 20
POWER10 gains over POWER9 range from two- to four- (with 8-bit integer data type) times faster than the stan-
fold, using DDR4 and DDR5 memory, respectively. dard single-precision Resnet-50 in POWER9.
FIGURE 6. POWER10 SIMD/AI socket performance gains. The matrix math accelerator delivers four times the throughput of
POWER9 SIMD. Combined with two-and-a-half times the number of cores, it results in a ten-fold improvement in socket through-
put for computational-intensive operations. Further gains are possible with the new reduced-precision operations.

HOT CHIPS
CONCLUSION processor. He is responsible for shaping the processor cache

The IBM POWER10 processor combines innovations in hierarchy, symmetric multiprocessor (SMP) interconnect,
silicon technology, system interconnects, memory sys- cache coherence, memory, and I/O controllers, accelerators,
tems, and processing core to deliver a compute and logical system structures for Power Systems. He is an IBM
engine that is optimized for the modern day demands
Master Inventor, authoring roughly 300 patents. Starke
of the Enterprise on premises and in the cloud. Innova-
received a B.S. in computer science from Michigan Technolog-
tive memory interfaces and cross-system memory
functionality enable new architectures for cloud com- ical University. Contact him at wstarke@us.ibm.com.
puting, while improved performance and power effi-
ciency of the computational cores support a three- BRIAN W. THOMPTO is an IBM Distinguished Engineer and
fold improvement in comparable system throughput the Chief Architect for the IBM POWER10 core. He has led
over a broad spectrum of applications. For matrix- global development teams across ten generations of IBM
intensive computations in the fields of machine- and POWER and IBM System z processors. He has been recog-
deep-learning, the gains can be more than ten-fold.
nized by two IBM corporate awards and as an IBM Master
Inventor. Thompto received a B.S. in electrical engineering
REFERENCES and computer science from the University of Wisconsin Madi-
1. S. K. Sadasivam, B. W. Thompto, R. Kalla, and W. J. son. Contact him at bthompto@us.ibm.com.
Starke, “IBM power9 processor architecture,” IEEE
Micro, vol. 37, no. 2, pp. 40–51, Mar./Apr. 2017.
JEFF A. STUECHELI is a Senior Technical Staff Member in the
2. W. A. Hanson, “The CORAL supercomputer systems,”
IBM J. Res. Develop., vol. 64, no. 3/4, pp. 1:1–1:10, IBM Power Systems Development Group. He works on server
May/Jul. 2020. hardware architecture. His most recent work includes
3. J. Stuecheli, S. Willenborg, and W. Starke, “IBM’s next advanced memory architectures, cache coherence, and
generation POWER processor,” in Proc. IEEE Hot Chips accelerator design. He has contributed to the development
31 Symp., Cupertino, CA, USA, 2019, pp. 1–19. of numerous IBM products in the POWER architecture family,
4. W. J. Starke et al., “The cache and memory subsystems
most recently the POWER10 design. He has been appointed
of the IBM POWER8 processor,” IBM J. Res. Develop.,
an IBM Master Inventor, authoring over 190 patents. Stue-
vol. 59, no. 1, pp. 3:1–3:13, Jan./Feb. 2015.
cheli received a B.S., an M.S., and a Ph.D. in electrical engi-
5. IBM Corporation, “Power ISA version 3.1,” May 2020.
6. “IEEE Standard for Floating-Point Arithmetic—Redline," neering from The University of Texas Austin. Contact him
in IEEE Std 754-2019 (Revision of IEEE 754-2008), at jeffas@us.ibm.com.
pp. 1–148, Jul. 2019.
7. G. Tagliavini, S. Mach, D. Rossi, A. Marongiu, and E. MOREIRA is a Distinguished Research Staff Member
JOSE
L. Benin, “A transprecision floating-point platform for
with the IBM Thomas J. Watson Research Center. Moreira
ultra-low power computing,” Proc. Des., Autom. Test
Eur. Conf. Exhib., Dresden, Germany, 2018, received a B.S. in physics and a B.S. and an M.S. in electrical
pp. 1051–1056, doi: 10.23919/DATE.2018.8342167. engineering from the University of Sao Paulo and a Ph.D. in
electrical engineering from the University of Illinois at Urbana-
WILLIAM J. STARKE is an IBM Distinguished Engineer and Champaign. He is a Fellow of IEEE and a Distinguished Scien-
the Chief Architect and Engineer for the IBM POWER10 tist of ACM. Contact him at jmoreira@us.ibm.com.

Marvell ThunderX3: Next-Generation

Arm-Based Server Processor
Rabin Sugumar , Mehul Shah , and Ricardo Ramirez , Marvell Semiconductor Inc., Sunnyvale, CA, 475644,
USA
Marvell ThunderX3 is the third-generation Arm-based server processor from

Marvell. This article describes the architecture of ThunderX3, going into details of
core microarchitecture. ThunderX3 is distinctive among mainstream processors in
supporting four SMT threads. The article describes the threading microarchitecture
of ThunderX3 and highlights threading benefits. The article also touches on the
on-chip interconnect and L3-cache and the power management subsystem. Initial
benchmark results from silicon show 30%þ single thread and up to 3X socket level
performance gains over the prior generation resulting in industry-leading single
thread and socket level performance.
T
hunderX3 is the latest server chip from Marvell buy and the cost to run. From the perspective of pro-
based on the Arm instruction set architecture cessor design, TCO optimization translates to optimiz-
(ISA) manufactured in TSMC 7 nm. ing performance per dollar and performance per watt
Server processors constitute a segment of the at the platform level. TCO optimization is a multidi-
overall processor market. They are typically deployed mensional problem and different datacenters take dif-
in racks in temperature-controlled warehouse settings ferent approaches. Typically, there is a maximum
called datacenters and are accessed from client devi- power [referred to as thermal design power (TDP)]
ces for a variety of purposes including checking email, that datacenters specify for a processor—the data-
web search, and cloud services. Historically, several center design requires that the processor stay under
ISAs have been used in server processors including this TDP under all circumstances. The design goal for
IBM mainframe ISAs, SPARC, PA RISC, and Itanium. a server processor is then to achieve the maximum
While there are still important residual islands that possible performance while staying under this TDP.
use these other architectures, most servers today run There are die area limits as well to manage the
on the x86 ISA created by Intel. The shift to x86 was manufacturing cost of the die. The workload that is
driven by the low cost of x86 servers and the availabil- often used as an initial performance gate is the CPU
ity of Linux, a mature, open source operating system. benchmark from the SPEC organization.1 Both single
The low cost of x86 servers was enabled by the econo- thread and rate performance of the CPU benchmark
mies of scale resulting from the use of the x86 archi- are important—single thread as a measure of the
tecture in PCs. However, processor shipment volumes responsiveness of the system and rate as a measure
have now shifted from PCs to cell phones where the of the throughput capability. The CPU benchmark
Arm ISA is almost universally used. Marvell and other from SPEC is a good overall measure of the capability
companies are riding the mobile wave to enter the of a server chip but may not always correlate directly
server market with the Arm ISA. to performance seen on customer applications.
The design of server processors is focused on Focus areas of server processor design, thus,
reducing the total cost of ownership (TCO) at the include optimizing single thread performance as mea-
datacenter—that is, to provide the required service at sured by the Integer CPU single thread benchmark
the lowest total dollar cost, including both the cost to from SPEC, optimizing socket level performance as
measured by the Integer CPU rate benchmark from
SPEC, optimizing performance on workloads com-
0272-1732 ß 2021 IEEE monly run in the datacenter such as web servers,
Digital Object Identifier 10.1109/MM.2021.3055451 databases, web search, Java application servers, and
Date of publication 29 January 2021; date of current version other such workloads, optimizing die area and power,
26 March 2021.

HOT CHIPS
obtaining performance from more cores running

slower is more efficient with respect to power dissipa-
tion than performance from fewer cores running faster
owing to the higher voltage required for faster opera-
tion and the quadratic impact of voltage on power.
Each of the 60 cores has four hardware threads. x86
systems from Intel and AMD support two threads—
the ThunderX line uses four threads since in the server
space cache miss rates are so high that there is still
significant pipeline slack with two threads and good
additional performance gains are seen at four threads.
The cores implement Armv8.3 with select v8.4 and
v8.5 features pulled in.2
FIGURE 1. ThunderX3 overview of key features.

IN THIS ARTICLE, WE WILL START
providing as much memory capacity and bandwidth, WITH AN OVERVIEW OF THUNDERX3
and I/O bandwidth as possible within the constraints of AND THEN WE WILL DESCRIBE THE
die area, providing an interconnect for multisocket ARCHITECTURE OF THUNDERX3
deployments, good virtualization support for cloud GOING INTO SOME DETAIL ON THE
deployments, and providing secure boot, monitoring
CORE, THREADING, AND MEMORY
and management facilities for large-scale deployments.
SUBSYSTEM ARCHITECTURE.
ThunderX3 is the third-generation Arm-based server
chip from Marvell. The chip prior to ThunderX3 was
ThunderX2. ThunderX2 is a 32 core Armv8.1
processor—each core had four hardware threads—so a
ThunderX3 has eight 64-bit wide DDR4 SDRAM
single ThunderX2 die has 128 CPUs exposed to the OS.
channels with a max bit transfer rate of 3.2 Gbits/s on
ThunderX2 has several firsts. It featured in the first Arm-
the data pins. The peak DDR bandwidth per ThunderX3
based system that entered the Supercomputer Top500,
chip works out to 204 GB/s and efficiency around 80%
it is the first non-x86 system to be deployed in Microsoft
of peak is reasonable. More memory bandwidth will be
Azure, it had industry-leading performance in particular
helpful to feed the 60 cores on certain applications, but
on memory bandwidth-intense benchmarks at the time
there are cost-benefit tradeoffs and eight seems the
it was released and is the most widely deployed Arm
optimal design point. ThunderX3 has 64 PCIe lanes sup-
server processor. In the last two years, ThunderX2 has
porting up to Gen4 speeds, which is 16 giga-transfers/s
gone through qualification cycles at datacenter vendors
per lane. ThunderX3 does not include any on-die acceler-
and a variety of issues have been addressed that are
ators. While accelerators are widely used in the datacen-
required for at scale deployment in datacenters. Thun-
ter, different customers have their own preferred
derX3 builds on this experience.
external accelerators and there is no industry standard
In this article, we will start with an overview of
accelerator that may be usefully integrated on the die.
ThunderX3 and then we will describe the architecture
This is one reason custom server designs that are opti-
of ThunderX3 going into some detail on the core,
mized for each customer is expected to gain traction in
threading, and memory subsystem architecture. The
future generations. ThunderX3 also integrates USB and
focus will be on the features that made a difference
SATA ports for local storage and miscellaneous device
on the performance with some insight provided on
support, avoiding the need for external chips to provide
why the features help. We also present performance
such support on a platform.
data from ThunderX3 silicon.
ThunderX3 includes a sophisticated on-die power
monitoring and management system that has access
THUNDERX3 OVERVIEW to a variety of sensors and performance counters on
Figure 1 has an overview of the key features of Thun- the chip and the platform and is able to control fre-
derX3. On a single die, there are 60 cores. High core quency individually of the cores and the memory inter-
count works well on server processors since the work- connect as well as control voltage, power state, and
loads often have inherent concurrency—for instance, performance configuration of cores. It periodically
serving web pages for multiple clients. In addition, monitors the status of different sensors and

HOT CHIPS
ThunderX3 implements
resteer at different lev-
els to reduce these pen-
alties and does bundle
merging to smooth out
the instruction stream.
The eight instructions
that are fetched go into
the decoder. The
decoder maps instruc-
tions to micro-ops—
most instructions map to
a single micro-op, but
FIGURE 2. ThunderX3 core block diagram. there are a few that map
to multiple micro-ops.
Micro-op expansion was
performance counters, and in conjunction with input reduced significantly going from ThunderX2 to Thun-
from software (OS or Hypervisor) makes decisions on derX3. ThunderX2 was derived from an earlier MIPS-
frequency and voltage settings and configuration to based architecture, and during the transition to Arm,
optimize chip performance and power. most instructions that did not map to a MIPS instruc-
tion were broken into micro-ops. In ThunderX3,
instructions such as loads and stores with register
THUNDERX3 CORE plus register addressing that see widespread use by
The ThunderX3 core is a deep out of order core with Arm compilers were mapped to a single micro-op.
support for four-way multithreading. On single thread Reducing micro-op expansion results in additional
performance, it is competitive with the highest end complexity in the execution unit to execute these
cores from x86 competitors, and the support for four- more complex operations, but the performance gain
way threading allows it to outperform high-end com- was worth it. Decode also fuses certain instruction
petitor cores on a wide variety of workloads. pairs such as simple integer instructions and branches
Figure 2 has the core block diagram. Up to eight into a single micro-op. On average across the SPEC
instructions per cycle are fetched from the instruc- integer suite about 0.95 micro-ops are output by
tion cache—way prediction is used at the instruc- decode per instruction.
tion cache to simplify the design and to reduce Decoded micro-ops go into a skid buffer—the skid
power. Concurrent with instruction cache access, buffer is the separation point between the front end
various branch prediction structures are also read to of the pipe and the back end of the pipe. The skid
decide the next bundle of instructions to be fetched. buffer is one of the few structures in the core that is
There are separate prediction structures for condi- statically partitioned among the threads—to simplify
tional branches, indirect branches, and returns. Sup- thread arbitration at dispatch. The renaming unit picks
port for decoupled fetch was added in ThunderX3— up micro-ops from the skid buffer, does renaming, allo-
this is a mechanism to keep fetching following an cates various backend structures for the micro-op,
instruction cache miss, using BTBs and branch pre- and writes the micro-op into the scheduler, the reor-
dictors to predict future PCs—it is quite effective der buffer and other queues such as the load and
showing 1.5 to 2 gain on datacenter codes such store queue as needed. Renaming is done at four
as webservers and databases that tend to have a micro-ops per cycle and there are optimizations
large instruction footprint and predictable basic around merging to reduce instruction delivery ineffi-
block sequences. Fetch bundles break on a 64B ciencies created by breaks in fetch bundles. Note that
cache line boundary and on a taken branch. Front- renaming width is four while fetch and decode width
end inefficiencies introduced by fetch breaks and are eight—it may seem that the pipeline is out of bal-
branch resteer latencies are a major challenge in ance, but in practice most workloads do not run any-
high-end CPU design since IPCs on many cache resi- where close to peak rates in a sustained fashion.
dent codes may reach 3 or more, and even a single Performance and design studies showed that increas-
cycle additional delay on a frequently executed ing rename width had low performance benefit at the
branch resteer could lower IPC by a high percentage. current design point, but high design cost.

HOT CHIPS
TABLE 1. Sources of single thread gains on ThunderX3 over

ThunderX2.
Feature Rough Pct gain

(SPECInt)
Size Icache Size 0.5%
512 KB L2-cache 2.5%
Larger out-of-order 5%
structures
Width Wider decode 2%
Additional ALU port 1.5%
Two branches per 1.5%
cycle
Algorithm Branch prediction 3%
enhancements
FIGURE 3. ThunderX3 measured gains over ThunderX2.
Front-end resteer 1%
enhancements
Reduce micro-op 6% 30% higher than ThunderX2 at the same frequency,
expansion
and on the Floating-point version (SPECFp), it is 35%
D-cache bank 0.5% higher. Obtaining 30%–35% on single thread on a
conflict reduction
broad benchmark suite such as cpu2017 requires
Reduce FP 1%
improvements on several fronts. Table 1 provides
structural hazards
some color on where the gains come from. The
Prefetch 1.5%
enhancements improvements may be classified along four dimen-
sions: increase in size, increase in width, algorithmic
Latency FP latency reduction 0.5%
improvements, and latency improvements. On size,
the L1 instruction cache and L2-cache sizes were
The next step is for micro-ops to be issued and doubled and various out of order structures were
executed. This is done from a 70-entry unified issue made larger such as the reorder buffer, the scheduler,
queue and micro-ops may issue to one of seven and the load and store queues. On width, decode was
ports: four ALUs or four FP/Neon units, two load/ made wider, an ALU was added, and support for
store address pipes, and a store data pipe. Since branch execution on an additional ALU was also
there is just a single store data port sustained added. Even though the size and width get the most
throughput on stores is limited to one per cycle. Max- visibility when cores are discussed, it is seen from
imum transfer width between the L1 D-cache and the Table 1 that the major source of gains was algorith-
pipeline is two 16B blocks. mic. Reduction in micro-op expansion from Thun-
The data cache is 32 KB, eight-way. Both the data derX2 improved pipeline efficiency significantly. In
cache and the instruction cache are backed up by a addition, branch prediction and prefetch algorithms
unified 512-KB L2-cache. The hardware prefetcher pre- were enhanced and other algorithmic tweaks at vari-
fetches into the level 2 cache—it keys off stride pat- ous points in the pipeline got incremental gains. The
terns and in addition prefetches aggressively around last dimension is latency—floating point latencies
cache lines where a miss is seen to perform region were reduced—this bought a little bit of performance
prefetching. There are small level 1 TLBs adjacent to on SPECInt and even more on SPECFp.
both the instruction cache and the data cache and a Figure 3 shows some performance data from sili-
large level 2 TLB. Four-way threading support is dis- con. On single thread SPECInt performance gain is
tributed throughout the core and is discussed in greater than 1.5 from a combination of architecture
greater detail later in this article. performance improvement and frequency increase.
On SPECInt, the rate performance gain is around
2.5 from a combination of architectural gains,
THUNDERX3 PERFORMANCE increased core count, frequency increase, and the
ThunderX3’s single thread performance on the Inte- larger L3-cache and greater DDR bandwidth.
ger version of the cpu2017 benchmark (SPECInt) is On SPECFp single thread, the gains are higher than

HOT CHIPS
FIGURE 4. Thread arbitration.
1.5 and on datacenter codes we are seeing 2–3 micro-op in the scheduler for each issue port on each
on a variety of codes. cycle. The execution units are threading agnostic as
well, executing whatever micro-op is delivered to them
by the scheduler. Finally, the retire unit picks the thread
THUNDERX3 THREADING
that has the most micro-ops to retire, with a mechanism
There are four hardware threads per core on ThunderX3.
to prevent starvation when a single micro-op in a thread
Each hardware thread has the full Arm CPU state—so
has not retired for a while. With this arbitration scheme,
the OS sees four CPUs per core. The four hardware
when running similar workloads on all threads, perfor-
threads share almost all core resources including
mance is uniform among the threads. Of course, when
caches, execution units, pipeline stages, structures
the threads are running workloads with vastly different
such as the reorder buffer and the scheduler as well as
profiles, the notion of what is fair itself is not clear, but
interfaces to the external world. The core was designed
results are reasonable with no thread experiencing dra-
with four threads from the beginning—there was never
matic slowdowns.
a single thread variant of the core. A back of the enve-
Figure 5 shows single core performance gains from
lope analysis of area penalty of threading was done and
threading—threading gains often correlate with inst-
the area penalty came out to around 5%. The gain from
ructions per clock (IPC) that a single thread is able to
threading on a variety of codes is well over 5%—so
achieve, which is a measure of how efficiently the
threading is quite area and power efficient.
thread is using the pipeline—the more slack there is in
Figure 4 describes thread arbitration, which provides
execution, the more opportunity there is for threading
a flavor of how threading is implemented in the core.
gains. But cache pressure is a factor in some cases.
There are four points of arbitration. Once at fetch, once
On MySQL, which runs at a low IPC, we see more than
at rename, once in the scheduler, and once at retire. At
2 gain from going to four threads. For a benchmark
fetch, the goal of the arbitration algorithm is to treat the
such as leela (a go playing code from cpu2017), which
four threads uniformly while achieving the maximum
achieves mid-range IPC, the gain from threading is still
execution rate. On each cycle, the algorithm looks at
good at 1.7 to 1.8 at four threads. For x264, which is
the active threads and picks the thread that has the
video encoder also from cpu2017, the IPC is high but
fewest instructions/micro-ops in flight further down the
there are still decent gains.
pipe. That ensures that among active threads, pipeline
utilization is balanced to some extent. Similarly, at
rename, among the threads that have at least one THUNDERX3 SOC
micro-op in the skid buffer, arbitration picks the thread Figure 6 shows the overall SOC—it is a switched ring
that has the fewest micro-ops further down the pipeline. with a switch at the top and bottom. Each leg is bidi-
The scheduler is threading agnostic—it picks the oldest rectional, and the traffic is routed through the
switches to the different legs. During the
design of ThunderX3, a mesh was consid-
ered for the interconnect but given that
DDR bandwidth was not increasing dra-
matically from ThunderX2 to ThunderX3
a mesh did not seem necessary, and the
switched ring architecture was adopted
to minimize design changes.
Cores are grouped in four core clus-
ters along with an interconnect slice,
FIGURE 5. Threading performance gains.
L3-cache and coherence control logic

HOT CHIPS
L3-cache efficiently without duplication. The L3-

cache is mostly exclusive—i.e., it is filled on evict
from the L2-cache—an exclusive cache architecture
increases the number of distinct cache lines that are
held in on-die SRAM. For many datacenter codes, L3-
cache bandwidth is the primary scaling limiter and
much work was done during the design to optimize
L3-cache bandwidth. Running the STREAM bench-
mark3 out of L3-cache, the per core L3-cache band-
width on ThunderX3 is greater than ThunderX2
despite the almost 2 increase in core count on
ThunderX3 over ThunderX2.
Figure 7 shows socket level performance from
threading on a MySQL benchmark4—there is a linear
region where cores are kicking in, and then there is
a threading region where the additional threads on
each core kick in and finally it flattens out—at the
socket level, we see 89 over a single thread from a
combination of cores and threads, which is excellent
scaling.
FIGURE 6. ThunderX3 SOC block diagram.
CONCLUSION
ThunderX3 is a major upgrade over ThunderX2 achiev-
ing up to 3X performance in the same performance
envelope in single die configurations. Single thread
performance is industry leading. Four-way multi-
threading provides a significant competitive advan-
tage over other systems.
ACKNOWLEDGMENTS
The authors would like to thank their colleagues,
hundreds of engineers over the years, who worked
days, nights, and weekends, sometimes under
FIGURE 7. Socket level performance. extremely challenging circumstances, to build Thun-
derX3 and its predecessors. They would also like to
thank the anonymous reviewers for the helpful
feedback.
to form tiles. Tiles are replicated to form the Core
and L3-cache cluster in the middle of the die. DDR,
PCIe, and other I/O tap into the rings and are on
the periphery. Support logic such as the interrupt REFERENCES
controller, the system management unit, and debug 1. 2017. [Online]. Available: https://www.spec.org/cpu2017/
assist also tap into the rings and are in the Docs/overview.html
periphery. 2. Arm Developer. Understanding the Armv8.x extensions.
The L3-cache is 90 MB (1.5 MB per core) on Thun- 2019. [Online]. Available: https://developer.arm.com/
derX3 versus 32 MB (1 MB per core) on ThunderX2. architectures/learn-the-architecture/understanding-
Physical addresses are mapped statically to L3-cache the-armv8-x-extensions/single-page
tiles and the location of the requesting core has no 3. 2016. [Online]. Available: http://www.cs.virginia.edu/
impact on tile selection. The large, shared L3-cache stream/ref.html
allows large datasets and instructions of highly mul- 4. 2020. [Online]. Available: https://github.com/akopytov/
tithreaded datacenter codes to be held in the sysbench

HOT CHIPS
RABIN SUGUMAR is a Chief Architect with Marvell, Sunnyvale, with Broadcom that later became ThunderX2. Before Broad-
CA, USA and leads the architecture group for the ThunderX com, he was involved in architecture and verification of high-
server processor line. In his role, he participates in a variety of performance embedded processors at two startups, PA Semi
aspects of ThunderX development and productization. He and SiByte. Shah received a B.S. in electrical engineering
joined Marvell from Cavium, which Marvell acquired in 2018. and computer science from UC Berkeley, Berkeley, CA, USA
Prior to this, he was with Broadcom, where he was one of the and an M.S. in electrical engineering from UCLA, Los Angeles,
lead architects on the server processor that later CA, USA. Contact him at mehuls@marvell.com.
became ThunderX2 when the team from Broadcom moved to
Cavium. During his career, he has worked on architecture and
design of vector processors at Cray Research; early multi- RICARDO RAMIREZ is a Principal Engineer with Marvell, Santa
threaded and out-of-order SPARC processors with Sun Micro- Clara, CA, USA and is the lead designer on the ThunderX CPU
systems; and Arm-based processors with Broadcom, Cavium, logic design team. He joined Marvell with the Cavium acquisi-
and now, Marvell. Sugumar received a Ph.D. in computer sci- tion in 2018. Before Cavium, he was with Broadcom as the lead
ence and engineering from the University of Michigan, Ann designer on the server processor that became ThunderX2
Arbor, MI, USA. Contact him at rsugumar@marvell.com. when the team moved to Cavium. Throughout his career, he
has been involved in developing many high-performance CPU
MEHUL SHAH is a Principal Architect involved in the design cores, which include Intel’s Itanium processor and Broadcom’s
of server processors with Marvell, Sunnyvale, CA, USA. He XLR and XLP line of MIPS-based processor. Ramirez received
came to Marvell through the acquisition of Cavium. Prior to an M.S.E.E. from Stanford University, Stanford, CA, USA. Con-
that, he was one of the designers of the server processors tact him at rramirez@marvell.com.

The Xbox Series X System Architecture

Mark Grossman and Jeffrey Andrews, Microsoft, Redmond, WA, 98052, USA
The Xbox Series X console, released in November 2020, contains a System on Chip
(SoC) created in partnership with AMD. This article describes its architecture
including the intelligence for input processing, rendering game graphics and audio,
managing storage, user services, and security, all under a tiered operating system.
KEY FEATURES As listed above, there are multiple custom on-die
T
he Xbox 2020 console generation features subsystems for multimedia, security, and I/O. Media-
numerous upgrades to the previous architec- related blocks are fed by the Media Hub interface, while
ture that powered console products from 2013 security and audio blocks are fed by the System Hub for
to 2017, including Xbox One, Xbox One S, and Xbox best quality of service. The primary I/O interface is the 8-
One X. Some of the key architectural features of the lane Gen4 PCI Express interface, which connects to a
Series X1 are listed in Table 1. South Bridge chip, internal and (optional) external NVMe
flash SSDs, and gigabit ethernet NIC.
THE DIE AND PACKAGE MEMORY CONFIGURATION AND

Figure 1 shows a physical view of the SoC die. Note
PERFORMANCE
that the bulk of the die—more than 75%—is dedicated The 20-channel DRAM memory subsystem balances
to the GPU. The bandwidth of the wide DRAM inter- high bandwidth for the GPU and low latency for the
face is sized for the GPU efficiency. The die is 360.4 CPU cores. Because the memory capacity (16 GB) is
mm2 with 15.3 billion transistors. It is fabricated in not an integer multiple of 20, memory space is seg-
TSMC’s N7 process node and packaged on a 12-layer mented into a high-bandwidth 10-GB partition spread
substrate in a 52.5-mm2 2963-ball BGA package, with across all channels, and a 6-GB partition spread
0.80-mm minimum ball pitch. across 12 of the channels. The system places lower
Figure 2 shows the system block diagram. The bandwidth resources in the latter partition.
eight simultaneous-multithreading CPUs are orga- At 14 Gb/s raw pin bandwidth, the full memory sub-
nized in two clusters, or complexes, of four cores. system can achieve a theoretical peak of 560 GB/s, all
Each core has its own first-level instruction and data of which may be needed by the GPU. As with every
caches, as well as its own 512-kB second-level cache. SDRAM-based system, this is derated due to row
Each cluster has its own 4-MB third-level cache, refresh, read-write turnaround, and new-page access
which is populated by cast-outs from the CPU L2s. overhead. The hierarchical caches, write-gathering,
The LLC is effectively 4 L2s þ L3 capacity, thus 6 MB and memory interface logic optimize the batching of
per cluster. The CPU caches are coherent with the requests to maximize utilization. Each channel has 16
GPU—i.e., they can be snooped by all CPU cores and banks, giving a total of 320, which helps hide overhead
by the GPU. The CPU, GPU, and communication hubs and helps lower the average CPU memory latency
share access to the Scalable Data Fabric for DRAM despite the high GPU demand.
access. There are 20 16-bit-wide GDDR6 DRAM chan-
nels, each with its own interface containing queues COST/PERFORMANCE TRENDS
and scheduler. The DRAM interfaces are placed on Moore’s Law has been a reliable basis for increasing the
three sides of the die; the fourth side is left free for number of transistors per square die millimeter. That has
power delivery. more than tripled between 2013 (Xbox One) and 2020
(Series X). It has been a less reliable benefit to power
consumption, and especially die cost, in the most recent
silicon technology nodes. The die area for successive
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3058629 Xbox console SOCs has remained roughly constant, but
Date of publication 15 February 2021; date of current version the SOC cost has increased considerably, due to
26 March 2021. increased silicon fabrication process layer count,

HOT CHIPS
TABLE 1. Key features. THE CPUs

The high-end out-of-order CPU cores bring high-effi-
Area Feature
ciency compute to demanding 120-frames/s work-
Compute Eight 3.8-GHz “Zen 2” CPU cores loads. They run at 3.8 GHz with one thread enabled, or
Graphics Sampler Feedback Streaming (SFS) 3.6 GHz with two threads. One Xbox addition is Secu-
DirectX Raytracing acceleration (DXR) rity Privilege Level Execution and Access Protection
Variable Rate Shading (VRS) (SPLEAP), which is a hardware function to prevent
Machine Learning acceleration Escalation of Privilege attacks. Xbox One systems
Direct3D Mesh Shading
used an earlier version of SPLEAP.
Memory 320-bit 14-Gbit/s GDDR6 DRAM, 560-GB/s
The design also incorporates 2x floating point pipe-
peak
lines. Using AVX256 instructions gives a peak of 32 sin-
Storage Xbox Velocity Architecture – 2.4 GB/s (Raw),
gle-precision floating-point operations per clock cycle
4.8 GB/s (compressed), encrypted assets from
NVMe SSD or 972 GFLOPS for eight CPU cores. This is a valuable
Audio Opus Audio Decode
compute resource for the most demanding use of
High quality sample rate converter these common operations.
Project Acoustics (3D audio) acceleration
Display 8K UHDTV support DISPLAY ENGINE
120-Hz TV support
HDMI 2.1 with 10-Gbit/s FRL, ALLM, and DSC
Display processing is the most difficult real-time func-
Variable Refresh Rate (VRR) tion in the Xbox console due to the high bandwidth for
Linear light display processing all the display planes—game, video, and user inter-
Security Host Security Processor (HSP) root of trust face. It is very important to keep display processing
Media Security Processor (MSP) off the GPU’s unified shader engine so more GPU per-
SHACK elevation of privilege protection formance is available for rendering.
Xbox Series X performs full linear-light processing,
enabling extremely high quality resizing and composi-
tion of input display planes. In the early days of high
additional complex processing steps, and the cost to
dynamic range2 and wide color gamut content, some
develop new architectural structures of a growing het-
operations were performed on gamma coded pixel
erogeneous system. The primary mitigating factors that
values, which can cause color shifts, banding, and
keep overall product cost from increasing are as follows:
other artifacts. Linear values require higher preci-
› Architectural improvements on existing IP
sion than gamma-coded values to preserve both
dynamic range and precision but were chosen for
blocks such as the CPU and GPU.
the sake of fidelity. The addition of a hardware 3-D
› New custom subsystems to offload more gen-
lookup table for dynamic range conversion also off-
eral-purpose programmable CPU and GPU
loads GPU work.
workloads.
Three features that Xbox had been trying to bring
to the HDMI standard since the beginning of HDMI 2.0
efforts are now fully implemented in Xbox Series X via
HDMI 2.1.3 These are auto low latency mode, variable
rate refresh, and display stream compression, which
improve latency and image quality without increasing
bandwidth.
AUDIO UNITS
Overall, the three hardware audio engines in Xbox
Series X have more peak single-precision floating-
point performance than all eight of the Xbox One X
CPU cores running at 2.3 GHz. The CPFU2 is a new
engine focused on efficient audio convolution, FFTs,
reverb, and complex arithmetic for audio algorithms. It
FIGURE 1. The SoC die. enables new realistic audio experiences such as in
Project Acoustics,4 where environments are modeled,

HOT CHIPS
FIGURE 2. SoC block diagram.
and 3-D audio sources are simulated in real time. PLUTON HSP AND MSP
MOVAD is a hyper real-time Opus5 audio decoder, The integrated Hardware Security Platform6 (HSP)
with a throughput-matched high-quality sample rate raises Xbox’s already robust hardware security thresh-
converter. It can process more than 300 real-time old with the addition of Secure Hardware Crypto Keys
channels of Opus. Because of its optimal quality and (SHACK). The HSP orchestrates security operations
compression ratio, the Opus CODEC was chosen to without other software or firmware involvement.
implement in hardware. Opus allows different SILK The Media Security Platform unit (MSP) offloads
voice CODEC and CELT music CODEC mixes per audio simultaneous cryptographic and compression-related
frame. The SRC engine in MOVAD has >100 dB signal- processing of streams of storage data to/from the
to-noise ratio across game use cases, which are much NVMe SSD. The MSP performs LZ lossless decompres-
more difficult and varied than traditional music/voice sion and custom, texture optimized BCPack decompres-
audio. sion, giving an average 2:1 space and bandwidth boost.
FIGURE 3. GPU block diagram.

HOT CHIPS
XBOX VELOCITY ARCHITECTURE color, depth, rasterization, and primitive assembly

(XVA) units. Frame buffers are tiled so that color and depth
New Xbox hardware and software innovations contrib- units work in spatially balanced subspaces. A single
ute to storage acceleration. unified Geometry Engine handles primitives including
DRAM technology has for many years failed to shrink mesh shading and generates higher order surfaces. A
capacitor size significantly, making it challenging for con- new three-level cache hierarchy (shown in blue) opti-
sole systems to grow capacity. Fortunately, flash mem- mizes data latency and sharing among related threads
ory has continued to cost-reduce very well. From 2012– of execution. It can snoop the CPU caches (but CPUs
2020, DRAM prices have reduced by about 5% year over cannot snoop the GPUs). Translation from virtual to
year, compared with approximately 23% for flash. physical addressing is likewise handled via a three-
This reduction has allowed the hard disk drive level TLB hierarchy. The entire shader complex is fed
(HDD) to be replaced with a flash drive. This approach by a multicore command processor with custom firm-
allows the multiple NAND flash chip instances to ware to handle two independent command streams.8
deliver the needed bandwidth for filling a DRAM-based
cache acting as a dynamic game art content cache Dual CU
instead of a static buffer. The SSD acts as a read-only The workhorse of the GPU is the dual compute unit,
backing store, resulting in effective DRAM art content also known as the Work Group Processor. It has four
storage to be many times larger. An added value of 32-wide SIMD units arranged to execute either 32-wide
the SSD to users is that HDDs were doomed to have or 64-wide scalar thread groups and to allow sharing of
increasingly poorer load times, due to the inherent local data held in the LDS. Each ALU lane can perform
square term density versus linear bit-rate problem. a fused multiply-add per clock, performing full-rate 32-
These are the main components of XVA7: bit math and double-rate 16-bit floating-point math.
The faster single-cycle instruction issue rate reduces
› The SSD itself, which has a customized NVMe stalls. The sequencer coissues up to seven instructions
controller to deliver a guaranteed 2.4 GB/s. This of four types per cycle per CU, or a total of 14 instruc-
is more than 40 times the bandwidth of Xbox tions per cycle for the dual unit. There are three types
One. of level zero caches—one for SIMD texture and SIMD
› The MSP described above. load/store operations, one for instructions, and one for
› A new DirectStorage API, which bypasses the scalar data, shared among all SIMD units in the WGP.
traditional system file I/O protocols to give appli- Architecturally, these CUs have substantially better
cations simplified, driver-supported control over performance per clock cycle on average graphics work-
streaming-oriented accesses. loads relative to the CUs in AMD’s GCN generation.
With these coordinated elements, along with Sam- GPU Evolution

pler Feedback Streaming detailed later, games can The normalized graph in Figure 4 shows that process
achieve an overall 2.5 improvement in RAM utiliza- and design advances have allowed raw Xbox shader
tion. This is largely since the GPU often accesses only TFLOPS (trillion float operations per second) to
a fraction of a given texture, especially at the finest
levels of detail.
THE GPU
The goal for the GPU was to create a console-class
design that significantly advances gamers' sense of
immersion in realistic worlds. A generational increase in
raw graphics operations per second is natural, but
because of the cost considerations described previ-
ously, little real estate could be devoted to brand new
GPU functions, so enhancements were added judi-
ciously. Figure 3 shows the overall RDNA-based GPU
structure. It fully supports Direct3D12 feature level 12_2.
There are four Shader Arrays, each with 6 or 7 dual FIGURE 4. Evolution of performance and capacity.
CUs for 26 total active. Each Shader Array has its own

HOT CHIPS
increase about ninefold since 2013. This growth has This shows that, on average, we only need to shade
enabled developers to create ever more stunning visu- every other pixel; but we need to be judicious about
als. Note that memory space and bandwidth have the distribution of fragments to not lose visual detail.
grown much more slowly–only 2-3. VRS addresses this using a set of bias controls. The
The brown line tracks the number of TV screen pix- rate can be determined based on knowing which
els that must be filled. Because consumers have been objects have high detail, which primitives within
able to increase their TV resolution from FHD objects, or based on individual 8x8-pixel screen tiles.
(1920 1080) to 4K UHD at 120 frames/s, the pixel count For instance, since the game title performs multiple
has gone up almost as fast as shader power. Taking the rendering passes, it is possible to predict which areas
average of GPU compute and memory capability, the of the screen that might ordinarily have high detail will
useable graphics performance increase is in the 4–6 be blurred in later passes.
range depending on the game title. A programmable combination of the different
By these metrics, the GPU is falling behind the per- types of rates is supported, which increases or
formance-per-pixel curve; but developers and players decreases the nominal rate, which is limited in fine-
alike want ever nicer more realistic pixels. A big part of ness to the global antialiasing level set for a given ren-
the solution for Xbox Series X involves architectural dering pass.
enhancements that amplify the raw performance. The Object edge detail is preserved, and VRS can be
actual increase depends on several factors, including used alongside other resolution enhancing techni-
adoption of new techniques by game programmers ques, including temporal antialiasing, super resolu-
and the specific content. tion, and even checkerboarding. The actual amount
of dedicated hardware for this feature is tiny but can
have a payoff of around 10%–30% in improved perfor-
OVER MANY YEARS, GAME ENGINES mance, allowing higher frame rates and more math
HAVE TRIED MULTIPLE APPROACHES per pixel.
TO LOADING ONLY ENOUGH TEXTURE
DETAIL TO SATISFY THE DEMANDS OF
THE FRAMES ABOUT TO BE RENDERED. Sampler Feedback Streaming
IN SERIES X THERE ARE TWO NEW Over many years, game engines have tried multiple
STRUCTURES IN THE GPU TO ASSIST approaches to loading only enough texture detail to
WITH TILE-BY-TILE MANAGEMENT OF A satisfy the demands of the frames about to be ren-
MODEST TEXTURE WORKING SET dered. In Series X there are two new structures in the
PLACED IN RAM JUST BEFORE OR JUST GPU to assist with tile-by-tile management of a mod-
est texture working set placed in RAM just before or
AFTER NEEDED.
just after needed. There is a residency map per texture
that clamps the level of detail (LOD) of each tile, and a
Variable Rate Shading request map that records the finest MIPMAP level
Traditionally, GPUs had to run a shader thread on that was requested for each tile since it was last reset.
every rendered pixel, or fragment, to generate a color The tile size is flexible. The sequence of SFS opera-
value—and with antialiasing, multiple fragments tions is as follows:
depth-buffered and shaded per screen pixel. We
observed that for most scenes that amount of unique 1. The first step is to allocate virtual memory space
work is overkill, since there is not high spatial fre- for the entire texture. Then, the title loads all
quency variation everywhere. A single fragment color coarsest mipmap levels—in this case, from the
can serve for multiple subpixel samples or multiple coarsest level up to level 2, which is 1=4 by 1=4 the
pixels without noticeable degradation. Proposed in dimensions of the finest level, requiring just 6.7%
2018 by Microsoft, VRS is now widespread. of all pixels to be resident in memory. Finer levels
Figure 5 shows an analysis of a typical rendered are divided into tiles.
frame—in this case from the Sponza scene model. 2. The next step is to render. Closer-up portions of
The black areas are where shading can occur at quar- a texture require more detail. With SFS, the
ter resolution, i.e., one fragment for a two-by-two pixel shader executes a single sample macroinstruc-
area. Turquoise areas are shaded once per two-by-one tion that combines residency map lookup to
pixel area; yellow per one-by-two. Red areas are determine the current detail level with the fetch
shaded at the normal 11 pixel rate. of the actual texture data. Since the shader

HOT CHIPS
FIGURE 5. Typical shading rates, sponza.
sample instruction, as in the past, has already bilinear filtering on the left, fractional LOD values
calculated which LOD tiles should have been greater than zero leak over the boundary into the
fetched, those values are captured in the sepa- lower detail tiles, meaning that pixels sampled in those
rate request map. The closer tiles need more regions would need nonresident LOD 0 pixels to be
detail and the farther tiles may need less than blended in, causing visual errors. With a new biased fil-
what is resident. The latter are candidates for ter function shown on the right, the transition zones
eviction if the cache is full. are moved toward the coarser LOD map texel, so that
3. After rendering, the application reads back the the nonresidency problem is avoided.
request map, compares with its saved copy of the Overall, with a very small incremental hardware
residency map, and uses the XVA API to bring in cost, Sample Feedback Streaming gives the same or
the higher detail tiles it decides are needed from better level of visual detail with up to 60% savings in I/
flash. For each fine LOD tile, the corresponding O and memory footprint costs.
region of the coarser LODs is also loaded to pro-
vide the right detail everywhere in that region. Ray Tracing
4. Finally, the residency map is updated to reflect The Series X GPU supports DirectX ray tracing accel-
the current state of the tile cache and uploaded eration, allowing the most physically realistic render-
to the texture unit. The next time that texture is ing techniques to be used in real time. But in this
accessed the finer detail ready to use. console generation, developers still want to use tradi-
tional rendering techniques evolved over decades
There is a second mode of the request map to sup-
port texture-space shading, which saves GPU work by
deferring rendering passes that generate texture tiles
until it is known which are needed. In this mode,
instead of one detail value per tile for an entire tex-
ture, accesses are tracked using a single bit per tile for
every MIPMAP level.
Ideally these new tile maps, treated just like other
textures, stay on-die for low latency access, so they
are designed to be as small as possible—meaning tiles
should be as large as practicable. But we also do not
want to see seams between tiles with different levels
of available detail. In the example in Figure 6, the red FIGURE 6. Xbox series X texture LOD filtering.
tile has LOD 0 resident, orange is LOD 1, etc. With

HOT CHIPS
without a performance penalty. They can apply ray 3. 2017. [Online]. Available: HDMI 2.1 specs and features:
tracing selectively where materials and environments Everything you need to know TechHive.com
demand. This means the GPU needs a good balance 4. 2019. [Online]. Available: https://docs.microsoft.com/en-
of die resources dedicated to the two techniques. us/gaming/acoustics/what-is-acoustics
We have added hardware embedded in the com- 5. 2020. [Online]. Available: https://opus-codec.org/
pute units to perform intersections of rays with accel- 6. 2020. [Online]. Available: https://www.microsoft.com/
eration structures that represent the scene geometry security/blog/2020/11/17/meet-the-microsoft-pluton-
hierarchy. That task is a sizeable fraction of the overall processor-the-security-chip-designed-for-the-future-of-
specialized ray tracing workload. The rest can be per- windows-pcs/
formed with the baseline shader design and memory 7. 2020. [Online]. Available: https://news.xbox.com/en-us/
hierarchy with good real-time quality. The hardware 2020/07/14/a-closer-look-at-xbox-velocity-architecture
performs up to 380 billion ray-box intersections or 95 8. 2015. [Online]. Available: https://gamingbolt.com/xbox-
billion ray-triangle intersections per second, or any one-gpu-has-8-graphics-contexts-uses-multiple-gpu-
mix of the two. The overall speedup varies with scene command-streams-to-reduce-cpu-gpu-latency
complexity and lighting characteristics, but for the 9. 2020. [Online]. Available: Inside Xbox Series X: the full
intersection task it can be up to ten times the perfor- specs, Eurogamer.net
mance of a pure shader-based implementation.
MARK GROSSMAN is a Principal Architect with Microsoft,
Machine Learning Support working on GPUs and display processing for multiple genera-
Game engines increasingly make use of machine learn- tions of consoles and headsets. His main interests are high-
ing inference for a variety of game-related tasks, from performance graphics systems and hardware-accelerated
character behavior and animation to super resolution
algorithms. He is a Founder of Silicon Graphics. Grossman
detail enhancement. Xbox Series X includes a small
received a B.A. in information and computer science from the
increment for small-integer operations in the compute
units for inference,9 accelerating some tasks up to 10. University of California, Santa Cruz, CA, USA. Contact him at
mark.grossman@microsoft.com.
CONCLUSION
The Series X SoC, in conjunction with software innova-
JEFFREY ANDREWS is a Distinguished Engineer with Micro-
tions provided by the software teams at Microsoft and
game development companies, help “get technology soft, directing a silicon IP architecture team within Azure.
out of the way” and deliver developers’ vision to His team’s focus areas include: silicon security, machine
gamers worldwide. learning, storage, graphics, audio, display, image process-
ing, and embedded CPUs. He has worked on every Xbox
REFERENCES console, plus four game consoles, and four startups before
1. 2020. [Online]. Available: https://www.xbox.com/en-US/
Microsoft. Andrews received a B.Sc. in computer architec-
consoles/xbox-series-x
ture from the University of Illinois, Urbana-Champaign, IL,
2. 2021. [Online]. Available: https://en.wikipedia.org/wiki/
High-dynamic-range_video USA. Contact him at jeffrey.andrews@microsoft.com.

NVIDIA A100 Tensor Core GPU: Performance

and Innovation
Jack Choquette , Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky, NVIDIA,
Singapore, 609927
NVIDIA A100 Tensor Core GPU is NVIDIA’s latest flagship GPU. It has been designed
with many new innovative features to provide performance and capabilities for
HPC, AI, and data analytics workloads. Feature enhancements include a Third-
Generation Tensor Core, new asynchronous data movement and programming
model, enhanced L2 cache, HBM2 DRAM, and third-generation NVIDIA NVLink I/O.
NVIDIA A100 TENSOR CORE GPU faster than V100 on important HPC applications within
OVERVIEW molecular dynamics, physics, engineering ,and geo sci-
T
he diversity of compute-intensive applications ence areas (see Figure 1). For DL workloads, DGX
in modern cloud data centers has driven the super-pods with A100 have set records for the MLPerf
explosion of GPU-accelerated cloud computing. benchmark,2 handily surpassing all other commercially
Such applications include AI deep learning (DL) train- available systems, including Google’s TPUv3 and Hua-
ing and inference, data analytics, scientific computing, wei’s Ascend systems (see Figure 2). The benchmark
genomics, edge video analytics and 5G services, also demonstrates A100’s breadth of support for AI
graphics rendering, and cloud gaming. The NVIDIA networks, the only system able to run all benchmarks,
Ampere architecture-based A100 GPU brings record- and run them with high performance.
setting scale and novel capabilities to these work- Many new and innovative features in A100 contrib-
loads. The A100 GPU is 54 billion transistors built on ute to its high performance and capabilities. This arti-
TSMC’s 7-nm process. It has 108 streaming microproc- cle will cover a few of those improvements: top to
essors (SMs) with 6912 CUDA cores, 40 MB of L2 bottom features to support strong scaling, elastic
cache, 600 GB/s of NVIDIA NVLink interconnect band- GPU capabilities to support scale up and scale out,
width, and 1.6 TB/s of HBM2 memory bandwidth. It and asynchronous programming features to enable
also has new elastic GPU capabilities including scale- efficiency and programmer productivity.
out support with multi-instance GPU (MIG) virtuali-
zation, and scale-up support with a third-generation
50-Gb/s NVLink I/O interface connecting multiple STRONG SCALING: TOP TO
A100s directly or through NVIDIA’s NVSwitch. Inside BOTTOM
the A100 SM are new third-generation tensor cores A typical deep neural network consists of long chains
with support for fine-grain sparsity and new BFloat16 of interconnected layers (see Figure 3). While there is
(BF16), TensorFloat-32 (TF32), and FP64 datatypes. a massive amount of compute in a network, the paral-
The SM also adds new asynchronous data movement lelism is broken up into layers of smaller, sequentially
instructions and barriers which work together to effi- dependent chunks of work. Each layer takes an input
ciently stream data in a programmer-friendly way. activation tensor and weight tensor, performs an oper-
A comparison shows that A100 provides dramati- ation similar to a matrix multiplication, and outputs an
cally higher performance than currently available com- activation tensor. To leverage the large amount of
mercial designs and NVIDIA’s previous generation compute available in a GPU, the output tensor is bro-
V100 GPU.1 A100 runs approximately 1.5 to over 2 ken down into smaller tiles which are distributed
across the different SMs.
A100’s Tensor Core throughput is 2.5 times higher
on dense FP16 data than the previous generation V100
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3061394 GPU. In weak scaling, both the network size and paral-
Date of publication 23 February 2021; date of current version lelism must grow to match the increased throughput.
26 March 2021. In strong scaling, both the network size and available

HOT CHIPS
FIGURE 1. NVIDIA A100 HPC performance HPC relative to NVIDIA V100, normalized to per chip.
FIGURE 2. MLPerf v0.7 performance for NVIDIA A100 and other commercially available DL systems, relative to NVIDIA V100 &
normalized to per chip.
FIGURE 3. DL network mapping to GPU.

HOT CHIPS
FIGURE 4. NVIDIA A100 & V100 tensor core formats and performance.
parallelism remain fixed, and A100 runs the network reduces the number of times data needs to be loaded
faster. and reduces SMEM bandwidth by half.
To achieve strong scaling, A100 needed to scale Data transfer efficiency from the memory system
performance at all levels: from the tensor cores, was also improved. In V100, data must first be loaded
through the cache hierarchy, through DRAM, and chip into the register file and then stored into SMEM. In
interconnect. A100, a new asynchronous combined load-global-
store-shared instruction was added which transfers
data directly into SMEM, bypassing the register file
SM Core and increasing efficiency.
A100’s new tensor cores have increased the process- Combined with the Tensor Core organization
ing of dense FP16 data by 2x per SM and 2.5 per GPU changes, A100 reduces 6 L1þSMEM accesses (L1 read,
over V100 (see Figure 4). The tensor cores added sup- SMEM write, 4 SMEM reads) down to only two SMEM
port for additional formats, including the TensorFloat- reads. The asynchronous capability of the transfer of
32 operation that improves the processing of FP32 memory also helps the SM to continuously stream data,
data by 10 per GPU. A100’s tensor cores also support improving utilization throughout the memory system.
fine-grain sparsity, which doubles the throughput
when processing sparse data.
The 2 increase in FP16 math throughput per SM A100 L2 Cache
for dense data requires 2 more data bandwidth, and The A100 L2 cache is a shared resource for the SMs.
the effective 4 increase for sparse data requires 3 Additional bandwidth from the L2 cache is necessary
more data bandwidth. to achieve strong scaling, as the number of SMs is
In the SM core, multiple improvements in data increased in A100 and each SM processes data at a
delivery provide this increase in data bandwidth (see faster rate. A100 delivers a 2.3x L2 bandwidth increase
Figure 5). Inside the V100 and A100 SM, the network over V100, supported by a new structure to efficiently
layer tile is further broken down into four smaller tiles move this data.
which are each processed by a 32-thread warp. V100’s The L2 cache is divided into two partitions to enable
tensor cores were designed to work at 8-thread granu- higher bandwidth and lower latency memory access
larity, and required the tiles to be further broken down (Figure 6). Each L2 partition localizes and caches data
to four smaller tiles per warp. Each of these tiles loads for memory accesses from SMs directly connected to
the tensor data from shared memory (SMEM), which the partition. Hardware cache-coherence maintains
in aggregate requires all data to be loaded four times. the CUDA programming model across the full GPU,
In A100, the tensor cores were reorganized and and applications will automatically leverage the band-
enhanced to work at 32-thread granularity. This width and latency benefits of A100’s new L2 cache.

HOT CHIPS
FIGURE 5. Data delivery for V100 and A100.
L2 Cache Residency Controls global memory can only utilize this portion of L2 when
A100 gives applications the capability to influence the it is unused by persistent accesses.
persistence of data in the L2 cache, allowing higher There are two primary mechanisms that allow per-
bandwidth and lower latency accesses to the global sistent data to be resident in the L2. In the first, an
memory. The so-called persistent accesses control address-based window is specified where all the read/
the replacement policy to effectively set-aside a por- write accesses are persistently cached in the L2. Indi-
tion of L2 cache. Normal or streaming accesses to vidual accesses are not tagged in this scheme. Alter-
natively, controls can be specified at a finer-grained,
per-memory-operation basis.
A100 DRAM Bandwidth

The A100 GPU includes 40 GB of fast HBM2 DRAM
memory on its SXM4-style circuit board. The memory
is organized as five active HBM2 stacks with eight
memory dies per stack. With a 1215 MHz (DDR) data
rate the A100 HBM2 delivers 1555-GB/s memory band-
width, which is more than 1.7x higher than Tesla V100
memory bandwidth.
A100 Compute Data Compression

To boost DRAM bandwidth and L2 capacity, A100
implements a data-compression feature for sparsity
and other compressible data patterns. Compression
in L2 provides up to 4 improvement to DRAM read/
write bandwidth, up to 4 improvement in L2 read
bandwidth, and up to 2 improvement in L2 capacity.
Compression is initiated by marking buffers avail-
able for compressibility using new APIs in CUDA 11.
Data written to these buffers is examined inside the
L2 by the hardware. For specific data patterns that
FIGURE 6. NVIDIA A100 die. match the hardware algorithms, the data are com-
pressed and written back to the L2. Any subsequent

HOT CHIPS
FIGURE 7. NVIDIA DGX-A100 system block diagram.
access will take advantage of the compressed data efficiency of small payload writes and data-less
bandwidth. Compression helps both the write and responses were also added.
read accesses to DRAM and increases the effective
DRAM bandwidth available.
ELASTIC GPU: MULTI-GPU
SCALE UP
A100 Gen 3 NVLink The twelve NVLink links in each A100 allow a variety of
The third generation of NVIDIA’s high-speed NVLink configurations with high-speed connections to other
interconnect is implemented in A100 and the new GPUs and switches. To meet the growing computa-
NVSwitch. NVLink is a reliable, high bandwidth, low- tional demands of larger and more complex DNNs and
latency memory interconnect, and includes resiliency HPC simulations, the new NVIDIA DGX A100 system
features such as link-level error detection and packet (Figure 7) includes eight A100 GPUs connected by the
replay mechanisms to guarantee successful transmis- new NVLink-enabled NVSwitch.
sion of data. The new NVLink has a data rate of 50 Gb/s Multiple DGX A100 systems can be connected via
per signal pair, nearly doubling the 25.78-Gb/s rate in a networking fabric like NVIDIA Mellanox InfiniBand
Tesla V100.3 Each link uses four differential signal pairs and Mellanox Ethernet to scale out data centers, cre-
(four lanes) in each direction compared to eight signal ating powerful supercomputer-class systems. More
pairs (eight lanes) in V100. A single link provides 25-GB/ powerful NVIDIA DGX POD and NVIDIA DGX Super-
s bandwidth in each direction similar to V100 GPUs, but POD systems will include multiple DGX A100 systems
uses only half the signals compared to V100. The total to provide much greater compute power with strong
number of NVLink links is increased to 12 in A100, ver- scaling.
sus six in V100, yielding 600-GB/s total bandwidth for an
entire A100 versus 300 GB/s for Tesla V100. The twelve
NVLink links in each A100 allow a variety of configura- ELASTIC GPU: MULTI-INSTANCE
tions with high-speed connections to other GPUs and GPU (MIG)
switches. While many data center workloads continue to scale,
All writes in the third-generation NVLink now both in size and complexity, some acceleration tasks
require an acknowledgement from the destination. are not as demanding, such as early-stage develop-
This allows synchronization to be performed at the ment or inference on simple models at low batch
requester, and error attribution to be returned to a sizes. Data center managers aim to keep resource uti-
specific execution context. Writes are allowed to be lization high, so an ideal data center accelerator not
pipeline to the destination while the requestor is wait- only needs to efficiently handle a large workload—it
ing on a response. New features to improve the also efficiently accelerates many smaller workloads.

HOT CHIPS
FIGURE 8. Traditional CUDA Cþþ program with exposed latency.
The new MIG feature can partition each A100 into Updates to CUDA for A100 expand the expressive-
as many as seven GPU Instances for optimal utiliza- ness of Cþþ for data and computation pipelining,
tion, effectively expanding access to every user and which we identified as a growing source of difficulty
application. The A100 GPU’s new MIG capability can for CUDA programmers. The goal of pipelining in soft-
divide a single GPU into multiple GPU partitions called ware is the same as in hardware: to keep execution
GPU Instances. Each instance’s SMs have separate resources busy by overlapping the latency of different
and isolated paths through the entire memory sys- phases of computation. This is difficult to express in
tem—the on-chip crossbar ports, L2 cache banks, Cþþ due to its conservative requirements around
memory controllers, and DRAM address busses are all memory consistency.
assigned uniquely to an individual instance. This The example program in Figure 8 shows that oper-
ensures that an individual user’s workload can run ation-level parallelism is prevented by synchronization
with predictable throughput and latency, with the semantics. Software pipelining depends on having the
same L2 cache allocation and DRAM bandwidth even opportunity to execute independent work while syn-
if other tasks are thrashing their own caches or satu- chronization is resolving at the boundaries of stages.
rating their DRAM interface. Using this capability, MIG The first programming model innovation borrows
can partition available GPU compute resources to pro- from asynchronous programming: separate the arrival
vide a defined quality of service (QoS) with fault isola- and waiting steps, as in phasers.4 Early in develop-
tion for different clients (such as VMs, containers, ment, NVIDIA recognized this innovation would bene-
processes, and so on). It enables multiple GPU Instan- fit programmers beyond CUDA, so NVIDIA offered
ces to run in parallel on a single, physical A100 GPU. both specifications and implementations to the com-
MIG also keeps the CUDA programming model munity; it is now part of ISO Cþþ 205 and is available
unchanged to minimize programming effort. in LLVM (libcxx) today.
MIG enables users to see and schedule jobs on vir- The second innovation extends this foundation
tual GPU Instances as if they were physical GPUs. with asynchronous data movement capabilities. This
MIG works with Linux operating systems and their example program in Figure 9 combines both asynchro-
hypervisors. nous barrier and data movement operations, resulting
in a remarkably precise expression of programming
intent. By leveraging the relaxed semantics of asyn-
PRODUCTIVITY: ASYNCHRONOUS chronous operations, there is a net reduction in the
PROGRAMMING difficulty of compiling the program and performance is
The goal of CUDA is to compile intuitive Cþþ sources both higher and more predictable.
into high-performance executable programs for the
GPU. In this pursuit, a compiler’s goals are in tension:
to maximize operation-level parallelism, but never to CONCLUSION
alter the program’s semantics. NVIDIA’s work on joint NVIDIA’s A100 GPU is the largest and most advanced
hardware & programming system codesign directly GPU developed by NVIDIA, and builds on the ground-
addresses this tension. work laid by previous generations of NVIDIA GPUs. It

HOT CHIPS
FIGURE 9. Asynchronous data movement and computation.
is the result of the work of thousands of engineers WISHWESH GANDHI is a Senior Director of architecture at
that worked together from transistors to standards NVIDIA, Singapore. He has led the architecture development
and everything in between such as system integration of the GPU memory system for multiple generations. He
and programming model design. A100 provides
has been working with Integrated and Discrete GPU mem-
unprecedented acceleration at every scale, and adds
ory architecture for more than 20 years. Contact him at
powerful new features which deliver dramatically
faster performance for HPC, AI, and data analytics wgandhi@nvidia.com.
workloads.
OLIVIER GIROUX is a Distinguished Architect at NVIDIA, and
the ISO Cþþ Concurrency and Parallelism Chair. He has
worked on ten GPU and six SM architecture generations,
REFERENCES
1. J. Choquette, O. Giroux, and D. Foley, “Volta: with a focus on clarifying the programming model of GPU
Performance and programmability,” IEEE Micro, vol. 38, threads. Giroux received an M.S. in computer science from
no. 2, pp. 42–52, Mar./Apr. 2018. McGill University, Montreal, QC, Canada. Contact him at
2. P. Mattson et al., “MLPerf training benchmark,” 2019, ogiroux@nvidia.com.
arXiv:1910.01500.
3. A. Ishii et al., “NVSwitch and DGX-2: NVLink-Switching NICK STAM is a Senior Technical Marketing Director with
chip and scale-up compute server,” Hot Chips, 2018.
NVIDIA. His team provides tech support to press, and also
4. J. Shirako, D. M. Peixotto, V. Sarkar, and W. N. Scherer,
produces our GPU white papers. Before NVIDIA, he worked
“Phasers: A unified deadlock-free construct for
at PC Magazine USA, and cofounded the ExtremeTech web-
collective and point-to-point synchronization,” in Proc.
22nd Annu. Int. Conf. Supercomput., 2008, pp. 277–288. site. Stam received an M.S. in computer science from SUNY
5. International Standard ISO/IEC 14882:2020 – Program Binghamton, NY, USA. Contact him at nstam@nvidia.com.
ming Language Cþþ.
RONNY KRASHINSKY is a Distinguished Engineer with NVI-
JACK CHOQUETTE is a Senior Distinguished Engineer with DIA where he has architected GPUs for 11 years. He began
NVIDIA, where he has led the architecture development of his NVIDIA career in Research, and later joined the Streaming
NVIDIA’s GPGPU Streaming Microprocessors for multiple Multiprocessor team. He now focuses on deep-learning com-
generations. He has been leading CPU and system designs pute architecture. Krashinsky received a Ph.D. in electrical
for over 25 years. Choquette received an M.S. in computer engineering and computer science from Massachusetts Insti-
engineering from the University of Illinois Urbana-Champaign, tute of Technology (MIT), Cambridge, MA, USA. Contact him
Champaign, IL, USA. Contact him at jchoquette@nvidia.com. at rkrashinsky@nvidia.com.

Manticore: A 4096-Core RISC-V Chiplet

Architecture for Ultraefficient Floating-Point
Computing
Florian Zaruba and Fabian Schuiki € rich, 8092, Zu
, Integrated Systems Laboratory, ETH Zu € rich, Switzerland
Luca Benini , Department of Electrical, Electronic and Information Engineering, University of Bologna, 40126,
Bologna, Italy, and Integrated Systems Laboratory, ETH Zürich, Zürich, 8092, Switzerland
Data-parallel problems demand ever growing floating-point (FP) operations per

second under tight area- and energy-efficiency constraints. In this work, we present
Manticore, a general-purpose, ultraefficient chiplet-based architecture for data-
parallel FP workloads. We have manufactured a prototype of the chiplet’s
computational core in Globalfoundries 22FDX process and demonstrate more than 5x
improvement in energy efficiency on FP intensive workloads compared to CPUs and
GPUs. The compute capability at high energy and area efficiency is provided in “Snitch:
A tiny pseudo dual-issue processor for area and energy efficient execution of floating-
point intensive workloads,” IEEE Trans. Comput., containing eight small integer cores,
each controlling a large floating-point unit (FPU). The core supports two custom ISA
extensions: The SSRs extension elides explicit load and store instructions by encoding
them as register reads and writes (“Stream semantic registers: A lightweight RISC-V
ISA extension achieving full compute utilization in single-issue cores,” IEEE Trans.
Comput.). The floating-point repetition extension decouples the integer core from the
FPU allowing floating-point instructions to be issued independently. These two
extensions allow the single-issue core to minimize its instruction fetch bandwidth and
saturate the instruction bandwidth of the FPU, achieving FPU utilization above 90%,
with more than 40% of core area dedicated to the FPU.
D
omains such as data analytics, machine prominent example for how to leverage specializa-
learning, and scientific computing are depen- tion.5 Unfortunately, they are hard to adjust to algo-
dent on increasing compute resources.3 rithmic changes and tied to a specific application
Increasing technology node densities result in sys- domain.
tems that are mainly limited by thermal design power The trend in leading-edge general-purpose com-
and the most feasible way to increase the amount of puter architectures paints a similar picture on the
active compute units is to design more energy-effi- importance of increasing energy efficiency. Two prom-
cient architectures. While many emerging architec- inent examples of recent high-performance architec-
tures,4 especially in the machine learning domain, tures are Fujitsu’s A64FX6 and NVIDIA’s A100.7 Both
tradeoff floating-point (FP) precision for higher systems strive to control their 32 (A64FX) and 16
throughput and efficiency, algorithms such as stencils (A100) wide multilane SP (sp) data-path with as few
and linear differential equations require higher preci- instructions as possible.
sion arithmetic. Domain-specific accelerators are a With the proposed Manticore system, we pursue a
similar goal. We achieve this goal by pairing a simple,
in-order, 32-bit RISC-V integer core with a large float-
0272-1732 ß 2020 IEEE ing-point unit (FPU). Two instruction set architecture
Digital Object Identifier 10.1109/MM.2020.3045564 (ISA) extensions: SSRs and floating-point repetition
Date of publication 17 December 2020; date of current (FREP) make it possible for the single-issue integer
version 26 March 2021.

HOT CHIPS
FIGURE 1. Conceptual floorplan of the package. Arrangement

of the chiplets and HBM on the interposer. Each chiplet has
its own, private, 8 GB HBM. Chiplets interconnect via die-to- FIGURE 3. Memory hierachy of the Manticore concept. Four
die serial links8. cluster form a quadrant and share an uplink into the next stage.
Four S1 quadrants form an S2 quadrant which share an uplink
to the next stage. Two S2 quadrant form an S3 quadrant. Four
core to saturate the bandwidth of its FPU, achieving S3 quadrants per chiplet share access to the HBM memory.
utilization higher than 90% for compute-bound kernels.
The chiplet (see Figure 2) contains four quadrants,

CHIPLET ARCHITECTURE consisting of 32 clusters with eight cores each, which
The proposed Manticore architecture consists of four results in 1024 cores for all four quadrants on a chiplet.
222-mm2 (14.9 x 14.9 mm2 ) 22-nm chiplet dies on an inter- Furthermore, each chiplet contains four Ariane RV64GC
poser. Using chiplets improves yield and reduces cost. management cores9 an HBM (256 GB/s) controller, a
Each die has three short-range, multichannel, in-package 27 MB of L2 memory, and a 16x PCIe endpoint (31.5 GB/s)
chip-to-chip links,8 one to each sibling. They are used for for host communication, as shown in Figure 2.
interdie synchronization and chiplet-to-chiplet nonuni- The four Ariane management cores run a general-
form memory access. Furthermore, each chiplet has purpose operating system such as Linux and manage
access to a private 8-GB high-bandwidth memory (HBM). the Snitch clusters and program off-loading. The Man-
The conceptual floorplan is depicted in Figure 1. ticore chiplet has enough silicon area to support
27-MB on-chip shared L2 memory for critical data stor-
age such as neural network weights or stencil kernels.
Memory Hierarchy
Each quadranty (see Figure 3) is further subdivided into
multiple stages, in a tree-structure using an interconnect
tuned for burst-based direct memory transfer (DMA)
accesses. Four clusters share an instruction cache and
an uplink into the next stage. These four clusters have a
high aggregate bandwidth of 64 TB/s among each other
and can perform low-latency, high-bandwidth intraclus-
ter data transfers. As shown in Figure 3, clusters share
FIGURE 2. Conceptual floorplan of an individual chiplet.

y
Arrangement of individual cluster quadrants, interconnects, The term quadrant is somewhat generic and does not neces-
sarily imply four members (cores or lower stage quadrants),
L2 memory, HBM2 controller, PCIe controller, and quad-core as the number of members can be adjusted to match the
Ariane RV64GC system. available bandwidth into the next stage, as for example, in
stage three of our system

HOT CHIPS
FIGURE 5. Effect of SSRs and frep on the hot loop of a dot prod-
uct kernel. (a) Left: baseline simplified RISC-V implementation,
FIGURE 4. Simplified block diagram of a Snitch-based compute
with address calculation and pointer increment omitted for
cluster. The core complex (CC) contains the integer core and
brevity. Right: SSRs implementation with memory loads
the FPU as well as necessary hardware for the SSRs and FREP.
encoded as reads from stream registers; additional stream con-
The cluster contains eight core corplices, which share an
figuration instructions required ahead of the loop. (b) Left:
instruction cache and a tightly coupled data memory. A DMA
implementation with loop bookkeeping using baseline RISC-V
engines is used for efficient, bulk, data movement.
instructions. Right: implementation with an frep hardware loop,
with all bookkeeping to occur implicitly in the hardware.
the uplink into the next higher stage, the bandwidth to
the other S1 quadrants becomes smaller. Bandwidth is at 1 GHz, thus delivering more than 4 TDPflop/s peak
subsequently thinned as four S1 quadrants share an compute per chiplet.
instruction cache and an uplink into the S2 quadrant With this architecture, we achieve a very high com-
and two S2 quadrants share an uplink into the S3 quad- pute/control ratio: 44% of the system consisting of
rant. In the last stage of hierarchy 16 S3 quadrants, compute units, another 44% spent on the L1 memory
distributed over four chiplets (nonuniform memory and just 12% of the area are spent on the control parts.
access), share four HBMs with an aggregated peak
bandwidth of 1 TB/s. This bandwidth thinning scheme
PROGRAMMING
allows us to have a very low diameter, low latency inter-
We leverage two custom RISC-V ISA extensions to
connect topology, which can sustainably saturate the
achieve extremely high fp utilization and efficiency:
HBM bandwidth while being benign to floorplanning
Xssr and Xfrep.
and physical design. The interconnect also allows for a
very high cluster-to-cluster internal bandwidth,
through multiple stages, which by far exceeds the Stream Semantic Registers (Xssr)
bandwidth into the memory. With this model, we effi- SSRs2 offer a means to elide explicit load/store instruc-
ciently support cluster-to-cluster traffic, while, at the tions in a program. This is achieved by giving a subset
same time, fully loading the memory system. of the processor core’s registers stream semantics.
When enabled, a read from such an SSRs is translated
in hardware into a load from memory, and conversely, a
Compute Cluster register write becomes a store to memory. Since an in-
The compute cluster consists of eight small, 22 kGE, order single-issue core can only execute a single
single-stage, 32-bit RISC-V processor cores1 (see instruction every cycle, the presence of loads and
Figure 4). Each Snitch core contains a double-preci- stores in a hot loop of the program diminishes FPU utili-
sion (DP) FPU, which can be used to compute one DP zation significantly. For example, consider a dot prod-
fused multiply–add (FMA) operation or two SP FMA uct, which has to issue two loads from memory for
per cycle. When running at 1 GHz, a cluster with eight every FMA operation, as shown in Figure 5(a). In this
Snitch cores is able to compute 16 DP or 32 SP flop, scenario, even if the loop is fully unrolled, we achieve at
resulting in 4 TDPflop/s for the entire Manticore sys- most 33% FPU utilization. In theory, this allows the FPU
tem. All eight cores have elementwise, low latency, to be 100% utilized, and even a simple processor can
access into 128-KiB tightly coupled and shared achieve >90% utilization in many relevant kernels with-
scratchpad memory. Moreover, a DMA engine is in out resorting to complex and energy-inefficient wide
charge of moving blocks of data into the scratchpad issue superscalar or very long instruction word (VLIW)
memory over a 512-bit data bus. The cores are clocked architectures.2 SSRs offer a way to elide memory

HOT CHIPS
FIGURE 6. Typical execution of a matrix-vector multiplication implementation leveraging the SSRs and frep extensions. The 16
instructions are fetched and decoded once by the integer pipeline of the processor core (b), and expanded to 204 executed
instructions in the fpu (c). (a) Reference C implementation with a square matrix A of fixed size 48. (b) Resulting assembly imple-
mentation as stored in the binary and fetched/decoded by the processor core. (c) Execution traces of the integer pipeline (left)
and the fp pipeline (right).
accesses and address computation in hot loops, which Typical SSR/FREP Execution
in many cases leaves no integer instructions in the As a concrete example, let us consider the matrix-vec-
loop body. tor multiplication operation shown in Figure 6(a). A
typical implementation leveraging Manticore’s SSRs
Floating-Point Repetition (Xfrep) and FREP extensions is shown in Figure 6(b). The
The frep1 extension implements a FPU-only hardware address computation and memory accesses of A and
loop. Consider a dot product utilizing SSRs for example, x are entirely performed by the SSRs ft0 and ft1. The
as shown in Figure 5(b). Besides the essential FMA oper- inner loop is implemented using an FREP instruction
ation running on the FPU, the loop only consists of a trip and unrolled to compute four results in parallel in
count increment (addi) and a back branch (bne). This order to avoid pipeline stalls due to FPU latency. The
loop can be replaced by a FREP instruction, which loops outer loop is executed by the integer core. It stores
a range of subsequent FP instructions (one in this case) the results (fsd), implements loop bookkeeping (addi,
for a configurable number of times. The RISC-V ISA bltu), and initializes a (fmv.d).
makes the integration of such an extension very As shown in Figure 6(c), the 16 instructions of the
straightforward as most instructions either operate assembly implementation are fetched and decoded
entirely on integer or entirely on FP registers. Only a once by the integer pipeline of the processor core and
handful, such as comparisons or moves between integer expand to 204 executed instructions in the FPU through
and FP domains, exchange information from one the use of FREP. This leaves 188 cycles for the integer
domain to the other. We leverage this separation and pipeline for other tasks, such as preparing the next loop
insert a microloop sequence buffer of 16 instructions iteration or coordination of data movement. In case no
between the Snitch core and the FPU. FREP instructions other work is required, the 16 instructions fetched over
configure this buffer to emit a range of buffered instruc- 204 cycles of execution amounts to roughly one instruc-
tions multiple times into the FPU, which essentially tion every 13 cycles, mitigating the von Neumann bottle-
implements the hardware loop. Since this happens neck by reducing instruction fetch bandwidth by more
entirely in the FPU subsystem outside of the Snitch than one order of magnitude. Since the FPU can execute
core, the core’s integer pipeline can run in parallel, the loop iterations back-to-back and of the 204 instruc-
enabling a pseudo-dual-issue mode of operation that tions, 192 perform actual computation, this kernel can
would not be achievable with a traditional hardware achieve up to 94% FPU utilization.
loop. This allows the core to perform nontrivial book- Compilers can leverage these new instructions
keeping and address calculation while the FPU is run- through scalar evolution and loop analysis to detect
ning, without incurring a reduction of the FPU utilization. loops with the appropriate structure, and matching

HOT CHIPS
FIGURE 7. Floorplan of the prototype silicon. The two Ariane cores as well as the Snitch cluster have been designed hierar-
chically. The core’s follow a star-shaped layout around the shared instruction cache.
against amenable address calculation, load and use

patterns as a peephole optimization. These facilities
are commonly provided as part of the existing com-
piler infrastructure in both GCC and LLVM.
PROTOTYPE
A 3 x 3-mm2 prototype containing the logic core of the
chiplet architecture was manufactured and character-
ized using the Globalfoundries 22FDX technology. The
prototype in Figure 7 contains three Snitch clusters
with eight cores (each configured with 8-KiB L1
instruction cache and 128-KiB L1 data memory orga-
nized in 32 banks), a dual-core Ariane (with 16-KiB L1
instruction cache and 32-KB data cache), 1.25-MiB L2
memory, and a 400-MHz, double data-rate, 2.56-GB/s,
digital-only chip-to-chip link.
SILICON PERFORMANCE FIGURE 8. Compute performance, energy efficiency, speed,

and power consumption for different operating voltages.
Efficiency
Measured on the prototype silicon across eight sample dies.
We measured the speed and power consumption of the
prototype silicon under representative workloads and Cores performing matrix multiplications, at 90% FPU utiliza-
operating conditions. Figure 8 shows the DP per- tion. Performance and efficiency doubles across range.

HOT CHIPS
FIGURE 9. Performance roofline plot of DNN training work-

loads. We group convolutions and linear/pooling layers to
indicate performance in the compute- and memory-bound
regions, respectively. The Manticore architecture is very effi-
cient at tracking the performance and bandwidth roofline,
with a detachment down to 5% for low-intensity and 14% for
high-intensity optimized kernels.
FIGURE 10. Top: Estimated sp energy efficiency of a full DNN
training step, overall and specifically on the convolution
formance and energy efficiency achieved when execut-
ing parallel workloads on the 24 cores of our prototype layers for the conceptual Manticore chiplet architecture. Bot-
22-nm silicon, for different operating voltages. The chips tom: dp energy efficiency on linear algebra (assuming 90% of
offer a wide range of operating points and choices for peak performance); Manticore shown for maximum perfor-
performance/efficiency tradeoffs, which we leverage mance and maximum efficiency operating points.
through dynamic voltage and frequency scaling based
on the current workload’s operational intensity. This of the system’s peak bandwidth. Since DNN workloads
allows us to essentially adjust the roofline of the system tend to be dominated by the convolution layers, the
to match the current workload. In high-performance overall performance, which considers all layers, is
mode running over 1 GHz at 0.9V VDD, our architecture almost identical to the convolution performance.
achieves a peak performance of 54 GDPflop/s across 24 Overall we observe that the Manticore architecture is
cores and a compute density of up to 20 GDPflop/s mm2 , very efficient at tracking the performance and band-
which translates to 9.2 TDPflop/s across a full 4096 width roofline, with a detachment down to 5% for low
cores. In max-efficiency mode running at 0.5 GHz at 0.6 V intensity and 14% for high-intensity optimized kernels.
VDD, our architecture achieves an industry-leading effi- The worst-case detachment of 34% from the roofline
ciency of 188 GDPflop/sW, while still delivering a respect- is encountered in the intermediate region around the
able 25 GDPflop/s across 24 cores, which translates to point of inflection, where intuitively, the aggregate
4.3 TDPflop/s across a full 4096 cores. bandwidth pressure on the L1 TCDM is highest due to
the DMA and the compute units both operating at
Roofline capacity and banking conflicts more frequently stall L1
To assess the performance of the manufactured sili- memory accesses.
con, we analyzed workloads from training steps of a
set of deep neural networks (DNNs). Figure 9 shows Applications
the roofline plot of our architecture across a full train- Figure 10 shows the SP energy efficiency achieved in a
ing step. We estimate full-system performance based DNN training step overall, and on the compute-bound
on cycle-accurate simulation of a smaller instantiation convolutions specifically, across a variety of networks,
of the hardware, combined with an architectural and the industry-leading dp efficiency on linear alge-
model of the full system and measured performance bra. On sp DNN training workloads, Manticore’s actual
characteristics of the prototype silicon. The compute- efficiency is competitive with the V100 GPU’s peak
bound convolution layers in the workload reach >80% efficiency and outperforms the Core i9-9900K CPU by
of the system’s peak performance, and the proximity 2 and the Neoverse N110 by 3. On dp workloads,
to the point of inflection of the roofline indicates a bal- Manticore outperforms a V100 GPU’s peak efficiency
anced utilization of the hardware capabilities. The by 6, the N1 by 7, the Celerity RISC-V CPU by 9,
memory-bound linear and pooling layers reach >90% and the Core i9-9900K CPU by 15. Our architecture

HOT CHIPS
achieves this despite these chips having a substantial 5. A. Yang, “Deep learning training at scale spring crest
technology advantage due to their 7-, 12-, and 14-nm deep learning accelerator,” in Proc. Symp. High
FinFET processes. Regarding the A100 GPU, our initial Performance Chips, vol. 31, 2019.
estimates based on data published by Nvidia7 suggest 6. T. Yoshida, “Fujitsu high performance CPU for the Post-
that it achieves a 25% improvement on SP and DP K Computer,” in Proc. Symp. High Performance Chips,
over the V100 in terms of speed at similar power con- vol. 30, 2018.
sumption. This indicates that Manticore has just 25% 7. Nvidia, “NVIDIA Ampere GA102 GPU Architecture - The
lower efficiency on SP than A100, but outperforms it Ultimate Play,” 2020.
on DP by 5, despite the A100’s significant 7-nm Fin- 8. P. Vivet et al., “A 220GOPS 96-core processor with 6
FET technology advantage. Manticore delivers signifi- chiplets 3D-stacked on an active interposer offering
cantly higher peak FP performance than comparable 0.6ns/mm latency, 3TBit/s/mm2 inter-chiplet
RISC-V architectures11 in 16 nm. interconnects and 156mW/mm2@82% Peak-Efficiency
DC-DC Converters,” in Proc. IEEE Int. Conf. Solid-State
Circuits, 2020.
OVERALL WE OBSERVE THAT THE
9. F. Zaruba and L. Benini, “The cost of application-class
MANTICORE ARCHITECTURE IS VERY processing: Energy and performance analysis of a
EFFICIENT AT TRACKING THE Linux-ready 1.7-GHz 64-bit RISC-V core in 22-nm FDSOI
PERFORMANCE AND BANDWIDTH Technology,” IEEE Trans. Very Large Scale Integr. (VLSI)
ROOFLINE, WITH A DETACHMENT Syst., vol. 27, no. 11, pp. 2629–2640, Nov. 2019.
DOWN TO 5% FOR LOW INTENSITY 10. R. Christy et al., “A 3GHz Arm Neoverse N1 CPU in 7nm
AND 14% FOR HIGH-INTENSITY FinFet for infrastructure applications,” in Proc. IEEE Int.
OPTIMIZED KERNELS. Conf. Solid-State Circuits, 2020.
11. S. Davidson et al., “The Celerity open-source 511-core
RISC-V tiered accelerator fabric: Fast architectures and
design methodologies for fast chips,” IEEE Micro, vol.
38, no. 2, pp. 30–41, Mar./Apr. 2018.
ACKNOWLEDGMENTS
This work was supported by the European Union’s
FLORIAN ZARUBA received a B.Sc. from TU Wien, Vienna, Aus-
H2020 program under Grant 826647 (European Pro-
cessor Initiative-EPI) and Grant 732631 (Open Trans- tria, in 2014 and an M.Sc. in 2017 from the Swiss Federal Institute
precision Computing - “OPRECOMP”). € rich, Zu
of Technology Zu € rich, Switzerland, where he is currently
working toward a Ph.D. with the Digital Circuits and Systems
group of Luca Benini. Contact him at zarubaf@iis.ee.ethz.ch.
REFERENCES
1. F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini, “Snitch: A
FABIAN SCHUIKI received a B.Sc. and an M.Sc. in electrical
tiny pseudo dual-issue processor for area and energy
engineering in 2014 and 2016, respectively, from ETH
efficient execution of floating-point intensive
workloads,” IEEE Trans. Comput., to be published. € rich, Zu
Zu € rich, Switzerland, where he is currently working
2. F. Schuiki, F. Zaruba, T. Hoefler, and L. Benini, “Stream toward a Ph.D. with the Digital Circuits and Systems group of
semantic registers: A lightweight RISC-V ISA extension Luca Benini. Contact him at fschuiki@iis.ee.ethz.ch.
achieving full compute utilization in single-issue cores,”
IEEE Trans. Comput., to be published.
LUCA BENINI holds the Chair of Digital Circuits and Systems,
3. “AI and compute,” 2020. Accessed: Oct. 5, 2020. [Online].
Available: https://openai.com/blog/ai-and-compute/ ETHZ and is Full Professor with the Universita di Bologna. He is
4. N. P. Jouppi et al., “A domain-specific supercomputer for a Fellow of the ACM and a member of the Academia Europaea.
training deep neural networks,” Commun. ACM, 2020. Contact him at lbenini@iis.ee.ethz.ch.

Pensando Distributed Services Architecture

Michael Galles and Francis Matus , Pensando Systems, Inc., Milpitas, CA, 95035, USA
The Pensando distributed services architecture is designed to provide

programmable security, network, storage, and visibility services to a data center
landscape growing in scale and complexity. This domain-specific architecture
combines P4 packet processing pipelines with embedded ARM CPUs and hardware
offload engines in a tightly coupled, cache coherent interconnect. Millions of flows
are tracked by hardware and distributed to a combination of P4 programs, ARM
processes, custom hardware engines, and PCIe based hosts and devices. There are
currently 16- and 7-nm ASIC implementations of this distributed services
architecture, future generations are in design.
D
ata center growth in scale, bandwidth, appli- ARCHITECTURE
cation diversity, and security requirements The distributed services ASIC connects directly to the
has stretched traditional networking and data center Ethernet network near the compute edge.
storage IO solutions beyond their design targets.2 It can operate as a standalone, bump-in-the-wire
Server connectivity speeds have moved to 100 Gb/s device or as an endpoint controller, connecting servers
and beyond while the number of interconnected serv- or endpoint devices to the data center network via
ers in a data center surges. Virtualization, storage PCIe ports. Inside the ASIC, network traffic flows
disaggregation, and service mesh architectures are through multiple P4 pipelines interconnected via a cen-
driving the amount of east–west (intradata center) tral packet buffer. P4 programs process packets in-line
traffic up along with the need for more sophisticated and may divert packets or flows to on-chip offload
network segmentation, overlays, and telemetry. engines and CPUs as well as to PCIe attached hosts or
Security requirements, including encryption for data- target devices. In the ASIC block diagram (Figure 1), the
in-flight and data-at-rest, and stateful firewalls with blue blocks (lower half) primarily handle packet-based
connection tracking further increases the complexity traffic and services, while the green blocks (upper half)
and computational load on data center networking are primarily memory transaction based, providing off-
services. loads and data stream processing to memory buffers.
To address these challenges, the Pensando
distributed services architecture uses P41 domain-
THIS ARTICLE WILL FOCUS ON THE
specific processing coupled with general purpose
CPUs to deliver networking, security, and storage ARCHITECTURE AND PERFORMANCE
services in a scalable and flexible solution. Distrib- OF THE FIRST TWO GENERATIONS OF
uted services cards (DSCs) are deployed at the PENSANDO ASICS, IMPLEMENTED IN
edge of compute and storage resources in the data 16- AND 7-NM PROCESSES.
center and managed by a central Policy services
manager (PSM). This article will focus on the archi-
tecture and performance of the first two genera-
tions of Pensando ASICs, implemented in 16- and Life of a Packet
7-nm processes. Packets received through the network enter on one of
the Ethernet MAC interfaces and are placed in the
packet buffer according to L2 or L3 class of service
(COS). The packet buffer provides pause absorption
buffering and COS aware arbitration for the P4 ingress
pipeline. Consistent with the P4 model, packets pass
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3058560 through the P4 ingress pipeline for flow classification,
Date of publication 10 February 2021; date of current version firewall, tunnel endpoint processing, and other ingress
26 March 2021. services; return to the packet buffer in case of

HOT CHIPS
FIGURE 1. Asic block diagram.
FIGURE 2. Single P4 pipeline with 8 stages.

multicast or span replication; then enter the P4
egress pipeline to apply, for example, network
address translation, policers, and telemetry. Having includes multiple engines designed to extract header
completed network processing tasks, the packet fields and populate a packet header vector (PHV) at
can be sent back out an Ethernet MAC port or it rates of 100 million packets per second or more. Follow-
can be passed up toward the ARM subsystem or ing the parser are six to eight match action stages, each
local host. stage able to load multiple tables per packet and per-
If the packet is sent to the embedded ARM or local form complex actions which modify the PHV and update
host, it must be processed according to the needs of the memory data structures. A single P4 pipeline includes
kernel or user level driver running on that CPU. The local TCAM and SRAM resources where high bandwidth
packet passes through a third P4 pipeline, where it runs tables can be stored. Large scale tables as well as
another program in accordance with the protocol type tables shared between multiple pipelines are stored in
of the packet, i.e., Ethernet, RDMA, NVMe, or a new pro- attached DRAM or host memory. The table base
tocol. A significant advantage of running protocol-spe- address steers table requests to SRAM or to DRAM,
cific P4 programs associated with a targeted kernel or with per-pipeline caches reducing DRAM traffic for
user driver is the ability to customize descriptor formats, commonly accessed table entries. The P4 pipeline
completion events, interrupt schemes, and data formats concludes with a deparser to reassemble the packet
to the needs of the targeted driver. from the updated PHV fields. Alternatively, the pipe-
In summary, network packets enter the ASIC, pass line can conclude with a DMA engine to directly
through P4 ingress, and P4 egress pipelines to apply copy packet header, payload, and meta data struc-
security and network services, are associated with a tures to and from local or host memory for CPU or
target interface and device type, and pass through P4 offload engine processing.
driver interface programs. The final packet or data
buffer are delivered to the attached host, embedded
ARM, or integrated offload engine. Packets sent by P4 Stages
the host enter through the PCIe MAC to be processed Each P4 stage (Figure 3) within a pipeline begins
by the P4 driver interface programs, then by the P4 with a table engine (TE). The TE extracts bits and
ingress and egress pipelines to apply security and net- bytes from the PHV in any combination to build a
work services before being sent out through the Ether- table key. Table keys constructed from multiple
net MAC. The following section takes a closer look at header fields can be up to 512 bits in width, or keys
the P4 pipeline internals. can be chained together to create extra wide keys
up to 2048 bits in width. The resulting key may be
P4 Pipelines hashed or used as a direct index, then matched
Pensando ASICs include multiple P43 pipelines (Figure 2) against any data structure in TCAM, SRAM, DRAM,
as the centerpiece of a domain-specific processing or attached host memory.
architecture. Each P4 pipeline starts with a programma- The table data and lookup result (match or no
ble parser. At its core, the programmable parser match) are then forwarded to a match processing unit

HOT CHIPS
functional code locality as P4 divides the pipeline

into discrete functional stages.
The MPU implements a domain-specific instruction
set architecture with an emphasis on bit field manipula-
tion and fast header updates. In particular, the MPU ISA
focuses on rapid field extraction from multiple sources
(tables, registers, or PHV) and forwards those fields
directly to an ALU or straight to the PHV. MPU instruc-
tion types include register and field comparisons,
FIGURE 3. Single P4 stage with 4 MPUs. branches, Boolean, and arithmetic ALU operations, and
memory load and store operations. In addition to these
familiar CPU instructions, the MPU includes PHV
(MPU), along with the original key and meta data from write operations, packet sequence number comparison
the PHV. Architecturally, an important role of the TE is and modification, queue state reduction, leading 0 and
to provide a latency tolerance mechanism to protect 1 detection, and other special protocol acceleration
the MPUs from stalls. The TE logic processes multiple instructions. A general purpose, 64-bit-wide register file
PHVs in advance of the MPUs, issuing high latency can be used to hold intermediate computational values,
table reads and moving to the next PHV before earlier plus a domain-specific, 512-bit-wide table entry vector
table reads complete. As table responses return to the and a 512-bit-wide header field vector provide operands
TE, lookup results and action entry vectors are passed directly to ALU instructions.
to the MPUs for immediate execution, the MPUs never
seq c1, p.tcp_valid, 1
stall waiting for table results. This latency tolerance
phvwr.c1 p.daddr, d.lb_addr
feature of the TE allows tables to be stored in high
tbladd.c1 d.byte_cnt, k.pkt_sz
latency DRAM memory, enabling large scale tables to
be processed by the MPUs at high packet rates. The above MPU code snippet is a portion of
Multiple MPUs in one stage can work on the same a simple TCP flow based stateful load balancer.
PHV at the same time, provided each MPU is access- The seq (set if equal) instruction sets a conditional
ing a different table result. MPUs have a dedicated flag c1 if the parser has marked the packet as TCP.
write path to the stage data buffer where PHVs are The second instruction will overwrite the PHV desti-
kept, and writes are merged at the bit level to support nation IP address with the load balancer address, a
updating arbitrary header formats and alignments. field which was returned by stage’s TE table read.
After the final write completes, the PHV graduates to This overwrite is conditional on the c1 flag being set
the next P4 stage. (TCP traffic only), as is the final instruction which
adds the packet size to the byte count to track
the load balancing destination traffic via the tbladd
MPUs (table add) instruction.
The Pensando ASICs include four MPUs per P4 stage, In addition to efficient packet header manipula-
although this number is an implementation detail tion, MPUs are used to interpret and create memory
which can be scaled without changing pipeline pro- descriptors which drive DMA operations based on
gramming. As the TE completes a table fetch and packet content, flow state, or processor work queue
match operation, results are delivered to the next entries. For example, received TCP packets are parsed,
available MPU. Subsequent table fetches to the same classified to a connection or flow, sequence checked,
entry which have serial update dependencies are and DMA’ed directly to associated data stream buf-
locked by the TE to support single flow processing and fers. These data stream buffers can then be passed to
data ordering. Ordered table lookup results are deliv- crypto offload processing for P4 TLS support, higher
ered to an MPU along with an entry program counter layer applications running on general purpose host/
(PC), calculated from the current table configuration ARM CPUs, or P4 proxy programs to attach the data
and a programmable offset which may be stored in a stream to a new connection. Processing memory
table entry. descriptors in the P4 pipeline has the additional
Starting from the entry PC, the MPU (Figure 4) advantage of adapting to customized descriptor for-
executes a run-to-completion program stored in mats, allowing P4 programs to read and write packets
DRAM. Instructions are fetched and cached in the directly from/to Linux mbufs with associated meta-
stage, which experience a high hit rate due to data in the format required by DPDK4 or other drivers.

HOT CHIPS
FIGURE 4. Single MPU block diagram. FIGURE 5. Sixteen million queues.
64-bit MPU Instructions and peer information, and a process ID used for
Internally, the MPU instruction fetch, branch, and memory protection.
load/store blocks are similar to existing CPU designs. Events which enqueue new objects are signaled
The MPU differs from most CPU designs in which its with a doorbell mechanism, allowing one or more
instruction set architecture is based on 8-byte-wide objects to be added to any queue with a single door-
instructions, allowing it to process networking bell action. Events which require processing in the
domain jobs with higher efficiency due to richer future, such as TCP timers or activity checks, set
encoding available from wider instructions. For exam- hardware time markers for future scheduling. The
ple, the MPU ALU operand load aligner is based on bit queue scheduler organizes all 16 million queues into a
fields instead of byte fields, allowing any header field DRAM array, tracking scheduling groups, interface
pair of arbitrary length and offset to be loaded with a groups, and COS groups. Priority scheduling, min/max
single instruction, whereas a traditional CPU would data rates, and deficit weighted round robin schedul-
need to mask and shift header fields for alignment. ing are available across scheduling groups. When a
MPU ALU instructions also accept header fields queue is scheduled, a PHV token with the scheduled
directly as ALU inputs, allowing ADD, XOR, COM- queue ID and COS is inserted into the assigned
PARE, and similar instructions to specify operands P4 pipeline for processing. The TE then accesses
directly from header fields, whereas a traditional CPU the qstate array, launching programs depending on
must load general purpose registers before operating queue type and state, which in turn fetch queued
with its ALU. For updating the PHV, the MPU has a objects and descriptors as entries of tables in the
PHV Write unit to process custom instructions capa- P4 program.
ble of updating or appending to the PHV. Again, one
or more header fields can be transferred directly
from table data to the PHV using a single instruction, Central Packet Buffer
with no alignment restrictions on source or destina- The multiple independent P4 pipelines are intercon-
tion fields. nected via a center packet buffer. The packet buffer
design is a shared memory switch, with virtual out-
put queueing for multiple classes of service and
Hardware Queue Management enhanced transmission selection5 scheduling at the
Packets which enter the P4 pipeline form the wire or outputs. Multicast replication, SPAN (switch port
which originate from internal events are placed analyzer) queues, and network pause buffering are
in hardware queues. Hardware queues are used to provided by the central packet buffer. In addition, an
manage interfaces, flows, connections, and any other ingress packet burst overflow feature allows short
objects which require ordered tracking and sched- term bursts to be written to overflow regions in
uling. Software can configure up to 16 million hard- DRAM memory. The central packet buffer operates
ware queues (Figure 5). Each queue stores its current in the packet domain, whereas the system-on-chip
state in a DRAM-based qstate record, which includes (SOC) operates in the memory transaction domain.
a count of enqueued objects, pointers to arrays or The P4 DMA engines bridge these domains, convert-
linked lists of enqueued objects, connection state ing packets to memory transactions.

HOT CHIPS
SOC, NOC, ARM Processors Cryptography offloads for AES-GCM, AES-XTS, AES-
A full SOC subsystem with multicore ARM A-72 CPUs CBC, AES-CCM, ChaChaPoly, HMAC, SHA3-512, and
is connected to a hardware coherent, shared cache, SHA2-256 are available to provide high performance
and memory hierarchy. Processors, memories, P4 security processing.
pipelines, offload engines, and external devices com-
municate through a high-speed network-on-chip Storage Offloads
(NOC). The P4 pipelines are tightly coupled to the
Storage virtualization and management is provided
ARM processors, supporting direct delivery of packets,
with a combination of P4 programs and ARM code.
headers, and/or metadata to the ARM L2 caches. This
Individual packet and transaction processing is han-
tight coupling allows packet and flow processing to be
dled by P4, while ARM programs handle connection
passed back and forth between P4 pipelines, offload
establishment and volume management. P4 device
engines, and ARM processing services with low over-
driver programs present standard NVMe device inter-
head. Tasks which are well suited to the P4 domain-
faces to the host and DMA descriptors and data to
specific processing are performed in the P4 pipelines
the ASIC for offload processing or transport. Storage
and offload engines, including packet parsing, encap-
offload engines are available to P4 and ARM process-
sulation, cryptography, compression, classification,
ors, and can be mapped to external hosts as a PCIe-
stateful firewalls, telemetry, and data stream segmen-
based accelerator endpoint. Dedicated storage off-
tation and reassembly. Higher level tasks better suited
load engines include compression and decompression
to general purpose processing are performed by the
at 100 Gb/s using LZRW1A, GZIP, or Deflate algo-
ARM cores or host CPUs, such as exception handling,
rithms. Reed–Solomon Erasure coding engines with up
application level processing, and storage volume man-
to 12 data and 4 parity blocks operate at 100 Gb/s.
agement. P4 programming determines which portions
Data integrity engines operate at 200 Gb/s for multiple
of data buffers, header buffers, and completion infor-
algorithms, including CRC64, M-Adler32, CRC32C, and
mation to deliver to the DRAM, SOC last level cache,
Alder-32. Deduplication support is based on SHA2 and
or ARM L2 on a per-application, per-packet basis. For
SHA3 engines.
chained offload operations, which may include encryp-
tion, compression, and data integrity operations mixed
in with P4 control functions, a 4-MB chaining buffer is
PCIe Ports and Virtualization
PCIe virtualization hardware allows software running
attached directly to the NOC to support high band-
width, multihop offload chaining. on the SOC to configure the number and type of devi-
ces presented to the host, including configuration
space, BAR (base address register) ranges, and virtual
Root of Trust and Cryptography function counts. This allows multiple network, storage,
Pensando ASICs provide a Root of Trust based on a and host visible offload devices to be presented and
physical unclonable function and an ARM TrustZone attached to different processes or guest OS instances.
core running from protected memory. Boot loaders PCIe lanes can be bifurcated into multiple ports, each
and operating systems are authenticated before port can operate as a virtual endpoint switch or a root
execution, providing a secure root of trust that is complex. If multiple virtual switch ports are configured,
not dependent on or vulnerable to attached hosts. multiple hosts can attach over separate PCIe lanes
Within the data center, central controller software and share networking services as if they were local to
securely communicates with trusted agents running each host. Internal ASIC COS buffering resources
on the ARMs at each DSC to enforce data center enforce fairness between hosts, preventing noisy
wide policies. neighbors from impacting adjacent hosts. Alterna-
Secure protocols including IPsec and TLS proxy can tively, if a PCIe port is configured as a root complex,
be initialized by ARM processes, then P4 programming storage devices such as NVMe drives are controlled by
controls packet-by-packet flow processing using keys the SOC and virtualized to local and remote hosts.
stored in secure on-chip memory. Cryptography opera- Other PCIe devices including GPUs and machine learn-
tions are performed by internal hardware offload ing accelerators can also be controlled by the ASIC.
engines under the direction of P4-created descriptor
lists. Cryptography hardware includes a hardware P4-16 COMPILER
entropy source driving a NIST SP 800-90 compliant Pensando has developed a compiler which accepts
deterministic random bit generator and public key P4-16 imperative code and generates parser sequence
exchange hardware to assist with connection setup. commands, TE key generation, and table match

HOT CHIPS
TABLE 1. ASIC implementations. TABLE 2. Performance results.
ASIC IMPLEMENTATIONS
Pensando has completed the design of two genera-
tions of ASIC implementations and is currently archi-
tecting a third generation. All ASICs are software
compatible and features are forward compatible.
Table 1 summarizes implementation details of the first
ASIC generation, codenamed Capri, and the second
ASIC generation, codenamed Elba. Architectural fidel-
ity is preserved across generations, so many improve-
ments are focused on increased scale and
performance. In particular, the move from HBM mem-
ory in Capri to DDR memory in Elba was done to signif-
parameters, as well as MPU action instruction sequen- icantly increase the number and size of P4 tables
ces. In order to quantitatively measure the compiler supported, while the reduced memory bandwidth was
capability of generating optimal code for the P4 pipe- compensated for with larger caches and more latency
lines, we compared the performance of two implemen- tolerance built into the P4 pipelines.
tations of the same network functionality: termination
of a VXLAN overlay with flow-based packet routing. A
production-grade implementation consists of hand
PERFORMANCE
written MPU assembly code. The compiler implemen- The performance results in Table 2 apply to a “bump in
tation is coded in P4. Several performance tests were the wire” full service forwarding application, which
run with various traffic patterns presented to the ASIC includes a stateful firewall per connection, telemetry
using both implementations, and the number of for- collection on every flow, and connection establish-
warded packets per second was measured. Overall, ment/teardown managed by ARM CPUs while datapath
the P4 implementation achieves 85% of the perfor- operations are handled in the P4 pipeline. Latency and
mance of the hand-written assembly implementation. jitter were measured on the wire by lab testing devices.
This demonstrates that the domain-specific compiler Comparing performance with available solutions
is an effective and efficient tool to turn P4 programs is difficult as few publish results with identical service
into machine code for Pensando’s packet forwarding features, but solutions offering a subset of these
pipelines. features show a fraction of the delivered MPPS6
As generations of the Pensando domain-specific as compared to the DSC. Connection per second
architecture progress, hardware features are added to establishment and tracking performance of the DCS is
improve capabilities, performance, and efficiency. one order of magnitude higher than available solutions7
These improvements can be in the MPU instruction and similar to large, multiport service appliances.8
set, conditions causing pipeline hazards or stalls, pro-
grammable parser control, hash table options, or other CONCLUSION/
areas. The P4 compiler abstracts these implementa- ACKNOWLEDGMENTS
tion-related changes away, allowing imperative, user- Modern data centers are facing rapidly evolving needs in
created P4 code to take advantage of new ASIC gen- security, networking, telemetry, storage, and scale which
erations without source code changes. are not addressed by current equipment. Pensando has

HOT CHIPS
developed a distributed services architecture to address infrastructure, testing and support which make these
these data center needs while providing user program- products possible. We look forward to delivering
mability based on a domain-specific, open source, future generations of this architecture to meet the
permissively licensed P4 language. The first two genera- evolving data center scale, security, and innovation
tions of this architecture are realized in ASICs fabricated challenges.
in 16- and 7-nm processes.
MODERN DATA CENTERS ARE FACING REFERENCES

RAPIDLY EVOLVING NEEDS IN 1. “P4: Programming protocol-independent packet
SECURITY, NETWORKING, TELEMETRY, processors,” Jul. 2014. Accessed: Apr. 7, 2015.
STORAGE, AND SCALE WHICH ARE NOT 2. “Azure accelerated networking: Smartnics in the public
ADDRESSED BY CURRENT cloud,” NSDI, 2018.
EQUIPMENT. PENSANDO HAS 3. “P4-16 specification,” P4 Official Website, p4.org.
DEVELOPED A DISTRIBUTED SERVICES 4. Data Plane Development Kit. [Online]. Available: dpdk.org
ARCHITECTURE TO ADDRESS THESE 5. IEEE 802.1Qaz – Enhanced Transmission Selection.
DATA CENTER NEEDS WHILE 6. L. Csikor et al., “Transition to SDN is HARMLESS: Hybrid
architecture for migrating legacy ethernet switches to
PROVIDING USER PROGRAMMABILITY
SDN,” IEEE/ACM Trans. Netw., vol. 28, no. 1, pp. 275–288,
BASED ON A DOMAIN-SPECIFIC, OPEN
Feb. 2020.
SOURCE, PERMISSIVELY LICENSED P4
7. Cisco Firepower 2100 Series Data Sheet, Jan. 2021.
LANGUAGE. [Online]. Available: https://www.cisco.com/c/en/us/
products/collateral/security/firepower-2100-series/
datasheet-c78-742473.html
The authors would like to recognize the talented
8. F5 Big-IP iSeries Appliance datasheets. [Online].
and dedicated team of engineers who created the
Available: https://www.f5.com/pdf/products/big-ip-
Pensando Distributed Architecture, hardware and
platforms-datasheet.pdf
ASIC implementations, compiler design, software

Compute Substrate for Software 2.0

Jasmina Vasiljevic , Ljubisa Bajic, Davor Capalija, Stanislav Sokorac, Dragoljub Ignjatovic , Lejla Bajic,
Milos Trajkovic, Ivan Hamer, Ivan Matosevic, Aleksandar Cejkov, Utku Aydonat , Tony Zhou, Syed Zohaib Gilani,
Armond Paiva, Joseph Chu, Djordje Maksimovic, Stephen Alexander Chin, Zahi Moudallal, Akhmed Rakhmati,
Sean Nijjar, Almeet Bhullar, Boris Drazic, Charles Lee, James Sun, Kei-Ming Kwong, James Connolly,
Miles Dooley, Hassan Farooq, Joy Yu Ting Chen, Matthew Walker, Keivan Dabiri, Kyle Mabee, Rakesh Shaji Lal,
Namal Rajatheva, Renjith Retnamma, Shripad Karodi, Daniel Rosen, Emilio Munoz, Andrew Lewycky,
Aleksandar Knezevic, Raymond Kim, Allan Rui, Alexander Drouillard, and David Thompson, Tenstorrent Inc.,
Toronto, ON, M3C 2G9, Canada
The rapidly growing compute demands of AI necessitate the creation of new

computing architectures and approaches. Tenstorrent designed its architecture
(embodied in Grayskull and Wormhole devices) to tackle this challenge via two
fundamental and synergistic approaches. The first is via compute-on-packets fabric
that is built from ground up for massive scaleout. The second is the ability to
execute dynamic computation, built into the compiler, runtime software and
hardware architecture. By combining these approaches, TensTorrent will enable
continued scaling of AI workloads.
C
ompute demand of AI is skyrocketing at a rate compute and storage requirements and reduce it down
that far outpaces the compute density to something much more compatible with Moore’s Law.
improvements that can be gained by Moore’s In the “Hardware” section, we present the details
Law alone3,4 and approaches based on monolithic of our hardware, with primary focus on the Grayskull
shared memory models. We have chosen to attack device. The “Software” section presents our software
this challenge via two approaches, dynamic computa- stack. The “Full Stack Performance Optimizations”
tion and massive scaleout. We design dynamic com- section deep dives into several full stack performance
putation to enable a wide range of techniques that optimizations enabled by our hardware and software.
intelligently “forgo unnecessary computation” or The “Dynamic Execution” section summarizes various
“compute only what is relevant to input,” akin to what dynamic execution approaches enabled by our archi-
our brains do. tecture. Performance results are presented in the
Large clusters are already the norm for training of “Results” section. Finally, the “Conclusion” section
AI models, while inference for some large models also concludes this article.
requires multi-device execution. The shared memory
paradigm cannot enable the required scale, which HARDWARE
necessitates a paradigm shift to a multicore private-
memory model, a foundation in our scaleout architec- Devices
ture. On top of this, we build a push-based data move- Over the last four years, Tenstorrent designed three
ment in which data transfers are explicitly planned chips, shown in Figure 2 and summorized in Table 1.
and controlled. Jawbridge is a small 14-nm test-chip, containing six
These two approaches can be synergistically com- first-generation Tensix cores.
bined to take the current steep slope of increasing AI Grayskull, shown in Figure 1, is our first production
chip in 12-nm technology, and is currently in evalua-
tion with multiple customers. It is first incarnation of
our large cluster-on-a-chip multicore architecture and
This work is licensed under a Creative Commons Attribution is composed of 120 compute cores. The physical area
4.0 License. For more information, see https://creativecom- of the 10x12 grid of cores is 477 mm2. Each core oper-
mons.org/licenses/by/4.0/
Digital Object Identifier 10.1109/MM.2021.3061912 ates independently; it has its own unique instruction
Date of publication 9 March 2021; date of current version 26 queue and progresses at its own pace, in contrast to
March 2021. monolithic chip-scale SIMD, VLIW or single-kernel

HOT CHIPS
TABLE 1. TensTorrent devices.
Jawbridge Grayskull Wormhole

Manufactured 2019 2020 2021
Technology 14 nm 12 nm 12 nm
Compute grid 2x3 grid of 10x12 grid 10x8 grid of
cores of cores cores
On-chip 6 MB 120 MB 120 MB
SRAM
Off-chip IO 1 ch. 8 ch. 16 ports of
LPDDR4, LPDDR4, 100G Ethernet,
PCIe PCIe 6 ch. GDDR6,
FIGURE 1. Grayskull 75W PCIe board. Gen4x4 Gen4x16 PCIe Gen4x16
CPUs 4 core 4 core 4 core OoO
OoO ARC OoO ARC ARC
GPU models. A network-on-chip is used to transfer
data between the cores and to synchronize cores at
various points during the program execution. The Wormhole device (12 nm) contains 16 ports of
The Network-on-Chip (NoC) is a 2-D, bi-directional 100-Gb Ethernet, an integrated network switch, and six
torus, with a bandwidth of 192 GB/s per node. The channels of GDDR6. The compute fabric is composed of
NoC was architected internally alongside the Tensix Tensix cores similar to Grayskull. This communication-
core, and it has been optimized for ML workloads, oriented architecture realizes our vision of converged
described in more detail in the “Block Sparsity” sec- networking and accelerated AI compute on a single
tion. The NoC connects all the cores, as well as off- device. The large number of dedicated communication
chip communication and memory controller blocks. links on Wormhole enable many-device scaleout, by con-
As a result, each of the 120 compute cores has access necting Wormhole devices directly to each other, with-
to the PCIe and to the DRAM off-chip memory. out a central host CPU processor or Ethernet switches.
The chip includes a Gen4x16 PCIe block for com-
munication with the host processor and other Gray- Tensix Core
skull devices. Eight channels of LPDDR4 are located TensTorrent architecture operates on the basis of pack-
along the north and south edges of the compute grid. ets. The data units moved between the memories on the
In terms of numerical precision, Grayskull NoC are packets. Also, compute is executed directly on
supports: packets, shown in Figure 4. A single Tensix core contains
a packet processor, a packet manager, SRAM, and five
1. full fp16/BFLOAT at 92 TFLOPs; RISC processors. The RISC processors execute the run-
2. reduced precision mode of fp16/BFLOAT, at 122 time software which dispatches instructions to
TFLOPs; the packet processor and to the packet manager. On
3. block-based 8-bit floating point at 368 FLOPs. Grayskull, the SRAM has a capacity of 1 MB, with a
FIGURE 2. TensTorrent devices. (a) Jawbridge (2019). (b) Grayskull (2020). (c) Wormhole (2021).

HOT CHIPS
FIGURE 4. Packets flowing between compute operations. (a)

FIGURE 3. Single Tensix core. Tensor decomposed into packets. (b) Packets flowing
between compute operations.
384-GB/s read/write bandwidth. The memory space is the compute engine from the complexity of data move-
primarily used by the local core, but it is directly accessi- ment and multicore synchronization. These features of
ble by remote cores as well. the packet manager realize the push-based transfer
model, which maximizes the overlap between compute
Packet Compute Engine and data transfers.
The packet compute is a SIMD-based matrix and vec- The router moves packets across the NoC. It pro-
tor engine with a high degree of flexibility and vides guaranteed ordering, manages flow control and
programmability. Peak compute density is 3 TOPs at backpressure, and has deadlock-free operation. It is
8-bit precision, or 0.75 TFLOPs at 16-bit floating-point also optimized for the way AI workloads are parallel-
precision. The packet compute engine is software- ized across our multicore architecture and has effi-
programmable via the associated RISC cores, which cient multicast and gather capabilities.
execute kernels written in standard C language and Finally, the tensor manipulation engine can per-
issue matrix and vector instructions to the engine. A form dynamic packet compression; the smaller mem-
large number of PyTorch and TensorFlow deep learn- ory footprint enabled by compression results in faster
ing instructions are supported. The matrix and vector data transfers and an increase in data locality. Fur-
engine support operations on a range of integer and thermore, tensor manipulation instructions can be
floating-point formats. Most importantly it natively executed by this engine, described in the “Optimiza-
handles sparse computation to achieve speedup, tion of Tensor Manipulation Instructions” section.
reduce power, and memory footprint.
The packet compute engine does not have a global SOFTWARE
view of execution across the multicore system. Its The software is composed of three main pieces.
operation is driven primarily by the packet manager:
incoming packets from the packet manager are com- 1. The machine learning framework integration and
puted and returned to the packet manager for storage plugin software.
or data transfer. 2. The runtime software, executing on the RISC
processors.
Packet Manager Engine 3. The ahead-of-time graph compiler.
The packet manager, depicted in Figure 3, is com-
posed of data transfer engine, router, and tensor Framework Integration
manipulation engine. The Tenstorrent compiler and runtime have been
The data transfer engine is responsible for execut- natively integrated into PyTorch, and support both
ing all data movement and synchronization among the inference and training flows. The user can target execu-
compute engines, as well as between on-chip SRAM, tion on a single device, or multidevice cluster. In either
off-chip memory and I/O. The packet manager and case, the hardware is visible to the user as a single
compute engine each receive their own unique instruc- device. The multidevice scheduling and parallelization
tion queues from the compiler, and they execute con- are orchestrated behind the scenes by the software
currently. The packet manager completely de-burdens stack. In addition, the software stack can execute

HOT CHIPS
ONNX networks as well, enabling a funnel from the core is ready to begin computing, as a first step it
frameworks that export into the ONNX format. issues a request to copy remotely stored data (from
another core’s cache, or from DRAM) into its local
Graph Compiler memory or cache. After the copy has been completed,
The graph compiler is composed of three main compo- the compute core starts computing. The read request
nents, the front end, optimizer, and back end. The primary latency combined with a data transfer through a
role of the front end is to lower a wide range of instruc- potentially congested NoC or memory port, could
tions to a smaller number of optimized instructions sup- result in the consumer core being idle while waiting
ported by the hardware. Instructions are parallelized and for its data to arrive.
scheduled onto the device cores by the optimizer, which In contrast, our architecture operates on a push-
maximizes performance by balancing compute, data local- based data transfer model. A core that produces an
ity, and data movement. The back-end translates the com- output buffer is aware of the consumer core that
piled graph down into instruction queues for each core. needs to receive it. Instead of waiting for the con-
The packet managers and NoC connecting the sumer core to issue a remote read request, the pro-
cores is are visible to the software and the data move- ducer core proactively copies the buffer to the
ment and synchronization are both controlled explicitly consumer core. This approach minimizes the idle time
by the compiler. To schedule the data movement, the of the consumer core.
compiler packetizes each tensor by splitting it into The data transfer engine executes all the required
“mini-tensors,” and each mini-tensor is combined with flow control for the push-based data transfer model. It
a packet header. Each packet header contains a unique receives an instruction queue from the graph compiler
packet ID, and all data is referenced via unique packet containing information about the producer-consumer
IDs. The header also contains routing information, connectivity. The instructions enable the data transfer
enabling the packet manager to perform the desired engine to execute data transfers using a number of
data transfers between the cores across the NoC. multicore synchronization instructions, including
exchange of data transfer status, such as data-ready
Runtime Software or memory-space-ready.
The runtime software runs concurrently on RISC pro-
cessors within every core. The compiled executable
Optimization of Tensor Manipulation
contains instruction queues for the packet processor
and the packet manager of every core. The runtime
Instructions
Instructions that make up deep neural networks fall
software manages the queues, and dispatches instruc-
into two main categories: 1) math instructions, and 2)
tions to the packet compute and the packet manager.
Buffers containing packets are dynamically allo- tensor manipulation (TM) instructions. TM instruc-
cated and de-allocated during runtime. The runtime tions do not modify the data inside the tensor, but
software works in tight collaboration with the packet simply reshuffle the tensor contents. Common TM
instructions in NLP networks include reshape, trans-
manager to store packets into the allocated buffers.
pose, flatten, and permute.
The runtime also controls the storage target, allowing
The TM reshuffling is performed on the intermedi-
for buffers that do not fit into a core’s local SRAM to
ate activation data, and hence must be executed dur-
spill to either remote SRAM, or to an off-chip memory.
ing runtime. One implementation approach is to issue
The architecture also supports various types of
conditional executions such as if-statements, and for and execute each TM instruction independently to
and while loops. The runtime software interprets the hardware during runtime. This typically involves a spe-
instruction queues generated for each core and can cific read/write pattern from/to memory, where the
patterns match the particular TM to be implemented.
execute jumps to a specific instruction in the instruc-
GPUs execute TMs using this approach, shown in
tion queues to reflect control flow decisions.
Figure 5(c), which idles compute cores while perform-
ing potentially complex memory access.
FULL STACK PERFORMANCE
In contrast, our architecture overlaps the execu-
OPTIMIZATIONS
tion of math instructions performed by the compute
Optimization of Data Transfers: The engine and the TM instructions performed by the ten-
Push-Based Model sor manipulation engine. This process is facilitated by
Traditional multicore devices operate on a pull-based the graph compiler. The execution trace in Figure 5(b)
data transfer model. For example, when a compute shows the overlapping of compute and TMs. The

HOT CHIPS
FIGURE 5. Tensor manipulation instructions executed on

Grayskull and a GPU. (a) Application graph received from FIGURE 6. Flexible scheduling and parallelization. (a) Pipeline
PyTorch. (b) Execution trace on a Grayskull Tensix core. (c) parallelism. (b) Model parallelism.
Execution trace on a GPU.
DYNAMIC EXECUTION
MM_1 compute instruction and the Reshape and Dynamic execution is an umbrella term representing
Transpose instructions execute concurrently, in a various approaches that reduce the computational
pipelined fashion. complexity of a network at runtime. Some approaches
The tensor manipulation engine is programmable – can be represented within the topology of network
it receives its own unique instruction queue from the itself, such as Mixture-of-Experts (MoE), while others
compiler. TM instructions are executed using a combi- can be used to augment the network execution during
nation of two methods. First, it contains a small stor- runtime. Four approaches enabled by the Tenstorrent
age that it uses as a scratch pad to load a packet and architecture are described next.
reshuffle it in place. Second, it can execute complex
memory read/write patterns. Using a combination of
these approaches, any tensor manipulation instruc- Block Sparsity
tion can be implemented inline as the packets are Tensors feeding into math operations within networks
being streamed out of the packet compute engine contain a variable amount of sparsity within them.
and being written into local SRAM. Certain models have been tuned to take advantage of
sparsity of trained parameters,2 or take advantage of
block sparsity in model parameters.5 However, these
Flexible Scheduling and Parallelization approaches do not tap into a large potential of spar-
The TensTorrent architecture unlocks a tremendous sity in activations, which can be inherent or induced
amount of concurrency. All building blocks receive at runtime.7 To fully utilize this potential, in addition to
their own unique instruction queues from the compiler model parameter sparsity, our architecture supports
and can progress at their own pace. As a result, the block sparsity of activations, which enables quadratic
overlap between compute and data transfers is gains from run-time activation sparsity.
maximized.
However, any single parallelization approach even-
tually plateaus, hence the desire to support flexible Dynamic Precision
parallelization approaches along all available dimen- Similar to scientific computing applications, numerical
sions for any given compute layer. Each individual precision can be traded off for an increase in perfor-
deep learning operation can be parallelized across a mance and a reduction in power. The Tenstorrent
variable number of cores, combining a number of par- architecture enables a numerical precision to be set
allelization approaches. In addition, operations can be at a fine grain level, per each packet in the neural net-
run in parallel, be pipelined, or sequential, across the work. The setting can be specified both ahead-of-time
many cores of a device, shown in Figure 6. by the compiler, as well as during runtime.

HOT CHIPS
Runtime Compression In comparison to alternative and competing archi-

Tensors can be compressed and decompressed dur- tectures, Tenstorrent’s architecture is the only one
ing runtime via dedicated hardware blocks. Parame- which has all the features of the next-generation com-
ters can be compressed at compile time, and de- puting architecture, one that allows fusion of Software
compressed at runtime by the hardware. Similarly, dur- 2.0 (“neural nets”)8 and Software 1.0 (“classical
ing runtime, the output activations of a math layer can programs”):
be compressed inline as they are being produced.
Both approaches result in reduced memory footprint, › Same programming model for single-chip and mul-
in addition to power savings from smaller and faster tichip scaleout. Parallelization for private memory
data transfers. combined with direct compute-on-packets.
› Flexible parallelization across all tensor dimensions
Conditional Computation and approaches, including the time dimension.
MoE1 involves the use of conditional computations, › Native ability to intermingle dense-math-heavy
where only parts of the network are computed, on a nodes of a neural net with sparse nodes, proce-
per-input basis. This allows a significant increase in dural nodes (i.e., “classical programs”), and vari-
model capacity, without a proportional increase in ous dynamic and conditional execution
computational complexity. A host CPU can easily techniques. These are all critical building blocks
implement such conditional computation, however for a compute substrate that allows fusion of
that approach involves unnecessary data transfers Software 1.0 and Software 2.0., entirely removing
between host and the AI processor and is impractical the need for host CPU fallback.
for large scaleout implementations. The Tenstorrent › Fully programmable: kernels and runtime system
architecture is the first architecture, to the best of our (firmware) are written in standard C language.
knowledge, that tightly couples native conditional › Scaleout capability and flexibility: scaleout via
computation together with the dense compute within standard Ethernet that seamlessly integrates
an AI processor fabric. into our NoC.
RESULTS
We measured a baseline result for BERT-base in
BFLOAT16 precision at 2830 sequences/s. Significant REFERENCES
speedup is achievable by applying two optimizations 1. N. Shazeer et al., “Outrageouly large neural networks:
on top of this baseline: dynamic activation sparsity The sparsley-gated mixture-of-experts layer,” in Proc.
and use of 8-bit floating point. In our experiments we Int. Conf. Learn. Representations, 2017. [Online].
observe that 75% sparsity in activations (induced Available: https://arxiv.org/abs/1701.06538
dynamically at runtime) results in 4x speedup on BERT 2. V. Sanh et al., “Movement pruning: Adaptive sparsity by
layers. Similarly, we observe that using block-based fine-tuning,” 2020. [Online]. Available: https://arxiv.org/
8-bit floating point precision provides a factor of two. abs/2005.07683
The two optimizations can be combined synergisti- 3. OpenAI, “AI and compute,” 2018. [Online]. Available:
cally—they both reduce the activation memory foot- https://openai.com/blog/ai-and-compute/
print linearly for a total of 8x reduction, and 8-bit floats 4. Stanford HAI. “Artificial intelligence index report 2019,”
reduce the model parameter memory footprint by 2x. 2019. [Online]. Available: https://hai.stanford.edu/sites/
This enables the majority of the model parameters to default/files/ai_index_2019_report.pdf
fit on chip allowing the sparsified layers to be fed from 5. T. Gale et al., “Sparse GPU kernels for deep learning,” 2020.
local SRAM. Realizing this on an entire BERT-base is [Online]. Available: https://arxiv.org/pdf/2006.10901.pdf
work in progress and we project a score of 23 345 6. “Fast block sparse matricies for pytorch,” 2020. [Online].
sequences/s. Available: https://github.com/huggingface/pytorch_
block_sparse
CONCLUSION 7. Z. Chen et al., “You look twice: Gaternet for dynamic
We solve the private memory parallel computing prob- filter selection in CNNs,” 2019. [Online]. Available:
lem and tensor manipulations in a way that removes https://arxiv.org/pdf/1811.11205.pdf
communication, synchronization, and data shuffle bot- 8. A. Karpathy, “Software 2.0,” 2017. [Online]. Available:
tlenecks and enables keeping all the compute units https://medium.com/@karpathy/software-2-0-
highly utilized. a64152b37c35

GENERAL INTEREST
The Design Process for Google’s Training

Chips: TPUv2 and TPUv3
Thomas Norrie , Nishant Patil , Doe Hyun Yoon , George Kurian , Sheng Li , James Laudon ,
Cliff Young , and Norman Jouppi , Google, Mountain View, CA, 94043, USA
David Patterson , Google, Mountain View, CA, 94043, USA and UC Berkeley, Berkeley, CA, 94720, USA
F
ive years ago, few would have predicted that a requires matrix transposition and loss functions—
software company like Google would build its and the amount of computation. An inference cal-
own chips. Nevertheless, Google has been culation typically executes on the order of 109
deploying chips for machine learning (ML) training since floating point operations, but Google’s production
2017, powering key Google services. These tensor proc- training applications require 1019–1022; more than
essing units (TPUs) are composed of chips, systems, and 10 orders of magnitude larger!
software, all codesigned in-house. This article details the › More memory. During training, temporary data
circumstances that led to this outcome, the challenges are kept longer for use during backpropagation.
and opportunities observed, the approach taken for the With inference, there is no backpropagation, so
chips, a quick review of performance, and finally a retro- data are more ephemeral.
spective on the results. A companion paper describes › Wider operands. Inference can tolerate int8
the supercomputers built from these chips, the compiler, numerics relatively easily, but training is more
and its performance in detail.6 sensitive to dynamic range due to the accumula-
tion of small gradients during weight update.
FORCES PUSHING TOWARDS › More programmability. Much of training is
CUSTOM ML HARDWARE experimentation, meaning unstable workload
In 2013, only one year after AlexNet swept the Image- targets such as new model architectures or
Net competition,4 Google leaders predicted ML would optimizers. The operating mode for handling a
soon be ready to attack difficult problems like produc- long tail of training workloads can be quite
tion versions of image and speech recognition. Alas, different from a heavily optimized inference
these ML models were so computationally expensive system.
that a sustainable service was nearly impossible, as › Harder parallelization. For inference, one chip can
Internet service costs scale by the number of users hit most latency targets. Beyond that, chips can be
who grow as the service improves. The motivation to scaled out for greater throughput. In contrast, exa-
slash ML inference serving costs led to the TPUv1 flops-scale training runs need to produce a single,
inference system, deployed successfully in 2015.5 consistent set of parameters across the full sys-
TPUv1 exposed the next bottleneck: ML training. tem, which is easily bottlenecked by off-chip
The team quickly pivoted into building the TPUv2 communication.
training system. Two years later, TPUv2 powered key
These problems felt daunting initially, plus we had
Google services with fast and cost-effective training.
constraints on time and staffing. Time matters
because each day saved during development acceler-
CHALLENGES AND OPPORTUNITIES ates our production training pipeline a day. And as for
OF BUILDING ML HARDWARE staffing: while Google is teeming with engineers, they
ML training brings challenges relative to ML inference: are not all available for our project. Ultimately, the
TPU team had only a cup from Google’s large engi-
› More computation. More means both the types of neering pool. Thus, we had to be ambitious to over-
computation—for example, backpropagation come the complexities of training, but the time and
staffing budget set constraints.
To prioritize, we sorted activities into two buckets:
0272-1732 ß 2021 IEEE
those where we must do a great job, and those that
Date of publication 9 February 2021; date of current version we only have to do good enough. The first bucket
26 March 2021. included the following:

GENERAL INTEREST
FIGURE 1. Transforming the TPUv1 datapath into the TPUv2 datapath.

1 Build quickly. fixed-function units, but that is bad for training as on-

2 Achieve high chip performance. chip memory requires more flexibility. The first edit

3 Scale efficiently to numerous chips. merges these into a single vector memory [see Figure 1

4 Work for new workloads out of the box. (b) and (c)]. For the activation pipeline, we moved

5 Be cost effective. away from the fixed-function datapath (containing
pooling units or hard-coded activation functions) and
Everything else was in the second bucket. While built a more programmable vector unit [see Figure 1
tempting to brush these second bucket issues aside (d)]
4 . The matrix multiply unit (MXU) attaches to the
as minor embarrassments, the reality is building and vector unit as coprocessor [Figure 1(e)].
delivering a good system is as much about what you
decide not to do as what you decide to do. In retro-
spect, these decisions are not embarrassing after all!
WHILE GOOGLE IS TEEMING WITH
We refer to the relevant goals using the circled ENGINEERS, THEY ARE NOT ALL
numbers (e.g., 2 ) throughout the discussion to high- AVAILABLE FOR OUR PROJECT.
light “first bucket” design decisions. ULTIMATELY, THE TPU TEAM HAD
ONLY A CUP FROM GOOGLE’S LARGE
ENGINEERING POOL. THUS, WE HAD
OUR APPROACH TO ML TO BE AMBITIOUS TO OVERCOME THE
HARDWARE COMPLEXITIES OF TRAINING, BUT THE
TPUv1 provided a familiar starting point for our train- TIME AND STAFFING BUDGET SET
ing chip [see Figure 1(a)]. The high-bandwidth loop CONSTRAINTS.
(red) identifies the important part of TPUv1: the core
data and computation loop that crunches neural net-
work layers quickly. DDR3 DRAM feeds the loop at
much lower bandwidth with model parameters. The Loading the read-only parameters into the MXU for
PCIe connection to the host CPU exchanges model inference does not work for training. Training writes
inputs and outputs at even lower bandwidth. those parameters, and it needs significant buffer
Figure 1 shows five piecewise edits that turn TPUv1 space for temporary per-step variables. Hence, DDR3
into a training chip. First, splitting on-chip SRAM moves behind Vector Memory so that the pair form a
makes sense when buffering data between sequential memory hierarchy [also in Figure 1(e)]. Adopting in-

GENERAL INTEREST
FIGURE 2. TPUv2 datapath in more detail, showing both cores, which appear as a single PCIe device.
package HBM DRAM instead of DDR3 upgrades band- compiler’s name is XLA, for accelerated linear alge-
width twentyfold, critical to utilizing the core 2. bra.6) In these discussions, we landed on two impor-
Last is scale. These humongous computations are tant themes. First, a VLIW architecture was the
much bigger than any one chip. We connect the mem- simplest way for the hardware to express instruction
ory system to a custom interconnect fabric (ICI for level parallelism and allowed us to utilize known com-
InterChip Interconnect) for multichip training [see piler techniques 1 2 . Second, we could ensure gener-
Figure 1(f)] 3 . And with that final edit, we have a train- ality by architecting within the principled language of
ing chip! linear algebra 4 . That meant focusing on the compu-
Figure 2 provides a cleaner diagram, showing the tational primitives for scalar, vector, and matrix data
two-core configuration. The TPUv2 core datapath is types.
blue, HBM is green, host connectivity is purple, and
the interconnect router and links are yellow. The
TPUv2 Core contains the building blocks of linear alge- Scalar Computation Unit
bra: scalar, vector, and matrix computation. The scalar unit is where computation originates. It
Why two cores? The simpler answer is that we fetches complete VLIW bundles from a local instruc-
could fit a second core into a reasonably sized chip tion memory, executes the scalar operations slots
2 5 . But then why not build a single, bigger core? Wire locally, and then forwards decoded instructions on to
latencies grow as the core gets bigger, and the two- the vector and matrix units where execution happens
core configuration hits the right balance between rea- later, decoupled from scalar execution. The VLIW bun-
sonable pipeline latency and additional per-chip com- dle is 322 bits and is composed of two scalar slots,
putation capacity. We stopped at two bigger cores four vector slots (two used for vector load/store), two
because they are easier to program as they allow the matrix slots (a push and a pop), one miscellaneous
developer to reason about one unified instruction slot (a simple example would be a delay instruction),
stream operating on big chunks of data 4 , rather than and six immediates.
exposing lots of tiny cores that need to be lashed Figure 3(a) shows a diagram of the scalar unit. At
together. We took advantage of training fundamen- the top left is the instruction bundle memory. While an
tally being a big-data problem. instruction cache backed by HBM would have been
The following sections dive deeper into key chip nice, a DMA target for software-managed instruction
components. overlays was easier 1 . It is not the flashiest solution,
but remember that this part just needs to be “good

enough.” The greatness in this system lies elsewhere.
TPUV2 CORE Down the left of Figure 3(a), we perform scalar
The TPUv2 Core was codesigned closely with the com- decode and issue, and on the right is where scalar execu-
piler team to ensure programmability 4 . (The tion happens. At the bottom right is the DMA port into

GENERAL INTEREST
FIGURE 3. (a) TPUv2 core scalar unit and (b) single vector lane. The vector unit contains 128 vector lanes.
the memory system, primarily to HBM. That feeds into Matrix Computation Units
local Scalar Memory SRAM to issue loads and stores The matrix multiply unit is the computational heart of
against. They feed a 32-deep register file containing 32- the TPU. It is a 128 x 128 systolic array of multipliers
bit words, which finally feeds into a dual-issue ALU at and adders, delivering 32,768 operations per cycle 2.
the top right. Execution interlock is managed by stan- The inputs are two matrices: a left hand side matrix
dard hold conditions on the instructions, while synchro- and a right hand side matrix. The left hand side matrix
nization flags provide a way to interlock against streams over a preloaded right hand side matrix to
software-managed DMA operations. The memory hierar- create a streaming result matrix, which is sent directly
chy is under the control of the compiler, which simplifies to the vector unit’s result FIFO. The right hand side
the hardware design while delivering high performance matrix can be optionally transposed when loaded into
1 2 5. the matrix unit.
Critically, the systolic array structure provides high
Vector Computation Unit computational density 2 . While it performs most of
After scalar execution, the instruction bundle and up to the FLOPS per second, it is not the largest contributor
three scalar register values are forwarded to the vector to chip area (see Figure 4) 5.
unit. Figure 3(b) shows a single vector lane, and the Numerics shape computation density. Multiplica-
entire vector unit contains 128 such vector lanes. Each tions use bfloat16: this is a 16-bit format that has the
lane contains an additional 8-way execution dimension same exponent range as float32 but with fewer bits of
called the sublane. Each sublane contains a dual issue mantissa. The accumulation happens in full 32-bit
32-bit ALU connected into a 32 deep register file. All floating point.
together, the vector computation unit allows operation Bfloat16 works seamlessly for almost all ML train-
on eight sets of 128-wide vectors per clock cycle. Sub- ing, while reducing hardware and energy costs 2 5.
lanes let TPUv2 increase the vector versus matrix com- Our estimate for the more recent 7 nm is that
pute ratio, which is useful for batch normalization.6 bfloat16 has a 1.5 energy advantage over the IEEE
Each lane’s register files perform loads and stores 16-bit float: 0.11 versus 0.16 pJ for add and 0.21 versus
against the lane’s local slice of vector memory, and 0.31 pJ for multiply. Moreover, bfloat16 is easier for
that memory connects into the DMA system (primarily ML software to use than fp16, since developers need
providing access to HBM). to perform loss scaling to get fp16 to work.9 Many are
On the right of Figure 3(b) is the connectivity into willing to do the extra programming to go faster, but
the matrix units. The push instruction slots send data some are not. For example, all but 1 of the 200 ML
vectors out to the matrix units. A result FIFO captures experts at the Vector Institute rely on the slower 32-
any returning result vectors from the matrix units, and bit floating point in GPUs.8 We know of no downsides
these can be popped into vector memory using the for bfloat16 versus fp16 for ML. As it takes less area
pop instruction slots. The result FIFO lets us avoid and energy and is easier for ML software to use 2 4,
strict execution schedule constraints for the long- bfloat16 is catching on. TPUv2 was the first, but ARM,
latency matrix operations and shorten register life- Intel, and NVIDIA have subsequently embraced
times, simplifying the compiler 1. bfloat16.

GENERAL INTEREST
Matrix multiplication is not the only important connect into a supercomputer, with many chips work-
matrix transformation, so another set of units effi- ing together to train a model.
ciently perform other matrix operations like trans- The chip includes four off-chip links and two on-
poses, row reductions, or column permutations. chip links connected with an on-chip router. These
four links enable the 2-D torus system interconnect,
TPUV2 MEMORY SYSTEM which matches many common ML communication
patterns, like all-reduce. The four external links deliver
As discussed earlier, the TPUv2 Core has a variety of
500 Gb/s and the two internal links are twice as fast.
SRAM-based scratchpad memories. These are archi-
Some torus links are within the TPU tray, and the rest
tecturally visible to software and accessed using loads
are through cables across trays and racks.
and stores. This approach gives the compiler predict-
An important property of interconnect design is ease
able execution control over both computation and
of use 4 : DMAs to other chips look just like DMAs to
SRAM and simplifies the hardware 2 4.
local HBM, albeit with a push-only restriction for simplic-
But the memory story cannot stop with SRAM,
ity 1 . This common DMA interface allows the compiler
because most models would not fit. Figure 2 shows
to treat the full system’s memory space consistently.
that high-bandwidth, in-package HBM backs up the
The dedicated TPU network enables scalable syn-
SRAM. The compiler moves data between HBM and
chronous training across TPUs. Asynchronous training
SRAM using asynchronous DMAs. Since HBM holds
was the state-of-the-art previously, but our studies
vectors and matrices, DMAs can stride through mem-
showed synchronous converges better 4 —async pri-
ory to allow slicing off subdimensions of larger dimen-
marily allowed broader scaling when networking was
sional structures to reduce the DMA descriptor
limited.
overhead. When a DMA finishes, a completion notifi-
We connect the TPU pod to storage over the data-
cation lands in the core’s sync flags, allowing the pro-
center network to feed the input data for the model
gram to stall until data arrives.
via a PCIe-connected CPU host. The system balance
Taking a step back, this approach provides a clean
across CPU, network, and storage is critical to
partitioning of concerns. The core (blue in Figure 2)
achieve end-to-end performance at scale 3 . The
provides predictable, tightly scheduled execution 2.
PCIe straw is tiny (16 GB/s per chip) the in-package
The memory system (in green) handles asynchronous
bandwidth is huge (700 GB/s per chip), and the inter-
prefetch DMA execution from the larger HBM memory
connect links are somewhere in the middle (4 60
space 5 . The hand-off between the regimes is man-
GB/s). Building TPUv2 to be flexible and programma-
aged with sync flags.
ble allows operation on all data locally instead of
Ultimately, the primary goal of the memory system
moving data back to the host through a bandwidth-
is to keep the datapath fed 2 , so HBM’s high band-
constricted PCIe bus 2.
width is critical. At 700 GB/s per chip, the HBM of
TPUv2 provides 20 the bandwidth of the pair of DDR3
channels in TPUv1. This increase allows full computa- TPUV2 FLOOR PLAN
tion utilization at much lower data reuse rates 4. The floorplan in Figure 4 is not stylish, but it was good
Zooming out to the chip-level memory system, enough and allowed us to focus on more important
Figure 2 shows each core is connected to half of the things 1 . Most area is for the computation cores in
chip’s HBM. The split HBM memory space might be a blue, with noticeable area for the memory system and
bit surprising, but the reason is simple: we wanted to interconnect. The two cores are split across the top
build this chip quickly 1 , this approach simplified the and bottom, with the interconnect router sitting in the
design, and it was good enough. donut hole in the middle. The white areas are not
Each core also has a set of PCIe queues to exchange empty, but filled with wiring.
data with the host. Between the two cores is the inter- The two matrix multiply units are at the top center
connect router that also connects to the off-chip links. and bottom center. The FLOP/second delivered in
such a small area demonstrates the computation den-
sity available with the bfloat16 systolic array 5.
TPUV2 INTERCONNECT
A dedicated interconnect is the foundation of the
TPUv2 supercomputer (“pod”). TPUv1 was a single- TPUV3
chip system built as a coprocessor, which works for We hoped to avoid the second system effect2; we did not
inference. Training Google production models would want to blow everything we worked hard for in TPUv2 by
require months on a single chip. Hence, TPUv2s can building the kitchen sink into TPUv3. TPUv3 is a “mid-life

GENERAL INTEREST
FIGURE 4. TPUv2 floor plan.
kicker” that leveraged what we already built—both use 16 It seems quaint now, but we thought the
nm technology—but let us pick the low-hanging fruit left 256-chip TPUv2 system was huge. ML’s voracious
behind by the quick development of TPUv2 1 . The most appetite continues,1 so moving to 1024-chip sys-
important enhancements were the following. tems was critical
3.
› Doubling the matrix multiply units to get double

maximum FLOPS/second 2 (high leverage due PERFORMANCE SYNOPSIS
to the matrix multiply unit computation density). Figure 5 shows a roofline model of TPUv2/v3 and the
› Increasing clock frequency from 700 to 940 MHz 2 NVIDIA V100 GPU. It identifies programs as memory-
(30% performance from tuning pipelines). bound (under the slanted roofline) or compute-
› Bumping HBM performance by 30% using a bound (under the flat roofline) using operational
higher bus speed 2. intensity (operations per DRAM byte accessed) to
› Doubling HBM capacity enabling bigger models place applications appropriately. The vertical lines in
and bigger batch sizes [see Figure 5(a)]
4. Figure 5(a) show the performance improvements
› Raising interconnect link bandwidth by 30% to from TPUv2 to TPUv3. CNN0 is actually above its
650 Gb/s per link 2. fp32 roofline in Figure 5(b)! This model includes
› Raising the maximum scale to a 1024-chip sys- many 33 convolutions and V100 uses Winograd
tem, up from the 256-chip limit of TPUv2 3. Transform that does fewer operations than a

GENERAL INTEREST
FIGURE 5. (a) Roofline Models11 for TPUv2 and TPUv3 (left). (b) Roofline Models of V100 for fp16 and fp32 (right). RNN0/RNN1
move up and to the right in (a) since the larger HBM capacity of TPUv3 enabled bigger batch sizes; all others use the same batch
size. MLP0/MLP1 could not run on Volta in (b) because the embedding tables were too big.
convolution. However, Figure 5(b) assumes the same unwilling to do the extra work of loss scaling for
computation for comparison, lifting CNN0 through fp16 9 4.
its roofline. › The peak computation gain of TPUv3 over

TPUv2 is 2.7, but the improvements in mem-
ory bandwidth, ICI bandwidth, and clock rate
AS DEEP LEARNING CONTINUES TO
are only 1.3. Were the extra MXUs in TPUv3
EVOLVE, AND AS WE UNDERSTAND underutilized due to bottlenecks elsewhere?
THE WORKLOADS AND SCALING The geometric mean of performance gains of
LIMITS BETTER, OPPORTUNITIES TPUv3 over TPUv2 were 1.8 for both MLPerf
CONTINUE FOR FURTHER CODESIGN 0.6 benchmarks and Google production appli-
ACROSS ML MODELS, SOFTWARE, cations 2 . MXUs are a small part of the chip
AND HARDWARE TO IMPROVE NEXT (see Figure 4), so doubling the MXUs worked
GENERATION TPUS AND OTHER well 5.
DOMAIN-SPECIFIC ARCHITECTURES. › Google applications scaled to 96%–99% of per-

fect linear speedup 3 ; applications for general-
purpose supercomputers are rarely that high.

› 3 allows TPUv3s to scale to supercomputer
size and performance. The GigaFLOPS/Watt of

A companion paper6 measures performance in a TPUv3 supercomputer running a Google
detail, summarized here. application was 50 higher than a general-pur-
pose supercomputer running the Linpack
› Although TPUv3 is a smaller chip (<700 versus benchmark6 5.
812 mm2) in an older semiconductor technology
than Volta (16 nm versus 12 nm), the geometric
mean of performance per chip is the same for CONCLUSION
the MLPerf 0.6 benchmarks 5 . GPUs are larger
TPUv2/v3 met our “first bucket” goals:
in part because they also perform graphics
operations.
1 Build quickly.
› Using Google production applications, TPUv3 is Our cross-team co-design philosophy found sim-
on average five times faster. Like the Vector pler-to-design hardware solutions that also gave
Institute, Googlers use 32-bit floating point oper- more predictable software control, such DMA to
ations on V100 [see Figure 5(b)] as they were main memory (HBM) and compiler-controlled on
chip memory instead of caches. Along the way,

we made difficult tradeoffs to preserve the devel-
MLPerf 0.7 previewed the next generation TPU chip (TPUv4),
which matched the performance of NVIDIA’s next generation opment schedule, such as splitting the HBM
GPU chip (Ampere), but that’s a story for another time. between the two cores, tolerating an inefficient

GENERAL INTEREST
chip layout, and using FIFOs to simplify XLA com- Oguntebi, Andy Phelps, Paul Rodman, Bjarke Roune,
piler scheduling. Brennan Saeta, Amir Salek, Julian Schrittwieser, Dan
Steinberg, Andy Swing, Horia Toma, Shibo Wang, Tao

2 Achieve high performance.
Wang, Yujing Zhang, and many more.
Matrix computation density, high HBM bandwidth,
and XLA compiler optimizations deliver excellent
performance.
REFERENCES

3 Scale up. 1. D. Amodei and D. Hernandez, “AI and compute,” 2018.
Our system-first approach and simple-to-use inter- [Online]. Available: blog.openai.com/aiand-compute
connect let TPUv3 scale natively to 1024 chips and 2. F. P. Brooks Jr, The Mythical Man-Month: Essays on
deliver nearly linear speedup for production Software Engineering. London, U.K.: Pearson Education,
applications. 1975.
3. J. J. Dongarra et al.., “The LINPACK benchmark: Past,

4 Work for new workloads out of the box. present and future,” Concurrency Comput., Pract.
To support the deluge of training workloads, we Experience, vol. 15, no. 9, pp. 803–820, 2003.
built a core grounded in linear algebra that works 4. A. Krizhevsky et al., “Imagenet classification with deep
well with the XLA compiler, and HBM ensures we convolutional neural networks,” in Proc. Neural Inf.
have enough capacity and bandwidth to keep Process. Syst., 2012, pp. 1097–1105.
pace with growing models. 5. N. P. Jouppi et al., “In-datacenter performance analysis
of a tensor processing unit,” in Proc. 44th Annu. Int.

5 Be cost effective. Symp. Comput. Archit., 2017, pp. 1–12.
The matrix units are efficient, the design was sim- 6. N. P. Jouppi et al., “A domain-specific supercomputer
ple without gratuitous bells and whistles, and we for training deep neural networks,” Commun. ACM,
got our money’s worth in performance. vol. 18, 63, no. 7, pp. 67–78, 2020.
As deep learning continues to evolve, and as we 7. N. Kumar, “Google breaks AI performance records in
understand the workloads and scaling limits better, MLPerf with world’s fastest training supercomputer,”
opportunities continue for further codesign across ML 2020. [Online]. Available: cloud.google.com/blog/
models, software, and hardware to improve next gener- products/ai-machine-learning/google-breaks-ai-
ation TPUs and other domain-specific architectures. performance-records-in-mlperf-with-worlds-fastest-
training-supercomputer
8. J. Lin, X. Li, and G. Pekhimenko, “Multi-node Bert-
ACKNOWLEDGMENTS pretraining: Cost-efficient approach,” 2020,
TPUv2/v3 exist thanks to the creativity and hard work arXiv:2008.00177.
of the TPU team, from chips to systems to software. 9. P. Micikevicius et al., “Mixed precision training,” 2017,
The author list represents a small slice, and we are arXiv:1710.03740.
grateful for the opportunity to work together. Thanks 10. D. Silver et al., “A general reinforcement learning
go to Paul Barham, Eli Bendersky, Dehao Chen, Clifford algorithm that masters chess, shogi, and go through
Chao, Chiachen Chou, Jeff Dean, Brian Fosco, Ben self-play,” Science, vol. 362, no. 6419, pp. 140–1144,
Gelb, Jesse Guss, Peter Hawkins, Blake Hechtman, 2018.
Mark Heffernan, Richard Ho, Robert Hundt, Michael 11. S. Williams, A. Waterman, and D. Patterson, “Roofline:
Isard, Terry Kang, Fritz Kruger, Naveen Kumar, Sameer An insightful visual performance model for multicore
Kumar, Steve Lacy, Chris Leary, Hyouk-Joong Lee, architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76,
David Majnemer, Lifeng Nai, Rahul Nagarajan, Tayo 2009.

GENERAL INTEREST
Klessydra-T: Designing Vector Coprocessors

for Multithreaded Edge-Computing Cores
Abdallah Cheikh , Stefano Sordillo, Antonio Mastrandrea, Francesco Menichelli, Giuseppe Scotti , and
Mauro Olivieri , Sapienza University of Rome, 00185, Rome, Italy
Computation-intensive kernels, such as convolutions, matrix multiplication, and

Fourier transform, are fundamental to edge-computing AI, signal processing, and
cryptographic applications. Interleaved-multithreading (IMT) processor cores are
interesting to pursue energy efficiency and low hardware cost for edge computing,
yet they need hardware acceleration schemes to run heavy computational
workloads. Following a vector approach to accelerate computations, this article
explores possible alternatives to implement vector coprocessing units in RISC-V
cores, showing the synergy between IMT and data-level parallelism in the target
workloads.
I
nterleaved multithreading (IMT) or barrel-process- (ISA) that allow extensions dedicated to particular
ing is a simple and widely known program execution computation domains, such as RISC-V.4
paradigm that alternates instructions belonging to Edge computing devices regard energy efficiency
different execution threads in the stages of a single- as the prime concern. This article addresses the intro-
issue in-order processor pipeline.1,2,3 In this scheme, duction of vector coprocessor acceleration in IMT
while the throughput is limited to one instruction per cores for extreme edge computing, showing that an
cycle (IPC), pipeline stalls due to interinstruction IMT processor has an architectural design advantage
dependence are avoided without any hardware over- over other cores with similar IPC, which allows exploit-
head for dependence management. As long as the ing hardware acceleration with higher energy effi-
application workload can be programmed as multiple ciency and speed.
threads, the IMT approach can sustain IPC ¼ 1 with rel- In this context, we specifically address supporting
atively high clock frequency and high energy efficiency, accelerated vector operations, to execute ubiquitous
thanks to the hardware simplicity, which is a desirable computation kernels in edge computing applications:
goal in embedded edge-computing processors.
Nonetheless, to execute computationally heavy › 2-D convolution, covering the broad area of deep
applications on the extreme edge, any processor core neural network applications5;
needs hardware acceleration support. Two broad clas- › fast Fourier transform (FFT), typical of signal
ses of hardware acceleration exist: hardware units that processing applications, for example, in 5G IoT
autonomously execute entire computation kernels devices6;
upon memory-mapped commands from the processor › matrix multiplication (MatMul) used in a variety
core, and instruction acceleration units, sometimes of fields, predominantly in cryptography.
referred to as coprocessors, that take over complex
instructions and thus are directly sequenced by the A typical scenario is to run homogenous workloads
core instruction stream. Coprocessors imply less com- on all the threads applying the same algorithm on dif-
munication overhead, yet they can be efficiently ferent input data, e.g., convoluting multiple image
exploited only within instruction set architectures frames. Otherwise, one can take advantage of the mul-
tiple contexts provided in an IMT core and run a com-
posite workload running different algorithms, e.g.,
transmitting an encrypted stream of a preprocessed
0272-1732 ß 2021 IEEE video/audio, by convoluting an image while analyzing
Digital Object Identifier 10.1109/MM.2021.3050962 an audio stream via FFT then encrypting the processed
Date of publication 12 January 2021; date of current version
data using an algorithm that heavily relies on MatMul.
26 March 2021.

GENERAL INTEREST
In this article, we design, implement, and evaluate a regime. Our study is agnostic about supply or bias
whole taxonomy of coprocessor acceleration schemes voltage tuning, purely addressing DLP and TLP balanc-
for IMT cores, analyzing them for performance, area, ing for energy efficiency in any physical implementa-
and energy efficiency on the above application cases. tion, including soft-cores on FPGA, as shown in our
The contributions of this work are as follows: results.
A hardware convolution engine for image process-
› We provide designers with a quantitative ing is presented by Conti and Benini,11 focusing on the
comparison between different coprocessing optimal buffer design to store selected portions of the
schemes referring to different computation input image. Chen et al.12 and Du et al.13 also present
kernels. convolution accelerators, based on parallel hardware
› Specifically, we identify the optimal balance units and local data reuse. Our study adopts a different
between thread level parallelism (TLP) and data approach, based on multipurpose vector coprocessors
level parallelism (DLP) in the addressed equipped with scratchpad memories, coupled with an
scenarios. IMT processor, to hide memory latency.
› We demonstrate the performance and energy This article builds on the activity reported by
efficiency of the IMT approach in the target Cheikh et al.14 which was an initial effort into design-
application contexts by comparing it with pro- ing a mathematical accelerator for a RISC-V core, and
cessor cores in the same complexity range. by Cheikh et al.,3 who addressed the best performing
› We show the potentials of an open hardware pipeline organization for an IMT RISC-V core.
design based on the RISCV instruction set along
with its open programming environment.
THE KLESSYDRA-T IMT
ARCHITECTURE
The processing core discussed in this article, named
THIS ARTICLE ADDRESSES THE
Klessydra-T13, is a parametric design implementing an
INTRODUCTION OF VECTOR
IMT four-stage-pipeline RISC-V processor. It supports
COPROCESSOR ACCELERATION IN
the RV32IMA instruction set,4 augmented by a custom
IMT CORES FOR EXTREME EDGE extension composed of a small subset of mathemati-
COMPUTING, SHOWING THAT AN IMT cal vector instructions. The Klessydra-T13 core (see
PROCESSOR HAS AN ARCHITECTURAL Figure 1) realizes a pure IMT paradigm as defined by
DESIGN ADVANTAGE OVER OTHER the following points:
CORES WITH SIMILAR IPC.
› thread context switch at each clock cycle;
› in-order, single issue instruction execution;
› feed-forward pipeline (no hardware support for
BACKGROUND branching-hazard and data-hazard handling);
Many previous works reported the design of hardware › bare metal execution (RISCV M mode).
accelerated cores in edge-computing applications.
Lim et al.7 report the design details of a low-volt- The core interleaves three hardware threads (harts 4)
age microcontroller with subword-SIMD support. This in the instruction pipeline. The register file, program
article is more general in investigating various SISD- counter, and CSR unit are replicated per hart. A hard-
SIMD-MIMD combinations in coprocessor design. The ware context counter (harc) switches between the
work by Botman et al.8 is similar and investigates ad hart program counters on a rotation basis to fetch
hoc ISA encoding and pipeline stage balancing for instructions from the program memory. The three harts
power efficiency and introduces a dedicated copro- in the four pipeline stages provide a register file access
cessor interface. Yet, the authors do not elaborate on fence, so that it never possible for any two instructions
coprocessor architectures and performance. Our work to manifest a dependence hazard in the pipeline.
further differs from Gautschi et al.7 and Luo and The T13 core includes multiple units in the execu-
Zhang8 in targeting RISC-V compliance. tion stage, namely a load/store unit (LSU), a scalar
Gautschi et al.9 describe a RISC-V processor with execution unit (EXEC), and a vector-oriented multipur-
DSP hardware support, targeting near-threshold volt- pose functional unit (MFU), which implements the
age operation, and in the Diet-SODA design,10 a SIMD- coprocessing features. The LSU works in parallel with
oriented DSP accelerator also runs in near-threshold other units when executing store instructions, which

GENERAL INTEREST
FIGURE 1. Klessydra T13 block organization.
cannot cause a write-back conflict on the register file. configure the number of parallel lanes D in the
The MFU is allowed to read operands from the register MFU, the number of MFUs F, the SPM capacity, the
file, but can write results only to local scratchpad number of SPMs N, the number of SPMIs M, and
memories (SPMs). The LSU manages data transfers the sharing scheme of MFUs, and SPMI among
to/from the data memory from/to the SPMs via dedi- harts. The MFU is the engine that accelerates vec-
cated instructions. tor computations. It can operate on different inte-
The MFU executes vector arithmetic instructions, ger data element widths (8, 16, 32bits) in subword-
whose latency is proportional to the vector length. A SIMD fashion, and also in element-SIMD fashion
hart requesting access to the busy MFU executes a when D is configured to multiply the execution
self-referencing jump until the MFU becomes free, lanes for DLP. A typical vector arithmetic operation
avoiding unnecessary stalls of other harts in the pipe- has an initial latency between 4 and 8 cycles to
line that are independent from the MFU being busy. access the SPM.
The custom instruction extension supported by Each SPM has one read and one write port. The
the MFU and LSU is summarized in Table 1. The parameter D that defines the MFU lanes also corre-
instructions implement vector operations without sponds to the number of SPM banks; all the banks of
relying on a vector register file, but rather on a mem- an SPM are accessed together as a single SPM line.
ory space mapped on the local SPMs, for maximum When the MFU executes a vector operation, it
flexibility. The programmer can move vector data at fetches an entire SPM data line in every clock cycle,
any point of the SPM address space with no constraint composed of multiple vector elements. A bank read
except the total capacity of the SPMs, which in turn is rotator aligns the source operands coming from the
a parameter of the microarchitecture design. SPM line, and a bank write rotator aligns the destina-
The coprocessor instructions are exposed to the tion data to the correct banks in an SPM line. When
programmer as very simple intrinsic functions, fully the LSU fills the SPM banks with data from the 32-bit
integrated into the RISC-V GCC compiler toolchain. data memory port, a bank interleaver switches
between the banks. The reader may refer to the work
by Botman et al.14 for internal details of the units
HARDWARE ACCELERATION inside the MFU and SPMs.
SCHEMES Furthermore, the coprocessor can be configured
The MFU and SPMs are accessed through a to implement the following sharing schemes among
scratchpad-memory interface (SPMI). The user can harts.

GENERAL INTEREST
TABLE 1. Custom vector instruction extension. coprocessor instructions, provided they use different
internal functional units of the MFU (e.g., adder, multi-
Assembly syntax – (r) denotes Short description plier). Harts requesting a busy internal unit in the MFU
memory addressing via register r
kmemld (rd),(rs1),(rs2) load vector into get stalled until the contended unit becomes free.
scratchpad region This scheme can exploit DLP by multilane SIMD exe-
kmemstr (rd),(rs1),(rs2) store vector into main cution, and also TLP in the form of a heterogeneous
memory
MIMD execution.
kaddv (rd),(rs1),(rs2) adds vectors in
scratchpad region The explored design parameters and correspond-
ksubv (rd),(rs1),(rs2) subtract vectors in ing configurations, for reference in reporting perfor-
scratchpad region mance results, are the following:
kvmul (rd),(rs1),(rs2) multiply vectors in
scratchpad region
kvred (rd),(rs1) reduce vector by › M ¼ 1, F ¼ 1, D ¼ 1: SISD
addition › M ¼ 1, F ¼ 1, D ¼ 2,4,8: Pure SIMD
kdotp (rd),(rs1),(rs2) vector dot product into
register
› M ¼ 3, F ¼ 3, D ¼ 1: Symmetric MIMD
ksvaddsc (rd),(rs1),(rs2) add vector þ scalar › M ¼ 3, F ¼ 3, D ¼ 2,4,8: Symmetric MIMD þ SIMD
into scratchpad › M ¼ 3, F ¼ 1, D ¼ 1: Heterogenous MIMD
ksvaddrf (rd),(rs1),rs2 add vector þ scalar › M ¼ 3, F ¼ 1, D ¼ 2,4,8: Heterogenous MIMD þ SIMD
into register
ksvmulsc (rd),(rs1),(rs2) multiply vector þ
scalar into scratchpad We use N ¼ 3 in MatMul and convolutions, and
ksvmulrf (rd),(rs1),rs2 multiply vector þ N ¼ 4 in FFT.
scalar into register Finally, we refer to the T13 microarchitecture con-
kdotpps (rd),(rs1),(rs2) vector dot product and figured with no hardware acceleration as Klessydra
post scaling
ksrlv (rd),(rs1),rs2 vector logic shift within T03.
scratchpad
ksrav (rd),(rs1),rs2 vector arithmetic shift
within scratchpad PERFORMANCE RESULTS
krelu (rd),(rs1) vector ReLu within
scratchpad We run a set of test programs composed of 2-D convo-
kvslt (rd),(rs1),(rs2) compare vectors and lution, FFT, and MatMul kernels. We adopted the
create mask vector widely used 3 3 filter size on matrix sizes of 4 4, 8
ksvslt (rd),(rs1),rs2 compare vector-scalar
and create mask
8, 16 16, and 32 32 elements for convolutions.
kvcp (rd),(rs1) copy vector within FFT was run on 256 samples, and MatMul on 64 64
scratchpad region element matrices. The element width was kept 32 bit
in fixed-point representation. The tests were organized
as follows:
Shared coprocessor: All the harts share a single
MFU/SPM subsystem. In the case of busy MFU, any › homogeneous workload, running multiple instan-
hart wanting to access it is stalled until the MFU ces of the same kernel on multiple harts, on dif-
becomes free. In this scheme, parallel execution may ferent input data;
occur between coprocessor and noncoprocessor › composite workload, running convolutions, FFTs,
instructions. Yet, the MFU/SPM may exploit pure DLP and MatMul repeatedly on three respective
acceleration, by multilane SIMD execution. harts.
Thread-dedicated coprocessors: A complete
MFU/SPM subsystem is appointed to each hart, elimi- The performance was measured by taking the
nating coprocessor contention. Stalls can only happen average cycle count to execute one computation ker-
if two instructions of the same hart request MFU oper- nel. Table 2 summarizes the results, which are dis-
ation. This scheme can exploit DLP by multilane SIMD cussed below.
execution and TLP by fully symmetric MIMD execu- Cycle count: With small matrix convolutions, the
tion, allowing multiple vector instructions to execute accelerated core reached up to 3 cycle count
in parallel. speed-up over a nonaccelerated IMT core (Klessydra
Thread-dedicated SPMI/shared MFU: A dedicated T03), and 2 speed-up over single-threaded, DSP-
SPM address space is kept for each hart, while the extended core (RI5CY9).
harts share one MFU at the functional unit level. This As expected, large matrix convolutions and Mat-
scheme still allows interhart parallel execution of Mul obtain more considerable advantage from vector-

GENERAL INTEREST
TABLE 2. Summary of performance results and synthesis results. Green ¼ best case; red ¼ worst case.
accelerated cores, quantified in 13 cycle count Maximum clock frequency: All the cores under
speed-up relative to Klessydra T03, 9 relative to the analysis were implemented as FPGA soft-cores. The
RI5CY core, and 19 relative to ZeroRiscy. In contrast, clock speed exhibited the sharpest drops as the TLP
FFT takes benefit from TLP and reduced data memory grew larger: in the heterogeneous MIMD scheme, the
accesses rather than from DLP. crossbar mapping the SPMI output data on the shared
Figure 2 quantifies the contribution of DLP and TLP MFU units became the critical path for D ¼ 4,8. Pipe-
for convolutions on different matrix sizes. For small lining the crossbar to reduce the critical path introdu-
vectors, TLP inherently exhibits better contributions ces hardware overhead, compromising the area
to speed-up than DLP, while as the vector size grows, advantage of the heterogeneous MIMD configuration.
the DLP boost dominates. Implementations exploiting Absolute execution time: The cycle count and
both TLP and DLP performed much better than pure the operating frequency allow calculating the total
DLP also with large matrices. A key outcome is that a execution time. Figure 3 compares the actual execu-
single core IMT processor can exploit both DLP and tion time speed-up relative to the ZeroRiscy core
TLP and follow the gray curve, while a single-threaded taken as the reference when each core operates at
core exploiting only DLP acceleration follows the blue its maximum frequency. In pure SIMD configurations,
curve. the speed-up grows linearly with the DLP for the
Notably, the heterogeneous MIMD coprocessor, explored DLP range. Yet, exploiting TLP, by going
which has three times less functional units than the from a SISD/SIMD to symmetric and heterogenous
fully symmetric MIMD, employed only 1% – 7% more MIMD, improved the speedup in all cases, despite
cycles than the latter. the frequency drop associated with the MIMD copro-
cessor. Thanks to exploiting both TLP and DLP, the
symmetric MIMDþSIMD schemes exhibit the lowest
execution times, reaching up to 17 speed-up over Zer-
oriscy for Convolution 32 32 and up to 13 speed-up
for the composite workload. Notably, the heteroge-
neous MIMD configurations maintain an almost perfect
overlap with the symmetric MIMD.
The nonaccelerated Klessydra-T03, while employ-
ing a higher cycle count than RI5CY due to the
absence of DSP and hardware-loop extensions, exhib-
its an absolute performance advantage over RI5CY
thanks to a more than double frequency attained by
FIGURE 2. DLP and TLP cycle-count boost in 2-D convolu- the pure IMT microarchitecture. When compared to
tions for different matrix sizes. ZeroRiscy, T03 exhibits both lower cycle count and
higher frequency.

GENERAL INTEREST
FIGURE 4. Average energy per algorithmic operation, normal-

FIGURE 3. Execution time speed-up with respect to Zeroriscy ized to the case of the Zeroriscy soft-core, taken as
core, taken as reference. For the composite test the average reference.
kernel speed-up is reported.
Hardware Resource Utilization: In cost-con- The results are presented as the reduction in nJ/op
strained applications, it is crucial to find an optimal relative to Zeroriscy, taken as reference, which
balance between speed-up and area overhead. The exhibited 4.24 nJ/op as the best case in the ana-
heterogenous MIMD þ SIMD scheme with D ¼ 2 lyzed workloads.
resulted to be a possible best choice with all test The most energy efficient designs resulted to be
programs. the symmetric MIMD and heterogenous MIMD
The nonaccelerated T03 exhibits only a slightly schemes, again exhibiting an almost complete overlap
more significant footprint than the tiny ZeroRiscy and reaching over 85% energy saving related to the
core, despite the replicated register file to support reference Zeroriscy. Despite having the smallest area
multithreading, thanks to the LUT-RAM implementa- footprint, the pure SIMD schemes resulted in a larger
tion of the registers. energy consumption, due to low exploitation of TLP.
Energy Efficiency: The average energy per algo- Larger Filters: Convolutional neural networks pri-
rithmic operation (multiplications and additions) is marily employ 3 3 filters (VGG16) but also larger ones
a general measure of the energy efficiency attained (e.g., 11 11 in Alexnet, 5 5 in Googlenet). Large
by a processor core in implementing an algorithm masks such as 7 7 are used in Sobel, Gaussian
computation. Figure 4 reports the outcome of this smoothing, median filtering. We evaluated the vector
analysis, referring to the soft-core implementations. coprocessor schemes with filters ranging from 5 5
TABLE 3. Higher order filter evaluation results for cycle count, total time at max frequency, and total energy. Green ¼ best case;
red ¼ worst case.

GENERAL INTEREST
to 11 11, on 32 32 element matrices. Table 3 shows 2. A. Cheikh et al., “The microarchitecture of a multi-
the speed-up and energy efficiency trends continue as threaded RISC-V compliant processing core family for
the filter dimensions grow larger, favoring higher DLP. IoT end-nodes,” in Proc. Int. Conf. Appl. Electron.
The improvement referring to ZeroRiscy grows up to Pervading Ind., Environ. Soc., 2017, pp. 89–97.
35 when using 11 11 filters. 3. M. Olivieri et al., “Investigation on the optimal
The symmetric and heterogeneous MIMDþSIMD pipeline organization in RISC-V multi-threaded soft
schemes, with D ¼ 2, maintain similar performance processor cores,” in Proc IEEE New Gener. CAS, 2017,
and energy results throughout the analyzed cases. pp. 45–48.
The results confirm that an IMT core capable of MIMD 4. RISC-V Unprivileged Instruction Set specifications,
acceleration increasingly performs better than a sin- December, 2019. [Online]. Available: https://riscv.org/
gle-thread SIMD acceleration. specifications/
5. F. Samie, L. Bauer, and J. Henkel, “From cloud down to
things: An overview of machine learning in Internet of
CONCLUSION Things,” IEEE Internet Things J., vol. 6, no. 3,
The scientific outcome of this article can be summa- pp. 4921–4934, Jun. 2019.
rized in the following list of evidence: 6. F. Luo and C. J. Zhang, Signal Processing for 5G:
Algorithms and Implementations. New York, NY, USA:
› The MIMD-SIMD vector coprocessor schemes Wiley, 2016.
enable tuning the TLP and DLP contribution and 7. K. Lim, S. Jeong, Y. Kim, and H. S. Yang, “CalmRISCTM : A
obtain the best results in absolute performance low power microcontroller with efficient coprocessor
and energy efficiency, reaching >15 speed-up interface,” Microprocessors Microsyst., vol. 25,
and -85% energy per operation. pp. 247–261, 2001.
› Kernels that are less effectively vectorizable can 8. F. Botman, J. deVos, S. Bernard, F. Stas, J. Legat,
still benefit from acceleration through SPMs and and D. Bol, “Bellevue: A 50MHz variable-width SIMD
TLP, in an IMT core, reaching 2 -3 speed-up. 32bit microcontroller at 0.37V for processing-
› Fully symmetric and heterogeneous MIMD give intensive wireless sensor nodes,” in Proc. IEEE Int.
very similar results, showing that coprocessor Symp. Circuits Syst., Melbourne, Australia, 2014,
contention can be effectively mitigated by func- pp. 1207–1210.
tional unit heterogeneity, allowing hardware 9. M. Gautschi et al., “Near-threshold RISC-V core with
resource saving. From the same observation, we DSP extensions for scalable IoT endpoint devices,” IEEE
can state that functional unit contention is less Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25,
impacting than SPM contention, in all the no. 10, pp. 2700–2713, Oct. 2017.
kernels. 10. S. Seo et al., “Diet SODA: A power-efficient processor
› Pure DLP acceleration always gives inferior for digital cameras,” in Proc. 16th ACM/IEEE Int. Symp.
results than a balanced TLP/DLP acceleration. Low Power Electron. Des., 2010, pp. 79–84.
An IMT microarchitecture can benefit from TLP 11. F. Conti and L. Benini, “A ultra-low-energy convolution
and DLP acceleration in a single core. engine for fast brain-inspired vision in multicore
› In the absence of hardware acceleration, IMT clusters,” in Proc. IEEE Des., Autom. Test Eur. Conf.
still exhibits an absolute performance advantage Exhib., 2015, pp. 683–688.
over single-thread execution thanks to the sim- 12. Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze,
plified hardware structure. “Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks,”
The Klessydra-T parametric cores are available as IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
open source designs on GitHub at https://perma.cc/ Jan. 2017.
6FYD-AF68. 13. L. Du et al., “A reconfigurable streaming deep
convolutional neural network accelerator for Internet
of Things,” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 65, no. 1, pp. 198–208, Jan. 2018.
REFERENCES 14. A. Cheikh et al., “Efficient mathematical accelerator
1. C. Bechara et al., “A small footprint interleaved design coupled with an interleaved multi-threading
multithreaded processor for embedded systems,” in RISC-V microprocessor,” in Proc. Int. Conf. Appl.
Proc. 18th IEEE Int. Conf. Electron., Circuits, Syst., 2011, Electron. Pervading Ind., Environ. Soc., 2019,
pp. 685–690. pp. 529–539.

GENERAL INTEREST
ABDALLAH CHEIKH is currently a Postdoctoral Researcher electronic engineering “cum laude” in 2001 and a Ph.D. in
with Sapienza University of Rome, Italy. His interest in com- electronic engineering from the University of Rome “La Sapi-
puter organization, and design drove him to pursue a PhD fel- enza” in 2005. He coauthored more than 40 publications in
lowship at Sapienza University of Rome majoring in computer international journals and conference proceedings. Contact
architecture. His research activities cover design, implemen- him at francesco.menichelli@uniroma1.it.
tation, and verification of a wide range of microprocessor
architectures, vector accelerators, and morphing processors.
GIUSEPPE SCOTTI became a Researcher (Assistant Profes-
Cheikh received a B.S. and an M.S. in electrical engineering
sor) with the DIET Department, University of Rome ”La Sapi-
from Rafik Hariri University, Lebanon, in 2014 and 2016,
enza,” Italy, in 2010 and became an Associate Professor with
respectively, and a Ph.D. in 2020. Contact him at abdallah.
the same department in 2015. His research activity was
cheikh@uniroma1.it
mainly concerned with integrated circuits design and
focused on design methodologies able to guarantee robust-
STEFANO SORDILLO has been a Doctorate student with the
ness with respect to parameter variations in both analog cir-
Sapienza University of Rome, Italy, since 2019. His research
cuits and digital VLSI circuits. Scotti received an M.S. and a
activities cover microprocessor core design, hardware acceler-
Ph.D. in electronic engineering from the University of Rome
ators, and neural network algorithms for IoT devices. Sordillo
“La Sapienza,” in 1999 and 2003, respectively. He has coau-
received an M.S. (Laurea) “cum laude” in electronics engineer-
thored about 50 publications in international journals and
ing from Sapienza University of Rome, Italy, in 2019. Contact
more than 70 contributions in conference proceedings. Con-
him at stefano.sordillo@uniroma1.it.
tact him at giuseppe.scotti@uniroma1.it.
ANTONIO MASTRANDREA is a Research Assistant with the

Department of Information Engineering, Electronics and Tele- MAURO OLIVIERI joined Sapienza University of Rome, Italy,
communications, Sapienza University of Rome, Italy. His cur- as an Associate Professor in 1998. He is a Visiting
rent research interests include digital system-on-chip Researcher with the Barcelona Supercomputing Center
architectures and nano-CMOS circuits oriented to high-speed within the European Processor Initiative Project. His
computation. Mastrandrea received an M.S. (Laurea) (cum research interests include microprocessor core design and
laude) in electronics engineering and a Ph.D. from the Sapi- digital nano-scale circuits. He authored more than 100
enza University of Rome, in 2010 and 2014, respectively. Con- papers and a textbook in three volumes on digital VLSI
tact him at antonio.mastrandrea@uniroma1.it. design. Olivieri received an M.S. (Laurea) “cum laude” in 1991
and a Ph.D. in electronics and computer engineering from
FRANCESCO MENICHELLI is currently an Assistant Profes- the University of Genoa, Italy, in 1994, where he was an
sor with Sapienza University of Rome, Italy. His scientific Assistant Professor from 1995 to 1998. He has been a mem-
interests focus on low-power digital design and, in particular, ber of the TPC of IEEE DATE and was General Co-Chair of
in system level and architectural level techniques for low- IEEE/ACM ISLPED’15. He is an Evaluator for the European
power consumption, power modeling, and cosimulation of Commission in the ECSEL Joint Undertaking. He is a Senior
digital system platforms. Menichelli received an M.S. in Member of IEEE. Contact him at mauro.olivieri@uniroma1.it.

EDITOR: Michael Mattioli, michael.mattioli@gs.com
DEPARTMENT: SECURITY
Hidden Potential Within Video Game

Consoles
Michael Mattioli and Atte Lahtiranta, Goldman Sachs & Co., New York , NY, 10282, USA
Video game consoles share many of the characteristics of an ideal device for use in
enterprise deployments. In comparison to many desktop and notebook PCs
available in the market, modern video game consoles are actually quite powerful
and capable. They provide an excellent user experience with simple and intuitive
setup and operation. At the heart of the design of many modern video game
consoles is security; they are remarkably resilient against very sophisticated
hardware and software attacks. They are also rather cost-effective in comparison
to modern PCs.
V
ideo game consoles are ideal devices for examples of such devices many of the points being
enterprise deployments; they are powerful, made apply to almost all modern video game consoles
versatile, easy to use, cost-effective, and including, but not limited to, the Nintendo Switch, the
extremely secure. Systems suitable for use in an enter- Sony PlayStation 4, the Sony PlayStation 5, and the
prise must be able to handle a variety of workloads to Microsoft Xbox Series S.
ensure that users remain productive; such workloads
include, but are not limited to, video conferencing,
web browsing, content creation (e.g., spreadsheets, PERFORMANCE
presentations, documents, audio, video, etc.), and Relative to desktop and notebook PCs available at
audio/video streaming. Security is equally as impor- their time of release, modern video game consoles
tant, if not more important, than user productivity; are quite powerful in comparison. As described in
enterprise systems routinely process sensitive infor- Table 1, video game consoles feature the latest tech-
mation, which is critical to the organization’s success. nologies available at their respective times of
Of course, all enterprise systems strive to reduce over- release. Unsurprisingly, performance is critical when
all total cost of ownership (TCO); this can be achieved considering devices for use in the enterprise. It is
by simply reducing the cost of the hardware itself as fairly common for enterprise users to be working
well as reducing the operational expenses (OpEx) across multiple applications and contexts at once;
incurred by managing and maintaining the systems high-performance devices aid user workflow and
(e.g., patching, updates, user training/support, etc.). productivity.
The purpose of this article is to describe the ideal
characteristics of a device suitable for use in enter-
THE PURPOSE OF THIS ARTICLE IS TO
prise deployments and demonstrate how video game
DESCRIBE THE IDEAL
consoles are designed with these characteristics and
traits in mind, which, therefore, makes them an excel- CHARACTERISTICS OF A DEVICE
lent fit. It is extremely important to understand that SUITABLE FOR USE IN ENTERPRISE
while this article may use specific consoles as DEPLOYMENTS AND DEMONSTRATE
HOW VIDEO GAME CONSOLES ARE
DESIGNED WITH THESE
CHARACTERISTICS AND TRAITS IN
MIND, WHICH, THEREFORE, MAKES
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3055681 THEM AN EXCELLENT FIT.

SECURITY
TABLE 1. Hardware details of various video game consoles released over several years.
System Release Date CPU Memory Process CPU Microarch. GPU Microarch.
Cores Tech.
Microsoft Xbox One November 2013 8x x86-64 8 GB DDR3 TSMC 28 nm AMD Jaguar AMD GCN 2
Sony PlayStation 4 November 2013 8x x86-64 8 GB GDDR5 TSMC 28 nm AMD Jaguar AMD GCN 2
Microsoft Xbox One S August 8x x86-64 8 GB DDR3 TSMC 28 nm AMD Jaguar AMD GCN 2
2016
Sony PlayStation 4 November 2016 8x x86-64 8 GB GDDR5 TSMC 28 nm AMD Jaguar AMD GCN 4
Pro
Nintendo Switch March 2017 4x ARMv8 4 GB TSMC 20 nm ARM Cortex- NVIDIA
LPDDR4 A57 Maxwell
Microsoft Xbox One X November 8x x86-64 12 GB GDDR5 TSMC 16FF+ AMD Jaguar AMD GCN 4
2017
Microsoft Xbox November 8x x86-64 16 GB TSMC N7e AMD Zen 2 AMD RDNA 2
Series X 2020 GDDR6
Microsoft Xbox November 8x x86-64 10 GB TSMC N7e AMD Zen 2 AMD RDNA 2
Series S 2020 GDDR6
Sony PlayStation 5 November 8x x86-64 16 GB TSMC N7e AMD Zen 2 AMD RDNA 2
2020 GDDR6
The Microsoft Xbox One X features a semicustom features a CPU composed of 8 64-bit x86 cores oper-
system on a chip (SoC) developed in partnership with ating at 2.3 GHz and a GPU composed of 40 com-
Microsoft and advanced microdevices (AMD).1 The pute units operating at 1.172 GHz. The SoC uses a
SoC is implemented using Taiwan Semiconductor unified memory pool, shared by both the CPU and the
Manufacturing Company’s (TSMC) 16-nm Fin Field- GPU, which consists of 12 GB of GDDR5 DRAM; the
effect transistor (FinFET) Plus (16FF+) technology; it total memory bandwidth is 326.4 GB/s. The console
supports HDMI 2.0b display output with high-band-
width digital content protection (HDCP) 2.2, 10-bit
HDR, and a resolution of 3840 2160 at 60 Hz. The
GPU is further optimized for a version of Microsoft’s
DirectX 12 graphics API specific to the system.
The Microsoft Xbox Series X, shown in Figure 1, fea-
tures a semicustom (SoC) developed in partnership
with Microsoft and AMD.2
The SoC is implemented using TSMC’s 7 nm FinFET
Enhanced (N7e) technology; it features a CPU com-
posed of 8 64-bit x86 cores operating at 3.8 GHz, or
3.6 GHz with simultaneous multithreading (SMT)
enabled, and a GPU composed of 52 compute units
operating at 1.825 GHz. The SoC uses a unified memory
pool, shared by both the CPU and the GPU, which con-
sists of 16 GB of GDDR6 DRAM; 10 GB, reserved for the
GPU, operate at 560 GB/s and 6 GB, reserved for the
CPU, operate at 336 GB/s. The console supports HDMI
2.1 display output with the same features as the Xbox
One X SoC in addition to fixed rate link (FRL), variable
refresh rate (VRR), display stream compression (DSC),
4:4:4 chrome subsampling, and a resolution of either
FIGURE 1. Microsoft Xbox Series X video game console and 3840 2160 at 120 Hz or 7680 4320 at 60 Hz. The GPU
is further optimized for a version of Microsoft’s DirectX
controller.3
12 Ultimate graphics API specific to the system.

SECURITY
For comparison, the Intel Core i5-8600 T, an Intel internal applications and independent software ven-
eighth-generation desktop SoC released several dors (ISVs) can also develop and distribute applica-
months after the Xbox One X, features a CPU com- tions for public consumption.
posed of 6 64-bit x86 cores operating at a base fre-
quency of 2.3 GHz and a GPU operating at 1.15 GHz; SECURITY
this SoC is implemented using Intel’s third-generation
At the heart of their design is thoughtful and practical
14 nm++ technology.4 It supports HDMI display output
defense against a wide range of threats. Without ques-
with a resolution of 4096 2304 at 24 Hz; it also sup-
tion, security is paramount in enterprise contexts. It is
ports the mainstream version of Microsoft’s DirectX
common for users to access and store sensitive infor-
12 graphics API.
mation, which is crucial to the success and well-being
It is important to understand that while these SoCs
of the organization and its stakeholders (e.g., employ-
differ drastically in regards to cost, intended use, pro-
ees, customers, clients, shareholders, etc.).
cess technology, thermal design power (TDP), instruc-
tions per cycle (IPC), and various other aspects of SoC
and CPU and GPU design, the intent here is to high- Identity and Access Management
light that video game consoles are powerful and fea- Centralized identity and access management (IAM) is
ture-rich in comparison to mainstream compute used throughout the entire ecosystem. The Microsoft
devices available at the time of their release. consoles require users authenticate using a Microsoft
account; similarly, the Sony consoles require a Sony
account and the Nintendo consoles require a Nin-
USER EXPERIENCE tendo account. These identities are then used for
Video game consoles provide an elegant and engaging access and privilege management. Identity is required
experience for all ages and skill levels. This character- to associate and maintain licenses for software
istic is ideal for enterprises, because it enables all of (games and applications) and subscription services. It
its users to be functional and productive without is also used to control communication and interaction
requiring additional training, learning, etc. with other users; access to user information (e.g.,
online status, currently running software, etc.) and
Ease of Use interactions (e.g., text messages, voice messages, in-
Content (games) aside, the systems are designed for game chat, etc.) can be explicitly granted to or
both children and adults with limited experience or revoked from other users.
understanding of technology. Only two cables are
required to use the system: power and display (usually Patching and Updates
HDMI). User input and haptic feedback is performed The systems are designed such that only fully patched
through an ergonomically designed controller; how- and updated systems and software can access pro-
ever, modern systems also support traditional key- tected resources (e.g., Xbox LIVE, Nintendo Online,
board and mouse input.5 System and software (game) game servers, etc.) and interact with other compliant
updates and purchases are obtained through a single (patched and updated) systems and users.
source (e.g., Microsoft Store, Nintendo eShop, Sony Upon boot, the system attempts to connect to sin-
PlayStation Store, etc.), which is tightly integrated into gle, trusted authority (e.g., Xbox LIVE, PlayStation Net-
the system making it easy and intuitive to find, pur- work, etc.). Upon connection, it then checks for any
chase, and download software. system updates and prompts the user to download
and install them. If the user chooses to skip/decline
Versatility any pending updates or if a connection to the trusted
The systems are also rather versatile; developers can authority cannot be made (for whatever reason), they
write and release a variety of different software titles. can continue to use the system offline and use soft-
Aside from the obvious (games), this includes, but is ware (games) that is already installed. Simply, the sys-
not limited to, video streaming applications, music tem will not allow a user to connect to the trusted
streaming applications, video conferencing applica- authority (e.g., to play games, to chat with friends,
tions, web browsers, and cloud storage applications. etc.) unless the system is fully patched and updated.
For example, Hulu is available on the Nintendo Switch If the system is fully patched and updated but the
via the Nintendo eShop and Spotify is available on the software (game) the user wishes to launch is not fully
PlayStation consoles via the Sony PlayStation Store. updated, the user is prompted to download and install
Enterprises have the ability to write their own custom any pending updates. If the user chooses to skip/

SECURITY
decline any pending updates, they can continue to use

the software offline; in other words, they cannot inter-
act (play) with other users online using that software
unless it is fully patched and updated.
Hardware Security
Video game consoles are designed to be remarkably
resilient against various hardware and software
attacks. The entire business model of a modern video
game console is centered around software sales, not
hardware sales. They are designed around the premise
that the end-user cannot be trusted; an end-user’s
motivation is to play games for free (piracy) and/or
modify the game to achieve an unfair advantage over FIGURE 3. High-level depiction of the Xbox One X SoC
other players (cheat). Therefore, extreme measures architecture.6
must be taken to prevent physical attacks against the
system. However, the end-user is not the only
untrusted entity. Xbox One X and NVMe in the Xbox Series X). Keys
The Xbox One X is an excellent example of such used to decrypt information are fed into the SCE
design;6 many of these security features have been through a dedicated hardware pin connecting it to the
carried forward to the Xbox Series X. Quite literally, Crypto Engine inside of another custom-designed ele-
the only trusted entity of the entire Xbox One X sys- ment within the SoC referred to as the Security Com-
tem is the SoC itself; the internal storage, DRAM, opti- plex; this ensures that the keys are never exposed to
cal drive, etc., are considered untrusted. Therefore, all software at any point in time. This Security Complex
information which leaves the SoC must be encrypted also closely monitors the system clock, voltage, tem-
and all information which enters the SoC must be perature, and reset; these are commonly manipulated
decrypted and integrity checked. to attack a system.
All data are stored in nonvolatile memory using a One of the core tenets of the console’s security
format known as the Xbox Virtual Disk (XVD). As illus- design is defense in depth; in other words, an attacker
trated in Figure 2, all data are stored in an NT File Sys- must break through multiple layers of security. In addi-
tem (NTFS) virtual disk and then encrypted and tion to encrypting and integrity checking all informa-
hashed (for both confidentiality and integrity); finally, tion, which passes through the SoC, the system uses a
the root digest of the hash tree is signed using Micro- three-OS architecture,7 as illustrated in Figure 4.
soft RSA (for integrity of the hash tree itself).
The system SoC, illustrated in Figure 3, features a
custom-designed element referred to as the Stream-
ing Crypto Engine (SCE), which is able to decrypt infor-
mation loaded from the internal storage as fast as it
can be read from the underlying I/O bus (SATA III in
FIGURE 2. High-level depiction of an Xbox Virtual Disk (XVD) FIGURE 4. High-level depiction of the Xbox One X three-OS
structure.6 architecture.6

SECURITY
TABLE 2. Cost comparison of various video game consoles at For comparison, USD 499 spent today can buy a
their time of release. Lenovo ThinkCentre M720q, which includes 128 GB
of internal storage, an Intel Pentium Gold G5400T
System Release Internal Cost SoC, and 4 GB of DDR4 DRAM.8 Clearly, the consoles
Date Storage (USD) are rather competitively priced compared to modern
Microsoft Xbox November 500 GB 499 PCs. However, the true cost of any hardware deploy-
One 2013 ment in the enterprise extends far beyond the
Sony PlayStation 4 November 500 GB 399 device itself. One must consider patching, mainte-
2013 nance, and management of the device throughout
Microsoft Xbox August 500 GB 299 its entire lifecycle. Considering that patches and
One S 2016 software are released and distributed directly
Sony PlayStation 4 November 1 TB 399 through the trusted authority (e.g., Microsoft via
Pro 2016 Xbox LIVE, Nintendo via Nintendo Online, etc.), there
Nintendo Switch March 32 GB 299 is less operational overhead for an enterprise which
2017 would otherwise have to build its own infrastructure
Microsoft Xbox November 500 GB 499 to do so.
One X 2017
Microsoft Xbox November 1 TB 499
Series X 2020 THE NOTION OF USING A VIDEO GAME
CONSOLE IN THE ENTERPRISE MAY
Microsoft Xbox November 512 GB 299
Series S 2020 SEEM LAUGHABLE AT FIRST GLANCE.
HOWEVER, AS DISCUSSED, VIDEO
Sony PlayStation 5 November 825 GB 399
w/o optical drive 2020 GAME CONSOLES ACTUALLY EMBODY
MANY OF THE CHARACTERISTICS OF
Sony PlayStation 5 November 825 GB 499
w/optical drive 2020 AN IDEAL DEVICE FOR USE IN THE
ENTERPRISE.
The Host Operating System (OS) contains all of the

drivers and necessary components for interacting
with and communicating with the underlying hard- CONCLUSION
ware. The Software OS contains the game or applica- The notion of using a video game console in the enter-
tion that is running in the foreground. The System OS prise may seem laughable at first glance. However, as
is responsible for everything outside of the software discussed, video game consoles actually embody
title itself such as networking communication and many of the characteristics of an ideal device for use
drawing any user interface elements outside of the in the enterprise: high performance, excellent user
software. As illustrated in Figure 4, the majority of the experience, versatility, cost-effective, and secure at its
system memory is allocated to the Software OS. core. Technologies from video game consoles are
The Hypervisor is responsible for enforcement of the already trickling into the enterprise. Security technolo-
signed code. Using the stage-2 CPU MMU, it will verify gies found in some of the latest AMD CPUs such as
that every page of memory, which is marked for execu- Secure Memory Encryption (SME) and Secure
tion, is signed by Microsoft; if it is not signed, it will not Encrypted Virtualization (SEV) originated in the design
be executed. This architecture not only allows devel- of the Xbox One X.9 Hypervisor-Protected Code Integ-
opers to create more engaging experiences, it further rity (HVCI), a feature now generally available in Hyper-
strengthens the overall security of the system by V for Windows, originated in the design of the hypervi-
layering and separating software components. sor signed code enforcement technology previously
discussed. Microsoft’s recently announced Pluton
security processor originated in the SoC previously
COST discussed.10 Other recently released consoles such as
As described in Table 2, video game consoles, at the Microsoft Xbox Series S and the Sony PlayStation
the time of their respective times of release, have 5 share the same characteristics and carry these traits
been priced at or below USD 499 for the last sev- forward; performance has improved with adoption of
eral years. NVMe internal storage11 and semicustom variants of

SECURITY
AMD’s Zen 2 CPU cores and RDNA 2 GPU cores while 8. Newegg, “Lenovo Thinkcentre m720q 10t8sdj900
security has improved simply because threats (piracy Desktop Computer - Pentium Gold g5400t -
and cheating) and attacks only become more 4 Gb Ram - 128 Gb Ssd - Tiny - Black,” 2020. Accessed:
advanced over time. Sep. 26, 2020. [Online]. Available: https://www.newegg.
com/p/N82E16883994324
9. CRN, “Amd’s Xbox, Playstation Work Led to a Big
Security Feature in Epyc,” Aug. 2019. Accessed: Aug.
REFERENCES 2019. [Online]. Available: https://www.crn.com/news/
1. J. Sell, “The xbox one x scorpio engine,” IEEE Micro, components-peripherals/amd-s-xbox-playstation-work-
vol. 38, no. 2, pp. 53–60, Mar./Apr. 2018. led-to-a-big-security-feature-in-epyc
2. J. Andrews and M. Grossman, “Xbox Series X System 10. Microsoft, “Meet the Microsoft Pluton Processor—The
Architecture,” Aug. 2020. Accessed: Jan. 24, 2021. Security Chip Designed for the Future of Windows Pcs,”
[Online]. Available: https://hotchips.org/assets/program/ Nov. 2020. Accessed: Nov. 23, 2020. [Online]. Available:
conference/day1/HotChips2020_GPU_Microsoft_Jeff https://www.microsoft.com/security/blog/2020/11/17/
_Andrews_v2.pdf meet-the-microsoft-pluton-processor-the-security-
3. W. Commons, “File:xbox Series X 2.jpg,” Dec. 2020. chip-designed-for-the-future-of-windows-pcs/
Accessed: Jan. 24, 2021. [Online]. Available: https:// 11. M. Mattioli, “History of video game distribution,” IEEE
commons.wikimedia.org/wiki/File:Xbox_Series_X_2.jpg Consum. Electron. Mag., vol. 12, no. 3, pp. 312–322,
4. Intel, “Intel Core i5-8600t Processor,” 2020. Accessed: Sep. 2020.
Sep. 27, 2020. [Online]. Available: https://ark.intel.com/
content/www/us/en/ark/products/129938/intel-core- MICHAEL MATTIOLI leads the Hardware Engineering team
i5-8600t-processor-9m-cache-up-to-3-70-ghz.html within Goldman Sachs, New York, NY, USA. He is responsible
5. Microsoft, “Mouse and Keyboard Support on Xbox
for the design and engineering of the firm’s digital experien-
One,” 2020. Accessed: Sep. 27, 2020. [Online]. Available:
ces and technologies. He is also responsible for the overall
https://support.xbox.com/help/hardware-network/
strategy and execution of hardware innovation both within
accessories/mouse-keyboard
6. T. Chen, “Guarding against physical attacks: The Xbox the firm and within the broader technology industry. Contact
one story,” Oct. 2019. Accessed: Sep. 27, 2020. [Online]. him at michael.mattioli@gs.com.
Available: https://www.platformsecuritysummit.com/
2019/speaker/chen/ ATTE LAHTIRANTA is currently a Chief Technology Officer
7. E. Brechner, “Getting diagonal on the Xbox one and head of Core Engineering with Goldman Sachs, New
trifecta,” Computer, vol. 46, no. 8, pp. 77–78, 2013. York, NY, USA. Contact him at atte.lahtiranta@gs.com.

DEPARTMENT: AWARDS
A Brief History of Warehouse-Scale

Computing
Reflections Upon Receiving the 2020 Eckert-Mauchly Award
Barroso
Luiz Andre , Google, Mountain View, CA, 94043, USA
R
eceiving the 2020 ACM-IEEE Eckert-Mauchly have been implemented and launched as products.
Award this past June was among the most Practitioners can contribute to our community by look-
rewarding experiences of my career. I am ing back and showing us how those ideas played out (or
grateful to IEEE Micro for giving me the opportunity to not) in practical applications. Commercial success or
share here the story behind the work that led to this the lack thereof can be an objective judge of the merits
award, a short version of my professional journey so of research ideas; even if cruelly so at times. In giving
far, as well as a few things I learned along the way. me this award, the IEEE Computer Society and ACM are
highlighting the role of practitioners in our field.
Now, as this award is about the practice of ware-
THE PRACTICE OF COMPUTER
house-scale computing, I should get to that with no
SCIENCE
further delay.
For many of us our earliest models of professional-
ism come from observing our parents’ approach to
their work. That was the case for me observing my A BRIEF HISTORY OF
father, a surgeon working in public hospitals in Rio WAREHOUSE-SCALE COMPUTING
de Janeiro. Throughout his career, he was continu- If it is indeed true that “great poets imitate and
ally investigating new treatments, collecting case improve,”1 poetry and computing may have something
studies, participating and publishing in medical con- in common after all. Warehouse-scale computers (the
ferences, despite never having held an academic or name we eventually gave to the computers we began to
research position. He was dedicated to the practice design at Google in the early 2000s) are the technical
of medicine but always made time to help advance descendents of numerous distributed computing sys-
knowledge in his area of expertise. tems that aimed to make multiple independent com-
Without really being aware of it, I ended up following puters behave as a single unit. That family begins with
my father’s path and became a practitioner myself. As a VAXclusters2 in the 1980s, a networked collection of
practitioner, my list of peer-reviewed publications is VAX computers with a distributed lock manager that
notably shorter than most of the previous winners of attempted to present itself as a single system to the
this award, but every time I had something valuable to user. In the 1990s, the concept of computing clusters
share with the academic community, I felt welcomed by began to be explored using lower end or desktop com-
our top research conferences, and those articles tended puters and local area networks with systems such as
to be well received. Practitioners like myself tend to pub- NASA’s Beowulf clusters3 and UC Berkeley’s NOW
lish papers in the past tense, reporting on ideas that project.4

Administered jointly by ACM and the IEEE Computer Soci- FOR MANY OF US OUR EARLIEST
ety, the award is given for contributions to computer and digi- MODELS OF PROFESSIONALISM
tal systems. In 2020, my award was given for pioneering the
design of warehouse-scale computing and driving it from COME FROM OBSERVING OUR
concept to industry. PARENTS’ APPROACH TO THEIR
WORK. THAT WAS THE CASE FOR ME
OBSERVING MY FATHER, A SURGEON
WORKING IN PUBLIC HOSPITALS IN
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3055379 RIO DE JANEIRO.

AWARDS
When I arrived at Google, in 2001, I found a company co-location facilities, so we had to build our own facili-
of brilliant programmers that was short on cash but not ties in order to continue to grow our services.
on confidence as they had already committed to a At that point, it became evident to us how much
strategy of systems built from inexpensive desktop- room for improvement there was in the design of
class components. Cheap might be a fairer characteri- datacenters. As a third-party hosting business, data-
zation of those early systems than inexpensive. The centers were put together by groups of disjoint
first generation of those computer racks, tenderly nick- engineering crafts that knew little of each other’s dis-
named “corkboards” consisted of desktop mother- ciplines; civil engineers built the building, mechanical
boards loosely resting on sheets of cork that isolated engineers provisioned cooling, electrical engineers dis-
the printed circuit boards from the metal tray, with disk tributed power, hardware designers built servers, soft-
drives themselves loosely resting on top of DIMMs. ware engineers wrote internet services. The lack of
Despite my hardware background,y I had joined Goo- cross-disciplinary coordination resulted in facilities
gle to try to become a software engineer. In my early that were both expensive and incredibly energy ineffi-
years, I was not involved in building computers but cient. Our team’s lack of experience in datacenter
instead I worked developing our index searching soft- design may have been an asset as we set out to ques-
ware and related software infrastructure components tion nearly every aspect of how these facilities were
such as load balancers and remote procedure call librar- designed. Perhaps most importantly we had the
ies. Three years later, Urs Ho€ lzle asked me to build a chance to look at the entire system design, from cool-
hardware team capable not only of building sound ing towers to compilers, and that perspective quickly
server-class systems but to invent new technologies in revealed significant opportunities for improvement.
the datacenter space. The years I had spent in software Speed of deployment was also a critical factor in
development turned out to be extremely useful in this those days as we were often running dangerously close
new role since my first-hand understanding of Google’s to exhausting our computing capacity as our traffic
software stack was essential to architecting the grew, so our initial approach was to prefabricate ready-
machinery needed to run it. We published some of those to-deploy computer rooms inside forty foot shipping
early insights into the architectural requirements for containers. Containers gave us a datacenter floor
Google-class workloads in an IEEE Micro paper in 2003.6 where we could isolate the hot (exhaust) plenum from
the cold aisle and shortened the total distance the air
needed to be moved; both were factors that improved
OUR TEAM’S LACK OF EXPERIENCE IN cooling efficiency. All that the container needed to
DATACENTER DESIGN MAY HAVE function was power, cold water and networking, and
BEEN AN ASSET AS WE SET OUT TO we had a 1200-server machine room ready to deploy.
QUESTION NEARLY EVERY ASPECT OF That original container-based deployment also
HOW THESE FACILITIES WERE introduced other innovations that led to cost, perfor-
DESIGNED mance, and energy efficiency improvements. Here are
some of the most notable ones:
› Higher temperature air cooling: We determined

In our earliest days as a hardware team we focused through field experiments that contrary to
primarily on designing servers and datacenter net- common wisdom the electronic components
working, but quickly realized that we would need to believed to be most affected by air temperature
design the datacenters themselves. Up until that point were still quite reliable at reasonably high tem-
internet companies deployed computing machinery in peratures (think 70F instead of 60F).8 This made
third-party colocation facilities (businesses that provi- it possible to run many facilities using evapora-
sioned space, power, cooling, and internet connectiv- tive cooling and improved cooling efficiency.
ity for large scale computing gear), and Google was no › Distributed uninterruptible power supplies (UPS):
exception. As the scale of our deployments grew, the Typical datacenters were built with a UPS room
minimum footprint required for a Google cluster was (a room full of batteries) in order to store enough
beginning to be larger than the total size of existing energy to ride electrical grid glitches. As such ac
voltage was rectified to power the UPS and then
y
inverted to distribute to the machine room only
My Ph.D. and the earlier phase of my career had been in
computer architecture, particularly in microprocessor and then to be rectified again by per-server power
memory system design. supplies, incurring losses at each transformation

AWARDS
FIGURE 1. A Google warehouse-scale computer in Belgium.
step. We instead eliminated the UPS room and a gigabit of nonoversubscribed bandwidth to up
introduced per tray (and later per rack) batteries. to a thousand servers.
That way power entering the building only
needed to be rectified once per machine. Although our adventure with shipping containers
› Single-voltage rail power supplies: Every server lasted only that one generation and soon after we
used to be outfitted with a power supply that con- found ways to obtain the same efficiencies with more
verted ac voltage into a number of dc voltage rails traditional building shells, the innovations from that
(12 V, 5 V 3.3 V, etc.) based on old standards first program have continued to evolve into industry-
for electronic components. By 2005, most elec- leading solutions over generations of warehouse-scale
tronic components did not use any of the standard machines. Figure 1 shows a birds-eye view of a modern
dc rails so yet another set of dc/dc conversions Warehouse-scale computer.
needed to happen onboard. The allocation of
power among multiple rails also lowered power MY JOURNEY
supply efficiency sometimes below 70%. We intro- I knew I wanted to be an electrical engineer when I was
duced a single-rail power supply that reached 90% 8 years old and got to help my grandfather work on his
efficiency and created on-board only the voltages HAM radio equipment. Putting aside the fact that eight-
actually used by components. year-olds should not be making career choices, I find it
› 1000-port GigE Ethernet switch: Datacenter net- difficult to question that decision to this date. Although
working bandwidth was beginning to become a I had always been a good student, I struggled a bit dur-
bottleneck for many warehouse-scale applica- ing my Ph.D. and graduated late. I did have a few things
tions, but enterprise-grade switches were not only going for me: an ability to focus, stamina for hard work,
very expensive but also lacked offerings for large and a lot of luck. As an example, after a 24-year drought
numbers of high bandwidth endpoints. Using a col- the Brazilian men’s national soccer team chose to win a
lection of inexpensive edge switches configured World Cup, during my hardest year in graduate school,
as a multistage network, our team created the first delivering a degree of joy that was badly needed to get
of a family of distributed datacenter networking me to the finish line. Less than a year after that World
products (codenamed Firehose) that could deliver Cup I was working in my grad student office on a

AWARDS
Saturday afternoon when I got a call from Norm Jouppi Lo, Sujay Parekh, Ed Bugnion, Alex Ramirez, Gautham
inviting me to interview for a research job at Digital Thambidorai, Karthik Sankaranarayanan, David Meis-
Equipment’s Western Research Lab (WRL). At the time ner, and David Lo. We worked together on many fun
Norm was already one of the most highly respected projects and I hope for more in the future. Although
computer architects in the world and perhaps nothing my dad is no longer with us, I am also fortunate to
in my career since has compared to the feeling I had count on the love and support of my family. My mom
that day—Norm Jouppi knew who I was! Cecilia, my godmother Margarida, my siblings Paula,
Tina, and Carlos and their families, and my wife
Catherine Warner who is the award life gives me every
I KNEW I WANTED TO BE AN single day.
ELECTRICAL ENGINEER WHEN I WAS
8 YEARS OLD AND GOT TO HELP MY
PERHAPS NOTHING SUMMARIZES
GRANDFATHER WORK ON HIS HAM
THE IMPACT THAT FRIENDS AND
RADIO EQUIPMENT. PUTTING ASIDE
THE FACT THAT EIGHT-YEAR- OLDS LUCK CAN HAVE IN YOUR LIFE MORE
THAN THE STORY OF HOW I CAME TO
SHOULD NOT BE MAKING CAREER
JOIN GOOGLE. AS I WAS TRYING TO
CHOICES, I FIND IT DIFFICULT TO
MAKE A DECISION BETWEEN TWO
QUESTION THAT DECISION TO THIS
OPTIONS, JEFF DEAN ASKED ME
DATE.
WHETHER THE OTHER COMPANY I
WAS CONSIDERING HAD ALSO

SERVED ME CREME ^ EE
BRUL DURING
I joined DEC WRL and had the chance to learn from
top researchers like Kourosh Gharachorloo and collab- MY INTERVIEWS. I THANKED JEFF
orate with leading computer architects such as Sarita AND ACCEPTED THE GOOGLE OFFER
Adve, Susan Eggers, Mateo Valero, and Josep Lariba- THE VERY NEXT MORNING.
Pey. During that time, I also met Mark Hill who would
become a friend and a mentor. Later, at Google I
would also have the chance to coauthor papers with
other leading figures in our field such as Tom Wenisch, THREE LESSONS
Wolf Weber, David Patterson, and Christos Kozyrakis. I will finish this essay by sharing with you three lessons
Perhaps nothing summarizes the impact that friends I have learned in this first half of my career, in the hope
and luck can have in your life more than the story of how that they may be useful to engineers who are at an
I came to join Google. As I was trying to make a decision earlier stage in their journey.
between two options, Jeff Dean asked me whether the
other company I was considering had also served me Consider the Winding Road
creme bru e during my interviews. I thanked Jeff and
^ le As an engineer you stand on a foundation of knowl-
accepted the Google offer the very next morning. edge that enables you to branch into many different
The brilliance and generosity of countless people at kinds of work. Although there is always risk when you
Google have been essential to the work that led to this take on something new, the upside of being adventur-
award, but I must highlight three here: Urs Ho € lzle who ous with your career can be amazingly rewarding. I for
has been a close collaborator and possibly the single one never let my complete ignorance about a new
person most to blame for Google’s overall systems field stop me from giving it a go.
infrastructure successes; Bart Sano who managed the As a result, I have worked in areas ranging from
Platforms team that built out the infrastructure we chip design to datacenter design; from writing soft-
have today (I was the technical lead for for Bart’s team ware for web search to witnessing my team launch
for many years); and Partha Ranganathan who is our satellites into space; from writing software for Google
computing technical lead today and is taking Google’s Scholar to using ML to automatically update Google
architectural innovation into the future. Maps; from research in compiler optimizations to
One part of my career I have no hesitation to brag deploying exposure notification technology to curb
about is the quality of the students I have had a the spread of Covid-19.8
chance to host as interns at DEC and Google. They It seems a bit crazy, but not going in a straight line
were (to date) Partha Rahganathan, Rob Stets, Jack has worked out really well for me and resulted in a rich

AWARDS
set of professional experiences. Whatever the out- revisit my position on technical matters that I had
come, you will be inoculated from boredom. invested significant time and effort on, especially
when the original position had a track record of suc-
Develop Respect for the Obvious cess. I will present just one illustrative example.
The surest way to waste a career is to work on unim-
portant things. I have found that big, important prob-
lems have one feature in common: they tend to be I JOINED GOOGLE AFTER A FAILED
straightforward to grasp even if they are hard to solve. MULTIYEAR CHIP DESIGN PROJECT
Those problems stare you right in the face. They are AND AS SUCH I IMMEDIATELY
obvious and they deserve your attention. EMBRACED GOOGLE’S DESIGN
Let me give you some examples by listing some of PHILOSOPHY OF STAYING AWAY
my more well-cited papers next to the formulation of FROM SILICON DESIGN OURSELVES.
the problems address:
Publication Problem addressed I joined Google after a failed multiyear chip design
ISCA'98: “Memory “High-end project and as such I immediately embraced Google’s
System Characterization microprocessors are design philosophy of staying away from silicon design
of Commercial being sold to run ourselves. Later as the technical lead of Google’s data-
Workloads”10 commercial workloads, so center infrastructure, I consistently avoided using
with Kourosh why are we designing
exotic or specialized silicon even when they could
Gharachorloo and them for number
Edouard Bugnion crunching?” demonstrate performance of efficiency improvements
for some workloads, since betting on the low cost
ISCA'00: “Piranha: A “Thread-level parallelism
Scalable Architecture is easy. Instruction level base of general purpose components consistently
Based on Single-Chip parallelism is hard. Should proved to be the winning choice. Year after year, bet-
Multiprocessing”5 we bet on thread-level ting on general purpose solutions proved successful.
with Kourosh parallelism then?” Then, deep learning acceleration for large ML mod-
Gharachorloo, Robert els arose as the first opportunity in my career to build
McNamara, Andreas
Nowatzyk, specialized components that would have both broad
Shaz Qadeer, Barton applicability and dramatic efficiency advantages when
Sano, Scott Smith, Robert compared to general purpose designs. Our estimates
Stets, and Ben Verghese indicated that large fractions of Google’s emerging AI
CACM '17: “The Attack of “If datacenter-wide events workloads could be executed in these specialized
the Killer Microsecond”11 run at microsecond accelerators with as much as a 40 cost/efficiency
with Mike Marty, Dave speeds, why do we only advantage over general purpose computing.
Patterson, and Partha optimize for millisecond
That was a time to ignore the past successes of bet-
Ranganathan and nanosecond
latencies?” ting on general purpose off-the-shelf components and
invest heavily on the design and deployment of our own
CACM '13: “The Tail at “Large scale services
Scale”12 should be resilient to silicon to accelerate ML workloads. Coming full circle,
with Jeff Dean performance hiccups in this meant that it was now my time to call Norm Jouppi
any of their and ask him to join us to become the lead architect for
subcomponents” what was to become our TPU accelerators program.
IEEE Computer '07: “A “Shouldn’t servers use
Case for Energy- little energy when they
proportional are doing little work?” CONCLUDING
Computing”13
Before the onset of the current pandemic, some of us
€ lzle
with Urs Ho
may have underappreciated how important computing
If it takes you much more than a couple of sentences technology and cloud-based services have become to
to explain the problem you are trying to solve, you our society. In this last year, these technologies have
should seriously consider the possibility of it not being allowed many of us to continue to work, to connect
that important to be solved. with loved ones, and to support each other. I am grate-
ful to all of those at Google and everywhere in our
Even Successes Have a “Sell-By” Date industry who have built such essential technologies,
Some of the most intellectually stimulating moments and I am inspired to be working in a field with still so
in my career have come about when I was forced to much potential to improve people’s lives.

AWARDS
REFERENCES 9. Google & Apple Exposure Notification technology.

2020. [Online]. Available: g.co/ENS
1. W. H. Davenport Adams, “Imitators and plagiarists,” The
10. L. A. Barroso, K. Gharachorloo, and E. Bugnion,
Gentleman’s Magazine, Jan. 1892
“Memory system characterization of commercial
2. N. P. Kronenberg, H. M. Levy, and W. D. Strecker,
workloads,” SIGARCH Comput. Archit. News, vol. 26,
“VAXcluster: A closely-coupled distributed system,”
no. 3, pp. 3–14, Jun. 1998.
ACM Trans. Comput. Syst., vol. 4, May 1986, Art. no. 130.
11. L. Barroso, M. Marty, D. Patterson, and P. Ranganathan,
[Online]. Available: https://doi.org/10.1145/
“Attack of the killer microseconds,” Commun. ACM,
214419.214421
vol. 60, no. 4, pp. 48–54, Apr. 2017.
3. T. Sterling, D. Becker, M. Warren, T. Cwik, J. Salmon, and
12. J. Dean and L. A. Barroso, “The tail at scale,” Commun.
B. Nitzberg, “An assessment of Beowulf class computing
ACM, vol. 56, no. 2, pp. 74–80, Feb. 2013.
for NASA requirements: Initial findings from the first
13. L. A. Barroso and U. Ho€ lzle, “The case for energy-
NASA workshop on Beowulf-class clustered computing,”
proportional computing,” Computer, vol. 40, no. 12,
in Proc. IEEE Aerosp. Conf., 1998, pp. 367–381.
pp. 33–37, Dec. 2007.
4. T. E. Anderson, D. E. Culler, and D. Patterson, “A case
for NOW (Networks of Workstations),” IEEE Micro,
BARROSO is a Google Fellow and a former VP of
LUIZ ANDRE
vol. 15, no. 1, pp. 54–64, Feb. 1995.
5. L. A. Barroso et al., “Piranha: A scalable architecture Engineering at Google. His technical interests include
based on single-chip multiprocessing,” in Proc. 27th machine learning infrastructure, privacy, and the design and
Annu. Int. Symp. Comput. Archit., 2000, pp. 282–293. programming of warehouse-scale computers. He has pub-
6. L. A. Barroso, J. Dean, and U. Holzle, “Web search for a lished several technical papers and has co-authored the
planet: The Google cluster architecture,” IEEE Micro,
book The Datacenter as a Computer, now in its 3rd edition.
vol. 23, no. 2, pp. 22–28, Mar./Apr. 2003.
He is a Fellow of the ACM and the AAAS and he is a member
7. A. Singh et al., “Jupiter rising: A decade of Clos
of the National Academy of Engineering. Barroso received a
topologies and centralized control in Google’s
datacenter network,” SIGCOMM Comput. Commun. B.S. and an M.S. in electrical engineering from the Pontifícia
Rev., vol. 45, no. 4, pp. 183–197, Oct. 2015. Universidade Católica of Rio de Janeiro, Rio de Janeiro, Brazil,
8. E. Pinheiro, W. Weber, and L. Barroso, “Failure trends in and a Ph.D. in computer engineering from the University
a large disk drive population,” in Proc. 5th USENIX Conf. of Southern California, Los Angeles, CA, USA. He is the
File Storage Technol., Feb. 2007, pp. 17–29. recipient of the 2020 Eckert-Mauchly award. Contact him at
luiz@barroso.org.

PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading provider President: FȏȵȵƷȽɋ‫ژ‬°ǚɓǹǹ
of technical information in the field. President-Elect: ÞǠǹǹǠƌȂ‫ژ‬%ِ‫ژ‬GȵȏȲȲ
MEMBERSHIP: Members receive the monthly magazine Past President: kƷǠǹƌ‫ژ‬%Ʒ‫ژ‬FǹȏȵǠƌȄǠ
Computer, discounts, and opportunities to serve (all activities First VP: ¨ǠƩƩƌȵưȏ‫ژ‬uƌȵǠƌȄǠ; Second VP: FƌƨȵǠɼǠȏ‫ژ‬kȏȂƨƌȵưǠ ‫ژژژژژ‬
are led by volunteer members). Membership is open to all IEEE Secretary: ¨ƌȂƌǹƌɋǚƌ‫ژ‬uƌȵǠȂɓɋǚɓ; Treasurer: %ƌɫǠư‫ژ‬kȏȂƷɋ
members, affiliate society members, and others interested in the VP, MemberȽǚǠȲ & Geographic Activities: ȄưȵƷ‫ژ‬ƨȏǹƷȵ
computer field. VP, Professional & Educational Activities: OǠȵȏȄȏȵǠ‫ژ‬ÞƌȽǚǠɼƌǵǠ ‫ژژژژژژژژژژژ‬
VP, Publications: uِ‫ژ‬ȵǠƌȄ‫ژ‬ǹƌǵƷ
COMPUTER SOCIETY WEBSITE: www.computer.org
VP, Standards Activities: Riccardo Mariani
OMBUDSMAN: Direct unresolved complaints to VP, Technical & Conference Activities: GȵƌƩƷ‫ژ‬kƷɬǠȽ‫ژژژژژژژژژژژ‬
ombudsman@computer.org. 20‫–׏א‬202‫ א‬IEEE Division VIII Director: ǚȵǠȽɋǠȄƌ‫ژ‬uِ‫ژ‬°ƩǚȏƨƷȵ‫ژ‬
CHAPTERS: Regular and student chapters worldwide provide the ‫ژ׏א׎אٮ׎א׎א‬U---‫ژ‬%ǠɫǠȽǠȏȄ‫ژ‬Ý‫ژ‬%ǠȵƷƩɋȏȵ‫¾ژي‬ǚȏȂƌȽ‫ژ‬uِ‫ژ‬ȏȄɋƷ‫ژژژژژژژژژژ‬
opportunity to interact with colleagues, hear technical experts, ‫ژ׏א׎א‬U---‫ژ‬%ǠɫǠȽǠȏȄ‫ژ‬Ý‫ژ‬%ǠȵƷƩɋȏȵ‫ٮ‬-ǹƷƩɋ‫ژي‬ƷƩǠǹǠƌ‫ژ‬uƷɋȵƌ
and serve the local professional community.
AVAILABLE INFORMATION: To check membership status, report BOARD OF GOVERNORS
an address change, or obtain more information on any of the ¾ƷȵȂ‫ژ‬-ɱȲǠȵǠȄǒ‫ژي׏א׎אژ‬uِ‫ژ‬ȵǠƌȄ‫ژ‬ǹƌǵƷً‫ژ‬FȵƷư‫ژ‬%ȏɓǒǹǠȽً
following, email Customer Service at help@computer.org or call ƌȵǹȏȽ‫ژ‬-ِ‫ژ‬eǠȂƷȄƷɼ‫ٮ‬GȏȂƷɼً‫¨ژ‬ƌȂƌǹƌɋǚƌ‫ژ‬uƌȵǠȂɓɋǚɓً
+1 714 821 8380 (international) or our toll-free number, -ȵǠǵ‫ژ‬eƌȄ‫ژ‬uƌȵǠȄǠȽȽƷȄً‫ژ‬hɓȄǠȏ‫ژ‬ÅƩǚǠɲƌȂƌ
+1 800 272 6657 (US): ¾ƷȵȂ‫ژ‬-ɱȲǠȵǠȄǒ‫ژيאא׎אژ‬wǠǹȽ‫ژ‬ȽƩǚƷȄƨȵɓƩǵً‫ژ‬
• Membership applications -ȵȄƷȽɋȏ‫ژ‬ɓƌưȵȏȽ‫ٯ‬ÝƌȵǒƌȽً‫ژ‬%ƌɫǠư‫ژ‬°ِ‫ژ‬-ƨƷȵɋً‫ژ‬GȵƌƩƷ‫ژ‬kƷɬǠȽً‫ژژژژژژ‬
• Publications catalog OǠȵȏȄȏȵǠ‫ژ‬ÞƌȽǚǠɼƌǵǠً‫ژ‬°ɋƷǑƌȄȏ‫ژ‬îƌȄƷȵȏ
• Draft standards and order forms ¾ƷȵȂ‫ژ‬-ɱȲǠȵǠȄǒ‫ژيבא׎אژ‬eɲȏɋǠǵƌ‫ژ‬ɋǚƌɫƌǹƷً‫¾ژ‬Ʒȵȵɲ‫ژ‬ƷȄɼƷǹً‫ژژژژژژژژژژژ‬
• Technical committee list ¾ƌǵƌǵȏ‫ژ‬OƌȽǚǠȂȏɋȏً‫ژ‬UȵƷȄƷ‫ژ‬¥ƌɼȏȽ‫ژ‬ÝǠƌȄƌً‫ژ‬ȄȄƷɋɋƷ‫¨ژ‬ƷǠǹǹɲً‫ژژژژژ‬
• Technical committee application %Ʒƨȏȵƌǚ‫ژ‬°ǠǹɫƷȵ
• Chapter start-up procedures
• Student scholarship information EXECUTIVE STAFF
• Volunteer leaders/staff directory
Executive Director: Melissa ِ‫ژ‬Russell‫ژ‬
• IEEE senior member grade application (requires 10 years
Director, Governance & Associate Executive Director: ‫ژژژژژژژ‬
practice and significant performance in five of those 10)
Anne Marie Kelly‫ژ‬
%ǠȵƷƩɋȏȵً‫ژ‬ȏȄǑƷȵƷȄƩƷ‫ژ‬ȲƷȵƌɋǠȏȄȽ‫ژي‬°ǠǹɫǠƌ‫ژ‬ƷƨƌǹǹȏȽ‫ژژژژژژژژژژژ‬
PUBLICATIONS AND ACTIVITIES
Director, Finance & Accounting: Sunny Hwang‫ژ‬
Computer: The flagship publication of the IEEE Computer Society, Director, Information Technology & Services: Sumit Kacker
Computer publishes peer-reviewed technical content that covers Director, Marketing & Sales: Michelle Tubb‫ژ‬
all aspects of computer science, computer engineering, Director, Membership ‫ژۯ‬-ưɓƩƌɋǠȏȄ: Eric Berkowitz
technology, and applications.
Periodicals: The society publishes 12 magazines‫ژ‬ƌȄư‫ژו׏ژ‬ǱȏɓȵȄƌǹȽ.
Refer to membership application or request information as noted
COMPUTER SOCIETY OFFICES
above. Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C.
20036-4928ٕ Phone: +1 202 371 0101ٕ Fax: +1 202 728 9614ٕ‫ژژژژ‬
Conference Proceedings & Books: Conference Publishing
Email: ǚƷǹȲ‫ۮ‬ƩȏȂȲɓɋƷȵِȏȵǒ
Services publishes more than 275 titles every year.
Los Alamitos: 10662 Los Vaqueros Cir., Los Alamitos, CA 90720ٕ‫ژ‬
Standards Working Groups: More than 150 groups produce IEEE Phone: +1 714 821 8380ٕ Email: help@computer.org
standards used throughout the world.
Technical Committees: TCs provide professional interaction in u-u-¨°OU¥‫ژۯژ‬¥ÅkU¾Uw‫¨ژ‬%-¨°‫ژ‬
more than 30 technical areas and directly influence computer ¥ǚȏȄƷ‫ژٕבבבגژזוהژ׎׎זژ׏ڹژي‬Fƌɱ‫ژٕ׏גהגژ׏אזژג׏וژ׏ڹژي‬
engineering conferences and publications. -ȂƌǠǹ‫ژي‬ǚƷǹȲ‫ۮ‬ƩȏȂȲɓɋƷȵِȏȵǒ
Conferences/Education: The society holds about 200 conferences

each year and sponsors many educational activities, including IEEE BOARD OF DIRECTORS
computing science accreditation. President: °ɓȽƌȄ‫ژ‬hِ‫ٹژ‬hƌɋǚɲ‫ژٺ‬kƌȄư
Certifications: The society offers three software developer President-Elect: hِeِ‫¨ژ‬ƌɲ‫ژ‬kǠɓ
credentials. For more information, visit Past President: ¾ȏȽǚǠȏ‫ژ‬Fɓǵɓưƌ
www.computer.org/certification. Secretary: Kathleen ِ‫ژ‬Kramer
Treasurer: uƌȵɲ‫ژ‬-ǹǹƷȄ‫¨ژ‬ƌȄưƌǹǹ
BOARD OF GOVERNORS MEETING Director & President, IEEE-USA: hƌɋǚƷȵǠȄƷ‫ژ‬eِ‫ژ‬%ɓȄƩƌȄ‫ژ‬Director
& President, Standards Association: eƌȂƷȽ‫ژ‬uƌɋɋǚƷɬȽ‫ژ‬Director
& VP, Educational Activities: °ɋƷȲǚƷȄ‫ژ‬¥ǚǠǹǹǠȲȽ‫ژژ‬Director & VP,
‫ژ׏א‬ȲȵǠǹ‫ژً׏א׎אژ‬ɫǠȵɋɓƌǹ Membership ‫ ۯ‬Geographic Activities:‫ژژژژژژژ‬
uƌǠǵƷ‫ژ‬kɓǠǵƷȄ
Director & VP, Publication Services & Products: kƌɬȵƷȄƩƷ‫ژ‬Oƌǹǹ‫ژ‬
Director & VP, Technical Activities: ¨ȏǒƷȵ‫ژ‬Åِ‫ژ‬FɓǱǠǠ
revised ‫ڳ׮‬uƌȵƩǚ‫ת׫ש׫ڳ‬
IEEE COMPUTER SOCIETY
Computing
in Science &
Engineering
Subscribe to CiSE today!
www.computer.org/product/magazines/cise
The computational and data-centric problems faced by scientists

and engineers transcend disciplines. There is a need to share
knowledge of algorithms, software, and architectures, and
to transmit lessons-learned to a broad scientific audience.
Computing in Science & Engineering (CiSE) is a cross-disciplinary,
international publication that meets this need by presenting
contributions of high interest and educational value from a
variety of fields, including physics, biology, chemistry, and
astronomy. CiSE emphasizes innovative applications in cutting-
edge techniques. CiSE publishes peer-reviewed research articles,
as well as departments spanning news and analyses, topical
reviews, tutorials, case studies, and more.
www.computer.org/cise
DEPARTMENT: MICRO ECONOMICS
The Economics of Confrontational

Conversation
Shane Greenstein , Harvard Business School, Boston, MA, 02163, USA
M
y favorite panel from Randall Munroe’s one- so on. In either case, those attractions become self-
panel comic, xkcd, is labeled “duty calls.” It reinforcing.
shows a lone stick figure at a desk, A third crucial piece of economics shapes focal
hunched over, intensely staring at his computer platforms with large scale. Some confrontation is inev-
screen, while engaging in a staccato conversation itable, and undermines the functionality of most (not
with an offscreen companion. She says: “Are you com- all!) platforms. It is possible to write volumes on how
ing to bed?” Him: “I can’t. This is important.” Her: to address confrontation with human moderation or
“What?” Him: “Someone is wrong on the internet.” algorithmic processes, and whether specific practices
Like all good satire, this comic elicits both laughs are legal or effective. Take a step back from that dis-
and winces. Nobody would ever engage in this behav- cussion, and recognize the broad economic facts. No
ior in any physical place where a veneer of social matter how it gets implemented, the processes are
politeness predominates, such as standing in line at a expensive and imperfect.
cashier or sitting in an airplane seat. On the internet
surfers jettison much of their social restraint, con-
ON THE INTERNET SURFERS
fronting, and correcting perfect strangers. It leads to,
JETTISON MUCH OF THEIR SOCIAL
for example, edit wars on Wikipedia, condescending
insults on Reddit, and righteous putdowns on Twitter. RESTRAINT, CONFRONTING, AND
This behavior invites plenty of legal analysis, angry CORRECTING PERFECT STRANGERS.
editorializing, and technological proposals, but rarely IT LEADS TO, FOR EXAMPLE, EDIT
economic analysis. Let us address that gap. What eco- WARS ON WIKIPEDIA,
nomic factors make confrontational conversation CONDESCENDING INSULTS ON
more or less likely in our era? REDDIT, AND RIGHTEOUS
PUTDOWNS ON TWITTER. THIS
TENSIONS WITH COMPROMISE BEHAVIOR INVITES PLENTY OF
The first piece of relevant economics is the low cost of LEGAL ANALYSIS, ANGRY
scale. It is inexpensive to host terabytes of data, and EDITORIALIZING, AND
dirt-cheap to serve millions of users. Software can be TECHNOLOGICAL PROPOSALS, BUT
replicated many times at almost no cost, making it RARELY ECONOMIC ANALYSIS.
possible for a platform to scale.
The second relevant economic factor complements
Two features of the present era drive up those
scale, and goes by the label “network effects.” These
costs and exacerbate the imperfections. Anything
are self-reinforcing advantages affiliated with being a
with long video is inherently expensive to moderate,
focal platform. Simplifying, a platform becomes focal in
as managers at YouTube and Facebook can attest. In
one of two ways. In one form a platform attracts more
addition, bots and misinformation farms have flooded
apps or content, and that attracts more users, which
the modern experience, especially at focal plat-
then attracts more apps or content, and so on. In
forms—just ask Twitter and Facebook. Holding those
another form a platform attracts more sellers, which
in check has become an endless and expensive
attracts more buyers, which attracts more sellers and
whack-a-mole for large platforms.
More to the point, addressing content moderation
has become this era’s key to achieving scale. Different
0272-1732 ß 2021 IEEE platforms have taken different approaches to this
Digital Object Identifier 10.1109/MM.2021.3060295 challenge, and each approach comes with different
Date of current version 26 March 2021. upsides and drawbacks.

MICRO ECONOMICS
DIFFERENT APPROACHES approach tends to divide participants into like-minded

Apple’s approach to content moderation is the easiest groups. Facebook came to this approach through test-
to understand, namely, it “sanitizes” its content during ing and refining its algorithms for the news feed to
a preapproval process for apps. Apple treats violent maximize engagement, showing users content that
content and confrontational language just like it kept them for longer and motivated them to come
treats porn. It bans everything that violate some rules back.
and bounds, with the (possible) exception of some As has been widely noted, this approach has one
video games. major drawback. It creates insulated conversation
Is that expensive? Yes, because humans must do bubbles of outrage among like-minded friends. More-
it. Moreover, the approval process is arduous, and over, a flood of misinformation and bots make this
many developers dislike it. So why do it? Because it problem worse. Some time ago Facebook started to
protects the brand. Would most parents buy an feel the downside in terms of user fatigue and disgust
iPhone for a teenage child if it included an app for with the site overall, so it has been experimenting with
quasi-Nazi nonsense? Of course not. As in any other policing the misinformation. It banned many fake
business, Apple targets a primary customer, and aligns sites. Like other platforms, Facebook also imple-
their operations with serving these customers. mented labels for lies and for unverified rumors. Like
Parts of this approach has migrated to other plat- every imperfect approach, this one was costly, and it
forms. Most platforms want nothing to do with Holo- frustrated plenty of users.
caust deniers and flat-earthers or violent terrorists of And, yet, they survive and still thrive. The economic
any sort, so they ban them, using their terms of service lesson? The site held together because the network
as a legal shield. effects are strong enough.
A final approach melds clever design and sanitizing
into something more muddled. While there are many
DURING THE ELECTION A NEW TYPE examples, the most successful of these is Google’s
OF BREAKAWAY EMERGED WHEN approach to its search engine. Google sanitizes its
POLITICAL OPINION SHAPERS ads, banning alcohol, gambling, porn, and illegal activi-
COUCHED THEIR APPEAL IN ties, as well as indecent language. At the same time,
RIGHTEOUS PATRIOTISM. LOVE THEM Google has long taken a more permissive approach to
OR HATE THEM, THEIR EXPERIENCE its organic search results. If a user searched for some-
ILLUSTRATES THE ECONOMICS. thing shady they got back something, well, shady.
Users filter the offensive results if they want to. This
was (and still is) a clever and effective way to find any-
thing except the dark web.
Sanitizing alone does not work on all platforms in Its downsides are also well known: It remains vul-
all situations. That mixed success motivates another nerable to special pleading. Google fields endless
approach, channeling the confrontations with clever complaints about its organic results. Many govern-
design accompanied by a few sanitizing rules about ments around the world wants to regulate it, while
indecent language. For example, a customer may hate many politicians want to manipulate it.
a restaurant for a good or crazy reason, but Yelp found
a way to guide the confrontations, and make them
valuable to readers. Yelp displays the average as stars, PRESSURES AGAINST SCALE
and spotlights a colorful description, which users can We just described settings where network effects are
read or ignore. strong. What happens when they are not? Simply
A whole range of ratings sites use different varia- stated, a breakaway community forms if users find an
tions on this approach. However, fake reviews and entrepreneur who creates an alternative platform that
offensive reviews (that just skirt the rules) have users prefer.
become the pest in today’s version of whack-a-mole. Breakaway communities with facets of indecency
Without naming names, it undermines the usefulness have been a part of the internet for a long time. For
of some ratings sites. example, Reddit, 4chan, or many places in the dark
Facebook illustrates a different approach, namely, web wear their confrontational language as a badge of
using cleverly designed algorithms to manipulate who honor. More to the point, the skills to manage one of
takes part in the confrontational conversation. One these sites are common, and it is a lot cheaper to
might call it “divide but engage” because this operate if the moderation is light. It is not my taste

MICRO ECONOMICS
and it may not be yours, but there is a large supply of ban Parler’s app, consistent with its sanitizing approach.
these breakaways. AWS eventually they took a similar action after notifying
During the election a new type of breakaway Parler of their concerns about lack of moderation. Once
emerged when political opinion shapers couched their Parler gained prominence as a coordinator of the capital
appeal in righteous patriotism. Love them or hate riot, AWS saw no benefit to having its brand associated
them, their experience illustrates the economics. with this group, and that was that.
It started with, say, Alex Jones and Infowars, who Parler’s breakaway did not go well because man-
got banned because it went beyond the pale. Many agement cut many corners and did not design a hack-
mainstream sites thought it was the right thing to do, proof site, but here is the rub. Parler is trying again.
and it aligned with the economics because the losses Even if they fail again, it is a good bet that somebody
of user were small compared to the risks to offending else will get a version of this up and running.
too many disgusted customers. Once banned, Alex Summarizing, the increasing frequency of break-
Jones took his users with him and formed his own aways is a symptom that they are becoming cheaper
community elsewhere. to build. Ergo, we should expect mainstream sites to
face increasing pressures towards fragmentation.
AS DEEPFAKES BECOME MORE

COMMON, WHO WILL ADJUDICATE CONCLUSION
WHETHER A DEEPFAKE OF A The trends toward fragmentation worries anyone who
POLITICIAN OR CELEBRITY IS REAL OR wants to maintain civil society. Who will encourage the
NOT? HOW CAN ANYBODY DO THAT confrontations that settle the public conversations?
IF ONLINE USERS HAVE SORTED Most worrisome, misinformation, and deep fakes
THEMSELVES INTO VARIOUS GROUPS are becoming more widespread in breakaway commu-
THAT DO NOT TRUST ONE ANOTHER? nities, and especially on the dark web. Right now,
most users of deepfakes entertain themselves (You
do not really want to know the details.), but, as with
Parler’s recent experience exemplifies the state of any frontier software, it will become mainstream soon
play today. As a reminder, Parler came into existence enough.
because many white nationalist groups tired of Face- As deepfakes become more common, who will
book’s and Twitter’s attempts to label their content as adjudicate whether a deepfake of a politician or celeb-
factually problematic and potentially violent. Parler rity is real or not? How can anybody do that if online
opened on AWS, and made lack of moderation a fea- users have sorted themselves into various groups that
ture instead of a bug, and it attracted a following do not trust one another?
quickly.
The other mainstream platforms eventually reacted SHANE GREENSTEIN is a Professor at the Harvard Business
in predictable ways. Apple was among the earliest to School. Contact him at sgreenstein@hbs.edu.

Get Published in the IEEE Open
Journal of the$PNQVUFS4PDJFUZ
Submit a paper today to the

premier open access journal
in DPNQVUJOHBOEJOGPSNBUJPO
UFDIOPMPHZ.
Your research will benefit from

the IEEE marketing launch and
5 million unique monthly users
of the IEEE Xplore® Digital Library.
Plus, this journal is fully open
and compliant with funder
mandates, including Plan S.
Submit your paper today!

Visit XXXDPNQVUFSPSHPK to learn more.
IEEE
Computer
Society Has
You Covered!
WORLD-CLASS CONFERENCES — Stay
ahead of the curve by attending one of our
215+ globally recognized conferences.
DIGITAL LIBRARY — Easily access over 800k

articles covering world-class peer-reviewed
content in the IEEE Computer Society
Digital Library.
CALLS FOR PAPERS — Discover

opportunities to write and present your
ground-breaking accomplishments.
EDUCATION — Strengthen your resume

with the IEEE Computer Society Course
Catalog and its range of offerings.
ADVANCE YOUR CAREER — Search the

new positions posted in the IEEE Computer
Society Jobs Board.
NETWORK — Make connections that count

by participating in local Region, Section,
and Chapter activities.
Explore all of the member benefits

at www.computer.org today!

2021-Mar - Hot Chips Conf

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2021-Mar - Hot Chips Conf

Uploaded by

Copyright:

Available Formats

VOLUME 41, NUMBER 2 MARCH/APRIL 2021

Cutting-edge Unique original Keeps you up to

Subscribe for free

IBM’s POWER10 Processor Marvell ThunderX3: The Xbox Series

36 Manticore: A 4096-Core RISC-V Chiplet Architecture for

43 Pensando Distributed Services Architecture

50 Compute Substrate for Software 2.0

64 Klessydra-T: Designing Vector Coprocessors for Multithreaded

Columns and Departments

Also in this Issue

CPUs, GPUs, and More From Hot

4 IEEE Micro Published by the IEEE Computer Society March/April 2021

March/April 2021 IEEE Micro 5

Best Papers From Hot Chips 32

0272-1732 ß 2021 IEEE

6 IEEE Micro Published by the IEEE Computer Society March/April 2021

IBM’s POWER10 Processor

March/April 2021 Published by the IEEE Computer Society IEEE Micro 7

8 IEEE Micro March/April 2021

March/April 2021 IEEE Micro 9

10 IEEE Micro March/April 2021

March/April 2021 IEEE Micro 11

ports per entry, when compared to POWER9, without

12 IEEE Micro March/April 2021

March/April 2021 IEEE Micro 13

CONCLUSION processor. He is responsible for shaping the processor cache

14 IEEE Micro March/April 2021

Marvell ThunderX3: Next-Generation

Marvell ThunderX3 is the third-generation Arm-based server processor from

March/April 2021 Published by the IEEE Computer Society IEEE Micro 15

obtaining performance from more cores running

FIGURE 1. ThunderX3 overview of key features.

16 IEEE Micro March/April 2021

March/April 2021 IEEE Micro 17

TABLE 1. Sources of single thread gains on ThunderX3 over

Feature Rough Pct gain

18 IEEE Micro March/April 2021

FIGURE 4. Thread arbitration.

March/April 2021 IEEE Micro 19

L3-cache efﬁciently without duplication. The L3-

20 IEEE Micro March/April 2021

March/April 2021 IEEE Micro 21

The Xbox Series X System Architecture

KEY FEATURES As listed above, there are multiple custom on-die

THE DIE AND PACKAGE MEMORY CONFIGURATION AND

22 IEEE Micro Published by the IEEE Computer Society March/April 2021

TABLE 1. Key features. THE CPUs

March/April 2021 IEEE Micro 23

FIGURE 2. SoC block diagram.

FIGURE 3. GPU block diagram.

24 IEEE Micro March/April 2021

XBOX VELOCITY ARCHITECTURE color, depth, rasterization, and primitive assembly

With these coordinated elements, along with Sam- GPU Evolution

March/April 2021 IEEE Micro 25

26 IEEE Micro March/April 2021

FIGURE 5. Typical shading rates, sponza.

March/April 2021 IEEE Micro 27

28 IEEE Micro March/April 2021

NVIDIA A100 Tensor Core GPU: Performance

March/April 2021 Published by the IEEE Computer Society IEEE Micro 29

FIGURE 3. DL network mapping to GPU.

30 IEEE Micro March/April 2021

March/April 2021 IEEE Micro 31