Professional Documents
Culture Documents
Hot Chips
www.computer.org/micro
mpute
r.org ²ƧǞƵȁ
ɈǞ
Comp ˛Ƨ
uting
nted Data
and Visual
eality Inform izatio
n
on
ce4c1.
indd Techn ation
1
ology
on Quan
tum
y Comp
uting
putin
g
Intern
et of
Ethic Thing
s s
Mach
ine L
Quan ea rning
tu
Comp m
uting
JUNE
2020
JULY 20
ww w.c 20
ompu
ter.org
ww w.c
ompu
ter.org
ce7c1.ind
d 1
Secu
rity an
Priva d
cy High-P
Auto
matio
n Comp erforman
Block uting ce
chain Hard
Digit ware
al Affect
Transf iv
ormat Comp e
ion uting
Educa
tion
MAY 20
20
w w w.c
ompu
ter.org
ce5c1.
indd
1
S
ww w.c NOVE
ompu MBER
ter.org 2020
ww w.c
ompu
ter.org
ce11c1.in
dd 1
ndd 1
ComputingEdge
Secu
riit
Priva y and
cy
Data
Intern
et
ȲɈǞ˛
ƧǞƊǶ
XȁɈƵǶǶ
ǞǐƵȁƧ
Ƶ
Your one-stop resource for industry hot topics,
technical overviews, and in-depth articles.
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters: Three Park Ave., 17th Floor, New York,
NY 10016; IEEE Computer Society Headquarters: 2001 L St., Ste. 700, Washington, DC 20036; IEEE Computer Society Publications Office: 10662
Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. Postmaster: Send address changes and undelivered copies to IEEE, Membership
Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid at New York, NY, and at additional mailing offices. Canadian
GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Return undeliverable Canadian addresses to
4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the
copy; and 3) does not imply IEEE endorsement of any third-party products or services. Author and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own webservers without permission, provided that the IEEE copyright notice and a full citation to the
original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate
review suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to
ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising,
or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual
Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854 or pubs-permissions@ieee.org. ©2021 by IEEE. All rights reserved. Abstracting and
library use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy
fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE
Micro does not necessarily constitute an endorsement by IEEE or the Computer Society. All submissions are subject to editing for style, clarity, and
space. IEEE prohibits discrimination, harassment, and bullying. For more information, visit ieee.org/web/aboutus/whatis/policies/p9-26.html.
6
GUEST EDITORS’ INTRODUCTION
Best Papers From Hot Chips 32
Priyanka Raina and Cliff Young
MARCH/APRIL 2021
7 15 22
Theme Articles
General Interest
56 The Design Process for Google’s Training Chips: TPUv2 and TPUv3
Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon,
Cliff Young, Norman Jouppi, and David Patterson
www.computer.org/micro
ISSN: 0272-1732
FROM THE EDITOR-IN-CHIEF
W
elcome to the March/April 2021 issue of TPU v2 are discussed next and the performance is
IEEE Micro. This Issue features selected compared using roofline plots.
articles from the Hot Chips ’32 Symposium, The second General Interest article is “Klessydra-T:
held virtually in August 2020. COVID-19 forced Hot Designing Vector Coprocessors for Multithreaded
Chips’32 to be a virtual event, however the chips are Edge-Computing Cores” by Cheikh et al. of Sapienza
more interesting and powerful than ever! Whether it is University of Rome. This work addresses the introduc-
graphics acceleration or sheer increase in traditional tion of coprocessor acceleration in interleaved multi-
compute capability, the chip design arena has become threading (IMT) cores for extreme edge computing.
hotter than ever. A lot of money is pouring into designing Specifically, it explores possible alternatives to imple-
both special purpose and general purpose chips. IEEE ment vector coprocessing units in RISC-V cores,
Micro is pleased to present seven selected articles showing the synergy between IMT and data-level
based on the presentations at the Hot Chips Sympo- parallelism in edge-computing applications.
sium for our readers. Priyanka Raina of Stanford Univer-
sity and Cliff Young of Google served as guest editors
for this special issue. They have compiled an excellent
WHETHER IT IS GRAPHICS
selection of articles on emerging chips and systems
from the Symposium, including articles on IBM Power10, ACCELERATION OR SHEER INCREASE
Marvell Thunder X3, Xbox Series X, NVIDIA A100, Manti- IN TRADITIONAL COMPUTE
core 4096 core chiplet, Pensando Distributed Service CAPABILITY, THE CHIP DESIGN ARENA
Architecture, and TensTorrent Compute substrate for HAS BECOME HOTTER THAN EVER.
Software 2.0. Please read the Guest Editors’ Introduc- A LOT OF MONEY IS POURING INTO
tion to get a preview of the seven articles. Thanks to the DESIGNING BOTH SPECIAL PURPOSE
editors, authors, and reviewers who worked hard to put AND GENERAL PURPOSE CHIPS.
this issue together.
In addition to the aforementioned seven Hot Chips
articles, there are two General Interest articles. The
first one is on the design process for Google’s TPUv2 This issue also features three department articles.
and TPUv3 training chips, written by Norrie et al. from Michael Mattioli of Goldman Sachs who joined the
Google. This was also originally presented at Hot IEEE Micro Editorial Board as a new Department Editor
Chips, however because the authors included one of has coauthored an article with Atte Lahtiranta on the
the guest editors, this was considered as a General hidden capabilities of video game consoles. The
Interest article, and the review process was separately authors describe nonvideo game capabilities of video
coordinated. The article describes Google’s approach game consoles such as web browsing, video confer-
to machine learning (ML) hardware, and provides encing, and audio/video/document content creation.
details on the scalar computation unit, the vector They posit that video game consoles are ideal for
computation unit, the matrix computation units, the enterprise deployment, with security features and
memory system, the interconnect, and the floor plan defense against a wide range of threats. The Microsoft
of TPU v2. The enhancements in TPU v3 compared to Xbox Series X architecture is described with emphasis
on the security processor. The security processor is
housed in the security complex with crypto engine,
random number generator, secure RAM, secure ROM,
0272-1732 ß 2021 IEEE security fuse bank, and side-channel monitors. Please
Digital Object Identifier 10.1109/MM.2021.3064094 read the article to appreciate the less-discussed fea-
Date of current version 26 March 2021. tures of video game consoles.
This issue additionally includes an Award article Let me also provide an overview of what to expect
from Luiz Andre Barroso of Google who received the in upcoming issues. The May/June issue will be the
2020 Eckert-Mauchly Award. In his article, the author popular “Top Picks” Special Issue which presents the
provides a brief history of warehouse-scale computing. best of the best from articles in computer architecture
He describes the progression of datacenter computing conferences in 2020. Prof. Daniel Jimenez of Texas
as it evolved during the last two decades, and also A&M University and a selection committee from
describes his personal journey as a computer engineer. industry and academia have selected 12 papers from
He concludes the article acknowledging how the pan- more than 100 articles that were submitted in
demic has made many to realize the importance of com- response to the Top Picks call for papers. Readers can
puting technology and cloud-based services, and how look forward to an amazing collection of excellent
these have allowed us to continue to work and live. articles in May/June.
Another article presented in this issue is a Micro Many thematic special issues are planned for the
Economics column by Shane Greenstein, “The Econom- remainder of 2021. Themes include quantum comput-
ics of Confrontational Conversation,” discussing how ing, FPGA computing, in-memory computing, and
confrontational conversations are commonplace on the smart agriculture. The July/August issue will be a Spe-
internet. Greenstein focuses on the economics relevant cial Issue on Quantum Computing, guest edited by
to such confrontational conversations. One economic Ulya Karpuzcu of the University of Minnesota. The
factor is that it is inexpensive to host terabytes of data. FPGA Computing Special Issue will be guest edited by
Additionally simple focal platforms attract more users, Maya Gokhale of Lawrence Livermore Laboratories
which then attracts more apps and content. The third and Lesley Shannon of Simon Fraser University, Can-
relevant economic fact is that the mechanisms to ada. The In-Memory Computing Special Issue will be
address confrontation, whether human moderation or guest edited by Reetuparna Das of the University of
algorithmic processes, will be expensive. Michigan. We also have an open call for smart agricul-
ture, focusing on the use of artificial intelligence and
IoTs in agriculture, guest edited by Neeraj Kumar of
Thapar University and Sudip Misra of IIT Kharagpur.
MANY THEMATIC SPECIAL ISSUES We invite readers to submit to these Special
ARE PLANNED FOR THE REMAINDER Issues. Please find the open calls at:
OF 2021. THEMES INCLUDE QUANTUM
COMPUTING, FPGA COMPUTING, › https://www.computer.org/digital-library/
IN-MEMORY COMPUTING, AND SMART magazines/mi/call-for-papers-special-issue-
AGRICULTURE. on-processing-in-memory.
› https://www.computer.org/digital-library/
magazines/mi/call-for-papers-special-issue-on-
ai-edge-and-iot-for-smart-agriculture.
There have been some additions/enhancements
to the IEEE Micro Editorial Board this month. Dr. IEEE Micro is interested in submissions on any
Vijaykrishnan Narayanan of Penn State has been pro- aspect of chip/system design or architecture.
moted to the Associate Editor-in-Chief of IEEE Micro. Hope you enjoy the articles presented in this issue.
Michael Mattioli of Goldman Sachs will serve as a Happy reading!
Department Editor for Security and Product Reviews.
Prof. Guido Araujo of the University of Campinas
(Brazil) joins the Editorial Board this month as an LIZY KURIAN JOHN is a Cullen Trust for Higher Education
Associate Editor. I look forward to working with all of Endowed Professor with the Electrical and Computer Engi-
them and bringing to you an even more interesting neering Department, University of Texas at Austin, Austin,
reading experience. TX, USA. Contact her at ljohn@ece.utexas.edu.
W
elcome to our special issue of IEEE Micro, Microsoft’s latest gaming console, dedicating over
which highlights the best presentations two-thirds of its die to the GPU that delivers 4K at 120
from Hot Chips 32, held virtually on frames per second. While graphics remain central to
August 16–18, 2020. Like many things in 2020, Hot the mission of the article titled “NVIDIA A100 Tensor
Chips was unprecedented, going virtual for the first Core GPU: Performance and Innovation,” GPUs have
time in its history. Presentations were done by video, become the default for high performance and program-
with a small production team working in a studio and mability in one solution, supporting a huge variety of
a virtual conference supplemented by Zoom chat- scientific computing workloads and powering both
rooms. Despite the switch to the virtual format and neural network training and inference.
the challenges to the global economy, attendance Bridging between the general-purpose floating-
was the highest ever, and the technical program was point power of GPUs and the specialized application
robust, with strong representation from traditional focus of neural network accelerators, “Manticore: A
CPU, GPU, and FPGA manufacturers, and strong offer- 4096-Core RISC-V Chiplet Architecture for Ultraeffi-
ings from startups in both communication networks cient Floating-Point Computing” uses the extensibility
and neural networks. This issue collects articles of the RISC-V ISA to reduce control overheads and
derived from the best talks, chosen after the confer- energy costs for neural-network workloads.
ence by the Program Committee. It is a great time to Our final two articles come from startups, both
work in computer architecture, where the combina- with networking in their roots. “Pensando Distributed
tion of approaching limits to Moore’s Law and new Services Architecture” describes their domain-specific
transformative application areas mean that both architecture (including chips and programmable soft-
incumbent computer architectures and new startups ware) for building new applications and services within
have large contributions to make. a datacenter network. “Compute Substrate for Soft-
Two of the articles focus on server processors, one ware 2.0” explains startup TensTorrent’s unique archi-
from a long-time mainframe manufacturer and one from tecture for neural network acceleration, which takes a
a potential disruptor of the server business. “IBM’s packet-network-inspired approach to control and flex-
Power10 Processor” describes the newest instance of ibility, allowing better support for sparsity, varying
the POWER architecture from the seminal computing numerical precision, and compression than older,
company. Power10 was designed for general-purpose more monolithic neural network accelerators.
enterprise computing with interconnection across All of the talks from Hot Chips 32 (https://www.
16 chips in a multiprocessor, with 1-TB/s memory band- hotchips.org/archives/hc32/) are available at the Hot
width per CPU and high-bandwidth links to accelerators Chips website. Hot Chips are run by a great set of
including GPUs. By contrast, “Marvell ThunderX3: Next- volunteers, including a sophisticated logistical and mar-
Generation Arm-Based Server Processor” represents keting team in the Organizing Committee and the won-
the new wave of ARM-based server-class chips that aim derful set of academic and professional computer
to change the performance and price/performance of architects on our Program Committee. No other confer-
the datacenter server market. ence has the same focus on production computer sys-
Two articles describe chips with significant tems, presented by their designers, sharing how and why
graphics capability. “The Xbox Series X System Archi- they built their chips. We hope you find this issue, and
tecture” details the system-on-a-chip that powers future Hot Chips, as informative and fun as we have.
The IBM POWER10 processor represents the 10th generation of the POWER family of
enterprise computing engines. It is built on a balance of computation and
bandwidth, delivered by powerful processing cores and intrachip interconnect,
respectively. Multiple system interconnect infrastructures support configurations
with up to 16 processor chips and up to 1920 simultaneous threads of execution, as
well as an expansive memory system with up to 2 Petabytes of addressing space.
Cross-system memory sharing and coherent accelerator attach are also supported.
The POWER10 processing core has been significantly enhanced over its POWER9
predecessor, including the addition of an all-new matrix math engine. Throughput
gains from POWER9 to POWER10 average 30% at the core level and three-fold at the
socket level. Those gains can reach ten- or twenty-fold at the socket level for
matrix-intensive computations.
T
he IBM POWER10 processor delivers significant POWER10 PROCESSOR CHIP
gains in capacity and capability over its immedi- A POWER10 processor die (see Figure 1) consists of 18
ate POWER9 predecessor1,2: an average 20% billion transistors in 602 mm2 of silicon, compared to 8
single-thread performance boost, and 30% gain in billion transistors in POWER9, and is built in Samsung’s
core throughput over a wide range of applications. 7-nm technology with 18 metal layers. The central part
Combined with a two-and-a-half increase in the num- of the die, approximately 300 mm2, is occupied by 16
ber of cores per package, these improvements result enterprise-grade cores, each capable of running eight
in three times or better per socket throughput on pop- simultaneous threads of execution (SMT8), and their
ular integer, floating-point, and commercial workloads, associated 2- and 8-MB levels 2 and 3 cache regions,
and 2–4 times increased memory bandwidth, depend- respectively. To better match the supply and demand
ing on memory technology. For matrix math, the gains of processor chips with the maximum number of cores,
in performance can reach 10 or 20 times through a we cap the number of active cores in a die to 15, keep-
new computational engine. ing one core as a manufacturing spare. This results in
Additional breakthroughs include: a new Power- up to 120 simultaneous threads of execution, backed
AXON system interconnect with 1 TB/s of bandwidth by 120 MB of level-3 cache.
per POWER10 chip and support for cross-system mem- The remaining half of the POWER10 processor chip
ory clustering; a new Open Memory Interface (OMI) that area is dedicated to the system interconnect, includ-
supports multiple industry-standard memory technolo- ing the two protocol spines to the left and right of the
gies on the same processor chip; a modular building core/cache complex, supporting the various intercon-
block die that enables systems with up to 1920 simulta- nect protocols for memory, multiple processors, accel-
neous threads of execution; hardware-enforced security erators, clusters, and I/O. The periphery of the die is
to protect sensitive code and data from attacks; and AI- filled with high bandwidth, power efficient signaling
optimized machine instructions to address the circuits that implement the PowerAXON,3 OMI,3 and
increased computing demands of modern machine PCI Gen5 I/O infrastructures.
learning/deep learning business applications. Not visible in Figure 1 are the large numbers of
communication trunk lines, which run horizontally
over the two L3 hemispheres and vertically over the
protocol spines. The placement of the L3 hemi-
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3058632
spheres, the protocol spines, the trunk lines, the inter-
Date of publication 10 February 2021; date of current version cept of these vertical and horizontal trunk lines, and
26 March 2021. the location of the protocol spines next to the
FIGURE 1. POWER10 processor chip. Approximately half the die area is dedicated to cores and caches. The other half is for the
various system interconnects, including memory interfaces, SMP, accelerators, clustering, and I/O.
signaling infrastructure are the result of rearchitecting interface, for interconnecting processor chips to other
the chip floorplan around a computation-to-band- processors and accelerators and for implementing
width balance. cross-system memory clustering; and PCI Gen 5 for I/
POWER10 processor chips can be packaged in either O and other system interconnect.
single- or dual-chip modules (SCM/DCM). The SCM con- PowerAXON and OMI signaling runs at rates of up
figuration is optimized for scale-up systems and maxi- to 32 GT/s. With a combined 256 bidirectional lanes,
mizes power, interconnect bandwidth, and memory this results in up to 2 TB/s of total bandwidth on a pro-
capacity delivered to each core. It also supports more cessor chip, with 128 lanes and 1 TB/s for each inter-
flexible topologies, allowing configurations with up to 16 face. These are shown to the left and right of the
processor chips. The DCM configuration is optimized for POWER10 chip in Figure 2, respectively. (See Figure 1
scale-out systems and maximizes computational and I/ for physical placement of the interfaces. Each Power-
O density while trading off the power and memory per AXON corner has 32 lanes plus 4 spares.)
core compared with the SCM. It limits configurations to
a maximum of four DCMs (eight processor chips).
THE POWER10 CHIP INTRODUCES
The POWER10 chip introduces new security fea-
tures for cloud paradigms that extend trusted virtuali- NEW SECURITY FEATURES FOR
zation environments to include protected containers CLOUD PARADIGMS THAT EXTEND
and include in-line memory encryption and application TRUSTED VIRTUALIZATION
level protections against attacks. ENVIRONMENTS TO INCLUDE
PROTECTED CONTAINERS AND
INCLUDE IN-LINE MEMORY
POWER10 SYSTEMS ENCRYPTION AND APPLICATION
POWER10 systems are built around the three intercon- LEVEL PROTECTIONS AGAINST
nect infrastructures shown in Figure 2: the OMI, for ATTACKS.
connecting processors to memory; the PowerAXON
FIGURE 2. POWER10 system interconnect. OMI is used for attaching memory to the processor. PowerAXON provides the SMP,
clustering, and accelerator interfaces. PCIe Gen5 is used for I/O and other interconnect.
OMI is a technology agnostic memory interface attach, and memory clustering. The largest single-sys-
based on open standards. Memory is attached to the tem configuration consists of 16 SCMs, intercon-
processor chip through an OMI-compliant buffer chip, nected as shown in the lower left corner of Figure 2.
which encapsulates technology specific requirements Each module is at most two hops away from any other
as first introduced in IBM’s POWER8 processor and its module. A system with this configuration can run up
companion Centaur memory buffer chip.4 POWER10 to 1920 simultaneous threads of execution and con-
systems will initially use DDR4 memory, through a tain up to 256 OMI DIMMs (16 DIMMs attached to
buffer chip built by Microchip.3 This buffer chip imple- each of 16 SCMs), with a maximum capacity of 1 Peta-
ments 25.6-GT/s signaling over an 8-bit interface, byte of memory. (POWER10 processors have a 2-Peta-
which matches its 8-byte DDR4 channel operating at byte physical memory address space.)
3.2 GT/s. With 16 channels in use, up to 410 GB/s of PowerAXON can also be used to support Open-
peak DDR4 bandwidth can be achieved per POWER10 CAPI, an open, asymmetric protocol for coherently
chip, with a latency that is only 10 ns over traditional attaching compute accelerators, memory devices,
DDR4 DIMMs. DDR5 memory DIMMs can be sup- network interfaces, and storage controllers, either in a
ported later through a future buffer chip. device slot or a cabled external enclosure. Since its
Alternative memory technologies can also be introduction with POWER9, a variety of vendors have
deployed with POWER10 processors using OMI. As provided OpenCAPI-attached devices that expand
shown in the right half of Figure 2, both high-band- and enhance the functionality of POWER systems.
width GDDR memory and high-capacity nonvolatile POWER10 OpenCAPI provides a new level of perfor-
storage-class memory can be connected to the same mance and functionality over the prior version.
OMI channels through corresponding buffer chips and The third and final functionality of PowerAXON in
using standard OMI DIMM form factors. A fully popu- POWER10 that we discuss in this article is memory clus-
lated 16-channel GDDR configuration would achieve tering, shown in the top left corner of Figure 2. This new
over 800 GB/s of memory bandwidth to a single feature of POWER10, which is called memory inception,
POWER10 processor. This approaches the bandwidth delivers the long-sought functionality of server disaggre-
achieved with high-bandwidth memory (HBM), but at gation. Memory inception enables systems to directly
higher capacities and lower cost. Alternatively, a stor- share each other’s main memory. The latency through
age-class (nonvolatile) memory solution could achieve memory inception is 50–100 ns over that of a remote (2-
capacities in the Terabytes per DIMM range. hop) socket within a server, and it is still low enough to
The PowerAXON infrastructure is used for system be used as main memory. The protocol for memory
scaling, including multiprocessor interconnect, device inception is built on top of the OpenCAPI protocol and
different from the SMP protocol used to build (up to) 16- variant of the chip with twice as many SMT4 cores per
socket systems. Memory inception does not implement chip (up to 60 per DCM).
a cache coherence scheme and is not meant to enable The microarchitecture of the POWER10 core,
larger single system image configurations. Rather, the together with key factors affecting its performance
goal is to allow one server to map its address space to and power efficiency, are shown in Figure 3. The block
the physical memory of another server. diagram shows those microarchitecture resources
As a scenario for using memory inception, con- available for the execution of 1 to 4 simultaneous
sider the case of a cluster of homogeneous servers, threads, corresponding to half of the total resources
each with enough memory for the average workload. in an SMT8 core. POWER10 core components colored
By borrowing memory from other machines, a host- in green were somewhat improved in capacity over
ing system can run large memory workloads that go the predecessor POWER9 core. Those colored in blue
beyond the capacity of any single server. Another had their capacity at least doubled and those in red
scenario is a hub-and-spoke configuration, in which had their capacity at least quadrupled. These addi-
a very large central server has a big pool of memory, tional resources along with various other improve-
distributed as needed across a large set of much ments in latency and microarchitecture are
smaller machines. This combines the cost efficiency responsible for the 30% average increase in core
of small machines with the memory capacity of a throughput and a much higher boost in performance
much larger server. in some cases.
Memory inception can also be used as the mes- Each POWER10 SMT8 core has an associated 2
sage layer for a large cluster of POWER10 servers. MiB L2 cache that provides both instructions and data
Combined with the processor’s 2-Petabyte address and is four times the capacity of POWER9. For each
space, memory inception can use the address trans- half of the core, instructions are fetched at a sus-
lation facilities in each server to create a multihop tained rate of up 32-bytes per cycle and predecoded
interconnect, with messages delivered simply by before being installed in a 48-KiB instruction cache
writing to the target memory. A robust, fully hard- (50% more capacity than POWER9). During the prede-
ware managed end-to-end message capability is code stage, select pairs of instructions can be identi-
possible in clusters with thousands of nodes, deliv- fied for fusion into a single internal operation of the
ering high bandwidth, low latency, and flexible microarchitecture, which leads to a faster and more
topologies. efficient execution of those instructions. The new 64-
The final component of the POWER10 system inter- bit prefix instructions in Power ISA 3.15 are also identi-
connect, shown in the bottom of Figure 2, are the PCIe fied in that stage. POWER10 then decodes and dis-
Gen 5 interfaces. PCIe is central to the I/O infrastruc- patches to the execution slices up to 8 instructions
ture of POWER10 systems, and up to 64 lanes are per cycle per thread, or 16 instructions per cycle per
available in a DCM (32 lanes per chip). With a signaling SMT8 core. This represents a 33% increase in dispatch
rate of 32 GT/s, a single DCM can achieve 252 GB/s of rate when compared to POWER9. Over a thousand
I/O bandwidth in each direction. instructions can be in-flight, from dispatch to commit,
in a POWER10 SMT8 core, representing a doubling of
the out-of-order execution capabilities over POWER9.
POWER10 CORE The translation lookaside buffer (TLB) has been
The POWER10 core is the processing engine that runs increased four-fold with 8192 entries per SMT8 core,
both system and user software, responsible for the while at the same time reducing the latency and
computational capacity and capability of POWER10 increasing throughput over POWER9.
systems. There are two focus areas in the design of The four execution slices of POWER9 have been
the POWER10 core: performance strength and power widened to 128 bits each. This has resulted in a dou-
efficiency. A 30% average increase in core throughput bling of the general SIMD rate of execution, to a
while cutting power consumption in half combine to maximum of four SIMD instructions per cycle per
deliver a 2.6-fold average increase in energy efficiency thread or up to 8 SIMD instruction per cycle per
for computations. The increased energy efficiency has SMT8 core. Crypto processing in the execution sli-
allowed the implementation of DCMs with up to 30 ces have also been enhanced, with an overall four-
SMT8 cores, and up to a three-fold throughput over fold gain in throughput from POWER9 to POWER10
current POWER9 modules with similar power con- core.
sumption. The POWER10 core retains the modular A single-thread of execution can load up to two
architecture from POWER9 that provides a second 32-byte data chunks per cycle from the L1 cache,
FIGURE 3. POWER10 core microarchitecture. The boxes on the left show the improvements over POWER9 on performance and
power efficiency, respectively. The latency numbers include both absolute values and improvements over POWER9.
with a total SMT8 core load bandwidth of 128 bytes unit itself, significantly reducing the total data move-
per cycle. (The same bandwidth can also be achieved ment (bits distance) when compared to an equiva-
from the L2 cache.) A single thread of execution can lent 512-bit SIMD operation. The result is improved
store up to four instructions per cycle by gathering power efficiency and higher frequency, enabling the
from up to two store queue entries when each entry POWER10 core to achieve a four-fold increase in
includes a fused store operation. Stores always tar- matrix math throughput compared to the POWER9
get the L2 cache and maximum bandwidth is 32 core.
bytes per cycle per thread or 64 bytes per cycle per In addition, a focus on power efficiency domi-
SMT8 core. nated many other elements of the POWER10 core
Complementing the four general purpose execu- microarchitecture and design. When compared to
tion slices, POWER10 introduces a new matrix math the POWER9 core, there is more use of clock-gating
accelerator (MMA) unit, optimized for the execution and an emphasis on reducing data switching. The
of new matrix instructions in Power ISA 3.1. The branch prediction accuracy has been improved,
instructions perform BLAS2- and BLAS3-class opera- which results in less wasted work and improves
tions on eight 512-bit accumulator registers that are thread latency. Instruction fusion also helps with
added to the architecture. The instructions use either both performance and power efficiency, by combin-
two or three 128-bit vector-scalar registers to perform ing multiple instructions in fewer operations.
rank-1, -2, -4 or -8 updates on either a 4 2 or 4 4 POWER10 supports both independent and dependent
matrix stored in an accumulator. Each input vector- forms of fusion. Dependent fusion combines the exe-
scalar register contains either a 2 1 vector of dou- cution of two instructions that share a register
ble-precision elements, a 4 1 vector of single-preci- dependence into a single operation (with no depen-
sion elements, a 4 2 matrix of 16-bit elements (half- dent latency) or a latency optimized pair of opera-
precision floating-point,6 bfloat16,7 or signed integer), tions, whereas independent fusion enables the
a 4 4 matrix of 8-bit elements (signed/unsigned inte- combining of loads or stores to adjacent memory
ger), or a 4 8 matrix of 4-bit elements (signed locations into a single wider access reducing
integer). resource consumption and conflicts.
The MMA microarchitecture reduces data The register file for the general-purpose and vec-
switching by storing the accumulators locally in the tor-scalar registers requires four times fewer write-
FIGURE 4. POWER10 core speeds and feeds. Load/store and SIMD bandwidth have been doubled over POWER9, matching SIMD
and load throughputs. The matrix math unit offers increased throughput in computational-intensive operations.
FIGURE 5. POWER10 general purpose socket performance gains. The three-fold improvement in performance comes from a
combination of increased number of cores and more powerful cores. DDR5 will double memory bandwidth in the future.
derived from presilicon simulations and have been cor- For computations that are heavy on matrix math,
related against first-pass silicon. We do not yet have the gains from POWER9 to POWER10 are even more
the final version of the chips and the results reflect the substantial, as shown in Figure 6. LINPACK is expected
projected frequency of operation for production to run ten times faster in POWER10 than POWER9,
POWER10 parts. The figures are for a dual-socket when compared socket-to-socket. The same is
POWER10 system relative to a dual-socket POWER9 expected for single-precision floating-point implemen-
S924 server. We observe a three-fold improvement in tation of the Resnet-50 benchmark. When some of the
performance across integer (SPECint2017_rate), float- new mixed-precision math features of POWER10 are
ing-point (SPECfp2017_rate), and commercial bench- taken into account, our evaluation shows that Resnet-
marks. For memory streaming benchmarks, the 50 will execute up to 15 (with bfloat16 data type) or 20
POWER10 gains over POWER9 range from two- to four- (with 8-bit integer data type) times faster than the stan-
fold, using DDR4 and DDR5 memory, respectively. dard single-precision Resnet-50 in POWER9.
FIGURE 6. POWER10 SIMD/AI socket performance gains. The matrix math accelerator delivers four times the throughput of
POWER9 SIMD. Combined with two-and-a-half times the number of cores, it results in a ten-fold improvement in socket through-
put for computational-intensive operations. Further gains are possible with the new reduced-precision operations.
T
hunderX3 is the latest server chip from Marvell buy and the cost to run. From the perspective of pro-
based on the Arm instruction set architecture cessor design, TCO optimization translates to optimiz-
(ISA) manufactured in TSMC 7 nm. ing performance per dollar and performance per watt
Server processors constitute a segment of the at the platform level. TCO optimization is a multidi-
overall processor market. They are typically deployed mensional problem and different datacenters take dif-
in racks in temperature-controlled warehouse settings ferent approaches. Typically, there is a maximum
called datacenters and are accessed from client devi- power [referred to as thermal design power (TDP)]
ces for a variety of purposes including checking email, that datacenters specify for a processor—the data-
web search, and cloud services. Historically, several center design requires that the processor stay under
ISAs have been used in server processors including this TDP under all circumstances. The design goal for
IBM mainframe ISAs, SPARC, PA RISC, and Itanium. a server processor is then to achieve the maximum
While there are still important residual islands that possible performance while staying under this TDP.
use these other architectures, most servers today run There are die area limits as well to manage the
on the x86 ISA created by Intel. The shift to x86 was manufacturing cost of the die. The workload that is
driven by the low cost of x86 servers and the availabil- often used as an initial performance gate is the CPU
ity of Linux, a mature, open source operating system. benchmark from the SPEC organization.1 Both single
The low cost of x86 servers was enabled by the econo- thread and rate performance of the CPU benchmark
mies of scale resulting from the use of the x86 archi- are important—single thread as a measure of the
tecture in PCs. However, processor shipment volumes responsiveness of the system and rate as a measure
have now shifted from PCs to cell phones where the of the throughput capability. The CPU benchmark
Arm ISA is almost universally used. Marvell and other from SPEC is a good overall measure of the capability
companies are riding the mobile wave to enter the of a server chip but may not always correlate directly
server market with the Arm ISA. to performance seen on customer applications.
The design of server processors is focused on Focus areas of server processor design, thus,
reducing the total cost of ownership (TCO) at the include optimizing single thread performance as mea-
datacenter—that is, to provide the required service at sured by the Integer CPU single thread benchmark
the lowest total dollar cost, including both the cost to from SPEC, optimizing socket level performance as
measured by the Integer CPU rate benchmark from
SPEC, optimizing performance on workloads com-
0272-1732 ß 2021 IEEE monly run in the datacenter such as web servers,
Digital Object Identifier 10.1109/MM.2021.3055451 databases, web search, Java application servers, and
Date of publication 29 January 2021; date of current version other such workloads, optimizing die area and power,
26 March 2021.
ThunderX3 implements
resteer at different lev-
els to reduce these pen-
alties and does bundle
merging to smooth out
the instruction stream.
The eight instructions
that are fetched go into
the decoder. The
decoder maps instruc-
tions to micro-ops—
most instructions map to
a single micro-op, but
FIGURE 2. ThunderX3 core block diagram. there are a few that map
to multiple micro-ops.
Micro-op expansion was
performance counters, and in conjunction with input reduced significantly going from ThunderX2 to Thun-
from software (OS or Hypervisor) makes decisions on derX3. ThunderX2 was derived from an earlier MIPS-
frequency and voltage settings and configuration to based architecture, and during the transition to Arm,
optimize chip performance and power. most instructions that did not map to a MIPS instruc-
tion were broken into micro-ops. In ThunderX3,
instructions such as loads and stores with register
THUNDERX3 CORE plus register addressing that see widespread use by
The ThunderX3 core is a deep out of order core with Arm compilers were mapped to a single micro-op.
support for four-way multithreading. On single thread Reducing micro-op expansion results in additional
performance, it is competitive with the highest end complexity in the execution unit to execute these
cores from x86 competitors, and the support for four- more complex operations, but the performance gain
way threading allows it to outperform high-end com- was worth it. Decode also fuses certain instruction
petitor cores on a wide variety of workloads. pairs such as simple integer instructions and branches
Figure 2 has the core block diagram. Up to eight into a single micro-op. On average across the SPEC
instructions per cycle are fetched from the instruc- integer suite about 0.95 micro-ops are output by
tion cache—way prediction is used at the instruc- decode per instruction.
tion cache to simplify the design and to reduce Decoded micro-ops go into a skid buffer—the skid
power. Concurrent with instruction cache access, buffer is the separation point between the front end
various branch prediction structures are also read to of the pipe and the back end of the pipe. The skid
decide the next bundle of instructions to be fetched. buffer is one of the few structures in the core that is
There are separate prediction structures for condi- statically partitioned among the threads—to simplify
tional branches, indirect branches, and returns. Sup- thread arbitration at dispatch. The renaming unit picks
port for decoupled fetch was added in ThunderX3— up micro-ops from the skid buffer, does renaming, allo-
this is a mechanism to keep fetching following an cates various backend structures for the micro-op,
instruction cache miss, using BTBs and branch pre- and writes the micro-op into the scheduler, the reor-
dictors to predict future PCs—it is quite effective der buffer and other queues such as the load and
showing 1.5 to 2 gain on datacenter codes such store queue as needed. Renaming is done at four
as webservers and databases that tend to have a micro-ops per cycle and there are optimizations
large instruction footprint and predictable basic around merging to reduce instruction delivery ineffi-
block sequences. Fetch bundles break on a 64B ciencies created by breaks in fetch bundles. Note that
cache line boundary and on a taken branch. Front- renaming width is four while fetch and decode width
end inefficiencies introduced by fetch breaks and are eight—it may seem that the pipeline is out of bal-
branch resteer latencies are a major challenge in ance, but in practice most workloads do not run any-
high-end CPU design since IPCs on many cache resi- where close to peak rates in a sustained fashion.
dent codes may reach 3 or more, and even a single Performance and design studies showed that increas-
cycle additional delay on a frequently executed ing rename width had low performance benefit at the
branch resteer could lower IPC by a high percentage. current design point, but high design cost.
1.5 and on datacenter codes we are seeing 2–3 micro-op in the scheduler for each issue port on each
on a variety of codes. cycle. The execution units are threading agnostic as
well, executing whatever micro-op is delivered to them
by the scheduler. Finally, the retire unit picks the thread
THUNDERX3 THREADING
that has the most micro-ops to retire, with a mechanism
There are four hardware threads per core on ThunderX3.
to prevent starvation when a single micro-op in a thread
Each hardware thread has the full Arm CPU state—so
has not retired for a while. With this arbitration scheme,
the OS sees four CPUs per core. The four hardware
when running similar workloads on all threads, perfor-
threads share almost all core resources including
mance is uniform among the threads. Of course, when
caches, execution units, pipeline stages, structures
the threads are running workloads with vastly different
such as the reorder buffer and the scheduler as well as
profiles, the notion of what is fair itself is not clear, but
interfaces to the external world. The core was designed
results are reasonable with no thread experiencing dra-
with four threads from the beginning—there was never
matic slowdowns.
a single thread variant of the core. A back of the enve-
Figure 5 shows single core performance gains from
lope analysis of area penalty of threading was done and
threading—threading gains often correlate with inst-
the area penalty came out to around 5%. The gain from
ructions per clock (IPC) that a single thread is able to
threading on a variety of codes is well over 5%—so
achieve, which is a measure of how efficiently the
threading is quite area and power efficient.
thread is using the pipeline—the more slack there is in
Figure 4 describes thread arbitration, which provides
execution, the more opportunity there is for threading
a flavor of how threading is implemented in the core.
gains. But cache pressure is a factor in some cases.
There are four points of arbitration. Once at fetch, once
On MySQL, which runs at a low IPC, we see more than
at rename, once in the scheduler, and once at retire. At
2 gain from going to four threads. For a benchmark
fetch, the goal of the arbitration algorithm is to treat the
such as leela (a go playing code from cpu2017), which
four threads uniformly while achieving the maximum
achieves mid-range IPC, the gain from threading is still
execution rate. On each cycle, the algorithm looks at
good at 1.7 to 1.8 at four threads. For x264, which is
the active threads and picks the thread that has the
video encoder also from cpu2017, the IPC is high but
fewest instructions/micro-ops in flight further down the
there are still decent gains.
pipe. That ensures that among active threads, pipeline
utilization is balanced to some extent. Similarly, at
rename, among the threads that have at least one THUNDERX3 SOC
micro-op in the skid buffer, arbitration picks the thread Figure 6 shows the overall SOC—it is a switched ring
that has the fewest micro-ops further down the pipeline. with a switch at the top and bottom. Each leg is bidi-
The scheduler is threading agnostic—it picks the oldest rectional, and the traffic is routed through the
switches to the different legs. During the
design of ThunderX3, a mesh was consid-
ered for the interconnect but given that
DDR bandwidth was not increasing dra-
matically from ThunderX2 to ThunderX3
a mesh did not seem necessary, and the
switched ring architecture was adopted
to minimize design changes.
Cores are grouped in four core clus-
ters along with an interconnect slice,
FIGURE 5. Threading performance gains.
L3-cache and coherence control logic
ACKNOWLEDGMENTS
The authors would like to thank their colleagues,
hundreds of engineers over the years, who worked
days, nights, and weekends, sometimes under
FIGURE 7. Socket level performance. extremely challenging circumstances, to build Thun-
derX3 and its predecessors. They would also like to
thank the anonymous reviewers for the helpful
feedback.
to form tiles. Tiles are replicated to form the Core
and L3-cache cluster in the middle of the die. DDR,
PCIe, and other I/O tap into the rings and are on
the periphery. Support logic such as the interrupt REFERENCES
controller, the system management unit, and debug 1. 2017. [Online]. Available: https://www.spec.org/cpu2017/
assist also tap into the rings and are in the Docs/overview.html
periphery. 2. Arm Developer. Understanding the Armv8.x extensions.
The L3-cache is 90 MB (1.5 MB per core) on Thun- 2019. [Online]. Available: https://developer.arm.com/
derX3 versus 32 MB (1 MB per core) on ThunderX2. architectures/learn-the-architecture/understanding-
Physical addresses are mapped statically to L3-cache the-armv8-x-extensions/single-page
tiles and the location of the requesting core has no 3. 2016. [Online]. Available: http://www.cs.virginia.edu/
impact on tile selection. The large, shared L3-cache stream/ref.html
allows large datasets and instructions of highly mul- 4. 2020. [Online]. Available: https://github.com/akopytov/
tithreaded datacenter codes to be held in the sysbench
RABIN SUGUMAR is a Chief Architect with Marvell, Sunnyvale, with Broadcom that later became ThunderX2. Before Broad-
CA, USA and leads the architecture group for the ThunderX com, he was involved in architecture and verification of high-
server processor line. In his role, he participates in a variety of performance embedded processors at two startups, PA Semi
aspects of ThunderX development and productization. He and SiByte. Shah received a B.S. in electrical engineering
joined Marvell from Cavium, which Marvell acquired in 2018. and computer science from UC Berkeley, Berkeley, CA, USA
Prior to this, he was with Broadcom, where he was one of the and an M.S. in electrical engineering from UCLA, Los Angeles,
lead architects on the server processor that later CA, USA. Contact him at mehuls@marvell.com.
became ThunderX2 when the team from Broadcom moved to
Cavium. During his career, he has worked on architecture and
design of vector processors at Cray Research; early multi- RICARDO RAMIREZ is a Principal Engineer with Marvell, Santa
threaded and out-of-order SPARC processors with Sun Micro- Clara, CA, USA and is the lead designer on the ThunderX CPU
systems; and Arm-based processors with Broadcom, Cavium, logic design team. He joined Marvell with the Cavium acquisi-
and now, Marvell. Sugumar received a Ph.D. in computer sci- tion in 2018. Before Cavium, he was with Broadcom as the lead
ence and engineering from the University of Michigan, Ann designer on the server processor that became ThunderX2
Arbor, MI, USA. Contact him at rsugumar@marvell.com. when the team moved to Cavium. Throughout his career, he
has been involved in developing many high-performance CPU
MEHUL SHAH is a Principal Architect involved in the design cores, which include Intel’s Itanium processor and Broadcom’s
of server processors with Marvell, Sunnyvale, CA, USA. He XLR and XLP line of MIPS-based processor. Ramirez received
came to Marvell through the acquisition of Cavium. Prior to an M.S.E.E. from Stanford University, Stanford, CA, USA. Con-
that, he was one of the designers of the server processors tact him at rramirez@marvell.com.
The Xbox Series X console, released in November 2020, contains a System on Chip
(SoC) created in partnership with AMD. This article describes its architecture
including the intelligence for input processing, rendering game graphics and audio,
managing storage, user services, and security, all under a tiered operating system.
T
he Xbox 2020 console generation features subsystems for multimedia, security, and I/O. Media-
numerous upgrades to the previous architec- related blocks are fed by the Media Hub interface, while
ture that powered console products from 2013 security and audio blocks are fed by the System Hub for
to 2017, including Xbox One, Xbox One S, and Xbox best quality of service. The primary I/O interface is the 8-
One X. Some of the key architectural features of the lane Gen4 PCI Express interface, which connects to a
Series X1 are listed in Table 1. South Bridge chip, internal and (optional) external NVMe
flash SSDs, and gigabit ethernet NIC.
AUDIO UNITS
Overall, the three hardware audio engines in Xbox
Series X have more peak single-precision floating-
point performance than all eight of the Xbox One X
CPU cores running at 2.3 GHz. The CPFU2 is a new
engine focused on efficient audio convolution, FFTs,
reverb, and complex arithmetic for audio algorithms. It
FIGURE 1. The SoC die. enables new realistic audio experiences such as in
Project Acoustics,4 where environments are modeled,
and 3-D audio sources are simulated in real time. PLUTON HSP AND MSP
MOVAD is a hyper real-time Opus5 audio decoder, The integrated Hardware Security Platform6 (HSP)
with a throughput-matched high-quality sample rate raises Xbox’s already robust hardware security thresh-
converter. It can process more than 300 real-time old with the addition of Secure Hardware Crypto Keys
channels of Opus. Because of its optimal quality and (SHACK). The HSP orchestrates security operations
compression ratio, the Opus CODEC was chosen to without other software or firmware involvement.
implement in hardware. Opus allows different SILK The Media Security Platform unit (MSP) offloads
voice CODEC and CELT music CODEC mixes per audio simultaneous cryptographic and compression-related
frame. The SRC engine in MOVAD has >100 dB signal- processing of streams of storage data to/from the
to-noise ratio across game use cases, which are much NVMe SSD. The MSP performs LZ lossless decompres-
more difficult and varied than traditional music/voice sion and custom, texture optimized BCPack decompres-
audio. sion, giving an average 2:1 space and bandwidth boost.
THE GPU
The goal for the GPU was to create a console-class
design that significantly advances gamers' sense of
immersion in realistic worlds. A generational increase in
raw graphics operations per second is natural, but
because of the cost considerations described previ-
ously, little real estate could be devoted to brand new
GPU functions, so enhancements were added judi-
ciously. Figure 3 shows the overall RDNA-based GPU
structure. It fully supports Direct3D12 feature level 12_2.
There are four Shader Arrays, each with 6 or 7 dual FIGURE 4. Evolution of performance and capacity.
CUs for 26 total active. Each Shader Array has its own
increase about ninefold since 2013. This growth has This shows that, on average, we only need to shade
enabled developers to create ever more stunning visu- every other pixel; but we need to be judicious about
als. Note that memory space and bandwidth have the distribution of fragments to not lose visual detail.
grown much more slowly–only 2-3. VRS addresses this using a set of bias controls. The
The brown line tracks the number of TV screen pix- rate can be determined based on knowing which
els that must be filled. Because consumers have been objects have high detail, which primitives within
able to increase their TV resolution from FHD objects, or based on individual 8x8-pixel screen tiles.
(1920 1080) to 4K UHD at 120 frames/s, the pixel count For instance, since the game title performs multiple
has gone up almost as fast as shader power. Taking the rendering passes, it is possible to predict which areas
average of GPU compute and memory capability, the of the screen that might ordinarily have high detail will
useable graphics performance increase is in the 4–6 be blurred in later passes.
range depending on the game title. A programmable combination of the different
By these metrics, the GPU is falling behind the per- types of rates is supported, which increases or
formance-per-pixel curve; but developers and players decreases the nominal rate, which is limited in fine-
alike want ever nicer more realistic pixels. A big part of ness to the global antialiasing level set for a given ren-
the solution for Xbox Series X involves architectural dering pass.
enhancements that amplify the raw performance. The Object edge detail is preserved, and VRS can be
actual increase depends on several factors, including used alongside other resolution enhancing techni-
adoption of new techniques by game programmers ques, including temporal antialiasing, super resolu-
and the specific content. tion, and even checkerboarding. The actual amount
of dedicated hardware for this feature is tiny but can
have a payoff of around 10%–30% in improved perfor-
OVER MANY YEARS, GAME ENGINES mance, allowing higher frame rates and more math
HAVE TRIED MULTIPLE APPROACHES per pixel.
TO LOADING ONLY ENOUGH TEXTURE
DETAIL TO SATISFY THE DEMANDS OF
THE FRAMES ABOUT TO BE RENDERED. Sampler Feedback Streaming
IN SERIES X THERE ARE TWO NEW Over many years, game engines have tried multiple
STRUCTURES IN THE GPU TO ASSIST approaches to loading only enough texture detail to
WITH TILE-BY-TILE MANAGEMENT OF A satisfy the demands of the frames about to be ren-
MODEST TEXTURE WORKING SET dered. In Series X there are two new structures in the
PLACED IN RAM JUST BEFORE OR JUST GPU to assist with tile-by-tile management of a mod-
est texture working set placed in RAM just before or
AFTER NEEDED.
just after needed. There is a residency map per texture
that clamps the level of detail (LOD) of each tile, and a
Variable Rate Shading request map that records the finest MIPMAP level
Traditionally, GPUs had to run a shader thread on that was requested for each tile since it was last reset.
every rendered pixel, or fragment, to generate a color The tile size is flexible. The sequence of SFS opera-
value—and with antialiasing, multiple fragments tions is as follows:
depth-buffered and shaded per screen pixel. We
observed that for most scenes that amount of unique 1. The first step is to allocate virtual memory space
work is overkill, since there is not high spatial fre- for the entire texture. Then, the title loads all
quency variation everywhere. A single fragment color coarsest mipmap levels—in this case, from the
can serve for multiple subpixel samples or multiple coarsest level up to level 2, which is 1=4 by 1=4 the
pixels without noticeable degradation. Proposed in dimensions of the finest level, requiring just 6.7%
2018 by Microsoft, VRS is now widespread. of all pixels to be resident in memory. Finer levels
Figure 5 shows an analysis of a typical rendered are divided into tiles.
frame—in this case from the Sponza scene model. 2. The next step is to render. Closer-up portions of
The black areas are where shading can occur at quar- a texture require more detail. With SFS, the
ter resolution, i.e., one fragment for a two-by-two pixel shader executes a single sample macroinstruc-
area. Turquoise areas are shaded once per two-by-one tion that combines residency map lookup to
pixel area; yellow per one-by-two. Red areas are determine the current detail level with the fetch
shaded at the normal 11 pixel rate. of the actual texture data. Since the shader
sample instruction, as in the past, has already bilinear filtering on the left, fractional LOD values
calculated which LOD tiles should have been greater than zero leak over the boundary into the
fetched, those values are captured in the sepa- lower detail tiles, meaning that pixels sampled in those
rate request map. The closer tiles need more regions would need nonresident LOD 0 pixels to be
detail and the farther tiles may need less than blended in, causing visual errors. With a new biased fil-
what is resident. The latter are candidates for ter function shown on the right, the transition zones
eviction if the cache is full. are moved toward the coarser LOD map texel, so that
3. After rendering, the application reads back the the nonresidency problem is avoided.
request map, compares with its saved copy of the Overall, with a very small incremental hardware
residency map, and uses the XVA API to bring in cost, Sample Feedback Streaming gives the same or
the higher detail tiles it decides are needed from better level of visual detail with up to 60% savings in I/
flash. For each fine LOD tile, the corresponding O and memory footprint costs.
region of the coarser LODs is also loaded to pro-
vide the right detail everywhere in that region. Ray Tracing
4. Finally, the residency map is updated to reflect The Series X GPU supports DirectX ray tracing accel-
the current state of the tile cache and uploaded eration, allowing the most physically realistic render-
to the texture unit. The next time that texture is ing techniques to be used in real time. But in this
accessed the finer detail ready to use. console generation, developers still want to use tradi-
tional rendering techniques evolved over decades
There is a second mode of the request map to sup-
port texture-space shading, which saves GPU work by
deferring rendering passes that generate texture tiles
until it is known which are needed. In this mode,
instead of one detail value per tile for an entire tex-
ture, accesses are tracked using a single bit per tile for
every MIPMAP level.
Ideally these new tile maps, treated just like other
textures, stay on-die for low latency access, so they
are designed to be as small as possible—meaning tiles
should be as large as practicable. But we also do not
want to see seams between tiles with different levels
of available detail. In the example in Figure 6, the red FIGURE 6. Xbox series X texture LOD filtering.
tile has LOD 0 resident, orange is LOD 1, etc. With
without a performance penalty. They can apply ray 3. 2017. [Online]. Available: HDMI 2.1 specs and features:
tracing selectively where materials and environments Everything you need to know TechHive.com
demand. This means the GPU needs a good balance 4. 2019. [Online]. Available: https://docs.microsoft.com/en-
of die resources dedicated to the two techniques. us/gaming/acoustics/what-is-acoustics
We have added hardware embedded in the com- 5. 2020. [Online]. Available: https://opus-codec.org/
pute units to perform intersections of rays with accel- 6. 2020. [Online]. Available: https://www.microsoft.com/
eration structures that represent the scene geometry security/blog/2020/11/17/meet-the-microsoft-pluton-
hierarchy. That task is a sizeable fraction of the overall processor-the-security-chip-designed-for-the-future-of-
specialized ray tracing workload. The rest can be per- windows-pcs/
formed with the baseline shader design and memory 7. 2020. [Online]. Available: https://news.xbox.com/en-us/
hierarchy with good real-time quality. The hardware 2020/07/14/a-closer-look-at-xbox-velocity-architecture
performs up to 380 billion ray-box intersections or 95 8. 2015. [Online]. Available: https://gamingbolt.com/xbox-
billion ray-triangle intersections per second, or any one-gpu-has-8-graphics-contexts-uses-multiple-gpu-
mix of the two. The overall speedup varies with scene command-streams-to-reduce-cpu-gpu-latency
complexity and lighting characteristics, but for the 9. 2020. [Online]. Available: Inside Xbox Series X: the full
intersection task it can be up to ten times the perfor- specs, Eurogamer.net
mance of a pure shader-based implementation.
MARK GROSSMAN is a Principal Architect with Microsoft,
Machine Learning Support working on GPUs and display processing for multiple genera-
Game engines increasingly make use of machine learn- tions of consoles and headsets. His main interests are high-
ing inference for a variety of game-related tasks, from performance graphics systems and hardware-accelerated
character behavior and animation to super resolution
algorithms. He is a Founder of Silicon Graphics. Grossman
detail enhancement. Xbox Series X includes a small
received a B.A. in information and computer science from the
increment for small-integer operations in the compute
units for inference,9 accelerating some tasks up to 10. University of California, Santa Cruz, CA, USA. Contact him at
mark.grossman@microsoft.com.
CONCLUSION
The Series X SoC, in conjunction with software innova-
JEFFREY ANDREWS is a Distinguished Engineer with Micro-
tions provided by the software teams at Microsoft and
game development companies, help “get technology soft, directing a silicon IP architecture team within Azure.
out of the way” and deliver developers’ vision to His team’s focus areas include: silicon security, machine
gamers worldwide. learning, storage, graphics, audio, display, image process-
ing, and embedded CPUs. He has worked on every Xbox
REFERENCES console, plus four game consoles, and four startups before
1. 2020. [Online]. Available: https://www.xbox.com/en-US/
Microsoft. Andrews received a B.Sc. in computer architec-
consoles/xbox-series-x
ture from the University of Illinois, Urbana-Champaign, IL,
2. 2021. [Online]. Available: https://en.wikipedia.org/wiki/
High-dynamic-range_video USA. Contact him at jeffrey.andrews@microsoft.com.
NVIDIA A100 Tensor Core GPU is NVIDIA’s latest flagship GPU. It has been designed
with many new innovative features to provide performance and capabilities for
HPC, AI, and data analytics workloads. Feature enhancements include a Third-
Generation Tensor Core, new asynchronous data movement and programming
model, enhanced L2 cache, HBM2 DRAM, and third-generation NVIDIA NVLink I/O.
NVIDIA A100 TENSOR CORE GPU faster than V100 on important HPC applications within
OVERVIEW molecular dynamics, physics, engineering ,and geo sci-
T
he diversity of compute-intensive applications ence areas (see Figure 1). For DL workloads, DGX
in modern cloud data centers has driven the super-pods with A100 have set records for the MLPerf
explosion of GPU-accelerated cloud computing. benchmark,2 handily surpassing all other commercially
Such applications include AI deep learning (DL) train- available systems, including Google’s TPUv3 and Hua-
ing and inference, data analytics, scientific computing, wei’s Ascend systems (see Figure 2). The benchmark
genomics, edge video analytics and 5G services, also demonstrates A100’s breadth of support for AI
graphics rendering, and cloud gaming. The NVIDIA networks, the only system able to run all benchmarks,
Ampere architecture-based A100 GPU brings record- and run them with high performance.
setting scale and novel capabilities to these work- Many new and innovative features in A100 contrib-
loads. The A100 GPU is 54 billion transistors built on ute to its high performance and capabilities. This arti-
TSMC’s 7-nm process. It has 108 streaming microproc- cle will cover a few of those improvements: top to
essors (SMs) with 6912 CUDA cores, 40 MB of L2 bottom features to support strong scaling, elastic
cache, 600 GB/s of NVIDIA NVLink interconnect band- GPU capabilities to support scale up and scale out,
width, and 1.6 TB/s of HBM2 memory bandwidth. It and asynchronous programming features to enable
also has new elastic GPU capabilities including scale- efficiency and programmer productivity.
out support with multi-instance GPU (MIG) virtuali-
zation, and scale-up support with a third-generation
50-Gb/s NVLink I/O interface connecting multiple STRONG SCALING: TOP TO
A100s directly or through NVIDIA’s NVSwitch. Inside BOTTOM
the A100 SM are new third-generation tensor cores A typical deep neural network consists of long chains
with support for fine-grain sparsity and new BFloat16 of interconnected layers (see Figure 3). While there is
(BF16), TensorFloat-32 (TF32), and FP64 datatypes. a massive amount of compute in a network, the paral-
The SM also adds new asynchronous data movement lelism is broken up into layers of smaller, sequentially
instructions and barriers which work together to effi- dependent chunks of work. Each layer takes an input
ciently stream data in a programmer-friendly way. activation tensor and weight tensor, performs an oper-
A comparison shows that A100 provides dramati- ation similar to a matrix multiplication, and outputs an
cally higher performance than currently available com- activation tensor. To leverage the large amount of
mercial designs and NVIDIA’s previous generation compute available in a GPU, the output tensor is bro-
V100 GPU.1 A100 runs approximately 1.5 to over 2 ken down into smaller tiles which are distributed
across the different SMs.
A100’s Tensor Core throughput is 2.5 times higher
on dense FP16 data than the previous generation V100
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3061394 GPU. In weak scaling, both the network size and paral-
Date of publication 23 February 2021; date of current version lelism must grow to match the increased throughput.
26 March 2021. In strong scaling, both the network size and available
FIGURE 1. NVIDIA A100 HPC performance HPC relative to NVIDIA V100, normalized to per chip.
FIGURE 2. MLPerf v0.7 performance for NVIDIA A100 and other commercially available DL systems, relative to NVIDIA V100 &
normalized to per chip.
FIGURE 4. NVIDIA A100 & V100 tensor core formats and performance.
parallelism remain fixed, and A100 runs the network reduces the number of times data needs to be loaded
faster. and reduces SMEM bandwidth by half.
To achieve strong scaling, A100 needed to scale Data transfer efficiency from the memory system
performance at all levels: from the tensor cores, was also improved. In V100, data must first be loaded
through the cache hierarchy, through DRAM, and chip into the register file and then stored into SMEM. In
interconnect. A100, a new asynchronous combined load-global-
store-shared instruction was added which transfers
data directly into SMEM, bypassing the register file
SM Core and increasing efficiency.
A100’s new tensor cores have increased the process- Combined with the Tensor Core organization
ing of dense FP16 data by 2x per SM and 2.5 per GPU changes, A100 reduces 6 L1þSMEM accesses (L1 read,
over V100 (see Figure 4). The tensor cores added sup- SMEM write, 4 SMEM reads) down to only two SMEM
port for additional formats, including the TensorFloat- reads. The asynchronous capability of the transfer of
32 operation that improves the processing of FP32 memory also helps the SM to continuously stream data,
data by 10 per GPU. A100’s tensor cores also support improving utilization throughout the memory system.
fine-grain sparsity, which doubles the throughput
when processing sparse data.
The 2 increase in FP16 math throughput per SM A100 L2 Cache
for dense data requires 2 more data bandwidth, and The A100 L2 cache is a shared resource for the SMs.
the effective 4 increase for sparse data requires 3 Additional bandwidth from the L2 cache is necessary
more data bandwidth. to achieve strong scaling, as the number of SMs is
In the SM core, multiple improvements in data increased in A100 and each SM processes data at a
delivery provide this increase in data bandwidth (see faster rate. A100 delivers a 2.3x L2 bandwidth increase
Figure 5). Inside the V100 and A100 SM, the network over V100, supported by a new structure to efficiently
layer tile is further broken down into four smaller tiles move this data.
which are each processed by a 32-thread warp. V100’s The L2 cache is divided into two partitions to enable
tensor cores were designed to work at 8-thread granu- higher bandwidth and lower latency memory access
larity, and required the tiles to be further broken down (Figure 6). Each L2 partition localizes and caches data
to four smaller tiles per warp. Each of these tiles loads for memory accesses from SMs directly connected to
the tensor data from shared memory (SMEM), which the partition. Hardware cache-coherence maintains
in aggregate requires all data to be loaded four times. the CUDA programming model across the full GPU,
In A100, the tensor cores were reorganized and and applications will automatically leverage the band-
enhanced to work at 32-thread granularity. This width and latency benefits of A100’s new L2 cache.
L2 Cache Residency Controls global memory can only utilize this portion of L2 when
A100 gives applications the capability to influence the it is unused by persistent accesses.
persistence of data in the L2 cache, allowing higher There are two primary mechanisms that allow per-
bandwidth and lower latency accesses to the global sistent data to be resident in the L2. In the first, an
memory. The so-called persistent accesses control address-based window is specified where all the read/
the replacement policy to effectively set-aside a por- write accesses are persistently cached in the L2. Indi-
tion of L2 cache. Normal or streaming accesses to vidual accesses are not tagged in this scheme. Alter-
natively, controls can be specified at a finer-grained,
per-memory-operation basis.
access will take advantage of the compressed data efficiency of small payload writes and data-less
bandwidth. Compression helps both the write and responses were also added.
read accesses to DRAM and increases the effective
DRAM bandwidth available.
ELASTIC GPU: MULTI-GPU
SCALE UP
A100 Gen 3 NVLink The twelve NVLink links in each A100 allow a variety of
The third generation of NVIDIA’s high-speed NVLink configurations with high-speed connections to other
interconnect is implemented in A100 and the new GPUs and switches. To meet the growing computa-
NVSwitch. NVLink is a reliable, high bandwidth, low- tional demands of larger and more complex DNNs and
latency memory interconnect, and includes resiliency HPC simulations, the new NVIDIA DGX A100 system
features such as link-level error detection and packet (Figure 7) includes eight A100 GPUs connected by the
replay mechanisms to guarantee successful transmis- new NVLink-enabled NVSwitch.
sion of data. The new NVLink has a data rate of 50 Gb/s Multiple DGX A100 systems can be connected via
per signal pair, nearly doubling the 25.78-Gb/s rate in a networking fabric like NVIDIA Mellanox InfiniBand
Tesla V100.3 Each link uses four differential signal pairs and Mellanox Ethernet to scale out data centers, cre-
(four lanes) in each direction compared to eight signal ating powerful supercomputer-class systems. More
pairs (eight lanes) in V100. A single link provides 25-GB/ powerful NVIDIA DGX POD and NVIDIA DGX Super-
s bandwidth in each direction similar to V100 GPUs, but POD systems will include multiple DGX A100 systems
uses only half the signals compared to V100. The total to provide much greater compute power with strong
number of NVLink links is increased to 12 in A100, ver- scaling.
sus six in V100, yielding 600-GB/s total bandwidth for an
entire A100 versus 300 GB/s for Tesla V100. The twelve
NVLink links in each A100 allow a variety of configura- ELASTIC GPU: MULTI-INSTANCE
tions with high-speed connections to other GPUs and GPU (MIG)
switches. While many data center workloads continue to scale,
All writes in the third-generation NVLink now both in size and complexity, some acceleration tasks
require an acknowledgement from the destination. are not as demanding, such as early-stage develop-
This allows synchronization to be performed at the ment or inference on simple models at low batch
requester, and error attribution to be returned to a sizes. Data center managers aim to keep resource uti-
specific execution context. Writes are allowed to be lization high, so an ideal data center accelerator not
pipeline to the destination while the requestor is wait- only needs to efficiently handle a large workload—it
ing on a response. New features to improve the also efficiently accelerates many smaller workloads.
The new MIG feature can partition each A100 into Updates to CUDA for A100 expand the expressive-
as many as seven GPU Instances for optimal utiliza- ness of Cþþ for data and computation pipelining,
tion, effectively expanding access to every user and which we identified as a growing source of difficulty
application. The A100 GPU’s new MIG capability can for CUDA programmers. The goal of pipelining in soft-
divide a single GPU into multiple GPU partitions called ware is the same as in hardware: to keep execution
GPU Instances. Each instance’s SMs have separate resources busy by overlapping the latency of different
and isolated paths through the entire memory sys- phases of computation. This is difficult to express in
tem—the on-chip crossbar ports, L2 cache banks, Cþþ due to its conservative requirements around
memory controllers, and DRAM address busses are all memory consistency.
assigned uniquely to an individual instance. This The example program in Figure 8 shows that oper-
ensures that an individual user’s workload can run ation-level parallelism is prevented by synchronization
with predictable throughput and latency, with the semantics. Software pipelining depends on having the
same L2 cache allocation and DRAM bandwidth even opportunity to execute independent work while syn-
if other tasks are thrashing their own caches or satu- chronization is resolving at the boundaries of stages.
rating their DRAM interface. Using this capability, MIG The first programming model innovation borrows
can partition available GPU compute resources to pro- from asynchronous programming: separate the arrival
vide a defined quality of service (QoS) with fault isola- and waiting steps, as in phasers.4 Early in develop-
tion for different clients (such as VMs, containers, ment, NVIDIA recognized this innovation would bene-
processes, and so on). It enables multiple GPU Instan- fit programmers beyond CUDA, so NVIDIA offered
ces to run in parallel on a single, physical A100 GPU. both specifications and implementations to the com-
MIG also keeps the CUDA programming model munity; it is now part of ISO Cþþ 205 and is available
unchanged to minimize programming effort. in LLVM (libcxx) today.
MIG enables users to see and schedule jobs on vir- The second innovation extends this foundation
tual GPU Instances as if they were physical GPUs. with asynchronous data movement capabilities. This
MIG works with Linux operating systems and their example program in Figure 9 combines both asynchro-
hypervisors. nous barrier and data movement operations, resulting
in a remarkably precise expression of programming
intent. By leveraging the relaxed semantics of asyn-
PRODUCTIVITY: ASYNCHRONOUS chronous operations, there is a net reduction in the
PROGRAMMING difficulty of compiling the program and performance is
The goal of CUDA is to compile intuitive Cþþ sources both higher and more predictable.
into high-performance executable programs for the
GPU. In this pursuit, a compiler’s goals are in tension:
to maximize operation-level parallelism, but never to CONCLUSION
alter the program’s semantics. NVIDIA’s work on joint NVIDIA’s A100 GPU is the largest and most advanced
hardware & programming system codesign directly GPU developed by NVIDIA, and builds on the ground-
addresses this tension. work laid by previous generations of NVIDIA GPUs. It
is the result of the work of thousands of engineers WISHWESH GANDHI is a Senior Director of architecture at
that worked together from transistors to standards NVIDIA, Singapore. He has led the architecture development
and everything in between such as system integration of the GPU memory system for multiple generations. He
and programming model design. A100 provides
has been working with Integrated and Discrete GPU mem-
unprecedented acceleration at every scale, and adds
ory architecture for more than 20 years. Contact him at
powerful new features which deliver dramatically
faster performance for HPC, AI, and data analytics wgandhi@nvidia.com.
workloads.
OLIVIER GIROUX is a Distinguished Architect at NVIDIA, and
the ISO Cþþ Concurrency and Parallelism Chair. He has
worked on ten GPU and six SM architecture generations,
REFERENCES
1. J. Choquette, O. Giroux, and D. Foley, “Volta: with a focus on clarifying the programming model of GPU
Performance and programmability,” IEEE Micro, vol. 38, threads. Giroux received an M.S. in computer science from
no. 2, pp. 42–52, Mar./Apr. 2018. McGill University, Montreal, QC, Canada. Contact him at
2. P. Mattson et al., “MLPerf training benchmark,” 2019, ogiroux@nvidia.com.
arXiv:1910.01500.
3. A. Ishii et al., “NVSwitch and DGX-2: NVLink-Switching NICK STAM is a Senior Technical Marketing Director with
chip and scale-up compute server,” Hot Chips, 2018.
NVIDIA. His team provides tech support to press, and also
4. J. Shirako, D. M. Peixotto, V. Sarkar, and W. N. Scherer,
produces our GPU white papers. Before NVIDIA, he worked
“Phasers: A unified deadlock-free construct for
at PC Magazine USA, and cofounded the ExtremeTech web-
collective and point-to-point synchronization,” in Proc.
22nd Annu. Int. Conf. Supercomput., 2008, pp. 277–288. site. Stam received an M.S. in computer science from SUNY
5. International Standard ISO/IEC 14882:2020 – Program Binghamton, NY, USA. Contact him at nstam@nvidia.com.
ming Language Cþþ.
RONNY KRASHINSKY is a Distinguished Engineer with NVI-
JACK CHOQUETTE is a Senior Distinguished Engineer with DIA where he has architected GPUs for 11 years. He began
NVIDIA, where he has led the architecture development of his NVIDIA career in Research, and later joined the Streaming
NVIDIA’s GPGPU Streaming Microprocessors for multiple Multiprocessor team. He now focuses on deep-learning com-
generations. He has been leading CPU and system designs pute architecture. Krashinsky received a Ph.D. in electrical
for over 25 years. Choquette received an M.S. in computer engineering and computer science from Massachusetts Insti-
engineering from the University of Illinois Urbana-Champaign, tute of Technology (MIT), Cambridge, MA, USA. Contact him
Champaign, IL, USA. Contact him at jchoquette@nvidia.com. at rkrashinsky@nvidia.com.
D
omains such as data analytics, machine prominent example for how to leverage specializa-
learning, and scientific computing are depen- tion.5 Unfortunately, they are hard to adjust to algo-
dent on increasing compute resources.3 rithmic changes and tied to a specific application
Increasing technology node densities result in sys- domain.
tems that are mainly limited by thermal design power The trend in leading-edge general-purpose com-
and the most feasible way to increase the amount of puter architectures paints a similar picture on the
active compute units is to design more energy-effi- importance of increasing energy efficiency. Two prom-
cient architectures. While many emerging architec- inent examples of recent high-performance architec-
tures,4 especially in the machine learning domain, tures are Fujitsu’s A64FX6 and NVIDIA’s A100.7 Both
tradeoff floating-point (FP) precision for higher systems strive to control their 32 (A64FX) and 16
throughput and efficiency, algorithms such as stencils (A100) wide multilane SP (sp) data-path with as few
and linear differential equations require higher preci- instructions as possible.
sion arithmetic. Domain-specific accelerators are a With the proposed Manticore system, we pursue a
similar goal. We achieve this goal by pairing a simple,
in-order, 32-bit RISC-V integer core with a large float-
0272-1732 ß 2020 IEEE ing-point unit (FPU). Two instruction set architecture
Digital Object Identifier 10.1109/MM.2020.3045564 (ISA) extensions: SSRs and floating-point repetition
Date of publication 17 December 2020; date of current (FREP) make it possible for the single-issue integer
version 26 March 2021.
die serial links8. cluster form a quadrant and share an uplink into the next stage.
Four S1 quadrants form an S2 quadrant which share an uplink
to the next stage. Two S2 quadrant form an S3 quadrant. Four
core to saturate the bandwidth of its FPU, achieving S3 quadrants per chiplet share access to the HBM memory.
utilization higher than 90% for compute-bound kernels.
Memory Hierarchy
Each quadranty (see Figure 3) is further subdivided into
multiple stages, in a tree-structure using an interconnect
tuned for burst-based direct memory transfer (DMA)
accesses. Four clusters share an instruction cache and
an uplink into the next stage. These four clusters have a
high aggregate bandwidth of 64 TB/s among each other
and can perform low-latency, high-bandwidth intraclus-
ter data transfers. As shown in Figure 3, clusters share
FIGURE 5. Effect of SSRs and frep on the hot loop of a dot prod-
uct kernel. (a) Left: baseline simplified RISC-V implementation,
FIGURE 4. Simplified block diagram of a Snitch-based compute
with address calculation and pointer increment omitted for
cluster. The core complex (CC) contains the integer core and
brevity. Right: SSRs implementation with memory loads
the FPU as well as necessary hardware for the SSRs and FREP.
encoded as reads from stream registers; additional stream con-
The cluster contains eight core corplices, which share an
figuration instructions required ahead of the loop. (b) Left:
instruction cache and a tightly coupled data memory. A DMA
implementation with loop bookkeeping using baseline RISC-V
engines is used for efficient, bulk, data movement.
instructions. Right: implementation with an frep hardware loop,
with all bookkeeping to occur implicitly in the hardware.
the uplink into the next higher stage, the bandwidth to
the other S1 quadrants becomes smaller. Bandwidth is at 1 GHz, thus delivering more than 4 TDPflop/s peak
subsequently thinned as four S1 quadrants share an compute per chiplet.
instruction cache and an uplink into the S2 quadrant With this architecture, we achieve a very high com-
and two S2 quadrants share an uplink into the S3 quad- pute/control ratio: 44% of the system consisting of
rant. In the last stage of hierarchy 16 S3 quadrants, compute units, another 44% spent on the L1 memory
distributed over four chiplets (nonuniform memory and just 12% of the area are spent on the control parts.
access), share four HBMs with an aggregated peak
bandwidth of 1 TB/s. This bandwidth thinning scheme
PROGRAMMING
allows us to have a very low diameter, low latency inter-
We leverage two custom RISC-V ISA extensions to
connect topology, which can sustainably saturate the
achieve extremely high fp utilization and efficiency:
HBM bandwidth while being benign to floorplanning
Xssr and Xfrep.
and physical design. The interconnect also allows for a
very high cluster-to-cluster internal bandwidth,
through multiple stages, which by far exceeds the Stream Semantic Registers (Xssr)
bandwidth into the memory. With this model, we effi- SSRs2 offer a means to elide explicit load/store instruc-
ciently support cluster-to-cluster traffic, while, at the tions in a program. This is achieved by giving a subset
same time, fully loading the memory system. of the processor core’s registers stream semantics.
When enabled, a read from such an SSRs is translated
in hardware into a load from memory, and conversely, a
Compute Cluster register write becomes a store to memory. Since an in-
The compute cluster consists of eight small, 22 kGE, order single-issue core can only execute a single
single-stage, 32-bit RISC-V processor cores1 (see instruction every cycle, the presence of loads and
Figure 4). Each Snitch core contains a double-preci- stores in a hot loop of the program diminishes FPU utili-
sion (DP) FPU, which can be used to compute one DP zation significantly. For example, consider a dot prod-
fused multiply–add (FMA) operation or two SP FMA uct, which has to issue two loads from memory for
per cycle. When running at 1 GHz, a cluster with eight every FMA operation, as shown in Figure 5(a). In this
Snitch cores is able to compute 16 DP or 32 SP flop, scenario, even if the loop is fully unrolled, we achieve at
resulting in 4 TDPflop/s for the entire Manticore sys- most 33% FPU utilization. In theory, this allows the FPU
tem. All eight cores have elementwise, low latency, to be 100% utilized, and even a simple processor can
access into 128-KiB tightly coupled and shared achieve >90% utilization in many relevant kernels with-
scratchpad memory. Moreover, a DMA engine is in out resorting to complex and energy-inefficient wide
charge of moving blocks of data into the scratchpad issue superscalar or very long instruction word (VLIW)
memory over a 512-bit data bus. The cores are clocked architectures.2 SSRs offer a way to elide memory
FIGURE 6. Typical execution of a matrix-vector multiplication implementation leveraging the SSRs and frep extensions. The 16
instructions are fetched and decoded once by the integer pipeline of the processor core (b), and expanded to 204 executed
instructions in the fpu (c). (a) Reference C implementation with a square matrix A of fixed size 48. (b) Resulting assembly imple-
mentation as stored in the binary and fetched/decoded by the processor core. (c) Execution traces of the integer pipeline (left)
and the fp pipeline (right).
accesses and address computation in hot loops, which Typical SSR/FREP Execution
in many cases leaves no integer instructions in the As a concrete example, let us consider the matrix-vec-
loop body. tor multiplication operation shown in Figure 6(a). A
typical implementation leveraging Manticore’s SSRs
Floating-Point Repetition (Xfrep) and FREP extensions is shown in Figure 6(b). The
The frep1 extension implements a FPU-only hardware address computation and memory accesses of A and
loop. Consider a dot product utilizing SSRs for example, x are entirely performed by the SSRs ft0 and ft1. The
as shown in Figure 5(b). Besides the essential FMA oper- inner loop is implemented using an FREP instruction
ation running on the FPU, the loop only consists of a trip and unrolled to compute four results in parallel in
count increment (addi) and a back branch (bne). This order to avoid pipeline stalls due to FPU latency. The
loop can be replaced by a FREP instruction, which loops outer loop is executed by the integer core. It stores
a range of subsequent FP instructions (one in this case) the results (fsd), implements loop bookkeeping (addi,
for a configurable number of times. The RISC-V ISA bltu), and initializes a (fmv.d).
makes the integration of such an extension very As shown in Figure 6(c), the 16 instructions of the
straightforward as most instructions either operate assembly implementation are fetched and decoded
entirely on integer or entirely on FP registers. Only a once by the integer pipeline of the processor core and
handful, such as comparisons or moves between integer expand to 204 executed instructions in the FPU through
and FP domains, exchange information from one the use of FREP. This leaves 188 cycles for the integer
domain to the other. We leverage this separation and pipeline for other tasks, such as preparing the next loop
insert a microloop sequence buffer of 16 instructions iteration or coordination of data movement. In case no
between the Snitch core and the FPU. FREP instructions other work is required, the 16 instructions fetched over
configure this buffer to emit a range of buffered instruc- 204 cycles of execution amounts to roughly one instruc-
tions multiple times into the FPU, which essentially tion every 13 cycles, mitigating the von Neumann bottle-
implements the hardware loop. Since this happens neck by reducing instruction fetch bandwidth by more
entirely in the FPU subsystem outside of the Snitch than one order of magnitude. Since the FPU can execute
core, the core’s integer pipeline can run in parallel, the loop iterations back-to-back and of the 204 instruc-
enabling a pseudo-dual-issue mode of operation that tions, 192 perform actual computation, this kernel can
would not be achievable with a traditional hardware achieve up to 94% FPU utilization.
loop. This allows the core to perform nontrivial book- Compilers can leverage these new instructions
keeping and address calculation while the FPU is run- through scalar evolution and loop analysis to detect
ning, without incurring a reduction of the FPU utilization. loops with the appropriate structure, and matching
FIGURE 7. Floorplan of the prototype silicon. The two Ariane cores as well as the Snitch cluster have been designed hierar-
chically. The core’s follow a star-shaped layout around the shared instruction cache.
PROTOTYPE
A 3 x 3-mm2 prototype containing the logic core of the
chiplet architecture was manufactured and character-
ized using the Globalfoundries 22FDX technology. The
prototype in Figure 7 contains three Snitch clusters
with eight cores (each configured with 8-KiB L1
instruction cache and 128-KiB L1 data memory orga-
nized in 32 banks), a dual-core Ariane (with 16-KiB L1
instruction cache and 32-KB data cache), 1.25-MiB L2
memory, and a 400-MHz, double data-rate, 2.56-GB/s,
digital-only chip-to-chip link.
achieves this despite these chips having a substantial 5. A. Yang, “Deep learning training at scale spring crest
technology advantage due to their 7-, 12-, and 14-nm deep learning accelerator,” in Proc. Symp. High
FinFET processes. Regarding the A100 GPU, our initial Performance Chips, vol. 31, 2019.
estimates based on data published by Nvidia7 suggest 6. T. Yoshida, “Fujitsu high performance CPU for the Post-
that it achieves a 25% improvement on SP and DP K Computer,” in Proc. Symp. High Performance Chips,
over the V100 in terms of speed at similar power con- vol. 30, 2018.
sumption. This indicates that Manticore has just 25% 7. Nvidia, “NVIDIA Ampere GA102 GPU Architecture - The
lower efficiency on SP than A100, but outperforms it Ultimate Play,” 2020.
on DP by 5, despite the A100’s significant 7-nm Fin- 8. P. Vivet et al., “A 220GOPS 96-core processor with 6
FET technology advantage. Manticore delivers signifi- chiplets 3D-stacked on an active interposer offering
cantly higher peak FP performance than comparable 0.6ns/mm latency, 3TBit/s/mm2 inter-chiplet
RISC-V architectures11 in 16 nm. interconnects and 156mW/mm2@82% Peak-Efficiency
DC-DC Converters,” in Proc. IEEE Int. Conf. Solid-State
Circuits, 2020.
OVERALL WE OBSERVE THAT THE
9. F. Zaruba and L. Benini, “The cost of application-class
MANTICORE ARCHITECTURE IS VERY processing: Energy and performance analysis of a
EFFICIENT AT TRACKING THE Linux-ready 1.7-GHz 64-bit RISC-V core in 22-nm FDSOI
PERFORMANCE AND BANDWIDTH Technology,” IEEE Trans. Very Large Scale Integr. (VLSI)
ROOFLINE, WITH A DETACHMENT Syst., vol. 27, no. 11, pp. 2629–2640, Nov. 2019.
DOWN TO 5% FOR LOW INTENSITY 10. R. Christy et al., “A 3GHz Arm Neoverse N1 CPU in 7nm
AND 14% FOR HIGH-INTENSITY FinFet for infrastructure applications,” in Proc. IEEE Int.
OPTIMIZED KERNELS. Conf. Solid-State Circuits, 2020.
11. S. Davidson et al., “The Celerity open-source 511-core
RISC-V tiered accelerator fabric: Fast architectures and
design methodologies for fast chips,” IEEE Micro, vol.
38, no. 2, pp. 30–41, Mar./Apr. 2018.
ACKNOWLEDGMENTS
This work was supported by the European Union’s
FLORIAN ZARUBA received a B.Sc. from TU Wien, Vienna, Aus-
H2020 program under Grant 826647 (European Pro-
cessor Initiative-EPI) and Grant 732631 (Open Trans- tria, in 2014 and an M.Sc. in 2017 from the Swiss Federal Institute
precision Computing - “OPRECOMP”). € rich, Zu
of Technology Zu € rich, Switzerland, where he is currently
working toward a Ph.D. with the Digital Circuits and Systems
group of Luca Benini. Contact him at zarubaf@iis.ee.ethz.ch.
REFERENCES
1. F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini, “Snitch: A
FABIAN SCHUIKI received a B.Sc. and an M.Sc. in electrical
tiny pseudo dual-issue processor for area and energy
engineering in 2014 and 2016, respectively, from ETH
efficient execution of floating-point intensive
workloads,” IEEE Trans. Comput., to be published. € rich, Zu
Zu € rich, Switzerland, where he is currently working
2. F. Schuiki, F. Zaruba, T. Hoefler, and L. Benini, “Stream toward a Ph.D. with the Digital Circuits and Systems group of
semantic registers: A lightweight RISC-V ISA extension Luca Benini. Contact him at fschuiki@iis.ee.ethz.ch.
achieving full compute utilization in single-issue cores,”
IEEE Trans. Comput., to be published.
LUCA BENINI holds the Chair of Digital Circuits and Systems,
3. “AI and compute,” 2020. Accessed: Oct. 5, 2020. [Online].
Available: https://openai.com/blog/ai-and-compute/ ETHZ and is Full Professor with the Universita di Bologna. He is
4. N. P. Jouppi et al., “A domain-specific supercomputer for a Fellow of the ACM and a member of the Academia Europaea.
training deep neural networks,” Commun. ACM, 2020. Contact him at lbenini@iis.ee.ethz.ch.
D
ata center growth in scale, bandwidth, appli- ARCHITECTURE
cation diversity, and security requirements The distributed services ASIC connects directly to the
has stretched traditional networking and data center Ethernet network near the compute edge.
storage IO solutions beyond their design targets.2 It can operate as a standalone, bump-in-the-wire
Server connectivity speeds have moved to 100 Gb/s device or as an endpoint controller, connecting servers
and beyond while the number of interconnected serv- or endpoint devices to the data center network via
ers in a data center surges. Virtualization, storage PCIe ports. Inside the ASIC, network traffic flows
disaggregation, and service mesh architectures are through multiple P4 pipelines interconnected via a cen-
driving the amount of east–west (intradata center) tral packet buffer. P4 programs process packets in-line
traffic up along with the need for more sophisticated and may divert packets or flows to on-chip offload
network segmentation, overlays, and telemetry. engines and CPUs as well as to PCIe attached hosts or
Security requirements, including encryption for data- target devices. In the ASIC block diagram (Figure 1), the
in-flight and data-at-rest, and stateful firewalls with blue blocks (lower half) primarily handle packet-based
connection tracking further increases the complexity traffic and services, while the green blocks (upper half)
and computational load on data center networking are primarily memory transaction based, providing off-
services. loads and data stream processing to memory buffers.
To address these challenges, the Pensando
distributed services architecture uses P41 domain-
THIS ARTICLE WILL FOCUS ON THE
specific processing coupled with general purpose
CPUs to deliver networking, security, and storage ARCHITECTURE AND PERFORMANCE
services in a scalable and flexible solution. Distrib- OF THE FIRST TWO GENERATIONS OF
uted services cards (DSCs) are deployed at the PENSANDO ASICS, IMPLEMENTED IN
edge of compute and storage resources in the data 16- AND 7-NM PROCESSES.
center and managed by a central Policy services
manager (PSM). This article will focus on the archi-
tecture and performance of the first two genera-
tions of Pensando ASICs, implemented in 16- and Life of a Packet
7-nm processes. Packets received through the network enter on one of
the Ethernet MAC interfaces and are placed in the
packet buffer according to L2 or L3 class of service
(COS). The packet buffer provides pause absorption
buffering and COS aware arbitration for the P4 ingress
pipeline. Consistent with the P4 model, packets pass
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3058560 through the P4 ingress pipeline for flow classification,
Date of publication 10 February 2021; date of current version firewall, tunnel endpoint processing, and other ingress
26 March 2021. services; return to the packet buffer in case of
64-bit MPU Instructions and peer information, and a process ID used for
Internally, the MPU instruction fetch, branch, and memory protection.
load/store blocks are similar to existing CPU designs. Events which enqueue new objects are signaled
The MPU differs from most CPU designs in which its with a doorbell mechanism, allowing one or more
instruction set architecture is based on 8-byte-wide objects to be added to any queue with a single door-
instructions, allowing it to process networking bell action. Events which require processing in the
domain jobs with higher efficiency due to richer future, such as TCP timers or activity checks, set
encoding available from wider instructions. For exam- hardware time markers for future scheduling. The
ple, the MPU ALU operand load aligner is based on bit queue scheduler organizes all 16 million queues into a
fields instead of byte fields, allowing any header field DRAM array, tracking scheduling groups, interface
pair of arbitrary length and offset to be loaded with a groups, and COS groups. Priority scheduling, min/max
single instruction, whereas a traditional CPU would data rates, and deficit weighted round robin schedul-
need to mask and shift header fields for alignment. ing are available across scheduling groups. When a
MPU ALU instructions also accept header fields queue is scheduled, a PHV token with the scheduled
directly as ALU inputs, allowing ADD, XOR, COM- queue ID and COS is inserted into the assigned
PARE, and similar instructions to specify operands P4 pipeline for processing. The TE then accesses
directly from header fields, whereas a traditional CPU the qstate array, launching programs depending on
must load general purpose registers before operating queue type and state, which in turn fetch queued
with its ALU. For updating the PHV, the MPU has a objects and descriptors as entries of tables in the
PHV Write unit to process custom instructions capa- P4 program.
ble of updating or appending to the PHV. Again, one
or more header fields can be transferred directly
from table data to the PHV using a single instruction, Central Packet Buffer
with no alignment restrictions on source or destina- The multiple independent P4 pipelines are intercon-
tion fields. nected via a center packet buffer. The packet buffer
design is a shared memory switch, with virtual out-
put queueing for multiple classes of service and
Hardware Queue Management enhanced transmission selection5 scheduling at the
Packets which enter the P4 pipeline form the wire or outputs. Multicast replication, SPAN (switch port
which originate from internal events are placed analyzer) queues, and network pause buffering are
in hardware queues. Hardware queues are used to provided by the central packet buffer. In addition, an
manage interfaces, flows, connections, and any other ingress packet burst overflow feature allows short
objects which require ordered tracking and sched- term bursts to be written to overflow regions in
uling. Software can configure up to 16 million hard- DRAM memory. The central packet buffer operates
ware queues (Figure 5). Each queue stores its current in the packet domain, whereas the system-on-chip
state in a DRAM-based qstate record, which includes (SOC) operates in the memory transaction domain.
a count of enqueued objects, pointers to arrays or The P4 DMA engines bridge these domains, convert-
linked lists of enqueued objects, connection state ing packets to memory transactions.
SOC, NOC, ARM Processors Cryptography offloads for AES-GCM, AES-XTS, AES-
A full SOC subsystem with multicore ARM A-72 CPUs CBC, AES-CCM, ChaChaPoly, HMAC, SHA3-512, and
is connected to a hardware coherent, shared cache, SHA2-256 are available to provide high performance
and memory hierarchy. Processors, memories, P4 security processing.
pipelines, offload engines, and external devices com-
municate through a high-speed network-on-chip Storage Offloads
(NOC). The P4 pipelines are tightly coupled to the
Storage virtualization and management is provided
ARM processors, supporting direct delivery of packets,
with a combination of P4 programs and ARM code.
headers, and/or metadata to the ARM L2 caches. This
Individual packet and transaction processing is han-
tight coupling allows packet and flow processing to be
dled by P4, while ARM programs handle connection
passed back and forth between P4 pipelines, offload
establishment and volume management. P4 device
engines, and ARM processing services with low over-
driver programs present standard NVMe device inter-
head. Tasks which are well suited to the P4 domain-
faces to the host and DMA descriptors and data to
specific processing are performed in the P4 pipelines
the ASIC for offload processing or transport. Storage
and offload engines, including packet parsing, encap-
offload engines are available to P4 and ARM process-
sulation, cryptography, compression, classification,
ors, and can be mapped to external hosts as a PCIe-
stateful firewalls, telemetry, and data stream segmen-
based accelerator endpoint. Dedicated storage off-
tation and reassembly. Higher level tasks better suited
load engines include compression and decompression
to general purpose processing are performed by the
at 100 Gb/s using LZRW1A, GZIP, or Deflate algo-
ARM cores or host CPUs, such as exception handling,
rithms. Reed–Solomon Erasure coding engines with up
application level processing, and storage volume man-
to 12 data and 4 parity blocks operate at 100 Gb/s.
agement. P4 programming determines which portions
Data integrity engines operate at 200 Gb/s for multiple
of data buffers, header buffers, and completion infor-
algorithms, including CRC64, M-Adler32, CRC32C, and
mation to deliver to the DRAM, SOC last level cache,
Alder-32. Deduplication support is based on SHA2 and
or ARM L2 on a per-application, per-packet basis. For
SHA3 engines.
chained offload operations, which may include encryp-
tion, compression, and data integrity operations mixed
in with P4 control functions, a 4-MB chaining buffer is
PCIe Ports and Virtualization
PCIe virtualization hardware allows software running
attached directly to the NOC to support high band-
width, multihop offload chaining. on the SOC to configure the number and type of devi-
ces presented to the host, including configuration
space, BAR (base address register) ranges, and virtual
Root of Trust and Cryptography function counts. This allows multiple network, storage,
Pensando ASICs provide a Root of Trust based on a and host visible offload devices to be presented and
physical unclonable function and an ARM TrustZone attached to different processes or guest OS instances.
core running from protected memory. Boot loaders PCIe lanes can be bifurcated into multiple ports, each
and operating systems are authenticated before port can operate as a virtual endpoint switch or a root
execution, providing a secure root of trust that is complex. If multiple virtual switch ports are configured,
not dependent on or vulnerable to attached hosts. multiple hosts can attach over separate PCIe lanes
Within the data center, central controller software and share networking services as if they were local to
securely communicates with trusted agents running each host. Internal ASIC COS buffering resources
on the ARMs at each DSC to enforce data center enforce fairness between hosts, preventing noisy
wide policies. neighbors from impacting adjacent hosts. Alterna-
Secure protocols including IPsec and TLS proxy can tively, if a PCIe port is configured as a root complex,
be initialized by ARM processes, then P4 programming storage devices such as NVMe drives are controlled by
controls packet-by-packet flow processing using keys the SOC and virtualized to local and remote hosts.
stored in secure on-chip memory. Cryptography opera- Other PCIe devices including GPUs and machine learn-
tions are performed by internal hardware offload ing accelerators can also be controlled by the ASIC.
engines under the direction of P4-created descriptor
lists. Cryptography hardware includes a hardware P4-16 COMPILER
entropy source driving a NIST SP 800-90 compliant Pensando has developed a compiler which accepts
deterministic random bit generator and public key P4-16 imperative code and generates parser sequence
exchange hardware to assist with connection setup. commands, TE key generation, and table match
ASIC IMPLEMENTATIONS
Pensando has completed the design of two genera-
tions of ASIC implementations and is currently archi-
tecting a third generation. All ASICs are software
compatible and features are forward compatible.
Table 1 summarizes implementation details of the first
ASIC generation, codenamed Capri, and the second
ASIC generation, codenamed Elba. Architectural fidel-
ity is preserved across generations, so many improve-
ments are focused on increased scale and
performance. In particular, the move from HBM mem-
ory in Capri to DDR memory in Elba was done to signif-
parameters, as well as MPU action instruction sequen- icantly increase the number and size of P4 tables
ces. In order to quantitatively measure the compiler supported, while the reduced memory bandwidth was
capability of generating optimal code for the P4 pipe- compensated for with larger caches and more latency
lines, we compared the performance of two implemen- tolerance built into the P4 pipelines.
tations of the same network functionality: termination
of a VXLAN overlay with flow-based packet routing. A
production-grade implementation consists of hand
PERFORMANCE
written MPU assembly code. The compiler implemen- The performance results in Table 2 apply to a “bump in
tation is coded in P4. Several performance tests were the wire” full service forwarding application, which
run with various traffic patterns presented to the ASIC includes a stateful firewall per connection, telemetry
using both implementations, and the number of for- collection on every flow, and connection establish-
warded packets per second was measured. Overall, ment/teardown managed by ARM CPUs while datapath
the P4 implementation achieves 85% of the perfor- operations are handled in the P4 pipeline. Latency and
mance of the hand-written assembly implementation. jitter were measured on the wire by lab testing devices.
This demonstrates that the domain-specific compiler Comparing performance with available solutions
is an effective and efficient tool to turn P4 programs is difficult as few publish results with identical service
into machine code for Pensando’s packet forwarding features, but solutions offering a subset of these
pipelines. features show a fraction of the delivered MPPS6
As generations of the Pensando domain-specific as compared to the DSC. Connection per second
architecture progress, hardware features are added to establishment and tracking performance of the DCS is
improve capabilities, performance, and efficiency. one order of magnitude higher than available solutions7
These improvements can be in the MPU instruction and similar to large, multiport service appliances.8
set, conditions causing pipeline hazards or stalls, pro-
grammable parser control, hash table options, or other CONCLUSION/
areas. The P4 compiler abstracts these implementa- ACKNOWLEDGMENTS
tion-related changes away, allowing imperative, user- Modern data centers are facing rapidly evolving needs in
created P4 code to take advantage of new ASIC gen- security, networking, telemetry, storage, and scale which
erations without source code changes. are not addressed by current equipment. Pensando has
developed a distributed services architecture to address infrastructure, testing and support which make these
these data center needs while providing user program- products possible. We look forward to delivering
mability based on a domain-specific, open source, future generations of this architecture to meet the
permissively licensed P4 language. The first two genera- evolving data center scale, security, and innovation
tions of this architecture are realized in ASICs fabricated challenges.
in 16- and 7-nm processes.
C
ompute demand of AI is skyrocketing at a rate compute and storage requirements and reduce it down
that far outpaces the compute density to something much more compatible with Moore’s Law.
improvements that can be gained by Moore’s In the “Hardware” section, we present the details
Law alone3,4 and approaches based on monolithic of our hardware, with primary focus on the Grayskull
shared memory models. We have chosen to attack device. The “Software” section presents our software
this challenge via two approaches, dynamic computa- stack. The “Full Stack Performance Optimizations”
tion and massive scaleout. We design dynamic com- section deep dives into several full stack performance
putation to enable a wide range of techniques that optimizations enabled by our hardware and software.
intelligently “forgo unnecessary computation” or The “Dynamic Execution” section summarizes various
“compute only what is relevant to input,” akin to what dynamic execution approaches enabled by our archi-
our brains do. tecture. Performance results are presented in the
Large clusters are already the norm for training of “Results” section. Finally, the “Conclusion” section
AI models, while inference for some large models also concludes this article.
requires multi-device execution. The shared memory
paradigm cannot enable the required scale, which HARDWARE
necessitates a paradigm shift to a multicore private-
memory model, a foundation in our scaleout architec- Devices
ture. On top of this, we build a push-based data move- Over the last four years, Tenstorrent designed three
ment in which data transfers are explicitly planned chips, shown in Figure 2 and summorized in Table 1.
and controlled. Jawbridge is a small 14-nm test-chip, containing six
These two approaches can be synergistically com- first-generation Tensix cores.
bined to take the current steep slope of increasing AI Grayskull, shown in Figure 1, is our first production
chip in 12-nm technology, and is currently in evalua-
tion with multiple customers. It is first incarnation of
our large cluster-on-a-chip multicore architecture and
This work is licensed under a Creative Commons Attribution is composed of 120 compute cores. The physical area
4.0 License. For more information, see https://creativecom- of the 10x12 grid of cores is 477 mm2. Each core oper-
mons.org/licenses/by/4.0/
Digital Object Identifier 10.1109/MM.2021.3061912 ates independently; it has its own unique instruction
Date of publication 9 March 2021; date of current version 26 queue and progresses at its own pace, in contrast to
March 2021. monolithic chip-scale SIMD, VLIW or single-kernel
FIGURE 2. TensTorrent devices. (a) Jawbridge (2019). (b) Grayskull (2020). (c) Wormhole (2021).
384-GB/s read/write bandwidth. The memory space is the compute engine from the complexity of data move-
primarily used by the local core, but it is directly accessi- ment and multicore synchronization. These features of
ble by remote cores as well. the packet manager realize the push-based transfer
model, which maximizes the overlap between compute
Packet Compute Engine and data transfers.
The packet compute is a SIMD-based matrix and vec- The router moves packets across the NoC. It pro-
tor engine with a high degree of flexibility and vides guaranteed ordering, manages flow control and
programmability. Peak compute density is 3 TOPs at backpressure, and has deadlock-free operation. It is
8-bit precision, or 0.75 TFLOPs at 16-bit floating-point also optimized for the way AI workloads are parallel-
precision. The packet compute engine is software- ized across our multicore architecture and has effi-
programmable via the associated RISC cores, which cient multicast and gather capabilities.
execute kernels written in standard C language and Finally, the tensor manipulation engine can per-
issue matrix and vector instructions to the engine. A form dynamic packet compression; the smaller mem-
large number of PyTorch and TensorFlow deep learn- ory footprint enabled by compression results in faster
ing instructions are supported. The matrix and vector data transfers and an increase in data locality. Fur-
engine support operations on a range of integer and thermore, tensor manipulation instructions can be
floating-point formats. Most importantly it natively executed by this engine, described in the “Optimiza-
handles sparse computation to achieve speedup, tion of Tensor Manipulation Instructions” section.
reduce power, and memory footprint.
The packet compute engine does not have a global SOFTWARE
view of execution across the multicore system. Its The software is composed of three main pieces.
operation is driven primarily by the packet manager:
incoming packets from the packet manager are com- 1. The machine learning framework integration and
puted and returned to the packet manager for storage plugin software.
or data transfer. 2. The runtime software, executing on the RISC
processors.
Packet Manager Engine 3. The ahead-of-time graph compiler.
The packet manager, depicted in Figure 3, is com-
posed of data transfer engine, router, and tensor Framework Integration
manipulation engine. The Tenstorrent compiler and runtime have been
The data transfer engine is responsible for execut- natively integrated into PyTorch, and support both
ing all data movement and synchronization among the inference and training flows. The user can target execu-
compute engines, as well as between on-chip SRAM, tion on a single device, or multidevice cluster. In either
off-chip memory and I/O. The packet manager and case, the hardware is visible to the user as a single
compute engine each receive their own unique instruc- device. The multidevice scheduling and parallelization
tion queues from the compiler, and they execute con- are orchestrated behind the scenes by the software
currently. The packet manager completely de-burdens stack. In addition, the software stack can execute
ONNX networks as well, enabling a funnel from the core is ready to begin computing, as a first step it
frameworks that export into the ONNX format. issues a request to copy remotely stored data (from
another core’s cache, or from DRAM) into its local
Graph Compiler memory or cache. After the copy has been completed,
The graph compiler is composed of three main compo- the compute core starts computing. The read request
nents, the front end, optimizer, and back end. The primary latency combined with a data transfer through a
role of the front end is to lower a wide range of instruc- potentially congested NoC or memory port, could
tions to a smaller number of optimized instructions sup- result in the consumer core being idle while waiting
ported by the hardware. Instructions are parallelized and for its data to arrive.
scheduled onto the device cores by the optimizer, which In contrast, our architecture operates on a push-
maximizes performance by balancing compute, data local- based data transfer model. A core that produces an
ity, and data movement. The back-end translates the com- output buffer is aware of the consumer core that
piled graph down into instruction queues for each core. needs to receive it. Instead of waiting for the con-
The packet managers and NoC connecting the sumer core to issue a remote read request, the pro-
cores is are visible to the software and the data move- ducer core proactively copies the buffer to the
ment and synchronization are both controlled explicitly consumer core. This approach minimizes the idle time
by the compiler. To schedule the data movement, the of the consumer core.
compiler packetizes each tensor by splitting it into The data transfer engine executes all the required
“mini-tensors,” and each mini-tensor is combined with flow control for the push-based data transfer model. It
a packet header. Each packet header contains a unique receives an instruction queue from the graph compiler
packet ID, and all data is referenced via unique packet containing information about the producer-consumer
IDs. The header also contains routing information, connectivity. The instructions enable the data transfer
enabling the packet manager to perform the desired engine to execute data transfers using a number of
data transfers between the cores across the NoC. multicore synchronization instructions, including
exchange of data transfer status, such as data-ready
Runtime Software or memory-space-ready.
The runtime software runs concurrently on RISC pro-
cessors within every core. The compiled executable
Optimization of Tensor Manipulation
contains instruction queues for the packet processor
and the packet manager of every core. The runtime
Instructions
Instructions that make up deep neural networks fall
software manages the queues, and dispatches instruc-
into two main categories: 1) math instructions, and 2)
tions to the packet compute and the packet manager.
Buffers containing packets are dynamically allo- tensor manipulation (TM) instructions. TM instruc-
cated and de-allocated during runtime. The runtime tions do not modify the data inside the tensor, but
software works in tight collaboration with the packet simply reshuffle the tensor contents. Common TM
instructions in NLP networks include reshape, trans-
manager to store packets into the allocated buffers.
pose, flatten, and permute.
The runtime also controls the storage target, allowing
The TM reshuffling is performed on the intermedi-
for buffers that do not fit into a core’s local SRAM to
ate activation data, and hence must be executed dur-
spill to either remote SRAM, or to an off-chip memory.
ing runtime. One implementation approach is to issue
The architecture also supports various types of
conditional executions such as if-statements, and for and execute each TM instruction independently to
and while loops. The runtime software interprets the hardware during runtime. This typically involves a spe-
instruction queues generated for each core and can cific read/write pattern from/to memory, where the
patterns match the particular TM to be implemented.
execute jumps to a specific instruction in the instruc-
GPUs execute TMs using this approach, shown in
tion queues to reflect control flow decisions.
Figure 5(c), which idles compute cores while perform-
ing potentially complex memory access.
FULL STACK PERFORMANCE
In contrast, our architecture overlaps the execu-
OPTIMIZATIONS
tion of math instructions performed by the compute
Optimization of Data Transfers: The engine and the TM instructions performed by the ten-
Push-Based Model sor manipulation engine. This process is facilitated by
Traditional multicore devices operate on a pull-based the graph compiler. The execution trace in Figure 5(b)
data transfer model. For example, when a compute shows the overlapping of compute and TMs. The
DYNAMIC EXECUTION
MM_1 compute instruction and the Reshape and Dynamic execution is an umbrella term representing
Transpose instructions execute concurrently, in a various approaches that reduce the computational
pipelined fashion. complexity of a network at runtime. Some approaches
The tensor manipulation engine is programmable – can be represented within the topology of network
it receives its own unique instruction queue from the itself, such as Mixture-of-Experts (MoE), while others
compiler. TM instructions are executed using a combi- can be used to augment the network execution during
nation of two methods. First, it contains a small stor- runtime. Four approaches enabled by the Tenstorrent
age that it uses as a scratch pad to load a packet and architecture are described next.
reshuffle it in place. Second, it can execute complex
memory read/write patterns. Using a combination of
these approaches, any tensor manipulation instruc- Block Sparsity
tion can be implemented inline as the packets are Tensors feeding into math operations within networks
being streamed out of the packet compute engine contain a variable amount of sparsity within them.
and being written into local SRAM. Certain models have been tuned to take advantage of
sparsity of trained parameters,2 or take advantage of
block sparsity in model parameters.5 However, these
Flexible Scheduling and Parallelization approaches do not tap into a large potential of spar-
The TensTorrent architecture unlocks a tremendous sity in activations, which can be inherent or induced
amount of concurrency. All building blocks receive at runtime.7 To fully utilize this potential, in addition to
their own unique instruction queues from the compiler model parameter sparsity, our architecture supports
and can progress at their own pace. As a result, the block sparsity of activations, which enables quadratic
overlap between compute and data transfers is gains from run-time activation sparsity.
maximized.
However, any single parallelization approach even-
tually plateaus, hence the desire to support flexible Dynamic Precision
parallelization approaches along all available dimen- Similar to scientific computing applications, numerical
sions for any given compute layer. Each individual precision can be traded off for an increase in perfor-
deep learning operation can be parallelized across a mance and a reduction in power. The Tenstorrent
variable number of cores, combining a number of par- architecture enables a numerical precision to be set
allelization approaches. In addition, operations can be at a fine grain level, per each packet in the neural net-
run in parallel, be pipelined, or sequential, across the work. The setting can be specified both ahead-of-time
many cores of a device, shown in Figure 6. by the compiler, as well as during runtime.
RESULTS
We measured a baseline result for BERT-base in
BFLOAT16 precision at 2830 sequences/s. Significant REFERENCES
speedup is achievable by applying two optimizations 1. N. Shazeer et al., “Outrageouly large neural networks:
on top of this baseline: dynamic activation sparsity The sparsley-gated mixture-of-experts layer,” in Proc.
and use of 8-bit floating point. In our experiments we Int. Conf. Learn. Representations, 2017. [Online].
observe that 75% sparsity in activations (induced Available: https://arxiv.org/abs/1701.06538
dynamically at runtime) results in 4x speedup on BERT 2. V. Sanh et al., “Movement pruning: Adaptive sparsity by
layers. Similarly, we observe that using block-based fine-tuning,” 2020. [Online]. Available: https://arxiv.org/
8-bit floating point precision provides a factor of two. abs/2005.07683
The two optimizations can be combined synergisti- 3. OpenAI, “AI and compute,” 2018. [Online]. Available:
cally—they both reduce the activation memory foot- https://openai.com/blog/ai-and-compute/
print linearly for a total of 8x reduction, and 8-bit floats 4. Stanford HAI. “Artificial intelligence index report 2019,”
reduce the model parameter memory footprint by 2x. 2019. [Online]. Available: https://hai.stanford.edu/sites/
This enables the majority of the model parameters to default/files/ai_index_2019_report.pdf
fit on chip allowing the sparsified layers to be fed from 5. T. Gale et al., “Sparse GPU kernels for deep learning,” 2020.
local SRAM. Realizing this on an entire BERT-base is [Online]. Available: https://arxiv.org/pdf/2006.10901.pdf
work in progress and we project a score of 23 345 6. “Fast block sparse matricies for pytorch,” 2020. [Online].
sequences/s. Available: https://github.com/huggingface/pytorch_
block_sparse
CONCLUSION 7. Z. Chen et al., “You look twice: Gaternet for dynamic
We solve the private memory parallel computing prob- filter selection in CNNs,” 2019. [Online]. Available:
lem and tensor manipulations in a way that removes https://arxiv.org/pdf/1811.11205.pdf
communication, synchronization, and data shuffle bot- 8. A. Karpathy, “Software 2.0,” 2017. [Online]. Available:
tlenecks and enables keeping all the compute units https://medium.com/@karpathy/software-2-0-
highly utilized. a64152b37c35
F
ive years ago, few would have predicted that a requires matrix transposition and loss functions—
software company like Google would build its and the amount of computation. An inference cal-
own chips. Nevertheless, Google has been culation typically executes on the order of 109
deploying chips for machine learning (ML) training since floating point operations, but Google’s production
2017, powering key Google services. These tensor proc- training applications require 1019–1022; more than
essing units (TPUs) are composed of chips, systems, and 10 orders of magnitude larger!
software, all codesigned in-house. This article details the › More memory. During training, temporary data
circumstances that led to this outcome, the challenges are kept longer for use during backpropagation.
and opportunities observed, the approach taken for the With inference, there is no backpropagation, so
chips, a quick review of performance, and finally a retro- data are more ephemeral.
spective on the results. A companion paper describes › Wider operands. Inference can tolerate int8
the supercomputers built from these chips, the compiler, numerics relatively easily, but training is more
and its performance in detail.6 sensitive to dynamic range due to the accumula-
tion of small gradients during weight update.
FORCES PUSHING TOWARDS › More programmability. Much of training is
CUSTOM ML HARDWARE experimentation, meaning unstable workload
In 2013, only one year after AlexNet swept the Image- targets such as new model architectures or
Net competition,4 Google leaders predicted ML would optimizers. The operating mode for handling a
soon be ready to attack difficult problems like produc- long tail of training workloads can be quite
tion versions of image and speech recognition. Alas, different from a heavily optimized inference
these ML models were so computationally expensive system.
that a sustainable service was nearly impossible, as › Harder parallelization. For inference, one chip can
Internet service costs scale by the number of users hit most latency targets. Beyond that, chips can be
who grow as the service improves. The motivation to scaled out for greater throughput. In contrast, exa-
slash ML inference serving costs led to the TPUv1 flops-scale training runs need to produce a single,
inference system, deployed successfully in 2015.5 consistent set of parameters across the full sys-
TPUv1 exposed the next bottleneck: ML training. tem, which is easily bottlenecked by off-chip
The team quickly pivoted into building the TPUv2 communication.
training system. Two years later, TPUv2 powered key
These problems felt daunting initially, plus we had
Google services with fast and cost-effective training.
constraints on time and staffing. Time matters
because each day saved during development acceler-
CHALLENGES AND OPPORTUNITIES ates our production training pipeline a day. And as for
OF BUILDING ML HARDWARE staffing: while Google is teeming with engineers, they
ML training brings challenges relative to ML inference: are not all available for our project. Ultimately, the
TPU team had only a cup from Google’s large engi-
› More computation. More means both the types of neering pool. Thus, we had to be ambitious to over-
computation—for example, backpropagation come the complexities of training, but the time and
staffing budget set constraints.
To prioritize, we sorted activities into two buckets:
0272-1732 ß 2021 IEEE
those where we must do a great job, and those that
Digital Object Identifier 10.1109/MM.2021.3058217
Date of publication 9 February 2021; date of current version we only have to do good enough. The first bucket
26 March 2021. included the following:
1 Build quickly. fixed-function units, but that is bad for training as on-
2 Achieve high chip performance. chip memory requires more flexibility. The first edit
3 Scale efficiently to numerous chips. merges these into a single vector memory [see Figure 1
4 Work for new workloads out of the box. (b) and (c)]. For the activation pipeline, we moved
5 Be cost effective. away from the fixed-function datapath (containing
pooling units or hard-coded activation functions) and
Everything else was in the second bucket. While built a more programmable vector unit [see Figure 1
tempting to brush these second bucket issues aside (d)]
4 . The matrix multiply unit (MXU) attaches to the
as minor embarrassments, the reality is building and vector unit as coprocessor [Figure 1(e)].
delivering a good system is as much about what you
decide not to do as what you decide to do. In retro-
spect, these decisions are not embarrassing after all!
WHILE GOOGLE IS TEEMING WITH
We refer to the relevant goals using the circled ENGINEERS, THEY ARE NOT ALL
numbers (e.g.,
2 ) throughout the discussion to high- AVAILABLE FOR OUR PROJECT.
light “first bucket” design decisions. ULTIMATELY, THE TPU TEAM HAD
ONLY A CUP FROM GOOGLE’S LARGE
ENGINEERING POOL. THUS, WE HAD
OUR APPROACH TO ML TO BE AMBITIOUS TO OVERCOME THE
HARDWARE COMPLEXITIES OF TRAINING, BUT THE
TPUv1 provided a familiar starting point for our train- TIME AND STAFFING BUDGET SET
ing chip [see Figure 1(a)]. The high-bandwidth loop CONSTRAINTS.
(red) identifies the important part of TPUv1: the core
data and computation loop that crunches neural net-
work layers quickly. DDR3 DRAM feeds the loop at
much lower bandwidth with model parameters. The Loading the read-only parameters into the MXU for
PCIe connection to the host CPU exchanges model inference does not work for training. Training writes
inputs and outputs at even lower bandwidth. those parameters, and it needs significant buffer
Figure 1 shows five piecewise edits that turn TPUv1 space for temporary per-step variables. Hence, DDR3
into a training chip. First, splitting on-chip SRAM moves behind Vector Memory so that the pair form a
makes sense when buffering data between sequential memory hierarchy [also in Figure 1(e)]. Adopting in-
FIGURE 2. TPUv2 datapath in more detail, showing both cores, which appear as a single PCIe device.
package HBM DRAM instead of DDR3 upgrades band- compiler’s name is XLA, for accelerated linear alge-
width twentyfold, critical to utilizing the core
2. bra.6) In these discussions, we landed on two impor-
Last is scale. These humongous computations are tant themes. First, a VLIW architecture was the
much bigger than any one chip. We connect the mem- simplest way for the hardware to express instruction
ory system to a custom interconnect fabric (ICI for level parallelism and allowed us to utilize known com-
InterChip Interconnect) for multichip training [see piler techniques
1
2 . Second, we could ensure gener-
Figure 1(f)]
3 . And with that final edit, we have a train- ality by architecting within the principled language of
ing chip! linear algebra
4 . That meant focusing on the compu-
Figure 2 provides a cleaner diagram, showing the tational primitives for scalar, vector, and matrix data
two-core configuration. The TPUv2 core datapath is types.
blue, HBM is green, host connectivity is purple, and
the interconnect router and links are yellow. The
TPUv2 Core contains the building blocks of linear alge- Scalar Computation Unit
bra: scalar, vector, and matrix computation. The scalar unit is where computation originates. It
Why two cores? The simpler answer is that we fetches complete VLIW bundles from a local instruc-
could fit a second core into a reasonably sized chip tion memory, executes the scalar operations slots
2
5 . But then why not build a single, bigger core? Wire locally, and then forwards decoded instructions on to
latencies grow as the core gets bigger, and the two- the vector and matrix units where execution happens
core configuration hits the right balance between rea- later, decoupled from scalar execution. The VLIW bun-
sonable pipeline latency and additional per-chip com- dle is 322 bits and is composed of two scalar slots,
putation capacity. We stopped at two bigger cores four vector slots (two used for vector load/store), two
because they are easier to program as they allow the matrix slots (a push and a pop), one miscellaneous
developer to reason about one unified instruction slot (a simple example would be a delay instruction),
stream operating on big chunks of data
4 , rather than and six immediates.
exposing lots of tiny cores that need to be lashed Figure 3(a) shows a diagram of the scalar unit. At
together. We took advantage of training fundamen- the top left is the instruction bundle memory. While an
tally being a big-data problem. instruction cache backed by HBM would have been
The following sections dive deeper into key chip nice, a DMA target for software-managed instruction
components. overlays was easier
1 . It is not the flashiest solution,
FIGURE 3. (a) TPUv2 core scalar unit and (b) single vector lane. The vector unit contains 128 vector lanes.
the memory system, primarily to HBM. That feeds into Matrix Computation Units
local Scalar Memory SRAM to issue loads and stores The matrix multiply unit is the computational heart of
against. They feed a 32-deep register file containing 32- the TPU. It is a 128 x 128 systolic array of multipliers
bit words, which finally feeds into a dual-issue ALU at and adders, delivering 32,768 operations per cycle
2.
the top right. Execution interlock is managed by stan- The inputs are two matrices: a left hand side matrix
dard hold conditions on the instructions, while synchro- and a right hand side matrix. The left hand side matrix
nization flags provide a way to interlock against streams over a preloaded right hand side matrix to
software-managed DMA operations. The memory hierar- create a streaming result matrix, which is sent directly
chy is under the control of the compiler, which simplifies to the vector unit’s result FIFO. The right hand side
the hardware design while delivering high performance matrix can be optionally transposed when loaded into
1
2
5. the matrix unit.
Critically, the systolic array structure provides high
Vector Computation Unit computational density
2 . While it performs most of
After scalar execution, the instruction bundle and up to the FLOPS per second, it is not the largest contributor
three scalar register values are forwarded to the vector to chip area (see Figure 4)
5.
unit. Figure 3(b) shows a single vector lane, and the Numerics shape computation density. Multiplica-
entire vector unit contains 128 such vector lanes. Each tions use bfloat16: this is a 16-bit format that has the
lane contains an additional 8-way execution dimension same exponent range as float32 but with fewer bits of
called the sublane. Each sublane contains a dual issue mantissa. The accumulation happens in full 32-bit
32-bit ALU connected into a 32 deep register file. All floating point.
together, the vector computation unit allows operation Bfloat16 works seamlessly for almost all ML train-
on eight sets of 128-wide vectors per clock cycle. Sub- ing, while reducing hardware and energy costs
2
5.
lanes let TPUv2 increase the vector versus matrix com- Our estimate for the more recent 7 nm is that
pute ratio, which is useful for batch normalization.6 bfloat16 has a 1.5 energy advantage over the IEEE
Each lane’s register files perform loads and stores 16-bit float: 0.11 versus 0.16 pJ for add and 0.21 versus
against the lane’s local slice of vector memory, and 0.31 pJ for multiply. Moreover, bfloat16 is easier for
that memory connects into the DMA system (primarily ML software to use than fp16, since developers need
providing access to HBM). to perform loss scaling to get fp16 to work.9 Many are
On the right of Figure 3(b) is the connectivity into willing to do the extra programming to go faster, but
the matrix units. The push instruction slots send data some are not. For example, all but 1 of the 200 ML
vectors out to the matrix units. A result FIFO captures experts at the Vector Institute rely on the slower 32-
any returning result vectors from the matrix units, and bit floating point in GPUs.8 We know of no downsides
these can be popped into vector memory using the for bfloat16 versus fp16 for ML. As it takes less area
pop instruction slots. The result FIFO lets us avoid and energy and is easier for ML software to use
2
4,
strict execution schedule constraints for the long- bfloat16 is catching on. TPUv2 was the first, but ARM,
latency matrix operations and shorten register life- Intel, and NVIDIA have subsequently embraced
times, simplifying the compiler
1. bfloat16.
Matrix multiplication is not the only important connect into a supercomputer, with many chips work-
matrix transformation, so another set of units effi- ing together to train a model.
ciently perform other matrix operations like trans- The chip includes four off-chip links and two on-
poses, row reductions, or column permutations. chip links connected with an on-chip router. These
four links enable the 2-D torus system interconnect,
TPUV2 MEMORY SYSTEM which matches many common ML communication
patterns, like all-reduce. The four external links deliver
As discussed earlier, the TPUv2 Core has a variety of
500 Gb/s and the two internal links are twice as fast.
SRAM-based scratchpad memories. These are archi-
Some torus links are within the TPU tray, and the rest
tecturally visible to software and accessed using loads
are through cables across trays and racks.
and stores. This approach gives the compiler predict-
An important property of interconnect design is ease
able execution control over both computation and
of use
4 : DMAs to other chips look just like DMAs to
SRAM and simplifies the hardware
2
4.
local HBM, albeit with a push-only restriction for simplic-
But the memory story cannot stop with SRAM,
ity
1 . This common DMA interface allows the compiler
because most models would not fit. Figure 2 shows
to treat the full system’s memory space consistently.
that high-bandwidth, in-package HBM backs up the
The dedicated TPU network enables scalable syn-
SRAM. The compiler moves data between HBM and
chronous training across TPUs. Asynchronous training
SRAM using asynchronous DMAs. Since HBM holds
was the state-of-the-art previously, but our studies
vectors and matrices, DMAs can stride through mem-
showed synchronous converges better
4 —async pri-
ory to allow slicing off subdimensions of larger dimen-
marily allowed broader scaling when networking was
sional structures to reduce the DMA descriptor
limited.
overhead. When a DMA finishes, a completion notifi-
We connect the TPU pod to storage over the data-
cation lands in the core’s sync flags, allowing the pro-
center network to feed the input data for the model
gram to stall until data arrives.
via a PCIe-connected CPU host. The system balance
Taking a step back, this approach provides a clean
across CPU, network, and storage is critical to
partitioning of concerns. The core (blue in Figure 2)
achieve end-to-end performance at scale
3 . The
provides predictable, tightly scheduled execution
2.
PCIe straw is tiny (16 GB/s per chip) the in-package
The memory system (in green) handles asynchronous
bandwidth is huge (700 GB/s per chip), and the inter-
prefetch DMA execution from the larger HBM memory
connect links are somewhere in the middle (4 60
space
5 . The hand-off between the regimes is man-
GB/s). Building TPUv2 to be flexible and programma-
aged with sync flags.
ble allows operation on all data locally instead of
Ultimately, the primary goal of the memory system
moving data back to the host through a bandwidth-
is to keep the datapath fed
2 , so HBM’s high band-
constricted PCIe bus
2.
width is critical. At 700 GB/s per chip, the HBM of
TPUv2 provides 20 the bandwidth of the pair of DDR3
channels in TPUv1. This increase allows full computa- TPUV2 FLOOR PLAN
tion utilization at much lower data reuse rates
4. The floorplan in Figure 4 is not stylish, but it was good
Zooming out to the chip-level memory system, enough and allowed us to focus on more important
Figure 2 shows each core is connected to half of the things
1 . Most area is for the computation cores in
chip’s HBM. The split HBM memory space might be a blue, with noticeable area for the memory system and
bit surprising, but the reason is simple: we wanted to interconnect. The two cores are split across the top
build this chip quickly
1 , this approach simplified the and bottom, with the interconnect router sitting in the
design, and it was good enough. donut hole in the middle. The white areas are not
Each core also has a set of PCIe queues to exchange empty, but filled with wiring.
data with the host. Between the two cores is the inter- The two matrix multiply units are at the top center
connect router that also connects to the off-chip links. and bottom center. The FLOP/second delivered in
such a small area demonstrates the computation den-
sity available with the bfloat16 systolic array
5.
TPUV2 INTERCONNECT
A dedicated interconnect is the foundation of the
TPUv2 supercomputer (“pod”). TPUv1 was a single- TPUV3
chip system built as a coprocessor, which works for We hoped to avoid the second system effect2; we did not
inference. Training Google production models would want to blow everything we worked hard for in TPUv2 by
require months on a single chip. Hence, TPUv2s can building the kitchen sink into TPUv3. TPUv3 is a “mid-life
kicker” that leveraged what we already built—both use 16 It seems quaint now, but we thought the
nm technology—but let us pick the low-hanging fruit left 256-chip TPUv2 system was huge. ML’s voracious
behind by the quick development of TPUv2
1 . The most appetite continues,1 so moving to 1024-chip sys-
important enhancements were the following. tems was critical
3.
FIGURE 5. (a) Roofline Models11 for TPUv2 and TPUv3 (left). (b) Roofline Models of V100 for fp16 and fp32 (right). RNN0/RNN1
move up and to the right in (a) since the larger HBM capacity of TPUv3 enabled bigger batch sizes; all others use the same batch
size. MLP0/MLP1 could not run on Volta in (b) because the embedding tables were too big.
convolution. However, Figure 5(b) assumes the same unwilling to do the extra work of loss scaling for
computation for comparison, lifting CNN0 through fp16 9
4.
AND HARDWARE TO IMPROVE NEXT (see Figure 4), so doubling the MXUs worked
GENERATION TPUS AND OTHER well
5.
chip layout, and using FIFOs to simplify XLA com- Oguntebi, Andy Phelps, Paul Rodman, Bjarke Roune,
piler scheduling. Brennan Saeta, Amir Salek, Julian Schrittwieser, Dan
Steinberg, Andy Swing, Horia Toma, Shibo Wang, Tao
2 Achieve high performance.
Wang, Yujing Zhang, and many more.
Matrix computation density, high HBM bandwidth,
and XLA compiler optimizations deliver excellent
performance.
REFERENCES
3 Scale up. 1. D. Amodei and D. Hernandez, “AI and compute,” 2018.
Our system-first approach and simple-to-use inter- [Online]. Available: blog.openai.com/aiand-compute
connect let TPUv3 scale natively to 1024 chips and 2. F. P. Brooks Jr, The Mythical Man-Month: Essays on
deliver nearly linear speedup for production Software Engineering. London, U.K.: Pearson Education,
applications. 1975.
3. J. J. Dongarra et al.., “The LINPACK benchmark: Past,
4 Work for new workloads out of the box. present and future,” Concurrency Comput., Pract.
To support the deluge of training workloads, we Experience, vol. 15, no. 9, pp. 803–820, 2003.
built a core grounded in linear algebra that works 4. A. Krizhevsky et al., “Imagenet classification with deep
well with the XLA compiler, and HBM ensures we convolutional neural networks,” in Proc. Neural Inf.
have enough capacity and bandwidth to keep Process. Syst., 2012, pp. 1097–1105.
pace with growing models. 5. N. P. Jouppi et al., “In-datacenter performance analysis
of a tensor processing unit,” in Proc. 44th Annu. Int.
5 Be cost effective. Symp. Comput. Archit., 2017, pp. 1–12.
The matrix units are efficient, the design was sim- 6. N. P. Jouppi et al., “A domain-specific supercomputer
ple without gratuitous bells and whistles, and we for training deep neural networks,” Commun. ACM,
got our money’s worth in performance. vol. 18, 63, no. 7, pp. 67–78, 2020.
As deep learning continues to evolve, and as we 7. N. Kumar, “Google breaks AI performance records in
understand the workloads and scaling limits better, MLPerf with world’s fastest training supercomputer,”
opportunities continue for further codesign across ML 2020. [Online]. Available: cloud.google.com/blog/
models, software, and hardware to improve next gener- products/ai-machine-learning/google-breaks-ai-
ation TPUs and other domain-specific architectures. performance-records-in-mlperf-with-worlds-fastest-
training-supercomputer
8. J. Lin, X. Li, and G. Pekhimenko, “Multi-node Bert-
ACKNOWLEDGMENTS pretraining: Cost-efficient approach,” 2020,
TPUv2/v3 exist thanks to the creativity and hard work arXiv:2008.00177.
of the TPU team, from chips to systems to software. 9. P. Micikevicius et al., “Mixed precision training,” 2017,
The author list represents a small slice, and we are arXiv:1710.03740.
grateful for the opportunity to work together. Thanks 10. D. Silver et al., “A general reinforcement learning
go to Paul Barham, Eli Bendersky, Dehao Chen, Clifford algorithm that masters chess, shogi, and go through
Chao, Chiachen Chou, Jeff Dean, Brian Fosco, Ben self-play,” Science, vol. 362, no. 6419, pp. 140–1144,
Gelb, Jesse Guss, Peter Hawkins, Blake Hechtman, 2018.
Mark Heffernan, Richard Ho, Robert Hundt, Michael 11. S. Williams, A. Waterman, and D. Patterson, “Roofline:
Isard, Terry Kang, Fritz Kruger, Naveen Kumar, Sameer An insightful visual performance model for multicore
Kumar, Steve Lacy, Chris Leary, Hyouk-Joong Lee, architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76,
David Majnemer, Lifeng Nai, Rahul Nagarajan, Tayo 2009.
I
nterleaved multithreading (IMT) or barrel-process- (ISA) that allow extensions dedicated to particular
ing is a simple and widely known program execution computation domains, such as RISC-V.4
paradigm that alternates instructions belonging to Edge computing devices regard energy efficiency
different execution threads in the stages of a single- as the prime concern. This article addresses the intro-
issue in-order processor pipeline.1,2,3 In this scheme, duction of vector coprocessor acceleration in IMT
while the throughput is limited to one instruction per cores for extreme edge computing, showing that an
cycle (IPC), pipeline stalls due to interinstruction IMT processor has an architectural design advantage
dependence are avoided without any hardware over- over other cores with similar IPC, which allows exploit-
head for dependence management. As long as the ing hardware acceleration with higher energy effi-
application workload can be programmed as multiple ciency and speed.
threads, the IMT approach can sustain IPC ¼ 1 with rel- In this context, we specifically address supporting
atively high clock frequency and high energy efficiency, accelerated vector operations, to execute ubiquitous
thanks to the hardware simplicity, which is a desirable computation kernels in edge computing applications:
goal in embedded edge-computing processors.
Nonetheless, to execute computationally heavy › 2-D convolution, covering the broad area of deep
applications on the extreme edge, any processor core neural network applications5;
needs hardware acceleration support. Two broad clas- › fast Fourier transform (FFT), typical of signal
ses of hardware acceleration exist: hardware units that processing applications, for example, in 5G IoT
autonomously execute entire computation kernels devices6;
upon memory-mapped commands from the processor › matrix multiplication (MatMul) used in a variety
core, and instruction acceleration units, sometimes of fields, predominantly in cryptography.
referred to as coprocessors, that take over complex
instructions and thus are directly sequenced by the A typical scenario is to run homogenous workloads
core instruction stream. Coprocessors imply less com- on all the threads applying the same algorithm on dif-
munication overhead, yet they can be efficiently ferent input data, e.g., convoluting multiple image
exploited only within instruction set architectures frames. Otherwise, one can take advantage of the mul-
tiple contexts provided in an IMT core and run a com-
posite workload running different algorithms, e.g.,
transmitting an encrypted stream of a preprocessed
0272-1732 ß 2021 IEEE video/audio, by convoluting an image while analyzing
Digital Object Identifier 10.1109/MM.2021.3050962 an audio stream via FFT then encrypting the processed
Date of publication 12 January 2021; date of current version
data using an algorithm that heavily relies on MatMul.
26 March 2021.
In this article, we design, implement, and evaluate a regime. Our study is agnostic about supply or bias
whole taxonomy of coprocessor acceleration schemes voltage tuning, purely addressing DLP and TLP balanc-
for IMT cores, analyzing them for performance, area, ing for energy efficiency in any physical implementa-
and energy efficiency on the above application cases. tion, including soft-cores on FPGA, as shown in our
The contributions of this work are as follows: results.
A hardware convolution engine for image process-
› We provide designers with a quantitative ing is presented by Conti and Benini,11 focusing on the
comparison between different coprocessing optimal buffer design to store selected portions of the
schemes referring to different computation input image. Chen et al.12 and Du et al.13 also present
kernels. convolution accelerators, based on parallel hardware
› Specifically, we identify the optimal balance units and local data reuse. Our study adopts a different
between thread level parallelism (TLP) and data approach, based on multipurpose vector coprocessors
level parallelism (DLP) in the addressed equipped with scratchpad memories, coupled with an
scenarios. IMT processor, to hide memory latency.
› We demonstrate the performance and energy This article builds on the activity reported by
efficiency of the IMT approach in the target Cheikh et al.14 which was an initial effort into design-
application contexts by comparing it with pro- ing a mathematical accelerator for a RISC-V core, and
cessor cores in the same complexity range. by Cheikh et al.,3 who addressed the best performing
› We show the potentials of an open hardware pipeline organization for an IMT RISC-V core.
design based on the RISCV instruction set along
with its open programming environment.
THE KLESSYDRA-T IMT
ARCHITECTURE
The processing core discussed in this article, named
THIS ARTICLE ADDRESSES THE
Klessydra-T13, is a parametric design implementing an
INTRODUCTION OF VECTOR
IMT four-stage-pipeline RISC-V processor. It supports
COPROCESSOR ACCELERATION IN
the RV32IMA instruction set,4 augmented by a custom
IMT CORES FOR EXTREME EDGE extension composed of a small subset of mathemati-
COMPUTING, SHOWING THAT AN IMT cal vector instructions. The Klessydra-T13 core (see
PROCESSOR HAS AN ARCHITECTURAL Figure 1) realizes a pure IMT paradigm as defined by
DESIGN ADVANTAGE OVER OTHER the following points:
CORES WITH SIMILAR IPC.
› thread context switch at each clock cycle;
› in-order, single issue instruction execution;
› feed-forward pipeline (no hardware support for
BACKGROUND branching-hazard and data-hazard handling);
Many previous works reported the design of hardware › bare metal execution (RISCV M mode).
accelerated cores in edge-computing applications.
Lim et al.7 report the design details of a low-volt- The core interleaves three hardware threads (harts 4)
age microcontroller with subword-SIMD support. This in the instruction pipeline. The register file, program
article is more general in investigating various SISD- counter, and CSR unit are replicated per hart. A hard-
SIMD-MIMD combinations in coprocessor design. The ware context counter (harc) switches between the
work by Botman et al.8 is similar and investigates ad hart program counters on a rotation basis to fetch
hoc ISA encoding and pipeline stage balancing for instructions from the program memory. The three harts
power efficiency and introduces a dedicated copro- in the four pipeline stages provide a register file access
cessor interface. Yet, the authors do not elaborate on fence, so that it never possible for any two instructions
coprocessor architectures and performance. Our work to manifest a dependence hazard in the pipeline.
further differs from Gautschi et al.7 and Luo and The T13 core includes multiple units in the execu-
Zhang8 in targeting RISC-V compliance. tion stage, namely a load/store unit (LSU), a scalar
Gautschi et al.9 describe a RISC-V processor with execution unit (EXEC), and a vector-oriented multipur-
DSP hardware support, targeting near-threshold volt- pose functional unit (MFU), which implements the
age operation, and in the Diet-SODA design,10 a SIMD- coprocessing features. The LSU works in parallel with
oriented DSP accelerator also runs in near-threshold other units when executing store instructions, which
cannot cause a write-back conflict on the register file. configure the number of parallel lanes D in the
The MFU is allowed to read operands from the register MFU, the number of MFUs F, the SPM capacity, the
file, but can write results only to local scratchpad number of SPMs N, the number of SPMIs M, and
memories (SPMs). The LSU manages data transfers the sharing scheme of MFUs, and SPMI among
to/from the data memory from/to the SPMs via dedi- harts. The MFU is the engine that accelerates vec-
cated instructions. tor computations. It can operate on different inte-
The MFU executes vector arithmetic instructions, ger data element widths (8, 16, 32bits) in subword-
whose latency is proportional to the vector length. A SIMD fashion, and also in element-SIMD fashion
hart requesting access to the busy MFU executes a when D is configured to multiply the execution
self-referencing jump until the MFU becomes free, lanes for DLP. A typical vector arithmetic operation
avoiding unnecessary stalls of other harts in the pipe- has an initial latency between 4 and 8 cycles to
line that are independent from the MFU being busy. access the SPM.
The custom instruction extension supported by Each SPM has one read and one write port. The
the MFU and LSU is summarized in Table 1. The parameter D that defines the MFU lanes also corre-
instructions implement vector operations without sponds to the number of SPM banks; all the banks of
relying on a vector register file, but rather on a mem- an SPM are accessed together as a single SPM line.
ory space mapped on the local SPMs, for maximum When the MFU executes a vector operation, it
flexibility. The programmer can move vector data at fetches an entire SPM data line in every clock cycle,
any point of the SPM address space with no constraint composed of multiple vector elements. A bank read
except the total capacity of the SPMs, which in turn is rotator aligns the source operands coming from the
a parameter of the microarchitecture design. SPM line, and a bank write rotator aligns the destina-
The coprocessor instructions are exposed to the tion data to the correct banks in an SPM line. When
programmer as very simple intrinsic functions, fully the LSU fills the SPM banks with data from the 32-bit
integrated into the RISC-V GCC compiler toolchain. data memory port, a bank interleaver switches
between the banks. The reader may refer to the work
by Botman et al.14 for internal details of the units
HARDWARE ACCELERATION inside the MFU and SPMs.
SCHEMES Furthermore, the coprocessor can be configured
The MFU and SPMs are accessed through a to implement the following sharing schemes among
scratchpad-memory interface (SPMI). The user can harts.
TABLE 1. Custom vector instruction extension. coprocessor instructions, provided they use different
internal functional units of the MFU (e.g., adder, multi-
Assembly syntax – (r) denotes Short description plier). Harts requesting a busy internal unit in the MFU
memory addressing via register r
kmemld (rd),(rs1),(rs2) load vector into get stalled until the contended unit becomes free.
scratchpad region This scheme can exploit DLP by multilane SIMD exe-
kmemstr (rd),(rs1),(rs2) store vector into main cution, and also TLP in the form of a heterogeneous
memory
MIMD execution.
kaddv (rd),(rs1),(rs2) adds vectors in
scratchpad region The explored design parameters and correspond-
ksubv (rd),(rs1),(rs2) subtract vectors in ing configurations, for reference in reporting perfor-
scratchpad region mance results, are the following:
kvmul (rd),(rs1),(rs2) multiply vectors in
scratchpad region
kvred (rd),(rs1) reduce vector by › M ¼ 1, F ¼ 1, D ¼ 1: SISD
addition › M ¼ 1, F ¼ 1, D ¼ 2,4,8: Pure SIMD
kdotp (rd),(rs1),(rs2) vector dot product into
register
› M ¼ 3, F ¼ 3, D ¼ 1: Symmetric MIMD
ksvaddsc (rd),(rs1),(rs2) add vector þ scalar › M ¼ 3, F ¼ 3, D ¼ 2,4,8: Symmetric MIMD þ SIMD
into scratchpad › M ¼ 3, F ¼ 1, D ¼ 1: Heterogenous MIMD
ksvaddrf (rd),(rs1),rs2 add vector þ scalar › M ¼ 3, F ¼ 1, D ¼ 2,4,8: Heterogenous MIMD þ SIMD
into register
ksvmulsc (rd),(rs1),(rs2) multiply vector þ
scalar into scratchpad We use N ¼ 3 in MatMul and convolutions, and
ksvmulrf (rd),(rs1),rs2 multiply vector þ N ¼ 4 in FFT.
scalar into register Finally, we refer to the T13 microarchitecture con-
kdotpps (rd),(rs1),(rs2) vector dot product and figured with no hardware acceleration as Klessydra
post scaling
ksrlv (rd),(rs1),rs2 vector logic shift within T03.
scratchpad
ksrav (rd),(rs1),rs2 vector arithmetic shift
within scratchpad PERFORMANCE RESULTS
krelu (rd),(rs1) vector ReLu within
scratchpad We run a set of test programs composed of 2-D convo-
kvslt (rd),(rs1),(rs2) compare vectors and lution, FFT, and MatMul kernels. We adopted the
create mask vector widely used 3 3 filter size on matrix sizes of 4 4, 8
ksvslt (rd),(rs1),rs2 compare vector-scalar
and create mask
8, 16 16, and 32 32 elements for convolutions.
kvcp (rd),(rs1) copy vector within FFT was run on 256 samples, and MatMul on 64 64
scratchpad region element matrices. The element width was kept 32 bit
in fixed-point representation. The tests were organized
as follows:
Shared coprocessor: All the harts share a single
MFU/SPM subsystem. In the case of busy MFU, any › homogeneous workload, running multiple instan-
hart wanting to access it is stalled until the MFU ces of the same kernel on multiple harts, on dif-
becomes free. In this scheme, parallel execution may ferent input data;
occur between coprocessor and noncoprocessor › composite workload, running convolutions, FFTs,
instructions. Yet, the MFU/SPM may exploit pure DLP and MatMul repeatedly on three respective
acceleration, by multilane SIMD execution. harts.
Thread-dedicated coprocessors: A complete
MFU/SPM subsystem is appointed to each hart, elimi- The performance was measured by taking the
nating coprocessor contention. Stalls can only happen average cycle count to execute one computation ker-
if two instructions of the same hart request MFU oper- nel. Table 2 summarizes the results, which are dis-
ation. This scheme can exploit DLP by multilane SIMD cussed below.
execution and TLP by fully symmetric MIMD execu- Cycle count: With small matrix convolutions, the
tion, allowing multiple vector instructions to execute accelerated core reached up to 3 cycle count
in parallel. speed-up over a nonaccelerated IMT core (Klessydra
Thread-dedicated SPMI/shared MFU: A dedicated T03), and 2 speed-up over single-threaded, DSP-
SPM address space is kept for each hart, while the extended core (RI5CY9).
harts share one MFU at the functional unit level. This As expected, large matrix convolutions and Mat-
scheme still allows interhart parallel execution of Mul obtain more considerable advantage from vector-
TABLE 2. Summary of performance results and synthesis results. Green ¼ best case; red ¼ worst case.
accelerated cores, quantified in 13 cycle count Maximum clock frequency: All the cores under
speed-up relative to Klessydra T03, 9 relative to the analysis were implemented as FPGA soft-cores. The
RI5CY core, and 19 relative to ZeroRiscy. In contrast, clock speed exhibited the sharpest drops as the TLP
FFT takes benefit from TLP and reduced data memory grew larger: in the heterogeneous MIMD scheme, the
accesses rather than from DLP. crossbar mapping the SPMI output data on the shared
Figure 2 quantifies the contribution of DLP and TLP MFU units became the critical path for D ¼ 4,8. Pipe-
for convolutions on different matrix sizes. For small lining the crossbar to reduce the critical path introdu-
vectors, TLP inherently exhibits better contributions ces hardware overhead, compromising the area
to speed-up than DLP, while as the vector size grows, advantage of the heterogeneous MIMD configuration.
the DLP boost dominates. Implementations exploiting Absolute execution time: The cycle count and
both TLP and DLP performed much better than pure the operating frequency allow calculating the total
DLP also with large matrices. A key outcome is that a execution time. Figure 3 compares the actual execu-
single core IMT processor can exploit both DLP and tion time speed-up relative to the ZeroRiscy core
TLP and follow the gray curve, while a single-threaded taken as the reference when each core operates at
core exploiting only DLP acceleration follows the blue its maximum frequency. In pure SIMD configurations,
curve. the speed-up grows linearly with the DLP for the
Notably, the heterogeneous MIMD coprocessor, explored DLP range. Yet, exploiting TLP, by going
which has three times less functional units than the from a SISD/SIMD to symmetric and heterogenous
fully symmetric MIMD, employed only 1% – 7% more MIMD, improved the speedup in all cases, despite
cycles than the latter. the frequency drop associated with the MIMD copro-
cessor. Thanks to exploiting both TLP and DLP, the
symmetric MIMDþSIMD schemes exhibit the lowest
execution times, reaching up to 17 speed-up over Zer-
oriscy for Convolution 32 32 and up to 13 speed-up
for the composite workload. Notably, the heteroge-
neous MIMD configurations maintain an almost perfect
overlap with the symmetric MIMD.
The nonaccelerated Klessydra-T03, while employ-
ing a higher cycle count than RI5CY due to the
absence of DSP and hardware-loop extensions, exhib-
its an absolute performance advantage over RI5CY
thanks to a more than double frequency attained by
FIGURE 2. DLP and TLP cycle-count boost in 2-D convolu- the pure IMT microarchitecture. When compared to
tions for different matrix sizes. ZeroRiscy, T03 exhibits both lower cycle count and
higher frequency.
Hardware Resource Utilization: In cost-con- The results are presented as the reduction in nJ/op
strained applications, it is crucial to find an optimal relative to Zeroriscy, taken as reference, which
balance between speed-up and area overhead. The exhibited 4.24 nJ/op as the best case in the ana-
heterogenous MIMD þ SIMD scheme with D ¼ 2 lyzed workloads.
resulted to be a possible best choice with all test The most energy efficient designs resulted to be
programs. the symmetric MIMD and heterogenous MIMD
The nonaccelerated T03 exhibits only a slightly schemes, again exhibiting an almost complete overlap
more significant footprint than the tiny ZeroRiscy and reaching over 85% energy saving related to the
core, despite the replicated register file to support reference Zeroriscy. Despite having the smallest area
multithreading, thanks to the LUT-RAM implementa- footprint, the pure SIMD schemes resulted in a larger
tion of the registers. energy consumption, due to low exploitation of TLP.
Energy Efficiency: The average energy per algo- Larger Filters: Convolutional neural networks pri-
rithmic operation (multiplications and additions) is marily employ 3 3 filters (VGG16) but also larger ones
a general measure of the energy efficiency attained (e.g., 11 11 in Alexnet, 5 5 in Googlenet). Large
by a processor core in implementing an algorithm masks such as 7 7 are used in Sobel, Gaussian
computation. Figure 4 reports the outcome of this smoothing, median filtering. We evaluated the vector
analysis, referring to the soft-core implementations. coprocessor schemes with filters ranging from 5 5
TABLE 3. Higher order filter evaluation results for cycle count, total time at max frequency, and total energy. Green ¼ best case;
red ¼ worst case.
to 11 11, on 32 32 element matrices. Table 3 shows 2. A. Cheikh et al., “The microarchitecture of a multi-
the speed-up and energy efficiency trends continue as threaded RISC-V compliant processing core family for
the filter dimensions grow larger, favoring higher DLP. IoT end-nodes,” in Proc. Int. Conf. Appl. Electron.
The improvement referring to ZeroRiscy grows up to Pervading Ind., Environ. Soc., 2017, pp. 89–97.
35 when using 11 11 filters. 3. M. Olivieri et al., “Investigation on the optimal
The symmetric and heterogeneous MIMDþSIMD pipeline organization in RISC-V multi-threaded soft
schemes, with D ¼ 2, maintain similar performance processor cores,” in Proc IEEE New Gener. CAS, 2017,
and energy results throughout the analyzed cases. pp. 45–48.
The results confirm that an IMT core capable of MIMD 4. RISC-V Unprivileged Instruction Set specifications,
acceleration increasingly performs better than a sin- December, 2019. [Online]. Available: https://riscv.org/
gle-thread SIMD acceleration. specifications/
5. F. Samie, L. Bauer, and J. Henkel, “From cloud down to
things: An overview of machine learning in Internet of
CONCLUSION Things,” IEEE Internet Things J., vol. 6, no. 3,
The scientific outcome of this article can be summa- pp. 4921–4934, Jun. 2019.
rized in the following list of evidence: 6. F. Luo and C. J. Zhang, Signal Processing for 5G:
Algorithms and Implementations. New York, NY, USA:
› The MIMD-SIMD vector coprocessor schemes Wiley, 2016.
enable tuning the TLP and DLP contribution and 7. K. Lim, S. Jeong, Y. Kim, and H. S. Yang, “CalmRISCTM : A
obtain the best results in absolute performance low power microcontroller with efficient coprocessor
and energy efficiency, reaching >15 speed-up interface,” Microprocessors Microsyst., vol. 25,
and -85% energy per operation. pp. 247–261, 2001.
› Kernels that are less effectively vectorizable can 8. F. Botman, J. deVos, S. Bernard, F. Stas, J. Legat,
still benefit from acceleration through SPMs and and D. Bol, “Bellevue: A 50MHz variable-width SIMD
TLP, in an IMT core, reaching 2 -3 speed-up. 32bit microcontroller at 0.37V for processing-
› Fully symmetric and heterogeneous MIMD give intensive wireless sensor nodes,” in Proc. IEEE Int.
very similar results, showing that coprocessor Symp. Circuits Syst., Melbourne, Australia, 2014,
contention can be effectively mitigated by func- pp. 1207–1210.
tional unit heterogeneity, allowing hardware 9. M. Gautschi et al., “Near-threshold RISC-V core with
resource saving. From the same observation, we DSP extensions for scalable IoT endpoint devices,” IEEE
can state that functional unit contention is less Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25,
impacting than SPM contention, in all the no. 10, pp. 2700–2713, Oct. 2017.
kernels. 10. S. Seo et al., “Diet SODA: A power-efficient processor
› Pure DLP acceleration always gives inferior for digital cameras,” in Proc. 16th ACM/IEEE Int. Symp.
results than a balanced TLP/DLP acceleration. Low Power Electron. Des., 2010, pp. 79–84.
An IMT microarchitecture can benefit from TLP 11. F. Conti and L. Benini, “A ultra-low-energy convolution
and DLP acceleration in a single core. engine for fast brain-inspired vision in multicore
› In the absence of hardware acceleration, IMT clusters,” in Proc. IEEE Des., Autom. Test Eur. Conf.
still exhibits an absolute performance advantage Exhib., 2015, pp. 683–688.
over single-thread execution thanks to the sim- 12. Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze,
plified hardware structure. “Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks,”
The Klessydra-T parametric cores are available as IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
open source designs on GitHub at https://perma.cc/ Jan. 2017.
6FYD-AF68. 13. L. Du et al., “A reconfigurable streaming deep
convolutional neural network accelerator for Internet
of Things,” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 65, no. 1, pp. 198–208, Jan. 2018.
REFERENCES 14. A. Cheikh et al., “Efficient mathematical accelerator
1. C. Bechara et al., “A small footprint interleaved design coupled with an interleaved multi-threading
multithreaded processor for embedded systems,” in RISC-V microprocessor,” in Proc. Int. Conf. Appl.
Proc. 18th IEEE Int. Conf. Electron., Circuits, Syst., 2011, Electron. Pervading Ind., Environ. Soc., 2019,
pp. 685–690. pp. 529–539.
ABDALLAH CHEIKH is currently a Postdoctoral Researcher electronic engineering “cum laude” in 2001 and a Ph.D. in
with Sapienza University of Rome, Italy. His interest in com- electronic engineering from the University of Rome “La Sapi-
puter organization, and design drove him to pursue a PhD fel- enza” in 2005. He coauthored more than 40 publications in
lowship at Sapienza University of Rome majoring in computer international journals and conference proceedings. Contact
architecture. His research activities cover design, implemen- him at francesco.menichelli@uniroma1.it.
tation, and verification of a wide range of microprocessor
architectures, vector accelerators, and morphing processors.
GIUSEPPE SCOTTI became a Researcher (Assistant Profes-
Cheikh received a B.S. and an M.S. in electrical engineering
sor) with the DIET Department, University of Rome ”La Sapi-
from Rafik Hariri University, Lebanon, in 2014 and 2016,
enza,” Italy, in 2010 and became an Associate Professor with
respectively, and a Ph.D. in 2020. Contact him at abdallah.
the same department in 2015. His research activity was
cheikh@uniroma1.it
mainly concerned with integrated circuits design and
focused on design methodologies able to guarantee robust-
STEFANO SORDILLO has been a Doctorate student with the
ness with respect to parameter variations in both analog cir-
Sapienza University of Rome, Italy, since 2019. His research
cuits and digital VLSI circuits. Scotti received an M.S. and a
activities cover microprocessor core design, hardware acceler-
Ph.D. in electronic engineering from the University of Rome
ators, and neural network algorithms for IoT devices. Sordillo
“La Sapienza,” in 1999 and 2003, respectively. He has coau-
received an M.S. (Laurea) “cum laude” in electronics engineer-
thored about 50 publications in international journals and
ing from Sapienza University of Rome, Italy, in 2019. Contact
more than 70 contributions in conference proceedings. Con-
him at stefano.sordillo@uniroma1.it.
tact him at giuseppe.scotti@uniroma1.it.
DEPARTMENT: SECURITY
Video game consoles share many of the characteristics of an ideal device for use in
enterprise deployments. In comparison to many desktop and notebook PCs
available in the market, modern video game consoles are actually quite powerful
and capable. They provide an excellent user experience with simple and intuitive
setup and operation. At the heart of the design of many modern video game
consoles is security; they are remarkably resilient against very sophisticated
hardware and software attacks. They are also rather cost-effective in comparison
to modern PCs.
V
ideo game consoles are ideal devices for examples of such devices many of the points being
enterprise deployments; they are powerful, made apply to almost all modern video game consoles
versatile, easy to use, cost-effective, and including, but not limited to, the Nintendo Switch, the
extremely secure. Systems suitable for use in an enter- Sony PlayStation 4, the Sony PlayStation 5, and the
prise must be able to handle a variety of workloads to Microsoft Xbox Series S.
ensure that users remain productive; such workloads
include, but are not limited to, video conferencing,
web browsing, content creation (e.g., spreadsheets, PERFORMANCE
presentations, documents, audio, video, etc.), and Relative to desktop and notebook PCs available at
audio/video streaming. Security is equally as impor- their time of release, modern video game consoles
tant, if not more important, than user productivity; are quite powerful in comparison. As described in
enterprise systems routinely process sensitive infor- Table 1, video game consoles feature the latest tech-
mation, which is critical to the organization’s success. nologies available at their respective times of
Of course, all enterprise systems strive to reduce over- release. Unsurprisingly, performance is critical when
all total cost of ownership (TCO); this can be achieved considering devices for use in the enterprise. It is
by simply reducing the cost of the hardware itself as fairly common for enterprise users to be working
well as reducing the operational expenses (OpEx) across multiple applications and contexts at once;
incurred by managing and maintaining the systems high-performance devices aid user workflow and
(e.g., patching, updates, user training/support, etc.). productivity.
The purpose of this article is to describe the ideal
characteristics of a device suitable for use in enter-
THE PURPOSE OF THIS ARTICLE IS TO
prise deployments and demonstrate how video game
DESCRIBE THE IDEAL
consoles are designed with these characteristics and
traits in mind, which, therefore, makes them an excel- CHARACTERISTICS OF A DEVICE
lent fit. It is extremely important to understand that SUITABLE FOR USE IN ENTERPRISE
while this article may use specific consoles as DEPLOYMENTS AND DEMONSTRATE
HOW VIDEO GAME CONSOLES ARE
DESIGNED WITH THESE
CHARACTERISTICS AND TRAITS IN
MIND, WHICH, THEREFORE, MAKES
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3055681 THEM AN EXCELLENT FIT.
Date of current version 26 March 2021.
TABLE 1. Hardware details of various video game consoles released over several years.
System Release Date CPU Memory Process CPU Microarch. GPU Microarch.
Cores Tech.
Microsoft Xbox One November 2013 8x x86-64 8 GB DDR3 TSMC 28 nm AMD Jaguar AMD GCN 2
Sony PlayStation 4 November 2013 8x x86-64 8 GB GDDR5 TSMC 28 nm AMD Jaguar AMD GCN 2
Microsoft Xbox One S August 8x x86-64 8 GB DDR3 TSMC 28 nm AMD Jaguar AMD GCN 2
2016
Sony PlayStation 4 November 2016 8x x86-64 8 GB GDDR5 TSMC 28 nm AMD Jaguar AMD GCN 4
Pro
Nintendo Switch March 2017 4x ARMv8 4 GB TSMC 20 nm ARM Cortex- NVIDIA
LPDDR4 A57 Maxwell
Microsoft Xbox One X November 8x x86-64 12 GB GDDR5 TSMC 16FF+ AMD Jaguar AMD GCN 4
2017
Microsoft Xbox November 8x x86-64 16 GB TSMC N7e AMD Zen 2 AMD RDNA 2
Series X 2020 GDDR6
Microsoft Xbox November 8x x86-64 10 GB TSMC N7e AMD Zen 2 AMD RDNA 2
Series S 2020 GDDR6
Sony PlayStation 5 November 8x x86-64 16 GB TSMC N7e AMD Zen 2 AMD RDNA 2
2020 GDDR6
The Microsoft Xbox One X features a semicustom features a CPU composed of 8 64-bit x86 cores oper-
system on a chip (SoC) developed in partnership with ating at 2.3 GHz and a GPU composed of 40 com-
Microsoft and advanced microdevices (AMD).1 The pute units operating at 1.172 GHz. The SoC uses a
SoC is implemented using Taiwan Semiconductor unified memory pool, shared by both the CPU and the
Manufacturing Company’s (TSMC) 16-nm Fin Field- GPU, which consists of 12 GB of GDDR5 DRAM; the
effect transistor (FinFET) Plus (16FF+) technology; it total memory bandwidth is 326.4 GB/s. The console
supports HDMI 2.0b display output with high-band-
width digital content protection (HDCP) 2.2, 10-bit
HDR, and a resolution of 3840 2160 at 60 Hz. The
GPU is further optimized for a version of Microsoft’s
DirectX 12 graphics API specific to the system.
The Microsoft Xbox Series X, shown in Figure 1, fea-
tures a semicustom (SoC) developed in partnership
with Microsoft and AMD.2
The SoC is implemented using TSMC’s 7 nm FinFET
Enhanced (N7e) technology; it features a CPU com-
posed of 8 64-bit x86 cores operating at 3.8 GHz, or
3.6 GHz with simultaneous multithreading (SMT)
enabled, and a GPU composed of 52 compute units
operating at 1.825 GHz. The SoC uses a unified memory
pool, shared by both the CPU and the GPU, which con-
sists of 16 GB of GDDR6 DRAM; 10 GB, reserved for the
GPU, operate at 560 GB/s and 6 GB, reserved for the
CPU, operate at 336 GB/s. The console supports HDMI
2.1 display output with the same features as the Xbox
One X SoC in addition to fixed rate link (FRL), variable
refresh rate (VRR), display stream compression (DSC),
4:4:4 chrome subsampling, and a resolution of either
FIGURE 1. Microsoft Xbox Series X video game console and 3840 2160 at 120 Hz or 7680 4320 at 60 Hz. The GPU
is further optimized for a version of Microsoft’s DirectX
controller.3
12 Ultimate graphics API specific to the system.
For comparison, the Intel Core i5-8600 T, an Intel internal applications and independent software ven-
eighth-generation desktop SoC released several dors (ISVs) can also develop and distribute applica-
months after the Xbox One X, features a CPU com- tions for public consumption.
posed of 6 64-bit x86 cores operating at a base fre-
quency of 2.3 GHz and a GPU operating at 1.15 GHz; SECURITY
this SoC is implemented using Intel’s third-generation
At the heart of their design is thoughtful and practical
14 nm++ technology.4 It supports HDMI display output
defense against a wide range of threats. Without ques-
with a resolution of 4096 2304 at 24 Hz; it also sup-
tion, security is paramount in enterprise contexts. It is
ports the mainstream version of Microsoft’s DirectX
common for users to access and store sensitive infor-
12 graphics API.
mation, which is crucial to the success and well-being
It is important to understand that while these SoCs
of the organization and its stakeholders (e.g., employ-
differ drastically in regards to cost, intended use, pro-
ees, customers, clients, shareholders, etc.).
cess technology, thermal design power (TDP), instruc-
tions per cycle (IPC), and various other aspects of SoC
and CPU and GPU design, the intent here is to high- Identity and Access Management
light that video game consoles are powerful and fea- Centralized identity and access management (IAM) is
ture-rich in comparison to mainstream compute used throughout the entire ecosystem. The Microsoft
devices available at the time of their release. consoles require users authenticate using a Microsoft
account; similarly, the Sony consoles require a Sony
account and the Nintendo consoles require a Nin-
USER EXPERIENCE tendo account. These identities are then used for
Video game consoles provide an elegant and engaging access and privilege management. Identity is required
experience for all ages and skill levels. This character- to associate and maintain licenses for software
istic is ideal for enterprises, because it enables all of (games and applications) and subscription services. It
its users to be functional and productive without is also used to control communication and interaction
requiring additional training, learning, etc. with other users; access to user information (e.g.,
online status, currently running software, etc.) and
Ease of Use interactions (e.g., text messages, voice messages, in-
Content (games) aside, the systems are designed for game chat, etc.) can be explicitly granted to or
both children and adults with limited experience or revoked from other users.
understanding of technology. Only two cables are
required to use the system: power and display (usually Patching and Updates
HDMI). User input and haptic feedback is performed The systems are designed such that only fully patched
through an ergonomically designed controller; how- and updated systems and software can access pro-
ever, modern systems also support traditional key- tected resources (e.g., Xbox LIVE, Nintendo Online,
board and mouse input.5 System and software (game) game servers, etc.) and interact with other compliant
updates and purchases are obtained through a single (patched and updated) systems and users.
source (e.g., Microsoft Store, Nintendo eShop, Sony Upon boot, the system attempts to connect to sin-
PlayStation Store, etc.), which is tightly integrated into gle, trusted authority (e.g., Xbox LIVE, PlayStation Net-
the system making it easy and intuitive to find, pur- work, etc.). Upon connection, it then checks for any
chase, and download software. system updates and prompts the user to download
and install them. If the user chooses to skip/decline
Versatility any pending updates or if a connection to the trusted
The systems are also rather versatile; developers can authority cannot be made (for whatever reason), they
write and release a variety of different software titles. can continue to use the system offline and use soft-
Aside from the obvious (games), this includes, but is ware (games) that is already installed. Simply, the sys-
not limited to, video streaming applications, music tem will not allow a user to connect to the trusted
streaming applications, video conferencing applica- authority (e.g., to play games, to chat with friends,
tions, web browsers, and cloud storage applications. etc.) unless the system is fully patched and updated.
For example, Hulu is available on the Nintendo Switch If the system is fully patched and updated but the
via the Nintendo eShop and Spotify is available on the software (game) the user wishes to launch is not fully
PlayStation consoles via the Sony PlayStation Store. updated, the user is prompted to download and install
Enterprises have the ability to write their own custom any pending updates. If the user chooses to skip/
Hardware Security
Video game consoles are designed to be remarkably
resilient against various hardware and software
attacks. The entire business model of a modern video
game console is centered around software sales, not
hardware sales. They are designed around the premise
that the end-user cannot be trusted; an end-user’s
motivation is to play games for free (piracy) and/or
modify the game to achieve an unfair advantage over FIGURE 3. High-level depiction of the Xbox One X SoC
other players (cheat). Therefore, extreme measures architecture.6
must be taken to prevent physical attacks against the
system. However, the end-user is not the only
untrusted entity. Xbox One X and NVMe in the Xbox Series X). Keys
The Xbox One X is an excellent example of such used to decrypt information are fed into the SCE
design;6 many of these security features have been through a dedicated hardware pin connecting it to the
carried forward to the Xbox Series X. Quite literally, Crypto Engine inside of another custom-designed ele-
the only trusted entity of the entire Xbox One X sys- ment within the SoC referred to as the Security Com-
tem is the SoC itself; the internal storage, DRAM, opti- plex; this ensures that the keys are never exposed to
cal drive, etc., are considered untrusted. Therefore, all software at any point in time. This Security Complex
information which leaves the SoC must be encrypted also closely monitors the system clock, voltage, tem-
and all information which enters the SoC must be perature, and reset; these are commonly manipulated
decrypted and integrity checked. to attack a system.
All data are stored in nonvolatile memory using a One of the core tenets of the console’s security
format known as the Xbox Virtual Disk (XVD). As illus- design is defense in depth; in other words, an attacker
trated in Figure 2, all data are stored in an NT File Sys- must break through multiple layers of security. In addi-
tem (NTFS) virtual disk and then encrypted and tion to encrypting and integrity checking all informa-
hashed (for both confidentiality and integrity); finally, tion, which passes through the SoC, the system uses a
the root digest of the hash tree is signed using Micro- three-OS architecture,7 as illustrated in Figure 4.
soft RSA (for integrity of the hash tree itself).
The system SoC, illustrated in Figure 3, features a
custom-designed element referred to as the Stream-
ing Crypto Engine (SCE), which is able to decrypt infor-
mation loaded from the internal storage as fast as it
can be read from the underlying I/O bus (SATA III in
FIGURE 2. High-level depiction of an Xbox Virtual Disk (XVD) FIGURE 4. High-level depiction of the Xbox One X three-OS
structure.6 architecture.6
TABLE 2. Cost comparison of various video game consoles at For comparison, USD 499 spent today can buy a
their time of release. Lenovo ThinkCentre M720q, which includes 128 GB
of internal storage, an Intel Pentium Gold G5400T
System Release Internal Cost SoC, and 4 GB of DDR4 DRAM.8 Clearly, the consoles
Date Storage (USD) are rather competitively priced compared to modern
Microsoft Xbox November 500 GB 499 PCs. However, the true cost of any hardware deploy-
One 2013 ment in the enterprise extends far beyond the
Sony PlayStation 4 November 500 GB 399 device itself. One must consider patching, mainte-
2013 nance, and management of the device throughout
Microsoft Xbox August 500 GB 299 its entire lifecycle. Considering that patches and
One S 2016 software are released and distributed directly
Sony PlayStation 4 November 1 TB 399 through the trusted authority (e.g., Microsoft via
Pro 2016 Xbox LIVE, Nintendo via Nintendo Online, etc.), there
Nintendo Switch March 32 GB 299 is less operational overhead for an enterprise which
2017 would otherwise have to build its own infrastructure
Microsoft Xbox November 500 GB 499 to do so.
One X 2017
Microsoft Xbox November 1 TB 499
Series X 2020 THE NOTION OF USING A VIDEO GAME
CONSOLE IN THE ENTERPRISE MAY
Microsoft Xbox November 512 GB 299
Series S 2020 SEEM LAUGHABLE AT FIRST GLANCE.
HOWEVER, AS DISCUSSED, VIDEO
Sony PlayStation 5 November 825 GB 399
w/o optical drive 2020 GAME CONSOLES ACTUALLY EMBODY
MANY OF THE CHARACTERISTICS OF
Sony PlayStation 5 November 825 GB 499
w/optical drive 2020 AN IDEAL DEVICE FOR USE IN THE
ENTERPRISE.
AMD’s Zen 2 CPU cores and RDNA 2 GPU cores while 8. Newegg, “Lenovo Thinkcentre m720q 10t8sdj900
security has improved simply because threats (piracy Desktop Computer - Pentium Gold g5400t -
and cheating) and attacks only become more 4 Gb Ram - 128 Gb Ssd - Tiny - Black,” 2020. Accessed:
advanced over time. Sep. 26, 2020. [Online]. Available: https://www.newegg.
com/p/N82E16883994324
9. CRN, “Amd’s Xbox, Playstation Work Led to a Big
Security Feature in Epyc,” Aug. 2019. Accessed: Aug.
REFERENCES 2019. [Online]. Available: https://www.crn.com/news/
1. J. Sell, “The xbox one x scorpio engine,” IEEE Micro, components-peripherals/amd-s-xbox-playstation-work-
vol. 38, no. 2, pp. 53–60, Mar./Apr. 2018. led-to-a-big-security-feature-in-epyc
2. J. Andrews and M. Grossman, “Xbox Series X System 10. Microsoft, “Meet the Microsoft Pluton Processor—The
Architecture,” Aug. 2020. Accessed: Jan. 24, 2021. Security Chip Designed for the Future of Windows Pcs,”
[Online]. Available: https://hotchips.org/assets/program/ Nov. 2020. Accessed: Nov. 23, 2020. [Online]. Available:
conference/day1/HotChips2020_GPU_Microsoft_Jeff https://www.microsoft.com/security/blog/2020/11/17/
_Andrews_v2.pdf meet-the-microsoft-pluton-processor-the-security-
3. W. Commons, “File:xbox Series X 2.jpg,” Dec. 2020. chip-designed-for-the-future-of-windows-pcs/
Accessed: Jan. 24, 2021. [Online]. Available: https:// 11. M. Mattioli, “History of video game distribution,” IEEE
commons.wikimedia.org/wiki/File:Xbox_Series_X_2.jpg Consum. Electron. Mag., vol. 12, no. 3, pp. 312–322,
4. Intel, “Intel Core i5-8600t Processor,” 2020. Accessed: Sep. 2020.
Sep. 27, 2020. [Online]. Available: https://ark.intel.com/
content/www/us/en/ark/products/129938/intel-core- MICHAEL MATTIOLI leads the Hardware Engineering team
i5-8600t-processor-9m-cache-up-to-3-70-ghz.html within Goldman Sachs, New York, NY, USA. He is responsible
5. Microsoft, “Mouse and Keyboard Support on Xbox
for the design and engineering of the firm’s digital experien-
One,” 2020. Accessed: Sep. 27, 2020. [Online]. Available:
ces and technologies. He is also responsible for the overall
https://support.xbox.com/help/hardware-network/
strategy and execution of hardware innovation both within
accessories/mouse-keyboard
6. T. Chen, “Guarding against physical attacks: The Xbox the firm and within the broader technology industry. Contact
one story,” Oct. 2019. Accessed: Sep. 27, 2020. [Online]. him at michael.mattioli@gs.com.
Available: https://www.platformsecuritysummit.com/
2019/speaker/chen/ ATTE LAHTIRANTA is currently a Chief Technology Officer
7. E. Brechner, “Getting diagonal on the Xbox one and head of Core Engineering with Goldman Sachs, New
trifecta,” Computer, vol. 46, no. 8, pp. 77–78, 2013. York, NY, USA. Contact him at atte.lahtiranta@gs.com.
Barroso
Luiz Andre , Google, Mountain View, CA, 94043, USA
R
eceiving the 2020 ACM-IEEE Eckert-Mauchly have been implemented and launched as products.
Award this past June was among the most Practitioners can contribute to our community by look-
rewarding experiences of my career. I am ing back and showing us how those ideas played out (or
grateful to IEEE Micro for giving me the opportunity to not) in practical applications. Commercial success or
share here the story behind the work that led to this the lack thereof can be an objective judge of the merits
award, a short version of my professional journey so of research ideas; even if cruelly so at times. In giving
far, as well as a few things I learned along the way. me this award, the IEEE Computer Society and ACM are
highlighting the role of practitioners in our field.
Now, as this award is about the practice of ware-
THE PRACTICE OF COMPUTER
house-scale computing, I should get to that with no
SCIENCE
further delay.
For many of us our earliest models of professional-
ism come from observing our parents’ approach to
their work. That was the case for me observing my A BRIEF HISTORY OF
father, a surgeon working in public hospitals in Rio WAREHOUSE-SCALE COMPUTING
de Janeiro. Throughout his career, he was continu- If it is indeed true that “great poets imitate and
ally investigating new treatments, collecting case improve,”1 poetry and computing may have something
studies, participating and publishing in medical con- in common after all. Warehouse-scale computers (the
ferences, despite never having held an academic or name we eventually gave to the computers we began to
research position. He was dedicated to the practice design at Google in the early 2000s) are the technical
of medicine but always made time to help advance descendents of numerous distributed computing sys-
knowledge in his area of expertise. tems that aimed to make multiple independent com-
Without really being aware of it, I ended up following puters behave as a single unit. That family begins with
my father’s path and became a practitioner myself. As a VAXclusters2 in the 1980s, a networked collection of
practitioner, my list of peer-reviewed publications is VAX computers with a distributed lock manager that
notably shorter than most of the previous winners of attempted to present itself as a single system to the
this award, but every time I had something valuable to user. In the 1990s, the concept of computing clusters
share with the academic community, I felt welcomed by began to be explored using lower end or desktop com-
our top research conferences, and those articles tended puters and local area networks with systems such as
to be well received. Practitioners like myself tend to pub- NASA’s Beowulf clusters3 and UC Berkeley’s NOW
lish papers in the past tense, reporting on ideas that project.4
Administered jointly by ACM and the IEEE Computer Soci- FOR MANY OF US OUR EARLIEST
ety, the award is given for contributions to computer and digi- MODELS OF PROFESSIONALISM
tal systems. In 2020, my award was given for pioneering the
design of warehouse-scale computing and driving it from COME FROM OBSERVING OUR
concept to industry. PARENTS’ APPROACH TO THEIR
WORK. THAT WAS THE CASE FOR ME
OBSERVING MY FATHER, A SURGEON
WORKING IN PUBLIC HOSPITALS IN
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3055379 RIO DE JANEIRO.
Date of current version 26 March 2021.
When I arrived at Google, in 2001, I found a company co-location facilities, so we had to build our own facili-
of brilliant programmers that was short on cash but not ties in order to continue to grow our services.
on confidence as they had already committed to a At that point, it became evident to us how much
strategy of systems built from inexpensive desktop- room for improvement there was in the design of
class components. Cheap might be a fairer characteri- datacenters. As a third-party hosting business, data-
zation of those early systems than inexpensive. The centers were put together by groups of disjoint
first generation of those computer racks, tenderly nick- engineering crafts that knew little of each other’s dis-
named “corkboards” consisted of desktop mother- ciplines; civil engineers built the building, mechanical
boards loosely resting on sheets of cork that isolated engineers provisioned cooling, electrical engineers dis-
the printed circuit boards from the metal tray, with disk tributed power, hardware designers built servers, soft-
drives themselves loosely resting on top of DIMMs. ware engineers wrote internet services. The lack of
Despite my hardware background,y I had joined Goo- cross-disciplinary coordination resulted in facilities
gle to try to become a software engineer. In my early that were both expensive and incredibly energy ineffi-
years, I was not involved in building computers but cient. Our team’s lack of experience in datacenter
instead I worked developing our index searching soft- design may have been an asset as we set out to ques-
ware and related software infrastructure components tion nearly every aspect of how these facilities were
such as load balancers and remote procedure call librar- designed. Perhaps most importantly we had the
ies. Three years later, Urs Ho€ lzle asked me to build a chance to look at the entire system design, from cool-
hardware team capable not only of building sound ing towers to compilers, and that perspective quickly
server-class systems but to invent new technologies in revealed significant opportunities for improvement.
the datacenter space. The years I had spent in software Speed of deployment was also a critical factor in
development turned out to be extremely useful in this those days as we were often running dangerously close
new role since my first-hand understanding of Google’s to exhausting our computing capacity as our traffic
software stack was essential to architecting the grew, so our initial approach was to prefabricate ready-
machinery needed to run it. We published some of those to-deploy computer rooms inside forty foot shipping
early insights into the architectural requirements for containers. Containers gave us a datacenter floor
Google-class workloads in an IEEE Micro paper in 2003.6 where we could isolate the hot (exhaust) plenum from
the cold aisle and shortened the total distance the air
needed to be moved; both were factors that improved
OUR TEAM’S LACK OF EXPERIENCE IN cooling efficiency. All that the container needed to
DATACENTER DESIGN MAY HAVE function was power, cold water and networking, and
BEEN AN ASSET AS WE SET OUT TO we had a 1200-server machine room ready to deploy.
QUESTION NEARLY EVERY ASPECT OF That original container-based deployment also
HOW THESE FACILITIES WERE introduced other innovations that led to cost, perfor-
DESIGNED mance, and energy efficiency improvements. Here are
some of the most notable ones:
step. We instead eliminated the UPS room and a gigabit of nonoversubscribed bandwidth to up
introduced per tray (and later per rack) batteries. to a thousand servers.
That way power entering the building only
needed to be rectified once per machine. Although our adventure with shipping containers
› Single-voltage rail power supplies: Every server lasted only that one generation and soon after we
used to be outfitted with a power supply that con- found ways to obtain the same efficiencies with more
verted ac voltage into a number of dc voltage rails traditional building shells, the innovations from that
(12 V, 5 V 3.3 V, etc.) based on old standards first program have continued to evolve into industry-
for electronic components. By 2005, most elec- leading solutions over generations of warehouse-scale
tronic components did not use any of the standard machines. Figure 1 shows a birds-eye view of a modern
dc rails so yet another set of dc/dc conversions Warehouse-scale computer.
needed to happen onboard. The allocation of
power among multiple rails also lowered power MY JOURNEY
supply efficiency sometimes below 70%. We intro- I knew I wanted to be an electrical engineer when I was
duced a single-rail power supply that reached 90% 8 years old and got to help my grandfather work on his
efficiency and created on-board only the voltages HAM radio equipment. Putting aside the fact that eight-
actually used by components. year-olds should not be making career choices, I find it
› 1000-port GigE Ethernet switch: Datacenter net- difficult to question that decision to this date. Although
working bandwidth was beginning to become a I had always been a good student, I struggled a bit dur-
bottleneck for many warehouse-scale applica- ing my Ph.D. and graduated late. I did have a few things
tions, but enterprise-grade switches were not only going for me: an ability to focus, stamina for hard work,
very expensive but also lacked offerings for large and a lot of luck. As an example, after a 24-year drought
numbers of high bandwidth endpoints. Using a col- the Brazilian men’s national soccer team chose to win a
lection of inexpensive edge switches configured World Cup, during my hardest year in graduate school,
as a multistage network, our team created the first delivering a degree of joy that was badly needed to get
of a family of distributed datacenter networking me to the finish line. Less than a year after that World
products (codenamed Firehose) that could deliver Cup I was working in my grad student office on a
Saturday afternoon when I got a call from Norm Jouppi Lo, Sujay Parekh, Ed Bugnion, Alex Ramirez, Gautham
inviting me to interview for a research job at Digital Thambidorai, Karthik Sankaranarayanan, David Meis-
Equipment’s Western Research Lab (WRL). At the time ner, and David Lo. We worked together on many fun
Norm was already one of the most highly respected projects and I hope for more in the future. Although
computer architects in the world and perhaps nothing my dad is no longer with us, I am also fortunate to
in my career since has compared to the feeling I had count on the love and support of my family. My mom
that day—Norm Jouppi knew who I was! Cecilia, my godmother Margarida, my siblings Paula,
Tina, and Carlos and their families, and my wife
Catherine Warner who is the award life gives me every
I KNEW I WANTED TO BE AN single day.
ELECTRICAL ENGINEER WHEN I WAS
8 YEARS OLD AND GOT TO HELP MY
PERHAPS NOTHING SUMMARIZES
GRANDFATHER WORK ON HIS HAM
THE IMPACT THAT FRIENDS AND
RADIO EQUIPMENT. PUTTING ASIDE
THE FACT THAT EIGHT-YEAR- OLDS LUCK CAN HAVE IN YOUR LIFE MORE
THAN THE STORY OF HOW I CAME TO
SHOULD NOT BE MAKING CAREER
JOIN GOOGLE. AS I WAS TRYING TO
CHOICES, I FIND IT DIFFICULT TO
MAKE A DECISION BETWEEN TWO
QUESTION THAT DECISION TO THIS
OPTIONS, JEFF DEAN ASKED ME
DATE.
WHETHER THE OTHER COMPANY I
WAS CONSIDERING HAD ALSO
SERVED ME CREME ^ EE
BRUL DURING
I joined DEC WRL and had the chance to learn from
top researchers like Kourosh Gharachorloo and collab- MY INTERVIEWS. I THANKED JEFF
orate with leading computer architects such as Sarita AND ACCEPTED THE GOOGLE OFFER
Adve, Susan Eggers, Mateo Valero, and Josep Lariba- THE VERY NEXT MORNING.
Pey. During that time, I also met Mark Hill who would
become a friend and a mentor. Later, at Google I
would also have the chance to coauthor papers with
other leading figures in our field such as Tom Wenisch, THREE LESSONS
Wolf Weber, David Patterson, and Christos Kozyrakis. I will finish this essay by sharing with you three lessons
Perhaps nothing summarizes the impact that friends I have learned in this first half of my career, in the hope
and luck can have in your life more than the story of how that they may be useful to engineers who are at an
I came to join Google. As I was trying to make a decision earlier stage in their journey.
between two options, Jeff Dean asked me whether the
other company I was considering had also served me Consider the Winding Road
creme bru e during my interviews. I thanked Jeff and
^ le As an engineer you stand on a foundation of knowl-
accepted the Google offer the very next morning. edge that enables you to branch into many different
The brilliance and generosity of countless people at kinds of work. Although there is always risk when you
Google have been essential to the work that led to this take on something new, the upside of being adventur-
award, but I must highlight three here: Urs Ho € lzle who ous with your career can be amazingly rewarding. I for
has been a close collaborator and possibly the single one never let my complete ignorance about a new
person most to blame for Google’s overall systems field stop me from giving it a go.
infrastructure successes; Bart Sano who managed the As a result, I have worked in areas ranging from
Platforms team that built out the infrastructure we chip design to datacenter design; from writing soft-
have today (I was the technical lead for for Bart’s team ware for web search to witnessing my team launch
for many years); and Partha Ranganathan who is our satellites into space; from writing software for Google
computing technical lead today and is taking Google’s Scholar to using ML to automatically update Google
architectural innovation into the future. Maps; from research in compiler optimizations to
One part of my career I have no hesitation to brag deploying exposure notification technology to curb
about is the quality of the students I have had a the spread of Covid-19.8
chance to host as interns at DEC and Google. They It seems a bit crazy, but not going in a straight line
were (to date) Partha Rahganathan, Rob Stets, Jack has worked out really well for me and resulted in a rich
set of professional experiences. Whatever the out- revisit my position on technical matters that I had
come, you will be inoculated from boredom. invested significant time and effort on, especially
when the original position had a track record of suc-
Develop Respect for the Obvious cess. I will present just one illustrative example.
The surest way to waste a career is to work on unim-
portant things. I have found that big, important prob-
lems have one feature in common: they tend to be I JOINED GOOGLE AFTER A FAILED
straightforward to grasp even if they are hard to solve. MULTIYEAR CHIP DESIGN PROJECT
Those problems stare you right in the face. They are AND AS SUCH I IMMEDIATELY
obvious and they deserve your attention. EMBRACED GOOGLE’S DESIGN
Let me give you some examples by listing some of PHILOSOPHY OF STAYING AWAY
my more well-cited papers next to the formulation of FROM SILICON DESIGN OURSELVES.
the problems address:
Publication Problem addressed I joined Google after a failed multiyear chip design
ISCA'98: “Memory “High-end project and as such I immediately embraced Google’s
System Characterization microprocessors are design philosophy of staying away from silicon design
of Commercial being sold to run ourselves. Later as the technical lead of Google’s data-
Workloads”10 commercial workloads, so center infrastructure, I consistently avoided using
with Kourosh why are we designing
exotic or specialized silicon even when they could
Gharachorloo and them for number
Edouard Bugnion crunching?” demonstrate performance of efficiency improvements
for some workloads, since betting on the low cost
ISCA'00: “Piranha: A “Thread-level parallelism
Scalable Architecture is easy. Instruction level base of general purpose components consistently
Based on Single-Chip parallelism is hard. Should proved to be the winning choice. Year after year, bet-
Multiprocessing”5 we bet on thread-level ting on general purpose solutions proved successful.
with Kourosh parallelism then?” Then, deep learning acceleration for large ML mod-
Gharachorloo, Robert els arose as the first opportunity in my career to build
McNamara, Andreas
Nowatzyk, specialized components that would have both broad
Shaz Qadeer, Barton applicability and dramatic efficiency advantages when
Sano, Scott Smith, Robert compared to general purpose designs. Our estimates
Stets, and Ben Verghese indicated that large fractions of Google’s emerging AI
CACM '17: “The Attack of “If datacenter-wide events workloads could be executed in these specialized
the Killer Microsecond”11 run at microsecond accelerators with as much as a 40 cost/efficiency
with Mike Marty, Dave speeds, why do we only advantage over general purpose computing.
Patterson, and Partha optimize for millisecond
That was a time to ignore the past successes of bet-
Ranganathan and nanosecond
latencies?” ting on general purpose off-the-shelf components and
invest heavily on the design and deployment of our own
CACM '13: “The Tail at “Large scale services
Scale”12 should be resilient to silicon to accelerate ML workloads. Coming full circle,
with Jeff Dean performance hiccups in this meant that it was now my time to call Norm Jouppi
any of their and ask him to join us to become the lead architect for
subcomponents” what was to become our TPU accelerators program.
IEEE Computer '07: “A “Shouldn’t servers use
Case for Energy- little energy when they
proportional are doing little work?” CONCLUDING
Computing”13
Before the onset of the current pandemic, some of us
€ lzle
with Urs Ho
may have underappreciated how important computing
If it takes you much more than a couple of sentences technology and cloud-based services have become to
to explain the problem you are trying to solve, you our society. In this last year, these technologies have
should seriously consider the possibility of it not being allowed many of us to continue to work, to connect
that important to be solved. with loved ones, and to support each other. I am grate-
ful to all of those at Google and everywhere in our
Even Successes Have a “Sell-By” Date industry who have built such essential technologies,
Some of the most intellectually stimulating moments and I am inspired to be working in a field with still so
in my career have come about when I was forced to much potential to improve people’s lives.
Computing
in Science &
Engineering
Subscribe to CiSE today!
www.computer.org/product/magazines/cise
www.computer.org/cise
DEPARTMENT: MICRO ECONOMICS
M
y favorite panel from Randall Munroe’s one- so on. In either case, those attractions become self-
panel comic, xkcd, is labeled “duty calls.” It reinforcing.
shows a lone stick figure at a desk, A third crucial piece of economics shapes focal
hunched over, intensely staring at his computer platforms with large scale. Some confrontation is inev-
screen, while engaging in a staccato conversation itable, and undermines the functionality of most (not
with an offscreen companion. She says: “Are you com- all!) platforms. It is possible to write volumes on how
ing to bed?” Him: “I can’t. This is important.” Her: to address confrontation with human moderation or
“What?” Him: “Someone is wrong on the internet.” algorithmic processes, and whether specific practices
Like all good satire, this comic elicits both laughs are legal or effective. Take a step back from that dis-
and winces. Nobody would ever engage in this behav- cussion, and recognize the broad economic facts. No
ior in any physical place where a veneer of social matter how it gets implemented, the processes are
politeness predominates, such as standing in line at a expensive and imperfect.
cashier or sitting in an airplane seat. On the internet
surfers jettison much of their social restraint, con-
ON THE INTERNET SURFERS
fronting, and correcting perfect strangers. It leads to,
JETTISON MUCH OF THEIR SOCIAL
for example, edit wars on Wikipedia, condescending
insults on Reddit, and righteous putdowns on Twitter. RESTRAINT, CONFRONTING, AND
This behavior invites plenty of legal analysis, angry CORRECTING PERFECT STRANGERS.
editorializing, and technological proposals, but rarely IT LEADS TO, FOR EXAMPLE, EDIT
economic analysis. Let us address that gap. What eco- WARS ON WIKIPEDIA,
nomic factors make confrontational conversation CONDESCENDING INSULTS ON
more or less likely in our era? REDDIT, AND RIGHTEOUS
PUTDOWNS ON TWITTER. THIS
TENSIONS WITH COMPROMISE BEHAVIOR INVITES PLENTY OF
The first piece of relevant economics is the low cost of LEGAL ANALYSIS, ANGRY
scale. It is inexpensive to host terabytes of data, and EDITORIALIZING, AND
dirt-cheap to serve millions of users. Software can be TECHNOLOGICAL PROPOSALS, BUT
replicated many times at almost no cost, making it RARELY ECONOMIC ANALYSIS.
possible for a platform to scale.
The second relevant economic factor complements
Two features of the present era drive up those
scale, and goes by the label “network effects.” These
costs and exacerbate the imperfections. Anything
are self-reinforcing advantages affiliated with being a
with long video is inherently expensive to moderate,
focal platform. Simplifying, a platform becomes focal in
as managers at YouTube and Facebook can attest. In
one of two ways. In one form a platform attracts more
addition, bots and misinformation farms have flooded
apps or content, and that attracts more users, which
the modern experience, especially at focal plat-
then attracts more apps or content, and so on. In
forms—just ask Twitter and Facebook. Holding those
another form a platform attracts more sellers, which
in check has become an endless and expensive
attracts more buyers, which attracts more sellers and
whack-a-mole for large platforms.
More to the point, addressing content moderation
has become this era’s key to achieving scale. Different
0272-1732 ß 2021 IEEE platforms have taken different approaches to this
Digital Object Identifier 10.1109/MM.2021.3060295 challenge, and each approach comes with different
Date of current version 26 March 2021. upsides and drawbacks.
and it may not be yours, but there is a large supply of ban Parler’s app, consistent with its sanitizing approach.
these breakaways. AWS eventually they took a similar action after notifying
During the election a new type of breakaway Parler of their concerns about lack of moderation. Once
emerged when political opinion shapers couched their Parler gained prominence as a coordinator of the capital
appeal in righteous patriotism. Love them or hate riot, AWS saw no benefit to having its brand associated
them, their experience illustrates the economics. with this group, and that was that.
It started with, say, Alex Jones and Infowars, who Parler’s breakaway did not go well because man-
got banned because it went beyond the pale. Many agement cut many corners and did not design a hack-
mainstream sites thought it was the right thing to do, proof site, but here is the rub. Parler is trying again.
and it aligned with the economics because the losses Even if they fail again, it is a good bet that somebody
of user were small compared to the risks to offending else will get a version of this up and running.
too many disgusted customers. Once banned, Alex Summarizing, the increasing frequency of break-
Jones took his users with him and formed his own aways is a symptom that they are becoming cheaper
community elsewhere. to build. Ergo, we should expect mainstream sites to
face increasing pressures towards fragmentation.