2021-May - Top Picks From The 2020 Computer Architecture Conferences

VOLUME 41, NUMBER 3 MAY/JUNE 2021
Top Picks From the 2020 Computer

Architecture Conferences
www.computer.org/micro
mpute
r.org ²ƧǞƵȁ
ɈǞ
Comp ˛Ƨ
uting
nted Data
and Visual
eality Inform izatio
n
on
ce4c1.
indd Techn ation
1
ology
on Quan
tum
y Comp
uting
putin
g
Intern
et of
Ethic Thing
s s
Mach
ine L
Quan ea rning
tu
Comp m
uting
JUNE
2020
JULY 20
ww w.c 20
ompu
ter.org
ww w.c
ompu
ter.org
ce7c1.ind
d 1
Secu
rity an
Priva d
cy High-P
Auto
matio
n Comp erforman
Block uting ce
chain Hard
Digit ware
al Affect
Transf iv
ormat Comp e
ion uting
Educa
tion
MAY 20
20
w w w.c
ompu
ter.org
ce5c1.
indd
1
S
ww w.c NOVE
ompu MBER
ter.org 2020
ww w.c
ompu
ter.org
ce11c1.in
dd 1
ndd 1
ComputingEdge
Secu
riit
Priva y and
cy
Data
Intern
et
ȲɈǞ˛
ƧǞƊǶ
XȁɈƵǶǶ
ǞǐƵȁƧ
Ƶ
Your one-stop resource for industry hot topics,
technical overviews, and in-depth articles.
Cutting-edge Unique original Keeps you up to

articles from the FRQWHQWbE\ date on what you
IEEE Computer computing need to know across
Society’s portfolio thought leaders, the technology
of 12 magazines. innovators, spectrum.
DQGbH[SHUWV
Subscribe for free

www.computer.org/computingedge
EDITOR-IN-CHIEF IEEE MICRO STAFF
Journals Production Manager: Joanna Gojlik,
Lizy K. John, University of Texas, Austin, USA
j.gojlik@ieee.org
ASSOCIATE EDITOR-IN-CHIEF Cover Design: Janet Dudar
Peer Review Administrator: micro-ma@computer.org
Vijaykrishnan Narayanan, The Pennsylvania State Publications Staff Editor: Cathy Martin
University, USA Publications Operations Project Specialist: Christine
Anthony
EDITORIAL BOARD Content Quality Assurance Manager: Jennifer
Guido Araujo, University of Campinas, Brazil Carruth
Publications Portfolio Manager: Kimberly Sperka
R. Iris Bahar, Brown University, USA
Publisher: Robin Baldwin
Christopher Batten, Cornell University, USA
Executive Director: Melissa Russell
Mauricio Breternitz, University of Lisbon, Portugal
Senior Advertising Coordinator: Debbie Sims
Bronis de Supinski, Lawrence Livermore
National Laboratory, USA IEEE PUBLISHING OPERATIONS
Yasuko Eckert, AMD Research, USA Senior Director, Publishing Operations: Dawn Melley
Maya Gokhale, Lawrence Livermore National Director, Editorial Services: Kevin Lisankie
Laboratory, USA Director, Production Services: Peter M. Tuohy
Shane Greenstein, Harvard Business School, Associate Director, Editorial Services: Jeffrey E. Cichocki
USA Associate Director, Information Conversion
Russ Joseph, Northwestern University, USA and Editorial Support: Neelam Khinvasara
Hyesoon Kim, Georgia Institute of Technology, USA Senior Manager, Journals Production: Patrick Kempf
John Kim, Korea Advanced Institute of Science and
Technology, South Korea CS MAGAZINE OPERATIONS COMMITTEE
Hsien-Hsin (Sean) Lee, Taiwan Semiconductor Diomidis Spinellis (Chair), Lorena Barba,
Manufacturing Company, USA Irena Bojanova, Shu-Ching Chen, Gerardo Con Diaz,
Michael Mattioli, Goldman Sachs & Co., USA Lizy K. John, Marc Langheinrich, Torsten Möller,
Tulika Mitra, National University of Singapore, Ipek Ozkaya, George Pallis, Sean Peisert,
Singapore VS Subrahmanian, Jeff Voas
Sreenivas Subramoney, Intel Corporation, India
CS PUBLICATIONS BOARD
Carole-Jean Wu, Arizona State University, USA
Lixin Zhang, Chinese Academy of Sciences, M. Brian Blake (VP for Publications),
China David Ebert, Elena Ferrari, Hui Lei, Timothy Pinkston,
Antonio Rubio Sola, Diomidis Spinellis, Tao Xie,
ADVISORY BOARD Ex officio: Robin Baldwin, Sarah Malik, Melissa Russell,
Forrest Shull
David H. Albonesi, Cornell University, USA
Erik R. Altman, IBM, USA COMPUTER SOCIETY OFFICE
Pradip Bose, IBM T.J. Watson Research Center, USA IEEE MICRO
Kemal Ebcioglu, Global Supercomputing c/o IEEE Computer Society
Corporation, USA 10662 Los Vaqueros Circle
Lieven Eeckhout, ELIS – Ghent University, Belgium Los Alamitos, CA 90720 USA; +1 (714) 821-8380
Ruby B. Lee, Princeton University, USA
Trevor Mudge, University of Michigan, Ann Arbor, Subscription change of address:
USA address.change@ieee.org
Yale Patt, University of Texas at Austin, USA Missing or damaged copies:
James E. Smith, University of Wisconsin–Madison, USA help@computer.org
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters: Three Park Ave., 17th Floor, New York,
NY 10016; IEEE Computer Society Headquarters: 2001 L St., Ste. 700, Washington, DC 20036; IEEE Computer Society Publications Office: 10662
Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. Postmaster: Send address changes and undelivered copies to IEEE, Membership
Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid at New York, NY, and at additional mailing offices. Canadian
GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Return undeliverable Canadian addresses to
4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the
copy; and 3) does not imply IEEE endorsement of any third-party products or services. Author and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own webservers without permission, provided that the IEEE copyright notice and a full citation to the
original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate
review suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to
ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising,
or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual
Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854 or pubs-permissions@ieee.org. ©2021 by IEEE. All rights reserved. Abstracting and
library use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy
fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE
Micro does not necessarily constitute an endorsement by IEEE or the Computer Society. All submissions are subject to editing for style, clarity, and
space. IEEE prohibits discrimination, harassment, and bullying. For more information, visit ieee.org/web/aboutus/whatis/policies/p9-26.html.
6
GUEST EDITOR’S INTRODUCTION
Daniel A. Jiménez
MAY/JUNE 2021
10 19 27
Theme Articles
The Vision Behind MLPerf: Superconductor Leaking Secrets

Understanding AI Inference Computing for Through Compressed
Performance Neural Networks Caches
Vijay Janapa Reddi, Christine Cheng, Koki Ishida , Ilkwon Byun, Po-An Tsai, Andres Sanchez,
David Kanter, Peter Mattson, Ikki Nagaoka, Kosuke Fukumitsu, Christopher W. Fletcher, and
Guenther Schmuelling, and Carole-Jean Wu Masamitsu Tanaka, Daniel Sanchez
Satoshi Kawakami,
Teruo Tanimoto, Takatsugu Ono,
Jangwoo Kim, and Koji Inoue
Theme Articles Continued
34 Understanding Acceleration Opportunities at Hyperscale
Akshitha Sriraman and Abhishek Dhanotia
42 Accelerating Genomic Data Analytics With Composable

Hardware Acceleration Framework
Tae Jun Ham, David Bruns-Smith, Brendan Sweeney, Yejin Lee, Seong Hoon Seo,
U Gyeong Song, Young H. Oh, Krste Asanovic, Jae W. Lee, and Lisa Wu Wills
50 uGEMM: Unary Computing for GEMM Applications

Di Wu, Jingjie Li, Ruokai Yin, Younghyun Kim, Joshua San Miguel, and Hsuan Hsiao
57 BabelFish: Fusing Address Translations for Containers

Dimitrios Skarlatos, Umur Darbaz, Bhargava Gopireddy, Nam Sung Kim,
and Josep Torrellas
63 Characterizing and Modeling Nonvolatile Memory Systems

Zixuan Wang, Xiao Liu, Jian Yang, Theodore Michailidis, Steven Swanson,
and Jishen Zhao
71 Temporal Computing With Superconductors

Georgios Tzimpragos, Jennifer Volk, Dilip Vasudevan, Nestan Tsiskaridze,
George Michelogiannakis, Advait Madhavan, John Shalf, and Timothy Sherwood
80 A Next-Generation Cryogenic Processor Architecture

Ilkwon Byun, Dongmoon Min, Gyuhyeon Lee, Seongmin Na, and Jangwoo Kim
87 Balancing Specialized Versus Flexible Computation in Brain–

Computer Interfaces
Ioannis Karageorgos, Karthik Sriram, Ján Veselý, Nick Lindsay, Xiayuan Wen, Michael
Wu, Marc Powell, David Borton, Rajit Manohar and
Abhishek Bhattacharjee
95 Virtual Logical Qubits: A Compact Architecture for Fault-Tolerant

Quantum Computing
Jonathan M. Baker, Casey Duckering, David I. Schuster, and Frederic T. Chong
Columns and Departments

From the Editor-in-Chief
4 Top Picks From Year 2020
Lizy Kurian John
Security
Image credit:
Image licensed
103 The Next Security Frontier: Taking the Mystery Out of the Supply Chain
by Ingram Michael Mattioli, Tom Garrison, and Baiju V. Patel
Publishing.
Micro Economics
110 R emote Work
Shane Greenstein
Also in this Issue

1 Masthead
108 IEEE Computer Society Info
www.computer.org/micro
ISSN: 0272-1732
FROM THE EDITOR-IN-CHIEF
Top Picks From Year 2020

Lizy Kurian John, The University of Texas at Austin, TX, 78712, USA
W
e have lived more than one year in the fields. A second goal of Top Picks is to recognize
special circumstances created by COVID- excellent research in the field and bestow this honor
19. Microchips and the systems powered on researchers who conducted the outstanding
by them have helped the world to survive this chal- research that resulted in these articles. It is critically
lenging year. At the same time, the computer architec- important for our field to honor our budding research-
ture conferences saw many interesting articles on ers and help them to shape their careers. The Top
various aspects of computer system design. It is my Picks honor has been seen to be instrumental in
pleasure to present to you the IEEE Micro Top Picks achieving faculty positions, leading research positions
issue, with the very best articles from 2020. in industry, and prestigious research grants. Third, the
For more than a decade, IEEE Micro has had this writing style intended for a broader audience makes it
tradition of evaluating articles from the previous year’s easy for beginner graduate students to understand
computer architecture or computer systems confer- the state of the art of the field as they are pondering
ences and selecting the articles with the most novelty on topics to work for their PhDs. Above all, I expect
and potential for long-term impact. The magazine is these articles to be enjoyable reads for all the readers
upholding the tradition this year as well. Any article of IEEE Micro.
published in top architecture or systems conferences
during 2020 was eligible to compete for the Top Picks
honor. In total, 120 submissions were received, from
FOR MORE THAN A DECADE, IEEE
which 12 articles were chosen to represent the cream
MICRO HAS HAD THIS TRADITION OF
of the crop of 2020.
Professor Daniel Jimenez of Texas A&M University EVALUATING ARTICLES FROM THE
chaired this year’s selection committee. Jimenez and PREVIOUS YEAR’S COMPUTER
42 experts from academia and industry worked hard ARCHITECTURE OR COMPUTER
to identify 12 Top Picks and 12 honorable mention SYSTEMS CONFERENCES AND
articles. An article recognized as a Top Pick was SELECTING THE ARTICLES WITH THE
invited to prepare a submission for inclusion in this MOST NOVELTY AND POTENTIAL FOR
Special Issue. The articles in this Special Issue are LONG-TERM IMPACT. THE MAGAZINE
intended to be for a broader audience than the origi- IS UPHOLDING THE TRADITION THIS
nal conference articles. These articles also focus more YEAR AS WELL.
on their potential impact. The honorable mentions are
high-quality articles that unfortunately could not be
included in this Special Issue due to space constraints.
They are listed in the guest editor’s introduction. Inter- I take this opportunity to express my gratitude to
ested readers can locate them in the original confer- Prof. Jimenez and the selection committee members,
ence proceedings or the IEEE/ACM Digital Library. who spent countless hours during the Christmas and
The purpose of the Top Picks issue has been multi- New Year season to evaluate the submissions, and
fold. First and foremost, Top Picks was originally insti- conducted a face-to-face meeting and deliberated a
tuted to present the “best of the best” of the whole day to perform this important selection. Prof.
preceding year’s architecture research contributions Jimenez and the committee conducted a multistep
to a broader audience including industry and other selection process that tried to reduce the impact of
the discussion order of the articles. I wish to express
my special thanks to them for the thoughtful process
and the hard work.
0272-1732 ß 2021 IEEE The Top Picks articles belong to the following five
Digital Object Identifier 10.1109/MM.2021.3074759 themes: 1) machine learning benchmarking; 2) secu-
Date of current version 25 May 2021. rity; 3) accelerators; 4) address translation; and 5) new
4 IEEE Micro Published by the IEEE Computer Society May/June 2021

FROM THE EDITOR-IN-CHIEF
technologies. One may recognize that these are highly equipment manufacturers. Read this article to get an
relevant topics for advancing the state of the art of overview of the challenges and potential paths for a
computer architecture. A comprehensive article writ- solution.
ten by Daniel Jimenez serves as an excellent introduc- The Micro Economics column by Shane Greenstein
tion to the compendium. of Harvard Business School, titled “Remote work,” dis-
I want to personally congratulate all the Top Picks cusses how exercise during the pandemic has become
authors for their fantastic work. It is a significant regular operation. Read this article to see how the
honor to rise to the top in a competition where each workplace is likely to shift even after the pandemic
candidate article is already recognized as an excellent and how the distinction between the workplace and
piece of work. I hope that these works have significant the home place is going to blur. Many thoughts on the
impact on future computer systems. likely economic and societal changes are outlined in
In addition to the Top Picks articles, this issue also this article.
features a Micro Security article on supply chain secu- Year 2021 is a special year. It is the 50th anniversary
rity from Mattioli et al., and a Micro Economics column of the microprocessor. IEEE Micro will be publishing a
by Shane Greenstein of Harvard Business School, special issue on this theme in November/December
titled “Remote Work.” 2021. Other upcoming special issues are on FPGA com-
puting, quantum computing, commercial products, in-
memory computing, and smart agriculture.
YEAR 2021 IS A SPECIAL YEAR. IT IS THE I hope this Special Issue is thought provoking for
50TH ANNIVERSARY OF THE our readers and will help shape the field for many
MICROPROCESSOR. IEEE MICRO WILL years to come. I also hope that researchers in the field
BE PUBLISHING A SPECIAL ISSUE ON intensify their efforts to design better chips and sys-
THIS THEME IN NOVEMBER/ tems to help medical research and ordinary daily lives.
DECEMBER 2021. Additionally, I encourage readers to submit to IEEE
Micro. The magazine is interested in submissions on
any aspect of chip or system design or architecture.
From the Micro Security department, we present Enjoy the award-winning articles presented in this
an interesting article titled “The next security frontier: issue. All the 12 Top Picks articles can be thought of
Taking the mystery out of the supply chain,” coau- as award-winning articles. When IEEE Micro selects its
thored by Mattioli, Garrison, and Patel. When a prod- Annual Best Paper Award at the end of each year, the
uct goes through several warehouses and gets Top Picks articles are not eligible to compete for the
handled by many before finally being received or used Best Paper Award, because they are already catego-
by the end client, there are many risks and vulnerabil- rized as prestigious. Hope our industry and academic
ities regardless of whether the end product has been readers enjoy these articles.
altered from the intended one either unintentionally Happy reading!
or maliciously. This is true for all products, but there
are increased risks in computer system components
LIZY KURIAN JOHN is a Cullen Trust for Higher Education
and systems. The authors outline some solutions,
including a centralized database or ledger, or a block- Endowed Professor with the Electrical and Computer Engi-
chain-based approach, but participation will be neering Department, University of Texas at Austin. Contact
required from foundries, device manufacturers, and her at ljohn@ece.utexas.edu.
May/June 2021 IEEE Micro 5

GUEST EDITOR'S INTRODUCTION

nez, Texas A&M University, College Station, TX, 77843, USA
Daniel A. Jime
Every year, IEEE Micro selects some of the most notable articles published in the
previous year’s computer architecture conferences to highlight in this Special Issue. In
early 2021, the selection committee chose 12 articles to appear as “Top Picks” and an
additional 12 deserving an “Honorable Mention” from among the 2020 conferences.
I
t is my pleasure to introduce to you IEEE Micro’s I also invited a diverse set of other researchers who regu-
Top Picks from the 2020 computer architecture larly publish in our top conferences. The final selection
conferences. Every year, IEEE Micro publishes a committee had 42 people.
special issue with the previous year’s Top Picks as
chosen by a selection committee. It was my privilege
this year to be chosen as the selection committee
chair and special issue guest editor. This year’s selec-
THIS YEAR’S SELECTIONS SHOWCASE A
tions showcase a great diversity of research topics GREAT DIVERSITY OF RESEARCH
making an impact in our community. From traditional TOPICS MAKING AN IMPACT IN OUR
microarchitecture, to machine learning, nonvolatile COMMUNITY. FROM TRADITIONAL
memory management, security, and quantum comput- MICROARCHITECTURE, TO MACHINE
ing, our community has been very busy redefining and LEARNING, NONVOLATILE MEMORY
reinventing computer architecture research. IEEE MANAGEMENT, SECURITY, AND
Micro imposes a hard limit of 12 articles that can be QUANTUM COMPUTING, OUR
chosen as Top Picks. It was very hard to summarize all COMMUNITY HAS BEEN VERY BUSY
the great research of 2020 in just a dozen articles but REDEFINING AND REINVENTING
the selection committee did an excellent job. The
COMPUTER ARCHITECTURE
committee chose another 12 articles as “Honorable
RESEARCH.
Mentions” described in the “Honorable Mentions”
sidebar. I encourage you to read each of the Top Picks
and Honorable Mentions.
Authors were asked to prepare three-page descrip-
tions of their articles, including a citation for a hypo-
REVIEW PROCESS
thetical “test of time” award their article could win in
The review process for Top Picks was similar to previous
ten years. Authors submitted these three-pagers
years with the exception that the selection committee
along with the original conference article. The intent
meeting was held remotely for the first time. I chose a
of Top Picks is to choose noteworthy articles from the
somewhat larger committee than previous years to
top computer architecture conferences, but I allowed
reduce individual workload, take advantage of more spe-
a rather broad interpretation as our community has
cialized expertise, and select some recently graduated
grown significantly. For example, this year one of the
rising stars to give them an insight into the process. I
Honorable Mentions is from OSDI and another is from
invited half of the previous year’s committee as well as
the IEEE Symposium on Security and Privacy.
the program chairs of 2020’s top architecture conferen-
The review process took place in two rounds. In the
ces and recent Top Picks selection committee chairs.
first round, each article was assigned four reviews. After
the reviews were finished, each article was assigned a dis-
cussion lead to guide an online discussion during which
0272-1732 ß 2021 IEEE each group of reviewers decided whether the article
Digital Object Identifier 10.1109/MM.2021.3074795 should proceed to the second round. Out of 120 submis-
Date of current version 25 May 2021. sions, 47 went on to the second round. These articles

were assigned an additional five reviews, so that every concern. One article from researchers at Illinois, MIT,
article was assigned nine reviews. The vast majority of EPFL, and NVIDIA gives techniques for attacking the
these articles actually received nine reviews; three security of compressed caches, showing that com-
received only eight due to unforeseen reviewer circum- pressed caches are vulnerable to side-channel attacks.
stances. After the second round of reviewers, there was
another online discussion period led again by discussion Accelerators
leads. This time, articles were classified into categories to
One article shows the potential for accelerators to
help decide which articles should be discussed. During
improve performance in datacenters. This study from
the selection committee meeting, articles were discussed
Facebook and the University of Michigan character-
in an order determined by an aggregate of merit scores
izes workloads at Facebook in terms of the time spent
normalized by an algorithm that takes into account
in portions that could be offloaded to accelerators,
reviewer bias. At the beginning of each discussion, those
and then projects the performance improvement
with conflicts were moved to a Zoom break-out room
achievable by acceleration. Unlike previous work, the
and were not able to hear or see the discussion of the
study is done in the context of increasingly common
conflicted article. At the end of each discussion, a deci-
microservices. Another article from Seoul National
sion was made to make the article a “Top Pick,” “Honor-
University, Berkeley, Duke, and SungKyunKwan Uni-
able Mention,” or “Reject.” After the decision was
versity looks at acceleration in the genomics domain.
recorded and the shared screen was cleared of informa-
It presents a framework for flexible genomic data
tion from the discussion, conflicts were allowed to re-
processing using SQL-like queries as a domain-spe-
enter the main Zoom room. According to guidelines intro-
cific language targeting FPGAs. Another fascinating
duced at the beginning of the meeting, a procedure was
article from the University of Wisconsin and the Uni-
in place to upgrade Honorable Mentions or Rejects if
versity of Toronto provides a novel microarchitecture
there were fewer than 12 Top Picks or Honorable Men-
for unary computing focused on matrix multiply for
tions at the end of the discussion. I was advised by the
power-constrained environments, showing a very dif-
IEEE Micro Editor-In-Chief that 12 is the absolute limit for
ferent way to compute with simple circuits.
Top Picks and another 12 for Honorable Mentions, and I
wanted to award as many of these distinguished awards
as possible to highlight the great research going on in the Address Translation
architecture community. I am gratified but not surprised One article from a team at Illinois and NVIDIA gives a
that we filled all the slots for 12 Top Picks and an addi- new more efficient method for address translation for
tional 12 Honorable Mentions. containerized workloads by deduplicating page table
entries within translation-lookaside buffers and smar-
ter page table sharing techniques.
SELECTED ARTICLES
This issue has an eclectic set of articles reflecting our
New Technologies
diverse community.
Several articles describe architectural aspects of new
and emerging technologies. One article from the Uni-
Machine Learning versity of California, San Diego, characterizes the first
Interest in machine learning has blossomed in recent widely available nonvolatile memory technology, com-
years, and architecture is on the forefront of this new pares its behavior against previous less accurate mod-
revolution. One article from a large team spanning els used in academia, and proposes modifications to
many companies including Google, Microsoft, Face- make those models more accurate. Another article
book, Intel, NVIDIA, ARM, and others introduces what from UC Santa Barbara, Lawrence Berkeley National
has already become an industry-standard benchmark Labs, NIST, and the University of Maryland shows how
inference suite: MLPerf, addressing the need for an to design logic and processors using superconducting
industry-wide, fair, and reproducible machine learning substrates, where pulses rather than voltage levels
benchmarking methodology. Another article from are the fundamental physical phenomenon that carry
Kyushu University, Seoul University, and Nagoya Uni- information. Another article from Seoul National Uni-
versity proposes an optimized neural processing unit versity examines the design of cryogenically cooled
made with single-flux quantum technology. processors, looking at the performance benefits and
power tradeoffs and how to design a processor to bet-
Security ter leverage a cryogenic environment. Another article
As industry explores more complex processor designs, from Yale, Rutgers, and Brown describes adapting a
inadvertent exposure of security risks remains a top traditional architecture to a new extreme domain:

HONORABLE MENTIONS
Article Summary
“NeuMMU: Architectural Support for Efficient This article explores the design of memory
Address Translations in Neural Processing Units” management units for emerging neural network
by Hyun et al. (ASPLOS 2020). accelerators.
“Hermes: A Fast, Fault-Tolerant and Linearizable Combining insights from cache coherence and
Replication Protocol” by Katsarakis et al. (ASPLOS consensus protocols, Hermes is a fault-tolerant
2020) replication protocol that achieves high
performance and strong consistency.
“DSAGEN: Synthesizing Programmable Spatial This article addresses the cost of developing
Accelerators” by Weng et al. (ISCA 2020) specialized accelerators, proposing a framework
for programmable accelerator synthesis for a
decoupled spatial architecture.
“Printed Machine Learning Classifiers” by Husnain Inkjet printed circuits are an interesting new
Mubarik et al. (MICRO 2020). technology. This article gives a deep dive into in a
synergistic application domain: classifiers.
“AGAMOTTO: How Persistent is your Persistent AGAMOTTO is a persistent memory debugging
Memory Application?" by Neal et al. (OSDI 2020) framework that uses symbolic execution to test
persistent memory systems.
“Graphene: Strong yet Lightweight Row Hammer This article introduces a technique to prevent
Protection” by Park (MICRO 2020) DRAM Rowhammer attacks based on detecting
frequently accessed elements. The idea features
low overhead and strong security guarantees.
“Systematic Crosstalk Migitation for This article mitigates crosstalk in quantum
Superconducting Qubits via Frequency-Aware computing systems with a hardware–software
Compilation” by Ding et al. (MICRO 2020) solution that tunes the frequency of each qubit.
“TRRespass: Exploiting the Many Sides of Target This article gives a comprehensive study of Target
Row Refresh” by Frigo et al. (SP (Oakland) 2020) Row Refresh, a DRAM Rowhammer mitigation
technique, and shows that new DRAM chips
continue to be vulnerable to Rowhammer attacks.
“SQUARE: Strategic Quantum Ancilla Reuse for This article shows it is possible to use ancilla (i.e.,
Modular Quantum Programs via Cost-Effective scratch) qubits by doing uncomputation to set the
Uncomputation” by Ding et al. (ISCA 2020) ancilla back to the original state, allowing longer
programs to be run on a quantum computer.
“Leaking Information Through Cache LRU States” This article shows how to use LRU replacement
by Xiong and Szefer (HPCA 2020) state to create a cache covert channel between
agents sharing a cache. It works for a variety of
scenarios and pseudo-LRU designs.
“Elastic Cuckoo Page Tables: Rethinking Virtual This article gives a novel page table design based
Memory Translation for Parallelism” by Skarlatos on cuckoo hashing. Sequentially walked radix
et al. (ASPLOS 2020) tables are replaced by a set of cuckoo tables
searched in parallel. The scheme allows page
tables to scale efficiently.
“The Architectural Implications of Facebook’s This article gives a characterization of Facebook
DNN-based Personalized Recommendation” by servers running synthetic recommendation
Gupta et al. (HPCA 2020) models. It introduces open source
recommendation models representing a large
class of services important in modern datacenters.
8 IEEE Micro May/June 2021

brain–computer interfaces. It gives optimizations to experiences with Zoom PC meetings that made our
make power-efficient hardware by designing a chip for meeting more enjoyable; Lizy Kurian John, IEEE
this unique application domain. Finally, an article from Micro EIC, for asking me to be the selection chair
the University of Chicago proposes a 2.5D virtualized and guest editor, and for her helpful advice and
architecture for enabling the efficient implementation guidance throughout this process; and the selection
of surface codes, which are error correcting codes committee whose hard work over the Christmas
that enable fault-tolerant quantum computing. and New Year’s holidays made this selection pro-
cess possible. Most of all, I would like to thank the
authors of the 120 submissions, whose remarkable
THANKS contributions to research continue to make com-
I would like to give special thanks to Elba Garza and puter architecture an adventure for all of us.
Gino Chacon, doctoral students at Texas A&M Uni-
versity, who were tremendously helpful running the
DANIEL A. JIMÉNEZ is a Professor in the Department of Com-
HotCRP and Zoom software during the selection
committee meeting. I would also like to thank Emery puter Science & Engineering at Texas A&M University. His
Berger, ASPLOS 2021 co-PC-chair who shared with research is in microarchitecture, including microarchitectural
me scripts for parsing HotCRP output to provide prediction and cache management. He invented perceptron-
files for managing conflicts of interest in Zoom; based branch prediction which is used in millions of process-
Boris Grot and Paul Gratz who acted as selection ors. He is an NSF CAREER award winner, and a member of
chair for papers with which I was conflicted; Kathryn the MICRO, HPCA, and (as of June 2021) ISCA Halls of Fame.
McKinley for helpful suggestions based on previous Contact him at djimenez@tamu.edu.
SELECTION COMMITTEE
› Nael Abu-Ghazaleh, University of California › Kathryn S McKinley, Google

Riverside › Onur Mutlu, ETH Zurich
› Abhishek Bhattacharjee, Yale University › Prashant Nair, University of British Columbia
› Marc Casas, Barcelona Supercomputing › Gilles A. Pokam, Intel
Center › Moinuddin Qureshi, Georgia Tech
› Fred Chong, University of Chicago › Parthasarathy Ranganathan, Google
› Reetuparna Das, University of Michigan › Adrian Sampson, Cornell University
› Christina Delimitrou, Cornell University › Daniel Sanchez, MIT
› Ronald Dreslinski, University of Michigan › Joshua San Miguel, University of Wisconsin-
› Lieven Eeckhout, Ghent University Madison
› Hadi Esmaeilzadeh, University of California, › Simha Sethumadhavan, Columbia Univer-
San Diego sity/Chip Scan
› Michael Ferdman, Stony Brook University › Sophia Shao, UC Berkeley
› Christopher Fletcher, University of Illinois- › Yan Solihin, University of Central Florida
Urbana Champaign › Viji Srinivasan, IBM Research
› Paul Gratz, Texas A&M University › Josep Torrellas, University of Illinois-Urbana
› Boris Grot, University of Edinburgh Champaign
› Engin Ipek, Qualcomm › Caroline Trippel, Stanford University
› Akanksha Jain, Arm Research › Thomas F. Wenisch, University of Michigan
› Aamer Jaleel, NVIDIA Research and Google
› Jose Joao, Arm Research › Lisa Wu Wills, Duke University
› Samira Khan, University of Virginia › Yuan Xie, University of California, Santa
› Martha Kim, Columbia University Barbara
› Hyesoon Kim, Georgia Tech › Mengjia Yan, MIT
› John Kim, KAIST › Jishen Zhao, University of California, San
› Gabriel Loh, Advanced Micro Devices Diego

THEME ARTICLE: TOP PICKS
The Vision Behind MLPerf: Understanding AI

Inference Performance
Vijay Janapa Reddi, Harvard University, Cambridge, MA, 02138, USA
Christine Cheng, Intel, Santa Clara, CA, 95054-1549, USA
David Kanter, MLCommons, Mountain View, CA, 94105, USA
Peter Mattson , Google, Mountain View, CA, 94043, USA
Guenther Schmuelling, Microsoft Corp, Redmond, WA, 98052, USA
Carole-Jean Wu , Facebook Inc., Menlo Park, CA, 342996, USA
Deep learning has sparked a renaissance in computer systems and architecture. Despite
the breakneck pace of innovation, there is a crucial issue concerning the research and
industry communities at large: how to enable neutral and useful performance
assessment for machine learning (ML) software frameworks, ML hardware accelerators,
and ML systems comprising both the software stack and the hardware. The ML field
needs systematic methods for evaluating performance that represents real-world use
cases and useful for making comparisons across different software and hardware
implementations. MLPerf answers the call. MLPerf is an ML benchmark standard driven
by academia and industry (70+ organizations). Built out of the expertise of multiple
organizations, MLPerf establishes a standard benchmark suite with proper metrics and
benchmarking methodologies to level the playing field for ML system performance
measurement of different ML inference hardware, software, and services.
I
n this article, we describe the MLPerf Inference the launch of MLPerf Inference in November 2019, over
benchmark’s design principles. Inference spans 3,500 individual results from 100+ system configurations
across the full spectrum of machine learning (ML) have been submitted: the number of inference submis-
systems, ranging from cloud-scale data centers to sions has tripled from 600+ to 1,800+ between the Octo-
edge devices like smartphones and endpoint systems ber 2019 and March 2021 submission rounds. Submitting
based on embedded hardware. We present the chal- organizations include large organizations and small start-
lenges and opportunities in developing a benchmark ups, reflecting the diversity in hardware innovation
that tackles the complexity, heterogeneity, and scale (CPUs, GPUs, FPGAs, and other accelerators) and their
of ML systems. Specifically, we describe the design of respective software stacks. The performance of the slow-
benchmarking scenarios. These scenarios reflect real- est and fastest ML inference systems span over five
world ML deployment use cases to address the various orders of magnitude, representing the broad spectrum of
inference deployment services and the evaluation ML deployments and use cases, as summarized in Table 1.
methodologies that enable measurement results of Although the results include various systems, the variety
various optimization techniques to reflect realistic per- of results are comparable because of the standardized
formance effects from inference execution conditions. ML benchmarks, data sets, metrics, and execution rules.
The vision behind MLPerf11 is to provide fair and useful
benchmarks for measuring training10 and inference13 per-
formance of ML hardware, software, and services. Since MLPERF INFERENCE
Building an ML system benchmark requires us to
address three challenges: the diversity of models, the
0272-1732 ß 2021 IEEE variety of deployment scenarios, and the array of infer-
Digital Object Identifier 10.1109/MM.2021.3066343
Date of publication 17 March 2021; date of current version
ence systems. To this end, the essential products of
25 May 2021. MLPerf Inference are as follows.

TOP PICKS
TABLE 1. MLPerf covers a wide range of ML deployments, empower a variety of real-world ML use cases, each of
spanning from cloud datacenters to ultralow-power systems. which requires scenario-specific performance metrics.
Appropriate metrics, reflecting production use cases,
MLPerf Machine learning tasks are not specific to MLPerf but will benefit the ML and
Inference systems community to make fair and neutral system
Cloud/ Image classification, object detection, performance comparison.
Datacenter image segmentation, speech, language MLPerf Inference consists of four evaluation sce-
processing, recommendation,
narios: single-stream, multistream, server, and offline.
reinforcement learning
The scenarios capture many critical inference applica-
Edge Image classification, object detection,
tions. As demonstrated in the original article,13 the
image segmentation, speech, language
processing performance of an ML system can vary drastically
Mobile Image classification, object detection,
under these different scenarios using the correspond-
image segmentation, language processing ing metrics. MLPerf Inference provides a way to emu-
Tiny Image classification, Object detection, late the inference system’s realistic behavior under
Anomaly detection, Speech test; such a feature is unique amongst existing AI
MLPerf benchmarks.
training
Cloud/ Image classification, object detection,
Datacenter image segmentation, natural language
THE VISION BEHIND MLPERF IS TO
processing, recommendation,
reinforcement learning PROVIDE FAIR AND USEFUL
HPC Climate segmentation, cosmology BENCHMARKS FOR MEASURING
parameter prediction TRAINING AND INFERENCE
PERFORMANCE OF ML HARDWARE,
Identify Representative ML Inference SOFTWARE, AND SERVICES.
Workloads for Reproducibility,
Accessibility, and Fair Comparison
The ML ecosystem is rife with models. Comparing and
contrasting ML system performance is nontrivial
because implementations vary in model complexity and
Prescribe Target Qualities and Tail-
execution characteristics. A name such as ResNet-50 Latency Bounds in Accordance With
does not uniquely or portably describe a model. Conse- Real-World Use Cases
quently, understanding improvements in system perfor- Quality and performance are intimately connected for
mance with such an unstable baseline is challenging. all forms of ML. System architectures and their opera-
One of the significant contributions of this work is tors sometimes choose to trade off model quality to
identifying and defining representative models and achieve lower latency, lower total cost of ownership,
describing an inclusive benchmarking methodology or higher throughput. The tradeoffs are application-
that enables result reproducibility. Based on consen- specific. To reflect this important aspect of real-world
sus of the working groups, MLPerf inference comprises deployments, benchmarks must provide representa-
mature models and have earned broad community tive model quality targets.
support. MLPerf models, the reference implementa- MLPerf defines machine learning model accuracy
tions, and the benchmarking infrastructure and meth- or quality targets. For example, for accuracy-sensitive
odology are all based on open-source software and can tasks, such as recommendation systems and medical
be easily leveraged for research and further develop- imaging, stringent quality targets of 99% and 99.9% of
ment (https://github.com/mlcommons/inference). the reference implementation are specified and
enforced. We established per-model and scenario tar-
gets for inference latency and model quality. The
Define Different Usage Scopes for latency bounds and target model qualities are based
Realistic Evaluation and Meaningful on input gathered and consensus reached from end-
Measurement Results users, ML systems practitioners, and the correspond-
ML inference systems cover a broad spectrum, ranging working groups. As MLPerf evolves these in the
ing from deeply embedded devices to smartphones to future based on industry needs, the broader research
edge servers and data centers. These systems community can track these changes to stay relevant.

TOP PICKS
Synthesize Rules That Enable original article, our design enables the benchmark to
Benchmark Users to Showcase extend the scope to include more areas and tasks.
Hardware and Software Capabilities Between 2019 and 2020, MLPerf Inference introduced
The ML community has embraced a variety of DL lan- four new inference benchmarks for data centers—
guages and libraries. Hence, there is a need for a DLRM for recommendation,7,12 3D U-NET for medical
semantic-level benchmark that specifies the task to segmentation,8 RNN-T9 and BERT6 for natural lan-
be accomplished and the general rules to follow but guage processing, and additional inference bench-
leaves implementation choices to the benchmark marks for mobile—Mobile-BERT15 for natural language
submitters. processing, MobileNetEdgeTPU1 for image classifica-
MLPerf Inference is a semantic-level benchmark tion, and DeepLabV3+5 for image segmentation.
that provides flexibility for submitters to optimize the
reference models, i.e., to run the reference code we BENCHMARK DESIGN
provide through their preferred software toolchains The ML system diversity presents a unique challenge
and measure the system performance on the hard- to deriving a robust and useful ML benchmark that
ware of choice. This allows submitters to perform cus- meets industry needs. MLPerf Inference adopted a set
tom optimizations, such as use different numerical of principles for developing a robust yet flexible
formats and various quantization schemes. within benchmark suite based on community-driven develop-
some restricted set of rules. MLPerf Inference pro- ment. Here, we describe the benchmarks, the quality
vides two result divisions: closed and open. Strict rules targets, and the scenarios under which the benchmark
govern the closed division, whereas the open division tasks can be evaluated.
allows submitters to change the model and demon-
strate different latency and accuracy targets. The
Representative, Broadly Accessible
closed division addresses the lack of a standard infer-
Workloads
ence-benchmarking workflow. The open division
Designing ML benchmarks is different from designing
relaxes a subset of the benchmark rules to encourage
traditional non-ML benchmarks. MLPerf defines high-
a wide variety of software-level innovations.
level tasks that a machine-learning system can per-
form. For each one, we provide a canonical reference
model(s) in a few widely used frameworks. Any imple-
MLPERF INFERENCE FOCUSES FIRST mentation that is mathematically equivalent to the
AND FOREMOST ON MODULAR reference model is considered valid, and certain other
BENCHMARK DESIGN TO ADD NEW deviations (e.g., numerical formats) are allowed.
MODELS AND TASKS LESS COSTLY The concept of a reference model and a valid class
of equivalent implementations creates freedom for
WHILE PRESERVING THE USAGE
most ML systems while still enabling relevant compar-
SCENARIOS, TARGET QUALITIES, AND
isons of inference systems. MLPerf provides reference
INFRASTRUCTURE.
models using 32-bit floating-point weights and, for
convenience, also provides carefully implemented
equivalent models to address three formats: Tensor-
Flow, PyTorch, and ONNX.
Present a Modular and Flexible Table 2 summarizes the vision, speech, language,
Inference Benchmarking Methodology and commerce tasks in MLPerf Inference v1.0 (exclud-
That Allows Components to Get ing Mobile14 and tinyMLPerf4). From 2019 to 2021,
Updated Individually MLPerf Inference doubled in size, adding speech, com-
ML applications, tasks, and models evolve quickly. merce, and medical imaging to the suite. This trend
Therefore, the ultimate challenge for any ML bench- will continue as new benchmark tasks, models and
marks is not the tasks themselves, rather the underly- versions become available.
ing benchmarking methodology that can withstand
the rapid pace of change. Realistic End-User Scenarios
MLPerf Inference focuses first and foremost on ML applications have a variety of usage models and fig-
modular benchmark design to add new models and ures of merit, which, in turn, require multiple perfor-
tasks less costly while preserving the usage scenarios, mance metrics. We surveyed MLPerf’s membership,
target qualities, and infrastructure. As we show in the which includes customers and vendors. Based on that

TOP PICKS
TABLE 2. ML tasks in the edge- and datacenter-class for MLPerf inference v1.0. Each one reflects critical commercial and
research use cases for a large class of ML practitioners; together they also capture a broad set of computing motifs. The top
four tasks and reference models were introduced in v0.5, while the lower four were added in v0.7.
v0.5 v0.7 v1.0 Area Task Reference Data Set Quality Target (Top-1
Model Accuracy)
@ @ @ Vision Image ResNet-50 ImageNet (224x224) 99% of FP32
classification v1.5
(heavy)
@ @ @ Vision Object detection SSD- COCO (1,200x1,200) 99% of FP32
(heavy) ResNet34
@ @ @ Vision Object detection SSD- COCO (300x300) 99% of FP32
(light) MobileNet-
v1
@ Language Machine GNMT WMT16 EN-DE 99% of FP32
translation
@ @ Commerce Recommendation DLRM 1TB Click Logs 99% of FP32 and
99.9% of FP32
@ @ Language Language BERT SQuAD v1.1 (max_seq_len=384) 99% of FP32 and
processing 99.9% of FP32
@ @ Speech Speech-to-text RNN-T Librispeech dev-clean 99% of FP32
(samples < 15 seconds)
@ @ Vision Medical image 3D U-Net BraTS 2019 (224x224x160) 99% of FP32 and
segmentation 99.9% of FP32
feedback, we identified four scenarios that represent a delays the remaining queries by one interval. No more
variety of critical inference applications: single-stream, than 1% of the queries may produce one or more
multistream, server, and offline. These scenarios emulate skipped intervals. A query’s N input samples are contig-
the ML workload behavior of mobile devices, autono- uous in memory, which accurately reflects production
mous vehicles, robotics, and cloud-based setups. input pipelines and avoids penalizing systems that
would otherwise require that samples be copied to a
Single-stream contiguous memory region before starting inference.
This scenario represents one inference-query stream
with a query sample size of one, reflecting the many Server
client applications where response latency is critical. This scenario represents online applications where
One example is the on-device voice transcription task query arrival is random and latency is important.
on Google’s Pixel 4 smartphone. To measure perfor- Almost every consumer-facing website is a good
mance, we inject a single query into the inference sys- example, including services such as online translation
tem; when the query is complete, we record the from Baidu, Google, and Microsoft. For this scenario,
completion time and inject the next query. The metric queries are sent with one sample each, in accordance
is the query stream’s 90th-percentile latency. with a Poisson distribution. The system under test
(SUT) responds to each query within a benchmark-
Multistream specific latency bound that varies from 15 to 250 ms.
This scenario represents applications with a stream of No more than 1% of queries may exceed the latency
queries, but each query comprises multiple inferences, bound for the vision tasks and no more than 3% may
reflecting a variety of industrial-automation and do so for translation. The scenario’s performance met-
remote-sensing applications. To model a concurrent ric is the Poisson parameter that indicates the queries
scenario, we send a new query comprising N input sam- per second (QPS) achievable while meeting the quality
ples at a fixed time interval (e.g., 50 ms). The interval is of service (QoS) requirement.
benchmark specific and also acts as a latency bound
that ranges from 50 to 100 ms. If the system is available, Offline
it processes the incoming query. If it is still processing This scenario represents batch-processing applica-
the prior query in an interval, it skips the interval and tions where all the data are immediately available and

TOP PICKS
latency is unconstrained. An example is identifying the the latency constraint, each benchmark scenario
people and locations in a photo album. For the offline must run for at least N amount of time (e.g., 60 s in
scenario, we send a single query that includes all sam- v0.5 and 10 min in v1.0) and process additional queries
ple-data IDs to be processed, and the system is free to and/or samples as the scenarios requires.
process the input data in any order. Similar to the mul-
tistream scenario, neighboring samples in the query
INFERENCE SUBMISSION SYSTEM
are contiguous in memory. The metric for the offline
An MLPerf inference submission system contains mul-
scenario is throughput measured in samples per
tiple components: an SUT, the Load Generator
second.
(LoadGen), a data set, and an accuracy script. In this
For the multistream and server scenarios, latency
section, we describe these various components. The
is a critical component of the system behavior and
data set, LoadGen, and accuracy script are fixed for all
constrains various performance optimizations. For
submissions and are provided by MLPerf. Submitters
example, most inference systems require a minimum
have wide discretion to implement an SUT according
(and architecture-specific) batch size to achieve full
to requirements and engineering judgment.
utilization of the underlying computational resources.
But in a server scenario, the arrival rate of queries is
random, so systems must optimize for tail latency and System Under Test
potentially process inferences with a suboptimal The goal of MLPerf Inference is to measure system
batch size. performance across a wide variety of SUTs. But the
To reflect real-world inference use cases for data- properties of realism, comparability, architecture neu-
center and edge systems, each task and scenario has trality, and friendliness to small submission teams
specific quality and quality-of-service constraints. For require careful tradeoffs. Therefore, we set the model-
MLPerf v1.0, we require datacenter system submis- equivalence rules to allow submitters to re-implement
sions to cover Server and Offline whereas Single- models on different architectures. The rules provide a
stream and Offline are required for edge system sub- complete list of disallowed techniques and a list of
missions, and multistream is optional. We additionally allowed technique examples. The list illustrates the
require that all datacenter submissions must use ECC boundaries of the blacklist while also encouraging
for off-chip memory to reflect real-world requirements. common and appropriate optimizations.
Robust Quality Targets Load Generator

The tradeoffs between accuracy, latency, and The LoadGen is a traffic generator that loads the SUT
throughput are application specific. We require that and measures performance that is provided by MLPerf
most implementations achieve a quality target within Inference. The LoadGen produces the query traffic
1% of the FP32 reference model’s accuracy. For the according to the rules of the previously described sce-
medical imaging, language processing, and recom- narios (i.e., single-stream, multistream, server, and off-
mendation tasks in datacenter submissions, we addi- line). Additionally, the LoadGen collects information
tionally require achieving a quality target that is within for logging, debugging, and postprocessing the data. It
0.1% of the reference, reflecting realistic deployment records queries and responses from the SUT, and at
needs. the end of the run, it reports statistics, summarizes
the results, and determines whether the run was valid.
Statistically Confident Tail-Latency The LoadGen has two primary operating modes: accu-
racy and performance. It measures the holistic perfor-
Bounds
mance of the entire SUT rather than any individual
Each task and scenario combination requires a mini-
part. Finally, this condition enhances the benchmark’s
mum number of queries to ensure results are statisti-
realism: inference engines typically serve as black-box
cally robust and adequately capture steady-state
components of larger systems.
system behavior. That number is determined by the
tail-latency percentile, the desired margin, and the
desired confidence interval. We selected a margin that Data Set
is one-twentieth of the difference between the tail- We use standard and publicly available data sets
latency percentage and 100%. For scenarios with (Table 2) to ensure that the community can partici-
latency constraints, our goal is to ensure a 99% confi- pate. Data sets are downloaded before LoadGen uses
dence interval that the constraints hold. In addition to it to run the benchmark.

TOP PICKS
FIGURE 2. Systems that perform well for throughput (offline

FIGURE 1. Performance range across systems for models in
mode) may perform poorly under latency-bounded through-
single-stream (SS), multistream (MS), server (S), and offline
put (server mode).
(O) scenarios.
Accuracy Checker the closed and open divisions from 14 organizations.

The LoadGen also ensures the submission system The results are the most extensive corpus of inference
complies with the rules. In addition, it can self-check performance data available to the public, covering a
to determine whether its source code has been modi- range of ML tasks and scenarios, hardware architec-
fied during the submission process. To facilitate vali- tures, and software runtimes. Here, we briefly assess
dation, the submitter provides an experimental config the major results on the basis of our benchmark
file that allows use of nondefault LoadGen features. objectives.
Audits Benchmark Must Capture and Reflect

To provide additional verification, we introduced an the Broad Power and Performance
audit process in MLPerf Inference v0.7. The audits are Range, From Mobile and Edge Devices
designed to ensure compliance with the rules and to Cloud Computing Systems
help to understand any potential performance anoma- The performance delta between the smallest and larg-
lies. One submission is randomly selected for an audit, est inference systems is four orders of magnitude, or
and a second systems is selected by the submitters about 10,000x. Figure 1 shows the results across vari-
collectively. During an audit, an auditor will examine, ous inference tasks and scenarios. In MobileNet-v1
reproduce, and experiment with a submitted system single-stream scenario (SS), ResNet50 v1.5 (SS), and
to verify the performance. SSD-MobileNet-v1 offline (O), systems exhibit a large
performance difference (100x). Because these models
Power Measurement have many applications, the systems that target them
We built and introduced a set of tools to support cover everything from low-power embedded devices
optional full-system power measurement for wall-pow- to high-performance servers.
ered systems. We licensed the PTDaemon software
from SPEC3 to interface with a wide variety of power Accurate Modeling of Inference
analyzers, and built a set of tools to automate the col- Deployment Scenarios Is Key to
lection, correlation, and verification of power con- Produce Representative System
sumption during performance runs, thereby enabling Performance Benchmarking Results
direct measurement of power and energy efficiency.
Figure 2 shows the throughput degradation from the
Power measurement also influenced our choice of the
server scenario, i.e., latency-bounded throughput con-
10 minute minimum run time.
straint that is normally ignored in evaluations. Across
all inference benchmarks (x-axis), the queries-per-sec-
RESULT HIGHLIGHTS ond throughput ratio between the Server and Offline
In October 2019, the first version of the benchmark scenarios is up to 0.55 for gnmt and on average 0.8 for
(v0.5) was put to the test. We received over 600 sub- ResNet50 and SSD-ResNet34. This 20% performance
missions in all three categories (available, preview, difference comes from the ability of the Inference
and research/products under development) across LoadGen to accurately reflect query serving

TOP PICKS
scenarios—implementing query arrival patterns using the benchmark reference code and specification.
the Poisson distribution for the Server scenario. This is DLRM for recommendation is the first benchmark pro-
important as query arrival characteristics determine duced using this process.16
the degree of input batching by balancing queuing
effects with compute parallelism. Also, as the through- New Benchmark Suites
put of AI systems increases, the effect of query arrival Inference is deployed on a wide variety of systems,
patterns becomes much less pronounced. ranging from data centers to edge servers to smart-
phones to tiny devices. Each of these domains has its
Results Must Reflect Real-World ML unique use-case-driven models and requirements. To
Solutions Deployed on Many Platforms, this end, driven by the demand from the industry,
Ranging From General-Purpose CPUs MLPerf Inference has launched three new suites in
to Programmable GPUs and DSPs, 2020: 1) one for data center systems, 2) one for edge
FPGAs, and Fixed-Function systems, and 3) one for mobile devices such as smart-
Accelerators phones (via an Android App) and laptops.14 Another
The MLPerf Inference results reflect this diversity. benchmark, tinyMLPerf for ultralow-power ML devi-
Overall, the MLPerf Inference submissions cover most ces,4 is slated for release in 2021 with power
hardware categories, as summarized in https://mlperf. measurements.
org/inference-results/. The system diversity indicates
that the inference benchmark suite and the described End-to-end AI Performance
methodologies are broadly applicable to evaluate any There is growing awareness that ML hardware is
processor architectures. developed and evaluated in a vacuum without consid-
ering the full application environment. AI applications
FUTURE DIRECTIONS rely heavily on preprocessing and postprocessing
code, storage, and communication systems, which
Demand for Expanding the Benchmark
must be considered while assessing end-to-end AI sys-
Scope tem performance. To this end, MLPerf Inference is
There is a strong demand to expand the breadth and
maturing into measuring AI system performance by
scope of ML use cases that reflect industry diversity
developing an end-to-end (E2E) AI benchmarking
and relevance. The industry has grown MLPerf Infer-
methodology to capture the increasing levels of
ence by adding new benchmarks, creating new bench-
abstraction in understanding AI applications.
mark suites, and establishing new metrics and
methodologies driven by ML use cases in less than a
Stringent Metrics
year.
MLPerf Inference has evolved quickly to not only
establish best practices for assessing ML perfor-
mance but also point the industry in the right direction
TO ENABLE US TO BUILD BETTER ML by establishing proper metrics. For instance, with
MODELS THAT DRIVE SYSTEMS medical image segmentation (3D U-Net), MLPerf Infer-
ACTIVITIES, THE MLPERF COMMUNITY ence increased the accuracy threshold from 99% of
HAS EXPANDED INTO CREATING FP32 accuracy to 99.9% of FP32 (0.85300 mean DICE
LARGE, HIGH-QUALITY, FAIR DATA score). This rationale was driven by community feed-
back and input that a 1% accuracy drop could mean a
SETS TO DRIVE ML INNOVATIONS.
significant number of misdiagnosis in brain tumors. In
real-life, the standards are more stringent. So the com-
munity is using MLPerf Inference to set “gold stand-
ards” for what it means to achieve performance while
New Networks avoiding arbitrary numerical precision tradeoffs.
MLPerf has expanded the benchmark suite with four
new inference benchmarks (DLRM, 3D U-Net, BERT, Accelerating AI Innovation
and RNN-T). New MLPerf benchmarks go through MLPerf Inference currently captures two of the three
advisory boards to steer future benchmark construc- fundamental pillars fueling ML—algorithms and sys-
tion and development. The collaboration between the tems—with the open and closed divisions. The third
advisory boards and the MLPerf submitters produces pillar is the data. Data sets are the rocket fuel of

TOP PICKS
machine learning. The ultimate demand for new and We are also thankful for the MLCommons advisory
improved ML systems arises out of new ML algorithms boards’ contribution and the colleagues for the data-
that emerge to tackle complexity in data sets, both in sets and model reference implementations: https://
scale and diversity. mlcommons.org/en/credits/.
To enable us to build better ML models that drive
systems activities, the MLPerf community has expanded
into creating large, high-quality, fair data sets to drive ML REFERENCES
innovations. Specifically, we are spawning off new data- 1. “Introducing the next generation of on-device vision
set initiatives, such as People’s Speech and 1000 Words models: MobileNetV3 and MobileNetEdgeTPU,” 2019.
in 1000 Languages, to fuel and accelerate language- [Online]. Available: https://ai.googleblog.com/2019/11/
based advancement in ML algorithms and system devel- introducing-next-generation-on-device.html
opment. Taking a step further, the community is also 2. “ MLCommons: Machine learning innovation to benefit
exploring best practices to make it easier to share and everyone,” 2020. [Online]. Available: https://
publish ML models using our MLCube approach. mlcommons.org/en/
3. “A new version of the PTDaemon tool is now available,”
Streamlining and Fostering Research 2020. [Online]. Available: https://www.spec.org/power/
We are beginning to see MLPerf Inference’s impact on docs/SPECpower-PTD-Update_Process.html
the experimental methodologies used to assess AI sys- 4. C. R. Banbury et al., “Benchmarking TinyML systems:
tems’ performance and efficiency. For instance, the Challenges and direction,” 2020, arXiv:2003.04821.
LoadGen we developed is being adopted as a testing 5. L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam,
harness to evaluate ML system performance across the “Rethinking atrous convolution for semantic image
scenarios. We hope to see the influence of MLPerf Infer- segmentation,” 2017, arXiv:1706.05587.
ence on academic and industrial research by enabling 6. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert:
future works to adopt the benchmarks, metrics, quality, Pre-training of deep bidirectional transformers for
and accuracy target thresholds as standard “baselines” language understanding,” 2019, arXiv:1810.04805.
for systematically evaluating their research. 7. U. Gupta et al., “The architectural implications of
Facebook’s DNN-based personalized
recommendation,” in Proc. IEEE Int. Symp. High
CONCLUSION
Perform. Comput. Archit., 2020, pp. 488–501, doi:
The explosion in ML hardware and software requires a
10.1109/HPCA47549.2020.00047.
standard benchmark to focus on machine learning
8. F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and K.
systems with clear metrics. MLPerf enables fair
H. Maier-Hein, “NNU-NET: A self-configuring method
apples-to-apples comparisons—it is a neutral system
for deep learning-based biomedical image
performance benchmark suite that levels the playing
segmentation,” Nature Methods, vol. 18, no. 2, pp. 203–
field for ML systems, from programming frameworks
211, Dec. 2020.
and compiler construction to system architectures,
9. M. Johnson et al., “Google’s multilingual neural
and hardware implementations.
machine translation system: Enabling zero-shot
translation,” 2017, arXiv:1611.04558.
ACKNOWLEDGMENTS 10. P. Mattson et al., “MLPerf training benchmark,” in Proc.
In addition to the leads, MLPerf Inference Benchmark13 3rd Conf. Mach. Learn. Syst., 2020, arXiv:1910.01500.
was carried by the following primary co-authors: Brian 11. P. Mattson et al., “MLPerf: An industry standard
Anderson (Google), Maximilien Breughe (NVIDIA), benchmark suite for machine learning performance,”
Ramesh Chukka (Intel), Cody Coleman (Stanford), Itay IEEE Micro, vol. 40, no. 2, pp. 8–16, Mar.–Apr. 2020, doi:
Hubara (Habana), Thomas B. Jablin (Google), Pankaj 10.1109/MM.2020.2974843.
Kanwar (Google), Anton Lokhmotov (dviditi), Francisco 12. M. Naumov et al., “Deep learning recommendation
Massa (Facebook), Gennady Pekhimenko (University of model for personalization and recommendation
Toronto & Vector Institute), Dilip Sequeira (NVIDIA), systems,” 2019, arXiv:1906.00091.
Ashish Sirasao (Xilinx), Tom St. John (Cruise), George 13. V. J. Reddi et al., “MLPerf inference benchmark,” in
Yuan (NVIDIA). MLPerf Inference is also the work of sev- Proc. ACM/IEEE 47th Annu. Int. Symp. Comput. Archit.,
eral collaborators and submitters who helped produce 2020, pp. 446–459, doi: 10.1109/ISCA45697.2020.00045.
the first large set of benchmark results. To see the com- 14. V. J. Reddi et al., “MLPerf mobile inference benchmark:
plete list, please refer to the original MLPerf Inference Why mobile AI benchmarking is hard and what to do
Benchmark13 and MLPerf Training Benchmark10 papers. about it,” 2020, arXiv:2012.02328.

TOP PICKS
15. Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou, mathematics with a specialization in computer science and a
“Mobilebert: A compact task-agnostic bert for B.A. degeee with honors in economics from the University of
resource-limited devices,” 2020, arXiv:2004.02984.
Chicago. Contact him at dkanter@gmail.com.
16. C.-J. Wu et al., “Developing a recommendation
benchmark for MLPerf training and inference,” 2020,
arXiv:2003.07336. PETER MATTSON leads ML Metrics at Google, Mountain
View, CA, USA. He co-founded and is the president of
MLCommons, and co-founded and was the general chair of
VIJAY JANAPA REDDI is currently an Associate Professor at
the MLPerf Consortium that preceded it. Previously, he
Harvard University, Cambridge, MA, USA. Before joining Har-
founded the Programming Systems and Applications
vard, he was an Associate Professor with the Department of
Group at NVIDIA Research, was the VP of software infra-
Electrical and Computer Engineering, The University of Texas
structure for Stream Processors Inc. (SPI), and was a Man-
at Austin, Austin, TX, USA. His research interests include
aging Engineer at Reservoir Labs. His research focuses on
computer architecture and runtime systems, specifically in
understanding machine learning models and data through
the context of autonomous machines and mobile and edge
quantitative metrics and analysis. Mattson received a B.S.
computing systems. He received multiple honors and awards,
degree from the University of Washington and M.S. and
including the National Academy of Engineering (NAE) Gil-
Ph.D. degrees from Stanford University. Contact him at
breth Lecturer Honor and has been inducted into the MICRO
petermattson@google.com.
and HPCA Halls of Fame. Reddi received a B.S. degree from
Santa Clara University, an M.S. degree from the University of
Colorado at Boulder, and a Ph.D. degree in computer science GUENTHER SCHMUELLING is currently an Inference Chair at
from Harvard University. Contact him at vj@ece.utexas.edu. MLPerf. He is a Principal Tech Lead. He is with the Microsoft
Azure AI Infrastructure, Redmond, WA, USA. Contact him at
CHRISTINE CHENG is one of the engineering leads for deep
guschmue@microsoft.com.
learning benchmarking and optimization at Intel, Santa Clara,
CA, USA. She has been involved with shaping MLPerf bench-
marks from the beginning and has also led teams at Intel to CAROLE-JEAN WU is currently a Research Scientist at Face-
submit results to every MLPerf training and inference round. book AI Research, Menlo Park, CA, USA. She is an Associate
Before joining MLPerf/MLCommons, she was a data scientist Professor at Arizona State University, Tempe, AZ, USA. Her
in sports analytics. Cheng received a B.S. degree from Cal- research lies in the domain of computer system. Her work has
tech and an M.S. degree from Stanford University. Contact pivoted into designing systems for machine learning execu-
her at christine.cheng@intel.com. tion at-scale and tackling system challenges to enable effi-
cient, responsible AI execution. She chairs the MLPerf
DAVID KANTER is a founder and the executive director of Recommendation Benchmark Advisory Board and co-chaired
MLCommons, Mountain View, CA, USA, where he helps lead MLPerf Inference. She was the recipient of the NSF CAREER
the MLPerf benchmarks and other initiatives. He has more Award, the Facebook AI Infrastructure Mentorship Award, the
than 16 years of experience in semiconductors, computing, IEEE Young Engineer of the Year Award, the Science Founda-
and machine learning. He founded a microprocessor and tion Arizona Bisgrove Early Career Scholarship, and the Intel
compiler startup, was an early employee at Aster Data Sys- PhD Fellowship, among a number of best paper awards. Wu
tems, and has consulted for industry leaders such as Intel, received a B.Sc. degree from Cornell University, and M.A. and
Nvidia, KLA, Applied Materials, Qualcomm, Microsoft, and Ph.D. degrees from Princeton University. Contact her at
many others. Kanter received a B.S. degree with honors in carolejeanwu@fb.com

Superconductor Computing for Neural

Networks
Koki Ishida , Department of Advanced Information Technology, Kyushu University, Fukuoka, 464-8601, Japan
Ilkwon Byun , Department of Electrical and Computer Engineering, Seoul National University, Gwanak-gu,
08826, South Korea
Ikki Nagaoka , Department of Electronics, Nagoya University, Nagoya, 464-8601, Japan
Kosuke Fukumitsu, Department of Advanced Information Technology, Kyushu University, Fukuoka, 464-8601,
Japan
Masamitsu Tanaka , Department of Electronics, Nagoya University, Nagoya, 464-8601, Japan
Satoshi Kawakami , Teruo Tanimoto , and Takatsugu Ono , Department of Advanced Information
Technology, Kyushu University, Fukuoka, 464-8601, Japan
Jangwoo Kim, Department of Electrical and Computer Engineering, Seoul National University, Gwanak-gu,
08826, South Korea
Koji Inoue , Department of Advanced Information Technology, Kyushu University, Fukuoka, 464-8601, Japan
The superconductor single-flux-quantum (SFQ) logic family has been recognized

as a promising solution for the post-Moore era, thanks to the ultrafast and low-
power switching characteristics of superconductor devices. Researchers have
made tremendous efforts in various aspects, especially in device and circuit
design. However, there has been little progress in designing a convincing SFQ-
based architectural unit due to a lack of understanding about its potentials and
limitations at the architectural level. This article provides the design principles
for SFQ-based architectural units with an extremely high-performance neural
processing unit (NPU). To achieve our goal, we developed and validated a
simulation framework to identify critical architectural bottlenecks in designing a
performance-effective SFQ-based NPU. We propose SuperNPU, which
outperforms a conventional state-of-the-art NPU by 23 times in terms of
computing performance and 1.23 times in power efficiency even with the cooling
cost of the 4K environment.
W
e are about to enter an era where both we believe that now is the right time to exploit
Moore’s law and Dennard scaling do not emerging device technologies with significant
hold anymore. We are running out of potential and make a serious effort to improve their
effective options to improve the performance of feasibility.
CMOS-based computer systems while maintaining Among the several candidates, the superconduc-
their power and temperature budgets.1 Therefore, tor single-flux-quantum (SFQ) logic family2 has
emerged as a highly promising solution for the post-
Moore era. SFQ technology, which utilizes supercon-
This work is licensed under a Creative Commons Attribution ductor devices operating at 4K, has significant poten-
4.0 License. For more information, see https://creativecom- tial for both high performance and energy efficiency.
mons.org/licenses/by/4.0/ Specifically, SFQ logic gates use low-voltage impulse-
Digital Object Identifier 10.1109/MM.2021.3070488 shaped signals for their operations and achieve the
Date of publication 5 April 2021; date of current version
25 May 2021.
ultrafast ( 1012 s) and low-energy ( 1019 J)
May/June 2021 Published by the IEEE Computer Society IEEE Micro 19

TOP PICKS
switching. By focusing on these strong points, many

researchers have contributed to SFQ-related research
in various aspects, especially in device and circuit
design.
However, little research has been conducted on
SFQ-based architectures3 due to a lack of under-
standing about its architecture-level potentials and
limitations. As we show later, SFQ logic, with its
unique pulse-driven nature, requires completely dif-
ferent architecture designs from conventional
CMOS technology. In consideration of SFQ-specific
architectural tradeoffs, the following questions
FIGURE 1. Circuit elements and working principle of SFQ logic
must be clearly addressed. 1) Which architecture is
technology. (a) Superconductor ring with SFQ. (b) Electrical
most promising for this technology? 2) How can we
implement various microarchitectural units in an characteristics of JJ. (c) Circuit diagram of an SFQ-based
SFQ-friendly manner? 3) How can we fully exploit its DFF. (d) Determination of SFQ circuits’ frequency with an
potential at the architecture level while considering operation example of a two-input AND gate.
its limitations? 4) How do we simulate and validate
SFQ architectural units?
Our article, presented at MICRO’20, provides and it has unique electrical characteristics with which
straightforward answers to the above questions in the to generate a voltage pulse [Figure 1(b)].
form of SuperNPU, our design for an SFQ-based neural Figure 1(c) shows the working principle of SFQ
processing unit (NPU). The main contributions of this logic gates with an SFQ-based delay flip flop (DFF),
work are as follows. 1) Architecting an SFQ-based which consists of a single superconductor ring and a
NPU: to the best of our knowledge, this is the first clock line. First, when the input pulse enters the ring,
work to design an NPU, which addresses the architec-
it is stored in the ring as an SFQ by switching JJ1 ( 1 ).
tural tradeoffs of SFQ technology. 2) Simulation frame-
Next, by receiving a clock pulse ( 2 ), JJ2 is activated,
work: it is also the first work to model and validate a and the stored SFQ is transferred to the output as a
simulator for SFQ-based architectures. 3) SFQ-specific
voltage pulse ( 3 ). In this manner, SFQ gates can rep-
architectural optimizations: we identify critical archi- resent the logical value “1” (or “0”) by the existence (or
tectural bottlenecks and optimizations, which can absence) of the stored SFQ between the two adjacent
cause a performance variance. 4) Significant results: clock pulses.
SuperNPU provides extremely high performance and The SFQ logic gates are implemented using stor-
power efficiency; it outperforms a conventional design age rings and pulse interactions in a similar way.
by 23 times and 1.23 times in terms of performance Figure 1(d) shows an operation example of a two-input
and power efficiency even with the cooling cost of the AND gate. The SFQ circuits’ frequency is determined by
4K environment. Our thorough architectural analysis
with a validated modeling tool clearly shows the f ¼ 1=CCT ¼ 1=ðSetupTime þ Maxðdt1 ; dt2 ÞÞ
impact of our SFQ-specific optimization process and
where SetupTime is the timing constraint, and dt1 and
the potential of SFQ computing as a post-Moore-era
dt2 are the timing gaps between the input pulses and
solution.
the first clock pulse arrival. dt1 ; dt2 HoldTime must
be satisfied, where HoldTime is the other timing con-
straint. dt1 and dt2 can be shortened by delaying the
BACKGROUND ON SFQ LOGIC clock pulse arrival. Therefore, the timing gap between
TECHNOLOGY the two inputs’ arrivals (dt3 ) is important. If there is a
SFQ circuits utilize a small-voltage pulse as an infor- large difference between the arrivals, the clock fre-
mation carrier, which can be stored as a single mag- quency decreases because both inputs must arrive at
netic flux quantum (SFQ) in a superconductor ring the destination gate in the same clock cycle period.
[Figure 1(a)]. To store and transfer the SFQ, the super- Because the timing constraints are fixed for each SFQ
conductor ring contains superconducting devices gate, minimizing these two kinds of timing gaps is
called Josephson junctions (JJs). Each JJ consists of a essential for achieving a high clock frequency in an
thin insulator sandwiched by the superconductors, SFQ design.

TOP PICKS
FIGURE 2. Our baseline SFQ-based NPU with each microarchitecture unit’s design alternatives. (a) Overview of our baseline
architecture. (b) Network designs. (c) PE designs.
SFQ-FAVORABLE technology. It has been a long-standing challenge to

ARCHITECTURAL implement a large-scale and high-speed off-chip mem-
CHARACTERISTICS ory able to operate in a 4K environment. Although
The aim of this article is contributing to the architec- there have been a few studies on JJ-based memories
ture community by introducing SFQ technology from and they are currently being developed, these technol-
the architects’ perspective. To achieve this goal, we ogies have not been put to practical use yet. For this
describe below SFQ-favorable architectural character- reason, it is currently more practical to use CMOS
istics by fully considering the SFQ technology’s unique memory technology, which is slower than the JJ-based
features, which originate from its pulse-driven nature. memory but is large and reliable. Thus, computation-
First, SFQ technology favors simple control flows oriented applications with minimal off-chip memory
due to its deeply pipelined nature. Architects can nat- accesses are suited to the current SFQ technology.
urally apply gate-level pipelining to achieve a high fre-
quency,4,5 because all SFQ logic gates consist of BASELINE SFQ-BASED NPU
superconductor rings, i.e., all SFQ gates have a latch DESIGN
functionality and, thus, can be pipelined without addi-
After considering the presented characteristics shown
tional DFFs. However, the deep pipeline structure
in the “SFQ-Favorable Architectural Characteristics”
makes it difficult to avoid data (or control) hazards
section, we chose an NPU as an example of SFQ-favor-
and, thus, may suffer from huge pipeline stalls. There-
able architecture and designed the baseline SFQ-
fore, applications with streaming execution are more
based NPU, as shown in Figure 2(a). Specifically, we
favorable than those with a complex control flow.
implemented the on-chip buffer as a shift-register-
Second, the SFQ technology favors sequential
based memory ( 1 ), the network unit (NW unit) as a 2D
memory access patterns due to its on-chip memory
systolic network ( 2 ), and the processing element (PE)
implementation. There are two options for an SFQ-
with weight-stationary dataflow ( 3 ) in an SFQ-friendly
based on-chip memory design: random access mem-
manner.
ory (RAM) and shift-register-based memory. However,
SFQ-based RAM suffers from poor driving capability
and scalability, mainly because of the difficulty of driv- Network Unit for Systolic Array
ing the word lines and bit lines with the small pulses. To design an SFQ-friendly on-chip network, we com-
On the other hand, a shift-register-based memory pared two representative NW unit designs: fan-out
does not have such problems, so it is a much more network (e.g., splitter tree) and store-and-forward
practical option for on-chip memory. This means that chain (e.g., systolic array), as shown in Figure 2(b). We
applications with sequential memory accesses are selected the systolic array because it is superior to
much more suitable for SFQ technology. the splitter tree in both clock frequency and area. The
Third, SFQ technology favors fewer off-chip mem- splitter tree significantly suffers from the low fre-
ory accesses due to a lack of scalable off-chip memory quency due to the increasing difference between the

TOP PICKS
FIGURE 3. Simulation framework and validation. (a) SFQ-NPU overview. (b) Validation results in terms of frequency, power con-
sumption, and area of microarchitectural and architectural units. (c) Validation setup. (d) Chip microphotograph of 2 2 PE
arrayed NPU prototype design.
arrival timings of the two PE inputs. As shown in propagation delay can be hidden by making the clock
Figure 2(b)(1), the timing gap between input 1 and pulse flow in the same direction as the data, resulting

input 2 increases in proportion to the PE array width in a small timing gap between the arrivals of the clock
due to the large difference in clock arrival at two split- and data. However, a circuit with a feedback loop
ter trees. cannot apply such a frequency speed-up because the
On the other hand, as shown in Figure 2(b)(2), the clock and data pulses cannot flow in the same direc-
systolic network has a smaller timing difference tion. Thus, we concluded that a PE design without a

between input 3 and input 4 , which does not scale feedback loop, PE with WS, is a more SFQ-friendly
with the PE array width. Thus, the systolic array can choice.
achieve a higher clock frequency than the splitter tree
can. In addition, a systolic network has a simpler struc-
SIMULATION FRAMEWORK
ture than the splitter tree and, thus, occupies less
To enable exploration and optimization of the SFQ-
area. For these reasons, we decided that the systolic
based architectures, we have developed the architec-
array is more SFQ-friendly and chose it to be part of
tural simulation framework for SFQ technology.
our on-chip network design.
Figure 3(a) shows an overview of our tool targeting
SFQ-based NPU architecture, which consists of two
Processing Element With Weight- engines: an SFQ-NPU estimator and SFQ-NPU simula-
tor. In what follows, we describe these engines, their
Stationary Dataflow
implementation, and validation.
For an SFQ-friendly PE design, we compared the
designs with two major dataflows in a systolic net-
work, i.e., weight stationary (WS), where the PE stores SFQ-NPU Estimator
weight pixels in its local register, and output station- The SFQ-NPU estimator predicts the clock frequency,
ary (OS), where the PE stores ofmap pixels in its local static power, access energy, and area of the target
register, as shown in Figure 2(c). We chose the PE NPU configuration. To carefully consider the unique
with WS dataflow to maximize the frequency, features of SFQ logic ranging from the device to archi-
because it does not include any feedback loop. tecture, the estimator takes a three-layer abstraction
Unlike in CMOS technology, in which the clocking strategy: gate-level, microarchitecture-level, and archi-
scheme synchronizes all the gates, the SFQ logic tecture-level estimations.
employs point-to-point (or gate-to-gate) synchroniza- First, in regard to the device-level parameters,
tion. SFQ circuits can achieve a higher clock fre- the gate-level estimation layer derives the timing
quency without a feedback loop, because the data parameters, power information, and area for all of

TOP PICKS
the SFQ logic gates and wire cells. The gate models SFQ-NPU Simulator
are compatible with two SFQ technologies; rapid For the given SFQ-based NPU design running DNN
single-flux-quantum (RSFQ)2 and energy-efficient applications, the SFQ-NPU simulator reports the effec-
RSFQ (ERSFQ).6 For the RSFQ gates, all gate param- tive performance and power consumption based on
eters are extracted by running JSIM7 simulations the obtained frequency and power information from
with the RSFQ cell library.8 On the other hand, the the SFQ-NPU estimator. The SFQ-NPU simulator first
ERSFQ gate parameters are estimated from the analyzes all of the required weight mappings by taking
RSFQ gate parameters and ERSFQ gate features9 a DNN description file as an input. Next, it runs a
because of the lack of detailed fabrication cycle-based simulation for each weight mapping to
information. derive the consumed cycles and hardware activation
Next, the microarchitecture-level layer estimates ratio. Finally, it aggregates the mapping results and
the frequency, static power, access energy, and area reports the performance and the power results.
of each microarchitectural unit by utilizing the gate-
level layer’s output and the input configuration param-
OPTIMIZING SFQ-BASED NPU
eters. For an accurate frequency estimation, this layer
DESIGN
generates the intra-unit gate pair information consist-
By using our validated simulation framework, we iden-
ing of all source and destination gate pairs in each
tified and resolved architectural performance bottle-
unit because the two adjacent gates determine the
necks in our baseline SFQ-based NPU design. Then,
SFQ circuit frequency. With the intra-unit gate pair
we devised our SFQ-optimal NPU architecture, Super-
information, this layer calculates the frequencies of all
NPU, which resolves the bottlenecks with architec-
gate pairs in the target unit and takes the minimum
ture-level solutions.
value as the unit frequency. This layer also calculates
To make observations and SFQ-specific optimiza-
the static power, access energy, and area of each unit
tions, we conducted performance analyses by running
on the basis of the gate count information and the
six CNN workloads (i.e., Alexnet, FasterRCNN, GoogLe-
gate-level layer’s output.
Net, MobileNet, ResNet50, and VGG16). As input infor-
Lastly, the architecture-level layer reports the esti-
mation on the fabrication process, we used the
mation results regarding the area, static power,
currently available AIST 1.0-mm process in order to show
access energy, and clock frequency of the target NPU
the SFQ technology’s potential conservatively. In addi-
configuration. For an accurate prediction, this layer
tion, we assumed a memory bandwidth of 300 GB/s,
not only integrates the microarchitecture-level esti-
which is the typical value of HBM used in the recent
mations based on the unit counts but also considers
TPUv2.10 Note that the estimated area of the baseline
the inter-unit connection. For instance, it calculates
SFQ-based NPU design might be comparable to the TPU
all the inter-unit communication latencies on the basis
core ( < 330 mm2 ) if JJs are equivalently scaled to 28 nm
of the interfacing gates’ timing parameters and
because CMOS technology is used in the TPU design.
accounts for them when deriving the frequency of the
target NPU. The layer also calculates the area of the
wire cells required to connect each unit and includes Architectural Bottlenecks and Design
it in the final area result. Implications
We thoroughly validated our SFQ-NPU estimator’s We identified performance bottlenecks and design
accuracy in terms of frequency, power, and area by implications by conducting analyses with the baseline
comparing it with a fabricated 4-bit MAC unit mea- SFQ-based NPU design in the “Baseline SFQ-Based
sured in a 4K environment [Figure 3(c)] and in post-lay- NPU Design” section (called the Baseline from here on).
out characterizations of an 8-bit 8-entry shift-register- To show the design implications, we started from the
based memory (SRmem), 8-bit NW unit, and 2-bit, and Baseline by following the TPU core’s architectural speci-
2 2 PE-arrayed NPU. We have also fabricated a pro- fications (e.g., number of PEs, on-chip memory capacity)
totype chip of the NPU [Figure 3(d)] and plan to mea- by focusing on their similar hardware structures.11
sure it in detail. As Figure 3(b) shows, the SFQ-NPU Our performance analyses identified two architec-
estimator accurately predicts the frequency, power, ture-level performance bottlenecks. First, we found
and area of each microarchitecture unit and target that the data movement overhead among different
NPU. Even though our validation was conducted with on-chip buffers (or within a single buffer) can signifi-
a small NPU prototype, the estimation accuracy is cantly degrade performance. As the Baseline uses
rather convincing, thanks to the systolic network’s shift-register-based on-chip buffers, it should con-
scalable structure. sume a huge amount of cycles corresponding to the

TOP PICKS
FIGURE 4. Architectural optimization summary. (a) Overview of SuperNPU with three architectural optimizations. (b) Perfor-
mance evaluation result. (d) Power efficiency evaluation result normalized to the efficiency of the TPU.
buffers’ length for moving the data from each buffer’s chip buffer capacity while reducing the number of PEs
tail to head. Second, we found that the PEs are signifi-
[Figure 4(a) 2 ]. Our key idea for this design choice is to
cantly underutilized because the SFQ computing units reduce off-chip memory accesses by sacrificing exces-
are too fast compared with the slow off-chip memory sive computation speed without a performance loss. By
access. If these bottlenecks are not resolved, even our using resource balancing, we can increase each work-
SFQ-friendly baseline design with 52.6-GHz clock fre- load’s computational intensity (i.e., the number of MAC
quency cannot outperform a conventional CMOS operations executed with one weight data mapped on
design [Figure 4(b)]. Therefore, we should minimize the the PE) by increasing the batch size without additional
wasteful data movements while maximizing the PE uti- off-chip memory accesses. As a result, we improved
lization at the architectural level.
performance a further 2.1 times [Figure 4(b) 2 ].
SuperNPU: SFQ-Optimal NPU

Architecture OUR SUPERNPU OUTPERFORMS THE
In light of the above implications, we devised our SFQ- BASELINE BY 52 TIMES AT A 52.6-GHZ
optimal NPU architecture, SuperNPU. Figure 4(a) CLOCK FREQUENCY WHILE CLEARLY
shows an overview of SuperNPU, driven by three SFQ- SHOWING THE RIGHT DIRECTIONS
specific architectural optimizations: buffer optimiza- FOR ARCHITECTURE-LEVEL

tion ( 1 ), resource balancing ( 2 ), and increasing the OPTIMIZATIONS.

number of registers in PE ( 3 ). We briefly introduce our
optimization techniques and their performance
impact here; interested readers should refer to our
original paper for more details.12 Finally, to increase the PE utilization, we increased
First, we almost completely eliminated the data the number of weight registers in each PE. With the
movement overhead among the on-chip buffers by larger number of local weight registers, SuperNPU
optimizing the buffer architecture. Specifically, we achieves higher PE utilization by filling several PE pipe-
divided each on-chip buffer into small buffer chunks line stages with a single input data. For example, if
(256 and 64 divisions for ofmap and ifmap buffers) and each PE holds two different weights from different
connected them with a multiplexer and demultiplexer.
weight filters, as in Figure 4(a) 3 , PE can compute two
The optimized buffer architecture eliminates unneces- different MAC operations with one ifmap pixel. As a
sary data movements and improves buffer utilization. result, we get an additional performance improvement
As a result, the Baseline’s performance was signifi-
of 1.3 times [Figure 4(b) 3 ].
cantly improved (by 19 times), as shown in Figure 4 In summary, our SuperNPU outperforms the Baseline

(b) 1 . by 52 times at a 52.6-GHz clock frequency while clearly
Next, we efficiently narrowed the gap between the showing the right directions for architecture-level optimi-
computation and memory speeds by increasing the on- zations. Besides, in our performance evaluation,

TOP PICKS
SuperNPU outperformed TPU by 23 times, even with its picosecond order is a challenging problem. Thus,
immature device technology (i.e., 1-mm niobium process). placement and routing automation is indispens-
Meanwhile, our power–efficiency evaluation [Figure 4(c)] able to design large-scale SFQ circuits.
showed that SuperNPU with ERSFQ technology achieves › Scale out: The superconductor transmission line
490 times higher power efficiency without considering in the circuit enables low-latency lossless signal
cooling power requirements. Even with the enormous propagation since it does not need charge/dis-
cost of cooling to 4K (i.e., 400 times the power consump- charge processes. It can be used for both on-
tion of the device), SuperNPU attains 1.23 times higher chip and off-chip communications, which means
power efficiency compared with TPU. that SFQ circuits will be suitable for large-scale
multichip architectures. Therefore, SFQ comput-
ing platforms have a potential for continuous
IMPACTS AND PROSPECTIVE growth independently of device shrinkage.
The major impacts of our work are as follows.
› Our validated modeling framework for SFQ logic ACKNOWLEDGMENTS

enables architects to conduct fast and accurate This work was supported in part by the JST-Mirai Program
SFQ architecture design space explorations. It under Grant JPMJMI18E1; in part by JSPS KAKENHI under
can be used in other SFQ-promising domains. Grant JP19H01105, Grant JP18H05211, and Grant
The model will encourage researchers to study JP18J21274; and in part by the National Research Founda-
SFQ architectures. tion of Korea (NRF) under Grant NRF-2019R1A5A1027055
› The SFQ-based hardware design was optimized and Grant NRF-2021R1A2C3014131. The circuit was
at the microarchitecture and architecture levels. designed with the support by VDEC of the University of
The optimization process for NPU provided sev- Tokyo in collaboration with Cadence Design Systems,
eral crucial insights, e.g., on the timing adjust- Inc., and was fabricated in the CRAVITY of AIST.
ment of SFQ pulses at the circuit level and on
the increasing buffer and pipeline utilization.
These insights will guide the designs of follow-up REFERENCES
SFQ-based architectures. 1. H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam,
› Our work shows the true potential of SFQ logic by and D. Burger, “Dark silicon and the end of multicore
evaluating an SFQ-based architecture in compari- scaling,” in Proc. 38th Annu. Int. Symp. Comput. Archit.,
son with its state-of-the-art CMOS counterpart. 2011, pp. 365–376, doi: 10.1109/77.80745.
With our model-driven analysis and optimization, 2. K. K. Likharev and V. K. Semenov, “RSFQ logic/memory
SuperNPU outperforms TPU by 23 times on aver- family: A new Josephson-junction technology for sub-
age while achieving 1.23 times higher power effi- terahertz-clock-frequency digital systems,” IEEE Trans.
ciency, even including the enormous cooling cost Appl. Supercond., vol. 1, no. 1, pp. 3–28, Mar. 1991, doi:
at 4K. These significant results will motivate indus- 10.1145/2000064.2000108.
try and academia to work on SFQ technology to 3. G. Tzimpragos et al., “A computational temporal logic
prepare for the post-Moore era. for superconducting accelerators,” in Proc. 24th Int.
Conf. Archit. Support Program. Lang. Oper. Syst., 2020,
Meanwhile, there are critical future challenges pp. 435–448, doi: 10.1145/3373376.3378517.
regarding scalability that have to be addressed before 4. I. Nagaoka, M. Tanaka, K. Inoue, and A. Fujimaki, “A
SFQ computing platforms come into practical use. 48GHz 5.6mW gate-level-pipelined multiplier using
single-flux quantum logic,” in Proc. IEEE Int. Solid- State
› Scale up: Although our work shows the potential Circuits Conf., 2019, pp. 460–462, doi: 10.1109/
for high performance even with immature device ISSCC.2019.8662351.
technology (i.e., 1.0 mm niobium process), advances 5. K. Ishida et al. “32 GHz 6.5 mW gate-level-pipelined 4-bit
in device integration technology are essential for processor using superconductor single-flux-quantum
constructing sophisticated computer systems. In logic,” in Proc. IEEE Symp. VLSI Circuits, 2020, pp. 1–2,
addition to device shrinkage as with conventional doi: 10.1109/VLSICircuits18222.2020.9162826.
CMOS technology, a 3D-stacked SFQ design 6. D. E. Kirichenko, S. Sarwana, and A. F. Kirichenko, “Zero
would be promising due to the ultra-low-power static power dissipation biasing of RSFQ circuits,” IEEE
feature of superconductor logic devices. Besides, Trans. Appl. Supercond., vol. 21, no. 3, pp. 776–779, Jun. 2011,
manually adjusting the timing of SFQ pulses on doi: 10.1109/TASC.2010.2098432.

TOP PICKS
7. E. Fang and T. V. Duzer, “A Josephson integrated circuit MASAMITSU TANAKA is currently an Assistant Professor with
simulator (JSIM) for superconductive electronics the Department of Electronics, Graduate School of Engineering,
application,” in Proc. Extended Abstr. Int. Supercond.
Nagoya University, Nagoya, Japan. His research interests
Electron. Conf., 1989, pp. 407–410. [Online]. Available:
include ultra-fast/energy-efficient computing using the SFQ-
https://ci.nii.ac.jp/naid/10006481720/
based technology and logic design methodologies. He is a
8. Y. Yamanashi et al., “100 {GHz} demonstrations based
on the single-flux-ouantum cell library for the 10 kA/ senior member of the IEEE. Contact him at masami_t@ieee.org.
2
cm Nb multi-layer process,” IEICE Trans. Electron., vol.
93, no. 4, pp. 440–444, 2010. SATOSHI KAWAKAMI is currently an Assistant Professor
9. O. A. Mukhanov, “Energy-efficient single flux quantum with the Department of Advanced Information and Tech-
technology,” IEEE Trans. Appl. Supercond., vol. 21, no. 3, nology, Kyushu University, Fukuoka, Japan. His research
pp. 760–769, Jun. 2011, doi: 10.1109/TASC.2010.2096792. interests include computer architecture with emerging
10. “Hot Chips 2017: A. Closer Look At Google’s TPU v2,”
devices. He is a Member of IEEE and ACM. Contact him at
[Online]. Available: https://www.tomshardware.com/
satoshi.kawakami@cpc.ait.kyushu-u.ac.jp.
news/tpu-v2-google-machine-learning,35370.html
11. N. P. Jouppi et al., “In-datacenter performance analysis
of a tensor processing unit,” in Proc. 44th Annu. Int. TERUO TANIMOTO is currently an Assistant Professor at the
Symp. Comput. Archit., 2017, pp. 1–12, doi: 10.1145/ Research Institute for Information Technology, Kyushu Uni-
3079856.3080246. versity, Fukuoka, Japan. His research interests include edge
12. K. Ishida et al., “SuperNPU: An extremely fast neural computing systems, secure computer architecture, and
processing unit using superconducting logic devices,” quantum computer system architecture. He is a Member of
in Proc. 53rd Annu. IEEE/ACM Int. Symp. Microarchit.,
IEEE and ACM. Contact him at tteruo@kyudai.jp.
2020, pp. 58–72, doi: 10.1109/MICRO50266.2020.00018.
KOKI ISHIDA is currently working at Kyushu University. His TAKATSUGU ONO is currently an Associate Professor with the
research interests include computer system architecture using Department of Advanced Information and Technology, Kyushu
superconductor single flux quantum logic. Ishida received a University, Fukuoka, Japan. His research interests include the
Ph.D. degree from Kyushu University in 2021. He is one of the co- areas of computer architecture with particular emphasis on
first authors of this article. Contact him at koki.ishida@cpc.ait. secure computing, high-performance computing, and memory
kyushu-u.ac.jp. system (including nonvolatile memory). Ono received a Ph.D.
degree from Kyushu University in 2009. He is a Member of IEEE.
ILKWON BYUN is currently working toward a Ph.D. degree Contact him at takatsugu.ono@cpc.ait.kyushu-u.ac.jp.
with the Department of Electrical and Computer Engineering,
Seoul National University, Seoul, South Korea. His research JANGWOO KIM is currently a Professor with the Department
focuses on architecting cryogenic CMOS and superconduc- of Electrical and Computer Engineering, Seoul National Uni-
tor-based computer systems by using computer architecture versity, Seoul, South Korea. His primary research interests lie
modeling and simulation techniques. He is a Student Member in computer architecture, server and datacenter, system
of IEEE and ACM. He is one of the co-first authors of this modeling, and intelligent systems. Kim received a Ph.D.
article. Contact him at ik.byun@snu.ac.kr. degree in electrical and computer engineering from Carnegie
Mellon University in 2008. He is a Member of IEEE and ACM.
IKKI NAGAOKA is currently working toward a Ph.D. degree
Contact him at jangwoo@snu.ac.kr.
with the Department of Electronics, Nagoya University,
Nagoya, Japan. His research interests include designing LSIs
KOJI INOUE is a Professor with the Department of Advanced
using the superconductor single-flux quantum logic. Contact
Information and Technology, Kyushu University, Fukuoka,
him at nagaoka@super.nuee.nagoya-u.ac.jp.
Japan. He is currently the Director of System LSI Research
KOSUKE FUKUMITSU is currently working toward an M.E. Center. His research interests include computer architecture
degree with the Department of Advanced Information and Tech- and low-power system designs. Inoue received a Ph.D. degree
nology, Kyushu University, Fukuoka, Japan. His research interests from Kyushu University. He is a Member of IEEE and ACM. He
include designing the superconductor single-flux-quantum cir- is the corresponding author of this article. Contact him at
cuits. Contact him at kosuke.fukumitsu@cpc.ait.kyushu-u.ac.jp. inoue@ait.kyushu-u.ac.jp.

Leaking Secrets Through Compressed

Caches
Po-An Tsai , NVIDIA, Santa Clara, CA, 95051, USA
Andres Sanchez
, Ecole de
Polytechnique Fe rale de Lausanne, 27218, Lausanne, Switzerland
Christopher W. Fletcher , University of Illinois at Urbana-Champaign, Champaign, IL,14589, USA
Daniel Sanchez , Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
We offer the first security analysis of cache compression, a promising architectural

technique that is likely to appear in future mainstream processors. We find that
cache compression has novel security implications because the compressibility of a
cache line reveals information about its contents. Compressed caches introduce a
new side channel that is especially insidious, as simply storing data transmits
information about the data. We present two techniques that make attacks on
compressed caches practical. Pack+Probe allows an attacker to learn the
compressibility of victim cache lines, and Safecracker leaks secret data efficiently
by strategically changing the values of nearby data. Our evaluation on a proof-of-
concept application shows that, on a representative compressed cache
architecture, Safecracker lets an attacker compromise an 8-byte secret key in
under 10 ms. Even worse, Safecracker can be combined with latent memory safety
vulnerabilities to leak a large fraction of program memory.
O
ver the past three years, computer architec- enable a similarly large amount of data leakage? In
ture has suffered a major security crisis. this work, we provide an answer in the affirmative by
Researchers have uncovered critical security analyzing the security of memory hierarchy compres-
flaws in billions of deployed processors related to specu- sion, specifically cache compression.
lative execution, starting with Spectre1 and Meltdown,2 Compression is an attractive technique to improve
generating significant interest and progress in micro- memory performance and has received intense devel-
architectural side and covert channel research. opment from both academia and industry. Some early
While microarchitectural side channel attacks have adopter processors already feature memory-hierarchy
been around for over a decade, speculative execution compression, including IBM’s z15,4 Qualcomm’s Cen-
attacks are significantly more dangerous because of triq, and NVIDIA’s A100. As data movement becomes
their ability to leak program data directly. In the worst increasingly critical, we expect to see general-purpose
case, these attacks let the attacker construct a univer-
cache compression become widely used. Nonetheless,
sal read gadget,3 capable of leaking data at attacker-
despite strong interest from both academia and indus-
specified addresses. For example, the Spectre V1
try, prior research in this area has focused on perfor-
attack—if (i < N) { B[A[i]]; }—exploits branch mispre-
mance and ignored security.
diction to leak the data at address &A + i given an
We present the first security analysis of cache com-
attacker-controlled i.
pression. The key insight that our analysis builds on is
Yet, speculative execution is only one optimization
that the compressibility of data reveals information
in modern microprocessors. It is critical to ask: Are
about the data itself. Similar to speculative execution
there other microarchitectural optimizations that
attacks, we show how this allows an attacker to leak
program data directly and, in the worst case, create a
new universal read gadget that can leak large portions
0272-1732 ß 2021 IEEE of program memory. In short, we show that cache com-
Digital Object Identifier 10.1109/MM.2021.3069158 pression—without speculative execution—can leak as
much program privacy as speculative execution.
25 May 2021.

TOP PICKS
FIGURE 1. Overview of our compressed cache attacks. (a) Simple attack on a compressed cache, where the attacker exploits
colocation with secret data to leak it. (b) Comparison between Spectre and our proposed attacks.
CACHE COMPRESSION changes when a byte of the key matches a byte of

INTRODUCES A NEW CHANNEL the plaintext.
Figure 1(a) shows a simple attack on compressed The general principle in the above mentioned exam-
caches. The attacker seeks to steal the victim’s ple is that when the attacker is able to colocate its own
encryption key and can submit encryption requests to data alongside secret data, it can learn the secret data.
the victim. On each request, the victim’s encryption Beyond cases where the victim itself facilitates coloca-
function stores the key and the attacker’s plaintext tion (e.g., by pushing arguments onto the stack),
consecutively, so they fall on the same cache line. we observe that latent security vulnerabilities related
to memory safety, such as buffer overflows, heap
spraying, and uninitialized memory, further enable the
attacker to colocate its data with secret data. Com-
THE KEY INSIGHT THAT OUR bined with the new side channel in compressed
ANALYSIS BUILDS ON IS THAT THE caches, these can enable a read gadget that leaks a
COMPRESSIBILITY OF DATA significant amount of data, as we will show later.
REVEALS INFORMATION ABOUT
THE DATA ITSELF.
CACHE COMPRESSION ENABLES
A NEW TYPE OF ACTIVE ATTACK
All compression techniques seek to store data effi-
Colocating secret data and attacker-controlled data ciently by using a variable-length code, where the
is safe with conventional caches, but it is unsafe with a length of the encoded message approaches the infor-
compressed cache. Suppose we run this program on a mation content, or entropy, of the data being encoded.
system with a compressed cache that tries to shrink It trivially follows that the compressibility of a data
each cache line by removing duplicate bytes. If the chunk, i.e., the compression ratio achieved, reveals
attacker can observe the line’s size, it can leak all information about the data.
individual bytes of the key by trying different cho- Hence, compressed caches introduce a new, fun-
sen plaintexts, as the compressed line’s size damentally different type of side channel. As a point

TOP PICKS
FIGURE 2. Comparison of VSC and conventional caches and example showing how Pack+Probe measures compressibility on
VSC. (a) Comparison of VSC (right) versus an uncompressed set-associative cache (left). VSC divides each set of the data array
into small segments (8 bytes in this example), stores each variable-sized line as a contiguous run of segments in the data array,
modifies tags to point to the blocks’ data segments, and increases the number of tags per set relative to the uncompressed
cache (by 2 in this example) to allow tracking more, smaller lines per set. (b) Simplified example of the Pack+Probe attack on
VSC. The attacker first packs the target set to leave exactly S segments left. After the victim accesses the secret, the attacker
probes the set to see if the compressed size of the secret is larger than S segments.
of comparison, consider conventional cache-based create different transmitters by writing different data
side channels. Conventional cache channels are into the cache. Exploiting the compressed cache side
based on the presence or absence of a line in the channel requires a new receiver. In Spectre, the
cache. Thus, conventional attacks make a strong receiver uses techniques like Prime+Probe to detect
assumption, namely that the victim is written in a way the timing difference due to a line’s presence. In com-
that encodes the secret as a load address. pressed cache attacks, the receiver has to detect the
Compressed cache attacks relax this assumption. compressibility information from the channel. To this
Compressed cache channels are based on data com- end, we present Pack+Probe, a general technique that
pressibility in the cache. Data can leak regardless of how leverages the timing difference due to a line’s pres-
the program is written, just based on what data is written ence to also infer its compressibility.
to memory. In this sense, our attacks on compressed
caches are more similar to Spectre and the recent Ram-
bleed5 attacks than conventional side-channel attacks. PACK+PROBE: MEASURING
Figure 1(b) compares Spectre and the new attacks in COMPRESSIBILITY
this article, using an abstract view of a side-channel Whereas conventional caches manage fixed-size cache
attack from prior work.6 Attacker and victim reside in dif- lines, compressed caches manage variable-sized lines.
ferent protection domains, so the attacker resorts to a Thus, compressed caches divide the data array among
side channel to extract the secret from the victim. To variable-sized blocks and track their tags in a way that 1)
exploit the side channel, a transmitter in the victim’s pro- enables fast lookups and insertions; 2) allows high com-
tection domain encodes the secret into the channel, pression ratios; and 3) avoids high tag storage overheads.
which is read and interpreted by a receiver in the attack- While prior work has proposed various compressed cache
er’s protection domain. architectures, Pack+Probe is general and applies broadly.
Spectre attacks allow the attacker to create differ- For concreteness, we explain it using a commonly used
ent transmitters by arranging various sequences of organization, Variable-Sized Cache (VSC).7
mispredicted speculations. Analogously, attacks VSC extends a set-associative design to store
based on cache compression allow the attacker to compressed, variable-sized cache lines. Figure 2(a)

TOP PICKS
FIGURE 3. Example of the Safecracker attack on BDI. The attacker fills the cache line with attacker-controlled data and brute-
forces the first 2 bytes in every 4 bytes until the guess is close to the secret value and incurs compression. The change in the
line size can be observed using Pack+Probe.
illustrates VSC and compares it with a set-associative cycles through each digit of the combination and listens
design. VSC divides each set of the data array into small for changes in the lock that signal when the digit is cor-
segments [8 bytes in Figure 2(a)]. It stores each variable- rect. Similarly, Safecracker provides a guess and learns
size line as a contiguous run of segments in the data indirect outcomes (compressibility) of the guess. The
array. Each tag includes a pointer to identify the block’s attacker then uses the outcome to guide the next guess.
data segments within the set, and VSC increases the In the context of compressed caches, the attacker
number of tags per set relative to the uncompressed first makes a guess about the secret data and then builds
cache [e.g., by 2 in Figure 2(a)] to support smaller lines. a data pattern that, when colocated with the secret data,
Pack+Probe exploits that, in compressed caches, a will cause the cache line to be compressed in a particular
victim’s access to a cache line may cause different way if the guess is correct [like in Figure 1(a)]. By measur-
evictions of other lines depending on the compressing the line’s compressibility using Pack+Probe, the
ibility (i.e., compressed size) of the victim’s line, in attacker knows whether the guess was correct.
addition to the replacement policy. Since the com- Depending on the compression algorithm used, Safe-
pressed cache can store variable-sized lines, whether cracker needs different search strategies. We showcase
a newly installed line evicts other lines is determined and implement the Safecracker attack on the base-
by the unused capacity in the compressed cache. delta-immediate (BDI) compression algorithm,9 a simple
Figure 2(b) shows a single step of Pack+Probe. Pack and common baseline algorithm that relies on delta
+Probe first packs the compressed cache with encoding. The BDI algorithm performs intra-cache-line
attacker-controlled lines to leave exactly S segments compression by storing a common base value and small
unused. Once the victim accesses the secret line, if its deltas. For example, for eight 4-byte integer values rang-
compressed size is S segments, no evictions will hap- ing from 1,280 to 1,287, BDI will store a 4-byte base of
pen, whereas if it is larger than S segments, at least one 1,280, and eight 1-byte values from 0 to 7, compressing a
of the attacker-controlled lines will be evicted. Finally, 32-byte cache line into a 12-byte compressed line.
the attacker probes (accesses) the lines it inserted Depending on the base value and the ranges of the del-
again, uses timing differences to infer which lines hit or tas, BDI compresses a cache line into eight different
miss, and thus infers whether the victim’s line fits within sizes. For example, a 4-byte base and 16 2-byte values
X bytes. Repeating these steps with a simple binary result in a 20-byte compressed line.
search over values of X, the attacker can precisely Figure 3 shows how our Safecracker attack exploits
determine the compressed size of the victim’s line. BDI. Assume the attacker wishes to steal a 4-byte secret,
Our full paper8 shows Pack+Probe on VSC takes located in a cache line where all other data, 28 bytes in
less than 10 K cycles to determine the compressed this example, is attacker controlled [e.g., it is part of a
size of the victim’s line. Pack+Probe’s efficiency allows request buffer like in Figure 1(a)]. Safecracker starts by
us to further develop Safecracker, an active attack to
targeting a compressed pattern of a 4-byte base and
leak secret data.
2-bytes values (a 20-byte line). It brute-forces the first
2 bytes of the victim’s word by changing the first 2 bytes
SAFECRACKER: LEARNING of attacker-controlled words (i.e., trying patterns
SECRETS THROUGH COLOCATION 0x00000000,0x00010000,{. . .}0xFFFF0000, at most 216 guesses),
Safecracker is named after the process used to crack and uses Pack+Probe to see if the cache line compresses
combination locks in (classic) safes, where the attacker to the size of a 20-byte line. Since the secret in this

TOP PICKS
TABLE 1. Worst-case time for Safecracker to leak secrets of different sizes.
Time (ms) to leak secret of size

Attack 1 bytes 2 bytes 3 bytes 4 bytes 5 bytes 6 bytes 7 bytes 8 bytes
Safecracker 0.11 0.30 1.05 54.0 54.4 256 - -
Safecracker+ buffer overflow 0.76 1.53 2.30 3.11 3.91 4.32 5.61 6.46
example is 0x0F00BA20, when the attacker guesses buffer that does not overwrite it and brute-forcing only
0x0F000000, the line compresses into a 20-byte line, and the this byte. This requires the victim to restore the data
attacker records the 2 bytes it tried as part of the secret. over multiple invocations, which is the case with local,
The attacker targets the next compressed pattern, a 4- stack-allocated variables. By repeating these steps, the
byte base and 1-byte values, and brute-forces the third attacker can learn the 8 bytes of secret data even faster.
byte (taking at most 28 guesses), and then brute-forces
the last byte to learn all 4 bytes of the secret. SAFECRACKER LEAKS SECRETS
Our ASPLOS’20 paper8 details the algorithm to EFFICIENTLY
steal secrets in various sizes on BDI, and we also dis-
We evaluate the effectiveness of Safecracker using
cuss how to extend the idea to other cache compres-
proof-of-concept workloads (PoC) and architectural sim-
sion algorithms. Safecracker on BDI can leak up to
ulation (simulating a compressed cache with VSC+BDI).
8 bytes of secret data, which can be devastating. For
The first PoC has two separate processes, victim and
example, a 128-bit AES key is considered secure, but
attacker. The victim is a login server with a vulnerability
leaking 64 bits degrades it to 64-bit protection, which
that lets attacker-controlled input be stored next to a
is insecure in many contexts.
secret key. The attacker can provide input to the cache
line where the key is located, without modifying the key.
Enhancing Safecracker With Latent The attacker can also invoke victim accesses to the
Memory-Safety Violations secret by issuing encryption requests that use the secret
The previous example assumes that the attacker-con- key. This lets the attacker perform Pack+Probe.
trolled data (OATD) is located right next to the secret The attacker first finds the set that holds the secret
in the same cache line. This limits the attacker to only cache line using standard Prime+Probe.10 Once the
leak contents contiguous to attacker-controlled data. conflicting set is found, the attacker uses Safecracker
However, if the attacker can find other latent memory- to steal the victim’s secret key.
safety violations in the victim program, then this vul- Table 1 reports the worst-case execution time needed
nerability can significantly increase the amount of data to steal different numbers of bytes. Safecracker requires
leaked and enhance the efficiency of Safecracker. less than a second to crack a 6-byte secret value. Though
For example, combining Safecracker with buffer Safecracker can steal up to 8 bytes when applied to BDI,
overflows, the attacker can control where the attacker- trying to steal more than 6 bytes requires much longer
controlled OATD is located. In the worst case, if the runtimes (hours for 8 bytes), because the complexity of
compression algorithm allows leaking up to X bytes per Safecracker on BDI grows exponentially. Nonetheless, as
line, memory-safety violations let the attacker leak X we explained earlier, Safecracker can be combined with
bytes from every cache line. That is, if the victim has a latent memory violations to enhance its efficiency.
memory footprint of M, then Safecracker with a buffer
overflow can leak OðMÞ bytes of memory, where differ-
Buffer-Overflow-Based Attack
ent compression algorithms have different constant
The second PoC builds on top of the first PoC. How-
factors (e.g., 8=64 ¼ 1=8th of program memory for BDI).
ever, this time, the victim has a buffer-overflow vulner-
Moreover, buffer overflows also allow the attacker to
ability. The vulnerable function is as follows:
control how many bytes the attacker-controlled data
consists of. Leveraging this, the attacker can leak the void encrypt(char *plaintext) {
secret much more efficiently by making all partial char result[LINESIZE];
guesses at byte granularity. The attacker can allocate a char data[DATASIZE]; // can be any size
buffer that leaves only one byte of secret data in the char key[KEYSIZE];
line. By brute-force, the attacker quickly learns this memcpy(key, KEYADDR, KEYSIZE);
remaining byte in the same manner as the previous strcpy(result, plaintext);
example. Once the last byte is known, the attacker {. . .}
learns the second-to-last byte by allocating a smaller }.

TOP PICKS
The buffer overflow stems from the unsafe call to With our proposed attacks, Pack+Probe and Safe-
strcpy, which causes out-of-bounds writes when the cracker, we have also shown how cache compression
plaintext input string exceeds LINESIZE bytes. The can potentially leak as much program privacy as spec-
attacker exploits this buffer overflow to scribble over ulative execution. This has significant implications for
the stack and overwrite some of the bytes of the key. architects. It suggests we as a community need to
After encrypt returns, the scribbled-over line remains in revisit our vast literature and reexamine other micro-
the stack, and the attacker is then able to measure its architectural optimizations through a security lens.
compressibility with Pack+Probe and to run Safe- Beyond the above mentioned immediate and long-
cracker as described before. term message, our paper makes two other microarchi-
Buffer overflows give Safecracker much higher tectural side channel “firsts.”
bandwidth by allowing it to guess a single byte on The first work to show that memory-safety violations
each step. Using a buffer overflow, Safecracker steals can enhance microarchitectural attacks: Memory safety
8 bytes of secret data in under 10 ms. Table 1 describes vulnerabilities, e.g., buffer overflows, and microarchic-
that attack time with a buffer overflow grows linearly tural attacks are both prominent vulnerability classes in
versus exponentially as it had with the first PoC. their own right. Fortunately, however, we have tradition-
While Safecracker applied to BDI can steal only ally been able to treat them as orthogonal concerns with
8 bytes per line, the above mentioned process can be orthogonal sets of attacks and defenses.
repeated for different-sized attacker-controlled buf- In this work, we show that memory-safety violations
fers to steal data in other lines, e.g., 8 bytes from multi- can enhance microarchitectural attacks. In other
ple (potentially all) lines. words, memory safety vulnerabilities can be used to
mount (and exacerbate) microarchitectural attacks.
GENERALIZATION AND DEFENSES Worse, defenses against a given memory-safety vulner-
Our ASPLOS’20 paper8 also discusses how to generalize ability are not sufficient to block its “mirror image” in
these attacks to other compressed cache architectures the microarchitectural-attack world. Our original paper
and algorithms, and presents other opportunities to illustrates this idea by showing how a “code-reuse
colocate attacker-controlled data with secret data. We buffer overflow” defense (StackGuard) is insufficient
find that 1) Pack+Probe is applicable to most com- for preventing a “microarchitectural-attack buffer over-
pressed cache architectures with a decoupled tag store flow” which, when combined with compressed caches,
with extra tags, and a data array divided in fixed-size sets can leak (asymptotically) all of the program memory.
where variable-sized blocks are laid over; 2) variants of
Safecracker can be constructed case by case for other
compression algorithms, and the better the algorithm
compresses, the more information it can leak; and 3)
WE HOPE OUR WORK CATALYZES A
there are many more ways to colocate attacker-con- LINE OF WORK IN PROACTIVE SECURITY
trolled data near sensitive data, both spatially (e.g., heap ANALYSIS OF MICROARCHITECTURAL
spraying) and temporally (e.g., uninitialized data), and OPTIMIZATIONS.
widely used software, such as the Linux kernel, suffers
from these attack vectors.
Finally, we present multiple ways (e.g., obfuscation) The first data-centric, differential microarchitec-
to defend against Pack+Probe, Safecracker, and other
tural attack: There is a rich heritage in the security
attacks on compressed caches. We evaluate one of
community for performing chosen-plaintext attacks.
them, partitioning the compressed cache, to understand
In this model, the victim program P takes as input sen-
the tradeoff between security and performance. Our
sitive data S and attacker-controlled data C, and pro-
analysis shows that even though it is possible to make
duces an observation O, i.e., O ¼ ViewðPjS; CÞ. The
compressed caches secure, the straightforward solution
attacker more-precisely learns S by varying C and
that partitions both the tag and data array comes at a
monitoring changes in O. This style is also called a dif-
cost. How to limit this performance impact while mini-
ferential attack; it is used to perform cryptanalysis and
mizing leakage would be interesting future work.
amplify traditional side-channel, e.g., DPA, attacks.
This article presents the first data-centric differen-
LESSONS LEARNED tial attack in the microarchitectural attack setting.
We have presented the first security analysis of cache Specifically, C is attacker-controlled data that colo-
compression and found that cache compression is cates with S. We show how by modulating C, the
insecure because the compressibility of a cache line attacker can perform a guided search to recover S by
reveals information about its contents. observing changes in compressibility O. Making

TOP PICKS
matters worse, depending on the cache compression 8. P.-A. Tsai, A. Sanchez, C. W. Fletcher, and D. Sanchez,
algorithm, this search can be performed in asymptoti- “Safecracker: Leaking secrets through compressed
cally fewer steps than brute-force guessing S. caches,” in Proc. 25th Int. Conf. Archit. Support Program.
We hope this work prevents insecure cache Lang. Oper. Syst., 2020, pp. 1125–1140, doi: 10.1145/
compression techniques from reaching mainstream 3373376.3378453.
processors. More importantly, we hope our work cata- 9. G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons,
lyzes a line of work in proactive security analysis of M. A. Kozuch, and T. C. Mowry, “Base-delta-immediate
microarchitectural optimizations. compression: Practical data compression for on-chip
caches,” in Proc. 21st Int. Conf. Parallel Archit.
ACKNOWLEDGMENTS Compilation Techn., 2012, pp. 377–388, doi: 10.1145/
We would like to thank Maleen Abeydeera, Joel Emer, 2370816.2370870.
Mark Jeffrey, Anurag Mukkara, Quan Nguyen, Victor 10. D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks
Ying, Guowei Zhang, and the ASPLOS’20 reviewers for and countermeasures: The case of AES,” in Proc. 7th
their feedback. We would also like to thank Joel Emer for Cryptographers’ Track RSA Conf. Topics Cryptology,
his insights on the taxonomy, and Paul Kocher for shar- 2006, pp. 1–20, doi: 10.1007/11605805_1.
ing his concerns about the security of memory compres-
sion. This work was supported in part by the National
PO-AN TSAI is currently a Research Scientist with NVIDIA,
Science Foundation under Grant CAREER-1452994 and
Santa Clara, CA, USA. His current research focuses on tensor
Grant SaTC-1816226, in part by the Google faculty
accelerators design/modeling and memory hierarchy optimiza-
research award, and in part by an Intel ISRA grant. The
tions for domain-specific accelerators. Tsai received a B.S.
work of Andres Sanchez was supported through a MISTI
degree in electrical engineering from National Taiwan University,
grant by the Technical University of Madrid.
Taipei, Taiwan, in 2012, and S.M. and Ph.D. degrees in computer
science from MIT, Cambridge, MA, USA, in 2015 and 2019, respec-
REFERENCES tively. He is a Member of IEEE. Contact him at poant@nvidia.com.
1. P. Kocher et al., “Spectre attacks: Exploiting
speculative execution,” in Proc. IEEE Symp. Secur. ANDRES SANCHEZ is currently working toward a graduate
Privacy, 2019, pp. 1–19, doi: 10.1109/SP.2019.00002.
degree with the Ecole Polytechnique Federale de Lausanne,
2. M. Lipp et al., “Meltdown: Reading kernel memory from Lausanne, Switzerland. His research topic is side-channels
user space,” in Proc. 27th USENIX Secur. Symp., 2018, detection and mitigation, security-aware systems layering,
pp. 973–990, doi: 10.1145/3357033. compilation techniques, and characterization of sensitive infor-
3. R. McIlroy, J. Sevcik, T. Tebbi, B. L. Titzer, and mation. Sanchez received a B.S. degree in mathematics and
T. Verwaest, “Spectre is here to stay: An analysis of computer engineering from the Technical University of Madrid,
side-channels and speculative execution,” CoRR, Madrid, Spain, in 2019. Contact him at andres.sanchez@epfl.ch.
vol. abs/1902.05178, 2019. [Online]. Available: http://
arxiv.org/abs/1902.05178
CHRISTOPHER W. FLETCHER is currently an Assistant Pro-
4. B. Abali et al., “Data compression accelerator on IBM
fessor in computer science with the University of Illinois at
power9 and z15 processors: Industrial product,” in Proc.
Urbana-Champaign, Champaign, IL, USA. He has interests
47th Annu. Int. Symp. Comput. Archit., 2020, pp. 1–14,
ranging from computer architecture to security to high-perfor-
doi: 10.1109/ISCA45697.2020.00012.
mance computing (ranging from theory to practice, algorithm
5. A. Kwong, D. Genkin, D. Gruss, and Y. Yarom,
to software to hardware). Fletcher received a Ph.D. degree
“Rambleed: Reading bits in memory without accessing
from the Massachusetts Institute of Technology, Cambridge,
them,” in Proc. IEEE Symp. Secur. Privacy, 2020,
MA, USA, in 2016. Contact him at cwfletch@illinois.edu.
pp. 695–711, doi: 10.1109/SP40000.2020.00020.
6. V. Kiriansky, I. Lebedev, S. Amarasinghe, S. Devadas,
and J. Emer, “DAWG: A defense against cache timing DANIEL SANCHEZ is currently an Associate Professor with the
attacks in speculative execution processors,” in Proc. Department of Electrical Engineering and Computer Science,
51st Annu. IEEE/ACM Int. Symp. Microarchit., 2018, Massachusetts Institute of Technology, Cambridge, MA, USA.
pp. 974–987, doi: 10.1109/MICRO.2018.00083. His research interests include scalable memory hierarchies,
7. A. R. Alameldeen and D. A. Wood, “Adaptive cache architectural support for parallelization, and accelerators for
compression for high-performance processors,” in sparse computations. Sanchez received a Ph.D. degree in electri-
Proc. 31st Annu. Int. Symp. Comput. Archit., 2004, cal engineering from Stanford University, Stanford, CA, USA.
pp. 212–223, doi: 10.1109/ISCA.2004.1310776. Contact him at sanchez@csail.mit.edu.

Understanding Acceleration Opportunities

at Hyperscale
Akshitha Sriraman , University of Michigan, Ann Arbor, MI, 48109, USA
Abhishek Dhanotia , Facebook, Menlo Park, CA, 94025, USA
Modern web services run across hundreds of thousands of servers in a data center,
i.e., at hyperscale. With the end of Moore’s Law and Dennard scaling, successive
server generations running these web services exhibit diminishing performance
returns, resulting in architects adopting hardware customization. An important
question arises: Which web service software operations are worth building custom
hardware for? To answer this question, we comprehensively analyze important
Facebook production services and identify key acceleration opportunities. We then
develop an open-source analytical model, Accelerometer, to help make well-
informed hardware decisions for the acceleration opportunities we identify.
M
odern web services such as social media, news feed stories extractor; 4) Tao: a distributed social
online messaging, web search, video graph data store; and 5) MyRocks: a user database.
streaming, and online banking often sup- Complex microservice interactions place stringent per-
port billions of users, requiring data centers that scale formance constraints on individual microservices.
to hundreds of thousands of servers, i.e., hyperscale. As hyperscale computing grows to drive more
Whereas hyperscale web services once had largely sophisticated applications (e.g., virtual reality and con-
monolithic software architectures, modern web serv- versational AI), existing microservice systems will face
ices are composed of numerous independent, special- greater efficiency challenges due to these more com-
ized, distributed microservices (e.g., key-value serving plex tasks. Increasingly complex microservices can be
in a social media service).1,2 Several companies such efficiently supported if the hardware rises to meet effi-
as Amazon, Netflix, Gilt, LinkedIn, Facebook, and ciency requirements. However, with the end of
SoundCloud have adopted microservice architectures Moore’s Law and Dennard scaling, a key challenge to
to improve web service development and scalability.3 realizing microservice efficiency is that successive
While at face value, hyperscale web systems seem server generations running microservices exhibit
instantaneously available at the touch of a button, exist- diminishing performance returns.
ing microservices barely meet performance require- To improve hardware efficiency, several architects
ments. In reality, microservices have much more today work on developing numerous specialized hard-
stringent performance constraints than their monolithic ware accelerators for important microservice domains
counterparts, since numerous microservices must be [e.g. machine learning (ML) tasks]. However, large-
invoked, often serially, to serve a user’s query. For exam- scale internet operators have strong economic incen-
ple, a Facebook news feed service query may flow tives to limit hardware platform diversity to: 1) main-
through a pipeline of many microservices invoked via tain fungibility of hardware resources; 2) preserve
remote procedure calls (RPCs), such as 1) Sigma: a procurement advantages that arise from economies
spam filter; 2) McRouter: a protocol router; 3) Feed: a of scale; and 3) limit the overhead of developing and
testing on myriad specialized hardware platforms.
Hence, an important question arises: Which microser-
This work is licensed under a Creative Commons Attribution- vice operations consume the most CPU cycles and
NonCommercial-NoDerivatives 4.0 License. For more infor- are worth accelerating?
mation, see https://creativecommons.org/licenses/by-nc-nd/ To build specialized accelerators for these key
4.0/
microservice operations, it is important to first
Date of publication 30 March 2021; date of current version identify which type of accelerator meets microser-
25 May 2021. vice requirements and is worth designing and

TOP PICKS
deploying. Deploying specialized hardware is risky HOW DO MICROSERVICES SPEND

at hyperscale, as the hardware might underperform THEIR CPU CYCLES?
due to performance bounds from the microservi- To identify microservice overheads, we comprehen-
ce’s software interaction with the hardware, result- sively characterize how seven important Facebook
ing in high monetary losses. To make well-informed production microservices spend their CPU cycles
hardware decisions, it is crucial to answer the when serving live user traffic. Very few prior works
following question early in the design phase of study how cycles are spent in data centers. Kanev
a new accelerator: How much can the accelerator et al.1 investigate the “data center tax” across Google’s
realistically improve its targeted microservice server fleet by studying cycles spent in seven types of
overhead? leaf functions invoked at the end of a call trace [e.g.,
To answer both the above questions, our article,4 memcpy()]. However, a leaf function study alone does not
presented at ASPLOS 2020, first undertakes a compre- holistically provide insight into whether acceleration
hensive characterization of how microservices spend might improve a microservice functionality (e.g.,
their CPU cycles. We study seven important hyper- encryption).
scale Facebook microservices in four diverse service To analyze microservice functionalities, we must
domains that run across hundreds of thousands of comprehensively characterize a microservice’s entire
servers, occupying a large portion of Facebook’s global call stack to measure the CPU cycles spent in each
server fleet. Our detailed breakdown of CPU cycles phase of the microservice’s operation after it receives
consumed by various microservice operations identi- a request. Characterizing microservice functionalities
fies key overheads and potential design optimizations. helps determine 1) whether diverse microservices exe-
To make well-informed hardware decisions for these cute common types of operations (e.g., compression,
microservice overheads, we contribute Accelerometer, serialization, and encryption) and 2) the overheads
an analytical model that projects realistic microser- such operations induce. Analyzing both leaf functions
vice speedup for various hardware acceleration strate- and microservice functionalities helps identify key
gies. We also demonstrate Accelerometer’s utility acceleration opportunities that might inform future
in Facebook’s production microservices via three ret- software and hardware designs.
rospective case studies conducted when serving live We characterize the CPU cycles spent by Face-
user traffic. book’s production microservices in both leaf functions
FIGURE 1. Breakdown of cycles spent in various leaf functions (leaf categories defined in the table to the right): Memory func-
tions consume a significant portion of total cycles.

TOP PICKS
and microservice functionalities. We study 1) Web: a

front-end microservice that implements PHP and
Hack; 2) Feed1 and Feed2: news feed microservices that
aggregate, rank, and display stories; 3) Ads1 and Ads2:
advertisement microservices that compute user-spe-
cific and ad-specific data; and 4) Cache1 and Cache2: large
distributed-memory object caching microservices.
FIGURE 2. Breakdown of cycles spent in the main application
logic versus orchestration work: Orchestration overheads sig-
Leaf Function Characterization nificantly dominate.
We present key leaf function breakdowns for Face-
book’s microservices in Figure 1, comparing them with
Google’s services1 and SPEC CPU2006 benchmarks.5
their performance. Cache2 also spends significant cycles
We find that many leaf function overheads are signifi-
in I/O and network interactions, and can benefit from
cant and common across microservices. We detail our
optimizations such as kernel-bypass and multiqueue
observations for dominant leaf categories.
NICs.
Memory. Most microservices spend a significant
Synchronization. Microservices such as Cache over-
fraction of cycles on memory functions that include
subscribe threads to improve throughput.8 Hence,
memory copy, free, allocation, move, set, and com-
such microservices spend significant cycles synchro-
pare. Memory copies are by far the greatest consum-
nizing frequent communication between distinct
ers of memory cycles. Data is primarily copied during
thread pools. Cache also spends a large fraction of
microservice operations such as 1) I/O pre- or
cycles in spin locks that are typically deemed perfor-
postprocessing, 2) I/O sends and receives, 3) RPC seri-
mance inefficient.9 However, Cache implements spin
alization/deserialization, and 4) microservice business
locks since it is a microsecond-scale microservice,8
logic execution (e.g., executing key-value stores in
and is hence more prone to microsecond-scale perfor-
Cache).
mance penalties that can otherwise arise from thread
We observe significant diversity in dominant service
re-scheduling, wakeups, and context switches.7
functionalities that invoke memory copies across micro-
C Libraries. We observe that Feed2, Ads1, and Ads2
services. This diversity suggests a strategy to specialize
invoke C libraries for vector operations, as they deal
copy optimizations to suit each microservice’s distinct
with large ML feature vectors. Web spends significant
needs. For example, Web can benefit from reducing copies
cycles parsing and transforming strings to process
in I/O pre- or postprocessing, whereas Cache2 can gain
queries from many URL endpoints. Interestingly, unlike
from fewer copies in network protocol stacks.
memory or kernel leaf functions, C libraries’ instruc-
Freeing memory incurs a high overhead for several
tions per cycle (IPC) scales well across CPU genera-
microservices, as the free() function does not take a
tions, as many hardware vendors primarily rely on
memory block size parameter, performing extra work
open-source SPEC benchmarks that heavily use C
to determine the size class to return the block to.
libraries to make architecture design decisions.
TCMalloc performs a hash lookup to get the size class.
Other observations. ML microservices such as Ads2
This hash tends to cache poorly, especially in the TLB,
and Feed2 spend only up to 13% of cycles on mathemati-
leading to performance losses. Although C++11 ameli-
cal operations that constitute ML inference using mul-
orates this problem by allowing compilers to invoke
tilayer perceptrons. Cache2 spends 6% of cycles in leaf
delete() with a parameter for memory block size, over-
encryption functions since it encrypts a high number of
heads still arise from 1) removing pages faulted in
queries per second (QPS). Additionally, Google’s break-
when memory was written to and 2) merging neighbor-
down for a few leaf function categories, such as mem-
ing freed blocks to produce a more valuable large free
ory or kernel, is similar to Facebook’s breakdowns. In
block. While numerous prior works optimize memory
contrast, SPEC CPU2006 benchmarks do not capture
allocations,6 very few recognize that optimizing free()
key leaf overheads faced by our microservices.
can result in significant performance wins.
Kernel. Microservices with high OS kernel overhead,
Cache1 and Cache2, invoke OS scheduler functions fre-
quently. Software/hardware optimizations that reduce Service Functionality Characterization
scheduler latency (e.g., intelligent thread switching7 We show a broad microservice functionality break-
and coalescing I/O) might considerably improve down in Figure 2.

TOP PICKS
FIGURE 3. Breakdown of CPU cycles spent in various microservice functionalities (service functionality categories defined in the
table to the right): Orchestration overheads are significant and fairly common across microservices.
We find that application logic disaggregation across leaf microservices that support a high request rate8—
microservices has resulted in significant microservice they frequently invoke RPCs to communicate with
functionality overheads. Several microservices spend mid-tier microservices. These microservices can bene-
only a small fraction of their execution time serving their fit from I/O optimizations such as kernel-bypass, multi-
main application logic (e.g., ML-based ads recommenda- queue NICs, and efficient I/O notification paradigms.
tion or key-value serving), squandering significant cycles Second, Web spends only 18% of cycles in its main
facilitating the main logic via orchestration work that is web serving logic (parsing and processing client
not critical to the main application logic (e.g., compres- requests), consuming 23% of cycles in reading and
sion, serialization, and I/O processing). For example, updating logs. It is unusual for applications to incur
microservices that perform ML inference—Feed1, Feed2, such high logging overheads; only few academic stud-
Ads1, and Ads2—spend as few as 33% of cycles on ML infer- ies focus on optimizing them in hardware.
ence, consuming 42%–67% of cycles in orchestrating Third, Ads1, Feed1, Feed2, and Cache1 incur a high thread
inference. Hence, even if modern inference accelera- pool management overhead. Intelligent thread sched-
tors10 were to offer an infinite inference speedup, the net uling and tuning7 can help these microservices.
microservice performance would only improve by 1.49x– We conclude application logic disaggregation
2.38x. There is hence an urgent need to accelerate the across microservices and the consequent increase
significant orchestration work that facilitates the main in inter-service communication at hyperscale has
application logic. resulted in significant and common orchestration
Orchestration overheads arise since a microservice, overheads in modern data centers. In Table 1, we
upon receiving an RPC, must often perform operations report acceleration opportunities that might inform
such as I/O processing, decompression, deserialization, future software and hardware designs. We believe
and decryption, before executing its main functionality. that our rich overhead characterization and taxonomy
Hence, many microservices face common orchestration of existing optimizations will guide researchers in miti-
overheads despite great diversity in microservices’ main gating these overheads.
application logic, as shown in Figure 3.
We make several observations about these signifi-
cant and common orchestration overheads. First, Web, ACCELEROMETER: AN
Cache1, and Cache2 spend a large portion of cycles exe- ANALYTICAL MODEL FOR
cuting I/O, i.e., sending and receiving RPCs, and conse- HARDWARE ACCELERATION
quent I/O compression and serialization overheads Accelerating key common orchestration overheads in
dominate. Web incurs a high I/O overhead since it imple- production requires 1) designing new hardware; 2)
ments many URL endpoints and communicates with a testing it; and 3) carefully planning capacity to provi-
large back-end microservice pool. Cache1 and Cache2 are sion the hardware to match projected load. Given the

TOP PICKS
TABLE 1. Summary of findings and suggestions for future optimizations.
Finding Acceleration opportunity

Significant orchestration overheads Software and hardware acceleration for orchestration rather than just app. logic
Several common orchestration overheads Accelerating common overheads (e.g., compression) can provide fleet-wide wins
Poor IPC scaling for several functions Optimizations for specific leaf/service categories
Memory copies & allocations are Dense copies via SIMD, copying in DRAM, Intel’s I/O AT, DMA via accelerators, PIM
significant
Memory frees are computationally expensive Faster software libraries, hardware support to remove pages
High kernel overhead and low IPC Coalesce I/O, user-space drivers, in-line accelerators, kernel-bypass
Logging overheads can dominate Optimizations to reduce log size or number of updates
High compression overhead Bit-plane compression, Buddy compression, dedicated compression hardware
Cache synchronizes frequently Better thread pool tuning and scheduling, Intel’s TSX, coalesce I/O, vDSO
High event notification overhead Hardware support for notifications (e.g., RDMA-style), spin versus block
hybrids
uncertainties inherent in projecting customer asynchronous offloads for three hardware acceleration
demand, deploying diverse custom hardware is risky strategies—on-chip, off-chip, and remote.
at scale as the hardware might underperform due to Accelerometer assumes an abstract system with
performance bounds from the microservice’s software three components: 1) host: a general-purpose CPU; 2)
interactions with the hardware. accelerator: custom hardware to accelerate a kernel (or
microservice operation); and 3) interface: the
communication layer between the host and the acceler-
TO EASILY AND ACCURATELY MODEL
ator (e.g., a PCIe link). Accelerometer models both the
WHETHER AN ACCELERATOR IS microservice throughput speedup (referred to as
WORTH DESIGNING AND DEPLOYING “speedup”) and the per-request latency speedup
FOR A MICROSERVICE OPERATION, (referred to as “latency reduction”). Modeling both
WE DEVELOP AN ANALYTICAL speedup and latency reduction ensures that accelera-
MODEL, ACCELEROMETER. tion enables a higher throughput (i.e., more QPS) without
violating microservice latency service level objectives
(SLOs). When work is offloaded to an accelerator, the
speedup and latency reduction depend on the accelera-
To easily identify performance bounds early in the
tion strategy and the threading design used to offload,
hardware design phase and estimate realistic gains from
i.e., synchronous versus asynchronous offload.
hardware acceleration, there is an urgent need to
develop a simple, yet, accurate analytical model for hard-
ware acceleration. The state-of-the-art analytical model Synchronous Offload
for acceleration, LogCA,11 falls short for microservices as In a synchronous offload (Sync), the host awaits the
it assumes that the CPU synchronously waits while the accelerator’s response before resuming execution
offload operates. However, for many microservice func-
tionalities, offload is asynchronous; the processor con-
tinues doing useful work concurrent with the offload.
Capturing this concurrency-induced performance
bounds will help realistically model microservice
speedup for various hardware acceleration strategies.
To easily and accurately model whether an accelera-
tor is worth designing and deploying for a microservice
operation, we develop an analytical model, Accelerome- FIGURE 4. Example timeline of host and accelerator.
ter. Accelerometer models both synchronous and

TOP PICKS
(see Figure 4), putting the accelerator’s operation comparing model-estimated speedup with real micro-
cycles (aC
A ) in the critical path of the host’s execution, service speedup determined via A/B testing. Each
impacting speedup and per-request latency. The host study covers a distinct microservice threading sce-
can consume additional cycles to 1) prepare the kernel nario (i.e., Sync, Sync-OS, and Async). We analyze 1) an on-
for offload, o0 ; 2) transfer the kernel to the accelerator, chip accelerator: a specialized hardware instruction
L; and 3) wait in a queue for the accelerator to become for encryption, AES-NI; 2) an off-chip accelerator: an
available, Q. encryption device connected to the host CPU via a
PCIe link; and 3) a remote accelerator: a general-pur-
Synchronous Offload With Thread pose CPU that solely performs ML inference and is
Oversubscription connected to the host CPU via commodity network.
In reality, several microservices (e.g., Web and Cache) In all three studies, we show that Accelerometer
oversubscribe threads to improve throughput by hav- estimates the real microservice speedup with 3:7%
ing more threads than available cores. With synchro- error.
nous offload, a microservice oversubscribing threads Finally, we use Accelerometer to project speedup
(Sync-OS) allows the host to schedule an available for the acceleration recommendations derived from
thread to process new work, while the thread that off- three key common overheads identified by our charac-
loaded work blocks awaiting the accelerator’s terization: compression, memory copy, and memory
response. Hence, the accelerator’s cycles do not criti- allocation.
cally affect speedup, but impact the per-request
latency. Moreover, OS overheads from the host LONG-TERM IMPLICATIONS
switching to an available thread after an offload, o1 , We discuss long-term implications, highlighting the
affect both speedup and latency reduction. The micro- impact this work has already had.
second-scale o1 overhead dominates in microsecond-
scale microservices (e.g., Caching), making it feasible
to incur a throughput gain at the cost of a per-request Accelerometer in Production
latency slowdown. In such cases, Accelerometer can As microservices evolve, Accelerometer’s generality
help ensure that the microservice still meets its makes it even more suitable in determining new hard-
latency SLO. ware requirements early in the design phase. Since we
validated Accelerometer in production and made it
Asynchronous Offload open-source,12 we are happy to report that it has been
In an asynchronous offload (Async), the host does use- adopted by multiple hyperscale companies (e.g., with
ful work concurrent with the accelerator’s operation developing their encryption and compression acceler-
on the offload, removing the accelerator’s cycles from ators) to make well-informed hardware decisions. We
the critical path. Depending on whether the response expect Accelerometer to trigger research in develop-
is picked up by the same thread that sent the request ing more complex models that account for overheads
or a different thread, OS thread switch penalties can induced by offloading to specific accelerators (e.g.,
impact speedup and latency reduction. software batching implications on FPGA memory
Accelerometer models all these cases as well as bandwidth versus latency).
other nuanced scenarios to project realistic gains
early in the hardware design phase and make well- Influence on Real Hardware Designs
informed hardware investments. We expect Acceler- In this work, we take a step back and answer the
ometer to have the following use cases. 1) Data center Amdahl’s Law question of: Which overheads prevail
operators can estimate fleet-wide gains from optimiz- even after offloading a microservice’s main functional-
ing key service overheads. 2) Architects can make bet- ity to accelerators? Our comprehensive study of real-
ter accelerator design decisions and estimate realistic world microservices definitively indicates the need for
gains by considering offload overheads due to micro- a qualitatively different approach to future accelerator
service software design. efforts. So far, data center hardware acceleration
efforts have primarily focused on the most costly
VALIDATING AND APPLYING operations of a few “killer” applications (e.g., ML infer-
ACCELEROMETER ence). However, accelerating orchestration overheads
We validate Accelerometer’s utility via three retro- can offer greater benefits as they are significant and
spective case studies on production systems, by common across microservices.

TOP PICKS
As web service architectures grow more frag- performance scaling, there is a greater need for
mented (e.g., deeper microservice pipelines and serv- researchers to develop such tools for performance
erless architectures), it becomes more important to monitoring and optimization at all levels of the sys-
optimize the increasingly ubiquitous orchestration tems stack.
overheads. However, accelerating orchestration
overheads is nontrivial as 1) orchestration libraries
are already well-optimized in software and 2) orches- Industry-Academia Collaborative
tration function invocations are frequent, involve Benchmarking Efforts
small data granularity, and are interspersed between Many hardware vendors rely on open-source bench-
other microservice code. Hence, accelerating orche- marks such as SPEC that heavily use C libraries to
stration overheads will require different techniques make architecture decisions. Hence, in our characteri-
than those used in throughput-based specialization zation, we observe that only C libraries’ IPC scales well
blocks with coarse-grained offloads (e.g., video across CPU generations, but the other overheads (e.g.,
processing). memory movement and encryption) show little to no
improvement.
There is immense value in validating commonly
used benchmarks with real-world application behav-
ALTHOUGH ACCELEROMETER iors. Our characterization drove a hardware vendor to
PROVIDES THE FIRST STEP IN consider more representative benchmarks (in place of
DETERMINING REQUIRED traditional ones they used for decades) when evaluat-
ACCELERATION STRATEGIES, WE ing hardware designs. This work has resulted in an
EXPECT SIGNIFICANT ACADEMIC AND industry–academia joint collaborative effort to design
INDUSTRIAL INTEREST IN RETHINKING and open-source scale-out cloud benchmarks that
ACCELERATORS FOR FINE-GRAINED represent the hyperscale behaviors identified in our
ORCHESTRATION OPERATIONS. characterization. We expect our comprehensive study
to drive continued benchmarking efforts that repre-
sent the severity of overheads in production-grade
software.
Although Accelerometer provides the first step in
determining required acceleration strategies, we
expect significant academic and industrial interest in
End-to-End Thinking in Accelerator
rethinking accelerators for fine-grained orchestration Design
operations. Already, a few hardware vendors have Oftentimes, when designing accelerators, architects
used our study’s insights to influence hardware cus- tend to miss the end-to-end picture, i.e., overheads
tomization for orchestration operations. that might arise from other system parts. When trying
to adopt these accelerators at hyperscale, we have
Characterization Approach and Tool often found that they degrade performance due to
While it is relatively simple to measure the CPU cycles overlooked offload-induced overheads. Accelerometer
spent in leaf functions, it is extremely difficult to cate- is a simple, powerful tool to help architects analytically
gorize every path’s functionality in a microservice’s estimate offload-induced overheads that arise from
entire call stack, as microservices have deep, complex the end-to-end path, projecting realistic gains early in
software stacks that are hard to parse and classify. the hardware design phase.
We developed a methodology to systematically clas-
sify each call trace path: We applied expert insights to
identify service functionality classification rules that REFERENCES
we then used to categorize cycles spent in various 1. S. Kanev et al., “Profiling a warehouse-scale
microservice functionalities. computer,” in Proc. Int. Symp. Comput. Archit., 2015,
We integrated this characterization tool into our pp. 158–169, doi: 10.1145/2749469.2750392.
fleet-wide performance monitoring infrastructure; it 2. A. Mirhosseini, A. Sriraman, and T. F. Wenisch,
currently assimilates statistics from hundreds of thou- “Enhancing server efficiency in the face of killer
sands of servers from around the world to help devel- microseconds,” in Proc. Int. Symp. High Perform.
opers visualize the performance impact of their code Comput. Archit., 2019, pp. 185–198, doi: 10.1109/
changes at hyperscale. With the decline of hardware HPCA.2019.00037.

TOP PICKS
3. A. Sriraman and T. F. Wenisch, “mSuite: A benchmark 10. N. Jouppi et al. “In-datacenter performance analysis of
suite for microservices,” in Proc. IEEE Int. Symp. a tensor processing unit,” in Proc. Int. Symp. Comput.
Workload Characterization, 2018, pp. 1–12, doi: 10.1109/ Archit., 2017, pp. 1–12.
IISWC.2018.8573515. 11. M. S. B. Altaf, and D. A. Wood, “A high-level
4. A. Sriraman and A. Dhanotia, “Accelerometer: performance model for hardware accelerators,” in Proc.
Understanding acceleration opportunities for data Int. Symp. Comput. Archit., 2017, pp. 375–388.
center overheads at hyperscale,” in Proc. Int. Conf. 12. “Accelerometer,” doi: 10.5281/zenodo.3612796.
Archit. Support Program. Lang. Operating Syst., 2020,
pp. 733–750, doi: 10.1145/3373376.3378450. AKSHITHA SRIRAMAN is currently a Ph.D. candidate in com-
5. J. L. Henning, “SPEC CPU2006 benchmark puter science and engineering at the University of Michigan,
descriptions,” ACM SIGARCH Comput. Archit. News,
Ann Arbor, MI, USA. Her research bridges computer architec-
vol. 34, no. 4, 2006, pp. 1–17, doi: 10.1145/
ture and software systems, demonstrating the importance of
1186736.1186737.
that bridge in realizing efficient hyperscale web services via
6. S. Kanev, S. L. Xi, G.-Y. Wei and D. Brooks, “Mallacc:
Accelerating memory allocation,” in Proc. Int. Conf. solutions that span the systems stack. Sriraman received an
Archit. Support Program. Lang. Operating Syst., 2017, M.S. degree in Embedded Systems from the University of
pp. 33–45, doi: 10.1145/3037697.3037736. Pennsylvania. She is the corresponding author of this article.
7. A. Sriraman and T. F. Wenisch, “mTune: Auto-tuned Contact her at akshitha@umich.edu.
threading for OLDI microservices,” in Proc. USENIX
Conf. Operating Syst. Des. Implementation, 2018,
pp. 177–194. [Online]. Available: https://www.usenix.org/
ABHISHEK DHANOTIA is currently a Performance Engineer at
conference/osdi18/presentation/sriraman
Facebook, Menlo Park, CA, USA, where he works on designing
8. A. Sriraman, A. Dhanotia, and T. F. Wenisch, “SoftSKU:
their next-generation systems and improving efficiency of data
Optimizing server architectures for microservice
diversity @scale,” in Proc. Int. Symp. Comput. Archit., center workloads. His research interests include computer
2019, pp. 513–526, doi: 10.1145/3307650.3322227. architecture, performance analysis, and energy efficient system
9. L. Luo et al., “LASER: Light, Accurate Sharing dEtection architectures for data centers. Dhanotia received an M.S. degree
and Repair,” in Proc. Int. Symp. High Perform. Comput. in computer engineering from North Carolina State University.
Archit., 2016, pp. 261–273. Contact him at abhishekd@fb.com.

Accelerating Genomic Data Analytics With

Composable Hardware Acceleration
Framework
Tae Jun Ham , Seoul National University, Seoul, 130-743, South Korea
David Bruns-Smith and Brendan Sweeney, University of California Berkeley, Berkeley, CA, 94720-5800, USA
Yejin Lee, Seong Hoon Seo , and U Gyeong Song, Seoul National University, Seoul, 130-743, South Korea
Young H. Oh , Sungkyunkwan University, Seoul, 03063, South Korea
Krste Asanovic , University of California Berkeley, Berkeley, CA, 94720-5800, USA
Jae W. Lee , Seoul National University, Seoul, 130-743, South Korea
Lisa Wu Wills , Duke University, Durham, NC, 27708-0187, USA
This article presents a framework, Genesis (genome analysis), to efficiently and

flexibly accelerate generic data manipulation operations that have become
performance bottlenecks in the genomic data processing pipeline utilizing FPGAs-
as-a-service. Genesis conceptualizes genomic data as a very large relational
database and uses extended SQL as a domain-specific language to construct data
manipulation queries. To accelerate the queries, we designed a Genesis hardware
library of efficient coarse-grained primitives that can be composed into a
specialized dataflow architecture. This approach explores a systematic and
scalable methodology to expedite domain-specific end-to-end accelerated system
development and deployment.
A
s the democratization of wet lab sequencing Genomic data processing algorithms are composed
technology drives down sequencing cost, the of a mixture of specific algorithms as well as generic
cost and runtime of data analysis are becom- data manipulation operations. For example, the most
ing more significant. An article published in PLoS Biol- popular genome sequencing workflow, Broad Insti-
ogy quantitatively claimed that genomics is projected tute’s Genome Analysis ToolKit 4 (GATK4) Best Practi-
to produce over 250 exabytes of sequence data per ces, consists of stages implementing specific
year by 2025, far surpassing the current major genera- algorithms such as read alignment and variant calling
tors of big data such as YouTube (1–2 exabytes/year) as well as stages performing generic data manipula-
and Twitter (1.36 petabytes/year). With the afore- tions such as mark duplicates and base quality score
mentioned big data generation comes challenges in recalibration. Thus far, most prior work focused on the
genomic data acquisition, storage, distribution, and hardware acceleration of specific algorithms such as
analysis. We focus our effort on addressing the effi- read alignment1–3 or pair-HMM (hidden Markov model)
cient analysis of genomic data, in parti`cular, identify- in variant calling. Such specialized accelerators, target-
ing genomic variants in each individual genome, as it ing a specific implementation of a particular genome
is one of the most computationally demanding sequencing pipeline stage, have demonstrated orders
pipelines. of magnitude speedups and energy-efficiency improve-
ments. With these specific algorithm accelerations in
place, the remaining unaccelerated analysis stages
that contain data manipulation operations become the
0272-1732 ß 2021 IEEE
bottleneck and a large portion of the genomic analysis
Date of publication 12 April 2021; date of current version execution time, making them good targets for acceler-
25 May 2021. ation pursuant to Amdahl’s law.

TOP PICKS
An important aspect of genomic data analysis is that character, A, T, C, G, representing a DNA nucleotide base.
the algorithms are still being refined and special care is Genomic analysis uses a DNA sequence to identify varia-
needed when proposing hardware acceleration. For tions from a biological sample against a reference
example, INDEL realignment was the major performance genome. Our work focuses on genomic analysis through
bottleneck in the now deprecated GATK3 and thus a the next generation sequencing (NGS) technology, the
hardware accelerator targeting the stage was proposed.3 de facto technology for the whole genome analysis. In
However, GATK4 does not utilize this stage with its this process, fragmented DNA samples are read by an
updated variant calling algorithms rendering the pro- NGS wet lab instrument. Raw sensor data from the
posal suitable largely for legacy pipelines. Similarly, accel- instrument are processed through an equipment-spe-
erators targeting the pair-HMM algorithms used in the cific proprietary software (or hardware), and the instru-
variant calling stages of GATK4 are likely being replaced ment outputs processed data called reads. Reads
by the DNN-based algorithm for the same stage. Noting contain multiple fragments from a sequence of base
the rapid changes in specific algorithms, we argue that pairs and a sequence of quality scores where a single
designing accelerators for the generic data manipulation quality score represents the machine’s confidence of the
portions of the pipelines is just as important, if not more, corresponding base pair measurement. This process of
than designing accelerators for the specific algorithms. postmeasurement analysis is called the primary analysis,
and the outcome of the primary analysis is an input to
the secondary analysis. Secondary analysis is a process
of identifying genomic variants. Since it is very computa-
WE ARCHITECT AND EVALUATE A
tionally demanding, this is what most computer soft-
FLEXIBLE ACCELERATION ware/hardware research (including ours) focuses on.
FRAMEWORK THAT TARGETS GENERIC Once these genomic variants are identified, they can be
DATA MANIPULATION OPERATIONS used to analyze the specific characteristics of this DNA
COMMONLY USED IN GENOMIC DATA (e.g., disease risk).
PROCESSING CALLED GENESIS.
GENESIS ACCELERATION
FRAMEWORK
Thus, we architect and evaluate a flexible accelera- Representing Genomic Data Analysis as
tion framework that targets generic data manipulation Relational Database Queries
operations commonly used in genomic data processing Genesis conceptualizes genomic data as relational data
called Genesis. We view genomic data as traditional data tables and uses SQL as a domain-specific language to
tables and use extended SQL as a domain-specific lan- represent the target genomic analysis operations and
guage to process genomic analytics. Conceptualizing pipeline stages for acceleration. Genomic reads are rep-
the genomic data as a very large relational database resented as rows in a table, and attributes associated
allows us to reason about the algorithms and transforms with each read are represented as columns. We use Illu-
genomic data processing stages into simple extended mina sequencer short reads (up to 151 base pairs per
SQL-style queries. Once the queries are constructed, read) for a specific human in our evaluated dataset. A ref-
Genesis facilitates the translation of the queries into erence sequence is fragmented into many segments and
hardware accelerator pipelines using the Genesis hard- each segment is represented as a row in the reference
ware library that accelerates primitive operations in data- table. We configure a single row in the reference table to
base and genomic data processing. As a proof of have about 1 M base pairs. For the efficient management
concept, we accelerate the data preprocessing phase in of those tables, we partition each table into multiple
GATK4 Best Practices and deploy Genesis-generated tables by chromosome identifiers, and then again by the
accelerators on Amazon EC2 F1 instances. We demon- mapped position of the reads or the reference data. For
strate that the accelerated system targeting these both tables, we assign a unique partition ID to the parti-
queries provides a significant performance improvement tion. Genesis supports common SQL operations such as
and cost savings over a commodity CPU. Select, Where, GroupBy, Join, Limit (used to select a subset of
rows), Count, and Sum. In addition, we support two addi-
GENOMIC DATA ANALYTICS tional operations PosExplode and ReadExplode. PosExplode(COL,
A genome is an organism’s complete set of DNAs. For a INITPOS), converts an array in a single row of a single col-
human genome, each chromosome is represented as a umn (COL) to multiple rows with an extra POS column that
sequence of DNA base pairs expressed as a single starts from the position INITPOS (POS is incremented by one

TOP PICKS
FIGURE 1. Example query, its execution flow, and the genesis-generated HW pipeline.
for every row that is exploded). ReadExplode converts a that allows us to extract base matching information (Q2).
read, stored as a single row in the read tables, to multiple, Step 5: the number of matching base pairs (i.e., a read’s
separate rows where each row contains the base, the base pair is identical to the reference’s base pair) are
corresponding quality score, and its position. This opera- computed and inserted into the output table (Q3). Gene-
tion converts individual base pairs and corresponding sis Hardware Library: Genesis framework lets a user
quality scores into separate rows utilizing its alignment easily construct a dataflow pipeline that accelerates
information recorded in the metadata called CIGAR the desired target query [e.g., Figure 1(a)]. The key idea
(CIGAR contains base pair alignment information such behind this framework is that a relational query can be
as substitution, matching, insertion, deletion, etc.). decomposed into a series of relational operators. For
Finally, we support iteration over rows with the FOR Row IN example, it is well known that SQL queries can be easily
Table clause, which is similar to that of Oracle PL/SQL. parsed into a tree graph where each node represents a
We use an example to illustrate how to construct table (leaf node) or a relational/computational operator
queries for a genomic data analysis operation and walk (nonleaf node).4 In such a case, if there exists a set of
through the execution of the query using a high-level configurable hardware modules where each of them
block diagram. In this example, the user wants to find can be directly mapped to each relational/computa-
the number of bases that matches the reference for all tional operator, constructing a dataflow pipeline for the
reads whose partition ID is equal to the constant P. In query becomes rather simple. Specifically, each node in
this case, the user can represent this operation as a the graph can be mapped to a Genesis hardware mod-
sequence of SQL queries as shown in Figure 1(a), which ule, and each edge in the graph is mapped to a hardware
essentially follows the execution flow depicted in queue connecting these modules.
Figure 1(b). Step 1: the set of reads and the relevant ref- Each Genesis hardware library module operates
erence with the partition ID (P) are first extracted (I1). with a sequence of data called streams. A stream con-
Step 2: the relevant reference row’s base pair sequence sists of many data items, each of which can contain
is expanded into multiple rows with PosExplode (I2). Step multiple different types of fields. Each data item is
3: for each read in the ReadPartition, its base pairs are con- also divided into multiple flits, where a single flit repre-
verted to a multirow table with ReadExplode (Q1). Step 4: sents the atomic unit of data communication and
inner-join the ReadExplode’ed table and the subset of the operation. For example, when a sequence of reads
PosExplode’ed reference row table (the subset is obtained forms a single stream, each read is a data item, and
with the LIMIT base offset clause) to obtain a joined table each base pair (or multiple base pairs), which is part of

TOP PICKS
FIGURE 2. Block diagrams of genesis data manipulation and computation modules.
a base pair sequence in a read, is a flit. In general, each (2) Memory Access Modules
module consumes (or inspects) a single flit from its Memory Reader reads contiguous data from memory
input queue(s) and generates a single output flit. The and streams the read data to the next module. Given
output flit is then inserted into the output queue, a starting address and the total amount of data to
which will work as an input queue for the next module. read from memory, it continuously sends memory
requests to memory at a memory access granularity
(1) Data Manipulation and Computation (e.g., 64B) as long as its internal prefetch buffer is not
Modules full. At the same time, this module supplies the
Figure 2 shows the block diagrams for four data manip- returned data from memory to the next module at a
ulation and computation modules in Genesis hardware throughput of a single flit per cycle.
library. The figure visualizes the operations of each Memory Writer writes the data coming from an input
module with the example input/output values. Detailed queue to memory. It takes a single flit from the previous
explanations for each module are provided below. module per cycle and temporarily stores it in its internal
Joiner merges flits from two input queues and pro- buffer. Once its internal buffer size reaches the size of
duces a single output. For this module, a flit in an input the memory access granularity (or a specific termination
queue should consist of a key field and a data field. condition), it sends a write request to memory starting
Every cycle, this module compares keys of the flits from the preconfigured starting address.
from two input queues and either outputs or discards SPM (Scratchpad Memory) Reader simply takes an
a single flit with the smaller key while leaving the other address from the input queue and outputs the
one intact. scratchpad read result to the output queue. It can
Filter takes input data from a single queue, checks also be configured to read all elements in the interval
whether it matches the specified comparison condi- when the starting address and the finishing address
tion (across fields or for a field and a constant), and are provided. This module is also used to drain all of
outputs the item if and only if the item satisfies the its content to the output queue when a drain signal is
specified condition. provided.
Reducer takes a sequence of data and performs a SPM Updater takes an address and the value from
reduction operation (e.g., Sum, Max, Min, Count) with an input queue and updates the scratchpad memory.
a reduction tree. For this module, a reduction tree is This module supports three operating modes. First, it
utilized to obtain a reduction result at a throughput of can work like a memory writer, which performs
a single flit per cycle. Note that this module can also sequential writes to the SPM buffer when provided a
support reduction across multiple flits (i.e., reduction starting address. It can also be configured to perform
at an item granularity). a random SPM write, which simply writes the value to
Stream ALU takes input data from a single or two the provided address. Finally, it can be configured to
input queues (or a single input queue and a constant perform a read-modify-write update with the provided
item) and performs a relatively simple unary/binary modify function (e.g., add/subtract a constant).
ALU operation (e.g., NOT, ADD, SUB, CMP, AND, OR,
etc.) with data from those queues. When a single item (3) Genomic Data Processing Modules
contains multiple values, the unary/binary operation is ReadToBases supports the ReadExplode operation. This
performed in an elementwise manner. module takes a sequence of CIGAR, POS, SEQ, and

TOP PICKS
optionally QUAL values from the input queues and pro- hardware pipelines targeting different operations to
duces a ReadExplode’ed table. Each cycle, this module work together. Input/output ports of all hardware
outputs the tuple of the reference position, the corre- pipelines’ memory modules are first arbitrated by a
sponding base, and the quality score. The reference local arbiter and then arbitrated again by one of the
position field can be Ins if the base is an inserted base. global arbiters, each of which is connected to one out
Similarly, the base and quality score fields can be Del if of four memory channels in the system. A set of stock
the base is deleted. Genesis modules are often enough to design accelera-
tors for data manipulation operations in the existing
gene processing pipeline. However, different genomics
Constructing Hardware Accelerator data often need different treatment, and thus Genesis
with Genesis Hardware Library is designed to support user-defined modules. Genesis
Genesis accelerates the user-provided query by con- provides a standardized stream-based I/O interface
structing a hardware pipeline using multiple Genesis for all of its modules; a user only has to specify the
hardware modules written in Chisel. For example, the user-defined internal computation hardware in Chisel
SQL query in Figure 1(a) is translated to the hardware to utilize the framework at ease.
pipeline shown in Figure 1(c). The hardware pipeline
has five memory readers, and each reader reads the
data streams from READS.POS, READS.ENDPOS, READS.CIGAR,
EVALUATION
READS.SEQ, and REFS.SEQ. Three of these memory readers Methodology
are connected to the ReadToBases module, which Figure 3(a) shows the runtime breakdown of the
generates a sequence of flits where each flit is a pair GATK4 Best Practices data preprocessing pipeline
of a base and the corresponding reference position. with (bottom) and without (top) recently developed
This generated sequence is then provided as an input hardware alignment accelerator.2 To demonstrate
to the Joiner. Unlike the reads data, the relevant refer- Genesis’s capability to accelerate data manipulation
ence data are mapped to an on-chip SPM to facilitate operations in genomic data analysis, we architected
data reuse. A single memory reader is connected to and implemented hardware accelerators (we call each
the SPM Updater module so that it can initialize the hardware pipeline(s) constructed for a particular algo-
SPM with data from memory. The contents from this rithm an accelerator for the rest of this article to avoid
SPM is retrieved with the SPM Reader, which takes confusion) for three key data manipulation opera-
two inputs from the memory readers (the ones reading tions,5 namely Mark Duplicates, Metadata Update,
READS.POS and READS.ENDPOS), reads the SPM contents for and the table construction phase of BQSR, which
the corresponding interval, and supplies the read data together account for the majority of the runtime in
(i.e., reference base pairs) to the Joiner. The Joiner the data preprocessing phase of GATK4 Best Practi-
takes these two input sequences (i.e., one from the ces. Figure 3(b) shows the block diagrams for these
read, another from the reference), performs an inner- Genesis-generated accelerators.
join, and passes the joined sequence to the Filter, We deployed Genesis-generated accelerators on
which compares two data fields (i.e., the base pair the commercial cloud using the Amazon EC2
from the read and the base pair from the reference), f1.2xlarge instances. Each F1 instance contains a Xilinx
and only outputs the matching items. Finally, the Virtex UltraScale+ VU9P FPGA card. We use a
Reducer module accumulates the number of matched 250 MHz clock for all three accelerators. We configure
base pairs and passes the outcome to the memory the number of pipelines to 1) the resource limit we can
writer, which stores the outcome to memory. The con- fit on one FPGA card or 2) the performance limit where
structed pipeline is fully pipelined and can process an accelerator can no longer get more speedup from
one base pair per cycle. A single pipeline is often insuf- parallelism due to memory or communication bottle-
ficient to fully utilize the available memory bandwidth necks. We used 16 pipelines for mark duplicates, 16
provided to the system. In order to fully utilize the pipelines for metadata update, and 8 pipelines for
available memory bandwidth and achieve high base quality score recalibration.
throughput, it is necessary to exploit abundant paral- To compare our design with the software-only
lelism in genomic data processing operations through implementation, we run GATK version 4.1.3 on an
the use of multiple pipelines. Genesis treats each pipe- Amazon EC2 r5.4xlarge instance that is memory-opti-
line to be independent of each other except that they mized. Large memory is crucial to obtain high perfor-
share memory interfaces and the command interfa- mance for genomic data analysis workloads. For the
ces. This separation allows the utilization of different reads input dataset, we use a well-characterized

TOP PICKS
FIGURE 3. GATK4 best practices runtime breakdown and block diagrams for the genesis-generated accelerators.
Illumina sequencing result of patient NA12878 obt- of the Genesis framework runtime for the three stages
ained from the Broad Institute Public Dataset, and we in the GATK4 data preprocessing phase. The figure
use GRCh38 as the reference genome. shows that the relatively low speedup of mark dupli-
cates stage is due to the unaccelerated software por-
tion of the stage, which is responsible for about 50%
Performance Results of the baseline runtime. Furthermore, the figure indi-
Figure 4(a) shows the speedup of three Genesis accel- cates the metadata update and BQSR speedups are
erators designed to accelerate various stages of the partially limited by the host-FPGA communication
GATK4 data preprocessing phase over the GATK4 (takes 53.4% and 29.5% of the runtime).
software implementations run on a carefully config-
ured 8-core memory-optimized CPU instance. Genesis
achieves an overall speedup of 2 on mark duplicates Cost Comparison
stage, 19.3 on metadata update, and 12.6 on BQSR Many genomic data processing workloads exhibit a
(covariate table construction). For metadata update plethora of parallelism and thus often scale relatively
and BQSR stages, per-chromosome speedups are also well with the increased amount of resources. In such a
presented in Figure 4(c) and (d). Considering that scenario, the cost can be a more meaningful metric
these three stages take about three and a half hours than the raw speedup itself since it considers the
for a single genome to execute (assuming that meta- amount of resources the system utilizes. We compare
data update perfectly scales), Genesis reduces the the cost of running each accelerated stage in the
computation time to process a single person’s gene AWS f1 instance (1:69$/hr) with the cost of running
by roughly 140 min. Figure 4(b) shows the breakdown baseline software implementations on r5.4xlarge

TOP PICKS
FIGURE 4. Performance comparison of the Genesis accelerators over baseline for three GATK4 data preprocessing stages.
instance (1:29$/hr). Compared to the baseline, Genesis queries. In addition, we observe that genomic process-
reduces the cost of genomic data processing by ing algorithms are composed of a mixture of specific
2.08, 15.05, and 9.84 for Mark Duplicates, Meta- algorithms as well as generic data manipulation opera-
data Update, and BQSR (table construction) stages. tions and that the generic data manipulation opera-
tions are the common primitives that can be used
CONCLUSION AND IMPLICATIONS across analytics domains. These commonalities allow
Genesis acceleration framework explores a systematic the sharing, reusing, and composition of hardware
and scalable methodology to democratize end-to-end modules across domains, lowering the development
accelerated system development and deployment costs of highly efficient accelerated systems. These
using a software interface that is a standardized lan- commonalities can be applied to many other big data
guage as a domain-specific language and a hardware analytics domains such as graph analytics.
library that is composed of efficient coarse-grained Genesis hardware library demonstrates that con-
primitives. While the work specifically showcases a set structing hardware accelerators by utilizing a set of com-
of de facto algorithms in genomic analytics, the con- posable hardware modules enables a flexible hardware
cepts articulated and demonstrated in this work accelerator design that can be easily extended or
can be applied to many other domains and inspire updated. We present a systematic way for accelerator
accelerated systems research that offers a degree of researchers to decompose the algorithms into primitive
generality without sacrificing the efficiency brought operations and build hardware modules that directly
upon via specialization. map to those primitives as composable hardware blocks
that form a hardware library. The algorithms are then
composed using the hardware modules and the frame-
work for ease of development. In domains where the
GENESIS HARDWARE LIBRARY algorithms are constantly changing, such as genomics,
DEMONSTRATES THAT algorithm changes can be reflected quickly by simply
CONSTRUCTING HARDWARE updating or adding new hardware library components
ACCELERATORS BY UTILIZING A SET OF and recomposing the algorithm using the framework for
COMPOSABLE HARDWARE MODULES rapid deployment. This development methodology can
ENABLES A FLEXIBLE HARDWARE be adopted for various domains such as machine learn-
ACCELERATOR DESIGN THAT CAN BE ing, having hardware library components to execute
matrix–matrix multiply, matrix–vector multiply, etc.
EASILY EXTENDED OR UPDATED.
In the world of accelerator research, efficiently map-
ping software onto custom hardware is a challenging
problem. Researchers solve this problem by either using
In this article, we investigate commonalities high-level synthesis to generate hardware or inventing
between database and genomic analytics domains by new domain-specific languages that ease the mis-
conceptualizing genomic data as a very large rela- matches between software and hardware. In this work,
tional database and mapping genomic data analytics we advocate to leverage an already-standardized lan-
algorithms into one or more relational database guage as the domain-specific language and construct

TOP PICKS
primitive operators that directly map software primi- DAVID BRUNS-SMITH is currently working toward a Ph.D.
tives to hardware blocks to allow efficient mapping of degree with the Department of Electrical Engineering and
software to hardware. The effect of this approach is real-
Computer Sciences, University of California Berkeley, Berke-
ized in 1) the resultant accelerated systems achieving
ley, CA, USA. Contact him at bruns-smith@berkeley.edu.
one order of magnitude better performance and cost-
efficiency on FPGAs-in-the-cloud compared to multi-
BRENDAN SWEENEY is currently working toward a Ph.D
threaded software, and 2) the ease of adoption of the
accelerated systems and the lowering of the barrier to degree with the Electrical and Computer Engineering Depart-
entry for non-hardware-savvy scientists to use the ment, The University of Texas at Austin, Austin, TX, USA. This
accelerated systems, creating broader impacts. work was done when he was an undergraduate student at
UC Berkeley. Contact him at brs@berkeley.edu.
ACKNOWLEDGMENTS YEJIN LEE is currently working toward a Ph.D. degree with the
This work was supported in part by the Korean govern- Computer Science and Engineering Department, Seoul National
ment grant (NRF-2016M3C4A7952587). This work was University, Seoul, South Korea. Contact her at yejinlee@snu.ac.kr.
also funded in part by the Advanced Research Proj-
ects Agency-Energy (ARPA-E), U.S. Department of SEONG HOON SEO is currently working toward a Ph.D
Energy (Award Number DE-AR0000849), ADEPT Lab degree with the Computer Science and Engineering Depart-
industrial sponsor Intel, RISE Lab and APEX Lab spon-
ment, Seoul National University, Seoul, South Korea. Contact
sor Amazon Web Services, and ADEPT Lab affiliates
him at andyseo247@snu.ac.kr.
Google, Siemens, and SK Hynix.
U GYEONG SONG is an undergraduate student with the

Computer Science and Engineering Department, Seoul
REFERENCES
National University, Seoul, South Korea. Contact him at
1. Y. Turakhia, G. Bejerano, and W. J. Dally, “Darwin: A
thddnrud2010@gmail.com.
genomics coprocessor provides up to 15000x
acceleration on long read assembly,” in Proc. Int. Conf.
Archit. Support Program. Lang. Oper. Syst., 2018, YOUNG H. OH is currently working toward a Ph.D degree with
pp. 199–213. the Semiconductor System Engineering Department, Sung-
2. D. Fujiki et al., “GenAX: A genome sequencing kyunkwan University, Seoul, South Korea. Contact him at
accelerator,” in Proc. Annu. Int. Symp. Comput. Archit., younghwan@skku.edu.
Jun. 2018, pp. 69–82.
3. L. Wu et al., “FPGA accelerated INDEL realignment in KRSTE ASANOVIC is a Professor with the Computer Science
the cloud,” in Proc. Int. Symp. High-Perform. Comput.
Division, Electrical Engineering and Computer Science Depa-
Archit., 2019, pp. 277–290.
rtment, University of California Berkeley, Berkeley, CA, USA.
4. Oracle, “Database SQL Tuning Guide—SQL Process,”
Contact him at krste@berkeley.edu
[Online]. Avaialble: https://docs.oracle.com/database/
121/TGSQL/tgsql_interp.htm#TGSQL94618 (URL)
5. T. J. Ham et al., “Genesis: A. hardware acceleration JAE W. LEE is an Associate Professor of computer science
framework for genomic data analysis,” in Proc. Int. and engineering (CSE) with Seoul National University (SNU),
Symp. Comput. Archit., Jun. 2020, pp. 254–267. Seoul, South Korea. Contact him at jaewlee@snu.ac.kr.
TAE JUN HAM is a Postdoctoral Researcher with Seoul LISA WU WILLS is the Clare Boothe Luce Assistant Professor
National University, Seoul, South Korea. Contact him at of Computer Science and ECE with Duke University, Durham,
taejunham@snu.ac.kr. NC, USA. Contact her at lisa@cs.duke.edu.

uGEMM: Unary Computing for GEMM

Applications
Di Wu , Jingjie Li , Ruokai Yin , Younghyun Kim , and Joshua San Miguel , University of
Wisconsin–Madison, Madison, WI, 53706, USA
Hsuan Hsiao, University of Toronto, Toronto, ON, M5S 1A1, Canada
General matrix multiplication (GEMM) is pervasive in various domains, such as

signal processing, computer vision, and machine learning. Conventional binary
architectures for GEMM exhibit poor scalability in area and energy efficiency, due to
the spatial nature of number representation and computing. On the contrary, unary
computing processes data in temporal domain with extremely simple logic.
However, to date, there rarely exist efficient architectures for unary GEMM. In this
work, we first present uGEMM, a hardware-efficient unary GEMM architecture
enabled by universally compatible arithmetic units, which simultaneously achieves
input-insensitivity and high output accuracy. Next, we demonstrate that the
proposed uGEMM can reliably early terminate the computation and offers dynamic
energy-accuracy scaling for real-world applications via an accuracy-aware metric.
Finally, to propel the future research for unary computing, we open source our unary
computing simulator, UnarySim.
G
eneral matrix multiplication (GEMM) is ubiqui- computing,4 temporal-coded race logic3), with each
tous and essential in many applications, partic- designed for application-specific use cases.1 Stochas-
ularly emerging deep learning. Conventional tic computing supports both arithmetic and relational
binary GEMM implementations can leverage either hard- operations but can often be inaccurate,5 while race
ware for efficiency or software for flexibility.1 However, logic, though accurate, supports only limited arithme-
as the parallelism increases, binary GEMM implementa- tic operations and is not applicable to GEMM.1
tions, which compute on multiple parallel bits, suffer Such disparities and limitations pose three funda-
from poor hardware area, power, and energy efficiency mental challenges to an ideal unary GEMM architec-
due to exponentially growing wire congestion. ture. 1) How can a designer figure out which unary
To enable efficient GEMM processing on extremely computing units would work? 2) How can a designer
area-, power-, or energy-constrained devices, unary ensure reliability, i.e., accuracy, while improving energy
computing has been leveraged in prior works,2,3 which efficiency? 3) How can a designer simulate and
employ extremely simple logic that consumes unary explore the disparate design space? We propose the
data in the form of serial bit streams. Unary computing unified unary GEMM architecture, dubbed uGEMM,
converts computations from the spatial domain (i.e., along with our UnarySim simulator to offer answers to
traditional parallel binary computing) to the temporal these questions, striving to make unary computing a
domain, enabling ultra-low-power hardware at the first-class citizen in resource-constrained systems.
cost of longer latency. There are, however, a widely To tackle the first challenge, we present new
disparate set of unary computing schemes proposed mechanisms for unary arithmetic units and design uni-
in the literature (e.g., rate-coded stochastic versally compatible microarchitectures. These units
are highly accurate and support arbitrary input bit
streams, i.e., input-insensitivity, as indicated in the
input in Figure 1. Furthermore, they process bit
0272-1732 ß 2021 IEEE
streams in a fully streaming manner without unary–
Date of publication 11 March 2021; date of current version binary data interconversion, allowing for streaming
25 May 2021. consecutive uGEMM blocks in Figure 1. The resultant

TOP PICKS
1Þ and 2 PðS ¼ 1Þ 1, respectively. Any real number

can be scaled and mapped into a bit stream.
Stochastic computing performs computation by
manipulating input bit streams statistically. In sto-
chastic computing, an adder with two inputs A and B
can be implemented by a two-input multiplexer (MUX)
FIGURE 1. Illustrative example of uGEMM. whose select signal is a stochastic bit stream with
PðS ¼ 1Þ ¼ 0:5. This computes Vout ¼ ðVA þ VB Þ=2,
where a scaling factor of 2 is introduced to prevent
uGEMM, a multiply-accumulate (MAC) array of these overflow. Besides the scaled addition using MUX, uni-
novel units, naturally inherits all microarchitecture- polar nonscaled addition can be done with an OR gate.
level benefits and outperforms counterparts. Then, Multiplication can be achieved using an AND gate for
the second challenge is solved via a novel metric, unipolar bit streams and an XNOR gate for bipolar bit
named stability, which monitors the output accuracy streams.
in the temporal domain. Understanding the accuracy Although such extreme simplicity offers benefits in
evolution through time, we can early terminate the hardware efficiency, stochastic computing suffers the
computation, as shown in the output in Figure 1, to correlation problem5 (i.e., two bit streams are more
achieve a desired efficiency reliably, i.e., in a controlla- correlated when they are more similar), leading to
ble manner. The final challenge on the design space inaccuracy. For example, an AND gate only accurately
exploration is also addressed in this work by open- computes the product of two input bit streams when
sourcing our unary computing simulator, UnarySim, they have zero correlation. Otherwise, if more paired
based on which the evaluations at microarchitecture, ones appear than expected, it ends up computing the
architecture, and application levels can be minimum function,7 as shown in Figure 2. Existing sol-
performed.1,6,7 utions raise the demands for costly RNGs and
increase the latency to achieve accurate results.8,9
BACKGROUND
According to the data encoding methods, we catego- Temporal-Coding-Based Unary
rize unary computing into rate-coded stochastic com- Computing
puting4 and temporal-coded race logic,3 with Temporal coding is applied in race logic, which enco-
examples of data representations and limitations des data into the timing of a signal’s transition (or
shown in Figure 2. edge), with the bit stream as a chain of ones followed
by a chain of zeros, or vice versa, and each bit gener-
ated by comparing source data with a counter out-
Rate-Coding-Based Unary Computing
put.3 The polarity categorization of temporal-coded bit
Stochastic computing adopts rate coding, whose data
streams can follow that in stochastic computing.
value relies on the frequency of ones and zeros in the
In race logic, an AND gate and an OR gate now per-
bit stream, with each bit generated by comparing
form minimum and maximum functions, respectively,
source data with a random number generator (RNG).4
unlike the unipolar multiplication and unipolar non-
According to the polarity, bit streams can be in unipo-
scaled addition in stochastic computing. As the signal
lar or bipolar (unsigned or signed) formats. Given a bit
edges are deterministic, race logic can compute the
stream with the probability/frequency of ones as
minimum and maximum accurately. However, prior to
PðS ¼ 1Þ 2 ½0; 1, unipolar and bipolar values are PðS ¼
our work, multipliers and adders have never been pro-
posed for temporal computing.
UGEMM MICROARCHITECTURE
AND ARCHITECTURE
In this section, we present novel linear functional unit
designs in Figure 3, including multiplication (uMUL),
scaled, and nonscaled additions (uSADD and
FIGURE 2. Unary computing data representations and limita- uNSADD), which are universally compatible to varying
tions. .
input codings, and the integrated uGEMM architec-
ture in Figure 4.

TOP PICKS
FIGURE 3. uGEMM functional units. Thick line: binary signal; and thin line: unary signal.
uMUL counts. Therefore, the proposed uSADD first collects all

The proposed unipolar and bipolar uMUL are shown in input bits with the parallel counter and compute the
Figure 3(a) and (b). We recognize that unary multiplica- sum in the accumulator as in block 1 . Then, the output
tion using a naive AND gate [ 2 in Figure 3(a)] conditionally is set to the carry bit of the accumulator in block 2.
produces a valid output bit when input 0 is logic one, This means that only when there are N ones in the input,
implying a conditional bit stream generation for input 1 a logic one at the output will be generated, exactly the
[1 in Figure 3(a)]. More specifically, only when input 0 is output scaling. Such a carry bit overflow mechanism no
logic one, the RNG inside the bit stream generator for longer considers the correlation between the input bit
input 1 will update, and the generator will eventually gen- streams, and obtains accurate results.
erate a new bit, i.e., input 0 is an enable signal to the bit
stream generator. As such, we thoroughly eliminate the
uNSADD
correlation problem and achieve high accuracy with
Nonscaled addition calculates the clipped sum of all
merely an extra enable signal. Then, for bipolar multipli-
inputs, and the clipping happens when the sum over-
cation using an XNOR gate, we decompose the XNOR gate
flows/underflows the legal unary data range. The par-
to two AND gates, with each leveraging the conditional
allel counter and accumulator in block 1 are the same
bit stream generation for high accuracy.
as in uSADD, except that now the accumulation out-
put is entirely utilized, rather than only the carry bit.
uSADD Next, a subtraction between the offset and the accu-
Scaled addition calculates the average of all inputs, mulation result is performed in block 2 to retrieve the
where the average comes via scaling the output. For anticipated count of output ones. The offset is of
scaled addition, the unipolar and bipolar microarchitec- value 0 and ðN 1Þ=2 for unipolar and bipolar bit
tures are identical, as in Figure 3 (c). The takeaway here streams, respectively. Finally, block 3 generates the
is that the input average is the mean of input-one output based on the anticipated one count and the
FIGURE 4. uGEMM architecture and its PE. Thick line: multibit stream; and thin line: single-bit stream.

TOP PICKS
actual one count in the accumulator via comparison. TABLE 1. Hardware comparison of unary GEMM implemen-
When there are more anticipated ones than actual, tations.
logic ones will be output every cycle until they are
equal. Similar to uSADD, uNSADD computes based on GEMM Area Power Latency Energy
one counts and solves the correlation issue. Note that (mm2 ) (W) (m
ms) (m
mJ)
we are the first to support bipolar nonscaled addition Unipolar scaled
in unary computing. uGEMM 0.43 0.15 0.64 0.07
Gaines 1.57 0.50 0.64 0.32
Integrated uGEMM Architecture Sim 0.52 0.18 0.64 0.11
Our uGEMM can simply be built on above units, as
Jenson 0.08 0.02 163.84 3.91
they are universally compatible to varying inputs (i.e.,
input-insensitive) and directly take in/produce bit Najafi 1.26 0.46 0.64 0.29
streams as inputs and outputs (i.e., fully streaming Unipolar nonscaled
process). We use O ¼ A B þ C as an example to illus- uGEMM 0.44 0.12 0.64 0.07
trate the uGEMM architecture, where O, A, B and C are
Gaines 1.55 0.49 0.64 0.31
of size ðm nÞ, ðm kÞ, ðk nÞ, and ðm nÞ, respec-
tively. Figure 4(a) shows the uGEMM architecture. Sim 0.50 0.15 0.64 0.09
Input A, B, and C go through the m-by-n processing Bipolar scaled
element (PE) array in the center of the figure. As
uGEMM 0.76 0.21 0.64 0.13
shown in Figure 4(b), the ði; jÞth PE performs the MAC
Gaines 1.56 0.50 0.64 0.32
of the ith row from A and the jth column from B, with k
elements for each row and column, then adds the Sim 0.53 0.18 0.64 0.12
ði; jÞth element from C as the result. Inheriting from Jenson 0.08 0.02 163.84 3.90
the features of uMUL, uSADD, and uNSADD, uGEMM
Najafi 1.25 0.45 0.64 0.29
achieves 1) high parallelism due to simple unary logic,
2) input-insensitivity and high output accuracy by Bipolar nonscaled
diminishing the correlation problem, and 3) fully uGEMM 0.77 0.20 0.64 0.13
streaming process to enable reliable early termination
for controllable energy efficiency boost. More details
on related mathematical theories and walkthrough
stability measures how early a bit stream’s value sta-
examples can be found in Wu et al.’s work.1
bilizes/converges, given a specific error budget. For a
bit stream of length L, VL represents its expected
ENERGY-ACCURACY SCALING value and Vl represents the value of the partial bit
EVALUATION stream based on the first l bits (l L). If starting from
We list the hardware implementation result of uGEMM the lth bit, the bias DVl ¼ jVl VL j is consistently
in Table 1. Here, we fix the GEMM shape to m ¼ k ¼ smaller than a user-defined threshold VTHD , then sta-
n ¼ 16, and evaluate uGEMM against Gaines’,4 Sim’s,10 bility is calculated as follows:
Jenson’s8 and Najafi’s9 unary schemes. Note that
uGEMM is the only one supporting bipolar nonscaled maxfljDVl > VTHD g
Stability ¼ 1 : (1)
addition and that Najafi’s design does not support L
temporal coding. We find that uGEMM consumes
small area and low power, outperforming Gaines,’ Stability ranges in [0,1], with a higher value indicat-
Sim’s and Najafi’s designs. Jenson’s unary GEMM is ing earlier termination. Usually, rate coding yields
significantly more lightweight than the others due to higher stability than temporal coding. Given a VTHD ,
its resource sharing scheme. However, Jenson’s the resultant maximum l is defined as the stable point,
design introduces long latency, resulting in higher after which the output accuracy, statistically, will
energy consumption. never drop below the threshold. More details on the
Early terminating a bit stream in the temporal usage of this metric at the microarchitecture, archi-
domain leads to a partial bit stream, which represents tecture, and application levels can be found in Wu
a less accurate value, but reduces the latency to offer et al.’s works.1,6
higher energy efficiency. As a newly proposed metric In addition to high accuracy, our unary multiplica-
for the capability of early termination in this work, tion and addition designs have consistently higher

TOP PICKS
simulation of general unary architectures and applica-

tions. Given how diverse and specialized prior
unary works are, the gap between the demand for
ultra-low-power architecture exploration and the lack of
off-the-shelf toolchains motivated us to provide a stan-
dard means for characterizing different unary designs
and fairly quantifying their tradeoffs by open-sourcing
our unary computing simulator: UnarySim.12
IN ADDITION TO HIGH ACCURACY, OUR

UNARY MULTIPLICATION AND
ADDITION DESIGNS HAVE
CONSISTENTLY HIGHER STABILITY
THAN THAT OF PRIOR COUNTERPARTS.
FIGURE 5. Progressive accuracy (curves) and stability (dots)
comparison of GEMMs. Cycle ranges from 1 to 256 in
uGEMM, Gaines,’ Sim’s and Najafi’s; and from 1 to 2562 in Jen- UnarySim is a cycle-accurate simulator to capture
son’s. RC: rate-coded input; TC: temporal-coded input. the precise behavior in a unary computing architec-
ture, as shown in Figure 6. The entire simulator con-
sists of two parts, including the software simulation
stability than that of prior counterparts. Such high sta- and the hardware implementation. The backbone of
bility indicates that the maximum l is smaller for our the software simulation is the PyTorch deep learning
designs, implying that under the same error threshold, framework from Facebook; as such, UnarySim natu-
our design can early terminate the computation earlier rally supports deep learning applications and is highly
than others to boost energy efficiency. By setting scalable and extensible. UnarySim inherits the high
VTHD ¼ 0:05, we present the accuracy evolution scalability of PyTorch by decoupling the high-level
through time (i.e., progressive accuracy) and the stable architecture design and the low-level performance
point for various unary GEMM implementations in simulation. The key simulated components are cate-
Figure 5. We observe that uGEMM outperforms all gorized into stream, kernel, and metric as follows.
others in the final accuracy for both inputs, as well as
their difference, suggesting input-insensitivity. Then, 1) The stream components are used to generate
the stable points of uGEMM are also closer to the y- the bit streams using either rate or temporal
axis in this plot, demonstrating that with the same error coding and manipulate a pair of bit streams to
budget, early termination in uGEMM happens earlier exhibit a specific correlation, so as to cover
than others, providing even higher energy efficiency major use cases in existing literature.1,4,5,7
benefits on top of the results in Table 1. Therefore, for 2) The kernel components refer to the unary com-
each configuration and input type, uGEMM outper- puting units for both arithmetic (linear and non-
forms all other approaches in terms of how early its linear) and logical operations. The proposed
accuracy can stabilize, i.e., high output stability. uGEMM microarchitectures and architecture
Above results demonstrate that uGEMM is are all covered in this category, as well as some
more suitable than its counterparts for ultra-low-
power architectures, especially considering its reliable
energy efficiency boost via the accuracy-aware metric.
UNARYSIM: A UNARY COMPUTING

SIMULATOR
Although there exist tools for synthesizing unary cir-
cuits,11 there are still little to no publicly available tool-
FIGURE 6. UnarySim diagram.
chains for rapid design space exploration and

TOP PICKS
other general operations for deep learning and demonstrate the accuracy-aware energy-accuracy
computer vision. scaling for uGEMM using our stability metric, which
3) The metric components include key unary com- offers further opportunities for energy savings via
puting metrics in existing literature, like correla- early termination. Finally, we introduce our dedicated
tion,5 accuracy, and our proposed stability simulator for unary computing, UnarySim, which can
metric.1,6 Those metrics can be seamlessly pinned significantly lower the learning curve toward rapid and
to any node in the system during simulation and reliable unary computing research.
monitor the system status over time, which can
be used to tune the system accuracy and latency.
REFERENCES
1. D. Wu et al., “UGEMM: Unary computing architecture
for GEMM applications,” in Proc. 47th Annu. Int. Symp.
WE INTRODUCE OUR DEDICATED
Comput. Archit., 2020, pp. 377–390, doi: 10.1109/
SIMULATOR FOR UNARY COMPUTING,
ISCA45697.2020.00040.
UNARYSIM, WHICH CAN
2. V. T. Lee, A. Alaghi, R. Pamula, V. S. Sathe, L. Ceze, and
SIGNIFICANTLY LOWER THE LEARNING
M. Oskin, “Architecture considerations for stochastic
CURVE TOWARD RAPID AND RELIABLE computing accelerators,” IEEE Trans. Comput.-Aided
UNARY COMPUTING RESEARCH. Design Integr. Circuits Syst., vol. 37, no. 11, pp. 2277–
2289, Nov. 2018, doi: 10.1109/TCAD.2018.2858338.
3. A. Madhavan, T. Sherwood, and D. Strukov, “Race logic:
A hardware acceleration for dynamic programming
After the system is fine-tuned, these metric compo- algorithms,” in Proc. 41st Annu. Int. Symp. Comput.
nents can be removed to speed up simulation. All of Archit., 2014, pp. 517–528, doi: 10.1109/
these components in software simulation are callable ISCA.2014.6853226.
PyTorch modules, which follows the PyTorch program- 4. B. R. Gaines, “Stochastic computing systems,” in Proc.
ming rules, and are provided with test examples for clar- Adv. Inf. Syst. Sci., 1969, pp. 37–172, doi: 10.1007/978-1-
ity. With the growing popularity of PyTorch in both the 4899-5841-9_2.
academia and industry, we believe that UnarySim offers 5. A. Alaghi and John P. Hayes, “Exploiting correlation in
a low barrier-to-entry for researchers: further extending stochastic circuit design,” in Proc. 31st Int. Conf.
UnarySim to support more stream, kernel and metric Comput. Des., 2013, pp. 39–46, doi: 10.1109/
components simply requires constructing additional ICCD.2013.6657023.
PyTorch modules. In terms of the hardware implementa- 6. D. Wu, R. Yin, and J. SanMiguel, “Normalized stability: A
tion, we provide component-level implementation exam- cross-level design metric for early termination in
ples for the proposed and evaluated designs in this work, stochastic computing,” in Proc. 26th Asia South Pacific
as well as those in prior works.7 Note that current metric Des. Autom. Conf., 2021, pp. 254–259, doi: 10.1145/
components are merely simulated in software for perfor- 3394885.3431549.
mance evaluation and do not have correspondent hard- 7. D. Wu and J. SanMiguel, “In-stream stochastic division
ware implementations. Future works for this simulator and square root via correlation,” in Proc. IEEE/ACM
will be the extension of the components to better sup- 56th Des. Autom. Conf., 2019, pp. 1–6, doi: 10.1145/
port broader applications and the automation of execut- 3316781.3317844.
ing arbitrary algorithms on unary hardware. 8. D. Jenson and M. Riedel, “A deterministic approach to
stochastic computation,” in Proc. 35th Int. Conf.
CONCLUSION Comput.-Aided Des., 2016, pp. 1–8, doi: 10.1145/
Unary computing has gained growing research atten- 2966986.2966988.
tion in the past decade in the fields of signal process- 9. M. H. Najafi, D. Jenson, D. J. Lilja, and M. D. Riedel,
ing, computer vision, and machine learning, etc., due “Performing stochastic computation deterministically,”
to its ultra-low power consumption. In this work, to IEEE Trans. Very Large Scale Integration (VLSI) Syst.,
further promote the research of unary computing into vol. 27, no. 12, pp. 2925–2938, Dec. 2019, doi: 10.1109/
broader disciplines, we focus on architecting GEMM, TVLSI.2019.2929354.
which is ubiquitous in applications. We first present 10. H. Sim and J. Lee, “A new stochastic computing
our uGEMM microarchitecture and architecture with multiplier with application to deep convolutional
benefits in input-insensitivity, output-stability, and neural networks,” in Proc. 54th Des. Autom. Conf.,
hardware efficiency compared to prior art. Then, we 2017, pp. 1–6, doi: 10.1145/3061639.3062290.

TOP PICKS
11. K. Daruwalla, H. Zhuo, R. Shukla, and M. Lipasti, “BitSAD Prof. Joshua San Miguel on unary computing and applying
v2: Compiler optimization and analysis for bitstream this paradigm to multiple real-world applications. Contact
computing,” ACM Trans. Archit. Code Optim., vol. 16,
him at ryin25@wisc.edu.
no. 4, pp. 1–25, 2019, doi: 10.1145/3364999.
12. D. Wu and R. Yin, “UnarySim.” Accessed: Jan. 31, 2021.
YOUNGHYUN KIM is currently an Assistant Professor of Elec-
[Online]. Available: https://github.com/diwu1990/
UnarySim trical and Computer Engineering at the University of Wiscon-
sin-Madison, Madison, WI, USA. His research interests include
DI WU is currently a Ph.D. candidate with the Department of energy-efficient computing, machine learning at the edge, and
Electrical and Computer Engineering, University of Wiscon- cyber-physical systems. Kim received a Ph.D. degree in elec-
sin–Madison, Madison, WI, USA. His research interests include trical engineering and computer science from Seoul National
stochastic and approximate computing, as well as numerical University in 2013. Before joining University of Wisconsin-
optimization for deep neural networks. Wu received B.S. and Madison in 2016, he was a postdoc at Purdue University,
M.Eng. degrees from Fudan University in 2012 and 2015, West Lafayette, IN, USA. He is a member of IEEE and ACM.
respectively. He is a member of ACM. He is the corresponding Contact him at younghyun.kim@wisc.edu.
author of this article. Contact him at di.wu@ece.wisc.edu.
JOSHUA SAN MIGUEL is currently an Assistant Professor at
JINGJIE LI is currently a Ph.D. candidate with the Department the University of Wisconsin-Madison, Madison, WI, USA. His
of Electrical and Computer Engineering, University of Wiscon- research interests include stochastic and approximate com-
sin–Madison, Madison, WI, USA. His research interests include puting, intermittent computing, and interconnection networks.
human-centered computing, Internet of Things, and embed- San Miguel received a Ph.D. degree in electrical and computer
ded systems. Li received a B.S. degree from Beijing Institute of engineering from the University of Toronto. He is a member of
Technology and a B.Eng. (R&D) degree (Hons.) from The Aus- IEEE and ACM. Contact him at jsanmiguel@wisc.edu.
tralian National University in 2017. He is a Student Member of
IEEE and ACM. Contact him at jingjie.li@wisc.edu. HSUAN (JULIE) HSIAO is currently a Ph.D. candidate with the
Edward S. Rogers Sr. Department of Electrical and Computer
RUOKAI YIN is currently a senior undergraduate at the Uni- Engineering, University of Toronto, Toronto, ON, Canada. Her
versity of Wisconsin-Madison, Madison, WI, USA, majored in research interests include stochastic computing and com-
electrical engineering and computer science. His research puter-aided design. Hsiao received an M.A.Sc. degree in elec-
interests include design high-performance computer archi- trical and computer engineering from the University of
tecture for machine learning. Currently, he is working with Toronto. Contact her at julie.hsiao@mail.utoronto.ca.

BabelFish: Fusing Address Translations for

Containers
Dimitrios Skarlatos, Umur Darbaz, Bhargava Gopireddy, Nam Sung Kim , and Josep Torrellas, University of
Illinois at Urbana-Champaign, Champaign, IL, 61820, USA
Cloud computing has begun a transformation from using virtual machines to using
containers. Containers are attractive because of their “build once, run anywhere”
computing model and their minimal performance overhead. Cloud providers
leverage the lean nature of containers to run hundreds of them or more on a few
cores. Furthermore, containers enable the serverless paradigm, which involves the
creation of short-lived processes. In this work, we identify that containerized
environments create page translations that are extensively replicated across
containers in the TLB and in page tables. The result is high TLB pressure and
redundant kernel work during page table management. To remedy this situation,
this article proposes BabelFish, a novel architecture to share page translations
across containers in the TLB and in page tables. BabelFish reduces the mean and
tail latency of containerized workloads, cold-start effects of function execution, and
container bring-up time. This work also advocates for the need to provide more
hardware support for containerized and serverless environments.
C
loud computing has been undergoing a radical foundation for Serverless computing,4 a new cloud
transformation with the emergence of Con- computing paradigm provided by services like Ama-
tainers.1 Like a virtual machine (VM), a con- zon’s Lambda, Microsoft’s Azure Functions, Google’s
tainer packages an application and all of its Cloud Functions, and IBM’s Cloud Functions. The
dependencies, libraries, and configurations, and iso- most popular use of serverless computing is known as
lates it from the system it runs on. However, while Function-as-a-Service (FaaS). In this environment, the
each VM requires a guest operating system (OS), mul- user runs small code snippets called functions, which
tiple containers share a single kernel. As a result, con- are triggered by specified events. The cloud provider
tainers require significantly fewer memory resources automatically scales the number and type of functions
and have lower overheads than VMs. For these rea- executed based on demand, and users are charged
sons, cloud providers such as Google’s Compute only for the amount of time a function spends
Engine, Amazon’s ECS, IBM’s Cloud, and Microsoft’s computing.5,6
Azure now provide container-based solutions. The Our detailed analysis of containerized environ-
most prominent container solution is Docker contain- ments reveals that, very often, the same virtual page
ers. In addition, there are management frameworks, number (VPN) to physical page number (PPN) transla-
such as Google’s Kubernetes2 and Facebook’s Twine,3 tion, with the same permission bit values, is replicated
which automate the deployment, scaling, and mainte- in the TLB and in page tables. One reason for this is
nance of containerized applications. that containerized applications are encouraged to cre-
Container environments are typically oversub- ate many containers, as doing so simplifies scale-out
scribed, with many more containers running than management, load balancing, and reliability.7,8 In such
cores. Moreover, container technology has laid the environments, applications scale with additional con-
tainers, which run the same application on different
sections of a common dataset. While each container
0272-1732 ß 2021 IEEE serves different requests and accesses different data,
Digital Object Identifier 10.1109/MM.2021.3073194 a large number of the pages accessed is the same
Date of publication 19 April 2021; date of current version across containers.
25 May 2021.

TOP PICKS
Another reason for the replication is that contain-

ers are created with forks, which replicate translations.
Further, since containers are stateless, data are usually
accessed through the mounting of directories and the
memory mapping of files, which further creates transla-
tion sharing. Also, both within and across applications,
containers often share middleware. Finally, the light-
weight nature of containers encourages cloud pro- FIGURE 1. Page table walk.
viders to deploy many containers in a single host.9 All
of this leads to numerous replicated page translations.
Unfortunately, state-of-the-art TLB and page table
hardware and software are designed for an environ-
ment with few and diverse application processes. This
WE PROPOSE BABELFISH, A NOVEL
has resulted in per-process tagged TLB entries, sepa-
ARCHITECTURE TO SHARE
rate per-process page tables, and lazy page table man-
TRANSLATIONS ACROSS CONTAINERS
agement, where rather than updating the page
IN THE TLB AND IN PAGE TABLES, translations at process creation time, they are
WITHOUT SACRIFICING THE updated later on demand. In containerized environ-
ISOLATION PROVIDED BY THE ments, this approach causes high TLB pressure,
VIRTUAL MEMORY ABSTRACTION. redundant kernel work during page table management
and, generally, substantial overheads.
HANDLING TLB MISSES IN X86 LINUX
W hen a processor access misses in both L1 and

L2 TLBs, a page table walk begins. This is a
multistep process performed in hardware. Figure 1
uploads into the L1 TLB to proceed with the translation
of the virtual address.
In theory, a page walk involves four cache hierarchy
shows the page walk for an address in the x86-64 accesses. In practice, a core has a translation cache
architecture. The hardware reads the CR3 control called the page walk cache (PWC) that stores a few
register, which contains the physical address of the recently accessed entries of the first three tables (PGD,
Page Global Directory (PGD) of the currently running PUD, and PMD). The hardware checks the PWC before
process. The hardware adds the 40-bit CR3 register to going to the cache hierarchy. If it hits there, it avoids a
bits 47-39 of the virtual address. The result is the cache hierarchy access.
physical address of an entry in the PGD. The hardware When this translation process fails, a page fault
reads such address from the memory hierarchy— occurs and the OS is invoked. There are two relevant
accessing first the data caches and, if they declare a types of page faults: major and minor. A major one
miss, the main memory. The data in that address occurs when the page for one of these physical
contain the physical address of the page upper directory addresses requested during the walk is not in memory.
(PUD), which is the next level of the translation. Such In this case, the OS fetches the page from disk into
physical address is then added to bits 38-30 of the memory and resumes the translation. A minor page fault
virtual address. The contents of the resulting address occurs when the page is in memory, but the
are the physical address of the next-level table, the page corresponding entry in the tables says that the page is
middle directory (PMD). The process is repeated using not present in memory. In this case, the OS simply marks
bits 29-21 of the virtual address to reach the next table, the entry as present, and resumes the translation. This
the Page Table (PTE). In this table, using bits 20-12 of the happens, for example, when multiple processes share
virtual address, the hardware obtains the target physical the same physical page. Even though the physical page
table entry (pte_t). The pte_t provides the physical page is present in memory, a new process incurs a minor page
number (PPN) and additional flags that the hardware fault on its first access to the page.

TOP PICKS
OUR PROPOSAL: BABELFISH

To remedy this problem, we propose BabelFish, a
novel architecture to share translations across con-
tainers in the TLB and in page tables, without sacrific-
ing the isolation provided by the virtual memory
abstraction. BabelFish eliminates the replication of
{VPN, PPN} translations in two ways. First, it modifies
the TLB to dynamically share identical {VPN, PPN}
pairs and permission bits across containers. Second, it FIGURE 2. Two-way set-associative BabelFish TLB.
merges page table entries of different processes with
the same {VPN, PPN} translations and permission bits.
As a result, BabelFish reduces the pressure on the
TLB, reduces the cache space taken by translations,
and eliminates redundant minor page faults. In addi-
tion, it effectively prefetches shared translations into
the TLB and caches. The end result is the higher per-
formance of containerized applications and functions, FIGURE 3. Ownership-PrivateCopy (O-PC) field. The Private-
and faster container bring-up. Copy (PC) bitmask has a bit set for each process in the CCID
BabelFish has two parts. One enables TLB entry group that has its own private copy of the page. The ORPC bit
sharing, and the other enables page table entry sharing. is the logic OR of all the bits in the PC bitmask.
ENABLING TLB ENTRY SHARING We also want to support the more advanced case
In containerized environments, TLBs may contain mul- where many of the processes of the CCID group want
tiple entries with the same {VPN, PPN} pair, the same to share the same {VPN0 , PPN0 } translation, but a few
permission bits, and different PCID tags. Such replica- processes of the CCID group do not, and have made
tion can lead to TLB thrashing. To solve this problem, their own private copies. For example, one process
BabelFish combines these entries into a single one created {VPN0 , PPN1 } and another one created {VPN0 ,
with the use of a new identifier called container con- PPN2 }. This situation occurs when a few of the pro-
text identifier (CCID). All of the containers created by cesses of the CCID group have written to a copy-on-
a user for the same application are given the same write (CoW) page and have made their own private
CCID. It is expected that the processes in the same copy of the page, while most of the other processes
CCID group will want to share many TLB and page still share the original clean page. To support this
table entries. case, we integrate the ownership bit into a new TLB
BabelFish adds a CCID field to each entry in the field called Ownership-PrivateCopy (O-PC) (Figure 2).
TLB. Further, when the OS schedules a process, the Ownership-PrivateCopy Field. The O-PC field is
OS loads the process’ CCID into a register—like it cur- expanded in Figure 3. It contains the PrivateCopy (PC)
rently does for the process’ PCID. Later, when the TLB bitmask, one bit that is the logic OR of all the bits in
is accessed, the hardware will look for an entry with a the PC bitmask (ORPC ), and the Ownership (O) bit. The
matching VPN tag and a matching CCID. If such an PC bitmask has a bit set for each process of the CCID
entry is found, the translation succeeds, and the corre- group that has its own private copy of this page. The
sponding PPN is read. Figure 2 shows an example for a rest of the processes of the CCID group, which can be
two-way set-associative TLB. This support allows all an unlimited number, still share the clean shared page.
the processes in the same CCID group to share The BabelFish TLB is indexed as a regular TLB,
entries. using the VPN Tag. The hardware looks for a match in
The processes of a CCID group may not want to the VPN and CCID. All of the potentially matching TLB
share some pages. In this case, a given VPN should entries will be in the same TLB set, and more than one
translate to different PPNs for different processes. To match may occur. On a match, the O-PC and PCID
support this case, we retain the PCID in the TLB, and fields are checked, and two cases are possible. First, if
add an Ownership (O) bit in the TLB. If O is set, it indi- the O bit is set, this is a private entry. Hence, the entry
cates that this page is owned rather than shared, and can be used only if the process’ PCID matches the
a TLB hit also requires a PCID match. TLB entry’s PCID field.

TOP PICKS
Alternately, if O is clear, this is a shared entry. In

this case, before the process can use it, the process
needs to check whether the process itself has its own
private copy of the page. To do so, the process checks
its own bit in the PC bitmask. If the bit is set, the pro-
cess cannot use this translation because the process
already has its own private copy of the page. (An entry
for such page may or may not exist in the TLB.) Other-
wise, since the process’ bit in the PC bitmask is clear,
the process can use this translation.
The O-PC information of a page is part of a TLB FIGURE 4. Page table sharing in BabelFish.
entry, but only the O and ORPC bits are stored in the
page table entry. The PC bitmask is not stored in the
page table entry to avoid changing the data layout of occur at other levels. For example, it can occur at the
the page tables. Instead, it is stored in an OS software PMD level—i.e., entries in multiple PUD tables point to
structure, which also includes an ordered list (pid_list) the base of the same PMD table. In this case, multiple
of processes that performed a CoW in the CCID group. processes can share the mapping of 512 512 4-KB
The order of the pids in this list encodes the mapping pages or 512 2-MB huge pages. Further, processes can
of PC bitmask bits to processes. For example, the sec- share a PUD table, in which case they can share even
ond pid in the pid_list is the process that uses the sec- more mappings. We always keep the first level of the
ond bit in the PC bitmask. More details are given in tables (PGD) private to the process.
the conference paper.10
PUTTING IT ALL TOGETHER

ENABLING PAGE TABLE ENTRY To understand the impact of BabelFish, we describe
SHARING an example. Consider three containers (A, B, and C)
In current systems, two processes that have the same that have the same {VPN0 , PPN0 } translation. First, A
{VPN, PPN} mapping and permission bits still need to runs on Core 0, then B runs on Core 1, and then C runs
keep separate page table entries. This situation is com- on Core 0. Figure 5 shows the timeline of the transla-
mon in containerized environments, where the pro- tion process, as each container, in order, accesses
cesses of a CCID group may share many pages (e.g., a VPN0 for the first time. The top three rows of the figure
large library) using the same {VPN, PPN} mappings. correspond to a conventional architecture, and the
Keeping separate page table entries has two costs. First, lower three to BabelFish. To save space, we show the
the many pte_ts requested from memory can thrash the timelines of the three containers on top of each other;
cache hierarchy.11 Second, every single process in the in reality, they take place in sequence.
group that accesses the page may suffer a minor page We assume that PPN0 is in memory but not yet
fault, rather than only one process suffering a fault. marked as present in memory in any of the A, B, or C
To solve this problem, BabelFish changes the page pte_ts. We also assume that none of these transla-
table structures so that processes of the same CCID tions is currently cached in the page walk cache
can share one or more levels of the page tables. In the (PWC) of any core.
most common case, multiple processes will share the Conventional Architecture. The top three rows of
table in the last level of the translation. This is shown in Figure 5 show the conventional process. As container
Figure 4. The figure shows the translation of an address A accesses VPN0 , the access misses in the L1 and L2
for two processes of the same CCID group that map it to TLBs, and in the PWC. Then, the page walk requires a
the same physical address. The two processes (one with memory access for each level of the page table (we
CR30 and the other with CR31 ) use the same last level assume that, once the PWC has missed, it will not be
page (PTE). In the corresponding entries of their previous accessed again in this page walk). First, as the entry in
tables (PMD), both processes place the base address of the PGD is accessed, the page walker issues a cache
the same PTE table. Now, both processes together suf- hierarchy request. The request misses in the L2 and L3
fer only one minor page fault (rather than two), and caches and hits in main memory. The location is read
reuse the cache line that contains the target pte_t. from memory. Then, the entry in the PUD is accessed.
The default sharing level in BabelFish is a PTE table, The process repeats for every level, until the entry in
which maps 512 4-KB pages in x86-64. Sharing can also the PTE is accessed. Since we assume that PPN0 is in

TOP PICKS
FIGURE 5. Timeline of the translation process in a conventional (top) and BabelFish (bottom) architecture. In the figure, con-
tainer A runs on Core 0, then container B on Core 1, and then container C on Core 0.
memory but not marked as present, A suffers a minor the TLB. Recall that, in the x86 architecture, writes to
page fault as it completes the translation (see CR3 do not flush the TLB. This example highlights the
Figure 5). Finally, A’s page table is updated and a benefits in a scenario where multiple containers are
{VPN0 , PPN0 } translation for A is loaded into the TLB. coscheduled on the same physical core, either in SMT
After that, container B running on another core mode, or due to an oversubscribed system.
accesses VPN0 . The hardware and OS follow exactly
the same process as for A. At the end, B’s page table EVALUATION
is updated and a {VPN0 , PPN0 } translation for B is We evaluate BabelFish with simulations of an 8-core pro-
loaded into the TLB. cessor running a set of Docker containers in an environ-
Finally, container C running on the same core as A ment with conservative container colocation. We
accesses VPN0 . Again, the hardware and OS follow evaluate two types of containerized workloads (data
exactly the same process. C’s page table is updated, serving and compute) and two types of FaaS workloads
and a {VPN0 , PPN0 } translation for C is loaded into the (dense and sparse). On average, under BabelFish, 53% of
TLB. The system does not take advantage of the state the translations in containerized workloads and 93% of
that A loaded into the TLB, PWC, or caches because the translations in FaaS workloads are shared.
the state was for a different process. Figure 6 shows the latency or time reduction
BabelFish Architecture. The lower three rows of obtained by the extensions proposed by BabelFish.
Figure 5 show the behavior of BabelFish. Container A’s BabelFish reduces the mean and tail (95th percentile)
access follows the same translation steps as in the latency of containerized data-serving workloads by
conventional architecture. After that, container B run- 11% and 18%, respectively. It also lowers the execution
ning on another core is able to perform the translation time of containerized compute workloads by 11%.
substantially faster. Specifically, its access still misses
in the TLBs and in the PWC; this is because these are
per-core structures. However, during the page walk,
the multiple requests issued to the cache hierarchy
miss in the local L2 but hit in the shared L3 (except for
the PGD access). This is because BabelFish enables
container B to reuse the page-table entries of con-
tainer A—at any level except at the PGD level. Also,
container B does not suffer any page fault.
Finally, as C runs on the same core as A, it performs
a very fast translation. It hits in the TLB because it can
FIGURE 6. Latency or time reduction attained by BabelFish.
reuse the TLB translation that container A brought into

TOP PICKS
Finally, it reduces FaaS function execution time by Container Context Identifiers (CCIDs)
10%–55% and bring-up time by 8%. CCIDs are a new way to logically group processes and
containers to enable resource sharing. CCIDs are a
FUTURE RESEARCH DIRECTIONS useful abstraction to allow the hardware and kernel to
AND APPLICATIONS reason about the cooperation and isolation of execu-
The core concepts of BabelFish can be extended for tion contexts—enabling higher performance and
other environments and resources. In this section, we security guarantees. BabelFish proposes the first-class
present some possible directions. support of CCIDs in both the kernel and in hardware,
which represents a first step toward container-aware
Overcommitted Environments computing. Looking forward, it will be beneficial to
While container environments place substantial pres- provide more hardware and systems support for con-
sure on the TLB and caches due to translations, as the tainerized and serverless environments.
working set of applications continues to increase,
other workloads will also suffer from the same effects.
REFERENCES
BabelFish novel design, which allows processes and
1. Docker, “What is a container?.” [Online]. Available:
containers within a group to share replicated transla-
https://www.docker.com/what-container
tions in the TLB and in the cache hierarchy, is a general
2. Google, “Production grade container orchestration.”
primitive that can be used in other environments to
[Online]. Available: https://kubernetes.io
reduce the translation-induced pressure. Furthermore,
3. C. Tang et al., “Twine: A unified cluster management
BabelFish also supports the more advanced case where
system for shared infrastructure,” in Proc. 14th USENIX
many of the processes of a group want to share the
Symp. Operating Syst. Des. Implementation, USENIX
same translation, but a few other processes do not, and
Assoc., Nov. 2020.
have made their own private copies. This overall design
4. N. Savage, “Going serverless,” Commun. ACM, vol. 61,
minimizes the context-switch overhead and effectively
no. 2, pp. 15–16, Jan. 2018.
prefetches shared translation in the TLB and the caches.
5. A. Klimovic, Y. Wang, P. Stuedi, A. Trivedi, J. Pfefferle,
and C. Kozyrakis, “Pocket: Elastic ephemeral storage
for serverless analytics,” in Proc. 13th USENIX Symp.
BABELFISH NOVEL DESIGN, WHICH Operating Syst. Des. Implementation, Carlsbad, CA,
ALLOWS PROCESSES AND USA, USENIX Assoc., 2018, pp. 427–444.
CONTAINERS WITHIN A GROUP TO 6. M. Shahrad, J. Balkind, and D. Wentzlaff, “Architectural
implications of Function-as-a-Service computing,” in
SHARE REPLICATED TRANSLATIONS
Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchit., 2019,
IN THE TLB AND IN THE CACHE
pp. 1063–1075.
HIERARCHY, IS A GENERAL PRIMITIVE
7. B. Burns and D. Oppenheimer, “Design patterns for
THAT CAN BE USED IN OTHER container-based distributed systems,” in Proc. 8th
ENVIRONMENTS TO REDUCE THE USENIX Workshop Hot Top. Cloud Comput., Denver,
TRANSLATION-INDUCED PRESSURE. CO, USA, Jun. 2016, pp. 108–113.
8. B. Ibryam, “Principles of container-based application
design,” Tech. Rep., Red Hat, Inc., 2017. [Online].
Available: https://www.redhat.com/cms/managed-
Virtualized Environments files/cl-cloud-native-container-design-whitepaper-
Conventional virtualized environments are slowed f8808kc-201710-v3-en.pdf f8808kc-201710-v3
down by nested address translation. For example, a 9. IBM, “Docker at insane scale on IBM power systems.”
nested address translation may require up to 24 [Online]. Available: https://www.ibm.com/blogs/bluemix/
sequential memory accesses. In such deployments, 2015/11/dockerinsane-scale-on-ibm-power-systems
content-aware page deduplication is a prevalent tech- 10. D. Skarlatos, U. Darbaz, B. Gopireddy, N. S. Kim, and J.
nique to reduce the memory pressure caused by mem- Torrellas, “Babelfish: Fusing address translations for
ory overcommitment. containers,” in Proc. ACM/IEEE 47th Annu. Int. Symp.
This page deduplication process creates page shar- Comput. Archit., 2020, pp. 501–514.
ing in virtualized environments, generating more repli- 11. Y. Marathe, N. Gulur, J. H. Ryoo, S. Song, and L. K. John,
cated translations. BabelFish can be extended to reduce “CSALT: Context switch aware large TLB,” in Proc. 50th
this translation replication in virtualized environments. Annu. IEEE/ACM Int. Symp. Microarchit., 2017, pp. 449–462.

Characterizing and Modeling Nonvolatile

Memory Systems
Zixuan Wang and Xiao Liu, University of California San Diego, La Jolla, CA, 92093, USA
Jian Yang, Google Inc., Mountain View, CA, 94043, USA
Theodore Michailidis, Steven Swanson , and Jishen Zhao , University of California San Diego, La Jolla, CA,
92093, USA
Scalable server-grade nonvolatile RAM (NVRAM) DIMMs are commercially available

with the release of Intel’s Optane DIMM. Recent studies on Optane DIMM-based
systems unveil discrepant performance characteristics, compared with what many
researchers thought before the product release. To thoroughly analyze the source
of the discrepancy and facilitate real-NVRAM-aware system design, we develop an
NVRAM microarchitecture characterization and modeling framework, consisting of
a Low-level profilEr for Non-volatile memory Systems (LENS) and a Validated cycle-
Accurate NVRAM Simulator (VANS). LENS allows users to comprehensively analyze
NVRAM performance attributes and reverse engineer NVRAM microarchitectures.
We use LENS to reverse engineer the sophisticated microarchitecture design of
Optane DIMM and generate a set of architecture implications of industrial
NVRAMs. VANS models Optane DIMM microarchitecture and is validated by
comparing with the detailed performance characteristics of Optane DIMM-
attached servers. VANS adopts a modular design that can be easily modified to
extend to other NVRAM architecture designs.
N
onvolatile RAMs (NVRAMs)1 have been envi- app direct mode. In memory mode, Optane DIMMs
sioned as a new tier of memory in server are configured as volatile memory with DRAM DIMMs
systems. It offers byte-addressable access as data cache. In app direct mode, Optane DIMMs are
through CPU load/store instructions, with comparable used as persistent memory, which unifies the fast-
performance to DRAM and durability property of stor- access interface of memory with the persistence prop-
age devices. Seeing these great values, a large body of erty of storage.4 For example, programmers can
prior studies investigated how to exploit NVRAM to directly issue load/store instructions to access in-
benefit future computer systems.2 Yet only recently memory data structures while NVRAM system soft-
has the first server-grade NVRAM DIMM product ware and hardware ensure that the data structures
come to market, namely Intel Optane DC persistent are recoverable in the face of system crashes and
memory (also known as Optane DIMM).3 power loss.2
Figure 1 illustrates an example of Optane DIMM- The latest characterization studies5 exposed
based server systems. Optane DIMMs—denoted substantial discrepant performance characteristics
NVRAMs—sit on the memory bus along with DRAM compared to what we thought before the real prod-
DIMMs and are controlled by the processors’ inte- ucts were released. For instance, Figure 2 compares
grated memory controllers (iMCs). Optane DIMMs results of the widely used memory simulators,
operate in one of the two modes: memory mode or DRAMSim26 and Ramulator,7 and our Optane
DIMM server measurement on single-thread read
and write bandwidth [see Figure 2(a)] and latency
per cache line (CL) access [see Figure 2(b)] with a
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3065305 pointer chasing microbenchmark. The microbe-
Date of publication 11 March 2021; date of current version nchmark randomly accesses a contiguous data
25 May 2021. region with fixed 64 B objects. By varying the data

TOP PICKS
FIGURE 2. Comparison between memory simulators and

FIGURE 1. Optane DIMM-based server system configurations.
Optane DIMM system profiling. (a) Simulator average accu-
racy wrt. Optane DIMM load/store bandwidth (bw-ld and bw-
st) and latency (lat-ld and lat-st). (b) Comparison between
region size from 64 B to 64 KB, we observe a clear
Ramulator and Optane DIMM on read latency per CL with
inconsistency between the simulation and real-sys-
tem results. Other previous NVRAM emulators and the pointer chasing test.
simulators also generate different performance
characteristics compared to real-machine results.
As such, previous NVRAM emulation and simulation LENS: LOW-LEVEL NVRAM
tools are insufficient to model modern real NVRAM PROFILER
systems. To address the challenge of insufficient profiling tools,
this article proposes a Low-level profilEr for Non-vola-
tile memory System (LENS). LENS consists of a set of
NVRAM profiling tools and low-level microbe-
IN ORDER TO DEVELOP A MEMORY
nchmarks. With detailed performance profiling, LENS
SIMULATOR THAT MODELS THE
also allows researchers to reverse engineer the micro-
SOPHISTICATED PERFORMANCE architecture design of NVRAM systems with on-DIMM
BEHAVIOR AND MICROARCHITECTURE buffers and various control schemes. In this article, we
OF REAL NVRAM SYSTEMS, WE NEED use LENS to profile the detailed architecture design of
TO COLLECT SUFFICIENT Optane DIMM.
INFORMATION ABOUT DETAILED
PERFORMANCE CHARACTERISTICS.
LENS Framework
LENS adopts three key components—probers—to
In order to develop a memory simulator that models examine three aspects of NVRAM architecture design,
the sophisticated performance behavior and micro- respectively. Each prober employs customized microbe-
architecture of real NVRAM systems, we need to col- nchmarks to trigger specific hardware behaviors, which
lect sufficient information about detailed performance leads to different latency and bandwidth patterns with
characteristics. This requires a comprehensive archi- different memory architecture properties. By analyzing
tecture-level memory system profiling. Most previous the performance patterns, we identify the correspond-
performance profiling tools focus on investigating basic ing microarchitecture properties and parameters.
memory system performance characteristics, such as Buffer Prober detects the on-DIMM buffer capac-
latency, bandwidth, and access counts. Beyond these ity, entry size, and organization, by measuring the
characteristics, DRAMA8 examines certain address latency change caused by buffer overflow and read/
mapping schemes. However, none of the widely used write amplification.
performance profiling tools allows us to analyze Policy Prober investigates NVRAM control poli-
detailed on-DIMM buffering and management schemes cies, including data migration (e.g., for wear-leveling)
of Optane DIMM. and multi-DIMM interleaving policy. Typical NVRAM
Our article, presented at MICRO 2020,9 offers a set wear-leveling schemes migrate data from one NVRAM
of open-sourced NVRAM profiling and simulation media location to another to maintain evenly distrib-
toolsa to address the above-mentioned challenges uted wear out. The prober detects the frequency,
and facilitate future research on architecture and sys- granularity, and latency overhead of such data migra-
tem designs of NVRAM-based memory systems. tion procedures.
Performance Prober facilitates the analysis of the
other two probers to measure the device bandwidth
a
Available at: https://github.com/TheNetAdmin/LENS-VANS and latency for on-DIMM architecture components.

TOP PICKS
Microbenchmarks LENS provides three microbe-

nchmarks: pointer chasing, overwrite, and stride.
Pointer chasing is a random memory access
benchmark: it divides a contiguous memory region—
referred to as a pointer chasing region (PC-Region)—
into equal-sized blocks (PC-Blocks); it reads/writes all
PC-Blocks in a PC-Region in random order, and
sequentially accesses data within each PC-Block. In
order to bypass CPU caches while still generating CL- FIGURE 3. LENS probers and Optane DIMM parameters. Yel-
sized memory accesses, all pointer chasing tests are low numbers are obtained from Intel documents; other num-
implemented using nontemporal AVX512 load/store
bers are characterized by LENS.
instructions. Pointer chasing has three variants to
detect various buffer architecture characteristics: 1)
collecting average latency per CL with a fixed PC-
Block size across various PC-Region sizes; 2) quantify- 256 B and 4 KB access granularity, respectively. We
ing read and write amplification by using a fixed PC- find that these two buffers are organized as a two-
Region size, while varying the size of PC-Block; and 3) level inclusive buffer hierarchy rather than indepen-
issuing read-after-write requests, which issue writes in dent buffers. We also identified a 4 KB load-store-
a pointer chasing order, followed by reads in the same queue (LSQ), which reorders the incoming requests to
order. perform write combining. Finally, we identified a long
Overwrite repeatedly generates sequential writes tail-latency effect, which may be caused by wear-level-
to a fixed memory region, and then measures the exe- ing data migration. We identified the frequency and
cution time of each iteration. It has two variants: 1) the block size of this data migration.
collecting the execution time of each write iteration
with a fixed memory region size and 2) measuring the Buffer Capacity
frequency of long tail-latency by changing the memory Figure 4 shows representative profiling results of LENS
region size. buffer prober and policy prober. Figure 4(a) shows a
Stride sequentially reads or writes to a set of stride pointer chasing read and write test with 64 B PC-Block
CLs with a fixed striding distance. It has two variants: and various PC-Region sizes. The experiments mea-
1) measuring bandwidth by using a fixed striding dis- sure latency per CL access. The shape of the curves
tance and increasing the access size, and 2) charac- exposes the read requests overflow points at 16 KB
terizing multi-DIMM interleaving by a fixed total and 16 MB, where the read latency drastically changes.
access size and a variable striding distance. This indicates that the Optane DIMM has two levels of
LENS Implementation. We build LENS as a Linux read buffers, one 16 KB and the other 16 MB. Figure 4
kernel module to avoid the overhead of switching (a) also shows that the write curve (denoted by st) has
between user and kernel spaces. We implement all two overflow points at 512 B and 4 KB, respectively.
our microbenchmarks in assembly language to This indicates that the memory has two write buffers,
enforce a well-controlled memory access behavior. namely WPQ and LSQ, of two distinct capacities.
We also disable process preemption and hardware We refer to the 16 MB buffer as AIT buffer. Because
prefetchers to avoid noises on probing results. its read latency is approximately 100 ns, as shown in
Figure 4(a), it is likely to locate inside the on-DIMM
DRAM, where AIT resides. We believe that the 512 B
Architecture Implications of Industrial buffer is WPQ, given its small size; the 4 KB buffer is
NVRAMs an on-DIMM LSQ that reorders the read/write
Figure 3 shows an overview of a real-machine charac- requests.
terization with LENS. Each prober identifies certain
memory system architecture and performance char- Access Granularity
acteristics, as illustrated by arrows in the figure. We Figure 4(c) shows a read amplification test. The results
identified the write-pending-queue (WPQ) size and indicate that the RMW buffer and AIT buffer adopt
multi-DIMM interleaving scheme in iMC. We also iden- 256 B and 4 KB access granularities, respectively.
tified two on-DIMM buffers, a 16 KB SRAM-based Figure 4(d) shows a buffer write amplification test. It
read–modify–write (RMW) buffer and a 16 MB DRAM- demonstrates that the two write queues, WPQ and
based address indirection translation (AIT) buffer, with LSQ, adopt 512 and 256 B granularities, respectively.

TOP PICKS
FIGURE 4. LENS profiling details. (a)–(d) Buffer prober analysis for buffer size, hierarchy, and granularity. (e)–(h) Policy prober
analysis for multi-DIMM interleaving and wear-leveling. (a) Load/store latency with 64 B PC-Block. (b) RaW latency and R+W
latency with 64 B PC-Block. (c) Read amplification score. (d) Write amplification score. (e) Execution time of sequential write
test. (f) Tail latency in overwrite test. Each iteration is one 256 B write. (g) Ratio of long tail latency with various overwrite granu-
larities. (h) L2 TLB miss per millisecond in the overwrite test of (f).
As a result, an mfence will cause the WPQ to flush in various sized sequential writes on interleaved and
total 512 B data; the LSQ combines 64 B writes into noninterleaved DIMMs, respectively. In the inter-
256 B in order to reduce RMW operations. leaved experiment, the first 4 KB has a similar execu-
tion time as in the noninterleaved case, indicating
Buffer Hierarchy Organization that the first 4 KB is written to a single DIMM. We
Figure 4(b) shows a read-after-write (RaW) experiment also observe a repeated execution time pattern
to characterize the buffer hierarchy organization. every 4 KB, indicating that different 4 KB data writes
Figure 4(b) indicates that RMW buffer and AIT buffer are directed to different DIMMs. These observations
form a two-level inclusive hierarchy—if they were indicate a 4 KB granularity for multi-DIMM interleav-
independent of each other, they would have demon- ing. We consider the reason for the interleaving gran-
strated a RaW latency speedup at 16 MB by fast-for- ularity is to fully utilize the 4 KB sized LSQ and 4 KB
warding data in parallel. They do not appear to be entries in AIT buffer.
exclusive either, because they adopt different access
granularities and entry sizes.
Figure 4(b) also demonstrates that the read-after- Tail Latency Analysis
write latency (denoted as RaW) is significantly higher Figure 4(f) shows the overwrite experiment for tail
than the total latency of performing individual reads latency analysis. The policy prober constantly writes
and writes (denoted as R+W) for small PC-Regions. data to the same 256 B memory region, and measures
The key reasons are small-sized RaW requests i) trig- the latency of each 256 B write. We observe a long tail
ger frequent memory bus redirection and ii) underuti- latency every 14;000 write iterations (i.e., every
lize the WPQ and LSQ capacity. Therefore, due to the 3.4 MB for 256 B overwrite test), which incurs over
identified architectural design, small-sized requests in 100 latency penalty on average. We do not observe
Optane DIMMs tend to perform better with pure reads similar trends on DRAM. Moreover, L2 TLB miss rate
or writes than with mixed read and write access pat- remains stable in the overwrite experiment, as shown
terns. In addition, the high RaW latency with small PC- in Figure 4(h). Therefore, we consider wear-leveling as
Region sizes also indicates that mfence flushes the a major contributor to the tail latency.
LSQ, because i) RaW latency reduces when the PC- To study the size of the blocks that wear-leveling
Region size increases and ii) RaW latency equals the R tracks, we increase the size of the overwrite test
+W latency when the PC-Region size reaches 4 KB, the region and count the frequency of the long tail
size of the LSQ. latency. Each test writes the same amount of data to
NVRAM. As shown in Figure 4(g), the frequency of
Multi-DIMM Interleaving Analysis long tail latency dramatically drops, once we overwrite
Figure 4(e) shows a multi-DIMM interleaving policy on 64 KB or larger memory regions. This indicates that
analysis. LENS measures the execution time of a possible block size for wear-leveling is 64 KB.

TOP PICKS
FIGURE 5. VANS overview of (a) full system simulation model with VANS reconfigurable components connected to gem5 and (b)
detailed NVRAM DIMM model.
VANS: VALIDATED NVRAM CPU writes to this block, migrates the data into
SIMULATOR another media block, updates the translation record,
Based on our profiling observations, we build a Vali- and then resumes the CPU write. The AIT data buffer
dated cycle-Accurate NVRAM Simulator (VANS) that stores data from the media to accommodate read and
models Optane DIMM’s microarchitecture design and write requests from RMW buffer. We place the AIT
its parameters. Figure 5 depicts an overview of VANS table and buffer in the on-DIMM DRAM, based on our
model and example usage cases. Optane DIMM profiling.
Figure 5(a) shows a full system simulation model VANS is written in C++ and takes advantage of C+
with VANS attached to gem5. The CPU cores and LLC + 17 standard to improve performance and reduce
are provided by gem5, while VANS acts as a drop-in code complexity. It adopts a modularized software
replacement of gem5’s memory controller and main design that is flexible and reconfigurable, where
memory. VANS provides a set of controller and mem- each component has a consistent input and output
ory components, which can be used in combination to interface and can be configured to work with any
provide a conventional DRAM-only model or NVRAM- other component. This modular interface enables
involved memory mode and app direct mode. VANS’s users to configure VANS component organization
iMC controls multiple DRAM and NVRAM DIMMs with and parameters solely from a configuration file, with-
various interleaving schemes. The DRAM model sup- out any source code change. Figure 5(a) shows such
ports DDR4 that complies with the JEDEC standard, an example where multiple NVRAM models use vari-
and can be extended to support more DDR models ous internal components.
with slight modifications to the source code. The
NVRAM model supports all the hardware components
discovered by LENS and can configure them in differ-
ent combinations to simulate various possible NVRAM
architecture.
Figure 5(b) depicts a detailed NVRAM DIMM model
with an LSQ, an RMW buffer, an AIT, and NVRAM
media. The LSQ serves as the highest level storage in
the DIMM, directly queuing load/store requests from
the iMC. During each scheduling epoch, the LSQ per-
forms reordering and write combining to reduce the
number of RMW operations. The RMW Buffer receives
read and write requests from the LSQ, and accesses
the AIT buffer at certain granularity (by default 256 B FIGURE 6. VANS performance validation with microbe-
based on our Optane DIMM analysis). The AIT consists nchmarks: (a)–(c) pointer chasing and (d) overwrite. (a)
of a translation table and a data buffer (AIT buffer). Pointer chasing test (64 B PC-Block size) on noninterleaved
The translation table stores the records of CPU DIMM. (b) Pointer chasing test (64 B PC-Block size) on inter-
address to media address translation. It also stores leaved 6DIMMs. (c) RMW buffer’s read amplification in (a). (d)
the media wear-leveling records in each table entry. If Overwrite (256 B region), each iteration is one 256 B write.
one media block is wearing out, AIT stalls the inflight

TOP PICKS
Verification and Validation microarchitecture. LENS fills in the gap with the capa-
To verify the VANS DRAM model, we run SPEC2006 bility of reverse engineering the detailed structures of
and SPEC2017 benchmarks on VANS and capture memory architecture components, management poli-
internal DRAM command traces. Then, we employ cies, and interfaces. VANS helps researchers to model
Micron’s DDR4 verification model with Cadence tool- the emerging nonvolatile memory and communication
chain to test these command traces. The results dem- protocols with minimum engineering effort. This tool-
onstrate that our model does not generate any illegal chain is beneficial to not only architecture research
DDR4 command, showing that our on-DIMM DRAM but also system and software research, such as com-
model complies with DDR4 specification. piler optimizations and file system developments. Fur-
To validate the VANS NVRAM model, we run LENS thermore, the toolchain can be easily extended to
microbenchmarks on VANS and compare the simu- support future research with different memory tech-
lated load/store latency, bandwidth, and read amplifi- nologies and architectures.
cation factors with Optane DIMM-based real-machine
results. Figure 6(a) and (b) shows the pointer chasing
DUE TO THE SCARCITY OF ACCESS TO
latency on a single DIMM and interleaved six DIMMs.
The simulated load/store latency curves match the NVRAM DEVICES, NVRAM SYSTEM
curves of real-system profiling. Furthermore, VANS’s RESEARCH TYPICALLY RELIES ON
read amplification [see Figure 6(c)] and overwrite PUBLICLY AVAILABLE OR OPEN-
latency [Figure 6(d)] match the Optane DIMM’s. Over- SOURCE SIMULATION AND
all, VANS achieves over 86.5% accuracy across all EMULATION MODELS.
metrics.
LONG-TERM IMPLICATIONS
We hope to draw attention to the need for research on Need for Rethinking NVRAM System
architecture and systems based on sophisticated Design
NVRAM DIMM microarchitectures found in modern This article unveils the microarchitecture designs in
and future devices. This work offers an open-source current and future NVRAM products. Conventional
toolset to make such research widely accessible. DIMM architecture designs are incompatible with
novel NVRAMs, due to the discrepant performance
An End-to-End Toolchain for Memory characteristics of NVRAM media compared with tradi-
tional DRAM devices. This requires imperative investi-
System Research
gations on NVRAM-based computer systems and
Memory microarchitecture is becoming increasingly
attention from both academia and industry on new
complex in order to i) adapt to new memory media
software and hardware designs.
and ii) embrace new computation functionality and
communication protocols.10 Yet, many of these new
memory techniques appear to be a black-box or a Modeling State-of-the-Art NVRAM
grey-box and lack publicly available specifications. Architecture
Profiling tools, which can capture detailed memory Due to the scarcity of access to NVRAM devices,
architecture designs, are critical to understanding the NVRAM system research typically relies on publicly
memory performance characteristics and exploring available or open-source simulation and emulation
the full potential of the new techniques. The existing models. Therefore, it is imperative for the research
profiling tools examine low-level hardware counters,11 community to have simulation models that keep up
high-level software schemes,12 or coarse-grained with the advancement of NVRAM architecture
architecture characteristics8 of the memory system. designs. VANS is verified and validated with the latest
Moreover, NVRAM research relies heavily on simula- commercialized NVRAM product. It is the first design
tors and emulators, while many of the existing main to offer the research community the new NVRAM
memory simulators and emulators lack the flexibility architecture design and enable NVRAM systems stud-
and accuracy to model the emerging nonvolatile ies without access to the real hardware.
memory.
LENS and VANS is an end-to-end toolchain that CONCLUSION
enables researchers to characterize and model the This article presents LENS and VANS, a low-level
new memory products without prior knowledge of its performance profiler for NVRAM systems and an

TOP PICKS
architecture-level memory simulator that models 7. Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and
the recently released Optane DIMM memory sys- extensible DRAM simulator,” IEEE Comput. Archit. Lett.,
tem. Using LENS, we perform a detailed characteri- vol. 15, no. 1, pp. 45–49, Jan.–Jun. 2016, doi: 10.1109/
zation of the Optane DIMM microarchitecture LCA.2015.2414456.
design. LENS allows users to characterize the per- 8. P. Pessl, D. Gruss, C. Maurice, M. Schwarz, and
formance and reverse engineer the architecture S. Mangard, “DRAMA: Exploiting DRAM addressing for
design of real NVRAM DIMM systems. VANS allows cross-CPU attacks,” in Proc. USENIX Secur. Symp., 2016,
researchers who do not have access to Optane pp. 565–581, doi: 10.5555/3241094.3241139.
DIMM physical devices to explore and evaluate new 9. Z. Wang, X. Liu, J. Yang, T. Michailidis, S. Swanson, and
design ideas on architecture and systems. Further- J. Zhao, “Characterizing and modeling non-volatile
more, both LENS and VANS are flexible to be used memory systems,” in Proc. Int. Symp. Microarchit., 2020,
with other NVRAM systems beyond the current pp. 496–508, doi: 10.1109/MICRO50266.2020.00049.
Optane DIMM design. 10. “Compute Express Link,” 2020. [Online]. Available:
https://www.computeexpresslink.org/
11. “Linux Perf Examples,” 2020. [Online]. Available: http://
ACKNOWLEDGMENTS www.brendangregg.com/perf
This work was supported in part by the National Sci- 12. “Valgrind,” 2020. [Online]. Available: https://valgrind.
ence Foundation under Grant 1829524 and Grant org/
1817077 and in part by SRC/DARPA Center for
Research on Intelligent Storage and Processing-in- ZIXUAN WANG is currently working toward a Ph.D. degree
memory. with the Department of Computer Science and Engineering,
University of California San Diego, La Jolla, CA, USA. His
research focuses on operating system and computer archi-
REFERENCES
tecture design for memory systems. He is the corresponding
1. B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting
phase change memory as a scalable DRAM author for this article. Contact him at zxwang@ucsd.edu.
alternative,” in Proc. Int. Symp. Comput. Archit., 2009,
pp. 2–13, doi: 10.1145/1555754.1555758. XIAO LIU is currently working toward a Ph.D. degree with the
2. J. Xu and S. Swanson, “NOVA: A log-structured Department of Computer Science and Engineering, Univer-
file system for hybrid volatile/non-volatile main sity of California San Diego, La Jolla, CA, USA. His research
memories,” in Proc. USENIX Conf. File Storage Technol., concerns reliability support for memory hierarchy, in-memory
2016, pp. 323–338, doi: 10.5555/2930583.2930608.
machine learning acceleration, persistent memory system
3. “Intel Optane Persistent Memory,” 2019. [Online].
design, and fault tolerance in the storage system of data cen-
Available: https://www.intel.com/content/www/us/en/
ters. Contact him at x1liu@ucsd.edu.
architecture-and-technology/optane-dc-persistent-
memory
4. J. Zhao, S. Li, D. H. Yoon, Y. Xie, and N. P. Jouppi, “Kiln: JIAN YANG is currently a software engineer at Google, Moun-
Closing the performance gap between systems with tain View, CA, USA, where he works on the host networking
and without persistence support,” in Proc. Int. Symp. team. This work was done while he was with University of Cal-
Microarchit., 2013, pp. 421–432, doi: 10.1145/ ifornia San Diego, La Jolla, CA, USA. Yang received a Ph.D.
2540708.2540744. degree in computer science from the University of California
5. J. Yang, J. Kim, M. Hoseinzadeh, J. Izraelevitz, and
San Diego. Contact him at jianyang@google.com.
S. Swanson, “An empirical guide to the behavior and
use of scalable persistent memory,” in Proc. USENIX
Conf. File Storage Technol., 2020, pp. 169–182. [Online]. THEODORE MICHAILIDIS is currently working toward a Ph.D.
Available: https://www.usenix.org/conference/fast20/ degree with the Department of Computer Science and Engi-
presentation/yang neering, University of California San Diego, La Jolla, CA, USA.
6. P. Rosenfeld, E. Cooper-Balis, and B. Jacob,
His research focuses on building efficient memory and stor-
“DRAMSim2: A cycle accurate memory system
age systems through cooperative software–hardware techni-
simulator,” IEEE Comput. Archit. Lett., vol. 10, no. 1,
ques. Contact him at tmichail@eng.ucsd.edu.
pp. 16–19, Jan.–Jun. 2011, doi: 10.1109/L-CA.2011.4.

TOP PICKS
STEVEN SWANSON is currently a professor with the Depart- Swanson received a Ph.D. from the University of Washington
ment of Computer Science and Engineering, University of in 2006. Contact him at swanson@cs.ucsd.edu.
California San Diego, La Jolla, CA, USA, the Director of the
JISHEN ZHAO is currently an associate professor of com-
lu Chair in
Non-Volatile Systems Laboratory, and the Haliciog
puter science and engineering at the University of California
Memory Systems. His research interests include the systems,
San Diego, La Jolla, CA, USA. Her research interests include a
architecture, security, and reliability issues surrounding het-
broad range of computer architecture and system topics that
erogeneous memory/storage systems, especially those that
bridge system software and hardware design, with an empha-
incorporate nonvolatile, solid-state memories. He received sis on memory and storage systems, and machine learning
the NSF CAREER Award, Google Faculty Awards, and a Face- for systems and programming. Zhao received a Ph.D. degree
book Faculty Award. He has been a NetApp Faculty Fellow. in computer science and engineering from Pennsylvania
He is a co-founder of the Non-Volatile Memories Workshop. State University. Contact her at jzhao@ucsd.edu.

Temporal Computing With Superconductors

Georgios Tzimpragos and Jennifer Volk, University of California, Santa Barbara, Santa Barbara,
CA, 93106, USA
Dilip Vasudevan , Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
Nestan Tsiskaridze, Stanford University, Stanford, CA, 94305, USA
George Michelogiannakis, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
Advait Madhavan, University of Maryland, College Park, MD, 20742, USA
John Shalf, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
Timothy Sherwood, University of California, Santa Barbara, Santa Barbara, CA, 93106, USA
Creating computing systems able to address our ever-increasing needs, especially

as we reach the end of CMOS transistor scaling, will require truly novel methods of
computing. However, the traditional logic abstractions and the digital design
patterns we understand so well have coevolved with the hardware technology that
has embodied them. As we look past CMOS, there is no reason to think that those
same abstractions best serve to encapsulate the computational potential inherent
to emerging devices. We posit that a new and radically more efficient foundation for
computing lies at the intersection of superconductor electronics and delay-coded
computation. Building on recent work in race logic, we show that superconducting
circuits can naturally compute directly over temporal relationships between pulse
arrivals; that the computational relationships between those pulse arrivals can be
formalized through a functional extension to a temporal predicate logic used in the
verification community; and that the resulting architectures can operate
asynchronously and describe real and useful computations. We verify our
hypothesis through a combination of detailed analog circuit models and layout
designs, a formal analysis of our abstractions, and an evaluation of several
superconducting temporal accelerators.
S
uperconductor electronics offer the potential A pulse, unlike a stable voltage level, will fire
to perform computations at tremendous through a channel for only an instant. Arranging the
speeds and energy savings—especially on large network of superconducting components so that input
scales.1 Unfortunately, a semantic gap exists between pulses—driven by the transfer of magnetic flux
the level-driven logic, which CMOS designs accept as quanta—always arrive simultaneously to logic gates to
a foundation, and the pulse-driven logic that is natu- maintain the illusion of a strictly Boolean evaluation is
rally supported by the most compelling superconduct- a significant engineering hurdle and results in unavoid-
ing technologies. This gap creates a variety of
able overheads. If we instead think about these pulses
challenges across the full hardware stack, from the
as the representation of nonbinary data, the natural
circuits up through the CAD and architecture levels.
language for expressing computations over those data
would be one that can efficiently describe the temporal
relationships between these pulses. Here, we draw
upon two distinct lines of research, both currently dis-
0272-1732 ß 2021 IEEE
connected from superconducting.
Date of publication 17 March 2021; date of current version First, recent work has shown that delay-based
25 May 2021. encoding has impressive expressive power and

TOP PICKS
practical utility in implementing important classes of propagation of digital data (in superconducting intercon-
accelerators.2–4 The principles of the delay-coded nects) is near the speed of light, lossless, and consumes
logic described in that prior work apply directly to no power. These speed and power benefits do not come
problems in superconducting. However, the fact that without challenges, though. One of the most profound is
its primitive operations have been so far implemented the pulse-based nature of computation. Developing logic
only in CMOS under specific assumptions—e.g., edges using only pulses presents intellectual and application
are used to denote event occurrences—brings up challenges alike. Pulses cannot be sampled like voltage
questions about their implementability and efficiency levels because they do not coincide with ps precision
in the much different superconducting technology. (because of this reason, we usually end up with synchro-
Second, there is a long history of work in temporal nous superconducting gates and fine-grained clocking).
logic. Temporal logic systems allow for the representation Also, the relatively low integration density of JJs along
and reasoning of propositions qualified in terms of time. with the need for Splitters to fan out can inflate the cir-
Although expressing and evaluating constraints on the cuit size considerably. Finally, the lack of reliable and
order of events is particularly useful for formal verification high-capacity superconducting memories imposes its
purposes, the predicate nature of existing temporal logics own distinctive constraints.
makes them insufficient for desired computing tasks. We Because of these unique characteristics (and limi-
instead need a temporal logic with computational capa- tations) of superconducting, innovation requires that
bilities that takes events as inputs and creates new research efforts focus on computing paradigms and
events as outputs based on the input relationships. architectures that depart from classic CMOS-inspired
In this article, we address these issues and explore solutions. Examples of such alternative solutions are
the potential of superconducting temporal computing those that 1) use fewer JJs than transistors for the
by proposing a new computational temporal logic. This same information throughput, 2) allow for easier
computational temporal logic subsumes classical tem- clocking, and 3) have lower memory requirements.5
poral predicate logic and gives clear, precise, and useful We claim that temporal computing satisfies all these
semantics to delay-based computations. Moreover, we “constraints” and paves a promising path forward.
demonstrate how this approach enables a tradeoff
between implementation complexity and delay; show
how one can realize a functionally complete set of tem- Race Logic
poral operations in superconducting circuits; and create Race logic is a prime example of computing in the time
useful designs that effectively capture the potential of domain.2,4 Under race logic, information is repre-
these circuits. Finally, we identify critical timing con- sented as a delay; and a set of basic temporal opera-
straints and develop proof-of-concept accelerator tions—FirstArrival (FA), LastArrival (LA), Delay (D),
designs that demonstrate the functional correctness and Inhibit (IS )—replaces the OR, AND, and NOT we
and high performance of the proposed logic scheme, know from Boolean logic.
methodology, and architectures. Open-source imple- Figure 1 illustrates the functionality of these tem-
mentations of our superconducting temporal primitives poral operations. As their names indicate, FA and LA
and accelerators can be found on github.
BACKGROUND
Computing With Superconductors
Pulse-based superconductor electronics are charac-
terized by three basic features: 1) the absence of resis-
tance in static circuits at superconducting
temperatures, 2) the Josephson effect, which governs
the fundamental switching element in superconductor
circuits, the Josephson junction (JJ), and 3) the propa-
gation of single flux quanta (SFQ), appearing as ps-
duration, mV-amplitude pulses across switching JJs,
instead of static-voltage levels. FIGURE 1. Basic operations of race logic are FirstArrival
Unlike semiconductor devices, the resistance of (FA), LastArrival (LA), Delay (D), and strict Inhibit (IS ). Dt
superconducting circuits is zero, and thus, their speed is corresponds to the delay associated with one time step.
not limited by RC time constants. Moreover, the

TOP PICKS
“fire” as soon as the first and last high inputs arrive, TABLE 1. Semantics of PLTL.
respectively. Under the assumption that smaller
delays in rise time encode smaller magnitudes,
whereas larger magnitudes are encoded as longer
delays, FA realizes Min function and LA Max function.
The D operation delays an input event by c units of
time; in the shown example, c ¼ 2. Since the arrival
time of the rising edge is what encodes information,
delaying the 0 ! 1 transition by a fixed amount of
time is equivalent to Constant Addition. Finally, the A definition of the semantics of PLTL is provided in
IS operation, inspired by the behavior of inhibitory Table 1. The notation hS; ti is used to signify a system
postsynaptic potentials in the neurons of the neocor- S at time step t. We say that an event f occurs at time
tex,3 works as a nonlinear filter that has two inputs: 1) step t in the system S, if f holds at time step t in S,
an inhibiting signal and 2) a data signal. If the data sig- denoted by hS; ti f.
nal arrives first, it is allowed to pass through the gate These operators allow for the efficient represen-
unchanged; otherwise, no event occurs on the output tation and reasoning of propositions qualified in
line, which denotes 1. Prior to the next computation, terms of time; e.g., an event in a system S has hap-
race logic-driven circuitry must be reset. pened sometime in the past. However, they are
incapable of encapsulating when a formula f holds,
Space–Time Algebra which is essential for race logic. To address this
Space–time algebra provides a mathematical under- issue, we introduce the earliest-occurrence function
pinning for temporal processing.3 Any function defined E hS;ti ðf)
over the domain S ¼ f0; Nþ ; 1g, where Nþ is the set of
8
positive natural numbers, that satisfies the properties < tmin ;
> 9 tmin: tmin 2 ½½fhS;ti :hS; tmin i f
of invariance and causality complies with this algebra. E hS;ti ðfÞ ¼ and 8j : j 2 ½0; tmin Þ:hS; ji 6 f
Interestingly, one can always make an arbitrary func- >
:
1; otherwise.
tion satisfy these properties with the use of a refer-
ence signal r. Thus, any n-ary function defined over S
is expressible in race logic with at most n þ 1 variables.
This earliest occurrence function receives as
input a formula f and returns the earliest time step
FORMALIZATION tmin 2 ½½fhS;ti , where ½½fhS;ti is the scope of f at time
The transition from Boolean to race logic gives us the step t in the system S (the scope of f denotes an
freedom to set up a clean infrastructure and avoid ret- interval of time steps f operates on at time step t),
rofitting superconductor electronics to existing such that hS; tmin i f. If f does not hold at any
abstractions and architectures. To establish a solid time step within ½½fhS;ti , the earliest occurrence
foundation and lay the groundwork for the develop- function returns 1, which represents an unreach-
ment of systematic methods at all levels, we formally able time step.
define the above-described temporal operations.
Race Logic Semantics
Computational Temporal Logic The semantics of FA, LA, D, and IS operations can be
Given the long history of temporal logics in the verifi- formally defined using existing PLTL operators; as
cation community, we use the well-established setting
shown in Table 2 ( c denotes the application of
of linear temporal logic (LTL) as a starting point. The operator c times). To extract the step at which these
basic LTL operators are: sometime in the future, t u formulas evaluate to True for the first time in their
always in the future, next time (tomorrow), U until, scope, E hS;ti ðÞ can be used. For example, E hS;ti ðFAfyÞ
and R release. Past LTL (PLTL) extends LTL with past- will return the first time step that either f or y hold.
time operators, which are the temporal duals of LTLs
future-time operators, and allows one to concisely TABLE 2. PLTL-based semantics of race logic operations.
express statements on the past time instances, such
as: ^ sometime in the past, n always in the past (his-

torically), previous time (yesterday), S since, and T
trigger.6,7

TOP PICKS
Note that an important property of race logic is that important property of edge-based event encoding is
at most one event is allowed to occur per “wire” across that it automatically keeps track of the input state at
the entire computation. Bounding switching activity all times—a signal that has made a transition from a
leads to simple and power-efficient hardware designs, “low” to a “high” state will not make a transition back
and it may even allow for an overhead-free transition to low in the same computation. This feature breaks
between temporal and binary domains.4 In the case of down, though, when dealing with pulses—as dis-
superconducting, verifying that this property holds is cussed in the “Background” section, superconducting
crucial to the correct operation of the circuits presented technologies encode digital data as SFQ voltage
in the section that follows. For example, if more than pulses. Pulses naturally return back to their low state,
one pulse appears per line, the proposed IS design will preventing downstream nodes from implicitly knowing
not satisfy its specification anymore. A formal definition the state of its predecessors.
of this property—represented here by A—using PLTL A potential solution to this problem is to embed
operators follows: the state into each gate. Interestingly, the majority
of superconducting single flux quantum (SFQ) ele-
hS; ti Af iff nðf ! n:fÞ: mentary cells have both logic and storage abilities;8
thus, they can be thought of as simple state
machines. To facilitate the mapping between our
temporal operations and existing stateful SFQ ele-
SUPERCONDUCTING TEMPORAL
ments, we draw FA, LA, D, and IS as Mealy
ARCHITECTURES
machines—as shown in Figure 2. Following this
Realizing Temporal Operations in representation, we build FA with a Merge and a
Superconducting Destructive Read Out (DRO) cell, LA with a
The way in which events are encoded plays a critical C-element cell, D with a sequence of
role in selecting the hardware that most efficiently Josephson Transmission Lines (JTL), and IS with an
implements logic operators. For example, in CMOS, in Inverter cell. The length of the JTL chain depends
which (under race logic) temporal events are repre- on the selected Dt. In contrast to the other shown
sented by rising edges, FA and LA functions are real- cells, JTLs do not hold state; thus, more than one
ized with a single OR and AND gate, respectively.2,4 An SFQ pulses can propagate through a JTL chain
FIGURE 2. Block diagrams, Mealy machine representations, and WRSPICE simulations of race logic operations implemented in
Rapid SFQ.

TOP PICKS
TABLE 3. Area and latency estimates. is implemented by a Merger. Obviously, the shown
DDST design is not suitable for Boolean circuits; e.g., a
Function Area (#JJs) Latency (ps) NOT gate must be clocked even in the absence of an
FA 10 13 incoming pulse. However, this is not the case for tem-
LA 6 8 poral systems, as temporal operators can be safely
considered idle for the time steps that no input pulse
D 2/JTL 5/JTL
arrives. To extend the “evaluation” window of a syn-
Is 8 11 chronous logic block (defined by the delay between
data and clock pulse arrivals), JTLs can be added after
the Merger cell.
concurrently. To highlight this difference, we color
the “Init” node in the Dcf Mealy machine gray. EVALUATION
Area and latency results for each of these opera- For the evaluation of the proposed logic scheme and
tors are provided in Table 3. The shown estimates are methodology, we build, simulate, and measure the per-
based on our WRSPICE simulations using the MIT-LL formance of various superconducting temporal accel-
SFQ5ee 10-kA/cm2 process. Note that one of the goals erators. A detailed description of our race trees4
of this article is to repurpose existing superconducting implementation follows. Interested readers can find
cells where possible, like those from Stony Brook.8 additional designs and information in the original
paper9 and our github repository: https://github.com/
UCSBarchlab/Superconducting-Temporal-Logic.
Self-Clocked Temporal SFQ Circuits
In contrast to more traditional superconducting
approaches, where each gate is synchronous and fine-
Experimental Setup and Design
grained clocking is necessary, the proposed scheme Principles
relies on asynchronous operations. That is, even in the We perform circuit simulations and analysis in both
cases of FA and Is , where synchronous cells are used, the open-source WRSPICE and Cadence Spectre plat-
a data signal is routed to their “clock” port. forms using MIT-Lincoln Lab’s SFQ5ee 10-kA/cm2 pro-
However, the use of synchronous components may cess node. The layout is completed in Cadence
sometimes be beneficial. For example, although Virtuoso version 16.1. For gate isolation, delay balanc-
Coincidence—which gets satisfied if and only if both f ing, and interconnects implementation, JTLs are used.
and y arrive within the same interval—can be built For fan out, we use Splitters—denoted by s in the
from the above-described operations,3 a more effi- figure that follows. Finally, we define minimum Dt in a
cient implementation is possible: all that is needed for way that is similar to finding the cycle-time in wave-
its realization is a synchronous AND gate. pipelined architectures.
To avoid costly clock trees and the clock skew
problems that come with them, we propose a data- Proof-of-Concept Design: Race Trees
driven self-timing (DDST) scheme—as shown in In the case of race trees (ensembles of decision trees
Figure 3. In a DDST system, timing information is car- implemented in race logic), each tree node is consid-
ried by data. More specifically, the required clock sig- ered an independent temporal threshold function and
nal is generated locally by a logical OR function realized with a single Inhibit operator. Figure 4(ii)
between the data lines, which in the provided example shows the block diagram of an SFQ race tree corre-
sponding to the decision tree seen in Figure 4(i). For
the construction of NOT gates, which are required for
the implementation of the label decoder, Inhibits and
an upper bound reference signal tub are used. This ref-
erence signal denotes the end of a specific time inter-
val of interest (directly related to the inputs resolution
in this case); thus, NOT will fire at t ¼ tub if and only if
FIGURE 3. Proposed DDST scheme. The clock signal can be the gate has received no input spikes from time refer-
locally generated from input data at each gate. If no input ence 0 until that moment.
Figure 4(iii) and (iv) provides simulation results and
pulse arrives, it is safe to assume the operator idle; thus no
a layout diagram of this design. In the shown example,
clock pulse is required.
x and y are set to 4 and 1, and tub is equal to 5. As

TOP PICKS
FIGURE 4. Panel (i): Decision tree with three nodes. Panel (ii): Block diagram of an SFQ race tree. Panel (iii): Simulation results for
x ¼ 4 and y ¼ 1. Panel (iv): Layout diagram (unlabeled JJs are used for interconnection, splitting, routing, and testing purposes).
TABLE 4. Estimated latency results for hardwired race trees important role in the future of high-performance com-
in both CMOS (f = 1 GHz)4 and SFQ (Dt ¼ 25 ps). puting. While there are certainly challenges remaining,
the high-speed and ultralow energy operation of
Depth Input CMOS SFQ Improvement superconductor electronics makes them a promising
res. Latency Latency candidate for this new era.
6 4 bits 17 ns 0.464 ns 37 Continuous and extended effort, mostly at the
6 8 bits 257 ns 6.464 ns 40 device level, has already carried the superconducting
8 4 bits 17 ns 0.490 ns 35
field from the first fabricated JJ in 1970 through the
development of RSFQ logic in 1985 to chips with densi-
8 8 bits 257 ns 6.490 ns 40
ties on the order of several million JJs today. With the
realization of self-shunted JJs in 2017,11 chips with
10 M JJs are now in sight. The excitement around
expected, the final outcome is Label C. The total quantum computing further drives the demand for
latency, for Dt ¼ 25 ps, is 150 ps and the design con- improvement in superconducting circuit fabrication.
sists of 164 JJs; 72 JJs for logic elements, 24 JJs for But these efforts to advance superconducting-based
Splitters, and 68 JJs for JTLs. The number of JJs in computation also highlight a fundamental mismatch
the layout is greater than 164 because of the addi- between the computational abstractions provided by
tional cost of routing and the inclusion of testing cir- these pulse-based devices and the core “logic” we
cuitry. More results and a comparison with CMOS can often assert prematurely as the basis of any comput-
be found in Table 4. ing system.
A change in the underlying technology can dis-
rupt convention and necessitate a complete ret-
CONCLUSIONS AND FUTURE hinking of computers from logic to circuits to
IMPACT architecture. While it remains to be seen if such a
CMOS scaling has been a driving factor for computing radical reconsideration (of computing) is necessary,
technology for many decades and the transistors it is clearly far more likely now than at any other
manufactured today are multiple orders of magnitude time since the creation of the first integrated cir-
more efficient, compact, and less expensive than cuits. The approach taken in this article, which
those built 30 years ago. However, as CMOS integrates ideas from languages, logic design, verifi-
approaches its thermal limits, keeping this trend going cation, circuits, and formal methods, and is info-
becomes increasingly difficult.10 Given this reality, rmed by the physical phenomena underlying the
post-Moore technologies are likely to play an operation of these novel devices, is a useful model

TOP PICKS
for exploring other approaches to computation ACKNOWLEDGMENTS

as well. This work was supported by the National Science Foun-
The versatility of this method is a broader vision for
dation under Grant 1763699, Grant 1730309, Grant
post-Moore technological evaluation than is realized 1717779, and Grant 1563935. The research conducted at
here. However, in line with this objective, we do specif- LBNL was supported by the Director, Office of Science of
ically propose and evaluate the idea that a radically the U.S. Department of Energy under Contract DE-AC02-
more efficient foundation for computing lies at the 05CH11231. We would like to thank James E. Smith, David
intersection of superconductor electronics and delay- Donofrio, Dmitri Strukov, Alexander Wynn, Isaac Mackey,
coded computation. In particular, we claim that the and the anonymous reviewers for their helpful comments.
natural language for expressing computations in
superconducting is one that can precisely and effi-
ciently describe the temporal relationships between
SFQ pulses. To support this argument, we present a REFERENCES
functional extension to LTL that provides the needed 1. D. S. Holmes, A. M. Kadin, and M. W. Johnson,
abstractions to formally capture the capabilities of “Superconducting computing in large-scale hybrid
computing in the time domain; we show how existing systems,” Computer, vol. 48, pp. 34–42, Dec. 2015,
RSFQ elementary logic cells can be repurposed to real- doi: 10.1109/MC.2015.375.
ize the desired temporal operations; and we develop 2. A. Madhavan, T. Sherwood, and D. Strukov, “Race logic:
architectures that can safely leverage the extremely A hardware acceleration for dynamic programming
tight timing of superconducting circuits while avoiding algorithms,” SIGARCH Comput. Archit. News, vol. 42,
the clocks and memories that shackle more incremen- pp. 517–528, Jun. 2014, doi: 10.1145/2678373.2665747.
tal approaches. 3. J. E. Smith, “Space-time algebra: A model for
neocortical computation,” in Proc. 45th Annu. Int.
Symp. Comput. Archit., 2018, Art. no. 289–300,
THE CONNECTION THAT THIS ARTICLE
doi: 10.1109/ISCA.2018.00033.
ESTABLISHES BETWEEN PREDICATE 4. G. Tzimpragos, A. Madhavan, D. Vasudevan, D. Strukov,
TEMPORAL LOGICS AND COMPUTING and T. Sherwood, “Boosted race trees for low energy
PROVIDES A FRESH PERSPECTIVE ON classification,” in Proc. 24th Int. Conf. Archit. Support
THE DEVELOPMENT OF Program. Lang. Operating Syst., 2019, pp. 215–228,
ABSTRACTIONS THAT CAN LEVERAGE doi: 10.1145/3297858.3304036.
THE UNIQUE PROPERTIES OF 5. S. K. Tolpygo, “Superconductor digital electronics:
SUPERCONDUCTOR ELECTRONICS Scalability and energy efficiency issues,” Low Temp.
AND ARE SEMANTICALLY CLOSE TO Phys., vol. 42, no. 5, pp. 361–379, 2016, doi: 10.1063/
THE THEORIES USED BY MODEL 1.4948618.
6. M. Benedetti and A. Cimatti, “Bounded model checking
CHECKERS AND FORMAL ANALYSIS
for past LTL,” in Tools and Algorithms for the
TOOLS.
Construction and Analysis of Systems, H. Garavel and J.
Hatcliff, Eds. Berlin, Germany: Springer, 2003, pp. 18–33,
doi: 10.1007/3-540-36577-X_3.
7. F. Laroussinie, N. Markey, and P. Schnoebelen,
Looking forward, we expect a growing interest in “Temporal logic with forgettable past,” in Proc. 17th IEEE
emerging technologies, less-traditional computing Annu. Symp. Logic Comput. Sci., 2002, pp. 383–392,
paradigms, and device-specific architectures. We see doi: 10.1109/LICS.2002.1029846.
this work as an important step toward the transition 8. K. K. Likharev and V. K. Semenov, “RSFQ logic/memory
from strictly Boolean circuits and the von Neumann family: A new Josephson-junction technology for sub-
computer model to “languages” and designs that bet- terahertz-clock-frequency digital systems,” IEEE Trans.
ter exploit the unique characteristics of post-Moore Appl. Supercond., vol. 1, no. 1, pp. 3–28, Mar. 1991,
electronics. Moreover, the connection that this article doi: 10.1109/77.80745.
establishes between predicate temporal logics and 9. G. Tzimpragos et al., “A computational temporal
computing provides a fresh perspective on the devel- logic for superconducting accelerators,” in Proc.
opment of abstractions that are semantically close to 25th Int. Conf. Archit. Support Program. Lang.
the theories used by model checkers and formal analy- Operating Syst., 2020, pp. 435–448, doi: 10.1145/
sis tools. 3373376.3378517.

TOP PICKS
10. V. V. Zhirnov, R. K. Cavin, J. A. Hutchby, and G. I. NESTAN TSISKARIDZE is currently a Research Engineer with
Bourianoff, “Limits to binary logic switch scaling—A the Department of Computer Science, Stanford University,
Gedanken model,” Proc. IEEE, vol. 91, no. 11,
Stanford, CA, USA. From 2011 to 2019, she was a Researcher
pp. 1934–1939, 2003, doi: 10.1109/JPROC.2003.818324.
with Princeton University, Princeton, NJ, USA, the University of
11. S. K. Tolpygo et al., “Developments toward a 250-nm,
Iowa, Iowa City, IA, USA, and the University of California, Santa
fully planarized fabrication process with ten
superconducting layers and self-shunted Josephson Barbara, Santa Barbara, CA, USA. Her research work focuses
junctions,” in Proc. 16th Int. Supercond. Electron. Conf., on developing and adapting cutting-edge formal verification
Jun. 2017, pp. 1–3, doi: 10.1109/ISEC.2017.8314189. techniques to automate the design and analysis of hardware
and software systems. Tsiskaridze received a Ph.D. degree in
GEORGIOS TZIMPRAGOS is currently working toward a Ph.D. computer science from the University of Manchester, Man-
degree with the Department of Computer Science, University chester, U.K. Contact her at nestan@stanford.edu.
of California, Santa Barbara, Santa Barbara, CA, USA, and is a
Research Affiliate with Lawrence Berkeley National Labora-
GEORGE MICHELOGIANNAKIS is currently a Research Sci-
tory, Berkeley, CA, USA. His research interest includes com-
entist with Computer Architecture Group, Lawrence Berkeley
puter architecture, currently focusing on emerging computing
National Laboratory, Berkeley, CA, USA. He has extensive
paradigms and technologies for energy-efficient processing—
experience in on- and off-chip networking and computer
ranging from tiny embedded systems to exotic supercom-
architecture. His current research work focuses on emerging
puters. Tzimpragos received an M.S. degree in electrical and
technologies and 3-D integration in the post-Moore era, as
computer engineering from University of California, Davis,
well as optics and architecture for HPC and datacenter net-
Davis, CA, USA. His alma mater is the National Technical Uni-
works. Michelogiannakis received a Ph.D. degree from Stan-
versity of Athens, Greece. He is the corresponding author of
ford University, Stanford, CA, USA, in 2012. He is a Member of
this article. Contact him at gtzimpragos@cs.ucsb.edu.
IEEE and ACM. Contact him at mihelog@lbl.gov.
JENNIFER VOLK is currently working toward a Ph.D. degree

with the Department of Electrical and Computer Engineering, ADVAIT MADHAVAN is currently a Faculty Specialist with the
University of California, Santa Barbara, Santa Barbara, CA, University of Maryland, College Park, MD, USA, affiliated with
USA. Her research interests include the exploitation of cir- the National Institute of Standards and Technology. His
cuit-based quirks for architectural advancements in novel research interests include novel encoding schemes, emerging
technologies, as well as the exploration of tradeoffs between technologies and architectures, and CMOS VLSI aimed toward
area, energy, robustness, and sustainability. Volk received a building next generation computing systems. Madhavan
B.S. degree in electrical engineering from the University of received a Ph.D. degree from the University of California, Santa
California, Santa Cruz, Santa Cruz, CA, USA. She is a Member Barbara, Santa Barbara, CA, USA, in 2016. He is a Member of
of IEEE and ACM. Contact her at jevolk@ucsb.edu. IEEE. Contact him at advait.madhavan@nist.gov.
JOHN SHALF is currently with the Department Head for

DILIP VASUDEVAN is currently a Research Scientist with the Computer Science, Lawrence Berkeley National Laboratory,
Computer Architecture Group, Lawrence Berkeley National Berkeley, CA, USA, and was formerly the Deputy Director of
Laboratory, Berkeley, CA, USA. His research interests include Hardware Technology for the DOE Exascale Computing Proj-
post-Moore architectures, reconfigurable spintronic devices, ect and the Leader of the Green Flash Project. He is a coau-
superconductor electronics, and neuromorphic computing. thor of more than 90 publications in the field of parallel
Vasudevan received a B.E. degree in electronics and communi- computing software and HPC technology and corecipient of
cations engineering from the University of Madras, Chennai, three best paper awards. Before joining Lawrence Berkeley
India, an M.S. degree in computer systems engineering from the National Laboratory in 2000, he was with the National Center
University of Arkansas, Fayetteville, AR, USA, and a Ph.D. degree for Supercomputing Applications, University of Illinois, and
in informatics (computer engineering) from the University of was a Visiting Scientist with the Max-Planck-Institute for
Edinburgh, Edinburgh, U.K. He is a Professional Member of Gravitational Physics/Albert Einstein Institute, Potsdam, Ger-
ACM. Contact him at dilipv@lbl.gov. many. Contact him at jshalf@lbl.gov.

TOP PICKS
TIMOTHY SHERWOOD is currently a Professor of Computer computer science and engineering from the University of
Science with the University of California, Santa Barbara, California, Davis, Davis, CA, USA, in 1998, and M.S. and Ph.D.
Santa Barbara, CA, USA, specializing in the development of degrees in computer science and engineering from the Uni-
processors that exploit novel technologies (e.g., supercon-
versity of California San Diego, San Diego, CA, USA, in 2003.
ductors and memristors), provable properties (e.g., security,
He is a recipient of the Northrop Grumman Teaching Excel-
privacy, and correctness), and hardware-accelerated algo-
lence Award and the ACM SIGARCH Maurice Wilkes Award,
rithms (e.g., high-throughput scanning and new logic repre-
sentations). In 2013, he cofounded Tortuga Logic to bring an ACM Distinguished Scientist, and corecipient of 17 differ-
rich security analysis to hardware and embedded system ent “best paper” and “top pick” article awards. Contact him
design processes. Sherwood received a B.S. degree in at sherwood@cs.ucsb.edu.

A Next-Generation Cryogenic Processor

Architecture
Ilkwon Byun , Dongmoon Min , Gyuhyeon Lee , Seongmin Na, and Jangwoo Kim, Department of Electrical
Computer Engineering, Seoul National University, Seoul, 08826, South Korea
Cryogenic computing can achieve high performance and power efficiency by

dramatically reducing the device’s leakage power and wire resistance at low
temperatures. Recent advances in cryogenic computing focus on developing
cryogenic-optimal cache and memory devices to overcome memory capacity,
latency, and power walls. However, little research has been conducted to develop a
cryogenic-optimal core architecture even with its high potentials in performance,
power, and area efficiency. In this article, we first develop CryoCore-Model, a
cryogenic processor modeling framework that can accurately estimate the
maximum clock frequency of processor models running at 77 K. Next, driven by the
modeling tool, we design CryoCore, a 77 K-optimal core microarchitecture to
maximize the core’s performance and area efficiency while minimizing the cooling
cost. The proposed cryogenic processor architecture, in this article, achieves the
large performance improvement and power reduction and, thus, contributes to the
future of high-performance and power-efficient computer systems.
H
igh-performance computing and datacenter and, thus, build much faster computing devices. For
industries always require the fastest and the example, the copper’s resistivity at 77 K is six times
most power-efficient computer systems. lower than the resistivity at 300 K. As a result, cryo-
However, due to the prohibitively increasing CPU genic computing can be the breakthrough of the
power with clock frequency (i.e., power wall problem) power wall and memory wall problems and, thus, sig-
and the increasing performance gap between the pro- nificantly improve the computer’s performance.
cessors and memory devices (i.e., memory wall prob- Recent advances in cryogenic computing mainly
lem), the server performance and power efficiency focus on developing cryogenic-optimal cache and mem-
have not meaningfully improved nowadays. ory devices to overcome the memory wall problem. For
Cryogenic computing, which aims to run com- example, our cryogenic memory research1 proposed the
puter devices at extremely low temperatures (e.g., cryogenic-optimal dynamic random access memory
196 C; 77 K), has emerged as a highly promising (DRAM) design, which is 3.80 times faster than the con-
solution to resolve the power wall and memory wall ventional DRAM with the less power consumption. In
problems. Cryogenic computing allows us to increase
addition, our cryogenic cache research2 proposed the
the server’s performance and power efficiency as fol-
cryogenic-optimal cache design, which provides twice
lows. First, with the voltage scaling enabled by the
higher speed with twice larger capacity while achieving
almost eliminated leakage current, we can increase
34.1% of power reduction.
the clock frequency while achieving the much lower
Therefore, it is straightforward to develop the cryo-
power consumption. Second, as the wire resistivity
genic-optimal core design as a next step to overcome
linearly decreases with the temperature, we can cor-
the power wall problem. However, despite its huge
respondingly benefit from the reduced wire latency
potential, little research has been conducted to
develop the cryogenic-optimal processors. Note that
once the cryogenic-optimal core becomes available, it
0272-1732 ß 2021 IEEE will also take full advantage of cryogenic-optimal
cache and memory devices, which leads to a full-
25 May 2021. cryogenic computer system.

TOP PICKS
(Ileak )] at cryogenic temperatures by taking a MOSFET

model card as an input. The model card is a set of low-
level MOSFET variables related to the fabrication pro-
cess (e.g., gate-oxide thickness and doping concentra-
tion). Cryo-MOSFET predicts both Ion and Ileak at low
temperatures by automatically adjusting the three
most temperature-dependent MOSFET variables [i.e.,
effective carrier mobility (meff ), saturation velocity
(vsat ), and threshold voltage (Vth )].
To build the MOSFET model, we utilize our previous
model1 as a baseline and improve its accuracy at
smaller technology nodes. The model extension is
based on the industry-provided MOSFET data, which
show that the temperature dependence of meff and vsat
significantly changes with the MOSFET’s gate length.
Wire Model
Cryo-wire generates the on-chip wire characteristic
(i.e., wire resistivity) at low temperatures, based on the
FIGURE 1. (a) Cryogenic processor model (i.e., CC-Model) given metal layer’s geometry information (i.e., wire
overview and (b) validation setup. width and height). The wire resistivity is mainly deter-
mined by the three physical mechanisms: geometry-
independent scattering (rbulk ), grain-boundary scatter-
Our article, published at ISCA’20,3 presents the
ing (rgb ), and surface scattering (rsf ). rbulk depends only
first architecture exploration and proposes a fast,
on the temperature, whereas rgb and rsf are primarily
dense, and power-efficient cryogenic-optimal proces-
determined by the wire geometry. For the accurate pre-
sor architecture running at 77 K. To achieve the goal,
diction, we model both temperature and geometry
the paper tackles following three critical challenges:
dependency based on the previous works.4,5
1) absence of a cryogenic-core performance model;
2) unclear direction for the architecture optimization; Processor Model
3) lack of the cryogenic-optimal core design. Cryo-pipeline predicts the critical-path delay of each
pipeline stage at cryogenic temperatures, by utilizing
the low-temperature MOSFET and wire properties
CRYOGENIC PROCESSOR
from cryo-MOSFET and cryo-wire, respectively. In
MODELING FRAMEWORK
addition, to help the detailed reasoning and analysis,
As the first step to enable the architecture-level study
cryo-pipeline can decompose each critical-path delay
for the cryogenic processor, we build a modeling tool
into the transistor and the wire delay portions. There-
to predict the performance of processors running at
fore, using our modeling tool, architects can predict
77 K. Figure 1(a) illustrates the overview of our vali-
the frequency speed-up at cryogenic temperatures
dated cryogenic processor modeling framework, Cryo-
and analyze how the low temperatures affect the
Core-Model (CC-Model). CC-Model takes a target
delay of each pipeline stage in detail.
processor design and its fabrication-process informa-
For implementing cryo-pipeline, we utilize Syn-
tion (i.e., MOSFET model card, wire height, and width)
opsys Design Compiler Topographical Mode.6 Design
as inputs, and then predicts the peak clock frequency
Compiler can synthesize a Verilog design based on the
for a wide range of temperatures including 77 K. For
logical library (i.e., transistor information) and the
this purpose, we build CC-Model with three submo-
physical library (i.e., metal-layer information) while
dels: 1) MOSFET model (cryo-MOSFET), 2) wire model
reporting the critical-path delay of each stage. By
(cryo-wire), and 3) processor model (cryo-pipeline).
using Design Compiler, cryo-pipeline first synthesizes
a processor layout for the target processor design
MOSFET Model written in Verilog. With the processor layout, cryo-
Cryo-MOSFET predicts the major MOSFET character- pipeline then derives the critical-path delay of each
istics [i.e., on-channel current (Ion ), leakage current pipeline stage at 77 K by taking the logical and

TOP PICKS
physical libraries generated from our MOSFET and

wire models. Finally, cryo-pipeline decomposes the
critical-path delay into the transistor and wire delay
portions by applying the no-wire option to Design
Compiler.
Model Validation
We fully validate CC-Model’s accuracy by validating
each submodel. First, we validate our MOSFET model
by comparing it with the HSPICE simulation results
obtained from our industry-provided MOSFET model
card. The industry model card was prevalidated by
FIGURE 2. Our analysis with two reference core models.
actual measurements for the 77–300 K temperature
range. Second, we validate our wire model by compar- (a) Power consumption of hp-cores at 300 K and 77 K. (b) Peak
ing it with the measured data from the previous litera- frequency and power consumption of lp-cores at 77 K.
€ gl et al.7). Finally, we validate the
ture (e.g., Steinho
processor model by comparing its frequency predic-
tion with our measurement from the experimental Figure 2(a) shows the power consumption of hp-
setup shown in Figure 1(b). We verify that all the sub- cores operating at various temperatures and voltages.
models achieve high accuracy with low error rates (up In the graph, 300 K hp and 77 K hp indicate the
to 4.5%). processors with hp-core design running at 300 K and
77 K without voltage optimization. In our analysis, we
observe that the huge amount of dynamic power (83%
DESIGN PRINCIPLES FOR in 300 K hp) remains at 77 K, which incurs the prohibi-
CRYOGENIC PROCESSOR tively expensive cooling cost (800% in 77 K hp). To
Using our validated modeling framework, we reveal reduce the dynamic power, we can simultaneously
the necessity of microarchitecture-level optimizations decrease the Vdd and Vth levels thanks to the nearly
and introduce the design principles for 77 K-optimal eliminated leakage current at 77 K. However, even
core microarchitectures. Specifically, we emphasize with the aggressive voltage scaling, we cannot miti-
the importance of minimizing the dynamic power con- gate the huge cooling cost while maintaining the
sumption and maximizing the peak clock frequency at same performance, as shown in 77 K hp (power opt.).
the microarchitecture level. That is, we cannot realize the power efficiency of cryo-
For this purpose, we conduct performance, power, genic processors only with the voltage optimization.
and area analysis for the various core models. We Therefore, we should minimize the dynamic power at
implement the target processor designs by customiz- the microarchitectural level.
ing RISC-V BOOM.8 For the performance analysis, we
utilize CC-Model to analyze the maximum clock fre- Principle 2: Maximize the Clock
quency of processors running at 77 K. For the power Frequency
and die-area analysis, we utilize McPAT9 integrated We then emphasize the importance of achieving the
with cryo-MOSFET. We also include the power con- high frequency at the microarchitectural level. To
sumption for the cryogenic cooling, as 9.65 times of draw the principle, we conduct a case study with the
the device power consumption at 77 K.1,2,10 low-power reference core model (i.e., ARM Cortex
core; lp-core) as we highlighted the importance of the
Principle 1: Minimize the Dynamic low power consumption in the previous analysis. Lp-
Power Consumption core has the shallower and narrower pipeline with a
We first emphasize the importance of reducing the lower operation voltage and clock frequency com-
dynamic power at the microarchitectural level. To pared to hp-core.
draw the principle, we perform a case study with the Figure 2(b) shows the result of frequency and
high-performance reference core model (i.e., Intel Sky- power analysis for three lp-core designs running at
lake core; hp-core), which represents the high-perfor- 77 K. The three designs share the same core design
mance datacenter processor. Hp-core has a deep and but adopt the different voltage levels to adjust their
wide pipeline, and operates with a high voltage and frequencies. First, lp-core with the nominal voltage
clock frequency. (77 K lp) consumes less power but operates with a

TOP PICKS
TABLE 1. Hardware specifications of hp, lp, and cryocore.
Hp-core Lp-core
(i7-6700) (Cortex-A15) CryoCore
# cache load/ 4 1 1
store ports
Pipeline width 8 4 4
Load queue size 72 24 24
Store queue size 56 24 24
FIGURE 3. Optimization process for our processor designs.
Issue queue size 97 72 72
Reorder buffer 224 96 96
much lower frequency compared to 300 K hp-core. In size
other words, we should increase 77 K lp’s frequency by # physical 180 100 100
scaling its operating voltages. However, in our analy- integer registers
sis, we observe that the frequency improvement with # physical float 168 96 96
the voltage scaling is limited due to the transistor registers
speed saturation. As evidence, even though we apply Max frequency 4.0 GHz 2.5 GHz 4.0 GHz
much higher Vdd to 77 K lp (freq. opt.) and 77 K lp
Power per core 24 W 1.5 W 5.5 W
(extreme freq.), lp-core cannot achieve a higher clock (45 nm)
frequency than hp-core within the power budget. That
Core area (45 44.3 mm2 11.54 mm2 22.89
is, we cannot meaningfully improve the performance
nm) mm2
of cryogenic processors only with the voltage scaling.
Therefore, we should also maximize the clock fre- Core & L1/L2 97.51 mm2 17.51 mm2 38.89
area (45 nm) mm2
quency at the microarchitectural level.
Supply voltage 1.25 V 1.0 V 1.25 V
(Vdd )
CRYOGENIC-OPTIMAL
PROCESSOR DESIGN
Following our design principles, we present 77 K-opti- That is, we can integrate twice the number of cores
mized processor designs in terms of both performance under the same area budget.
and power efficiency. Figure 3 summarizes our optimi-
zation process in two steps: 1) microarchitecture opti-

mization ( 1 ) and voltage scaling ( 2 ). Vdd and Vth Scaling
Based on CryoCore, we derive two 77 K-optimal pro-
cessors by applying Vdd and Vth scaling. We explore a
Microarchitecture Optimization large number of design points with the different Vdd
We follow the two identified principles and propose and Vth levels, and obtain the power-frequency Par-
our cryogenic-optimal core microarchitecture, called eto-optimal curve, as shown in Figure 3. Among the
CryoCore. Table 1 illustrates CryoCore compared with optimal design points, we choose the two representa-
hp-core and lp-core. First, to achieve the high peak- tive designs: 1) the power-optimal design (cryogenic
frequency at the microarchitectural level, CryoCore low-power core; CLP-core) and 2) the frequency-opti-
adopts the deep pipeline and high operating voltage mal design (cryogenic high-performance core; CHP-
of hp-core. Second, to minimize the dynamic power core). CLP-core is the power-minimized processor
consumption at the microarchitectural level, we take design whose performance is the same as that of
the smaller (or fewer) microarchitecture units and, 300 K hp-core. On the other hand, CHP-core is the fre-
thus, the narrow-width pipeline of lp-core. With these quency-maximized design whose total power con-
optimizations, CryoCore running at 77 K (77 K Cryo- sumption (i.e., sum of the device and cooling power
Core) achieves the much higher peak-frequency and consumption) is the same as that of 300 K hp-core.
the lower power consumption compared to 300 K Note that we intentionally choose the same Vth value
hp-core. In addition, CryoCore consumes only 50% for both designs to implement them in a single hard-
area compared to hp-core, thanks to its narrow pipe- ware with dynamic voltage and frequency scaling
line and the reduced sizes of microarchitectural units. (DVFS). As a result, we achieve both the 1.5 times

TOP PICKS
FIGURE 4. Multithread performance of various systems including the full-cryogenic computer system (i.e., CHP-core with 77 K
memory).
higher clock frequency (CHP-core) and the 97% lower

power consumption (CLP-core) with DVFS.
POTENTIAL OF FULL-CRYOGENIC
COMPUTER SYSTEMS
By expanding the cryogenic-computing coverage into FIGURE 5. Total power consumption of various processor
the CPU pipeline, our work enables the evaluation of designs including CLP-core.
full cryogenic computer systems for the first time.
With our proposed cryogenic-optimal processor, we
evaluate the performance and power consumption of Performance Improvement
the full-cryogenic computer where the CPU, cache, We first show the performance gain of the full-cryogenic
and memory devices are optimized for the 77 K computer systems. Figure 4 shows the performance of
operation. the full-cryogenic computer. The performance values
are the inverse of the execution time and normalized to
Evaluation Setup that of the baseline.
We evaluate the performance and power consumption The full-cryogenic computer systems equipped
of full cryogenic computer systems with two represen- with CHP-core and 77 K-optimized memories (i.e.,
tative cryogenic servers: 1) the performance-optimized CHP-core with 77 K memory) achieve 2.39 times of
77 K server that consumes the same total power com- speed-up on average, up to 3.41 times in blackscholes.
pared to 300 K servers, and 2) the power-optimized Note that the average speed-up of the full-cryogenic
77 K server that achieves the same performance com- computer system (139%) is clearly higher than the
pared to 300 K servers. speed-up of the systems with the cryogenic-core only
First, for the performance-optimized 77 K server, (CHP-core with 300 K memory; 83%) or the cryogenic-
we utilize CHP-core equipped with CryoCache2 and memory only (300 K hp-core with 77 K memory; 21%).
CLL-DRAM.1 CryoCache achieves twice larger capac-
More importantly, the full system’s speed-up is even
ity by using 3T-eDRAM technology at 77 K while
much higher than the aggregated speed-up of 77 K
achieving twice faster access speed compared to
core only and 77 K memory only (104%), which indi-
the conventional static random access memory
cates the synergetic effect of the cryogenic core and
(SRAM-based) 300 K caches. CLL-DRAM is the
memory system together.
latency-optimized 77 K DRAM that is 3.80 times faster
In summary, when designing the full-cryogenic
than the conventional 300 K DRAM.
computer to maximize the performance, architects
Next, for the power-optimized 77 K server, we eval-
can achieve up to 3.41 times of the speed-up within
uate the power consumption of CLP-core. As the pre-
the same power budget, even including the huge cool-
vious works1,2 already show the significant power
ing cost.
reduction of cryogenic memories (i.e., 34.1% and 9.2%
reduction for the 77 K-optimal cache and DRAM,
respectively), we focus on the power reduction of the Power Reduction
CPU pipeline in this article. Next, we evaluate the power consumption of CLP-
In our evaluation, we utilize Intel i7-6700 equipped core. Figure 5 shows the power consumption of vari-
with commodity DDR4-2400 DRAM for the baseline ous processor designs including CLP-core. The power
design. For the simulation, we run 10 PARSEC 2.1 work- values are normalized to the baseline 300 K hp-core’s
loads11 with gem5 simulator12 and McPAT.9 power consumption. As CryoCore is twice denser

TOP PICKS
than 300 K hp-core, we doubled the power consump- performance or the 37.5% of power reduction, which
tion of CryoCore and CLP-core for the conservative has not been seen in other computer-architectural
comparison. studies for the general-purpose computing. Thanks
The total power consumption of CLP-core to the large potential, we believe that cryogenic
(62.5%) is much lower than the 300 K hp-core’s computing will be considered as a practical and
power consumption. First, by adopting our proposed promising candidate for future computer systems,
77 K-optimal microarchitecture design, CLP-core especially for the performance-critical and power-
reduces the device power consumption by 71%. hungry HPC and datacenter industries. We are also
Next, with the low-power-targeted voltage scaling, convinced that the significant potential will moti-
CLP-core further reduces the device power con- vate architects to engage in the cryogenic comput-
sumption. As a result, CLP-core consumes 94.1% ing research area.
less power than the 300 K baseline. Even including
the huge cooling cost at 77 K (i.e., 9.65 times of the First Architectural Guidelines of
device power consumption10), CLP-core reduces the Cryogenic Computing
total power consumption by 37.5% compared to the This article is also the first to provide fundamental
300 K baseline. guidelines to build cryogenic hardware and architecture.
To run the CPU core with the highest performance and
power efficiency, we propose the 77 K-targeted design
guidelines, which are generally applicable to any types
THIS ARTICLE IS THE FIRST TO SHOW
of hardware components. Our architecture guidelines
THE POTENTIAL OF FULL-CRYOGENIC
will inspire follow-up studies to design and develop vari-
COMPUTER SYSTEMS THAT
ous cryogenic-optimal computer devices, especially
INTEGRATE 77 K OPTIMAL PIPELINE, compute-intensive hardware, such as CPU, GPU, and
CACHE, AND DRAM DEVICES. accelerators.
First Widely Applicable and Accurate

Note that CLP-core achieves the same single- Modeling Framework
thread performance with doubled throughput thanks Finally, this article is also the first to propose an accu-
to the twice higher core density. That is, architects rate and widely applicable modeling framework for
can achieve the same single-core performance and cryogenic computing. As our framework is developed
doubled throughput with the 37.5% lower power con- on top of the commercial-grade digital design tool (i.e.,
sumption by using CLP-core. Design Compiler of Synopsys), it can model any type
of computer device written in Verilog. In addition, as
we carefully validate the tool with the actual measure-
IMPACT OF OUR WORK ments and industry data, the framework guarantees
We believe that our work will have various long-term high accuracy. We believe that our modeling frame-
impacts to initiate the active research and develop- work will contribute to future cryogenic computer
ment of cryogenic computer systems. We summarize architecture exploration.
the major impacts of our work as follows:
1) the impact on the next-generation computer ACKNOWLEDGMENTS

systems; This work was supported in part by the Samsung Elec-
2) the first microarchitectural guidelines of cryo- tronics, SK Hynix, National Research Foundation of
genic computing; Korea (NRF) funded by the Korean Government under
3) the first widely applicable and accurate model- Grant NRF-2019R1A5A1027055, NRF-2019-Global Ph.D.
ing framework for cryogenic computing. Fellowship Program, and Grant NRF-2021R1A2C3014131;
in part by the Institute of Information & Communica-
Impact on the Next-Generation tions Technology Planning & Evaluation funded by the
Computer Systems Korean Government under Grant 1711080972; in part by
This article is the first to show the potential of full- the Creative Pioneering Researchers Program through
cryogenic computer systems that integrate 77 K-opti- Seoul National University; and in part by the Automation
mal pipeline, cache, and DRAM devices. The full- and Systems Research Institute (ASRI) and Inter-univer-
cryogenic computer achieves the 3.41 times higher sity Semiconductor Research Center at Seoul National

TOP PICKS
University. Ilkwon Byun and Dongmoon Min contributed 12. N. Binkert et al., “The gem5 simulator,” ACM SIGARCH
equally to this work. Comput. Archit. News, vol. 39, no. 2, pp. 1–7, 2011, doi:
10.1145/2024716.2024718.
REFERENCES
1. G.-H. Lee, D. Min, I. Byun, and J. Kim, “Cryogenic ILKWON BYUN is currently working toward a Ph.D. degree
computer architecture modeling with memory-side with the Department of Electrical and Computer Engineer-
case studies,” in Proc. 46th Int. Symp. Comput. Archit., ing, Seoul National University, Seoul, South Korea. His
2019, pp. 774–787, doi: 10.1145/3307650.3322219. research interests include architecting cryogenic CMOS and
2. D. Min, I. Byun, G.-H. Lee, S. Na, and J. Kim, “CryoCache: superconductor-based computer systems by using com-
A fast, large, and cost-effective cache architecture for puter architecture modeling and simulation techniques. He
cryogenic computing,” in Proc. 25th Int. Conf. Archit.
is a Student Member of IEEE and ACM. Contact him at
Support Programm. Lang. Oper. Syst., 2020,
ik.byun@snu.ac.kr.
pp. 449–464, doi: 10.1145/3373376.3378513.
3. I. Byun, D. Min, G.-H. Lee, S. Na, and J. Kim, “CryoCore: A
fast and dense processor architecture for cryogenic DONGMOON MIN is currently working toward a Ph.D. degree
computing,” in Proc. ACM/IEEE 47th Annu. Int. Symp. with the Department of Electrical and Computer Engineering,
Comput. Archit., 2020, pp. 335–348, doi: 10.1109/ Seoul National University, Seoul, South Korea. His current
ISCA45697.2020.00037. research interests include cryogenic computing for both
4. R. A. Matula, “Electrical resistivity of copper, gold, CMOS-based and superconductor-based computer systems.
palladium, and silver,” J. Phys. Chem. Reference Data, He is a Student Member of IEEE and ACM. Contact him at
vol. 8, no. 4, pp. 1147–1298, 1979, doi: 10.1063/1.555614.
dongmoon.min@snu.ac.kr.
5. C.-K. Hu et al., “Future on-chip interconnect
metallization and electromigration,” in Proc. IEEE Int.
Rel. Phys. Symp., 2018, pp. 4F.1-1–4F.1-6, doi: 10.1109/ GYUHYEON LEE is currently working toward a Ph.D. degree
IRPS.2018.8353597. with the Department of Electrical and Computer Engineering,
6. Synopsys, “Synopsys DC ultra,” 2019. [Online]. Available:
Seoul National University, Seoul, South Korea. His research
https://www.synopsys.com/implementation-and-
interests include performance modeling methodologies for
signoff/rtl-synthesis-test/dc-ultra.html
€ gl, G. Schindler, G. Steinlesberger, both conventional and cryogenic computer architectures. He
7. W. Steinho
M. Traving, and M. Engelhardt, “Comprehensive study is a Student Member of IEEE and ACM. Contact him at
of the resistivity of copper wires with lateral guhylee@snu.ac.kr.
dimensions of 100 nm and smaller,” J. Appl. Phys., vol.
97, no. 2, 2005, Art. no. 023706, doi: 10.1063/1.1834982.
SEONGMIN NA is currently working toward a Ph.D. degree
8. C. Celio, D. A. Patterson, and K. Asanovic, “The Berkeley
out-of-order machine (BOOM): An industry- with the Department of Electrical and Computer Engineering,
competitive, synthesizable, parameterized RISC-V Seoul National University, Seoul, South Korea. His current
processor,” EECS Dept., Univ. California, Berkeley, CA, research interests include hardware optimization techniques
USA, Tech. Rep. UCB/EECS-2015-167, 2015. in various areas including cryogenic computing and deep
9. S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, learning acceleration. He is a Student Member of IEEE and
D. M. Tullsen, and N. P. Jouppi, “McPAT: An integrated ACM. Contact him at seongmin.na@snu.ac.kr.
power, area, and timing modeling framework for
multicore and manycore architectures,” in Proc.
42nd Annu. IEEE/ACM Int. Symp. Microarchit., 2009, JANGWOO KIM is currently a Professor with the Department
pp. 469–480, doi: 10.1145/1669112.1669172. of Electrical and Computer Engineering, Seoul National Uni-
10. Y. Iwasa, Case Studies in Superconducting Magnets: versity, Seoul, South Korea. His primary research interests
Design and Operational Issues. Berlin, Germany:
include computer architecture, server and datacenter, sys-
Springer, 2009.
tem modeling, and intelligent systems. Kim received a Ph.D.
11. C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC
degree in electrical and computer engineering from Carnegie
benchmark suite: Characterization and architectural
implications,” in Proc. 17th Int. Conf. Parallel Archit. Mellon University, Pittsburgh, PA, USA, in 2008. He is a Mem-
Compilation Techn., 2008, pp. 72–81, doi: 10.1145/ ber of IEEE and ACM. He is the corresponding author of this
1454115.1454128. article. Contact him at jangwoo@snu.ac.kr.

Balancing Specialized Versus Flexible

Computation in Brain–Computer Interfaces
Ioannis Karageorgos and Karthik Sriram , Yale University, New Haven, CT, 06520, USA
n Vesely , Rutgers—The State University of New Jersey, New Brunswick, NJ, 08901-8554, USA, and Yale
Ja
University, New Haven, CT, 06520, USA
Nick Lindsay and Xiayuan Wen, Yale University, New Haven, CT, 06520, USA
Michael Wu, Rutgers—The State University of New Jersey, New Brunswick, NJ, 08901-8554, USA
Marc Powell, University of Pittsburgh, Pittsburgh, PA, USA
David Borton, Brown University, Providence, RI, 02912, USA
Rajit Manohar and Abhishek Bhattacharjee, Yale University, New Haven, CT, 06520, USA
We are building HALO, a flexible ultralow-power processing architecture for

implantable brain– computer interfaces (BCIs) that directly communicate with
biological neurons in real time. This article discusses the rigid power, performance,
and flexibility tradeoffs that BCI designers must balance, and how we overcome
them via HALO’s palette of domain-specific hardware accelerators, general-
purpose microcontroller, and configurable interconnect. Our evaluations using
neuronal data collected in vivo from a nonhuman primate, along with full-stack
algorithm to chip codesign, show that HALO achieves flexibility and superior
performance per watt versus existing implantable BCIs.
B
y enabling direct brain-computer communica- with Neuralink, Kernel, Neuropace, and Medtronic to
tion, brain–computer interfaces (BCIs) can build BCIs that read/stimulate an ever-increasing
accelerate the process of scientific discovery, number of biological neurons with high signal fidelity.5
restore sensory capabilities, mitigate symptoms of Modern BCIs designs are of two types. While
movement disorders like Parkinson’s disease, treat some are noninvasive in the form of headsets or
pharmacologically resistant depression and anxiety, other external devices,1 invasive BCIs surgically
and even restore motor capabilities for spinal cord implanted on, around, and in the brain tissue are able
injury, brain strokes, and amyotropic lateral sclero- to record and stimulate large numbers of neurons
sis.1–3 BCIs interrogate biological neurons and decode with higher signal fidelity, spatial resolution, and
pathological behavior or the user’s intent, guiding tighter real-time characteristics.6 Low-power hard-
stimulation of the brain to mitigate seizures, control ware for onboard processing is critical to the success
prostheses, actuate assistive devices, and more. BCIs of implantable BCIs, especially because elevating tis-
have even been shown to augment human capabili- sue temperature by just 1 can damage the brain’s
ties; e.g., enhancing short-term memory capacity, cellular structure.7
monitoring attention and mental state to enhance
performance, navigating augmented realities via sig-
nals from the motor cortex, and reading signals from CHALLENGES OF BCI DESIGN
the visual cortex to infer words, pictures, and videos.4 BCI applications must read the electrophysiological activ-
Consequently, Facebook and Microsoft are competing ity of as many biological neurons as possible with high spa-
tial and temporal resolution to be useful. Modern BCIs
extract neuronal activity at data rates of 10–50 Mbps, with
Neuralink demonstrating even orders of magnitude higher
0272-1732 ß 2021 IEEE
Digital Object Identifier 10.1109/MM.2021.3065455 data rates,5 and DARPA’s NESD program targeting com-
Date of publication 11 March 2021; date of current version munication with millions of neurons.8 These large volumes
25 May 2021. of data may need to be processed in real time. For

TOP PICKS
TABLE 1. Existing commercial and research BCIs meet target power budgets by either restricting their scope to a single use
case, or by dropping brain–computer communication bandwidth. HALO is the first flexible implantable BCI architecture to
overcome this tradeoff.
Medtronic Neuropace Aziz Kassiri Neuralink NURIP HALO

2 2 10 2 5 11
Tasks supported
Spike detection • • • • • • @
Compression • • @ • • • @
Seizure prediction • @ • @ • @ @
Movement intent @ • • • • • @
Encryption • • • • • • @
Technical capabilities
Programmable @ Limited • @ • Limited @
Read channels 4 8 256 24 3072 32 96
Data rate (Mbps) 0.01 0.02 9.76 1.32 545 0.13 46
Safety ( < 15 mW) @ @ @ @ • @ @
example, BCIs that treat seizures must read the activity of summarizing the limitations of the current state-of-
biological neurons, process it to detect signs of a seizure the-art commercial and research BCIs.
or its imminent arrival, predict the movement of the sei-
zure through different brain regions, determine where to
apply electrical stimulus (and for how long) to mitigate sei- OUR APPROACH: THE HALO
zure symptoms, and then stimulate brain tissue.9 All of PROJECT
this must be done accurately, necessitating significant sig- An ideal BCI must be flexible as its operation may
nal processing of many neuronal channels of data, and need to be personalized, there may be multiple neuro-
quickly, within milliseconds of seizure detection.9 At the logical conditions to treat, and several brain–com-
same time, BCIs may not exceed 15 mW for safe chronic puter interactions to support. In response, we are
implantation, a target that is notoriously challenging given building HALO, a high-performance, ultralow-power,
the high data rates that BCIs must support. and flexible BCI processing architecture. HALO is a
full-stack design effort that uses electrophysiological
data collected in vivo from a nonhuman primate’s
BCI APPLICATIONS MUST READ THE motor cortex (specifically, from the regions responsi-
ELECTROPHYSIOLOGICAL ACTIVITY ble for arm and leg movement) to evaluate a BCI archi-
tecture that balances a palette of power-efficient
OF AS MANY BIOLOGICAL NEURONS
accelerators with configurable dataflow to support
AS POSSIBLE WITH HIGH SPATIAL
frequently used neural processing kernels. We are
AND TEMPORAL RESOLUTION TO BE realizing HALO via several tape-outs, with Figure 1(a)
USEFUL. illustrating a chip diagram of our HALO architecture in
a 12-nm technology, after augmenting over the 28-nm
technology from the original paper.3 Furthermore,
Designers have responded by building BCIs that Figure 1(b) shows how the processing architecture
either achieve power efficiency via specialization for a integrates with the remainder of a typical implantable
restricted set of applications/treatments for specific BCI device.
disorders in specific brain regions, or more flexible In realizing HALO, we make several research contri-
multiuse designs that achieve power efficiency but butions. First, we systematically map the design space of
only by restricting the number of neurons they read/ BCI applications to identify a list of target capabilities to
stimulate. Consequently, the modern BCI ecosystem support. Because commercial BCIs are generally single-
is fragmented, with many different single-use devices, use devices, identification of a canonical set of applica-
and lacks standardization of computational capabili- tions that more flexible BCIs should strive to support has
ties. Table 1 captures this predicament by hitherto remained unanswered. This list includes disease

TOP PICKS
FIGURE 1. Chip diagram on the left shows our HALO tape-out in a 12-nm technology. The block diagram on the right shows other key
components of implantable BCIs, including the sensors, which consists of conductive needles that penetrate millimeters of cortical tissue,
analog components, a radio, and power sources. Implantable BCIs are packaged in a hermetically fused silica capsule or titanium capsule.
treatment, signal processing, and secure transmission of style of heterogeneity, where a family of accelerator PEs,
neuronal data (e.g., compression and encryption of extra- each of which is identified in our chip tape-out diagram in
cellular voltage streams). As BCIs are an active area of Figure 1(a), operates in unrelated clock domains with low-
research, this list is nonexhaustive. Nevertheless, it iden- power asynchronous circuit-switched communication.
tifies a broader set of tasks needed for a flexible BCI plat- Third, we devise several hardware–software codesign
form as a starting point, while also offering a viable path techniques that raise the level of abstraction of BCI
to integrate these tasks. design from “bits and wires” to architectural choices that
Second, we navigate a large design space of architec- take inspiration from the world of software engineering.
ture and integration options to realize this list of BCI capa- Table 2 summarizes these techniques, which we
bilities by using principled hardware–software codesign. discuss in the next section. These approaches
Standard low-power design dictates that we realize one enable HALO to achieve 4–57 and 2 lower power
accelerator per BCI task in the form of a dedicated ASIC. dissipation than software alternatives and mono-
We refer to this as a monolithic ASIC design, and find that lithic ASICs, respectively.
they often exceed the 15 mW power budget. In response,
we refactor the underlying algorithm of the original BCI
tasks into distinct pieces that realize different phases of COMPUTATIONAL TASKS
the algorithm. We refer to these pieces as kernels, and SUPPORTED BY HALO
show that they facilitate design of ultralow-power hard- Figure 2 presents an overview of the HALO architec-
ware processing elements (PEs) via novel hardware–soft- ture. The block diagram on the left shows the PEs in
ware codesign approaches. We round out the design with our design and the configurable interconnect used to
a low-power RISC-V microcontroller to configure PEs into assemble PEs to realize the task pipelines shown on
processing pipelines and support computation for which the right. HALO supports BCI tasks ranging from those
there are currently no PEs. The result is an unconventional that require real-time closed-loop support for treat-
ment of neurological disorders to those that exfiltrate
neural recordings to external systems for postprocess-
TABLE 2. Overview of hardware–software codesign ing and batch analysis. The first category consists of
techniques used to realize HALO. support for seizure treatment and amelioration of
movement disorders. Seizure prediction/stimulation
Technique Direction
pipelines that break neuronal feedback loops are
Kernel PE decomposition SW! HW responsible for seizure severity present cutting-edge
PE reuse generalization SW! HW capabilities of FDA-approved clinical BCIs. So do algo-
rithms to detect/stimulate the brain to counteract
PE locality refactoring SW HW
movement disorders associated with essential tremor
Spatial reprogramming SW HW and Parkinson’s disease. HALO supports FFTs, cross-
Counter saturation SW$ HW correlation, and bandpass filters over linear models to
NoC route selection SW! HW support closed-loop treatment of these neurological
disorders.

TOP PICKS
FIGURE 2. HALO consists of low-power hardware PEs and a RISC-V microcontroller. The PEs are configured into pipelines to
realize tasks ranging from compression (in blue) to spike detection (in green). Optional PEs (e.g., AES encryption) are shown in
square brackets. PEs operating in parallel (e.g., FFT, XCOR, and BBF for seizure prediction) are shown in curly brackets.
The second workload category includes compres- that scales quadratically with channel count. In con-
sion to reduce radio transmission bandwidth. Apart trast, BBF is a simple filter with minimal arithmetic
from some specific and well-understood forms of lossy that scales linearly with channel count. Separating
compression—such as spike sorting—BCIs generally XCOR and BBF into separate PEs ensures that BBF’s
require lossless compression. HALO supports spike filtering logic is clocked over an order of magnitude
detection via the near-energy operator (NEO) PE, and slower than the logic for cross-correlation.
also implements several lossless compression var- PE Reuse Generalization: Many BCI tasks use
iants as their effectiveness can change across brain computational kernels slightly differently. We develop
regions and patient activity. We support lossless LZ4 configurable PEs that can be shared among BCI tasks.
and LZMA compression, as well as discrete wavelet Consider movement intent, which can be decom-
transform (DWT) compression. Compression ratios posed into FFT, followed by logic that checks whether
vary by as much as 40% depending on compression the FFT output is in a particular spectral range. We
algorithm and target brain region.3 create a threshold PE (THR) to determine when a PE’s
Finally, although state-of-the-art BCIs do not cur- output is within a specified numerical range and
rently support encryption, we foresee it as necessary enable sharing of the FFT between movement intent
in future BCIs for secure data exfiltration. HIPAA, and seizure prediction tasks. The FFT PE is configura-
NIST, and NSA require using AES with an encryption ble because movement intent requires 14–25-point
key of at least 128 bits. FFTs to detect drops in signal power, while seizure pre-
diction requires 1024-point FFTs.
HALO ARCHITECTURE
HALO supports five tasks, and can set up two of them
Algorithm 1. LZMA pseudocode
in multiple ways, leading to a total of eight distinct 1: Function LZMA_compress_blockinput
pipelines configurable by a doctor or technician. The 2: output ¼ listðlzma headerÞ;
RISC-V microcontroller is used to configure these pipe- 3: while data ¼ input:getðÞdo
lines via the programmable switches. With the conven- 4: best match ¼ find best matchðdataÞ;
tional monolithic ASIC approach, this means that we 5: Probmatch ¼ countðtablematch ; best matchÞ
6: =count totalðtablematch Þ;
would implement eight ASICs. However, we decom-
7: r1 ¼ range encodeðProbmatch Þ;
pose these pipelines into the PEs of Figure 2.
8: output:push backðr1Þ;
9: increment counterðtablematch ; best matchÞ;
Decomposing BCI Tasks Into PEs 10: end while
Kernel PE Decomposition: Some BCI tasks consist of 11: Return output;
distinct computational kernels naturally amenable to 12: end function
PE decomposition. For example, seizure prediction
combines kernels for FFT, cross-correlation (XCOR), Major Refactoring: PE decomposition can require
Butterworth bandpass filtering (BBF), and a support significant refactoring of the original algorithm. Con-
vector machine (SVM). We realize each as a PE, as sider LZMA and DWTMA compression. Both algo-
shown in Figure 2. As FFT, XCOR, and BBF have no rithms use Markov (MA) chains to calculate the
data dependencies, they can operate in parallel. This probability of the current input value based on
approach saves power because XCOR contains com- observed history, which is used to pick more efficient
plex computation (e.g., divisions and square roots) encoding of the input signal. We found that using the

TOP PICKS
combined MA PE overshoots the 15-mW power bud-

Algorithm 2. XCOR spatial programming refactoring
get. To solve this problem, we refactored the original
algorithm to make it more amenable for PE decompo- 1: function XCORinput; output
sition. To separate algorithmic phases, we realize that 2: // channel[][] stores input in appropriate channel location
3: channel½channel num½sample num ¼ input
data locality (i.e., following routines that manipulate
4: // data[] stores sums of input received so far
major data structures) is a good indicator of kernel
5: data½countþ ¼ input
boundaries within programs. This observation is tied 6: // data_lag[] stores sums of input till LAG
to the fact that PEs in HALO have only local memories 7: If count 2 ¼¼ LAGthen
and cannot share large amounts of data. Locality 8: data lag½count ¼ data½count
refactoring highlights how design decisions about the 9: end if
architecture (i.e., use of PE-local memories) guided 10: // Finish correlation computation
refactoring of our algorithms. 11: if channel:filledðÞthen
Algorithm 1 demonstrates how we use this insight 12: for eachi; j 2 channels do
to change LZMA. The second half of this algorithm 13: avg i ¼ data½i=SIZE
can be separated into probability calculations and fre- 14: avg j ¼ ðdata½j data lag½jÞ=SIZE
15: output:push backðavg i; avg jÞ
quency information updates centered around the
16: end for
maintenance of the core MA data structure, the fre-
17: return output
quency table (in green), as well as efficient encoding 18: end if
(in blue). This refactoring permits bringing together 19: end function
phases that operate on the same data structures
within the hardware, allowing us to separate the PEs
since they can now operate independently with mini-
mal data movement. This permits clocking each com- Modified PE Output: Although initialization circuits
ponent at significantly lower frequency, leading to decrease the direct power/performance cost of start-
power savings of 2. ing a new compression block, there is also an indirect
cost of using uninitialized internal structures, which
leads to lower compression rates. This presents a prob-
PE Optimizations lem with respect to the choice of block size. Large
Unchanged PE Output: Some of the PEs (e.g., XCOR block sizes lead to better estimates of frequencies, but
and LZ) process data in blocks instead of samples and small block sizes allow the use of smaller data types
wait for all inputs in the block to arrive, before comput- and reduce the memory footprint and power of the MA
ing in a bursty manner. Bursty computation is problem- PE. One might balance power/compression ratio for an
atic as it requires either large buffers to sink the bursts ideal design, but such an approach does not find a
or high PE frequency to meet data rates while sustain- design point that fits within the constrained power
ing periods of bursty activity. Neither is ideal from the budget. Instead, we observe that the frequencies of
perspective of saving power. To achieve power values within a block remain largely unchanged after
improvements, we spatially reprogram the original they have stabilized. Consequently, we allow the fre-
algorithm and codesign it with the hardware. Consider quency counters to saturate and set block size inde-
the XCOR PE. The original algorithm performs compu- pendently of counter bit width. Overall, counter
tation at the end once all data have been filled into the saturation modification allows HALO to benefit both
block. We refactor the algorithm to process inputs as from reduced memory footprint of 16 bit counters, and
early as they are available. The final form in Algorithm 2 better compression ratio of larger blocks.
reduces the amount of computation needed in the final
step, as well as the number of buffers needed to store On-Chip Network
the inputs. This translates to a power savings of 2.2 Each PE operates at the lowest frequency needed for
over the original algorithm. This technique also extends data processing rates, and synthesize with established
to other PEs like LZ to achieve 1.5 power reduction. synchronous design flows. While running PEs in sepa-
Finally, LZ and MA PEs require initialization of data rate clock domains saves power, it can potentially com-
structures at the beginning of every compressed plicate inter-PE communication. Prior work on globally
block. We found that dedicated circuits are necessary asynchronous locally synchronous (GALS) architec-
to meet the 15 mW power budget. These circuits use tures12 encountered these issues for packet-switched
only combinational logic and reduce PE power con- on-chip networks. Unfortunately, we cannot repurpose
sumption by 1:8. their solutions as our analysis with the DSENT tool

TOP PICKS
estimates that a simple packet-switched mesh net-

work consumes over 50 mW, well over our 15-mW
power budget. Instead, we codesign inter-PE communi-
cation with the BCI algorithms. The decomposition of
BCI tasks into kernels creates static and well-defined
data-flows between PEs. NoC route selection allows
replacement of a packet-switched network to a far
lower power circuit-switched network built on an
asynchronous communication fabric.
FIFO Interfaces
Since the publication of our original paper on HALO,3
one challenge that we encountered is that our GALS-
based approach requires careful data rate matching FIGURE 3. Power (in log-scale) of PEs, control logic, and
between PEs in separate clock domains. We use per-PE
radios for HALO versus RISC-V and monolithic ASICs. To
FIFO buffers to transfer data from the network into the
meet the 15-mW device power budget, these components
form expected by the PE and to perform this rate
(without ADCs and amplifiers) need to be under 12 mW (the
matching. Consider PEs f and g. If their computation is
regular—i.e., the functions produce and consume data red line). We compare HALO against the lowest power RISC-
in a perfectly periodic fashion—then a simple interface V and HALO-no-NoC, which shows how much power would
between the two for clock domain conversion suffices. be saved if HALO’s configurability were sacrificed.
However, if f produces bursty data or g consumes data
in bursts, then a FIFO is needed between f and g to
and overcoming a moving styrofoam obstacle. All
smooth out producer–consumer patterns. The size of
research protocols were approved and monitored by
this FIFO is determined by the computational proper-
Brown University’s Institutional Animal Care and Use
ties of f and g, and the frequencies at which they oper-
Committee, and all research was performed in accor-
ate. Increasing the frequency of g beyond the
dance with relevant guidelines and regulations.
minimum operating point to meet data throughput
needs would reduce the FIFO size required. We have
found that balancing FIFO size with PE frequency is key
to meeting the 15-mW power budget. HALO PRESENTS A WET LAB-TO-CHIP
DESIGN PROJECT THAT EXPLORES
EVALUATION THE QUESTION OF HOW TO BUILD A
Our 15-mW target power budget includes the HALO FLEXIBLE ULTRALOW-POWER
chip, sensors, ADC, amplifier, and radio technologies. PROCESSING ARCHITECTURE FOR
We assume a microelectrode array with 96 channels, NEXT-GENERATION BCIS.
each of which records each sample encoded in 16 bits
at a frequency of 30 kHz, yielding a data rate of
46 Mbps. After accounting for all analog components,
HALO’s processing pipelines (including the radio) Figure 3 compares HALO’s power versus ASICs
must consume no more than 12 mW. All results pre- and software alternatives on RISC-V. Software tasks
sented use a commercial 28-nm fully depleted silicon- can execute on microcontroller cores in both single-
on-insulator (FD-SOI) CMOS process except when core and multicore designs, where we divide the 96
noted otherwise. Synthesis and power analysis is per- channel data streams and operate on them in paral-
formed using Cadence synthesis tools with standard lel. We study 1–64 RISC-V core counts and report the
cell libraries from STMicroelectronics. best configuration per task. We also show an ideal-
We use electrophysiological data collected from the ized version of HALO where the on-chip interconnect
brain of a non-human primate. Microelectrode arrays is removed to quantify the power penalty for the con-
were implanted in two locations in the motor cortex, figurability that the network offers. Both HALO var-
corresponding to the left upper and lower limbs. We use iants use the optimizations from the ones described
recordings of brain activity while the animal performed in the “PE Optimization” section. HALO uses less
tasks such as walking on a treadmill, reaching for a treat, power than monolithic ASICs and RISC-V approaches.

TOP PICKS
Finally, as we have been extending our chip design 7. P. D. Wolf, “Thermal considerations for the design of an
efforts, we have discovered the crucial impact of FIFO implanted cortical brain-machine interface (BMI),”
design on total power. For each PE, we have evaluated Indwelling Neural Implants: Strategies for Contending
the power utilized for various configurations with fre- With the in Vivo Environment, W. M. Reichert, Eds. Boca
quency and input and output FIFO buffers. For each Raton (FL): CRC Press/Taylor & Francis, 2008, ch. 3, doi:
frequency, we select the lowest FIFO size required for 10.1201/9781420009309-11.
the design, and report its power. For example, for the 8. DARPA, Bridging the Bio-Electronic Divide Accessed:
LIC PE, we have found that the lowest power configu- Aug. 10, 2019. [Online]. Available: https://www.darpa.
ration is achieved at 24 MHz, with an 8-entry input mil/news-events/2015-01-19.
FIFO and no output FIFO. We also note that power 9. S. Li, W. Zhou, Q. Yuan, and Y. Liu, “Seizure
consumed can vary by as much as 1 mW depending on prediction using spike rate of intracranial EEG,” IEEE
these configuration options. Trans. Neural Syst., vol. 21, no. 6, pp. 880–886, Nov.
2013, doi: 10.1109/TNSRE.2013.2282153.
10. J. N. Y. Aziz et al., “256-Channel neural recording and
CONCLUSION delta compression microsystem with 3D electrodes,”
HALO presents a wet lab-to-chip design project that IEEE J. Solid-State Circuits, vol. 44, no. 3, pp. 995–1005,
explores the question of how to build a flexible ultra- Mar. 2009, doi: 10.1109/JSSC.2008.2010997.
low-power processing architecture for next-genera- 11. G. O’Leary et al., “NURIP: Neural interface processor
tion BCIs. While this work performs an initial explora- for brain-state classification and programmable-
tion of workloads that are important for neuroscience, waveform neurostimulation,” IEEE J. Solid-State
the list of tasks can be expanded. Future BCIs will Circuits, vol. 53, no. 11, pp. 3150–3162, Nov.
implement other workloads, with different pipelines 2018, doi: 10.1109/JSSC.2018.2869579.
targeting different research and medical objectives. 12. M. Krstic and E. Grass, “New GALS technique for
Because of its modular design, HALO will be able to datapath architectures,” in Proc. Int. Workshop
support such workloads seamlessly. Power Timing Modeling, Optim. Simulation, 2003,
pp. 161–170, doi: 10.1007/978-3-540-39762-5_18.
IOANNIS KARAGEORGOS is currently an associate research

REFERENCES scientist at Yale University, New Haven, CT, USA. His primary
1. A. L. S. Ferreira, L. C. D. Miranda, and E. E. Cunha de research interests focuses on the general area of VLSI,
Miranda, “A survey of interactive systems based on including GALS architectures, logical and physical SoC/ASIC
brain-computer interfaces,” SBC J. Interactive Syst.,
design, and DTCO. Karageorgos received a Ph.D. degree in
vol. 4, no. 1, pp. 3–13, 2013, doi: 10.5753/jis.2013.623.
electrical engineering from KU Leuven and IMEC, Belgium.
2. H. Kassiri et al., “Closed-loop neurostimulators: A
survey and a seizure-predicting design example for Contact him at ikarageo@aya.yale.edu.
intractable epilepsy treatment,” IEEE Trans. Biomed.
KARTHIK SRIRAM is currently a Ph.D. student at Yale Univer-
Circuits Syst., vol. 11, no. 5, pp. 1026–1040, Oct. 2017, doi:
sity, New Haven, CT, USA. His research interests include com-
10.1109/TBCAS.2017.2694638.
3. I. K. et al., “Hardware-software codesign for brain- puter systems and architecture, hardware-software codesign,
computer interfaces,” in Proc. 47th Annu. Int. Symp. specially in the design of brain–computer interfaces. Sriram
Comput. Archit., 2020, pp. 391–404, doi: 10.1109/ received a B.S. degree in computer science from Rutgers Uni-
ISCA45697.2020.00041. versity. He is the corresponding author of this article. Contact
4. C. Cinel, D. Valeriani, and R. Poli, “Neurotechnologies him at karthik.sriram@yale.edu.
for human cognitive augmentation: Current state of
the art and future prospects,” Front., Hum. Neurosci., VESELY
JAN is currently a software engineer with Nvidia.
vol. 13, 2019, Art. no. 13, doi: 10.3389/fnhum.2019.00013. His interests span the areas of architecture, operating
5. E. Musk and Neuralink, “An integrated brain-machine
systems, and compiler techniques for accelerators. Vesely
interface platform with thousands of channels,” J. Med.
contributed to this work while he was a visiting student at
Internet Res., vol. 21, no. 10, 2019, Paper e16194, doi:
Yale. Vesely graduated in 2021 from Rutgers University. His
10.1101/703801.
6. I. Stevenson and K. Kording, “How advances in neural thesis focused on hardware and software methods of inte-
recording affect data analysis,” Nature Neurosci., grating accelerators into heterogeneous systems. Contact
vol. 14, pp. 139–42, 2011, doi: 10.1038/nn.2731. him at jan.vesely@rutgers.edu.

TOP PICKS
NICK LINDSAY is currently a graduate student at Yale Uni- DAVID BORTON is currently an assistant professor of bio-
versity, New Haven, CT, USA. His interests include building medical engineering at the Brown University School of Engi-
secure, safe, and high-performance heterogeneous systems. neering, the Carney Institute for Brain Science, and is also a
Lindsay received a B.Eng. degree in electrical engineering biomedical engineer at the Providence Veterans Affairs Cen-
degree from the University of Glasgow. Contact him at ter for Neurorestoration and Neurotechnology, New Haven,
Nick.Lindsay@yale.edu. CT, USA. He leads an interdisciplinary team of researchers
focused on the design, development, and deployment of novel
neural recording and stimulation technologies. His team lever-
XIAYUAN WEN is currently a Ph.D. student at Yale University,
ages engineering principles to untangle the underpinnings of
New Haven, CT, USA. Her research interests include com-
sensorimotor and neuropsychiatric disease and injury. Borton
puter architecture and circuit design. Wen received the B.S.
received a B.S. degree in biomedical engineering from Wash-
degree from Nanjing University and an M.S. degree from Yale
ington University in St. Louis in 2006 and a Ph.D. degree in
University. Contact her at xiayuan.wen@yale.edu.
bioengineering from Brown University in 2012. He was a Marie
Curie Postdoctoral Fellow at the Ecole Polytechnique Frale de
MICHAEL WU is an incoming Ph.D. student at Yale University,
Lausanne. Contact him at david_borton@brown.edu.
New Haven, CT, USA. His research focuses on the applications
of machine learning in computer systems. Wu received a bache-
lor’s degree in computer science from Rutgers University, New RAJIT MANOHAR is currently a John C. Malone Professor of
Brunswick, NJ, USA. Contact him at mw811@cs.rutgers.edu. Electrical Engineering and a professor of computer science at
Yale University, New Haven, CT, USA. His research focuses on
MARC POWELL is currently a postdoctoral associate with the design and implementation of asynchronous circuits and
the Department of Neurological Surgery, University of systems. Manohar received a Ph.D. degree in computer sci-
Pittsburgh. His research focuses on the development of ence from Caltech. Contact him at rajit.manohar@yale.edu.
advanced neural technologies designed to provide unprec-
edented access to the nervous system and apply these ABHISHEK BHATTACHARJEE is currently an associate pro-
tools to the treatment of neurological dysfunction caused fessor of computer science at Yale University, New Haven,
by injury or disease. A major goal of his work is to facili- CT, USA. His research focuses on computer architecture and
tate the clinical translation of these devices and ensure systems at all scales of computing, ranging from server sys-
that they are implemented safely and reliably. Powell tems for large-scale data centers to embedded systems for
received a bachelor’s degree in biomedical engineering implantable brain– computer interfaces. Bhattacharjee
from Georgia Institute of Technology in 2014 and a Ph.D. received a bachelor’s degree in engineering from McGill Uni-
degree in biomedical engineering from Brown University in versity in 2005 and a Ph.D. from Princeton University in 2010.
2021. Contact him at marc_powell@pitt.edu. Contact him at abhishek@cs.yale.edu.

Virtual Logical Qubits: A Compact

Architecture for Fault-Tolerant Quantum
Computing
Jonathan M. Baker , Casey Duckering , David I. Schuster , and Frederic T. Chong , University of Chicago,
Chicago, IL, 60637, USA
Fault-tolerant quantum computing is required to execute many of the most

promising quantum applications. In recent years, numerous error correcting codes,
such as the surface code, have emerged which are well suited for current and future
limited connectivity 2-D devices. We find quantum memory, particularly resonant
cavities with transmon qubits arranged in a 2.5-D architecture, can efficiently
implement surface codes with around 20 fewer transmons via this work. We
virtualize 2-D memory addresses by storing the code in layers of qubit memories
connected to each transmon. Distributing logical qubits across many memories has
minimal impact on fault tolerance and results in substantially more efficient logical
operations. Virtualized logical qubit (VLQ) systems can achieve fault tolerance
comparable to conventional 2-D transmon-only architectures while putting within
reach a proof-of-concept experimental demonstration of around ten logical qubits,
requiring only 11 transmons and 9 attached cavities.
Q
uantum devices have improved significantly surface code is a particularly appealing candidate
in the last several years both in terms of because of its low overhead, high error threshold,
physical error rates and number of usable and its reliance on few nearest-neighbor interac-
quantum bits (qubits). Concurrently, great progress tions in a 2-D array of qubits, a common feature of
has been made at the software level such as improved currently popular hardware like superconducting
compilation procedures reducing required overhead transmon qubits.
for program execution. These efforts are directed at Current architectures for both NISQ and fault-tol-
enabling noisy intermediate-scale quantum (NISQ)1 erant quantum computers make no distinction
algorithms to demonstrate the power of quantum between the memory and processing of quantum
computing and are expected to run some important information. While currently viable, as larger devices
programs. are built, the engineering challenges of scaling to hun-
Despite this, these machines will be too small dreds of qubits become readily apparent. For trans-
for error correction and unable to run large-scale mon technology, some of these issues include
programs due to unreliable qubits. The ultimate fabrication consistency and crosstalk during parallel
goal is to construct fault-tolerant machines capable operations. Every qubit needs dedicated control wires
of executing thousands of gates and in the long and signal generators, which fill the refrigerator the
term to execute large-scale algorithms with speed- device runs in. To scale to the millions of qubits
ups over classical algorithms. There are a number needed for useful fault-tolerant machines,4 we need
of promising error correction schemes which have to adopt a memory-based architecture to decouple
been proposed such as the surface code.2,3 The qubit count from transmon count.
We use a recently realized qubit memory technol-
ogy, which stores qubits in a superconducting cavity.5
Stored in cavity, qubits have a significantly longer life-
0272-1732 ß 2021 IEEE
time (coherence time), but must be loaded into a
Date of publication 13 April 2021; date of current version transmon for computation. We design and evaluate a
25 May 2021. system-level organization of these components within

TOP PICKS
FIGURE 1. Our fault-tolerant architecture with random-

access memory local to each transmon. On top is the typical
2-D grid of transmon qubits. Attached below each data trans- FIGURE 2. Typical 2-D superconducting qubit architecture.
mon is a resonant cavity storing error-prone data qubits The dots are transmon qubits where black are used as data
(shown as black circles). This pattern is tiled in 2-D to obtain and gray are used as ancilla for error correction. The lines
a 2.5-D array of logical qubits. Our key innovation here is stor- indicate physical connections between qubits that allow
ing the qubits that make up each logical qubit (shown as operations between them. Four logical qubits, each consist-
checkerboards) spread across many cavities to enable effi- ing of nine error-prone data qubits are shown here in the
cient computation. rotated surface code with distance 3. Z parity checks are
shaded yellow (light) and X parity checks are shaded blue
the context of a novel surface code embedding and (dark) where checks on only two data are drawn as half
fault-tolerant quantum operations. circles.
Our proposed 2.5-D memory-based design is a typi-
cal 2-D grid of transmons with memory added, Figure 1. physical transmons for a given algorithm. In the near-
This can be compared with the traditional 2-D error to-intermediate term, when qubits are a highly con-
correction implementation in Figure 2, where the strained resource, this will accelerate a path toward
checkerboards represent error-corrected logical fault-tolerant computation. In fact, the smallest
qubits. The logical qubits in this system are stored at instance of Compact requires only 11 transmons and 9
unique virtual addresses in memory cavities when not cavities for about ten logical qubits. Via simulation, we
in use. They are loaded to a physical address in the determine the error correction threshold rates for
transmons and made accessible for computation on each and find they are all close to the baseline thresh-
request and are periodically loaded to correct errors, old meaning the additional error sources do not signif-
similar to DRAM refresh. This design allows for more icantly impact the performance.
efficient operations such as the transversal CNOT
between logical qubits sharing the same physical
address, i.e., colocated in the same cavities. This is not
BACKGROUND
possible on the surface code in 2-D, which requires Superconducting Qubit Architectures
methods such as braiding or lattice surgery for a In contrast to other leading qubit technologies such
CNOT operation. as trapped ion devices with one or more fully con-
We develop an embedding from the standard nected qubit chains, superconducting qubits are typi-
representation to this new architecture, which cally connected in nearest neighbor topologies, often
reduces the required number of physical transmon a 2-D mesh on a regular square grid. This limitation
qubits by a factor of approximately k, the number of makes engineering these devices easier but results in
resonant modes per cavity which is expected to be at high communication costs, increasing the chance of
least 10 and is expected to improve over time. We also errors on NISQ devices, and communication conges-
develop a Compact variant saving an additional pffiffiffiffiffi 2. tion for error corrected operations. More background
This means we can obtain a code distance 2k times on superconducting hardware can be found by Krantz
greater or use hardware with only 1=ð2kÞ the required et al.6

TOP PICKS
Qubit Memory Technology repeatedly performing syndrome extraction and

Recently, studies have demonstrated random access detecting parity changes, we are able to locate errors.
memory for quantum information.5 Qubit states can This repeated syndrome extraction collapses any error
be stored in the resonant modes of physical supercon- to a correctable Pauli error and forces the data to
ducting cavities attached to a transmon qubit and remain in what is called the code state. We may detect
depicted as the individual cylinders in Figure 1. Cur- errors which occur as changes in measurement out-
rently demonstrated error rates are promising, and comes of the parity checks.
there is nothing fundamental preventing this technol- There are two primary ways to manipulate the logi-
ogy from becoming competitive with other transmon cal qubits of the surface code to perform desired logi-
devices as it matures. We expect operation error rates cal operations–braiding and lattice surgery. In this
to improve, cavity sizes and coherence times to article, we will primarily consider lattice surgery, which
increase and in general expect performance to has been shown to have some advantages over braid-
improve as it has with other quantum technologies. ing like using fewer physical qubits. For a more thor-
Local memory is not free. Stored qubits cannot be ough introduction to lattice surgery, we refer the
operated directly. Instead, operations are mediated reader to Horsman et al.3,8,9 In our proposed scheme,
through the transmon. To operate on qubits stored in all primitive lattice surgery operations can be used
memory, we first load the qubit from memory. Then, such as split and merge, which together perform a log-
we perform the desired operation on the transmons, ical CNOT. For universal quantum computation in sur-
and store the qubit back in its original location. A two- face codes, we allow for the creation and use of magic
qubit operation such as a CNOT can also be per- states such as jTi or jCCZi.
formed directly between the transmon and a qubit in
its connected cavity by manipulating higher states of
VIRTUALIZED LOGICAL QUBITS
the transmon. Qubits stored in the same cavity cannot
Our proposed architecture is an embedding of the sur-
be operated on in parallel. There are two primary bene-
face code, which virtualizes logical qubits, saving in
fits of this technology. First, we are able to quickly per-
required number of transmons. This takes advantage
form two-qubit interactions between any pair of
of quantum resonant cavity memory technology to
qubits stored in the same cavity. Second, qubits
store logical qubits, in the form of surface code
stored in the cavity are expected to have longer coher-
patches, in memory local to the computational
ence times by about one order of magnitude.
transmons.
Surface Codes Natural Surface Code Embedding

The surface code2 is one of the most promising quan- Our embedding slices the plane of surface code tiles
tum error correction protocols because it requires into many pieces, storing them flat in memory to
only nearest neighbor connectivity between physical enable them to stitch together on-demand. This
qubits and improvements continue to be made.7 The embedding enables the fast transversal CNOT and
surface code is implemented on a 2-D array of physical high connectivity.
qubits shown in Figure 2. These qubits are either data, For every transmon in this architecture (the com-
where the state of the logical qubit is stored, or ancilla pute qubits in the top layer of Figure 1), there is a cav-
used for syndrome extraction (parity checks). These ity attached with a fixed number of resonant modes k.
ancilla qubits are measured to stabilize the entangled Each cavity can store k qubits, one per mode. Each
state of the data. These ancilla fall into two categories, transmon can load and store qubits from its attached
measure-Z and measure-X for Z syndromes and X syn- cavity. All transmons can be operated on in parallel as
dromes designed to detect bit and phase errors, is the case in most superconducting hardware. We
respectively. expect this technology to allow cavity size k on the
Each X (Z) plaquette corresponds to a single mea- order of 10 to 100 qubits.
sure-X (Z) qubit and the four data, which it interacts Consider the rotated surface code of Figure 2 and
with. The corners of each plaquette are the data the high level view of this architecture in Figure 1. We
qubits. For the baseline, we use standard Z and X syn- map each of the physical qubits of this logical qubit to
drome extraction (parity measurement) circuits where the same mode z of each cavity in this memory archi-
the qubits of this circuit are physical qubits. The Z-syn- tecture. Another logical qubit can be mapped to a dif-
drome measures the bit-parity of its corner qubits and ferent mode of the same set of cavities. We view this
the X-syndrome measures their phase parity. By as stacking the surface code patches, the logical

TOP PICKS
qubits, together under the same set of transmon two logical qubits. To do this, one of the qubits must
qubits. The transmons themselves are only used for be moved to the same 2-D address as the other using
logical operations and error correction cycles per- a move operation described by Litinski.8 Once the two
formed on the patches. qubits are in the same 2-D address, the transversal
In this memory architecture, we are unable to CNOT can be applied.
operate on qubits stored in the same cavity in parallel,
however we are permitted to operate on qubits stored
Compact Surface Code Embedding
in different cavities in parallel. In order to detect mea-
In the previous scheme, half of the transmons did not
surement errors, we require d, the distance of the
have attached cavities (or they did not make use of
code, rounds of syndrome extraction before we per-
them). An ancilla and data qubit could share a trans-
form our decoding algorithm and correct errors. We
mon because the data are stored in the cavity the
can load a logical qubit (meaning load all data in paral-
majority of the time and the ancilla are reset every
lel to each transmon), perform all d rounds of extrac-
cycle. This leads to a more efficient, Compact embed-
tion, then store the qubit, this is our All-at-once
ding which halves the required number of transmons.
strategy. Alternatively, we can interleave the extrac-
This comes at the cost of additional loads and stores
tion cycles by loading the logical qubit in index 0, per-
from memory due to contention during error correc-
forming one syndrome extraction step, then storing.
tion, effectively trading some error and time for signifi-
We execute this same procedure for every logical
cant space savings.
qubit in the stack and repeat d times.
This mapping results in plaquettes, which resemble
Up to k logical qubits share the same set of trans-
triangles rather than squares, where the center of the
mons thereby more efficiently storing these qubits
hypotenuse of each triangle corresponds to both the
than on a single large surface. To interact logical
ancilla qubit and the data qubit, stored “beneath” in its
qubits in different stacks, we load them in parallel to
cavity. Every data qubit is still mapped to the same
the transmons then interact them via lattice surgery
index. We illustrate this transformation from our
operation. In these cases, all of the other stacks’ trans-
undistorted Natural surface code patch to Compact in
mons between the interacting logical qubits act as a
Figure 3.
single logical ancilla. Furthermore, physical operations
This new mapping also requires a new syndrome
between qubits in the same cavity enable our system
extraction procedure because data cannot be loaded
to perform fast transversal two-qubit interactions if
while a transmon is in use as an ancilla. A single round
the logical qubits are colocated in the same stack.
of syndrome extraction can be executed by dividing
the plaquettes into four groups, with each group con-
taining noninterfering plaquettes. Two plaquettes are
Transversal CNOT
noninterfering if they do not share their ancilla with
A major advantage of this 2.5-D architecture is the
any data qubits of the other plaquette. It is imperative
ability to do two-qubit operations transversely using
this process use both the minimum number of loads
the third dimension. The logical operation is per-
and stores and keep data qubits loaded for as short a
formed directly by doing the same physical gate to
time as possible as the error incurred during this cir-
every data qubit and correcting any resulting errors.
cuit directly impacts the error threshold for the code.
For typical 2-D error correcting codes like the surface
Error correction can be performed interleaved or all-
code, transversal two-qubits operations are not possi-
at-once just as with natural.
ble because the corresponding data qubits of two logi-
cal patches cannot be made adjacent. However, with
memory, it is possible to load one patch into the trans- Beyond the Surface Code
mons and apply two-qubit gates mediated by each The surface code is an appealing choice for currently
transmon onto the data qubits for a second qubit available superconducting devices because of its rela-
stored in one mode of the cavities. The transversal tively low overhead and it requires only limited nearest
CNOT can be performed in a single round of d error neighbor connectivity. For this new memory-based
correction cycles while the lattice surgery CNOT takes architecture, there is a fortuitous match with the sur-
six rounds. This can translate to major savings in run- face code, making it an even more appealing candi-
time for algorithms. date. However, there are many other candidate error
The transversal CNOT is not limited to logical correction codes such as the color code.
qubits currently stored in the same 2-D address. With One particularly relevant class of codes for this
an extra step, it is possible to transversely interact any architecture are Bosonic codes.10 The surface and

TOP PICKS
FIGURE 3. Transformation from natural to compact. (a) Natural embedding: Only data have attached cavities (not shown). (b)
The transformation: Z ancilla (over yellow/light areas) merge with the upper-right data transmon and X ancilla (over blue/dark
areas) merge with the lower left data transmon. The opposite parings are key to keeping 4-way grid connectivity. (c) Compact
embedding: All ancilla transmons without attached cavities have been removed. All remaining transmons have cavities and are
used as both data and ancilla.
color codes protect quantum information by using best choice, however we have shown the surface
many physical qubits to construct a single logical code is a fortuitous match with this new memory-
qubit. For Bosonic codes, instead we can create based architecture.
redundancy by using many modes of a single physical
system. The modes of the cavities in the underlying
hardware of our systems can be used to implement
EVALUATION
Bosonic codes directly. Unfortunately, Bosonic codes Error Threshold Results
in practice only approximately correct errors and do We detail our threshold results in Figure 4. We study
not have the property that as you scale the size of the five different code distances in order to obtain the
code you can obtain an exponential reduction in the physical error threshold value. The threshold value
logical error rate of the system. Bosonic codes can be indicates at which point increasing the code distance,
used effectively for error mitigation. d improves the logical error rate instead of hurting it.
It is unclear, given an architecture, what the best This threshold is a function of both the physical sys-
error correction scheme is. Ideally, we want a code tem model, the chosen syndrome extraction circuit,
which takes full advantage of the high connectivity and the specific decoding procedure. The major differ-
between information stored in the same cavity. For ence in each procedure is the additional error sources
practical demonstrations, we also want a code which and different syndrome extraction procedures. The
requires a small number of transmons. Bosonic codes slopes for each code distance compared across the
are viable options for this architecture and have a various schemes is stable, indicating each scheme
somewhat natural fit. It is yet to be seen what is the improves at a similar rate, post error threshold, and
FIGURE 4. Error thresholds for the baseline 2-D architecture and natural and compact variants of our 2.5-D architecture. The
thresholds are comparable to the baseline indicating the space savings obtained in our system does not substantially reduce
the error thresholds. The slopes of the lines in this figure indicate, postthreshold, how much improvement in physical error rates
improve logical error rate. Except for the baseline, all use a cavity size of 10.

TOP PICKS
FIGURE 5. Sensitivity of logical error rate to various error sources in compact (Interleaved). The logical error rates are most sen-
sitive to physical error of loads/stores and SC-SC (transmon–transmon) gates. The logical error rate is less sensitive to transmon
and cavity coherence times (not shown) and mostly insensitive to effects of cavity size.
showing that the logical error rate decays exponen- decoherence error dominates. For the error rates used
tially with d as desired. This is significant because it in the evaluation, we find that cavity decoherence
means we will be able to save on total number of error starts dominating after cavity size k 150. After
transmons without major degradation of the error this point, it would be more beneficial to improve the
threshold. cavity coherence time.
Error Sensitivity Results CONCLUSION

Different system-level details affect the threshold of Current NISQ machines are powerful demonstrations,
the code. Here we focus on compact, interleaved as but fall short of many serious applications without
the most efficient physical qubit mapping and subject error correction. This article makes an error-corrected
to a wide variety of errors. The results of these sensi- machine 20 easier to build by exploiting quantum
tivity studies are found in Figure 5. The logical error memory with a codesigned architecture to enable
rate is sensitive to a particular error source’s probabil- medium-scale quantum machines (100–1000 trans-
ity if the slope of the line is pronounced at the marked mons) and allow the industry to realize the long-term
reference value. The logical error rate for compact, potential of scalable quantum machines.
interleaved is sensitive to all changes in system-level Current quantum computers are noisy and they
details to some degree. The gate error rates show the are incapable of running sizable programs accurately.
highest sensitivity, indicating improvement in these There is currently a major gap between what is avail-
will give the greatest benefit. Coherence times (plots able on the market and what is required to execute
not shown) are not quite as sensitive but the slightly the famed quantum algorithms with quantum error
over 10 offset between the cavity and transmon correction. Currently, approaches toward bridging this
plots shows that there is no benefit in transmon T1 gap fall into a few categories: either push the error
being longer than 1=10 cavity T1 when the cavity size is rates of current devices down below the error thresh-
10. The lines taper off, indicating other errors sources old of known codes, or design new codes. Our work
eventually dominate. Initially, we expected the cavity bridges this gap in a completely different way by
size to have a large impact on the logical error rate. exploring the use of new technology, resonant cavi-
However, when coherence times are high and gate ties, for the design of new architectures which better
error rates are fairly low below the threshold, the logi- support error correction codes already available.
cal error rate does increase proportional to the length Locally accessible quantum memory can be used
of the cavity but the effect is very minor. Given cavi- to create a 2.5-D architecture better suited for the
ties with good coherence times, this indicates our pro- code than traditional approaches. Our architecture
posed system will be able to scale smoothly into the enables the execution of the surface codes with lat-
future as cavity sizes increase. tice surgery operations with significantly lower physi-
While larger cavity sizes will make this architecture cal requirements resulting in higher distance codes
even more advantageous, there will be a point at with fewer total transmon qubits. What does this
which it has a vanishing benefit because the delay mean? Given a fixed physical error rate below the
between error correction becomes too long and threshold, we can use larger codes to obtain strictly

TOP PICKS
better logical error rates. This enables the execution of JONATHAN M. BAKER is currently working toward a
longer input programs. Conversely, for a fixed desired Ph.D. degree with the University of Chicago, IL, USA. His
logical error rate, determined by the application, we
research is primarily focused on vertical integration of the
can run on hardware with worse error rates which will
quantum computing hardware-software stack. Contact
be available years sooner. This demonstrates the ben-
him at jmbaker@uchicago.edu.
efit of codesigning quantum architectures alongside
the applications and technologies.
CASEY DUCKERING is currently working toward a Ph.D.
degree with the University of Chicago, Chicago, IL, USA, aim-
REFERENCES ing to efficiently bring together quantum algorithms and error
1. J. Preskill, “Quantum computing in the NISQ era and
correction with their physical implementations on quantum
beyond,” Quantum, vol. 2 , 2018, Art. no. 79.
computers. Contact him at cduck@uchicago.edu.
2. A. G. Fowler, M. Mariantoni, J. M. Martinis, and
A. N. Cleland, “Surface codes: Towards practical large-
scale quantum computation,” Phys. Rev. A, vol. 86,
no. 3, 2012, Art. no. 032324. DAVID I. SCHUSTER is an Associate Professor with the
3. C. Horsman, A. G. Fowler, S. Devitt, and R. Van Meter, Department of Physics, University of Chicago, Chicago, IL,
“Surface code quantum computing by lattice surgery,” USA. His work is currently centered on superconducting
New J. Phys., vol. 14, no. 12, 2012, Art. no. 123011. quantum circuits to make quantum memories, perform error

4. C. Gidney and M. Ekera, “How to factor 2048 bit RSA
correction, and create topologically protected qubits. Schus-
integers in 8 hours using 20 million noisy qubits,”
ter received a Ph.D. degree from Yale University in 2007. Con-
Quantum, vol. 5, p. 433, Apr. 2021.
5. R. Naik et al., “Random access quantum information tact him at david.schuster@uchicago.edu.
processors using multimode circuit quantum
electrodynamics,” Nature Commun., vol. 8, no. 1, 2017,
Art. no. 1904. FREDERIC T. CHONG is the Seymour Goodman Professor
6. P. Krantz, M. Kjaergaard, F. Yan, T. P. Orlando, with the Department of Computer Science, University of Chi-
S. Gustavsson, and W. D. Oliver, “A quantum engineer’s
cago, Chicago, IL, USA. He is also Lead Principal Investigator
guide to superconducting qubits,” Appl. Phys. Rev.,
for the EPiQC Project (Enabling Practical-scale Quantum
vol. 6, no. 2, 2019, Art. no. 021318.
Computing), an NSF Expedition in Computing. He was a fac-
7. J. P. Bonilla-Ataides, D. K. Tuckett, S. D. Bartlett,
S. T. Flammia, and B. J. Brown, “The XZZX surface code,” ulty member and Chancellor’s fellow at UC Davis from 1997
2020, arXiv:2009.07851. to 2005. He was also a Professor of Computer Science, the
8. D. Litinski, “A game of surface codes: Large-scale Director of Computer Engineering, and the Director of the
quantum computing with lattice surgery,” Quantum, Greenscale Center for Energy-Efficient Computing at UCSB
vol. 3, 2019, Art. no. 128. from 2005 to 2015. Chong received a Ph.D. degree from MIT
9. L. Lao et al., “Mapping of lattice surgery-based
in 1996. He is a recipient of the NSF CAREER Award, the Intel
quantum circuits on surface code architectures,”
Outstanding Researcher Award, and ten best paper awards.
Quantum Sci. Technol., vol. 4, no. 1, 2018, Art. no.
He currently serves on the National Quantum Initiative Advi-
015005.
10. W. Cai, Y. Ma, W. Wang, C.-L. Zou, and L. Sun, “Bosonic sory Committee and is the Chief Scientist and Co-Founder of
quantum error correction codes in superconducting Super.tech, a quantum software company. Contact him at
quantum circuits,” Fundam. Res., vol. 1, pp. 50–67, 2021. chong@cs.uchicago.edu.

IEEE COMPUTER SOCIETY JOBS BOARD
Evolving Career
Opportunities
Need Your Skills
Explore new options—upload your resume today
Changes in the marketplace shift demands for vital skills

www.computer.org/jobs and talent. The IEEE Computer Society Jobs Board is a
valuable resource tool to keep job seekers up to date on
the dynamic career opportunities offered by employers.
Take advantage of these special resources for job seekers:
JOB ALERTS TEMPLATES WEBINARS
CAREER RESUMES VIEWED

ADVICE BY TOP EMPLOYERS
No matter what your career level, the IEEE Computer

Society Jobs Board keeps you connected to
workplace trends and exciting career prospects.
EDITOR: Michael Mattioli, michael.mattioli@gs.com
DEPARTMENT: SECURITY
The Next Security Frontier: Taking the

Mystery Out of the Supply Chain
Michael Mattioli , Goldman Sachs & Co., New York, NY, 10282, USA
Tom Garrison and Baiju V. Patel, Intel Corporation, Mountain View, CA, 94040, USA
The modern technology supply chain is highly complex. Throughout the various
stages of design, manufacture, assembly, transport, and operation, a compute
device is subject to tampering (malicious or otherwise). This exposes end users and
consumers of compute devices to a variety of risks at varying levels of impact. Two
potential technology solutions, which aim to provide transparency and insight into
the systems in use, are proposed. The success and adoption of these choices, or any
other solutions developed, is highly dependent on the participation of ecosystem
partners such as foundries, original device manufacturers (ODMs), and original
equipment manufacturers (OEMs).
B
y the time a computing system (e.g., laptop, transport methods. However, a growing portion of
desktop, server, smart watch, tablet, etc.) is the population, such as financial services or gov-
delivered to its intended end user, the sum of ernment institutions and even consumers are
its parts has traveled through a highly complex keen to understand and gain insight into the sys-
supply chain. This supply chain, illustrated in Figure 1, tems they use to store and process sensitive infor-
includes diverse component suppliers, subsystem mation as well as conduct high-stakes transactions
manufacturers, integrators, and original equipment (e.g., financial, health care, education, etc.). They
manufacturers (OEMs). are actively taking steps to ensure that the com-
The final product may go through several ware- puting system, as delivered, meets their risk profile
houses and may be transported/handled via several and can fulfill their compliance, security, and per-
shipping companies before finally being received, formance requirements.
unboxed, and deployed/used. It is important for the modern technology supply
Considering ever-increasing threats to this supply chain ecosystem to take measures to ensure a grow-
chain; end users have a growing need to know that ing portion of the population can trust the supply
the final product they received is indeed the product chain with increased accuracy and decreased cost.
they ordered. Unintentional mistakes/errors, poor han- Improving transparency in the supply chain will help
dling, or intentional fraud are key risks at play. Addi- meet the need for security and quality assurance
tional risks may come from malicious actors, including among broader portions of the population as both
nation states and well-funded criminal organizations awareness and risks continue to grow.
who are motivated to tamper with systems in the sup-
ply chain. The consequences of these risks may
include financial or reputational harm.
KEY RISKS TO SECURITY
Frankly, many treat an arbitrary computing system Any typical component or system changes hands doz-
as a “black box” and blindly trust the supplier(s) and ens of times from inception to deployment and, ulti-
mately, retirement. The supply chain is a continuous,
evolving process and, as illustrated in Figure 2, may
not end even when it leaves the end user’s hands (e.g.,
0272-1732 ß 2021 IEEE recycle or donate).
Digital Object Identifier 10.1109/MM.2021.3072174 Participants in the supply chain, such as the OEMs,
Date of current version 25 May 2021. also treat the role they play as intellectual property
SECURITY
FIGURE 1. High-level, simplified depiction of the modern technology supply chain.
(IP) or some sort of “business secret,” which makes it such as circuit design modification (e.g., hardware tro-
even more challenging to evaluate and manage risk. jans, scan chain attacks, etc.), firmware modifications,
and even counterfeiting.1 A disruption to the supply
Design chain, such as a factory fire (local) or the 2020 COVID-19
The first window of opportunity to insert risk into a pandemic (global), may also force OEMs to substitute
computing system is by attacking the design and parts to ensure timely delivery of computing systems to
manufacturing of the individual components that will end users. The many levels and layers of the supply
eventually comprise the system. Schematics can be chain make it nearly impossible to keep a watchful eye
altered, design tools can be compromised, and collab- on every single party involved and ensure that no harm,
oration solutions can be manipulated to alter a com- either intentional or unintentional, was done.
ponent from its intended composition. Subsequent
risks are introduced when end users work with the Shipment
supplier, such as an OEM, to design and/or configure a After a system has left the factory, there are many
platform (e.g., build-to-order systems). During such opportunities for tampering, modifications, or changes
engagements, there is room for unintentional human within the hardware, firmware, and software. Once a
error, such as misinterpretation of phone calls and device is received and deployed, it may be serviced
emails. Risks might also include intentional malicious (e.g., components replaced) in different locations or
action that can cause harm in the process. Behind the by different parties throughout its lifecycle. It is practi-
scenes, the vendor also works with their respective cally impossible to have complete confidence that a
suppliers, such as original design manufacturers system has not been tampered with in some manner.
(ODMs), to do similar functions such as source sub- The potential for tampering applies to each of the
components and/or assemble components; this fur- functional components of a computing system. For
ther exacerbates the potential for human error. example, a solid-state drive (SSD) shipped to an ODM
for integration into a computing system could be tem-
Assembly and Manufacturing pered with by having the firmware within the drive
Further down the line, during assembly and manufactur- replaced with a malicious version.
ing, there are more opportunities for malicious actions
Operation
Transparency in the supply chain helps the end user
ensure that they received exactly what was ordered.
This is a significant step in maintaining continuous
trust in the computing ecosystem. Since most com-
puting systems are frequently upgraded with the lat-
est functional and security updates, it is extremely
important to maintain the current state of the system
throughout its entire lifecycle. Ideally, one should have
the complete and latest information (e.g., firmware
version updates) about each critical component of
FIGURE 2. High-level depiction of the supply chain lifecycle. the computing system, and should be able to validate
the version updates and configurations against the

SECURITY
expected state in real-time with reasonable assurance is typical to include multiple sources for every signifi-
that nothing malicious has modified the system to an cant component to support business continuity.
unacceptable configuration/state. Granted, this might At the same time, suppliers also need to protect
be more feasible for large-scale commercial entities their confidential business information including com-
but is rather complex and dizzying for consumers; in plex business relationships, procurement processes,
either case, it is difficult to achieve. and inventory management. This type of information
in the wrong hands can have significant impact to the
business and can lead to compliance issues. There-
Impact fore, the ecosystem needs to align on the right tech-
Just as causes of unintended changes to the platform nology as well as processes to minimize misuse of
vary, the impact to the end user can vary from benign transparency while meeting end user requirements.
to more serious consequences rooted in malicious Additionally, transparency needs to maintain a bal-
intent, such as the following. ance between the needs of supply chain and the
needs of the end user and consider a baseline that
› Financial—Stealing sensitive information can may be available to everyone with additional levels of
lead to financial gain through access to nonpub- detail requiring different agreements.
lic information.
› Reputational—Exposing and fabricating informa-
tion, or interrupting operations, can cause severe TRUST REQUIRES INDUSTRY-WIDE
reputational damage to a person/company. PARTICIPATION
› Revenge—Personal attacks against a company A computing system is commonly composed of vari-
or organization by a disgruntled employee or end ous components. Many of these components can be
user can cause potential damages ranging from classified as “smart”; these are components that typi-
physical alterations, compromise of IP, to cally contain their own firmware. Firmware has signifi-
destruction of reputation. cant functional capabilities including the ability to
› Chaotic—Some individuals and criminal organi- access, and potentially manipulate, critical or sensitive
zations simply want to “watch the world burn.” (e.g., personal, mission-critical, etc.) data.
The smart components are typically manufactured
Different entities, private and public, are exposed by different suppliers and, in turn, are composed of both
to these and other threats at varying levels. It is imper- active and passive subcomponents. It is important that
ative that supply chain transparency is designed to one has full transparency of active components due to
scale to not just different size businesses but to con- their significant capabilities to access and modify data.
sumers as well over time. Passive components, such as resistors, capacitors, phys-
ical packaging, printed circuit boards (PCBs), or displays
can also pose a risk to one’s data if not implemented
POWER OF TRANSPARENCY correctly. For example, if PCB routing had certain wires
As might be expected, transparency has a cost. close to the surface or implemented sockets for expan-
Beyond the obvious, managing the supply chain sion or options, an adversary might be able to easily sub-
means keeping track of each component through stitute or even add a component without the knowledge
manufacturing, warehousing, and delivery. This, in of the end user.
turn, limits the flexibility of the manufacturer. While it may not be practical to have full transpar-
For example, a manufacturer may wish to have ency at each passive component level, it is important,
multiple sources for a keyboard and would want to be at a minimum, to have transparency at the active
flexible in selecting the keyboard to be integrated into component level and to establish trust in the entire
a laptop based on the availability of parts and the cost supply chain; this requires an industry-wide initiative
at any given time. However, one may require or specify and agreement on key technical underpinnings in
a keyboard from a specific supplier, which may impact order for the initiative to be successful. With the ulti-
a timely delivery and also introduce challenges and mate goal of achieving full transparency, we recom-
additional cost due to inventory management. mend starting small yet critical components of the
Inventory may also be impacted due to unex- computing system first and then develop technology
pected natural calamities such as the 2020 COVID-19 and processes to progress ahead.
pandemic. In order to deal with such disruptions, as While we do not attempt to specify a comprehensive
well as those caused by smaller unforeseen events, it list of smart components for initial implementation,

SECURITY
some examples that the industry should consider as changes, and ownership transitions that occur as the
starting points are the CPU, SSD, wireless (e.g., Blue- system travels through distribution and integration and
tooth, WiFi, etc.) interfaces, motherboard, and any/all then, ultimately, provisioning, operation, and modifica-
embedded controllers (ECs). It is not enough to know tion by the end user or owner. The private and public
the manufacturer of these components or subsystems, cryptographic controls inherent in existing blockchain
one needs to know details such as the firmware running infrastructures can be applied to balance the simulta-
on these components, the date and location of manu- neous needs of all ecosystem participants. Component
facture, and even the design revision. suppliers can contribute product information into the
blockchain without the worry of sharing sensitive prod-
PROPOSED SOLUTIONS uct details to competitors. Similarly, system manufac-
turers can create system-level ledgers from component
Technology choices we make to establish transpar-
and subsystem vendors in private transactions shared
ency will have long-term impact on the industry both
only with those whose access permissions have been
from longevity and cost perspective. We propose two
cryptographically proven. Ownership can be transferred
models which are 1) use of a ledger or database and 2)
to operators who can manage ledger updates to track
a method of self-reporting.
key information regarding location, application, upgrades,
updates, and usage statistics. The usefulness of this led-
Ledger or Database ger can extend from the resale of the device into the sec-
As a computing system goes through assembly and ondary market; the ledger can provide sellers with better
delivery in the supply chain, each entity makes an entry insights into possible IP loss based on the usage and
in a ledger, thus, creating a record that can be retrieved application while providing buyers a better understand-
later by the end user to meet their need for transpar- ing of the provenance of the device.
ency. Once a computing system is delivered to the end
user, the same process can be used to record changes,
such as a change of ownership, to maintain a continu- Self-Reporting
ous record for the entire lifespan of the computing sys- Tracking the components in the supply chain while the
tem. Two fundamental approaches exist here. device is manufactured and readied to be delivered to
the end user helps ensure that intended components
Centralized Approach are used in the device. Once the device is in the hands
A trusted third-party maintains the ledger or database of the end user, the device can provide a cryptographic
of all the transactions and manages updates to and report (i.e., attestation) of the current configuration of all
retrieval of information per agreement between of the smart components in the device. The configura-
trusted third party and members of the supply chain. tion can include current firmware version, security ver-
The fundamental trust model is based on the idea that sion, hardware ID, etc. Each report may further include
without collaboration between multiple member com- information about nonactive or passive components
panies in the supply chain, any inconsistency (inten- that are part of the smart component or even include a
tional or otherwise) might not be detected. For report-out of smart subcomponents as well as revision
example, if a system manufacturer entry shows they or change log for additional transparency.
shipped 1M computing systems with a specific SSD In order to establish trust into the report, the smart
from a specific supplier, a corresponding entry by the component needs to include a cryptographic key and
supplier in their database is necessary to corroborate. associated certificate, as depicted in Figure 3, and,
when queried, produces a signed report along with the
Blockchain Approach certificate so that the end user can validate the
Application of blockchain technology to ingredient sup- authenticity and trust in certificate and subsequently
ply chains has been widely discussed. A 2019 Fortune trust the report.
article outlined Bumble Bee Tuna’s use of blockchain to Finally, the entire computing system can either col-
create an electronic and distributed ledger chronicling a lect report-out from each of the smart components
fish’s capture, processing, and travel history to ensure and may produce a comprehensive report for the
freshness and product quality2. Beyond this ability to entire system or may leave it up to the end user to
track ingredients, the distributed ledger can be expanded query each smart component of interest. The benefit
to keep a running and growing record over the opera- of a system-level report-out is that the end user gets
tional lifecycle for each system. Tightly controlled and the full picture without necessarily having to under-
permissioned ledger entries can continue for updates, stand how to query each of the components (as the

SECURITY
post deployment as well. For example, if a security

risk is identified (e.g., a publication in the common
vulnerabilities and exposures system), the end user
can use transparency to precisely pinpoint impacted
systems and take appropriate remediation steps.
Due to the diverse and complex nature of the ecosys-
tem, no one company or institution can single-hand-
edly solve this problem. The choice of technologies
and collaboration will have long-term implication on
evolution of capabilities to meet current and future
FIGURE 3. Simplified illustration of computer components needs as well as the ability to scale and manage
with a certificate of authenticity. costs. We strongly recommended that industry lead-
ers come together and take a comprehensive long-
term approach to address this important problem.
manufacturer of the system has the best understand-
ing). This would require the system to also have a
trusted smart component that cannot only collect all REFERENCES
the information but is able to produce a cryptographi- 1. M. Mattioli, “Consumer exposure to counterfeit
cally signed comprehensive report. hardware,” IEEE Consum. Electron. Mag., to be
The shortcoming of this type of approach is that published, doi: 10.1109/MCE.2020.3023873.
each component has an additional cost of obtaining a 2. Fortune, “Bumble bee foods aims to put all its fish on
certificate and complexity of providing a report-out. a blockchain. It’s starting with ‘fair trade’ tuna,” Mar.
Additionally, the end user does not have the benefit of 2010. Accessed: Feb. 17, 2021. [Online]. Available:
transparency from any component that does not https://fortune.com/2019/03/08/tuna-blockchain-
implement this capability. bumble-bee-sap/
GOVERNANCE MICHAEL MATTIOLI currently leads the Hardware Engineer-
There are two potential paths toward solution adop- ing team within Goldman Sachs, New York, NY, USA. He is
tion. The first is a self-regulated or market-driven responsible for the design and engineering of the firm’s digital
approach where each participant in the supply chain experiences and technologies. He is also responsible for the
may choose their own degree of participation in trans- overall strategy and execution of hardware innovation both
parency. End users, based on their purchasing prefer- within the firm and within the broader technology industry.
ence, will drive supply chain to adopt the “right” level
Contact him at michael.mattioli@gs.com.
of transparency to meet market needs. The other is by
means of an industry group or consortium in which
TOM GARRISON is currently the Vice President and General
computing system supply chain participants form a
Manager of Security Strategy and Initiatives within the Client
consortium to establish technical direction and a gov-
ernance model to provide consistent and sufficient Computing Group, Intel Corporation, Mountain View, CA,
transparency to meet broad industry needs. USA. He leads efforts to help customers and manufacturers
deploy tooling and processes for greater security assurance,
RECOMMENDATION supply chain transparency, and cybersecurity innovation. He
With increasing reliance on information technology also launches industry-wide initiatives and research with eco-
for business-critical and personal data, having trans- system partners and academia to accelerate cybersecurity
parency in the supply chain, as well as information on product assurance. Contact him at tom.garrison@intel.com.
the current state of the computing systems in use, is
vital to the economic health of the information tech-
BAIJU V. PATEL is currently an Intel Fellow within the Client
nology ecosystem. Transparency by itself is of the
Computing Group, Intel Corporation, Mountain View, CA,
limited value without the information needed to
assess risk or compliance. Beyond the ability to verify USA. He is responsible for setting the technical direction for
that what the end user received is indeed what they the Client Computing Group and Intel’s security technolo-
ordered, there is an additional value of transparency gies. Contact him at baiju.v.patel@intel.com.

PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading provider President: Forrest Shull
of technical information in the field. President-Elect: William D. Gropp
MEMBERSHIP: Members receive the monthly magazine Past President: Leila De Floriani
Computer, discounts, and opportunities to serve (all activities First VP: Riccardo Mariani; Second VP: Fabrizio Lombardi
are led by volunteer members). Membership is open to all IEEE Secretary: Ramalatha Marimuthu; Treasurer: David Lomet
members, affiliate society members, and others interested in the VP, Membership & Geographic Activities: Andre Oboler
computer field. VP, Professional & Educational Activities: Hironori Washizaki
VP, Publications: M. Brian Blake
COMPUTER SOCIETY WEBSITE: www.computer.org
VP, Standards Activities: Riccardo Mariani
OMBUDSMAN: Direct unresolved complaints to VP, Technical & Conference Activities: Grace Lewis
ombudsman@computer.org. 2021–2022 IEEE Division VIII Director: Christina M. Schober
CHAPTERS: Regular and student chapters worldwide provide the 2020-2021 IEEE Division V Director: Thomas M. Conte
opportunity to interact with colleagues, hear technical experts, 2021 IEEE Division V Director-Elect: Cecilia Metra
and serve the local professional community.
AVAILABLE INFORMATION: To check membership status, report BOARD OF GOVERNORS
an address change, or obtain more information on any of the Term Expiring 2021: M. Brian Blake, Fred Douglis,
following, email Customer Service at help@computer.org or call Carlos E. Jimenez-Gomez, Ramalatha Marimuthu,
+1 714 821 8380 (international) or our toll-free number, Erik Jan Marinissen, Kunio Uchiyama
+1 800 272 6657 (US): Term Expiring 2022: Nils Aschenbruck,
• Membership applications Ernesto Cuadros‐Vargas, David S. Ebert, Grace Lewis,
• Publications catalog Hironori Washizaki, Stefano Zanero
• Draft standards and order forms Term Expiring 2023: Jyotika Athavale, Terry Benzel,
• Technical committee list Takako Hashimoto, Irene Pazos Viana, Annette Reilly,
• Technical committee application Deborah Silver
• Chapter start-up procedures
• Student scholarship information EXECUTIVE STAFF
• Volunteer leaders/staff directory
Executive Director: Melissa A. Russell
• IEEE senior member grade application (requires 10 years
Director, Governance & Associate Executive Director:
practice and significant performance in five of those 10)
Anne Marie Kelly
Director, Conference Operations: Silvia Ceballos
PUBLICATIONS AND ACTIVITIES
Director, Finance & Accounting: Sunny Hwang
Computer: The flagship publication of the IEEE Computer Society, Director, Information Technology & Services: Sumit Kacker
Computer publishes peer-reviewed technical content that covers Director, Marketing & Sales: Michelle Tubb
all aspects of computer science, computer engineering, Director, Membership & Education: Eric Berkowitz
technology, and applications.
Periodicals: The society publishes 12 magazines and 17 journals.
Refer to membership application or request information as noted
COMPUTER SOCIETY OFFICES
above. Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C.
20036-4928; Phone: +1 202 371 0101; Fax: +1 202 728 9614;
Conference Proceedings & Books: Conference Publishing
Email: help@computer.org
Services publishes more than 275 titles every year.
Los Alamitos: 10662 Los Vaqueros Cir., Los Alamitos, CA 90720;
Standards Working Groups: More than 150 groups produce IEEE Phone: +1 714 821 8380; Email: help@computer.org
standards used throughout the world.
Technical Committees: TCs provide professional interaction in MEMBERSHIP & PUBLICATION ORDERS
more than 30 technical areas and directly influence computer Phone: +1 800 678 4333; Fax: +1 714 821 4641;
engineering conferences and publications. Email: help@computer.org
Conferences/Education: The society holds about 200 conferences

each year and sponsors many educational activities, including IEEE BOARD OF DIRECTORS
computing science accreditation. President: Susan K. “Kathy” Land
Certifications: The society offers three software developer President-Elect: K.J. Ray Liu
credentials. For more information, visit Past President: Toshio Fukuda
www.computer.org/certification. Secretary: Kathleen A. Kramer
Treasurer: Mary Ellen Randall
BOARD OF GOVERNORS MEETING Director & President, IEEE-USA: Katherine J. Duncan Director
& President, Standards Association: James Matthews Director
& VP, Educational Activities: Stephen Phillips Director & VP,
14, 15, 16, and 18 June 2021, virtual
Membership & Geographic Activities:
Maike Luiken
Director & VP, Publication Services & Products: Lawrence Hall
Director & VP, Technical Activities: Roger U. Fujii
revised 2 April 2021
IEEE Computer Society
Volunteer Service Awards
Nominations accepted throughout the year.
T. Michael Elliott Distinguished Outstanding Contribution Certificate

Service Certificate Third highest level service certificate for a
Highest service award in recognition for specific achievement of major value to the
distinguished service to the IEEE Computer IEEE Computer Society, i.e., launching a major
Society at a level of dedication rarely conference series, a specific publication,
demonstrated. i.e., initiating a Society program standards and model curricula.
or conference, continuing officership, or long-
term and active service on Society committees.
Continuous Service Certificate
Recognize and encourage ongoing involvement
Meritorious Service Certificate of volunteers in IEEE Computer Society
Second highest level service certificate for programs. The initial certificate may be awarded
meritorious service to an IEEE Computer after three years of continuous service.
Society-sponsored activity. i.e., significant as
an editorship, committee, Computer Society
Certificate of Appreciation
officer, or conference general or program chair. Areas of contribution would include service with
a conference organizing or program committee.
May be given to subcommittee members in lieu
of a letter of appreciation.
Nominations
Submit your nomination at
http://bit.ly/computersocietyawards
Contact us at
awards@computer.org
DEPARTMENT: MICRO ECONOMICS
Remote Work
Shane Greenstein , Harvard Business School, Boston, MA, 02163, USA
W
e are living through a monumental change reluctantly. On the other hand, many employees and
in work. Experiments during the pandemic their employers discovered unexpected benefits from
turned into regular operations, and have remote employment, and have considered continuing
had a profound impact on perceptions of the viability of the practice.
remote work. I do not lightly use the word “profound.” The Start with reluctant remote work. Nobody expects
trend reverses more than two centuries of separating the a reluctant employee to continue to work at home
location of work and residences. That separation has once the pandemic declines. The reluctance has many
deep roots, and reflects something fundamental about origins. For example, many daily work experiences can
the gains to society from separating work from other best be performed face-to-face with coworkers and in
activities. All developed countries have this feature. person, and no zoom call can replicate that. In addi-
The pandemic brought about an acceleration of tion, remote education does not work well for the vast
remote work. To appreciate how suddenly this arrived, majority of K-12 students, and many parents unsuc-
consider that two years ago less than 25% of the U.S. cessfully had to navigate their work and child’s travails
workforce participated in some sort of remote work at with virtual school. All would happily return to the tra-
home, and it was typically less than half their work ditional arrangements.
hours. During the height of the pandemic, however, The source of gains also varied, but two circum-
approximately 40% of the U.S. labor force participated stances dominated. For one, less commute time led to
in remote work, and it was full time at home. real gains in time devoted to work, and that arose
Surveys of the U.S. experience indicate the shift to across a wide variety of jobs. Second, some firms
remote work started with some initial variance and could get out of real-estate contracts. Less real-estate
hesitance in March, 2020, but by the end of April, this expense leads to real savings in the budget.
hesitance had evaporated. The practice emerged To be sure, these gains arose unevenly, with sur-
quickly, with anybody working remotely who could do veys showing massive differences across different
so. Once it started, few went back, as firms put in occupations and product markets. Skilled employees
place expensive investments to support remote work, at technology firms tended to be among the biggest
and established routines. A large amount of travel got users of remote work, for example, while most services
cancelled, and managers explored how much they tended not to be. Something subtle also emerged,
gained or lost from substituting to online meetings. namely, a shifting of norms, especially with respect to
Those investments and drastic behavioral changes travel. Many business trips did not occur. Expenses
would never have been done in normal times. and time were saved, albeit, many sales also did not
Now, the learning has been learned, and nobody occur that a more personal touch could have pushed
will forget the experience, nor tear out the software over the line.
that supports it, which raises the question: How much Just in case there was any doubt, productivity
of this will persist after the pandemic? Today’s column gains at the individual level did not make up for the
reviews some of the latest research on this question. decline in the macro economy. Only in rare instances
did a firm find itself in a better place, and most of
those experiences arose in technology companies
BEYOND PUBLIC DISCUSSION whose online businesses faced explosive demand.
Two extremes examples tend to receive attention in
public conversation. On the one hand, some remote
work took place out of necessity, and, therefore, MAKE A FORECAST
To get clues on the likelihood and direction of the
anticipated shift to remote work after the pandemic,
0272-1732 ß 2021 IEEE forecasters needed to get beyond anecdote and char-
Digital Object Identifier 10.1109/MM.2021.3073433 acterize broad patterns. Two professors from the Uni-
Date of current version 25 May 2021. versity of Chicago, Jonathan Dingell and Brent
MICRO ECONOMICS
Neiman, took a clever approach. They made an edu- personnel, the best paid workers in the economy were
cated guess at the start of the pandemic about how most able to adjust to the pandemic, while the worst
remote work would affect some occupations and not paid workers in the economy were required to face
others. Their speculation turned out to be close to the health risks associated with reporting to work at a
actual experience.1 traditional location.
They reasoned that it was feasible for many to Most remarkably, this simple prediction turned
work from home, and social distancing might push the out to be a reasonable estimate for many other
reluctant to try it. In spring 2020, they estimated the countries. In other words, focusing on the feasibility
maximum percentage of the workforce that could of moving occupations to remote work helps
work from home, landing on an estimate of 37% of the explain why countries with high income and difficul-
workforce. ties managing the pandemic shared a similar eco-
nomic experience.
The broad trend hides a great deal of variance in
TO SAY IT BLUNTLY, WITH THE experience, to be sure. For example, numerous sur-
EXCEPTION OF “ESSENTIAL veys of the U.S. labor experience found that profes-
WORKERS” SUCH AS DOCTORS AND sional women with school-age children had a more
OTHER MEDICAL PERSONNEL, THE difficult time with remote work, regardless of the
BEST PAID WORKERS IN THE occupation. The type of employer also shaped the
ECONOMY WERE MOST ABLE TO experience, with larger firms managing it better in gen-
ADJUST TO THE PANDEMIC, WHILE eral, with the notable exceptions of retailing and
THE WORST PAID WORKERS IN THE entertainment.2
ECONOMY WERE REQUIRED TO FACE What does this mean for the economy when it
THE HEALTH RISKS ASSOCIATED WITH does go back to normal? Stated simply, remote
work is not a possibility in the entire economy, but
REPORTING TO WORK AT A
it has become a strong probability in the some of
TRADITIONAL LOCATION.
the highest paid occupations, and especially, where
it is possible to separate childcare from work at
home.
This estimate arose out of a close reading of the
Occupational Information Network, a database main-
tained by the Bureau of Labor Statistics. It provides VARIANCE
detailed descriptions of over one thousand pre-pan- Forecasters have started to think hard about the dis-
demic occupations in the U.S. economy. Dingell and tinction between the transient and permanent
Neiman classified occupations by whether an changes wrought by remote work. Again, it involves
employee performed physical activity, used heavy considerable guess work, since nobody can fully antic-
machinery at work, required constant communica- ipate what part of remote work will persist in the long
tions with others, and several other criteria that pre- run.
cluded remote work. Based on surveys of the Some forecasts are easy to make. Some jobs, such
prevalence of those occupations, they estimated what as event planners, are devoted to administrative
fraction, at most, could work from home. efforts to support conventions and face-to-face meet-
This forecast yielded novel insights that differed ings. Another fraction of jobs goes into supporting
from pre-pandemic experiments with remote work. commuters to downtown areas—e.g., attendants of
Prior studies tended to stress the compatibility of vehicles for mobile lunches, and security guards of
remote work with individualistic and asynchronistic parking lots. The number of these jobs will decline.
activity, such as that found in contract programming. More difficult to forecast, new jobs will emerge to
As it turned out, however, much more was feasible meet the new demands from remote work. For exam-
across a much wider range of design, engineering, and ple, it will become somebody’s jobs to find ways to
financial activities than anybody thought possible. facilitate the rotation of desks and meeting rooms
In addition, this approach yielded several addi- among those showing up every few days, or among
tional observations. Dingell and Neiman calculated those visiting a remote location where another branch
that this workforce accounted for 46% of the wages in for the firm resides. Some of these staff will coordinate
the economy. To say it bluntly, with the exception of new types of meetings among workers from the same
“essential workers” such as doctors and other medical firm, who merely want to meet once a month, but do

MICRO ECONOMICS
so in a bigger room, or a setting that accommodates fraction of employment in remote work. Employees
larger teams. Some of these jobs are difficult to imag- must develop enough trust to transact with one
ine because many firms have never experienced this another by sharing the mission and a list of com-
activity at such scale. mon experiences. Deep organizational cultures,
however, need to get beyond transactional trust. It
is essential to share a mission in order to share les-
STATED SIMPLY, REMOTE WORK IS
sons and informal knowledge, and coordinate long-
NOT A POSSIBILITY IN THE ENTIRE term plans. In other words, if a large fraction of a
ECONOMY, BUT IT HAS BECOME A company’s employees alter where they perform
STRONG PROBABILITY IN THE SOME their job, then that also will lead to changes in the
OF THE HIGHEST PAID OCCUPATIONS, daily tasks of many of that company’s other
AND ESPECIALLY, WHERE IT IS employees.
POSSIBLE TO SEPARATE CHILDCARE
FROM WORK AT HOME.
REFERENCES
The rise of remote work also will enable some 1. J. I. Dingell and B. Neiman, “How many jobs can be done
workers to become more of a contractor and less of at home?,” J. Public Econ., vol. 189, Sep. 2020,
an employee, accelerating a trend that already has Art. no. 104235. [Online]. Available: https://www.
gotten a foothold in technology labor markets. This sciencedirect.com/science/article/abs/pii/
will generate a new type of job, namely, extending IT S0047272720300992
services into the fraction of the remote workforce 2. N. Bloom, R. S. Fletcher, and E. Yeh, “The impact of
that becomes closer to a contractor. Some of these COVID-19 on US firms,” National Bureau of Economic
staff will take on the administrative tasks just like IT Research, Cambridge, MA, USA, Working Paper 28314.
staff within large organizations, but with a more dis- [Online]. Available: https://www.nber.org/papers/
persed clientele. w28314
HR and many managerial tasks also will be rede-
fined by this shift. Especially difficult to predict is SHANE GREENSTEIN is a Professor at Harvard Business
how organizational cultures will change with a high School. Contact him at sgreenstein@hbs.edu.

Get Published in the IEEE Open
Journal of the$PNQVUFS4PDJFUZ
Submit a paper today to the

premier open access journal
in DPNQVUJOHBOEJOGPSNBUJPO
UFDIOPMPHZ.
Your research will benefit from

the IEEE marketing launch and
5 million unique monthly users
of the IEEE Xplore® Digital Library.
Plus, this journal is fully open
and compliant with funder
mandates, including Plan S.
Submit your paper today!

Visit XXXDPNQVUFSPSHPK to learn more.
IEEE
Computer
Society Has
You Covered!
WORLD-CLASS CONFERENCES — Stay
ahead of the curve by attending one of our
215+ globally recognized conferences.
DIGITAL LIBRARY — Easily access over 800k

articles covering world-class peer-reviewed
content in the IEEE Computer Society
Digital Library.
CALLS FOR PAPERS — Discover

opportunities to write and present your
ground-breaking accomplishments.
EDUCATION — Strengthen your resume

with the IEEE Computer Society Course
Catalog and its range of offerings.
ADVANCE YOUR CAREER — Search the

new positions posted in the IEEE Computer
Society Jobs Board.
NETWORK — Make connections that count

by participating in local Region, Section,
and Chapter activities.
Explore all of the member benefits

at www.computer.org today!

2021-May - Top Picks From The 2020 Computer Architecture Conferences

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2021-May - Top Picks From The 2020 Computer Architecture Conferences

Uploaded by

Copyright:

Available Formats

VOLUME 41, NUMBER 3 MAY/JUNE 2021

Top Picks From the 2020 Computer

Cutting-edge Unique original Keeps you up to

Subscribe for free

The Vision Behind MLPerf: Superconductor Leaking Secrets

42 Accelerating Genomic Data Analytics With Composable

50 uGEMM: Unary Computing for GEMM Applications

57 BabelFish: Fusing Address Translations for Containers

63 Characterizing and Modeling Nonvolatile Memory Systems

71 Temporal Computing With Superconductors

80 A Next-Generation Cryogenic Processor Architecture

87 Balancing Specialized Versus Flexible Computation in Brain–

95 Virtual Logical Qubits: A Compact Architecture for Fault-Tolerant

Columns and Departments

Also in this Issue

Top Picks From Year 2020

4 IEEE Micro Published by the IEEE Computer Society May/June 2021

May/June 2021 IEEE Micro 5

Top Picks From the 2020 Computer

6 IEEE Micro Published by the IEEE Computer Society May/June 2021

May/June 2021 IEEE Micro 7

8 IEEE Micro May/June 2021

› Nael Abu-Ghazaleh, University of California › Kathryn S McKinley, Google

May/June 2021 IEEE Micro 9

The Vision Behind MLPerf: Understanding AI

10 IEEE Micro Published by the IEEE Computer Society May/June 2021

May/June 2021 IEEE Micro 11

12 IEEE Micro May/June 2021

May/June 2021 IEEE Micro 13

Robust Quality Targets Load Generator

14 IEEE Micro May/June 2021

FIGURE 2. Systems that perform well for throughput (ofﬂine

Accuracy Checker the closed and open divisions from 14 organizations.

Audits Benchmark Must Capture and Reﬂect

May/June 2021 IEEE Micro 15

16 IEEE Micro May/June 2021

May/June 2021 IEEE Micro 17

18 IEEE Micro May/June 2021

Superconductor Computing for Neural

The superconductor single-ﬂux-quantum (SFQ) logic family has been recognized

May/June 2021 Published by the IEEE Computer Society IEEE Micro 19

switching. By focusing on these strong points, many

20 IEEE Micro May/June 2021

SFQ-FAVORABLE technology. It has been a long-standing challenge to

May/June 2021 IEEE Micro 21

22 IEEE Micro May/June 2021

May/June 2021 IEEE Micro 23

SuperNPU: SFQ-Optimal NPU

24 IEEE Micro May/June 2021

› Our validated modeling framework for SFQ logic ACKNOWLEDGMENTS

May/June 2021 IEEE Micro 25

26 IEEE Micro May/June 2021

Leaking Secrets Through Compressed

We offer the ﬁrst security analysis of cache compression, a promising architectural

May/June 2021 Published by the IEEE Computer Society IEEE Micro 27

CACHE COMPRESSION changes when a byte of the key matches a byte of

28 IEEE Micro May/June 2021

May/June 2021 IEEE Micro 29

30 IEEE Micro May/June 2021

TABLE 1. Worst-case time for Safecracker to leak secrets of different sizes.

Time (ms) to leak secret of size

May/June 2021 IEEE Micro 31

32 IEEE Micro May/June 2021

VOLUME 41, NUMBER 3 MAY/JUNE 2021