You are on page 1of 4

V

viewpoints

DOI:10.1145/3457191 Sean Peisert


• Terry Benzel, Column Editor

Security
Trustworthy Scientific
Computing
Addressing the trust issues underlying the
current limits on data sharing.

D
science is
ATA U S E F U L TO hindered, because data is not made
not shared as much as it available and used in ways that maxi-
should or could be, partic- Hardware mize its value.
ularly when that data con- trusted execution And yet, as emphasized widely in
tains sensitivities of some scientific communities,3,5 by the Na-
kind. In this column, I advocate the environments tional Academies, and via the U.S.
use of hardware trusted execution en- can form the basis government’s initiatives for “respon-
vironments (TEEs) as a means to sig- sible liberation of Federal data,” find-
nificantly change approaches to and for platforms ing ways to make sensitive data avail-
trust relationships involved in se- that provide able is vital for advancing scientific
cure, scientific data management. discovery and public policy. When
There are many reasons why data may strong security data is not shared, certain research
not be shared, including laws and benefits while may be prevented entirely, be signifi-
regulations related to personal priva- cantly more costly, take much longer,
cy or national security, or because maintaining or might simply not be as accurate
data is considered a proprietary trade computational because it is based on smaller, poten-
secret. Examples of this include elec- tially more biased datasets.
tronic health records, containing performance. Scientific computing refers to the
protected health information (PHI); computing elements used in scien-
IP addresses or data representing the tific discovery. Historically, this has
locations or movements of individu- emphasized modeling and simula-
als, containing personally identifi- tion, but with the proliferation of in-
able information (PII); the properties the risks of sharing sensitive data, struments that produce and collect
of chemicals or materials, and more. and concerns of providers of comput- data, now significantly also includes
Two drivers for this reluctance to ing systems about the risks of hosting data analysis. Computing systems
share, which are duals of each other, such data. As barriers to data sharing used in science include desktop sys-
are concerns of data owners about are imposed, data-driven results are tems and clusters run by individual

18 COMM UNICATIO NS O F THE ACM | M AY 2021 | VO L . 64 | NO. 5


viewpoints

investigators, institutional comput- tectures like this are becoming more physical presence in a particular fa-
ing resources, commercial clouds, and more common as means for sci- cility for analysis would be a public
and supercomputers such as those entific computing involving sensitive health risk.
present in high-performance com- data.8 However, even with these secu-
puting (HPC) centers sponsored by rity protections, traditional enclaves Reducing Data Sensitivity Using
U.S. Department of Energy’s Office of still require implicitly trusting sys- “Anonymization” Techniques
Science and the U.S. National Science tem administrators and anyone with Sometimes attempts are made to
Foundation. Not all scientific com- physical access to the system con- avoid security requirements by mak-
puting is large, but at the largest taining the sensitive data, thereby in- ing data less sensitive by applying
scale, scientific computing is charac- creasing the risk to and liability of an “anonymization” processes in which
terized by massive datasets and dis- institution for accepting responsibil- data is masked or made more gener-
tributed, international collabora- ity for hosting data. This security al. Examples of this approach remove
tions. However, when sensitive data limitation can significantly weaken distinctive elements from datasets
is used, computing options available the trust relationships involved in such as birthdates, geographical lo-
are much more limited in computing sharing data, particularly when cations, or IP network addresses.
scale and access.8 groups are large and distributed. Indeed, removing 18 specific identi-
These concerns can be partially miti- fiers from electronic health records
Current Secure gated by requiring data analysts to be satisfies the HIPAA Privacy Rule’s
Computing Environments physically present in a facility owned “Safe Harbor” provisions to provide
Today, where remote access to data is by the data provider in order to access legal de-identification. However, on
IMAGERY BY KENTOH /SHUT TERSTO CK

permitted at all, significant technical data. However, in all these cases, anal- a technical level, these techniques
and procedural constraints may be ysis is hindered for the scientific com- have repeatedly been shown to fail to
put in place, such as instituting in- munity whose abilities and tools are preserve privacy, typically by merg-
gress/egress “airlocks,” requiring optimized for working in open, collab- ing external information containing
“two-person” rules to move software orative, and distributed environ- identifiable information with quasi-
and data in or out, and requiring the ments. Further, consider the current identifiers in the dataset to re-identi-
use “remote desktop” systems. Archi- pandemic in which a requirement of fy “anonymized” records.6 Therefore,

MAY 2 0 2 1 | VO L. 6 4 | N O. 5 | C OM M U N IC AT ION S OF T HE ACM 19


viewpoints

A portion of a system leveraging a trusted execution environment in which data is stored used or generated in the computation,
encrypted on disk; a policy engine, controlled by the data owner, and running in the TEE, including even from certain “physical
contains the mapping for what data is to be made available for computing by each authen- attacks” against the computing sys-
ticated user; and an output policy, also specified by the data owner, dictates what informa-
tem. They can implement similar
tion is permitted to be returned to the user. An output policy might be based on differential
privacy, or be access-control based, or be some combination of these or other functions. functionality as software-based homo-
morphic and mutiparty computation2
approaches, but without the usability
issues and with dramatically smaller
“ Trusted Execution Environment (protected data is always performance penalties.
encrypted) and computation isolated (cannot see other processes)
The use of TEEs to protect against
Encrypted Data
Repository Compute System untrustworthy data centers is not a
novel idea, as seen by the creation of
Policy Engine
“ (contains decrypt Output Policy the Linux Foundation’s Confidential
key) Computing Consortium10 and
Google’s recent “Move to Secure the
Encrypted Data
Repository Cloud From Itself.”7 Google has com-
Input Data, Query, Permitted paring the importance of the use of
or Instructions Data Output
“ TEEs in its cloud platform to the in-
vention of email.9 However, TEEs
Encrypted Data have not yet seen broad interest and
Repository End “User” (Person or Automated Process)
adoption by scientists or scientific
computing facilities.
The envisioned approach is to le-
de-identification does not necessar- Nested Paging) in 2020. All three verage TEEs when data processing
ily address the risk and trust issues vendors take extremely different ap- environments are out of the direct
involved in data sharing because re- proaches and have extremely differ- control of the data owner, such as in
identification attacks can still result ent strengths, weaknesses, use cases, third-party (including DOE or NSF)
in significant embarrassment, if not and threat models. HPC facilities or commercial cloud
legal sanctions. In addition, the same TEEs can be used to maintain or environments, in order to prevent ex-
masking used in these processes also even increase security over traditional posure of sensitive data to other us-
removes data that is critical to the enclaves, at minimal cost to perfor- ers of those systems or even the ad-
analysis.6 Consider public health re- mance in comparison to computing ministrators of those systems. Data
search for which the last two digits over plaintext. TEEs can isolate com- providers can specify the configura-
of a ZIP code, or the two least signifi- putation, preventing even system ad- tion of the system, even if they are not
cant figures of a geographic coordi- ministrators of the machine in which directly the hosts of the computing
nate are vital to tracking viral spread. the computation is running from ob- environment, to specify access con-
serving the computation or data being trol policies, a permitted list of soft-
Confidential Scientific Computing ware or analyses that can be per-
Hardware TEEs can form the basis for formed, and output policies to
platforms that provide strong securi- Trusted execution prevent data exfiltration by the user.
ty benefits while maintaining compu- The notion of being able to leverage
tational performance (see the accom- environments can community HPC and cloud environ-
panying figure). TEEs are portions of be used to maintain ments also enables the use of data
certain modern microprocessors that from multiple providers simultane-
enforce strong separation from other or even increase ously while protecting the raw data
processes on the CPU, and some can security over from all simultaneously, each poten-
even encrypt memory and computa- tially with their own distinct policies.
tion. TEEs have roots in the concepts traditional enclaves, Researchers at the Berkeley Lab
of Trusted Platform Modules (TPMs) at a minimal cost and UC Davis have been empirically
and Secure Boot, but have evolved to evaluating Intel SGX and AMD SEV
have significantly greater functional- to performance TEEs for their performance under
ity. Common commercial TEEs today in comparison typical HPC workloads. Our results1
include ARM’s TrustZone, introduced show that AMD’s SEV generally im-
in 2013; Intel’s Secure Guard Exten- to computing poses minimal performance degra-
sions (SGX), introduced in 2015; and over plaintext. dation for single-node computation
AMD’s Secure Encrypted Virtualiza- and represents a performant solution
tion (SEV), introduced in 2016 and for scientific computing with lower
revamped several times since then ratios of communication to computa-
to include SEV-ES (Encrypted State) tion. However, Intel’s SGX is not per-
in 2017 and SEV SEV-SNP (Secure formant at all for HPC due to TEE

20 COM MUNICATIO NS O F TH E ACM | M AY 2021 | VO L . 64 | NO. 5


viewpoints

memory size limitations. Important- batch scheduling systems in HPC; HPC


ly, NERSC-9 and a number of other I/O subsystems; custom scientific work-
modern HPC centers will contain Trusted execution flows; highly specialized scientific in-
AMD processors that support the SEV environments struments; community data reposito-
TEE, and thus it is our hope that our ries, and so on. Therefore what is
results will provide some of the evi- enable sensitive needed is a conversation between pro-
dence needed to justify the use of data to be leveraged cessor manufacturers, system vendors
TEEs in scientific computing. (for example, Cray, HPE), and scientific
without having computing operators regarding en-
Looking to the Future to trust system abling the TEE functionality already
Although numerous commercial TEEs present in the AMD EPYC processors—
exist, no TEEs yet exist in processors administrators and presumably in other, future proces-
other than CPUs, such as in GPUs and and computing sors—into scientific computing envi-
accelerators, although Google has in- ronments. However, the path forward is
dicated that it plans to expand “Con- providers. not solely technical. It requires the com-
fidential Computing” to GPUs, TPUs, munity to build infrastructure around
and FPGAs.9 There are also issues with TEE technology and integrate that infra-
low-latency communication between structure into scientific computing fa-
TEEs, and also the cost of virtualiza- cilities and workflows, and into the
tion, that must be addressed to enable mind-set of operators of such facilities.
HPC at scale.1 In addition, promising highly useful today, albeit in a limited I hope this column helps to start that
RISC-V efforts such as Keystone4 exist set of situations for datasets that have conversation. For more on TEEs, see the
that carry both the promise of broad- sufficiently wide use to justify the Singh et al. article on p. 42.  —Ed.
ening the scope of processors that time and expense required. Work is
contain TEEs, while also being open needed to advance the usability of dif- References
1. Akram, A. et al. Performance analysis of scientific
source and possible to formally verify. ferential privacy so it can more easily computing workloads on trusted execution
However, RISC-V based TEEs have not be broadly leveraged. environments. In Proceedings of the 35th IEEE
International Parallel & Distributed Processing
yet been developed that target scien- Symposium. (2021).
2. Choi, J.I. and Butler, K. Secure multiparty
tific computing. Most likely, an entire- Summary and Next Steps computation and trusted hardware: Examining
ly new TEE architecture tailored for In contrast to traditional secure en- adoption challenges and opportunities. Security and
Communication Networks 1368905 (2019).
scientific computing and data analy- claves, TEEs enable sensitive data to 3. Hastings, J.S. Unlocking data to improve public policy.
sis applications will be needed. be leveraged without having to trust Commun. ACM 62, 10 (Sept. 2019), 48–53.
4. Lee, D. Keystone: An open framework for architecting
Output policies are another area system administrators and comput- trusted execution environments. In Proceedings of the
that deserve investigation. While TEEs ing providers. However, while the Fifteenth European Conference on Computer Systems
(EuroSys) (Heraklion, Greece, 2020), Article 38.
protect against untrusted computing application of TEEs has now been 5. Macfarlane, J. When apps rule the road: The proliferation
providers, and can provide certain widely heralded in cloud environ- of navigation apps is causing traffic chaos. It’s time to
restore order. IEEE Spectrum 56, 10 (2019), 22–27.
measures of protection from mali- ments, TEEs have not been discussed 6. Narayanan, A. and Felten, E.W. No Silver Bullet:
cious users, output policies determine for use in scientific computing en- De-identification Still Doesn’t Work. (2014); https://bit.
ly/3loBvMd
what data is returned to the user. Dif- vironments, despite the significant 7. Newman, L.H. Google moves to secure the Cloud from
ferential privacy is a particularly inter- concerns frequently expressed by itself. Wired (July 14, 2020); https://bit.ly/2NnLLru
8. Peisert, S. An examination and survey of data
esting approach to providing strong both data providers and computing confidentiality issues and solutions in academic
privacy protection of data output. Dif- facilities about hosting sensitive data. research computing. Trusted CI Report (2020);
https://bit.ly/30Oajx9
ferential privacy is a statistical tech- Operators of scientific computing fa- 9. Potti, S. and Manor, E. Expanding Google Cloud’s
Confidential Computing portfolio. (2020); https://bit.
nique that can guarantee the bounds cilities are notoriously conservative ly/38Nozuo
on the amount of information about a for good reason—they are frequently 10. Rashid, F.Y. The rise of confidential computing. IEEE
Spectrum 57, 6 (2020), 8–9.
dataset that can be leaked to a data evaluated on the degree of utiliza-
analyst as a result of a query or compu- tion and amount of uptime of the sys- Sean Peisert (sppeisert@lbl.gov) leads computer security
tation by adding “noise” and enforcing tems they run, and so the margin for R&D at Lawrence Berkeley National Laboratory and is an
Associate Adjunct Professor at University of California,
a “privacy budget” that bounds infor- error is low. But TEEs are here, they Davis, USA.
mation leakage. It is now a main- are available, and until we start mak-
stream solution, with production use ing use of them in scientific comput- The author thanks Venkatesh Akella, Ayaz Akram, Jim
by Apple, Google, and the U.S. Census ing, data is not shared as much as it Basney, Jason Lowe-Power, and Von Welch for their
valuable feedback on this Viewpoint and the ideas in it.
Bureau, the existence of several open should or could be by leveraging TEEs This work was supported by the Director, Office of Science,
source distributions, and successful to address the trust issues underlying Office of Advanced Scientific Computing Research, of the
U.S. Department of Energy, and by Contractor Supported
application to a diverse range of data current limits on data sharing. Research (CSR) funds provided by Lawrence Berkeley
types. However, differential privacy is National Laboratory, operated for the U.S. Department
What is missing is a connection to the of Energy under Contract No. DE-AC02-05CH11231.
not appropriate everywhere, and ap- particular infrastructure used in scien- Any opinions, findings, conclusions, or recommendations
expressed in this material are those of the authors and do
plying it is currently challenging, re- tific computing, including identity, ac- not necessarily reflect those of the sponsors of this work.
quiring a high degree of expertise and cess, and authentication systems; re-
effort. Thus, differential privacy is mote direct memory access (RDMA); Copyright held by author.

MAY 2 0 2 1 | VO L. 6 4 | N O. 5 | C OM M U N IC AT ION S OF T HE ACM 21

You might also like