Professional Documents
Culture Documents
viewpoints
Security
Trustworthy Scientific
Computing
Addressing the trust issues underlying the
current limits on data sharing.
D
science is
ATA U S E F U L TO hindered, because data is not made
not shared as much as it available and used in ways that maxi-
should or could be, partic- Hardware mize its value.
ularly when that data con- trusted execution And yet, as emphasized widely in
tains sensitivities of some scientific communities,3,5 by the Na-
kind. In this column, I advocate the environments tional Academies, and via the U.S.
use of hardware trusted execution en- can form the basis government’s initiatives for “respon-
vironments (TEEs) as a means to sig- sible liberation of Federal data,” find-
nificantly change approaches to and for platforms ing ways to make sensitive data avail-
trust relationships involved in se- that provide able is vital for advancing scientific
cure, scientific data management. discovery and public policy. When
There are many reasons why data may strong security data is not shared, certain research
not be shared, including laws and benefits while may be prevented entirely, be signifi-
regulations related to personal priva- cantly more costly, take much longer,
cy or national security, or because maintaining or might simply not be as accurate
data is considered a proprietary trade computational because it is based on smaller, poten-
secret. Examples of this include elec- tially more biased datasets.
tronic health records, containing performance. Scientific computing refers to the
protected health information (PHI); computing elements used in scien-
IP addresses or data representing the tific discovery. Historically, this has
locations or movements of individu- emphasized modeling and simula-
als, containing personally identifi- tion, but with the proliferation of in-
able information (PII); the properties the risks of sharing sensitive data, struments that produce and collect
of chemicals or materials, and more. and concerns of providers of comput- data, now significantly also includes
Two drivers for this reluctance to ing systems about the risks of hosting data analysis. Computing systems
share, which are duals of each other, such data. As barriers to data sharing used in science include desktop sys-
are concerns of data owners about are imposed, data-driven results are tems and clusters run by individual
investigators, institutional comput- tectures like this are becoming more physical presence in a particular fa-
ing resources, commercial clouds, and more common as means for sci- cility for analysis would be a public
and supercomputers such as those entific computing involving sensitive health risk.
present in high-performance com- data.8 However, even with these secu-
puting (HPC) centers sponsored by rity protections, traditional enclaves Reducing Data Sensitivity Using
U.S. Department of Energy’s Office of still require implicitly trusting sys- “Anonymization” Techniques
Science and the U.S. National Science tem administrators and anyone with Sometimes attempts are made to
Foundation. Not all scientific com- physical access to the system con- avoid security requirements by mak-
puting is large, but at the largest taining the sensitive data, thereby in- ing data less sensitive by applying
scale, scientific computing is charac- creasing the risk to and liability of an “anonymization” processes in which
terized by massive datasets and dis- institution for accepting responsibil- data is masked or made more gener-
tributed, international collabora- ity for hosting data. This security al. Examples of this approach remove
tions. However, when sensitive data limitation can significantly weaken distinctive elements from datasets
is used, computing options available the trust relationships involved in such as birthdates, geographical lo-
are much more limited in computing sharing data, particularly when cations, or IP network addresses.
scale and access.8 groups are large and distributed. Indeed, removing 18 specific identi-
These concerns can be partially miti- fiers from electronic health records
Current Secure gated by requiring data analysts to be satisfies the HIPAA Privacy Rule’s
Computing Environments physically present in a facility owned “Safe Harbor” provisions to provide
Today, where remote access to data is by the data provider in order to access legal de-identification. However, on
IMAGERY BY KENTOH /SHUT TERSTO CK
permitted at all, significant technical data. However, in all these cases, anal- a technical level, these techniques
and procedural constraints may be ysis is hindered for the scientific com- have repeatedly been shown to fail to
put in place, such as instituting in- munity whose abilities and tools are preserve privacy, typically by merg-
gress/egress “airlocks,” requiring optimized for working in open, collab- ing external information containing
“two-person” rules to move software orative, and distributed environ- identifiable information with quasi-
and data in or out, and requiring the ments. Further, consider the current identifiers in the dataset to re-identi-
use “remote desktop” systems. Archi- pandemic in which a requirement of fy “anonymized” records.6 Therefore,
A portion of a system leveraging a trusted execution environment in which data is stored used or generated in the computation,
encrypted on disk; a policy engine, controlled by the data owner, and running in the TEE, including even from certain “physical
contains the mapping for what data is to be made available for computing by each authen- attacks” against the computing sys-
ticated user; and an output policy, also specified by the data owner, dictates what informa-
tem. They can implement similar
tion is permitted to be returned to the user. An output policy might be based on differential
privacy, or be access-control based, or be some combination of these or other functions. functionality as software-based homo-
morphic and mutiparty computation2
approaches, but without the usability
issues and with dramatically smaller
“ Trusted Execution Environment (protected data is always performance penalties.
encrypted) and computation isolated (cannot see other processes)
The use of TEEs to protect against
Encrypted Data
Repository Compute System untrustworthy data centers is not a
novel idea, as seen by the creation of
Policy Engine
“ (contains decrypt Output Policy the Linux Foundation’s Confidential
key) Computing Consortium10 and
Google’s recent “Move to Secure the
Encrypted Data
Repository Cloud From Itself.”7 Google has com-
Input Data, Query, Permitted paring the importance of the use of
or Instructions Data Output
“ TEEs in its cloud platform to the in-
vention of email.9 However, TEEs
Encrypted Data have not yet seen broad interest and
Repository End “User” (Person or Automated Process)
adoption by scientists or scientific
computing facilities.
The envisioned approach is to le-
de-identification does not necessar- Nested Paging) in 2020. All three verage TEEs when data processing
ily address the risk and trust issues vendors take extremely different ap- environments are out of the direct
involved in data sharing because re- proaches and have extremely differ- control of the data owner, such as in
identification attacks can still result ent strengths, weaknesses, use cases, third-party (including DOE or NSF)
in significant embarrassment, if not and threat models. HPC facilities or commercial cloud
legal sanctions. In addition, the same TEEs can be used to maintain or environments, in order to prevent ex-
masking used in these processes also even increase security over traditional posure of sensitive data to other us-
removes data that is critical to the enclaves, at minimal cost to perfor- ers of those systems or even the ad-
analysis.6 Consider public health re- mance in comparison to computing ministrators of those systems. Data
search for which the last two digits over plaintext. TEEs can isolate com- providers can specify the configura-
of a ZIP code, or the two least signifi- putation, preventing even system ad- tion of the system, even if they are not
cant figures of a geographic coordi- ministrators of the machine in which directly the hosts of the computing
nate are vital to tracking viral spread. the computation is running from ob- environment, to specify access con-
serving the computation or data being trol policies, a permitted list of soft-
Confidential Scientific Computing ware or analyses that can be per-
Hardware TEEs can form the basis for formed, and output policies to
platforms that provide strong securi- Trusted execution prevent data exfiltration by the user.
ty benefits while maintaining compu- The notion of being able to leverage
tational performance (see the accom- environments can community HPC and cloud environ-
panying figure). TEEs are portions of be used to maintain ments also enables the use of data
certain modern microprocessors that from multiple providers simultane-
enforce strong separation from other or even increase ously while protecting the raw data
processes on the CPU, and some can security over from all simultaneously, each poten-
even encrypt memory and computa- tially with their own distinct policies.
tion. TEEs have roots in the concepts traditional enclaves, Researchers at the Berkeley Lab
of Trusted Platform Modules (TPMs) at a minimal cost and UC Davis have been empirically
and Secure Boot, but have evolved to evaluating Intel SGX and AMD SEV
have significantly greater functional- to performance TEEs for their performance under
ity. Common commercial TEEs today in comparison typical HPC workloads. Our results1
include ARM’s TrustZone, introduced show that AMD’s SEV generally im-
in 2013; Intel’s Secure Guard Exten- to computing poses minimal performance degra-
sions (SGX), introduced in 2015; and over plaintext. dation for single-node computation
AMD’s Secure Encrypted Virtualiza- and represents a performant solution
tion (SEV), introduced in 2016 and for scientific computing with lower
revamped several times since then ratios of communication to computa-
to include SEV-ES (Encrypted State) tion. However, Intel’s SGX is not per-
in 2017 and SEV SEV-SNP (Secure formant at all for HPC due to TEE