Professional Documents
Culture Documents
Security
of FPGA-Accelerated
Cloud Computing
Environments
Security of FPGA-Accelerated Cloud Computing
Environments
Jakub Szefer • Russell Tessier
Editors
Security of
FPGA-Accelerated Cloud
Computing Environments
Editors
Jakub Szefer Russell Tessier
Yale University University of Massachusetts Amherst
New Haven, CT, USA Amherst, MA, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
v
Acknowledgements
The preparation of this book was supported in part by NSF grants 1901901 and
1902532.
vii
Contents
ix
x Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Chapter 1
Authentication and Confidentiality
in FPGA-Based Clouds
Semih Ince, David Espes, Julien Lallet, Guy Gogniat, and Renaud Santoro
1.1 Introduction
FPGAs have become popular as accelerators in cloud computing. They support high
computational load and provide high performance compared to general purpose pro-
cessors [44] and graphics processing units (GPUs). FPGAs have been deployed for
applications requiring intensive computations (e.g., fully homomorphic encryption
algorithms). Cloud providers (CPs), like AWS [35] and Alibaba [2], provide FPGA-
based cloud computing services to fulfill users’ acceleration needs.
Cloud security is critical for cloud users. They expect secure remote computation
and access to FPGA accelerators with minimal impact on design performance.
Security mechanisms must be adapted for appropriate cloud usage. The user must
ensure that their data are kept private. They do not want to disclose sensitive
intellectual property (IP) and data to the cloud provider. To ensure privacy, the
user needs an encrypted channel with the FPGA isolated from the CP. Furthermore,
authentication between end-user and FPGA accelerator must be guaranteed. The
user must be sure to use a specific FPGA and know that another user cannot access
their device. Authentication is thus necessary to manage FPGAs and cloud service
accesses.
Alibaba F3 instances [2]. Other types of architectures and use cases also exist. IBM
and Microsoft have opted for a different type of FPGA acceleration approach, as
explained in the following subsection.
In all solutions presented above, FPGAs are managed by the CP. The latter allocates
resources and establishes security. Figure 1.3 shows the high-level architecture and
associated mechanisms in current cloud solutions.
Upon receiving a user request, the CP creates a work environment and allocates
resources. The user has access to resources through a virtualized environment. The
CP can track FPGA usage and check for security issues. To achieve this, AWS and
Alibaba include management functions inside the FPGA [2, 35]. Due to a lack
of transparency, the CP’s privileges within the FPGA are unknown. The CP can
communicate with the FPGA. Thus, the CP can breach the user privacy and access
resources allocated to a user.
Figure 1.4 shows the mechanisms used in AWS F1 instances. The virtual machine
is set up with an Amazon Machine Image provided by Amazon according to a
specification selected by the user [35]. In order to program the FPGA, the user must
load an Amazon FPGA Image (AFI) generated by AWS. The user must disclose
their IP to program the FPGA. Before generating the AFI file, AWS checks the user
IP for malicious design patterns such as power wasters and side-channel analyzers
based on ring oscillators, short circuits, and long wires [26]. Under this scheme,
the user and the cloud provider (CP) distrust each other. The CP protects devices
against damage. Although the user desires IP protection and confidentiality, the user
IP is never authenticated after the AFI file (i.e., verified user bitstream) is produced.
The user has no proof that their IP in the AFI file is unmodified. A recent work has
proven that bitstream manipulation is a possible way to introduce hardware Trojans
into designs [8]. The user also has no proof that their IP is indeed kept secret.
In the current implementation, the user never authenticates the FPGA. The user
accesses a virtual machine which can access an FPGA. The lack of authentication
in this scenario can be a threat as man-in-the-middle (MitM) attacks and FPGA
impersonation is possible. In MitM attacks, the attacker is placed between two
communicating entities, as shown in Fig. 1.5. During this attack, the virtual machine
sees the attacker as the FPGA and the FPGA sees the attacker as the virtual machine.
In this situation the attacker intercepts the communication data, the user IP, and the
data.
In current solutions, the user IP confidentiality is lacking. To use a custom FPGA
accelerator, the client must send a design file to the CP [31, 35]. The CP must verify
custom designs to protect the FPGA from damage. To address confidentiality and
isolation issues, a trusted authority (TA) can be involved.
6 S. Ince et al.
The trusted authority is an entity trusted by the CP and the cloud user. In an FPGA-
based cloud, the TA will be responsible for security and other mechanisms involving
user confidentiality. Figure 1.6 shows a high-level FPGA-based cloud architecture
involving a TA.
The CP must work with the TA to allocate resources. With this architecture,
FPGA security is established by the TA. Mechanisms like encryption, keys,
and certificates can be managed by the TA independently from the CP. These
mechanisms are described in more detail in Sect. 1.3.1.2. The user is isolated from
the CP thanks to the TA. Moreover, the TA can verify custom user designs instead
of the CP. Thus, the client can protect their IP from the CP. The link between the TA
and the user is described in Sect. 1.5. Under this architecture, the CP still manages
devices. For example, the CP can still allocate and revoke access to hardware and
manage resources, and however the CP does not have full knowledge about the
client’s IP. Approaches [13–15] have used a TA in their solutions. Currently, no
cloud provider uses a TA with a cloud-based FPGA.
Each architecture presented thus far represents a specific use case. Hence, their
privacy measures and security assumptions are not the same. Users are provided full
FPGA access under the FaaS model and they reconfigure the FPGA with their own
designs. Users can benefit from FPGA acceleration through a web browser without
knowing FPGA technology. In the FaaS model, in which users have access to a
cloud provider’s FPGA, security requirements are high because users receive access
to expensive hardware. Users can implement custom accelerators, but IP verification
is mandatory. This causes confidentiality concerns for the client. The involvement
of a TA in FPGA-based clouds can be a solution to solve security and confidentiality
issues. The following section focuses on authentication in FPGA-based cloud
computing. Authentication is one of the most important security features for FPGA-
based cloud computing. It allows for the identification of acting entities (i.e., users,
cloud providers), targeted devices, and custom user designs. In an FPGA-based
cloud context, authenticating users and devices allows the cloud provider to set up
authorization and access control. For the user, authentication is proof that the work
environment (e.g., the allocated resources) is genuine and identified.
1 Authentication and Confidentiality in FPGA-Based Clouds 7
Authentication mechanisms often involve digital signatures and hash functions. The
latter is a one-way function which maps .h : X −→ Y where .|X| = n, .|Y | = m, and
.n > m. It is relatively easy to compute the result with a given input. But computing
the input with a given output is extremely difficult. SHA (Secure Hash Algorithm)
is a popular hash function. A hash algorithm guarantees the integrity of data, but it
does not authenticate the sender/receiver [36]. There are no shared secrets in hash
functions to achieve authentication. This means that an attacker can still construct a
message with a correct hash and send it. The receiver will see this message as valid.
An attacker only needs to know which hash function is used. As .|X| > |Y |, it is
possible to brute force the hash algorithm. Collision attacks [38] allow an attacker
to find the same hash output with a different input using a brute-force algorithm.
There are different ways to authenticate an entity. Technical details and features of
different authentication techniques are detailed below.
A Message Authenticated Code (MAC) algorithm is one possibility for sender
authentication. It uses a shared secret to create message digests. To build authentic
and valid messages, a shared secret is needed. In order to authenticate and verify
message integrity, HMAC (keyed-Hash Message Authenticated Code) functions are
optimal. HMAC uses a MAC (Message Authenticated Code) with a hash function.
HMAC and MAC algorithms serve the same purpose, but an HMAC has stronger
security properties than a standard MAC.
Authenticated encryption (AE) is a way to authenticate the data and the sender.
AE operates a block cipher mode that provides encryption, integrity, and data
authentication. Common AE algorithms are AES-CCM and AES-GCM [1]. AE can
8 S. Ince et al.
Certificates are another way to authenticate a user. Information like public keys,
name, and organization is present in the certificate. Often, certificates work under a
Public Key Infrastructure (PKI) scheme. Under this scheme, a Certificate Authority
(CA) signs user certificates and generates public keys, as shown in Fig. 1.7a. CAs
are trusted anchors, and they sign certificates to indicate that the user’s identity is
verified. Generally, there are other entities like the Registration Authority (RA) in
PKI. There are various architectures and entities for PKI systems [9, 37]. Figure 1.7b
shows a single-rooted hierarchical PKI structure. A root CA (i.e., trust anchor)
certifies multiple CAs. A certificate signed by a CA can be verified if it can be
traced back to a trust anchor (i.e., root CA). This scheme is widely used by web
browsers. Holding and updating a list of root CAs is enough to verify a certificate.
Nowadays, PKI is everywhere. The most notable system using PKI is TLS. It
allows authentication and secure communication to be achieved over HTTP [39].
Through client–server communication, entities exchange their respective certificates
for verification. This is mutual TLS authentication and it is an optional mechanism.
The certificate is signed with the CA’s public key. To verify the certificate, a user
needs to contact the CA and send it for verification. If the certificate is valid (i.e.,
not revoked), the CA gives a positive response to the user. Each entity is identified
and associated with a certificate. X.509 certificates are popular in TLS [9]. They
include information like issuer name, subject name, public key information, validity
period, etc. For each future communication, the certificates can be verified to achieve
authentication. Upon successful verification under TLS 1.3, both entities proceed to
communicate by creating a shared secret with DHE (Diffie–Hellman Ephemeral) or
ECDHE (Elliptic Curve Diffie–Hellman Ephemeral). In TLS 1.2, the most common
method remains RSA (Rivest–Shamir–Adleman).
For cloud FPGAs, the expression “Trusted Authority” (TA) is more common
than CA. The role of the TA is similar to a CA. Previous approaches [13, 15] take
advantage of a TA to generate FPGA certificates for authentication and encryption.
Moreover they reinforce the isolation between the user and the CP to offer better
confidentiality. Security-critical functions like authentication and certification are
1 Authentication and Confidentiality in FPGA-Based Clouds 9
Fig. 1.7 Public key infrastructure mechanisms. (a) Basic PKI. (b) Hierarchy PKI
HTTP resource requests [19]. As shown in Fig. 1.8, OAuth 2.0 is a protocol usually
involving four entities: a user, a resource owner (RO), an authorization server (AS),
and a resource server (RS). The user negotiates an agreement with the RO to
access their resources. The RO specifies the scope (i.e., rules and limitations) of
the resource sharing. If the user receives authorization from the RO, the user can
ask the AS to generate an access token. The user can use the access token with the
RS to gain access to the resources shared with them. This scheme is lightweight and
practical when a user needs to access tools quickly without successive log-in. OAuth
2.0 is much more focused on authorization, whereas SAML is more focused on
authentication. Authentication does exist in OAuth 2.0, although user authentication
is out of the standard’s scope and is left to the developer. In Sect. 1.5, an adaptation
of OAuth 2.0 for confidential cloud FPGA sharing is detailed [21].
Current commercial cloud providers like Amazon, Alibaba, and Huawei offer FPGA
access through virtual machines. Figure 1.9 shows the most common way to access
an FPGA deployed in the cloud. Users are authenticated two times when accessing
an FPGA. During step 1, the customer authenticates them self using their AWS
account to request an FPGA instance. During step 2, the CP provides VM access
to meet the user’s hardware requirements. During step 3, the customer logs into the
VM and obtains FPGA access in step 4.
be resilient to brute-force attacks. The PUF response must be long enough to make
this type of attack difficult. Depending on the design, PUF responses only produce
one or two stable random bits for one challenge. The method used to expand the
PUF response or the authentication protocol itself must not create new possible
attacks. For example, an attacker should not be able to send challenges to a PUF
design. There should be a mutual authentication between the PUF design and the
PUF user; otherwise the PUF can be used by an attacker and all challenge–response
pairs can be documented. Mutual authentication should not be solely based on a
shared secret. The lack of a stored secret is the biggest advantage of a PUF. If mutual
authentication is based on a shared secret, then PUF usage becomes irrelevant. To
protect the PUF design, the FPGA should authenticate the user. Upon successful
user authentication, the FPGA should allow PUF usage. However, several attacks
have been realized on PUF architectures [10, 11, 40]. PUFs must be used within a
robust framework to mitigate vulnerabilities documented in the literature.
Other FPGA authentication techniques are also possible. Previous works [13,
15] use certificates and PKI. Public key cryptography and certificates are used to
communicate outside the FPGA and take advantage of the trusted authority that
is present in the scheme. The following subsection gives further details on these
mechanisms.
A TA is often used to create isolation between the user and the cloud provider to
reinforce confidentiality and privacy. Reference [15] describes an FPGA enclave
for cloud computing with the use of a TA. The aim is to reinforce user IP
security, authentication, and privacy in public FPGA clouds while supporting multi-
tenancy. FPGA multi-tenancy is the spatial sharing of one device among multiple
users. Each user has a well-defined set of FPGA resources called a partially
reconfigurable region (PRR). Using this approach, FPGA resource use is maximized
and computing resources are not left unused. FPGAs have been added to a PKI
architecture by using certificates. An on-board key hierarchy is used with a device
unique key (DUK) as a root. In order to establish security, the DUK is derived
multiple times for different purposes (e.g., bitstream encryption, enclave specific
keys). The solution for FPGA authentication proposed here is an SGX-inspired
attestation mechanism endorsed by the TA. The latter offers services like bitstream
certification and boot code authentication. To provide a secure FPGA environment,
security critical components like the device unique key and bitstream loader are
controlled by the TA. The TA controls the root of the key hierarchy and can
update/revoke keys depending on security threats. For enclave communication,
the authors have implemented public and private keys and a certificate. Keys are
used to create a secure channel between an enclave and a user, and certificates are
implemented to authenticate the enclave and the running design.
In references [15] and [13], a solution to protect data and bitstreams from the
CP is proposed. A TA installs encryption keys inside the FPGA device. In order
14 S. Ince et al.
to secure their bitstreams, users can send their designs to the TA. The TA encrypts
the user designs with an encryption key associated with the FPGA. As a result, the
user can securely send and receive data in an encrypted communication channel with
the FPGA. The user is never authenticated, and the FPGA is indirectly authenticated
with the encryption keys installed by the TA. If an attacker successfully retrieves the
keys stored inside the FPGA, the content of the device is compromised. This creates
other attack vectors on key renewal and distribution mechanisms. Multi-tenancy is
not supported with this method.
In reference [14], a secure FPGA enclave is proposed. In this solution, a user
requests access to an FPGA from the CP and obtains the FPGA serial number
in return. The user sends it to the TA and proceeds to FPGA authentication. A
scheme to protect the privacy and the integrity of user data and IP in public FPGA
clouds is proposed. This solution involves a TA (i.e., FPGA vendor) who has a non-
reconfigurable design in the FPGA. The design includes AES and SHA accelerators.
The TA’s design also includes a PUF, characterized by the TA. It is used for FPGA
authentication purposes. A secure channel is created for the user based on modular
exponentiation and a Diffie–Hellman Ephemeral algorithm. To authenticate the
FPGA, the user compares the PUF output hash with the awaited hash response
known by the TA. If the response is valid, the user now has a shared secret with
the FPGA and the session key is established. This key is used as a symmetrical
encryption key between the FPGA and the user. Thus, user bitstreams and data will
be protected against the CP and bitstream integrity will be achieved. Despite that,
user authentication is missing in this scheme. A malicious user can impersonate a
user and get access to the provided FPGA before the user. The user obtains access
to the FPGA after the session key is set up, but the user is not authenticated. Thus,
the session key is security critical and must be kept secret. The security of remote
access and computing lies with the session key.
Using an alternative approach [48], users are able to leverage CPUs and a TEE
to achieve security for an FPGA-based commercial cloud. IP confidentiality is
provided for the user. The CP secures FPGA devices and the cloud infrastructure
with the use of a TA (i.e., FPGA vendor). The CP verifies the user IP inside a
TEE. The TA is responsible for the FPGA shell and various attestations (e.g., FPGA
shell, application). The user must prove their identity to the FPGA to prevent user
impersonations. Then the user can authenticate the active FPGA shell. In this way
the user knows that the security elements of the device are not compromised. Lastly,
the user’s bitstream is authenticated by the FPGA and its integrity is verified. This
level of security is not only optimal but also mandatory when it comes to FPGA
cloud computing. Both the user and the cloud provider have expensive technology
at stake. Taking advantage of a TEE inside the processing system of the FPGA for
design verification has great advantages for confidentiality and security, although
this approach can create significant latency. In a multi-tenant context, the user’s
design verification using the TEE can be a serious bottleneck.
Reference [51] proposes an FPGA resource pooling system. FPGA acceleration
resources in a cloud environment are presented at a high level of abstraction.
1 Authentication and Confidentiality in FPGA-Based Clouds 15
Users can deploy their own bitstreams or use available designs. Their approach is
heavily focused on FPGA cloud efficiency and performance. This system is based
on multiple compute nodes managed by control nodes. The user can accelerate an
application by making API calls from a virtual machine. Performance improvements
are efficient, but there are a few potential security threats with the proposed scheme.
If the user deploys their own design, it is totally disclosed to the CP. The user has
no confidentiality, an unsuitable situation for the user. A TA can be a solution to this
issue. Also, no design integrity mechanism is provided to ensure that an unmodified
user IP is delivered. Lastly, no mechanisms are provided to protect the FPGA device
from the user design. Authentication is also not addressed since the user never
authenticates the FPGA device. As a consequence, the end user cannot be sure that
the associated FPGA is the legitimate target or it has been replaced by a malicious
FPGA target. This issue can lead to FPGA impersonation and cause data breaches
and malicious behaviors. The lack of user authentication by the FPGA reinforces
these security flaws. The FPGA, or a reconfigurable region inside the device, lacks
isolation. A user may be able to access a device allocated to another user if no user-
FPGA authentication is performed.
Authentication can happen at many levels in FPGA-based cloud computing.
Users and the CP can be authenticated to reinforce security. FPGAs and bitstreams
can also be authenticated to reinforce IP and device security. To achieve higher
levels of security, multiple authentication techniques can be used. HMAC and
AE algorithms allow two communicating users to authenticate each other. Other
mechanisms like certificates and PKI allow users to authenticate each other using a
TA.
Our authentication and access delegation framework for FPGA-enabled cloud
computing [21] (described in detail in Sect. 1.5) addresses these issues. The
approach is based on OAuth 2.0 and is adapted for cloud usage with FPGA
devices. It provides an authentication solution for four entities simultaneously
(FPGA, user, TA, and CP) and achieves isolation between the client and the CP. In
this framework the TA is involved in the access delegation process and the user-
FPGA authentication process. The framework offers a single sign-on feature to
prevent repeated authentication processes. Access control is enforced by the TA
and managed by the CP. This subject is detailed in Sects. 1.5.5 and 1.5.6. Only our
work and [15] support multi-tenancy.
As described earlier in this section, previous work [13–15] has proposed FPGA
authentication and isolation mechanisms available from the CP. The latter two
approaches use modular exponentiation [14] and RSA [13] algorithms to authenti-
cate a user. Despite introducing the FPGA into PKI, reference [15] does not mention
user authentication. None of the three works proposes an access delegation protocol
involving a TA. Reference [14] only uses a TA for FPGA authentication in the
proposed framework. Access control mechanisms are only described in reference
[15]. Public keys are used to authenticate a secure enclave inside the FPGA. These
keys can be used for access control although no mechanism is explicitly mentioned.
Table 1.1 compares our solution with previous approaches.
16 S. Ince et al.
Table 1.1 Comparison with Achievements [15] [14] [13] Our proposal
prior work. .×∗ : the Trusted
Authority (TA) only gives FPGA Authentication
information to the client for User Authentication .×
FPGA authentication Multi-tenancy .× .×
User-CP isolation
Access delegation .× .×
∗ .×
Table 1.2 Resource utilization of three FPGA-based neural network accelerators on Virtex
UltraScale+ VU9P FPGAs
Related work LUT FF DSP BRAM
Xiao et al. [45] 131,042 (11%) 113,581 (5.0%) 242 (3.5%) 4 (5.0%)
Tsai et al. [43] 38,899 (3.0%) 40,534 (1.7%) 9 (0.1%) 3 (4.0%)
Zhou et al. [50] 80,175 (7%) 46,140 (2.0%) 83 (1.2%) 0 (0%)
1.4.1 Multi-tenancy
FPGA cloud services aim to be efficient in resource usage and energy consumption.
Currently each cloud FPGA is allocated to one customer at a time. Often, high-
end FPGA devices are deployed for acceleration. For example, AWS offers Xilinx
Virtex UltraScale+ VU9P FPGAs. This device has approximately 1.2 million look-
up tables (LUTs) and 2.6 million flip-flops (FFs). As shown in Table 1.2, several
neural network accelerators [43, 45, 50] have been implemented using this platform.
One implementation [45] used 131,042 LUTs and 113,581 FFs, which correspond
to 11% of the LUTs and 5% of the FFs in the AWS FPGA. To the best of our
knowledge, no commercial cloud service provider supports FPGA multi-tenancy
[24, 27, 30].
Multi-tenancy introduces new authentication and access control challenges.
Each tenant is assigned to a dedicated PRR. Consequently, each PRR must be
authenticated. Each user’s rights must be enforced by strong access control schemes.
With multi-tenancy, sensitive components like accelerators, memory, and CPUs can
be shared. These hardware components must not leak information and must be
shared securely among users. Strong access control policies must be set individually
for each user and each shared hardware resource. Policies must be set by the
resource owner or by a trusted authority if the resource sharing scheme includes
one. To satisfy these policies, users must be authenticated inside the FPGA. The
FPGA must be aware of user’s design locations and the requests coming from them.
A user PRR also needs authentication to mitigate impersonation and unauthorized
requests. The FPGA must check the source of the request and whether access control
policies allow the user to make such a request.
1 Authentication and Confidentiality in FPGA-Based Clouds 17
Remote attacks on multi-tenant FPGAs are also a significant concern. The power
distribution network (PDN) of the FPGA can be exploited by possible attack vectors.
By manipulating the PDN, a malicious user can set up a side channel since the
PDN must be shared between all tenants in an FPGA. PDN attacks exploit the
dependency between switching activity and power consumption [17, 25, 42, 49].
With a voltage sensor based on delay lines, it is possible to track voltage drops at
a nanosecond scale. In fact, the delay of a signal depends on the supply voltage.
The delay line itself is composed of buffers and latches. By collecting power traces
and using correlation power analysis (CPA), it is possible to recover data from the
PDN. For example, CPA can correlate power measurements against a power model
to guess an encryption key byte. Previous work [17] (which uses AWS FPGA cloud)
and [42] have described attacks on an AES core and retrieved encryption keys. The
PDN can also be used as a covert channel. Hence, a user can bypass access control
mechanisms and communicate with other FPGA designs or system hardware (e.g.,
CPU). Since each program has its own power consumption pattern, it is possible to
identify the programs of other tenants by monitoring PDN voltage drops.
PDN attack mitigation and protection mechanisms exist. Unfortunately, pro-
tecting the FPGA and user designs against attacks requires significant overhead.
For example, a CPA resistant AES core based on a power equalizer [33] was
implemented and tested. The equalizer’s power overhead is between 8 and 23%
depending on the mode and the solution consumes roughly 40% of 128-bit AES
area usage. This approach only protects the AES core. Side channels can still be
exploited and the PDN is still vulnerable.
The protocol seen in Fig. 1.8 must be adapted for the cloud use case for several
reasons. First, the resource owner (website user) uses the client’s website (e.g., a
blog) and is willing to share information. The client then asks for authorization for
resource access (user data) and starts the access delegation procedure. This scheme
is not valid in an FPGA-accelerated cloud use case as the client uses the resource
owner’s (RO’s) website to earn access to hardware owned by the RO.
In this situation, the cloud provider is considered to be the resource owner, the
trusted authority is the authorization server, and the FPGA is a part of the resource
server. The client requests a hardware resource with their account information. This
is the first step of authentication. Then, resources are allocated automatically with
the CP’s virtualization tools (e.g., orchestrator/scheduler).
This framework is based on a TA to create user-CP isolation by using an access
delegation protocol. Moreover, the TA executes security-sensitive operations like
18 S. Ince et al.
Fig. 1.10 CP introduces the client to the TA and generates and manages authorization code. Client
authenticates themself with the TA and obtains their authorization code
To request a resource from the CP, the client first sends a message to the CP’s user-
agent, as shown in step 1 of Fig. 1.10. In this request message, the client includes
their identifiers, certificate (or produces it online as stated below), and a redirection
unique resource identifier (URI). This information is sent to the TA in the next step.
The URI is used by the TA to send back redirected messages to the client via the
CP’s user-agent.
There are two different scenarios for client certification. The client can generate
their certificate online or offline from the protocol. Online certificate generation
proceeds as follows. Upon receiving the client’s resource request, the CP authenti-
cates themself to the TA with their certificate, requests a certificate for the client,
and shares it with the client as shown in step 2 in Fig. 1.10. The client certificate
is created from the CP’s website. The client must interact with the web browser
to create the certificate and add randomness to the generated keys. A similar
mechanism is seen in Microsoft Azure’s key-pair generation for SSH channel
protection.
It is also possible to create the client certificate offline from this protocol. The
client has the responsibility of generating a key-pair for themself. By doing so, the
client makes the resource request to the CP with their certificate. Then, the certificate
generation step is skipped, and as a result the protocol is faster.
1 Authentication and Confidentiality in FPGA-Based Clouds 19
During the second step in Fig. 1.10, after the CP’s authentication, the client’s request
is accepted or declined. If the request is accepted, the TA generates the authorization
code. By using the previously provided URI, the TA redirects the CP’s user-agent
back to the client in order to authenticate them directly. By performing this action,
the CP has authenticated and introduced the client to the TA.
At this time, the TA knows the client’s certificate and the authorization code
associated with the client’s identifiers. To obtain their authorization code, the
client needs to authenticate with the TA for the first time. A certificate-based
TLS authentication is performed [39]. If the client’s credentials are valid, the
authentication is successful and the TA sends an HTTP redirection code to the
CP’s user-agent (HTTP code 302) alongside the client redirection URI. The client
receives the authorization code from the CP’s user-agent in the last step of Fig. 1.10.
In this protocol, the authorization code cannot be used as an attack vector. In fact,
the authorization code is associated with the client’s credentials and the URI. It is
not a secret code because the CP’s user-agent shares the authorization code through
HTTP redirection. The authorization code can be found in the user-agent history. In
case of an authorization code redirection attack, which aims to get backdoor access
to the client’s resource, a simple redirection URI check from the TA is sufficient. The
URI used when requesting the authorization code must match the URI used for the
access token generation, as explained in Sect. 1.5.3. Hence, a malicious client cannot
gain access to resources attributed to another client by intercepting the authorization
code.
The role of this code is to ensure that the CP is authenticated and cannot be
impersonated. By using the CP’s user-agent to redirect the code, it can be confirmed
that the CP which authenticated themself and gave authorizations for resource
allocation to the TA is the same entity that communicates with the client in the
protocol.
After receiving the redirected authorization code, the client needs to authenticate
themself again with the TA using their certificate, their authorization code, and
redirection URI to request an access token from the TA. A certificate-based TLS
authentication is performed one last time to confirm the client’s identity. This second
authentication is necessary to verify the client’s identity.
The client needs to submit their authorization code to request the access token.
If the client’s credentials are registered and associated with the used authorization
code, the TA will be able to authenticate the FPGA device and generate the access
token using a shared secret with the FPGA, as shown in Fig. 1.11. Access tokens
have scopes and the duration of access. They are managed by the RO and endorsed
20 S. Ince et al.
Fig. 1.11 The TA generates the access token for the client
by the TA [19]. These options are requested by the client during the first step shown
in Fig. 1.10. Then, the RO accepts or declines the requested scopes and duration and
notifies the TA (i.e., the authorization server) in step 2. The client can decline an
issued token if the scope requested does not match their requests.
Access tokens must be kept confidential. They are only shared using TLS and
stored in an encrypted form using the FPGA public key. If an access token leaks,
the secure access to the allocated FPGA will be compromised.
The access token’s content may also be extended in order to contain information
such as FPGA serial number, partially reconfigurable region identifier, and so on.
This feature gives flexibility for the implementation phase as additional mechanisms
can be developed.
Using the described protocol, the client is strongly authenticated through their
certificate and their credentials with the CP and the TA. The authentication between
the two last entities is not as critical as the client authentication because they are
anchors in this protocol. Due to the shared secret and the TLS session between the
client and the FPGA, a secure and tokenized confidential remote access can be set up
for the client. The CP is isolated from the client’s computation but can still manage
access scopes and duration.
After the token issuance, the client contacts the FPGA to obtain access to resources.
A TLS session is set up for secure communication with perfect forward secrecy
between the FPGA and the client [39]. The client and the FPGA create their
shared secret with algorithms like DHE and ECDHE [39] and then use symmetric
encryption algorithms like AES-256-GCM. Once the TLS connection is established,
the client sends their token to be authenticated. The FPGA parses the token and
evaluates if the resources can be granted. Further communications between the client
and the FPGA will be encrypted. The user privacy is greatly enhanced and isolation
from other entities is achieved.
1 Authentication and Confidentiality in FPGA-Based Clouds 21
When all the entities are authenticated and the authorizations are granted, the client
can access the FPGA with the access token. The token needs to be decrypted and
parsed. Actions are then taken by the FPGA to program and allocate resources inside
the device. Thanks to their resource server, the CP explicitly specifies the attributed
resource information to the TA during the second step shown in Fig. 1.10.
According to the OAuth 2.0 protocol, token content can be extended according to
the user’s preference [19]. In order to take advantage of this feature in a cloud FPGA
context, critical information needs to be selected. This information reinforces the
access control of the client and ensures device/infrastructure security. It is up to the
CP to decide the content of the token. In our solution, several pieces of information
must be included. This information includes the FPGA serial number (or a specific
challenge–response pair for a PUF) and a partial reconfiguration region (PRR)
identifier in the case of multi-tenant FPGA usage. A client identifier (e.g., certificate)
is useful for secure communication outside the FPGA. This information identifies
the device, the client, and the allocated PRR. Additionally, bitstream identification
and signatures are stored in the token. This aspect is further detailed in Sect. 1.5.6.
The token must have validity timestamps for the FPGA to take action upon token
expiration. This information is necessary to ensure that the client’s activity is located
inside the cloud infrastructure and that the access delegation scope is respected. The
TA should be able to create an access token with this information.
The CP’s virtualization tools track which FPGA and PRR are in use. Cloud
resource utilization is tracked by the CP and already allocated resources cannot
be overwritten and reallocated by the client. Moreover, as the token has a limited
validity, the FPGA should be able to end the connection with the client based on
a timestamp without needing the CP to take action, i.e., the FPGA must be time-
aware. The access validity timestamp is available inside the token.
In order to load a bitstream into an allocated resource, the client must fulfill several
requirements. Bitstreams must be verified by the TA for malicious design patterns.
Then the bitstream must be certified and cleared for safe use.
To accomplish this goal, each bitstream submitted by the client receives a unique
identifier. Then the TA proceeds to verify it. If the bitstream is declared safe for use
by the TA, the signature of the bitstream is calculated and stored in the TA database.
Then, the bitstream identifier and the bitstream signature are also added to the access
token. SHA-256 or SHA-364 should be used for security.
When the client tries to load a bitstream into their allocated resource, the
bitstream identifier and the bitstream signature will also be shared in step 2 of
Fig. 1.11. At the reception of the bitstream, the FPGA computes the signature and
22 S. Ince et al.
verifies its results with the signature stored inside the token in step 3. If the results
are the same, the FPGA reconfigures the user’s allocated resource in the last step.
Analytically, we can show that the time required to generate the authorization code
as shown in Fig. 1.10 is as follows:
Internal tasks are not computationally expensive, for example, .tCP (internal)
refers to the time needed by the CP’s virtualization tools to check for available
resources and allocate them to a client. The TA would only read/write values and
generate the authorization code during .tT A (internal). The certificate creation time
.tcert can be skipped if the client already has a certificate. .tA−B (auth.) represents the
.ttoken should be very small because there is only a certificate-based TLS authenti-
cation of the client and an FPGA authentication by reading a value stored in a secure
FPGA memory location. Then the TA proceeds with the token generation procedure,
which involves a few read and write operations. The information necessary for the
token generation is stored in the TA’s database.
.taccess is the time required to confirm that a generated token has a valid FPGA
access. To confirm this, we need one TLS authentication between the client and the
FPGA and a token verification which is represented by .tF P GA (internal). This value
is significantly smaller than the time required by our protocol to generate an autho-
rization code and an access token. .taccess requires less communication, whereas
1 Authentication and Confidentiality in FPGA-Based Clouds 23
tcode and .ttoken include internal computing times and accumulated communication
.
Let us set .tA−B (net.) to 30 ms and use the delay time of 67.5 ms without network
latency [30] as a baseline for a standard TLS handshake. We include three client-
authenticated TLS handshakes as noted in Eq. 1.1. There are eight messages in a
client authenticated TLS handshake which gives us .8×3×30 = 720 ms for network
latency and .3 × 67.5 = 202.5 ms for TLS handshake. If the client generates their
certificate offline and then requests an authorization code as stated in Sect. 1.5.1,
.tcode = 720 + 202.5 = 922.5 ms. Tasks internal to CP and TA are not expensive,
so their time costs are much smaller than values taken into account for the previous
cost estimation.
Additionally, .ttoken equals .2 × 8 × 30 + 2 × 67.5 = 682.5 ms for two TLS
handshakes and network latency without taking internal tasks into account. Most of
the time is spent by network latency.
As a final total, .tcode + ttoken = 922.5 + 682.5 = 1605 ms is needed to
obtain authorization and generate an access token for the allocated FPGA. In
a worst-case scenario, this procedure should not last more than two seconds.
Finally, .taccess can be estimated by the following equation: .taccess = 30 + 67.5 +
tF P GA (internal) < 1605 ms = 97.5 ms+tF P GA (internal) < 1605 ms. Accessing
an FPGA with a valid token is approximately 100 ms because .tF P GA (internal)
consists of simple memory operations. For this use case, Eq. 1.3 is true: .taccess =
97.5 + tF P GA (internal) < 922.5 + 682.5
1.7 Conclusion
This chapter described the extensive state of the art in authentication techniques
and their application to FPGAs in cloud computing. Each service model used
for the FPGA cloud has different security requirements that need to be met to
24 S. Ince et al.
achieve secure computation. Under the infrastructure as a service model, users have
indirect access to resources and do not need FPGA technical knowledge. Under
the FPGA as a service model, users can implement their own FPGA design. This
approach provides more flexibility to the user at the risk of cloud security and the
user’s privacy. Security and privacy concerns must be addressed for secure FPGA
computation in the cloud.
Authentication is an important security step. It allows for the identification
and certification of users and devices to establish security in communications and
computations. Authentication is especially necessary for FPGA multi-tenancy so
that access control policies for hardware resource access and data protection can be
enforced. Authentication can either be directly defined between two communicating
users or involve a trusted third party (i.e., TA). Direct authentication mechanisms
involve MAC algorithms and AE. Authentication with a TA takes advantage of a
PKI and certificates. Authentication can happen at many levels. Firstly, users, the
CP, and the TA can be authenticated. Secondly, FPGAs and user bitstreams can be
authenticated to verify bitstream and device authenticity. Authentication with a TA
is more common in FPGA-based cloud computing because the cloud user needs to
protect their IP to keep it private from the CP and attackers. The CP also needs to
protect their devices and infrastructure.
Among open challenges, PUF-based authentication techniques are still being
investigated. They often lack security against machine learning attacks and a generic
protocol that can be widely used as an authentication solution is needed. Moreover,
multi-tenancy mechanisms in FPGAs are being actively investigated to support
improved FPGA resource usage. By spatially sharing the FPGA device among
multiple users, it is possible to reach higher efficiency. But multi-tenancy creates
new threats. Firstly, new authentication challenges like user identification inside the
device and resource sharing need to be investigated. Secondly, mitigating hardware-
level attacks and side-channel exploits is important. Users are vulnerable to each
other inside the FPGA if no security mechanisms are deployed to cover these
potential attack vectors.
In this chapter, a novel OAuth 2.0-based authentication and access delegation
framework for FPGA-enabled cloud computing is proposed. A client’s authenti-
cation protects their sensitive information and the CP’s cloud infrastructure from
malicious behaviors caused by identity theft. Furthermore, by introducing a trusted
authority, the client’s FPGA access and sensitive operations are isolated from the
CP. The client benefits from a low latency single sign-on authentication for their
FPGA thanks to a tokenized access. Security and privacy are enhanced for both the
cloud provider and the client.
References
1. Abdellatif, K. M., Chotin-Avot, R., & Mehrez, H. (2013). Protecting FPGA bitstreams
using authenticated encryption. In 2013 IEEE 11th International New Circuits and Systems
Conference (NEWCAS) (pp. 1–4). https://doi.org/10.1109/NEWCAS.2013.6573635
1 Authentication and Confidentiality in FPGA-Based Clouds 25
21. Ince, S., Espes, D., Lallet, J., Gogniat, G., & Santoro, R. (2021). OAuth 2.0-based authentica-
tion solution for FPGA-enabled cloud computing. In 3rd International Workshop on Cloud, IoT
and Fog Systems (and Security) - CIFS 2021 co-located with the 14th IEEE/ACM International
Conference on Utility and Cloud Computing - UCC 2021. University of Leicester, UK.
22. Intel Quartus development software official product website. https://www.intel.com/content/
www/us/en/products/details/fpga/development-tools/quartus-prime.html
23. Intel software guard extension documentation. https://www.intel.com/content/www/us/en/
architecture-and-technology/software-guard-extensions.html
24. Knodel, O., Lehmann, P., & Spallek, R. G. (2016). RC3E: Reconfigurable accelerators in data
centres and their provision by adapted service models. In IEEE International Conference on
Cloud Computing (CLOUD) (pp. 19–26). https://doi.org/10.1109/CLOUD.2016.0013
25. Krautter, J., Gnad, D., & Tahoori, M. (2020). CPAmap: On the complexity of secure FPGA
virtualization, multi-tenancy, and physical design. In TCHES, (Vol. 2020, pp. 121–146).
26. La, T., Mtas, K., Grunchevski, N., Pham, K., & Koch, D. (2020). FPGAdefender: Malicious
self-oscillator scanning for Xilinx ultrascale+ FPGAs. ACM Transactions on Reconfigurable
Technology and Systems, 13(3), 34:1–34:20. https://doi.org/10.1145/3402937
27. Lallet, J., Enrici, A., & Saffar, A. (2018). FPGA-based system for the acceleration of cloud
microservices. In 2018 IEEE International Symposium on Broadband Multimedia Systems
and Broadcasting (BMSB) (pp. 1–5). https://doi.org/10.1109/BMSB.2018.8436912
28. Ma, Y., Zhang, Q., Zhao, S., Wang, G., Li, X., & Shi, Z. (2020). Formal verification of
memory isolation for the trustzone-based TEE. In 2020 27th Asia-Pacific Software Engineering
Conference (APSEC) (pp. 149–158). https://doi.org/10.1109/APSEC51365.2020.00023
29. Maimut, D., & Reyhanitabar, R. (2014). Authenticated encryption: Toward next-generation
algorithms. IEEE Security & Privacy, 12(2), 70–72. https://doi.org/10.1109/MSP.2014.19
30. Mbongue, J. M., Shuping, A., Bhowmik, P., & Bobda, C. (2020). Architecture support for
FPGA multi-tenancy in the cloud. In 2020 IEEE 31st International Conference on Application-
specific Systems, Architectures and Processors (ASAP) (pp. 125–132). https://doi.org/10.1109/
ASAP49362.2020.00030
31. Microsoft Azure documentation for cloud FPGA attestation mechanism. https://learn.
microsoft.com/en-us/azure/virtual-machines/field-programmable-gate-arrays-attestation
32. Microsoft Azure documentation for FPGA optimized virtual machine sizes. https://docs.
microsoft.com/en-us/azure/virtual-machines/sizes-field-programmable-gate-arrays
33. Miura, N., Fujimoto, D., Korenaga, R., Matsuda, K., Nagata, M.: An intermittent-driven
supply-current equalizer for 11x and 4x power-overhead savings in CPA-resistant 128bit AES
cryptographic processor. In 2014 IEEE Asian Solid-State Circuits Conference (A-SSCC) (pp.
225–228). https://doi.org/10.1109/ASSCC.2014.7008901
34. Nilsson, A., Bideh, P. K., & Brorsson, J. (2020). A Survey of Published Attacks on Intel SGX.
Technical Report, CoRR, abs/2006.13598.
35. Official repository of the AWS EC2 FPGA Hardware and Software Development Kit. https://
github.com/aws/aws-fpga
36. Parelkar, M. (2004). FPGA security-bitstream authentication. Technical Report, George
Mason University (2004). http://mason.gmu.edu/~mparelka/reports/bitstream-auth.pdf
37. Perlman, R. (1999). An overview of PKI trust models. IEEE Network, 13(6), 38–43. https://
doi.org/10.1109/65.806987
38. Ramanna, S. C., & Sarkar, P. (2011). On quantifying the resistance of concrete hash functions
to generic multicollision attacks. IEEE Transactions on Information Theory, 57(7), 4798–4816.
https://doi.org/10.1109/TIT.2011.2146570
39. Rescorla, E. (2018). The Transport Layer Security (TLS) Protocol Version 1.3. Technical
Report, RFC 8446. https://doi.org/10.17487/RFC8446
40. Rhrmair, U., et al. (2013). PUF modeling attacks on simulated and silicon data. IEEE
Transactions on Information Forensics and Security, 8(11), 1876–1891. https://doi.org/10.
1109/TIFS.2013.2279798
41. Ringlein, B., Abel, F., Ditter, A., Weiss, B., Hagleitner, C., & Fey, D. (2019). System
architecture for network-attached FPGAs in the cloud using partial reconfiguration. In 2019
1 Authentication and Confidentiality in FPGA-Based Clouds 27
29th International Conference on Field Programmable Logic and Applications (pp. 293–300).
https://doi.org/10.1109/FPL.2019.00054
42. Schellenberg, F., Gnad, D. R. E., Moradi, A., & Tahoori, M. B. (2018). An inside job: Remote
power analysis attacks on FPGAs. In 2018 Design, Automation & Test in Europe Conference
& Exhibition (DATE) (pp. 1111–1116). https://doi.org/10.23919/DATE.2018.8342177
43. Tsai, T. H., Ho, Y. C., & Sheu, M. H. (2019). Implementation of FPGA-based accelerator for
deep neural networks. In 2019 IEEE 22nd International Symposium on Design and Diagnostics
of Electronic Circuits & Systems (DDECS) (pp. 1–4). https://doi.org/10.1109/DDECS.2019.
8724665
44. Turan, F., Roy, S. S., & Verbauwhede, I. (2020). HEAWS: An accelerator for homomorphic
encryption on the Amazon AWS FPGA. IEEE Transactions on Computers, 69(8), 1185–1196.
https://doi.org/10.1109/TC.2020.2970824
45. Xiao, H., Li, K., & Zhu, M. (2021). FPGA-based scalable and highly concurrent convolutional
neural network acceleration. In 2021 IEEE International Conference on Power Electronics,
Computer Applications (ICPECA) (pp. 367–370). https://doi.org/10.1109/ICPECA51329.
2021.9362549
46. Xilinx application store for FPGA acceleration boards. https://www.xilinx.com/products/app-
store.html
47. Xilinx Vivado development software official product website. https://www.xilinx.com/products/
design-tools/vivado.html
48. Zeitouni, S., Vliegen, J., Frassetto, T., Koch, D., Sadeghi, A. R., & Mentens, N. (2021). Trusted
configuration in cloud FPGAs. In 2021 IEEE 29th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM) (pp. 233–241). https://doi.org/10.1109/
FCCM51124.2021.00036
49. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In 2018
IEEE Symposium on Security and Privacy (SP) (pp. 229–244). https://doi.org/10.1109/SP.
2018.00049
50. Zhou, Y., & Jiang, J. (2015). An FPGA-based accelerator implementation for deep convo-
lutional neural networks. In 2015 4th International Conference on Computer Science and
Network Technology (ICCSNT) (pp. 829–832). https://doi.org/10.1109/ICCSNT.2015.7490869
51. Zhu, Z., Liu, A. X., Zhang, F., & Chen, F. (2021). FPGA resource pooling in cloud computing.
IEEE Transactions on Cloud Computing, 9(2), 610–626. https://doi.org/10.1109/TCC.2018.
2874011
Chapter 2
Domain Isolation and Access Control
in Multi-tenant Cloud FPGAs
2.1 Introduction
Field-programmable gate arrays (FPGAs) are generally provided in the cloud in two
main paradigms: The first model is “hardware acceleration as a service” (HAaaS),
where acceleration is provided to the user through pre-implemented hardware
accelerators by the cloud operator. Users call functions with input values and collect
the results. This model is used, for instance, by Microsoft with the Catapult platform
for accelerating the Bing search engine [51]. The second model is “infrastructure
as a service” (IaaS), used by Amazon [1], where the FPGA logic is provided to
the users as part of a virtual infrastructure package (processor, memory, storage,
peripherals, etc.) that must be scheduled onto physical resources by the cloud
provider. In HAaaS, the cloud provider controls everything, which reduces the user’s
options to only the kernels implemented by the cloud provider. IaaS is more flexible
and allows users to develop and optimize their hardware implementation.
For security reasons, existing FPGA clouds do not implement FPGA multi-
tenancy and continue to allocate entire FPGAs to single users. In the age of dark
silicon, with a considerable amount of resources on an FPGA and designs that
rarely occupy the whole FPGA, assigning the entire fabric to single users reduces
device utilization and increases power consumption. Sharing FPGA resources
among different tenants can help improve FPGA utilization in the cloud. However,
multi-tenancy deployment of FPGAs with hardware accelerators from random
sources takes a lot of work. The allocation and de-allocation of parts of the
FPGA resources require a model that enables efficient reallocation of input/output
(IO) resources to service multiple virtual machines from different tenants using
VM VM VM VM VM VM
VMM VMM
Provisioned
FPGAs resources VRs
FPGA FPGA
(a) (b)
Fig. 2.1 Illustration of the provisioning models. (a) Model in which entire FPGAs are provisioned
in the cloud. (b) Proposed model: VMM provisions FPGA regions in the cloud
Software Stack
Software Stack
Software Stack
SLOT 1
ACC 1
SHELL
SHELL
SHELL
SLOT 2
FPGA
FPGA
FPGA
ACC 1
SLOT 3 ACC 2
SLOT 4 ACC 3
Fig. 2.2 Hardware elasticity on FPGA. (a) The FPGA is partitioned into slots. (b) ACC1 uses
two slots, and ACC2 and ACC3 use one slot each. (c) The FPGA only provisions one slot that is
assigned to ACC1
Service elasticity generally allows for the provisioning and releasing of resources
and capacity scaling with need [47]. Provisioning of elastic FPGA resources in the
cloud enables developers to program designated areas on the fabric with designs of
various sizes. This action is possible in clouds that allocate entire FPGAs to users,
like AWS F1 [1]. This model of deployment consists of time-shared FPGAs. CSPs
generally program FPGAs with a shell that implements static design modules such
as IO controllers. Therefore, partial reconfiguration is leveraged to update hardware
functions in areas of the fabric that are exposed to cloud users [9].
The provisioning of space-shared FPGAs, in which multiple user designs are
co-hosted on a device, introduces a different perspective on hardware elasticity.
Hardware space sharing between multiple tenants requires constraining hardware
accelerators to geographic locations on the FPGA. As a result, user designs are
synthesized, placed, and routed considering the physical constraints of the allocated
FPGA areas (registers, block RAMs, DSP blocks, etc.). Therefore, if a cloud
application needs additional FPGA resources, the generated partial bitstream may
not be suitable for programming a different FPGA region.
Figure 2.2a illustrates a device divided into four slots. User accelerators (.ACCi ,
i.∈{1 . . . 4} in Fig. 2.2b and c) can then utilize one or multiple slots based on their
need. However, it may not always be possible to allocate adjacent slots to user
workloads in multi-tenant FPGA clouds. For instance, an accelerator in SLOT 1
may need additional resources that are available only in SLOT 4 (Fig. 2.2a). There
32 C. Bobda et al.
Within the stack of an IaaS, FPGAs generally act as co-processors to enable cus-
tomized acceleration for specific software functions. They are most often integrated
with VM software through high-speed interfaces such as PCIe. In virtualization
systems such as cloud infrastructure, the hardware interface is generally exposed
as IO ports to VMs using one of the four major approaches shown in Fig. 2.3.
IO virtualization is achieved by either software (emulation and paravirtual-
ization) or hardware (directIO, single/multiple root IO virtualization) support. In
emulation (Fig. 2.3a), each attempt to execute an IO instruction raises a system
call that is trapped and executed by the virtual machine manager (VMM) in a
privileged mode. While this approach has the benefit of not requiring guest operating
system (GOS) modification, it incurs high overhead because of the recurrent
context switches between privileged and non-privileged modes. Paravirtualization
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 33
Fig. 2.3 Summary of the state-of-the-art IO virtualization approaches. (a) Emulation, (b) Paravir-
tualization, (c) DirectIO, and (d) SRIOV/MRIOV
User User
Application Application
Hardware Hardware
IO Block
FPGA Abstraction
FPGA Abstraction
Physical Layer
Calls Calls
Data Link Layer
vFPGA
vFPGA
Transaction Layer
Channel
Data Data
Demultiplexing
Transfer Transfer
vSock vSock
BUFFER
PHYSICAL FPGA
GUEST
GUEST
Client Client
VM n
VM 1
...
...
...
Stack Stack Interface
Access
VIRTUALIZED
VirtIO VirtIO
Transport Transport
REGIONS
Monitor
RX/TX RX/TX
Multi-Thread Server USER
REGION
vSock
VR
Controller
Data Channel
Server
Table
Table Demultiplexing
Network
HOST
Stack Graph
BUFFER
Builder
Request
Allocation
Resource
Handler ...
...
...
...
Placement
Tool
VR
Controller
FPGA Management
Software
VRs to a single job. However, providing low-level details regarding the topology of
the NoC is beyond the purpose of this chapter. An efficient NoC infrastructure that
supports multi-tenancy with VM integration has been designed and implemented
and described in Mandebi et al. [43, 45].
This section describes the security framework’s mechanism to ensure the controlled
sharing of hardware modules. Sharing FPGA resources among tenants without a
guarantee of isolated execution can lead to scenarios where shared accelerators act
as potential covert channels among software guests which reside in different security
contexts. Remote attacks have recently been demonstrated in FPGAs [22, 52, 64].
Giechaskiel et. al. showed that information transported on long wires in FPGA could
be ex-filtrated using neighboring long lines on the same chip [22]. This vulnerability
was then leveraged by Ramesh et. al. to launch an attack on a remote accelerator
sharing the same FPGA as the malicious accelerators [52].
Prior research on domain isolation in hardware has focused on the prob-
lem of isolating accelerator access in standalone systems-on-chip applications
[6, 13, 14, 16, 17, 19, 23, 33–36, 50, 53, 56, 59]. In the Secure Enclave approach
[2, 4, 8, 24, 37, 54, 58, 60, 61], system developers rely on an operating system and
hardware-level enforcement mechanisms to provide an isolated mode of execution
36 C. Bobda et al.
and other system security services, such as confidentiality and integrity protection
of the external memory. Elnaggar et. al. described a defense mechanism against
attacks on multi-tenant FPGAs using secured authentication [18]. While these
proposals achieve isolation in a single application systems-on-chip, they have yet to
be demonstrated to work efficiently in multi-user cloud environments managed by
the host and guest operating systems. Current implementations of these proposals
do not define an interface that would allow accelerators to inherit dynamically, at
run-time, security policies of processes calling them from the operating system or
the hypervisor. Furthermore, secure enclave technologies such as TrustZone [2] do
not provide fine-grained security enforcement directly at the IP level, which limits
their security performance coverage.
In literature, FPGA virtualization architectures were proposed to increase the
programmability and security of FPGAs without losing performance [41, 42, 48].
Address protection strategies for systems-on-chip using NoC-based communication
architectures were demonstrated by Saeed et al. in [55], while hardware isolation
strategies for IP protection in system-on-chip (SoC) and computer networks were
presented in [28, 29] and further extended in [10, 11, 30] to shield hardware IPs
using hardware sandboxes. Mbongue et. al. investigated the use of access control
mechanisms in the cloud with FPGA extensions to support cloud security [41]. The
work does not consider sharing among FPGAs and allocates an entire FPGA to a
single VM. Internet of Things (IoT) security [46] and domain isolation strategies in
the cloud were also proposed by Festus et al. [25, 26]. A provable architecture for
isolation in networked design was proposed, implemented, and evaluated. Security
rules check for access control in SoC-based embedded systems was the goal of the
work [30, 31]. The security architecture for domain isolation and access control
is described in the following subsections, along with the threat model and system
assumptions in the cloud domain.
Policy
Enforcement
Server
Access Security
Vector Server
Context Cache
Match?
Yes No
Access Access
Granted Denied
such as SELinux using a Linux security module (LSM) built into the kernel code
with default security policy “Type Enforcement (TE).”
The solution mentioned in this chapter borrows from the insights gained from
the domain separation that is present in the FLASK security architecture. The
information is applied to the isolation of guest VM execution in mandatory
access control (MAC)-based hypervisors and extended to the isolated execution
of hardware accelerators on FPGAs. The resulting framework guarantees that in
such systems, hardware modules execute and reside in the same security context as
the “caller” guest VM by propagating to the “callee” modules guest VM privilege
boundaries defined at the software level. The proposed solution is based on the
FLASK architecture, the foundation of security kernels of the most widely deployed
hypervisors, such as KVM and Xen (Fig. 2.5).
1
VM session VMM
Identification
Shell
2
Enforcement
Monitoring HMM HMM HMM HMM
3
Access Rules
Definition
Acc1 Acc2 Acc3 Acc4
S := {VM , A, F, VR , D, M}
. (2.1)
where:
• .VM = .{VM1 , VM2 , VM3 , . . . ., VMn } is the set of virtual machines in the cloud
platform.
• A = .{A1 , A2 , A3 , . . . ., An } is the set of application sets in which each virtual
machine i has its corresponding application set, .Ai = .{ai1 , ai2 , ai3 , . . . ., aim }.
• F = .{f1 , f2 , f3 , . . . ., fp } is the set of FPGA devices.
• .VR = .{VR1 , VR2 , VR3 , . . . ., VRp } is the set of virtual regions allocated in the
FPGA devices. Here, each VR set j , associated with the corresponding FPGA
device, is the set of multiple VRs, such as:
.VRj = .{VRj 1 , VRj 2 , VRj 3 , . . . ., VRj q }.
Rule 3 (Confidentiality). For a legal access request .τ (by obeying rule 2), if the
corresponding read decision .d1 is made by the lookup function .Γ (M, τ ) and
.d1 ∈ D, the confidentiality is preserved in the system domain.
Rule 4 (Integrity). For a legal access request .τ (by obeying rule 2), if the
corresponding write decision .d2 is made by the lookup function .Γ (M, τ ) and
.d2 ∈ D, the integrity is preserved in the system domain.
VM 1 ... VM 1
admin
Domain
Kernel Specific
Policy Policy
security Security
inheritance inheritance
policy policies
(SMM) (SMM)
server
Hypervisor
Accelerator Accelerator
FPGA
In the first iteration of its design, when the access enforcement function allows
the request to proceed, it sends a message to the hardware accelerator over the
network-on-chip. When the execution is complete, the accelerator sends the result
to the SMM.
During hardware execution, the access enforcement function maintains the list of
busy hardware modules and their corresponding caller guest VM security contexts
in a specialized data structure. It allows the access enforcement function to discard
results of ongoing execution in case there have been changes in security policies
mid-execution, which the ongoing execution violates.
To minimize the overhead associated with access decision requests and compu-
tations, during the initial access decision request for the pair of security contexts
provided, the Flask security server provides more decisions than requested. The
latter and the security contexts pair they map to are then cached by a “Access
Vector Cache (AVC).” For this ability to scale without degrading the overall system
performance due to repeated access requests, AVC capability is added in the
Hardware Module Manager, and the AVC interface is responsible for managing
access decision misses. The changes in security policies remain in software. The
AVC queries its software component to request access decisions to the Flask security
server to manage access misses. When the latter returns the decision, the AVC
informs the access enforcement function and updates its access decision table.
When there are changes in security policies, the Flask security server alerts the
AVC software component of policy changes. The latter notifies the AVC hardware
component and updates its permissions state. Then, the AVC alerts the access
enforcement function of changes in policies. The access enforcement function
reevaluates the security context of ongoing hardware execution. It discards the
execution results if, per the updated policies, the execution was not authorized.
The AVC software component informs the security server when policy update
propagation is completed.
The pair SMM and HMM ensures that a hardware accelerator running on the
FPGA performs actions in lockstep only with the software thread that controls
it, thus extending the domain separation enforced in the hypervisor to the FPGA.
The resource access control uses the same tools as SMM and HMM to grant
access to those resources from software threads and accelerators as defined in the
security policies.
The security server uniquely identifies the communication sessions between VMs
and FPGAs, implements security policies defining the access rules, and monitors the
enforcement of the security policies. The communication sessions implement Rule
2 of the security model as they enable requests between applications in the VM and
the VR on the FPGA.
42 C. Bobda et al.
The HMM works with the Security Server to enforce access control over FPGA
accelerators. Figure 2.8 shows the internal architecture of the HMM. The major
components of the HMM are the maintenance controller, the true-random number
generator (TRNG), the cryptography module, and the decision module. It also has a
session ID register (SessionID_reg), Session Key register (SessionKey_reg), header
extraction, and insertion modules. The HMM has three interfaces: a Management
Interface for secure communication with the Security Server, an Accelerator
Interface to stream data in and out of the hardware accelerators, and a Cloud
Interface enabling exchanges between the hardware and software services running
within VMs. For security purposes, the Management Interface is not connected to
physical ports managed by the VMM.
Maintenance Controller This unit implements the logic that the Security Server
uses to access configuration resources such as session ID and Session Key registers.
It also provides interfaces to request the generation of random numbers.
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 43
Cloud Interface
Application Controller
AES Core
SessionKey_reg
En 1 Decrypt Encrypt
To/From Security Server
Maintenance Controller
Header Header
En n Extraction Insertion
Mnt Interface
XOR-tree
Entropy Source
RO-based TRNG ==
Decision
Module
SessionID_reg Interrupt
Generator
Accelerator Interface
Header Payload
SessionID Operation Data
16bits 1bit 128bits
True-Random Number Generator The TRNG generates random keys that are
stored in the Session Key registers. The crypto module uses the keys to encrypt and
decrypt the messages. Encrypting communications ensures confidentiality (Rule 3
of the security formalism). Ring Oscillators (ROs) are used as a source of entropy.
The output of the XOR tree is sampled in a synchronous D flip-flop driven by the
system clock to convert the RO jitter into a random digital sequence [39]. Jitter, in
this case, represents the deviation caused by random process variation and temporal
variations such as random physical noise, environmental variations, and the aging
of the chip.
Crypto Module The crypto module decrypts incoming traffic and encrypts outgo-
ing packets. For example, the Advanced Encryption Standard (AES) with 128-bit
keys and ten rounds are used in this architecture.
Decision Module This module decides whether an incoming packet is forwarded
to the accelerator or discarded. Figure 2.9 shows the structure of communication
packets.
The header stores the session ID, and the payload defines the operation (read or
write) and data to transfer. The session ID of incoming data is first checked against
the content of the session ID register. Access to the accelerator is granted when both
values match (Rule 4 of the security formalism). An interrupt is generated to the
44 C. Bobda et al.
Appi Security
VMM HMM Accelerator
Server
1 2
FPGA_access_req(Appi) User_req(Appi) 3
Allocate_FPGA(Appi,Acci)
<VRID , FPGAID> Program_FPGA(Acci) 4
5
Gen_rand_sessionID()
VM session ID
Request_key() 6
session_key
Gen_rand_sessionKey()
Response
e
VM session Key VM session ID 7
En_HW_access(Appi) 8
Check_sessionID()
alt
Notify_security_attack()
[sessionID not matching]
[sessionID matching]
Forward_data(msg)
Response Process_data(msg)
Security Server if the two values do not match (to support Rule 5 of the security
formalism).
5. The Security Server records the VM-FPGA region binding and generates a
random session ID number shared with the VM through a trusted communication
channel.
6. The Security Server requests a Session Key to the HMM in the FPGA region
hosting the user design. The Session Key is shared with the VM through a trusted
communication channel.
7. The Security Server shares the session ID with the HMM to enable hardware-
level authentication.
8. The Security Server directs the VMM to assign the address space of the FPGA
accelerator to the VM to enable read and write operations to the VR allocated in
step (3).
Once these initial configuration steps are completed, the communication between
the VM and hardware can start. The VM encrypts data using the Session Key.
Each data sent between VM and hardware accelerators (and vice versa) contains a
message and a session ID number. When the HMM receives data, it decrypts using
the Session Key and checks whether the session ID number matches its record. If
it does, the message is forwarded to the hardware accelerator, otherwise, a security
notification is sent to the Security Server.
The VM also decrypts the response from the hardware to ensure that the received
session ID matches the value previously sent by the Security Server. Overall,
the secured communication protocol relies on a two-step authentication between
VM and FPGA accelerator that uses a 128-bit session key and a 16-bit session
ID, or 2.144 possible combinations at a time. The session keys and session IDs
are generated at the hardware level on the FPGA and Security Server to ensure
randomness. Though the 128-bit session keys and 16-bit session IDs are re-used
in the implementation, the architecture can accommodate wider data widths. After
a time window defined by the cloud provider, the Security Server initiates the
generation of new session keys and session IDs to reduce the risk of a security
breach. The system configuration is only performed by an authorized administrator
(Rule 6 of the security formalism).
To find the area overhead of the HMM and the feasibility of the domain isolation,
a prototype of the described architecture was implemented in [39, 44] for a cloud
configuration with a node that runs VMs. The cloud is set up on a Dell R7415l
EMC server with a 2.09 GHz AMD Epyc 7251 CPU and 64 GB of memory. The
node runs CentOS-7 with a kernel of version 3.10.0. An Intel Stratix V FPGA
(5SGXMB5R1F40C1) is used as a testing device, and Intel Quartus Prime 18.1.0
Standard Edition is used to synthesize, place, and route hardware designs. The
FPGA is connected to the server through a PCIe Gen3.×8 interface. QEMU 2.11.50
emulates the VMs, with each VM running on Ubuntu 16.04.01 with 4 GB of RAM.
46 C. Bobda et al.
60
40
20
0
1 5 9 13 17 21 25
Fig. 2.11 Hamming distance between the random numbers generated by the HMM for 128-bit
keys
To assess the security resulting from implementing domain isolation with the
secured communication protocol, an evaluation of the randomness of the generated
session IDs and session keys using the Hamming distance was performed in [12].
It quantified the extent to which two bitstrings differ. A total of 50 session IDs and
50 session keys were generated, and the Hamming distance between the 25 pairs of
generated numbers was measured.
Figure 2.11 summarizes the results from implementing 5-stage through 17-
stage RO-based TRNGs with 30 ROs. Observe that the number of stages does not
significantly influence the difference between consecutive session keys. On average,
the Hamming distance between the 128-bit session keys generated is 63.45, which
means that between successive keys, there may be up to .∼2.63 possible numbers,
making it hard to predict.
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 47
Hamming Distance
between the random numbers
generated by the Security 10
Server for 16-bit session IDs
5
0
1 5 9 13 17 21 25
Table 2.2 Resource overhead of the JPEG accelerator with secured communication protocol
ALM ALUT Registers M20K DSP
Available 185,000 185,000 740,000 2100 399
No HMM 18,202 (9.8%) 17,777 (9.6%) 32,765 (4.4%) 8 (0.4%) 399 (100%)
With HMM 20,550 (11.1%) 21,032 (11.4%) 33,864 (4.5%) 12 (0.6%) 399 (100%)
Overhead 12.9% 18.31% 3.35% 50% 0%
Similarly, Fig. 2.12 shows that successive 16-bit session IDs generated by the
Security Server vary by 8.64 bits on average, corresponding to more than 256
possibilities. Further, the NIST Statistical Test Suite [7] is used to study each of
the 50 strings of 144 bits (16-bit session ID + 128-bit session key). While only 11
test scenarios (the four other tests required wider data width) out of the possible 15
could be run, the calculated P-values, ranging from 0.110904 to 0.990904, indicate
that the generated numbers are sufficiently random. Distributing the generation of
the session IDs and session keys across two different entities (FPGA and Security
Server), which rely on unpredictable hardware variation, forms the root of trust in
the proposed domain isolation architecture. In addition, the communication protocol
relies on the recurrent update of session IDs and session keys at run-time.
A test case experiment was conducted to simulate breaches in the cloud system.
The experiment shows how the proposed architecture preserves a VM domain’s
confidentiality, integrity, and availability. A VR is programmed with a JPEG
encoder. It takes as input 24-bit values (8 bits for red, 8 bits for green, and 8 bits
for blue signals) and returns 32-bit JPEG streams. Table 2.2 compares the resource
utilization of the VR with and without the HMM.
The insertion of HMMs in the virtualization stack does not incur significant
resource overhead. The LUT utilization went from 9.6 to 11.4% when using an
HMM. Two VMs (VM.1 and VM.2 ) are considered. The VR running the JPEG
encoder is assigned to VM.1 . The scenario simulated uses a malicious VM and a
compromised VMM to attempt breaching VM.1 ’s domains. Figure 2.13 illustrates
the test scenario.
While VM.2 and a malicious application in the VMM can access VM.1 ’s FPGA
address space, they do not have the correct session ID and session key. As a result,
the HMM accepts only the read and write requests from VM.1 . The requests from
VM.2 and the VMM are discarded, and the Security Server is notified. The isolation
mechanism focuses on notifying the Security Server in case of an unauthorized
attempt to access the hardware accelerator. The cloud provider or cloud adminis-
48 C. Bobda et al.
Data Offset
pp
p
App
KVM SessionID
Session Key
HMM SessionID
Session Key
Memory
Memory
Output
Input
Encoder
JPEG EncoderAccelerator
VR
FPGA
trator is then responsible for deciding the actions that follow any breach attempt. In
summary, the proposed security architecture preserves confidentiality by encrypting
data before any transfer; integrity as VR data is protected by the HMM from
unauthorized changes, and availability as the hardware notification mechanism in
the HMM allows the user to engage in actions pre-defined by the cloud provider.
host roundtrip takes about 34.6 .μs, and the single-tenant access to the FPGA uses
close to 94 .μs.
To evaluate the configuration overhead introduced by the proposed architecture,
the time needed to generate session IDs and session keys is assessed. The time
consumed in programming hardware accelerators with partial bitstreams is not
considered as the proposed architecture does not modify FPGA vendor tools. The
generation of a session ID in the Security Server takes about 10 ms. At the hardware
level, HMM generates a new session key in 1.84 ns. Finally, a roundtrip between
Security Server and HMM (requesting a session key and collecting the result) takes,
on average, .∼34 .μs. Overall, the different configuration steps illustrated in Fig. 2.10
only introduce configuration overhead in the order of milliseconds (.∼10 ms). After
prototyping the FPGA architecture on the Stratix V device, the HMM achieves a
maximum frequency of 542 MHz. The encryption and decryption steps consume 12
clock cycles (10 clock cycles for the ten rounds, one cycle for loading the key, and
one cycle for returning the result). At the level of the HMM, incoming packets take
14 cycles to reach the accelerator or be discarded (12 cycles to decrypt, one cycle
to extract the header, and one cycle to decide whether the packet will be accepted
or not). Each outgoing traffic value requires 13 cycles (1 cycle to insert the header,
12 cycles to encrypt) to be forwarded to VMs through the Cloud Interface. Festus
et al. [27] proposed an isolation approach that incurred three clock cycles of latency
on an FPGA. However, their architecture does not ensure confidentiality as data are
not encrypted.
In addition, roundtrip latencies were recorded when 25, 50, and 75 VMs
attempted to access a space-shared FPGA simultaneously. Though it may not be
practical to have 25, 50, or 75 VRs on a single device, this experiment evaluates
concurrent FPGA access time. For this purpose, each VR simply implements a
register that is written and read back by one of the VMs. Figure 2.15a, c, and e
present the recorded latencies. In each of the three test cases (25, 50, and 75
VMs), the roundtrip latency mostly lies between 0 and 1.2 ms, with some instances
around 200 and 400 ms. 77% of the IO operations were completed in less than
1 ms, 21% took about 200 ms, and 2% reached 400 ms. Since no packet loss was
experienced, the IO latencies observed are a combination of the virtualization layer
50 C. Bobda et al.
Fig. 2.15 IO and wait time evaluation on multi-tenant cloud FPGAs. (a) IO time with 25 VMs.
(b) Wait time with 25 VMs. (c) IO time with 50 VMs. (d) Wait time with 50 VMs. (e) IO time with
75 VMs. (f) Wait time with 75 VMs
running the VMs, the host operating system process scheduling, the FPGA driver
implementation, the FPGA interface response time, and the FPGA management
service processing time (read packet from user buffer, hardware call, read the result
from FPGA memory, write the result to user buffer, etc.). In general, FPGA multi-
tenancy resulted in about 10.× slower IO speed than that of the single-tenant baseline
and a 29.× drop compared to the average roundtrip time of the host. The average
wait time in the FPGA management service queue increases with the number of
VMs (Fig. 2.15b, d, and f). For example, the VM with the identifier 30 waits up to
8.8 ms (Fig. 2.15d), while the scheduler is busy with other requests.
Overall, these results are impacted by the First Come First Serve (FCFS)
scheduling policy coupled with the host process management policy. In short,
implementing FPGA multi-tenancy in the cloud may generally result in 10.× slower
IO operations compared to single-tenant deployments, which nevertheless remains
on the order of a few microseconds. Common FPGA cloud applications submit jobs
and collect the results after execution. Therefore, it may not be typical to have
constant IO operations between the VM and FPGA accelerators, which mitigates
the overall performance degradation observed in these experiments.
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 51
2.4 Discussion
This chapter addressed the secure sharing of hardware accelerators from different
tenants in FPGA-based clouds that operate in the IaaS paradigm. We focused
on architectural and system-level integration, ensuring that domain separation
is enforced, even in the presence of vulnerabilities, while sharing FPGAs. The
segmentation of an FPGA to ensure that hardware tasks can be reached regardless of
their position is of utmost importance. We believe that a NoC-based communication
strategy offers the flexibility and performance needed to run all accelerators in
parallel. Because of hardware accelerators, resource adjustment can be achieved,
thus extending elasticity to FPGAs while sharing.
To ensure domain separation, access control is one path to extend the well-
insulated user software domain to the hardware world. The FLASK architecture
is presented here as basic infrastructure using a (SMM, HMM) pair to enforce
security rules at the hardware level. This approach is proven by design to work and
ensure confidentiality, integrity, and availability. Implementation of this architecture
has demonstrated that the security infrastructure’s overhead in area and latency
is negligible.
The adoption of this domain isolation in the cloud will be possible if automation
is introduced in system design at the level of cloud operators and cloud users.
Approaches that can automatically integrate the isolation infrastructure are highly
desirable. Our work at the University of Florida has led to the setup of an
FPGA-based cloud infrastructure for research on multi-tenancy. The cloud and
services provided are reachable through the following link: https://smartsystems.
ece.ufl.edu/research/projects/gatorrecc/. Design automation is currently the topic
of investigation, and results will be made available to the community in a timely
fashion.
Acknowledgment This work is partially funded by the National Science Foundation (NSF) under
Grant CNS 2007320.
References
6. Basak, A., Bhunia, S., & Ray, S. (2015). A flexible architecture for systematic implementation
of SoC security policies. In Proceedings of the IEEE/ACM International Conference on
Computer-Aided Design, ICCAD ’15 (pp. 536–543), Piscataway, NJ, USA: IEEE Press. http://
dl.acm.org/citation.cfm?id=2840819.2840894
7. Bassham III, L. E., Rukhin, A. L., Soto, J., Nechvatal, J. R., Smid, M. E., Barker, E. B., Leigh,
S. D., Levenson, M., Vangel, M., Banks, D. L., et al. (2010) SP 800-22 Rev. 1a. A statistical
test suite for random and pseudorandom number generators for cryptographic applications.
National Institute of Standards & Technology.
8. Baumann, A., Peinado, M., Hunt, G. C. (2014) Shielding applications from an untrusted cloud
with haven. ACM Transactions on Computer Systems, 33(8), 1–8:26
9. Bobda, C., Mbongue, J. M., Chow, P., Ewais, M., Tarafdar, N., Vega, J. C., Eguro, K., Koch,
D., Handagala, S., Leeser, M., et al. (2022) The future of FPGA acceleration in datacenters
and the cloud. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 15(3),
1–42.
10. Bobda, C., Mead, J., Whitaker, T. J. L., Kamhoua, C. A., Kwiat, & K. A. (2017). Hardware
sandboxing: A novel defense paradigm against hardware trojans in systems on chip. In
Applied Reconfigurable Computing - 13th International Symposium, ARC 2017, Delft, The
Netherlands, April 3–7, 2017, Proceedings (pp. 47–59). https://doi.org/10.1007/978-3-319-
56258-2_5
11. Bobda, C., Whitaker, T. J. L., Kamhoua, C. A., Kwiat, K. A., & Njilla, L. (2017). Synthesis
of hardware sandboxes for trojan mitigation in systems on chip. In 2017 IEEE International
Symposium on Hardware Oriented Security and Trust, HOST 2017, McLean, VA, USA, May
1–5, 2017 (p. 172). https://doi.org/10.1109/HST.2017.7951836
12. Bookstein, A., Kulyukin, V. A., & Raita, T. (2002) Generalized hamming distance. Informa-
tion Retrieval, 5(4), 353–375.
13. Boule, M., & Zilic, Z. (2007). Efficient automata-based assertion-checker synthesis of SEREs
for hardware emulation. In 2007 Asia and South Pacific Design Automation Conference (pp.
324–329). https://doi.org/10.1109/ASPDAC.2007.358006
14. De Alfaro, L., & Henzinger, T. A. (2001) Interface automata. SIGSOFT Software Engineering
Notes, 26(5), 109–120. https://doi.org/10.1145/503271.503226. http://doi.acm.org/10.1145/
503271.503226
15. Dong, Y., Yang, X., Li, J., Liao, G., Tian, K., & Guan, H. (2012). High performance network
virtualization with SR-IOV. Journal of Parallel and Distributed Computing, 72(11), 1471–
1480.
16. Drzevitzky, S. (2010). Proof-carrying hardware: Runtime formal verification for secure
dynamic reconfiguration. In 2010 International Conference on Field Programmable Logic
and Applications pp. 255–258. https://doi.org/10.1109/FPL.2010.59
17. Drzevitzky, S., Kastens, U., & Platzner, M. (2009). Proof-carrying hardware: Towards runtime
verification of reconfigurable modules. In 2009 International Conference on Reconfigurable
Computing and FPGAs (pp. 189–194). https://doi.org/10.1109/ReConFig.2009.31
18. Elnaggar, R., Karri, R., & Chakrabarty, K. (2019). Multi-tenant FPGA-based reconfigurable
systems: Attacks and defenses. In 2019 Design, Automation & Test in Europe Conference &
Exhibition (DATE) (pp. 7–12). IEEE.
19. Emmi, M., & Giannakopoulou Dimitra, C. S. (2008). Assume-guarantee verification for
interface automata. In FM 2008: Formal Methods: 15th International Symposium on Formal
Methods, Turku, Finland, May 26–30, 2008 Proceedings 15. Berlin: Springer.
20. Fahmy, S. A., Vipin, K., & Shreejith, S. (2015). Virtualized FPGA accelerators for efficient
cloud computing. In 2015 IEEE 7th International Conference on Cloud Computing Technology
and Science (CloudCom) (pp. 430–435). IEEE.
21. Fraser, K., Hand, S., Neugebauer, R., Pratt, I., Warfield, A., & Williamson, M. (2004).
Reconstructing I/O. Technical Report, University of Cambridge, Computer Laboratory.
22. Giechaskiel, I., Rasmussen, K. B., & Eguro, K. (2018). Leaky wires: Information leakage
and covert communication between FPGA long wires. In Proceedings of the 2018 on Asia
Conference on Computer and Communications Security, ASIACCS ’18 (pp. 15–27). New
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 53
application partitioning for Intel SGX. In 2017 USENIX Annual Technical Conference
(USENIX ATC 17) (pp. 285–298). Santa Clara, CA, USA: USENIX Association. https://www.
usenix.org/conference/atc17/technical-sessions/presentation/lind
38. Loscocco, P., & Smalley, S. (2001). Integrating flexible support for security policies into
the linux operating system. In Proceedings of the FREENIX Track: 2001 USENIX Annual
Technical Conference (pp. 29–42). Berkeley, CA, USA: USENIX Association. http://dl.acm.
org/citation.cfm?id=647054.715771
39. Mandebi Mbongue, J., Saha, S. K., & Bobda, C. (2021). Domain Isolation in FPGA-accelerated
cloud and data center applications. In Proceedings of the 2021 on Great Lakes Symposium on
VLSI (pp. 283–288).
40. Mavrogiannopoulos, N. (2021). Understanding the Red Hat Enterprise Linux random number
generator interface. February 11, 2021 https://www.redhat.com/en/blog/understanding-red-
hat-enterprise-linux-random-number-\generator-interface
41. Mbongue, J. M., Hategekimana, F., Kwadjo, D. T., Andrews, D., & Bobda, C. (2018).
FPGAVirt: A novel virtualization framework for FPGAs in the cloud. In 11th IEEE
International Conference on Cloud Computing, CLOUD 2018, San Francisco, CA, USA, July
2–7, 2018 (pp. 862–865). https://doi.org/10.1109/CLOUD.2018.00122
42. Mbongue, J. M., Kwadjo, D. T., & Bobda, C. (2018). FLexiTASK: A flexible FPGA overlay
for efficient multitasking. In Proceedings of the 2018 on Great Lakes Symposium on VLSI,
GLSVLSI 2018, Chicago, IL, USA, May 23–25, 2018 (pp. 483–486). https://doi.org/10.1145/
3194554.3194644. http://doi.acm.org/10.1145/3194554.3194644
43. Mbongue, J. M., Kwadjo, D. T., Shuping, A., & Bobda, C. (2021). Deploying multi-
tenant FPGAs within linux-based cloud infrastructure. ACM Transactions on Reconfigurable
Technology and Systems (TRETS), 15(2), 1–31.
44. Mbongue, J. M., Saha, S. K., & Bobda, C. (2021). A security architecture for domain isolation
in multi-tenant cloud FPGAs. In 2021 IEEE Computer Society Annual Symposium on VLSI
(ISVLSI) (pp. 290–295). IEEE.
45. Mbongue, J. M., Shuping, A., Bhowmik, P., & Bobda, C. (2020). Architecture support for
FPGA multi-tenancy in the cloud. In 2020 IEEE 31st International Conference on Application-
Specific Systems, Architectures and Processors (ASAP) (pp. 125–132). IEEE.
46. Mead, J., Bobda, C., & Whitaker, T. J. L. (2016). Defeating drone jamming with hardware
sandboxing. In 2016 IEEE Asian Hardware-Oriented Security and Trust, AsianHOST 2016,
Yilan, Taiwan, December 19–20, 2016 (pp. 1–6). https://doi.org/10.1109/AsianHOST.2016.
7835557
47. Mell, P., Grance, T., et al. (2011). The NIST Definition of Cloud Computing, Special Publication
(NIST SP). National Institute of Standards and Technology, Gaithersburg, MD, USA.
48. Metzner, M., Lizarraga, J., & Bobda, C. (2015). Architecture virtualization for run-time
hardware multithreading on field programmable gate arrays. In Applied Reconfigurable
Computing - 11th International Symposium, ARC 2015, Bochum, Germany, April 13–17, 2015,
Proceedings (pp. 167–178). https://doi.org/10.1007/978-3-319-16214-0_14. http://dx.doi.org/
10.1007/978-3-319-16214-0_14
49. Nelson, M., Lim, B. H., Hutchins, G., et al. (2005). Fast transparent migration for virtual
machines. In USENIX Annual Technical Conference, General Track (pp. 391–394).
50. Peeters, E. (2015). SoC security architecture: Current practices and emerging needs. In
Proceedings of the 52Nd Annual Design Automation Conference, DAC ’15 (pp. 144:1–144:6).
New York, NY, USA: ACM. https://doi.org/10.1145/2744769.2747943. http://doi.acm.org/
10.1145/2744769.2747943
51. Putnam, A., Caulfield, A., Chung, E., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh,
H., Fowers, J., Gopal, G., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J. Y.,
Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P., & Burger, D. (2014).
A reconfigurable fabric for accelerating large-scale datacenter services. In 2014 ACM/IEEE
41st International Symposium on Computer Architecture (ISCA) (pp. 13–24). https://doi.org/
10.1109/ISCA.2014.6853195
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 55
52. Ramesh, C., Patil, S. B., Dhanuskodi, S. N., George Provelengios, S. P., Holcomb, D., &
Tessier, R. (2018). FPGA side channel attacks without physical access. In FCCM 2008. 26th
International Symposium on Field-Programmable Custom Computing Machines.
53. Ray, S., & Jin, Y. (2015). Security policy enforcement in modern SoC designs. In Proceedings
of the IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’15 (pp. 345–
350). Piscataway, NJ, USA: IEEE Press. http://dl.acm.org/citation.cfm?id=2840819.2840868
54. Sabt, M., Achemlal, M., & Bouabdallah, A. (2015). Trusted execution environment: What it is,
and what it is not. In Trustcom/BigDataSE/ISPA, 2015 IEEE (Vol. 1, pp. 57–64). https://doi.
org/10.1109/Trustcom.2015.357
55. Saeed, A., Ahmadinia, A., Just, M., & Bobda, C. (2014). An ID and address protection unit for
NoC based communication architectures. In Proceedings of the 7th International Conference
on Security of Information and Networks, SIN ’14 (pp. 288:288–288:294). New York, NY,
USA: ACM. https://doi.org/10.1145/2659651.2659719. http://doi.acm.org/10.1145/2659651.
2659719
56. Saha, S. K., & Bobda, C. (2020). FPGA accelerated embedded system security through
hardware isolation. In 2020 Asian Hardware Oriented Security and Trust Symposium
(AsianHOST) (pp. 1–6). IEEE.
57. Salot, P. (2013). A survey of various scheduling algorithm in cloud computing environment.
International Journal of Research in Engineering and Technology, 2(2), 131–135.
58. Victor Costan, S. D. (2016). Intel SGX Explained.
59. Wiersema, T., Drzevitzky, S., & Platzner, M. (2014). Memory security in reconfigurable
computers: Combining formal verification with monitoring. In 2014 International Conference
on Field-Programmable Technology (FPT) (pp. 167–174). https://doi.org/10.1109/FPT.2014.
7082771
60. Xilinx (2014). TrustZone Technology Support in Zynq-7000 All Programmable SoCs.
61. Yee, B., Sehr, D., Dardyk, G., Chen, J. B., Muth, R., Ormandy, T., Okasaka, S., Narula, N., &
Fullagar, N. (2009). Native client: A sandbox for portable, untrusted x86 native code. In 2009
30th IEEE Symposium on Security and Privacy (pp. 79–93). https://doi.org/10.1109/SP.2009.
25
62. Zhang, B., Wang, X., Lai, R., Yang, L., Luo, Y., Li, X., & Wang, Z. (2010). A survey on I/O
virtualization and optimization. In 2010 Fifth Annual ChinaGrid Conference (ChinaGrid) (pp.
117–123). IEEE.
63. Zhang, F., Liu, G., Fu, X., & Yahyapour, R. (2018). A survey on virtual machine migration:
Challenges, techniques, and open issues. IEEE Communications Surveys & Tutorials, 20(2),
1206–1243.
64. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In 2018 IEEE
Symposium on Security and Privacy (SP) (Vol. 00, pp. 839–854). https://doi.org/10.1109/SP.
2018.00049. http://doi.org/doi.ieeecomputersociety.org/10.1109/SP.2018.00049
Chapter 3
Efficient and Secure Encryption for
FPGAs in the Cloud
3.1 Introduction
In the past few years, lightweight cryptography has become a popular research
discipline with a number of block and stream ciphers and hash functions being
proposed [5, 15, 17, 25]. Block ciphers and stream ciphers cater to the cryptographic
operation of encryption. Whereas encryption deals with message confidentiality,
it does not address message integrity, i.e., the situation when an adversary can
tamper with the ciphertext, which would potentially result in the receiver getting
an incorrect plaintext. A message authentication code (MAC) [42], sometimes
known as a tag, is a short piece of information used to authenticate a message,
to confirm that the message came from the stated sender (its authenticity) and
that it has not been changed. The MAC value protects both a message’s data
integrity and its authenticity, by allowing verifiers (who also possess the secret
key) to detect any changes to the message content. Authenticated Encryption
(AE) [42] or Authenticated Encryption with Associated Data (AEAD) [42] is a
form of encryption that simultaneously provides data confidentiality, integrity, and
authenticity assurances.
S. Banik
Università della Svizzera italiana, Lugano, Switzerland
e-mail: subhadeep.banik@usi.ch
F. Regazzoni ()
University of Amsterdam, Amsterdam, The Netherlands
In contrast with a symmetric key cryptosystem in which the same secret key is
used to encrypt and decrypt plaintexts and ciphertexts, asymmetric cryptosystems,
also called public-key cryptosystems (PKC), use encryption and decryption keys
that are not the same [42]. The encryption key (also called the public key) allows
anyone to encrypt a message. The encrypted message can only be decrypted by
a party in possession of the corresponding decryption key (also called the secret
key). A real-life analog is a locked mailbox in which anyone can deliver letters, but
the letter can only be read by the person who has the key to the mailbox. Public-
key cryptosystems such as RSA [51] derive their security from the factorization of
sufficiently large integers. There have been many attempts to factor integers, for an
early and contemporary history of factoring, see [45, 58]. Although factorization
is a known difficult problem on classical computers, it can be solved easily, i.e., in
polynomial time, using quantum computers with Shor’s Algorithm [55]. This has led
to establishment of a NIST standardization process for post-quantum cryptography
[47] with a goal to design algorithms that are secure against attacks that can be
performed on quantum computers.
Hardware platforms can be broadly classified into Application-Specific Inte-
grated Circuits (ASICs) and Field-Programmable Gate Arrays (FPGAs). An ASIC
is an integrated circuit (IC) chip customized for a particular use: as such the
designer can maximize the efficiency of silicon resources to construct circuits that
perform a specific task at very high speeds and consume little silicon area. These
chips are typically fabricated using complementary metal–oxide–semiconductor
(CMOS) technology, and the design cycle from conception to fabrication is a
lengthy and expensive one. FPGAs are more generic in the sense that they consist of
programmable logic blocks and interconnects that allow the same FPGA to be used
in many different applications. Typically, one encodes the circuit as an FPGA image
using software tools and uploads the image on to the device to obtain a circuit with
the required functionality. Furthermore, there are publicly available FPGA cloud
servers, such as the Amazon AWS EC2 [2], which can be accessed remotely from
any corner of the world. Deploying circuits on FPGAs is much cheaper and needs a
smaller conception to deployment time window.
Efficient cryptographic primitives are needed to help protect communication to
and from FPGAs, and, when needed, data generated or processed by an FPGA. The
large body of research addressing efficient implementation of cryptographic algo-
rithms on standalone FPGAs serves as a base for the development and deployment
of efficient cloud FPGA cryptography. To this end, we examine implementation
strategies for cryptographic algorithms implemented on FPGA devices. As a typical
FPGA device can accommodate a huge amount of logic gates, the implementation
metric we target is throughput. By comparing various cryptographic algorithms
on the same FPGA device, we can come to a conclusion about the algorithms’
respective merits when implemented on FPGAs.
3 Efficient and Secure Encryption for FPGAs in the Cloud 59
3.2.1 Architectures
Both block and stream ciphers consist of similar transformations that are applied
repetitively on public and private inputs to produce the output stream, see Fig. 3.1.
In the case of block ciphers, the public input is the plaintext, the private input is the
secret key, and the output is the encrypted plaintext also called the ciphertext. As
a result, there are many flavors of block cipher circuits implemented on hardware
platforms:
• Round-based circuits: These are circuits in which each round function is
executed in one clock cycle. The circuit architecture is equipped with logic
gates that execute the function, followed by a register on which the intermediate
outputs of the round function computation are written. If the block or stream
60 S. Banik and F. Regazzoni
Secret Key
r iterations
Fig. 3.1 Commonly, block or stream cipher consists of repeated application of a publicly known
round function
cipher specification calls for R executions of the round function, then the circuit
requires exactly R clock cycles to execute the encryption operation.
• Multiple round–unrolled circuits: This is a simple extension of the above
philosophy, as instead of one unit, .r < R round function units connected serially
in the circuit architecture are used. These circuits
execute r round function
operations sequentially and thus require only . Rr clock cycles. Because of the
higher hardware footprint, such circuits consume more power but take fewer
clock cycles to execute the encryption operation. Some circuits, e.g., for .r = 2,
are known to be energy-optimal for specific block ciphers on ASIC platforms [6].
• Fully round–unrolled circuits: This takes unrolling to the extreme level, i.e.,
.r = R, so that only a single clock cycle is required to execute encryption.
(SPN) type round function that supports 128-bit plaintexts and 128-bit, 192-bit,
or 256-bit keys. This includes a substitution layer that is generally non-linear.
In AES, 16 applications of an S-box function in .{0, 1}8 → {0, 1}8 are used. It is
well known that the S-box used in AES is affine equivalent to the inverse function
in .GF (28 ). The permutation network is generally a linear function that serves
the purpose of mixing the state bits among one another. The AES permutation
layer consists of ShiftRows and MixColumn operations. Since the block cipher
state can be interpreted as a .4 × 4 array of bytes, the ShiftRows operation
simultaneously rotates the i-th row of the state by i bytes. The MixColumn
operation multiplies each column of the state by an MDS matrix over .GF (28 ).
This is followed by an AddRoundkey operation in which a 128-bit RoundKey
is XORed to the state. The RoundKey for each round is produced from a
KeySchedule operation performed over the AES secret key.
• Present: Present [15] is a 64-bit block cipher that has an SPN type round
function. It has recently been adopted as a standard in ISO/IEC 29192-2. The
cipher specifications allow for both 80-bit and 128-bit keys, although we only
focus on the 80-bit version in this work. The only non-linear component in the
round function is a four-bit S-box (i.e., over .{0, 1}4 → {0, 1}4 ), which is applied
in parallel to each of the sixteen nibbles of the 64-bit state after the RoundKey
addition. Thereafter, the state bits are rearranged by a permutation layer (in
hardware this achieved at zero cost in energy and gate area by the simple crossing
of wires).
• Prince: Prince [17] is a 64-bit block cipher with an SPN type round function.
It allows for a 128-bit key but does not use any key-scheduling logic. Prince is
based on FX construction: The 128-bit key is divided into the most and least
significant 8-byte blocks .k0 , k1 , and a key .k ' is computed from them by a simple
rotate and add operation. .k0 and .k ' are used as whitening keys, and .k1 is used as
the RoundKey in every round. The cipher uses three types of round functions:
Forward, Middle, and Inverse. The Forward round consists of SubBytes and
MixColumn operations and the addition of a round constant and RoundKey.
The Middle round consists of SubBytes, MixColumn, and Inverse SubBytes
layers. The Inverse rounds are structurally and functionally the opposite of the
Forward round. As a result, the Prince encryption operation is an involution.
The cipher minimizes encryption latency and finds use in applications such as
memory encryption.
• Midori: Midori [5] is an SPN-based block cipher designed for energy efficiency.
The specifications support both 64-bit and 128-bit plaintexts and a 128-bit key.
The 64-bit version has been the subject of invariant subspace attacks [31];
however, the 128-bit version is secure and the block cipher solution consumes
the least amount of energy versus other ciphers. In this chapter, we focus on
Midori-128.
• Gift-128: Gift [9] was proposed as a redesign of the popular Present block cipher.
The idea was to design a cipher that would be efficient on both hardware and
software platforms and yet offer a high degree of security. The cipher is of SPN
type and like Present uses a bit permutation as the linear layer, so as to minimize
62 S. Banik and F. Regazzoni
Table 3.1 Synthesis results for block ciphers targeted to a Artix 7 xc7a200t device
Latency a
.fmax .T Pmax
the hardware footprint. The design supports both 64- and 128-bit plaintexts, and
we focus on the 128-bit version here.
For the experiments presented in this section, our target platform was the xc7a200t
Xilinx device from the Artix7 family. The following design flow was followed
for experimentation: The design was first implemented in register-transfer-level
(RTL) code. A functional simulation was then performed using Mentor Graphics
ModelSim SE software. The design was synthesized, mapped, placed, and routed
using Xilinx Vivado version 2021.2 (Table 3.1).
A stream cipher takes a secret key K, which is usually a small binary string
of around .80–256 bits, as input and applies a set of rules to produce a long
sequence of pseudorandom bits, bytes, or words (the keystream). The sequence is
pseudorandom, since it cannot be distinguished from a truly random sequence in
practical time.
This sequence of bits, bytes, or words is usually XORed with each bit, byte, or
word of the plaintext to produce the encrypted ciphertext. So, if .P = p0 , p1 , p2 , . . .
represent the bits, bytes, or words of the plaintext, and .κ = k0 , k1 , k2 , . . . represents
the keystream bits, bytes, or words produced by the stream cipher using the secret
key K, then the encryption rule is given by
ci = pi ⊕ ki , ∀i,
.
where .C = c1 , c2 , . . . represents the ciphertext bits, bytes, or words. Since the secret
key is already known to the receiver, he can compute the keystream bits .k0 , k1 , . . .
at his end, which are then used to decrypt the ciphertext as follows:
3 Efficient and Secure Encryption for FPGAs in the Cloud 63
.pi = ci ⊕ ki , ∀i.
The eSTREAM portfolio ciphers fall into two profiles. Profile 1 stream ciphers
are particularly suitable for hardware applications with restricted resources such
as limited storage, gate count, and power consumption. Profile 2 contains stream
ciphers more suitable for software applications with high throughput requirements.
The portfolio [21] currently contains the following ciphers (Table 3.2):
t3 1 66 69 93 t1
Fig. 3.2 Structure of Trivium. The AND gates .s91 · s92 , s175 · s176 , s286 · s287 are added to the
leftmost XOR gates before the 3rd, 1st, and 2nd registers, respectively. The registers have been
omitted for ease of depiction. The keystream bit produced every clock cycle is given as .z = t1 +
t2 + t 3
• Trivium: Trivium [23] is a stream cipher designed for the eSTREAM project by
DeCannière and Preneel and is currently an ISO standard under ISO/IEC 29192-
3:2012. Trvium has an internal state of 288 bits that is divided into 3 registers of
sizes 93, 84, and 111 bits, respectively, see Fig. 3.2. The stream cipher uses an
80-bit key and 80-bit initialization vector (IV) that is used to initialize the state.
The setup is updated for 1024 iterations using a very simple-to-implement update
function shown partially in Fig. 3.2.
• Grain 128: The Grain family of stream ciphers is one of the candidates in
the eSTREAM hardware portfolio [34]. Its simplicity and elegance in design
have attracted considerable attention from cryptologists worldwide. The family
consists of three ciphers: Grain v1, Grain 128, Grain 128a. In this chapter, we
focus on Grain 128 [32] that offers 128-bit security.
Like the other members of the Grain family, Grain 128 has a connected
register structure as shown in Fig. 3.3. Grain-128 consists of a 128-bit linear-
feedback shift register (LFSR) and a 128-bit non-linear-feedback shift register
(NFSR) and uses an 128-bit key K. Given that .Lt = [lt , .lt+1 , . . . , lt+127 ] is the
LFSR state at the t-th clock interval, Grain-128’s LFSR is defined by the update
function f given by:
The NFSR state is updated as .nt+128 = lt + g(·) for NFSR update function g,
which is given by
3 Efficient and Secure Encryption for FPGAs in the Cloud 65
/
/ h(Xt ,Yt )
zt
where .A = {2, 15, 36, 45, 64, 73, 89}, .h(s0 , . . . , s8 ) = s0 s1 +s2 s3 +s4 s5 +s6 s7 +
s0 s4 s8 , and .(s0 , . . . , s8 ) = (nt+12 , lt+8 , .lt+13 , lt+20 , nt+95 , lt+42 , lt+60 , lt+79 ,
.lt+95 ). The cipher is initialized with a 128-bit key and a 96-bit IV. 256 clocks
Our target platform was the xc7a200t Xilinx device from the Artix7 family once
again, and results are reported in Table 3.3. Since stream ciphers, once initialized,
continuously produce a keystream every clock cycle, we can experiment with
different numbers of unrolled rounds r for each stream cipher. The higher the value
of r, the higher the device utilization, while ensuring higher throughput.
3.4.1 AES-GCM
in a concrete security model. It is secure when it is used with a block cipher that
is indistinguishable from a random permutation; however, security depends on
choosing a unique initialization vector for every encryption performed with the
same key (see stream cipher attack). For any given key and initialization vector
combination, GCM is limited to encrypting .239 –256 bits of plain text (64 GiB).
GCM combines the well-known counter mode of encryption with the new Galois
mode of authentication [26]. The key feature is the ease of parallel computation of
the Galois field multiplication used for authentication. This feature permits higher
throughput than encryption algorithms, like CBC, which use chaining modes. The
GF(2128) field used is defined by the polynomial .x 128 + x 7 + x 2 + x + 1.
The authentication tag is constructed by feeding blocks of data into the GHASH
function and encrypting the result. This GHASH function is defined by
where .H = EK (0128 ) is the hash key, a string of 128 zero bits encrypted using
the block cipher, A are data that are only authenticated (not encrypted), C is the
ciphertext, m is the number of 128-bit blocks in A (rounded up), n is the number of
128-bit blocks in C (rounded up), and the variable .Xi for .i = 0, . . . , m + n + 1 is
defined below.
First, the authenticated text and the cipher text are separately zero-padded to
multiples of 128 bits and combined into a single message .Si :
⎧
⎪
⎪ Ai for i = 1, . . . , m − 1
⎪
⎪
⎪
⎪ ∗
⎨Am ‖ 0
⎪ for i = m
128−v
Si =
. Ci−m for i = m + 1, . . . , m + n − 1
⎪
⎪
⎪
⎪ ∗
⎪Cn ‖ 0 for i = m + n
128−u
⎪
⎪
⎩len(A) ‖ len(C) for i = m + n + 1
where .len(A) and .len(C) are the 64-bit representations of the bit lengths of A and
C, respectively, .v = len(A) mod 128 is the bit length of the final block of A, .u =
len(C) mod 128 is the bit length of the final block of C, and .‖ denotes concatenation
of bit strings. Then .Xi is defined as
i
0 for i = 0
. Xi = Sj · H i−j +1 =
j =1 (Xi−1 ⊕ Si ) · H otherwise
The second form is an efficient iterative algorithm (each .Xi depends on .Xi−1 )
produced by applying Horner’s method to the first. Only the final .Xm+n+1 remains
an output.
68 S. Banik and F. Regazzoni
The most critical operation in GCM is multiplication in the finite field .GF (2128 ).
The multiplier uses the irreducible polynomial .p(x) = x 128 + x 7 + x 2 + x + 1
to compute .C = AB mod p(x). In [44], several implementation options for such a
multiplier are proposed, including bit-parallel, digit-serial, and hybrid multipliers.
Bit-parallel multipliers use multiplication by x as the fundamental circuit of
computation and replicate it 128 times for the complete operation. Digit-serial
multipliers take this idea forward by making multiplication by .x m as the basic unit.
Hybrid multipliers redefine the original finite field .GF (2k ) as .GF ((2m )n ), where
.k = mn. Arithmetic calculations can then be performed using circuits in the subfield
m
.GF (2 ) and by combining them in the extension field .GF ((2 ) ).
m n
The above architectures take more than one clock cycle to compute the multiplica-
tion result. Since our core encryption algorithm will operate in a single clock cycle,
we propose an architecture that will compute the multiplication in a single cycle.
Let .A(x), B(x) be two polynomials of degree .2k − 1. The Karatsuba method of
multiplying them requires that we first split both polynomials into two degree .k − 1
polynomials as follows:
The multiplication operation requires the following logic operations over k-bit
polynomials:
1. Compute .S = (aL ⊕ aH ) · (bL ⊕ bH ).
2. Compute .L = aL · bL and .H = aH · bH .
3. Compute .M = S ⊕ L ⊕ H .
It can be seen that .A(x)·B(x) = x 2k H ⊕x k M⊕L. Thus the original 2k-bit multiplier
requires 3 k-bit multipliers plus some gates performing linear operations. Thus one
can recursively define multiplication over 128-bit polynomials as multiplication
over 64-bit polynomials, which in turn can be defined as multiplication over 32-
bit polynomials and so on. The base case is defining multiplication over 2-bit
(i.e., degree 1) polynomials. This can be constructed by defining a look-up table
.{0, 1} → {0, 1} , i.e., that takes the 4-bit coefficients of the two 2-bit polynomials
4 4
Here .S(x) denotes the S-box mapping. If .x0 , x1 , x2 , x3 are the four bytes in a
column after the ShiftRows operation, it can be seen that . i Ti (xi ) is the AES round
function output for that column. Since the circuit combines the S-box and part of
the MixColumn operation, it is natural to try to implement this circuit to minimize
circuit depth and, hence, critical path. However, the downside of this approach is the
large silicon area required to construct four 8- to 32-bit tables.
The AES-GCM circuit is shown in Fig. 3.4. In the 1st clock cycle, the hash key
H = EK (0) is computed and stored in the H register. Thereafter, every 128-
.
bit block of plaintext and associated data is processed in one block to produce
ciphertext. Simultaneously, the MAC is computed using a Horner-like computation
using the H and an auxiliary register using the single cycle finite field multiplier.
Thus processing n blocks of data takes .n + 1 clock cycles.
70 S. Banik and F. Regazzoni
AES Xi−1
0n
Plaintext
Ciphertext H register
Multiplier
The results in Table 3.4 were obtained after the designs were synthesized, mapped,
placed, and routed for a xc7a200t device.
The results show that for most architectures the total critical path is around .5.7
ns, which provides a throughput of about 22 Gbps. The GCM algorithm may be
further parallelized by a factor of k by using multiple amounts of circuit resources.
Replication allows for throughput well above 100 Gbps depending on the type of
application.
3.4.3 GIFT-COFB
AD1 ADa M1 Mm
nonce
EK G EK EK G EK G EK EK G EK
Tag
M1 Mm
LFSR LFSR LFSR LFSR
CT1 CTm
2x/3x 2x/3x 2x/3x 2x/3x
The GIFT-128 block cipher [9] was designed in 2017. It has 40 rounds in which
each round consists of a substitution layer composed of 4-bit S-boxes. It uses a bit
permutation over 128-bits as the linear layer. It is efficient on both software and
hardware platforms.
The GIFT-COFB circuit is shown in Fig. 3.6. In the 1st clock cycle, the LFSR
L is updated with the top half of .EK (Nonce). Thereafter, every 128-bit block of
plaintext/associated data is processed in one block to produce ciphertext. Simulta-
neously, the LFSR is updated using finite field computations. After the plaintext and
72 S. Banik and F. Regazzoni
State Register
L Register
S→ 2x 3y S
GIFT-128
Plaintext/AD
Ciphertext
associated data (AD) have been processed, the mode uses one additional encryption
call to produce the MAC. Thus, the processing of n blocks of data takes only .n + 2
clock cycles.
3.4.4 ROMULUS
ROMULUS is an AEAD scheme designed by Iwata et al. [37] and uses the SKINNY
family of block ciphers. In this chapter, we provide Romulus-N1 implementations.
Romulus-N1 makes .1/2 block cipher call per associated data block and 1
block cipher call per message block. It uses a 128-bit key, a 128-bit nonce, a
variable-length message chopped into 128-bit blocks and produces a 128-bit tag.
Each output of the block cipher and the incoming data block (associated data or
message) are passed through a light combinatorial function denoted by .ρ. Function
'
.ρ(S, M) = (S , C) is defined as .S
' ← S ⊕ M and .C ← G(S) ⊕ M. For
each byte, G performs the following operation: .G(x7 ||x6 ||x5 ||x4 ||x3 ||x2 ||x1 ||x0 ) :=
(x0 ⊕x7 )||x7 ||x6 ||x5 ||x4 ||x3 ||x2 ||x1 . The output of this function is immediately input
to the next block cipher call. Hence, a register keeps this running state, and at the
last step, it is encrypted to produce the tag.
Romulus handles odd and even authenticated data blocks differently; the odd
blocks are input to .ρ, and even blocks are fed to the nonce port of the block cipher, as
the underlying cipher SKINNY-128-384 has a 384-bit-long TWEAKEY. The actual
AEAD nonce is not used before all authenticated data blocks are processed and
later used as a block cipher nonce, while message blocks are encrypted. A 56-bit
LFSR is also a part of the TWEAKEY for SKINNY calls and keeps count of the
authenticated data and message block fed into the AEAD circuit from the beginning
of AE operation. Figure 3.7 describes the two phases the full AEAD operation
passes through, namely processing of (1) associated data, (2) message blocks.
3 Efficient and Secure Encryption for FPGAs in the Cloud 73
AD2 AD2a N N N
AD1 AD2a−1 M1 M2 Mm 0128
0128
ρ EL,d,·,K ρ EL,d,·,K EL,d,·,K ρ EL,d,·,K ρ ρ EL,d,·,K ρ
Fig. 3.7 The high-level view of Romulus-N1, which depicts the processing of 2a associated data
and m message blocks. L denotes the 56-bit LFSR that counts the number of processed blocks,
and d denotes a single-byte domain separator followed by .064
State Register
L Register
SKINNY-128
-384
The circuit is shown in Fig. 3.8. In addition to the tweakable block cipher is an LFSR
L that supplies a part of the tweak. After all the plaintext/AD have been processed,
the mode uses one additional encryption call to produce the MAC. Thus processing
n blocks of data takes only .n + 1 clock cycles.
ASCON 128 has been declared the winner of the NIST lightweight cryptography
competition. It is a permutation-based AEAD, as the core cryptographic primitive
used in the design is a permutation function over 320 bits and not a block cipher.
The ASCON permutation uses a state size of 320 bits (consisting of five 64-bit
words .x0 , x1 , x2 , x3 , x4 ) that are updated in four phases: initialization, processing
of associated data, processing of plaintext/ciphertext, and finalization.
74 S. Banik and F. Regazzoni
AD/Plaintext
Ciphertext
Tag
State
Register ASCON p6
IV‖K‖N 0∗ ‖1/K‖ 0∗
All phases use the same permutation function p that is applied 12 times in the
initialization and finalization phases and six times in the data processing phase. The
data, i.e., both the plaintext and AD, are handled in 64-bit blocks. The initialization
phase takes place after the initialization phase when the optional associated data are
processed. In the encryption phase, each plaintext block .Pi is XORed with the secret
state to produce one ciphertext block .Ci . After the generation of the ciphertext, the
finalization phase starts. The output of finalization is a 128-bit tag.
The circuit is shown in Fig. 3.9. The core circuit is the ASCON permutation .p6 , i.e.,
the round function p iterated 6 times. Hence, initialization and finalization take two
cycles each. Processing each 64-bit block of plaintext or AD takes one cycle. Thus
processing n blocks of 128-bit data takes only .2n + 2 + 2 = 2n + 4 clock cycles.
Table 3.5 compares synthesis results for the two lightweight schemes with AES-
GCM. It is seen that ASCON 128 and GIFT-COFB have an advantage in terms of
3 Efficient and Secure Encryption for FPGAs in the Cloud 75
throughput over the lightweight schemes. In particular, the most latency intensive
part of the circuit is the permutation circuit that requires only six rounds to be
implemented consecutively. This circuit reduces latency and elevates the maximum
operable frequency, hence throughput.
The last family of cryptography primitives that we describe are the post-quantum
cryptographic (PQC) algorithms. The algorithms use cryptographic primitives
meant to run on classical computers but designed to withstand the computational
power of quantum computers. Progress in the study of quantum computers would
eventually make it possible to run algorithms such as the Shor’s algorithm [56]
and solve problems on which the security of current asymmetric cryptography is
based (i.e., integer factorization). Symmetric cryptography must be resistant to the
increased computational power that quantum computers will bring: for instance,
because of Grover’s algorithm [29], there is a need to use bigger key sizes for AES
algorithms (e.g., change from 128-bit to 256-bit values for the AES algorithm).
Because of this threat, governmental bodies have started initiatives to standardize
post-quantum algorithms. The most relevant initiative has been carried out by
NIST starting in 2017 [47]. Researchers for all over the world could submit
standardization proposals. Several families of mathematical problems have been
explored, including lattice-based cryptography, isogeny-based cryptography, and
code-based cryptography. In addition to mathematical security, these algorithms
must be suitable for deployment in systems and applications, including applications
based on cloud FPGAs. This section summarizes the lessons learned in deploying
PQC algorithms on FPGAs in the last few years.
A large number of NIST submissions were based on lattice problems, such as
learning with errors [48] (LWE) and ring learning with errors (R-LWE) [40]. The
advantage of R-LWE over standard lattices is that the lattice matrix is generated
from a single row. Because of this, complex matrix multiplications are replaced
by polynomial multiplications. When implementing schemes based on lattices,
the sampler and the polynomial multiplication functions form a bottleneck. The
sampling step can use binomial or uniform distributions, but the majority of the
designs use discrete Gaussian distributions. Several structures have been proposed
for the implementation of discrete Gaussian samplers, and the most notable ones are
rejection sampling, Bernoulli sampling, cumulative distribution (CDT) sampling,
discrete Ziggurat sampling, and Knuth–Yao sampling [36].
Often, multiplication performance is improved by using the numeric theoretic
transform (NTT), which, after a conversion into spectral domain, reduces complex
polynomial multiplication to a point-wise multiplication. Several optimized hard-
ware implementations of NTT are available, including ones specifically designed
for FPGAs [3, 46, 52]. A common feature for NTT hardware architectures is a
butterfly structure, which is often implemented as a dedicated unit [49]. In FPGA
76 S. Banik and F. Regazzoni
designs, the processing logic that implements the core of the butterfly computations
is interleaved with Block RAMs (BRAMs) 1 that are used to store NTT poly-
nomial coefficients. The architecture is implemented sequentially when area is a
constraint [46]. The amount of pre-computed factors and on-the-fly computation is
an important tradeoff. Examples of both designs have been reported [3, 46, 53].
Quantum resistant cryptography can be also achieved by leveraging approaches
other than lattices. FPGA implementations of other families of post-quantum
algorithms have been also explored, including the design and implementation of
McElice [20, 28], and HQC [24].
Physical attacks are a threat for post-quantum algorithm implementations. In
fact, physical attack resistance was explicitly mentioned by NIST among the
selection criteria for the new standard. Common attacks against post-quantum
algorithms include timing attacks, power analysis attacks, and fault attacks [16].
Countermeasures typically aim to protect the secret key using approaches previously
explored in block ciphers and include masking [50] to protect against power
analysis attacks, constant time implementations [36] to mitigate timing attacks, and
dedicated countermeasures to prevent fault attacks [35].
Acknowledgments This work is partially supported by the EU Horizon 2020 Programme under
grant agreement No. 957269 (EVEREST).
References
1. AlFardan, N. J., & Paterson, K. G. (2013). Lucky thirteen: Breaking the TLS and DTLS record
protocols. In 2013 IEEE Symposium on Security and Privacy, SP 2013, Berkeley, CA, USA,
May 19–22, 2013 (pp. 526–540). IEEE Computer Society. https://doi.org/10.1109/SP.2013.42.
1 BRAMs are RAMs usually used to store a large amount of data within FPGAs
3 Efficient and Secure Encryption for FPGAs in the Cloud 77
17. Borghoff, J., Canteaut, A., Güneysu, T., Kavun, E. B., Knezevic, M., Knudsen, L. R., Leander,
G., Nikov, V., Paar, C., Rechberger, C., Rombouts, P., Thomsen, S. S., Yalçin, T. (2012).
PRINCE - A low-latency block cipher for pervasive computing applications - extended
abstract. In X. Wang, K. Sako (Eds.), Advances in Cryptology - ASIACRYPT 2012 - 18th
International Conference on the Theory and Application of Cryptology and Information
Security, Beijing, China, December 2–6, 2012. Proceedings, Lecture Notes in Computer
Science (vol. 7658, pp. 208–225). Springer. https://doi.org/10.1007/978-3-642-34961-4_14.
18. Cannière, C. D., & Preneel, B. (2005). TRIVIUM -Specifications. eSTREAM, ECRYPT
Stream Cipher Project Report. http://www.ecrypt.eu.org/stream/p3ciphers/trivium/trivium_p3.
pdf.
19. Chakraborti, A., Iwata, T., Minematsu, K., & Nandi, M. (2017). Blockcipher-based authen-
ticated encryption: How small can we go? In Cryptographic Hardware and Embedded
Systems - CHES 2017 - 19th International Conference, Taipei, Taiwan, September 25–28,
2017, Proceedings (pp. 277–298). https://doi.org/10.1007/978-3-319-66787-4_14.
20. Chen, P. J., Chou, T., Deshpande, S., Lahr, N., Niederhagen, R., Szefer, J., Wang, W. (2022).:
Complete and improved FPGA implementation of Classic McEliece. Cryptology ePrint
Archive.
21. Cid, C., & Robshaw, M. (Eds.). (2012). The eSTREAM Portfolio in 2012, 16 January 2012,
Version 1.0. eSTREAM, ECRYPT Stream Cipher Project Report. http://www.ecrypt.eu.org/
documents/D.SYM.10-v1.pdf.
22. Daemen, J., & Rijmen, V. (2005). Rijndael/aes. In H. C. A. van Tilborg (Ed.), Encyclopedia of
cryptography and security. Springer. https://doi.org/10.1007/0-387-23483-7_358.
23. De Canniere, C., & Preneel, B. (2008). Trivium. In New Stream Cipher Designs: The eSTREAM
Finalists (pp. 244–266). Springer.
24. Deshpande, S., Xu, C., Nawan, M., Nawaz, K., & Szefer, J. (2022). Fast and efficient hardware
implementation of HQC. Cryptology ePrint Archive.
25. Dobraunig, C., Eichlseder, M., Mendel, F., Schläffer, M. (2021). Ascon v1.2: Lightweight
Authenticated Encryption and Hashing. Journal of Cryptology, 34(3), 33. https://doi.org/10.
1007/s00145-021-09398-9.
26. Dworkin, M. (2007). Recommendation for block cipher modes of operation: Galois/counter
mode (GCM) and GMAC. Tech. rep., National Institute of Standards and Technology.
27. eSTREAM, The ECRYPT Stream Cipher Project. (2012). eSTREAM, The ECRYPT Stream
Cipher Project. https://www.ecrypt.eu.org/stream/.
28. Galimberti, A., Galli, D., Montanaro, G., Fornaciari, W., & Zoni, D. (2022). FPGA implemen-
tation of BIKE for quantum-resistant TLS. In 2022 25th Euromicro Conference on Digital
System Design (DSD) (pp. 539–547). IEEE.
29. Grover, L. K. (1997). Quantum mechanics helps in searching for a needle in a haystack.
Physical Review Letters, 79(2), 325.
30. Güneysu, T., & Moradi, A. (2011). Generic side-channel countermeasures for reconfigurable
devices. In International Workshop on Cryptographic Hardware and Embedded Systems (pp.
33–48). Springer.
31. Guo, J., Jean, J., Nikolić, I., Qiao, K., Sasaki, Y., & Sim, S. M. (2015). Invariant subspace
attack against full Midori64. Cryptology ePrint Archive.
32. Hell, M., Johansson, T., Maximov, A., & Meier, W. (2008). A Stream Cipher Proposal: Grain-
128. eSTREAM, ECRYPT Stream Cipher Project Report. http://www.ecrypt.eu.org/stream/
p3ciphers/grain/Grain128_p3.pdf.
33. Hell, M., Johansson, T., & Meier, W. (2005). Grain - A Stream Cipher for Constrained
Environments. eSTREAM, ECRYPT Stream Cipher Project Report. http://www.ecrypt.eu.
org/stream/p3ciphers/grain/Grain_p3.pdf.
34. Hell, M., Johansson, T., & Meier, W. (2007). Grain: a stream cipher for constrained environ-
ments. International Journal of Wireless and Mobile Computing, 2(1), 86–93.
35. Howe, J., Khalid, A., Martinoli, M., Regazzoni, F., & Oswald, E. (2019). Fault attack
countermeasures for error samplers in lattice-based cryptography. In 2019 IEEE International
Symposium on Circuits and Systems (ISCAS) (pp. 1–5). IEEE.
3 Efficient and Secure Encryption for FPGAs in the Cloud 79
36. Howe, J., Khalid, A., Rafferty, C., Regazzoni, F., & O’Neill, M. (2016). On practical discrete
Gaussian samplers for lattice-based cryptography. IEEE Transactions on Computers, 67(3),
322–334.
37. Iwata, T., Khairallah, M., Minematsu, K., & Peyrin, T. (2019). Romulus v1.2. NIST
Lightweight Cryptography Project. https://csrc.nist.gov/Projects/lightweight-cryptography/
round-2-candidates.
38. Jean, J., Moradi, A., Peyrin, T., & Sasdrich, P. (2017). Bit-sliding: a generic technique
for bit-serial implementations of SPN-based primitives - applications to AES, PRESENT
and SKINNY. In Cryptographic Hardware and Embedded Systems - CHES 2017 - 19th
International Conference, Taipei, Taiwan, September 25–28, 2017, Proceedings (pp. 687–707).
https://doi.org/10.1007/978-3-319-66787-4_33.
39. Lin, D., Xiang, Z., Zeng, X., & Zhang, S. (2021). A framework to optimize implementations
of matrices. In K. G. Paterson (Ed.), Topics in Cryptology - CT-RSA 2021 - Cryptographers’
Track at the RSA Conference 2021, Virtual Event, May 17–20, 2021, Proceedings, Lecture
Notes in Computer Science (vol. 12704, pp. 609–632). Springer. https://doi.org/10.1007/978-
3-030-75539-3_25.
40. Lyubashevsky, V., Peikert, C., & Regev, O. (2010). On ideal lattices and learning with
errors over rings. In Advances in Cryptology–EUROCRYPT 2010: 29th Annual International
Conference on the Theory and Applications of Cryptographic Techniques, French Riviera, May
30–June 3, 2010. Proceedings 29 (pp. 1–23). Springer.
41. Maximov, A., & Ekdahl, P. (2019). New circuit minimization techniques for smaller and
faster AES SBoxes. IACR Transactions on Cryptographic Hardware and Embedded Systems,
2019(4), 91–125. https://doi.org/10.13154/tches.v2019.i4.91-125.
42. Menezes, A. J., Van Oorschot, P. C., & Vanstone, S. A. (2018). Handbook of applied
cryptography. CRC Press.
43. NIST Lightweight Cryptography Project. (2019). Available at https://csrc.nist.gov/projects/
lightweight-cryptography.
44. Paar, C. (1999). Implementation options for finite field arithmetic for elliptic curve cryptosys-
tems. In The 3rd Workshop on Elliptic Curve Cryptography (October 1999).
45. Pomerance, C. (1996). A tale of 2 Sieves. Notices of the American Mathematical Society,
December 1996 (pp. 1473–1485).
46. Pöppelmann, T., & Güneysu, T. (2012). Towards efficient arithmetic for lattice-based cryp-
tography on reconfigurable hardware. In Progress in Cryptology–LATINCRYPT 2012: 2nd
International Conference on Cryptology and Information Security in Latin America, Santiago,
Chile, October 7–10, 2012. Proceedings 2 (pp. 139–158). Springer.
47. Post-Quantum Cryptography. (2023). https://csrc.nist.gov/Projects/post-quantum-
cryptography. Accessed 5 June 2023.
48. Regev, O. (2009). On lattices, learning with errors, random linear codes, and cryptography.
Journal of the ACM (JACM), 56(6), 1–40.
49. Rentería-Mejía, C., & Velasco-Medina, J. (2014). Hardware design of an NTT-based polyno-
mial multiplier. In 2014 IX Southern Conference on Programmable Logic (SPL) (pp. 1–5).
IEEE.
50. Reparaz, O., Sinha Roy, S., Vercauteren, F., & Verbauwhede, I. (2015). A masked ring-
LWE implementation. In International Workshop on Cryptographic Hardware and Embedded
Systems (pp. 683–702). Springer.
51. Rivest, R. L., Shamir, A., & Adleman, L. M. (1978). A method for obtaining digital signatures
and public-key cryptosystems. Communications of the ACM, 21(2), 120–126. http://doi.acm.
org/10.1145/359340.359342.
52. Roy, S. S., Turan, F., Jarvinen, K., Vercauteren, F., & Verbauwhede, I. (2019). FPGA-based
high-performance parallel architecture for homomorphic computing on encrypted data. In
2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)
(pp. 387–398). IEEE.
80 S. Banik and F. Regazzoni
53. Roy, S. S., Vercauteren, F., Mentens, N., Chen, D. D., & Verbauwhede, I. (2014). Compact
ring-LWE cryptoprocessor. In Cryptographic Hardware and Embedded Systems–CHES 2014:
16th International Workshop, Busan, South Korea, September 23–26, 2014. Proceedings 16
(pp. 371–391). Springer.
54. Shannon, C. E. (1949). Communication theory of secrecy systems. The Bell System Technical
Journal, 28(4), 656–715.
55. Shor, P. W. (1994). Algorithms for quantum computation: discrete logarithms and factoring
(pp. 124–134). IEEE Comput. Soc. Press. https://doi.org/10.1109/sfcs.1994.365700. http://
doi.acm.org/10.1145/359340.359342.
56. Shor, P. W. (1999). Polynomial-time algorithms for prime factorization and discrete logarithms
on a quantum computer. SIAM Review, 41(2), 303–332.
57. Vaudenay, S. (2002). Security flaws induced by CBC padding - applications to SSL, IPSEC,
WTLS. In L. R. Knudsen (Ed.) Advances in Cryptology - EUROCRYPT 2002, International
Conference on the Theory and Applications of Cryptographic Techniques, Amsterdam, The
Netherlands, April 28–May 2, 2002, Proceedings, Lecture Notes in Computer Science (vol.
2332, pp. 534–546). Springer. https://doi.org/10.1007/3-540-46035-7_35.
58. Williams, H. C., & Shallit, J. O. (1994). Factoring integers before computers. Mathematics of
Computation 1943–1993, Fifty Years of Computational Mathematics (W. Gautschi, ed.), Proc.
Sympos. Appl. Math. (vol. 48, pp. 481–531). Providence, RI: Amer. Math. Soc.
59. Wu, H. (2006). HC-128. eSTREAM, ECRYPT Stream Cipher Project Report. http://www.
ecrypt.eu.org/stream/p3ciphers/hc/hc128_p3.pdf.
Chapter 4
Remote Physical Attacks on FPGAs
at the Electrical Level
4.1 Introduction
4.2 Background
The fundamental driver of semiconductor chips is electricity, and thus any electrical
disruption can become a weakness for the entire system. In this section, we first
describe power distribution in modern semiconductor devices and then explain how
this knowledge is used for attacking those devices.
Integrated circuits (ICs) are typically supplied through a power distribution network
(PDN) to ensure that the same level of supply voltage is delivered to individual
chip transistors. The system-level PDN involves a hierarchy of active and passive
electronic components. A simplified PDN of a single printed circuit board (PCB)
with one IC is shown in Figs. 4.1 and 4.2. In modern low-power electronic
applications, the current flow is typically controlled by a switched-mode voltage
regulator and passes through traces on the PCB and chip package to individual IC
transistors. Some of the components shown in Fig. 4.1 are discrete components,
while others parasitically behave as such, as visualized in Fig. 4.2. For instance,
the bond wires of a chip package behave as unwanted inductors, which are partially
addressed by adding decoupling capacitors on the board and inside the silicon die.
A switched-mode voltage converter requires an inductor for its operation.
4 Remote Physical Attacks on FPGAs at the Electrical Level 83
Fig. 4.1 Overview of the PDN hierarchy from the board-level voltage regulator to the individual
transistors on the chip
Fig. 4.2 Electrical model of the PDN as a mesh of capacitive, inductive, and resistive elements
The supply voltage inside the PDN must be kept as stable as possible at all times
to keep transistor delays within predictable boundaries and maximize frequency
and performance under a given power budget. Typically, some noise in the on-chip
supply voltage is inevitable and can be characterized as generic voltage fluctuations.
However, these fluctuations depend on circuit activity, which in turn depend on the
workload and data being processed in the system.
A drop in on-chip supply voltage can be expressed using two components. One
part is a voltage drop which depends on the resistance (R) and the required current
(I ): .Vdrop,R = I R. The other part is dependent on the change in current over time
and the respective voltage drop over a series inductor (for instance bond wires):
.Vdrop,L = LdI /dt. This dynamic component of the voltage drop depends on
the data processed inside the chip and can be leveraged by attackers. ICs using
technology below 45 nm are more affected by .LdI /dt.
Two categories of attacks are possible through on-chip PDN manipulations [14, 47].
These attacks are traditionally carried out using test and measurement equipment [4,
22]:
84 D. R. E. Gnad et al.
Fig. 4.3 Summary of the two threats in a system including FPGA logic
Most fault attacks intentionally operate the device under test beyond its electrical
specification so that computational errors might occur. Here, we focus on fault-
inducing voltage drops. An excessive voltage drop can lead to a brown-out condition
in which the circuit resets and memory undergoes a retention failure leading to a
denial-of-service (DoS) condition. However, if the voltage drop strength is adjusted
precisely, the propagation delay of the circuit increases and timing violations can
be induced. A subsequent mathematical analysis of the observed faults can reveal
sensitive information.
For instance, when targeting an encryption function, classical Differential Fault
Analysis (DFA) [4] compares the result of a genuine encryption to that of a
faulty encryption under the same input. The fault must affect a specific part of
the algorithm that depends only on a small part of the internally used secret key.
Then, an attacker can test which key hypothesis leads to the correct difference in
the internal state caused by the induced fault. This process can be repeated for the
remaining parts of the secret key, revealing the entire key.
4 Remote Physical Attacks on FPGAs at the Electrical Level 85
where .Sbox−1 is the inverted AES Sbox, leading to 256 key hypotheses that need
to be evaluated.
FPGAs are typically designed to offer great flexibility in implementing digital logic
circuits. The main building blocks used to achieve that flexibility in most com-
mercial FPGAs are look-up tables (LUTs) and registers, which can be connected
arbitrarily using programmable switch matrices and multiplexers. Typically, vendor-
specific building blocks, such as accelerators for digital signal processing tasks,
86 D. R. E. Gnad et al.
fRO-internal,1
ftoggle …
Frequency
Generator
fRO-internal,N
carry-chains for integer arithmetic, and basic building blocks such as PLLs for clock
generation, are added to FPGAs. When standard FPGA primitives are configured,
they can be utilized to influence or measure the chip-internal supply voltage, i.e.,
improvised voltage sensors or fault injection logic. Thus, FPGA primitives can be a
powerful attack device, integrated on the same PCB or SoC as the victim logic.
With deliberate glitching and access to the power supply or clock of a chip, it
is possible to cause timing faults. However, with the knowledge of how power is
distributed in integrated circuits, it is also possible to cause excessive voltage drop
in an FPGA using the device logic [14]. As explained in Sect. 4.2.1, causing a rapid
change in current in a small time interval can lead to a .dI /dt-based voltage drop.
This knowledge can be exploited on FPGAs, if some basic requirements are
fulfilled. An attacker needs control over power wasting elements that can be toggled
on and off in specific sequences. Typically, FPGAs and power supplies of integrated
circuits are designed to withstand high constant power at a high frequency. However,
if the frequency drops below a certain level, either the frequency range of the voltage
regulator used on the board can be targeted, or chip-internal resonances can be
exploited. Thus, to cause a strong voltage drop, these high-power elements must be
enabled with a lower frequency that matches the weak spots of the PDN. When these
elements use a very high frequency internally, they generate a high current flux.
Additionally, an external frequency can be used to toggle the elements on and off.
In practice, ring oscillators (ROs) mapped to FPGA LUTs cause high current
consumption due to their high internal frequency, but they must be enabled and
disabled at a certain frequency, as shown in Fig. 4.4. By adjusting the shown
frequency .ftoggle , an ideal frequency can be found that can cause timing violations
in the FPGA or a crash or reset of the chip [14, 26, 31]. With further refinements, this
style of fault injection can be tuned to inject faults in short moments in time, with
the precision required for DFA [26]. Please note that the placement and the amount
of activated power wasters also have an influence on the attack quality, which is
shown later in Sect. 4.4.2.
4 Remote Physical Attacks on FPGAs at the Electrical Level 87
Counter
Voltage-level estimate
Follow-up work has shown that it is possible to cause timing faults not only with
ROs but also with sequential oscillators [49] and memory access collisions [2] or
even by toggling a large amount of AES modules or benchmark circuits [23, 41].
This large variety of available power wasters for fault injection further amplifies the
threat, as the circuits can evade design rule checks or other offline countermeasures,
as described in Sect. 4.5.2.
To estimate voltage level using digital logic elements, circuits that display voltage-
dependent behavior must be implemented. The switching speed of transistors is
voltage-dependent. Thus, these sensor circuits must violate typical synchronous
design constraints, of which two basic types can be identified.
The simplest approach is to use ROs [33, 54]. To approximate voltage, it is
sufficient to count the number of oscillations of a single RO in a given timeframe,
as shown in Fig. 4.5. By observing the differences from one timeframe to another,
a relative estimate of the voltage level can be obtained. However, because the RO
oscillations must be counted, only a limited sampling rate can be reached.
As an alternative, ROs can be unrolled. This design is also called a delay line or
time-to-digital converter (TDC), which in the same FPGA technology reaches about
10.× the sampling rate of an RO-based sensor [38, 56]. The delay line sensor has a
path which is too long to meet the timing constraints of a timing analysis model.
In the implementation, the path must be timing critical, i.e., on the verge of failing.
Because of the voltage dependency of transistor speed, the path fails for reduced
voltage levels. By adding multiple endpoints at different depths of the path, a sensor
for a relative voltage level can be implemented, as visualized in Fig. 4.6. In most
FPGAs, it is reasonable to use LUT/Latch for an initial delay and faster carry-chain
primitives to reach a fine granularity in delay from endpoint to endpoint.
Please note that both sensors are sensitive to other sources of timing variation
including inherent manufacturing process variation, changes in temperature, and
circuit aging [13, 55]. However, voltage fluctuations change more rapidly than the
88 D. R. E. Gnad et al.
Voltage-level estimate
Fig. 4.6 LUT-/latch-based delay line with carry-chain based Time-to-Digital Converter to mea-
sure voltage fluctuations for side-channel attacks (cf. [47, 56])
other variables [13]. Thus, in a trace of sensor data, fast voltage fluctuations are
modulated on top of slower variations. This approach does not diminish the success
of power analysis side-channel attacks [47].
In this chapter, we explain two types of attacks implemented on boards with FPGAs
from two vendors. Depending on the specific FPGA and platform, either fault
injection or voltage measurement works better. Typically, at least one of the attacks
is successful at the electrical level.
Fig. 4.7 CPA attack through on-chip sensors on the Lattice ECP5 FPGA; correlation progress
over .10,000 samples for all 256 secret AES key byte candidates with the correct key byte marked
red
power model than the other 255 key candidates, shown as gray plots. The correct
key byte can be identified once it is clearly distinguishable from the others. With
this simple attack, approximately 50% of key bytes can be recovered, when .100,000
traces are used. This result proves the fundamental vulnerability of FPGAs to such
attacks. More sophisticated analysis or more traces can be used to recover the full
key.
Fig. 4.8 Range of frequencies to toggle ROs on/off that cause timing faults or crashes in the Xilinx
VCU108 Virtex Ultrascale Board
show the results when the RO usage is increased from 30% to 35% LUT usage
(Fig. 4.8). A clearer boundary between faults and crashes can be seen. Furthermore,
a wide frequency band in which timing faults but not crashes that would require
bitstream reprogramming can be seen. These fault attacks compromise system
integrity without causing a crash that would be more obvious to a victim.
4 Remote Physical Attacks on FPGAs at the Electrical Level 91
Table 4.1 Overview of experimental results on side-channel and fault attacks in FPGAs, systems
containing FPGAs, or FPGA-based SoCs, as of March 2022
Attack successful?
Voltage Voltage Key/data
drop-based drop-based timing recovery
Board denial of service fault injection by side-channel
Intel Terasic DE0-Nano-SoC – Yes, [26] –
Intel Terasic DE1-SoC Yes, [26] Yes, [26] –
Intel Terasic DE4 – Yes, [26] –
Intel Terasic DE5a-Net Yes, [41] Yes, [41] –
Intel Terasic DE10-Pro Stratix Yes, [23] Yes, [23] –
10
Lattice ECP5 5G Evaluation – – Yes, [16]
Board
Lattice iCE40-HX8K Breakout Yes, [15] Yes, [26] Yes, [15]
Board
Xilinx Artix-7 Basys-3 – – Yes, [47]
Xilinx Artix-7 Nexys 4 Yes, [2] Yes, [2] –
Xilinx Kintex-7 KC705 Yes, [14] –4 –
Xilinx Pynq Zynq-ZC7020 Yes, [26]1 – 4 [26] –
Xilinx Spartan-6 SAKURA-G – – Yes, [47], [48]3
Xilinx Ultrascale VCU108 Yes, [16] Yes, [16] –
Xilinx Ultrascale+ VU9P – – Yes, [12]
Xilinx Virtex-6 ML605 Yes, [14] –4 –
Xilinx Virtex-7 VC707 – Yes, [31] –
Xilinx Virtex-7 ADM-PCIE-7V3 – – Yes, [24]
Xilinx Zynq-ZC7020 Zedboard Yes, [14]1 –4 Yes, [54]2
Xilinx Zynq Ultrascale+ Ultra96 Yes, [34] – –
1 The attack affects the whole SoC including the integrated ARM Cortex-A9 Dual-Core
2 Sufficient leakage for key recovery was also shown from CPU to FPGA in the same SoC
3 In [48], the attack was also shown to work from one FPGA in the system (connected to the
same power supply) to another, on board level
4 A simple experiment was conducted, but the devices crashed before timing violations
occurred—it might still be possible with more effort
92 D. R. E. Gnad et al.
4.5 Countermeasures
side-channel leakage.
Attack evaluation on datacenter scale Virtex 7 FPGAs also showed the impact
of design space parameters on attack success. The layout and organization of the
larger fabric further increases the design space and thus the variation in side-channel
vulnerability. Modern high-end FPGAs are often composed of multiple dies, in
which side-channel attacks have been shown to still be possible [10]. We consider
two causative factors: The PDN design is non-uniform across the chip, which leads
to differences in the impact of current draw on the supply voltage, and the mapped
user logic components are subject to intra-chip process variation (PV). Both factors
impact the sensitivity of sensors in different locations and lead to differences in the
voltage traces caused by modules in different locations.
Hypervisors might be able to leverage the impact of placement and routing to
optimize security. A possible computer-aided design approach could be built using
three steps:
• First, a hypervisor generates multiple local placements for a specific cryptomod-
ule.
4 Remote Physical Attacks on FPGAs at the Electrical Level 93
• Then, for each module mapping, a global analysis determines which regions
are less vulnerable to side-channel attacks. Attack evaluations for all possible
combinations are hardly feasible. Future work might be able to identify an
adequate model, which can assess side-channel vulnerability in less time than
a full attack.
• In the third and final step, side-channel security on a specific FPGA can be
improved at zero overhead. On the hypervisor side, a global map of secure
locations and precompiled cryptocores can be provisioned. These items can be
deployed by the user as building blocks in security-critical applications.
This approach could improve security without requiring additional FPGA area
or resources and would provide a valuable asset in securing multi-tenant FPGA
access against internal side channels. Neglecting such device and physical design
dependencies can compromise the security of virtualized FPGA in the cloud.
Online countermeasures have generally been developed with the detection and
mitigation of fault attacks in mind [30, 37, 40, 42]. In both Mirzargar et al. [37] and
94 D. R. E. Gnad et al.
Provelengios et al. [42], methods for locating an attacker’s design through a sensor
network are proposed. For attack mitigation, Provelengios et al. [42] and Nassar et
al. [40] suggest clock-gating and the quick disabling of Interconnect, respectively,
to stop excessive power consumption from impeding other tenants on the same
chip. The latter has the potential to stop attacks that rely on internal clocking, for
instance, through self-oscillating clocks in the attacker FPGA region. Alternatively,
the operating frequency of the victim design can be automatically lowered when a
critical voltage undershoot is detected, as proposed in Luo et al. [30]. This dynamic
frequency scaling, however, is intrusive to non-malicious tenants on the FPGA.
Other works have considered leveraging the advantages of reconfigurable hard-
ware for side-channel countermeasures [3, 19, 25, 52]. One approach [25] leverages
the availability of on-chip voltage sensors for adaptive noise generation to cancel out
fluctuations caused by an encryption module through ROs. By worsening the signal-
to-noise ratio (SNR) for an internal attacker, the amount of measurements required
for key recovery can be significantly increased. In Bete et al. [3], a countermeasure
based on implementation variety is proposed, in which different implementations
of the AES S-Box are randomly interchanged at runtime, increasing the difficulty
for a side-channel attacker due to the randomized power profile. A generalization of
programmable ROs as a countermeasure against both side-channel and fault attacks
is presented in Yao et al. [52]. On the one hand, ROs can be employed as sensors for
detecting fault attacks. On the other hand, they serve as random noise generators to
hide data-dependent leakage against side-channel attacks. Lastly, randomization of
the clock driving the victim circuit using a delay line is proposed in Jayasinghe et
al. [19], which can induce noise at very low overhead.
The presented mitigation strategies can prevent many of the known attacks.
However, as the underlying problem for multi-tenant FPGAs is the shared PDN
across security boundaries, future development will most likely lead to more
advanced attacks, requiring constant adaption to the continuously changing threat.
In Table 4.2, we provide a comprehensive overview of the recently developed attacks
and countermeasures in FPGA-based systems.
4.6 Summary
FPGA use in datacenters and multi-tenant systems has gained popularity. Typically,
such systems have multiple privilege levels and yet are connected to the same
shared power supply. In these systems, the FPGA fabric can become an aggressor
and launch fault and side-channel attacks on other parts of the system, including
other areas of the FPGA itself. In the past, threats at the physical level were only
4 Remote Physical Attacks on FPGAs at the Electrical Level 95
7]
]
g [30
ng [2
9]
5]
r [40]
der [2
scalin
hecki
ces [2
] [42
8]
3]
reake
[19]
defen
eam c
ypt [2
toring
AD [
ency
e fen
]
[52
D
B
FPGA
SPRE
Frequ
Bitstr
NNcr
Activ
UClo
Moni
Loop
PRO
Attacks Countermeasures
Inside job ✓ ✓ ✓ – ✓ – – – ✓ ✓
[47]
Inter-chip side ✓ ✓ ✓ – ✓ – – – ✓ ✓
channels [48]
Cross-SLR ✓ ✓ ✓∗ – ✓∗ – – – ✓∗ ✓
covert
channels [10]
C3 APSULe ✓ – – – – – – – – –
[11]
SCA on ✓ ✓ ✓ – ✗ – – – ✓ ✓
neural
networks [39]
FPGAhammer ✓ – – ✓ – ✓ ✓ ✓ – ✓
[26]
Sequential ✓ – – ✓ – ✓ ✓ ✓ – ✓
oscillator
faults [49]
BRAM ✗ – – ✗ – ✓ ✓ ✓ – ✓
collision
faults [2]
Glitch ✗ – – ✓ – ✓ ✓ ✓ – ✓
amplification
faults [34]
Benign logic ✗ – – ✗ – ✓ ✓ ✓ – ✓
faults [23, 41]
considered in the context of an adversary with direct physical access to the device
under attack. With the introduction of remote physical and electrical-level attacks
on FPGAs, this common security assumption is no longer true. Multi-tenant FPGA
attacks provide an important eye-opener to the bigger picture of system security of
all remotely accessible systems.
96 D. R. E. Gnad et al.
Countermeasures have improved rapidly in the last few years. However, an easy
solution for the underlying problem of information leakage between circuits through
a shared power supply has not been found, necessitating future research.
Acknowledgments The work described in this chapter has been supported in part by the Deutsche
Forschungsgemeinschaft (DFG, German Research Foundation) through the project 456967092
(SecFShare).
References
15. Gnad, D. R. E., Rapp, S., Krautter, J., & Tahoori, M. B. (2018). Checking for electrical level
security threats in bitstreams for multi-tenant FPGAs. In International Conference on Field-
Programmable Technology (ICFPT). Naha, Japan: IEEE.
16. Gnad, D. R. E., Schellenberg, F., Krautter, J., Moradi, A., & Tahoori, M. B. (2020). Remote
electrical-level security threats to multi-tenant FPGAs. IEEE Design Test. https://doi.org/10.
1109/MDAT.2020.2968248.
17. Huffmire, T., Brotherton, B., Wang, G., Sherwood, T., Kastner, R., Levin, T., Nguyen, T., &
Irvine, C. (2007). Moats and drawbridges: an isolation primitive for reconfigurable hardware
based systems. In Symposium on Security and Privacy (S&P). IEEE.
18. Intel Corporation. (2017). Intel FPGAs Power Acceleration-as-a-Service for Alibaba Cloud
| Intel Newsroom. https://newsroom.intel.com/news/intel-fpgas-power-acceleration-as-a-
service-alibaba-cloud.
19. Jayasinghe, D., Ignjatovic, A., & Parameswaran, S. (2021). UCloD: small clock delays to
mitigate remote power analysis attacks. IEEE Access, 9, 108,411–108,425. https://doi.org/
10.1109/ACCESS.2021.3100618.
20. Kamoun, N., Bossuet, L., & Ghazel, A. (2009). Correlated power noise generator as a low
cost DPA countermeasures to secure hardware AES cipher. In: International Conference on
Signals, Circuits and Systems (SCS). IEEE.
21. Khawaja, A., Landgraf, J., Prakash, R., Wei, M., Schkufza, E., & Rossbach, C. J. (2018).
Sharing, protection, and compatibility for reconfigurable fabric with AmorphOS. In USENIX
Symposium on Operating Systems Design and Implementation (OSDI) (pp. 107–127).
22. Kocher, P., Jaffe, J., & Jun, B. (1999). Differential power analysis. In Advances in Cryptology
— CRYPTO’ 99 (pp. 388–397). Berlin, Heidelberg: Springer. https://doi.org/10.1007/3-540-
48405-1_25.
23. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2021). Remote and stealthy fault attacks on
virtualized FPGAs. In Proceedings of Design, Automation & Test in Europe (DATE) (pp.
1632–1637). https://doi.org/10.23919/DATE51398.2021.9474140.
24. Krautter, J., Gnad, D., & Tahoori, M. (2020). CPAmap: On the complexity of secure FPGA
virtualization, multi-tenancy, and physical design. IACR Transactions on Cryptographic
Hardware and Embedded Systems, 2020(3), 121–146. https://doi.org/10.13154/tches.v2020.
i3.121-146.
25. Krautter, J., Gnad, D. R. E., Schellenberg, F., Moradi, A., & Tahoori, M. B. (2019). Active
fences against voltage-based side channels in multi-tenant FPGAs. In International Conference
on Computer-Aided Design (ICCAD). ACM.
26. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2018). FPGAhammer: remote voltage fault
attacks on shared FPGAs, suitable for DFA on AES. IACR Transactions on Cryptographic
Hardware and Embedded Systems (TCHES), 2018(3), 44–68.
27. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2019). Mitigating electrical-level attacks
towards secure multi-tenant FPGAs in the cloud. ACM Transactions on Reconfigurable
Technology and Systems (TRETS), 12(3). https://doi.org/10.1145/3328222.
28. Krautter, J., & Tahoori, M. B. (2021). Neural networks as a side-channel countermeasure:
challenges and opportunities. In Symposium on VLSI (ISVLSI) (pp. 272–277). IEEE Computer
Society. https://doi.org/10.1109/ISVLSI51109.2021.00057.
29. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
malicious self-oscillator scanning for Xilinx UltraScale + FPGAs. ACM Transactions on
Reconfigurable Technology and Systems, 13(3). https://doi.org/10.1145/3402937.
30. Luo, Y., & Xu, X. (2020). A quantitative defense framework against power attacks on multi-
tenant FPGA. In International Conference On Computer Aided Design (ICCAD) (pp. 1–4).
IEEE/ACM.
31. Mahmoud, D., & Stojilović, M. (2019). Timing violation induced faults in multi-tenant FPGAs.
In Proceedings of Design, Automation & Test in Europe (DATE) (pp. 1745–1750). IEEE.
32. Malkin, T. G., Standaert, F. X., & Yung, M. (2006). A comparative cost/security analysis of
fault attack countermeasures. In Fault Diagnosis and Tolerance in Cryptography (FDTC) (pp.
159–172). Berlin, Heidelberg: Springer.
98 D. R. E. Gnad et al.
33. Masle, A. L., & Luk, W. (2012). Detecting power attacks on reconfigurable hardware. In Field
Programmable Logic and Applications (FPL) (pp. 14–19). IEEE. https://doi.org/10.1109/FPL.
2012.6339235.
34. Matas, K., La, T. M., Pham, K. D., & Koch, D. (2020). Power-hammering through Glitch
amplification – attacks and mitigation. In International Symposium on Field-Programmable
Custom Computing Machines (FCCM) (pp. 65–69). https://doi.org/10.1109/FCCM48280.
2020.00018.
35. McEvoy, R. P., Murphy, C. C., Marnane, W. P., & Tunstall, M. (2009). Isolated WDDL:
A hiding countermeasure for differential power analysis on FPGAs. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 2(1).
36. Messerges, T. S., Dabbish, E. A., & Sloan, R. H. (2002). Examining smart-card security under
the threat of power analysis attacks. Transactions on Computers, 51(5), 541–552.
37. Mirzargar, S. S., Renault, G., Guerrieri, A., & Stojilović, M. (2020). Nonintrusive and adaptive
monitoring for locating voltage attacks in virtualized FPGAs. In International Conference
on Field-Programmable Technology (ICFPT) (pp. 288–289). IEEE. https://doi.org/10.1109/
ICFPT51103.2020.00050.
38. Moini, S., Li, X., Stanwicks, P., Provelengios, G., Burleson, W., Tessier, R., & Holcomb,
D. (2020). Understanding and comparing the capabilities of on-chip voltage sensors against
remote power attacks on FPGAs. In Midwest Symposium on Circuits and Systems (MWSCAS)
(pp. 941–944). IEEE. https://doi.org/10.1109/MWSCAS48704.2020.9184683.
39. Moini, S., Tian, S., Szefer, J., Holcomb, D., & Tessier, R. (2021). Remote power side-channel
attacks on BNN accelerators in FPGAs. In Proceedings of Design, Automation & Test in
Europe (DATE). IEEE.
40. Nassar, H., AlZughbi, H., Gnad, D., Bauer, L., Tahoori, M., & Henkel, J. (2021). LoopBreaker:
disabling interconnects to mitigate voltage-based attacks in multi-tenant FPGAs. In Interna-
tional Conference on Computer-Aided Design (ICCAD). IEEE/ACM.
41. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power wasting circuits for cloud FPGA
attacks. In Field Programmable Logic and Applications (FPL) (pp. 231–235). https://doi.org/
10.1109/FPL50879.2020.00046.
42. Provelengios, G., Holcomb, D., & Tessier, R. (2021). Mitigating voltage attacks in multi-tenant
FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 14(2), 1–24.
43. Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J.,
Esmaeilzadeh, H., Fowers, J., Gopal, G. P., Gray, J., Haselman, M., Hauck, S., Heil, S.,
Hormati, A., Kim, J. Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J.,
Xiao, P. Y., & Burger, D. (2014). A reconfigurable fabric for accelerating large-scale datacenter
services. In International Symposium on Computer Architecture (ISCA), ISCA ’14 (pp. 13–24).
Piscataway, NJ, USA: IEEE Press. http://dl.acm.org/citation.cfm?id=2665671.2665678.
44. Ramesh, C., Patil, S. B., Dhanuskodi, S. N., Provelengios, G., Pillement, S., Holcomb, D.,
& Tessier, R. (2018). FPGA side channel attacks without physical access. In International
Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. paper–116).
IEEE.
45. Rockett, L., Patel, D., Danziger, S., Cronquist, B., & Wang, J. (2007). Radiation hardened
FPGA technology for space applications. In Aerospace Conference (pp. 1–7). IEEE.
46. Sanaullah, A., Yang, C., Alexeev, Y., Yoshii, K., & Herbordt, M. C. (2018). Real-time data
analysis for medical diagnosis using FPGA-accelerated neural networks. BMC Bioinformatics,
19, 19–31.
47. Schellenberg, F., Gnad, D. R., Moradi, A., & Tahoori, M. B. (2018). An inside job: remote
power analysis attacks on FPGAs. In Proceedings of Design, Automation & Test in Europe
(DATE).
48. Schellenberg, F., Gnad, D. R. E., Moradi, A., & Tahoori, M. B. (2018). Remote inter-chip
power analysis side-channel attacks at board-level. In International Conference on Computer-
Aided Design (ICCAD) (pp. 1–7). IEEE/ACM. https://doi.org/10.1145/3240765.3240841.
4 Remote Physical Attacks on FPGAs at the Electrical Level 99
49. Sugawara, T., Sakiyama, K., Nashimoto, S., Suzuki, D., & Nagatsuka, T. (2019). Oscillator
without a combinatorial loop and its threat to FPGA in data center. Electronics Letters, 55(11),
640–642. https://doi.org/10.1049/el.2019.0163.
50. Tian, S., Moini, S., Wolnikowski, A., Holcomb, D., Tessier, R., & Szefer, J. (2021). Remote
power attacks on the versatile tensor accelerator in multi-tenant FPGAs. In Proceedings of the
International Symposium on Field-Programmable Custom Computing Machines, FCCM.
51. Trimberger, S., & McNeil, S. (2017). Security of FPGAs in data centers. In International
Verification and Security Workshop (IVSW). IEEE Computer Society.
52. Yao, Y., Kiaei, P., Singh, R., Tajik, S., & Schaumont, P. (2021). Programmable RO (PRO):
a multipurpose countermeasure against side-channel and fault injection attack. Preprint.
arXiv:2106.13784.
53. Zeng, S., Dai, G., Sun, H., Zhong, K., Ge, G., Guo, K., Wang, Y., & Yang, H. (2020). Enabling
efficient and flexible FPGA virtualization for deep learning in the cloud. In International
Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 102–110).
IEEE.
54. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In Symposium
on Security and Privacy (S&P) (pp. 805–820). IEEE. https://doi.org/10.1109/SP.2018.00049.
www.doi.ieeecomputersociety.org/10.1109/SP.2018.00049.
55. Zick, K. M., & Hayes, J. P. (2012). Low-cost sensing with ring oscillator arrays for healthier
reconfigurable systems. ACM Transactions on Reconfigurable Technology and Systems
(TRETS), 5(1), 1:1–1:26. https://doi.org/10.1145/2133352.2133353. http://doi.acm.org/10.
1145/2133352.2133353.
56. Zick, K. M., Srivastav, M., Zhang, W., & French, M. (2013). Sensing nanosecond-scale voltage
attacks and natural transients in FPGAs. In International Symposium on Field-Programmable
Gate Arrays (FPGA) (pp. 101–104). New York, NY, USA: ACM. https://doi.org/10.1145/
2435264.2435283. http://doi.acm.org/10.1145/2435264.2435283.
Chapter 5
Practical Implementations of Remote
Power Side-Channel and Fault-Injection
Attacks on Multitenant FPGAs
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 101
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_5
102 D. G. Mahmoud et al.
via the power- and ground-delivery network is retrieved, breaching the system’s
confidentiality [16, 52, 66, 71].
Electrical-level attacks against FPGAs are not new. Given physical access to a
device, an adversary can manipulate the external voltage source (e.g., lower the
voltage to a new level or introduce transient changes) or disrupt the clock (e.g.,
inject glitches) to cause a fault in the operation of the target circuit [45, 74].
Measuring the voltage and current variations on the power supply rails or sensing
the electromagnetic (EM) emanations is also feasible, given an oscilloscope and
appropriate probes. The obtained power or EM side-channel traces can then be
used to perform side-channel analysis (SCA) and recover a secret [45, 54] (e.g.,
a secret key of a cryptographic circuit). In a cloud FPGA scenario, however, an
adversary has no physical access to the device, and yet, we know that a remotely
executed electrical-level attack is possible. How is that? The answer is two-fold:
first, because, at least in theory, one can design almost arbitrary circuits using the
programmable FPGA resources; second, because the power-delivery network is
shared and, consequently, voltage and current variations created by an FPGA circuit
spread beyond the circuit’s boundaries.
With fine-grained FPGA logic and routing, one can build reconfigurable circuits
that measure the variations of on-chip delays [73]. Knowing that the delays correlate
with the supply voltage [37], these circuits effectively act as on-chip voltage sensors.
Assuming the adversary and the victim are coupled via the shared PDN, these
voltage sensors enable recording power side-channel traces and, ultimately, a remote
power side-channel attack [16, 71]. Alternatively, one can build power wasters,
i.e., circuits that toggle at high frequency and draw considerable current [36].
If power wasters are deployed in large numbers and carefully controlled, their
activity causes voltage transients that, because of the shared PDN, may disturb
the operation of other circuits supplied by the same voltage source. The result is
a remote undervolting attack, with the potential to become a denial-of-service [19]
or fault-injection (FI) attack [33, 41, 42].
Signal coupling via the power-delivery network is illustrated in Fig. 5.1. On the
FPGA die, circuits are coupled via the power- and ground-distribution mesh and the
unavoidable parasitic components (e.g., capacitance or inductance). The coupling
extends to the package and the printed circuit board (PCB), where the voltage
sources supplying the FPGA and other onboard components are typically deployed.
VICTIM ADVERSARY
Die
C4 bumps
Fig. 5.1 Illustration of the power-delivery network sharing across several levels: the FPGA die,
the chip package, and the printed circuit board
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 103
The coupling reaches the external voltage sources as well (e.g., the common power
supply of an entire server rack). How far the signal created by on-chip activity (a
voltage drop due to power wasters’ activity or side-channel information) propagates
depends on the quality of the coupling or the lack thereof. A common technique
to prevent voltage disturbances from propagating far is to provision decoupling
capacitors on all levels (board, package, chip, etc.). These capacitors provide a
response to transient current demands, reducing voltage variations in the expected
operating frequency ranges. However, recent research demonstrates that decoupling
capacitors are insufficient, given that even the data center-scale FPGAs with high-
quality PDNs are vulnerable to remote attacks [16]. One particular FPGA use case
stands out as the most affected: FPGA multitenancy, where the victim and the
adversary simultaneously share the same FPGA and, being close to one another,
are therefore strongly coupled.
In the remainder of this chapter, we first describe the threat models of the remote
power side-channel and fault-injection attacks, commonly considered in the relevant
literature. Then, we elaborate in detail on the two attack variants. For each of them,
we cover the FPGA circuits that are the key enablers of the attack and show the
results of the attacks on a range of FPGA boards, including data center acceleration
cards. We make the hardware description language implementation of the FPGA
circuits (a selection of sensors and power wasters) available as open source [18].
decoupling FF FF FF FF
...
FF FF FF FF
capacitors EN EN CLK ...
H
Fault Side-channel O
!
Shell
Voltage Attacker region injection attack
Regulator S
attack
Isolation T
Common PDN
Victim
applications !
adversary has the capability of offloading the sensor traces for off-chip analysis.
The victim, on the other side, is performing secret computation. For example, the
victim could be encrypting data using a secret key and sending the ciphertexts over
a public channel that can be observed by the adversary. The goal of the adversary
would be to retrieve the secret information, such as the secret key. In an alternative
scenario, an adversary could aim to infer the architecture or the type of the victim
circuit (e.g., steal the proprietary architecture of a neural network model, discover
what type of operation the victim is executing, etc.).
In the context of a fault-injection attack, the adversary aims to leverage the
shared PDN to propagate electrical-level effects to the regions used by other
tenants. Specifically, the attacker attempts to lower the voltage of the chip with
the help of the power-wasting circuits. The lowered voltage affects the delays
within the victim circuit, potentially compromising its operation. Depending on the
circuit characteristics and the delays of the hardware employed by other tenants,
computation results can become faulty. The adversary’s goal may be to render
the victim’s calculations incorrect or consume enough power to reset the FPGA
(resulting in a DoS attack, affecting both the victim and the cloud service provider).
In a more interesting scenario, the adversary may want to gain information about
the victim, for example, secret information it may be processing (e.g., encryption
key). To gain information thanks to an injected fault, the adversary necessitates
an ability to observe some of the output generated by the victim, i.e., estimate
the effect of the fault on the computation. Hence, the fault-injection threat model
often assumes that the adversary can access the victim’s output. For instance, the
target circuit may be offering its function as a service to other parties, including
the attacker. In this case, the malicious party supplies an input to the victim and
receives the corresponding output. Alternatively, the victim may send the output
over a communication channel the adversary can access directly or indirectly; for
example, if the victim encrypts the data before sending it out, it may use a shared or
publicly accessible communication channel.
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 105
On-chip sensors suitable for remote power analysis attacks typically fall into two
categories: time-to-digital converters (TDCs) and frequency counters. The primary
component in TDCs is a tapped delay line. In frequency counters, the primary
component is a ring oscillator (RO). The underlying working principle is common:
instead of measuring the supply voltage variations directly, these sensors measure
delay variations, which correlate with voltage fluctuations [21, 22]. The power
supply voltage is the medium that carries the side-channel information.
RO-based sensors work on the principle of measuring the frequency (by means
of a counter) of a fast oscillator. Such oscillators are commonly implemented with
a single look-up table (LUT) acting as an inverter, closed in a loop. They have a
small footprint, are easy to deploy, and are portable. Yet, frequency counters require
a long measurement time, reducing sensor sensitivity and making them suitable for
capturing slow-changing signals only. Some CSPs (e.g., Amazon AWS) check user
designs for combinational loops and prevent such circuits from being deployed on
their cloud [4, 36, 57].
Compared to RO-based sensors, TDCs can sense nanosecond-scale voltage
variations. The suitability of FPGA TDCs for measuring on-chip voltage variations
and natural transients in FPGAs was first investigated by Zick et al. a decade
ago [73]. However, it was only a few years later, after the first work on remote
undervolting attacks on multitenant FPGAs, that TDC sensors regained the attention
of researchers [16, 21, 23, 50, 51, 63]. A typical TDC implementation is illustrated
in Fig. 5.3. The sensor consists of three parts:
• The first part is a tapped delay line, shown as a simple chain of multiplexers.
It is commonly referred to as an observable delay line and implemented using
fast carry propagation logic and dedicated routing. For optimal sensor sensitivity,
strict placement constraints are necessary: the delay line must be properly formed
by chaining the carry output of one FPGA slice (where a slice contains four or
106 D. G. Mahmoud et al.
PIN
Initial delay ...
(calibration)
D Q D Q D Q D Q ... D Q D Q D Q D Q
FF FF FF FF FF FF FF FF
CLKIN
...
Fig. 5.3 Time-to-digital converter (TDC) implemented using FPGA logic and routing, suitable for
measuring fast on-chip supply voltage variations
PIN
Initial delay
(calibration)
D Q D Q D Q D Q ... D Q D Q D Q D Q
FF FF FF FF FF FF FF FF
CLKIN
...
Fig. 5.4 Routing delay sensor (RDS), where the observable tapped delay line is replaced by
routing resources (RRs) [65]
eight registers, depending on the FPGA family) to the carry input of the next one.
The occupied slices should be constrained to one vertical column of the FPGA to
harness the dedicated wiring and minimize inter-slice delays. Every carry output
must drive its attached flip-flop (FF) residing in the same slice.
• The second part of the sensor is an output register used to periodically save the
state of the delay line (i.e., one sensor reading or a sample). The value in the
output register can be converted into the numerical value of one sensor sample
using a thermometer code [16, 63, 68] or, as is more common in recent work, by
taking the Hamming weight of the bits in the output register [17, 50, 51, 73].
• The third and the last part is also a chain of delay elements, but with coarser
granularity (e.g., look-up tables, phase-locked loops, or IDELAY adjustable input
delay elements), which adjusts the phase shift between the sampling clock of the
output register (.CLKIN in Fig. 5.3) and the clock driving the delay line (.PIN
in Fig. 5.3). In most practical implementations, these two clocks have the same
frequency, but the phase shift and the length of the tapped delay line must be
adjusted. The adjustment process is called calibration, as described shortly.
If carry-chain logic is not exposed to the user or a higher sensor sensitivity is
required, a routing delay sensor (RDS) could be implemented instead [65]. The
sensor architecture is illustrated in Fig. 5.4. The main difference between an RDS
and a TDC is the absence of the tapped delay line. The clock that drives the tapped
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 107
delay line in a TDC is instead routed without constraints to the output registers in
the RDC. Additionally, the locations of the FFs in the output registers are decided
by the FPGA placer, without tight constraints, in the interest of helping the FPGA
router find good-quality routes to the FFs. The absence of constraints allows the
delay difference between two sensor bits to become even lower than in the case
of a TDC, further improving sensor sensitivity. Finally, once the sensor sample is
computed as the Hamming weight of the output register, the issue of delays to every
output bit being different and in no predictable order becomes irrelevant.
In the absence of on-chip activity, the TDC output typically is constant (modulo
background noise) and determined by three parameters: the clock frequency, the
initial delay (i.e., the phase shift), and the length of the observable delay line
(parameter N in Figs. 5.3 and 5.4). For a given clock frequency, the initial delay
and the parameter N are chosen so that the output register captures a single clock
transition in every clock cycle. In other words, the output register should always be
filled with a sequence of ones followed by the sequence of zeros with a possibly
imperfect transition in which the location of the FF with the transition corresponds
to the depth of propagation of the clock signal through the delay line. Once the on-
chip voltage starts fluctuating due to the activity of the victim circuit, the delays
of the elements in the sensor change, and consequently, so does the sensor output.
The sensor is well-dimensioned and calibrated (i.e., the initial delay and the length
of the delay line are well-chosen) if, for the entire duration of the measurement,
the Hamming weight of the output register lies in the range .0 < H W (O) =
H W (O0 , O1 , . . . , ON −1 ) < N and only one clock edge is captured. As calibration
is a lengthy process of trial and error, it is convenient to automate it.
One practical approach for automating sensor calibration is as follows. First,
let us assume the initial delay is implemented as a sequence of two delay lines,
as illustrated in Fig. 5.5: one composed of fine calibration slices (e.g., carry-chain
logic) and one composed of coarse calibration slices (e.g., LUTs) [20]. Second,
the clock can enter the initial delay line in as many locations as possible (e.g.,
with the help of multiplexers). For a given maximum number of elements in the
fine and coarse calibration slices, the calibration can be performed, as described in
CLK
...
... ...
SENSOR OUTPUT
Fig. 5.5 Initial delay line implementation, convenient for automated calibration
108 D. G. Mahmoud et al.
1: procedure CALIBRATE(LI DC , LI DF , n, N, δ)
2: for I DCcnt from 1 to LI DC do
3: smin ← N
4: I DC ← I DCcnt ; I DF ← 1
5: SEND _ CALIBRATION (I DC, I DF )
6: for trace from 1 to Ntraces do
7: (s1 , s2 , ..., sn ) ← RECORD_TRACE()
8: smin ← MIN(smin , MIN(s1 , s2 , ..., sn ))
9: end for
10: if smin = N then
11: break
12: end if
13: end for
14: for I DCcnt from I DC to LI DC do
15: for I DFcnt from 1 to LI DF do
16: smax ← N
17: I DC ← I DCcnt ; I DF ← I DFcnt
18: SEND _ CALIBRATION (I DC, I DF )
19: for trace from 1 to Ntraces do
20: (s1 , s2 , ..., sn ) ← RECORD_TRACE()
21: smax ← MAX(smax , MAX(s1 , s2 , ..., sn ))
22: end for
23: if smax = δ then
24: return I DC, I DF
25: end if
26: end for
27: end for
28: return failure
29: end procedure
is considered completed, and the desired number of coarse and fine elements is
found. It should be noted, though, that from the side-channel attack perspective, it
is crucial to correctly record (without any saturation in sensor output) the events
corresponding to victim activity (i.e., voltage drops). For a given length of the
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 109
observable delay line and the sensor clock frequency, the threshold .δ should be
set with the above constraint in mind.
and-conquer approach to extract the secret key: instead of attacking the whole
128-bit key (search space of .2128 possible solutions), it exploits the property that
AES operates on key bytes, which allows an attack on each key byte independently
and, most importantly, reduces the search space to a manageable size of .16 × 28
(i.e., 256). During the attack, the adversary estimates the correlation between the
two datasets: the first is created by recording N power traces and corresponding
ciphertexts. The second dataset is the modeled leakage of the device for the N
observed ciphertexts. For each key byte, all possible 256 values (key guesses)
are considered. Each guess is then used to model power consumption for the
observed ciphertexts with the help of a leakage modeling function. Finally, for each
trace sample of interest, the correlation between the measured and the modeled
power consumption is computed using the expression for the Pearson correlation
coefficient [59]:
Key guesses
Pearson correlation
AES
Ciphertexts Power
consumption
model Modeled
Key guess power
is partitioned into smaller parts that are easy to attack independently. However, a
security evaluation of sub-keys does not necessarily illuminate the security of the
entire key. In particular, when an attacker fails to break all the key bytes with
correlation power analysis, the remaining effort for a full key recovery through
trial and error is not quantifiable. To overcome this issue, one can use a key
rank estimation metric. In the unknown-key setting, key rank estimation uses
heuristics to approximate the rank of an unknown secret key without performing key
enumeration [6]. In the known-key case, it uses the correlation computed with the
CPA (or another scoring metric) to sort the key guesses and estimate the remaining
effort for an attacker. For example, if an attacker has no side-channel information,
then the key rank equals the entire key space, i.e., .2128 for AES-128. Alternatively,
when the entire key is broken, the key rank drops to zero. While there are several
ways of computing the key rank estimation metric, in this chapter, we use the
histogram-convolution-based algorithm of Glowacz et al. [10] where the key rank is
upper and lower bounded.
SHELL
Logical and physical isolation
S
VICTIM
Start encryption T
AES Encryption data Controller
CLK
Fig. 5.7 System architecture for running a remote power side-channel attack
Fig. 5.8 Power side-channel trace recorded with RDS and TDC sensors [65]
trigger for sensor trace storage. Instead of a trigger, one could use trace alignment
techniques on the recorded sensor traces [62]. The clock frequencies of the sensor
and the AES are set to 200 MHz and 20 MHz, respectively. We test the attack with
both TDC and RDS sensors [65], both with a 128-bit output register. The value in
the output register is converted into the numerical value of one sensor sample using
the Hamming weight. For sensor calibration, the coarse (LUTs, latches) and fine
(carry) elements are used, as described in Algorithm 1.
Figure 5.8 shows the power trace recorded by the two sensors during the
encryption of one plaintext on the Sakura-X board. The AES rounds are clearly
visible. The traces have 128 samples, covering the entire duration of one encryption.
Because the sensor calibrations are independently performed, the vertical offsets of
the traces differ. More importantly, note that the peak-to-peak amplitude of the RDS
trace is considerably higher than the peak-to-peak amplitude of the TDC trace. Next,
we run 500,000 encryptions and record the corresponding traces. In Fig. 5.9, we plot
the results of the statistical analysis of the sensor samples, notably the variance of
one bit in the sensor output. In the left half of the figure, the total number of bits
with nonzero variance is shown. On the right, the variance per bit for all output
bits is visualized. For the TDC, we find 11 bits with a nonzero variance; these bits
cover the entire range of clock propagation depths captured by the sensor across
all encryptions. Under the same conditions, the RDS exhibits 47 bits with nonzero
variance, and given the absence of a tapped delay line, the bits with nonzero variance
are unequally distributed. The higher variation in trace sample values and the higher
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 113
Fig. 5.9 Number of bits toggling during trace acquisition (left), and the variance of the bits in the
sensor output register (right) [65]
Fig. 5.10 Signal-to-noise ratio for the RDS and TDC, computed on the least-significant byte of
the output of the ninth AES round [65]
number of bits with nonzero variance suggest that an attacker with RDS can likely
break the secret key faster. Later experiments will confirm this hypothesis.
In addition to computing the variance of the output bits, we compare the signal-
to-noise (SNR) ratio of the traces recorded by the TDC and the RDS on the Sakura-X
board. The SNR as a side-channel evaluation metric is defined as the ratio between
the useful signal, i.e., the variance of the data-dependent power consumption and
noise [55]. It can be obtained from power side-channel traces without performing an
attack and is most commonly used to identify trace samples with significant leakage
(i.e., samples having a strong correlation with the secret). Figure 5.10 shows the
results. For both sensors, we can observe two peaks: in sample 102 (the beginning
of the last AES round) and in sample 112 (the end of the last round, i.e., when
the ciphertext is saved in the state register). Compared to the TDC, the SNR of the
signal picked up by the RDS sensor approximately doubles in these two points of
interest. Furthermore, across all our experiments and every byte of the intermediate
value, the SNR in sample 112 for RDS is consistently higher than for the TDC, by
a factor of 1.6.× on average, with the maximum reaching 2.9.× [65].
To compare the sensors in the power side-channel attack scenario, we attack
power traces recorded on the Alveo U200 data center accelerator card using CPA
and the key rank estimation metric and repeat the attack a number of times [65].
114 D. G. Mahmoud et al.
Fig. 5.11 Key rank estimation for the TDC and RDS sensors on the Alveo U200 data center
card [65]
Fig. 5.12 CPA attack on the last AES round targeting the eighth byte of the key [16]. To the left,
the correlation for all key guesses across the trace samples corresponding to the last AES round,
with the correlation of the correct key byte highlighted in orange. To the right, the rank evolution
of the correct key byte
Figure 5.11 shows the attack results. With almost two million traces, more than
half of the key bits are broken with RDS. The attack with TDC is, on average,
notably less successful, which is not surprising given the results of the trace analyses
discussed previously.
Since signal coupling varies with the target FPGA and PDN quality, we repeat
the analyses, this time targeting AMD UltraScale+ FPGAs available in the Amazon
EC2 F1 platform [13]. The power traces are recorded with a TDC, in which the
output register value is converted into a numerical value of one sensor sample
using a thermometer code. Thirty experimental runs are performed. After each run,
the FPGA instance is shut down and restarted, potentially allowing the resource
allocator to assign a different FPGA for use during each experiment. With CPA,
the full key was successfully broken in 42% of attempts. Of all attempts to
attack individual key bytes across one million traces, 48% were successful. These
variations in the attack success are expected: first, because of low SNR and, second,
because of device timing characteristics that are known to be affected by process
and temperature variations [21]. Figure 5.12 shows the CPA attack results on the last
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 115
AES round for the eighth byte of the key [16]: the left half shows the correlation for
all key guesses for the trace samples corresponding to the last AES round, obtained
with one million traces. The right half illustrates the rank evolution of the correct
key byte.
In the final set of experiments, we examine how the choice of the numerical
transformation (i.e., encoding) applied to the sensor output register—to obtain the
numerical value of the sensor sample—impacts the attack success. Typically for
TDCs, the output register value (i.e., one sensor sample) is a sequence of zeros
followed by a sequence of ones (or vice versa), where the transition corresponds
to the propagation depth of the clock driving the tapped delay line. In practice, the
transition is not always perfect, and one or more pulses (commonly called bubbles)
may be observed. The bubbles occur because the delays in the observable delay line
are not always strictly monotonic [16]. Therefore, it is conceivable that the choice
of encoding may impact the attack’s success. To test this hypothesis, we take the
traces collected previously (on Amazon EC2 F1 instances), apply three different
encodings, and repeat the CPA attack. The baseline encoding is the thermometer
code, i.e., the sensor sample is the propagation depth of the clock (ignoring the
bubbles). In the second test case, we reorder the output register bits so that the
delays from the input of the delay line to each of the bits in the output register are
monotonically increasing, at least according to static timing analysis. Reordering is
performed in software (off-chip). In the third and the last test case, a sensor sample
is computed as the Hamming weight of the output register. The result of the CPA
attack on the obtained three datasets (original, permuted, and Hamming weight)
is shown in Fig. 5.13. As expected, the baseline is the least-performing option.
Timing-aware reordering of the bits and Hamming weight capture are successful
at capturing the signal of interest. Hence, they permit a more successful attack. In
conclusion, Hamming weight is a good approach for encoding TDC bits, while it is
also naturally suited for routing delay sensors [65].
Fig. 5.13 Key rank estimation when targeting the full 128-bit key of the AES encryption on AMD
UltraScale+ FPGAs in the Amazon EC2 F1 instances, for the original (i.e., baseline), permuted,
and Hamming weight traces
116 D. G. Mahmoud et al.
5.3.4 Countermeasures
D Q Combinational D Q
FF logic Tsetup FF
Tclk2q
Tcomb
CLK
Tskew
Fig. 5.14 Digital circuit timing parameters, which together form constraints that must be
respected for correct operation
injection exploits and the parameters affecting their success. The considered exploit
uses one FPGA circuit to carry out an attack against another, co-located on the same
FPGA die. Two logic signals must have faults, such that the registers capturing
their values record a binary combination that should not normally occur. We then
examine how this type of fault-injection attack can compromise the security of a
more complex system. The section continues with the discussion on how the effects
of FPGA-based PDN manipulation can affect computing components other than the
programmable logic. A discussion of countermeasures closes the section.
For the typical, sequential circuit illustrated in Fig. 5.14, the choice of the clock
frequency depends on the circuit’s timing constraints. Given that registers supply
the inputs of the combinational logic and the outputs are saved in registers, the
circuit designer must ensure that the design meets the setup and hold constraints
expressed as follows [60]:
Tclk ≥ Tclk2q
.
max
+ Tsetup + Tcomb
max
− Tskew , (5.2)
Thold ≤ Tclk2q
.
min
+ Tcomb
min
. (5.3)
Here, .Tclk is the clock period. .Tclk2q is the time between the arrival of the clock
edge and the corresponding update of a flip-flop’s output, or in other words, the
time the input register needs to supply the new value to the combinational circuit.
The superscripts refer to the maximum and minimum values this delay can take.
.Tsetup is the time for which the value at the input of the output register must remain
stable before the clock edge arrives, as shown in Fig. 5.15. .Tcomb is the delay the
combinational logic part takes to compute the result. Depending on the input values,
this delay can vary as more levels of logic gates may need to switch to produce the
118 D. G. Mahmoud et al.
clk
D data0 data1
Q data0 data1
Fig. 5.15 Timing of a D flip-flop highlighting the setup and hold times
final output. Accordingly, the timing constraints also consider this delay’s maximum
and minimum values. Different clock arrival times to the input and output registers
are reflected in .Tskew . Finally, .Thold is the time for which the input to a register needs
to remain stable after the clock edge to guarantee a correct update of the register’s
output, as shown in Fig. 5.15.
If the inequalities (5.2) or (5.3) are not met, the circuit may not operate correctly.
Therefore, the first step toward correct circuit’s functionality is to select a suitable
clock frequency. Then, if there are any hold time constraint issues (e.g., due to
clock skew), additional delays can be added. Delay tuning options include changing
the routing to use longer paths, using slower cells (if available), adding buffers, or
using negative-edge registers. FPGA synthesis tools typically support commands
for ensuring that synthesis meets the hold time requirements [5].
An adversary aiming to generate circuit faults can target the circuit’s timing
behavior. If the external clock input is accessible, the adversary can manipulate the
clock frequency to violate inequality (5.2). For example, the adversary can switch to
a faster clock for a short time before switching back to the normal clock to introduce
glitches [31]. While control over the clock can allow the adversary precise control
over the glitch and, accordingly, the injected fault [47], that control is not always
available. In remote multitenant FPGA scenarios, the adversary cannot control the
clock source of another tenant’s circuit. Instead, the attacker can focus on increasing
the combinational delays to eventually violate inequality (5.2). In this case, the
relationship between the circuit voltage and delay as the voltage decreases (e.g.,
the delays increase) serves as an attack enabler [37]. While the adversary does not
have access to the FPGA power supply, they can construct malicious FPGA power
waster circuits that consume excessive power, overwhelming the power supply and
resulting in transient voltage fluctuations.
As their name suggests, power wasters are circuits whose main aim is to consume
as much dynamic power as possible. The dynamic power consumption is directly
proportional to the square of the supply voltage, the switching frequency of the
signal being considered, and the load capacitance [70]. Given that the adversary
has no direct control over the supplied voltage in a remote attack scenario, the only
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 119
remaining factors that can increase power consumption are the switching frequency
and the load capacitance of the logic gates within the design. Accordingly, power
waster designs aim to generate high-frequency signals.
Many approaches for power waster designs have been proposed; for instance,
a high-frequency clock signal generated by a phase-locked loop (PLL) within the
FPGA can be used to drive long wires or a large number of sequential elements that
change state with every clock cycle (e.g., a shift register initialized to an alternate
pattern of ones and zeros). Efficient power wasters can be built with circuits that
generate glitches; glitches can be created by overclocking an otherwise benign
design (e.g., overclocking encryption rounds of a block cipher [24, 57]) or by
crafting signal paths of various delays that switch at different times and XORing the
signals [48]. The most common technique for generating a high-frequency signal is
to find the shortest path within the programmable logic, create an oscillator out of
it, and use it to the adversary’s advantage.
Ring oscillators (ROs) are circuits that use an odd number of inverters to
change the value of a logical signal and then use that same signal as input to the
inverters through a feedback loop. For the highest frequency of oscillation, one
inverter suffices. One FPGA look-up table (LUT) can be programmed to act as
an inverter, and its output can be connected back to its input with a short routing
path. The resulting FPGA-based RO can generate signals at a frequency surpassing
the maximum clock frequency that an on-chip PLL can generate. Depending on the
technology node, the oscillation frequency can surpass 1 GHz [22].
When used within an FPGA design, ROs typically employ an enable signal that
allows the user control over when the RO oscillates. Therefore, in practice, an
LUT implements a NAND instead of a NOT functionality, as shown in Fig. 5.16.
Even though an RO generates a high-frequency signal, one LUT cannot consume
enough power to overwhelm the power supply—an adversary needs to instantiate
a large number of ROs. Again, given that one RO cannot significantly change
the consumed power, a group of ROs can share a common enable signal. Such a
collection of ROs, representing the smallest unit the attacker can control, is often
called a block or a bank of ROs. Figure 5.17 shows the voltage drop resulting from
the activation of a varying number of ROs. A Genesys-ZU board, equipped with
an AMD Zynq UltraScale+ multi-processor system-on-chip (XCZU3EG) [14], is
targeted. The voltage readings in Fig. 5.17 are obtained using the TDC described
in Sect. 5.3.1. Not surprisingly, under the same enable signal activation pattern,
the more the ROs, the more significant the resulting voltage drop. The maximum
number of ROs in Fig. 5.17 corresponds to 60,963 LUTs; such a high number
120 D. G. Mahmoud et al.
Fig. 5.17 Voltage drop in the function of the number of active ROs. The enable signal is toggling
with a period of 1.1 µs. RO blocks are activated (and later deactivated) one by one. The baseline
corresponds to no RO blocks active
was achieved by implementing two ROs in one 6-input LUT, since LUTs in the
UltraScale FPGA family support two independent outputs [69].
Another way to boost the power consumption of ROs without using additional
LUTs is to increase the load capacitance at the output of each RO. This increase
can be achieved by connecting an RO output to another LUT, passing the signal
through routing resources, and driving additional loads to consume more power.
For example, wire-based power wasters, shown in Fig. 5.18, implement a group of
sources and sinks for each block. Then, with placement constraints, the signal from
each source is forced to travel a certain distance (using routing resources) to the
corresponding sink [17]. Enhanced ROs (EROs), which use LUTs implementing
ROs and connect the output of each LUT to three other LUTs, can also be used,
ensuring that the additional inputs do not change the RO functionality. Figure 5.19
shows the implementation of an ERO instance using four LUTs [36].
Evaluating EROs and comparing them to ROs on an AMD UltraScale+ FPGA
(the FPGA family used by many commercial cloud service providers) demonstrate
their higher power consumption [75]. Figure 5.20 shows the TDC sensor readings
when using 60,963 LUTs as ROs and 58,981 LUTs as EROs. Despite the lower
number of LUTs used for the EROs, the resulting voltage drop from their activity is
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 121
EN
Fig. 5.19 Implementation of an enhanced ring oscillator using four LUTs, with an enable signal
for attack control [36]
Fig. 5.20 Comparison of the voltage drop induced by EROs and ROs, under the same activation
pattern and the period of the enable signal of 1.4 µs. The baseline corresponds to no power wasters
active
more pronounced. This increase in voltage drop highlights the effect of the increased
load capacitance and routing resources used [42].
Activating a large number of power wasters is not sufficient for a successful fault-
injection attack. The malicious party also needs to consider the following factors: the
timing of the fault, the desired voltage drop, and the stealthiness of the exploit. Fault
-injection timing is important as the effect of a fault depends on the computation
affected. Attacks leveraging faulty outputs typically require the fault to be injected
at a specific point of the algorithm under attack. The adversary can leverage side
channels to determine when the circuit executes the target function and activate the
power wasters suitably [39]. The stealthiness of the exploit is strongly related to its
122 D. G. Mahmoud et al.
Period
Duty cycle
Attack duration
Fig. 5.21 Enable signal and controllable parameters for remote undervolting attacks
timing, as a well-timed undervolting does not need to last long. Hiding the exploit
also requires controlling its strength or the voltage drop. If the voltage decreases
too much, the FPGA may become unavailable, resulting in a denial of service. The
adversary must carefully control the voltage drop to match the desired delay increase
while avoiding a reset of the entire FPGA.
The voltage drop is a factor of the FPGA’s PDN parameters. Continuous
activation of power wasters is typically not sufficient for successful fault injection.
If the number of power wasters is insufficient to reset the board, then the PDN will
likely suffer an immediate voltage drop and compensate for the increased activity,
resulting in the final voltage drop not being as significant. If, instead, the power
wasters are activated periodically, as shown in Fig. 5.21, the power supply may not
be able to recover within the given time. In addition, the frequency of the switching
affects the voltage drop. The impedances of the PDN capacitive and inductive
elements are a function of the frequency, and the voltage drop is a function of the
impedance and the drawn current. Consequently, an activation signal matching the
PDN resonance frequency results in the most significant voltage drop [72].
Therefore, the adversary needs to be able to control several parameters that shape
the enable signal of the power wasters. Figure 5.21 illustrates these parameters:
• The start of the attack, i.e., the moment at which the power wasters are enabled
for the first time, corresponding to time 0 in Fig. 5.21
• The period of the enable signal, i.e., the time between two subsequent activations
of the enable signal
• The duty cycle of the enable signal, i.e., the time during which the enable signal
remains high over the period of the enable signal
• The duration of the attack
A well-chosen enable signal period allows for an effective power draw. Control-
ling the duty cycle of the enable signal also allows for controlling the voltage drop
since a longer-lasting activation draws more power than a shorter activation. Figure
5.22 shows the TDC sensor readings when using EROs with varying frequencies
for the enable signal on an AMD UltraScale+ FPGA [14]. Figure 5.22 compares the
voltage drop corresponding to different frequencies to the baseline. Using 26,908
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 123
Fig. 5.22 Comparison of the voltage drop induced by different periods of the attacker enable
signal for 26,908 LUTs implementing an ERO-based attacker. The baseline corresponds to no
power wasters active
LUTs as EROs (activated gradually in steps of 3844 LUTs), we can see that the
value and shape of the voltage drop depend on the period of the enable signal. If that
period is too short, the lowest voltage lasts for a short time since the power wasters
are active for a limited number of clock cycles. The longer the period, the longer the
voltage drop lasts. However, the period increase should also take into account the
attack duration. In Fig. 5.22, the longer period only results in one continuous voltage
drop, whereas a period of 1500 ns results in the EROs being active twice during the
attack, with the second voltage drop being more significant than the first and than
that achieved with a longer period. The results in Fig. 5.22 highlight the importance
and impact of carefully controlling the enable signal.
The waveform of the enable signal can be generated from external sources and
sent as an input to the FPGA. However, in a remote scenario, the adversary must
generate the signal on the chip. Access to a reconfigurable phase-locked loop (PLL)
may facilitate the creation of arbitrary signal waveforms. Finally, the adversary
can implement a control circuit to receive parameters and generate the signal
accordingly. Various solutions exist; Algorithm 2 shows an example leveraging a
counter to keep track of the number of clock cycles elapsed for the attack and the
toggling of the enable signal.
The count process is called at every rising clock edge. The state of the input
signals determines the action taken. If the reset signal is asserted, then all counters
are reset to 0. If the user has asserted the start signal, then the durationCnt
is incremented, while periodCnt is incremented if the period of the enable signal
has not passed. If it has, the periodCnt goes back to zero to start counting another
period. If the start signal is deasserted, then the counters keep their old values.
124 D. G. Mahmoud et al.
Algorithm 2 Algorithm that generates the signals that count the duration of the
exploit and the duration of each period for the toggling of the enable signal
Input: reset, signal to reset the counters
Input: start, signal to start the attack and the counters
Input: period, number of clock cycles which define the period of the enable signal
Output: durationCnt, counter for the attack duration
Output: periodCnt, counter for the attack period
The counters produced by the count process can then be used with other
parameters provided to the circuit to generate either a continuous enable signal
or a periodically activated and deactivated signal. Algorithm 3 shows an example
for enable signal generation. The process is called at every rising clock edge. The
obtained frequency is the result of dividing the clock frequency by the specified
number of clock cycles in the period input. The dutyCycle input specifies the
number of clock cycles for which the power wasters are active. If the system has
been reset, all of the enable signals are deactivated, and the end signal is set to
zero. If the start signal is deasserted, which is equivalent to the attack ending or
being stopped, then the end signal is high, while all enable signals are deactivated.
Otherwise, depending on the toggle input, the enable signals of the specified
number of power wasters blocks (.NB ) are either continuously high for the duration
of the attack, or periodically enabled and disabled, depending on the dutyCycle
input and the values of the counters.
5.4.4 Victim
With precise control over the activity of the power wasters, the adversary can lower
the voltage of the FPGA chip and increase the delays of other circuits within the
programmable logic. The attacker can then target a victim that contains a secret or
is processing sensitive information. Fault injection typically targets a specific part of
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 125
Algorithm 3 Algorithm for generating the enable signals that control the activity of
the power waster blocks, according to the waveform in Fig. 5.21
Constant: N, Total number of power waster blocks instantiated
Input: reset, signal to reset the counters and disable the power wasters
Input: start, signal to start the attack and the counters
Input: NB , number of power wasters to enable
Input: toggle, flag specifying whether the enable signal toggles (i.e., follows a period pattern)
Input: dutyCycle, number of clock cycles (in the activation period) to keep the enable active
Input: durationCnt, counter for the attack duration
Input: periodCnt, counter for the period for the toggling
Input: duration, number of clock cycles for which the attack should last
Output: enable[N − 1 : 0], enable signal for the power waster blocks
Output: end, signal indicating that the attack is over
the circuit to affect the output. This action allows the adversary to learn information
that is otherwise inaccessible.
Encryption cores have a secret key and process sensitive information (the
plaintexts are encrypted so that they are not accessible to other parties when
transmitted through communication channels or stored in potentially compromised
media, for example). These circuits can thus be victimized. Let us consider a fault-
126 D. G. Mahmoud et al.
IN1 g2
g1 DFF OUT1
IN2
injection exploit targeting an AES encryption core. For AES attacks, adversaries
usually leverage differential fault analysis (DFA), which relies on AES input control.
The malicious party sends the same plaintext twice, once in the absence of an attack
and once targeting a specific phase of AES algorithm execution during an attack.
The two encryptions result in one correct and one faulty ciphertext. By collecting
enough pairs of correct and faulty outputs, the adversary can retrieve the encryption
key and break the system’s security. Machine learning inference cores are also
interesting attack targets because of their proprietary architectural parameters. These
circuits often process sensitive data that should not be accessible to other parties and
their outputs can affect decisions made within a larger system. A fault injected into
a model can lead to erroneous inference results.
The remainder of this section considers AES as the target victim, but instead of
performing DFA, it assumes a scenario in which the cryptographic accelerator is
compromised by an embedded hardware Trojan, triggered only in the presence of
the attack. The first step of the attack is finding two signal paths within AES that
together cannot reach a specific stable state. In other words, two correlated signals
that guarantee that downstream registers that output their values cannot produce
a specific two-bit output state (e.g., never reach 11) must be identified. A simple
circuit satisfying the criteria is shown in Fig. 5.23: two signals produced by two
AND gates that share an input, except that one gate (.g0 in Fig. 5.23) uses the input as
is, while the other uses the inverted version of the input (.g1 in Fig. 5.23). Therefore,
both AND gates cannot output stable 1 outputs, ignoring glitching. Signals exhibiting
such a correlation can exist in circuits at a scale broader than the simple example in
Fig. 5.23. The adversary’s goal, once such two signals are identified, is to increase
the delay of the two correlated paths so that the outputs of downstream clocked
registers reach the impossible output state (e.g., timing closure is not achieved).
Triggering such an impossible state can be used for malicious purposes, including
activating a stealthy hardware Trojan to leak secret information.
We implemented RO-based power wasters to inject into an openly available AES
circuit [26] using an Intel Arria 10 GX FPGA deployed within a commercial cloud
instance [29]. Two preselected signals within the AES substitution box (SBox)
form the attack target; the signals are sampled and their values collected to analyze
whether they reach the desired (otherwise impossible to reach) state when under
attack. The number of observed faults as a function of the number of active ROs is
shown in Fig. 5.24. The data are averaged across 35 experiments, each containing
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 127
Fig. 5.24 Percentage of the target faults at the output of an AES circuit [26] within a cloud-scale
FPGA [29], as a function of the number of active power wasters. The data are averaged over a total
of 350 attack runs
Fig. 5.25 Percentage of desired faults at the output of an AES circuit [26] within a cloud-scale
FPGA [29], in the function of the period of the enable signal controlling power wasters. The data
are averaged over 150 attack runs
Infected circuit
Trojan
I0
Circuit
AES ciphertext I0 OT
O I1 output
secret secret I1
sig1
Fig. 5.26 An example victim circuit infected with a stealthy hardware Trojan utilizing two
correlated signals as a trigger
To validate the possibility of fault injection targeted to two correlated Sbox signals
in a cloud-scale FPGA, the pair are used as a trigger for a stealthy hardware
Trojan, as shown in Fig. 5.26. A similar approach has previously been used by
Mahmoud et al. [43]. Figure 5.27 shows a block diagram of the system. The shared
FPGA contains two user regions. One is assigned to the attacker and one to the
victim. The shell occupies part of the fabric and provides interfaces for both user
regions, allowing them to communicate with the host CPU. The adversary employs
a collection of RO blocks. They can control the number of activated blocks and
the shape of the enable signal. The victim uses an AES core and leverages FIFO
buffers to store and later send the results for processing outside the FPGA. The target
AES core has a hardware Trojan hidden inside it; we show the Trojan separately
to highlight its presence and operation. Activating the Trojan using the correlated
signal pair makes the Trojan stealthy, because when the user tests the designs with a
variety of inputs, the Trojan will never be activated [27]. In particular, as the Trojan
relies on faults to leak information, the core functionality testing will not reveal
expected results discrepancies [27].
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 129
Shared FPGA
EN Start
Control
...
... Number
Parameters
RO Blocks
Toggle
Attacker region H
SHELL
Logical and physical separation O
S
T
plaintext
AES Control
signals
secret output
Trojan FIFO
ciphertext
Victim region
Fig. 5.27 System design for the exploit targeting Trojan activation using undervolting-based fault
injection in a cloud FPGA. The parameters signal controls the period and the duty cycle of the
enable signal for the ROs
The functionality of the trigger signal pair is illustrated in Fig. 5.26: they act
as select signals for two multiplexers, such that if both select signals are 1, the
encryption key (i.e., the secret) leaks to the AES output. In all other cases, the AES
output is the ciphertext [43], as would normally be expected. The delays of the two
signals are 2.17 ns and 2.23 ns, under the fast timing model at .100◦ C. The AES
core operates at 320 MHz. Based on the results in Figs. 5.24 and 5.25, we allocate
12 RO blocks (49,152 ROs) and set the period and duty cycle of the enable signal to
16 clock cycles and 31.25%, respectively. The experiments are repeated ten times,
each experiment consisting of 30 attack runs of 4096 encryptions per run. One run
corresponds to one attack duration, equivalent to 12.8 µs. Pseudorandom plaintexts
and a fixed set of 16 values are supplied to the AES in an alternating fashion.
The results show that the secret key leaks, on average, 88 times per experiment
when using the fixed inputs and 30 times in the case of pseudorandom inputs.
The different input sequences affect the paths used within the AES, which in
turn affects the delays, and accordingly, the leakage depends on the sequence of
inputs. The key becomes the most occurring value at the AES output after 20,480
encryptions, corresponding to five attacks with pseudorandom inputs. In conclusion,
with enough samples and the correct set of attack control parameters, an adversary
can successfully induce faults in a victim design implemented in a cloud-scale
FPGA.
It is worth noting that, given the existence of various levels of sharing of the
power supply network (illustrated in Fig. 5.1), the effects caused by the activity of
the power wasters are not limited to the FPGA. An FPGA-based undervolting attack
on the Zynq UltraScale+ multi-processor system-on-chip revealed that disturbances
130 D. G. Mahmoud et al.
created by ROs can affect the CPU, to the point that exploitable faults can be injected
into bare-metal applications or that Linux-based applications can crash [24, 42, 44].
The propagation of electrical-level effects is not limited to the components on the
same chip: voltage disturbances can extend to other components in the same rack in
a data center [15]. It is, therefore, important to monitor the security of all the devices
within a data center or a cloud system where FPGAs are available to remote users.
5.4.6 Countermeasures
5.5 Conclusions
FPGAs are becoming ubiquitous based on the flexibility and hardware parallelism
they provide. Today, FPGAs are commonly found in cyber-physical systems and
data centers. However, access to low-level FPGA hardware resources can lead to
security issues, especially considering a multitenant FPGA scenario consistent with
common cloud and data center practices. This chapter focused on two security
threats to remote FPGAs, power side-channel and fault-injection attacks, resulting
from power-delivery network sharing and the possibility of an adversary deploying
carefully crafted malicious FPGA circuits. After discussing the threat models and
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 131
the requirements for the attacks, several malicious circuit implementations and their
deployment were given. The success of the attacks was demonstrated on a range
of FPGA families and platforms, including cloud FPGAs, confirming the extent of
the threat FPGA multitenancy poses. Lastly, an overview of a selected subset of
proposed approaches for protecting against power side-channel and fault-injection
attacks was given. This chapter demonstrated the practical risk of multitenancy on
cloud FPGAs. While multitenancy offers many potential benefits, there remains a
need for countermeasure deployment and FPGA-based cloud system redesign to
ensure a high level of security for future multitenant FPGA systems.
Acknowledgments This work is partially supported by the Swiss National Science Foundation
(grant No. 182428), by armasuisse Science and Technology, and by the EU Horizon 2020
Programme under grant agreement No 957269 (EVEREST).
References
16. Glamočanin, O., Coulon, L., Regazzoni, F., & Stojilović, M. (2020). Are cloud FPGAs really
vulnerable to power analysis attacks? In Design, Automation and Test in Europe Conference
and Exhibition (DATE) (pp. 1–4). IEEE, Grenoble.
17. Glamočanin, O., Kostić, A., Kostić, S., & Stojilović, M. (2023). Active wire fences for
multitenant FPGAs. In 26th international symposium on design and diagnostics of electronic
circuits systems (DDECS) (pp. 13–20). IEEE, Tallinn.
18. Glamočanin, O., Mahmoud, D. G., Regazzoni, F., & Stojilović, M. (2023). Cloud FPGA
security—practical implementations of remote power side-channel and fault-injection attacks
on multitenant FPGAs—artifacts. https://github.com/mirjanastojilovic/remote-fpga-attacks-
book-chapter.
19. Gnad, D. R., Oboril, F., & Tahoori, M. B. (2017). Voltage drop-based fault attacks on
FPGAs using valid bitstreams. In Proceedings of the 27th international conference on field-
programmable logic and applications (FPL) (pp. 1–7). IEEE, Ghent.
20. Gnad, D. R. E., Nguyen, C. D. K., Gillani, S. H., & Tahoori, M. B. (2021). Voltage-based covert
channels using FPGAs. ACM Transactions on Design Automation of Electronic Systems, 26(6),
1–25.
21. Gnad, D. R. E., Oboril, F., Kiamehr, S., & Tahoori, M. B. (2016). Analysis of transient voltage
fluctuations in FPGAs. In 2016 international conference on field-programmable technology
(FPT) (pp. 12–19). IEEE, Xi’an.
22. Gravellier, J., Dutertre, J. M., Teglia, Y., & Loubet-Moundi, P. (2019). High-speed ring
oscillator based sensors for remote side-channel attacks on FPGAs. In 2019 international
conference on ReConFigurable computing and FPGAs (ReConFig) (pp. 1–8). IEEE, Cancun.
23. Gravellier, J., Dutertre, J. M., Teglia, Y., Loubet-Moundi, P., & Olivier, F. (2019). Remote side-
channel attacks on heterogeneous SoC. In 18th smart card research and advanced applications
conference (CARDIS 2019) (pp. 109–25). Springer, Prague.
24. Gross, M., Krautter, J., Gnad, D., Gruber, M., Sigl, G., & Tahoori, M. (2023). FPGANeedle:
Precise remote fault attacks from FPGA to CPU. In Proceedings of the 28th Asia and South
Pacific design automation conference (pp. 358–64). ACM, Tokyo.
25. Hoozemans, J., Peltenburg, J., Nonnemacher, F., Hadnagy, A., Al-Ars, Z., & Hofstee, H. P.
(2021). FPGA acceleration for big data analytics: Challenges and opportunities. IEEE Circuits
and Systems Magazine, 21(2), 30–47.
26. Hsing, H. (2019). Tiny AES. https://opencores.org/projects/tiny_aes.
27. Hu, W., Zhang, L., Ardeshiricham, A., Blackston, J., Hou, B., Tai, Y., & Kastner, R. (2017).
Why you should care about don’t cares: Exploiting internal don’t care conditions for hardware
Trojans. In 2017 IEEE/ACM international conference on computer-aided design (ICCAD) (pp.
707–13). Irvine, CA, USA.
28. Huawei. (2023). FPGA accelerated cloud server—Huawei cloud. https://www.huaweicloud.
com/en-us/product/fcs.html.
29. Intel® programmable acceleration card (PAC) with Intel® Arria® 10 GX FPGA data
sheet. (2020). https://www.intel.com/content/www/us/en/docs/programmable/683226/current/
introduction-rush-creek.html.
30. Kocher, P., Jaffe, J., & Jun, B. (1999). Differential power analysis. In Advances in Cryptology—
CRYPTO ’99 (pp. 387–97). Santa Barbara, CA, USA.
31. Korczyc, J., & Krasniewski, A. (2012). Evaluation of susceptibility of FPGA-based circuits to
fault injection attacks based on clock glitching. In 15th international symposium on design and
diagnostics of electronic circuits systems (DDECS) (pp. 171–74). IEEE, Talinn.
32. Krautter, J., Gnad, D. R. E., Schellenberg, F., Moradi, A., & Tahoori, M. B. (2019). Active
fences against voltage-based side channels in multi-tenant FPGAs. In 2019 IEEE/ACM
international conference on computer-aided design (ICCAD) (pp. 1–8). Westminster, CO,
USA.
33. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2018). FPGAhammer: Remote voltage fault
attacks on shared FPGAs, suitable for DFA on AES. IACR Transactions on Cryptographic
Hardware and Embedded Systems, 2018(3), 44–68.
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 133
34. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2019). Mitigating electrical-level attacks
towards secure multi-tenant FPGAs in the cloud. ACM Transactions on Reconfigurable
Technology and Systems, 12(3), 1–26.
35. La, T., Pham, K. D., Powell, J., & Koch, D. (2021). Denial-of-service on FPGA-based cloud
infrastructures—attack and defense. IACR Transactions on Cryptographic Hardware and
Embedded Systems, 2021(3), 441–464.
36. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
Malicious self-oscillator scanning for Xilinx UltraScale + FPGAs. ACM Transactions on
Reconfigurable Technology and Systems, 13(3), 15:1–15:31.
37. Lee, W., Wang, Y., Cui, T., Nazarian, S., & Pedram, M. (2014). Dynamic thermal manage-
ment for FinFET-based circuits exploiting the temperature effect inversion phenomenon. In
Proceedings of the 2014 international symposium on low power electronics and design (pp.
105–10). ACM, La Jolla California.
38. Li, H., Tang, Y., Que, Z., & Zhang, J. (2022). FPGA accelerated post-quantum cryptography.
IEEE Transactions on Nanotechnology, 21, 685–691.
39. Luo, Y., Gongye, C., Fei, Y., & Xu, X. (2021). DeepStrike: Remotely-guided fault injection
attacks on DNN accelerator in cloud-FPGA. In 58th ACM/IEEE design automation conference
(DAC) (pp. 295–300). San Francisco, CA, USA.
40. Luo, Y., & Xu, X. (2020). A quantitative defense framework against power attacks on multi-
tenant FPGA. In Proceedings of the 39th international conference on computer-aided design
(pp. 1–9). ACM, New York.
41. Mahmoud, D., & Stojilović, M. (2019). Timing violation induced faults in multi-tenant FPGAs.
In Design, automation and test in europe conference and exhibition (DATE) (pp. 1745–50).
IEEE, Florence.
42. Mahmoud, D. G., Dervishi, D., Hussein, S., Lenders, V., & Stojilović, M. (2022). DFAulted:
Analyzing and exploiting CPU software faults caused by FPGA-driven undervolting attacks.
IEEE Access, 10(134), 199–216.
43. Mahmoud, D. G., Hu, W., & Stojilović, M. (2020). X-attack: Remote activation of satisfiability
don’t-care hardware Trojans on shared FPGAs. In Proceedings of the 30th international con-
ference on field-programmable logic and applications (FPL) (pp. 185–92). IEEE, Gothenburg.
44. Mahmoud, D. G., Hussein, S., Lenders, V., & Stojilović, M. (2022). FPGA-to-CPU undervolt-
ing attacks. In Design, automation and test in Europe conference and exhibition (DATE) (pp.
999–1004). IEEE, Virtual Event.
45. Mahmoud, D. G., Lenders, V., & Stojilović, M. (2022). Electrical-level attacks on CPUs,
FPGAs, and GPUs: Survey and implications in the heterogeneous era. ACM Computing
Surveys, 55(3), 1–40.
46. Mangard, S., Oswald, E., & Popp, T. (2007). Power analysis attacks—revealing the secrets of
smart cards. Springer, New York.
47. Martín, H., Korak, T., Millán, E. S., & Hutter, M. (2015). Fault attacks on STRNGs: Impact of
glitches, temperature, and underpowering on randomness. IEEE Transactions on Information
Forensics and Security, 10(2), 266–277.
48. Matas, K., La, T. M., Pham, K. D., & Koch, D. (2020). Power-hammering through glitch
amplification—attacks and mitigation. In 28th annual international symposium on field-
programmable custom computing machines (FCCM) (pp. 65–69). IEEE, Fayetteville.
49. Mirzargar, S. S., Renault, G., Guerrieri, A., & Stojilović, M. (2020). Nonintrusive and adaptive
monitoring for locating voltage attacks in virtualized FPGAs. In IEEE international conference
on field programmable technology (FPT) (pp. 1–2). IEEE, Maui.
50. Moini, S., Deric, A., Li, X., Provelengios, G., Burleson, W., Tessier, R., & Holcomb, D. (2022).
Voltage sensor implementations for remote power attacks on FPGAs. ACM Transactions on
Reconfigurable Technology and Systems, 16(1), 1–21.
51. Moini, S., Li, X., Stanwicks, P., Provelengios, G., Burleson, W., Tessier, R., & Holcomb,
D. (2020). Understanding and comparing the capabilities of on-chip voltage sensors against
remote power attacks on FPGAs. In 63rd International midwest symposium on circuits and
systems (MWSCAS) (pp. 941–44). IEEE, Springfield.
134 D. G. Mahmoud et al.
52. Moini, S., Tian, S., Holcomb, D., Szefer, J., & Tessier, R. (2021). Remote power side-channel
attacks on BNN accelerators in FPGAs. In Design, automation and test in Europe conference
and exhibition (DATE) (pp. 1639–44). IEEE.
53. Nassar, H., AlZughbi, H., Gnad, D. R. E., Bauer, L., Tahoori, M. B., & Henkel, J. (2021). Loop-
Breaker: Disabling interconnects to mitigate voltage-based attacks in multi-tenant FPGAs.
In 2021 IEEE/ACM international conference on computer aided design (ICCAD) (pp. 1–9).
Munich, Germany.
54. Örs, S. B., Oswald, E., & Preneel, B. (2003). Power-analysis attacks on an FPGA—first
experimental results. In Conference on cryptographic hardware and embedded systems (CHES)
(pp. 35–50). Springer, Cologne.
55. Papagiannopoulos, K., Glamočanin, O., Azouaoui, M., Ros, D., Regazzoni, F., & Stojilović,
M. (2023). The side-channel metrics cheat sheet. ACM Computing Surveys, 55(10), 1–38.
56. Provelengios, G., Holcomb, D., & Tessier, R. (2019). Characterizing power distribution attacks
in multi-user FPGA environments. In Proceedings of the 29th international conference on field-
programmable logic and applications (FPL) (pp. 194–201). IEEE, Barcelona.
57. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power wasting circuits for cloud FPGA
attacks. In Proceedings of the 30th international conference on field-programmable logic and
applications (FPL) (pp. 231–35). IEEE, Gothenburg.
58. Regazzoni, F., Yi, W., & Standaert, F. X. (2011). FPGA implementations of the AES masked
against power analysis attacks. In Proceedings of 2nd international workshop on constructive
side-channel analysis and secure design (COSADE) (pp. 1–11). Darmstadt, Germany.
59. Rodgers, J. L., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation
coefficient. The American Statistician, 42(1), 59–66.
60. Salman, E., Dasdan, A., Taraporevala, F., Kucukcakar, K., & Friedman, E. G. (2007).
Exploiting setup-hold-time interdependence in static timing analysis. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 26(6), 1114–1125.
61. SAKURA-X side-channel evaluation board. (2021). https://satoh.cs.uec.ac.jp/SAKURA/
hardware/SAKURA-X.html.
62. Schellenberg, F., Gnad, D. R. E., Moradi, A., & Tahoori, M. B. (2018). An inside job: Remote
power analysis attacks on FPGAs. In Design, automation and test in Europe conference and
exhibition (DATE) (pp. 1111–1116). IEEE, Dresden.
63. Schellenberg, F., Gnad, D. R. E., Moradi, A., & Tahoori, M. B. (2018). Remote inter-
chip power analysis side-channel attacks at board-level. In 2018 IEEE/ACM international
conference on computer-aided design (ICCAD) (pp. 114:1–114:7). New York.
64. Shawahna, A., Sait, S. M., & El-Maleh, A. (2019). FPGA-based accelerators of deep learning
networks for learning and classification: A review. IEEE Access, 7, 7823–7859.
65. Spielmann, D., Glamočanin, O., & Stojilović, M. (2023). RDS: FPGA routing delay sensors
for effective remote power analysis attacks. IACR Transactions on Cryptographic Hardware
and Embedded Systems, 2023(2), 543–67.
66. Tian, S., Moini, S., Wolnikowski, A., Holcomb, D., Tessier, R., & Szefer, J. (2021). Remote
power attacks on the versatile tensor accelerator in multi-tenant FPGAs. In Proceedings of the
international symposium on field-programmable custom computing machines (FCCM).
67. Tiri, K., & Verbauwhede, I. (2004). A logic level design methodology for a secure DPA
resistant ASIC or FPGA implementation. In Design, automation and test in Europe conference
and exhibition (DATE) (pp. 246–51). Paris, France.
68. Wu, J. (2010). Several key issues on implementing delay line based TDCs using FPGAs. IEEE
Transactions on Nuclear Science, 57(3), 1543–1548.
69. Xilinx. (2017). UltraScale architecture configurable logic block user guide (UG574). https://
docs.xilinx.com/v/u/en-US/ug574-ultrascale-clb.
70. Yeap, G. K. (2012). Practical low power digital VLSI design. Springer Science and Business
Media, Berlin.
71. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In 2018 IEEE
symposium on security and privacy (pp. 805–820). IEEE, San Francisco.
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 135
72. Zhu, H., Guo, X., Jin, Y., & Zhang, X. (2020). PowerScout: A security-oriented power delivery
network modeling framework for cross-domain side-channel analysis. In Asian hardware
oriented security and trust symposium (AsianHOST) (1–6). IEEE.
73. Zick, K. M., Srivastav, M., Zhang, W., & French, M. (2013). Sensing nanosecond-scale voltage
attacks and natural transients in FPGAs. In Proceedings of the 21st ACM/SIGDA international
symposium on field-programmable gate arrays (FPGA) (pp. 101–104). Monterey, CA, USA.
74. Zussa, L., Dutertre, J. M., Clédière, J., & Robisson, B. (2014). Analysis of the fault injection
mechanism related to negative and positive power supply glitches using an on-chip voltmeter.
In International symposium on hardware-oriented security and trust (HOST) (pp. 130–35).
IEEE, Arlington.
75. Zynq UltraScale+ MPSoC. (2022). https://www.xilinx.com/products/silicon-devices/soc/
zynq-ultrascale-mpsoc.html.
Chapter 6
Contention-Based Threats Between
Single-Tenant Cloud FPGA Instances
6.1 Introduction
I. Giechaskiel
Independent Researcher, London, UK
e-mail: ilias@giechaskiel.com
S. Tian · J. Szefer ()
Yale University, New Haven, CT, USA
e-mail: shanquan.tian@yale.edu; jakub.szefer@yale.edu
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 137
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_6
138 I. Giechaskiel et al.
dedicated FPGAs which are within the same server. Once an FPGA is released
by a user, it can then be assigned to the next user who rents it. This can lead to
temporal thermal covert channels [59], where heat generated by one circuit can be
later observed by other circuits that are loaded onto the same FPGA. Such channels
are slow (less than 1 bps) and are only suitable for covert communication, since
they require the two parties to coordinate and keep being scheduled on the same
physical hardware one after the other. Other means of covert communication in
the single-tenant setting do not require assignment to the same FPGA chip. For
example, multiple FPGA boards in servers share the same power supply, and prior
work has shown the potential for such shared power supplies to leak information
between FPGA boards [30]. However, the resulting covert channel was also slow
(less than 10 bps) and was only demonstrated in a lab setup.
Fingerprinting FPGA instances using physical unclonable functions (PUFs) [58,
60] is another single-tenant security topic that has been previously explored.
Fingerprinting allows users to partially map the infrastructure and gain insight
about the allocation of FPGAs (e.g., how likely a user is to be re-assigned to the
same physical FPGA they used before), but fingerprinting by itself does not lead
to information leaks. A more recent fingerprinting-related work explored mapping
FPGA infrastructures using PCIe contention to find which FPGAs are co-located in
the same non-uniform memory access (NUMA) node within a server [61]. However,
no prior work has successfully launched a cross-VM covert- or side-channel attack
in a real cloud FPGA setting.
By contrast, our work shows that shared resources can be used to leak infor-
mation across separate virtual machines running on the FPGA-accelerated F1
instances in AWS data centers. In particular, we use the contention of the PCIe
bus to not only demonstrate a new, fast covert channel (reaching up to 20 kbps)
but also identify patterns of activity based on the PCIe signatures of different
users’ Amazon FPGA Images (AFIs). This includes detecting when co-located
VMs are initialized, or slowing down the programming of other users’ FPGAs,
and more generally degrading the transfer bandwidth between the FPGA and the
host VM. Our attacks do not require special privileges or potentially malicious
circuits such as ring oscillators (ROs) or time-to-digital converters (TDCs) and thus
cannot easily be detected through static analysis or design rule checks (DRCs) that
cloud providers may perform. We further introduce three new methods of finding
co-located instances that are in the same physical server: (a) through reducing the
network bandwidth via PCIe contention, (b) through resource contention of the non-
volatile memory express (NVMe) SSDs that are accessible from each F1 instance
via the PCIe bus, and (c) through the common thermal signatures obtained from
the decay rates of each FPGA’s DRAM modules. Our work therefore shows that
single-tenant attacks in real FPGA-accelerated cloud environments are practical and
demonstrates several ways to infer information about the operations of other cloud
users and their FPGA-accelerated virtual machines or the data center itself.
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 139
6.1.1 Contributions
The remainder of the chapter is organized as follows. Section 6.2 provides the
background on today’s deployments of FPGAs in public cloud data centers and
summarizes related work. Section 6.3 discusses typical FPGA-accelerated cloud
servers and PCIe contention that can occur among the FPGAs, while Sect. 6.4
evaluates our fast, PCIe-based, cross-VM channel. Using the ideas from the covert
channel, Sect. 6.5 investigates how to infer information about other VMs through
their PCIe traffic patterns, including detecting the initialization of co-located VMs,
long-term PCIe monitoring of data center activity, and slowing down PCIe traffic
on adjacent instances. Section 6.6 then presents alternative sources of information
leakage due to network bandwidth contention, shared SSDs, and thermal signatures
of DRAM modules. The chapter concludes in Sect. 6.7.
This section first provides a brief background on the F1 instances from Amazon
Web Services (AWS) [14] that are evaluated in this work, with a focus on their
architecture (Sect. 6.2.1) and programming model (Sect. 6.2.2). It also summarizes
related work in the area of FPGA cloud security (Sect. 6.2.3).
140 I. Giechaskiel et al.
AWS has offered FPGA-accelerated virtual machine instances to users since late
2016 [4]. These so-called F1 instances are available in three sizes: f1.2xlarge,
f1.4xlarge, and f1.16xlarge, where the instance name represents twice the
number of FPGAs it contains (so f1.2xlarge has 1 FPGA, while f1.4xlarge
has 2, and f1.16xlarge has 8 FPGAs). Each instance is allocated 8 virtual CPUs
(vCPUs), 131 GB of DRAM, and 470 GB of NVMe SSD storage per FPGA. For
example, f1.4xlarge instances have 16 vCPUS, 262 GB of DRAM, and 940 GB
of SSD space [14], since they contain 2 FPGAs.
Each FPGA board is attached to the server over an x16 PCIe Gen 3 bus. In
addition, each FPGA board contains four DDR4 DRAM chips, totaling 68.7 GB
of memory per FPGA board [14]. These memories are separate from the server’s
DRAM and are directly accessible by each FPGA. The F1 instances use Virtex
UltraScale+ XCVU9P chips [14], which contain over 1.1 million lookup tables
(LUTs), 2.3 million flip-flops (FFs), and 6.8 thousand digital signal processing
(DSP) blocks [67].
Each server contains 8 FPGA boards, which are evenly split between two non-
uniform memory access (NUMA) nodes, as shown in Fig. 6.1 [11, 14, 61]. AWS
servers containing FPGAs have two Intel Xeon E5-2686 v4 (Broadwell) processors,
connected through an Intel QuickPath Interconnect (QPI) link. Each processor
forms its own NUMA node with its associated DRAM and four FPGAs attached
as PCIe devices. Due to this architecture, an f1.2xlarge instance may be on the
same NUMA node as up to three other f1.2xlarge instances or on the same
NUMA node as one other f1.2xlarge instance and one f1.4xlarge instance
(which uses 2 FPGAs). Conversely, an f1.4xlarge instance may share the
same NUMA node with up to two f1.2xlarge instances or one f1.4xlarge
instance. Finally, as f1.16xlarge instances use up to all 8 FPGAs in the server,
they do not share any resources with other instances since both NUMA nodes of
the server are fully occupied. In this work, we are able to produce a more fine-
grained topology of the servers and their PCIe topologies due to additional sources
of contention via NVMe SSDs and network interface controller (NIC) cards.
responses [56]. Our work is similar but presents the first successful cross-VM
attacks using PCIe contention on a public cloud. Moreover, by going beyond just
PCIe, our work is able to deduce cross-NUMA-node co-location using a DRAM
thermal fingerprinting approach.
Recent work has shown that direct control of the DRAM connected to the FPGA
boards can be used to fingerprint them [60]. This can be combined with existing
work [61] to build a map of the cloud data centers where FPGAs are used. Such
fingerprinting does not by itself, however, help with cross-VM covert channels, as
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 143
it does not provide co-location information. By contrast, our PCIe, NIC, SSD, and
DRAM approaches are able to co-locate instances in the same server and enable
cross-VM covert channels and information leaks.
This work has focused on the single-tenant setting, where each user gets full
access to the FPGA and thus reflects the current environment offered by cloud
providers. However, there is also a large body of security research in the multi-
tenant context, where a single FPGA is shared by multiple, logically (and potentially
physically) isolated users [26, 36, 46, 48, 71]. For example, several researchers have
shown how to recover information about the structure [62, 72] or inputs [49] of
machine learning models or cause timing faults to reduce their accuracy [24, 52].
Other work in this area has shown that crosstalk due to routing wires [28] and
logic elements [31] inside the FPGA chips can be used to leak static signals,
while voltage drops due to dynamic signals can lead to covert-channel [29], side-
channel [32, 35], and fault [50] attacks. Several works have also tried to address such
issues to enable multi-tenant applications, proposing static checks [38, 40], voltage
monitors [33, 45, 51], or a combination of the two [39]. Our work on PCIe, SSD,
and DRAM threats is orthogonal to such work but is directly applicable to current
cloud FPGA deployments.
The user’s custom logic running on the FPGA instances can use the Shell to
communicate with the server through the PCIe bus. Users cannot directly control the
PCIe transactions but instead perform reads and writes to predefined address ranges
through the shell. These memory accesses get translated into PCIe commands and
PCIe data transfers between the server and the FPGA. Users may also set up direct
memory access (DMA) transfers between the FPGA and the server. By designing
hardware modules with low logic overhead, users can generate transfers fast enough
to saturate the PCIe bandwidth. In fact, because of the shared PCIe bus within each
NUMA node, these transfers can create interference and bus contention that affects
the PCIe bandwidth of other users. The resulting performance degradation can be
used for detecting co-location [61] or, as we show in this work, for fast covert-
and side-channel attacks, breaking the isolation between otherwise logically and
physically separate VM instances.
In our covert-channel analysis (Sect. 6.4), we show that the communication
bandwidth is not identical for all pairs of FPGAs in a NUMA node. In particular,
this suggests that the 4 PCIe devices are not directly connected to each CPU, but
instead likely go through two separate switches, forming the hierarchy shown in
Fig. 6.2, improving the deduced model of prior work [61]. Although not publicly
144 I. Giechaskiel et al.
Fig. 6.2 The newly deduced PCIe configuration for F1 servers based on the experiments in this
work: each CPU has two PCIe links, each of which provides connectivity to two FPGAs, an NVMe
SSD, and a NIC through a PCIe switch
confirmed by AWS, this topology is similar to the one described for P4d instances,
which contain 8 GPUs [7]. As a result, even though all 4 FPGAs in a NUMA node
contend with one another, the covert-channel bandwidth is highest among those
sharing a PCIe switch, due to the bottleneck imposed by the shared link (Sect. 6.4).
We also expand on the model to show that the PCIe switches provide connectivity
to an NVMe SSD drive and a network interface card, thereby expanding the attack
surface by identifying additional sources of PCIe contention. Finally, as we show in
Sect. 6.4.5, how effectively these PCIe links can be saturated is also dependent on
the operating system/kernel configuration instead of just the user-level software and
underlying hardware architecture.
In this section, we describe our implementation for the first cross-FPGA covert
channel on public clouds (Sect. 6.4.1) and discuss our experimental setup
(Sect. 6.4.2). We then analyze bandwidth vs. accuracy trade-offs (Sect. 6.4.3),
before investigating the impact of receiver and transmitter transfer sizes on the
covert-channel accuracy for a given covert-channel bandwidth (Sect. 6.4.4). We
finish the section by discussing differences in the covert-channel bandwidth between
VMs using different operating systems (Sect. 6.4.5). Side channels and information
leaks based on PCIe contention from other VMs are discussed in Sect. 6.5.
Our covert channel is based on saturating the PCIe link between the FPGA and the
server, so, at their core, both the transmitter and the receiver consist of (a) an FPGA
image that interfaces with the host over PCIe with minimal latency in accepting
write requests or responding to read requests and (b) software that attaches to the
FPGA and repeatedly writes to (or reads from) the mapped base address register
(BAR). These requests are translated to PCIe transactions, transmitted over the data
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 145
and physical layers, and then relayed to the custom logic (CL) hardware through
the shell (SH) logic as AXI reads or writes. The transmitter stresses its PCIe link to
transmit a 1 but remains idle to transmit a bit 0, while the receiver keeps measuring
its own bandwidth during the transmission period (the receiver is thus identical to
a transmitter that sends a 1 during every measurement period). The receiver then
classifies the received bit as a 1 if the bandwidth B has dropped below a threshold
T and as 0 otherwise.
The two communicating parties need to have agreed upon some minimal
information prior to the transmissions: the specific data center to use (region and
availability zone, e.g., us-east-1e), the time t to start communications, and the
initial measurement period, expressed as the time difference between successive
transmissions .δ. All other aspects of the communication can be handled within the
channel itself, including detecting that the two parties are on the same NUMA node
or increasing the bandwidth by decreasing .δ. To ensure that the PCIe link returns
to idle between successive measurements, transmissions stop before the end of the
measurement interval, i.e., the transmission duration d satisfies .d < δ. Note that
synchronization is implicit due to the receiver and transmitter having access to
a shared wall clock time, e.g., via the network time protocol (NTP). Figure 6.3
provides a high-level overview of our covert-channel mechanism.
Before they can communicate, the two parties (Alice and Bob in the example
of Fig. 6.3) first need to ensure that they are co-located on the same NUMA
node within the server. To do so, they can launch multiple instances at or near
an agreed-upon time and attempt to detect whether any of their instances are co-
located by sending handshake messages and expecting a handshake response, using
the same setup information as for the covert channel itself (i.e., the time .t ' to start
the communication, the measurement duration .δ, and location information such as
the data center region and availability zone). They additionally need to have agreed
Fig. 6.3 Example cross-VM covert communication: The transmitter (Alice) sends the ASCII byte
“H,” represented as 01001000 in binary, to the receiver (Bob) in 8 intervals by stressing her PCIe
bandwidth to transmit a 1 and remaining idle to transmit a 0. If Bob’s FPGA bandwidth B drops
below a threshold T , he detects a 1, otherwise a 0 is detected. To ensure no residual effects after
each transmission, the time difference .δ between successive measurements is slightly larger than
the transmission duration d
146 I. Giechaskiel et al.
Fig. 6.4 The process to find a pair of co-located f1.2xlarge instances using PCIe contention
uses the covert-channel mechanism to check for pre-agreed handshake messages: Alice transmits
the handshake message with her first FPGA and waits to see if Bob acknowledges the handshake
message. In parallel, Bob measures the bandwidths of all his FPGAs. In this example, Bob detects
the contention in his seventh FPGA during the fourth handshake attempt. Note that Alice and Bob
can rent any number of FPGAs for finding co-location, with five and seven used as an example
transfer sizes, were informed by early manual experiments and the work in [61] to
ensure we can reliably detect co-location. Note that these parameters can be different
from those used after the co-location has been established. For instance, co-location
detection can use low-bandwidth transfers (e.g., 200 bps) that are reliable across all
NUMA node setups and can be increased as part of the setup process, once co-
location has been established.
During the co-location process, the two communicating parties can also establish
what the threshold T should be and whether the communication bandwidth should
be increased. As shown in [61], the PCIe bandwidth of an instance drops from over
3,000 MBps to under 1,000 MBps when there is an external PCIe stressor. As a
result, this threshold T could be simply hardcoded (at, say, 2,000 MBps), or be
adaptive, as the mid-point between the minimum .bm and maximum .bM bandwidths
recorded during a successful handshake. The latter is the approach we use in our
work: if the two instances are not co-located, .bm ≈ bM , and the decoded bits will
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 147
be random and hence will not match the expected handshake message H . If the two
instances are co-located, .bM ⪢ bm (assuming H contains at least one 0 and at least
one 1), so any bit 1 will correspond to a bandwidth .b1 ≈ bm ⪡ (bm + bM )/2 = T
and any bit 0 will result in bandwidth .b0 ≈ bM ⪢ (bm + bM )/2 = T .
For the majority of our experiments, we use VMs with AWS FPGA Developer
Amazon Machine Image (AMI) [17] version 1.8.1, which runs CentOS 7.7.1908,
and develop our code with the hardware and software development Kit (HDK/SDK)
version 1.4.15 [8]. We vary the operating systems used for the transmitters and
receivers and significantly improve the covert-channel bandwidth in Sect. 6.4.5.
For our FPGA bitstream, we use the unmodified CL_DRAM_DMA image provided
by AWS (agfi-0d132ece5c8010bf7) [10] for both the transmitter and the
receiver designs. This design maps the 137.4 GB PCIe application physical function
(AppPF) BAR4 (a 64-bit prefetchable base address register (BAR) [9]) to the
68.7 GB of FPGA DRAM (twice). It is not necessary to use the FPGA DRAMs
themselves: just responding to the PCIe requests to not hang the interfaces,
like in the CL_HELLO_WORLD example [12], is sufficient. Our custom-written
software maps the FPGA DRAM modules via the BAR4 register and uses the
BURST_CAPABLE flag to support write-combining for higher performance. Data
transfers are implemented using memcpy, getting similar performance to the AWS
benchmarks [6]. Algorithm 1 summarizes our software routine in pseudocode.
Unless otherwise noted, we perform experiments with “spot” instances in the
us-east-1 (North Virginia) region in availability zone d for cost reasons, though
a prior work has shown that PCIe contention is present with both spot and on-
demand instances, in all regions and different availability zones where F1 instances
are currently supported, namely ap-southeast-2 (Sydney), eu-west-1 (Ire-
land), us-east-1 (North Virginia), and us-west-2 (Oregon) [61]. Although
the results presented are for instances launched by a single user, it should also
be noted that we have successfully created a cross-VM covert channel between
instances launched by two different users.
to .(9 ms, 10 ms), corresponding to transmission rates between 5 kbps and 100 bps.1
For these experiments, the receiver keeps transferring 2 kB chunks of data from
the host, while the transmitter repeatedly sends 64 kB of data in each transmission
period (i.e., until the end of the interval d). These parameters are explored separately
in Sect. 6.4.4 below.
The results of our experiments, shown in Fig. 6.5, indicate that we can create
a fast covert channel between any two FPGAs in either direction: at 200 bps and
below, the accuracy of the covert channel is 100%, with the accuracy at 250 bps
dropping to 99% for just one pair. At 500 bps, three of the six possible pairs
can communicate at 100% accuracy, while one pair can communicate with 97%
accuracy at 2 kbps (and sharply falls to 70% accuracy even at 2.5 kbps—though
in Sect. 6.4.5 we show that bandwidths of 20 kbps at 99% accuracy are possible). It
should be noted that, as expected, the bandwidth within any given pair is symmetric,
i.e., it remains the same when the roles of the transmitter and the receiver are
reversed. As the VMs occupy a full NUMA node, there should not be any impact
from other users’ traffic. The variable bandwidth between different pairs is therefore
likely due to the PCIe topology.
1 Section 6.4.5 shows that different setups can result in even higher bandwidths exceeding 20 kbps.
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 149
100
90 Transmitter A
Transmitter B
Transmitter C
80 Transmitter D
Accuracy (%)
Receiver A
70 Receiver B
Receiver C
Receiver D
60 Pair (A, B)
Pair (A, C)
50 Pair (A, D)
Pair (B, C)
Pair (B, D)
40
Pair (C, D)
5.0 kbps
3.3 kbps
2.5 kbps
2.0 kbps
1.7 kbps
1.4 kbps
1.2 kbps
1.1 kbps
1.0 kbps
500.0 bps
333.3 bps
250.0 bps
200.0 bps
166.7 bps
142.9 bps
125.0 bps
111.1 bps
100.0 bps
Bandwidth
Fig. 6.5 Bandwidth and accuracy for covert-channel transmissions between any pair of FPGAs,
among the four FPGAs in the same NUMA node. Each FPGA pair is color-coded, with transmitters
indicated through different markers and receivers through different line styles. For any pair of
FPGAs X and Y , the bandwidth is approximately the same in each direction, i.e., the bandwidth
from X to Y is approximately the same as the bandwidth from Y to X. Communication is possible
between any two FPGAs in the NUMA node, but the bandwidths for different pairs diverge
100
80 Transmitter A
Transmitter B
Transmitter C
Accuracy (%)
Transmitter D
60 Receiver A
Receiver B
Receiver C
40 Receiver D
Pair (A, B)
Pair (A, C)
20 Pair (A, D)
Pair (B, C)
Pair (B, D)
0 Pair (C, D)
1B
2B
4B
8B
16 B
32 B
64 B
128 B
256 B
512 B
1 kB
2 kB
4 kB
8 kB
16 kB
32 kB
64 kB
128 kB
256 kB
512 kB
1 MB
Transmitter Transfer Size
Fig. 6.6 Covert-channel accuracy for different transmitter transfer sizes. Each chunk transmitted
over PCIe needs to be ≥ 4 kB to ensure an accuracy of 100% at 200 bps between any two FPGAs
in the NUMA node
100
90
Transmitter A
80 Transmitter B
Transmitter C
Accuracy (%)
70 Transmitter D
Receiver A
60 Receiver B
Receiver C
50 Receiver D
Pair (A, B)
40 Pair (A, C)
Pair (A, D)
30 Pair (B, C)
Pair (B, D)
20 Pair (C, D)
1B
2B
4B
8B
16 B
32 B
64 B
128 B
256 B
512 B
1 kB
2 kB
4 kB
8 kB
16 kB
32 kB
64 kB
128 kB
256 kB
Fig. 6.7 Covert-channel accuracy for different receiver transfer sizes. Chunks between 64 B and
4 kB are suitable for 100% accuracies, but sizes outside this range result in a drop in accuracy for
at least one pair of FPGAs in the NUMA node
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 151
Starting with FPGA AMI version 1.10.0, Amazon has provided AMIs based on
Amazon Linux 2 (AL2) [18] alongside AMIs based on CentOS [17] (both using the
Xilinx-provided XOCL PCIe driver). AL2 uses a Linux kernel that has been tuned
for the AWS infrastructure [15] and may therefore impact the performance of the
covert channel. Since the attacker does not have control over the victim’s VM, it
is necessary to explore the effect of the operating system on our communication
channel and thus experiment with both types of operating systems as receivers and
transmitters. We use the co-location methodology of Sect. 6.4.1 to find different
instances that are in the same NUMA node and report the accuracy of our cross-
VM covert channel from bandwidths as low as 0.1 kbps to as high as 66.6 kbps.
As described in Sect. 6.3 and shown in Fig. 6.2, each NUMA node consists of 4
distinct f1.2xlarge instances, and each one can run either AL2 or CentOS. As
Sect. 6.4.3 identified, the bandwidth between different FPGA pairs will depend on
where they are in the PCIe topology, so to get an accurate estimate of the maximum
cross-instance covert-channel bandwidth for different setups, we run experiments
on three different configurations of full NUMA nodes. The first experiment contains
one instance running CentOS and three instances running AL2 (Fig. 6.8), the second
100
TX A (AL2)
90 TX B (AL2)
TX C (AL2)
TX D (CentOS)
80 RX A (AL2)
Accuracy (%)
RX B (AL2)
RX C (AL2)
70 RX D (CentOS)
Pair (A, B)
Pair (A, C)
60 Pair (A, D)
Pair (B, C)
Pair (B, D)
50 Pair (C, D)
40
66.67kbps
50.00kbps
20.00kbps
12.50kbps
10.00kbps
6.67kbps
5.00kbps
3.33kbps
2.50kbps
2.00kbps
1.43kbps
1.00kbps
0.50kbps
0.33kbps
0.25kbps
0.20kbps
0.17kbps
0.12kbps
0.10kbps
Bandwidth
Fig. 6.8 Bandwidth and accuracy for covert-channel transmissions between any pair of four co-
located instances, where three instances are running Amazon Linux 2 (AL2) and the last one is
running CentOS. Each FPGA pair is color-coded, with transmitters indicated through different
markers and receivers through different line styles
152 I. Giechaskiel et al.
100
TX A (AL2)
90 TX B (AL2)
TX C (CentOS)
80 TX D (CentOS)
RX A (AL2)
Accuracy (%)
RX B (AL2)
70 RX C (CentOS)
RX D (CentOS)
Pair (A, B)
60
Pair (A, C)
Pair (A, D)
50 Pair (B, C)
Pair (B, D)
Pair (C, D)
40
30
66.67kbps
50.00kbps
20.00kbps
12.50kbps
10.00kbps
6.67kbps
5.00kbps
3.33kbps
2.50kbps
2.00kbps
1.43kbps
1.00kbps
0.50kbps
0.33kbps
0.25kbps
0.17kbps
0.12kbps
0.10kbps
0.07kbps
0.05kbps
Bandwidth
Fig. 6.9 Bandwidth and accuracy for covert-channel transmissions between any pair of four co-
located instances, where two instances are running Amazon Linux 2 (AL2) and the other two are
running CentOS
100
TX A (AL2)
90 TX B (CentOS)
TX C (CentOS)
TX D (CentOS)
80 RX A (AL2)
Accuracy (%)
RX B (CentOS)
RX C (CentOS)
70 RX D (CentOS)
Pair (A, B)
Pair (A, C)
60 Pair (A, D)
Pair (B, C)
Pair (B, D)
50 Pair (C, D)
40
6.67kbps
5.00kbps
3.33kbps
2.50kbps
2.00kbps
1.43kbps
1.00kbps
0.50kbps
0.33kbps
0.25kbps
0.20kbps
0.17kbps
0.12kbps
0.10kbps
0.08kbps
0.07kbps
0.05kbps
66.67kbps
50.00kbps
20.00kbps
12.50kbps
10.00kbps
Bandwidth
Fig. 6.10 Bandwidth and accuracy for covert-channel transmissions between any pair of four co-
located instances, where only one instance is running Amazon Linux 2 (AL2) and the remaining
are running CentOS
contains two instances with CentOS and two with AL2 (Fig. 6.9), while the last
setup has three CentOS VMs and one AL2 one (Fig. 6.10). For each experiment,
we collect the covert-channel bandwidths for all pairs of instances and in both
directions of communication, resulting in 12 different bandwidth vs. accuracy sets
of measurements.
Figure 6.8 shows the covert channel bandwidths for all FPGA pairs, where one
instance is running CentOS and the remaining three are running AL2. For any pair of
AL2 instances, the covert-channel accuracy at 20 kbps is over 90% (in fact, reaching
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 153
Table 6.1 Cross-VM covert channel bandwidth for different receiver and transmitter operating
systems. .∗ A bandwidth of 5.9 kbps at 95% accuracy could be sustained across repeated individual
experiments outside of a full NUMA node.
Transmitter Receiver Bandwidth Accuracy
CentOS CentOS 2.0 kbps 97%
CentOS Amazon Linux 2 ∗
. 0.3 kbps 100%
Amazon Linux 2 CentOS 2.5 kbps 94%
Amazon Linux 2 Amazon Linux 2 20.0 kbps 99%
Table 6.2 Cross-FPGA covert channel bandwidth achieved by different works. The PCIe con-
tention approach of our work achieves bandwidths that are several orders of magnitude faster than
prior research and are performed on a commercial public cloud. .∗ Achieved only in a lab setup.
Cloud Method Reference Bandwidth Accuracy
TACC Thermal Attack [59] .⪡ 0.1 bps 100%
∗
. — Voltage Stressing [30] 6.1 bps 99%
AWS PCIe Contention (CentOS) This work 2,000.0 bps 97%
AWS PCIe Contention (AL2) This work 20,000.0 bps 99%
99%) and for a subset of those pairs remains above 80% at even 40 kbps. However,
when a CentOS instance is involved, the bandwidth drops to 0.5 kbps, for either
direction of communication.
Figures 6.9 and 6.10 show that, depending on where the instances are on the PCIe
topology, the bandwidth can vary. Indeed, Fig. 6.9 shows that the bandwidth for an
AL2 transmitter and a CentOS receiver can reach 2.5 kbps at 98% accuracy, but
CentOS transmitters and AL2 receivers generally have bandwidths below 0.5 kbps,
though in repeated individual experiments (outside of a full NUMA node), we have
been able to get a channel at 5.9 kbps at 95% accuracy. The CentOS-CentOS results
of Fig. 6.10 are consistent with those of Sect. 6.4.3, with bandwidths between
250 bps and 1.4 kbps for all but the fastest pair of instances. Table 6.1 summarizes
these results, while Table 6.2 compares the achieved bandwidths to prior work in
cross-FPGA communications.
In this section, we explore what kinds of information malicious adversaries can infer
about computations performed by un-cooperating victim users that are co-located in
the same NUMA node in different, logically isolated VMs. We first show that the
PCIe activity of an off-the-shelf video-processing AMI from the AWS Marketplace
leaks information about the resolution and bitrate properties of the video being
processed, allowing adversaries to infer the activity of different users (Sect. 6.5.1).
We then show that it is possible to detect when a VM in the same NUMA node
is being initialized (Sect. 6.5.2) and more generally monitor the PCIe bus over a
long period of time (Sect. 6.5.3). We finally show that PCIe contention can be used
154 I. Giechaskiel et al.
(a) PCIe contention can reveal the initializa- (b) Stressing the PCIe bandwidth using the
tion process of co-located VM instances and FPGA can slow down the FPGA, NIC, and
traffic patterns of applications in the same SSD bandwidth of other users. SSD con-
NUMA node to a passive eavesdropping ad- tention for users on the same PCIe switch
versary. is also possible.
Fig. 6.11 Summary of the (a) passive monitoring side-channel and (b) active interface contention-
based attacks presented in Sects. 6.5 and 6.6. Bandwidths are not drawn to scale
for interference attacks, including slowing down the programming of the FPGA
itself, or of other data transfer communications between the FPGA and the host VM
(Sect. 6.5.4). The attacks of this and the next section are summarized in Fig. 6.11.
Among the different hardware accelerator solutions for cloud FPGAs, in this
section, we target video processing using the DeepField AMI, which leverages
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 155
Fig. 6.12 PCIe bandwidth traces collected by the attacker, while the victim runs the DeepField
AMI to perform VSR conversions with input videos of different resolutions and frame rates. Within
each sub-figure, the red lines label the start and the end of the VSR conversion on the FPGA
of the VSR conversion on the FPGA can be clearly seen in Fig. 6.12, where vertical
red lines delineating the start and end of the process have been added for clarity. We
observe that the PCIe bandwidth drops during the conversion and that runtime is
reduced as the input resolution or the input frame rate decreases. For example, the
runtime for a 720p, 30 FPS video (Fig. 6.12f) is approximately twice as long as for
a 15 FPS one (Fig. 6.12c).
In our experiments, we have thus far only focused on covert communication and
side-channel information leakage between VM instances that have already been
initialized. By contrast, in this section, we show for the first time that the instance
initialization process can also be detected by monitoring the bandwidth of the PCIe
bus. Indeed, on AWS, there is a time lag between when a user requests that an
instance with a target AMI be launched and when it is provisioned, initialized, and
ready for the user to connect to it over SSH. This process can take multiple minutes
and, as we show in this work, causes significant PCIe traffic that is measurable by
co-located adversaries.
For our experiments, we first create an f1.2xlarge instance (named INST-
A) and start the PCIe bandwidth monitoring program on it. We then launch five
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 157
1,880 1,880
Receiver Bandwidth (MBps)
1,860 1,860
1,850 1,850
1,840 1,840
0 1 2 3 4 0 1 2 3 4
Time (h) Time (h)
(a) April 25, after 5pm (b) April 25, after 9pm
Fig. 6.14 Long-term PCIe-based data center monitoring between the evening of April 25 and the
early morning of April 26, with .d = 4 ms and .δ = 5 ms on an f1.2xlarge on-demand instance
In this section, we present the results of measuring the PCIe bandwidth for two on-
demand f1.2xlarge instances in the us-east-1 region (availability zone e).
These experiments took place between 5pm on April 25, 2021 and 2am on April 26
(Eastern Time, as us-east-1 is located in North Virginia). For both sets of four-
hour measurements, the first f1.2xlarge instance (Fig. 6.14) is measuring with
a transmission duration of .d = 4 ms and a measurement duration of .δ = 5 ms,
while the second instance (Fig. 6.15) has .d = 18 ms and .δ = 20 ms. For the
first instance, the PCIe link remains mostly idle during the evening (Fig. 6.14a)
but experiences contention in the first night hour (Fig. 6.14b). The second instance
instead appears to be co-located with other FPGAs that make heavier use of
their PCIe bandwidth. During the evening measurements (Fig. 6.15a), the PCIe
bandwidth drops momentarily below 1,200 MBps during the third hour and below
800 MBps during the fourth hour. These large drops are likely due to co-located
VMs being initialized and not normal user traffic, as described in Sect. 6.5.2. It
also experiences sustained contention in the third hour of the night measurement
(Fig. 6.15b). Although the bandwidth in the two instances is comparable, the 5 ms
measurements are noisier compared to the 20 ms ones. Finally, note that, generally,
our covert-channel code results in bandwidth drops of over 800 MBps, while the
activity of other users tends to cause drops of less than 50 MBps, suggesting that
noise from external traffic has a minimal impact on our channel.
The PCIe contention mechanism we have uncovered can also be used to degrade the
performance of co-located applications by other users. Indeed, as we have shown in
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 159
1,880 1,880
Receiver Bandwidth (MBps)
1,860 1,860
1,850 1,850
1,840 1,840
0 1 2 3 4 0 1 2 3 4
Time (h) Time (h)
(a) April 25, after 5pm (b) April 25, after 9pm
Fig. 6.15 Long-term PCIe-based data center monitoring on a different f1.2xlarge on-demand
instance with .d = 18 ms and .δ = 20 ms
1,960
Receiver Bandwidth (MBps)
650
Receiver Bandwidth (MBps)
1,955
600
1,950 550
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Time (min) Time (min)
Fig. 6.16 PCIe bandwidth traces collected by the monitoring instance while the victim instance
runs the DeepField AMI to perform a VSR conversion of the same video five consecutive times, (a)
without and (b) with the third instance acting as a PCIe stressor. Within each sub-figure, the red
lines label the start and the end of the VSR conversion on the FPGA
a prior work [61], the bandwidth can fall from 3 GBps to under 1 GBps using just
one PCIe stressor (transmitter) and to below 200 MBps when using two stressors.
To exemplify how the reduced PCIe bandwidth can affect user applications,
we again find a full NUMA node with four co-located VMs but only use three
of them. Specifically, the first VM is running the DeepField AMI video super-
resolution (VSR) algorithm [22] and represents the victim user. The second VM
is monitoring the PCIe bandwidth (similar to the experiments of Sect. 6.5.1), while
the third acts as a PCIe stressor. The fourth one is unused and left idle, to avoid
unintended interference. To further minimize any other external effects, the VSR
computation in Fig. 6.16 is repeated five times in sequence. As Fig. 6.16 shows, the
PCIe bandwidth measured by the monitoring instance drops from over 1,950 MBps
to under 650 MBps, and the conversion time in the victim instance increases by
33%. In addition to slowing down the victim application, when using a stressor, the
attacker can extract even more fine-grained information about the victim. Indeed, as
160 I. Giechaskiel et al.
8
20
4 10
0 0
w/o contention w/ contention w/o contention w/ contention
Fig. 6.17 The FPGA programming time can be slowed down by heavy PCIe traffic from co-
located instances. In (a), only the user’s custom logic is reconfigured, while in (b), both the FPGA
shell and the custom logic are reloaded onto the FPGA. Three AFIs with different numbers of logic
resources are used
Fig. 6.16b shows, the boundary between the five repetitions becomes clear, aiding
the AMI fingerprinting attacks discussed in Sect. 6.5.1.
One particular, and perhaps unexpected, consequence of the reduced PCIe
bandwidth is a more time-consuming programming process that can, in some cases,
be more than tripled. To investigate this effect, we measure the FPGA programming
time in one of the instances (INST-A) under different conditions including:
1. Whether a PCIe bandwidth-hogging application is running on a second instance,
INST-B.
2. Whether just the custom logic or both the custom logic and FPGA shell are
reloaded with fpga-load-local-image (using the -F flag).
3. The size of the loaded AFI in terms of the logic resources used (see Table 6.3).
Because AWS uses partial reconfiguration [13], “the size of a partial bitstream is
directly proportional to the size of the region it is reconfiguring” [66], with larger
images therefore requiring more data transfers from the host to the FPGA device.
The results of our experiments are summarized in Fig. 6.17, where three AFIs
of different sizes are loaded onto INST-A with/without reloading the shell and
with/without PCIe contention on INST-B. As Fig. 6.17a shows, PCIe contention
slows down the FPGA programming of all AFIs, with the effect being more
prominent for larger instances, where programming has slowed down from .≈ 7 s
to .≈ 12 s. When the shell is also reloaded (Fig. 6.17b), the same pattern holds, but
the effects are even more pronounced: even reloading the small AFI slows down
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 161
from .≈ 7 s to over .20 s, while the large AFI takes over .30 s compared to .≈ 9 s
without PCIe stressing. The effect is likely not just due to the fact that the AFI needs
to be transferred to the FPGA over PCIe using the fpga-load-local-image
command, but in part also because the AFIs need to be fetched over the network
from the cloud provider’s internal servers. As we show in the next section, the
network bandwidth is also impacted by the FPGA’s PCIe activity.
In this section, we investigate how other aspects of the hardware that is present
in F1 servers, namely Network Interface Cards (Sect. 6.6.1), NVMe SSD storage
(Sect. 6.6.2), and DRAM modules directly attached to the FPGAs (Sect. 6.6.3) leak
information that can permeate the VM instance boundary. These effects can be used
to, for example, cause interference on other users or determine that different VM
instances belong to the same server. The NIC and SSD contention-based attacks are
summarized in Fig. 6.11b.
Another shared resource that can lead to contention is the SSD storage
that F1 instances can access. The public specification of F1 instances notes
that f1.2xlarge instances have access to 470 GB of NVMe SSD storage,
f1.4xlarge have 940 GB, and f1.16xlarge have 4 × 940 GB [14]. This
suggests that F1 servers have four separate 940 GB SSD drives, each of which
can be shared between two f1.2xlarge instances. In this section, we confirm
our hypothesis that one SSD drive can be shared between multiple instances and
explain how this fact can be exploited to reverse-engineer the PCIe topology and
co-locate VM instances. The SSD contention we uncover can also be used for a
slow but reliable, covert channel or to degrade the performance of other users,
akin to the interference attack of Sect. 6.5.4. We also demonstrate the existence
of FPGA-to-SSD contention, which is likely the result of the SSD going through
the same PCIe switch, as shown in Fig. 6.2. This topology remains consistent with
the one publicly described for GPU-based P4d instances [7], which appear to be
architecturally similar to F1 instances.
SSD contention is tested by measuring the bandwidth of the SSD by using the
hdparm command with its -t option, which performs disk reads without any
data caching [44]. Measurements are averaged over repeated reads of 2 MB chunks
from the disk in a period of 3 seconds. When the server is otherwise idle, hdparm
reports the SSD read bandwidth to be over 800 MBps. However, when the other
f1.2xlarge instance that shares the same SSD stresses it using the stress
command [65] with the --io 4 --hdd 4 parameters, the bandwidth drops
below 50 MBps. The stress command with the parameters above results in 4
threads calling sync (to stress the read buffers) and another 4 threads calling
write and unlink (to stress write performance). The total number of threads
is kept to 8, to match the number of vCPUs allocated to an f1.2xlarge instance,
while all FPGAs remain idle during these experiments.
This non-uniform SSD behavior can be used for a robust covert channel with a
bandwidth of 0.125 bps with 100% accuracy. Specifically, for a transmission of bit
1, stress is called for 7 seconds, while for a transmission of bit 0, the transmitter
remains idle. The receiver uses hdparm to measure its SSD’s bandwidth and can
distinguish between contention and no-contention of the SSD resources (i.e., bits 1
and 0, respectively) using a simple threshold. The period of 8 seconds per bit also
accounts for 1 second of inactivity in every transmission, allowing the disk usage to
return to normal.
The same mechanism can be exploited to deteriorate the performance of other
tenants. It can further co-locate instances on an even more fine-grained level
than was previously possible. To accomplish this, we rent several f1.2xlarge
instances until we find four which form a full NUMA node through the PCIe-based
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 163
co-location approach of Sect. 6.4. We then stress the SSD in one of the four instances
and measure the SSD performance in the remaining three. We discover two pairs of
instances with mutual SSD contention, which supports our hypothesis and is also
consistent with the PCIe topology for other instance types [7].
The fact that SSD contention only exists between two f1.2xlarge instances
can be beneficial for adversaries: when the covert-channel receiver and the transmit-
ter are scheduled on two instances that share an SSD, they can communicate without
interference from other tenants in the same NUMA node.2
To formalize the above observations, we use the methodology described in Sect. 6.4
to find four co-located f1.2xlarge instances in the same NUMA node. Then,
for each pair of instances, we repeatedly run hdparm in the “receiver” instance
for a period of 3 minutes and then in the transmitter instance, (a) at the one minute
mark run stress for 30 s and (b) at the two minute mark use our FPGA-based
covert-channel code as a stressor which constantly transmits the bit 1 during each
measurement period for another 30 s.
The results of these experiments are summarized in Fig. 6.18. During idle
periods, the SSD bandwidth is approximately 800–900 MBps. However, for the two
instances with SSD contention, i.e., pairs .(A, D) and .(B, C), the bandwidth drops
to as low as 7 MBps, while the stress command is running (the bandwidth for
the other instance pairs remains unaffected). When the FPGA-based PCIe stressor
is enabled, the SSD bandwidth reported by hdparm is reduced in a measurable way
to approximately 700 MBps.
We further test for the opposite effect, i.e., whether stressing the SSD can cause
a measurable difference to the FPGA-based PCIe performance. We again stress the
SSD between 60 and 90 s and stress the FPGA between 120 and 150 s. As the results
of Fig. 6.19 show, the PCIe bandwidth drops from almost 1.8 GBps to approximately
500–1,000 MBps when the FPGA stressor is enabled, but there is no significant
difference in performance when the SSD-based stressor is turned on. Similar to the
experiments of Sect. 6.6.1, this is likely because the FPGA-based stressor can more
effectively saturate the PCIe link, while the SSD-based stressor seems to be limited
by the performance of the hard drive itself, whose bandwidth when idle (800 MBps)
is much lower than that of the FPGA (1.8 GBps). In summary, using the FPGA as
a PCIe stressor can cause the SSD bandwidth to drop, but the converse is not true,
since there is no observable influence on the FPGA PCIe bandwidth as a result of
SSD activity.
2 Assuming that slots within a server are assigned randomly, the probability of getting instances
with shared SSDs given that they are already co-located in the same NUMA node is 33%: out
of the three remaining slots in the same NUMA node, exactly one slot can be in an instance that
shares the SSD.
164 I. Giechaskiel et al.
800
Transmitter A
SSD Bandwidth (MBps)
Transmitter B
600 Transmitter C
Transmitter D
Receiver A
Receiver B
400 Receiver C
Receiver D
Pair (A, B)
Pair (A, C)
200
Pair (A, D)
Pair (B, C)
Pair (B, D)
0 Pair (C, D)
Fig. 6.18 NVMe SSD bandwidth for all transmitter and receiver pairs in a NUMA node, as
measured by hdparm. Running stress between seconds 60 and 90 causes a bandwidth drop
in exactly one other instance in the NUMA node, while running the FPGA-based PCIe stressor
(between seconds 120 and 150) reduces the SSD bandwidth in all cases
1,800
1,600
PCIe Bandwidth (MBps)
Transmitter A
Transmitter B
1,400 Transmitter C
Transmitter D
Receiver A
1,200
Receiver B
Receiver C
1,000 Receiver D
Pair (A, B)
Pair (A, C)
800 Pair (A, D)
Pair (B, C)
Pair (B, D)
600
Pair (C, D)
Fig. 6.19 FPGA PCIe bandwidth for all transmitter and receiver pairs in a NUMA node, as
measured by our covert-channel receiver. Running stress between seconds 60 and 90 does not
cause a bandwidth drop, but running the FPGA-based PCIe stressor (between seconds 120 and
150) reduces the bandwidth in all cases
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 165
Fig. 6.20 By alternating between AFIs that instantiate DRAM controllers or leave them uncon-
nected, the decay rate of DRAM cells can be measured as a proxy for environmental temperature
monitors [60]
DRAM decay is known to depend on the temperature of the DRAM chip and its
environment [68, 69]. Since the FPGAs in cloud servers have direct access to the
on-board DRAM, they can be used as sensors for detecting and estimating the tem-
perature around the FPGA boards, supplementing PCIe traffic-based measurements.
Figure 6.20 summarizes how the DRAM decay of on-board chips can be used
to monitor thermal changes in the data center. When a DRAM module is being
initialized with some data, the DRAM cells will become charged to store the values,
with true cells storing logical 1s as charged capacitors and anti-cells storing them
as depleted capacitors. Typically, true and anti-cells are paired, so initializing the
DRAM to all ones will ensure only half of the DRAM cells will be charged, even if
the actual location of true and anti-cells is not known.
After the data has been written to the DRAM and the cells have been charged,
the DRAM refresh is disabled. Disabling DRAM refresh in the server itself is not
possible as the physical hardware on the server is controlled by the hypervisor, not
the users. However, the FPGA boards have their own DRAMs. By programming
the FPGAs with AFIs that do and do not have DRAM controllers, disabling of
the DRAM refresh can be emulated, allowing the DRAM cells to decay [60].
Eventually, some of the cells will lose enough charge to “flip” their value (for
example, data written as 1 becomes 0 for true cells since the charge has dissipated).
DRAM data can then be read after a fixed time .Tdecay , which is called the decay
time. The number of flipped cells during this time depends on the temperature of the
DRAM and its environment [69] and can therefore produce coarse-grained DRAM-
based temperature sensors of F1 instances.
166 I. Giechaskiel et al.
Prior work [61] and this chapter have so far focused on information leaks due to
shared resources within a NUMA node but did not attempt to co-locate instances
that are in the same physical server but belong to different NUMA nodes. In this
section, we propose such a methodology that uses the boards’ thermal signatures,
which are obtained from the decay rates of each FPGA’s DRAM modules. To
collect these signatures, we use the method and code provided by Tian et al. [60]
to alternate between bitstreams that instantiate DRAM controllers and ones that
leave them unconnected to initialize the memory and then disable its refresh rate.
When two instances are in the same server, the temperatures of all 8 FPGAs in
an f1.16xlarge instance (and by extension the DRAM thermal signatures) are
highly correlated. However, when the instances come from different servers, the
decay rates are different and thus contain distinguishable patterns that can be used
to classify the two instances separately. This insight can be used to find FPGA
instances that are co-located in the same server, even across different NUMA nodes.
Our method for co-locating instances within a server has two aspects to it: first,
we show that we can successfully identify two FPGA boards as being in the same
server with high probability using their DRAM decay rates, and then we show that
by using PCIe-based co-location we can build the full profile of a server and identify
all eight of its FPGA boards, even if they are in different NUMA nodes. More
specifically, we use the open-source software by Tian et al. [60] to collect DRAM
decay measurements for several FPGAs over a long period of time and then find
which FPGAs’ DRAM decay patterns are the “closest.”
To validate our approach, we rent three f1.16xlarge instances (a total of
24 FPGAs) for a period of 24 hours and measure how “close” each pair of FPGA
traces is by calculating the total distance between their data points over the entire
measurement period for three different metrics. The first metric compares the
raw number of bit flips from the DRAM decay measurement .craw i directly. The
second approach normalizes the data to fit in the .[−1, 1] range, i.e., .cnorm i =
(2craw − m − M)/(M − m), where .m = mini craw and .M = maxi craw . In Fig. 6.21,
i i i
we show an alternative metric, which takes the difference between successive raw
i
measurements, i.e., .cdiff = craw
i − ci−1 . Note that if FPGA A is the closest to FPGA
raw
B using these metrics, then B is not necessarily the closest to A. However, if FPGA
A is closest to B and B is closest to C, then A, B, and C are all in the same server.
The raw data metric has an accuracy of 75%, the normalized metric is 71%
accurate, while the difference metric succeeds in correctly pairing all FPGAs except
for one, for an accuracy of 96%. Shorter measurement periods still result in high
accuracies. For example, using the DRAM data from the first 12 hours results in
only one additional FPGA mis-identification, for an accuracy of 92%. We plot the
classification accuracy for the three metrics as a function of time in Fig. 6.22.
In the experiments of Fig. 6.21, the .cdiff metric places slots 0–4 of server A
together (along with, mistakenly, slot 0 of server B), slots 5–7 of server A as a second
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 167
i
NUMA Node 1
100 100 100
50 50 50
0 0 0
19 5
19 0
20 5
20 5
20 0
20 5
20 5
20 0
19 5
19 5
19 0
15
20 5
00
20 0
15
20 0
19 5
19 5
15
00
30
15
30
5
1
:4
-0 21:0
:1
:4
-0 04:0
:4
:1
:3
:4
:1
:4
:0
:4
:4
-0 18:4
-0 21:0
:4
:4
6:
6:
3:
1:
4:
6:
1:
3:
6:
3:
1:
4:
6:
1:
3:
18
23
01
08
11
13
15
16
18
21
08
15
08
15
21 9 1
21 0 0
21 0 0
21 0 0
21 0 1
21 0 1
21 9 1
21 9 2
21 0 0
21 0 0
21 0 0
21 0 1
21 0 1
19
20
20
1
2
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
3-
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
Date Date Date
Fig. 6.21 DRAM decay traces from three f1.16xlarge instances (24 FPGAs in total) for a
i as the comparison
period of 24 hours, using the differences between successive measurements .cdiff
metric, which results in the highest co-location accuracy of 96%. Within each server, measurements
from slots in the same NUMA node have been drawn in the same style
cdiff
cnorm
craw
90.0
Classification Accuracy (%)
80.0
70.0
60.0
0 5 10 15 20 25
Time (h)
Fig. 6.22 Accuracy of classifying individual FPGAs as belonging to the right server as a function
of measurement time using the three different proposed metrics
group, slots 1–7 of server B as one server and slots 0–3 and 4–7 of server C as the
two final groups. Consequently, our method successfully identifies the six NUMA
nodes without making use of PCIe contention at all.
However, by using insights about the NUMA nodes that can be extracted through
our PCIe-based experiments, the accuracy and reliability of this method can be
further increased. For example, slot 0 of server B could already be placed in the
same NUMA node as slots 1–3 using PCIe-based co-location. Leveraging the PCIe-
based co-location method, if the “closest” FPGA is known to be in the same NUMA
node due to PCIe contention and the second-closest FPGA (not in the same NUMA
node according to PCIe contention) is only farther by at most 1% compared to the
closest FPGA, then this second-closest FPGA can be identified as belonging to the
168 I. Giechaskiel et al.
second NUMA node of the same server. In the experiment of Fig. 6.21, this approach
successfully groups all FPGAs in the three tested servers without errors.
6.7 Conclusion
This chapter introduced a novel, fast covert-channel attack between separate users
in a public, FPGA-accelerated cloud computing setting. It characterized how con-
tention of the PCIe bus can be used to create a robust communication mechanism,
even among users of different operating systems, with bandwidths reaching 20 kbps
with 99% accuracy. In addition to using PCIe contention for covert channels, this
chapter demonstrated that contention can be used to monitor or disrupt the activities
of other users, including inferring information about their applications or slowing
them down. This work further identified alternative co-location mechanisms, which
make use of network cards, SSDs, or even the DRAM modules attached to the FPGA
boards, allowing adversaries to co-locate FPGAs in the same server, even if they are
on separate NUMA nodes. More generally, this work demonstrated that malicious
adversaries can use PCIe monitoring to observe the data center server activity,
breaking the separation of privilege that isolated VM instances are supposed to
provide. With more types of accelerators becoming available on the cloud, including
FPGAs, GPUs, and TPUs, PCIe-based threats are bound to become a key aspect
of cross-user attacks. Overall, our insights showed that low-level, direct hardware
access to PCIe, NIC, SSD, and DRAM hardware creates new attack vectors that
need to be considered by both users and cloud providers alike when deciding how
to trade off performance, cost, and security for their designs: even if the endpoints of
computations (e.g., CPUs and FPGAs) are assumed to be secure, the shared nature
of cloud infrastructures poses new challenges that need to be addressed.
References
1. Agne, A., Hangmann, H., Happe, M., Platzner, M., & Plessl, C. (2014). Seven recipes
for setting your FPGA on fire—A cookbook on heat generators. Microprocessors and
Microsystems, 38(8), 911–919.
2. Alam, M. M., Tajik, S., Ganji, F., Tehranipoor, M., & Forte, D. (2019). RAM-Jam: Remote
temperature and voltage fault attack on FPGAs using memory collisions. In Workshop on
Fault Diagnosis and Tolerance in Cryptography (FDTC).
3. Alibaba Cloud. (2023). Instance Families. https://www.alibabacloud.com/help/doc-detail/
25378.html. Accessed 18 May 2023.
4. Amazon Web Services. (2016). Developer preview—EC2 instances (F1) with pro-
grammable hardware. https://aws.amazon.com/blogs/aws/developer-preview-ec2-instances-
f1-with-programmable-hardware/. Accessed 18 May 2023.
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 169
5. Amazon Web Services. (2018). The agility of F1: Accelerate your applications with custom
compute power. https://d1.awsstatic.com/Amazon_EC2_F1_Infographic.pdf. Accessed 18
May 2023.
6. Amazon Web Services. (2019). F1 FPGA application note: How to use write combining to
improve PCIe bus performance. https://github.com/awslabs/aws-fpga-app-notes/tree/master/
Using-PCIe-Write-Combining. Accessed 18 May 2023.
7. Amazon Web Services. (2020). Amazon EC2 P4d instances deep dive. https://aws.amazon.
com/blogs/compute/amazon-ec2-p4d-instances-deep-dive/. Accessed 18 May 2023.
8. Amazon Web Services. (2020). Official repository of the AWS EC2 FPGA hardware and
software development kit v1.4.15. https://github.com/aws/aws-fpga/tree/v1.4.15. Accessed
18 May 2023.
9. Amazon Web Services. (2021). AWS shell interface specification. https://github.com/aws/aws-
fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md. Accessed 18 May 2023.
10. Amazon Web Services. (2021). CL_DRAM_DMA custom logic example. https://github.com/
aws/aws-fpga/tree/master/hdk/cl/examples/cl_dram_dma. Accessed 18 May 2023.
11. Amazon Web Services. (2021). F1 FPGA application note: How to use the PCIe peer-
2-peer version 1.0. https://github.com/awslabs/aws-fpga-app-notes/tree/master/Using-PCIe-
Peer2Peer. Accessed 18 May 2023.
12. Amazon Web Services. (2021). Hello World CL example. https://github.com/aws/aws-fpga/
tree/master/hdk/cl/examples/cl_hello_world. Accessed 18 May 2023.
13. Amazon Web Services. (2022). AWS FPGA - frequently asked questions. https://github.com/
aws/aws-fpga/blob/master/FAQs.md. Accessed 18 May 2023.
14. Amazon Web Services. (2023). Amazon EC2 instance types. https://aws.amazon.com/ec2/
instance-types/. Accessed 18 May 2023.
15. Amazon Web Services. (2023). Amazon Linux 2 FAQs. https://aws.amazon.com/amazon-
linux-2/faqs/. Accessed 18 May 2023.
16. Amazon Web Services. (2023). AWS marketplace. https://aws.amazon.com/marketplace.
Accessed 18 May 2023.
17. Amazon Web Services. (2023). FPGA developer AMI. https://aws.amazon.com/marketplace/
pp/prodview-gimv3gqbpe57k. Accessed 18 May 2023.
18. Amazon Web Services. (2023). FPGA developer AMI (Amazon Linux 2). https://aws.amazon.
com/marketplace/pp/prodview-iehshpgi7hcjg. Accessed 18 May 2023.
19. Amouri, A., Bruguier, F., Kiamehr, S., Benoit, P., Torres, L., & Tahoori, M. (2014). Aging
effects in FPGAs: An experimental analysis. In International Conference on Field Pro-
grammable Logic and Applications (FPL).
20. Baidu Cloud. (2023). FPGA cloud compute. https://cloud.baidu.com/product/fpga.html.
Accessed 18 May 2023.
21. Baker, G., & Lupo, C. (2017). TARUC: A topology-aware resource usability and contention
benchmark. In ACM/SPEC International Conference on Performance Engineering (ICPE).
22. BLUEDOT Inc. (2020). DeepField-SR video super resolution hardware accelera-
tor. https://www.xilinx.com/content/dam/xilinx/publications/solution-briefs/partner/xilinx-
bluedot-solution-brief.pdf. Accessed 18 May 2023.
23. Boemo, E., & López-Buedo, S. (1997). Thermal monitoring on FPGAs using ring-oscillators.
In International Workshop on Field-Programmable Logic and Applications (FPL).
24. Boutros, A., Hall, M., Papernot, N., & Betz, V. (2020). Neighbors from hell: Voltage attacks
against deep learning accelerators on multi-tenant FPGAs. In International Conference on
Field-Programmable Technology (FPT)
25. Danalis, A., Marin, G., McCurdy, C., Meredith, J. S., Roth, P. C., Spafford, K., Tipparaju, V.,
& Vetter, J. S. (2010). The scalable heterogeneous computing (SHOC) benchmark suite. In
Workshop on General-Purpose Processing on Graphics Processing Units (GPGPU).
26. Duan, S., Wang, W., Luo, Y., & Xu, X. (2021). A survey of recent attacks and mitigation on
FPGA systems. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI).
27. Faraji, I., Mirsadeghi, S. H., & Afsahi, A. (2016). Topology-aware GPU selection on multi-
GPU nodes. In IEEE International Parallel and Distributed Processing Symposium Workshops
(IPDPSW).
170 I. Giechaskiel et al.
28. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Measuring long wire leakage with ring
oscillators in cloud FPGAs. In International Conference on Field Programmable Logic and
Applications (FPL).
29. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Reading between the dies: Cross-SLR
covert channels on multi-tenant cloud FPGAs. In IEEE International Conference on Computer
Design (ICCD).
30. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2020). C3APSULe: Cross-FPGA covert-
channel attacks through power supply unit leakage. In IEEE Symposium on Security and
Privacy (S&P).
31. Giechaskiel, I., & Szefer, J. (2020). Information leakage from FPGA routing and logic
elements. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
32. Glamočanin, O., Coulon, L., Regazzoni, F., & Stojilović, M. (2020). Are cloud FPGAs really
vulnerable to power analysis attacks? In Design, Automation & Test in Europe Conference &
Exhibition (DATE).
33. Glamočanin, O., Mahmoud, D. G., Regazzoni, F., & Stojilović, M. (2021). Shared FPGAs and
the holy grail: Protections against side-channel and fault attacks. In Design, Automation & Test
in Europe Conference & Exhibition (DATE).
34. Gnad, D. R. E., Oboril, F., & Tahoori, M. B. (2017). Voltage drop-based fault attacks on
FPGAs using valid bitstreams. In International Conference on Field Programmable Logic
and Applications (FPL).
35. Gobulukoglu, M., Drewes, C., Hunter, W., Kastner, R., & Richmond, D. (2021). Classifying
computations on multi-tenant FPGAs. In Design Automation Conference (DAC).
36. Jin, C., Gohil, V., Karri, R., & Rajendran, J. (2020). Security of cloud FPGAs: A survey. https://
arxiv.org/abs/2005.04867. Accessed 18 May 2023.
37. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2018). FPGAhammer: Remote voltage fault
attacks on shared FPGAs, suitable for DFA on AES. Transactions on Cryptographic Hardware
and Embedded Systems (TCHES), 2018(3), 44–68.
38. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2019). Mitigating electrical-level attacks
towards secure multi-tenant FPGAs in the cloud. ACM Transactions on Reconfigurable
Technology and Systems (TRETS) 12(3).
39. La, T., Pham, K., Powell, J., & Koch, D. (2021). Denial-of-Service on FPGA-based cloud
infrastructures – Attack and defense. Transactions on Cryptographic Hardware and Embedded
Systems (TCHES) 2021(3), 441–464.
40. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
Malicious self-oscillator scanning for Xilinx UltraScale+ FPGAs. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 13(3).
41. Li, C., Sun, Y., Jin, L., Xu, L., Cao, Z., Fan, P., Kaeli, D., Ma, S., Guo, Y., & Yang, J.
(2019). Priority-based PCIe scheduling for multi-tenant multi-GPU systems. IEEE Computer
Architecture Letters (LCA), 18(2), 157–160.
42. López-Buedo, S., Garrido, J., & Boemo, E. (2000). Thermal testing on reconfigurable
computers. IEEE Design & Test of Computers (D&T), 17(1), 84–91.
43. López-Buedo, S., Garrido, J., & Boemo, E. (2002). Dynamically inserting, operating, and
eliminating thermal sensors of FPGA-based systems. IEEE Transactions on Components and
Packaging Technologies (TCAPT), 25(4), 561–566.
44. Lord, M. (2022). hdparm. https://sourceforge.net/projects/hdparm/. Accessed 18 May 2023.
45. Luo, Y., & Xu, X. (2020). A quantitative defense framework against power attacks on multi-
tenant FPGA. In International Conference on Computer-Aided Design (ICCAD).
46. Mahmoud, D. G., Lenders, V., & Stojilović, M. (2023). Electrical-level attacks on CPUs,
FPGAs, and GPUs: Survey and implications in the heterogeneous era. ACM Computing
Surveys (CSUR), 55(3).
47. Martz, M. (2021). speedtest-cli. https://github.com/sivel/speedtest-cli. Accessed 18 May 2023.
48. Mirzargar, S. S., & Stojilović, M. (2019). Physical side-channel attacks and covert communi-
cation on FPGAs: A survey. In International Conference on Field Programmable Logic and
Applications (FPL).
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 171
49. Moini, S., Tian, S., Holcomb, D., Szefer, J., & Tessier, R. (2021). Remote power side-channel
attacks on BNN accelerators in FPGAs. In Design, Automation & Test in Europe Conference
& Exhibition (DATE).
50. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power Distribution Attacks in Multi-
Tenant FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems (TVLSI),
28(1).
51. Provelengios, G., Holcomb, D., & Tessier, R. (2021). Mitigating voltage attacks in multi-tenant
FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 1(1).
52. Rakin, A. S., Luo, Y., Xu, X., & Fan, D. (2021). Deep-Dup: An adversarial weight duplication
attack framework to crush deep neural network in multi-tenant FPGA. In USENIX Security
Symposium.
53. Schaa, D., & Kaeli, D. (2009). Exploring the multiple-GPU design space. In IEEE Interna-
tional Parallel and Distributed Processing Symposium Workshops (IPDPSW).
54. Spafford, K., Meredith, J. S., & Vetter, J. S. (2011). Quantifying NUMA and contention effects
in multi-GPU systems. In Workshop on General-Purpose Processing on Graphics Processing
Units (GPGPU).
55. Sugawara, T., Sakiyama, K., Nashimoto, S., Suzuki, D., & Nagatsuka, T. (2019). Oscillator
without a combinatorial loop and its threat to FPGA in data centre. Electronics Letters, 15(11),
640–642.
56. Tan, M., Wan, J., Zho, Z., & Li, Z. (2021). Invisible probe: Timing attacks with PCIe congestion
side-channel. In IEEE Symposium on Security and Privacy (S&P).
57. Tencent Cloud. (2023). FPGA cloud server. https://cloud.tencent.com/product/fpga. Accessed
18 May 2023.
58. Tian, S., Krzywosz, A., Giechaskiel, I., & Szefer, J. (2020). Cloud FPGA security with RO-
based primitives. In International Conference on Field-Programmable Technology (FPT).
59. Tian, S., & Szefer, J. (2019). Temporal thermal covert channels in cloud FPGAs. In
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA).
60. Tian, S., Xiong, W., Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2020). Finger-
printing cloud FPGA infrastructures. In ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays (FPGA).
61. Tian, S., Xiong, W., Giechaskiel, I., & Szefer, J. (2021). Cloud FPGA cartography using
PCIe contention. In IEEE Symposium on Field-Programmable Custom Computing Machines
(FCCM).
62. Tian, S., Xiong, W., Giechaskiel, I., & Szefer, J. (2021). Remote power attacks on the versatile
tensor accelerator in multi-tenant FPGAs. In IEEE Symposium on Field-Programmable
Custom Computing Machines (FCCM)
63. Valtchanov, B., Aubert, A., Bernard, F., & Fischer, V. (2008). Modeling and observing the jitter
in ring oscillators implemented in FPGAs. In IEEE Workshop on Design and Diagnostics of
Electronic Circuits and Systems (DDECS).
64. Wang, X., Niu, Y., Liu, F., & Xu, Z. (2022). When FPGA meets cloud: A first look at
performance. IEEE Transactions on Cloud Computing (TCC), 10(2), 1344–1357.
65. Waterland, A. P. (2014). stress. https://web.archive.org/web/20190502/https://people.seas.
harvard.edu/~apw/stress/. Accessed 18 May 2023.
66. Xilinx, Inc. (2021). 63419 - Vivado partial reconfiguration - What types of bitstreams are used
in partial reconfiguration (PR) solutions? https://support.xilinx.com/s/article/63419. Accessed
18 May 2023.
67. Xilinx, Inc. (2023). UltraScale+ FPGAs: Product tables and product selection guides. https://
www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-
selection-guide.pdf. Accessed 18 May 2023.
68. Xiong, W., Anagnostopoulos, N. A., Schaller, A., Katzenbeisser, S., & Szefer, J. (2019).
Spying on temperature using DRAM. In Design, Automation & Test in Europe Conference
& Exhibition (DATE).
69. Xiong, W., Schaller, A., Anagnostopoulos, N. A., Saleem, M. U., Gabmeyer, S., Katzenbeisser,
S., & Szefer, J. (2016). Run-time accessible DRAM PUFs in commodity devices. In
Conference on Cryptographic Hardware and Embedded Systems (CHES).
172 I. Giechaskiel et al.
70. Yin, C. E., & Qu, G. (2009). Temperature-aware cooperative ring oscillator PUF. In IEEE
International Workshop on Hardware-Oriented Security and Trust (HOST).
71. Zhang, J., & Qu, G. (2019). Recent attacks and defenses on FPGA-based systems. ACM
Transactions on Reconfigurable Technology and Systems (TRETS), 12(3).
72. Zhang, Y., Yasaei, R., Chen, H., Li, Z., & Al Faruque, M. A. (2021). Stealing neural network
structure through remote FPGA side-channel analysis. IEEE Transactions on Information
Forensics and Security (TIFS), 16, 4377–4388.
Chapter 7
Cross-board Power-Based FPGA, CPU,
and GPU Covert Channels
7.1 Introduction
I. Giechaskiel
Independent Researcher, London, UK
e-mail: ilias@giechaskiel.com
K. Rasmussen
University of Oxford, Oxford, UK
e-mail: kasper.rasmussen@cs.ox.ac.uk
J. Szefer ()
Yale University, New Haven, CT, USA
e-mail: jakub.szefer@yale.edu
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 173
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_7
174 I. Giechaskiel et al.
single-tenant FPGAs on different FPGA boards that are merely powered through
the same PSU. Moreover, we show that if this PSU also powers the host computer,
the same sink FPGA (receiver) can detect high levels of CPU and GPU activity,
creating new CPU-to-FPGA and GPU-to-FPGA channels. These channels allow one
system, which may (GPU, FPGA) or may not (CPU) contain an accelerator, to leak
information such as private encryption keys to an entirely different system (the sink
FPGA), which is fully isolated, except for the shared power supply.
The first crucial observation of our work is that although causing variable power
consumption to transmit information is easy, detecting voltage fluctuations without
external equipment is non-trivial. However, the reconfigurability of FPGAs provides
access to the hardware at a much lower level and can be used to implement circuits
that detect voltage changes that are imperceptible to fixed silicon chips such as
CPUs and GPUs. Indeed, cloud providers are aware of the impact of such low-level
hardware access, so besides allocating FPGAs on a per-user basis, they also keep
several features such as voltage and temperature monitors inaccessible to end users.
The second key observation is that ring oscillators (ROs) are capable of both
causing and sensing voltage fluctuations. This chapter therefore introduces a novel
way of monitoring changes in voltage caused by the source FPGA, CPU, or GPU.
Specifically, both properties of ROs are used in the sink (receiver) FPGA, whereby
stressing the voltage regulator of the sink FPGA allows one to detect transmissions
by the source (transmitter) FPGA.
Using these insights, we demonstrate the first cross-FPGA covert channel
between off-the-shelf, unmodified Xilinx Artix 7, and Kintex 7 boards in either
direction of communication. We also characterize the bandwidth–accuracy tradeoffs
across different measurement periods and sizes of the covert-channel ROs on the
source and sink FPGAs. We further test our covert channel on two PSUs running
under normal operating conditions (i.e., without being overloaded) and introduce
CPU-to-FPGA and GPU-to-FPGA covert channels by modulating their respective
loads. We finally discuss countermeasures to mitigate this source of leakage.
7.1.1 Contributions
The rest of the chapter is organized as follows. Section 7.2 introduces the threat
model, while Sect. 7.3 details the experimental setup, including hardware properties,
the measurement procedure, and the high-level architectural FPGA design. Sec-
tion 7.4 then describes the need for our novel classification metric and explains why
it works where the naive approach of looking at absolute ring oscillator counts fails.
Section 7.5 then evaluates cross-FPGA covert communication over shared PSUs,
varying the number of source and sink ring oscillators used, and performing an
analysis of bandwidth–accuracy tradeoffs. Section 7.6 then covers CPU-to-FPGA
and GPU-to-FPGA information leakage, while Sect. 7.7 discusses potential defense
mechanisms. We place our work in the context of related research in Sect. 7.8, before
we conclude in Sect. 7.9.
Prior work on attacks without physical access to the FPGA hardware has primarily
investigated security in the context of multi-tenant FPGAs. It has shown that when
a single FPGA chip is shared among multiple users concurrently, designs are
vulnerable to temperature and voltage attacks (Sect. 7.8). Although these attacks
highlight potential issues with future architectures, they remain theoretical at the
moment, as FPGAs are currently allocated on a per-user basis. In this chapter, we
are thus concerned with covert-channel attacks against platforms where the entire
logic is allocated to a single user. Design logic therefore cannot access any voltage
or thermal system monitors present on the FPGA fabric, as these inaccessible in a
cloud environment.1 Compared to multi-tenant attacks on FPGA designs that share
the same power distribution network, adversarial attacks to infer any information
about the activity or data (e.g., encryption keys) of other users necessitate that side-
1 In cloud FPGAs, part of the fabric is reserved by a cloud-provided “shell” that hides implementa-
tion details, including physical pinouts, identification primitives, and system monitors. User logic
is forced to interact with external hardware through the shell’s AXI4 interfaces.
176 I. Giechaskiel et al.
Fig. 7.1 System model for FPGA-to-FPGA, CPU-to-FPGA, and GPU-to-FPGA leakage in co-
located environments. The CPU, GPU, and one or more (potentially malicious) FPGAs are
powered through the same PSU but do not share any logic and do not have access to system
monitors for measuring voltage or temperature changes
It should be noted that some cloud providers such as Amazon Web Services
(AWS) place restrictions on the types of circuits that can be instantiated on their
FPGAs and prohibit combinatorial loops including ring oscillators [9, 35]. Although
in this chapter we primarily use conventional ring oscillators, Sect. 7.5.5 shows that
they can be easily replaced by alternate designs proposed in recent work [9, 10, 22,
35], which bypass such cloud countermeasures, and could therefore be used to attack
the isolation mechanisms that separate physical hardware is supposed to provide.
In this section, we detail our experimental setup, starting with the ring oscillators
employed in the source and sink FPGAs (Sect. 7.3.1) and delving into the archi-
tectural design of the FPGA transmission and reception circuitry (Sect. 7.3.2). We
then describe the hardware properties of the FPGA boards used (Sect. 7.3.3), as well
as the computer PSUs, CPUs, and GPUs, which are effectively turned into covert-
channel transmitters (Sect. 7.3.4). We finally discuss the process followed for data
collection (Sect. 7.3.5).
Ring oscillators are comprised of an odd number of NOT gates in a ring formation
and therefore form a combinatorial loop, whose value oscillates. The frequency of
oscillation changes based on process variations, as well as voltage and temperature
conditions [16], making ROs good temperature [38] and voltage [46] monitors. ROs
also cause voltage fluctuations, which stress power circuits, and can potentially
crash the FPGA or inject faults [12, 24, 26, 27, 30].
In this chapter, we use ROs as both transmitters and receivers and implement
them using lookup tables (LUT-RO) with one inverter and three buffer stages as
shown in Fig. 7.2. We chose to use this RO design instead of more common ROs
with three inverters or one inverter and two buffer stages because preliminary
experiments showed that they resulted in more stable measurements. Alternative
types of ROs are evaluated in Sect. 7.5.5.
Fig. 7.2 The ring oscillators are implemented using lookup tables (LUT-ROs) and contain one
inverter and three buffer gates
178 I. Giechaskiel et al.
Fig. 7.3 Experimental setup: the covert source (left) uses T · NT ROs, while the sink (right) has
R · NR measurement ROs and S · NS stressor ROs. The same power supply unit powers both boards
We now give a high-level overview of the covert-channel source and sink FPGA
designs, which are summarized in Fig. 7.3.
To cause detectable changes on the sink, the source FPGA employs ring oscillators
organized as T transmitters, which can be controlled independently. These transmit-
ters are placed on separate clock regions to make power consumption more evenly
spread throughout the FPGA. They contain .NT ROs each, for a total of .T · NT ROs,
as shown in the left part of Fig. 7.3.
Fig. 7.4 Annotated Vivado screenshot of the sink architecture on the Kintex 7 board, with receiver
ROs in red, stressor ROs in blue, and other logic (counters, UART, FIFO) in brown
For our experiments, we use Xilinx Kintex 7 KC705 and Artix 7 AC701 boards. The
28 nm chips these devices contain are similar, but the Kintex 7 is more performant,
while the Artix 7 is optimized for low power [41, 44]. Both FPGAs have a 200 MHz
oscillator and operate at a core VCCINT voltage of 1.0 V, but the boards use
different regulators to convert the 12 V PSU output into 1.0 V [42, 43].
180 I. Giechaskiel et al.
For the source FPGA designs, we place a transmitter on each clock region of the
FPGA. As the Artix 7 board has 10 clock regions, while the Kintex 7 has 14, the
numbers of transmitters on these devices are .T = 10 and .T = 14, respectively.
The sink FPGAs contain .R = 4 receivers in the corners of each chip, each with
.NR = 5 ROs. Sink FPGAs also contain .S = 5 stressors, one of which is placed
in the center of the device, while the remaining four are next to the receiver clock
regions (Fig. 7.4 shows an example with .NS = 500). Although not shown to be
significant in our experiments, these early architectural choices were made to ensure
that the power draw was approximately equally spread across the FPGA fabric.
These decisions and other FPGA properties are summarized in Table 7.1. More
compile- and run-time parameters, such as the measurement period and the number
of source transmitters ROs .NT and sink stressor ROs .NS , are varied in Sect. 7.5.
To verify that the covert channel is not due to faulty design in a line of specific power
supply units, we test communication on two PSUs made by different manufacturers
(Corsair and Dell), rated for different loads (850 W and 1300 W, respectively), and
both with a Gold 80 Plus Certification (which guarantees 90% efficiency at 50%
load). These PSUs are integrated in two computers, the first of which contains two
Xeon E5645 CPUs for a total of 24 threads, while the second contains a single
Xeon E5-2609 with 4 threads. They also contain Nvidia GeForce GPUs, with 96
and 640 CUDA cores, respectively. The CPU and GPU cores are used as the covert-
channel sources in Sect. 7.6 for CPU-to-FPGA and GPU-to-FPGA communication
over the shared power supply. The properties of the computers used are summarized
in Table 7.2.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 181
Table 7.2 Hardware properties of the two computers used, with their corresponding PSUs, CPUs,
and GPUs
Property PC-A PC-B
PSU Brand Corsair Dell
Power Rating 850 W 1300 W
80 Plus Certification Gold Gold
Motherboard SuperMicro X8DAL-i Dell Precision T7600
Xeon CPU Model E5645 E5-2609
# of CPU Cores 6 @ 2.4 GHz 4 @ 2.4 GHz
# of Threads 12 4
# of CPUs 2 1
GeForce GPU ZOTAC GT 430 EVGA GTX 750 Ti
GPU Memory 1 GB GDDR3 2 GB GDDR5
# of CUDA Cores 96 @ 0.7 GHz 640 @ 1.0 GHz
For our data collection process, we made several choices to make the communica-
tion scenario realistic. For instance, the computers attached to the PSUs were used
normally during experimentation, including running and installing other software.
Moreover, to ensure leakage is not due to temperature, the FPGAs were placed
outside the computer case, and away from computer fans, which may affect
measurements by turning on or off based on the computer temperature. We similarly
placed the FPGAs next to each other horizontally (as opposed to stacking them
vertically), further minimizing cross-FPGA temperature effects. In addition, to
control for other voltage effects, the FPGAs were not connected to the computer
over PCIe, which would likely increase the potential for leakage. However, as
we show in Sect. 7.5.5, our covert channel operates with similar accuracy, even
when the FPGAs are connected to the computer over PCIe and are enclosed in it
without accounting for temperature variations. Finally, to verify that the leakage is
not caused through the UART interface, we often used one computer to take the
measurements, and the other to power the source and sink boards through its PSU.
As there is inherent noise in the measurements, (a) the absolute RO frequency
is not well-suited for comparison, and (b) the RO counts need to be averaged over
repeated measurements to produce meaningful results. To address both concerns, we
use Manchester-encoding, where to send a 1, the source transmitters are enabled for
one measurement period and disabled for the next (a 0 is similarly encoded by first
disabling transmitters during the first measurement period and enabling them in the
second period). These measurement periods are .M · 2t clock cycles long, where we
average M RO counts collected by ROs enabled for .2t clock cycles (see Sect. 7.4).
The bandwidth can thus be calculated as
fc
b=
. , (7.1)
2 · 2t · M
182 I. Giechaskiel et al.
This section introduces a novel methodology to detect changes in the power supply
voltage through the sink’s “stressor” ROs. Section 7.4.1 first motivates why the naive
approach of using the absolute ring oscillator counts is insufficient for classification
of transmissions in this scenario. Section 7.4.2 then introduces the metric using
stressors, while Sect. 7.4.3 finally explains why our technique works.
Broadly speaking, when the transmitters are activated on the source FPGA, CPU, or
GPU, there is a voltage drop that is visible not just at the board regulator, but also
at the 12 V rail PSU input to the FPGA board. Indeed, Fig. 7.5 demonstrates this
40
Nominal Voltage Difference (mV)
35 # of Enabled
Transmitters T
30 0
2
25 4
6
20 8
10
15
12
10 14
5
11.5 11.7 11.9 12.1 12.3 12.5
Power Supply Voltage (V)
Fig. 7.5 Voltage as set by the power supply and measured by the oscilloscope for various numbers
of enabled transmitters T on the KC705-2 source, with 99% confidence intervals
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 183
Fig. 7.6 The average ring oscillator counts .CVi (at 99% confidence) on the AC701-1 sink remain
approximately the same for different power supply voltages V and all eight ring oscillators .Ri
for a Kintex 7 source without a sink FPGA present across multiple input voltages
and different numbers of enabled transmitters T . Specifically, we power the board
using a Keithley 2231A power supply and measure the voltage at the power rail of
the board using a Tektronix MDO3104 Mixed Domain Oscilloscope with TPP1000
1 GHz passive probes, taking 10 000 data points. Figure 7.5 indicates that at any
voltage level provided by the power supply (11.5 V to 12.5 V), as the number of
enabled source transmitters T increases, the voltage measured by the oscilloscope
decreases. For example, at 12.5 V, the oscilloscope measures 12.539 V when no
transmitters are enabled, but only 12.521 V when 14 transmitters are enabled, for
a voltage drop of approximately 18 mV. At 11.5 V, the measured voltage similarly
drops from 11.525 V to 11.507 V.
Although one would expect RO frequency to increase with higher voltages [16],
this is not the case. For a ring oscillator i, let its average count be .CVi when the
voltage provided by the power supply is .11.5 V ≤ V ≤ 12.5 V. We would expect
that .CVi 1 > CVi 2 whenever .V1 > V2 , but Fig. 7.6 suggests that the RO counts remain
approximately the same for all eight ring oscillators and voltages V tested on an
Artix 7 sink, likely because the regulator is able to deal with such input voltages. As
a result, the absolute RO frequency cannot be used to decode cross-FPGA covert-
channel transmissions.
To solve the issues identified above, we introduce ROs to “stress” the voltage
regulator and make external changes in the power supply voltage measurable. For
any bit transmission (say the i-th one), we take M measurements as follows:
184 I. Giechaskiel et al.
Clock
Transmitted Bit
Encoded Bit
Source ROs
Stressor ROs Enabled
Measurement Period 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
R · NR RO Counts C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33
Contributes To D0 E0 D0 E0 D1 E1 D1 E1 D2 E2 D2 E2 D3 E3 D3 E3
Δn = Dn − En (C00 − C01 + C02 − C03)/2 (C10 − C11 + C12 − C13)/2 (C20 − C21 + C22 − C23)/2 (C30 − C31 + C32 − C33)/2
Final Metric Δ0 > Δ1 Δ2 < Δ3
Fig. 7.7 Timing diagram for a Manchester-encoded transmission of the two bits 10, with M = 4
measurement periods. Half of the ring oscillator counts are taken when the stressors are enabled
(E), and the other M/2 = 2 counts when they are disabled (D) to compute Δ = D − E. The
receiver uses the sign (positive or negative) of the difference Δ2n − Δ2n+1 between the two parts
of the encoded transmission of the n-th bit to determine if it should be decoded as a 0 or as a 1.
For example, (C 00 − C 01 + C 02 − C 03 )/2 = Δ0 > Δ1 = (C 10 − C 11 + C 12 − C 13 )/2, so the first bit is
decoded as a 1. Similarly, Δ2 < Δ3 , so the second bit is decoded as a 0
1. For the first measurement period, we disable all stressor ROs, and let the receiver
ROs run for .2t clock cycles, producing counts .C i0 = (C00 , . . . , C0R·NR −1 ).
2. In the second period, we enable all (or some, see Sects. 7.4.3 and 7.5.3) stressor
ROs and estimate the RO frequencies through their counts, .C i1 .
3. In the third measurement period, we disable all stressor ROs, re-enable them in
the fourth period, and so forth.
This procedure produces .M/2 measurements .C i0 , C i2 , . . . corresponding to
disabled stressors, and .M/2 measurements .C i1 , C i3 , . . . corresponding to enabled
stressors, as also shown in the timing diagram of Fig. 7.7. Figure 7.7 represents
Manchester-encoded transmissions of the 2 bits 10, averaging over .M = 4
measurements and only repeating transmissions once (actual measurements have
.M = 500, with 4 repetitions). We take the average of each set per RO, thereby
M/2−1
calculating the disabled-stressor average .D i = 2/M · k=0 C i2k and the enabled-
M/2−1
stressor average .E i = 2/M · k=0 C i2k+1 . We then use .Δi = D i − E i to recover
the transmitted bit.
Specifically, assume that we wish to recover the n-th bit, corresponding to
transmissions 2n and .2n + 1, as each bit b is Manchester-encoded as the pair
.(b, 1 − b). In each transmission pair, there is always a 1 bit and a 0 bit, so we can
compare the .R · NR counts of .Δ2n and .Δ2n+1 . If the majority of the RO differences
in the first set of measurements is bigger than the corresponding differences in the
second set of measurements (i.e., .Δ2n > Δ2n+1 for most ROs), we classify the n-th
bit as a 1, while if the majority is smaller, (.Δ2n < Δ2n+1 for most ROs), we classify
it as a 0.
Figure 7.8 demonstrates the need for this more complicated procedure in
practice for a transmission of a Manchester-encoded 1 bit. Specifically, it compares
our new metric with stressor ROs, .Δ2n − Δ2n+1 , against the naive bit-recovery
metric .D 2n − D 2n+1 for all 20 receiver ROs. As Fig. 7.8 (blue circles) shows,
.Δ
2n
− Δ2n+1 > 0 for all 20 receiver ROs .R0 , R1 , . . ., so our metric correctly
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 185
1,000
1
With Stressors: Δ2n − Δ2n+1
0
-1 Without Stressors: D2n − D2n+1
-10
-100
-1,000
R0 R5 R10 R15
Ring Oscillator Receiver Ri
Fig. 7.8 All RO count differences with stressors Δ2n − Δ2n+1 (blue circles) are positive, correctly
decoding a transmission of 1. However, the naive metric without stressors D 2n − D 2n+1 (orange
diamonds) behaves randomly, with only about half being positive
recovers this bit transmission. However, the .D 2n − D 2n+1 values with stressors
disabled (orange diamonds) behave randomly, and indeed, in the experiment in
which these measurements originated, our metric successfully recovered over 98%
of transmissions, compared to 53% using the naive method without the stressors.
Section 7.4.3 further expands on why the new technique makes for a good approach
in detecting transmissions.
In this section, we test the receiving circuit (sink FPGA) on its own to characterize
its behavior. We first plot in Fig. 7.9 the average metric .ΔiV for the eight ring
oscillators of Fig. 7.6 across the same power supply voltages .11.5 V ≤ V ≤ 12.5 V.
As expected, for all ROs, .ΔiV1 < ΔiV2 whenever .V1 > V2 : When there is an external
voltage drop (e.g., when the source FPGA enables the transmitter ROs), the .Δ metric
increases compared to when there are no external transmissions.
We additionally test the behavior of the receiver FPGA across different measure-
ment times of .2t clock cycles and the numbers of enabled stressors S. Specifically,
we conduct measurements on an Artix 7 sink and calculate the average value of
our .Δ metric over all 20 receiver ROs at two voltage levels: 11.5 V and 12.5 V.
Figure 7.10 plots our results, which lead to several observations.
First of all, the average difference .Δ = Δ11.5 − Δ12.5 is close to zero for time
periods up to 41 µs, indicating that prolonged measurement times are necessary
to distinguish between transmissions of zero and one, which in practice result in
186 I. Giechaskiel et al.
6,000
Ring Osc.
5,800 Index i
Average Metric ΔVi 0
5,600 1
5,400 2
3
5,200 4
5
5,000
6
4,800 7
4,600
11.5 11.7 11.9 12.1 12.3 12.5
Power Supply Voltage (V)
Fig. 7.9 The average metric ΔiV on the AC701-1 sink decreases with higher power supply voltages
V for all eight ring oscillators Ri
200
Average Difference Δ11.5 − Δ12.5
−200
# of Enabled Stressors S
−400 1 4
2 5
3
−600
0.6µs 5.1µs 41.0µs 0.3ms 2.6ms 21.0ms
Measurement Time
Fig. 7.10 Difference between the average Δ metric as measured at 11.5 V and 12.5 V for different
measurement times and numbers of stressors enabled on the AC701-1 sink
much smaller voltage drops of .≈20 mV. Moreover, until 2.6 ms, .Δ > 0 for all
choices of how many stressors S to enable simultaneously, with fewer stressors
resulting in a larger effect. However, for even larger time periods, .Δ < 0, with
more stressors resulting in a bigger effect in magnitude. Consequently, the choice of
number of stressors and measurement time is intricately linked with the accuracy of
the covert channel and, in fact, helps explain why in some experimental setups (e.g.,
the KC705-1 receiver on PSU-B of Table 7.3), the recovered pattern is flipped, i.e.,
a 0 bit is identified as a 1 bit and vice versa.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 187
Table 7.3 Default values for accuracy- and bandwidth-related parameters, and the chapter
sections in which they are varied. Bandwidth is calculated using Eq. (7.1)
Property Artix 7 Kintex 7 Section
# of Transmitter ROs, NT 1000 1000 7.5.2
# of Enabled Transmitters 10 14 7.5.2
Transmitted Pattern 0xf3ed1 0xf3ed1 7.5.4
Transmitter Types LUT-RO LUT-RO 7.5.4
# of Stressor ROs, NS 500 500 7.5.2
# of Enabled Stressors 1 5 7.5.3
Stressor and Receiver Types LUT-RO LUT-RO 7.5.5
# of Repetitions per Bit, M 500 500 7.5.3
Measurement Cycles, 2t 215 221 7.5.3
Channel Bandwidth b (b s−1 ) 6.1 0.1 7.5.3
In this section, we give an overview of our cross-FPGA results. The values for the
default experimental parameters used in these experiments and the corresponding
covert-channel bandwidths are summarized in Table 7.3. These values were chosen
based on exploratory testing, as they represent a good tradeoff between accuracy
and bandwidth. However, in some cases, better accuracy can be achieved at the
cost of bandwidth, or the same accuracy can be maintained despite increasing the
bandwidth (see Sect. 7.5.3).
The results of our measurements across all 12 combinations of source and sink
FPGAs on both PSUs are summarized in Table 7.4. As the table shows, covert
communication is possible with high accuracy between any two boards, in either
direction, and on both PSUs. The table also allows us to draw various conclusions.
First of all, the behavior is not the same for identical boards. This is likely due to
both process variations internal to the FPGA chip (which affect RO measurements),
and because of different component tolerances. As an example, the AC701-2 board
188 I. Giechaskiel et al.
Table 7.4 Accuracy for cross-FPGA covert channels on PSUs A and B, using the default
experimental parameters
Receiver
PSU Transmitter AC701-1 AC701-2 KC705-1 KC705-2
A AC701-1 – 79% 92% 100%
A AC701-2 99% – 93% 100%
A KC705-1 100% 86% – 100%
A KC705-2 100% 98% 99% –
B AC701-1 – 100% †98% 100%
B AC701-2 100% – †99% 100%
B KC705-1 100% 95% - 100%
B KC705-2 100% 100% †98% -
† signifies that the recovered bit pattern is flipped
is a worse sink than the AC701-1 board, while the KC705-1 board is a worse source
than the KC705-2 board.
Moreover, the Kintex 7 boards are generally better sources than the Artix 7
boards, due to the higher count of transmitters they contain (.T = 14 as opposed to
.T = 10). As we show in Sect. 7.5.2, more transmitters tend to improve the quality of
the covert channel. Finally, we notice that although the information leakage remains
strong in both PSUs, the accuracy of the recovered data on the .1300 W PSU-B is
generally higher than the accuracy on the .850 W PSU-A. This is perhaps somewhat
surprising, given that we would have expected the higher-rated PSU to produce more
stable output under sudden changes in the load, but this appears to not be the case.
In this section, we evaluate the effect of changing the size of the transmitting and
receiving circuits in the source and sink FPGAs, respectively, on the accuracy of
the covert channel. Since each of the T transmitters (with .NT ROs each) can be
controlled independently (Fig. 7.3), we first vary the number of simultaneously
enabled transmitters on the KC705-1 board and plot the results across all receiver
boards in Fig. 7.11a. We also change the number of transmitter ROs .NT on KC705-
1 with all T transmitters enabled at the same time and plot the results in Fig. 7.11b.
Both experiments show that increasing the number of effective transmitter ROs
.T · NT increases the accuracy of the covert channel. This is because the ensuing
voltage drops are more pronounced and can thus be more easily detected by the
receiving boards. However, for the KC705-2 sink board, too much activity on the
transmitter can decrease the accuracy of the channel. This is because although the
magnitude of the voltage drop increases in isolation (Fig. 7.5), the stressor ROs are
also causing a voltage drop that can overshadow that of the source FPGA.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 189
100 100
90
90
Accuracy (%)
Accuracy (%)
80
80
70 Source FPGA: Source FPGA:
KC705-1 KC705-1
60 Sink FPGA: 70 Sink FPGA:
AC701-1 AC701-1
50 AC701-2 AC701-2
KC705-2 60 KC705-2
40
0 2 4 6 8 10 12 14 200 400 600 800 1,000 1,200 1,400
Number of Enabled Transmitters Number of Transmitter ROs NT
Fig. 7.11 Increasing the number of (a) simultaneously enabled transmitters and (b) transmitter
ROs .NT on the KC705-1 source board generally increases the accuracy of the cross-board covert
channel, except for the KC705-2 sink past a certain threshold
100
90
Accuracy (%)
80
Sink FPGA:
AC701-2
Source FPGA:
70 AC701-1
KC705-1
KC705-2
60
500 1,000 1,500 2,000
Number of Stressor ROs NS
Fig. 7.12 Increasing the number of stressor ROs .NS on the AC701-2 sink board can decrease
accuracy, as the additional activity can hide external transmissions under the noise floor
We additionally evaluate the effect of changing the number of stressor ROs .NS
on the sink AC701-2 board and plot the accuracy of the covert channel in Fig. 7.12.
Consistent with Fig. 7.10, although stressor ROs are necessary to detect covert
transmissions, further increasing .NS can have the opposite effect: the voltage drop
caused by the stressors overpowers any effect caused by the source transmissions
and starts pushing the average difference from positive to negative.
190 I. Giechaskiel et al.
100
90
Accuracy (%)
80
70
T: Source FPGA, R: Sink FPGA
T: AC701-1 T: KC705-1
60 T: AC701-2 T: KC705-2
R: AC701-1 R: AC701-2
Fig. 7.13 Increasing the number of measurements M improves accuracy to any AC701 sink R,
from any FPGA source T
75 75
3 3
50 50
2 25 2 25
0 0
1 1
μs
μs
μs
μs
μs
μs
μs
s
21 s
s
m
m
6
.0
.0
.9
3
.5
.0
.0
0.
5.
0.
5.
41
0.
0.
1.
2.
5.
41
81
0.
0.
0.
1.
2.
10
21
Measurement Time Measurement Time
Fig. 7.14 Accuracy for different measurement times and the number of enabled stressors on the
(a) KC705-1 and (b) AC701-1 sinks
Accuracy (%)
transmitter (T) and the 90 Bottom-Left
receiver (R) Kintex 7 boards Bottom-Right
Left-Bottom
depends on how they are 80
Left-Right
connected to the Power Right-Bottom
Supply Unit 70
Right-Left
BB BL BR LB LR RB RL
Source and Sink Power Cable Location
Accuracy (%)
97.5
boards is consistently high for RO Type
all types of source ROs tested 95.0 FF
LD
92.5
LUT
90.0
KC705-1→KC705-2 KC705-2→KC705-1
FPGA Boards
least accurate between the single location on the bottom of the PSU and either of
the dual outputs. Finally, it should be noted that the recovered pattern is flipped in
all setups, except when sharing the cable on the bottom output.
Accuracy (%)
97.5
KC705-2 sink using different RO Type
receiver and stressor ROs also 95.0 FF
remains high LD
92.5
LUT
90.0
FF LD LUT
Receiver RO Type
90
80
Accuracy (%)
70
60
Sink FPGA and PSU Used
50 AC701-1 KC705-1
AC701-2 KC705-2
40 PSU-A PSU-B
0 5 10 15 20 25
Number of CPU Threads
Fig. 7.19 CPU-to-FPGA accuracy for the four FPGA sink boards on both PSUs for different
numbers of CPU threads used as transmitters. As PSU-A powers a CPU with only 4 threads, no
more than 4 threads can be dispatched for testing
Table 7.5 Maximum accuracy of transmissions from a CPU source to the four FPGA sinks on the
two PSU and PC setups, along with the parameters for which the accuracy is achieved
PSU Parameter AC701-1 AC701-2 KC705-1 KC705-2
A Accuracy 95% 97% 95% 86%
A Bandwidth 6.1 b s−1 6.1 b s−1 0.8 b s−1 0.8 b s−1
A # of Threads 10 14 11 23
A # of Enabled Stressors 1 1 4 4
A # of Measurements 500 500 500 500
A Measurement Cycles .2
15 .2
15 .2
18 .2
18
for PSU-A. This parallels our cross-FPGA results of Sect. 7.5 and indicates that
PSU-B is generally more prone to covert communication. The maximum accuracy
achieved, the number of CPU threads used, and other experimental parameters are
summarized in Table 7.5.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 195
Table 7.7 Maximum accuracy of transmissions from a GPU source to the four FPGA sinks on the
two PSU and PC setups, along with the parameters for which the accuracy is achieved
PSU Parameter AC701-1 AC701-2 KC705-1 KC705-2
A Accuracy 76% 70% 94% 89%
B Accuracy 97% 87% 96% †100%
A&B Bandwidth 2.0 b s−1 2.0 b s−1 0.03 b s−1 0.03 b s−1
A&B # of Enabled Stressors 1 1 5 5
A&B # of Measurements 1500 1500 1500 1500
A&B Measurement Cycles 215 215 221 221
† signifies that the recovered bit pattern is flipped
for all boards to 1500, reducing bandwidth by a factor of .3×. These parameters
and the corresponding results are summarized in Table 7.7. As in the CPU case, 3
seconds of delay are added after before and after the program, to allow usage to
return to normal.
Figure 7.20 plots the results of our experiments for the four boards on both GPUs.
We find that it is possible to create a communication channel to all four boards, on
both PSUs. As expected, since there are fewer GPU cores attached to PSU-A, the
covert channel is weaker, but the accuracy is over 95% for three of the four boards
when using the GPU attached to PSU-B, which is larger. Moreover, we notice that
the AC701 boards are worse sinks than the KC705 boards. Although this pattern
is not entirely identical across the three communication channels (FPGA-to-FPGA,
CPU-to-FPGA, and GPU-to-FPGA), it broadly remains consistent, potentially due
to the differences in the voltage regulators themselves or other aspects of board
design and component tolerances.
196 I. Giechaskiel et al.
Accuracy (%)
90
sink boards on both PSU
80 A
computers and PSUs
B
70
60
AC701-1 AC701-2 KC705-1 KC705-2
Receiver Board
7.7 Discussion
In this section, we discuss how practical the covert channels we introduced are
(Sect. 7.7.1) and propose some software- and hardware-level countermeasures to
mitigate the impact of the information leakage (Sect. 7.7.2).
There are two aspects of how practical our communication scheme is, which we
evaluate in this section. The first is how costly transmissions are in terms of
resources used on the FPGA boards. The amount of logic instantiated is moderate,
but not negligible. On the transmitting end, .G · T · NT lookup tables (LUTs) are
used, where .G = 4 is the number of ring oscillator stages. In particular, the source
design (including the UART and other logic) utilizes 16.6% of LUT resources on
the Artix 7 FPGA chip. Similarly, the sink design uses .G·(R ·NR +S ·NS ) LUTs for
the receiver and stressor ROs, and .L · R · NR registers for counting, where .L = 32
is the length of the counters. Only 7.8% of the Artix 7 resources are used in this
case—a number that can be reduced to 3.4%, as the AC701 boards only enable one
stressor for higher accuracy.
The second aspect is the channel capacity, which lies between that of thermal
attacks, which can transmit under 15 bits in an hour [14, 38], and power attacks
within CPUs that can transfer between 20 and 120 bits per second [1, 19].
Although the Kintex 7 boards were shown to be better sinks (often with 0% error
rate), the Artix 7 boards were faster by a factor of .7.6× (6.1 b s−1 vs. 0.8 b s−1 ).
This difference is significant in practice: Table 7.8 shows how long it would take
to transmit keys for different popular cryptographic algorithms. Even assuming that
the channel is not noisy, it would take almost 45 minutes to transfer a 256-bit AES
key to a KC705 board, and 3 hours to transfer a 1024-bit RSA key. However, the
AC701 board would need less than 3 minutes to transfer the same RSA key, despite
the potential drop in accuracy.
To increase accuracy, one can either tweak the parameters of the source and sink
FPGA designs (including the number of measurements M over which RO counts
are averaged) or instead change the communication scheme itself. For example, a
3-repetition code decreases bandwidth by a factor of 3, but also lowers the error rate
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 197
e to .3e2 − 2e3 : a 10% error rate is reduced to under 3%. The channel capacity is
.1−H (e) = 1+e log2 e+(1−e) log2 (1−e), and for smaller bitflip probabilities, other
error correcting codes such as Hamming and Golay codes can be used to improve
accuracy.
This section summarizes prior work in remote FPGA attacks without physical
access to the boards (Sect. 7.8.1), as well as voltage- and temperature-based covert
channels (Sect. 7.8.2).
So far, there have only been few works that consider remote attacks in the
single-tenant setting. One such attack by Tian and Szefer introduced a temporal
thermal channel, where different users receive time-shared access to the same FPGA
fabric [38]. A different attack by Schellenberg et al. considered cross-chip side-
channel attacks to recover RSA keys [33]. However, the chips were located on the
same FPGA board that is explicitly “designed for external side-channel analysis
research” [33], and hence shared the same voltage regulator, making them easier to
influence directly, due to the lack of additional intermediate components between
their power distribution networks.
7.9 Conclusion
References
1. Alagappan, M., Rajendran, J., Doroslovački, M., & Venkataramani, G. (2017). DFS covert
channels on multi-core platforms. In IFIP/IEEE international conference on very large scale
integration (VLSI-SoC).
2. Amazon Web Services (2021). AWS EC2 FPGA HDK+SDK errata. https://github.com/aws/
aws-fpga/blob/master/ERRATA.md. Accessed: 2023-05-21.
3. Bartolini, D. B., Miedl, P., & Thiele, L. (2016). On the capacity of thermal covert channels in
multicores. In European conference on computer systems (EuroSys).
4. Boutros, A., Hall, M., Papernot, N., & Betz, V. (2020). Neighbors from hell: Voltage attacks
against deep learning accelerators on multi-tenant FPGAs. In International conference on field-
programmable technology (FPT).
5. Corsair (2010). Professional series Gold AX850–80 PLUS Gold certified fully-modular power
supply. https://www.corsair.com/p/CMPSU-850AX. Accessed: 2023-05-21.
6. De Cnudde, T., Ender, M., & Moradi, A. (2018). Hardware masking, revisited. IACR
transactions on cryptographic hardware and embedded systems (TCHES), 2018(2), 123–148.
7. Gao, X., Xu, Z., Wang, H., Li, L., & Wang, X. (2018). Reduced cooling redundancy: A
new security vulnerability in a hot data center. In Network and distributed system security
symposium (NDSS).
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 201
8. Giechaskiel, I., Rasmussen, K. B., & Eguro, K. (2022). Long-wire leakage: The threat of
crosstalk. IEEE Design and Test (D&T), 39(4), 41–48.
9. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Measuring long wire leakage with
ring oscillators in cloud FPGAs. In International conference on field programmable logic and
applications (FPL).
10. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Reading between the dies: Cross-SLR
covert channels on multi-tenant cloud FPGAs. In IEEE international conference on computer
design (ICCD).
11. Giechaskiel, I. & Szefer, J. (2020). Information leakage from FPGA routing and logic elements.
In International conference on computer-aided design (ICCAD).
12. Glamočanin, O., Mahmoud, D., Regazzoni, F., & Stojilović, M. (2021). Shared FPGAs and the
holy grail: Protections against side-channel and fault attacks. In Design, automation & test in
Europe (DATE).
13. Gobulukoglu, M., Drewes, C., Hunter, W., Kastner, R., & Richmond, D. (2021). Classifying
computations on multi-tenant FPGAs. In Design automation conference (DAC).
14. Guri, M., Monitz, M., Mirski, Y., & Elovici, Y. (2015). BitWhisper: Covert signaling channel
between air-gapped computers using thermal manipulations. In IEEE computer security
foundations symposium (CSF).
15. Guri, M., Zadov, B., Bykhovsky, D., & Elovici, Y. (2020). PowerHammer: Exfiltrating data
from air-gapped computers through power lines. IEEE Transactions on Information Forensics
and Security (TIFS), 15, 1879–1890.
16. Hajimiri, A., Limotyrakis, S., & Lee, T. H. (1999). Jitter and phase noise in ring oscillators.
IEEE Journal of Solid-State Circuits (JSSC), 34(6), 790–804.
17. Islam, M. A., & Ren, S. (2018). Ohm’s law in data centers: A voltage side channel for timing
power attacks. In ACM conference on computer and communications security (CCS).
18. Islam, M. A., Ren, S., & Wierman, A. (2017). Exploiting a thermal side channel for power
attacks in multi-tenant data centers. In ACM conference on computer and communications
security (CCS).
19. Khatamifard, S. K., Wang, L., Das, A., Köse, S., & Karpuzcu, U. R. (2019). POWERT chan-
nels: A novel class of covert communication exploiting power management vulnerabilities. In
IEEE international symposium on high-performance computer architecture (HPCA).
20. Kocher, P., Jaffe, J., Jun, B., & Rohatgi, P. (2011) Introduction to differential power analysis.
Journal of Cryptographic Engineering, 1(1), 5–27.
21. La, T., Pham, K., Powell J., & Koch, D. (2021). Denial-of-Service on FPGA-based cloud
infrastructures: Attack and defense. IACR Transactions on Cryptographic Hardware and
Embedded Systems (TCHES), 2021(3), 441–464.
22. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
Malicious self-oscillator scanning for Xilinx UltraScale+ FPGAs. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 13(3), 1–31.
23. Le Masle, A., & Luk, W. (2012). Detecting power attacks on reconfigurable hardware. In
International conference on field programmable logic and applications (FPL).
24. Luo, Y., Gongye, C., Fei, Y., & Xu, X. (2021). DeepStrike: Remotely-guided fault injection
attacks on DNN accelerator in cloud-FPGA. In Design automation conference (DAC).
25. Luo, Y. & Xu, X. (2020). A quantitative defense framework against power attacks on multi-
tenant FPGA. In International conference on computer-aided design (ICCAD).
26. Mahmoud, D., Hussein, S., Lenders, V., & Stojilović, M. (2022). FPGA-to-CPU undervolting
attacks. In Design, automation and test in Europe (DATE).
27. Mahmoud, D., & Stojilović, M. (2019). Timing violation induced faults in multi-tenant FPGAs.
In Design, automation and test in Europe (DATE).
28. Masti, R. J., Rai, D., Ranganathan, A., Müller, C., Thiele, L., & Čapkun, S. (2015). Thermal
covert channels on multi-core platforms. In USENIX security symposium.
29. Moini, S., Tian, S., Holcomb, D., Szefer, J., & Tessier, R. (2021). Remote power side-channel
attacks on BNN accelerators in FPGAs. In Design, automation and test in Europe (DATE).
202 I. Giechaskiel et al.
30. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power distribution attacks in multi-tenant
FPGAs. IEEE transactions on very large scale integration systems (TVLSI), 28(12), 2685–
2698.
31. Provelengios, G., Holcomb, D., & Tessier, R. (2021). Mitigating voltage attacks in multi-tenant
FPGAs. ACM transactions on reconfigurable technology and systems (TRETS), 14(2), 1–24.
32. Rakin, A. S., Luo, Y., Xu, X., & Fan, D. (2021). Deep-Dup: An adversarial weight duplication
attack framework to crush deep neural network in multi-tenant FPGA. In USENIX security
symposium.
33. Schellenberg, F., Gnad, D. R. E., Moradi, A., & Tahoori, M. B. (2018). Remote Inter-chip
power analysis side-channel attacks at board-level. In International conference on computer-
aided design (ICCAD).
34. Spolaor, R., Abudahi, L., Moonsamy, V., Conti, M., & Poovendran, R. (2017). No free charge
theorem: A covert channel via USB charging cable on mobile devices. In Applied cryptography
and network security (ACNS).
35. Sugawara, T., Sakiyama, K., Nashimoto, S., Suzuki, D., & Nagatsuka, T. (2019). Oscillator
without a combinatorial loop and its threat to FPGA in data centre. Electronics Letters, 15(11),
640–642.
36. Tang, A., Sethumadhavan, S., & Stolfo, S. (2017). CLKSCREW: Exposing the perils of
security-oblivious energy management. In USENIX security symposium.
37. Tian, S., Moini, S., Wolnikowski, A., Holcomb, D., Tessier, R., & Szefer, J. (2021). Remote
power attacks on the versatile tensor accelerator in multi-tenant FPGAs. In IEEE international
symposium on field-programmable custom computing machines (FCCM).
38. Tian, S., & Szefer, J. (2019). Temporal thermal covert channels in cloud FPGAs. In
ACM/SIGDA international symposium on field-programmable gate arrays (FPGA).
39. Timonen, V. (2020). Multi-GPU CUDA stress test. http://wili.cc/blog/gpu-burn.html.
Accessed: 2023-05-21.
40. Waterland, A. P. (2014). Stress. https://web.archive.org/web/20190502184531/https://people.
seas.harvard.edu/~apw/stress/. Accessed: 2023-05-21.
41. Xilinx, Inc. (2012). 7 Series Product Brief. https://www.xilinx.com/publications/prod_mktg/7-
Series-Product-Brief.pdf. Accessed: 2023-05-21.
42. Xilinx, Inc. (2019). AC701 evaluation board for the Artix-7 FPGA (UG952). https://www.
xilinx.com/support/documentation/boards_and_kits/ac701/ug952-ac701-a7-eval-bd.pdf.
Accessed: 2023-05-21.
43. Xilinx, Inc. (2019). KC705 evaluation board for the Kintex-7 FPGA (UG810). https://
www.xilinx.com/support/documentation/boards_and_kits/kc705/ug810_KC705_Eval_Bd.
pdf. Accessed: 2023-05-21.
44. Xilinx, Inc. (2020). 7 series FPGAs data sheet: Overview (DS180). https://www.xilinx.com/
support/documentation/data_sheets/ds180_7Series_Overview.pdf. Accessed: 2023-05-21.
45. Zhang, Y., Yasaei, R., Chen, H., Li, Z., & Al Faruque, M. A. (2021). Stealing neural network
structure through remote FPGA side-channel analysis. IEEE Transactions on Information
Forensics and Security (TIFS), 16, 4377–4388.
46. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In IEEE
symposium on security and privacy (S&P).
47. Zick, K. M., Srivastav, M., Zhang, W., & French, M. (2013). Sensing nanosecond-scale voltage
attacks and natural transients in FPGAs. In ACM/SIGDA international symposium on field-
programmable gate arrays (FPGA).
Chapter 8
Microarchitectural Vulnerabilities
Introduced, Exploited, and Accelerated
by Heterogeneous FPGA-CPU Platforms
8.1 Introduction
T. Tiemann · T. Eisenbarth
Universität zu Lübeck, Lübeck, Germany
e-mail: t.tiemann@uni-luebeck.de; thomas.eisenbarth@uni-luebeck.de
Z. Weissman () · B. Sunar
Worcester Polytechnic Institute, Worcester, MA, USA
e-mail: zweissman@wpi.edu; sunar@wpi.edu
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 203
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_8
204 T. Tiemann et al.
8.2 Background
This section provides background knowledge about cache attack techniques as well
as the Rowhammer effect. It also explains the math behind the Chinese Remainder
Theorem (CRT) commonly used to implement the Rivest-Shamir-Adleman (RSA)
cryptosystem [4].
Cache attacks on multiple applications have been proposed [5, 10, 21, 23, 45, 54].
In general, cache attacks use the timing of cache behaviors to leak information.
Modern cache systems use a hierarchical architecture that includes smaller, faster
caches and bigger, slower caches. Measuring the latency of a memory access
can often confidently determine which levels of cache contain a certain memory
address (or if the memory is cached at all). Many modern cache subsystems also
support coherency, which ensures that whenever memory is overwritten in one
cache, copies of that memory in other caches are either updated or invalidated.
Cache coherency may allow an attacker to learn about a cache line that is not
even directly accessible [34]. Cache attacks have become a major focus of security
research in cloud computing platforms where users are allocated CPUs, cores, or
virtual machines which, in theory, should offer perfect isolation, but in practice may
leak information to each other via shared caches [26]. An introduction to various
cache attack techniques is given below.
A Flush+Reload (F+R) attack [61] has three steps: (1) the attacker uses the
clflush instruction to flush the cache line that is to be monitored. After flushing
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 205
this cache line, (2) they wait for the victim to execute. Later, (3) they reload the
flushed line and measure the reload latency. If the latency is low, the cache line was
served from the cache hierarchy, so the cache line was accessed by the victim. If
the access latency is high, the cache line is loaded from main memory, meaning that
the victim did not access it. F+R can work across cores and even across sockets, as
long as the Last-Level Cache (LLC) is coherent, as is the case with many modern
multi-CPU systems. Flush+Flush (F+F) [20] is similar to F+R, but the third step is
different: the attacker flushes the cache line again and measures the execution time
of the flush instruction instead of the memory access.
Orthogonal to F+R, if the attacker does not have access to an instruction to flush
a cache line, he/she can instead evict the desired cache line by accessing cache lines
that form an eviction set in an Evict+Reload (E+R) [39] attack. Eviction sets are
described shortly. E+R can be used if the attacker shares the same CPU socket (but
not necessarily the same core) as the victim and if the LLC is inclusive.1 F+R, F+F,
and E+R are limited to shared memory scenarios, where the victim and attacker
share data or instructions, e.g., when memory de-duplication is enabled.
Prime+Probe (P+P) gives the attacker less temporal resolution than the aforemen-
tioned methods since the attacker checks the status of the cache by probing a whole
cache set rather than flushing or reloading a single line. However, this resolution
is sufficient in many cases [3, 39, 43, 45, 46, 49, 64]. P+P has three steps: (1) the
attacker primes the cache set under surveillance with dummy data by accessing
a proper eviction set, (2) he/she waits for the victim to execute, and (3) he/she
accesses the eviction set again and measures the access latency (probing). If the
latency is above a certain threshold, some parts of the eviction set were evicted by
the victim process, meaning that the victim accessed cache lines belonging to the
cache set under surveillance [41]. Unlike F+R, E+R, and F+F, P+P does not rely
on shared memory. However, it is noisier, only works if the victim is located on
the same socket as the attacker, and relies on inclusive caches. An alternative attack
against non-inclusive caches is to target the cache directory structure [60].
In scenarios where the attacker cannot probe the target cache set or line but can
still influence the target cache line, an Evict+Time (E+T) is still possible depending
on the target application. In an E+T attack, the attacker only evicts the victim’s
cache line and measures the aggregate execution time of the victim’s operation,
hoping to observe a correlation between the execution time of an operation such as
a cryptographic routine and the cache access pattern.
Caches store data in units of cache lines that can hold .2b bytes each (64 bytes on
many Intel CPUs, including most Core, Xeon, and Atom architectures). Caches
1 A lower level cache is called inclusive of a higher level cache if all cache lines present in the
higher level cache are always present in the lower level cache.
206 T. Tiemann et al.
are divided into .2s sets, each capable of holding w cache lines. w is called the
way-ness or associativity of the cache. An eviction set is a set of congruent cache
line addresses capable of filling a whole cache set. Two cache lines are considered
congruent if they belong to the same cache set. Memory addresses are mapped to
cache sets depending on the s bits of the physical memory address directly following
the b cache line offset bits, which are the least significant bits. Additionally, some
caches are divided into n slices, where n is the number of CPU cores. In the
presence of slices, each slice has .2s sets with w ways each. The previous work
has reverse-engineered the mapping of physical address bits to cache slices on some
Intel processors [33]. A minimal eviction set contains w addresses and therefore fills
an entire cache set when accessed.
8.2.2 Rowhammer
DRAM cells discharge over time, and the memory controller has to refresh the
cells to avoid accidental data corruption. Generally, DRAM cells are laid out in
banks and rows, and each row within a bank has two adjacent rows, one on either
side. In a Rowhammer attack, memory addresses in the same bank as the target
memory address are accessed in quick succession. When memory adjacent to the
target is accessed repeatedly, the electrostatic interference generated by the physical
process of accessing the memory can elevate the discharge for bits stored in the
target memory. A “single-sided” Rowhammer performs accesses to just one of these
rows to generate bit flips in the target row; a “double-sided” Rowhammer performs
accesses to both adjacent rows and is generally more effective in producing bit flips.
A Rowhammer relies on the ability to find blocks of memory accessible to the
malicious program (or in this work, hardware) that are in the same memory bank
as a given target address. The standard way of finding these memory addresses is by
exploiting row buffer conflicts as a timing side channel [14]. Pessl et al. [47] reverse-
engineered the bank mapping algorithms of several CPU and DRAM configurations,
which allows an attacker to deterministically calculate all of the physical addresses
that share the same bank if the chipset and memory configuration are known.
must all be large for RSA to be secure, which makes the exponentiation rather slow.
However, there is an algebraic shortcut for modular exponentiation: the CRT, used
in many RSA implementations, including OpenSSL [11] and WolfSSL, which is
brought under attack as described in Sect. 8.6. The basic form of the RSA-CRT
signature algorithm is shown in Algorithm 1. The execution of the CRT algorithm
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 207
is much faster than the computation of .md mod N because .dp and .dq are of order
p and q, respectively, while d is of order N , which, being the product of p and q,
is significantly greater than p or q. It is around four times faster to compute the two
exponentiations .mdp and .mdq individually than it is to compute .md outright [4].
We experiment with two distinct FPGA-CPU platforms with Intel Arria 10 FPGAs:
(1) integrated into the CPU package and (2) Programmable Acceleration Card.
The integrated Intel Arria 10 is based on a prototype E5-2600v4 CPU with 12
physical cores [31]. The CPU has a Broadwell architecture in which the LLC is
inclusive of the L1/L2 caches. The CPU package has an integrated Arria 10 GX
1150 FPGA running at 400 MHz [31]. All measurements done on this platform are
strictly done from user space only, as access is provided by Intel through their Intel
Lab (IL) Academic Compute Environment (ACE) [32]. The IL environment also
gives us access to platforms with Programmable Acceleration Cards (PACs) with
Arria 10 GX 1150 FPGA installed and running at 200 MHz. These systems have
Intel Xeon Platinum 8180 CPUs that come with non-inclusive LLCs. We carried out
Rowhammer [19] experiments on a local Dell Optiplex 7010 system with an Intel
i7-3770 CPU and a single DIMM of Samsung M378B5773DH0-CH9 1333 MHz
208 T. Tiemann et al.
2 GB DDR3 DRAM equipped with the same Intel PAC running with a primary
clock speed of 200 MHz.2
The Operating System (OS) running in the IL ACE is a 64-bit Red Hat Enterprise
Linux 7 with kernel version 3.10. The Open Programmable Acceleration Engine
(OPAE) was compiled and installed on July 15, 2019 for both the FPGA PAC
and the integrated FPGA platform. We used Quartus 17.1.1 and Quartus 16.0.0
to synthesize hardware designs for the PACs and integrated FPGAs, respectively.
The bitstream version of the non-user-configurable Board Management Controller
(BMC) firmware is 1.1.3 on the FPGA PAC and 5.0.3 on the integrated FPGA. The
OS on the local Optiplex 7010 workstation is Ubuntu 16.04.4 LTS with Linux kernel
4.13.0–36. On this system, we installed the latest stable release of OPAE at the time
of the experiments, 1.3.0, and on its FPGA PAC, we installed the compatible 1.1.3
BMC firmware bitstream.
This section explains the hardware and software interfaces that the Intel Arria 10
GX FPGA platforms use to communicate with their host CPUs and the firmware,
drivers, and architectures that underlay them. Figure 8.1 gives an overview of this
type of architecture.
Intel refers to a single logical unit implemented in FPGA logic and having a
single interface to the CPU as an Accelerator Functional Unit (AFU). The AFU is
an abstraction of the user-configurable logic that can be thought of as analogous to
a program for a processor. Available FPGA platforms only support one AFU per
Partial Reconfiguration Unit (PRU). The FPGA Interface Manager (FIM) is part of
the non-user-configurable portion of the FPGA and contains external interfaces like
memory and network controllers as well as the FPGA Interface Unit (FIU), which
bridges those external interfaces with internal interfaces to the AFU.
2 The PAC is intended to support 400 MHz clock speed, but the current version of the Intel
Acceleration Stack (IAS) has a bug that halves the clock speed.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 209
Arria 10 Arria 10
Application
AFU AFU
API
OPAE
Drivers
IOMMU FIU
OS $
Core Core
UPI PCIe PCIe PCIe
L1/L2 L1/L2
LLC IOMMU
RAM
Fig. 8.1 Overview of the architecture of CPU-FPGA systems based on Intel FPGAs. The software
part of the IAS called the OPAE is highlighted in orange. Applications (yellow) use OPAE’s API
to communicate with the AFU. The green region marks the part of the FPGA that is reconfigurable
from user space at runtime. The blue region shows the static soft core of the FPGA. It exposes the
CCI-P interface to the AFU
(what Intel calls I/O Virtual Addresses (IOVAs)) to physical addresses in the host.
Alongside the FPGA, the PAC contains 8 GB of DDR4, 128 MB of flash memory,
and USB 2.0 for debugging.
An alternative accelerator platform is the Xeon server processor with an
integrated Arria 10 FPGA in the same package [13]. The FPGA and CPU are
closely connected through two PCIe Gen3x8 links and an Ultra Path Interconnect
(UPI) link. UPI is Intel’s high-speed CPU interconnect (replacing its predecessor
QPI) in Skylake and later Intel CPU architectures [44]. The FPGA has a 128 KiB
direct mapped cache that is coherent with the CPU caches over the UPI bus. Like
the PCIe link on the PAC, both the PCIe links and the UPI link use I/O virtual
addressing, appearing as physical addresses to virtualized environments. As the UPI
link bypasses the PCIe controller’s IOMMU, the FIU implements its own IOMMU
and device TLB to translate physical addresses for reads and writes using UPI [28].
Intel’s latest generations of FPGA products are designed for use with the Open
Programmable Acceleration Engine (OPAE) [27] which is part of the Intel Accel-
eration Stack (IAS). OPAE is an open-source, hardware-flexible software stack
for accessing FPGAs. Intel’s compatible FPGAs use the Core Cache Interface
(CCI-P), a hardware host interface for AFUs that specifies transaction requests,
header formats, timing, and memory models [28]. OPAE provides a software
interface for software developers to interact with a hosted FPGA, while CCI-P
210 T. Tiemann et al.
provides a hardware interface for hardware developers to interact with a host CPU.
Excluding a few platform-specific hardware features, any CCI-P compatible AFU
should be synthesizable (and the result should be logically identical) for any CCI-P
compatible FPGA platform. OPAE is built on top of hardware- and OS-specific
drivers and as such is compatible with any system with the appropriate drivers. As
described below, the OPAE/CCI-P system provides two main methods for passing
data between the host CPU and the FPGA.
OPAE can send 32- or 64-bit MMIO requests to the AFU directly or it can map an
AFU’s MMIO space to OS virtual memory [27]. CCI-P provides an interface for
incoming MMIO requests and outgoing MMIO read responses. The AFU responds
to read and write requests, although an MMIO read request will time out after 65,536
cycles of the FPGA clock used to access the interface. In software, MMIO offsets
are indicated in bytes and addresses are expected to be multiples of 4 (or 8, for 64-
bit reads and writes). In CCI-P, the last two bits of the address are truncated, since
at least four bytes are used for each read or write transaction. There are 16 available
address bits in CCI-P, so the total available MMIO space is .216 32-bit words or
256 KiB [28].
OPAE can request that the OS allocates a block of memory that can be read by
the FPGA. The memory is allocated in a contiguous physical address space. The
FPGA uses physical addresses to index the shared memory, so physical and virtual
offsets within the shared memory must match. On systems using Intel Virtualization
Technology for Directed I/O (VT-d) [30], which employs the IOMMU to provide
an IOVA to PCIe devices, the OS can allocate memory in continuous IOVA space
if needed by the peripheral. This approach ensures that the FPGA will see an
accessible and continuous buffer of the requested size. For buffer sizes up to and
including one 4 KiB memory page, a normal memory page will be allocated to the
calling process by the OS and configured to be accessible by the FPGA with its
IOVA or physical address. For buffer sizes greater than 4 KiB, the OPAE will call
the OS to allocate a 2 MB or 1 GB huge page. Isolating the buffer in a single page
ensures that it will be contiguously allocated in physical memory.
architecture for the Arria 10 PAC and the integrated Arria 10 FPGA and highlights
their common features and differences. We set a special focus on the caching
hints that can be used by Intel FPGAs to influence the cache coherency state of
cache lines.
The Arria 10 PAC has access to the CPU’s memory system as well as its own local
DRAM with a separate address space from that of the CPU and its memory. The
PAC’s local DRAM is always directly accessed, without a separate caching system.
When the PAC reads from the CPU’s memory, the CPU’s memory system will serve
the request from its LLC if possible. If the memory that is read or written is not
present in the LLC, the request will be served by the CPU’s main DRAM. The PAC
is unable to place cache lines into the LLC with reads but writes from the PAC
update the LLC.
The integrated Arria 10 FPGA has access to the host memory. Additionally, it has its
own 128 KiB cache that is kept coherent with the CPU’s caches over UPI. Memory
requests over PCIe take the same path as requests issued by an FPGA PAC. If the
request is routed over UPI, the local coherent FPGA cache is checked first. On a
local cache miss, the request is forwarded to the CPU’s LLC or main memory.
An AFU on the Arria 10 GX can exercise some control over caching behavior by
adding caching hints to memory requests. The available hints are summarized in
Table 8.1. For memory reads, RdLine_I is used to prevent local caching and
RdLine_S is used to cache data locally in a shared state. For memory writes,
WrLine_I is used to prevent local caching on the FPGA, and WrLine_M leaves
written data in the local cache in the modified state. WrPush_I does not cache data
locally but provides hints to the cache controller to cache data in the CPU’s LLC.
The CCI-P documentation lists all caching hints as available for memory requests
over UPI [28]. When sending requests over PCIe, only RdLine_I, WrLine_I,
and WrPush_I can be used while other hints are ignored. However, based on our
experiments, not all cache hints are implemented exactly to specification.
To confirm the behavior of caching hints available for DMA writes, we designed
an AFU that writes a constant string to a configurable memory address via either
UPI or PCIe and using a configurable caching hint. We used the AFU to write a
cache line and afterward timed a read access to the same cache line on the CPU. As
displayed in Fig. 8.2, these experiments confirm that the majority of the cache lines
212 T. Tiemann et al.
Table 8.1 Overview of the caching hints configurable over CCI-P on an integrated FPGA. *_I
hints invalidate a cache line in the local cache. Reading with RdLine_S stores the cache line in
the shared state. Writing with WrLine_M caches the line modified state
Cache Hint RdLine_I RdLine_S WrLine_I WrLine_M WrPush_I
Desc. No FPGA Leave FPGA No FPGA Leave FPGA Intent to cache
caching cache in S caching cache in M in LLC
state state
Available UPI, PCIe UPI UPI, PCIe UPI UPI, PCIe
WrLine I
UPI 150
2
Frequency
PCIe0
Density
100
1 PCIe1
50
0 0
0 50 100 150 200 250 300 350 400
WrLine M
2 UPI 150
Frequency
PCIe0
Density
100
1 PCIe1
50
0 0
0 50 100 150 200 250 300 350 400
WrPush I
2 UPI 100
Frequency
PCIe0
Density
1 PCIe1 50
0 0
0 50 100 150 200 250 300 350 400
Access time (CPU clock cycles)
Fig. 8.2 Memory access latency histograms and their density for cache lines accessed by the CPU
after being written to by an AFU running on an integrated Arria 10. On an Arria 10 PAC, the
behavior is nearly identical to the PCIe0 behavior in the integrated platform
written by the AFU are placed in the LLC, as access times stay below 100 CPU clock
cycles while main memory accesses take 175 cycles on average.3 This behavior is
independent of the caching hint, the bus, or the platform (PAC, integrated Arria 10).
In fact, we only see data being written to main memory in 21.21% of all runs if the
write request uses WrLine_I or WrPush_I and is sent over UPI. All other writes
are cached in at least 99.5% of all runs. The result is surprising as the hint meant to
place the data in the cache of the integrated Arria 10 and the hint meant for writing
directly to the main memory are either ignored by the FIM and the CPU or not
implemented. Intel later verified4 that the FIM ignores all caching hints that apply
to DMA writes. Instead, the CPU is configured to handle all DMA writes as if the
WrPush_I caching hint is set. The observed LLC caching behavior is likely caused
by Intel’s Data Direct I/O (DDIO), which is enabled by default in Intel Xeon E5 v2
and E7 v2 CPUs. DDIO is meant to give peripherals direct access to the LLC and
thus causes the CPU to cache all memory lines written by the AFU. DDIO restricts
cache access to a subset of ways per cache set, which reduces the attack surface for
Prime+Probe attacks. Nonetheless, attacks against other DDIO-enabled peripherals
are possible [38, 52].
In this section, we present and evaluate a simple AFU for the Arria 10 GX FPGA
that performs Rowhammer against its host CPU’s DRAM as much as two times
faster and four times more effectively than its host CPU. In a Rowhammer attack,
a significant factor in the speed and efficacy is the rate at which memory can be
repeatedly accessed. On many systems, the CPU is sufficiently fast to cause some
bit flips, but an FPGA can repeatedly access its host machine’s memory system
substantially faster than the host machine’s CPU can. Both the CPU and FPGA share
access to the same memory controller, but the CPU must flush the memory after
each access to ensure that the next access reaches DRAM; memory reads from the
FPGA do not affect the CPU cache system, so no time is wasted flushing memory.
We measure the performance of CPU and FPGA Rowhammer implementations
with caching both enabled and disabled and find that disabling caching brings CPU
Rowhammer speed near that of our FPGA Rowhammer implementation. Crucially,
the architectural difference makes it much more difficult for a CPU program to
detect the presence of an FPGA Rowhammer attack than that of a CPU Rowhammer
attack—the FPGA’s memory accesses leave far fewer traces on the CPU.
We now present our design for JackHammer, a Rowhammer AFU for the Arria
10 FPGA. JackHammer supports configuration through the MMIO interface. When
the JackHammer AFU is loaded, the CPU first sets the target physical addresses that
the AFU will repeatedly access. It is recommended that two addresses are set for a
double-sided attack, but if the second address is set to 0, JackHammer will perform
a single-sided attack using just the first address. The CPU must also set the number
of times the targeted addresses are accessed.
When the configuration is set, the CPU signals the AFU to repeat memory
accesses and issue them as fast as it can, alternating between addresses in a
double-sided attack. Unlike a software implementation of Rowhammer, the accessed
addresses do not need to be flushed from cache—DMA read requests from the
FPGA do not cache the line in the CPU cache, although if the requested value is in
the LLC, the value will be provided to the FPGA by the cache instead of by memory
(see Sect. 8.4.3 for more details on caching behavior). In this attack, the attacker
needs to ensure that the cache lines used for inducing bit flips are not accessed
by the CPU during the attack. The number of remaining accesses can be reread
by the CPU. This is the simplest way for software to check if AFU has finished
sending these accesses. When the last read request has been sent by the AFU, the
total amount of time taken to send all of the requests is recorded.5
Figure 8.3 shows a box plot of the 0th, 25th, 50th, 75th, and 100th percentile
of measured “hammering rates” on the Arria 10 FPGA PAC and its host i7-
3770 CPU. Each measurement in these distributions is the average hammering
Core i7-3770
3392 MHz
Arria 10 PAC
200 MHz
1 PCIe lane
Fig. 8.3 Box plots showing distributions of hammering rates (memory requests or “hammers” per
second) on FPGA PAC and i7-3770. The hammering rate of the FPGA is so consistent as to appear
as a single line. Red crosses indicate outliers
5 The time to send all the requests is not precisely the time to complete all the requests, but it is
very close for sufficiently high numbers of requests. The FPGA has a transaction buffer that holds
up to 64 transactions after they have been sent by the AFU. The buffer does take some time to clear,
but the additional time is negligible for our performance measurements of millions of requests.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 215
Core i7-3770
3392 MHz
Arria 10 PAC
200 MHz
1 PCIe lane
Fig. 8.4 Box plots of distributions of flip rates on FPGA PAC and i7-3770
rate over a run of 2 billion memory requests. The median hammering rate of our
JackHammer implementation is 81% faster than the median rate of the standard
CPU Rowhammer, and its speed is far more consistent than the CPU’s. The FPGA
can manage an average throughput of one memory request, or “hammer,” every ten
200 MHz FPGA clock cycles (finishing 2 billion hammers in an average of 103.25
seconds); the CPU averages one hammer every 311 3.4 GHz CPU clock cycles
(finishing 2 billion hammers in an average of 183.41 seconds). If our Arria 10 FPGA
were running at its intended frequency of 400 MHz, the hammering rate would still
be bottlenecked by memory speed, and JackHammer would not run significantly
faster.
Figure 8.4 shows measured bit flip rates in the victim row for the same
experiment. Runs where zero flips occurred during hardware or software hammering
were excluded from the flip rate distributions, as they are assumed to correspond
with sets of rows that are in the same logical bank, but not directly adjacent to
each other. The increased hammering speed of JackHammer produces a more than
proportional increase in flip rate, which is unsurprising due to the physical rather
than logical nature of Rowhammer faults. As the Rowhammer attack is underway,
electrical charge is drained from capacitors in the victim row. However, the memory
controller also periodically refreshes the charge in the capacitors. When there are
more memory accesses to adjacent rows within each refresh window, it is more
likely that a bit flip occurs before the next refresh. This is why the FPGA’s increased
memory throughput is so much more effective in conducting Rowhammer against
the same DRAM chip.
Another way to look at hammering performance is by counting the total number
of flips produced by a given number of hammers. Figures 8.5 and 8.6 show
distributions of flip counts after various numbers of hammers on the FPGA PAC
and i7 CPU, respectively. The graphs in these figures demonstrate how much more
effectively the FPGA PAC can generate bit flips in the DRAM after the same number
of memory accesses. For hammering attempts that resulted in a nonzero number of
bit flips, the AFU exhibits a wide distribution of flip counts in the range of 200
million to 800 million hammers which then rapidly narrows in the range of 800
million to 1.2 billion and finally levels out by 1.8 billion hammers. This set of
216 T. Tiemann et al.
150
50
0.5
0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Billions of hammers
Fig. 8.5 Distributions of total flips after 200 million to 2 billion hammers with JackHammer on
the Arria 10 FPGA PAC. In the upper graph, box plots show quartiles and outliers of flip counts
in flippy rows, that is, rows with nonzero flip counts. The bar graph in the lower axes shows the
portion of rows in the sample that incurred any flips. This is the same portion of rows represented
in the box plots
150
in flippy rows
100
Total flips
50
0.5
0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Billions of hammers
Fig. 8.6 Distributions of total flips after 200 million to 2 billion hammers with software
Rowhammer on the i7-3770, following the same layout as Fig. 8.5
distributions indicates that “flippable” rows will ultimately reach about 80–120 total
flips after enough hammering, but it can take anywhere from 200 million hammers
(about 10 seconds) to 2 billion hammers (about 100 seconds) to reach that limit.
There are also several rows that only incur a few flips. These samples appear in
a consistent pattern as demonstrated in Fig. 8.7, which plots a portion of the data
used to create Fig. 8.5. Each impulse in this plot represents the number of flips after
a single run of 2 billion hammers on a particular target row. In Fig. 8.7, at indices
23 and 36, two of these outliers are visible, each appearing two indices after several
samples in the standard 80–120 flip range. These outliers could indicate rows that
are affected by hammering nearby rows that are not adjacent.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 217
100
50
outlier outlier
0
14 16 18 20 22 24 26 28 30 32 34 36 38
Row index
Fig. 8.7 Time series plotting number of flips on a row-by-row basis, showing examples of the
consistent placement of small-valued but (unlike their immediate neighbors) nonzero outliers:
samples 23 and 36 on this graph. These rows only ever incur a few flips at most and always are
located two rows away from a block of “flippy” rows which incur dozens of flips. By contrast, rows
22 and 24–29, for example, incur no flips at all
The JackHammer AFU designed for the integrated platform is the same as the AFU
for the PAC, except that the integrated platform has access to more physical channels
for memory reads. The PAC has a single PCIe channel; the integrated platform has
one UPI channel and two PCIe channels, as well as an “automatic” setting that lets
the interface manager select a physical channel automatically. Therefore we present
the hammering rates on this platform with two different settings—alternating PCIe
lanes on each access and using the automatic setting.
However, the integrated platform is only available on Intel’s servers, so we have
only been able to test on one DRAM setup and have been unable to get bit flips in
the DRAM.6 The integrated Arria 10 shares the package with a modified Xeon v4-
style CPU. The available servers are equipped with an X99 series motherboard with
64 GB of DDR4 memory. Figure 8.8 shows distributions of measured hammering
rates on the integrated Arria 10 platform. Compared to the Arria 10 PAC, the
integrated Arria 10’s hammering rate is more varied, but with a similar mean rate.
6 There are several reasons why this could be the case. Some DRAM is simply more resistant to
Rowhammer by its physical nature. Error correcting code (ECC) memory is capable of reversing
some memory faults in real time. DDR4 memory, which can be found in this system, sometimes
has hardware features to block Rowhammer style attacks [35]. It is impossible to say whether
the DRAM in this system has any particular defenses in place without access to the hardware
or BIOS. Some methods have been developed to circumvent these protections [15, 18], but for
this work we focus on DDR3, where flips are more reliable and the advantage of the FPGA is
easier to demonstrate.
218 T. Tiemann et al.
Integrated Xeon
E5-2600v4
3400 MHz
Integrated
Arria 10
400 MHz
2 PCIe lanes
Fig. 8.8 Distributions of hammering rates (memory requests or “hammers” per second) on
integrated Arria 10 and Xeon E5-2600 v4. While the Xeon’s hammering speed varies greatly and
can be faster than the Arria 10’s, the Arria 10 is more consistent and generally hammers faster
·107
3
Hammers per second
0
Cachable Cachable Uncachable Uncachable
Arria 10 PAC Xeon Arria 10 PAC Xeon
200 MHz E5-2670 v2 200 MHz E5-2670 v2
1 PCIe lane 2500 MHz 1 PCIe lane 2500 MHz
Fig. 8.9 Distributions of hammering rates (memory requests or “hammers” per second) with
cacheable and uncacheable memory. Disabling caching significantly speeds up both the Arria 10
and Xeon Rowhammer implementations and brings the speed of the Xeon much closer to the speed
of the Arria 10
hammering rate of the PAC was only 22% faster than that of the CPU. Of course,
memory accesses on modern systems are extremely complex (even with caching
disabled), so there are likely additional factors affecting the difference in hammering
rate. However, our experimental evidence supports our hypothesis that time spent
flushing the cache slows down CPU Rowhammer implementations compared to
FPGA implementations.
Rowhammer has been used for fault injections on cryptographic schemes [6, 7]
and for privilege escalation [18, 51, 55]. Using JackHammer, we demonstrate a
practical fault injection attack from the Arria 10 FPGA to the WolfSSL RSA
implementation running on its host CPU. In the RSA fault injection attack proposed
by Boneh et al. [8], an intermediate value in the Chinese remainder theorem modular
exponentiation algorithm is faulted, causing an invalid signature to be produced.
Similarly, we attack the WolfSSL RSA implementation using JackHammer from
the FPGA PAC and Rowhammer from the host CPU and compare the efficiency of
220 T. Tiemann et al.
the two attacks. The increased hammering speed and flip rate of the Arria 10 FPGA
make the attack more practical in the time frame of about 9 RSA signatures.
Figure 8.10 shows the high-level overview of our attack: the WolfSSL RSA
application runs on one core, while a malicious application runs adjacent to it,
assisting the JackHammer AFU on the FPGA in setting up the attack. JackHammer
causes a hardware fault in the main memory, and when the WolfSSL application
reads the faulty memory, it produces a faulty signature and leaks the private factors
used in the RSA scheme.
In summary, our simplified attack model works as follows: The attacker first
allocates a large block of memory and checks it for conflicting row addresses.
It then quickly tests which of those rows can be faulted with hammering using
JackHammer. A list of rows that incur flips is saved so that it can be iterated over.
222 T. Tiemann et al.
The program then begins the “attack,” iterating through each row that incurred flips
during the test, and through the sixty-four 1024-bit offsets that make up the row.
During the attack, the JackHammer AFU is instructed to repeatedly access the rows
adjacent to the target row. Meanwhile, in the “victim” program, the targeted data
(the precomputed intermediate value .d mod q − 1) is copied to the target address,
which is computed as an offset of the targeted row. The victim then enters a loop
where it reads back the data from the target row and uses it as part of an RSA key to
create a signature from a sample message. Additionally, the “attacker” opens a new
thread on the CPU which repeatedly flushes the target row on a given interval. It is
necessary for the attacker to flush the target row because the victim is repeatedly
reading the targeted data and placing it in cache, but the fault will only ever occur
in main memory. For the victim program to read the faulty data from DRAM, there
cannot be an unaffected copy of the same data in cache or the CPU will simply read
that copy. As we show below, the performance of the attack depends significantly
on the time interval between flushes.
One of the typical complications of a Rowhammer fault injection attack is
ensuring that the victim’s data is located in a row that can be hammered. In our
simplified model, we choose the location of the victim data manually within a row
that we have already determined to be one that incurs flips under a Rowhammer
attack so that we may easily test the effectiveness of the attack at various rows
and various offsets within the rows. In a real attack, the location of the victim
program’s memory can be controlled by the attacker with a technique known as
page spraying [51], which is simply allocating a large number of pages and then
deallocating a select few, filling the memory in an attempt to cause the victim
program to allocate the right pages. Improvements in this process can be made;
for example, [6] demonstrated how cache attacks can be used to gather information
about the physical addresses of data being used by the victim process.
The other simplification in our model is that we force the CPU to read from
DRAM using the clflush instruction to flush the targeted memory from cache.
In an end-to-end attack, the attacker would use an eviction set to evict the targeted
memory since it is not directly accessible in the attack process’s address space.
However, the effect is ultimately the same—the targeted data is forcibly removed
from the cache by the attacker.
Table 8.2 Performance of our JackHammer exploit compared to a standard software CPU Rowham-
mer with various eviction intervals. JackHammer is able to achieve better performance in many cases
because it bypasses caching architecture, sending more memory requests during the eviction interval
and causing bit flips at a higher rate
Mean signatures to fault Successful fault rate
Eviction interval (ms) CPU JackHammer % Inc. speed CPU JackHammer % Inc. rate
16 280 186 51% 0.4% 0.2% .−46%
increasing the DRAM row refresh rate provides significant but not complete defense
against both implementations.
The performance of this fault injection attack is highly dependent on the time
interval between evictions, and as such we present all of our results in this section
as functions of the eviction interval. Each eviction triggers a subsequent reload from
memory when the key is read for the next signature, which refreshes the capacitors
in the DRAM. Whenever DRAM capacitors are refreshed, any accumulated voltage
error in each capacitor (due to Rowhammer or any other physical effect) is either
solidified as a new faulty bit value or reset to a safe and correct value. Too short of
an interval between evictions will cause the DRAM capacitors to be refreshed too
quickly to be flipped with a high probability. On the other hand, however, longer
intervals can mean the attack is waiting to evict the memory for a longer time, while
a bit flip has already occurred. It is crucial to note, also, that DRAM capacitors
are automatically refreshed by the memory controller on a 64 ms interval7 [19].
On some systems, this interval is configurable: faster refresh rates reduce the rate
of memory errors, including those induced by Rowhammer, but they can impede
maximum performance because the memory spends more time doing maintenance
refreshes rather than serving read and write request. For more discussion on
modifying row refresh rates as a defense against Rowhammer, see Sect. 8.8.
In Table 8.2 we present two metrics with which we compare JackHammer and
a standard CPU Rowhammer implementation. This table shows the mean number
of signatures until a faulty signature is produced and the ultimate probability of
success of an attack within 1000 signatures against a random key in a randomly
selected chunk of memory within a row known to be vulnerable to Rowhammer.
With an eviction interval of 96 ms, the JackHammer attack achieves the lowest
7 More specifically, DDR3 and DDR4 specifications indicate 64 ms as the maximum allowable
average number of signatures before a fault, at only 58, 25% faster than the best
performance of the CPU Rowhammer. The CPU attack is impeded significantly by
shorter eviction latency, while the JackHammer implementation is not, indicating
that on systems where the DRAM row refresh rate has been increased to protect
against memory faults and Rowhammers, JackHammer likely offers substantially
improved attack performance. Figure 8.11 highlights the mean number of signatures
until a faulty signature for the 16 ms to 96 ms range of eviction latency.
Table 8.3 Summary of our cache attacks analysis: OPAE accelerates eviction set construction by
making 2 MB huge pages and physical addresses available to user space
Attacker Target Channel Attack
FPGA PAC AFU CPU LLC PCIe E+T, E+R, P+P
Integrated FPGA AFU CPU LLC UPI E+T, E+R, P+P
Integrated FPGA AFU CPU LLC PCIe E+T, E+R, P+P
Integrated FPGA AFU FPGA Cache CCI-P E+T, E+R, P+P
CPU FPGA Cache UPI F+R, F+F
·105
LLC
Main memory
2
Frequency
0
130 140 150 160 170 180 190 200 210 220 230
FPGA clock cycles (200 MHz)
Fig. 8.12 Latency histogram for one million PCIe read requests on an FPGA PAC served by the
CPU’s LLC or main memory. The two distinct peaks enable the FPGA to distinguish the two
memory locations
The Intel PAC has access to one PCIe lane that connects it to the main memory
of the system through the CPU’s LLC. The CCI-P documentation [28] mentions a
timing difference for memory requests served by the CPU’s LLC and those served
by the main memory. Using our timer we verified the differences as shown in
Fig. 8.12. Accesses to the LLC take between 139 and 145 cycles, and accesses to
main memory take between 148 to 158 cycles. These access latency distributions
form the basis of cache attacks, as they enable an attacker to tell which part of the
memory subsystem served a particular memory request. Our results indicate that
FPGA-based attackers can precisely distinguish memory responses served by the
LLC from those served by main memory.
In addition to probing, the cache state must be influenced to perform cache
attacks. We investigated cache interactions offered by the CCI-P interface on an
FPGA PAC and found that cache lines read by the AFU from the main memory
are not cached. While this behavior is not usable for cache attacks, it boosts
Rowhammer performance as we saw in Sect. 8.5. On the other hand, cache lines
226 T. Tiemann et al.
written by an AFU on the PAC end up in the LLC with nearly 100% probability.
The reason for this behavior was discussed in Sect. 8.4.3 along with an analysis of
caching hints. This behavior can be used to evict other lines from the cache and
perform eviction-based attacks like Evict+Time, Evict+Reload, and Prime+Probe.
For E+T, DMA writes can be used to evict a cache line, while our hardware timer
measures the victim’s execution time. Even though an AFU cannot load data into
the LLC, E+R can be performed as the purpose of reloading a cache line is to learn
the reload latency. So the primitives for E+R on the FPGA are DMA writes and
timed DMA reads using a hardware timer. P+P can be performed using DMA writes
and timed reads. In the case where DDIO limits the number of accessible ways per
cache set, other DDIO-enabled peripherals are attackable. Flush-based attacks like
Flush+Reload or Flush+Flush cannot be performed by an AFU as CCI-P does not
offer a flush instruction.
The integrated Arria 10 has two PCIe lanes and one UPI lane connecting it to
the CPU’s memory subsystem. It also has its own additional cache on the FPGA
accessible over UPI (recall Sect. 8.4.3 for further details).
By timing memory requests from the AFU using our hardware timer, we show
that distinct delays for the different levels of the memory subsystem exist. Both
PCIe lanes have delays similar to those measured on a PAC (cf. Fig. 8.12). Our
memory access latency measurements for the UPI lane, depicted in Fig. 8.13, show
an additional peak for requests answered by the FPGA’s local cache. The two peaks
for LLC and main memory accesses are likely narrower and further apart than in
the PCIe case because UPI, Intel’s proprietary high-speed processor interconnect, is
500
0
0 50 100 150 200
FPGA clock cycles (400 MHz)
Fig. 8.13 Latency histogram for one thousand UPI read requests on an integrated Arria 10 served
by the FPGA’s local cache, CPU’s LLC, or main memory. All peaks are distinct which allows the
FPGA to identify the memory location that served the corresponding read request
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 227
an on-chip and inter-CPU bus only connecting CPUs and FPGAs. On all interfaces,
read requests, again, are not usable for evicting cache lines from the LLC. DMA
writes, however, can be used to alter the LLC on the CPU. Because the UPI and
PCIe lanes behave much like the PCIe lane on a PAC, we state the same attack
scenarios (E+T, E+R, P+P) are viable on the integrated Arria 10.
Since an AFU can place data in at least one way per LLC slice, it is possible to
construct a covert channel from the AFU to a co-operating process on the CPU using
side effects of the LLC. To do so, we designed an AFU that writes a fixed string to a
pre-configured cache line whenever a “1” is transmitted and does nothing whenever
a “0” is sent. Using this technique, the AFU sends messages which can be read by
the CPU. For the rest of this section, we will refer to the address to which the AFU
writes as the target address.
The receiver process8 first constructs an eviction set for the set/slice-pair of
the target address. To find an eviction set, we run a slightly modified version of
Algorithm 1 using Test 1 in [56]. Using the OPAE API to allocate 2 MB huge pages
and obtain physical addresses (cf. Sect. 8.4.2.2) allows us to construct the eviction
set from a rather small set of candidate addresses all belonging to the same set.
We construct the covert channel on the integrated platform as the LLC of the
CPU is inclusive. Additionally, the receiver has access to the target address via
shared memory and can receiver test its eviction set against the target address
directly. This way, we do not need to explicitly identify the target address’s
LLC slice. In a real-world scenario, either the slice selection function has to be
known [24, 25, 33] or eviction sets for all slices have to be constructed by seeking
conflicting addresses [41, 45]. The time penalty introduced by monitoring all cache
sets can be prevented by multi-threading.
Next, the receiver primes the LLC with the identified eviction set and probes the
set in an endless loop. Whenever the execution time of a probe is above a certain
threshold, the receiver assumes that the eviction of one of its eviction set addresses
was the result of the AFU writing to the target address and therefore interprets
this as receiving a “1.” If the probe execution time stays below the threshold, a
“0” is detected as no eviction of the eviction set addresses occurred. An example
measurement of the receiver and its decoding steps are depicted in Fig. 8.14.
To ease the decoding and visualization of the results, the AFU sends every
bit thrice and the CPU uses six probes to detect all three repetitions. This high
level of redundancy comes at the expense of speed, as we achieve a bandwidth of
about 94.98 kBit/s, which is low when compared to other work [41, 42, 58]. The
throughput can be increased by reducing the three redundant writes per bit from
the AFU as well as by increasing the transmission frequency further to reduce
8 This process is not the software process directly communicating with the AFU over OPAE/CCI-P.
228 T. Tiemann et al.
2,000
1,800
Classification
0
01001011010010110100101101001011010
Fig. 8.14 Covert channel measurements and decoding. The AFU sends each bit three times, which
results in three peaks at the receiver if a “1” is transmitted
the redundant CPU probes per AFU write. Also, multiple cache sets can be used
in parallel to encode several bits at once. The synchronization problem can be
solved by using one cache set as the clock, where the AFU writes an alternating
bit pattern [52]. An average probe on the CPU takes 1855 clock cycles. The CPU
operating in the range of 2.8–3.4 GHz results in a throughput of 1.5–1.8 MBit/s.
The AFU can on average send one write request every 10 clock cycles without
filling the CCI-P PCIe buffer and thereby losing the write pattern. In theory, this
makes the AFU capable of sending 40 MBit/s over the covert channel when clocked
at 400 MHz.9
Even though caching hints for memory writes are being ignored by the FIU, an
AFU can place data in the LLC because the CPU is configured to handle write
requests as if WrPush_I is set, allowing for evictions in the LLC. We corroborated
our findings by establishing a covert channel between the AFU and the CPU with
a bandwidth of 94.98 kBit/s. By exposing physical addresses to the user and by
enabling 2 MB huge pages, OPAE further eases eviction set determination from user
space.
We also investigated the CPU’s capabilities to run cache attacks against the coherent
cache on the integrated Arria 10 FPGA. First, we measured the memory access
latency depending on the location of the address accessed using the rdtsc
9 This is a worst-case scenario where every transmitted bit is a ‘1’-bit. For a random message, this
estimation increases as ‘0‘-bits do not fill the buffer, allowing for faster transmission.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 229
800
LLC
FPGA cache
600
Main memory
Frequency
400
200
0
0 50 100 150 200 250 300 350 400
CPU clock cycles
Fig. 8.15 Memory access latency histogram on the CPU with data being present in FPGA local
cache, CPU LLC, or main memory. All memory locations show unique access latency. To our
surprise, responses from the FPGA cache are slower than those originating from the main memory
instruction. The results in Fig. 8.15 show that the CPU can clearly distinguish where
an accessed address is located. Therefore, the CPU is capable of probing a memory
address that may or may not be present in the local FPGA cache. It is interesting
to note that requests to main memory return faster than those going to the FPGA
cache. This can be explained by the much slower clock speed of the FPGA running
at 400 MHz while the CPU operates at 1.2–3.4 GHz. As nearly all known cache
attack techniques rely on some form of probing, the capability to distinguish data
location is a good step in the direction of having a fully working cache attack that
originates from the CPU and targets the FPGA cache.
Besides probing the FPGA cache, we also need a way of flushing, priming, or
evicting cache lines to put the FPGA cache into a known state. While the AFU can
control which data is cached locally by using caching hints, there is no such option
documented for the CPU. Therefore, priming the FPGA cache to evict cache lines
is not possible. This disables all eviction-based cache attacks. However, as the CPU
has a clfush instruction, we can use it to flush cache lines from the FPGA cache,
because it is coherent with the LLC. Hence, we can flush and probe cache lines
located in the FPGA cache. This enables us to run a Flush+Reload attack against the
victim AFU where the addresses used by the AFU get flushed before the execution
of the AFU. After the execution, the attacker then probes all previously flushed
addresses to learn which addresses were used during the AFU execution. Another
possible cache attack is the more efficient Flush+Flush attack. We expect this attack
to be more precise as flushing a cache line that is present in the FPGA cache takes
about 500 CPU clock cycles longer than flushing a cache line that is not present (cf.
Fig. 8.16), while the latency difference between memory and FPGA cache accesses
adds up to only about 50–70 CPU clock cycles.
In general, the applicability of F+R and F+F is limited to shared memory
scenarios. For example, two users on the same CPU might share an instantiation
of a library that uses an AFU for acceleration of a process that should remain
private, like training a machine learning model with confidential data or performing
cryptographic operations.
230 T. Tiemann et al.
100 Absent
Present
Frequency
50
0
200 300 400 500 600 700 800
CPU clock cycles
Fig. 8.16 The flush execution time on the CPU with the flushed address being absent or present
in the FPGA cache. The two peaks are clearly distinct
If FPGAs support simultaneous multi-tenancy, that is, the capability to place two
AFUs from different users on the same FPGA at the same time, the possibility of
intra-FPGA cache attacks arises. As the cache on the integrated Arria 10 is directly
mapped and only 128 KiB in size, finding eviction sets becomes trivial when giving
the attacker AFU access to 2 MB huge pages. As this is the default behavior of the
OPAE driver when allocating more than one memory page at once, we assume that
it is straightforward to run eviction-based attacks like Evict+Time or Prime+Probe
against a neighboring AFU to, e.g., extract information about a machine learning
model. Flush-based attacks would still be impossible due to the lack of a flush
instruction in CCI-P.
8.8 Countermeasures
Microarchitectural attacks against CPUs leave traces in HPCs such as cache hit
and miss counters. Previous works have paired these HPCs with machine learning
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 231
techniques to build real-time detectors for these attacks [9, 12, 22, 63]. In some
cases, CPU HPCs may be able to trace incoming attacks from FPGAs. While
HPCs do not exist in the same form on the Arria 10 GX platforms, they could be
implemented by the FIM. A system combining FPGA and CPU HPCs could provide
monitoring of the FPGA-CPU interface.
Several cache partitioning mechanisms have been proposed to protect CPUs against
cache attacks. While some are implementable in software [36, 37, 62, 65], others
require hardware support [16, 17, 40]. When trying to protect FPGA caches against
cache attacks, hardware-based approaches should receive special consideration. For
example, the FIM could partition the FPGA’s cache into several security domains,
such that each AFU can only use a subset of the cache lines in the local cache.
Another approach would introduce an additional flag to the CCI-P interface telling
the local caching agent which cache lines to pin to the cache.
Intel is aware that making physical addresses available to User space through OPAE
has negative security consequences [27]. In addition to exposing physical addresses,
OPAE makes heavy use of huge pages to ensure physical address continuity of
232 T. Tiemann et al.
buffers shared with the AFU. However, it is well known that disabling huge pages
increases the barrier of finding eviction sets [3, 41], which in turn makes cache
attacks and Rowhammer more difficult. We suggest disabling OPAE’s usage of huge
pages. To do so, the AFU address space has to be virtualized independent of the
presence of virtual environments.
Defenses against fault injection attacks proposed in the original Bellcore whitepa-
per [8] include verifying the signature before releasing it, and random padding of
the message before signing, which ensures that no unique message is ever signed
twice and that the exact plaintext cannot be easily determined. OpenSSL protects
against the Bellcore attack by verifying the signature with its plaintext and public
key and recomputing the exponentiation by a slower but safer single exponentiation
instead of by the CRT if verification does not match [11].
8.9 Conclusion
In this work, we show that modern FPGA-CPU hybrid systems can be more
vulnerable to well-known hardware attacks that are traditionally seen on CPU-only
systems. We show that the shared cache systems of the Arria 10 GX and its host
CPU present possible CPU to FPGA, FPGA to CPU, and FPGA to FPGA attack
vectors. For Rowhammer, we show that the Arria 10 GX is capable of causing
more DRAM faults in less time than modern CPUs. Our research indicates that
hardware side-channel defenses are just as essential for modern FPGA systems
as they are for modern CPUs. Of course, the security of any device physically
installed in a system, like a network card or graphics card, is important, but FPGAs
present additional security challenges due to their inherently flexible nature. With
FPGAs, new hardware caches and buffers become accessible for users which opens
a new area of possible side channels. Tiemann et al. [53] showed that FPGAs can
be used to observe the IOTLB and therefore derive information about the state of
neighboring peripherals. From a security perspective, a user-configurable FPGA on
a cloud system needs to be treated with at least as much care and caution as a user-
controlled CPU thread, as it can exploit many of the same vulnerabilities.
Acknowledgments We would like to extend special thanks to Alpa Trivedi and Evan Custodio,
without whom this research would not have been possible. We also thank Daniel Moghimi, Sayak
Ray, and Thomas Unterluggauer for their indispensable advice and insights. Research included
in this chapter was funded in part by Intel, National Science Foundation (NSF) grants CNS
1814406 and CNS 2026913, German Research Foundation (DFG) grant 456967092, German
Federal Ministry of Education and Research (BMBF) grant VE-Jupiter, and the Qatar National
Research Fund.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 233
References
13. Faict, T., D’Hollander, E. H., & Goossens, B. (2019). Mapping a guided image filter on the
HARP reconfigurable architecture using OpenCL. Algorithms, 12(8), 149. https://doi.org/10.
3390/a12080149.
14. Frigo, P., Giuffrida, C., Bos, H., & Razavi, K. (2018). Grand Pwning Unit: Accelerating
microarchitectural attacks with the GPU. In Proceedings of the 2018 IEEE symposium on
security and privacy, SP 2018, 21–23 May 2018, San Francisco, California, USA (pp. 195–
210). IEEE Computer Society. https://doi.org/10.1109/SP.2018.00022.
15. Frigo, P., Vannacci, E., Hassan, H., van der Veen, V., Mutlu, O., Giuffrida, C., Bos, H., &
Razavi, K. (2020). TRRespass: Exploiting the many sides of target row refresh. In 2020 IEEE
symposium on security and privacy, SP 2020, San Francisco, CA, USA, May 18–21, 2020 (pp.
747–762). IEEE. https://doi.org/10.1109/SP40000.2020.00090.
16. Green, M., Lima, L. R., Zankl, A., Irazoqui, G., Heyszl, J., & Eisenbarth, T. (2017). AutoLock:
Why cache attacks on ARM are harder than you think. In E. Kirda, & T. Ristenpart (Eds.),
26th USENIX security symposium, USENIX security 2017, Vancouver, BC, Canada, August
16–18, 2017 (pp. 1075–1091). USENIX Association. https://www.usenix.org/conference/
usenixsecurity17/technical-sessions/presentation/green.
17. Gruss, D., Lettner, J., Schuster, F., Ohrimenko, O., Haller, I., & Costa, M. (2017). Strong and
efficient cache side-channel protection using hardware transactional memory. In E. Kirda, &
T. Ristenpart (Eds.), 26th USENIX security symposium, USENIX security 2017, Vancouver,
BC, Canada, August 16–18, 2017 (pp. 217–233). USENIX Association. https://www.usenix.
org/conference/usenixsecurity17/technical-sessions/presentation/gruss.
18. Gruss, D., Lipp, M., Schwarz, M., Genkin, D., Juffinger, J., O’Connell, S., Schoechl, W., &
Yarom, Y. (2018). Another flip in the wall of Rowhammer defenses. In Proceedings of the
2018 IEEE symposium on security and privacy, SP 2018, 21–23 May 2018, San Francisco,
California, USA (pp. 245–261). IEEE Computer Society. https://doi.org/10.1109/SP.2018.
00031.
19. Gruss, D., Maurice, C., & Mangard, S. (2016). Rowhammer.js: A remote software-induced
fault attack in JavaScript. In J. Caballero, U. Zurutuza, & R. J. Rodríguez (Eds.), Proceedings
of the detection of intrusions and malware, and vulnerability assessment—13th international
conference, DIMVA 2016, San Sebastián, Spain, July 7–8, 2016. Lecture notes in computer
science (vol. 9721, pp. 300–321). Springer. https://doi.org/10.1007/978-3-319-40667-1_15.
20. Gruss, D., Maurice, C., Wagner, K., & Mangard, S. (2016). Flush+Flush: A fast and
stealthy cache attack. In J. Caballero, U. Zurutuza, & R. J. Rodríguez (Eds.), Proceedings
of the detection of intrusions and malware, and vulnerability assessment—13th international
conference, DIMVA 2016, San Sebastián, Spain, July 7–8, 2016. Lecture notes in computer
science (Vol. 9721, pp. 279–299). Springer. https://doi.org/10.1007/978-3-319-40667-1_14.
21. Gülmezoglu, B., Eisenbarth, T., & Sunar, B. (2017). Cache-based application detection in
the cloud using machine learning. In R. Karri, O. Sinanoglu, A. Sadeghi, & X. Yi (Eds.),
Proceedings of the 2017 ACM Asia conference on computer and communications security,
AsiaCCS 2017, Abu Dhabi, United Arab Emirates, April 02–06, 2017 (pp. 288–300. ACM).
https://doi.org/10.1145/3052973.3053036.
22. Gülmezoglu, B., Moghimi, A., Eisenbarth, T., & Sunar, B. (2019). FortuneTeller: Predicting
microarchitectural attacks via unsupervised deep learning. CoRR abs/1907.03651. http://arxiv.
org/abs/1907.03651.
23. Gülmezoglu, B., Zankl, A., Eisenbarth, T., & Sunar, B. (2017). PerfWeb: How to violate web
privacy with hardware performance events. In S. N. Foley, D. Gollmann, & E. Snekkenes
(Eds.), Computer security—ESORICS 2017—22nd European symposium on research in
computer security, Oslo, Norway, September 11–15, 2017, Proceedings, Part II. Lecture notes
in computer science (Vol. 10493, pp. 80–97). Springer. https://doi.org/10.1007/978-3-319-
66399-9_5.
24. Hund, R., Willems, C., & Holz, T. (2013). Practical timing side channel attacks against kernel
space ASLR. In 20th Annual network and distributed system security symposium, NDSS 2013,
San Diego, California, USA, February 24–27, 2013. The Internet Society. https://www.ndss-
symposium.org/ndss2013/practical-timing-side-channel-attacks-against-kernel-space-aslr.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 235
25. Inci, M. S., Gülmezoglu, B., Apecechea, G. I., Eisenbarth, T., & Sunar, B. (2015). Seriously,
get off my cloud! cross-VM RSA key recovery in a public cloud. IACR Cryptol. ePrint Arch.
(p. 898). http://eprint.iacr.org/2015/898.
26. Inci, M. S., Gülmezoglu, B., Irazoqui, G., Eisenbarth, T., & Sunar, B. (2016). Cache attacks
enable bulk key recovery on the cloud. In: B. Gierlichs, & A. Y. Poschmann (Eds.). Proceedings
of the Cryptographic hardware and embedded systems—CHES 2016—18th international
conference, Santa Barbara, CA, USA, August 17–19, 2016. Lecture notes in computer science
(Vol. 9813, pp. 368–388). Springer. https://doi.org/10.1007/978-3-662-53140-2_18.
27. Intel (2017). Open programmable acceleration engine (1.1.2 ed.). Accessed 2023-05-23.
28. Intel (2018). Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-
P) Reference Manual (1.2 ed.).
29. Intel (2020). Intel programmable acceleration card (PAC) with Intel Arria 10 GX
FPGA data sheet. https://www.intel.com/content/www/us/en/docs/programmable/683226/
current/introduction-rush-creek.html. Accessed 2023-05-22.
30. Intel (2022). Intel Virtualization Technology for Directed I/O. Rev. 4.0.
31. Intel Labs (2021). FPGA accelerators. https://wiki.intel-research.net/FPGA.html#fpga-
system-classes. Accessed 2023-05-22.
32. Intel Labs (2021). IL academic compute environment documentation. https://wiki.intel-
research.net/. Accessed 2023-05-22.
33. Irazoqui, G., Eisenbarth, T., & Sunar, B. (2015). Systematic reverse engineering of cache slice
selection in intel processors. In 2015 Euromicro conference on digital system design, DSD
2015, Madeira, Portugal, August 26–28, 2015 (pp. 629–636). IEEE Computer Society. https://
doi.org/10.1109/DSD.2015.56.
34. Irazoqui, G., Eisenbarth, T., & Sunar, B. (2016). Cross processor cache attacks. In X. Chen,
X. Wang, & X. Huang (Eds.), Proceedings of the 11th ACM Asia conference on computer and
communications security, AsiaCCS 2016, Xi’an, China, May 30–June 3, 2016 (pp. 353–364).
ACM. https://doi.org/10.1145/2897845.2897867.
35. JC-42.6 Low Power Memories Committee (2017). Low Power Double Data Rate 4 (LPDDR4).
In Standard JESD209-4B, JEDEC solid state technology association.
36. Kim, T., Peinado, M., Mainar-Ruiz, G. (2012). STEALTHMEM: system-level protection
against cache-based side channel attacks in the cloud. In T. Kohno (Ed.), Proceedings of
the 21st USENIX security symposium, Bellevue, WA, USA, August 8–10, 2012 (pp. 189–
204). USENIX Association. https://www.usenix.org/conference/usenixsecurity12/technical-
sessions/presentation/kim.
37. Kiriansky, V., Lebedev, I. A., Amarasinghe, S. P., Devadas, S., & Emer, J. S. (2018). DAWG:
A defense against cache timing attacks in speculative execution processors. In 51st Annual
IEEE/ACM international symposium on microarchitecture, MICRO 2018, Fukuoka, Japan,
October 20–24, 2018 (pp. 974–987). IEEE Computer Society. https://doi.org/10.1109/MICRO.
2018.00083.
38. Kurth, M., Gras, B., Andriesse, D., Giuffrida, C., Bos, H., & Razavi, K. (2020). NetCAT:
Practical cache attacks from the network. In 2020 IEEE symposium on security and privacy,
SP 2020, San Francisco, CA, USA, May 18–21, 2020, pp. 20–38. IEEE. https://doi.org/10.
1109/SP40000.2020.00082.
39. Lipp, M., Gruss, D., Spreitzer, R., Maurice, C., Mangard, S.: ARMageddon: Cache attacks on
mobile devices. In T. Holz, & S. Savage (Eds.), 25th USENIX security symposium, USENIX
security 16, Austin, TX, USA, August 10–12, 2016, pp. 549–564. USENIX Association (2016).
https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/lipp.
40. Liu, F., Ge, Q., Yarom, Y., McKeen, F., Rozas, C. V., Heiser, G., Lee, R. B. (2016). CATalyst:
Defeating last-level cache side channel attacks in cloud computing. In 2016 IEEE international
symposium on high performance computer architecture, HPCA 2016, Barcelona, Spain, March
12–16, 2016 (pp. 406–418). IEEE Computer Society. https://doi.org/10.1109/HPCA.2016.
7446082.
236 T. Tiemann et al.
41. Liu, F., Yarom, Y., Ge, Q., Heiser, G., & Lee, R. B. (2015). Last-level cache side-channel
attacks are practical. In 2015 IEEE symposium on security and privacy, SP 2015, San Jose,
CA, USA, May 17–21, 2015 (pp. 605–622). IEEE Computer Society. https://doi.org/10.1109/
SP.2015.43.
42. Maurice, C., Weber, M., Schwarz, M., Giner, L., Gruss, D., Boano, C. A., Mangard, S.,
& Römer, K. (2017). Hello from the other side: SSH over robust cache covert chan-
nels in the cloud. In 24th annual network and distributed system security symposium,
NDSS 2017, San Diego, California, USA, February 26–March 1, 2017. The Internet Soci-
ety. https://www.ndss-symposium.org/ndss2017/ndss-2017-programme/hello-other-side-ssh-
over-robust-cache-covert-channels-cloud/.
43. Moghimi, A., Irazoqui, G., & Eisenbarth, T. (2017). CacheZoom: How SGX amplifies the
power of cache attacks. In W. Fischer, & N. Homma (Eds.), Proceedings of the cryptographic
hardware and embedded systems—CHES 2017—19th international conference, Taipei, Tai-
wan, September 25–28, 2017. Lecture Notes in Computer Science (Vol. 10529, pp. 69–90).
Springer. https://doi.org/10.1007/978-3-319-66787-4_4.
44. Mulnix, D. (2017). Intel Xeon processor scalable family technical overview. https://www.
intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-
technical-overview.html. Accessed 2023-05-22.
45. Oren, Y., Kemerlis, V. P., Sethumadhavan, S., & Keromytis, A. D. (2015). The spy in the
sandbox: Practical cache attacks in JavaScript and their implications. In I. Ray, N. Li, &
C. Kruegel (Eds.), Proceedings of the 22nd ACM SIGSAC conference on computer and
communications security, Denver, CO, USA, October 12–16, 2015 (pp. 1406–1418). ACM.
https://doi.org/10.1145/2810103.2813708.
46. Osvik, D. A., Shamir, A., & Tromer, E. (2006). Cache attacks and countermeasures: The case
of AES. In D. Pointcheval (Ed.), Proceedings of the Topics in Cryptology—CT-RSA 2006,
The Cryptographers’ track at the RSA conference 2006, San Jose, CA, USA, February 13–17,
2006. Lecture notes in computer science (Vol. 3860, pp. 1–20). Springer. https://doi.org/10.
1007/11605805_1.
47. Pessl, P., Gruss, D., Maurice, C., Schwarz, M., & Mangard, S. (2016). DRAMA: exploiting
DRAM addressing for cross-CPU attacks. In T. Holz, & S. Savage (Eds.), 25th USENIX
security symposium, USENIX security 16, Austin, TX, USA, August 10–12, 2016 (pp. 565–
581). USENIX Association. https://www.usenix.org/conference/usenixsecurity16/technical-
sessions/presentation/pessl.
48. Purnal, A., Turan, F., & Verbauwhede, I.: Double trouble: Combined heterogeneous attacks
on non-inclusive cache hierarchies. In K. R. B. Butler, & K. Thomas (Eds.), 31st USENIX
security symposium, USENIX security 2022, Boston, MA, USA, August 10–12, 2022 (pp. 3647–
3664). USENIX Association (2022). https://www.usenix.org/conference/usenixsecurity22/
presentation/purnal.
49. Ristenpart, T., Tromer, E., Shacham, H., & Savage, S. (2009). Hey, you, get off of my cloud:
Exploring information leakage in third-party compute clouds. In E. Al-Shaer, S. Jha, & A. D.
Keromytis (Eds.), Proceedings of the 2009 ACM conference on computer and communications
security, CCS 2009, Chicago, Illinois, USA, November 9–13, 2009 (pp. 199–212). ACM.
https://doi.org/10.1145/1653662.1653687.
50. Schwarz, M. (2019). PTEditor: A small library to modify all page-table levels of all processes
from user space for x86_64 and ARMv8. https://github.com/misc0110/PTEditor. Version
738f42e, accessed 2023-05-22.
51. Seaborn, M., & Dullien, T. (2015). Exploiting the DRAM Rowhammer bug to gain kernel
privileges. Black Hat USA, 18, 71. https://www.blackhat.com/docs/us-15/materials/us-15-
Seaborn-Exploiting-The-DRAM-Rowhammer-Bug-To-Gain-Kernel-Privileges.pdf.
52. Taram, M., Venkat, A., & Tullsen, D. M. (2020). Packet chasing: Spying on network packets
over a cache side-channel. In 47th ACM/IEEE annual international symposium on computer
architecture, ISCA 2020, Valencia, Spain, May 30–June 3, 2020 (pp. 721–734). IEEE. https://
doi.org/10.1109/ISCA45697.2020.00065.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 237
53. Tiemann, T., Weissman, Z., Eisenbarth, T., & Sunar, B.: IOTLB-SC: An accelerator-
independent leakage source in modern cloud systems. In: Proceedings of the 2023 ACM Asia
conference on computer and communications security, AsiaCCS 2023, Melbourne, Australia,
July 10–14, 2023. ACM (2023). https://doi.org/10.1145/3579856.3582838.
54. Tsunoo, Y., Saito, T., Suzaki, T., Shigeri, M., & Miyauchi, H. (2003). Cryptanalysis of
DES implemented on computers with cache. In C. D. Walter, Ç. K. Koç, & C. Paar
(Eds.) Proceedings of the Cryptographic hardware and embedded systems—CHES 2003, 5th
international workshop, Cologne, Germany, September 8–10, 2003. Lecture notes in computer
science (Vol. 2779, pp. 62–76). Springer. https://doi.org/10.1007/978-3-540-45238-6_6.
55. van der Veen, V., Fratantonio, Y., Lindorfer, M., Gruss, D., Maurice, C., Vigna, G., Bos, H.,
Razavi, K., & Giuffrida, C.: Drammer: Deterministic Rowhammer attacks on mobile platforms.
In E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, S. Halevi (Eds.), Proceedings
of the 2016 ACM SIGSAC conference on computer and communications security, Vienna,
Austria, October 24–28, 2016, pp. 1675–1689. ACM (2016). https://doi.org/10.1145/2976749.
2978406.
56. Vila, P., Köpf, B., & Morales, J. F. (2019). Theory and practice of finding eviction sets. In 2019
IEEE symposium on security and privacy, SP 2019, San Francisco, CA, USA, May 19–23, 2019
(pp. 39–54). IEEE. https://doi.org/10.1109/SP.2019.00042.
57. Witteman, M. F., van Woudenberg, J. G. J., & Menarini, F.: Defeating RSA multiply-always
and message blinding countermeasures. In A. Kiayias (Ed.), Proceedings of the topics in
cryptology—CT-RSA 2011—the Cryptographers’ track at the RSA conference 2011, San
Francisco, CA, USA, February 14–18, 2011. Lecture notes in computer science (Vol. 6558,
pp. 77–88). Springer (2011). https://doi.org/10.1007/978-3-642-19074-2_6.
58. Wu, Z., Xu, Z., & Wang, H. (2012). Whispers in the hyper-space: High-speed covert channel
attacks in the cloud. In T. Kohno (Ed.), Proceedings of the 21th USENIX security symposium,
Bellevue, WA, USA, August 8–10, 2012 (pp. 159–173). USENIX Association. https://www.
usenix.org/conference/usenixsecurity12/technical-sessions/presentation/wu.
59. Xilinx (2019). Accelerator Cards. https://www.xilinx.com/products/boards-and-kits/
accelerator-cards.html. Accessed 2023-05-22.
60. Yan, M., Sprabery, R., Gopireddy, B., Fletcher, C. W., Campbell, R. H., & Torrellas, J. (2019).
Attack directories, not caches: Side channel attacks in a non-inclusive world. In 2019 IEEE
symposium on security and privacy, SP 2019, San Francisco, CA, USA, May 19–23, 2019 (pp.
888–904). IEEE. https://doi.org/10.1109/SP.2019.00004.
61. Yarom, Y., & Falkner, K. (2014). FLUSH+RELOAD: A high resolution, low noise, L3 cache
side-channel attack. In K. Fu, & J. Jung (Eds.), Proceedings of the 23rd USENIX security
symposium, San Diego, CA, USA, August 20–22, 2014 (pp. 719–732). USENIX Association.
https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/yarom.
62. Ye, Y., West, R., Cheng, Z., & Li, Y. (2014). COLORIS: a dynamic cache partitioning system
using page coloring. In J. N. Amaral, & J. Torrellas (Eds.), International conference on parallel
architectures and compilation, PACT ’14, Edmonton, AB, Canada, August 24–27, 2014 (pp.
381–392). ACM. https://doi.org/10.1145/2628071.2628104.
63. Zhang, T., Zhang, Y., & Lee, R. B.: CloudRadar: A real-time side-channel attack detection
system in clouds. In F. Monrose, M. Dacier, G. Blanc, & J. García-Alfaro (Eds.), Proceedings
of the research in attacks, intrusions, and defenses—19th international symposium, RAID 2016,
Paris, France, September 19–21, 2016. Lecture notes in computer science (Vol. 9854, pp. 118–
140). Springer (2016). https://doi.org/10.1007/978-3-319-45719-2_6.
64. Zhang, Y., Juels, A., Reiter, M. K., & Ristenpart, T. (2012). Cross-VM side channels and their
use to extract private keys. In T. Yu, G. Danezis, & V. D. Gligor (Eds.), The ACM conference
on computer and communications security, CCS’12, Raleigh, NC, USA, October 16–18, 2012
(pp. 305–316). ACM. https://doi.org/10.1145/2382196.2382230.
65. Zhou, Z., Reiter, M. K., & Zhang, Y. (2016). A software approach to defeating side channels
in last-level caches. In E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, & S. Halevi
(Eds.), Proceedings of the 2016 ACM SIGSAC conference on computer and communications
security, Vienna, Austria, October 24–28, 2016 (pp. 871–882). ACM. https://doi.org/10.1145/
2976749.2978324.
Chapter 9
Fingerprinting and Mapping Cloud
FPGA Infrastructures
9.1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 239
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_9
240 S. Tian et al.
used and the geographic location of the data centers. Furthermore, there is a number
of design rule checks (DRCs) on the design checkpoint (DCP) files generated by
Xilinx’s Vivado tools before the generated bitstream (called an Amazon FPGA
image, or AFI) can be loaded onto one of the AWS FPGAs. The checks, which
include prohibiting combinatorial loops [4], are combined with a restrictive “shell”
interface that prevents access to Xilinx eFUSE and Device DNA primitives [66],
which could be used to identify the specific FPGA hardware that a user has rented.
In spite of the efforts to hide information about the cloud FPGA architecture, this
chapter shows that it is possible to gain insights into the infrastructure through the
resources that are available to unprivileged FPGA users. Specifically, this chapter
introduces two algorithms for fingerprinting cloud FPGAs through unique features
in their boards. The first approach uses physical unclonable functions (PUFs) based
on the decay of dynamic random access memory (DRAM) [69] to identify the
DRAM modules attached to the cloud FPGA boards, and, by extension, the FPGAs
themselves. The second PUF uses ring oscillators (ROs) that fingerprint the FPGA
chips themselves. Both designs bypass AWS countermeasures: the DRAM-based
PUF disables DRAM refresh by loading AFIs with and without DRAM controllers
instantiated, while the RO-based PUF uses novel ring oscillators with latches and
flip-flops [21, 59] that are not detected by the deployed DRCs.
Our work then shifts focus from identifying single FPGA boards to mapping the
whole cloud FPGA infrastructure itself. The main insight behind our research is
that memory accesses between the host computer and an FPGA board become a
bottleneck when two or more FPGAs from the same non-uniform memory access
(NUMA) node within a server are accessing memory simultaneously. We show that
it is possible to influence the peripheral component interconnect express (PCIe)
bandwidth by running an FPGA memory stressor and to observe this change in
bandwidth by running a separate FPGA memory tester on a different FPGA board
in the server.1 Using this approach, we can determine which FPGA slots within a
server belong to the same NUMA node.
In particular, we show that in f1.16xlarge AWS instances, FPGAs in slots
0–3 or 4–7 interfere with each other, and thus conclude that they form separate
NUMA nodes. We then use data from dozens of f1.2xlarge instances that have
been rented one after the other and determine that successive instances often (but
not always) belong to the same NUMA locality. The findings are confirmed for
f1.4xlarge instances and across different data center regions for both instance
types. We perform additional experiments with f1.2xlarge instances to calculate
the probability of renting the same FPGA or FPGAs within the same NUMA locality
across time, therefore fingerprinting and mapping the cloud infrastructure on a very
fine-grained level.
Overall, our work exposes a fundamental infrastructure issue and highlights
that simply focusing on the security of the FPGA chip itself, but ignoring other
1 In later work, we show how to exploit this effect to create covert- and side-channel attacks
infrastructure components, such as the DRAM modules or the PCIe bus, leaves
cloud FPGAs open to new vulnerabilities.
9.2 Background
This section describes current public cloud FPGA deployments and their typi-
cal hardware setup (Sect. 9.2.1). It then summarizes decay-based DRAM PUFs
(Sect. 9.2.2), ring oscillators (Sect. 9.2.3), and PCIe-related concepts (Sect. 9.2.4).
Several options are available for renting FPGAs in the cloud. Since 2015, academic
researchers can access a cluster with Intel Stratix V FPGAs in the Texas Advanced
Computing Center (TACC) [62]. Intel FPGAs are also available on Alibaba
Cloud [1] and on Microsoft Azure for machine learning applications [43]. Xilinx-
based cloud offerings have been available since 2016, when AWS announced F1
instances with Xilinx Virtex UltraScale+ FPGAs [2]. The same chips also power
242 S. Tian et al.
Huawei [65] and Alibaba [1] cloud services. Meanwhile, Kintex UltraScale boards
are available on Baidu [12] and Tencent [61].
In this chapter, we focus on FPGA instances provided by AWS. These instances,
or virtual machines (VMs), are offered in several geographical regions and come
in three flavors, with 1, 2, or 8 dedicated FPGAs per VM instance, called
f1.2xlarge, f1.4xlarge, and f1.16xlarge (the instance name is twice
the number of FPGAs, so f1.2xlarge has 1 FPGA, while f1.4xlarge has 2,
etc.). The total amount of resources allocated per VM increases proportionally with
the number of attached FPGAs, providing 8 virtual CPUs (vCPUs) from Intel Xeon
E5-2686 v4 (Broadwell) processors, 130 GB of RAM, and 470 GB of NVMe SSD
per FPGA [9]. Thus, e.g., f1.16xlarge instances have 64 vCPUs, over 1 TB of
RAM and 3.7 TB of disk space.
Each FPGA board can communicate with the server over x16 PCIe Gen 3. In
addition, each FPGA can access (via the programmable logic) four DDR4 DRAM
chips on the FPGA board itself, which are separate from the server’s DRAM.
The FPGA DRAM comes with error correction code (ECC) and a total of 16 GB
of memory [9] for each FPGA. AWS F1 instances use 16 nm Virtex UltraScale+
XCVU9P chips [9], which internally contain over 1.1 million lookup tables (LUTs),
2.3 million flip-flops (FFs), and 6,800 digital signal processing (DSP) blocks [67]. It
should be noted that f1.16xlarge instances use a dedicated PCIe fabric, which
“lets the FPGAs share the same memory space and communicate with each other
across the fabric at up to 12 Gbps in each direction” [9], suggesting that a server can
consist of at most two CPUs and eight attached PCIe cards, a.k.a. FPGAs. As we
show in this chapter, servers indeed seem to have two NUMA locality nodes, each
encompassing one CPU and four FPGAs. This is consistent with known server and
PCIe designs but is not publicly specified by Amazon.2
DRAM is widely used in personal computers and servers due to its high storage
density. Usually, multiple DRAM chips (ranks) are combined in a DRAM module
to provide enough memory. Each DRAM chip consists of DRAM banks, which are
arrays of DRAM cells. A single DRAM cell consists of a capacitor and a transistor,
with bits of information stored as charges on the capacitors. The gate of the access
transistor in the DRAM cell connects to the wordline (WL) in that row, while the
capacitor in the DRAM cell connects to the bitline (BL) through the transistor. To
access a certain memory address, the bitlines are first reset by the equalizers. Then,
the corresponding wordline is enabled, and the charge on the capacitors is read
through the sense amplifiers.
2 We further reverse-engineer the server architecture by incorporating information about SSDs and
DRAM is a type of volatile memory because the capacitor charge leaks over time
through different leakage paths. The time that a DRAM cell can retain the charge on
the capacitor and store the data value is called the retention time. After the retention
time elapses, the charge on the cell will leak, and the bit stored in the DRAM cell
may flip its value. To maintain the data integrity of information stored, the DRAM
is refreshed periodically to recharge the capacitors to their original voltage levels.
Moreover, an error correction code (ECC) can also be applied.
The variation in the retention time of different DRAM cells can be used in
physical unclonable functions (PUFs) [69]. Specifically, in a decay-based DRAM
PUF, the DRAM PUF region is first set to a known initial value (e.g., all ones) and
the DRAM refresh is disabled. After a certain decay period elapses, the DRAM
PUF region is then read. Due to DRAM charge leakage, bit flips (errors) in the
initial values will occur. The location of the bit flips depends on variations in the
fabrication process and is considered to be unique for each DRAM chip. Thus,
the bit flips due to DRAM decay can be used as a PUF response. DRAM PUFs
have been used to identify and authenticate DRAM chips [48, 52, 55, 56, 60, 69]
or generate keys [53, 56, 60, 69]. In this chapter, we use DRAM data retention
properties to create a unique fingerprint of DRAM chips, and, by extension, the
FPGAs to which they are attached.
Ring oscillators (ROs) are a type of circuit with an odd number of NOT gates that
are chained together in a loop (i.e., the output of the last gate is the input of the
first gate). The value at any given stage of an RO oscillates between 1 and 0, at a
frequency that depends on the number of stages in the RO, the delay between the
stages, as well as process, voltage, and temperature (PVT) variations [26].
ROs on FPGAs are traditionally implemented using lookup tables (LUTs), which
are configured as either inverters or buffers (e.g., 3 inverters, or 1 inverter and 2
buffers). These combinatorial loops can be detected by the synthesis tools, and some
cloud providers, such as Amazon Web Services [9], in fact prohibit ROs in FPGA
bitstreams deployed on their cloud FPGAs. However, alternative types of ROs that
behave similarly have been designed to bypass these countermeasures. These ROs,
used in this chapter, replace one of the LUT stages with a latch or a flip-flop [21, 59].
Covert- and side-channel attacks are possible in cloud FPGAs (Sect. 9.6), but
they often require that adversaries be able to uniquely identify FPGA instances
to carry out the attacks. This chapter provides a way to uniquely fingerprint
individual FPGAs and more generally map the cloud infrastructure, while obeying
the DRCs imposed by cloud FPGA providers. The adversarial user is therefore
free to place and route their potentially malicious logic within the confines of
their dedicated region on the FPGA chip but cannot use any prohibited circuits,
such as combinatorial loops [4]. In addition, users only interact with the physical
interfaces through the cloud-provided IP modules, such as the DRAM controller,
and do not have direct access to I/O pins, identifiers such as eFUSE and Device
DNA primitives [66], or voltage and temperature monitors. Adversaries can instead
try to infer or influence such information indirectly (e.g., by using PUFs or other
alternative RO constructions), but they do not have physical access to the underlying
FPGA boards or server racks themselves. We also do not assume vulnerabilities
in the virtualization mechanisms that would allow users to (directly) snoop on the
memory or PCIe transactions of other VM instances. Attacks on the cloud-provided
logic (including decrypting the protected IP), FPGA software tools, or the bitstream
are similarly out of scope.
The DRAM-based PUF is described in Sect. 9.4.1 and evaluated in Sect. 9.4.2,
while the RO-based PUF is introduced in Sect. 9.4.3 and evaluated in Sect. 9.4.4.
Section 9.4.5 presents some potential countermeasures against FPGA fingerprinting.
This section presents a novel way to create decay-based DRAM PUFs by loading
and unloading two different types of AFIs: one with and one without a memory
controller. This approach disables refresh, while still providing power to the
DRAM modules. Section 9.4.1.1 expands on the memory-related aspects of our
experimental setup, while Sect. 9.4.1.2 explains in detail how DRAM PUFs are
instantiated and used for data collection on AWS F1 instances.
The cl_dram_dma example in the AWS development kit [6] explains how to
access the DRAM from the FPGA. The physical pinout and timing parameters
246 S. Tian et al.
of the DDR4 DRAM chips are hidden in the sh_ddr module, which provides a
512-bit AXI4 interface to user logic. It also implements memory initialization, error
correction, and self-refresh of DRAM cells. As shown in Fig. 9.1, although there are
four DRAM modules, one (DRAM C) is reserved by the FPGA “shell.” It is always
initialized (and refreshed) regardless of whether custom user logic has instantiated
a DRAM controller to use the memory. The remaining DRAMs (A, B, and D) are
instantiated within the custom logic [5]. Each instantiation also uses the sh_ddr
module, which is encrypted and prevents users from modifying its functionality:
self-refresh of DRAM cells is always enabled whenever the DRAM controller is
instantiated. Nevertheless, Sect. 9.4.1.2 presents a novel approach through a method
that disables self-refresh, thus allowing DRAM cells to decay.
It should be noted that two key modifications are made to the cl_dram_dma
logic. First of all, the memory scrubber module mem_scrb (which erases DRAMs
when the AFI is loaded) is disabled through the macro NO_CL_TST_SCRUBBER.
This is necessary to ensure that the decay-based DRAM PUF fingerprints are not
zeroed out before they are read. And, second, the error correction logic is also turned
off by setting the ECC parameter of the ddr4_core_ddr4 module to OFF. This
ensures that the PUF response remains usable by keeping decay-based errors intact,
instead of being corrected by ECC. The ability to turn off ECC is discussed further
in the context of defense mechanisms in Sect. 9.4.5.
Figure 9.2 depicts the data collection process of the decay-based DRAM PUFs.
There are three steps in the measurement process, which use two separate AFIs:
1. The first step is to write all 1s to a fixed area within a DRAM chip. This step
uses AFI-0, which is based on the AWS example design cl_dram_dma, with
modifications to the memory scrubber and ECC, as explained above.
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 247
Time
Fig. 9.2 Steps to measure DRAM PUFs: AFI-0 is first loaded to write all 1s to a certain area of a
DRAM module. Then AFI-1 is loaded to stop memory self-refresh. Finally, after a fixed amount
of time, AFI-0 is re-loaded to measure bit flips in the written addresses
2. The second step is to wait for DRAM cells to decay by using AFI-1. This step
loads an FPGA image that stops self-refresh of DRAMs A, B, and D for the
chosen decay period, idling the FPGA. The cl_hello_world design is used
for this purpose, as it does not instantiate memory controllers. The self-refresh
logic is only disabled in this step.
3. The final step of reading returns to AFI-0 and simply reads back the DRAM
data to generate the PUF fingerprints from DRAMs A, B, and D. The memory
scrubber and ECC are disabled, so the image retrieves the previous data, with
some of its bits decayed.
This section expands on the experimental setup (Sect. 9.4.2.1) and provides an
example of the DRAM PUF response (Sect. 9.4.2.2). It then details the metric used
for fingerprinting FPGA instances (Sect. 9.4.2.3) and calculates the probability of
re-renting the same FPGA (Sect. 9.4.2.4). Finally, it finishes with an investigation
of the background data center conditions (Sect. 9.4.2.5).
Experiments are performed on Amazon EC2 F1 spot instances [11], in the North
Virginia us-east-1 region. Spot instances are similar to on-demand ones but can
be terminated at a moment’s notice. As a result, they are cheaper: an on-demand
f1.16xlarge instance costs $13.20 per hour, while the same spot instance only
costs $3.96 [10], i.e., less than a third of the price.
The VMs used on the cloud servers, also called Amazon Machine Images
(AMIs) [8], run CentOS 7.6.1810 and access the Xilinx Virtex UltraScale+ FPGAs
in the f1 instances. A series of spot instances, launched with the same AMIs, are
requested in order and are terminated after collecting DRAM PUFs responses on
all FPGA slots of each instance. The interval between terminating one instance and
requesting the next one is five minutes. However, due to variations in how long
248 S. Tian et al.
initialization of the FPGAs takes, there are some small differences in the collection
time of the DRAM PUFs in practice. On multi-FPGA (4x and 16x) instances, the
measurements on different FPGA slots are done in sequence, minimizing contention
errors or delays due to the shared PCIe bus.
As discussed in Sects. 9.2.2 and 9.4.1, the location of bit flips that occur after
disabling the memory scrubber, error correction, and self-refresh is related to the
manufacturing process and can fingerprint the DRAM modules attached to the
FPGAs. It can thus serve as a proxy for fingerprinting the cloud FPGA instances,
under the reasonable assumption that the same DRAM chips are always permanently
and physically connected to the same FPGA board. Figure 9.3 shows the number of
bit flips (error counts) for the four DRAMs on an FPGA board after waiting for
different decay periods. Due to the influence of memory access on DRAM PUFs,
all data points in Fig. 9.3 are independent. The waiting time between measurements
is two minutes, and the size of the PUF is 512 kB. Decay on DRAM C cannot be
measured, as it is reserved by the shell, but the other three DRAMs follow a similar
pattern: the longer the wait, the more pronounced the decay. However, the absolute
magnitude varies due to manufacturing variations. The decay period is chosen as
120 seconds in the following experiments.
Figure 9.4 shows an example DRAM PUF response, with each pixel in the .1024×
1024 grid representing four bits in the 512 kB PUF response. There is sufficient
randomness in the response to distinguish between otherwise-identical DRAMs.
To quantify how similar or different DRAM PUF responses are, we use the Jaccard
index [29]. Let .F1 and .F2 denote the set of bit flips in two DRAM PUF responses.
Then, the Jaccard index for the two DRAM PUF responses is defined as
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 249
|F1 ∩ F2 |
J (F1 , F2 ) =
. (9.1)
|F1 ∪ F2 |
As shown by Xiong et al. [69], the intra-device Jaccard index J of PUF responses
from the same DRAM chip is close to one, whereas the inter-device Jaccard index
J from different DRAMs is close to zero. This remains true for the data collected
in our work, where sixty f1.2xlarge instances are launched in series. Due to
the AWS allocation process, these instances may or may not use different FPGA
boards. As shown in Fig. 9.5, the distribution of the Jaccard indices for each pair of
PUF responses has a peak close to 0, and the rest are between 0.5 and 1 as expected.
Therefore, DRAM PUF responses that have a Jaccard index of less (resp., more)
than 0.5 are assumed to come from different (resp., the same) FPGA boards.
This section identifies the number of unique FPGAs when renting f1.2xlarge,
f1.4xlarge, f1.16xlarge instances sixty times each. As these instance
types contain 1, 2, and 8 FPGA boards, respectively, DRAM PUF fingerprints are
measured on a total of .60 + 120 + 480 = 660 FPGAs (which contain repeated
FPGA boards due to re-allocation). Table 9.1 summarizes the number of unique
FPGAs seen on AWS, as indicated by the Jaccard indices of their DRAM PUFs.
The results indicate that only 10, 6, and 8 unique FPGA sets have been allocated for
each type.
Given that we observed the same FPGA multiple times, Fig. 9.6 plots the
probability of getting the same FPGA board in the North Virginia region, as a
250 S. Tian et al.
Table 9.1 The number and type of FPGA instances rented, along with the number of unique sets
of FPGAs found and the approximate experimental cost using spot instances
Instance type # of FPGAs Unique FPGAs Cost ($)
f1.2xlarge .60 ×1 .10×1 3.47
f1.4xlarge .60 × 2 . 6×2 8.91
f1.16xlarge .60 × 8 . 8×8 83.16
function of the amount of time between requests for two instances. As DRAM
PUFs are collected in sequence, the intervals between two adjacent measurements
are nearly identical. For a given time period t, all n pairs of measurements that are
(approximately) t minutes apart are used to calculate the probability .p/n of renting
a re-allocated FPGA, where p denotes the number of pairs (out of n) for which
Jaccard indices are bigger than .0.5. Please note that the number of pairs n varies for
different instance types and numbers of measurements remaining.
Although the probability appears random and hard to predict, it is non-zero for all
instance types most of the time and often close to 25–30% for 2x and 4x instances.
As a result, temporal covert channels [63] indeed seem possible: the attacker and
the victim end up on the same FPGA in consecutive time slots after about four tries
on average. For 16x instances, the probability is around 10%, requiring about ten
tries to get the same instance.
Figure 9.7, in particular, shows the results of renting 16x instances eleven
consecutive times. As can be seen in the figure, one set of FPGAs is repeated four
times, two are repeated two times, while three are allocated once. Moreover, the
same eight FPGAs are re-allocated at once: in other words, by identifying that, e.g.,
DRAM D on slot 0 has stayed the same in INST-0 and INST-1, an adversary is able
to deduce that all eight FPGAs have stayed the same.
We finally investigate whether one can infer patterns about the environmental
conditions of the data center in which we performed measurements. To that end,
we measure how the DRAM decay varies in a span of approximately three days. As
Fig. 9.8 reveals, the PUF behaves differently throughout the measurement period,
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 251
Fig. 9.7 Fingerprinting FPGAs on f1.16xlarge instances with 8 FPGA slots: out of 11 spot
instances, only 6 different sets of FPGAs are allocated. In the remaining instances, only 2 additional
sets were identified (Table 9.1)
where the decay time of each measurement is 120 seconds. As DRAM decay varies
with temperature [68], these variations can give insights into the workloads and
operating conditions of the servers. For example, there may be a decrease in activity
at certain times in the day, allowing the data center to cool, and the DRAM PUF
to result in fewer errors. An attacker might use these insights to reason about data
center capacity and launch attacks on server availability [19, 27, 28].
In this section, we introduce an RO-based PUF design that fingerprints the FPGA
chips themselves instead of the DRAM modules attached to the FPGA boards. Our
PUF design introduces the idea of redundant ROs, which allows us to evaluate the
quality of each RO pair by pre-testing the PUF design on a smaller number of cloud
FPGA instances, and discarding “bad” RO pairs, which decrease the effectiveness
(uniqueness and reliability) of the PUF response.
252 S. Tian et al.
Counter-0
RO[1] Latch
or FF
MUX-0 0110
Timer !
… PUF
…
Response
Counter-1
RO[n-1] Latch
or FF
MUX-1
Our PUFs consist of 512 ROs, with each RO only used in one comparison
pair. The 256 resulting pairs then generate 256 bits for pre-testing with 40 FPGA
instances. After the pre-testing phase, 128 “good” bits (i.e., RO pairs) that generate
high entropy are chosen to be included in the final PUF responses, with the other 128
bits ignored entirely. As we show in Sect. 9.4.4 when testing with 160 instances, the
selection of high-entropy bits decreases the intra-device Hamming distance (HD)
of the PUF response, while increasing the inter-device HD. In other words, our
approach significantly improves the reliability and uniqueness of the RO PUFs by
finding the RO pairs that are stable within an FPGA but differ among FPGAs.
Figure 9.9 shows our RO PUF module, which consists of .n = 512 ROs with
two multiplexers (MUXes) and two RO counters for comparing the RO pairs. The
.m = n/2 = 256 RO pairs use adjacent ROs to eliminate systematic variations [39,
42], and each RO is only used once. The RO outputs drive the counters, which are
sampled on a system clock timer set by the software. To minimize noise, when an
RO pair is sampled, the other ROs are disabled. In pre-processing, .m = 256 bits
are sent back each time and are used to identify the 128 good RO pairs. In later
experiments with more (and different) FPGAs, the same 128 good RO pairs that
were identified in the pre-processing stage are used to generate the PUF fingerprints.
To evaluate the effect of temperature on the quality of the proposed RO PUFs, we
further add an RO Sensor and RO Heaters module to our design. This module allows
us to increase and observe the FPGA temperature, collecting PUF fingerprints at
different thermal states of the fabric, proving the stability of our design across
environmental conditions that might arise in a data center.
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 253
Fig. 9.10 Ignoring low-entropy bits (PUF-B) improves Uniqueness and maintains Reliability
compared to the baseline PUF-A. Temperature increases due to the heaters do not affect the
responses
Our PUF generates 128-bit fingerprints out of 256 ROs pairs. We validate our design
by calculating the Uniqueness (the average inter-device HD over responses from
different FPGAs) and the Reliability (the average intra-device HD over repeated
measurements from the same FPGA). Ideal PUFs have Uniqueness values of .0.5, as
their responses behave randomly. Reliability values are close to 0, as few bits differ
when re-querying the PUF.
Improving Uniqueness While Maintaining Reliability Previous RO PUF designs
usually compare .m = n/2 RO pairs out of n ROs and output .n/2 bits, but our design
eliminates low-entropy bits and instead produces .n/4 = 128 bits. To evaluate the
quality of our PUF design, we compare the Uniqueness and the Reliability of the
baseline implementation (PUF-A), which uses all .n/2 = 256 RO pair comparisons,
to our improved 128-bit PUF, PUF-B. Figure 9.10 shows that the Uniqueness of
PUF-B increases to .≈0.25 from .≈0.13 in PUF-A, while maintaining the Reliability
at almost the same value. In addition, Fig. 9.10 shows that the RO heaters do not
influence the Uniqueness and Reliability much, indicating that the RO PUFs are
stable at different temperatures.
Choosing Low-Entropy Bits To show the benefits of the novel idea to ignore but
not remove low-entropy RO pairs, we implement two additional types of PUFs:
PUF-C, which physically removes the “bad” ROs from the floorplan, and PUF-D,
which instead ignores randomly selected RO pairs. Table 9.2 contains a summary of
the four PUF designs. PUFs A, B, D utilize 669 slices, while PUF-C uses 359 slices.
Figure 9.11a shows that although the Reliability remains almost the same for
all four PUF designs, the Uniqueness of PUF-B is much higher than that of the
remaining three PUFs. In particular, PUF-C suggests that re-routed logic will still
influence the entropy of the remaining RO pairs, i.e., the “good” bits from PUF-B
no longer remain good in PUF-C.
254 S. Tian et al.
Fig. 9.11 Uniqueness and reliability for the four setups of Table 9.2. (a) The best uniqueness
is achieved when ignoring but not removing low-entropy bits (PUF-B). (b) Uniqueness remains
constant when testing on additional FPGAs that were not used in pre-testing
Fig. 9.12 PUF-B Hamming distances (a) intra- and inter-device, and (b) with or without the RO
heaters enabled
Moreover, the locations of “good” bits remain stable across many FPGAs.
Although 40 FPGAs were used in pre-testing for PUF-B, Fig. 9.11b shows that the
Uniqueness of PUF-B stays stable when applying the PUF with the same “good”
bits to 160 different instances.
PUF Responses for Different FPGAs By storing PUF responses in a database,
users are able to infer (based on the HD) whether a given FPGA has been used
before or not: as Fig. 9.12 shows, the intra-device and inter-device HDs of .N = 40
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 255
Fig. 9.13 The Uniqueness and Reliability of PUF-B using two types of ROs at three FPGA
locations
new F1 FPGA instances (measured 20 times) are clearly separated. Even if the RO
heaters are turned on, the intra-device HD stays almost the same, and always under
10. By contrast, inter-device HD ranges from about 20–50, so a threshold of 15 can
separate PUF responses from the same device to those from different devices.
at a cost of energy usage for designs that do not need it. Moreover, ECC is not
guaranteed to entirely prevent our fingerprints. For example, researchers have shown
that attacks using DRAM are possible even with error correction enabled [15].
Furthermore, introducing randomness at different layers of abstraction can raise
the bar for adversaries. Currently, our work can identify all eight FPGAs in an
f1.16xlarge instance by measuring the PUF behavior on a single DRAM
module (e.g., DRAM D) on one FPGA. However, software can randomize the order
of FPGAs within an instance as they appear to the user, or the way the DRAM
modules are presented to the FPGA. Moreover, DRAM address scrambling in the
memory controller can prevent the DRAM PUF from operating.
For the RO-based PUFs, fingerprinting is based on the existence of combinatorial
loops that can bypass AWS’s DRCs. Thus a more sophisticated and stricter set of
DRCs is able to prevent such a fingerprinting method. For example, researchers have
proposed an antivirus scan of FPGA bitstreams [32].
Although these approaches make power-based attacks harder, they cannot elimi-
nate temporal thermal channels (e.g., [63]). As such attacks only exploit temperature
effects, a mandatory cool-down period before re-assigning FPGAs can prevent
covert channels, even if adversaries successfully fingerprint the devices.
In this section, we present the setup (Sect. 9.5.1) and evaluation (Sect. 9.5.2) of cloud
FPGA cartography using PCIe contention.
Fig. 9.14 Diagram of the deduced AWS server configuration, with 8 FPGAs sharing the same
server across two NUMA nodes
9.5.2 Evaluation
Stressor VM Instance
6
determined by observing that 7
3 0
the PCIe bandwidth is 1
2
3
reduced below a target 4
5
threshold. As can be seen, 6
7
contention exists among 4 0
1
2
groups of exactly 4 3
4
consecutive slots within a 5
6
7
single VM instance. In this 5 0
1
figure, each f1.16xlarge 2
3
instance consists of slots 0–7 4
5
6
7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 slot
To evaluate whether our observations also hold for f1.4xlarge instances, we rent
20 f1.4xlarge spot instances in us-east-1, for a total of .40 × 40 = 1,600
260 S. Tian et al.
Stressor VM Instance
5 11
6 12
7 13
8 14
9 1
10 2
11 9
12 3
13 16
14 19
15 5
16 15
17 20
18 7
19 17
20 18
Fig. 9.16 Cross-VM contention between f1.2xlarge instances in the us-east-1 region: (a)
presents the instances in the order in which they are launched, while (b) re-orders them to more
clearly show pairs with PCIe contention. In this figure, each f1.2xlarge instance has one slot
stressor and tester pairs, repeating measurements three times. The results, which are
shown in Fig. 9.17, indicate that contention is still possible both within and between
different f1.4xlarge instances: we find 7 pairs of distinct instances that form
complete NUMA nodes with 4 FPGAs. We again notice that co-located instances
tend to not be fully consecutive but are instead interspersed with other instances.
However, unlike the results of Sect. 9.5.2.2, where the lone FPGA was rented near
the end of the 20 instances, all six lone f1.4xlarge instances are among the
first 10 instances launched, with the first five corresponding to instances 1–5. In
other words, in our experiments, it was almost always possible to find contention
in f1.2xlarge instances provided enough instances were launched after them.
However, the first few f1.4xlarge instances were not co-located with any other
instances (possibly because other users had already rented their counterparts), while
later VMs were more likely to be co-located in the same server.
As the previous experiments were conducted with spot instances in the us-east-1
region, we perform measurements with spot instances in the us-west-2
(Fig. 9.18) and eu-west-1 (Fig. 9.19) regions, as well as with on-demand
instances in the ap-southeast-2 (Fig. 9.20) and us-east-1 (Fig. 9.21)
regions, with experiments repeated once with 20 and 10 FPGAs, respectively.
It should be noted that we could not find availability for spot instances in the
ap-southeast-2 region. In addition, the pre-synthesized CL_DRAM_DMA AFI
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 261
Stressor VM Instance
1 1
5 0
1
9 0
1
6 0
1
13 0
1
7 0
1
11 0
1
8 0
1
14 0
1
9 0
1
15 0
1
10 0
1
17 0
1
11 0
1
16 0
1
12 0
1
19 0
1
13 0
1
18 0
1
14 0
1
20 0
1
15 0
1
1 0
1
16 0
1
2 0
1
17 0
1
3 0
1
18 0
1
4 0
1
19 0
1
5 0
1
20 0
1
10 0
1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 slot 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 slot
Fig. 9.17 Cross-VM contention between f1.4xlarge instances with 2 FPGAs each in the
us-east-1 region. In this test, 14 of the 20 launched instances are co-located in 7 NUMA nodes,
while the remaining 6 are not co-located with any of the other instances. Recall that f1.4xlarge
VMs contain two FPGAs, so the two slots within an instance always interfere with each other,
explaining why there are always at least two red squares per row and column in the figure. In this
figure, each f1.4xlarge instance consists of slots 0–1
Stressor VM Instance
5 5
6 6
7 7
8 8
9 12
10 13
11 14
12 15
13 9
14 10
15 11
16 18
17 20
18 16
19 17
20 19
Fig. 9.18 Example test results for cross-VM contention between f1.2xlarge spot instances in
the us-west-2 region
was not available in this region, so we synthesized it using its publicly available
source code.
The results are broadly similar to the experiments of Sect. 9.5.2.2, with only 10%
(.6/60) of instances not resulting in cross-VM contention. Indeed, seven complete
NUMA localities are identified, along with six groups of three FPGAs and four
pairs of two VMs. It is further interesting to note that, in our experiments, groups in
262 S. Tian et al.
Stressor VM Instance
5 6
6 7
7 9
8 10
9 11
10 12
11 14
12 15
13 4
14 5
15 13
16 16
17 17
18 18
19 19
20 20
Fig. 9.19 Example test results for cross-VM contention between f1.2xlarge spot instances in
the eu-west-1 region
1 1
2 3
Stressor VM Instance
Stressor VM Instance
3 5
4 9
5 2
6 4
7 6
8 7
9 8
10 10
Having uncovered that up to four f1.2xlarge instances can interfere with each
other, we further analyze the probability of the f1.2xlarge instances being
co-located on the same server. Specifically, we use the data gathered in previous
sections to calculate the probability .PK that, given a tester FPGA, launching K
stressor instances will result in at least one of the K new FPGAs being placed in the
same NUMA node as the tester FPGA.
Let .NK be the number of VMs for which there is PCIe contention with any
of the next K instances launched. Moreover, let .MK be the number of VMs that
have fewer than K instances launched after them and that are not the last in a
full NUMA node detected within the experiment. The second constraint is needed
because, although for groups of up to 3 FPGAs the remaining FPGAs might be
found by renting more instances, if the NUMA node has been fully detected, no
additional VM will correspond to the same locality, no matter how many instances
are launched. Denoting by T as the total number of VMs launched, the desired
probability is
NK
PK =
. (9.2)
T − MK
As an example using the on-demand instances of Fig. 9.21, .T = 10, and for
K = 3, .MK = 3 (instances 8–10 have .≤2 VMs launched after them), while .NK =
.
3 (instances 1, 4, and 7 are in the same NUMA nodes as instances 3, 5, and 10,
respectively), so .P3 = 3/(10 − 3) = 43%.
1 4
2 5
Stressor VM Instance
Stressor VM Instance
3 9
4 1
5 3
6 2
7 6
8 7
9 10
10 8
80
70
60
50
Region
40
all
30 ap-southeast-2
eu-west-1
20 us-east-1
us-west-2
10
2 4 6 8 10
Number of Launched FPGA Instances K
Fig. 9.22 Probability .PK of finding another f1.2xlarge instance in the same NUMA node
(overall and for individual data centers), as a function of the number of FPGA instances launched K
Figure 9.22 calculates .PK for .1 ≤ K ≤ 10, both for individual regions and over
all regions tested. Depending on the region, the probability that two consecutive
VMs are co-located (i.e., .K = 1) ranges between 38–58% (with the exception
of ap-southeast-2), while renting just one more instance can increase this
probability by approximately 10 percentage points (or 55 points for the Sydney
region). Renting even more instances increases this probability further to about 80%
for .K = 10, in part due to the smaller number of instances that have at least K
FPGAs launched after them (i.e., a larger .MK ).
Note that for sufficiently large T and K, we expect .PK = 75%, as there is a 1 in
4 chance that the sensor instance is the last FPGA in its NUMA node. However, for
smaller T and K, .PK can be larger (as in Fig. 9.22), since Amazon does not always
fully pack consecutive instances within a single server: as the previous sections
showed, co-located instances are often launched further apart in time.
In this section, we use the DRAM PUFs of Sect. 9.4 to fingerprint individual FPGAs
and detect overlaps between experiments repeated on different days. Specifically, we
make additional measurements from the us-east-1 region and compare the PUF
fingerprints between spot and on-demand f1.2xlarge instances in availability
zone c and spot f1.2xlarge, f1.4xlarge, and f1.16xlarge instances in
zone e, all collected on different days.
We reach two main conclusions. First, there is an overlap between spot and on-
demand VMs, and, second, there is overlap between all three instance types. For
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 265
Row
Row
Row
(a) f1.2xlarge and (b) 64 64 64
f1.16xlarge instances. 96 96 96
The two PUFs and their (c) 128 128 128
bitwise AND are almost Column Column Column
identical (a) PUF A (2x) (b) PUF B (16x) (c) PUF A AND B
Row
Row
between (a) f1.2xlarge 64 64 64
and (b) f1.16xlarge 96 96 96
instances. The two PUFs are 128 128 128
Column Column Column
distinct, and their (c) bitwise
AND is empty (a) PUF A (2x) (b) PUF B (16x) (c) PUF A AND B
example, instances 7 and 17 of Fig. 9.16 overlap with slots 0 and 1 of instance 3 in
Fig. 9.15. Figure 9.23 presents extracts from the PUF fingerprints for one of the two
pairs of overlapping FPGAs. The FPGAs in the f1.2xlarge and f1.16xlarge
instances had 419 and 461 DRAM bit flips, respectively, of which 419 bit flip
locations (DRAM addresses) were identical. This allows us to conclude that the two
instances correspond to the same underlying FPGA hardware. By contrast, Fig. 9.24
shows the same f1.2xlarge instance along with a different FPGA slot of the
f1.16xlarge instance. This FPGA has a fingerprint with 483 bit flips, of which
none are in the same locations as those in the f1.2xlarge instance. As a result,
these two FPGAs are distinct, as expected.
In addition, we found an overlap between instances 1, 2, and 9 of Fig. 9.16
and another set of 10 f1.4xlarge instances rented. Consequently, not only
is there the potential for cross-VM contention between identical spot instance
types, but also between f1.2xlarge and f1.4xlarge spot and on-demand
instances (but not f1.16xlarge ones, as they reserve all FPGAs within the
server). It should be noted that there was no overlap between different instance
types in the experiments of Sect. 9.4.2, likely because AWS often re-used the same
instances, resulting in only 10 unique f1.2xlarge, 6 unique f1.4xlarge,
and 8 unique f1.16xlarge instances, for a total of 86 FPGAs. By contrast, the
experiments of this section found an overlap of just two FPGAs in a pool of 20
f1.2xlarge and 5 f1.16xlarge instances, three FPGAs between the same set
of 20 f1.2xlarge and an additional 10 f1.4xlarge instances, and no overlap
between the ten f1.4xlarge and five f1.16xlarge instances, for a total of
.20 · 1 + 10 · 2 + 5 · 8 − 2 − 3 = 75 unique FPGAs. This suggests that it is rare
(but not impossible) for instance types to be repurposed, likely primarily in cases of
unmet demand of smaller F1 types.
266 S. Tian et al.
Bandwidth (MB/s)
16.4 kB
2,000 262.1 kB
4194.3 kB
1,000
0
0 1 2 3
Number of Stressors
Fig. 9.25 Median tester bandwidth for different numbers of enabled stressors and transfer sizes
(bandwidth averaged over 100 transfers)
In this section, we summarize prior work in FPGA security (Sect. 9.6.1) and cloud-
related attacks (Sect. 9.6.2).
In recent years, besides attacks on the FPGA bitstream itself, e.g., [17], there has
been extensive research on FPGA security without physical access to the underlying
hardware, with covert-channel, side-channel, and fault attacks predominantly using
voltage or temperature to affect the FPGA chips [30, 63]. Many such works,
e.g., [20, 31, 46, 70], focus on attacks between different users of the FPGA and
are therefore not directly applicable to single-tenant clouds.
Although most attacks have been performed in lab environments, a covert-
channel attack between separate dies (“Super Logic Regions”) was shown to be
possible on AWS and Huawei cloud [22], and a side-channel attack on AWS by
Glamocanin et al. soon followed [25]. The former depends on alternative ring
oscillator designs that bypass AWS restrictions [21, 59], while the latter uses
time-to-digital converters (TDCs), both of which could be detected by additional
DRCs [32]. By contrast, our research does not focus on the FPGA chip, but on
the cloud FPGA infrastructure, and in particular the unique aspects of the DRAM
modules as well as the shared PCIe bus used by different FPGA boards within each
server.
Richter et al. showed that when virtualizing PCIe network interface cards (NICs)
using single root I/O virtualization (SR-IOV), it is feasible for one VM to cause
congestion on the NIC ingress buffers [49]. As a possible defense, Richter et al.
recommend quality-of-service (QoS) extensions and different scheduling algorithms
to ensure that flooding a virtual function (VF) in one physical function (PF) cannot
cause performance degradation in a different PF [50].
Our work further advances research in similar areas, as we have shown how
to determine co-location and aspects of the scheduling algorithm using PCIe
contention in FPGA-accelerated clouds. Based on our work, attacks on intentional
performance degradation of the PCIe bandwidth or further work on understanding
how interference can affect our new attack or disrupt other users are natural
extensions.
9.7 Conclusion
This chapter focused on how to fingerprint cloud FPGAs and map the cloud FPGA
infrastructure itself, without damaging it.
We first introduced a novel algorithm for fingerprinting cloud FPGAs through
decay-based DRAM PUFs. Because it is not possible for users to directly control the
memory self-refresh parameters, we made use of a feature that enables data sharing
between different AFIs: by using two AFIs, one that disables memory scrubbing
and ECC, and one that does not instantiate memory controllers at all, we were able
to observe how DRAM decays, without violating any restrictions placed by AWS.
In addition, we described the design and evaluation of RO PUFs that can bypass
the AWS design rule checks on combinatorial loops. The PUFs created resulted in
unique and stable fingerprints of FPGAs in AWS F1 FPGAs.
This chapter further identified PCIe contention as a means of mapping cloud
FPGA infrastructures. We showed that it is possible to reverse-engineer the NUMA
locality of different FPGAs within an AWS server and find which f1.2xlarge
and f1.4xlarge instances are co-located within the same server. We also found
that f1.2xlarge and f1.4xlarge instance types can be scheduled on the same
AWS server, and we deduced that the probability of successive users renting FPGAs
within the same server is high.
Overall, this chapter highlighted the dangers of direct accesses to hardware
resources and a need for a more holistic approach to FPGA security that not
only considers the FPGA chips themselves, but also the security of other system
components accessible to the logic running on the FPGA.
References
1. Alibaba Cloud (2023). Elastic Compute Service: Instance Type Families. https://www.
alibabacloud.com/help/en/elastic-compute-service/latest/instance-family#f3. Accessed May
1, 2023.
2. Amazon Web Services (2016). Developer preview – EC2 instances (F1) with pro-
grammable hardware. https://aws.amazon.com/blogs/aws/developer-preview-ec2-instances-
f1-with-programmable-hardware/. Accessed May 1, 2023.
3. Amazon Web Services (2018). The agility of F1: Accelerate your applications with custom
compute power. https://d1.awsstatic.com/Amazon_EC2_F1_Infographic.pdf.. Accessed May
1, 2023.
4. Amazon Web Services (2021). AWS EC2 FPGA HDK+SDK errata. https://github.com/aws/
aws-fpga/blob/master/ERRATA.md. Accessed May 1, 2023.
5. Amazon Web Services (2021). AWS shell interface specification. https://github.com/aws/aws-
fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md. Accessed May 1, 2023.
6. Amazon Web Services (2021). CL_DRAM_DMA custom logic example. https://github.com/
aws/aws-fpga/tree/master/hdk/cl/examples/cl_dram_dma. Accessed May 1, 2023.
7. Amazon Web Services (2021). F1 FPGA application note: How to use the PCIe peer-
2-peer version 1.0. https://github.com/awslabs/aws-fpga-app-notes/tree/master/Using-PCIe-
Peer2Peer. Accessed May 1, 2023.
8. Amazon Web Services (2022). Amazon machine images (AMI). https://github.com/awsdocs/
amazon-ec2-user-guide/blob/master/doc_source/AMIs.md. Accessed May 1, 2023.
9. Amazon Web Services (2023). Amazon EC2 instance types. https://aws.amazon.com/ec2/
instance-types/. Accessed May 1, 2023.
10. Amazon Web Services (2023). Amazon EC2 spot instances pricing. https://aws.amazon.com/
ec2/spot/pricing/. Accessed May 1, 2023.
11. Amazon Web Services (2023). AWS EC2 spot instances. https://aws.amazon.com/ec2/spot/.
Accessed May 1, 2023.
12. Baidu Cloud (2023). FPGA cloud compute. https://cloud.baidu.com/product/fpga.html.
Accessed May 1, 2023.
13. Baker, G., & Lupo, C. (2017). TARUC: A topology-aware resource usability and contention
benchmark. In ACM/SPEC International Conference on Performance Engineering (ICPE).
14. Chiang, R. C., Rajasekaran, S., Zhang, N., & Huang, H. H. (2015). Swiper: Exploiting
virtual machine vulnerability in third-party clouds with competition for I/O resources. IEEE
Transactions on Parallel and Distributed Systems (TPDS), 26(6), 1732–1742.
15. Cojocar, L., Razavi, K., Giuffrida, C., & Bos, H. (2019). Exploiting correcting codes: On the
effectiveness of ECC memory against Rowhammer attacks. In IEEE Symposium on Security
and Privacy (S&P).
16. Danalis, A., Marin, G., McCurdy, C., Meredith, J. S., Roth, P. C., Spafford, K., Tipparaju, V.,
& Vetter, J. S. (2010). The scalable heterogeneous computing (SHOC) benchmark suite. In
Workshop on General-Purpose Processing on Graphics Processing Units (GPGPU).
17. Ender, M., Moradi, A., & Paar, C. (2020). The unpatchable silicon: A full break of the bitstream
encryption of Xilinx 7-Series FPGAs. In USENIX Security Symposium.
18. Faraji, I., Mirsadeghi, S. H., & Afsahi, A. (2016). Topology-aware GPU selection on multi-
GPU nodes. In IEEE International Parallel and Distributed Processing Symposium Workshops
(IPDPSW).
19. Gao, X., Xu, Z., Wang, H., Li, L., & Wang, X. (2018). Reduced cooling redundancy: A
new security vulnerability in a hot data center. In Network and Distributed Systems Security
Symposium (NDSS).
20. Giechaskiel, I., Rasmussen, K. B., & Eguro, K. (2018). Leaky wires: Information leakage and
covert communication between FPGA long wires. In ACM Asia Conference on Computer and
Communications Security (ASIACCS).
270 S. Tian et al.
21. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Measuring long wire leakage with ring
oscillators in cloud FPGAs. In International Conference on Field Programmable Logic and
Applications (FPL).
22. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Reading between the dies: Cross-SLR
covert channels on multi-tenant cloud FPGAs. In IEEE International Conference on Computer
Design (ICCD).
23. Giechaskiel, I., Tian, S., & Szefer, J. (2021). Cross-VM information leaks in FPGA-accelerated
cloud environments. In IEEE International Symposium on Hardware Oriented Security and
Trust (HOST).
24. Giechaskiel, I., Tian, S., & Szefer, J. (2022). Cross-VM covert- and side-channel attacks in
cloud FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 16(1),
1–29.
25. Glamočanin, O., Coulon, L., Regazzoni, F., & Stojilović, M. (2020). Are cloud FPGAs really
vulnerable to power analysis attacks? In Design, Automation & Test in Europe Conference &
Exhibition (DATE).
26. Hajimiri, A., Limotyrakis, S., & Lee, T. H. (1999). Jitter and phase noise in ring oscillators.
IEEE Journal of Solid-State Circuits (JSSC), 34(6), 790–804.
27. Islam, M. A., & Ren, S. (2018). Ohm’s law in data centers: A voltage side channel for timing
power attacks. In ACM Conference on Computer and Communications Security (CCS).
28. Islam, M. A., Ren, S., & Wierman, A. (2017). Exploiting a thermal side channel for power
attacks in multi-tenant data centers. In ACM Conference on Computer and Communications
Security (CCS).
29. Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et
du Jura. Bulletin del la Société Vaudoise des Sciences Naturelles, 37, 547–579.
30. Jin, C., Gohil, V., Karri, R., & Rajendran, J. (2020). Security of cloud FPGAs: A survey. https://
arxiv.org/abs/2005.04867. Accessed May 1, 2023.
31. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2018). FPGAhammer: Remote voltage fault
attacks on shared FPGAs, suitable for DFA on AES. Transactions on Cryptographic Hardware
and Embedded Systems (TCHES), 2018(3), 44–68.
32. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
Malicious self-oscillator scanning for Xilinx UltraScale+ FPGAs. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 13(3), 1–31.
33. Lameter, C. (2013). NUMA (non-uniform memory access): An overview. ACM Queue, 11(7),
40–51.
34. Li, C., Sun, Y., Jin, L., Xu, L., Cao, Z., Fan, P., et al. (2019). Priority-based PCIe scheduling for
multi-tenant multi-GPU systems. IEEE Computer Architecture Letters (LCA), 18(2), 157–160.
35. Linux man page: hwloc(7) (2023). https://linux.die.net/man/7/hwloc . Accessed May 1,
2023.
36. Linux man page: lscpu(1) (2023). https://linux.die.net/man/1/lscpu. Accessed May 1, 2023.
37. Linux man page: lstopo(1) (2023). https://linux.die.net/man/1/lstopo. Accessed May 1,
2023.
38. Lutz, T., Fensch, C., & Cole, M. (2013). PARTANS: An autotuning framework for stencil com-
putation on multi-GPU systems. ACM Transactions on Architecture and Code Optimization
(TACO), 9(4), 1–24.
39. Maiti, A., & Schaumont, P. (2011). Improved ring oscillator PUF: An FPGA-friendly secure
primitive. Journal of Cryptology, 24(2), 375–397.
40. Martinasso, M., Kwasniewski, G., Alam, S. R., Schulthess, T. C., & Torsten, H. (2016).
A PCIe congestion-aware performance model for densely populated accelerator servers. In
International Conference for High Performance Computing, Networking, Storage and Analysis
(SC).
41. McCurdy, C., & Vetter, J. (2010). Memphis: Finding and fixing NUMA-related performance
problems on multi-core platforms. In IEEE International Symposium on Performance Analysis
of Systems & Software (ISPASS).
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 271
42. Merli, D., Stumpf, F., & Eckert, C. (2010). Improving the quality of ring oscillator PUFs on
FPGAs. In Workshop on Embedded Systems Security (WESS).
43. Microsoft Research (2017). Microsoft unveils Project Brainwave for real-time AI. https://www.
microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/. Accessed May 1,
2023.
44. Molka, D., Hackenberg, D., Schöne, R., & Müller, M. S. (2009). Memory performance
and cache coherency effects on an Intel Nehalem multiprocessor system. In International
Conference on Parallel Architectures and Compilation Techniques (PACT).
45. Neugebauer, R., Antichi, G., Zazo, J. F., Audzevich, Y., López-Buedo, S., & Moore, A. W.
(2018). Understanding PCIe performance for end host networking. In ACM Special Interest
Group on Data Communication (SIGCOMM).
46. Provelengios, G., Ramesh, C., Patil, S. B., Eguro, K., Tessier, R., & Holcomb, D. (2019).
Characterization of long wire data leakage in deep submicron FPGAs. In ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays (FPGA).
47. Pu, X., Liu, L., Mei, Y., Sivathanu, S., Koh, Y., & Pu, C. (2010). Understanding performance
interference of I/O workload in virtualized cloud environments. In IEEE International
Conference on Cloud Computing (CLOUD).
48. Rahmati, A., Hicks, M., Holcomb, D. E., & Fu, K. (2015). Probable cause: The deanonymizing
effects of approximate DRAM. In Annual International Symposium on Computer Architecture
(ISCA).
49. Richter, A., Herber, C., Wild, T., & Herkersdorf, A. (2015). Denial-of-service attacks on PCI
passthrough devices: Demonstrating the impact on network- and storage-I/O performance.
Journal of Systems Architecture, 61(10), 592–599.
50. Richter, A., Herber, C., Wild, T., & Herkersdorf, A. (2016). Resolving performance interfer-
ence in SR-IOV setups with PCIe Quality-of-Service extensions. In Euromicro Conference on
Digital System Design (DSD).
51. Ristenpart, T., Tromer, E., Shacham, H., & Savage, S. (2009). Hey, you, get off of my
cloud: Exploring information leakage in third-party compute clouds. In ACM Conference on
Computer and Communications Security (CCS).
52. Rosenblatt, S., Chellappa, S., Cestero, A., Robson, N., Kirihata, T., & Iyer, S. S. (2013). A
self-authenticating chip architecture using an intrinsic fingerprint of embedded DRAM. IEEE
Journal of Solid-State Circuits (JSSC), 48(11), 2934–2943.
53. Rosenblatt, S., Fainstein, D., Cestero, A., Safran, J., Robson, N., Kirihata, T., & Iyer, S. S.
(2013). Field tolerant dynamic intrinsic chip ID using 32 nm high-K/metal gate SOI embedded
DRAM. IEEE Journal of Solid-State Circuits (JSSC), 48(4), 940–947.
54. Schaa, D., & Kaeli, D. (2009). Exploring the multiple-GPU design space. In IEEE Interna-
tional Parallel and Distributed Processing Symposium Workshops (IPDPSW).
55. Schaller, A., Xiong, W., Anagnostopoulos, N. A., Saleem, M. U., Gabmeyer, S., Katzenbeisser,
S., & Szefer, J. (2017). Intrinsic Rowhammer PUFs: Leveraging the Rowhammer effect for
improved security. In IEEE International Symposium on Hardware Oriented Security and Trust
(HOST).
56. Schaller, A., Xiong, W., Anagnostopoulos, N. A., Saleem, M. U., Gabmeyer, S., Skoric,
B., et al. (2018). Decay-based DRAM PUFs in commodity devices. IEEE Transactions on
Dependable and Secure Computing (TDSC), 16(3), 462–475.
57. Solomon, R. (2014). PCI express basics & background. https://pcisig.com/sites/default/files/
files/PCI_Express_Basics_Background.pdf. Accessed May 1, 2023.
58. Spafford, K., Meredith, J. S., & Vetter, J. S. (2011). Quantifying NUMA and contention effects
in multi-GPU systems. In Workshop on General-Purpose Processing on Graphics Processing
Units (GPGPU).
59. Sugawara, T., Sakiyama, K., Nashimoto, S., Suzuki, D., & Nagatsuka, T. (2019). Oscillator
without a combinatorial loop and its threat to FPGA in data centre. Electronics Letters, 15(11),
640–642.
272 S. Tian et al.
60. Sutar, S., Raha, A., & Raghunathan, V. (2016). D-PUF: An intrinsically reconfigurable
DRAM PUF for device authentication in embedded systems. In International Conference on
Compliers, Architectures, and Synthesis of Embedded Systems (CASES).
61. Tencent Cloud (2023). FPGA cloud server. https://cloud.tencent.com/product/fpga. Accessed
May 1, 2023.
62. Texas Advanced Computing Center (2015) TACC to launch new Catapult system to researchers
worldwide. https://web.archive.org/web/20211215201222/https://www.tacc.utexas.edu/-/tacc-
to-launch-new-catapult-system-to-researchers-worldwide. Accessed May 1, 2023.
63. Tian, S., & Szefer, J. (2019). Temporal thermal covert channels in cloud FPGAs. In
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA).
64. Wang, X., Niu, Y., Liu, F., & Xu, Z. (2022). When FPGA meets cloud: A first look at
performance. IEEE Transactions on Cloud Computing (TCC), 10(2), 1344–1357.
65. Xilinx, Inc. (2017). Xilinx powers Huawei FPGA accelerated cloud server. https://web.archive.
org/web/20220616002445/. https://www.xilinx.com/news/press/2017/xilinx-powers-huawei-
fpga-accelerated-cloud-server.html. Accessed May 1, 2023.
66. Xilinx, Inc. (2022). UltraScale architecture configuration: User guide (UG570). https://www.
xilinx.com/support/documentation/user_guides/ug570-ultrascale-configuration.pdf. Accessed
May 1, 2023.
67. Xilinx, Inc. (2023). UltraScale+ FPGAs: Product tables and product selection guides. https://
www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-
selection-guide.pdf. Accessed May 1, 2023.
68. Xiong, W., Anagnostopoulos, N. A., Schaller, A., Katzenbeisser, S., & Szefer, J. (2019). Spying
on temperature using DRAM. In Design, Automation, and Test in Europe (DATE).
69. Xiong, W., Schaller, A., Anagnostopoulos, N. A., Saleem, M. U., Gabmeyer, S., Katzenbeisser,
S., & Szefer, J. (2016). Run-time accessible DRAM PUFs in commodity devices. In Interna-
tional Conference on Cryptographic Hardware and Embedded Systems (CHES).
70. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In IEEE
Symposium on Security and Privacy (S&P).
Chapter 10
Countermeasures Against Voltage
Attacks in Multi-tenant FPGAs
10.1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 273
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_10
274 S. Moini et al.
Section 10.2 provides an overview of the different threats that rise in a multi-tenant
FPGA environment, the types of remote attacks that an adversary can exploit, and
potential countermeasures to mitigate these attacks.
10.2.1 Threats
In a multi-tenant cloud FPGA, the resources in the device are divided into multiple
physically isolated sections, each with its own share of the logic resources and
I/O [4]. At runtime, each tenant independently accesses their assigned section to
implement a hardware design and perform computation. The adjacent placement of
potential adversaries on the FPGA introduces security threats. Commercial FPGAs
contain a single PDN that is shared by all tenants’ logic. Voltage fluctuations caused
by hardware activity in one device region influence the supply voltage in other
sections of the FPGA via the PDN. The shared power distribution network provides
an opportunity for a malicious adversary tenant to attack one of more victim tenants.
The adversary tenant can eavesdrop on the hardware activity of the victim tenant
in a side-channel attack [50]. The adversary tenant can also maliciously affect the
voltage on the shared PDN to cause faults in the victim tenant’s design [19, 30, 35].
Other types of attacks involve signal coupling and thermal monitoring. Previous
work [7, 34, 36] has shown that a measurable electrical coupling exists between
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 275
neighboring long wires in Xilinx-AMD and Intel FPGAs. An adversary tenant can
use a wire that is adjacent to a victim tenant wire to extract information. Thermal
coupling is also a potential security threat that can be used by adversary tenants for
data transmission in a cloud FPGA [47].
There are two general groups of remote attacks in a multi-tenant FPGA scenario,
passive and active attacks. In a passive attack, the goal of the adversary is to
steal secret information from their victim, while in the active attack, the adversary
actively causes changes in the victim’s computation or its environment. Figure 10.1
illustrates each type of attack mentioned in this section.
There are two main types of passive attacks in a multi-tenant FPGA scenario. In a
long-wire cross-talk attack (Fig. 10.1a), the adversary uses cross-coupling between
adjacent FPGA long wires in a routing channel to extract secret information from
a victim. The delay of a long wire is affected by the logic level in an adjacent long
wire. An adversary tenant can measure the delay in its long wire (e.g., by measuring
ring oscillator frequency) and use the value to determine the logic value of an
adjacent wire belonging to a victim tenant [36]. This attack has been used to extract
cryptographic keys [7, 36] from a victim tenant. It is also possible to implement
a hardware Trojan in the victim tenant to act as a transmitter and transfer secret
information to an adversary tenant (the receiver) using long-wire cross-coupling as
the communication channel [7].
A remote power side-channel attack uses the shared FPGA PDN to extract
information from the victim (Figure 10.1b). In this scenario, the adversary tenant
implements on-chip voltage sensors to measure the voltage across the PDN [8].
FPGA voltage sensors are discussed in depth in Sect. 10.4. The adversary tenant
uses the voltage sensors to measure voltage fluctuations caused by the victim
tenant’s dynamic power consumption, including resistive (IR) and inductive (.L dI dt )
voltage drops [15]. This type of attack has been used to extract secret information
about cryptographic algorithms [39, 50] and to extract secret information from
computational accelerators for machine learning algorithms [27].
During active attacks in multi-tenant FPGAs, the adversary tenant actively
attempts to sabotage the victim tenant’s hardware design without physically access-
ing it. In active voltage attacks, the adversary tenant uses power wasters to consume
large amounts of dynamic power and drop the voltage across the FPGA PDN. The
lower voltage causes increased delay in FPGA logic and routing elements leading
to timing violations in paths with small slack, and consequently inducing timing
faults in the victim tenant hardware design (Fig. 10.1c). A targeted fault injection
attack can expose secret information from hardware accelerators for cryptographic
algorithms (RSA [30] and AES [14]), generate biased outputs in a true random
number generator [22], or cause a hardware accelerator for machine learning
algorithms to generate misclassified results [19].
The adversary tenant can also crash the FPGA board by activating thousands
of power waster circuits and overloading the FPGA board power regulator [9, 20].
This action will cause the regulator to fail leading to a denial-of-service (DoS) attack
(Fig. 10.1d) on the multi-tenant FPGA. Voltage attacks are covered in more detail in
Sect. 10.3.3.
10.2.2 Countermeasures
the tenant bitstream and detecting the activity of potential power waster circuits.
Other examples of offline countermeasures include design modification to hide
the dynamic power signature of the victim tenant hardware to make power side-
channel attacks more difficult [13, 37], and isolating long wires containing sensitive
information to counteract attacks based on long-wire cross-coupling [41].
Online countermeasures are deployed during FPGA execution. They either try
to make performing an attack harder for the adversary tenant and remediate any
damage (preventative), or focus on disabling the source of the attack once it has
started (reactive). Attack detection can be performed using on-chip voltage sensors
to monitor the voltage across the FPGA PDN and detect the activity of power
wasters used for voltage attacks when the sensor measurement drops below a
predetermined threshold [33, 51]. Fault detection circuits [43] can be used to identify
fault injection. These circuits typically use shadow registers [6]. The output of a
register is compared to the slightly delayed version stored in a shadow register. A
fault is detected if the two values are not equal.
Methods that mask the dynamic power signature of a victim tenant’s design and
make side-channel attacks difficult are another example of a preventative online
countermeasure. Active fences [13] surround the victim tenant’s design with ring-
oscillator-based power wasters. The activity of the power wasters is adjusted to
complement the dynamic power consumption of the protected circuit resulting in
reduced observable voltage fluctuations. As a result, power side-channel attacks
become more difficult. Once the activity of power wasters used for an attack is
detected, reactive online countermeasures can be deployed to disable the source of
the attack. These countermeasures must respond fast enough to avoid the damage
intended by the attacker. For example, the clock source to an adversary tenant can
be cut off to disable synchronous power wasters [33] once an attack is detected.
Detailed examples of reactive online countermeasures targeting several types of
power wasters are reviewed in Sect. 10.5. In the remainder of this chapter, we will
focus on reactive online countermeasures for multi-tenant FPGAs that attempt to
disable the attack source.
In this section, we consider the voltage attacks introduced in Sect. 10.2 in detail.
First, we provide an overview of the different types of power wasters used in voltage
attacks in multi-tenant FPGAs and their effect on the FPGA PDN. Then, we discuss
the threats that these attacks pose to potential victims.
278 S. Moini et al.
The effect of power waster operation on the FPGA PDN has previously been
characterized [33]. A total of 30,000 LUT-based power wasters (Fig. 10.2a) were
instantiated on an Intel Stratix 10 FPGA and used to generate voltage drops across
the FPGA PDN. Their effect was captured using an on-chip network of 218 ring-
oscillator-based voltage sensors uniformly placed across the FPGA floorplan to
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 279
MUX
LUT Enable I0 Oscillating
I0 Out Output
0 I1
Oscillating Sel
Enable Out Output
I1
LUT
EN
D DFF Q
LUT
Latch
Enable
Enable EN
CLK
D Q
EN EN EN
D DFF Q D DFF Q ... D DFF Q
CLK
First Run
128
CLK
(f) AES-based waster
LUT
101010...
Address
Din_A Din_B
Addr_A Addr_B
Dual Port RAM
1 WE_A WE_B 1
CLK_A CLK_B
CLK
Fig. 10.3 Normalized RO voltage sensor measurements and supply voltage measurements during
power waster activity. The power wasters are activated at time .= 0. The plot shows the response of
a voltage sensor at specific distances (measured as the number of FPGA logic array blocks) from
the center of the power wasters [33]
monitor the transient response of the FPGA PDN to the power wasters’ activity.
The voltage sensor’s architecture and operation are described in detail in Sect. 10.4.
Figure 10.3 shows the transient response of multiple RO-based voltage sensors
when the power wasters are activated at time 0. The number of oscillations of
the RO-based voltage sensors located across the FPGA die is recorded during the
measurement clock period (equal to 10 .μs). Each oscillation count is equated with
a corresponding voltage value [33]. This derived voltage measurement is shown on
the Y-axis of the graph. The distance between the power waster and each sensor is
measured as the number of FPGA logic array blocks (LABs) between them. Based
on the figure, the RO sensor measurements show a large voltage drop after time
0 followed by a steady-state voltage value that is smaller than the initial voltage of
0.92 V before the wasters were turned on. As expected, there is less of a voltage drop
as the distance between the center of the power wasters and the RO-based voltage
sensor drops. This experiment shows that there is a significant voltage drop present
even at a considerable distance from the source of the attack.
An adversary tenant may wish to inject faults in the victim tenant’s circuit with-
out causing a board crash to cause incorrect behavior or retrieve secret information.
For example, LUT-based RO power wasters were used to extract the key from a
hardware accelerator for the 512-bit RSA encryption algorithm [31] by injecting
faults during the modular multiplication phase of the RSA hardware accelerator.
This action resulted in incorrect encrypted ciphertext for known plaintext and public
keys. The faulty ciphertext was then used to extract the prime numbers needed to
generate the RSA key pair and expose the private key. FPGAhammer [14] extracted
the key from a 128-bit AES encryption module by injecting timing faults using LUT-
based power wasters. This action caused a byte fault in an AES encryption round
and consequently generated a ciphertext with faults injected in all bytes. The faulty
ciphertext was used alongside the correct ciphertext to perform a differential fault
attack (DFA) to extract the secret key. It is also possible to extract the AES key from
an FPGA by injecting faults without having access to the fault-free ciphertext [18].
Hardware accelerators for machine learning algorithms are also vulnerable to
fault injection attacks. In DeepStrike [19], timing faults were injected into a
hardware accelerator for a deep neural network (DNN) used to classify images,
resulting in misclassifications. Latch-based asynchronous power wasters were used
to inject timing faults in the digital signal processing (DSP) primitives used
by the hardware accelerators. In a similar project [35], a fault injection attack
was performed against a DNN hardware accelerator using LUT- and latch-based
asynchronous power wasters. The data communication process for loading the
DNN weights from off-chip memory was targeted, causing timing faults when data
were loaded. Input images to the DNN accelerator were misclassified using the
attack. On-chip RAM-based power wasters (Fig. 10.2g) were used to perform a fault
injection attack against a machine learning target [1]. Using RAM-based power
wasters, it was possible to change the output class of a neural network hardware
accelerator.
Since voltage attacks disrupt the FPGA PDN, an on-chip voltage sensor with the
ability to sense the voltage across the FPGA PDN is needed. On-chip voltage sensors
282 S. Moini et al.
Sample
D Q D Q D Q
Q’ Q’ Q’
Clk
Input Signal Delay Delay ... Delay
are digital circuits implemented using FPGA resources that can provide a proxy for
the voltage of the FPGA PDN. Figure 10.4 shows an overview of the on-chip voltage
sensors discussed in this section. Ring oscillators (ROs) form a common type of on-
chip voltage sensor. RO-based sensors (Fig. 10.4a) include an RO followed by a
counter that counts the number of RO oscillations during a specific time period. The
length of the combinational loop in the ring oscillator must be sufficiently long to
avoid timing faults in the counter circuit. As an example, 19 inverter stages have
previously been used [33] for voltage sensor implementation.
The relationship between the voltage measured from the FPGA supply voltage
pin and the on-chip voltage sensor oscillation frequency can be derived during a
calibration phase by measuring the voltage sensor oscillation frequency for various
supply voltage values. The FPGA supply voltage can be modified by connecting the
FPGA to a power supply and directly adjusting the voltage [30]. Another way to
adjust the supply voltage of the FPGA is to activate power wasters. The number of
active power wasters can be dynamically adjusted to change the voltage measured at
the supply voltage pin of the FPGA. In this scenario, an independent measurement
of the voltage across the FPGA PDN is needed for calibration. This independent
measurement is either achieved by connecting an oscilloscope to the supply voltage
pin of the FPGA [25], or by using the dedicated analog-to-digital (ADC) converter
available on the FPGA that monitors the supply voltage [30]. Figure 10.5, derived
during the calibration phase, shows the relationship between the FPGA PDN voltage
and the normalized frequency of the RO voltage sensor for two Intel-based FPGAs.
A Terasic DE10-Pro evaluation kit [46] containing an Intel Stratix 10 FPGA, and
a Terasic DE5a-Net evaluation kit [45] containing an Intel Arria 10 FPGA were
used. In both experiments [30, 33], variable numbers of RO-based power wasters
are activated to adjust the voltage across the FPGA PDN, while dedicated ADC-
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 283
based on-chip voltage sensors are used to record the voltage values. This figure
clearly shows a highly linear relationship between the FPGA PDN voltage and the
RO sensor frequency, and it can be used to directly convert on-chip voltage sensor
measurements into voltage values.
A drawback of RO-based on-chip voltage sensors is their high measurement
sampling time. A short sampling period (e.g., 10 .μs [33]) is needed before a single
measurement is recovered from the sensor. If this time is too short, the RO only
oscillates a few times and will be less sensitive to PDN voltage fluctuations [25].
Additionally, the RO sensor may not provide consistent measurements for repeated
sensing iterations and may require multiple runs and averaging to converge to a
consistent value [25]. On-chip voltage sensors based on time-to-digital converters
(TDCs) can alleviate these issues. A TDC sensor can digitally measure the time
elapsed between two separate events. In FPGAs, the TDC sensor is implemented
using the FPGA carry chain primitives as delay elements (Fig. 10.4b) and is able
to sense small FPGA PDN voltage fluctuations [51]. Multiple carry chain elements
are connected in a daisy chain. The output of each carry element is connected to
a flip-flop to record its value. A rising edge signal is propagated through the carry
chain, and the number of elements it passes is caught by the flip-flops. The TDC is
able to measure the number of carry elements passed between its start (rising edge
entering the carry chain) and stop (flip-flop clock) signals. The Hamming weight of
the TDC output can be used to approximate voltage across the FPGA PDN. With
higher voltage, each carry element has a lower propagation delay resulting in the
input signal propagating through a higher number of carry elements. Thus, the TDC
output will have a higher Hamming weight. Conversely, a lower voltage across the
FPGA PDN results in the TDC output having a lower Hamming weight.
A TDC sensor typically has a much shorter sampling delay (2 ns [51]) compared
to the RO sensor (10 .μs). Additionally, TDC measurements converge to a stable
averaged value with fewer repeated measurements [25].
284 S. Moini et al.
On-chip voltage sensors have been widely used in FPGAs to detect the activity
of power wasters used for fault injection attacks or denial-of-service voltage
attacks [51]. Generally, one or more voltage sensors constantly monitor the voltage
of the FPGA PDN and detect the activity of power wasters if the voltage drops
significantly below a predetermined threshold. For example [33], a sensor network
consisting of 218 RO-based voltage sensors was used to detect the activity of
power wasters deployed in a denial-of-service attack. The Stratix 10 FPGA was
divided into four separate regions (one per tenant). Sensors are used to detect
the activity of power wasters in each region independently. An integrated ARM
processor constantly calculates the average RO sensor count in each FPGA region
and compares it to a predetermined threshold. If a potential power waster is detected,
it can be remediated, as described in Sect. 10.5.
Similarly [24], a network of RO-based on-chip voltage sensors is implemented
on an FPGA to detect the activity of power wasters. The FPGA is divided into four
regions, with each assigned to a tenant. An algorithm is applied to collected sensor
data to locate the region potentially containing power wasters. For each sensor, the
measured RO-based sensor frequency is compared with a running frequency average
from the same sensor. When an attack occurs, the difference between these values
increases significantly as the RO sensor’s oscillation frequency drops. The location
of the wasters can be determined by identifying the region with the sensors that
have the largest drops. Unlike the approach described above, sensor measurements
are processed offline and cannot be used for immediate attack remediation.
A final attack detection approach [42] uses a long combinational adder circuit
connected to a register to detect voltage drops. Intermediate carry values stored
in the register are used for detection. Similar to a TDC, the PDN voltage can be
approximated by calculating how far a carry signal propagates across the carry
chain during a measurement clock cycle. If the value is smaller than a threshold
determined during calibration, the activity of power wasters used for voltage attacks
is detected. Subsequently, a countermeasure, which involves suppressing the user
clock, is deployed to alleviate the attack.
before they deliver their intended harm. Different countermeasures are deployed
for synchronous and asynchronous power wasters.
Fig. 10.6 Overview of the multi-tenant attack remediation system running on a Stratix 10
FPGA [33]
the clock buffer for that region is disabled using a memory write to the clock-enable
registers. Experimental results for 1,000 trials show that this monitoring system is
able to successfully locate a tenant region containing the activated power wasters
and disable its clock in 11.05 .μs on average, avoiding a power regulator crash.
For the second scenario, an ARM processor that runs faster than the NIOS II
processor is used to run the software. However, the data communication that sends
the RO sensor counts to the ARM processor and receives commands to disable
clock buffers is slower than in the NIOS II setup due to the delay associated
with the AVMM bridge. Similar to the previous scenario, the experiment with the
power wasters was run 1,000 times, and the monitoring system was able to detect
the activation of the power wasters, locate them, and disable them in 9.95 .μs on
average [33].
These experiments show the success of an RO-based on-chip voltage sensor
network in detecting and disabling synchronous power wasters used for a DoS
attack that requires 20 .μs. A drawback of this countermeasure is that it is unable
to deactivate asynchronous power wasters that do not rely on a clock to function.
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 287
FPGA
Old
New
PR Persona
Bitstream Persona
PR
JTAG PR Controller
Region
…
3
10
stretching vertically across the FPGA from top to bottom (Fig. 10.9). The size of
the PR bitstream for this PR region is 500 KB.
At the start of the experiment, all 10 RO voltage sensors are activated. The
number of oscillations of each RO voltage sensor during the active time period
is measured using a counter and stored in on-chip memory. When the sensors are
activated, the partial reconfiguration operation is started and the PR bitstream of
500 KB is sent to the FPGA using a JTAG cable. Once the partial reconfiguration
operation is complete, the RO voltage sensor measurements are recovered from the
on-chip memory. These measurements are used to calculate the time it takes from
the start of the PR operation until all 10 RO sensors are cut off (the cut-off time).
Note that the RO cut-off time does not include the time needed to detect the presence
of an attack using an on-chip voltage sensor network.
Figure 10.11 shows the measurements recovered from the 10 RO sensors after the
PR operation starts at time .= 0. Each count of RO sensor oscillations is collected
with a sampling period of 1 .μs . As shown in the figure, all RO sensors have a
similar count of 125 during a 1 .μs sampling period until they all simultaneously stop
oscillating after 40 ms. Further investigation of the zoomed-in image shows that the
last non-zero sensor measurement was recovered at 39.5 ms (the cut-off time of the
ring oscillators).
JTAG is a slow communication protocol with low bandwidth and high operating
system-level handshake delay, resulting in reconfiguration delay in the measured
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 289
PR Region PR Region
S0 S 0
1 1
...
...
On-Chip Counter
On-Chip Counter
Memory Memory
CLK CLK
S 0 S0
1 1
...
...
On-Chip Counter On-Chip Counter
Memory Memory
CLK CLK
Fig. 10.10 RO-based voltage sensors before and after the partial reconfiguration operation. The
red rectangle shows the partial reconfiguration region. The ring oscillator is disabled after the PR
operation is completed
RO cut-off times. To provide a better estimate of the RO cut-off time if a faster data
communication medium was available, we investigated the way data are transferred
through JTAG during the PR operation using the Intel Signal Tap logic analyzer [11].
By the time the RO sensors are cut off, 10% of the PR bitstream (50 KB) has been
loaded onto the FPGA. Based on this insight, the cut-off time of the RO sensors can
be estimated when loading the PR bitstream from on-chip memory instead of off-
chip communication with JTAG. Using on-chip memory, the PR bitstream could be
read using a 200-MHz clock with 32 bits per clock. In this scenario, it would take
at least 63 .μs to load the 10% of the PR bitstream needed to disable the ROs in a
single FPGA column.
Disabling thousands of RO wasters would require performing the partial recon-
figuration of numerous FPGA LAB columns. For example, previously [33], 64 LAB
columns in a Stratix 10 device were used to implement RO wasters. With each LAB
column requiring at least 63 .μs to disable, the PR operation to serially disable all RO
wasters in all LAB columns will take at least four milliseconds if the wasters were
implemented in a Stratix V. Since an adversary tenant can inject faults into a victim
design in 10–20 .μs [31], partial reconfiguration in current Intel Stratix V FPGAs is
not fast enough to avoid these attacks.
290 S. Moini et al.
Fig. 10.11 RO counts for 10 separate sensors during partial reconfiguration. The PR operation
starts at time 0. As shown in the zoomed-in version of the figure, the sensors are cut off at 39.5 ms
Recent work has modified the partial reconfiguration bitstreams in Xilinx 7-series
and UltraScale+ FPGAs to reduce the cut-off time for asynchronous power wasters.
The LoopBreaker approach [28] uses a partial reconfiguration (PR) bitstream that
disables asynchronous power wasters. The approach disables all interconnects in
the partial reconfiguration region by setting them to a high impedance “Z” state
prior to the full loading of a new bitstream for the region. The approach takes
advantage of the use of configuration commands that are part of the bitstream.
The commands, which are executed by the PR controller, select the PR region,
cut off its input and output wires, and disable its interconnects. LoopBreaker sends
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 291
The security measures that have been introduced for multi-tenant cloud FPGAs
are insufficient to address all threats posed by voltage attacks using power
wasters. Although experimentally tested techniques have successfully suppressed
synchronous [33] and asynchronous power wasters [28] in some scenarios, an
adversary tenant with power wasters can inject timing faults and cause an FPGA
board crash in the presence of the fastest reactive countermeasures in other
cases. Therefore, we propose several additional directions for voltage attack
countermeasures, targeting different stages of the multi-tenant FPGA design and
development process.
292 S. Moini et al.
Cloud users could potentially insert hardware in a design to roll back the design
state to known non-compromised values in the presence of a fault injection attack
[26]. This mechanism would allow the user’s hardware to halt operations once
the presence of a fault injection attack is detected, reject potentially compromised
computed results, and recompute affected values once the attack source is disabled.
Cloud FPGA users could use checkpointing methods to periodically generate a
checkpoint snapshot of the working design (e.g., state machines, registers, and on-
chip block memory instances [2, 40]). In the case of a denial-of-service attack,
the user could potentially return to their most recent checkpoint and minimize the
damage caused by the FPGA shutdown.
Cloud FPGA providers could also invest in countermeasures to improve the
security of the user environment. More sophisticated bitstream scanning methods
that discover complex power wasters that are undetected by the current bitstream
scanning methods [17] could be used. For example, glitch-based power wasters [23,
32] do not use ring oscillators, and RAM-based power wasters [1] appear similar to
standard circuitry. These power wasters could be discovered by scanning the FPGA
bitstream and design for suspicious high-fanout circuits with the ability to generate
high-power glitches.
FPGA hardware vendors also have an important role in increasing the security
of multi-tenant FPGAs. They could introduce improved partial reconfiguration
capabilities to more quickly cut off the interconnects in a PR region or clear
configuration hardware. Improvements could include removing the communication
and setup overheads and redundancies during partial reconfiguration and improving
the supported maximum throughput for PR bitstream loading. These changes may
help improve the success rate of partial reconfiguration as a countermeasure against
fault injection and DoS attacks by deactivating power wasters faster. FPGA vendors
could also improve the design of the FPGA power distribution network by dividing
the PDN into multiple isolated voltage planes [14]. Doing so could potentially avoid
side-channel attacks and reduce the viability of active fault injection and FPGA-
wide denial-of-service attacks.
10.7 Conclusion
figuration is not sufficiently fast enough to prevent fault injection and suppress
all denial-of-service attacks. Possible research directions that could address this
limitation were subsequently presented.
Acknowledgment This work was supported in part by National Science Foundation grant
1902532.
References
1. Alam, M. M., Tajik, S., Ganji, F., Tehranipoor, M., & Forte, D. (2019). RAM-Jam: Remote
temperature and voltage fault attack on FPGAs using memory collisions. In Workshop on
Fault Diagnosis and Tolerance in Cryptography (FDTC) (pp. 48–55).
2. Attia, S., & Betz, V. (2020). Feel free to interrupt: Safe task stopping to enable FPGA
checkpointing and context switching. ACM Transactions on Reconfigurable Technology and
Systems (TRETS), 13(1), 1–27.
3. Benz, F., Seffrin, A., & Huss, S. A. (2012). Bil: A tool-chain for bitstream reverse-engineering.
In 22nd International Conference on Field Programmable Logic and Applications (FPL) (pp.
735–738). IEEE.
4. Bobda, C., Mbongue, J. M., Chow, P., Ewais, M., Tarafdar, N., Vega, J. C., et al. (2022).
The future of FPGA acceleration in datacenters and the cloud. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 15(3), 1–42.
5. Damschen, M., Bauer, L., & Henkel, J. (2017). CoRQ: Enabling runtime reconfiguration under
WCET guarantees for real-time systems. IEEE Embedded Systems Letters, 9(3), 77–80.
6. Ernst, D., Kim, N. S., Das, S., Pant, S., Rao, R., Pham, T., Ziesler, C., Blaauw, D.,
Austin, T., Flautner, K., et al. (2003). Razor: A low-power pipeline based on circuit-level
timing speculation. In Proceedings. 36th Annual IEEE/ACM International Symposium on
Microarchitecture, 2003. MICRO-36 (pp. 7–7). Citeseer.
7. Giechaskiel, I., Rasmussen, K. B., & Eguro, K. (2018). Leaky wires: Information leakage and
covert communication between FPGA long wires. In ASM Asia Conference on Computer and
Communications Security (ASIACCS) (pp. 15–27).
8. Gnad, D. R., Oboril, F., Kiamehr, S., & Tahoori, M. B. (2016). Analysis of transient voltage
fluctuations in FPGAs. In International Conference on Field-Programmable Technology (FPT)
(pp. 12–19).
9. Gnad, D. R., Oboril, F., & Tahoori, M. B. (2017). Voltage drop-based fault attacks on FPGAs
using valid bitstreams. In International Conference on Field Programmable Logic and
Applications (FPL) (pp. 1–7).
10. Intel: Stratix V GX FPGA development kit (2015).
11. Intel (2017). Quartus Prime Standard Edition Handbook Volume 3: Verification, Chap. 14:
Design Debugging with the Signal Tap Logic Analyzer.
12. Intel: UG-20179: Intel Quartus prime standard edition user guide (2018).
13. Krautter, J., Gnad, D. R., Schellenberg, F., Moradi, A., & Tahoori, M. B. (2019). Active
fences against voltage-based side channels in multi-tenant FPGAs. In IEEE/ACM International
Conference on Computer-Aided Design (ICCAD) (pp. 1–8).
14. Krautter, J., Gnad, D. R., & Tahoori, M. B. (2018). FPGAhammer: Remote voltage fault attacks
on shared FPGAs, suitable for DFA on AES. IACR Transactions on Cryptographic Hardware
and Embedded Systems (TCHES), 2018(3), 44–68.
15. Krautter, J., Gnad, D. R., & Tahoori, M. B. (2019). Mitigating electrical-level attacks towards
secure multi-tenant FPGAs in the cloud. ACM Transactions on Reconfigurable Technology and
Systems (TRETS), 12(3), 12:1–12:26.
294 S. Moini et al.
16. La, T., Pham, K., Powell, J., & Koch, D. (2021). Denial-of-service on FPGA-based cloud
infrastructures - attack and defense. IACR Transactions on Cryptographic Hardware and
Embedded Systems, 2021(3), 441–464.
17. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
Malicious self-oscillator scanning for Xilinx UltraScale+ FPGAs. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 13(3), 1–31.
18. Li, X., Tessier, R., & Holcomb, D. (2022). Precise fault injection to enable DFIA for attacking
AES in remote FPGAs. In 2022 IEEE 30th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM). IEEE.
19. Luo, Y., Gongye, C., Fei, Y., & Xu, X. (2021). DeepStrike: Remotely-guided fault injection
attacks on DNN accelerator in cloud-FPGA. In 2021 58th ACM/IEEE Design Automation
Conference (DAC) (pp. 295–300). IEEE.
20. Luo, Y., Gongye, C., Ren, S., Fei, Y., & Xu, X. (2020). Stealthy-shutdown: Practical remote
power attacks in multi-tenant FPGAs. In 2020 IEEE 38th International Conference on
Computer Design (ICCD) pp. 545–552. IEEE.
21. Mahmoud, D. G., Lenders, V., & Stojilović, M. (2022). Electrical-level attacks on CPUs,
FPGAs, and GPUs: Survey and implications in the heterogeneous era. ACM Computing
Surveys (CSUR), 55(3), 1–40.
22. Mahmoud, D. G., & Stojilović, M. (2019). Timing violation induced faults in multi-tenant
FPGAs. In Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1745–
1750).
23. Matas, K., La, T. M., Pham, K. D., & Koch, D. (2020). Power-hammering through
glitch amplification–attacks and mitigation. In IEEE International Symposium on Field-
Programmable Custom Computing Machines (FCCM) (pp. 65–69).
24. Mirzargar, S. S., Renault, G., Guerrieri, A., & Stojilović, M. (2020). Nonintrusive and adaptive
monitoring for locating voltage attacks in virtualized FPGAs. In Cryptology ePrint Archive.
25. Moini, S., Li, X., Stanwicks, P., Provelengios, G., Burleson, W., Tessier, R., & Holcomb,
D. (2020). Understanding and comparing the capabilities of on-chip voltage sensors against
remote power attacks on FPGAs. In 2020 IEEE 63rd International Midwest Symposium on
Circuits and Systems (MWSCAS) (pp. 941–944). IEEE.
26. Moini, S., Provelengios, G., Holcomb, D., & Tessier, R. (2023). Fault recovery from multi-
tenant FPGA voltage attacks. In ACM SIGDA Great Lakes Symposium on VLSI (GLSVLSI)
(pp. 1–6).
27. Moini, S., Tian, S., Holcomb, D., Szefer, J., & Tessier, R. (2021). Power side-channel attacks
on BNN accelerators in remote FPGAs. IEEE Journal on Emerging and Selected Topics in
Circuits and Systems, 11(2), 357–370.
28. Nassar, H., AlZughbi, H., Gnad, D. R., Bauer, L., Tahoori, M. B., & Henkel, J. (2021). Loop-
Breaker: Disabling interconnects to mitigate voltage-based attacks in multi-tenant FPGAs. In
2021 IEEE/ACM International Conference on Computer Aided Design (ICCAD) (pp. 1–9).
IEEE.
29. Pezzarossa, L., Schoeberl, M., & Sparsø, J. (2017). A controller for dynamic partial reconfig-
uration in FPGA-based real-time systems. In 2017 IEEE 20th International Symposium on
Real-Time Distributed Computing (ISORC) (pp. 92–100). IEEE.
30. Provelengios, G., Holcomb, D., & Tessier, R. (2019). Characterizing power distribution attacks
in multi-user FPGA environments. In International Conference on Field Programmable Logic
and Applications (FPL) (pp. 194–201).
31. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power distribution attacks in multitenant
FPGAs. IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 28(12), 2685–
2698.
32. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power wasting circuits for cloud FPGA
attacks. In International Conference on Field Programmable Logic and Applications (FPL)
(pp. 231–235).
33. Provelengios, G., Holcomb, D., & Tessier, R. (2021). Mitigating voltage attacks in multi-tenant
FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 14(2), 1–24.
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 295
34. Provelengios, G., Ramesh, C., Patil, S. B., Eguro, K., Tessier, R., & Holcomb, D. (2019).
Characterization of long wire data leakage in deep submicron FPGAs. In ACM/SIGDA
International Symposium on Field Programmable Gate Arrays (FPGA) (pp. 292–297).
35. Rakin, A. S., Luo, Y., Xu, X., & Fan, D. (2021). Deep-Dup: An adversarial weight duplication
attack framework to crush deep neural network in multi-tenant FPGA. In 30th USENIX
Security Symposium (USENIX Security 21) (pp. 1919–1936).
36. Ramesh, C., Patil, S. B., Dhanuskodi, S. N., Provelengios, G., Pillement, S., Holcomb, D., &
Tessier, R. (2018). FPGA side channel attacks without physical access. In IEEE International
Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 45–52).
37. Regazzoni, F., Wang, Y., & Standaert, F. X. (2011). FPGA implementations of the AES masked
against power analysis attacks. Proceedings of COSADE, 2011, 56–66.
38. Rossi, E., Damschen, M., Bauer, L., Buttazzo, G., & Henkel, J. (2018). Preemption of the
partial reconfiguration process to enable real-time computing with FPGAs. ACM Transactions
on Reconfigurable Technology and Systems (TRETS), 11(2), 1–24.
39. Schellenberg, F., Gnad, D. R., Moradi, A., & Tahoori, M. B. (2018). An inside job: Remote
power analysis attacks on FPGAs. In Design, Automation & Test in Europe Conference &
Exhibition (DATE) (pp. 1111–1116).
40. Schmidt, A. G., Huang, B., Sass, R., & French, M. (2011). Checkpoint/restart and beyond:
Resilient high performance computing with FPGAs. In 2011 IEEE 19th Annual International
Symposium on Field-Programmable Custom Computing Machines (pp. 162–169). IEEE.
41. Seifoori, Z., Mirzargar, S. S., & Stojilović, M. (2020). Closing leaks: Routing against crosstalk
side-channel attacks. In Proceedings of the 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (pp. 197–203).
42. Shen, L. L., Ahmed, I., & Betz, V. (2019). Fast voltage transients on FPGAs: Impact and
mitigation strategies. In IEEE International Symposium on Field-Programmable Custom
Computing Machines (FCCM) (pp. 271–279).
43. Stott, E., Levine, J. M., Cheung, P. Y., & Kapre, N. (2014). Timing fault detection in FPGA-
based circuits. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable
Custom Computing Machines (pp. 96–99). IEEE.
44. Sugawara, T., Sakiyama, K., Nashimoto, S., Suzuki, D., & Nagatsuka, T. (2019). Oscillator
without a combinatorial loop and its threat to FPGA in data centre. Electronics Letters, 55(11),
640–642.
45. Terasic Technologies: DE5a-Net User Manual (2018).
46. Terasic Technologies: DE10-Pro User Manual (2019).
47. Tian, S., & Szefer, J. (2019). Temporal thermal covert channels in cloud FPGAs. In
Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (pp. 298–303).
48. Xilinx: UG1182: ZCU102 Evaluation Board User Guide (2019).
49. Xilinx: UG885: VC707 Evaluation Board for the Virtex-7 FPGA (2019).
50. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In IEEE
Symposium on Security and Privacy (S&P) (pp. 229–244).
51. Zick, K. M., Srivastav, M., Zhang, W., & French, M. (2013). Sensing nanosecond-scale voltage
attacks and natural transients in FPGAs. In ACM/SIGDA International Symposium on Field
Programmable Gate Arrays (FPGA) (pp. 101–104).
Chapter 11
Programmable RO (PRO): A
Multipurpose Countermeasure Against
Side-Channel and Fault Injection Attack
Yuan Yao, Pantea Kiaei, Richa Singh, Shahin Tajik, and Patrick Schaumont
11.1 Introduction
Y. Yao ()
Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and
State University, Blacksburg, VA, USA
e-mail: yuan9@vt.edu
P. Kiaei · R. Singh · S. Tajik · P. Schaumont
Department of Electrical and Computer Engineering, Worcester Polytechnic Institute, Worcester,
MA, USA
e-mail: pkiaei@wpi.edu; rsingh7@wpi.edu; stajik@wpi.edu; pschaumont@wpi.edu
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 297
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_11
298 Y. Yao et al.
In recent years, researchers have further demonstrated that the placement of the
attacker and the victim circuitry on the same chip while sharing a common power
distribution network (PDN) brings new side-channel and fault attack opportunities
[13, 15]. Having a common PDN intrinsically relates the perturbations from the
victim’s logic to the attacker’s logic and vice versa. Therefore, a neighboring
adversary logic can interpret information about the victim operations by monitoring
the changes on the shared PDN [13, 15]. On the other hand, the same physical effect
exists in the other way around; the victim logic can infer malicious operations of its
neighbor circuitry by monitoring the shared PDN [25, 32, 40]. Therefore, in order to
guarantee the security of the PDN, a monitoring sensor network on the PDN should
be built to detect ongoing attacks. The monitoring sensor network should fulfill the
requirements including large spatial coverage, i.e., covering the full PDN area, and
large temporal coverage, i.e., continuously monitoring the PDN [38].
Previously, a ring oscillator (RO) was often used by silicon design houses as a
test structure or on-chip sensor to monitor the technology and circuit performance
[19]. In a cloud context, where an FPGA circuit can host multiple applications, such
test structures are prohibited because of their potential impact on the shared PDN.
Cloud providers can use tools to scan FPGA bitstreams and automatically flag such
malicious structures [9, 24]. On the other hand, the myriad of ways to create an
oscillating structure in digital logic makes no detection technique foolproof. For
example, latches or flip flops can be inserted in the feedback structure of the ring
oscillator to prevent detection [39].
A multipurpose design of RO-based on-chip sensors has not been investigated
to add resistance against both side-channel and fault attacks to the circuit. In this
chapter, we introduce a new multipurpose RO design—a programmable RO (PRO).
With a low overhead, the proposed PRO can provide the following solutions within
the same structure:
• Active side-channel hiding countermeasure
• On-chip power monitoring
• Fault injection monitoring
The proposed PRO design has multiple configurations of oscillation frequency,
which are under the control of the user (i.e., the defender). Each PRO has its own
counter that can be read to calculate the PRO’s frequency by comparing it with
a reference counter. We first demonstrate that with low overhead, an individual
PRO can provide sufficient disturbance to the power to hide the side-channel
leakage of the secret information in the system. Moreover, we further demonstrate
that by combining multiple PROs into an array and by placing them within the
module under protection, a secure on-chip monitoring network can be constructed to
monitor the power fluctuations on the PDN to detect abnormalities and fault attacks.
Figure 11.1 shows the overall structure of the PRO-based on-chip secure system.
The PROs are evenly placed on the chip to form a secure on-chip network. The PRO
secure network can be controlled by the external user configuration. The user can
turn on the side-channel analysis (SCA) countermeasure by configuring the PRO to
oscillate at randomized oscillation frequencies. Besides, the user can monitor the
11 Programmable Ring Oscillator (PRO) 299
Fig. 11.1 A sensitive hardware module can be protected by a grid of programmable ring
oscillators (PROs) that monitor power integrity, detect EM fault injection, and detect side-channel
leakage
oscillation frequency of each PRO in the array by reading out its corresponding
counter value. We demonstrate that by monitoring the frequency change of the
PROs, on-chip local power attacks and EM fault injections can be detected.
The proposed design can be used on any secure module, from small hardware
accelerators to complex system-on-chips (SoCs). To the best of our knowledge, our
work is the first to comprehensively study the potential of RO-based designs in SCA
countermeasure, power sensing, and fault detection.
PRO covers adversaries with side-channel and fault attack capabilities listed in the
following.
We consider two attacker models. The first attacker model has physical access to
the device, which enables the attacker to control the input data and monitor the
power dissipation by shunting the device’s power supply. The second attacker model
works remotely; the attacker circuit shares a PDN with the victim circuit, and the
attacker can control only the attacker circuit remotely. Therefore, the attacker is
able to implement malicious logic to monitor the changes on the shared PDN and
measure the power consumption of the device [37, 46]. This enables the attacker to
perform side-channel attacks, such as simple power analysis (SPA) [29], differential
300 Y. Yao et al.
power analysis (DPA) [20], and correlation power analysis (CPA) [5], to retrieve the
secret information used in the victim circuit.
We also assume the adversary can induce faults into the victim circuit by stressing
the electrical environment, such as injecting clock glitches, power glitches, and
EM glitches. These glitches can induce targeted transient faults that can flip bits,
change the control flow of the secure algorithm, set or reset the circuit, etc.
Fault injection can be done either by exerting disturbance to the circuit directly,
which requires the adversary to have physical access to the device, or by having
remote access to the shared cloud computing environment with the victim circuit
[1, 23, 28, 36]. The exact fault effects to the circuit highly depend on the fault
injection parameters, victim circuit’s architecture and algorithm, and fault injection
technique. By monitoring the fault response of the circuit after injecting targeted
faults, the adversary can retrieve the secret information by performing differential
fault analysis (DFA) [4], statistical fault analysis (SFA) [10], or instruction skip
attacks [43]. PRO as a secure on-chip add-on can be integrated to the circuit to
protect against the aforementioned attackers. Adversaries may try to tamper with the
PRO sensor itself to bypass the PRO’s security mechanisms, but we do not consider
this adversary model within this work.
The structure of the chapter is as follows. The next section reviews related work of
ring oscillators and highlights our contribution. Section 11.3 describes our proposed
PRO design. In Fig. 11.4, we explain and demonstrate the effectiveness of PRO
as a side-channel countermeasure. Next, we present the PRO’s power sensing
functionality in Fig. 11.5. We further show that PRO can detect power fault and
EM faults in Fig. 11.6. Finally, we conclude the chapter in Fig. 11.7.
When sharing the same PDN, seemingly unsuspecting parts of the implemented
logic can perform adversarial operations on the other parts. In this chapter, our focus
is on two categories of adversarial operations: fault injection and power side-channel
analysis. In the following, we categorize the related work into three parts: using
on-chip logic as a countermeasure against power SCA, using on-chip sensors as
power sensors to detect power perturbation, and using on-chip sensors to detect
fault injection attacks.
11 Programmable Ring Oscillator (PRO) 301
Liu et al. [27] use an array of ROs, randomly switched on and off, to dynamically
hide the power consumption of AES SBoxes and hinder the first-order DPA. Simi-
larly, Krautter et al. [22] use ROs as a power-based SCA mitigation methodology. In
their work, the part of the implementation that needs to be protected is surrounded
by a network of ROs. By switching an arbitrary number of the ROs on and off, the
signal-to-noise ratio (SNR) in power traces decreases, and therefore, the number
of traces required to be successful is increased. This approach is called hiding
side-channel leakage. However, the ROs in both designs are running at a fixed
oscillation frequency, and thus, only a single-frequency noise is injected. In this
case, it is straightforward for an attacker to apply post-processing techniques to
remove the noise effect. To avoid this weakness, PRO uses user-controlled but
random frequency changes (Fig. 11.4). Moreover, to further reduce the overhead,
we show how a simple modification can enhance the countermeasure efficacy.
Zick et al. [47] use ROs to measure on-chip voltage variations. Indeed, the
oscillation frequency is proportional to the supplied voltage on the PDN. To measure
the frequency of an RO accurately, counters are required that are clocked with the
output of the RO. This limits the maximum sample rate attainable by the RO counter
structure, and hence, the bandwidth of the side-channel signal. This limitation
has motivated research on other voltage-sensitive time-to-digital converter (TDC)
methods. For instance, Gnad et al. [14] use carry-chain primitives available on
Xilinx FPGAs as TDCs. However, the use of carry-chain primitives makes their
approach specific to certain FPGA families. Similar TDC structures have been
explored in the context of CMOS design simulation to measure the operating voltage
of a chip [2].
Moreover, ROs have been used in offensive scenarios affecting the PDN for
both passive (power-based) and active (fault-injection-based) physical attacks. As
an example of power-based SCA, Zhao et al. [46] presented on-chip power monitors
with ROs. They demonstrated that ROs can be used as a power monitor to observe
the power consumption of other modules on the FPGA or SoC. Using their power
monitor, they captured power traces of the device running the RSA algorithm and
were able to successfully find the private key by applying SPA. Gravellier et al. [16]
perform CPA on power traces acquired with RO-based power sensors.
To inject timing faults, Mahmoud et al. [28] employ ROs to increase the voltage
drop on the power network and lower the voltage level. Effectively, they make the
victim chip slower, causing timing faults. Similar attacks have been shown in other
works [1, 23, 36].
302 Y. Yao et al.
Next, we consider on-chip sensors for fault detection. Miura et al. [30] present
a sensor consisting of phase-locked loop (PLL) and ROs. In their work, ROs are
routed in a specific way to ensure their path travels through most parts of the chip.
Once an EM fault is injected, the path delay of the ROs will be affected, resulting
in changes in the RO phase. The PLL logic can capture this phase disturbance and
detect the ongoing fault injection. Similarly, He et al. used a PLL block to detect the
laser disturbance on RO oscillation frequency [17].
Provelengios et al. [36] show that on-chip ROs cannot only detect fault injection,
but also locate the origin of the fault injection. With a similar structure, RON [45]
builds a ring oscillator network, distributed across the entire chip, to detect hardware
Trojans. Their work confirmed that RO-based power sensors can have a sufficiently
high sample rate to detect fluctuations on the PDN.
However, the scope of their work is limited to the power fault detection, whereas,
in our work, we further investigate EM fault detection (Fig. 11.6). Additionally, the
unique programmable design of our proposed RO structure also enables its usage
for a power SCA countermeasure (Fig. 11.4).
In general, each previous work addresses one single aspect at a time: a side-channel
countermeasure, a power monitor, or a fault detector. In practice, an adversary
is capable of performing a combination of attacks. Hence, it is crucial to find a
security mechanism that encapsulates protection against these attacks. Our goal in
this chapter is therefore to design a programmable RO structure that can provide the
following functionalities within the same structure:
1. Hiding protection against power-based SCA
2. On-chip power monitoring of the fluctuations on the PDN
3. Detecting fault injection
To the best of our knowledge, this is the first work to comprehensively investigate
the RO’s potential in addressing all these three aspects. In the following sections, we
introduce our proposed design and demonstrate through experiments the capability
of the proposed system. Even though we demonstrate our experiments as an FPGA
prototype, our design is not limited to FPGAs and can be extended to other
electronic chips.
11 Programmable Ring Oscillator (PRO) 303
11.3.1 Background
1
fRO =
. . (11.1)
2 · Tprop
1
fRO =
. (11.2)
2(n · t + tAN D )
In this chapter, we aim to have a programmable design of the RO that gives the
designer the flexibility to choose the RO oscillation frequency.
Figure 11.3 shows the basic structure of our proposed design of the pro-
grammable sensor. The PRO consists of multiple delay cells. Each delay cell
includes two delay paths: one consisting of inverters and the other a shorting path
that bypasses the inverters. The multiplexer in the delay cell can control the delay
cell’s propagation delay by selecting between the delay path and the shorting path
with the control input signal SEL. Each delay cell has its independent control signal.
Fig. 11.3 PRO Design. D0 donates the delay cell type-0, D1 donates the delay cell type-1, D2
donates the delay cell type-2
Suppose there are N inverters configured in the delay cell, when SEL = 1, the delay
path is selected, and when SEL = 0, the shorting path is selected. The propagation
delay of each delay cell .TC is therefore
TC = SEL · Td + (1 − SEL) · Ts ,
. (11.3)
where .Td denotes the propagation delay of the delay path and .Ts denotes propaga-
tion delay of the shorting path. The propagation delay of the shorting path .Ts is a
very small value compared to .Td but not 0; this is because of the delay of routing
and the delay of the multiplexer. Other user control inputs include EN that controls
whether PRO is enabled (oscillating) or not, and a control signal to reset/read the
PRO counter. The structure of the PRO design gives flexibility to the designer in
manifold. As shown in Table 11.1, there are multiple initial structural configurations
to be decided by the hardware designer at design time, including the number of
inverters per delay cell, as well as the number and type of different delay cells. These
parameters determine the range of the programmable RO’s oscillation frequency and
the number of frequency configurations the programmable RO can have.
Several constraints can be used as the guidance while configuring the initial
design configurations of PRO:
1. Oscillation frequency range
2. The number of configurations
3. Size of frequency changing step
4. Area
As a starting point of PRO parameter configuration, the designer should estimate
the propagation delay .Tprop for a single inverter. This knowledge can be obtained
11 Programmable Ring Oscillator (PRO) 305
through the design library, timing simulation, or measuring the RO’s oscillation
frequency with a single inverter (when working on an FPGA environment). Then,
based on the designated oscillation frequency range of PRO, the designer can
calculate the minimum and maximum number of inverters that are needed by
Fig. 11.2. After deciding the number of inverters needed, the designer can group
the inverters into different types of delay cells based on the designated frequency
changing step and the number of configurations that are needed. Theoretically, more
inverters result in a larger oscillation frequency range at a cost of larger PRO area.
Therefore, based on the targeted protect design area, the designer should decide the
area constraint for the PRO, so that each PRO can have a good spatial coverage of
the design while at the same time is not too close to other PROs to influence their
local power distribution.
Next, to better explain our proposed structure, we pick one configuration as an
example. Figure 11.3 shows the structure of the PRO with three types of delay cells.
The type-0 delay cell (D0) has 4 inverters, the type-1 delay cell (D1) has 8 inverters,
and type-2 delay cell (D2) has 16 inverters. We instantiated 2 of each type of delay
cells in the inverter chain. All the delay cells have an even number of inverters, and
1 inverter is instantiated at the start of the inverter chain to make sure that there is
always an odd number of inverters in the inverter chain. When all the inverters are
configured to be used in the delay path, the propagation delay .Tprop is maximal,
and therefore, the overall programmable RO will oscillate at its lowest frequency.
When all the delay cells are configured to use the shorting path, the propagation
delay .Tprop is minimal, and therefore, the overall programmable RO will oscillate
at its highest frequency.
In our experiment setup, we implement PRO on a Xilinx Spartan-6 FPGA,
which is fabricated with 45 nm CMOS technology. Under the aforementioned
configuration, we measured a low oscillation frequency of 22 MHz and a high
oscillation frequency of 123.44 MHz. Since each delay cell’s SEL is independent,
there are in total 15 frequency configurations consisting of {1, 5, 9, ..., 57} inverters.
Since there are six SEL signals, there are 64 configurations in total that redundantly
map into the 15 achievable configurations. Through this redundancy, we are able
to estimate the local manufacturing process variations, which is helpful in deciding
when a deviation should be cause for alarm (i.e., fault detection) or not.
306 Y. Yao et al.
The designers can control the RO’s frequency by setting the input value of SEL.
We are using the same configurations for all subsequent experiments in this chapter.
Under this PRO configuration, each PRO can be implemented with 128 lookup
table (LUT)s and 32 registers, in total 160 slices. In our experimental setup, a PRO
array with 36 PROs can cover the entire Spartan-6 FPGA (46,648 LUTs and 93,296
registers, in total 139,944 slices) to provide the whole chip power monitoring and
fault detection. Therefore, only an overhead of 4.1% is introduced.
As a security resistance add-on, PRO can be integrated into the design to protect
simple designs such as hardware encryption engines as well as complex systems
such as an SoC. Control signals are needed for communicating with the PRO.
The control signals set up the user control configurations in Table 11.1. Generally,
different control mechanisms can be adopted by the designer. In an SoC, the
designer can add PROs as a co-processor that can be controlled by the processor
through memory-mapped registers. Under this environment, the software running
on the processor can configure the PROs on the chip. Therefore, PRO-based coun-
termeasures can be dynamically enabled/disabled while the software is running.
Besides, a hardware-based finite-state machine (FSM) can be used to control the
PROs as well. In our experiments, we use the UART protocol to communicate with
PRO, and the control signals sent through the UART are generated by a Python
script in this chapter.
Figure 11.4 shows the high-level basic principles for the fault detection mech-
anism using the PROs. The counter value will be evaluated at the end of each
Fig. 11.4 The basic principle for PRO fault detection is to look for inconsistency in the
accumulated period count over a given time interval
11 Programmable Ring Oscillator (PRO) 307
monitoring interval and compared with the reference counter value to get the
actual oscillation frequency of the PRO. Under normal circumstances, each PRO
oscillates at a certain constant frequency, and thus, its counter value will increase
linearly during the monitoring interval. There will be some small variances caused
by the environmental changes, jitters, process variance of the manufacturer, etc.
A characterization procedure, therefore, is needed to define the range of normal
operation [38]. However, in the occurrence of instant fault injection (e.g., power
glitch, EM pulse, time glitch, laser pulse), the counter will be disturbed. The counter
value read out at the evaluation time will deviate from the normal range, and thus,
a pulse fault injection will be detected by the PROs. Additionally, an adversary can
inject timing faults by stressing the PDN continuously (e.g., power starving). As a
result, the victim circuit will operate slower and cause timing violations to create
faults. In this case, the PRO counter value will also deviate from the normal value
and capture the fault injection event. In this chapter, we use power and EM faults
as illustrations, but PRO’s fault detection coverage is not only limited to these two
fault types.
Masking and hiding are two popular techniques for side-channel countermeasures.
In masking, each secret variable is split into two or more shares that are concealed by
random numbers [6]. The side-channel leakage of each share alone does not reveal
the secret variable because of the randomization introduced by random numbers.
A random source that provides fresh random variables is significantly important in
masking implementations. Hiding countermeasures reduce the SNR for secret data-
dependent operations. Hiding can be achieved by several techniques, such as by
reshuffling cryptographic operations [41], inserting random delays [7], and running
multiple tasks in parallel [37]. In this chapter, we utilize the proposed PRO design
as a hiding countermeasure by injecting noise with random frequency. Previous
work has proposed injecting noise to reduce the SNR [8, 26]. However, since only
single-frequency noises are injected, it is not tricky for an attacker to decrease the
effect of noise either by using a band-pass filter while collecting traces or by post-
processing the collected power traces, such as applying averaging, filtering, and
frequency-domain analysis. Thus previously proposed noise-injection-based hiding
mechanisms still have security flaws. In our proposed design, we inject random-
frequency noises with the PRO design so that it will be much harder for an adversary
to eliminate the noise.
The countermeasure circuit consists of a single PRO whose frequency can be
controlled by the SEL input signals. The PRO drives one of the I/O pins on the
board. As demonstrated in previous work [21, 26], the power consumed by a single
RO is not large enough to have a significant influence on the power profile of a
complete chip or a complete cipher. Instead, hundreds of ROs need to be instantiated
on the chip to have a profound hiding influence. This approach will cause significant
308 Y. Yao et al.
Fig. 11.5 Experimental setup for evaluating RO’s performance in side-channel leakage hiding
design overhead and has the potential risk of inducing power faults into the circuit
[18]. In our proposed mechanism, by driving an I/O pin with a PRO, the effect
of a single (randomly switched) PRO to influence the off-chip power network is
amplified. Since the load capacitance of an I/O pin is much larger than the load
capacitance of an internal FPGA net, the IO pin requires more energy to charge
and discharge. In this manner, even with a single programmable ring oscillator
(PRO), sufficient additional power is consumed to affect the power consumption
characteristic (see Fig. 11.6). In practice, an adversary senses the on-chip power
consumption using a probe, either by connecting an external probe to the system
via a power supply pin [42] or else using an EM probe. Both of these probing
approaches are dependent on the off-chip power network, and therefore, affecting
the off-chip power network is an important factor to defeat an attacker maliciously
monitoring the power profile.
The performance of our proposed hiding countermeasure design is evaluated with
AES-128. Figure 11.5 shows our experimental setup. We implement a hardware
AES core as well as the programmable sensor in the FPGA. The output signal
of the PRO is mapped to drive the I/O pin to amplify the noise effect. For each
encryption scenario, plaintext and ciphertext are provided through the UART for
AES. The communication procedure is controlled by the AES control script. At
the same time, we use the sensor control script to send in control signals through
the PRO UART (Fig. 11.5). The control signals can enable/disable the RO and
configure the oscillation frequency of the RO. While AES is running, the sensor’s
control script generates random numbers for the frequency configuration so that the
frequency of the PRO can change randomly. Equally, an on-chip pseudo-random
number generator (PRNG) can be used for this purpose. Figure 11.6a shows the
collected AES power trace when the programmable sensor is off. We can clearly
see the pattern of ten rounds of the AES algorithm. By comparison, the power trace
changes to a repeated oscillation pattern when we turned the PRO on, as shown in
11 Programmable Ring Oscillator (PRO) 309
Fig. 11.6 AES power traces when PRO is (a) off; (b) on
Fig. 11.6b, which indicates the strong influence of PRO on the power profile. Under
our setup, the complete AES takes 41 ms, and we configure the PRO control script
such that the frequency of the PRO changes every 2 ms, which means that the PRO’s
frequency will change at least 20 times while AES is running. Figure 11.7a shows
the frequency spectrum of the power traces when the PRO is off. We can observe
small peaks at the clock frequency (24 MHz). We do not observe a significant
influence on the power spectrum if we only put a single PRO without driving the IO
pin. Figure 11.7c and d shows the power spectrum when the PRO is on and driving
the output pin. By comparing to Fig. 11.7a, a significant influence on the frequency
spectrum of the power profile can be observed, while the PRO is on. Figure 11.7c
shows a sharp peak when the PRO’s oscillation frequency if fixed to 120 MHz.
A single-frequency RO is a poor noise-injection countermeasure. Indeed, it is
easy for the attacker to implement frequency spectrum analysis, find the injected
noise frequency, and apply the corresponding filter to eliminate the influence of
the injected protection noise. As a sharp comparison, Fig. 11.7d shows that when
random frequency noise is injected by PRO the frequency spectrum is expanded
within the PRO’s oscillation range from 22 Mhz to 123 Mhz. This makes it much
harder for the adversary to filter out the noise by post-processing. To further evaluate
the effectiveness of the proposed design on increasing side-channel resistance, we
applied TVLA [12] on 50 k collected traces; as shown in Fig. 11.8, a dramatic
decrease in t-value can be observed when the PRO is turned on compared to when
the PRO is off. This indicates that the PRO design can significantly reduce the side-
channel leakage of the circuit.
310 Y. Yao et al.
Fig. 11.7 Power spectrum for power traces when (a) PRO off; (b) PRO on without driving I/O
pin; (c) PRO with fixed oscillation frequency and driving IO pin; (d) PRO with random oscillation
frequency and driving I/O pin
Note Generally, even though the adversary is aware of the noise signal, since
the noise is injected by the PRO at a random frequency that also changes at a
fast pace, it is exceedingly hard to remove its effect by normal post-processing
techniques; the adversary needs to monitor both the power consumption and the
11 Programmable Ring Oscillator (PRO) 311
output of the PRO simultaneously with sufficient precision and should be able to
remove the part of the power consumption related to the output pad’s oscillation
using noise-cancelation techniques, which requires high-end devices. Additionally,
to have sufficient information to perform a successful side-channel analysis from
the obtained side-channel traces, the sampling rate for side-channel attacks has to
be at least 2.× the clock frequency (according to the Nyquist theorem). We suggest
that when choosing the initial design configurations in Table 11.1, the designers
should adjust the configurations such that the oscillation frequency range of the
PRO covers at least 3.× the clock frequency. As a result, the adversary will need
a higher-end device with a much higher sampling frequency (at least 10.× the
clock frequency) to successfully apply the same side-channel attack. Hence, PRO
as a hiding countermeasure makes it much harder to attack the circuit by largely
elevating the technique bar for the adversaries.
We first investigate the PRO’s frequency with respect to external power variations.
Figure 11.9 shows the setup for this experiment scenario. We put a single PRO
sensor on the FPGA. For the PRO’s frequency measurement, we start the PRO
sensor’s counter and system clock counter CTR at the same time. After running
for an arbitrary amount of time .Tarb , we read out the RO sensor’s counter value
.CRO and the reference system clock counter value .Cclk through the UART. Then,
CRO
fP RO =
. · fclk , (11.5)
Cclk
where .fP RO is the PRO sensor oscillation frequency, and .fclk is the reference clock
frequency. We measure the value of .CP RO 1000 times and take the average for
better precision. The measurement procedure is automated through a control script
running on a PC.
To investigate the PRO’s power sensing sensitivity when operating under dif-
ferent frequencies, we set the PRO sensor to several oscillation frequencies at the
starting (highest) power supply voltage for the main FPGA core (1.33 V). The
frequency configurations we pick are 153 MHz, 100 MHz, 66 MHz, 40 MHz, and
27 MHz. We gradually decrease the FPGA’s supply voltage and monitor the PRO
sensor’s oscillation frequency.
Figure 11.10 shows the result of the PRO oscillation frequency with regard to the
external supply voltage. As shown in the figure, when the external supply voltage
drops, the PRO’s frequency drops steadily. The PRO’s oscillation frequency reflects
the power supply voltage, and therefore, it can sense the power supply changes and
can be used for power monitoring. With respect to the sensitivity of power sensing, it
can be observed that the higher the oscillation frequency is, the sharper the slope of
the frequency vs. the external supply voltage line will be. This indicates that a higher
oscillation frequency can achieve higher sensitivity in detecting power variations.
11 Programmable Ring Oscillator (PRO) 313
Fig. 11.10 PRO’s oscillation frequency with regard to external power supply voltage
After investigating the correspondence between the PRO sensor’s oscillation fre-
quency and the variations of external power, we evaluate the power sensing
performance with regard to the on-die local power variations. Previous work
has shown that RO-based power wasters can cause a local power supply drop
[31, 35, 46]. This will cause the local circuit’s logic to operate at a lower voltage;
therefore, the local power sensor should show a decrease in the oscillation frequency
when the power wasters are turned on. In this chapter, we adopt the RO-based power
waster shown in Fig. 11.11. Each power waster has five inverters in the delay chain
with an AND gate and oscillates at 245 MHz. A global enable signal is used to turn
on/off all the power wasters in the circuit.
Figure 11.12 shows the experimental setup for the local power sensing evalu-
ation. In this setup, UART communication is used to read out the PRO’s counter
value. We constrain the power waster to locations near the PRO sensor to induce
a local power drop. By configuring the number of power wasters, we can control
the amount of local power drop. An on-board dip switch is used to enable/disable
the power wasters. In a measurement scenario, we gradually increase the number
of power wasters. For each power waster configuration, we measure the PRO’s
oscillation frequency 1000 times and take the average with power waster on/off,
respectively. Next, we calculate the frequency drop ratio as follows:
314 Y. Yao et al.
foff − fon
F requency Drop Ratio =
. . (11.6)
foff
In Fig. 11.6, .foff denotes the PRO sensor’s frequency when the power wasters are
disabled (turned off) and .fon denotes its frequency when the power wasters are
enabled (turned on). The results from the experiment are shown in Fig. 11.13 when
different numbers of power wasters are enabled. As more power wasters are enabled,
the frequency drop ratio increases correspondingly. We can observe a nearly linear
relationship between the number of power wasters and sensor oscillation slowdown.
The linear regression that can closely model the correlation between the number
of power wasters and the frequency drop ratio can be constructed as .f (x) =
0.00031x + 0.247 with an R-squared value of 0.991. Therefore, we conclude that
the PRO can effectively sense local power fluctuations.
In this experimental scenario, we evaluate the PRO sensor’s frequency change with
respect to the spatial proximity to the switching logic that consumes the power. In
11 Programmable Ring Oscillator (PRO) 315
Fig. 11.13 PRO frequency drop ratio as a function of the number of active power wasting circuits
Fig. 11.14 PRO’s average frequency drop ratio for each row versus the spatial proximity of the
power wasters
this experiment, we instantiate 36 PRO sensors to get full spatial coverage of the
FPGA. For this experimental scenario, 36 sensors reside in nine rows, and each row
has four sensors.
To remove the process variations among the PRO instances, we calculate the
frequency drop ratio for each PRO instance following Fig. 11.6. We first measure the
frequency drop ratio for all the sensors. Then, we take the average of the frequency
drop of the four PROs in each row. The results are shown in Fig. 11.14. We observe
that as the PRO sensors are placed closer to the power wasters (from Row 0 to Row
8), the frequency drop ratio increases. Therefore, we can see the spatial distance of
the PRO sensor to the switching logic (power wasters) indeed can be reflected in
316 Y. Yao et al.
the frequency drop ratio. We can further use this feature to detect the location of
injected faults on the chip (will be demonstrated in Fig. 11.6). Note that there is an
outlier in our designed sensor, which might be attributed to the power distribution
network structure of the electronic circuits in which the power in the center of the
chip is built to be more stable [33].
Sharing the same PDN between a potential adversary and a victim opens the door to
a new array of attacks. An adversarial logic can impose changes in the voltage level
to cause timing faults in the victim circuit [1, 23, 28, 36]. Since all these attacks
affect the PDN, we aim to build sensors that are sufficiently sensitive to the voltage
level and therefore can detect such attacks. Detecting ongoing fault injection attacks
will prevent resulting timing faults to go unnoticed.
Figure 11.15 shows our experimental setup for evaluating the power fault
detection performance of our sensor. We instantiate AES as well as the PRO sensor
array on the FPGA. Power wasters are placed locally on the chip to simulate
the situation when local power faults are induced by an adversary. An on-board
dip switch can control the activation of the power wasters. A control script is
used to control starting the AES co-processor, sending plaintext, and reading the
ciphertext. The AES control script is also used to monitor the correctness of the
resulting ciphertext. We adjust the number of power wasters instantiated, while
AES is running. When faulty ciphertexts are observed, we know that a power fault
is successfully injected. This ensures that the power faults detected by PRO are
actually effective faults. Next, we read out the PRO’s counter value through the
sensor’s control script both when the fault is injected and not injected, respectively,
and compare their values. Note that as a chip-level sensor, our goal is to detect the
11 Programmable Ring Oscillator (PRO) 317
Fig. 11.15 The test setup uses a user-controlled power waster to test PRO fault detection. The
power waster is visible as an orange field of dots on the left side of the floorplan between row 1
and row 2, and between row 2 and row 3
location of the attacker instead of identifying that a fault has occurred within the
victim algorithm or circuit.
Figure 11.16 shows the floorplan of the aforementioned setup. We placed 36
sensors on the chip, and 524 power wasters are instantiated to generate power
faults. We first put the power wasters at Row 1 and Row 2 on the left as shown
in the orange blocks in Fig. 11.16. This power waster location is further identified
as location-1. While AES is running, we read out the sensors’ counter values when
the faults are injected and not injected by power wasters, respectively. Then we
calculate the frequency drop ratio based on Fig. 11.6. With the PRO sensor data, we
are to able find the location of the power fault. First, to locate which row has the
power fault, we take the average of the four PRO sensors’ frequency drop ratio in
each row. Figure 11.17 shows the result of each row’s average frequency drop ratio.
The maximum frequency drop ratio points to a location adjacent to Row 2. This
demonstrates that our sensor array can point to the correct row in which the fault
has occurred. Then, we divide the chip into two regions, left and right. To locate the
fault region, we take the average of the frequency drop ratios of the 18 sensors in the
left and right two columns separately. The average frequency drop ratio in the left
region is 0.2184, and the average frequency drop in the right region is 0.213. The
drop in the left region is higher than that in the right region, which indicates that the
source of the fault is in the left region. This demonstrates that our sensor array can
locate the correct fault column. Now, after analyzing the data of the sensor array, we
can locate the power fault’s location in Row 2, left region.
To further demonstrate the capability of the proposed PRO sensor in detecting
the fault locations, we placed the power wasters in different locations, while AES
is running. We repeat the same experimental scenario to locate the faulty row and
318 Y. Yao et al.
Fig. 11.17 PRO average frequency drop ratio for each row when a power fault happens at location-
1 (the power wasters are placed as shown in Fig. 11.16)
faulty column. We first put the power wasters in locations Row 4 and Row 5 on
the left region and get the results of locating the faulty row as demonstrated in
Fig. 11.18. The highest frequency drop ratio points to Row 4, which indicates that
the fault happens adjacent to Row 4. The left region’s average frequency drop is
0.2159 and the right region’s average frequency drop is 0.2091, which indicates
that the left region has the fault. Therefore, the sensor array locates the place where
11 Programmable Ring Oscillator (PRO) 319
Fig. 11.18 Floorplan and corresponding PRO average frequency drop ratio for each row when
power fault happens at location-2. Black blocks donate PROs in the floorplan, and red blocks
donate power waster positions in the floorplan
Fig. 11.19 Floorplan and the corresponding PRO average frequency drop ratio for each row when
power fault happens at location 3. Black blocks denote PROs in the floorplan, and red blocks denote
power waster positions in the floorplan
the fault is injected as the left region Row 4. Next, we move the power wasters
Row 1 and Row 2 on the right as shown in Fig. 11.19. We observe the left region’s
average frequency drop is 0.2083 and the right region’s average frequency drop is
0.2204, which indicates that the right region has the fault. In Fig. 11.18, the highest
frequency drop ratio correctly points to Row 2. Therefore, we demonstrate that our
proposed sensor can detect the location of the on-chip power fault.
320 Y. Yao et al.
EMFI is a well-known active attack. It uses an active probe to apply an intense and
transient magnetic field to integrated circuit (IC)s. An EM pulse causes a sudden
current flow in the circuit of the targeted IC, and therefore, the local supply voltage
drops. This voltage drop produces timing faults such as bit flips, bit sets, and
bit resets due to timing constraint violation and sampling faults by disrupting the
switching process of D-flip flops if EM perturbations are synchronous with clock
rising edges. This enables the adversary to exploit such faults to extract sensitive
content from the device. Previous research has shown that EM perturbations can
cause faulty computations, alter the program flow, and cause bit flips in the contents
of the memory. Other authors have demonstrated that EM can induce faults into
the devices [11, 34]. In the past few years, EM fault injection attacks have gained
increasing attention. In this section, we investigate the performance of our proposed
PRO sensor with regard to EM fault injection.
Figure 11.20 shows the experimental setup for evaluating EM fault injection
detection performance. In this setup, we instantiate AES and a PRO array with 36
sensors on the FPGA. A script is used to sending the plaintext, starting the AES,
and reading the ciphertext. While AES is running, an EM probe is placed in a fixed
position on top of the FPGA chip surface at a vertical distance of approximately
1.5 mm. It generates an EM pulse to induce faults. The EM probe’s tip is 4 mm
in diameter and produces a magnetic field that is perpendicular to the surface
of the chip. A glitch controller controls the time and intensity of the EM pulse.
While AES is running, we adjust the intensity of the EM pulse. When a faulty
ciphertext is observed, we know that an effective EM fault is injected. Next, in
Fig. 11.20 This experimental setup uses PRO to detect EM fault detection
11 Programmable Ring Oscillator (PRO) 321
Fig. 11.21 Influence on the frequency distribution, X-axis is probability and Y -axis is frequency
each measurement, we read out the PRO sensor’s counter value through the sensor’s
control script when the fault is injected and not injected, respectively, and compare
their values.
We collected 1000 frequency measurements for all 36 PROs. For each PRO
sensor, we investigate the distribution of the 1000 frequency measurements when
the EM fault is injected and not injected. We observe that the EM fault can cause
variations of the PRO’s frequency distribution. Figure 11.21 and Fig. 11.22 shows
comparisons of the frequency distribution when the EM fault is injected and not
injected for RO-0 to RO-15. We notice that the PRO sensor’s frequency shifts to
a larger value when faults are injected. We also observe that besides frequency
shifting, there is another fault injection reaction that the PRO sensors can have.
We observe that EM faults can also cause faulty counter value for RO-23 to RO-27
and RO-31 to RO-36. When the faults are injected, the counter values are read out
by the UART jump to a large (and faulty) value of .4.08 × 107 MHz. Therefore, by
monitoring the value of the PRO counters, we can detect ongoing electromagnetic
fault injection (EMFI) at runtime.
322 Y. Yao et al.
11.7 Conclusion
Acknowledgments This research was supported in part through NSF Award 1931639 and NSF
Award 2219810.
11 Programmable Ring Oscillator (PRO) 323
References
1. Alam, M. M., Tajik, S., Ganji, F., Tehranipoor, M., & Forte, D. (2019). RAM-Jam: Remote
temperature and voltage fault attack on FPGAs using memory collisions. In 2019 Workshop
on fault diagnosis and tolerance in cryptography (FDTC), pp. 48–55. https://doi.org/10.1109/
FDTC.2019.00015.
2. Anik, M. T. H., Ebrahimabadi, M., Pirsiavash, H., Danger, J. L., Guilley, S., & Karimi, N.
(2020). On-Chip voltage and temperature digital sensor for security, reliability, and portability.
In 2020 IEEE 38th international conference on computer design (ICCD) (pp. 506–509). IEEE.
3. Bar-El, H., Choukri, H., Naccache, D., Tunstall, M., & Whelan, C. (2006). The sorcerer’s
apprentice guide to fault attacks. Proceedings of the IEEE, 94(2), 370–382.
4. Biham, E., & Shamir, A. (1997). Differential fault analysis of secret key cryptosystems. In
Annual international cryptology conference (pp. 513–525). Springer.
5. Brier, E., Clavier, C., & Olivier, F. (2004). Correlation power analysis with a leakage model.
In International workshop on cryptographic hardware and embedded systems (pp. 16–29).
Springer.
6. Chari, S., Jutla, C.S., Rao, J. R., & Rohatgi, P. (1999). Towards sound approaches to counteract
power-analysis attacks. In M. Wiener (Ed.), Advances in Cryptology—CRYPTO’ 99 (pp. 398–
412). Springer.
7. Coron, J. S., & Kizhvatov, I. (2010). Analysis and improvement of the random delay
countermeasure of CHES 2009. In International workshop on cryptographic hardware and
embedded systems (pp. 95–109). Springer.
8. Das, D., Maity, S., Nasir, S. B., Ghosh, S., Raychowdhury, A., & Sen, S. (2017). High efficiency
power side-channel attack immunity using noise injection in attenuated signature domain. In
2017 IEEE international symposium on Hardware Oriented Security and Trust (HOST) (pp.
62–67). IEEE.
9. Elnaggar, R., Chaudhuri, J., Karri, R., & Chakrabarty, K. (2022). Learning malicious circuits
in FPGA bitstreams. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 42(3), 726–739. https://doi.org/10.1109/TCAD.2022.3190771.
10. Fuhr, T., Jaulmes, E., Lomné, V., & Thillard, A. (2013). Fault attacks on AES with faulty
ciphertexts only. In 2013 Workshop on fault diagnosis and tolerance in cryptography (pp.
108–118). IEEE.
11. Ghodrati, M., Yuce, B., Gujar, S., Deshpande, C., Nazhandali, L., & Schaumont, P. (2018).
Inducing local timing fault through EM injection. In 2018 55th ACM/ESDA/IEEE design
automation conference (DAC) (pp. 1–6). https://doi.org/10.1109/DAC.2018.8465836.
12. Gilbert Goodwill, B. J., Jaffe, J., Rohatgi, P., et al. (2011). A testing methodology for side-
channel resistance validation. In NIST non-invasive attack testing workshop (vol. 7, pp. 115–
136).
13. Glamocanin, O., Coulon, L., Regazzoni, F., & Stojilovic, M. (2020). Are cloud FPGAs
really vulnerable to power analysis attacks? In 2020 Design, automation and test in Europe
conference and exhibition, DATE 2020, Grenoble, France, March 9–13, 2020 (pp. 1007–1010).
IEEE. https://doi.org/10.23919/DATE48585.2020.9116481.
14. Gnad, D. R., Oboril, F., Kiamehr, S., & Tahoori, M. B. (2016). Analysis of transient voltage
fluctuations in FPGAs. In 2016 International conference on field-programmable technology
(FPT) (pp. 12–19). https://doi.org/10.1109/FPT.2016.7929182.
15. Gnad, D. R. E., Krautter, J., & Tahoori, M. B. (2019). Leaky noise: New side-channel attack
vectors in mixed-signal IoT devices. IACR Transactions on Cryptographic Hardware and
Embedded Systems, 2019(3), 305–339. https://doi.org/10.13154/tches.v2019.i3.305-339.
16. Gravellier, J., Dutertre, J. M., Teglia, Y., Loubet-Moundi, P. (2019). High-speed ring oscillator
based sensors for remote side-channel attacks on FPGAs. In 2019 International conference
on ReConFigurable computing and FPGAs (ReConFig) (pp. 1–8). https://doi.org/10.1109/
ReConFig48160.2019.8994789.
324 Y. Yao et al.
17. He, W., Breier, J., Bhasin, S., Miura, N., & Nagata, M. (2016). Ring oscillator under laser:
Potential of PLL-based countermeasure against laser fault injection. In 2016 Workshop on fault
diagnosis and tolerance in cryptography (FDTC) (pp. 102–113). IEEE.
18. Kim, C. H., & Quisquater, J. J. (2007). Faults, injection methods, and fault attacks. IEEE
Design & Test of Computers, 24(6), 544–545.
19. Kim, C. K., Kong, B. S., Lee, C. G., & Jun, Y. H. (2008). CMOS temperature sensor with ring
oscillator for mobile DRAM self-refresh control. In 2008 IEEE international symposium on
circuits and systems (pp. 3094–3097). IEEE.
20. Kocher, P. (1996). Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other
systems. In N. Koblitz (Ed.), CRYPTO ’96, LNCS (vol. 1109, pp. 104–113). Springer.
21. Krautter, J., Gnad, D., & Tahoori, M. (2020). CPAmap: On the complexity of secure FPGA
virtualization, multi-tenancy, and physical design. In IACR transactions on cryptographic
hardware and embedded systems (pp. 121–146).
22. Krautter, J., Gnad, D. R., Schellenberg, F., Moradi, A., & Tahoori, M. B. (2019). Active
fences against voltage-based side channels in multi-tenant FPGAs. In 2019 IEEE/ACM
international conference on computer-aided design (ICCAD) (pp. 1–8). https://doi.org/10.
1109/ICCAD45719.2019.8942094.
23. Krautter, J., Gnad, D. R., & Tahoori, M. B. (2018). FPGAhammer: Remote voltage fault attacks
on shared FPGAs, suitable for DFA on AES. In IACR transactions on cryptographic hardware
and embedded systems (pp. 44–68)
24. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020) FPGADefender:
Malicious Self-oscillator Scanning for Xilinx UltraScale + FPGAs. ACM Transactions on
Reconfigurable Technology and Systems, 13(3), 15:1–15:31. https://doi.org/10.1145/3402937.
25. Li, X., Tessier, R., & Holcomb, D. E. (2022). Precise Fault Injection to Enable DFIA for
Attacking AES in Remote FPGAs. In 30th IEEE annual international symposium on field-
programmable custom computing machines, FCCM 2022, New York City, NY, USA, May
15–18, 2022 (pp. 1–5). IEEE. https://doi.org/10.1109/FCCM53951.2022.9786154.
26. Liu, P. C., Chang, H. C., & Lee, C. Y.: A low overhead DPA countermeasure circuit based on
ring oscillators. IEEE Transactions on Circuits and Systems II: Express Briefs, 57(7), 546–550
(2010).
27. Liu, P. C., Chang, H. C., & Lee, C. Y.: A low overhead dpa countermeasure circuit based on
ring oscillators. IEEE Transactions on Circuits and Systems II: Express Briefs, 57(7), 546–550
(2010). https://doi.org/10.1109/TCSII.2010.2048400.
28. Mahmoud, D., & Stojilović, M. (2019). Timing violation induced faults in multi-tenant FPGAs.
In 2019 Design, automation test in europe conference exhibition (DATE), pp. 1745–1750.
https://doi.org/10.23919/DATE.2019.8715263.
29. Mangard, S., Oswald, E., & Popp, T. (2008). Power analysis attacks: Revealing the secrets of
smart cards (Vol. 31). Springer Science & Business Media.
30. Miura, N., Najm, Z., He, W., Bhasin, S., Ngo, X. T., Nagata, M., & Danger, J. L. (2016).
PLL to the rescue: A novel EM fault countermeasure. In 2016 53rd ACM/EDAC/IEEE design
automation conference (DAC) (pp. 1–6). https://doi.org/10.1145/2897937.2898065.
31. Moini, S., Li, X., Stanwicks, P., Provelengios, G., Burleson, W., Tessier, R., & Holcomb,
D. (2020). Understanding and comparing the capabilities of on-chip voltage sensors against
remote power attacks on FPGAs. In 2020 IEEE 63rd international midwest symposium on
circuits and systems (MWSCAS) (pp. 941–944). IEEE.
32. Moini, S., Tian, S., Szefer, J., Holcomb, D., & Tessier, R. (2021). Remote Power side-channel
attacks on BNN accelerators in FPGAs. In Design, automation and test in Europe conference
(DATE).
33. Popovich, M., Mezhiba, A., & Friedman, E. G. (2007). Power distribution networks with on-
chip decoupling capacitors. Springer Science & Business Media.
34. Poucheret, F., Tobich, K., Lisarty, M., Chusseauz, L., Robissonx, B., & Maurine, P. (2011).
Local and direct EM injection of power into CMOS integrated circuits. In 2011 Workshop on
fault diagnosis and tolerance in cryptography (pp. 100–104). https://doi.org/10.1109/FDTC.
2011.18.
11 Programmable Ring Oscillator (PRO) 325
35. Provelengios, G., Holcomb, D., & Tessier, R. (2019). Characterizing power distribution
attacks in multi-user FPGA environments. In 2019 29th International conference on field
programmable logic and applications (FPL) (pp. 194–201). IEEE.
36. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power distribution attacks in multitenant
FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 28(12), 2685–
2698. https://doi.org/10.1109/TVLSI.2020.3027711.
37. Standaert, O. X., Peeters, E., Rouvroy, G., & Quisquater, J. J. (2006). An overview of power
analysis attacks against field programmable gate arrays. Proceedings of the IEEE, 94(2), 383–
394.
38. Tajik, S., Fietkau, J., Lohrke, H., Seifert, J. P., & Boit, C. (2017). PUFMon: Security monitoring
of FPGAs using physically unclonable functions. In 2017 IEEE 23rd International symposium
on on-line testing and robust system design (IOLTS) (pp. 186–191). IEEE.
39. Tian, S., Krzywosz, A., Giechaskiel, I., & Szefer, J. (2020). Cloud FPGA security with RO-
Based primitives. In 2020 International conference on field-programmable technology (ICFPT)
(pp. 154–158). https://doi.org/10.1109/ICFPT51103.2020.00029.
40. Tian, S., Moini, S., Wolnikowski, A., Holcomb, D. E., Tessier, R., & Szefer, J. (2021).
Remote power attacks on the versatile tensor accelerator in multi-tenant FPGAs. In 29th
IEEE annual international symposium on field-programmable custom computing machines,
FCCM 2021, Orlando, FL, USA, May 9–12, 2021 (pp. 242–246). IEEE. https://doi.org/10.
1109/FCCM51124.2021.00037.
41. Tillich, S., & Herbst, C. (2008). Attacking state-of-the-art software countermeasures—a case
study for AES. In International workshop on cryptographic hardware and embedded systems
(pp. 228–243). Springer.
42. Utyamishev, D., & Partin-Vaisband, I. (2018). Real-time detection of power analysis attacks by
machine learning of power supply variations on-chip. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 39(1), 45–55.
43. Yao, Y., & Schaumont, P. (2018). A low-cost function call protection mechanism against
instruction skip fault attacks. In Proceedings of the 2018 workshop on attacks and solutions in
hardware security (pp. 55–64).
44. Yao, Y., Yang, M., Patrick, C., Yuce, B., & Schaumont, P. (2018). Fault-assisted side-channel
analysis of masked implementations. In 2018 IEEE international symposium on hardware
oriented security and trust (HOST) (pp. 57–64). IEEE.
45. Zhang, X., & Tehranipoor, M. (2011). RON: An on-chip ring oscillator network for hardware
Trojan detection. In 2011 Design, automation & test in Europe (pp. 1–6). IEEE.
46. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In 2018 IEEE
symposium on security and privacy (SP) (pp. 229–244). IEEE.
47. Zick, K. M., & Hayes, J. P. (2012). Low-cost sensing with ring oscillator arrays for health-
ier reconfigurable systems. ACM Transactions on Reconfigurable Technology and Systems
(TRETS), 5(1), 1–26.
Index
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 327
Springer Nature Switzerland AG 2024
J. Szefer and R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3
328 Index
Fingerprinting, 138, 142, 160, 198, 239–268 Power supply units (PSUs), 173–182, 186–188,
FPGA acceleration, 3–4, 6, 14, 103, 155, 203, 191–198, 200
209, 239
FPGA security, 6, 141, 200, 267, 268, 276
R
Remote attacks, 17, 35, 103, 118, 173, 198,
H 199, 274, 275
Hardware trojans, 5, 10, 126, 128, 276, 302 Ring oscillators (ROs), 5, 43, 86, 87, 105, 119,
Heterogeneous compute, 81, 101 121, 138, 141, 142, 174–178, 182–184,
186, 192–193, 196–198, 200, 240, 241,
243, 255, 275, 282, 287–289, 292, 298,
I 300, 302, 303, 311
Information leakage, 96, 139, 175, 188, 191, Rowhammer, 204, 206, 207, 213–219,
196, 199, 200, 297 221–225, 230–232
Interference attack, 154, 158–162
S
M
Secure remote computing, 1
Multi-tenancy, 13–16, 24, 29, 30, 33, 35, 50,
Security, 1, 29, 58, 81, 101, 137, 175, 204, 240,
51, 103, 131, 230, 276
273, 298
Multi-tenant FPGA, 17, 21, 31, 36, 48, 93–95,
Side-channel attack, 82, 84, 85, 87–95,
101–131, 175, 273–293
101–103, 105–116, 138, 141, 143, 173,
197, 221, 240, 245, 267, 274, 276, 277,
O 292, 297, 299, 311
OAuth 2.0, 2, 9, 10, 15, 21, 24 Side-channel countermeasure, 94, 297, 300,
On-chip sensors, 88, 89, 105, 274, 298, 307–311, 322
300–302 Side channels, 5, 91, 102, 137, 198, 206, 267,
On-chip voltage sensor, 94, 102, 276, 277, 297
281–284, 286, 288, 291 Stream ciphers, 57, 59, 60, 62–65, 67
System-on-chip (SoC), 12, 36, 86, 91, 119,
129, 198, 291, 299, 301, 306, 322
P
Partial reconfiguration, 21, 31, 34, 160, 274,
287–292 T
PCIe contention, 138, 139, 141–144, 146, Threats, 5, 13, 15, 24, 36–38, 75, 76, 81,
147, 153–155, 158, 160, 167, 168, 241, 83–85, 87, 92, 94, 103, 104, 116, 130,
243–244, 256, 259, 260, 263, 268 131, 137–168, 173, 175–177, 198, 200,
Physical unclonable functions (PUFs), 12–14, 203, 239, 245, 273–277, 280–281,
21, 138, 240, 243, 245–257, 264, 265, 291–292
268
Power analysis, 76, 82, 84, 88–89, 105,
109–111, 199, 221 V
Power attacks, 196, 199, 299 Virtualization, 17, 21, 22, 30, 32, 33, 36,
Power distribution network (PDN), 17, 82–84, 47–49, 82, 101, 245
86, 92, 94, 102–104, 114, 116, 117, Voltage fluctuations, 83, 87, 88, 103, 105, 111,
122, 175, 199, 273, 274, 276–284, 292, 118, 174, 177, 197, 274, 276, 277, 283
298–302, 307, 311, 316 Voltage regulators, 83, 174, 176, 179, 180, 195,
Power integrity monitoring, 299 198–200, 280, 311