You are on page 1of 329

Jakub Szefer

Russell Tessier Editors

Security
of FPGA-Accelerated
Cloud Computing
Environments
Security of FPGA-Accelerated Cloud Computing
Environments
Jakub Szefer • Russell Tessier
Editors

Security of
FPGA-Accelerated Cloud
Computing Environments
Editors
Jakub Szefer Russell Tessier
Yale University University of Massachusetts Amherst
New Haven, CT, USA Amherst, MA, USA

ISBN 978-3-031-45394-6 ISBN 978-3-031-45395-3 (eBook)


https://doi.org/10.1007/978-3-031-45395-3

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.


Foreword

This edited volume presents 11 chapters on research on the security of cloud-


based FPGAs (Field Programmable Gate Arrays). Since the mid-2010s, FPGAs
have become available in public cloud computing data centers. In today’s single-
tenant setting, FPGAs are shared by different users who use a pay-as-you-go model
of cloud computing to rent whole FPGAs. After finishing computation, the FPGAs
are then made available to other users. Further, researchers have also proposed
multi-tenant FPGAs, where multiple users can share the same FPGAs at the same
time. In both the single-tenant and multi-tenant settings, cloud-based access to
FPGAs means that potentially untrusted or malicious users are able to execute their
hardware modules on the FPGAs. The hardware modules could include different
types of sensors used to steal information or circuits that consume large amounts of
power to create faults and other attacks.
Chapters 1–3 focus on authentication, protection of data communicated between
users and remote FPGAs, isolation of different cloud FPGA tenants, as well as
efficient cryptographic primitives that can be realized on FPGAs. Chapters 4–
9 focus on remote physical attacks in FPGAs, attacks between different remote
FPGAs and attacks that use FPGAs and components such as CPU (Central
Processing Unit) caches or GPUs (Graphics Processing Units). Chapters 10 and 11
investigate countermeasures and defenses for cloud-based FPGAs.

v
Acknowledgements

The preparation of this book was supported in part by NSF grants 1901901 and
1902532.

vii
Contents

1 Authentication and Confidentiality in FPGA-Based Clouds . . . . . . . . . . . 1


Semih Ince, David Espes, Julien Lallet, Guy Gogniat,
and Renaud Santoro
2 Domain Isolation and Access Control in Multi-tenant Cloud
FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Christophe Bobda, Joel Mandebi Mbongue, Sujan Kumar Saha,
and Muhammed Kawser Ahmed
3 Efficient and Secure Encryption for FPGAs in the Cloud . . . . . . . . . . . . . 57
Subhadeep Banik and Francesco Regazzoni
4 Remote Physical Attacks on FPGAs at the Electrical Level . . . . . . . . . . . 81
Dennis R. E. Gnad, Jonas Krautter, and Mehdi B. Tahoori
5 Practical Implementations of Remote Power Side-Channel
and Fault-Injection Attacks on Multitenant FPGAs . . . . . . . . . . . . . . . . . . . 101
Dina G. Mahmoud, Ognjen Glamočanin, Francesco Regazzoni,
and Mirjana Stojilović
6 Contention-Based Threats Between Single-Tenant Cloud
FPGA Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Ilias Giechaskiel, Shanquan Tian, and Jakub Szefer
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels . 173
Ilias Giechaskiel, Kasper Rasmussen, and Jakub Szefer
8 Microarchitectural Vulnerabilities Introduced, Exploited,
and Accelerated by Heterogeneous FPGA-CPU Platforms . . . . . . . . . . . . 203
Thore Tiemann, Zane Weissman, Thomas Eisenbarth, and Berk Sunar
9 Fingerprinting and Mapping Cloud FPGA Infrastructures . . . . . . . . . . . 239
Shanquan Tian, Ilias Giechaskiel, Wenjie Xiong, and Jakub Szefer

ix
x Contents

10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs . . 273


Shayan Moini, George Provelengios, Daniel Holcomb,
and Russell Tessier
11 Programmable RO (PRO): A Multipurpose Countermeasure
Against Side-Channel and Fault Injection Attack . . . . . . . . . . . . . . . . . . . . . . 297
Yuan Yao, Pantea Kiaei, Richa Singh, Shahin Tajik,
and Patrick Schaumont

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Chapter 1
Authentication and Confidentiality
in FPGA-Based Clouds

Semih Ince, David Espes, Julien Lallet, Guy Gogniat, and Renaud Santoro

1.1 Introduction

FPGAs have become popular as accelerators in cloud computing. They support high
computational load and provide high performance compared to general purpose pro-
cessors [44] and graphics processing units (GPUs). FPGAs have been deployed for
applications requiring intensive computations (e.g., fully homomorphic encryption
algorithms). Cloud providers (CPs), like AWS [35] and Alibaba [2], provide FPGA-
based cloud computing services to fulfill users’ acceleration needs.
Cloud security is critical for cloud users. They expect secure remote computation
and access to FPGA accelerators with minimal impact on design performance.
Security mechanisms must be adapted for appropriate cloud usage. The user must
ensure that their data are kept private. They do not want to disclose sensitive
intellectual property (IP) and data to the cloud provider. To ensure privacy, the
user needs an encrypted channel with the FPGA isolated from the CP. Furthermore,
authentication between end-user and FPGA accelerator must be guaranteed. The
user must be sure to use a specific FPGA and know that another user cannot access
their device. Authentication is thus necessary to manage FPGAs and cloud service
accesses.

S. Ince () · J. Lallet · R. Santoro


Nokia Bell Labs, Lannion, France
e-mail: semih.ince@nokia.com; julien.lallet@nokia.com; renaud.santoro@nokia.com
D. Espes
University of Occidental Brittany, Brest, France
e-mail: david.espes@univ-brest.fr
G. Gogniat
University of South Brittany, Lorient, France
e-mail: guy.gogniat@univ-ubs.fr

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 1


J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_1
2 S. Ince et al.

To address this issue, an FPGA-based cloud authentication and access delegation


framework is described in this chapter. The approach uses OAuth 2.0 to safely share
cloud-enabled FPGA resources. Even though this protocol is normally used for
HTTP resources, our solution adapts it for FPGA-accelerated clouds. By securely
authenticating every entity involved in the provisioning of remote FPGAs, we
reinforce the security of remote FPGA access and its flexibility using a tokenized
access scheme.
We introduce FPGA-based cloud architectures and associated access mech-
anisms in Sect. 1.2. Then, we propose an extensive state-of-the-art review of
authentication in Sect. 1.3. Direct authentication techniques and authentication
based on a Trusted Authority (TA) are detailed and bitstream authentication and
FPGA authentication are described. User-FPGA authentication and FPGA authen-
tication solutions involving a TA are also explained. Open challenges are described
in Sect. 1.4. Section 1.5 describes our proposed framework adapted from OAuth 2.0
for FPGA-based cloud authentication and access delegation. Then, Sect. 1.6 shows
the theoretical performance analysis of our framework. The chapter concludes in
Sect. 1.7.

1.2 FPGA-Based Cloud Architectures

1.2.1 FPGA as a Cloud Service Accelerator

The development of demanding applications like artificial intelligence (AI)/machine


learning (ML), video encoding, and other domain specific algorithms has led to
the implementation of accelerators. These applications execute slowly on general
purpose CPUs. Thus, algorithms and hardware need to be more efficient to meet
timing and latency constraints. Specific hardware like a GPU or an FPGA can be
used to fulfill acceleration needs. FPGAs offer significant advantages in accelerating
various applications. FPGAs have more computational power than GPUs and CPUs
[44]. They are also reconfigurable to meet various acceleration needs with high
flexibility. With the development of cloud services, FPGAs are already deployed
by cloud providers such as AWS [35], Microsoft [32], and Alibaba [2]. Depending
on the need, FPGAs are deployed in the cloud using different architectures.
Cloud providers like AWS and Alibaba offer FaaS (FPGA as a Service) plat-
forms. For example, Alibaba can provide up to four FPGAs and 88 virtual CPUs in
one instance [2]. Users are provided a virtual machine to access this hardware. Cloud
users can develop and reconfigure FPGAs with their custom design. Figure 1.1
shows the most common way of deploying an FPGA. The FPGA is tightly attached
to a host through a PCIe bus. The host communicates with the FPGA by using
APIs and commands set up by the cloud provider. The latter has a communication
and management stack inside the FPGA to execute commands sent by the host
through the PCIe bus. This type of connection is deployed by AWS F1 [35] and
1 Authentication and Confidentiality in FPGA-Based Clouds 3

Fig. 1.1 Most common


FPGA-based cloud
architecture

Fig. 1.2 Microsoft Catapult


architecture [7]

Alibaba F3 instances [2]. Other types of architectures and use cases also exist. IBM
and Microsoft have opted for a different type of FPGA acceleration approach, as
explained in the following subsection.

1.2.2 Infrastructure as a Service FPGA Acceleration

Another example of a cloud platform with FPGA acceleration is Microsoft Catapult


[7]. This platform was used to accelerate web search rankings using FPGA
acceleration. In this case, FPGAs are not offered to users as a service. In fact, users
can only take advantage of this solution if they use Microsoft’s Bing search engine
or another cloud FPGA accelerated application. The FPGA is totally transparent
to the user. For this purpose, Microsoft opted for a top-of-rack (TOR) topology. A
two-socket server blade including two CPUs is interconnected with a QuickPath
Interconnect (QPI) interface. Additionally, one Network Interface Controler (NIC)
and one FPGA are connected to the datacenter network with a quad small form-
factor pluggable optical transceiver (QSFP) for acceleration purposes, as shown
in Fig. 1.2. This architecture is deployed massively in parallel to accelerate web
searches. The FPGA is mainly used for acceleration in this scenario. But if an FPGA
4 S. Ince et al.

is momentarily unused, it becomes an NIC and serves as a network accelerator for


the datacenter. Another example use of Catapult is Microsoft Brainwave [16]. The
project deployed an FPGA-based neural network accelerator. Microsoft uses the
TOR topology for this application.
IBM experimented with an FPGA-based cloud architecture in which FPGAs are
independent of a CPU [41]. FPGAs are interconnected and attached to the datacenter
network with a switch. The purpose of this architecture is to accelerate specific
workloads like AI/ML and network encryption under a Infrastructure as a Service
(IaaS) model. Users cannot use their custom accelerator but rather must use an
accelerator offered by the cloud provider.
Users can benefit from FPGA acceleration in three different ways. In an FaaS
model like AWS F1 and Alibaba F3, users get FPGA access to implement their
own accelerator. Users have access to a virtual work environment with tools and
hardware. With an IaaS model, users have a list of specific accelerators. They can
choose the application they want to speed up. This is the model deployed by IBM
and the Microsoft Brainwave project (neural network accelerator). Finally, users
can benefit from FPGA accelerators in systems in which FPGA acceleration is
transparent to the user.

1.2.3 FPGA-Based Cloud Without a Trusted Authority

In all solutions presented above, FPGAs are managed by the CP. The latter allocates
resources and establishes security. Figure 1.3 shows the high-level architecture and
associated mechanisms in current cloud solutions.
Upon receiving a user request, the CP creates a work environment and allocates
resources. The user has access to resources through a virtualized environment. The
CP can track FPGA usage and check for security issues. To achieve this, AWS and
Alibaba include management functions inside the FPGA [2, 35]. Due to a lack
of transparency, the CP’s privileges within the FPGA are unknown. The CP can
communicate with the FPGA. Thus, the CP can breach the user privacy and access
resources allocated to a user.
Figure 1.4 shows the mechanisms used in AWS F1 instances. The virtual machine
is set up with an Amazon Machine Image provided by Amazon according to a

Fig. 1.3 Current


FPGA-based cloud
mechanisms
1 Authentication and Confidentiality in FPGA-Based Clouds 5

Fig. 1.4 AWS F1 instance usage

Fig. 1.5 Man-in-the-middle attack configuration

specification selected by the user [35]. In order to program the FPGA, the user must
load an Amazon FPGA Image (AFI) generated by AWS. The user must disclose
their IP to program the FPGA. Before generating the AFI file, AWS checks the user
IP for malicious design patterns such as power wasters and side-channel analyzers
based on ring oscillators, short circuits, and long wires [26]. Under this scheme,
the user and the cloud provider (CP) distrust each other. The CP protects devices
against damage. Although the user desires IP protection and confidentiality, the user
IP is never authenticated after the AFI file (i.e., verified user bitstream) is produced.
The user has no proof that their IP in the AFI file is unmodified. A recent work has
proven that bitstream manipulation is a possible way to introduce hardware Trojans
into designs [8]. The user also has no proof that their IP is indeed kept secret.
In the current implementation, the user never authenticates the FPGA. The user
accesses a virtual machine which can access an FPGA. The lack of authentication
in this scenario can be a threat as man-in-the-middle (MitM) attacks and FPGA
impersonation is possible. In MitM attacks, the attacker is placed between two
communicating entities, as shown in Fig. 1.5. During this attack, the virtual machine
sees the attacker as the FPGA and the FPGA sees the attacker as the virtual machine.
In this situation the attacker intercepts the communication data, the user IP, and the
data.
In current solutions, the user IP confidentiality is lacking. To use a custom FPGA
accelerator, the client must send a design file to the CP [31, 35]. The CP must verify
custom designs to protect the FPGA from damage. To address confidentiality and
isolation issues, a trusted authority (TA) can be involved.
6 S. Ince et al.

Fig. 1.6 FPGA-based cloud


with a trusted authority

1.2.4 FPGA-Based Cloud with a Trusted Authority

The trusted authority is an entity trusted by the CP and the cloud user. In an FPGA-
based cloud, the TA will be responsible for security and other mechanisms involving
user confidentiality. Figure 1.6 shows a high-level FPGA-based cloud architecture
involving a TA.
The CP must work with the TA to allocate resources. With this architecture,
FPGA security is established by the TA. Mechanisms like encryption, keys,
and certificates can be managed by the TA independently from the CP. These
mechanisms are described in more detail in Sect. 1.3.1.2. The user is isolated from
the CP thanks to the TA. Moreover, the TA can verify custom user designs instead
of the CP. Thus, the client can protect their IP from the CP. The link between the TA
and the user is described in Sect. 1.5. Under this architecture, the CP still manages
devices. For example, the CP can still allocate and revoke access to hardware and
manage resources, and however the CP does not have full knowledge about the
client’s IP. Approaches [13–15] have used a TA in their solutions. Currently, no
cloud provider uses a TA with a cloud-based FPGA.
Each architecture presented thus far represents a specific use case. Hence, their
privacy measures and security assumptions are not the same. Users are provided full
FPGA access under the FaaS model and they reconfigure the FPGA with their own
designs. Users can benefit from FPGA acceleration through a web browser without
knowing FPGA technology. In the FaaS model, in which users have access to a
cloud provider’s FPGA, security requirements are high because users receive access
to expensive hardware. Users can implement custom accelerators, but IP verification
is mandatory. This causes confidentiality concerns for the client. The involvement
of a TA in FPGA-based clouds can be a solution to solve security and confidentiality
issues. The following section focuses on authentication in FPGA-based cloud
computing. Authentication is one of the most important security features for FPGA-
based cloud computing. It allows for the identification of acting entities (i.e., users,
cloud providers), targeted devices, and custom user designs. In an FPGA-based
cloud context, authenticating users and devices allows the cloud provider to set up
authorization and access control. For the user, authentication is proof that the work
environment (e.g., the allocated resources) is genuine and identified.
1 Authentication and Confidentiality in FPGA-Based Clouds 7

1.3 FPGA-Based Cloud Authentication Solutions

Authentication is a means to verify the identity and the authenticity of objects


or subjects. Let us consider Alice and Bob communicating over an insecure line.
In order to mutually authenticate themselves, they can use a shared secret like
a passphrase (i.e., password) and verify that they both know it. They could also
share their own identifier provided by an entity they both trust. Alice would verify
Bob’s identity provided by the trusted entity. If the trusted entity confirms that the
identity provided is valid, Alice can trust the person presented to be Bob. They
would continue to communicate securely using public key cryptography.

1.3.1 Authentication Principles

In this section, multiple authentication techniques are detailed. Some techniques


allow communicating users to directly authenticate each other. Other authentication
mechanisms use a TA to authenticate users.

1.3.1.1 Direct Authentication

Authentication mechanisms often involve digital signatures and hash functions. The
latter is a one-way function which maps .h : X −→ Y where .|X| = n, .|Y | = m, and
.n > m. It is relatively easy to compute the result with a given input. But computing

the input with a given output is extremely difficult. SHA (Secure Hash Algorithm)
is a popular hash function. A hash algorithm guarantees the integrity of data, but it
does not authenticate the sender/receiver [36]. There are no shared secrets in hash
functions to achieve authentication. This means that an attacker can still construct a
message with a correct hash and send it. The receiver will see this message as valid.
An attacker only needs to know which hash function is used. As .|X| > |Y |, it is
possible to brute force the hash algorithm. Collision attacks [38] allow an attacker
to find the same hash output with a different input using a brute-force algorithm.
There are different ways to authenticate an entity. Technical details and features of
different authentication techniques are detailed below.
A Message Authenticated Code (MAC) algorithm is one possibility for sender
authentication. It uses a shared secret to create message digests. To build authentic
and valid messages, a shared secret is needed. In order to authenticate and verify
message integrity, HMAC (keyed-Hash Message Authenticated Code) functions are
optimal. HMAC uses a MAC (Message Authenticated Code) with a hash function.
HMAC and MAC algorithms serve the same purpose, but an HMAC has stronger
security properties than a standard MAC.
Authenticated encryption (AE) is a way to authenticate the data and the sender.
AE operates a block cipher mode that provides encryption, integrity, and data
authentication. Common AE algorithms are AES-CCM and AES-GCM [1]. AE can
8 S. Ince et al.

be seen as an encrypted HMAC because it involves a shared secret, a hash function,


and an encryption method. The most common schemes of AE are Encrypt-then-
MAC (EtM), MAC-then-Encrypt (MtE), and Encrypt-and-MAC (E&M) [29]. Each
method differs slightly, but they all achieve the same objective. EtM is considered to
have the highest level of security of the three choices. E&M and MtE are also secure,
but they require a few modifications to be strongly unforgeable [4]. In cryptography,
unforgeability represents the ability of an attacker to create a fake signature based
on public information. AE is tightly linked to Transport Layer Security (TLS) and
is used for confidentiality and integrity. AE algorithms are mandatory in TLS, since
encryption and hash functions are used in every communication. The choice of
algorithm depends on a negotiation between the server and the client [39].

1.3.1.2 Authentication Using a Trusted Authority

Certificates are another way to authenticate a user. Information like public keys,
name, and organization is present in the certificate. Often, certificates work under a
Public Key Infrastructure (PKI) scheme. Under this scheme, a Certificate Authority
(CA) signs user certificates and generates public keys, as shown in Fig. 1.7a. CAs
are trusted anchors, and they sign certificates to indicate that the user’s identity is
verified. Generally, there are other entities like the Registration Authority (RA) in
PKI. There are various architectures and entities for PKI systems [9, 37]. Figure 1.7b
shows a single-rooted hierarchical PKI structure. A root CA (i.e., trust anchor)
certifies multiple CAs. A certificate signed by a CA can be verified if it can be
traced back to a trust anchor (i.e., root CA). This scheme is widely used by web
browsers. Holding and updating a list of root CAs is enough to verify a certificate.
Nowadays, PKI is everywhere. The most notable system using PKI is TLS. It
allows authentication and secure communication to be achieved over HTTP [39].
Through client–server communication, entities exchange their respective certificates
for verification. This is mutual TLS authentication and it is an optional mechanism.
The certificate is signed with the CA’s public key. To verify the certificate, a user
needs to contact the CA and send it for verification. If the certificate is valid (i.e.,
not revoked), the CA gives a positive response to the user. Each entity is identified
and associated with a certificate. X.509 certificates are popular in TLS [9]. They
include information like issuer name, subject name, public key information, validity
period, etc. For each future communication, the certificates can be verified to achieve
authentication. Upon successful verification under TLS 1.3, both entities proceed to
communicate by creating a shared secret with DHE (Diffie–Hellman Ephemeral) or
ECDHE (Elliptic Curve Diffie–Hellman Ephemeral). In TLS 1.2, the most common
method remains RSA (Rivest–Shamir–Adleman).
For cloud FPGAs, the expression “Trusted Authority” (TA) is more common
than CA. The role of the TA is similar to a CA. Previous approaches [13, 15] take
advantage of a TA to generate FPGA certificates for authentication and encryption.
Moreover they reinforce the isolation between the user and the CP to offer better
confidentiality. Security-critical functions like authentication and certification are
1 Authentication and Confidentiality in FPGA-Based Clouds 9

Fig. 1.7 Public key infrastructure mechanisms. (a) Basic PKI. (b) Hierarchy PKI

performed by the TA rather than the CP. If a third-party TA is not present in


the FPGA allocation scheme, the CP can control every security mechanism and
every step of FPGA sharing. Hence, the CP can have access to user IP, data, and
allocated resources. To the best of our knowledge, no commercial cloud provider
takes advantage of a TA for cloud FPGA services at the time of writing.
Other specific tools like SAML (Security Assertion Markup Language) and
OAuth 2.0 also exist. These are authentication and authorization tools commonly
used with HTTP environments. SAML is a framework based on XML. It allows
for the sharing of identity and security information to support access to other
domains [5]. On the one hand, SAML allows to access multiple corporate tools
and domains under one identity. On the other hand, OAuth 2.0 allows authenticated
10 S. Ince et al.

Fig. 1.8 High-level view of


the OAuth 2.0 protocol

HTTP resource requests [19]. As shown in Fig. 1.8, OAuth 2.0 is a protocol usually
involving four entities: a user, a resource owner (RO), an authorization server (AS),
and a resource server (RS). The user negotiates an agreement with the RO to
access their resources. The RO specifies the scope (i.e., rules and limitations) of
the resource sharing. If the user receives authorization from the RO, the user can
ask the AS to generate an access token. The user can use the access token with the
RS to gain access to the resources shared with them. This scheme is lightweight and
practical when a user needs to access tools quickly without successive log-in. OAuth
2.0 is much more focused on authorization, whereas SAML is more focused on
authentication. Authentication does exist in OAuth 2.0, although user authentication
is out of the standard’s scope and is left to the developer. In Sect. 1.5, an adaptation
of OAuth 2.0 for confidential cloud FPGA sharing is detailed [21].

1.3.2 Bitstream Authentication

A bitstream is a configuration file produced by FPGA development software such


as AMD/Xilinx Vivado [47] or Intel Quartus [22]. A bitstream is loaded inside the
FPGA to run an algorithm. The bitstream targets a defined configurable area and can
run until device reset or reconfiguration. Bitstream authentication is an important
aspect of IP protection in FPGAs. Users want to program the device with their
custom design but keep their IP private. By achieving bitstream authentication, the
FPGA device is able to verify if the bitstream comes from an authenticated source.
Bitstream integrity is also checked and any modification of the bitstream can be
spotted. Therefore, it is not possible for an attacker to program the FPGA with a
malicious IP. It is also not possible to implement hardware Trojans in valid custom
user designs at the bitstream level without detection.
A solution [1] has been proposed to achieve bitstream authentication, integrity
and encryption by implementing a hardware block cypher AE scheme. From
an FPGA resource point of view, block cypher algorithms like AES-CCM or
AES-GCM are much more efficient than a standard AES+HMAC algorithm. A
compact AES-CCM algorithm with 4.× fewer logical resources than AES+HMAC
1 Authentication and Confidentiality in FPGA-Based Clouds 11

is implemented at the cost of 3.× less throughput. AES-GCM is compared against


AES+SHA [20]. The design is 25% faster than AES+SHA with slightly fewer
logical resources used.
A secure protocol for remote bitstream update [12] has been proposed. The
approach prevents spoofing and replay attacks. Spoofing is the replacement of
genuine transferred data with other data. Replay attacks are performed by observing
a data transfer and replaying it. In the scheme, the bitstream version is controlled
and only the system designer can update the active bitstream. For authentication, the
system designer, the bitstream, and the FPGA are authenticated using a block cipher
(i.e., AE).
FPGA vendors often support FPGA bitstream marketplaces [46]. An FPGA
cloud service user can buy a third-party IP and implement it in the allocated
resources [6]. The user shares an FPGA identifier when buying an IP from a software
vendor. The vendor sends the requested IP and the FPGA identifier to the FPGA
vendor. The latter is able to identify the target device using the FPGA identifier.
The FPGA vendor then encrypts the IP with the shared secret associated with the
FPGA identifier. The FPGA vendor has a database of shared secrets associated with
sold FPGAs. Bitstream integrity and authenticity are achieved with a keyed-hash
message authentication code (HMAC). The encrypted IP is sent back to the software
vendor who verifies integrity and forwards the IP to the user. The bitstream can be
bound to a single device using device-targeted bitstream encryption.
Bitstream authentication mechanisms often involve a bitstream targeted to a
specific FPGA and a shared secret. To guarantee IP security, FPGA authentication
ensures the authenticity of the hardware endpoint receiving the IP.

1.3.3 FPGA Authentication

Current commercial cloud providers like Amazon, Alibaba, and Huawei offer FPGA
access through virtual machines. Figure 1.9 shows the most common way to access

Fig. 1.9 Common FPGA access scheme


12 S. Ince et al.

an FPGA deployed in the cloud. Users are authenticated two times when accessing
an FPGA. During step 1, the customer authenticates them self using their AWS
account to request an FPGA instance. During step 2, the CP provides VM access
to meet the user’s hardware requirements. During step 3, the customer logs into the
VM and obtains FPGA access in step 4.

1.3.3.1 Direct FPGA and User Authentication

FPGA device authentication is critical in remote FPGA computing. As users do not


have physical access to the device, they need to ensure that they are communicating
with the appropriate FPGA. As stated in Sect. 1.3.1.1, AE is one way to authenticate
a device. With a shared secret and a block cipher, it is possible to mutually
authenticate the user and the FPGA to achieve message integrity. Previous works
[1] and [12], respectively, implement a compact 32-bit AES-CCM (i.e., CBC-MAC
and counter mode encryption) and a 3DES block cipher for FPGA authentication.
Shared secrets or encryption keys can be implemented in secure memory and
one-time programmable memory. For system-on-chip (SoC) platforms, a Trusted
Execution Environment (TEE) like ARM TrustZone [3] and Intel SGX [23]
can be used to offer secure memory and security sensitive computation. Recent
research [18] shows that TEE and memory isolation present vulnerabilities. Cache
attacks have been demonstrated to retrieve sensitive information inside TEEs [34].
Some vulnerabilities come from the developer, but others come from the platform
architecture. Microcode patches remove some vulnerabilities at the architectural
level. TEE is considered safe despite being an active field of research [28].
It is also possible to share a secret without storing information inside a secure
memory. Physical unclonable functions (PUFs) [40] take advantage of the physical
randomness of the FPGA silicon that is unintentionally created during the manufac-
turing process. This randomness mostly affects the rise and fall times of signals. As
semiconductor physical characteristics cannot be reproduced, each FPGA device
PUF behaves differently. A specific input to the PUF (i.e., a challenge) produces
an associated output (i.e., a response). The output for a given input is unique to the
device. The same PUF implemented on multiple devices will give different results. If
a PUF design is characterized and challenge–response pairs are saved, it is possible
to identify deployed FPGAs remotely. One sends a challenge and reads the response
to validate the FPGA authentication.
PUF designs are vulnerable to machine learning and modelling attacks [11].
When enough challenge–response pairs are collected, it is possible to predict the
response for a given challenge. PUFs are also sensitive to noisy environments.
The temperature and device voltage have a significant impact on PUF responses
[10]. If they are not stable, the voltage can fluctuate enough to create a denial
of service (DoS) attack on the PUF. As a result, authentication requests are not
possible because the response for a given challenge is not reproduced as expected.
As a requirement, a secure PUF-based authentication protocol should be resilient
against machine learning attacks and noisy environments. Additionally, it should
1 Authentication and Confidentiality in FPGA-Based Clouds 13

be resilient to brute-force attacks. The PUF response must be long enough to make
this type of attack difficult. Depending on the design, PUF responses only produce
one or two stable random bits for one challenge. The method used to expand the
PUF response or the authentication protocol itself must not create new possible
attacks. For example, an attacker should not be able to send challenges to a PUF
design. There should be a mutual authentication between the PUF design and the
PUF user; otherwise the PUF can be used by an attacker and all challenge–response
pairs can be documented. Mutual authentication should not be solely based on a
shared secret. The lack of a stored secret is the biggest advantage of a PUF. If mutual
authentication is based on a shared secret, then PUF usage becomes irrelevant. To
protect the PUF design, the FPGA should authenticate the user. Upon successful
user authentication, the FPGA should allow PUF usage. However, several attacks
have been realized on PUF architectures [10, 11, 40]. PUFs must be used within a
robust framework to mitigate vulnerabilities documented in the literature.
Other FPGA authentication techniques are also possible. Previous works [13,
15] use certificates and PKI. Public key cryptography and certificates are used to
communicate outside the FPGA and take advantage of the trusted authority that
is present in the scheme. The following subsection gives further details on these
mechanisms.

1.3.3.2 FPGA and User Authentication Using a Trusted Authority

A TA is often used to create isolation between the user and the cloud provider to
reinforce confidentiality and privacy. Reference [15] describes an FPGA enclave
for cloud computing with the use of a TA. The aim is to reinforce user IP
security, authentication, and privacy in public FPGA clouds while supporting multi-
tenancy. FPGA multi-tenancy is the spatial sharing of one device among multiple
users. Each user has a well-defined set of FPGA resources called a partially
reconfigurable region (PRR). Using this approach, FPGA resource use is maximized
and computing resources are not left unused. FPGAs have been added to a PKI
architecture by using certificates. An on-board key hierarchy is used with a device
unique key (DUK) as a root. In order to establish security, the DUK is derived
multiple times for different purposes (e.g., bitstream encryption, enclave specific
keys). The solution for FPGA authentication proposed here is an SGX-inspired
attestation mechanism endorsed by the TA. The latter offers services like bitstream
certification and boot code authentication. To provide a secure FPGA environment,
security critical components like the device unique key and bitstream loader are
controlled by the TA. The TA controls the root of the key hierarchy and can
update/revoke keys depending on security threats. For enclave communication,
the authors have implemented public and private keys and a certificate. Keys are
used to create a secure channel between an enclave and a user, and certificates are
implemented to authenticate the enclave and the running design.
In references [15] and [13], a solution to protect data and bitstreams from the
CP is proposed. A TA installs encryption keys inside the FPGA device. In order
14 S. Ince et al.

to secure their bitstreams, users can send their designs to the TA. The TA encrypts
the user designs with an encryption key associated with the FPGA. As a result, the
user can securely send and receive data in an encrypted communication channel with
the FPGA. The user is never authenticated, and the FPGA is indirectly authenticated
with the encryption keys installed by the TA. If an attacker successfully retrieves the
keys stored inside the FPGA, the content of the device is compromised. This creates
other attack vectors on key renewal and distribution mechanisms. Multi-tenancy is
not supported with this method.
In reference [14], a secure FPGA enclave is proposed. In this solution, a user
requests access to an FPGA from the CP and obtains the FPGA serial number
in return. The user sends it to the TA and proceeds to FPGA authentication. A
scheme to protect the privacy and the integrity of user data and IP in public FPGA
clouds is proposed. This solution involves a TA (i.e., FPGA vendor) who has a non-
reconfigurable design in the FPGA. The design includes AES and SHA accelerators.
The TA’s design also includes a PUF, characterized by the TA. It is used for FPGA
authentication purposes. A secure channel is created for the user based on modular
exponentiation and a Diffie–Hellman Ephemeral algorithm. To authenticate the
FPGA, the user compares the PUF output hash with the awaited hash response
known by the TA. If the response is valid, the user now has a shared secret with
the FPGA and the session key is established. This key is used as a symmetrical
encryption key between the FPGA and the user. Thus, user bitstreams and data will
be protected against the CP and bitstream integrity will be achieved. Despite that,
user authentication is missing in this scheme. A malicious user can impersonate a
user and get access to the provided FPGA before the user. The user obtains access
to the FPGA after the session key is set up, but the user is not authenticated. Thus,
the session key is security critical and must be kept secret. The security of remote
access and computing lies with the session key.
Using an alternative approach [48], users are able to leverage CPUs and a TEE
to achieve security for an FPGA-based commercial cloud. IP confidentiality is
provided for the user. The CP secures FPGA devices and the cloud infrastructure
with the use of a TA (i.e., FPGA vendor). The CP verifies the user IP inside a
TEE. The TA is responsible for the FPGA shell and various attestations (e.g., FPGA
shell, application). The user must prove their identity to the FPGA to prevent user
impersonations. Then the user can authenticate the active FPGA shell. In this way
the user knows that the security elements of the device are not compromised. Lastly,
the user’s bitstream is authenticated by the FPGA and its integrity is verified. This
level of security is not only optimal but also mandatory when it comes to FPGA
cloud computing. Both the user and the cloud provider have expensive technology
at stake. Taking advantage of a TEE inside the processing system of the FPGA for
design verification has great advantages for confidentiality and security, although
this approach can create significant latency. In a multi-tenant context, the user’s
design verification using the TEE can be a serious bottleneck.
Reference [51] proposes an FPGA resource pooling system. FPGA acceleration
resources in a cloud environment are presented at a high level of abstraction.
1 Authentication and Confidentiality in FPGA-Based Clouds 15

Users can deploy their own bitstreams or use available designs. Their approach is
heavily focused on FPGA cloud efficiency and performance. This system is based
on multiple compute nodes managed by control nodes. The user can accelerate an
application by making API calls from a virtual machine. Performance improvements
are efficient, but there are a few potential security threats with the proposed scheme.
If the user deploys their own design, it is totally disclosed to the CP. The user has
no confidentiality, an unsuitable situation for the user. A TA can be a solution to this
issue. Also, no design integrity mechanism is provided to ensure that an unmodified
user IP is delivered. Lastly, no mechanisms are provided to protect the FPGA device
from the user design. Authentication is also not addressed since the user never
authenticates the FPGA device. As a consequence, the end user cannot be sure that
the associated FPGA is the legitimate target or it has been replaced by a malicious
FPGA target. This issue can lead to FPGA impersonation and cause data breaches
and malicious behaviors. The lack of user authentication by the FPGA reinforces
these security flaws. The FPGA, or a reconfigurable region inside the device, lacks
isolation. A user may be able to access a device allocated to another user if no user-
FPGA authentication is performed.
Authentication can happen at many levels in FPGA-based cloud computing.
Users and the CP can be authenticated to reinforce security. FPGAs and bitstreams
can also be authenticated to reinforce IP and device security. To achieve higher
levels of security, multiple authentication techniques can be used. HMAC and
AE algorithms allow two communicating users to authenticate each other. Other
mechanisms like certificates and PKI allow users to authenticate each other using a
TA.
Our authentication and access delegation framework for FPGA-enabled cloud
computing [21] (described in detail in Sect. 1.5) addresses these issues. The
approach is based on OAuth 2.0 and is adapted for cloud usage with FPGA
devices. It provides an authentication solution for four entities simultaneously
(FPGA, user, TA, and CP) and achieves isolation between the client and the CP. In
this framework the TA is involved in the access delegation process and the user-
FPGA authentication process. The framework offers a single sign-on feature to
prevent repeated authentication processes. Access control is enforced by the TA
and managed by the CP. This subject is detailed in Sects. 1.5.5 and 1.5.6. Only our
work and [15] support multi-tenancy.
As described earlier in this section, previous work [13–15] has proposed FPGA
authentication and isolation mechanisms available from the CP. The latter two
approaches use modular exponentiation [14] and RSA [13] algorithms to authenti-
cate a user. Despite introducing the FPGA into PKI, reference [15] does not mention
user authentication. None of the three works proposes an access delegation protocol
involving a TA. Reference [14] only uses a TA for FPGA authentication in the
proposed framework. Access control mechanisms are only described in reference
[15]. Public keys are used to authenticate a secure enclave inside the FPGA. These
keys can be used for access control although no mechanism is explicitly mentioned.
Table 1.1 compares our solution with previous approaches.
16 S. Ince et al.

Table 1.1 Comparison with Achievements [15] [14] [13] Our proposal
prior work. .×∗ : the Trusted
Authority (TA) only gives FPGA Authentication    
information to the client for User Authentication .×   
FPGA authentication Multi-tenancy  .× .× 
User-CP isolation    
Access delegation .× .×
∗ .× 

Table 1.2 Resource utilization of three FPGA-based neural network accelerators on Virtex
UltraScale+ VU9P FPGAs
Related work LUT FF DSP BRAM
Xiao et al. [45] 131,042 (11%) 113,581 (5.0%) 242 (3.5%) 4 (5.0%)
Tsai et al. [43] 38,899 (3.0%) 40,534 (1.7%) 9 (0.1%) 3 (4.0%)
Zhou et al. [50] 80,175 (7%) 46,140 (2.0%) 83 (1.2%) 0 (0%)

1.4 Open Challenges

1.4.1 Multi-tenancy

FPGA cloud services aim to be efficient in resource usage and energy consumption.
Currently each cloud FPGA is allocated to one customer at a time. Often, high-
end FPGA devices are deployed for acceleration. For example, AWS offers Xilinx
Virtex UltraScale+ VU9P FPGAs. This device has approximately 1.2 million look-
up tables (LUTs) and 2.6 million flip-flops (FFs). As shown in Table 1.2, several
neural network accelerators [43, 45, 50] have been implemented using this platform.
One implementation [45] used 131,042 LUTs and 113,581 FFs, which correspond
to 11% of the LUTs and 5% of the FFs in the AWS FPGA. To the best of our
knowledge, no commercial cloud service provider supports FPGA multi-tenancy
[24, 27, 30].
Multi-tenancy introduces new authentication and access control challenges.
Each tenant is assigned to a dedicated PRR. Consequently, each PRR must be
authenticated. Each user’s rights must be enforced by strong access control schemes.
With multi-tenancy, sensitive components like accelerators, memory, and CPUs can
be shared. These hardware components must not leak information and must be
shared securely among users. Strong access control policies must be set individually
for each user and each shared hardware resource. Policies must be set by the
resource owner or by a trusted authority if the resource sharing scheme includes
one. To satisfy these policies, users must be authenticated inside the FPGA. The
FPGA must be aware of user’s design locations and the requests coming from them.
A user PRR also needs authentication to mitigate impersonation and unauthorized
requests. The FPGA must check the source of the request and whether access control
policies allow the user to make such a request.
1 Authentication and Confidentiality in FPGA-Based Clouds 17

1.4.2 Remote FPGA Attacks

Remote attacks on multi-tenant FPGAs are also a significant concern. The power
distribution network (PDN) of the FPGA can be exploited by possible attack vectors.
By manipulating the PDN, a malicious user can set up a side channel since the
PDN must be shared between all tenants in an FPGA. PDN attacks exploit the
dependency between switching activity and power consumption [17, 25, 42, 49].
With a voltage sensor based on delay lines, it is possible to track voltage drops at
a nanosecond scale. In fact, the delay of a signal depends on the supply voltage.
The delay line itself is composed of buffers and latches. By collecting power traces
and using correlation power analysis (CPA), it is possible to recover data from the
PDN. For example, CPA can correlate power measurements against a power model
to guess an encryption key byte. Previous work [17] (which uses AWS FPGA cloud)
and [42] have described attacks on an AES core and retrieved encryption keys. The
PDN can also be used as a covert channel. Hence, a user can bypass access control
mechanisms and communicate with other FPGA designs or system hardware (e.g.,
CPU). Since each program has its own power consumption pattern, it is possible to
identify the programs of other tenants by monitoring PDN voltage drops.
PDN attack mitigation and protection mechanisms exist. Unfortunately, pro-
tecting the FPGA and user designs against attacks requires significant overhead.
For example, a CPA resistant AES core based on a power equalizer [33] was
implemented and tested. The equalizer’s power overhead is between 8 and 23%
depending on the mode and the solution consumes roughly 40% of 128-bit AES
area usage. This approach only protects the AES core. Side channels can still be
exploited and the PDN is still vulnerable.

1.5 Authorization and Access Delegation Framework for


FPGA-Enabled Cloud Computing

The protocol seen in Fig. 1.8 must be adapted for the cloud use case for several
reasons. First, the resource owner (website user) uses the client’s website (e.g., a
blog) and is willing to share information. The client then asks for authorization for
resource access (user data) and starts the access delegation procedure. This scheme
is not valid in an FPGA-accelerated cloud use case as the client uses the resource
owner’s (RO’s) website to earn access to hardware owned by the RO.
In this situation, the cloud provider is considered to be the resource owner, the
trusted authority is the authorization server, and the FPGA is a part of the resource
server. The client requests a hardware resource with their account information. This
is the first step of authentication. Then, resources are allocated automatically with
the CP’s virtualization tools (e.g., orchestrator/scheduler).
This framework is based on a TA to create user-CP isolation by using an access
delegation protocol. Moreover, the TA executes security-sensitive operations like
18 S. Ince et al.

Fig. 1.10 CP introduces the client to the TA and generates and manages authorization code. Client
authenticates themself with the TA and obtains their authorization code

FPGA authentication, bitstream certification, and verification. Figure 1.10 shows


the interactions between the entities present in our protocol.

1.5.1 Client Request and Certificate Creation

To request a resource from the CP, the client first sends a message to the CP’s user-
agent, as shown in step 1 of Fig. 1.10. In this request message, the client includes
their identifiers, certificate (or produces it online as stated below), and a redirection
unique resource identifier (URI). This information is sent to the TA in the next step.
The URI is used by the TA to send back redirected messages to the client via the
CP’s user-agent.
There are two different scenarios for client certification. The client can generate
their certificate online or offline from the protocol. Online certificate generation
proceeds as follows. Upon receiving the client’s resource request, the CP authenti-
cates themself to the TA with their certificate, requests a certificate for the client,
and shares it with the client as shown in step 2 in Fig. 1.10. The client certificate
is created from the CP’s website. The client must interact with the web browser
to create the certificate and add randomness to the generated keys. A similar
mechanism is seen in Microsoft Azure’s key-pair generation for SSH channel
protection.
It is also possible to create the client certificate offline from this protocol. The
client has the responsibility of generating a key-pair for themself. By doing so, the
client makes the resource request to the CP with their certificate. Then, the certificate
generation step is skipped, and as a result the protocol is faster.
1 Authentication and Confidentiality in FPGA-Based Clouds 19

1.5.2 HTTP Redirection and Authorization Grants

During the second step in Fig. 1.10, after the CP’s authentication, the client’s request
is accepted or declined. If the request is accepted, the TA generates the authorization
code. By using the previously provided URI, the TA redirects the CP’s user-agent
back to the client in order to authenticate them directly. By performing this action,
the CP has authenticated and introduced the client to the TA.
At this time, the TA knows the client’s certificate and the authorization code
associated with the client’s identifiers. To obtain their authorization code, the
client needs to authenticate with the TA for the first time. A certificate-based
TLS authentication is performed [39]. If the client’s credentials are valid, the
authentication is successful and the TA sends an HTTP redirection code to the
CP’s user-agent (HTTP code 302) alongside the client redirection URI. The client
receives the authorization code from the CP’s user-agent in the last step of Fig. 1.10.
In this protocol, the authorization code cannot be used as an attack vector. In fact,
the authorization code is associated with the client’s credentials and the URI. It is
not a secret code because the CP’s user-agent shares the authorization code through
HTTP redirection. The authorization code can be found in the user-agent history. In
case of an authorization code redirection attack, which aims to get backdoor access
to the client’s resource, a simple redirection URI check from the TA is sufficient. The
URI used when requesting the authorization code must match the URI used for the
access token generation, as explained in Sect. 1.5.3. Hence, a malicious client cannot
gain access to resources attributed to another client by intercepting the authorization
code.
The role of this code is to ensure that the CP is authenticated and cannot be
impersonated. By using the CP’s user-agent to redirect the code, it can be confirmed
that the CP which authenticated themself and gave authorizations for resource
allocation to the TA is the same entity that communicates with the client in the
protocol.

1.5.3 Access Management and Token Generation Phase

After receiving the redirected authorization code, the client needs to authenticate
themself again with the TA using their certificate, their authorization code, and
redirection URI to request an access token from the TA. A certificate-based TLS
authentication is performed one last time to confirm the client’s identity. This second
authentication is necessary to verify the client’s identity.
The client needs to submit their authorization code to request the access token.
If the client’s credentials are registered and associated with the used authorization
code, the TA will be able to authenticate the FPGA device and generate the access
token using a shared secret with the FPGA, as shown in Fig. 1.11. Access tokens
have scopes and the duration of access. They are managed by the RO and endorsed
20 S. Ince et al.

Fig. 1.11 The TA generates the access token for the client

by the TA [19]. These options are requested by the client during the first step shown
in Fig. 1.10. Then, the RO accepts or declines the requested scopes and duration and
notifies the TA (i.e., the authorization server) in step 2. The client can decline an
issued token if the scope requested does not match their requests.
Access tokens must be kept confidential. They are only shared using TLS and
stored in an encrypted form using the FPGA public key. If an access token leaks,
the secure access to the allocated FPGA will be compromised.
The access token’s content may also be extended in order to contain information
such as FPGA serial number, partially reconfigurable region identifier, and so on.
This feature gives flexibility for the implementation phase as additional mechanisms
can be developed.
Using the described protocol, the client is strongly authenticated through their
certificate and their credentials with the CP and the TA. The authentication between
the two last entities is not as critical as the client authentication because they are
anchors in this protocol. Due to the shared secret and the TLS session between the
client and the FPGA, a secure and tokenized confidential remote access can be set up
for the client. The CP is isolated from the client’s computation but can still manage
access scopes and duration.

1.5.4 Client and FPGA Secure Channel

After the token issuance, the client contacts the FPGA to obtain access to resources.
A TLS session is set up for secure communication with perfect forward secrecy
between the FPGA and the client [39]. The client and the FPGA create their
shared secret with algorithms like DHE and ECDHE [39] and then use symmetric
encryption algorithms like AES-256-GCM. Once the TLS connection is established,
the client sends their token to be authenticated. The FPGA parses the token and
evaluates if the resources can be granted. Further communications between the client
and the FPGA will be encrypted. The user privacy is greatly enhanced and isolation
from other entities is achieved.
1 Authentication and Confidentiality in FPGA-Based Clouds 21

1.5.5 Access Control with Tokens

When all the entities are authenticated and the authorizations are granted, the client
can access the FPGA with the access token. The token needs to be decrypted and
parsed. Actions are then taken by the FPGA to program and allocate resources inside
the device. Thanks to their resource server, the CP explicitly specifies the attributed
resource information to the TA during the second step shown in Fig. 1.10.
According to the OAuth 2.0 protocol, token content can be extended according to
the user’s preference [19]. In order to take advantage of this feature in a cloud FPGA
context, critical information needs to be selected. This information reinforces the
access control of the client and ensures device/infrastructure security. It is up to the
CP to decide the content of the token. In our solution, several pieces of information
must be included. This information includes the FPGA serial number (or a specific
challenge–response pair for a PUF) and a partial reconfiguration region (PRR)
identifier in the case of multi-tenant FPGA usage. A client identifier (e.g., certificate)
is useful for secure communication outside the FPGA. This information identifies
the device, the client, and the allocated PRR. Additionally, bitstream identification
and signatures are stored in the token. This aspect is further detailed in Sect. 1.5.6.
The token must have validity timestamps for the FPGA to take action upon token
expiration. This information is necessary to ensure that the client’s activity is located
inside the cloud infrastructure and that the access delegation scope is respected. The
TA should be able to create an access token with this information.
The CP’s virtualization tools track which FPGA and PRR are in use. Cloud
resource utilization is tracked by the CP and already allocated resources cannot
be overwritten and reallocated by the client. Moreover, as the token has a limited
validity, the FPGA should be able to end the connection with the client based on
a timestamp without needing the CP to take action, i.e., the FPGA must be time-
aware. The access validity timestamp is available inside the token.

1.5.6 Access Control of Bitstreams

In order to load a bitstream into an allocated resource, the client must fulfill several
requirements. Bitstreams must be verified by the TA for malicious design patterns.
Then the bitstream must be certified and cleared for safe use.
To accomplish this goal, each bitstream submitted by the client receives a unique
identifier. Then the TA proceeds to verify it. If the bitstream is declared safe for use
by the TA, the signature of the bitstream is calculated and stored in the TA database.
Then, the bitstream identifier and the bitstream signature are also added to the access
token. SHA-256 or SHA-364 should be used for security.
When the client tries to load a bitstream into their allocated resource, the
bitstream identifier and the bitstream signature will also be shared in step 2 of
Fig. 1.11. At the reception of the bitstream, the FPGA computes the signature and
22 S. Ince et al.

verifies its results with the signature stored inside the token in step 3. If the results
are the same, the FPGA reconfigures the user’s allocated resource in the last step.

1.6 Performance Analysis

1.6.1 Theoretical Performance

Analytically, we can show that the time required to generate the authorization code
as shown in Fig. 1.10 is as follows:

tcode = tCP (internal) + tT A (internal)


.

+ tclient−CP (net. + auth.) + tclient−T A (net. + auth.)


+ tT A−CP (net. + auth.) + (tcert ) (1.1)

Internal tasks are not computationally expensive, for example, .tCP (internal)
refers to the time needed by the CP’s virtualization tools to check for available
resources and allocate them to a client. The TA would only read/write values and
generate the authorization code during .tT A (internal). The certificate creation time
.tcert can be skipped if the client already has a certificate. .tA−B (auth.) represents the

duration of authentication between entities A and B as explained at the beginning


of Sect. 1.5.2. .tnet. represents the cumulated network latencies and transport time.
Assuming that every entity has a low latency with each other (i.e., small .tA−B (net.),
where A and B are the communicating entities), the client could obtain the
authorization code with low latency. The time required for the token generation
phase shown in Fig. 1.11 can be described as follows:

ttoken = tT A (internal) + tclient−T A (net. + auth.)


. (1.2)

.ttoken should be very small because there is only a certificate-based TLS authenti-

cation of the client and an FPGA authentication by reading a value stored in a secure
FPGA memory location. Then the TA proceeds with the token generation procedure,
which involves a few read and write operations. The information necessary for the
token generation is stored in the TA’s database.

taccess = tclient−F P GA (net.) + tF P GA (internal)  tcode + ttoken


. (1.3)

.taccess is the time required to confirm that a generated token has a valid FPGA

access. To confirm this, we need one TLS authentication between the client and the
FPGA and a token verification which is represented by .tF P GA (internal). This value
is significantly smaller than the time required by our protocol to generate an autho-
rization code and an access token. .taccess requires less communication, whereas
1 Authentication and Confidentiality in FPGA-Based Clouds 23

Table 1.3 Description of variables used in Eqs. 1.1–1.3


Variable Description
.tcode Time spent to generate an authorization code
.ttoken Time spent to generate an access token by using an authorization code
.taccess Time spent to connect to the FPGA using the access token
.tX (internal) Time spent for internal computations by entity X
.tX−Y (net.) Network latencies between entity X and Y
.tX−Y (auth.) Time spent for mutual authentication between X and Y

tcode and .ttoken include internal computing times and accumulated communication
.

latencies between entities present in this protocol (Table 1.3).

1.6.2 Time Estimation for Token Generation

Let us set .tA−B (net.) to 30 ms and use the delay time of 67.5 ms without network
latency [30] as a baseline for a standard TLS handshake. We include three client-
authenticated TLS handshakes as noted in Eq. 1.1. There are eight messages in a
client authenticated TLS handshake which gives us .8×3×30 = 720 ms for network
latency and .3 × 67.5 = 202.5 ms for TLS handshake. If the client generates their
certificate offline and then requests an authorization code as stated in Sect. 1.5.1,
.tcode = 720 + 202.5 = 922.5 ms. Tasks internal to CP and TA are not expensive,

so their time costs are much smaller than values taken into account for the previous
cost estimation.
Additionally, .ttoken equals .2 × 8 × 30 + 2 × 67.5 = 682.5 ms for two TLS
handshakes and network latency without taking internal tasks into account. Most of
the time is spent by network latency.
As a final total, .tcode + ttoken = 922.5 + 682.5 = 1605 ms is needed to
obtain authorization and generate an access token for the allocated FPGA. In
a worst-case scenario, this procedure should not last more than two seconds.
Finally, .taccess can be estimated by the following equation: .taccess = 30 + 67.5 +
tF P GA (internal) < 1605 ms = 97.5 ms+tF P GA (internal) < 1605 ms. Accessing
an FPGA with a valid token is approximately 100 ms because .tF P GA (internal)
consists of simple memory operations. For this use case, Eq. 1.3 is true: .taccess =
97.5 + tF P GA (internal) < 922.5 + 682.5

1.7 Conclusion

This chapter described the extensive state of the art in authentication techniques
and their application to FPGAs in cloud computing. Each service model used
for the FPGA cloud has different security requirements that need to be met to
24 S. Ince et al.

achieve secure computation. Under the infrastructure as a service model, users have
indirect access to resources and do not need FPGA technical knowledge. Under
the FPGA as a service model, users can implement their own FPGA design. This
approach provides more flexibility to the user at the risk of cloud security and the
user’s privacy. Security and privacy concerns must be addressed for secure FPGA
computation in the cloud.
Authentication is an important security step. It allows for the identification
and certification of users and devices to establish security in communications and
computations. Authentication is especially necessary for FPGA multi-tenancy so
that access control policies for hardware resource access and data protection can be
enforced. Authentication can either be directly defined between two communicating
users or involve a trusted third party (i.e., TA). Direct authentication mechanisms
involve MAC algorithms and AE. Authentication with a TA takes advantage of a
PKI and certificates. Authentication can happen at many levels. Firstly, users, the
CP, and the TA can be authenticated. Secondly, FPGAs and user bitstreams can be
authenticated to verify bitstream and device authenticity. Authentication with a TA
is more common in FPGA-based cloud computing because the cloud user needs to
protect their IP to keep it private from the CP and attackers. The CP also needs to
protect their devices and infrastructure.
Among open challenges, PUF-based authentication techniques are still being
investigated. They often lack security against machine learning attacks and a generic
protocol that can be widely used as an authentication solution is needed. Moreover,
multi-tenancy mechanisms in FPGAs are being actively investigated to support
improved FPGA resource usage. By spatially sharing the FPGA device among
multiple users, it is possible to reach higher efficiency. But multi-tenancy creates
new threats. Firstly, new authentication challenges like user identification inside the
device and resource sharing need to be investigated. Secondly, mitigating hardware-
level attacks and side-channel exploits is important. Users are vulnerable to each
other inside the FPGA if no security mechanisms are deployed to cover these
potential attack vectors.
In this chapter, a novel OAuth 2.0-based authentication and access delegation
framework for FPGA-enabled cloud computing is proposed. A client’s authenti-
cation protects their sensitive information and the CP’s cloud infrastructure from
malicious behaviors caused by identity theft. Furthermore, by introducing a trusted
authority, the client’s FPGA access and sensitive operations are isolated from the
CP. The client benefits from a low latency single sign-on authentication for their
FPGA thanks to a tokenized access. Security and privacy are enhanced for both the
cloud provider and the client.

References

1. Abdellatif, K. M., Chotin-Avot, R., & Mehrez, H. (2013). Protecting FPGA bitstreams
using authenticated encryption. In 2013 IEEE 11th International New Circuits and Systems
Conference (NEWCAS) (pp. 1–4). https://doi.org/10.1109/NEWCAS.2013.6573635
1 Authentication and Confidentiality in FPGA-Based Clouds 25

2. Alibaba FPGA cloud documentation. https://www.alibabacloud.com/help/en/fpga-based-ecs-


instance
3. ARM TrustZone Documentation. https://developer.arm.com/documentation/102418/0101/
What-is-TrustZone-
4. Bellare, M., & Namprempre, C. (2008). Authenticated encryption: Relations among notions
and analysis of the generic composition paradigm. Journal of Cryptology, 21, 469–491. https://
doi.org/10.1007/s00145-008-9026-x
5. Cantor, S., Kemp, J., Philpott, R., & Maler, E. (2005). Assertions and Protocols for the OASIS
Security Assertion Markup Language (SAML) V2.0. OASIS Standard saml-core-2.0-os.
6. Carelli, A., Cristofanini, C. A., Vallero, A., Basile, C., Prinetto, P., & Di Carlo, S. (2018).
Securing bitstream integrity, confidentiality and authenticity in reconfigurable mobile hetero-
geneous systems. In 2018 IEEE International Conference on Automation, Quality and Testing,
Robotics (AQTR) (pp. 1–6). https://doi.org/10.1109/AQTR.2018.8402795
7. Caulfield, A. M., et al. (2016). A cloud-scale acceleration architecture. In 2016 49th Annual
IEEE/ACM International Symposium on Microarchitecture (pp. 1–13). https://doi.org/10.1109/
MICRO.2016.7783710
8. Chakraborty, R. S., Saha, I., Palchaudhuri, A., & Naik, G. K. (2013). Hardware Trojan insertion
by direct modification of FPGA configuration bitstream. IEEE Design & Test, 30(2), 45–
54. https://doi.org/10.1109/MDT.2013.2247460
9. Cooper, M., Dzambasow, Y., Hesse, P., Joseph, S., & Nicholas, R. (2005). Internet X.509 public
key infrastructure: Certification path building. Technical Report, RFC 4158.
10. Delvaux, J. (2017). Security Analysis of PUF-Based Key Generation and Entity Authentication.
Ph.D. Thesis.
11. Delvaux, J. (2019). Machine-learning attacks on PolyPUFs, OB-PUFs, RPUFs, LHS-PUFs,
and PUF-FSMs. IEEE Transactions on Information Forensics and Security, 14(8), 2043–
2058. https://doi.org/10.1109/TIFS.2019.2891223
12. Devic, F., Torres, L., & Badrignans, B. (2010). Secure protocol implementation for remote
bitstream update preventing replay attacks on FPGA. In 2010 International Conference on
Field Programmable Logic and Applications (pp. 179–182). https://doi.org/10.1109/FPL.2010.
44
13. Eguro, K., & Venkatesan, R. (2012). FPGAs for trusted cloud computing. In Proceedings of
the 22nd International Conference on Field Programmable Logic and Applications (pp. 63–
70). https://doi.org/10.1109/FPL.2012.6339242
14. Elrabaa, M. E. S., Al-Asli, M., & Abu-Amara, M. (2021). Secure computing enclaves using
FPGAs. IEEE Transactions on Dependable and Secure Computing, 18(2), 593–604. https://
doi.org/10.1109/TDSC.2019.2933214
15. Englund, H., & Lindskog, N. (2020). Secure acceleration on cloud-based FPGAs FPGA
enclaves. In 2020 IEEE International Parallel and Distributed Processing Symposium
Workshops (IPDPSW) (pp. 119–122). https://doi.org/10.1109/IPDPSW50202.2020.00026
16. Fowers, J., et al. (2018). A configurable cloud-scale DNN processor for real-time AI. In
2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (pp. 1–
14). https://doi.org/10.1109/ISCA.2018.00012
17. Glamoanin, O., Coulon, L., Regazzoni, F., & Stojilovi, M. (2020). Are cloud FPGAs really
vulnerable to power analysis attacks? In 2020 Design, Automation & Test in Europe Conference
& Exhibition (DATE) (pp. 1007–1010). https://doi.org/10.23919/DATE48585.2020.9116481
18. Gross, M., Jacob, N., Zankl, A., & Sigl, G. (2021). Breaking TrustZone memory isolation and
secure boot through malicious hardware on a modern FPGA-SoC. Journal of Cryptographic
Engineering, 12, 181–196.
19. Hardt, D. (2012). The OAuth 2.0 Authorization Framework. Technical Report, RFC 6749.
https://doi.org/10.17487/RFC6749
20. Hori, Y., Satoh, A., Sakane, H., & Toda, K. (2008). Bitstream encryption and authentication
with AES-GCM in dynamically reconfigurable systems. In 2008 International Conference on
Field Programmable Logic and Applications (pp. 23–28). https://doi.org/10.1109/FPL.2008.
4629902
26 S. Ince et al.

21. Ince, S., Espes, D., Lallet, J., Gogniat, G., & Santoro, R. (2021). OAuth 2.0-based authentica-
tion solution for FPGA-enabled cloud computing. In 3rd International Workshop on Cloud, IoT
and Fog Systems (and Security) - CIFS 2021 co-located with the 14th IEEE/ACM International
Conference on Utility and Cloud Computing - UCC 2021. University of Leicester, UK.
22. Intel Quartus development software official product website. https://www.intel.com/content/
www/us/en/products/details/fpga/development-tools/quartus-prime.html
23. Intel software guard extension documentation. https://www.intel.com/content/www/us/en/
architecture-and-technology/software-guard-extensions.html
24. Knodel, O., Lehmann, P., & Spallek, R. G. (2016). RC3E: Reconfigurable accelerators in data
centres and their provision by adapted service models. In IEEE International Conference on
Cloud Computing (CLOUD) (pp. 19–26). https://doi.org/10.1109/CLOUD.2016.0013
25. Krautter, J., Gnad, D., & Tahoori, M. (2020). CPAmap: On the complexity of secure FPGA
virtualization, multi-tenancy, and physical design. In TCHES, (Vol. 2020, pp. 121–146).
26. La, T., Mtas, K., Grunchevski, N., Pham, K., & Koch, D. (2020). FPGAdefender: Malicious
self-oscillator scanning for Xilinx ultrascale+ FPGAs. ACM Transactions on Reconfigurable
Technology and Systems, 13(3), 34:1–34:20. https://doi.org/10.1145/3402937
27. Lallet, J., Enrici, A., & Saffar, A. (2018). FPGA-based system for the acceleration of cloud
microservices. In 2018 IEEE International Symposium on Broadband Multimedia Systems
and Broadcasting (BMSB) (pp. 1–5). https://doi.org/10.1109/BMSB.2018.8436912
28. Ma, Y., Zhang, Q., Zhao, S., Wang, G., Li, X., & Shi, Z. (2020). Formal verification of
memory isolation for the trustzone-based TEE. In 2020 27th Asia-Pacific Software Engineering
Conference (APSEC) (pp. 149–158). https://doi.org/10.1109/APSEC51365.2020.00023
29. Maimut, D., & Reyhanitabar, R. (2014). Authenticated encryption: Toward next-generation
algorithms. IEEE Security & Privacy, 12(2), 70–72. https://doi.org/10.1109/MSP.2014.19
30. Mbongue, J. M., Shuping, A., Bhowmik, P., & Bobda, C. (2020). Architecture support for
FPGA multi-tenancy in the cloud. In 2020 IEEE 31st International Conference on Application-
specific Systems, Architectures and Processors (ASAP) (pp. 125–132). https://doi.org/10.1109/
ASAP49362.2020.00030
31. Microsoft Azure documentation for cloud FPGA attestation mechanism. https://learn.
microsoft.com/en-us/azure/virtual-machines/field-programmable-gate-arrays-attestation
32. Microsoft Azure documentation for FPGA optimized virtual machine sizes. https://docs.
microsoft.com/en-us/azure/virtual-machines/sizes-field-programmable-gate-arrays
33. Miura, N., Fujimoto, D., Korenaga, R., Matsuda, K., Nagata, M.: An intermittent-driven
supply-current equalizer for 11x and 4x power-overhead savings in CPA-resistant 128bit AES
cryptographic processor. In 2014 IEEE Asian Solid-State Circuits Conference (A-SSCC) (pp.
225–228). https://doi.org/10.1109/ASSCC.2014.7008901
34. Nilsson, A., Bideh, P. K., & Brorsson, J. (2020). A Survey of Published Attacks on Intel SGX.
Technical Report, CoRR, abs/2006.13598.
35. Official repository of the AWS EC2 FPGA Hardware and Software Development Kit. https://
github.com/aws/aws-fpga
36. Parelkar, M. (2004). FPGA security-bitstream authentication. Technical Report, George
Mason University (2004). http://mason.gmu.edu/~mparelka/reports/bitstream-auth.pdf
37. Perlman, R. (1999). An overview of PKI trust models. IEEE Network, 13(6), 38–43. https://
doi.org/10.1109/65.806987
38. Ramanna, S. C., & Sarkar, P. (2011). On quantifying the resistance of concrete hash functions
to generic multicollision attacks. IEEE Transactions on Information Theory, 57(7), 4798–4816.
https://doi.org/10.1109/TIT.2011.2146570
39. Rescorla, E. (2018). The Transport Layer Security (TLS) Protocol Version 1.3. Technical
Report, RFC 8446. https://doi.org/10.17487/RFC8446
40. Rhrmair, U., et al. (2013). PUF modeling attacks on simulated and silicon data. IEEE
Transactions on Information Forensics and Security, 8(11), 1876–1891. https://doi.org/10.
1109/TIFS.2013.2279798
41. Ringlein, B., Abel, F., Ditter, A., Weiss, B., Hagleitner, C., & Fey, D. (2019). System
architecture for network-attached FPGAs in the cloud using partial reconfiguration. In 2019
1 Authentication and Confidentiality in FPGA-Based Clouds 27

29th International Conference on Field Programmable Logic and Applications (pp. 293–300).
https://doi.org/10.1109/FPL.2019.00054
42. Schellenberg, F., Gnad, D. R. E., Moradi, A., & Tahoori, M. B. (2018). An inside job: Remote
power analysis attacks on FPGAs. In 2018 Design, Automation & Test in Europe Conference
& Exhibition (DATE) (pp. 1111–1116). https://doi.org/10.23919/DATE.2018.8342177
43. Tsai, T. H., Ho, Y. C., & Sheu, M. H. (2019). Implementation of FPGA-based accelerator for
deep neural networks. In 2019 IEEE 22nd International Symposium on Design and Diagnostics
of Electronic Circuits & Systems (DDECS) (pp. 1–4). https://doi.org/10.1109/DDECS.2019.
8724665
44. Turan, F., Roy, S. S., & Verbauwhede, I. (2020). HEAWS: An accelerator for homomorphic
encryption on the Amazon AWS FPGA. IEEE Transactions on Computers, 69(8), 1185–1196.
https://doi.org/10.1109/TC.2020.2970824
45. Xiao, H., Li, K., & Zhu, M. (2021). FPGA-based scalable and highly concurrent convolutional
neural network acceleration. In 2021 IEEE International Conference on Power Electronics,
Computer Applications (ICPECA) (pp. 367–370). https://doi.org/10.1109/ICPECA51329.
2021.9362549
46. Xilinx application store for FPGA acceleration boards. https://www.xilinx.com/products/app-
store.html
47. Xilinx Vivado development software official product website. https://www.xilinx.com/products/
design-tools/vivado.html
48. Zeitouni, S., Vliegen, J., Frassetto, T., Koch, D., Sadeghi, A. R., & Mentens, N. (2021). Trusted
configuration in cloud FPGAs. In 2021 IEEE 29th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM) (pp. 233–241). https://doi.org/10.1109/
FCCM51124.2021.00036
49. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In 2018
IEEE Symposium on Security and Privacy (SP) (pp. 229–244). https://doi.org/10.1109/SP.
2018.00049
50. Zhou, Y., & Jiang, J. (2015). An FPGA-based accelerator implementation for deep convo-
lutional neural networks. In 2015 4th International Conference on Computer Science and
Network Technology (ICCSNT) (pp. 829–832). https://doi.org/10.1109/ICCSNT.2015.7490869
51. Zhu, Z., Liu, A. X., Zhang, F., & Chen, F. (2021). FPGA resource pooling in cloud computing.
IEEE Transactions on Cloud Computing, 9(2), 610–626. https://doi.org/10.1109/TCC.2018.
2874011
Chapter 2
Domain Isolation and Access Control
in Multi-tenant Cloud FPGAs

Christophe Bobda, Joel Mandebi Mbongue, Sujan Kumar Saha,


and Muhammed Kawser Ahmed

2.1 Introduction

Field-programmable gate arrays (FPGAs) are generally provided in the cloud in two
main paradigms: The first model is “hardware acceleration as a service” (HAaaS),
where acceleration is provided to the user through pre-implemented hardware
accelerators by the cloud operator. Users call functions with input values and collect
the results. This model is used, for instance, by Microsoft with the Catapult platform
for accelerating the Bing search engine [51]. The second model is “infrastructure
as a service” (IaaS), used by Amazon [1], where the FPGA logic is provided to
the users as part of a virtual infrastructure package (processor, memory, storage,
peripherals, etc.) that must be scheduled onto physical resources by the cloud
provider. In HAaaS, the cloud provider controls everything, which reduces the user’s
options to only the kernels implemented by the cloud provider. IaaS is more flexible
and allows users to develop and optimize their hardware implementation.
For security reasons, existing FPGA clouds do not implement FPGA multi-
tenancy and continue to allocate entire FPGAs to single users. In the age of dark
silicon, with a considerable amount of resources on an FPGA and designs that
rarely occupy the whole FPGA, assigning the entire fabric to single users reduces
device utilization and increases power consumption. Sharing FPGA resources
among different tenants can help improve FPGA utilization in the cloud. However,
multi-tenancy deployment of FPGAs with hardware accelerators from random
sources takes a lot of work. The allocation and de-allocation of parts of the
FPGA resources require a model that enables efficient reallocation of input/output
(IO) resources to service multiple virtual machines from different tenants using

C. Bobda () · J. M. Mbongue · S. K. Saha · M. K. Ahmed


Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA
e-mail: cbobda@ece.ufl.edu; jmandebimbongue@ufl.edu; sujansaha@ufl.edu;
muhammed.kawsera@ufl.edu

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 29


J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_2
30 C. Bobda et al.

the FPGA simultaneously. Moreover, strategies must be designed to enable the


integration of FPGA resources in cloud resource management middleware without
completely redesigning the existing implementation. Finally, with multiple tenants
sharing the same physical FPGA in an IaaS system, an isolation and security
framework must be provided to prevent domain crossing.
This chapter discusses approaches for domain isolation for shared FPGA
resources in the cloud. The first part of the chapter covers system organization and
discusses the FPGA architecture to support sharing and integration in the virtual
machine manager efficiently. The second part of the chapter discusses isolation
and protection mechanisms and the isolation of mutually distrusting accelerators
sharing the same FPGA.

2.2 System Organization

The architectural model for a cloud infrastructure that supports multi-tenancy in an


IaaS paradigm consists of server nodes, each of which hosts CPUs, IO, storage, and
FPGAs directly or indirectly connected through a node-level interconnect. CPUs
run tenants’ virtual machines (VMs). The VMs may request hardware resources
to run one or more hardware accelerators or intellectual property cores (IP). This
section discusses the FPGA organization and mechanisms for integrating hardware
resources into a virtual resource manager.

2.2.1 FPGA Provisioning Model

We consider the resource provisioning model of traditional clouds in which users


request cloud service providers (CSPs) to set up a VM workspace. The user selects
the type and amount of resources to attach to the VM (CPU, RAM, storage, etc.),
boots the virtual environment, and starts running applications. Tasks can run as long
as they do not violate the service-level agreement (SLA). A similar flow is used
to expose FPGA components to cloud users. In a multi-tenancy environment, the
FPGA resource is usually provided to the VM in “FPGA units of virtualization”
representing the smallest allocatable area on the FPGA. They are also known as
virtual regions (VRs). FPGA resources are therefore allocated in multi-tenant clouds
by provisioning of VRs as opposed to the entire FPGA allocation currently used in
single-tenant models. Figure 2.1 illustrates the two approaches. The virtual machine
monitor (VMM) abstracts low-level hardware details into the VM workspace.
The CSP defines the resources available in each VR and the amount of memory,
storage, and processing offered for the user’s VM. The division of the FPGA can be
done using various strategies and experience with past workloads [43, 45]. It will
not be further addressed in this chapter.
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 31

VM VM VM VM VM VM

VMM VMM

Provisioned
FPGAs resources VRs

FPGA FPGA
(a) (b)

Fig. 2.1 Illustration of the provisioning models. (a) Model in which entire FPGAs are provisioned
in the cloud. (b) Proposed model: VMM provisions FPGA regions in the cloud
Software Stack

Software Stack

Software Stack
SLOT 1
ACC 1
SHELL

SHELL

SHELL
SLOT 2
FPGA

FPGA

FPGA
ACC 1
SLOT 3 ACC 2
SLOT 4 ACC 3

(a) (b) (c)

Fig. 2.2 Hardware elasticity on FPGA. (a) The FPGA is partitioned into slots. (b) ACC1 uses
two slots, and ACC2 and ACC3 use one slot each. (c) The FPGA only provisions one slot that is
assigned to ACC1

2.2.1.1 FPGA Hardware Elasticity

Service elasticity generally allows for the provisioning and releasing of resources
and capacity scaling with need [47]. Provisioning of elastic FPGA resources in the
cloud enables developers to program designated areas on the fabric with designs of
various sizes. This action is possible in clouds that allocate entire FPGAs to users,
like AWS F1 [1]. This model of deployment consists of time-shared FPGAs. CSPs
generally program FPGAs with a shell that implements static design modules such
as IO controllers. Therefore, partial reconfiguration is leveraged to update hardware
functions in areas of the fabric that are exposed to cloud users [9].
The provisioning of space-shared FPGAs, in which multiple user designs are
co-hosted on a device, introduces a different perspective on hardware elasticity.
Hardware space sharing between multiple tenants requires constraining hardware
accelerators to geographic locations on the FPGA. As a result, user designs are
synthesized, placed, and routed considering the physical constraints of the allocated
FPGA areas (registers, block RAMs, DSP blocks, etc.). Therefore, if a cloud
application needs additional FPGA resources, the generated partial bitstream may
not be suitable for programming a different FPGA region.
Figure 2.2a illustrates a device divided into four slots. User accelerators (.ACCi ,
i.∈{1 . . . 4} in Fig. 2.2b and c) can then utilize one or multiple slots based on their
need. However, it may not always be possible to allocate adjacent slots to user
workloads in multi-tenant FPGA clouds. For instance, an accelerator in SLOT 1
may need additional resources that are available only in SLOT 4 (Fig. 2.2a). There
32 C. Bobda et al.

are multiple approaches to allocate additional resources to a user’s design that


cannot fit in the available FPGA regions: (1) find free space on other FPGAs,
(2) migrate the running workload to other FPGAs with enough space, and (3)
divide the design into smaller modules that can fit into the available slots on an
FPGA. Solution (1) represents the easiest step, but it may not be sufficient since
additional resource requirements can lead to the same issue. The migration of
workloads in cloud and virtualization systems transfers the execution environment
of an application to a new host while attempting to minimize the suspension time
[63]. A hardware design synthesized for a given geographical region on the FPGA
may not be placed at a different location on other FPGAs. Moreover, the migration
task itself faces significant challenges. For instance, let us assume that ACC 1
runs on SLOT 1. If it needs more resources, we could migrate the accelerators
currently running in SLOT 2 and SLOT 3 to SLOT 3 and SLOT 4, respectively.
It will free SLOT 2 as an additional resource for accelerator ACC 1. The resulting
layout is illustrated in Fig. 2.2b. Migrating CPU applications generally consist of
suspending the program’s execution, copying register, memory, and disk contents to
another CPU’s stack, and resuming the program’s execution on the new CPU [49].
This approach does not apply to the migration of FPGA designs. Execution of an
application on a CPU is sequential, with instructions processed in stages (fetching
and decoding, loading of operands, execution, and result write back [32]). However,
migrating hardware functions on an FPGA requires additional hardware circuits to
save and restore design context. Migration can be prevented through the allocation
of large slots to hardware accelerators. This approach will increase defragmentation
and reduce FPGA utilization.
To avoid migration, hardware designs need additional resources to be provided
to accelerators running on the same device or distant devices. For this approach
to work, a communication infrastructure must be provided for unrestricted data
exchange among hardware regions. Such an infrastructure can rely on a bus
architecture or a network-on-chip with low latency.

2.2.1.2 FPGA Interfacing

Within the stack of an IaaS, FPGAs generally act as co-processors to enable cus-
tomized acceleration for specific software functions. They are most often integrated
with VM software through high-speed interfaces such as PCIe. In virtualization
systems such as cloud infrastructure, the hardware interface is generally exposed
as IO ports to VMs using one of the four major approaches shown in Fig. 2.3.
IO virtualization is achieved by either software (emulation and paravirtual-
ization) or hardware (directIO, single/multiple root IO virtualization) support. In
emulation (Fig. 2.3a), each attempt to execute an IO instruction raises a system
call that is trapped and executed by the virtual machine manager (VMM) in a
privileged mode. While this approach has the benefit of not requiring guest operating
system (GOS) modification, it incurs high overhead because of the recurrent
context switches between privileged and non-privileged modes. Paravirtualization
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 33

Non-Modified Guest OS Non-Modified Guest OS


System Call

Modified Guest OS Non-Modified Guest OS


Legacy Driver Frontend Driver Legacy Driver Legacy Driver

Trap & Emulate VMM


Backend Driver
VMM VMM VMM
VF VF VF VF
Hardware Hardware Hardware Hardware

(a) (b) (c) (d)

Fig. 2.3 Summary of the state-of-the-art IO virtualization approaches. (a) Emulation, (b) Paravir-
tualization, (c) DirectIO, and (d) SRIOV/MRIOV

prevents context switches by implementing communication between front-end


drivers in the VM and back-end drivers in the VMM (Fig. 2.3b) [5]. Even though
paravirtualization modifies the GOS, it improves security and platform stability as
the hardware is accessed through a unified driver in the VMM [21].
Even with optimizations, software-based approaches incur performance penalties
from VMM interference. As a solution, directIO removes VMM overhead by
directly assigning the IO device to a VM (Fig. 2.3c). While device IO speed is
increased, it does not enable multi-tenancy, which is one of the core reasons for
hardware virtualization. Single/multiple root IO virtualization (SRIOV/MRIOV)
extends the directIO concept with device-sharing capabilities. A set of virtual
functions (VF) is provisioned to VMs that can access the same device through
the PCIe interface (Fig. 2.3d). SRIOV/MRIOV achieves near directIO performance
[15, 62]. However, it does not authorize run-time remapping since the allocation of
VFs is persistent for the entire lifecycle of VMs. This is not ideal in an environment
that aims to implement elasticity: it should be possible to allocate, de-allocate, and
re-assign FPGA regions to cloud tenants at runtime.
This chapter considers FPGAs as IO devices and proposes a hardware/software
architecture that extends state-of-the-art IO virtualization approaches. Similar to
SRIOV and MRIOV, approaches that host several VFs (also called virtual regions
or VRs) in the context of FPGA sharing is desirable. However, unlike SRIOV
and MRIOV, implementing a paravirtualized architecture will enable run-time
remapping of FPGA resources and support rapid elasticity.

2.2.2 Virtual Machine Integration

Figure 2.4 summarizes a global architecture supporting dynamic resource alloca-


tion. Each FPGA is divided into a fixed shell (PCIe block, a set of buffers, and
multiplexing/demultiplexing channels) and a set of reconfigurable virtual regions
(VRs) that can be assigned to VMs to accelerate dedicated functions.
The VR components are accessed through an access monitor. In this architec-
ture, a VM is uniquely identified by a 32-bit context identifier (CID). Each VM is
34 C. Bobda et al.

User User
Application Application
Hardware Hardware

IO Block
FPGA Abstraction

FPGA Abstraction
Physical Layer
Calls Calls
Data Link Layer
vFPGA

vFPGA
Transaction Layer

Channel
Data Data
Demultiplexing
Transfer Transfer
vSock vSock

BUFFER
PHYSICAL FPGA
GUEST

GUEST
Client Client

VM n
VM 1

Network Network ...

...

...

...
Stack Stack Interface

Access

VIRTUALIZED
VirtIO VirtIO
Transport Transport

REGIONS
Monitor
RX/TX RX/TX
Multi-Thread Server USER
REGION
vSock
VR
Controller

Data Channel
Server
Table

Table Demultiplexing
Network
HOST

Stack Graph

BUFFER
Builder
Request
Allocation
Resource

Handler ...

...

...

...
Placement
Tool

VR
Controller

FPGA Management
Software

Fig. 2.4 Hardware and software components of the global architecture

presented with an FPGA abstraction in which each hardware call to an accelerator


is replaced by a data transfer to the hardware. In the host, the request handler is
responsible for checking the type of the incoming requests, each of which can be:
(1) a read operation, (2) a write operation, (3) a request for VR allocation, and (4)
a request for releasing a VR. The request handler forwards requests to the FPGA
management software that interfaces with the hardware. The resource allocation
module manages the lifecycle of VRs in VM domains. It comprises two data tables:
the VR/FPGA table and the CID/VR table. The VR/FPGA table establishes the
correspondence between the VRs and the FPGA on which they are provisioned.
The CID/VR table keeps the correspondence between each VM (identified by a
32-bit CID) and their active VRs. The table controller allows updating the status
(available, reserved, etc.) of the VRs in the data tables. The graph builder updates
the data structure that keeps the list of available VRs in memory. The placement
tool is the component in charge of allocating VRs to VMs to minimize latency
and communication overhead among hardware accelerators. The VR controller
is responsible for programming FPGA devices using partial reconfiguration and
initializing VR internal registers.
Since each VR occupies a well-defined geographic FPGA location containing
a fixed amount of resources [look-up tables (LUTs), digital signal processing
blocks (DSP), embedded memory (BRAM), etc.], the communication infrastructure
[network-on-chip (NoC) or bus] will allow on-demand expansion of the hardware
domain allocated to a VM. The NoC makes it possible to assign multiple connected
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 35

VRs to a single job. However, providing low-level details regarding the topology of
the NoC is beyond the purpose of this chapter. An efficient NoC infrastructure that
supports multi-tenancy with VM integration has been designed and implemented
and described in Mandebi et al. [43, 45].

2.2.3 Simultaneous FPGA Access

The targeted usage model requires a low-overhead guest VM-FPGA accelerator


communication scheme and the ability to dynamically change guest VM-accelerator
mapping at run-time in a fashion that is transparent to the user. Virtual Unix sockets
(vsockets) over VirtIO transport are leveraged to provide fast IO access. As Fig. 2.4
shows, guest virtual machines are capable of accessing each accelerator on the
FPGA through a virtual socket channel opened from the guest OS to the host during
guest user application execution. Calls to open the socket (i.e., virtual socket clients)
are appended to calls for hardware functions during compilation. A virtual socket
client relies on VirtIO to transport data directly to the host software responsible for
managing the connections to the FPGAs.
On the host side, the software responsible for scheduling access to the pool
of FPGA accelerators abstracts and exposes virtual sockets to receive socket
connections from guest VM user application processes. The host software uses
the AF_vSock socket address family for socket creation, which utilizes a standard
host kernel network stack for high-priority processing of socket connections.
Host software implements a non-blocking multithreading scheme to provide time-
multiplexed access to the accelerators, allowing concurrent access.

2.3 Security Architecture and Domain Isolation

This section describes the security framework’s mechanism to ensure the controlled
sharing of hardware modules. Sharing FPGA resources among tenants without a
guarantee of isolated execution can lead to scenarios where shared accelerators act
as potential covert channels among software guests which reside in different security
contexts. Remote attacks have recently been demonstrated in FPGAs [22, 52, 64].
Giechaskiel et. al. showed that information transported on long wires in FPGA could
be ex-filtrated using neighboring long lines on the same chip [22]. This vulnerability
was then leveraged by Ramesh et. al. to launch an attack on a remote accelerator
sharing the same FPGA as the malicious accelerators [52].
Prior research on domain isolation in hardware has focused on the prob-
lem of isolating accelerator access in standalone systems-on-chip applications
[6, 13, 14, 16, 17, 19, 23, 33–36, 50, 53, 56, 59]. In the Secure Enclave approach
[2, 4, 8, 24, 37, 54, 58, 60, 61], system developers rely on an operating system and
hardware-level enforcement mechanisms to provide an isolated mode of execution
36 C. Bobda et al.

and other system security services, such as confidentiality and integrity protection
of the external memory. Elnaggar et. al. described a defense mechanism against
attacks on multi-tenant FPGAs using secured authentication [18]. While these
proposals achieve isolation in a single application systems-on-chip, they have yet to
be demonstrated to work efficiently in multi-user cloud environments managed by
the host and guest operating systems. Current implementations of these proposals
do not define an interface that would allow accelerators to inherit dynamically, at
run-time, security policies of processes calling them from the operating system or
the hypervisor. Furthermore, secure enclave technologies such as TrustZone [2] do
not provide fine-grained security enforcement directly at the IP level, which limits
their security performance coverage.
In literature, FPGA virtualization architectures were proposed to increase the
programmability and security of FPGAs without losing performance [41, 42, 48].
Address protection strategies for systems-on-chip using NoC-based communication
architectures were demonstrated by Saeed et al. in [55], while hardware isolation
strategies for IP protection in system-on-chip (SoC) and computer networks were
presented in [28, 29] and further extended in [10, 11, 30] to shield hardware IPs
using hardware sandboxes. Mbongue et. al. investigated the use of access control
mechanisms in the cloud with FPGA extensions to support cloud security [41]. The
work does not consider sharing among FPGAs and allocates an entire FPGA to a
single VM. Internet of Things (IoT) security [46] and domain isolation strategies in
the cloud were also proposed by Festus et al. [25, 26]. A provable architecture for
isolation in networked design was proposed, implemented, and evaluated. Security
rules check for access control in SoC-based embedded systems was the goal of the
work [30, 31]. The security architecture for domain isolation and access control
is described in the following subsections, along with the threat model and system
assumptions in the cloud domain.

2.3.1 FLASK Security Architecture

To address the problem of domain separation enforcement in software environ-


ments, the National Security Agency (NSA), in conjunction with the Secure
Computing Corporation and the University of Utah, developed the open-source Flux
Advanced Security Kernel (FLASK), whose main objective is to provide flexible
support for mandatory access control (MAC) policies in operating systems [38]. In
a system with MACs, a security label that describes the security context to which
system components belong, is assigned to each system subject (e.g., processes) and
each system object (files, sockets, memory segments, etc.). All accesses between
subjects and objects must be governed by the MAC policy based on the labels
[26, 38]. FLASK cleanly separates the definition of the security policy logic from
the enforcement mechanism to enable different models of security to be enforced by
the same base system [35, 38]. Flask is implemented by modern operating systems
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 37

Fig. 2.5 FLASK security Context(a) Context(b)


Subject Object
architecture (e.g. process) (e.g. file)

Policy
Enforcement
Server
Access Security
Vector Server
Context Cache
Match?

Yes No

Access Access
Granted Denied

such as SELinux using a Linux security module (LSM) built into the kernel code
with default security policy “Type Enforcement (TE).”
The solution mentioned in this chapter borrows from the insights gained from
the domain separation that is present in the FLASK security architecture. The
information is applied to the isolation of guest VM execution in mandatory
access control (MAC)-based hypervisors and extended to the isolated execution
of hardware accelerators on FPGAs. The resulting framework guarantees that in
such systems, hardware modules execute and reside in the same security context as
the “caller” guest VM by propagating to the “callee” modules guest VM privilege
boundaries defined at the software level. The proposed solution is based on the
FLASK architecture, the foundation of security kernels of the most widely deployed
hypervisors, such as KVM and Xen (Fig. 2.5).

2.3.2 Threat Model and System Assumption

Figure 2.6 represents an FPGA-accelerated cloud infrastructure model in which


applications run on a VM -accessed FPGA accelerators. Each VR combines static
components pre-defined by the cloud provider with a hardware accelerator designed
by the cloud user. The static components are programmed as part of the shell,
and each user-defined accelerator is part of a partially reconfigurable region or
“PR Region.” Therefore, the hardware domain of a VM corresponds to a set of
VRs allocated on the FPGAs at run-time. As shown in Fig. 2.6, VR.1 on FPGA.1
represents the hardware domain of VM.1 . In other words, FPGAs are considered
space-shared between VM applications in this system model. As an illustration,
in Fig. 2.6, Appi running in VM.i accesses the accelerators Acci hosted in FPGA.1
38 C. Bobda et al.

Fig. 2.6 Overview on the


cloud infrastructure and the VM1 VM2 VM2 VM3
threat model Security App1 p
App2 App3 App4
Server

1
VM session VMM
Identification
Shell
2
Enforcement
Monitoring HMM HMM HMM HMM
3
Access Rules
Definition
Acc1 Acc2 Acc3 Acc4

VR1 VR2 VR3 VR4


Security Policy PR Region
Database
FPGA1

(with .i ∈{1,. . . 4}). Considering each VR as memory-mapped in the FPGA address


space, read and write requests with an adequate base address and offset can result
in a domain breach. Therefore, malicious software that can submit access requests
(read or write operations) to the FPGA cloud interface can pose a potential threat.
This work considers the following attack scenarios:
• Scenario 1: An application running in a VM uncovers the address space of an
FPGA accelerator in the hardware domain of a co-tenant. This poses a serious
threat to the confidentiality, integrity, and availability of user services as an
attacker can steal a secret, tamper with data, or compromise the execution of
applications in another VM domain. An illustration is shown in Fig. 2.6. App2
running in VM.2 directly interacts with Acc1 that is part of VM.1 ’s hardware
domain.
• Scenario 2: In this attack scenario, a breached VMM is considered a potential
threat. An attacker can run malicious software with superuser privileges and
access the entire address space of the hosted FPGAs. Therefore, private user data
could be exposed or tampered with using corrupted data. Moreover, denial-of-
service and task hiding are concerns. The attacker could block user traffic and run
unauthorized workloads. Figure 2.6 shows malicious software launched in the
VMM host that performs read/write operations and blocks traffic from VM.3 to
execute unauthorized jobs in Acc3. These attacks directly put the confidentiality,
integrity, and availability of user data and services at risk.

2.3.3 Domain Isolation Model

In an FPGA-accelerated cloud infrastructure, both software and hardware compo-


nents contribute to the execution of an application. The software provides the data
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 39

to process, and the hardware accelerates computation. To ensure domain separation


between the applications sharing FPGA devices, the following security model S is
formulated to ensure isolated execution of FPGA hardware kernels.

S := {VM , A, F, VR , D, M}
. (2.1)

where:
• .VM = .{VM1 , VM2 , VM3 , . . . ., VMn } is the set of virtual machines in the cloud
platform.
• A = .{A1 , A2 , A3 , . . . ., An } is the set of application sets in which each virtual
machine i has its corresponding application set, .Ai = .{ai1 , ai2 , ai3 , . . . ., aim }.
• F = .{f1 , f2 , f3 , . . . ., fp } is the set of FPGA devices.
• .VR = .{VR1 , VR2 , VR3 , . . . ., VRp } is the set of virtual regions allocated in the
FPGA devices. Here, each VR set j , associated with the corresponding FPGA
device, is the set of multiple VRs, such as:
.VRj = .{VRj 1 , VRj 2 , VRj 3 , . . . ., VRj q }.

• D = .{1, 0} is the set of decisions.“1” indicates access is permitted, and “0”


indicates access not permitted.
• M = .{M1 , M2 , M3 , . . . ., Mn×p } is the set of access matrices. This set has .n × p
elements where each element is a matrix of dimension .m×q. Each entry of .Mi is
a 2-bit value, .d1 d2 , which represents the corresponding read and write permission
of the application to the hardware accelerator.
To ensure domain separation, the following rules aim to preserve the confiden-
tiality, integrity, and availability of the VM domains:
Rule 1. For each .VMi ∈ VM , there is a function .UVM : VM → A, which must
be a one-to-one function. Also, for each .f ∈ F , .UF : F → VR is a one-to-one
function.
Rule 2. An access request is a 4-tuple .τ := (VMi , aij , f, VRkl ), where .VMi .∈ .VM ,
.aij .∈ .Ai , f .∈ F and .VRkl .∈ .VRk .

Rule 3 (Confidentiality). For a legal access request .τ (by obeying rule 2), if the
corresponding read decision .d1 is made by the lookup function .Γ (M, τ ) and
.d1 ∈ D, the confidentiality is preserved in the system domain.

Rule 4 (Integrity). For a legal access request .τ (by obeying rule 2), if the
corresponding write decision .d2 is made by the lookup function .Γ (M, τ ) and
.d2 ∈ D, the integrity is preserved in the system domain.

Rule 5 (Availability). The availability of the system resources [virtual resources


(.VR ) in FPGA devices (F )] is ensured by the secured communication protocol
(described in Sect. 2.3.4.3).
Rule 6. Only the trusted cloud service provider has the ability to modify the
elements of access matrix M.
40 C. Bobda et al.

Fig. 2.7 Proposed VM


tasks
isolation framework Security
Policy

VM 1 ... VM 1
admin

Domain
Kernel Specific
Policy Policy
security Security
inheritance inheritance
policy policies
(SMM) (SMM)
server

Hypervisor

Access Access AVC


A
AVC
A
Enforcement Enforcement
HMM HMM

Accelerator Accelerator

FPGA

2.3.4 Hardware/Software Isolation Architecture

In FPGA-accelerated cloud computing, applications consist of software threads


running on the VM and computations performed in hardware accelerators in FPGAs.
The isolation strategy extends the FLASK software-only multi-level security to the
hardware–software co-design environment. The isolation mechanism is enforced by
splitting the software-only access control into two modules: the Hardware Modules
Manager (HMM), an object manager for hardware accelerators on the FPGA, and
the Software Module Manager (SMM) for threads running in software (Fig. 2.7).
The SMM maintains a map of security context labels for the hardware accelerators
it abstracts. At the same time, the HMM implements a custom circuit, the Access
Enforcement Function, which guards access to the hardware accelerators according
to the host kernel MAC policy, as the figure shows.
Isolation is enforced by adhering to the following procedure: When the access
control component receives a request to use a hardware accelerator for the first
time, access permissions associated with the caller guest virtual machine and the
callee hardware accelerator label pair are checked. The SMM running on the host
CPU managing the inheritance of security contexts queries the host kernel security
server for associated permissions. The host kernel security server consults its MAC
policy, a policy developed by the system security policy administrator, and returns
related permissions, which are then sent to the access enforcement function. The
access enforcement function then stores this relationship in the HMM Access
Vector Cache component (AVC), the HMM component in charge of remembering
access decisions. The enforcement function will consult the latter for subsequent
deny/grant decisions.
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 41

In the first iteration of its design, when the access enforcement function allows
the request to proceed, it sends a message to the hardware accelerator over the
network-on-chip. When the execution is complete, the accelerator sends the result
to the SMM.
During hardware execution, the access enforcement function maintains the list of
busy hardware modules and their corresponding caller guest VM security contexts
in a specialized data structure. It allows the access enforcement function to discard
results of ongoing execution in case there have been changes in security policies
mid-execution, which the ongoing execution violates.
To minimize the overhead associated with access decision requests and compu-
tations, during the initial access decision request for the pair of security contexts
provided, the Flask security server provides more decisions than requested. The
latter and the security contexts pair they map to are then cached by a “Access
Vector Cache (AVC).” For this ability to scale without degrading the overall system
performance due to repeated access requests, AVC capability is added in the
Hardware Module Manager, and the AVC interface is responsible for managing
access decision misses. The changes in security policies remain in software. The
AVC queries its software component to request access decisions to the Flask security
server to manage access misses. When the latter returns the decision, the AVC
informs the access enforcement function and updates its access decision table.
When there are changes in security policies, the Flask security server alerts the
AVC software component of policy changes. The latter notifies the AVC hardware
component and updates its permissions state. Then, the AVC alerts the access
enforcement function of changes in policies. The access enforcement function
reevaluates the security context of ongoing hardware execution. It discards the
execution results if, per the updated policies, the execution was not authorized.
The AVC software component informs the security server when policy update
propagation is completed.
The pair SMM and HMM ensures that a hardware accelerator running on the
FPGA performs actions in lockstep only with the software thread that controls
it, thus extending the domain separation enforced in the hypervisor to the FPGA.
The resource access control uses the same tools as SMM and HMM to grant
access to those resources from software threads and accelerators as defined in the
security policies.

2.3.4.1 Security Server

The security server uniquely identifies the communication sessions between VMs
and FPGAs, implements security policies defining the access rules, and monitors the
enforcement of the security policies. The communication sessions implement Rule
2 of the security model as they enable requests between applications in the VM and
the VR on the FPGA.
42 C. Bobda et al.

Communication Session Generation The communication sessions between VMs


and hardware accelerators on FPGAs are identified by generated random numbers.
A generated random number serves as the session ID and is shared with the
VM and the hardware accelerator through secured communication channels. To
create session IDs with satisfactory entropy, the Security Server implements a
cryptography-secure pseudo-random number generator using Linux kernel random
number interfaces [40]. It relies on device drivers and other environmental noise,
such as timing variance on processor operations in the Security Server, to seed the
random number generation. After reaching a certain entropy threshold based on the
recorded random events, the random numbers can be requested through system calls.
Security Policy Implementation The Security Server implements and holds the
security policies developed based on the guidelines defined in the security formalism
(see Sect. 2.3.3). It uniquely identifies each system component (VMs, applications,
FPGAs, and VRs) with specific identifiers, implementing Rule 1 of the security
formalism. It also stores access matrices that bind hardware resources to VM
domains at run-time. The access matrices trace the VM applications running within
the VRs. Each VR is distinguished based on the FPGA from which it is provisioned.
The security policies expressed as access rules are stored in the “Security Policy
Database.”
Enforcement Monitoring The monitoring of the security policy enforcement is
achieved through hardware interrupts. Any unauthorized attempt to access the
hardware accelerators within a VM domain is reported to the Security Server by
the HMM. The cloud provider is responsible for defining the actions that follow a
breach attempt.

2.3.4.2 Hardware Modules Manager

The HMM works with the Security Server to enforce access control over FPGA
accelerators. Figure 2.8 shows the internal architecture of the HMM. The major
components of the HMM are the maintenance controller, the true-random number
generator (TRNG), the cryptography module, and the decision module. It also has a
session ID register (SessionID_reg), Session Key register (SessionKey_reg), header
extraction, and insertion modules. The HMM has three interfaces: a Management
Interface for secure communication with the Security Server, an Accelerator
Interface to stream data in and out of the hardware accelerators, and a Cloud
Interface enabling exchanges between the hardware and software services running
within VMs. For security purposes, the Management Interface is not connected to
physical ports managed by the VMM.
Maintenance Controller This unit implements the logic that the Security Server
uses to access configuration resources such as session ID and Session Key registers.
It also provides interfaces to request the generation of random numbers.
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 43

To/From Cloud Applications

Cloud Interface

Application Controller

AES Core

SessionKey_reg
En 1 Decrypt Encrypt
To/From Security Server

Maintenance Controller

Header Header
En n Extraction Insertion
Mnt Interface

XOR-tree
Entropy Source

RO-based TRNG ==
Decision
Module
SessionID_reg Interrupt
Generator

Accelerator Interface

To/From Hardware Accelerator

Fig. 2.8 Architecture of the hardware modules manager

Header Payload
SessionID Operation Data
16bits 1bit 128bits

Fig. 2.9 Structure of communication packets

True-Random Number Generator The TRNG generates random keys that are
stored in the Session Key registers. The crypto module uses the keys to encrypt and
decrypt the messages. Encrypting communications ensures confidentiality (Rule 3
of the security formalism). Ring Oscillators (ROs) are used as a source of entropy.
The output of the XOR tree is sampled in a synchronous D flip-flop driven by the
system clock to convert the RO jitter into a random digital sequence [39]. Jitter, in
this case, represents the deviation caused by random process variation and temporal
variations such as random physical noise, environmental variations, and the aging
of the chip.
Crypto Module The crypto module decrypts incoming traffic and encrypts outgo-
ing packets. For example, the Advanced Encryption Standard (AES) with 128-bit
keys and ten rounds are used in this architecture.
Decision Module This module decides whether an incoming packet is forwarded
to the accelerator or discarded. Figure 2.9 shows the structure of communication
packets.
The header stores the session ID, and the payload defines the operation (read or
write) and data to transfer. The session ID of incoming data is first checked against
the content of the session ID register. Access to the accelerator is granted when both
values match (Rule 4 of the security formalism). An interrupt is generated to the
44 C. Bobda et al.

Appi Security
VMM HMM Accelerator
Server
1 2
FPGA_access_req(Appi) User_req(Appi) 3
Allocate_FPGA(Appi,Acci)
<VRID , FPGAID> Program_FPGA(Acci) 4
5
Gen_rand_sessionID()
VM session ID
Request_key() 6
session_key
Gen_rand_sessionKey()
Response
e
VM session Key VM session ID 7
En_HW_access(Appi) 8

Memory address of FPGA accelerator


Send(mem_addressFPGA_acc , encryptsession_key(msg,VMsessionID))
Decryptsession_key(packet)

Check_sessionID()

alt
Notify_security_attack()
[sessionID not matching]

[sessionID matching]
Forward_data(msg)
Response Process_data(msg)

Send_data(encrypted_response) Encryptsession_key (Response,sessionID)

Fig. 2.10 Secured communication protocol

Security Server if the two values do not match (to support Rule 5 of the security
formalism).

2.3.4.3 Secure Communication Protocol

The secure communication protocol between VM software and hardware accelera-


tors is illustrated in Fig. 2.10.
The secure communication protocol involves five components: a VM application,
the Security Server, the VMM, the HMM, and the accelerator on the FPGA. The pro-
tocol provides hardware-level access controls to ensure that only a designated VM
can use an accelerator in a VR. The secure communication starts with configuring a
communication session, followed by hardware acceleration of software functions.
Configuring a secured communication session includes the following steps (see
Fig. 2.10):
1. A VM application submits an FPGA access request to the Security Server with
the netlist of the hardware function to program as a parameter.
2. The Security Server forwards the request to the VMM.
3. The VMM allocates FPGA resources considering the pool of available VRs.
4. The VMM programs the user design on the VR selected in step (3).
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 45

5. The Security Server records the VM-FPGA region binding and generates a
random session ID number shared with the VM through a trusted communication
channel.
6. The Security Server requests a Session Key to the HMM in the FPGA region
hosting the user design. The Session Key is shared with the VM through a trusted
communication channel.
7. The Security Server shares the session ID with the HMM to enable hardware-
level authentication.
8. The Security Server directs the VMM to assign the address space of the FPGA
accelerator to the VM to enable read and write operations to the VR allocated in
step (3).
Once these initial configuration steps are completed, the communication between
the VM and hardware can start. The VM encrypts data using the Session Key.
Each data sent between VM and hardware accelerators (and vice versa) contains a
message and a session ID number. When the HMM receives data, it decrypts using
the Session Key and checks whether the session ID number matches its record. If
it does, the message is forwarded to the hardware accelerator, otherwise, a security
notification is sent to the Security Server.
The VM also decrypts the response from the hardware to ensure that the received
session ID matches the value previously sent by the Security Server. Overall,
the secured communication protocol relies on a two-step authentication between
VM and FPGA accelerator that uses a 128-bit session key and a 16-bit session
ID, or 2.144 possible combinations at a time. The session keys and session IDs
are generated at the hardware level on the FPGA and Security Server to ensure
randomness. Though the 128-bit session keys and 16-bit session IDs are re-used
in the implementation, the architecture can accommodate wider data widths. After
a time window defined by the cloud provider, the Security Server initiates the
generation of new session keys and session IDs to reduce the risk of a security
breach. The system configuration is only performed by an authorized administrator
(Rule 6 of the security formalism).

2.3.4.4 FPGA Area Overhead

To find the area overhead of the HMM and the feasibility of the domain isolation,
a prototype of the described architecture was implemented in [39, 44] for a cloud
configuration with a node that runs VMs. The cloud is set up on a Dell R7415l
EMC server with a 2.09 GHz AMD Epyc 7251 CPU and 64 GB of memory. The
node runs CentOS-7 with a kernel of version 3.10.0. An Intel Stratix V FPGA
(5SGXMB5R1F40C1) is used as a testing device, and Intel Quartus Prime 18.1.0
Standard Edition is used to synthesize, place, and route hardware designs. The
FPGA is connected to the server through a PCIe Gen3.×8 interface. QEMU 2.11.50
emulates the VMs, with each VM running on Ubuntu 16.04.01 with 4 GB of RAM.
46 C. Bobda et al.

Table 2.1 Area overhead of the VR on FPGA


ALMs ALUTs Registers M20Ks
HMM 2563.7 3101 1485 4
— TRNG 261 300 275 0
— Decrypt 1213.2 1496 632 4
— Encrypt 1089.5 1305 578 0
Other controls 266.4 44 667 0
Used 2830 3145 2152 4
Total resources on FPGA 185,000 185,000 740,000 2100
Utilization(%) 1.53% 1.70% 0.29% 0.19%

5 stage 9 stage 13 stage 17 stage


80
Hamming Distance

60
40
20
0
1 5 9 13 17 21 25

Fig. 2.11 Hamming distance between the random numbers generated by the HMM for 128-bit
keys

The evaluation identifies the resource overhead of implementing the security


components on an FPGA. Table 2.1 shows the resource utilization of VRs without
considering hosted hardware accelerators. The VR and HMM use 2830 ALMs,
3145 ALUTs, 2152 registers, and 4 M20Ks memory units (.∼1% of the FPGA
area). Therefore, integrating the HMM within VRs does not result in significant
resource overhead.

2.3.4.5 Security Assessment

To assess the security resulting from implementing domain isolation with the
secured communication protocol, an evaluation of the randomness of the generated
session IDs and session keys using the Hamming distance was performed in [12].
It quantified the extent to which two bitstrings differ. A total of 50 session IDs and
50 session keys were generated, and the Hamming distance between the 25 pairs of
generated numbers was measured.
Figure 2.11 summarizes the results from implementing 5-stage through 17-
stage RO-based TRNGs with 30 ROs. Observe that the number of stages does not
significantly influence the difference between consecutive session keys. On average,
the Hamming distance between the 128-bit session keys generated is 63.45, which
means that between successive keys, there may be up to .∼2.63 possible numbers,
making it hard to predict.
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 47

Fig. 2.12 Hamming distance 15

Hamming Distance
between the random numbers
generated by the Security 10
Server for 16-bit session IDs
5

0
1 5 9 13 17 21 25

Table 2.2 Resource overhead of the JPEG accelerator with secured communication protocol
ALM ALUT Registers M20K DSP
Available 185,000 185,000 740,000 2100 399
No HMM 18,202 (9.8%) 17,777 (9.6%) 32,765 (4.4%) 8 (0.4%) 399 (100%)
With HMM 20,550 (11.1%) 21,032 (11.4%) 33,864 (4.5%) 12 (0.6%) 399 (100%)
Overhead 12.9% 18.31% 3.35% 50% 0%

Similarly, Fig. 2.12 shows that successive 16-bit session IDs generated by the
Security Server vary by 8.64 bits on average, corresponding to more than 256
possibilities. Further, the NIST Statistical Test Suite [7] is used to study each of
the 50 strings of 144 bits (16-bit session ID + 128-bit session key). While only 11
test scenarios (the four other tests required wider data width) out of the possible 15
could be run, the calculated P-values, ranging from 0.110904 to 0.990904, indicate
that the generated numbers are sufficiently random. Distributing the generation of
the session IDs and session keys across two different entities (FPGA and Security
Server), which rely on unpredictable hardware variation, forms the root of trust in
the proposed domain isolation architecture. In addition, the communication protocol
relies on the recurrent update of session IDs and session keys at run-time.
A test case experiment was conducted to simulate breaches in the cloud system.
The experiment shows how the proposed architecture preserves a VM domain’s
confidentiality, integrity, and availability. A VR is programmed with a JPEG
encoder. It takes as input 24-bit values (8 bits for red, 8 bits for green, and 8 bits
for blue signals) and returns 32-bit JPEG streams. Table 2.2 compares the resource
utilization of the VR with and without the HMM.
The insertion of HMMs in the virtualization stack does not incur significant
resource overhead. The LUT utilization went from 9.6 to 11.4% when using an
HMM. Two VMs (VM.1 and VM.2 ) are considered. The VR running the JPEG
encoder is assigned to VM.1 . The scenario simulated uses a malicious VM and a
compromised VMM to attempt breaching VM.1 ’s domains. Figure 2.13 illustrates
the test scenario.
While VM.2 and a malicious application in the VMM can access VM.1 ’s FPGA
address space, they do not have the correct session ID and session key. As a result,
the HMM accepts only the read and write requests from VM.1 . The requests from
VM.2 and the VMM are discarded, and the Security Server is notified. The isolation
mechanism focuses on notifying the Security Server in case of an unauthorized
attempt to access the hardware accelerator. The cloud provider or cloud adminis-
48 C. Bobda et al.

Fig. 2.13 Illustration of the VM1 App VM2 App


domain isolation Data Offset
Offset Data Offset
Offset
SessionID SessionID
Session Key Session Key

Data Offset
pp
p
App
KVM SessionID
Session Key

HMM SessionID
Session Key

Memory

Memory
Output
Input
Encoder

JPEG EncoderAccelerator
VR
FPGA

trator is then responsible for deciding the actions that follow any breach attempt. In
summary, the proposed security architecture preserves confidentiality by encrypting
data before any transfer; integrity as VR data is protected by the HMM from
unauthorized changes, and availability as the hardware notification mechanism in
the HMM allows the user to engage in actions pre-defined by the cloud provider.

2.3.4.6 FPGA Configuration and Communication Overhead

In this section, we evaluate the quality of service (QoS) degradation resulting


from provisioning multi-tenant FPGAs in the cloud. As a general observation, IO
performance results presented in recent research differ from one another because
they are mainly platform-specific. For instance, Fahmy et al. [20] reported an IO
latency of at most 16 ms, while Asiatici et al. [3] recorded 8–40 ms for similar
operations. To establish a fair comparison baseline, we start by assessing average IO
performance without the discussed security architecture. The baseline assessment
allows capturing the performance properties of our prototyping environment. We
implement an FPGA management service in the hypervisor that exposes read and
write APIs to hardware accelerators on the FPGA. The FPGA management service
queues the incoming IO requests following a first come, first serve (FCFS) model.
Although there are multiple possible IO scheduling algorithms, such as round
robin, priority-based scheduling (shortest-job-first, fixed task priority, etc.) [57], we
implement the FCFS strategy for simplicity.
The time it takes to write values into FPGA registers and read them back is a
“roundtrip.” We record the roundtrip time from the host (hardware access without a
virtualization layer) and run the same experiment from a VM accessing a cloud
FPGA that is not space-shared (single-tenant access). Figure 2.14 illustrates the
roundtrip latencies recorded on 10 random roundtrip operations. On average, the
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 49

Fig. 2.14 Baseline IO trip Single-tenant access


study Host access

r/w time (us)


100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10

host roundtrip takes about 34.6 .μs, and the single-tenant access to the FPGA uses
close to 94 .μs.
To evaluate the configuration overhead introduced by the proposed architecture,
the time needed to generate session IDs and session keys is assessed. The time
consumed in programming hardware accelerators with partial bitstreams is not
considered as the proposed architecture does not modify FPGA vendor tools. The
generation of a session ID in the Security Server takes about 10 ms. At the hardware
level, HMM generates a new session key in 1.84 ns. Finally, a roundtrip between
Security Server and HMM (requesting a session key and collecting the result) takes,
on average, .∼34 .μs. Overall, the different configuration steps illustrated in Fig. 2.10
only introduce configuration overhead in the order of milliseconds (.∼10 ms). After
prototyping the FPGA architecture on the Stratix V device, the HMM achieves a
maximum frequency of 542 MHz. The encryption and decryption steps consume 12
clock cycles (10 clock cycles for the ten rounds, one cycle for loading the key, and
one cycle for returning the result). At the level of the HMM, incoming packets take
14 cycles to reach the accelerator or be discarded (12 cycles to decrypt, one cycle
to extract the header, and one cycle to decide whether the packet will be accepted
or not). Each outgoing traffic value requires 13 cycles (1 cycle to insert the header,
12 cycles to encrypt) to be forwarded to VMs through the Cloud Interface. Festus
et al. [27] proposed an isolation approach that incurred three clock cycles of latency
on an FPGA. However, their architecture does not ensure confidentiality as data are
not encrypted.
In addition, roundtrip latencies were recorded when 25, 50, and 75 VMs
attempted to access a space-shared FPGA simultaneously. Though it may not be
practical to have 25, 50, or 75 VRs on a single device, this experiment evaluates
concurrent FPGA access time. For this purpose, each VR simply implements a
register that is written and read back by one of the VMs. Figure 2.15a, c, and e
present the recorded latencies. In each of the three test cases (25, 50, and 75
VMs), the roundtrip latency mostly lies between 0 and 1.2 ms, with some instances
around 200 and 400 ms. 77% of the IO operations were completed in less than
1 ms, 21% took about 200 ms, and 2% reached 400 ms. Since no packet loss was
experienced, the IO latencies observed are a combination of the virtualization layer
50 C. Bobda et al.

Fig. 2.15 IO and wait time evaluation on multi-tenant cloud FPGAs. (a) IO time with 25 VMs.
(b) Wait time with 25 VMs. (c) IO time with 50 VMs. (d) Wait time with 50 VMs. (e) IO time with
75 VMs. (f) Wait time with 75 VMs

running the VMs, the host operating system process scheduling, the FPGA driver
implementation, the FPGA interface response time, and the FPGA management
service processing time (read packet from user buffer, hardware call, read the result
from FPGA memory, write the result to user buffer, etc.). In general, FPGA multi-
tenancy resulted in about 10.× slower IO speed than that of the single-tenant baseline
and a 29.× drop compared to the average roundtrip time of the host. The average
wait time in the FPGA management service queue increases with the number of
VMs (Fig. 2.15b, d, and f). For example, the VM with the identifier 30 waits up to
8.8 ms (Fig. 2.15d), while the scheduler is busy with other requests.
Overall, these results are impacted by the First Come First Serve (FCFS)
scheduling policy coupled with the host process management policy. In short,
implementing FPGA multi-tenancy in the cloud may generally result in 10.× slower
IO operations compared to single-tenant deployments, which nevertheless remains
on the order of a few microseconds. Common FPGA cloud applications submit jobs
and collect the results after execution. Therefore, it may not be typical to have
constant IO operations between the VM and FPGA accelerators, which mitigates
the overall performance degradation observed in these experiments.
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 51

2.4 Discussion

This chapter addressed the secure sharing of hardware accelerators from different
tenants in FPGA-based clouds that operate in the IaaS paradigm. We focused
on architectural and system-level integration, ensuring that domain separation
is enforced, even in the presence of vulnerabilities, while sharing FPGAs. The
segmentation of an FPGA to ensure that hardware tasks can be reached regardless of
their position is of utmost importance. We believe that a NoC-based communication
strategy offers the flexibility and performance needed to run all accelerators in
parallel. Because of hardware accelerators, resource adjustment can be achieved,
thus extending elasticity to FPGAs while sharing.
To ensure domain separation, access control is one path to extend the well-
insulated user software domain to the hardware world. The FLASK architecture
is presented here as basic infrastructure using a (SMM, HMM) pair to enforce
security rules at the hardware level. This approach is proven by design to work and
ensure confidentiality, integrity, and availability. Implementation of this architecture
has demonstrated that the security infrastructure’s overhead in area and latency
is negligible.
The adoption of this domain isolation in the cloud will be possible if automation
is introduced in system design at the level of cloud operators and cloud users.
Approaches that can automatically integrate the isolation infrastructure are highly
desirable. Our work at the University of Florida has led to the setup of an
FPGA-based cloud infrastructure for research on multi-tenancy. The cloud and
services provided are reachable through the following link: https://smartsystems.
ece.ufl.edu/research/projects/gatorrecc/. Design automation is currently the topic
of investigation, and results will be made available to the community in a timely
fashion.

Acknowledgment This work is partially funded by the National Science Foundation (NSF) under
Grant CNS 2007320.

References

1. Amazon (2017). Amazon EC2 F1 Instances.


2. ARM: TrustZone: SoC and CPU System-Wide Approach to Security. https://www.arm.com/
en/technologies/trustzone-for-cortex-a
3. Asiatici, M., George, N., Vipin, K., Fahmy, S. A., & Ienne, P. (2017). Virtualized execution
runtime for FPGA accelerators in the cloud. IEEE Access, 5, 1900–1910.
4. Azab, A. M., Ning, P., Shah, J., Chen, Q., Bhutkar, R., Ganesh, G., Ma, J., & Shen, W.
(2014). Hypervision across worlds: Real-time Kernel protection from the ARM TrustZone
secure world. In ACM Conference on Computer and Communications Security.
5. Babu, A., Hareesh, M., Martin, J. P., Cherian, S., & Sastri, Y. (2014). System performance
evaluation of para virtualization, container virtualization, and full virtualization using xen,
openvz, and xenserver. In 2014 Fourth International Conference on Advances in Computing
and Communications (pp. 247–250). IEEE.
52 C. Bobda et al.

6. Basak, A., Bhunia, S., & Ray, S. (2015). A flexible architecture for systematic implementation
of SoC security policies. In Proceedings of the IEEE/ACM International Conference on
Computer-Aided Design, ICCAD ’15 (pp. 536–543), Piscataway, NJ, USA: IEEE Press. http://
dl.acm.org/citation.cfm?id=2840819.2840894
7. Bassham III, L. E., Rukhin, A. L., Soto, J., Nechvatal, J. R., Smid, M. E., Barker, E. B., Leigh,
S. D., Levenson, M., Vangel, M., Banks, D. L., et al. (2010) SP 800-22 Rev. 1a. A statistical
test suite for random and pseudorandom number generators for cryptographic applications.
National Institute of Standards & Technology.
8. Baumann, A., Peinado, M., Hunt, G. C. (2014) Shielding applications from an untrusted cloud
with haven. ACM Transactions on Computer Systems, 33(8), 1–8:26
9. Bobda, C., Mbongue, J. M., Chow, P., Ewais, M., Tarafdar, N., Vega, J. C., Eguro, K., Koch,
D., Handagala, S., Leeser, M., et al. (2022) The future of FPGA acceleration in datacenters
and the cloud. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 15(3),
1–42.
10. Bobda, C., Mead, J., Whitaker, T. J. L., Kamhoua, C. A., Kwiat, & K. A. (2017). Hardware
sandboxing: A novel defense paradigm against hardware trojans in systems on chip. In
Applied Reconfigurable Computing - 13th International Symposium, ARC 2017, Delft, The
Netherlands, April 3–7, 2017, Proceedings (pp. 47–59). https://doi.org/10.1007/978-3-319-
56258-2_5
11. Bobda, C., Whitaker, T. J. L., Kamhoua, C. A., Kwiat, K. A., & Njilla, L. (2017). Synthesis
of hardware sandboxes for trojan mitigation in systems on chip. In 2017 IEEE International
Symposium on Hardware Oriented Security and Trust, HOST 2017, McLean, VA, USA, May
1–5, 2017 (p. 172). https://doi.org/10.1109/HST.2017.7951836
12. Bookstein, A., Kulyukin, V. A., & Raita, T. (2002) Generalized hamming distance. Informa-
tion Retrieval, 5(4), 353–375.
13. Boule, M., & Zilic, Z. (2007). Efficient automata-based assertion-checker synthesis of SEREs
for hardware emulation. In 2007 Asia and South Pacific Design Automation Conference (pp.
324–329). https://doi.org/10.1109/ASPDAC.2007.358006
14. De Alfaro, L., & Henzinger, T. A. (2001) Interface automata. SIGSOFT Software Engineering
Notes, 26(5), 109–120. https://doi.org/10.1145/503271.503226. http://doi.acm.org/10.1145/
503271.503226
15. Dong, Y., Yang, X., Li, J., Liao, G., Tian, K., & Guan, H. (2012). High performance network
virtualization with SR-IOV. Journal of Parallel and Distributed Computing, 72(11), 1471–
1480.
16. Drzevitzky, S. (2010). Proof-carrying hardware: Runtime formal verification for secure
dynamic reconfiguration. In 2010 International Conference on Field Programmable Logic
and Applications pp. 255–258. https://doi.org/10.1109/FPL.2010.59
17. Drzevitzky, S., Kastens, U., & Platzner, M. (2009). Proof-carrying hardware: Towards runtime
verification of reconfigurable modules. In 2009 International Conference on Reconfigurable
Computing and FPGAs (pp. 189–194). https://doi.org/10.1109/ReConFig.2009.31
18. Elnaggar, R., Karri, R., & Chakrabarty, K. (2019). Multi-tenant FPGA-based reconfigurable
systems: Attacks and defenses. In 2019 Design, Automation & Test in Europe Conference &
Exhibition (DATE) (pp. 7–12). IEEE.
19. Emmi, M., & Giannakopoulou Dimitra, C. S. (2008). Assume-guarantee verification for
interface automata. In FM 2008: Formal Methods: 15th International Symposium on Formal
Methods, Turku, Finland, May 26–30, 2008 Proceedings 15. Berlin: Springer.
20. Fahmy, S. A., Vipin, K., & Shreejith, S. (2015). Virtualized FPGA accelerators for efficient
cloud computing. In 2015 IEEE 7th International Conference on Cloud Computing Technology
and Science (CloudCom) (pp. 430–435). IEEE.
21. Fraser, K., Hand, S., Neugebauer, R., Pratt, I., Warfield, A., & Williamson, M. (2004).
Reconstructing I/O. Technical Report, University of Cambridge, Computer Laboratory.
22. Giechaskiel, I., Rasmussen, K. B., & Eguro, K. (2018). Leaky wires: Information leakage
and covert communication between FPGA long wires. In Proceedings of the 2018 on Asia
Conference on Computer and Communications Security, ASIACCS ’18 (pp. 15–27). New
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 53

York, NY, USA: ACM. https://doi.org/10.1145/3196494.3196518. http://doi.acm.org/10.1145/


3196494.3196518
23. Glazberg, Z., Moulin, M., Orni, A., Ruah, S., & Zarpas, E. (2007). PSL: Beyond hardware veri-
fication. In Next Generation Design and Verification Methodologies for Distributed Embedded
Control Systems: Proceedings of the GM R&D Workshop, Bangalore, India, January 2007.
Springer Netherlands.
24. Goguen, J. A., & Meseguer, J. (1982). Security policies and security models. In 1982 IEEE
Symposium on Security and Privacy (pp. 11–11). https://doi.org/10.1109/SP.1982.10014
25. Hategekimana, F., & Bobda, C. (2017). Towards the application of flask security architecture to
SOC design: Work-in-progress. In Proceedings of the Twelfth IEEE/ACM/IFIP International
Conference on Hardware/Software Codesign and System Synthesis Companion, CODES+ISSS
2017, Seoul, Republic of Korea, October 15–20, 2017 (pp. 12:1–12:2). https://doi.org/10.1145/
3125502.3125558
26. Hategekimana, F., Mbongue, J. M., Pantho, M. J. H., & Bobda, C. (2018). Inheriting software
security policies within hardware IP components. In 26th IEEE Annual International
Symposium on Field-Programmable Custom Computing Machines, FCCM 2018, Boulder, CO,
USA, April 29–May 1, 2018 (pp. 53–56). https://doi.org/10.1109/FCCM.2018.00017
27. Hategekimana, F., Mbongue, J. M., Pantho, M. J. H., & Bobda, C. (2018). Secure hardware
kernels execution in CPU+ FPGA heterogeneous cloud. In 2018 International Conference on
Field-Programmable Technology (FPT) (pp. 182–189). IEEE.
28. Hategekimana, F., Nardin, P., & Bobda, C. (2016). Hardware/software isolation and protection
architecture for transparent security enforcement in networked devices. In IEEE Computer
Society Annual Symposium on VLSI, ISVLSI 2016, Pittsburgh, PA, USA, July 11–13, 2016 (pp.
140–145). https://doi.org/10.1109/ISVLSI.2016.32
29. Hategekimana, F., Tbatou, A., Bobda, C., Kamhoua, C., & Kwiat, K. (2015). Hardware
isolation technique for IRC-based botnets detection. In International Conference on ReCon-
Figurable Computing and FPGAs (ReConFig0́8) (Vol. 0). Cancun, Mexico: IEEE Computer
Society.
30. Hategekimana, F., Whitaker, T., Pantho, M. J. H., & Bobda, C. (2017). Shielding non-trusted
IPs in SoCs. In 2017 27th International Conference on Field Programmable Logic and
Applications (FPL) (pp. 1–4). https://doi.org/10.23919/FPL.2017.8056848
31. Hategekimana, F., Whitaker, T. J. L., Pantho, M. J. H., & Bobda, C. (2017). Secure integration
of non-trusted IPs in SoCs. In 2017 Asian Hardware Oriented Security and Trust Symposium
(AsianHOST) (pp. 103–108). https://doi.org/10.1109/AsianHOST.2017.8354003
32. Hennessy, J. L., & Patterson, D. A. (2011). Computer architecture: a quantitative approach.
Elsevier.
33. Huffmire, T., Brotherton, B., Sherwood, T., Kastner, R., Levin, T., Nguyen, T. D., & Irvine,
C. (2008). Managing security in FPGA-based embedded systems. IEEE Design Test of
Computers, 25(6), 590–598. https://doi.org/10.1109/MDT.2008.166
34. Huffmire, T., Brotherton, B., Wang, G., Sherwood, T., Kastner, R., Levin, T., Nguyen, T., &
Irvine, C. (2007). Moats and drawbridges: An isolation primitive for reconfigurable hardware
based systems. In 2007 IEEE Symposium on Security and Privacy (SP ’07) (pp. 281–295).
https://doi.org/10.1109/SP.2007.28
35. Huffmire, T., Sherwood, T., Kastner, R., & Levin, T. (2008). Enforcing memory policy
specifications in reconfigurable hardware. Computers & Security, 27(5–6), 197–215. https://
doi.org/10.1016/j.cose.2008.05.002. http://dx.doi.org/10.1016/j.cose.2008.05.002
36. Li, X., Kashyap, V., Oberg, J. K., Tiwari, M., Rajarathinam, V. R., Kastner, R., Sherwood, T.,
Hardekopf, B., & Chong, F. T. (2014). Sapper: A language for hardware-level security policy
enforcement. In Proceedings of the 19th International Conference on Architectural Support for
Programming Languages and Operating Systems, ASPLOS ’14 (pp. 97–112). New York, NY,
USA: ACM. https://doi.org/10.1145/2541940.2541947. http://doi.acm.org/10.1145/2541940.
2541947
37. Lind, J., Priebe, C., Muthukumaran, D., O’Keeffe, D., Aublin, P. L., Kelbert, F., Reiher, T.,
Goltzsche, D., Eyers, D., Kapitza, R., Fetzer, C., & Pietzuch, P. (2017). Glamdring: Automatic
54 C. Bobda et al.

application partitioning for Intel SGX. In 2017 USENIX Annual Technical Conference
(USENIX ATC 17) (pp. 285–298). Santa Clara, CA, USA: USENIX Association. https://www.
usenix.org/conference/atc17/technical-sessions/presentation/lind
38. Loscocco, P., & Smalley, S. (2001). Integrating flexible support for security policies into
the linux operating system. In Proceedings of the FREENIX Track: 2001 USENIX Annual
Technical Conference (pp. 29–42). Berkeley, CA, USA: USENIX Association. http://dl.acm.
org/citation.cfm?id=647054.715771
39. Mandebi Mbongue, J., Saha, S. K., & Bobda, C. (2021). Domain Isolation in FPGA-accelerated
cloud and data center applications. In Proceedings of the 2021 on Great Lakes Symposium on
VLSI (pp. 283–288).
40. Mavrogiannopoulos, N. (2021). Understanding the Red Hat Enterprise Linux random number
generator interface. February 11, 2021 https://www.redhat.com/en/blog/understanding-red-
hat-enterprise-linux-random-number-\generator-interface
41. Mbongue, J. M., Hategekimana, F., Kwadjo, D. T., Andrews, D., & Bobda, C. (2018).
FPGAVirt: A novel virtualization framework for FPGAs in the cloud. In 11th IEEE
International Conference on Cloud Computing, CLOUD 2018, San Francisco, CA, USA, July
2–7, 2018 (pp. 862–865). https://doi.org/10.1109/CLOUD.2018.00122
42. Mbongue, J. M., Kwadjo, D. T., & Bobda, C. (2018). FLexiTASK: A flexible FPGA overlay
for efficient multitasking. In Proceedings of the 2018 on Great Lakes Symposium on VLSI,
GLSVLSI 2018, Chicago, IL, USA, May 23–25, 2018 (pp. 483–486). https://doi.org/10.1145/
3194554.3194644. http://doi.acm.org/10.1145/3194554.3194644
43. Mbongue, J. M., Kwadjo, D. T., Shuping, A., & Bobda, C. (2021). Deploying multi-
tenant FPGAs within linux-based cloud infrastructure. ACM Transactions on Reconfigurable
Technology and Systems (TRETS), 15(2), 1–31.
44. Mbongue, J. M., Saha, S. K., & Bobda, C. (2021). A security architecture for domain isolation
in multi-tenant cloud FPGAs. In 2021 IEEE Computer Society Annual Symposium on VLSI
(ISVLSI) (pp. 290–295). IEEE.
45. Mbongue, J. M., Shuping, A., Bhowmik, P., & Bobda, C. (2020). Architecture support for
FPGA multi-tenancy in the cloud. In 2020 IEEE 31st International Conference on Application-
Specific Systems, Architectures and Processors (ASAP) (pp. 125–132). IEEE.
46. Mead, J., Bobda, C., & Whitaker, T. J. L. (2016). Defeating drone jamming with hardware
sandboxing. In 2016 IEEE Asian Hardware-Oriented Security and Trust, AsianHOST 2016,
Yilan, Taiwan, December 19–20, 2016 (pp. 1–6). https://doi.org/10.1109/AsianHOST.2016.
7835557
47. Mell, P., Grance, T., et al. (2011). The NIST Definition of Cloud Computing, Special Publication
(NIST SP). National Institute of Standards and Technology, Gaithersburg, MD, USA.
48. Metzner, M., Lizarraga, J., & Bobda, C. (2015). Architecture virtualization for run-time
hardware multithreading on field programmable gate arrays. In Applied Reconfigurable
Computing - 11th International Symposium, ARC 2015, Bochum, Germany, April 13–17, 2015,
Proceedings (pp. 167–178). https://doi.org/10.1007/978-3-319-16214-0_14. http://dx.doi.org/
10.1007/978-3-319-16214-0_14
49. Nelson, M., Lim, B. H., Hutchins, G., et al. (2005). Fast transparent migration for virtual
machines. In USENIX Annual Technical Conference, General Track (pp. 391–394).
50. Peeters, E. (2015). SoC security architecture: Current practices and emerging needs. In
Proceedings of the 52Nd Annual Design Automation Conference, DAC ’15 (pp. 144:1–144:6).
New York, NY, USA: ACM. https://doi.org/10.1145/2744769.2747943. http://doi.acm.org/
10.1145/2744769.2747943
51. Putnam, A., Caulfield, A., Chung, E., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh,
H., Fowers, J., Gopal, G., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J. Y.,
Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P., & Burger, D. (2014).
A reconfigurable fabric for accelerating large-scale datacenter services. In 2014 ACM/IEEE
41st International Symposium on Computer Architecture (ISCA) (pp. 13–24). https://doi.org/
10.1109/ISCA.2014.6853195
2 Domain Isolation and Access Control in Multi-tenant Cloud FPGAs 55

52. Ramesh, C., Patil, S. B., Dhanuskodi, S. N., George Provelengios, S. P., Holcomb, D., &
Tessier, R. (2018). FPGA side channel attacks without physical access. In FCCM 2008. 26th
International Symposium on Field-Programmable Custom Computing Machines.
53. Ray, S., & Jin, Y. (2015). Security policy enforcement in modern SoC designs. In Proceedings
of the IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’15 (pp. 345–
350). Piscataway, NJ, USA: IEEE Press. http://dl.acm.org/citation.cfm?id=2840819.2840868
54. Sabt, M., Achemlal, M., & Bouabdallah, A. (2015). Trusted execution environment: What it is,
and what it is not. In Trustcom/BigDataSE/ISPA, 2015 IEEE (Vol. 1, pp. 57–64). https://doi.
org/10.1109/Trustcom.2015.357
55. Saeed, A., Ahmadinia, A., Just, M., & Bobda, C. (2014). An ID and address protection unit for
NoC based communication architectures. In Proceedings of the 7th International Conference
on Security of Information and Networks, SIN ’14 (pp. 288:288–288:294). New York, NY,
USA: ACM. https://doi.org/10.1145/2659651.2659719. http://doi.acm.org/10.1145/2659651.
2659719
56. Saha, S. K., & Bobda, C. (2020). FPGA accelerated embedded system security through
hardware isolation. In 2020 Asian Hardware Oriented Security and Trust Symposium
(AsianHOST) (pp. 1–6). IEEE.
57. Salot, P. (2013). A survey of various scheduling algorithm in cloud computing environment.
International Journal of Research in Engineering and Technology, 2(2), 131–135.
58. Victor Costan, S. D. (2016). Intel SGX Explained.
59. Wiersema, T., Drzevitzky, S., & Platzner, M. (2014). Memory security in reconfigurable
computers: Combining formal verification with monitoring. In 2014 International Conference
on Field-Programmable Technology (FPT) (pp. 167–174). https://doi.org/10.1109/FPT.2014.
7082771
60. Xilinx (2014). TrustZone Technology Support in Zynq-7000 All Programmable SoCs.
61. Yee, B., Sehr, D., Dardyk, G., Chen, J. B., Muth, R., Ormandy, T., Okasaka, S., Narula, N., &
Fullagar, N. (2009). Native client: A sandbox for portable, untrusted x86 native code. In 2009
30th IEEE Symposium on Security and Privacy (pp. 79–93). https://doi.org/10.1109/SP.2009.
25
62. Zhang, B., Wang, X., Lai, R., Yang, L., Luo, Y., Li, X., & Wang, Z. (2010). A survey on I/O
virtualization and optimization. In 2010 Fifth Annual ChinaGrid Conference (ChinaGrid) (pp.
117–123). IEEE.
63. Zhang, F., Liu, G., Fu, X., & Yahyapour, R. (2018). A survey on virtual machine migration:
Challenges, techniques, and open issues. IEEE Communications Surveys & Tutorials, 20(2),
1206–1243.
64. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In 2018 IEEE
Symposium on Security and Privacy (SP) (Vol. 00, pp. 839–854). https://doi.org/10.1109/SP.
2018.00049. http://doi.org/doi.ieeecomputersociety.org/10.1109/SP.2018.00049
Chapter 3
Efficient and Secure Encryption for
FPGAs in the Cloud

Subhadeep Banik and Francesco Regazzoni

3.1 Introduction

In the past few years, lightweight cryptography has become a popular research
discipline with a number of block and stream ciphers and hash functions being
proposed [5, 15, 17, 25]. Block ciphers and stream ciphers cater to the cryptographic
operation of encryption. Whereas encryption deals with message confidentiality,
it does not address message integrity, i.e., the situation when an adversary can
tamper with the ciphertext, which would potentially result in the receiver getting
an incorrect plaintext. A message authentication code (MAC) [42], sometimes
known as a tag, is a short piece of information used to authenticate a message,
to confirm that the message came from the stated sender (its authenticity) and
that it has not been changed. The MAC value protects both a message’s data
integrity and its authenticity, by allowing verifiers (who also possess the secret
key) to detect any changes to the message content. Authenticated Encryption
(AE) [42] or Authenticated Encryption with Associated Data (AEAD) [42] is a
form of encryption that simultaneously provides data confidentiality, integrity, and
authenticity assurances.

S. Banik
Università della Svizzera italiana, Lugano, Switzerland
e-mail: subhadeep.banik@usi.ch
F. Regazzoni ()
University of Amsterdam, Amsterdam, The Netherlands

Università della Svizzera italiana, Lugano, Switzerland


e-mail: f.regazzoni@uva.nl; regazzoni@alari.ch

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 57


J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_3
58 S. Banik and F. Regazzoni

In contrast with a symmetric key cryptosystem in which the same secret key is
used to encrypt and decrypt plaintexts and ciphertexts, asymmetric cryptosystems,
also called public-key cryptosystems (PKC), use encryption and decryption keys
that are not the same [42]. The encryption key (also called the public key) allows
anyone to encrypt a message. The encrypted message can only be decrypted by
a party in possession of the corresponding decryption key (also called the secret
key). A real-life analog is a locked mailbox in which anyone can deliver letters, but
the letter can only be read by the person who has the key to the mailbox. Public-
key cryptosystems such as RSA [51] derive their security from the factorization of
sufficiently large integers. There have been many attempts to factor integers, for an
early and contemporary history of factoring, see [45, 58]. Although factorization
is a known difficult problem on classical computers, it can be solved easily, i.e., in
polynomial time, using quantum computers with Shor’s Algorithm [55]. This has led
to establishment of a NIST standardization process for post-quantum cryptography
[47] with a goal to design algorithms that are secure against attacks that can be
performed on quantum computers.
Hardware platforms can be broadly classified into Application-Specific Inte-
grated Circuits (ASICs) and Field-Programmable Gate Arrays (FPGAs). An ASIC
is an integrated circuit (IC) chip customized for a particular use: as such the
designer can maximize the efficiency of silicon resources to construct circuits that
perform a specific task at very high speeds and consume little silicon area. These
chips are typically fabricated using complementary metal–oxide–semiconductor
(CMOS) technology, and the design cycle from conception to fabrication is a
lengthy and expensive one. FPGAs are more generic in the sense that they consist of
programmable logic blocks and interconnects that allow the same FPGA to be used
in many different applications. Typically, one encodes the circuit as an FPGA image
using software tools and uploads the image on to the device to obtain a circuit with
the required functionality. Furthermore, there are publicly available FPGA cloud
servers, such as the Amazon AWS EC2 [2], which can be accessed remotely from
any corner of the world. Deploying circuits on FPGAs is much cheaper and needs a
smaller conception to deployment time window.
Efficient cryptographic primitives are needed to help protect communication to
and from FPGAs, and, when needed, data generated or processed by an FPGA. The
large body of research addressing efficient implementation of cryptographic algo-
rithms on standalone FPGAs serves as a base for the development and deployment
of efficient cloud FPGA cryptography. To this end, we examine implementation
strategies for cryptographic algorithms implemented on FPGA devices. As a typical
FPGA device can accommodate a huge amount of logic gates, the implementation
metric we target is throughput. By comparing various cryptographic algorithms
on the same FPGA device, we can come to a conclusion about the algorithms’
respective merits when implemented on FPGAs.
3 Efficient and Secure Encryption for FPGAs in the Cloud 59

3.2 Cryptographic Primitives: Block Ciphers

Encryption is the process of encoding a message or information so that only


authorized parties can access it. In an encryption scheme, the intended information
or message, referred to as plaintext, is encrypted using an algorithm—a cipher—
generating ciphertext that can only be read if decrypted. Block ciphers and stream
ciphers are two of the primitives that achieve this objective. A block cipher is
a keyed permutation function. It is a deterministic algorithm operating on fixed-
length groups of bits, called a block, with an unvarying transformation that is
specified by a symmetric key (which is another bit string of short length). The
basic component of a block cipher is an easy-to-compute permutation called a
round function (sometimes simply referred to as a round). The encryption operation
involves repeated application of the round function for a specified number of
iterations.
The AES-128 block cipher [22] is the de facto encryption standard worldwide,
having been recommended by the United States National Institute of Standards and
Technology in 2001. Apart from this, it is the most preferred encryption standard for
a number of private sectors such as banking and e-commerce. It is a part of several
Internet protocols (HTTPS, FTPS, SFTP, WebDAVS, OFTP, and AS2). Since the
design of AES-128 was finalized, many block ciphers with lightweight properties
have been proposed. Among them, Present [15] is well-studied with respect to its
security and implementation. The cipher was standardized in the ISO/IEC 29192
“Lightweight Cryptography” standardization process.
While the above ciphers have mostly targeted optimization of hardware area,
there have been other block ciphers aimed at optimizing other lightweight design
metrics. The block cipher Prince [17] was designed for low-latency (latency is
defined as the total delay incurred in computing an operation)-based applications
such as memory encryption. The only block cipher targeting energy optimization
thus far has been Midori [5].

3.2.1 Architectures

Both block and stream ciphers consist of similar transformations that are applied
repetitively on public and private inputs to produce the output stream, see Fig. 3.1.
In the case of block ciphers, the public input is the plaintext, the private input is the
secret key, and the output is the encrypted plaintext also called the ciphertext. As
a result, there are many flavors of block cipher circuits implemented on hardware
platforms:
• Round-based circuits: These are circuits in which each round function is
executed in one clock cycle. The circuit architecture is equipped with logic
gates that execute the function, followed by a register on which the intermediate
outputs of the round function computation are written. If the block or stream
60 S. Banik and F. Regazzoni

Secret Key

Public Input Output


RF1 RF2 RF3 RFr

r iterations

Fig. 3.1 Commonly, block or stream cipher consists of repeated application of a publicly known
round function

cipher specification calls for R executions of the round function, then the circuit
requires exactly R clock cycles to execute the encryption operation.
• Multiple round–unrolled circuits: This is a simple extension of the above
philosophy, as instead of one unit, .r < R round function units connected serially
in the circuit architecture are used. These circuits
  execute r round function
operations sequentially and thus require only . Rr clock cycles. Because of the
higher hardware footprint, such circuits consume more power but take fewer
clock cycles to execute the encryption operation. Some circuits, e.g., for .r = 2,
are known to be energy-optimal for specific block ciphers on ASIC platforms [6].
• Fully round–unrolled circuits: This takes unrolling to the extreme level, i.e.,
.r = R, so that only a single clock cycle is required to execute encryption.

• Serialized circuits: These circuits reduce the hardware footprint by employing


an increasingly smaller number of logic gates, i.e., a fraction of the entire round
function circuit. As a result, the circuit takes multiple clock cycles to compute
one round function. For example, the AES-128 circuit in [38] has only one S-box
circuit, whereas the AES round function requires 16 S-boxes. The circuit requires
160 cycles to execute a single round function.

3.2.1.1 High-Throughput Implementations

On FPGAs, where hardware area is not as critical as in extremely constrained


devices, throughput is a primary target of optimization. The architectures most
suited for such applications are fully unrolled. In this section, we compare the
performance of five block ciphers on FPGA platforms. We consider block ciphers
of different flavors to make a fair throughput evaluation. Block ciphers with
both algebraically simple and complicated round functions are considered. In the
following part of this section, we highlight cipher characteristics:
• AES 128: The Advanced Encryption Standard [22] is arguably the most popular
and widely studied block cipher. It has a simple substitution permutation network
3 Efficient and Secure Encryption for FPGAs in the Cloud 61

(SPN) type round function that supports 128-bit plaintexts and 128-bit, 192-bit,
or 256-bit keys. This includes a substitution layer that is generally non-linear.
In AES, 16 applications of an S-box function in .{0, 1}8 → {0, 1}8 are used. It is
well known that the S-box used in AES is affine equivalent to the inverse function
in .GF (28 ). The permutation network is generally a linear function that serves
the purpose of mixing the state bits among one another. The AES permutation
layer consists of ShiftRows and MixColumn operations. Since the block cipher
state can be interpreted as a .4 × 4 array of bytes, the ShiftRows operation
simultaneously rotates the i-th row of the state by i bytes. The MixColumn
operation multiplies each column of the state by an MDS matrix over .GF (28 ).
This is followed by an AddRoundkey operation in which a 128-bit RoundKey
is XORed to the state. The RoundKey for each round is produced from a
KeySchedule operation performed over the AES secret key.
• Present: Present [15] is a 64-bit block cipher that has an SPN type round
function. It has recently been adopted as a standard in ISO/IEC 29192-2. The
cipher specifications allow for both 80-bit and 128-bit keys, although we only
focus on the 80-bit version in this work. The only non-linear component in the
round function is a four-bit S-box (i.e., over .{0, 1}4 → {0, 1}4 ), which is applied
in parallel to each of the sixteen nibbles of the 64-bit state after the RoundKey
addition. Thereafter, the state bits are rearranged by a permutation layer (in
hardware this achieved at zero cost in energy and gate area by the simple crossing
of wires).
• Prince: Prince [17] is a 64-bit block cipher with an SPN type round function.
It allows for a 128-bit key but does not use any key-scheduling logic. Prince is
based on FX construction: The 128-bit key is divided into the most and least
significant 8-byte blocks .k0 , k1 , and a key .k ' is computed from them by a simple
rotate and add operation. .k0 and .k ' are used as whitening keys, and .k1 is used as
the RoundKey in every round. The cipher uses three types of round functions:
Forward, Middle, and Inverse. The Forward round consists of SubBytes and
MixColumn operations and the addition of a round constant and RoundKey.
The Middle round consists of SubBytes, MixColumn, and Inverse SubBytes
layers. The Inverse rounds are structurally and functionally the opposite of the
Forward round. As a result, the Prince encryption operation is an involution.
The cipher minimizes encryption latency and finds use in applications such as
memory encryption.
• Midori: Midori [5] is an SPN-based block cipher designed for energy efficiency.
The specifications support both 64-bit and 128-bit plaintexts and a 128-bit key.
The 64-bit version has been the subject of invariant subspace attacks [31];
however, the 128-bit version is secure and the block cipher solution consumes
the least amount of energy versus other ciphers. In this chapter, we focus on
Midori-128.
• Gift-128: Gift [9] was proposed as a redesign of the popular Present block cipher.
The idea was to design a cipher that would be efficient on both hardware and
software platforms and yet offer a high degree of security. The cipher is of SPN
type and like Present uses a bit permutation as the linear layer, so as to minimize
62 S. Banik and F. Regazzoni

Table 3.1 Synthesis results for block ciphers targeted to a Artix 7 xc7a200t device
Latency a
.fmax .T Pmax

Design # LUTs # Slices (ns) (MHz) (Gbps)


AES-128 11,977 3387 129.341 7.7 0.92
Present 2222 801 66.222 15.1 1.80
Prince 1263 505 45.631 21.9 2.61
Midori-128 3840 1454 62.939 15.9 1.89
Gift-128 3836 1303 66.193 15.1 1.80
a Note that .fmax is generated from the Post-PAR simulation

the hardware footprint. The design supports both 64- and 128-bit plaintexts, and
we focus on the 128-bit version here.

3.2.2 Implementation Results of Different Block Ciphers

For the experiments presented in this section, our target platform was the xc7a200t
Xilinx device from the Artix7 family. The following design flow was followed
for experimentation: The design was first implemented in register-transfer-level
(RTL) code. A functional simulation was then performed using Mentor Graphics
ModelSim SE software. The design was synthesized, mapped, placed, and routed
using Xilinx Vivado version 2021.2 (Table 3.1).

3.3 Cryptographic Primitives: Stream Ciphers

A stream cipher takes a secret key K, which is usually a small binary string
of around .80–256 bits, as input and applies a set of rules to produce a long
sequence of pseudorandom bits, bytes, or words (the keystream). The sequence is
pseudorandom, since it cannot be distinguished from a truly random sequence in
practical time.
This sequence of bits, bytes, or words is usually XORed with each bit, byte, or
word of the plaintext to produce the encrypted ciphertext. So, if .P = p0 , p1 , p2 , . . .
represent the bits, bytes, or words of the plaintext, and .κ = k0 , k1 , k2 , . . . represents
the keystream bits, bytes, or words produced by the stream cipher using the secret
key K, then the encryption rule is given by

ci = pi ⊕ ki , ∀i,
.

where .C = c1 , c2 , . . . represents the ciphertext bits, bytes, or words. Since the secret
key is already known to the receiver, he can compute the keystream bits .k0 , k1 , . . .
at his end, which are then used to decrypt the ciphertext as follows:
3 Efficient and Secure Encryption for FPGAs in the Cloud 63

.pi = ci ⊕ ki , ∀i.

Stream ciphers can be viewed as approximating the action of a proven unbreak-


able cipher, the one-time pad (OTP). A one-time pad uses a keystream of completely
random digits, i.e., the keystream must be chosen uniformly randomly from the
set of keystream digits. This means that each element of the set of keystream
digits must have equal probability of being chosen. The keystream is combined
with the plaintext digits one at a time to form the ciphertext. The digits may be
elements of any arbitrary set, e.g., the English alphabet (for which keystream digit
and plaintext digit are added modulo 26 to produce the cipher text), but for most
practical purposes, as in the case of stream ciphers, the digit is a bit or byte and the
encryption operation is simply a bitwise XOR of the plaintext and keystream.
One-time pads are “information-theoretically secure” in that the encrypted
message, i.e., the ciphertext, provides no information about the original message to
a cryptanalyst (except the maximum possible length of the message). This is a very
strong notion of security first developed during World War II by Claude Shannon
and proved mathematically by Shannon to be true for the one-time pad [54].
Shannon proved, using information theory considerations, that the one-time pad has
perfect secrecy; that is, the ciphertext gives absolutely no additional information
about the plaintext.
The eSTREAM project, organized by the EU ECRYPT network [27], involved
the design of new stream ciphers suitable for widespread adoption. The call for
primitives was first issued in November 2004. The project was completed in April
2008. The project was divided into separate phases, and the project goal was to find
algorithms suitable for different application profiles.

3.3.1 eSTREAM Project

The eSTREAM portfolio ciphers fall into two profiles. Profile 1 stream ciphers
are particularly suitable for hardware applications with restricted resources such
as limited storage, gate count, and power consumption. Profile 2 contains stream
ciphers more suitable for software applications with high throughput requirements.
The portfolio [21] currently contains the following ciphers (Table 3.2):

Table 3.2 The eStream Profie 1 (HW) Profile 2 (SW)


portfolio
Grain v1 [33] Salsa20 [13]
MICKEY 2.0 [4] SOSEMANUK [11]
Trivium [18] HC128 [59]
Rabbit [14]
64 S. Banik and F. Regazzoni

t3 1 66 69 93 t1

t1 94 162 171 177 t2

t2 178 243 264 288 t3

Fig. 3.2 Structure of Trivium. The AND gates .s91 · s92 , s175 · s176 , s286 · s287 are added to the
leftmost XOR gates before the 3rd, 1st, and 2nd registers, respectively. The registers have been
omitted for ease of depiction. The keystream bit produced every clock cycle is given as .z = t1 +
t2 + t 3

• Trivium: Trivium [23] is a stream cipher designed for the eSTREAM project by
DeCannière and Preneel and is currently an ISO standard under ISO/IEC 29192-
3:2012. Trvium has an internal state of 288 bits that is divided into 3 registers of
sizes 93, 84, and 111 bits, respectively, see Fig. 3.2. The stream cipher uses an
80-bit key and 80-bit initialization vector (IV) that is used to initialize the state.
The setup is updated for 1024 iterations using a very simple-to-implement update
function shown partially in Fig. 3.2.
• Grain 128: The Grain family of stream ciphers is one of the candidates in
the eSTREAM hardware portfolio [34]. Its simplicity and elegance in design
have attracted considerable attention from cryptologists worldwide. The family
consists of three ciphers: Grain v1, Grain 128, Grain 128a. In this chapter, we
focus on Grain 128 [32] that offers 128-bit security.
Like the other members of the Grain family, Grain 128 has a connected
register structure as shown in Fig. 3.3. Grain-128 consists of a 128-bit linear-
feedback shift register (LFSR) and a 128-bit non-linear-feedback shift register
(NFSR) and uses an 128-bit key K. Given that .Lt = [lt , .lt+1 , . . . , lt+127 ] is the
LFSR state at the t-th clock interval, Grain-128’s LFSR is defined by the update
function f given by:

.f (Yt ) = lt+96 + lt+81 + lt+70 + lt+38 + lt+7 + lt .

The NFSR state is updated as .nt+128 = lt + g(·) for NFSR update function g,
which is given by
3 Efficient and Secure Encryption for FPGAs in the Cloud 65

Fig. 3.3 Structure of stream g(Xt ) f (Yt )


cipher in Grain family

NFSR LFSR

/
/ h(Xt ,Yt )

zt

g(Xt ) = nt+96 + nt+91 + nt+56 + nt+26 + nt + nt+3 nt+67 + nt+11 nt+13 +


.
nt+17 nt+18 + nt+27 nt+59 + nt+40 nt+48 + nt+61 nt+65 + nt+68 nt+84 .

The output function is of the form:



.zt = h' (Xt , Yt ) = nt+a + h(s0 , . . . , s8 ) + l93 ,
a∈A

where .A = {2, 15, 36, 45, 64, 73, 89}, .h(s0 , . . . , s8 ) = s0 s1 +s2 s3 +s4 s5 +s6 s7 +
s0 s4 s8 , and .(s0 , . . . , s8 ) = (nt+12 , lt+8 , .lt+13 , lt+20 , nt+95 , lt+42 , lt+60 , lt+79 ,
.lt+95 ). The cipher is initialized with a 128-bit key and a 96-bit IV. 256 clocks

of initialization are executed before entering the keystream phase.

3.3.2 Implementation Results of Different Stream Ciphers

Our target platform was the xc7a200t Xilinx device from the Artix7 family once
again, and results are reported in Table 3.3. Since stream ciphers, once initialized,
continuously produce a keystream every clock cycle, we can experiment with
different numbers of unrolled rounds r for each stream cipher. The higher the value
of r, the higher the device utilization, while ensuring higher throughput.

3.4 Cryptographic Primitives: Authenticated Encryption

Authenticated Encryption (AE) or Authenticated Encryption with Associated Data


(AEAD) is a form of encryption that simultaneously provides data confidentiality,
integrity, and authenticity assurances [42]. The need for AE emerged from the
observation that securely combining a confidentiality mode with an authentication
mode could be error prone and difficult. This was confirmed by a number of practical
66 S. Banik and F. Regazzoni

Table 3.3 Synthesis results targeted to a Artix 7 xc7a200t device


Latency a
.fmax .T Pmax

Design r # LUTs # Slices #FFs (ns) (MHz) (Gbps)


Trivium 36 364 110 297 2.656 376.5 12.62
72 462 168 296 2.740 365.0 24.47
144 786 344 295 3.687 271.2 36.37
288 1481 597 291 6.398 156.3 41.92
Grain-128 32 551 245 260 4.109 243.4 7.25
64 986 418 260 5.817 171.9 10.25
128 1815 782 310 8.820 113.4 13.52
256 4216 1550 401 15.731 63.6 15.52
a Note that .fmax is generated from the post-place-and-route simulation

attacks introduced into protocols and applications by incorrect implementation, or


lack of authentication (including SSL/TLS) [1, 12, 57]. A typical programming
interface for AE mode implementation would provide the following functions: (a)
Encryption that takes as input plaintext, key, and optionally a header that will not
be encrypted but will be covered by integrity protection. It produces as output a
ciphertext and authentication tag (MAC), and (b) Decryption that takes as input
ciphertext, key, authentication tag, and optionally a header and outputs a plaintext,
or an error if the authentication tag does not match the supplied ciphertext or header.
We now present the current state of the art in symmetric cryptographic schemes
available in the literature.
The proliferation of low-resource devices and their security requirements spurred
the NIST Lightweight Cryptography competition [43] that commenced in 2018 and
entered a final round with ten competing candidate designs in 2021. NIST recently
announced ASCON 128 [25] as the winner, ending almost five years of competition.
We report a comparison between ASCON 128, the selected algorithm, and two
schemes that moved to the second round of the NIST lightweight competition [43].
We selected GIFT-COFB and ROMULUS for comparison since they either directly
use lightweight block ciphers or are variants of them. More precisely, they are
directly instantiated with the Gift [9] or Skinny block ciphers [10]. ASCON 128,
GIFT-COFB, and ROMULUS are further compared with AES-GCM.

3.4.1 AES-GCM

Galois/Counter Mode (GCM) [26] is an authenticated encryption algorithm


designed to provide both data authenticity (integrity) and confidentiality. GCM
is defined for block ciphers with a block size of 128 bits. Galois Message
Authentication Code (GMAC) is an authentication-only variant of GCM that
can form an incremental message authentication code. GCM is proven secure
3 Efficient and Secure Encryption for FPGAs in the Cloud 67

in a concrete security model. It is secure when it is used with a block cipher that
is indistinguishable from a random permutation; however, security depends on
choosing a unique initialization vector for every encryption performed with the
same key (see stream cipher attack). For any given key and initialization vector
combination, GCM is limited to encrypting .239 –256 bits of plain text (64 GiB).
GCM combines the well-known counter mode of encryption with the new Galois
mode of authentication [26]. The key feature is the ease of parallel computation of
the Galois field multiplication used for authentication. This feature permits higher
throughput than encryption algorithms, like CBC, which use chaining modes. The
GF(2128) field used is defined by the polynomial .x 128 + x 7 + x 2 + x + 1.
The authentication tag is constructed by feeding blocks of data into the GHASH
function and encrypting the result. This GHASH function is defined by

.GH ASH (H, A, C) = Xm+n+1 ,

where .H = EK (0128 ) is the hash key, a string of 128 zero bits encrypted using
the block cipher, A are data that are only authenticated (not encrypted), C is the
ciphertext, m is the number of 128-bit blocks in A (rounded up), n is the number of
128-bit blocks in C (rounded up), and the variable .Xi for .i = 0, . . . , m + n + 1 is
defined below.
First, the authenticated text and the cipher text are separately zero-padded to
multiples of 128 bits and combined into a single message .Si :


⎪ Ai for i = 1, . . . , m − 1



⎪ ∗
⎨Am ‖ 0
⎪ for i = m
128−v

Si =
. Ci−m for i = m + 1, . . . , m + n − 1



⎪ ∗
⎪Cn ‖ 0 for i = m + n
128−u


⎩len(A) ‖ len(C) for i = m + n + 1

where .len(A) and .len(C) are the 64-bit representations of the bit lengths of A and
C, respectively, .v = len(A) mod 128 is the bit length of the final block of A, .u =
len(C) mod 128 is the bit length of the final block of C, and .‖ denotes concatenation
of bit strings. Then .Xi is defined as

i
0 for i = 0
. Xi = Sj · H i−j +1 =
j =1 (Xi−1 ⊕ Si ) · H otherwise

The second form is an efficient iterative algorithm (each .Xi depends on .Xi−1 )
produced by applying Horner’s method to the first. Only the final .Xm+n+1 remains
an output.
68 S. Banik and F. Regazzoni

3.4.2 Finite Field Multiplication

The most critical operation in GCM is multiplication in the finite field .GF (2128 ).
The multiplier uses the irreducible polynomial .p(x) = x 128 + x 7 + x 2 + x + 1
to compute .C = AB mod p(x). In [44], several implementation options for such a
multiplier are proposed, including bit-parallel, digit-serial, and hybrid multipliers.
Bit-parallel multipliers use multiplication by x as the fundamental circuit of
computation and replicate it 128 times for the complete operation. Digit-serial
multipliers take this idea forward by making multiplication by .x m as the basic unit.
Hybrid multipliers redefine the original finite field .GF (2k ) as .GF ((2m )n ), where
.k = mn. Arithmetic calculations can then be performed using circuits in the subfield
m
.GF (2 ) and by combining them in the extension field .GF ((2 ) ).
m n

3.4.2.1 Karatsuba Multiplier

The above architectures take more than one clock cycle to compute the multiplica-
tion result. Since our core encryption algorithm will operate in a single clock cycle,
we propose an architecture that will compute the multiplication in a single cycle.
Let .A(x), B(x) be two polynomials of degree .2k − 1. The Karatsuba method of
multiplying them requires that we first split both polynomials into two degree .k − 1
polynomials as follows:

.A(x) = aL (x) + x k aH (x), B(x) = bL (x) + x k bH (x).

The multiplication operation requires the following logic operations over k-bit
polynomials:
1. Compute .S = (aL ⊕ aH ) · (bL ⊕ bH ).
2. Compute .L = aL · bL and .H = aH · bH .
3. Compute .M = S ⊕ L ⊕ H .
It can be seen that .A(x)·B(x) = x 2k H ⊕x k M⊕L. Thus the original 2k-bit multiplier
requires 3 k-bit multipliers plus some gates performing linear operations. Thus one
can recursively define multiplication over 128-bit polynomials as multiplication
over 64-bit polynomials, which in turn can be defined as multiplication over 32-
bit polynomials and so on. The base case is defining multiplication over 2-bit
(i.e., degree 1) polynomials. This can be constructed by defining a look-up table
.{0, 1} → {0, 1} , i.e., that takes the 4-bit coefficients of the two 2-bit polynomials
4 4

and produces the 4-bit coefficients of the product.


The result of the above logic circuit is a polynomial of degree 254, i.e., 255-bit
coefficients. We now need to perform the modulo .p(x) operation to reduce it to
128-bit coefficients. However, this is a purely linear operation and needs only a few
XOR gates depending on the structure of .p(x).
3 Efficient and Secure Encryption for FPGAs in the Cloud 69

3.4.2.2 AES Architectures

We evaluate a number of different architectures for the components of the AES


block cipher. First, we experiment with four different architectures of the S-box,
i.e., small, tradeoff, and LUT. These three were proposed in [41]: “small” refers
to the smallest S-box circuit described in the literature, whereas “tradeoff” is the
circuit that provides a balance between latency and circuit area. “LUT” refers to
a simple look-up table style of implementation which the synthesizer optimizes.
Second, we further explore two different styles: fast and tiny implementations of
the MixColumns circuit. “Tiny” refers to the smallest implementation of the circuit
(92 gates) proposed in [39]. “Fast” refers to the 103 gate implementation in [8]
that, despite larger, is instead characterized by the shortest gate depth. We further
compare these implementations with T-Table-based implementations that are known
to be extremely fast on FPGA platforms [30].
T-tables, which are a set of look-up tables that combine the S-box and MixCol-
umn operations, are often used for fast software implementations. The AES T-tables
are a set of four look-up tables .T0 , T1 , T2 , T3 : {0, 1}8 → {0, 1}32 , with

T0 (x) = [2 · S(x), 1 · S(x), 1 · S(x), 3 · S(x)]


T1 (x) = [3 · S(x), 2 · S(x), 1 · S(x), 1 · S(x)]
.
T2 (x) = [1 · S(x), 3 · S(x), 2 · S(x), 1 · S(x)]
T3 (x) = [1 · S(x), 1 · S(x), 3 · S(x), 2 · S(x)]

Here .S(x) denotes the S-box mapping. If .x0 , x1 , x2 , x3 are the four bytes in a
column after the ShiftRows operation, it can be seen that . i Ti (xi ) is the AES round
function output for that column. Since the circuit combines the S-box and part of
the MixColumn operation, it is natural to try to implement this circuit to minimize
circuit depth and, hence, critical path. However, the downside of this approach is the
large silicon area required to construct four 8- to 32-bit tables.

3.4.2.3 AES-GCM Circuit Architecture

The AES-GCM circuit is shown in Fig. 3.4. In the 1st clock cycle, the hash key
H = EK (0) is computed and stored in the H register. Thereafter, every 128-
.

bit block of plaintext and associated data is processed in one block to produce
ciphertext. Simultaneously, the MAC is computed using a Horner-like computation
using the H and an auxiliary register using the single cycle finite field multiplier.
Thus processing n blocks of data takes .n + 1 clock cycles.
70 S. Banik and F. Regazzoni

Fig. 3.4 AES-GCM circuit Counter


Si

AES Xi−1

0n
Plaintext

Ciphertext H register
Multiplier

Table 3.4 Synthesis results targeted to a Artix 7 xc7a200t device


Design Latency a
.fmax .T Pmax

S-box MixColumns # LUTs # FFs # Slices (ns) (MHz) (Gbps)


Small Tiny 22,955 308 8775 168.092 175.8 21.98
Tradeoff Tiny 29,687 301 10,942 178.925 175.2 21.90
LUT Tiny 14,626 304 5034 73.893 175.9 21.99
Small Fast 23,794 300 9831 163.204 175.8 21.98
Tradeoff Fast 23,945 302 9694 166.131 175.5 21.94
LUT Fast 14,624 302 4788 74.203 175.9 21.98
T-Table 20,204 300 6614 87.922 175.9 21.98
a Note that .fmax is generated from the Post-PAR simulation

3.4.2.4 AES-GCM Implementation Results

The results in Table 3.4 were obtained after the designs were synthesized, mapped,
placed, and routed for a xc7a200t device.
The results show that for most architectures the total critical path is around .5.7
ns, which provides a throughput of about 22 Gbps. The GCM algorithm may be
further parallelized by a factor of k by using multiple amounts of circuit resources.
Replication allows for throughput well above 100 Gbps depending on the type of
application.

3.4.3 GIFT-COFB

GIFT-COFB [7] is a lightweight AEAD candidate and was a submission to


the recently closed NIST lightweight cryptography standardization process. The
algorithm reached the final round of the competition. The design processes 128-bit
blocks with a key and nonce of the same size and has a small register footprint, only
requiring a single 64-bit register. A bit permutation and a finite field multiplication
with different constants are also performed. Note that unlike GCM, multiplication
in GIFT-COFB occurs using only a few constant field elements. As such, this circuit
is completely linear and can be efficiently implemented in hardware using simple
XOR gates.
3 Efficient and Secure Encryption for FPGAs in the Cloud 71

AD1 ADa M1 Mm

nonce
EK G EK EK G EK G EK EK G EK
Tag
M1 Mm
LFSR LFSR LFSR LFSR
CT1 CTm
2x/3x 2x/3x 2x/3x 2x/3x

Fig. 3.5 GIFT-COFB mode of operation

Mathematically, GIFT-COFB is a block-cipher-based authenticated encryption


mode that integrates GiFT-128 as the underlying block cipher with an 128-bit key
and state. The construction adheres to the COmbined FeedBack (COFB) mode of
operation [19], which provides a processing rate of 1, i.e., a single block cipher
invocation per input data block. The mode only adds an additional 64-bit LFSR
state L [initialized as the first 64 most significant bits (MSBs) of .EK (Nonce)] to
the existing block cipher registers and thus ranks among the most lightweight AEAD
algorithms.
In this mode, encryption interspersed by three operations: execute operation G on
the state, update the LFSR L, and add plaintext/associated data to state as shown in
Fig. 3.5. The G operation is given by .G(X0 , X1 ) = (X1 , X0 ≪ 1), where .X0 , X1
are the upper and lower 64-bit blocks of a 128-bit word. The register L is initialized
with the first 64 MSBs of .EK (Nonce) and updated by finite field multiplication
over .GF (264 ) by the constant .2x 3y , where

1 if |A| mod n = 0 and A /= ϵ,


x=
.
2 otherwise;

1 if |M| mod n = 0 and M /= ϵ,


y=
.
2 otherwise.

The GIFT-128 block cipher [9] was designed in 2017. It has 40 rounds in which
each round consists of a substitution layer composed of 4-bit S-boxes. It uses a bit
permutation over 128-bits as the linear layer. It is efficient on both software and
hardware platforms.

3.4.3.1 GIFT-COFB Circuit Architecture

The GIFT-COFB circuit is shown in Fig. 3.6. In the 1st clock cycle, the LFSR
L is updated with the top half of .EK (Nonce). Thereafter, every 128-bit block of
plaintext/associated data is processed in one block to produce ciphertext. Simulta-
neously, the LFSR is updated using finite field computations. After the plaintext and
72 S. Banik and F. Regazzoni

Fig. 3.6 GIFT-COFB circuit Nonce

State Register
L Register

S→ 2x 3y S
GIFT-128

Plaintext/AD
Ciphertext

associated data (AD) have been processed, the mode uses one additional encryption
call to produce the MAC. Thus, the processing of n blocks of data takes only .n + 2
clock cycles.

3.4.4 ROMULUS

ROMULUS is an AEAD scheme designed by Iwata et al. [37] and uses the SKINNY
family of block ciphers. In this chapter, we provide Romulus-N1 implementations.
Romulus-N1 makes .1/2 block cipher call per associated data block and 1
block cipher call per message block. It uses a 128-bit key, a 128-bit nonce, a
variable-length message chopped into 128-bit blocks and produces a 128-bit tag.
Each output of the block cipher and the incoming data block (associated data or
message) are passed through a light combinatorial function denoted by .ρ. Function
'
.ρ(S, M) = (S , C) is defined as .S
' ← S ⊕ M and .C ← G(S) ⊕ M. For

each byte, G performs the following operation: .G(x7 ||x6 ||x5 ||x4 ||x3 ||x2 ||x1 ||x0 ) :=
(x0 ⊕x7 )||x7 ||x6 ||x5 ||x4 ||x3 ||x2 ||x1 . The output of this function is immediately input
to the next block cipher call. Hence, a register keeps this running state, and at the
last step, it is encrypted to produce the tag.
Romulus handles odd and even authenticated data blocks differently; the odd
blocks are input to .ρ, and even blocks are fed to the nonce port of the block cipher, as
the underlying cipher SKINNY-128-384 has a 384-bit-long TWEAKEY. The actual
AEAD nonce is not used before all authenticated data blocks are processed and
later used as a block cipher nonce, while message blocks are encrypted. A 56-bit
LFSR is also a part of the TWEAKEY for SKINNY calls and keeps count of the
authenticated data and message block fed into the AEAD circuit from the beginning
of AE operation. Figure 3.7 describes the two phases the full AEAD operation
passes through, namely processing of (1) associated data, (2) message blocks.
3 Efficient and Secure Encryption for FPGAs in the Cloud 73

AD2 AD2a N N N
AD1 AD2a−1 M1 M2 Mm 0128

0128
ρ EL,d,·,K ρ EL,d,·,K EL,d,·,K ρ EL,d,·,K ρ ρ EL,d,·,K ρ

CT1 CT2 CTm T


Phase 1 Phase 2

Fig. 3.7 The high-level view of Romulus-N1, which depicts the processing of 2a associated data
and m message blocks. L denotes the 56-bit LFSR that counts the number of processed blocks,
and d denotes a single-byte domain separator followed by .064

Fig. 3.8 ROMULUS-N1 0


circuit
Plaintext/AD ρ Ciphertext

State Register
L Register
SKINNY-128
-384

3.4.4.1 ROMULUS-N1 Circuit Architecture

The circuit is shown in Fig. 3.8. In addition to the tweakable block cipher is an LFSR
L that supplies a part of the tweak. After all the plaintext/AD have been processed,
the mode uses one additional encryption call to produce the MAC. Thus processing
n blocks of data takes only .n + 1 clock cycles.

3.4.5 ASCON 128

ASCON 128 has been declared the winner of the NIST lightweight cryptography
competition. It is a permutation-based AEAD, as the core cryptographic primitive
used in the design is a permutation function over 320 bits and not a block cipher.
The ASCON permutation uses a state size of 320 bits (consisting of five 64-bit
words .x0 , x1 , x2 , x3 , x4 ) that are updated in four phases: initialization, processing
of associated data, processing of plaintext/ciphertext, and finalization.
74 S. Banik and F. Regazzoni

AD/Plaintext

Ciphertext
Tag

State
Register ASCON p6

IV‖K‖N 0∗ ‖1/K‖ 0∗

Fig. 3.9 ASCON 128 circuit

Table 3.5 Synthesis results targeted to an Artix 7 xc7a200t device


Latency a
fmax T Pmax
Design # LUTs # FFs # Slices (ns) (MHz) (Gbps)
GIFT-COFB 5931 205 1791 44.422 22.5 2.68
Romulus-N1 45,953 454 15,342 180.206 5.5 0.66
AES-GCM 20,204 300 6614 87.922 11.4 1.36
ASCON 128 2644 327 708 22.133 45.2 2.69
a Note that fmax is generated from the Post-PAR simulation

All phases use the same permutation function p that is applied 12 times in the
initialization and finalization phases and six times in the data processing phase. The
data, i.e., both the plaintext and AD, are handled in 64-bit blocks. The initialization
phase takes place after the initialization phase when the optional associated data are
processed. In the encryption phase, each plaintext block .Pi is XORed with the secret
state to produce one ciphertext block .Ci . After the generation of the ciphertext, the
finalization phase starts. The output of finalization is a 128-bit tag.

3.4.5.1 ASCON 128 Circuit Architecture

The circuit is shown in Fig. 3.9. The core circuit is the ASCON permutation .p6 , i.e.,
the round function p iterated 6 times. Hence, initialization and finalization take two
cycles each. Processing each 64-bit block of plaintext or AD takes one cycle. Thus
processing n blocks of 128-bit data takes only .2n + 2 + 2 = 2n + 4 clock cycles.

3.4.5.2 Implementation Results of Different Authenticated Encryption


Algorithms

Table 3.5 compares synthesis results for the two lightweight schemes with AES-
GCM. It is seen that ASCON 128 and GIFT-COFB have an advantage in terms of
3 Efficient and Secure Encryption for FPGAs in the Cloud 75

throughput over the lightweight schemes. In particular, the most latency intensive
part of the circuit is the permutation circuit that requires only six rounds to be
implemented consecutively. This circuit reduces latency and elevates the maximum
operable frequency, hence throughput.

3.5 Post-Quantum Cryptography

The last family of cryptography primitives that we describe are the post-quantum
cryptographic (PQC) algorithms. The algorithms use cryptographic primitives
meant to run on classical computers but designed to withstand the computational
power of quantum computers. Progress in the study of quantum computers would
eventually make it possible to run algorithms such as the Shor’s algorithm [56]
and solve problems on which the security of current asymmetric cryptography is
based (i.e., integer factorization). Symmetric cryptography must be resistant to the
increased computational power that quantum computers will bring: for instance,
because of Grover’s algorithm [29], there is a need to use bigger key sizes for AES
algorithms (e.g., change from 128-bit to 256-bit values for the AES algorithm).
Because of this threat, governmental bodies have started initiatives to standardize
post-quantum algorithms. The most relevant initiative has been carried out by
NIST starting in 2017 [47]. Researchers for all over the world could submit
standardization proposals. Several families of mathematical problems have been
explored, including lattice-based cryptography, isogeny-based cryptography, and
code-based cryptography. In addition to mathematical security, these algorithms
must be suitable for deployment in systems and applications, including applications
based on cloud FPGAs. This section summarizes the lessons learned in deploying
PQC algorithms on FPGAs in the last few years.
A large number of NIST submissions were based on lattice problems, such as
learning with errors [48] (LWE) and ring learning with errors (R-LWE) [40]. The
advantage of R-LWE over standard lattices is that the lattice matrix is generated
from a single row. Because of this, complex matrix multiplications are replaced
by polynomial multiplications. When implementing schemes based on lattices,
the sampler and the polynomial multiplication functions form a bottleneck. The
sampling step can use binomial or uniform distributions, but the majority of the
designs use discrete Gaussian distributions. Several structures have been proposed
for the implementation of discrete Gaussian samplers, and the most notable ones are
rejection sampling, Bernoulli sampling, cumulative distribution (CDT) sampling,
discrete Ziggurat sampling, and Knuth–Yao sampling [36].
Often, multiplication performance is improved by using the numeric theoretic
transform (NTT), which, after a conversion into spectral domain, reduces complex
polynomial multiplication to a point-wise multiplication. Several optimized hard-
ware implementations of NTT are available, including ones specifically designed
for FPGAs [3, 46, 52]. A common feature for NTT hardware architectures is a
butterfly structure, which is often implemented as a dedicated unit [49]. In FPGA
76 S. Banik and F. Regazzoni

designs, the processing logic that implements the core of the butterfly computations
is interleaved with Block RAMs (BRAMs) 1 that are used to store NTT poly-
nomial coefficients. The architecture is implemented sequentially when area is a
constraint [46]. The amount of pre-computed factors and on-the-fly computation is
an important tradeoff. Examples of both designs have been reported [3, 46, 53].
Quantum resistant cryptography can be also achieved by leveraging approaches
other than lattices. FPGA implementations of other families of post-quantum
algorithms have been also explored, including the design and implementation of
McElice [20, 28], and HQC [24].
Physical attacks are a threat for post-quantum algorithm implementations. In
fact, physical attack resistance was explicitly mentioned by NIST among the
selection criteria for the new standard. Common attacks against post-quantum
algorithms include timing attacks, power analysis attacks, and fault attacks [16].
Countermeasures typically aim to protect the secret key using approaches previously
explored in block ciphers and include masking [50] to protect against power
analysis attacks, constant time implementations [36] to mitigate timing attacks, and
dedicated countermeasures to prevent fault attacks [35].

3.6 Conclusion and Open Problems

The use of FPGAs as a target platform to implement cryptography has significantly


increased recently. Implementations can be deployed in standalone FPGAs, in
embedded or cyber-physical system, or in large shared FPGAs in the cloud. It is
necessary to take full advantage of the target platform in realizing an architecture
that meets the requirements of the application in terms of algorithm performance,
area, and functionality. This chapter has discussed the implementation of the most
common cryptographic primitives in reconfigurable hardware. The chapter also
summarized the main challenges related to the deployment of quantum resistant
cryptography and the approaches that the scientific community are following to
implement them in a physical resistant way.

Acknowledgments This work is partially supported by the EU Horizon 2020 Programme under
grant agreement No. 957269 (EVEREST).

References

1. AlFardan, N. J., & Paterson, K. G. (2013). Lucky thirteen: Breaking the TLS and DTLS record
protocols. In 2013 IEEE Symposium on Security and Privacy, SP 2013, Berkeley, CA, USA,
May 19–22, 2013 (pp. 526–540). IEEE Computer Society. https://doi.org/10.1109/SP.2013.42.

1 BRAMs are RAMs usually used to store a large amount of data within FPGAs
3 Efficient and Secure Encryption for FPGAs in the Cloud 77

2. Amazon EC2 F1 Instances. (2023). Available at https://aws.amazon.com/ec2/instance-types/


f1/.
3. Aysu, A., Patterson, C., & Schaumont, P. (2013). Low-cost and area-efficient FPGA implemen-
tations of lattice-based cryptography. In 2013 IEEE International Symposium on Hardware-
Oriented Security and Trust (HOST) (pp. 81–86). IEEE.
4. Babbage, S., & Dodd, M. (2005). The stream cipher MICKEY 2.0. eSTREAM, ECRYPT
Stream Cipher Project Report. http://www.ecrypt.eu.org/stream/p3ciphers/mickey/mickey_p3.
pdf.
5. Banik, S., Bogdanov, A., Isobe, T., Shibutani, K., Hiwatari, H., Akishita, T., & Regazzoni, F.
(2015). Midori: A block cipher for low energy. In T. Iwata, J. H. Cheon (Eds.), Advances in
Cryptology - ASIACRYPT 2015 - 21st International Conference on the Theory and Application
of Cryptology and Information Security, Auckland, New Zealand, November 29–December
3, 2015, Proceedings, Part II, Lecture Notes in Computer Science (vol. 9453, pp. 411–436).
Springer. https://doi.org/10.1007/978-3-662-48800-3_17.
6. Banik, S., Bogdanov, A., & Regazzoni, F. (2015). Exploring energy efficiency of lightweight
block ciphers. In: Selected Areas in Cryptography - SAC 2015 - 22nd International Conference,
Sackville, NB, Canada, August 12–14, 2015, Revised Selected Papers (pp. 178–194). https://
doi.org/10.1007/978-3-319-31301-6_10.
7. Banik, S., Chakraborti, A., Inoue, A., Iwata, T., Minematsu, K., Nandi, M., Peyrin, T., Sasaki,
Y., Sim, S. M., & Todo, Y. (2020). GIFT-COFB. Cryptology ePrint Archive.
8. Banik, S., Funabiki, Y., & Isobe, T. (2021). Further results on efficient implementations
of block cipher linear layers. IEICE Transactions on Fundamentals of Electronics, Com-
munications and Computer Sciences, 104-A(1), 213–225. https://doi.org/10.1587/transfun.
2020CIP0013.
9. Banik, S., Pandey, S. K., Peyrin, T., Sasaki, Y., Sim, S. M., & Todo, Y. (2017). GIFT: A small
present - towards reaching the limit of lightweight encryption. In Cryptographic Hardware and
Embedded Systems - CHES 2017 - 19th International Conference, Taipei, Taiwan, September
25–28, 2017, Proceedings (pp. 321–345). https://doi.org/10.1007/978-3-319-66787-4_16.
10. Beierle, C., Jean, J., Kölbl, S., Leander, G., Moradi, A., Peyrin, T., Sasaki, Y., Sasdrich, P., Sim,
S. M. (2016). The SKINNY family of block ciphers and its low-latency variant MANTIS. In
Advances in Cryptology - CRYPTO 2016 - 36th Annual International Cryptology Conference,
Santa Barbara, CA, USA, August 14–18, 2016, Proceedings, Part II (pp. 123–153). https://doi.
org/10.1007/978-3-662-53008-5_5.
11. Berbain, C., Billet, O., Canteaut, A., Courtois, N., Gilbert, H., Goubin, L., Gouget, A.,
Granboulan, L., Lauradoux, C., Minier, M., Pornin, T., & Sibert, H. (2006). SOSEMANUK,
a fast software-oriented stream cipher. eSTREAM, ECRYPT Stream Cipher Project Report.
http://www.ecrypt.eu.org/stream/p3ciphers/sosemanuk/sosemanuk_p3.pdf.
12. Bernstein, D. (2013). Failures of secret-key cryptography. Invited talk to FSE 2013. Available
at https://cr.yp.to/talks/2013.03.12/slides.pdf.
13. Bernstein, D. J. (2006). Salsa20/8 and Salsa20/12. eSTREAM, ECRYPT Stream Cipher Project
Report. http://www.ecrypt.eu.org/stream/papersdir/2006/007.pdf.
14. Boesgaard, M., Vesterager, M., Christensen, T., & Zenner, E. (2006). The Stream Cipher
Rabbit. eSTREAM, ECRYPT Stream Cipher Project Report. http://www.ecrypt.eu.org/stream/
p3ciphers/rabbit/rabbit_p3.pdf.
15. Bogdanov, A., Knudsen, L. R., Leander, G., Paar, C., Poschmann, A., Robshaw, M. J. B.,
Seurin, Y., & Vikkelsoe, C. (2007). PRESENT: an ultra-lightweight block cipher. In P. Paillier,
& I. Verbauwhede (Eds.), Cryptographic Hardware and Embedded Systems - CHES 2007, 9th
International Workshop, Vienna, Austria, September 10–13, 2007, Proceedings, Lecture Notes
in Computer Science (vol. 4727, pp. 450–466). Springer. https://doi.org/10.1007/978-3-540-
74735-2_31.
16. Boneh, D., DeMillo, R. A., & Lipton, R. J. (1997). On the importance of checking crypto-
graphic protocols for faults. In International Conference on the Theory and Applications of
Cryptographic Techniques (pp. 37–51). Springer.
78 S. Banik and F. Regazzoni

17. Borghoff, J., Canteaut, A., Güneysu, T., Kavun, E. B., Knezevic, M., Knudsen, L. R., Leander,
G., Nikov, V., Paar, C., Rechberger, C., Rombouts, P., Thomsen, S. S., Yalçin, T. (2012).
PRINCE - A low-latency block cipher for pervasive computing applications - extended
abstract. In X. Wang, K. Sako (Eds.), Advances in Cryptology - ASIACRYPT 2012 - 18th
International Conference on the Theory and Application of Cryptology and Information
Security, Beijing, China, December 2–6, 2012. Proceedings, Lecture Notes in Computer
Science (vol. 7658, pp. 208–225). Springer. https://doi.org/10.1007/978-3-642-34961-4_14.
18. Cannière, C. D., & Preneel, B. (2005). TRIVIUM -Specifications. eSTREAM, ECRYPT
Stream Cipher Project Report. http://www.ecrypt.eu.org/stream/p3ciphers/trivium/trivium_p3.
pdf.
19. Chakraborti, A., Iwata, T., Minematsu, K., & Nandi, M. (2017). Blockcipher-based authen-
ticated encryption: How small can we go? In Cryptographic Hardware and Embedded
Systems - CHES 2017 - 19th International Conference, Taipei, Taiwan, September 25–28,
2017, Proceedings (pp. 277–298). https://doi.org/10.1007/978-3-319-66787-4_14.
20. Chen, P. J., Chou, T., Deshpande, S., Lahr, N., Niederhagen, R., Szefer, J., Wang, W. (2022).:
Complete and improved FPGA implementation of Classic McEliece. Cryptology ePrint
Archive.
21. Cid, C., & Robshaw, M. (Eds.). (2012). The eSTREAM Portfolio in 2012, 16 January 2012,
Version 1.0. eSTREAM, ECRYPT Stream Cipher Project Report. http://www.ecrypt.eu.org/
documents/D.SYM.10-v1.pdf.
22. Daemen, J., & Rijmen, V. (2005). Rijndael/aes. In H. C. A. van Tilborg (Ed.), Encyclopedia of
cryptography and security. Springer. https://doi.org/10.1007/0-387-23483-7_358.
23. De Canniere, C., & Preneel, B. (2008). Trivium. In New Stream Cipher Designs: The eSTREAM
Finalists (pp. 244–266). Springer.
24. Deshpande, S., Xu, C., Nawan, M., Nawaz, K., & Szefer, J. (2022). Fast and efficient hardware
implementation of HQC. Cryptology ePrint Archive.
25. Dobraunig, C., Eichlseder, M., Mendel, F., Schläffer, M. (2021). Ascon v1.2: Lightweight
Authenticated Encryption and Hashing. Journal of Cryptology, 34(3), 33. https://doi.org/10.
1007/s00145-021-09398-9.
26. Dworkin, M. (2007). Recommendation for block cipher modes of operation: Galois/counter
mode (GCM) and GMAC. Tech. rep., National Institute of Standards and Technology.
27. eSTREAM, The ECRYPT Stream Cipher Project. (2012). eSTREAM, The ECRYPT Stream
Cipher Project. https://www.ecrypt.eu.org/stream/.
28. Galimberti, A., Galli, D., Montanaro, G., Fornaciari, W., & Zoni, D. (2022). FPGA implemen-
tation of BIKE for quantum-resistant TLS. In 2022 25th Euromicro Conference on Digital
System Design (DSD) (pp. 539–547). IEEE.
29. Grover, L. K. (1997). Quantum mechanics helps in searching for a needle in a haystack.
Physical Review Letters, 79(2), 325.
30. Güneysu, T., & Moradi, A. (2011). Generic side-channel countermeasures for reconfigurable
devices. In International Workshop on Cryptographic Hardware and Embedded Systems (pp.
33–48). Springer.
31. Guo, J., Jean, J., Nikolić, I., Qiao, K., Sasaki, Y., & Sim, S. M. (2015). Invariant subspace
attack against full Midori64. Cryptology ePrint Archive.
32. Hell, M., Johansson, T., Maximov, A., & Meier, W. (2008). A Stream Cipher Proposal: Grain-
128. eSTREAM, ECRYPT Stream Cipher Project Report. http://www.ecrypt.eu.org/stream/
p3ciphers/grain/Grain128_p3.pdf.
33. Hell, M., Johansson, T., & Meier, W. (2005). Grain - A Stream Cipher for Constrained
Environments. eSTREAM, ECRYPT Stream Cipher Project Report. http://www.ecrypt.eu.
org/stream/p3ciphers/grain/Grain_p3.pdf.
34. Hell, M., Johansson, T., & Meier, W. (2007). Grain: a stream cipher for constrained environ-
ments. International Journal of Wireless and Mobile Computing, 2(1), 86–93.
35. Howe, J., Khalid, A., Martinoli, M., Regazzoni, F., & Oswald, E. (2019). Fault attack
countermeasures for error samplers in lattice-based cryptography. In 2019 IEEE International
Symposium on Circuits and Systems (ISCAS) (pp. 1–5). IEEE.
3 Efficient and Secure Encryption for FPGAs in the Cloud 79

36. Howe, J., Khalid, A., Rafferty, C., Regazzoni, F., & O’Neill, M. (2016). On practical discrete
Gaussian samplers for lattice-based cryptography. IEEE Transactions on Computers, 67(3),
322–334.
37. Iwata, T., Khairallah, M., Minematsu, K., & Peyrin, T. (2019). Romulus v1.2. NIST
Lightweight Cryptography Project. https://csrc.nist.gov/Projects/lightweight-cryptography/
round-2-candidates.
38. Jean, J., Moradi, A., Peyrin, T., & Sasdrich, P. (2017). Bit-sliding: a generic technique
for bit-serial implementations of SPN-based primitives - applications to AES, PRESENT
and SKINNY. In Cryptographic Hardware and Embedded Systems - CHES 2017 - 19th
International Conference, Taipei, Taiwan, September 25–28, 2017, Proceedings (pp. 687–707).
https://doi.org/10.1007/978-3-319-66787-4_33.
39. Lin, D., Xiang, Z., Zeng, X., & Zhang, S. (2021). A framework to optimize implementations
of matrices. In K. G. Paterson (Ed.), Topics in Cryptology - CT-RSA 2021 - Cryptographers’
Track at the RSA Conference 2021, Virtual Event, May 17–20, 2021, Proceedings, Lecture
Notes in Computer Science (vol. 12704, pp. 609–632). Springer. https://doi.org/10.1007/978-
3-030-75539-3_25.
40. Lyubashevsky, V., Peikert, C., & Regev, O. (2010). On ideal lattices and learning with
errors over rings. In Advances in Cryptology–EUROCRYPT 2010: 29th Annual International
Conference on the Theory and Applications of Cryptographic Techniques, French Riviera, May
30–June 3, 2010. Proceedings 29 (pp. 1–23). Springer.
41. Maximov, A., & Ekdahl, P. (2019). New circuit minimization techniques for smaller and
faster AES SBoxes. IACR Transactions on Cryptographic Hardware and Embedded Systems,
2019(4), 91–125. https://doi.org/10.13154/tches.v2019.i4.91-125.
42. Menezes, A. J., Van Oorschot, P. C., & Vanstone, S. A. (2018). Handbook of applied
cryptography. CRC Press.
43. NIST Lightweight Cryptography Project. (2019). Available at https://csrc.nist.gov/projects/
lightweight-cryptography.
44. Paar, C. (1999). Implementation options for finite field arithmetic for elliptic curve cryptosys-
tems. In The 3rd Workshop on Elliptic Curve Cryptography (October 1999).
45. Pomerance, C. (1996). A tale of 2 Sieves. Notices of the American Mathematical Society,
December 1996 (pp. 1473–1485).
46. Pöppelmann, T., & Güneysu, T. (2012). Towards efficient arithmetic for lattice-based cryp-
tography on reconfigurable hardware. In Progress in Cryptology–LATINCRYPT 2012: 2nd
International Conference on Cryptology and Information Security in Latin America, Santiago,
Chile, October 7–10, 2012. Proceedings 2 (pp. 139–158). Springer.
47. Post-Quantum Cryptography. (2023). https://csrc.nist.gov/Projects/post-quantum-
cryptography. Accessed 5 June 2023.
48. Regev, O. (2009). On lattices, learning with errors, random linear codes, and cryptography.
Journal of the ACM (JACM), 56(6), 1–40.
49. Rentería-Mejía, C., & Velasco-Medina, J. (2014). Hardware design of an NTT-based polyno-
mial multiplier. In 2014 IX Southern Conference on Programmable Logic (SPL) (pp. 1–5).
IEEE.
50. Reparaz, O., Sinha Roy, S., Vercauteren, F., & Verbauwhede, I. (2015). A masked ring-
LWE implementation. In International Workshop on Cryptographic Hardware and Embedded
Systems (pp. 683–702). Springer.
51. Rivest, R. L., Shamir, A., & Adleman, L. M. (1978). A method for obtaining digital signatures
and public-key cryptosystems. Communications of the ACM, 21(2), 120–126. http://doi.acm.
org/10.1145/359340.359342.
52. Roy, S. S., Turan, F., Jarvinen, K., Vercauteren, F., & Verbauwhede, I. (2019). FPGA-based
high-performance parallel architecture for homomorphic computing on encrypted data. In
2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)
(pp. 387–398). IEEE.
80 S. Banik and F. Regazzoni

53. Roy, S. S., Vercauteren, F., Mentens, N., Chen, D. D., & Verbauwhede, I. (2014). Compact
ring-LWE cryptoprocessor. In Cryptographic Hardware and Embedded Systems–CHES 2014:
16th International Workshop, Busan, South Korea, September 23–26, 2014. Proceedings 16
(pp. 371–391). Springer.
54. Shannon, C. E. (1949). Communication theory of secrecy systems. The Bell System Technical
Journal, 28(4), 656–715.
55. Shor, P. W. (1994). Algorithms for quantum computation: discrete logarithms and factoring
(pp. 124–134). IEEE Comput. Soc. Press. https://doi.org/10.1109/sfcs.1994.365700. http://
doi.acm.org/10.1145/359340.359342.
56. Shor, P. W. (1999). Polynomial-time algorithms for prime factorization and discrete logarithms
on a quantum computer. SIAM Review, 41(2), 303–332.
57. Vaudenay, S. (2002). Security flaws induced by CBC padding - applications to SSL, IPSEC,
WTLS. In L. R. Knudsen (Ed.) Advances in Cryptology - EUROCRYPT 2002, International
Conference on the Theory and Applications of Cryptographic Techniques, Amsterdam, The
Netherlands, April 28–May 2, 2002, Proceedings, Lecture Notes in Computer Science (vol.
2332, pp. 534–546). Springer. https://doi.org/10.1007/3-540-46035-7_35.
58. Williams, H. C., & Shallit, J. O. (1994). Factoring integers before computers. Mathematics of
Computation 1943–1993, Fifty Years of Computational Mathematics (W. Gautschi, ed.), Proc.
Sympos. Appl. Math. (vol. 48, pp. 481–531). Providence, RI: Amer. Math. Soc.
59. Wu, H. (2006). HC-128. eSTREAM, ECRYPT Stream Cipher Project Report. http://www.
ecrypt.eu.org/stream/p3ciphers/hc/hc128_p3.pdf.
Chapter 4
Remote Physical Attacks on FPGAs
at the Electrical Level

Dennis R. E. Gnad, Jonas Krautter, and Mehdi B. Tahoori

4.1 Introduction

The paradigm of heterogeneous computing using hardware accelerators has already


spread from small embedded platforms up to large-scale cloud applications [1, 18,
43]. Field-Programmable Gate Arrays (FPGAs) are popular accelerator platforms
which allow for the implementation of arbitrary digital circuits defined through a
software-based configuration interface. FPGAs allow frequent updates throughout
their lifetime, thanks to runtime reconfiguration, which provides an advantage
over dedicated ASIC implementations especially when the devices are utilized
in agile environments. FPGAs can be partially reconfigured at runtime, allowing
multiple individual accelerators to be hosted on the same chip. These advantages
have brought FPGAs into smart devices and many critical application domains
such as networking infrastructure [5], medicine [46], finance [7], aerospace, and
military [45], often powered by machine learning [53].
Recently, major cloud computing providers, such as Amazon Web Services [1],
Microsoft Azure [43], and Alibaba Cloud [18], have begun to adopt FPGAs. There
is increasing interest in virtualizing FPGAs and providing multi-tenant access to
increase their utilization, similar to other cloud computing devices [9, 21]. Like
other cloud and embedded applications, security requirements must be met, such
as the proper isolation of individual users on the same FPGA fabric, a fundamental
property of multi-tenant operation [51]. However, in this domain, the flexibility of
FPGAs can also become a security threat, specifically due to the many possible
circuits that can be implemented in FPGA-based design.

D. R. E. Gnad · J. Krautter · M. B. Tahoori ()


Karlsruhe Institute of Technology (KIT), Chair of Dependable Nano Computing (CDNC),
Karlsruhe, Germany
e-mail: dennis.gnad@kit.edu; jonas.krautter@kit.edu; mehdi.tahoori@kit.edu

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 81


J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_4
82 D. R. E. Gnad et al.

These security requirements have been explored in situations where individual


compute components on the FPGA are spatially isolated through restricted com-
munication interfaces [9, 17]. However, fault [14, 26, 31] and power analysis-based
side-channel attacks [44, 47, 54] can be performed in these scenarios. Until recently,
it was believed that physical FPGA attacks required local access to the device under
attack. In this chapter, we explain how physical attacks can also be performed
from within the FPGA itself, despite secure isolation at the digital level [17, 51].
Moreover, we discuss countermeasures that have been proposed to mitigate these
security attacks.
Physical circuit properties of FPGAs can be exploited by implementing circuitry
that is either sensitive to on-chip voltage levels or can influence them. These
possibilities break previous assumptions on how secure FPGA virtualization can
be implemented. It furthermore shifts physical fault and power analysis attacks
from a local to a potentially remote attacker, with negative implications for orders
of magnitude more users, particularly in cloud platforms. Since these attacks are
implemented on cloud FPGAs, cryptographic cores, which traditionally have been
vulnerable to fault and side-channel attacks, and other designs, such as neural
networks, become potential targets for attack [39, 50].

4.2 Background

The fundamental driver of semiconductor chips is electricity, and thus any electrical
disruption can become a weakness for the entire system. In this section, we first
describe power distribution in modern semiconductor devices and then explain how
this knowledge is used for attacking those devices.

4.2.1 Power Distribution in Integrated Circuits

Integrated circuits (ICs) are typically supplied through a power distribution network
(PDN) to ensure that the same level of supply voltage is delivered to individual
chip transistors. The system-level PDN involves a hierarchy of active and passive
electronic components. A simplified PDN of a single printed circuit board (PCB)
with one IC is shown in Figs. 4.1 and 4.2. In modern low-power electronic
applications, the current flow is typically controlled by a switched-mode voltage
regulator and passes through traces on the PCB and chip package to individual IC
transistors. Some of the components shown in Fig. 4.1 are discrete components,
while others parasitically behave as such, as visualized in Fig. 4.2. For instance,
the bond wires of a chip package behave as unwanted inductors, which are partially
addressed by adding decoupling capacitors on the board and inside the silicon die.
A switched-mode voltage converter requires an inductor for its operation.
4 Remote Physical Attacks on FPGAs at the Electrical Level 83

Fig. 4.1 Overview of the PDN hierarchy from the board-level voltage regulator to the individual
transistors on the chip

Fig. 4.2 Electrical model of the PDN as a mesh of capacitive, inductive, and resistive elements

The supply voltage inside the PDN must be kept as stable as possible at all times
to keep transistor delays within predictable boundaries and maximize frequency
and performance under a given power budget. Typically, some noise in the on-chip
supply voltage is inevitable and can be characterized as generic voltage fluctuations.
However, these fluctuations depend on circuit activity, which in turn depend on the
workload and data being processed in the system.
A drop in on-chip supply voltage can be expressed using two components. One
part is a voltage drop which depends on the resistance (R) and the required current
(I ): .Vdrop,R = I R. The other part is dependent on the change in current over time
and the respective voltage drop over a series inductor (for instance bond wires):
.Vdrop,L = LdI /dt. This dynamic component of the voltage drop depends on

the data processed inside the chip and can be leveraged by attackers. ICs using
technology below 45 nm are more affected by .LdI /dt.

4.2.2 Security Threats Based on PDN Access

Two categories of attacks are possible through on-chip PDN manipulations [14, 47].
These attacks are traditionally carried out using test and measurement equipment [4,
22]:
84 D. R. E. Gnad et al.

Multi-Tenant FPGA / SoC 1) Fault Attack Based on Voltage Drop Injection


Active
Influence

FPGA Logic Controlled Logically Victim FPGA / CPU /


by Attacker Isolated Other Module
Passive
Measurement
2) Side - Channel Attack Based on Voltage Measurement

Fig. 4.3 Summary of the two threats in a system including FPGA logic

• Fault attacks actively manipulate voltage to cause timing violations leading to


computational errors. Secret data can be extracted using these errors [4].
• Power analysis side-channel attacks passively measure data-dependent power or
voltage. This information is analyzed to extract secret information [22].
Both types of attacks have been shown in numerous practical scenarios using
smartcards for banking [36] and wireless car keys [8]. We will later show that these
types of attacks can be performed using elements inside the FPGA without the need
for external test equipment. As shown in Fig. 4.3, an attacker can use FPGA logic
to attack other parts of the system through the FPGA PDN even if logical isolation
is in place.

4.2.2.1 Fault Attacks Based on Power

Most fault attacks intentionally operate the device under test beyond its electrical
specification so that computational errors might occur. Here, we focus on fault-
inducing voltage drops. An excessive voltage drop can lead to a brown-out condition
in which the circuit resets and memory undergoes a retention failure leading to a
denial-of-service (DoS) condition. However, if the voltage drop strength is adjusted
precisely, the propagation delay of the circuit increases and timing violations can
be induced. A subsequent mathematical analysis of the observed faults can reveal
sensitive information.
For instance, when targeting an encryption function, classical Differential Fault
Analysis (DFA) [4] compares the result of a genuine encryption to that of a
faulty encryption under the same input. The fault must affect a specific part of
the algorithm that depends only on a small part of the internally used secret key.
Then, an attacker can test which key hypothesis leads to the correct difference in
the internal state caused by the induced fault. This process can be repeated for the
remaining parts of the secret key, revealing the entire key.
4 Remote Physical Attacks on FPGAs at the Electrical Level 85

4.2.2.2 Power Side-Channel Attacks

For power-based side-channel attacks, the power consumption or electromagnetic


emanation of the device under test is measured, while it performs a sensitive
operation. If the captured power consumption can be linked to the internal data
currently being processed, it may reveal secret information about the computation.
For Simple Power Analysis (SPA), a single or a small number of captured traces
are inspected for potential sources of leakage such as conditional branches that
depend on secret data [22]. Differential Power Analysis (DPA) statistically evaluates
multiple traces to reveal secret information [22]. DPA is based on the fact that power
consumption is a function of the data currently being processed, i.e., it might be
proportional to the number of bits that are set in an intermediate value. Because
this dependency is usually rather small, a few hundred up to millions of traces must
be evaluated, depending on the target and noise in the system. Like fault attacks,
the mathematical evaluation consists of a divide-and-conquer approach, i.e., only a
small part of the secret key is tested to see if the hypothetical power consumption
matches the actual measurement.
One variant of DPA is Correlation Power Analysis (CPA): Here, an intermediate
value of the encryption is chosen that depends on a small part of the key. Then,
the power consumption of this value is estimated for all potential subkey candidates
and the input to the encryption function. This hypothetical power consumption is
correlated with the measured power of the device for the same input. If the power
model is chosen correctly and the intermediate value indeed occurred, the correct
key hypothesis will have a high correlation, while the false hypotheses appear as
uncorrelated noise.
A CPA attack on an Advanced Encryption Standard (AES) module typically uses
the Hamming weight (HW) as a power model and the internal 8-bit state obtained
just before the final Sbox computation as an intermediate value. Thus, our prediction
based on one byte of the ciphertext .ci and the 8-bit key hypothesis .khyp will be
  
model = H W Sbox−1 khyp ⊕ ci
. (4.1)

where .Sbox−1 is the inverted AES Sbox, leading to 256 key hypotheses that need
to be evaluated.

4.3 Interaction Between Voltage and FPGA Logic

FPGAs are typically designed to offer great flexibility in implementing digital logic
circuits. The main building blocks used to achieve that flexibility in most com-
mercial FPGAs are look-up tables (LUTs) and registers, which can be connected
arbitrarily using programmable switch matrices and multiplexers. Typically, vendor-
specific building blocks, such as accelerators for digital signal processing tasks,
86 D. R. E. Gnad et al.

Fig. 4.4 Ring oscillator array Ring


to cause voltage drop-based Oscillator
faults (cf. [14]) fRO-internal,0
Array

fRO-internal,1
ftoggle …
Frequency
Generator
fRO-internal,N

carry-chains for integer arithmetic, and basic building blocks such as PLLs for clock
generation, are added to FPGAs. When standard FPGA primitives are configured,
they can be utilized to influence or measure the chip-internal supply voltage, i.e.,
improvised voltage sensors or fault injection logic. Thus, FPGA primitives can be a
powerful attack device, integrated on the same PCB or SoC as the victim logic.

4.3.1 Voltage Drop Injection with Digital Logic

With deliberate glitching and access to the power supply or clock of a chip, it
is possible to cause timing faults. However, with the knowledge of how power is
distributed in integrated circuits, it is also possible to cause excessive voltage drop
in an FPGA using the device logic [14]. As explained in Sect. 4.2.1, causing a rapid
change in current in a small time interval can lead to a .dI /dt-based voltage drop.
This knowledge can be exploited on FPGAs, if some basic requirements are
fulfilled. An attacker needs control over power wasting elements that can be toggled
on and off in specific sequences. Typically, FPGAs and power supplies of integrated
circuits are designed to withstand high constant power at a high frequency. However,
if the frequency drops below a certain level, either the frequency range of the voltage
regulator used on the board can be targeted, or chip-internal resonances can be
exploited. Thus, to cause a strong voltage drop, these high-power elements must be
enabled with a lower frequency that matches the weak spots of the PDN. When these
elements use a very high frequency internally, they generate a high current flux.
Additionally, an external frequency can be used to toggle the elements on and off.
In practice, ring oscillators (ROs) mapped to FPGA LUTs cause high current
consumption due to their high internal frequency, but they must be enabled and
disabled at a certain frequency, as shown in Fig. 4.4. By adjusting the shown
frequency .ftoggle , an ideal frequency can be found that can cause timing violations
in the FPGA or a crash or reset of the chip [14, 26, 31]. With further refinements, this
style of fault injection can be tuned to inject faults in short moments in time, with
the precision required for DFA [26]. Please note that the placement and the amount
of activated power wasters also have an influence on the attack quality, which is
shown later in Sect. 4.4.2.
4 Remote Physical Attacks on FPGAs at the Electrical Level 87

Fig. 4.5 Ring oscillator to en LUT realizing a<=a & en


sense voltage fluctuations for
side-channel attacks (cf. [54])
a
Ring Oscillator
with Enable

Counter
Voltage-level estimate

Follow-up work has shown that it is possible to cause timing faults not only with
ROs but also with sequential oscillators [49] and memory access collisions [2] or
even by toggling a large amount of AES modules or benchmark circuits [23, 41].
This large variety of available power wasters for fault injection further amplifies the
threat, as the circuits can evade design rule checks or other offline countermeasures,
as described in Sect. 4.5.2.

4.3.2 Digital Logic for Voltage Estimation

To estimate voltage level using digital logic elements, circuits that display voltage-
dependent behavior must be implemented. The switching speed of transistors is
voltage-dependent. Thus, these sensor circuits must violate typical synchronous
design constraints, of which two basic types can be identified.
The simplest approach is to use ROs [33, 54]. To approximate voltage, it is
sufficient to count the number of oscillations of a single RO in a given timeframe,
as shown in Fig. 4.5. By observing the differences from one timeframe to another,
a relative estimate of the voltage level can be obtained. However, because the RO
oscillations must be counted, only a limited sampling rate can be reached.
As an alternative, ROs can be unrolled. This design is also called a delay line or
time-to-digital converter (TDC), which in the same FPGA technology reaches about
10.× the sampling rate of an RO-based sensor [38, 56]. The delay line sensor has a
path which is too long to meet the timing constraints of a timing analysis model.
In the implementation, the path must be timing critical, i.e., on the verge of failing.
Because of the voltage dependency of transistor speed, the path fails for reduced
voltage levels. By adding multiple endpoints at different depths of the path, a sensor
for a relative voltage level can be implemented, as visualized in Fig. 4.6. In most
FPGAs, it is reasonable to use LUT/Latch for an initial delay and faster carry-chain
primitives to reach a fine granularity in delay from endpoint to endpoint.
Please note that both sensors are sensitive to other sources of timing variation
including inherent manufacturing process variation, changes in temperature, and
circuit aging [13, 55]. However, voltage fluctuations change more rapidly than the
88 D. R. E. Gnad et al.

clk LUT / LUT /


Latch Latch

Reg Reg Reg ... Reg Reg

Voltage-level estimate

Fig. 4.6 LUT-/latch-based delay line with carry-chain based Time-to-Digital Converter to mea-
sure voltage fluctuations for side-channel attacks (cf. [47, 56])

other variables [13]. Thus, in a trace of sensor data, fast voltage fluctuations are
modulated on top of slower variations. This approach does not diminish the success
of power analysis side-channel attacks [47].

4.4 Fault and Side-Channel Attacks on FPGA Platforms

In this chapter, we explain two types of attacks implemented on boards with FPGAs
from two vendors. Depending on the specific FPGA and platform, either fault
injection or voltage measurement works better. Typically, at least one of the attacks
is successful at the electrical level.

4.4.1 Power Analysis Attack on a Lattice ECP5

In this section we describe a CPA attack implemented on a Lattice Semiconductor


ECP5 FPGA [16] situated on a Lattice ECP5 evaluation board (LFE5UM5G-85F-
EVN). The attack uses TDC-based on-chip sensors. To implement a sensor delay
line, as described in Sect. 4.3.2, we use ECP5 carry chain primitives (CCU2C).
One FPGA slice is required for two bits of the TDC sensor. The connections
between the primitives are described using a hardware description language.
The primitives are directly used as components or modules. On-chip fluctuations
are measured using the TDC sensor when an AES module implemented in the
FPGA is performing encryption. Voltage fluctuations for 10,000 random plaintext
messages are measured. To perform the key recovery attack, the encrypted plaintexts
(ciphertexts) and voltage fluctuation data from the TDC are used. The secret key in
the last round of the AES encryption is targeted by correlating the sensor traces
with a standard Hamming distance for each byte-based power model, as explained
in Sect. 4.2.2.2.
Figure 4.7 depicts the results of a successful CPA for a single byte. The correct
key byte, which is marked red, shows a significantly higher correlation in the CPA
4 Remote Physical Attacks on FPGAs at the Electrical Level 89

Fig. 4.7 CPA attack through on-chip sensors on the Lattice ECP5 FPGA; correlation progress
over .10,000 samples for all 256 secret AES key byte candidates with the correct key byte marked
red

power model than the other 255 key candidates, shown as gray plots. The correct
key byte can be identified once it is clearly distinguishable from the others. With
this simple attack, approximately 50% of key bytes can be recovered, when .100,000
traces are used. This result proves the fundamental vulnerability of FPGAs to such
attacks. More sophisticated analysis or more traces can be used to recover the full
key.

4.4.2 Fault Attack on a Xilinx Ultrascale VCU108

We performed detailed experiments [16] using a VCU108 Virtex UltraScale board


to characterize the precise probabilities per frequency that cause a crash or induce
design timing faults. Timing faults were evaluated with a 288-bit adder/subtractor
with a slack of 0.037 ns. During operation, the circuit alternates between performing
an addition operation and a subtraction operation. Each operation causes overflow
or underflow and stimulates the critical path. A total of 1% (5376) of LUTs are used
in the XCVU095 FPGA. Timing faults can be reliably caused in the narrow range
of .fRO−t = 9.6 .. 13.7 kHz using as low as 25% of FPGA LUTs as ROs. At 20%
LUT usage, some faults are generated with .fRO−t = 25 MHz at 87% probability. A
total of 15% LUT usage (80k) did not generate faults in the FPGA.
When increasing the LUT usage to 30%, both crashes and timing faults are
caused, depending on the RO-toggling frequency .ftoggle , as explained in Sect. 4.3.1.
(cf. Fig. 4.4). Figure 4.8 shows the respective probabilities for a timing fault and a
crash. For each frequency, we launched a burst of 512 clock periods of .ftoggle . To
find the fault probability, we checked the result of our 288-bit adder/subtractor for
correctness and took a normalized average. For the crash probability, we launched
512 of these bursts and plotted the percentage of how many of them caused a crash.
The FPGA reliably crashes with an .ftoggle of 200 Hz or lower frequency, while
frequencies of between 200 Hz and 10 kHz reliably induce timing faults. We also
90 D. R. E. Gnad et al.

Fig. 4.8 Range of frequencies to toggle ROs on/off that cause timing faults or crashes in the Xilinx
VCU108 Virtex Ultrascale Board

show the results when the RO usage is increased from 30% to 35% LUT usage
(Fig. 4.8). A clearer boundary between faults and crashes can be seen. Furthermore,
a wide frequency band in which timing faults but not crashes that would require
bitstream reprogramming can be seen. These fault attacks compromise system
integrity without causing a crash that would be more obvious to a victim.
4 Remote Physical Attacks on FPGAs at the Electrical Level 91

4.4.3 Results on Additional FPGA Platforms

Voltage attacks were performed on several different FPGA platforms, as summa-


rized in Table 4.1.

Table 4.1 Overview of experimental results on side-channel and fault attacks in FPGAs, systems
containing FPGAs, or FPGA-based SoCs, as of March 2022
Attack successful?
Voltage Voltage Key/data
drop-based drop-based timing recovery
Board denial of service fault injection by side-channel
Intel Terasic DE0-Nano-SoC – Yes, [26] –
Intel Terasic DE1-SoC Yes, [26] Yes, [26] –
Intel Terasic DE4 – Yes, [26] –
Intel Terasic DE5a-Net Yes, [41] Yes, [41] –
Intel Terasic DE10-Pro Stratix Yes, [23] Yes, [23] –
10
Lattice ECP5 5G Evaluation – – Yes, [16]
Board
Lattice iCE40-HX8K Breakout Yes, [15] Yes, [26] Yes, [15]
Board
Xilinx Artix-7 Basys-3 – – Yes, [47]
Xilinx Artix-7 Nexys 4 Yes, [2] Yes, [2] –
Xilinx Kintex-7 KC705 Yes, [14] –4 –
Xilinx Pynq Zynq-ZC7020 Yes, [26]1 – 4 [26] –
Xilinx Spartan-6 SAKURA-G – – Yes, [47], [48]3
Xilinx Ultrascale VCU108 Yes, [16] Yes, [16] –
Xilinx Ultrascale+ VU9P – – Yes, [12]
Xilinx Virtex-6 ML605 Yes, [14] –4 –
Xilinx Virtex-7 VC707 – Yes, [31] –
Xilinx Virtex-7 ADM-PCIE-7V3 – – Yes, [24]
Xilinx Zynq-ZC7020 Zedboard Yes, [14]1 –4 Yes, [54]2
Xilinx Zynq Ultrascale+ Ultra96 Yes, [34] – –
1 The attack affects the whole SoC including the integrated ARM Cortex-A9 Dual-Core
2 Sufficient leakage for key recovery was also shown from CPU to FPGA in the same SoC
3 In [48], the attack was also shown to work from one FPGA in the system (connected to the
same power supply) to another, on board level
4 A simple experiment was conducted, but the devices crashed before timing violations
occurred—it might still be possible with more effort
92 D. R. E. Gnad et al.

4.5 Countermeasures

Many countermeasure approaches have been developed to address the issue of


side-channel and fault attacks, many of which have been known for more than
20 years [4, 22]. For instance, faults can often be addressed by hardware redun-
dancy [32]. Likewise, side-channel leakage can be reduced by hiding [20, 35] or
masking [6] countermeasures. We present countermeasures that explicitly target
threats faced by cloud FPGAs.

4.5.1 The Importance of Physical Design Parameters

When evaluating remote, internal side-channel attacks on FPGAs from different


vendors, we often found large differences in the amount of measurements that must
be collected by an attacker to recover a secret encryption key. This peculiarity
was further explored using a Xilinx Zynq-7000 platform [24] in which the impact
of attacker and victim physical design parameters on overall vulnerability was
investigated. An AES encryption module was chosen as the victim design. Four
global module placement locations, the local placement of primitives using four
different heuristic place and route strategies, and four different FPGA chips were
used during experimentation.
We collected up to 100k traces for each of the 256 experimental settings.
Depending on the experiment, the key byte can be recovered from as few as
200 traces or not recovered at all from 100k traces. Experimentation with an
exemplary noise-based hiding scheme shows the criticality of mapping parameters.
The minimum amount of traces required for key recovery is raised by more than
.260× in some cases, but in other cases, these countermeasures do not help prevent

side-channel leakage.
Attack evaluation on datacenter scale Virtex 7 FPGAs also showed the impact
of design space parameters on attack success. The layout and organization of the
larger fabric further increases the design space and thus the variation in side-channel
vulnerability. Modern high-end FPGAs are often composed of multiple dies, in
which side-channel attacks have been shown to still be possible [10]. We consider
two causative factors: The PDN design is non-uniform across the chip, which leads
to differences in the impact of current draw on the supply voltage, and the mapped
user logic components are subject to intra-chip process variation (PV). Both factors
impact the sensitivity of sensors in different locations and lead to differences in the
voltage traces caused by modules in different locations.
Hypervisors might be able to leverage the impact of placement and routing to
optimize security. A possible computer-aided design approach could be built using
three steps:
• First, a hypervisor generates multiple local placements for a specific cryptomod-
ule.
4 Remote Physical Attacks on FPGAs at the Electrical Level 93

• Then, for each module mapping, a global analysis determines which regions
are less vulnerable to side-channel attacks. Attack evaluations for all possible
combinations are hardly feasible. Future work might be able to identify an
adequate model, which can assess side-channel vulnerability in less time than
a full attack.
• In the third and final step, side-channel security on a specific FPGA can be
improved at zero overhead. On the hypervisor side, a global map of secure
locations and precompiled cryptocores can be provisioned. These items can be
deployed by the user as building blocks in security-critical applications.
This approach could improve security without requiring additional FPGA area
or resources and would provide a valuable asset in securing multi-tenant FPGA
access against internal side channels. Neglecting such device and physical design
dependencies can compromise the security of virtualized FPGA in the cloud.

4.5.2 Offline Bitstream Checking Countermeasures

Possible countermeasures can be classified as offline approaches, which attempt to


detect malicious logic before it is loaded into the FPGA, and online approaches,
which detect and neutralize attacks in the FPGA at run time using sensors and
other logic.
Offline bitstream checking countermeasures, in which user bitstreams are
checked for malicious signatures, have been proposed [27, 29]. These approaches
use methodologies from anti-virus software to prevent attacks. To detect designs that
could potentially be used for fault injection or remote side-channel measurements,
the hypervisor first reverts the bitstream back into a flattened netlist of low-level
primitives. Then, the netlist can be evaluated through various perspectives, for
instance, by treating it as a netlist graph, scanning for specific connections that
are unusual in benign designs, or performing a simple timing analysis. Whereas
timing violations can indicate a sensor in the design, fault injections usually require
a large amount of logic that is synchronously toggled. This logic can be identified
by searching the netlist graph for nodes with high fanout. In Krautter et al. [27],
the detection mechanisms were verified on more than 40 benign benchmark designs
from different collections, and none was flagged as potential attacker designs. An
obvious limitation is the constant need for updated signatures to counteract new
methods employed by attackers to hide sensors or fault injection logic.

4.5.3 Online On-Chip Countermeasures

Online countermeasures have generally been developed with the detection and
mitigation of fault attacks in mind [30, 37, 40, 42]. In both Mirzargar et al. [37] and
94 D. R. E. Gnad et al.

Provelengios et al. [42], methods for locating an attacker’s design through a sensor
network are proposed. For attack mitigation, Provelengios et al. [42] and Nassar et
al. [40] suggest clock-gating and the quick disabling of Interconnect, respectively,
to stop excessive power consumption from impeding other tenants on the same
chip. The latter has the potential to stop attacks that rely on internal clocking, for
instance, through self-oscillating clocks in the attacker FPGA region. Alternatively,
the operating frequency of the victim design can be automatically lowered when a
critical voltage undershoot is detected, as proposed in Luo et al. [30]. This dynamic
frequency scaling, however, is intrusive to non-malicious tenants on the FPGA.
Other works have considered leveraging the advantages of reconfigurable hard-
ware for side-channel countermeasures [3, 19, 25, 52]. One approach [25] leverages
the availability of on-chip voltage sensors for adaptive noise generation to cancel out
fluctuations caused by an encryption module through ROs. By worsening the signal-
to-noise ratio (SNR) for an internal attacker, the amount of measurements required
for key recovery can be significantly increased. In Bete et al. [3], a countermeasure
based on implementation variety is proposed, in which different implementations
of the AES S-Box are randomly interchanged at runtime, increasing the difficulty
for a side-channel attacker due to the randomized power profile. A generalization of
programmable ROs as a countermeasure against both side-channel and fault attacks
is presented in Yao et al. [52]. On the one hand, ROs can be employed as sensors for
detecting fault attacks. On the other hand, they serve as random noise generators to
hide data-dependent leakage against side-channel attacks. Lastly, randomization of
the clock driving the victim circuit using a delay line is proposed in Jayasinghe et
al. [19], which can induce noise at very low overhead.

4.5.4 Overview of Countermeasures

The presented mitigation strategies can prevent many of the known attacks.
However, as the underlying problem for multi-tenant FPGAs is the shared PDN
across security boundaries, future development will most likely lead to more
advanced attacks, requiring constant adaption to the continuously changing threat.
In Table 4.2, we provide a comprehensive overview of the recently developed attacks
and countermeasures in FPGA-based systems.

4.6 Summary

FPGA use in datacenters and multi-tenant systems has gained popularity. Typically,
such systems have multiple privilege levels and yet are connected to the same
shared power supply. In these systems, the FPGA fabric can become an aggressor
and launch fault and side-channel attacks on other parts of the system, including
other areas of the FPGA itself. In the past, threats at the physical level were only
4 Remote Physical Attacks on FPGAs at the Electrical Level 95

Table 4.2 A comprehensive overview of novel electrical-level attacks on FPGA-based systems


and associated countermeasures. If an attack can be mitigated by a specific countermeasure, it is
marked with “✓” in the respective column. In [10], a covert channel has been demonstrated, as
indicated with a “✓.∗ ,” if the countermeasure would be able to mitigate a side-channel attack. If
an attack evades mitigation, it is marked as “✗.” A “–” signifies that the countermeasure does not
target the respective attack category

7]

]
g [30
ng [2

9]
5]

r [40]
der [2

scalin
hecki

ces [2

] [42
8]

3]

reake

[19]
defen
eam c

ypt [2

toring
AD [

ency
e fen

]
[52
D
B
FPGA

SPRE

Frequ
Bitstr

NNcr
Activ

UClo
Moni

Loop

PRO
Attacks Countermeasures
Inside job ✓ ✓ ✓ – ✓ – – – ✓ ✓
[47]
Inter-chip side ✓ ✓ ✓ – ✓ – – – ✓ ✓
channels [48]
Cross-SLR ✓ ✓ ✓∗ – ✓∗ – – – ✓∗ ✓
covert
channels [10]
C3 APSULe ✓ – – – – – – – – –
[11]
SCA on ✓ ✓ ✓ – ✗ – – – ✓ ✓
neural
networks [39]
FPGAhammer ✓ – – ✓ – ✓ ✓ ✓ – ✓
[26]
Sequential ✓ – – ✓ – ✓ ✓ ✓ – ✓
oscillator
faults [49]
BRAM ✗ – – ✗ – ✓ ✓ ✓ – ✓
collision
faults [2]
Glitch ✗ – – ✓ – ✓ ✓ ✓ – ✓
amplification
faults [34]
Benign logic ✗ – – ✗ – ✓ ✓ ✓ – ✓
faults [23, 41]

considered in the context of an adversary with direct physical access to the device
under attack. With the introduction of remote physical and electrical-level attacks
on FPGAs, this common security assumption is no longer true. Multi-tenant FPGA
attacks provide an important eye-opener to the bigger picture of system security of
all remotely accessible systems.
96 D. R. E. Gnad et al.

Countermeasures have improved rapidly in the last few years. However, an easy
solution for the underlying problem of information leakage between circuits through
a shared power supply has not been found, necessitating future research.

Acknowledgments The work described in this chapter has been supported in part by the Deutsche
Forschungsgemeinschaft (DFG, German Research Foundation) through the project 456967092
(SecFShare).

References

1. Amazon EC2 F1 Instances. https://aws.amazon.com/ec2/instance-types/f1/.


2. Alam, M. M., Tajik, S., Ganji, F., Tehranipoor, M., & Forte, D. (2019). RAM-Jam: remote
temperature and voltage fault attack on FPGAs using memory collisions. In Workshop on
Fault Diagnosis and Tolerance in Cryptography (FDTC) (pp. 48–55). https://doi.org/10.1109/
FDTC.2019.00015.
3. Bete, N., Saqib, F., Patel, C., Robucci, R., & Plusquellic, J. (2019). Side-channel power
resistance for encryption algorithms using dynamic partial reconfiguration (SPREAD). In
International Symposium on Hardware Oriented Security and Trust (HOST).
4. Boneh, D., DeMillo, R. A., & Lipton, R. J. (1997). On the importance of checking crypto-
graphic protocols for faults. In Advances in Cryptology — EUROCRYPT ’97 (pp. 37–51).
Berlin, Heidelberg: Springer. https://doi.org/10.1007/3-540-69053-0_4.
5. Chen, H., Chen, Y., & Summerville, D. H. (2010). A survey on the application of FPGAs for
network infrastructure security. IEEE Communications Surveys & Tutorials, 13(4), 541–561.
6. Cnudde, T. D., Ender, M., & Moradi, A. (2018). Hardware masking, revisited. IACR
Transactions on Cryptographic Hardware and Embedded Systems (TCHES), 2018(2), 123–
148.
7. De Schryver, C. (2015). FPGA based accelerators for financial applications (vol. 10).
Springer.
8. Eisenbarth, T., Kasper, T., Moradi, A., Paar, C., Salmasizadeh, M., & Shalmani, M. T. M.
(2008). On the power of power analysis in the real world: a complete break of the KeeLoq
code hopping scheme. In D. Wagner (Ed.), Advances in cryptology – CRYPTO 2008 (pp.
203–220). Berlin, Heidelberg: Springer.
9. Fahmy, S. A., Vipin, K., & Shreejith, S. (2015). Virtualized FPGA accelerators for efficient
cloud computing. In CloudCom (pp. 430–435). IEEE Computer Society.
10. Giechaskiel, I., Rasmussen, K., & Szefer, J. (2019). Reading between the dies: cross-SLR
covert channels on multi-tenant cloud FPGAs. In IEEE International Conference on Computer
Design (ICCD).
11. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2020). C3 APSULe: cross-FPGA covert-
channel attacks through power supply unit leakage. In Symposium on Security and Privacy
(S&P) (pp. 1728–1741). IEEE. https://doi.org/10.1109/SP40000.2020.00070.
12. Glamočanin, O., Coulon, L., Regazzoni, F., & Stojilović, M. (2020). Are cloud FPGAs really
vulnerable to power analysis attacks? In Proceedings of Design, Automation & Test in Europe
(DATE) (pp. 1007–1010). IEEE.
13. Gnad, D. R. E., Oboril, F., Kiamehr, S., & Tahoori, M. B. (2018). An experimental evaluation
and analysis of transient voltage fluctuations in FPGAs. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 26(10), 1817–1830. https://doi.org/10.1109/TVLSI.2018.
2848460.
14. Gnad, D. R. E., Oboril, F., & Tahoori, M. B. (2017). Voltage drop-based fault attacks on FPGAs
using valid bitstreams. In Field Programmable Logic and Applications (FPL) (pp. 1–7). IEEE.
https://doi.org/10.23919/fpl.2017.8056840.
4 Remote Physical Attacks on FPGAs at the Electrical Level 97

15. Gnad, D. R. E., Rapp, S., Krautter, J., & Tahoori, M. B. (2018). Checking for electrical level
security threats in bitstreams for multi-tenant FPGAs. In International Conference on Field-
Programmable Technology (ICFPT). Naha, Japan: IEEE.
16. Gnad, D. R. E., Schellenberg, F., Krautter, J., Moradi, A., & Tahoori, M. B. (2020). Remote
electrical-level security threats to multi-tenant FPGAs. IEEE Design Test. https://doi.org/10.
1109/MDAT.2020.2968248.
17. Huffmire, T., Brotherton, B., Wang, G., Sherwood, T., Kastner, R., Levin, T., Nguyen, T., &
Irvine, C. (2007). Moats and drawbridges: an isolation primitive for reconfigurable hardware
based systems. In Symposium on Security and Privacy (S&P). IEEE.
18. Intel Corporation. (2017). Intel FPGAs Power Acceleration-as-a-Service for Alibaba Cloud
| Intel Newsroom. https://newsroom.intel.com/news/intel-fpgas-power-acceleration-as-a-
service-alibaba-cloud.
19. Jayasinghe, D., Ignjatovic, A., & Parameswaran, S. (2021). UCloD: small clock delays to
mitigate remote power analysis attacks. IEEE Access, 9, 108,411–108,425. https://doi.org/
10.1109/ACCESS.2021.3100618.
20. Kamoun, N., Bossuet, L., & Ghazel, A. (2009). Correlated power noise generator as a low
cost DPA countermeasures to secure hardware AES cipher. In: International Conference on
Signals, Circuits and Systems (SCS). IEEE.
21. Khawaja, A., Landgraf, J., Prakash, R., Wei, M., Schkufza, E., & Rossbach, C. J. (2018).
Sharing, protection, and compatibility for reconfigurable fabric with AmorphOS. In USENIX
Symposium on Operating Systems Design and Implementation (OSDI) (pp. 107–127).
22. Kocher, P., Jaffe, J., & Jun, B. (1999). Differential power analysis. In Advances in Cryptology
— CRYPTO’ 99 (pp. 388–397). Berlin, Heidelberg: Springer. https://doi.org/10.1007/3-540-
48405-1_25.
23. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2021). Remote and stealthy fault attacks on
virtualized FPGAs. In Proceedings of Design, Automation & Test in Europe (DATE) (pp.
1632–1637). https://doi.org/10.23919/DATE51398.2021.9474140.
24. Krautter, J., Gnad, D., & Tahoori, M. (2020). CPAmap: On the complexity of secure FPGA
virtualization, multi-tenancy, and physical design. IACR Transactions on Cryptographic
Hardware and Embedded Systems, 2020(3), 121–146. https://doi.org/10.13154/tches.v2020.
i3.121-146.
25. Krautter, J., Gnad, D. R. E., Schellenberg, F., Moradi, A., & Tahoori, M. B. (2019). Active
fences against voltage-based side channels in multi-tenant FPGAs. In International Conference
on Computer-Aided Design (ICCAD). ACM.
26. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2018). FPGAhammer: remote voltage fault
attacks on shared FPGAs, suitable for DFA on AES. IACR Transactions on Cryptographic
Hardware and Embedded Systems (TCHES), 2018(3), 44–68.
27. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2019). Mitigating electrical-level attacks
towards secure multi-tenant FPGAs in the cloud. ACM Transactions on Reconfigurable
Technology and Systems (TRETS), 12(3). https://doi.org/10.1145/3328222.
28. Krautter, J., & Tahoori, M. B. (2021). Neural networks as a side-channel countermeasure:
challenges and opportunities. In Symposium on VLSI (ISVLSI) (pp. 272–277). IEEE Computer
Society. https://doi.org/10.1109/ISVLSI51109.2021.00057.
29. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
malicious self-oscillator scanning for Xilinx UltraScale + FPGAs. ACM Transactions on
Reconfigurable Technology and Systems, 13(3). https://doi.org/10.1145/3402937.
30. Luo, Y., & Xu, X. (2020). A quantitative defense framework against power attacks on multi-
tenant FPGA. In International Conference On Computer Aided Design (ICCAD) (pp. 1–4).
IEEE/ACM.
31. Mahmoud, D., & Stojilović, M. (2019). Timing violation induced faults in multi-tenant FPGAs.
In Proceedings of Design, Automation & Test in Europe (DATE) (pp. 1745–1750). IEEE.
32. Malkin, T. G., Standaert, F. X., & Yung, M. (2006). A comparative cost/security analysis of
fault attack countermeasures. In Fault Diagnosis and Tolerance in Cryptography (FDTC) (pp.
159–172). Berlin, Heidelberg: Springer.
98 D. R. E. Gnad et al.

33. Masle, A. L., & Luk, W. (2012). Detecting power attacks on reconfigurable hardware. In Field
Programmable Logic and Applications (FPL) (pp. 14–19). IEEE. https://doi.org/10.1109/FPL.
2012.6339235.
34. Matas, K., La, T. M., Pham, K. D., & Koch, D. (2020). Power-hammering through Glitch
amplification – attacks and mitigation. In International Symposium on Field-Programmable
Custom Computing Machines (FCCM) (pp. 65–69). https://doi.org/10.1109/FCCM48280.
2020.00018.
35. McEvoy, R. P., Murphy, C. C., Marnane, W. P., & Tunstall, M. (2009). Isolated WDDL:
A hiding countermeasure for differential power analysis on FPGAs. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 2(1).
36. Messerges, T. S., Dabbish, E. A., & Sloan, R. H. (2002). Examining smart-card security under
the threat of power analysis attacks. Transactions on Computers, 51(5), 541–552.
37. Mirzargar, S. S., Renault, G., Guerrieri, A., & Stojilović, M. (2020). Nonintrusive and adaptive
monitoring for locating voltage attacks in virtualized FPGAs. In International Conference
on Field-Programmable Technology (ICFPT) (pp. 288–289). IEEE. https://doi.org/10.1109/
ICFPT51103.2020.00050.
38. Moini, S., Li, X., Stanwicks, P., Provelengios, G., Burleson, W., Tessier, R., & Holcomb,
D. (2020). Understanding and comparing the capabilities of on-chip voltage sensors against
remote power attacks on FPGAs. In Midwest Symposium on Circuits and Systems (MWSCAS)
(pp. 941–944). IEEE. https://doi.org/10.1109/MWSCAS48704.2020.9184683.
39. Moini, S., Tian, S., Szefer, J., Holcomb, D., & Tessier, R. (2021). Remote power side-channel
attacks on BNN accelerators in FPGAs. In Proceedings of Design, Automation & Test in
Europe (DATE). IEEE.
40. Nassar, H., AlZughbi, H., Gnad, D., Bauer, L., Tahoori, M., & Henkel, J. (2021). LoopBreaker:
disabling interconnects to mitigate voltage-based attacks in multi-tenant FPGAs. In Interna-
tional Conference on Computer-Aided Design (ICCAD). IEEE/ACM.
41. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power wasting circuits for cloud FPGA
attacks. In Field Programmable Logic and Applications (FPL) (pp. 231–235). https://doi.org/
10.1109/FPL50879.2020.00046.
42. Provelengios, G., Holcomb, D., & Tessier, R. (2021). Mitigating voltage attacks in multi-tenant
FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 14(2), 1–24.
43. Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J.,
Esmaeilzadeh, H., Fowers, J., Gopal, G. P., Gray, J., Haselman, M., Hauck, S., Heil, S.,
Hormati, A., Kim, J. Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J.,
Xiao, P. Y., & Burger, D. (2014). A reconfigurable fabric for accelerating large-scale datacenter
services. In International Symposium on Computer Architecture (ISCA), ISCA ’14 (pp. 13–24).
Piscataway, NJ, USA: IEEE Press. http://dl.acm.org/citation.cfm?id=2665671.2665678.
44. Ramesh, C., Patil, S. B., Dhanuskodi, S. N., Provelengios, G., Pillement, S., Holcomb, D.,
& Tessier, R. (2018). FPGA side channel attacks without physical access. In International
Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. paper–116).
IEEE.
45. Rockett, L., Patel, D., Danziger, S., Cronquist, B., & Wang, J. (2007). Radiation hardened
FPGA technology for space applications. In Aerospace Conference (pp. 1–7). IEEE.
46. Sanaullah, A., Yang, C., Alexeev, Y., Yoshii, K., & Herbordt, M. C. (2018). Real-time data
analysis for medical diagnosis using FPGA-accelerated neural networks. BMC Bioinformatics,
19, 19–31.
47. Schellenberg, F., Gnad, D. R., Moradi, A., & Tahoori, M. B. (2018). An inside job: remote
power analysis attacks on FPGAs. In Proceedings of Design, Automation & Test in Europe
(DATE).
48. Schellenberg, F., Gnad, D. R. E., Moradi, A., & Tahoori, M. B. (2018). Remote inter-chip
power analysis side-channel attacks at board-level. In International Conference on Computer-
Aided Design (ICCAD) (pp. 1–7). IEEE/ACM. https://doi.org/10.1145/3240765.3240841.
4 Remote Physical Attacks on FPGAs at the Electrical Level 99

49. Sugawara, T., Sakiyama, K., Nashimoto, S., Suzuki, D., & Nagatsuka, T. (2019). Oscillator
without a combinatorial loop and its threat to FPGA in data center. Electronics Letters, 55(11),
640–642. https://doi.org/10.1049/el.2019.0163.
50. Tian, S., Moini, S., Wolnikowski, A., Holcomb, D., Tessier, R., & Szefer, J. (2021). Remote
power attacks on the versatile tensor accelerator in multi-tenant FPGAs. In Proceedings of the
International Symposium on Field-Programmable Custom Computing Machines, FCCM.
51. Trimberger, S., & McNeil, S. (2017). Security of FPGAs in data centers. In International
Verification and Security Workshop (IVSW). IEEE Computer Society.
52. Yao, Y., Kiaei, P., Singh, R., Tajik, S., & Schaumont, P. (2021). Programmable RO (PRO):
a multipurpose countermeasure against side-channel and fault injection attack. Preprint.
arXiv:2106.13784.
53. Zeng, S., Dai, G., Sun, H., Zhong, K., Ge, G., Guo, K., Wang, Y., & Yang, H. (2020). Enabling
efficient and flexible FPGA virtualization for deep learning in the cloud. In International
Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 102–110).
IEEE.
54. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In Symposium
on Security and Privacy (S&P) (pp. 805–820). IEEE. https://doi.org/10.1109/SP.2018.00049.
www.doi.ieeecomputersociety.org/10.1109/SP.2018.00049.
55. Zick, K. M., & Hayes, J. P. (2012). Low-cost sensing with ring oscillator arrays for healthier
reconfigurable systems. ACM Transactions on Reconfigurable Technology and Systems
(TRETS), 5(1), 1:1–1:26. https://doi.org/10.1145/2133352.2133353. http://doi.acm.org/10.
1145/2133352.2133353.
56. Zick, K. M., Srivastav, M., Zhang, W., & French, M. (2013). Sensing nanosecond-scale voltage
attacks and natural transients in FPGAs. In International Symposium on Field-Programmable
Gate Arrays (FPGA) (pp. 101–104). New York, NY, USA: ACM. https://doi.org/10.1145/
2435264.2435283. http://doi.acm.org/10.1145/2435264.2435283.
Chapter 5
Practical Implementations of Remote
Power Side-Channel and Fault-Injection
Attacks on Multitenant FPGAs

Dina G. Mahmoud, Ognjen Glamočanin, Francesco Regazzoni,


and Mirjana Stojilović

5.1 Electrical-Level Vulnerabilities of Remote FPGAs

Given their highly parallel and flexible architecture, field-programmable gate


arrays (FPGAs) are particularly well suited for accelerating emerging compute-
intensive applications, such as artificial intelligence, networking, genomics, big data
analytics, and image and video processing [8, 25, 38, 64]. As such, FPGAs are being
deployed in modern heterogeneous computing systems, including cyber-physical
devices and, more recently, in data centers and the commercial cloud [7, 11, 13, 28].
With FPGAs in the cloud, new virtualization methods and hypervisor solutions for
efficient provisioning of FPGA resources are being developed [8], while at the same
time, the security risks associated with exposing fine-granularity FPGA hardware to
remote users are being carefully investigated [3, 45]. The key result of the related
security research in the past few years is the discovery that remote access to cloud
FPGAs presents an entirely new attack surface: that of remotely executed electrical-
level attacks that leverage shared power-delivery networks (PDNs). Several such
attacks have been investigated in the literature: denial-of-service (DoS) attacks,
with the goal of shutting down the remote FPGA, thus compromising the system’s
availability [35]; fault-injection attacks, which attempt to disrupt the normal opera-
tion of the FPGA circuit, hence impacting the system’s integrity [33, 42, 45, 56];
and power side-channel attacks, where secret information inadvertently leaked

D. G. Mahmoud · O. Glamočanin · M. Stojilović ()


EPFL, Lausanne, Switzerland
e-mail: dina.mahmoud@epfl.ch; ognjen.glamocanin@epfl.ch; mirjana.stojilovic@epfl.ch
F. Regazzoni
University of Amsterdam, Amsterdam, The Netherlands

Universitä della Svizzera italiana, Lugano, Switzerland


e-mail: f.regazzoni@uva.nl; regazzoni@alari.ch

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 101
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_5
102 D. G. Mahmoud et al.

via the power- and ground-delivery network is retrieved, breaching the system’s
confidentiality [16, 52, 66, 71].
Electrical-level attacks against FPGAs are not new. Given physical access to a
device, an adversary can manipulate the external voltage source (e.g., lower the
voltage to a new level or introduce transient changes) or disrupt the clock (e.g.,
inject glitches) to cause a fault in the operation of the target circuit [45, 74].
Measuring the voltage and current variations on the power supply rails or sensing
the electromagnetic (EM) emanations is also feasible, given an oscilloscope and
appropriate probes. The obtained power or EM side-channel traces can then be
used to perform side-channel analysis (SCA) and recover a secret [45, 54] (e.g.,
a secret key of a cryptographic circuit). In a cloud FPGA scenario, however, an
adversary has no physical access to the device, and yet, we know that a remotely
executed electrical-level attack is possible. How is that? The answer is two-fold:
first, because, at least in theory, one can design almost arbitrary circuits using the
programmable FPGA resources; second, because the power-delivery network is
shared and, consequently, voltage and current variations created by an FPGA circuit
spread beyond the circuit’s boundaries.
With fine-grained FPGA logic and routing, one can build reconfigurable circuits
that measure the variations of on-chip delays [73]. Knowing that the delays correlate
with the supply voltage [37], these circuits effectively act as on-chip voltage sensors.
Assuming the adversary and the victim are coupled via the shared PDN, these
voltage sensors enable recording power side-channel traces and, ultimately, a remote
power side-channel attack [16, 71]. Alternatively, one can build power wasters,
i.e., circuits that toggle at high frequency and draw considerable current [36].
If power wasters are deployed in large numbers and carefully controlled, their
activity causes voltage transients that, because of the shared PDN, may disturb
the operation of other circuits supplied by the same voltage source. The result is
a remote undervolting attack, with the potential to become a denial-of-service [19]
or fault-injection (FI) attack [33, 41, 42].
Signal coupling via the power-delivery network is illustrated in Fig. 5.1. On the
FPGA die, circuits are coupled via the power- and ground-distribution mesh and the
unavoidable parasitic components (e.g., capacitance or inductance). The coupling
extends to the package and the printed circuit board (PCB), where the voltage
sources supplying the FPGA and other onboard components are typically deployed.

VICTIM ADVERSARY

FPGA On-die capacitors

Die
C4 bumps

Chip Package Package capacitors


Voltage
power pins ground pins Regulators
Printed Power plane
On-board capacitors
Circuit
Ground plane
Board

Fig. 5.1 Illustration of the power-delivery network sharing across several levels: the FPGA die,
the chip package, and the printed circuit board
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 103

The coupling reaches the external voltage sources as well (e.g., the common power
supply of an entire server rack). How far the signal created by on-chip activity (a
voltage drop due to power wasters’ activity or side-channel information) propagates
depends on the quality of the coupling or the lack thereof. A common technique
to prevent voltage disturbances from propagating far is to provision decoupling
capacitors on all levels (board, package, chip, etc.). These capacitors provide a
response to transient current demands, reducing voltage variations in the expected
operating frequency ranges. However, recent research demonstrates that decoupling
capacitors are insufficient, given that even the data center-scale FPGAs with high-
quality PDNs are vulnerable to remote attacks [16]. One particular FPGA use case
stands out as the most affected: FPGA multitenancy, where the victim and the
adversary simultaneously share the same FPGA and, being close to one another,
are therefore strongly coupled.
In the remainder of this chapter, we first describe the threat models of the remote
power side-channel and fault-injection attacks, commonly considered in the relevant
literature. Then, we elaborate in detail on the two attack variants. For each of them,
we cover the FPGA circuits that are the key enablers of the attack and show the
results of the attacks on a range of FPGA boards, including data center acceleration
cards. We make the hardware description language implementation of the FPGA
circuits (a selection of sensors and power wasters) available as open source [18].

5.2 Attack Scenarios

The common threat model of remote electrical-level attacks on FPGAs assumes


multitenancy [16, 33, 41, 45], i.e., a data center or cloud FPGA spatially shared
between at least two users. This type of sharing involves partitioning the FPGA
resources into several physically and logically isolated regions, where each region
belongs, for the time required to run the workload, to the corresponding tenant,
and all tenant workloads run simultaneously within their regions. Spatial sharing
differs from temporal sharing, which provides multiple tenants access to the same
resources but at different times. For a commercial cloud setting and potentially
other shared FPGA settings, some resources are reserved for a static region called
a shell, deployed by the cloud service provider (CSP) to control access to the
external interfaces, such as PCI Express or the off-chip memory, as well as the
tenants. The tenants are free to implement almost any FPGA circuit (in practice,
some CSPs do not allow circuits containing combinational loops [13, 57]) and set
placement and routing constraints for their designs. Most importantly, the FPGA
tenant applications share the on-chip PDN and, to some extent, the PDN in the
package, board (server), and rack. The high-level view of the threat model is shown
in Fig. 5.2.
Considering a power side-channel attack scenario, the adversary in the assigned
FPGA region implements one or more on-chip voltage-fluctuation sensors, the
control logic, and the on-chip buffers for saving the side-channel traces. The
104 D. G. Mahmoud et al.

Power wasters Shared FPGA


On-chip voltage sensors
LUT LUT ...

decoupling FF FF FF FF
...
FF FF FF FF
capacitors EN EN CLK ...

H
Fault Side-channel O
!

Shell
Voltage Attacker region injection attack
Regulator S
attack
Isolation T

Common PDN
Victim
applications !

Fig. 5.2 Threat model

adversary has the capability of offloading the sensor traces for off-chip analysis.
The victim, on the other side, is performing secret computation. For example, the
victim could be encrypting data using a secret key and sending the ciphertexts over
a public channel that can be observed by the adversary. The goal of the adversary
would be to retrieve the secret information, such as the secret key. In an alternative
scenario, an adversary could aim to infer the architecture or the type of the victim
circuit (e.g., steal the proprietary architecture of a neural network model, discover
what type of operation the victim is executing, etc.).
In the context of a fault-injection attack, the adversary aims to leverage the
shared PDN to propagate electrical-level effects to the regions used by other
tenants. Specifically, the attacker attempts to lower the voltage of the chip with
the help of the power-wasting circuits. The lowered voltage affects the delays
within the victim circuit, potentially compromising its operation. Depending on the
circuit characteristics and the delays of the hardware employed by other tenants,
computation results can become faulty. The adversary’s goal may be to render
the victim’s calculations incorrect or consume enough power to reset the FPGA
(resulting in a DoS attack, affecting both the victim and the cloud service provider).
In a more interesting scenario, the adversary may want to gain information about
the victim, for example, secret information it may be processing (e.g., encryption
key). To gain information thanks to an injected fault, the adversary necessitates
an ability to observe some of the output generated by the victim, i.e., estimate
the effect of the fault on the computation. Hence, the fault-injection threat model
often assumes that the adversary can access the victim’s output. For instance, the
target circuit may be offering its function as a service to other parties, including
the attacker. In this case, the malicious party supplies an input to the victim and
receives the corresponding output. Alternatively, the victim may send the output
over a communication channel the adversary can access directly or indirectly; for
example, if the victim encrypts the data before sending it out, it may use a shared or
publicly accessible communication channel.
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 105

5.3 Remote Power Side-Channel Attacks

This section focuses on power side-channel attacks on remote multitenant FPGAs,


whose goals are to monitor the on-chip voltage fluctuations caused by the victim’s
activity and then process the collected data to extract secret information and
compromise the system’s confidentiality. First, we introduce two implementations
of FPGA on-chip voltage-drop sensors suitable for sensing nanosecond-scale
supply voltage variations. Then, the basics of power side-channel analysis and
the metrics commonly used for evaluating the success of an attack are given.
The section demonstrates successful attacks on several target FPGA devices,
including a commercial cloud instance, and analyzes the results. Finally, some of
the countermeasures proposed in the literature are discussed.

5.3.1 FPGA Voltage Sensors

On-chip sensors suitable for remote power analysis attacks typically fall into two
categories: time-to-digital converters (TDCs) and frequency counters. The primary
component in TDCs is a tapped delay line. In frequency counters, the primary
component is a ring oscillator (RO). The underlying working principle is common:
instead of measuring the supply voltage variations directly, these sensors measure
delay variations, which correlate with voltage fluctuations [21, 22]. The power
supply voltage is the medium that carries the side-channel information.
RO-based sensors work on the principle of measuring the frequency (by means
of a counter) of a fast oscillator. Such oscillators are commonly implemented with
a single look-up table (LUT) acting as an inverter, closed in a loop. They have a
small footprint, are easy to deploy, and are portable. Yet, frequency counters require
a long measurement time, reducing sensor sensitivity and making them suitable for
capturing slow-changing signals only. Some CSPs (e.g., Amazon AWS) check user
designs for combinational loops and prevent such circuits from being deployed on
their cloud [4, 36, 57].
Compared to RO-based sensors, TDCs can sense nanosecond-scale voltage
variations. The suitability of FPGA TDCs for measuring on-chip voltage variations
and natural transients in FPGAs was first investigated by Zick et al. a decade
ago [73]. However, it was only a few years later, after the first work on remote
undervolting attacks on multitenant FPGAs, that TDC sensors regained the attention
of researchers [16, 21, 23, 50, 51, 63]. A typical TDC implementation is illustrated
in Fig. 5.3. The sensor consists of three parts:
• The first part is a tapped delay line, shown as a simple chain of multiplexers.
It is commonly referred to as an observable delay line and implemented using
fast carry propagation logic and dedicated routing. For optimal sensor sensitivity,
strict placement constraints are necessary: the delay line must be properly formed
by chaining the carry output of one FPGA slice (where a slice contains four or
106 D. G. Mahmoud et al.

PIN
Initial delay ...
(calibration)

D Q D Q D Q D Q ... D Q D Q D Q D Q
FF FF FF FF FF FF FF FF

CLKIN
...

O0 O1 O2 O3 ON-4 ON-3 ON-2 ON-1

Fig. 5.3 Time-to-digital converter (TDC) implemented using FPGA logic and routing, suitable for
measuring fast on-chip supply voltage variations

PIN
Initial delay
(calibration)

RR0 RR1 RR2 RR3 ... RRN-4 RRN-3 RRN-2 RRN-1

D Q D Q D Q D Q ... D Q D Q D Q D Q
FF FF FF FF FF FF FF FF

CLKIN
...

O0 O1 O2 O3 ON-4 ON-3 ON-2 ON-1

Fig. 5.4 Routing delay sensor (RDS), where the observable tapped delay line is replaced by
routing resources (RRs) [65]

eight registers, depending on the FPGA family) to the carry input of the next one.
The occupied slices should be constrained to one vertical column of the FPGA to
harness the dedicated wiring and minimize inter-slice delays. Every carry output
must drive its attached flip-flop (FF) residing in the same slice.
• The second part of the sensor is an output register used to periodically save the
state of the delay line (i.e., one sensor reading or a sample). The value in the
output register can be converted into the numerical value of one sensor sample
using a thermometer code [16, 63, 68] or, as is more common in recent work, by
taking the Hamming weight of the bits in the output register [17, 50, 51, 73].
• The third and the last part is also a chain of delay elements, but with coarser
granularity (e.g., look-up tables, phase-locked loops, or IDELAY adjustable input
delay elements), which adjusts the phase shift between the sampling clock of the
output register (.CLKIN in Fig. 5.3) and the clock driving the delay line (.PIN
in Fig. 5.3). In most practical implementations, these two clocks have the same
frequency, but the phase shift and the length of the tapped delay line must be
adjusted. The adjustment process is called calibration, as described shortly.
If carry-chain logic is not exposed to the user or a higher sensor sensitivity is
required, a routing delay sensor (RDS) could be implemented instead [65]. The
sensor architecture is illustrated in Fig. 5.4. The main difference between an RDS
and a TDC is the absence of the tapped delay line. The clock that drives the tapped
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 107

delay line in a TDC is instead routed without constraints to the output registers in
the RDC. Additionally, the locations of the FFs in the output registers are decided
by the FPGA placer, without tight constraints, in the interest of helping the FPGA
router find good-quality routes to the FFs. The absence of constraints allows the
delay difference between two sensor bits to become even lower than in the case
of a TDC, further improving sensor sensitivity. Finally, once the sensor sample is
computed as the Hamming weight of the output register, the issue of delays to every
output bit being different and in no predictable order becomes irrelevant.
In the absence of on-chip activity, the TDC output typically is constant (modulo
background noise) and determined by three parameters: the clock frequency, the
initial delay (i.e., the phase shift), and the length of the observable delay line
(parameter N in Figs. 5.3 and 5.4). For a given clock frequency, the initial delay
and the parameter N are chosen so that the output register captures a single clock
transition in every clock cycle. In other words, the output register should always be
filled with a sequence of ones followed by the sequence of zeros with a possibly
imperfect transition in which the location of the FF with the transition corresponds
to the depth of propagation of the clock signal through the delay line. Once the on-
chip voltage starts fluctuating due to the activity of the victim circuit, the delays
of the elements in the sensor change, and consequently, so does the sensor output.
The sensor is well-dimensioned and calibrated (i.e., the initial delay and the length
of the delay line are well-chosen) if, for the entire duration of the measurement,
the Hamming weight of the output register lies in the range .0 < H W (O) =
H W (O0 , O1 , . . . , ON −1 ) < N and only one clock edge is captured. As calibration
is a lengthy process of trial and error, it is convenient to automate it.
One practical approach for automating sensor calibration is as follows. First,
let us assume the initial delay is implemented as a sequence of two delay lines,
as illustrated in Fig. 5.5: one composed of fine calibration slices (e.g., carry-chain
logic) and one composed of coarse calibration slices (e.g., LUTs) [20]. Second,
the clock can enter the initial delay line in as many locations as possible (e.g.,
with the help of multiplexers). For a given maximum number of elements in the
fine and coarse calibration slices, the calibration can be performed, as described in

Fine calibration slices

CLK

...

Observable delay line

... ...

Coarse calibration slices

SENSOR OUTPUT

Fig. 5.5 Initial delay line implementation, convenient for automated calibration
108 D. G. Mahmoud et al.

Algorithm 1 Calibration algorithm


Input: LI DC , maximum number of coarse elements
Input: LI DF , maximum number of fine elements
Input: n, number of samples per trace
Input: N, observable delay line length (maximum sensor value)
Input: δ, calibration parameter
Input: Ntraces , number of traces recorded at each calibration step
Output: I DC, I DF , number of coarse and fine elements, respectively

1: procedure CALIBRATE(LI DC , LI DF , n, N, δ)
2: for I DCcnt from 1 to LI DC do
3: smin ← N
4: I DC ← I DCcnt ; I DF ← 1
5: SEND _ CALIBRATION (I DC, I DF )
6: for trace from 1 to Ntraces do
7: (s1 , s2 , ..., sn ) ← RECORD_TRACE()
8: smin ← MIN(smin , MIN(s1 , s2 , ..., sn ))
9: end for
10: if smin = N then
11: break
12: end if
13: end for
14: for I DCcnt from I DC to LI DC do
15: for I DFcnt from 1 to LI DF do
16: smax ← N
17: I DC ← I DCcnt ; I DF ← I DFcnt
18: SEND _ CALIBRATION (I DC, I DF )
19: for trace from 1 to Ntraces do
20: (s1 , s2 , ..., sn ) ← RECORD_TRACE()
21: smax ← MAX(smax , MAX(s1 , s2 , ..., sn ))
22: end for
23: if smax = δ then
24: return I DC, I DF
25: end if
26: end for
27: end for
28: return failure
29: end procedure

Algorithm 1. The process, fully applicable to calibrating an RDS sensor as well,


involves setting the number of coarse and fine calibration slices through which the
clock should propagate, recording power side-channel traces, and then analyzing
the Hamming weight of all the samples s in the recorded trace. Once the maximum
Hamming weight of a sensor sample reaches the predefined threshold (the parameter
.δ, often set to .N/2, the half of the observable delay line length), the calibration

is considered completed, and the desired number of coarse and fine elements is
found. It should be noted, though, that from the side-channel attack perspective, it
is crucial to correctly record (without any saturation in sensor output) the events
corresponding to victim activity (i.e., voltage drops). For a given length of the
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 109

observable delay line and the sensor clock frequency, the threshold .δ should be
set with the above constraint in mind.

5.3.2 Power Side-Channel Analysis

Power side-channel attacks exploit data-dependent switching activity of logic cells


that directly impacts the power consumption of an integrated circuit [30]. Changes
in a circuit’s inputs lead to signal propagation through the logic, causing glitches
and transitions in the logic gate outputs that consume power. Consequently, the
power consumption correlates with the changes in the circuit’s inputs. In the
case of a cryptographic device, an attacker can infer the secret key by exploiting
this correlation. The attacker can use a simple, visual analysis of the power
measurements (simple power analysis) or more complex statistical methods such
as differential power analysis (DPA) [30] or correlation power analysis (CPA) [9].
Power side-channel attacks have long been considered a danger only for embedded
systems or smart cards because the adversary was required to have physical access
to the device to collect the power measurements needed to perform the attack.
Knowing that FPGA voltage sensors can be deployed remotely, the requirement
of physical access to the device under attack no longer holds.
The starting point of an attack is the acquisition of power side-channel traces
(often measured indirectly as voltage over a small resistor in series with the power
or the ground rail) of a device during execution of the victim’s computation.
For example, in the case of a cryptographic circuit, one trace corresponds to
measurements collected during one plaintext encryption. A trace consists of a finite
set of samples, measured at constant time intervals. The number of samples per
trace is determined by the sampling frequency of the measurement device (an
oscilloscope or a voltage sensor) and the operating frequency of the victim circuit.
Side-channel analysis is interested in a subset of these samples, specifically those
that leak information about the intermediate values of a cipher. Several techniques
exist for identifying the samples of interest, notably statistical hypothesis tests such
as t-test in the Test Vector Leakage Assessment (TVLA) framework [55].
In addition to exploitable leakage, measured traces often contain noise. In fact,
attack success largely depends on the signal-to-noise ratio (SNR) of the collected
traces, i.e., the ratio of the variance of the exploitable signal and the noise.
Depending on the situation, adversaries can resort to various noninvasive methods
to reduce the noise in the measured traces [46]. One of the standard methods is
averaging, in which, for each plaintext, the attacker collects a number of traces and
performs a sample-wise average. Another method uses the filtering of traces, e.g.,
with low-pass filters that eliminate electronic noise at high frequencies. Once quality
traces are obtained, one can proceed to the attack.
In the remainder of this section, it is assumed that the victim runs the Advanced
Encryption Standard (AES) encryption with the 128-bit encryption key as the secret.
The adversary is performing a correlation power analysis attack. CPA uses a divide-
110 D. G. Mahmoud et al.

and-conquer approach to extract the secret key: instead of attacking the whole
128-bit key (search space of .2128 possible solutions), it exploits the property that
AES operates on key bytes, which allows an attack on each key byte independently
and, most importantly, reduces the search space to a manageable size of .16 × 28
(i.e., 256). During the attack, the adversary estimates the correlation between the
two datasets: the first is created by recording N power traces and corresponding
ciphertexts. The second dataset is the modeled leakage of the device for the N
observed ciphertexts. For each key byte, all possible 256 values (key guesses)
are considered. Each guess is then used to model power consumption for the
observed ciphertexts with the help of a leakage modeling function. Finally, for each
trace sample of interest, the correlation between the measured and the modeled
power consumption is computed using the expression for the Pearson correlation
coefficient [59]:

cov(P [:][j ], H [:][k])


r[k][j ] =
σP [:][j ] σH [:][k]
. N −1    (5.1)
i=0 H [i][k] − H [:][k] P [i][j ] − P [:][j ]
=  2 .
N −1  2 N −1 
i=0 H [i][k] − H [:][k] i=0 P [i][j ] − P [:][j ]

Here, N is the number of power traces (corresponding to N ciphertexts), each


containing M samples. The matrix notation .P [i][j ] refers to the power sample j in
trace i, where .0 ≤ i < N, .0 ≤ j < M. The overline represents the average value. If
there are K values a key byte can take, .H [i][k] is the power estimate corresponding
to the trace i and key-byte guess k, where .0 ≤ i < N, .0 ≤ k < K. Once the Pearson
correlation coefficient (i.e., the CPA metric) is computed, it shows how well the
modeled power of sample j correlates with the measured power. The CPA metric
can then be computed for all M trace samples and K key-byte guesses, resulting
in a .K × M matrix of correlation coefficients. The key-byte guess that exhibits the
highest absolute value of the correlation .r[k][j ] for .0 ≤ k < K, .0 ≤ j < M
is then deemed to be the correct key byte. This is equivalent to ranking the key-
byte guesses based on the computed correlation (i.e., using the correlation as the
scoring metric) and considering the best-ranked key as the correct key byte. The
computation described here is illustrated in Fig. 5.6. To reconstruct the entire secret
key, the computation of the modeled power and the matrix of correlation coefficients
need to be repeated.
Power models (i.e., leakage functions) for estimating a device’s power con-
sumption exist [46]. In most cases, the power consumption of synchronous digital
devices depends on the toggling of register bits; the most commonly used power
model is the Hamming distance of register outputs. The Hamming distance model is
implementation-agnostic. More advanced power analysis attacks require more com-
plex models, often leveraging implementation details of the attacked devices [46].
As mentioned earlier in this section, power side-channel attacks are often
performed in a divide-and-conquer fashion [9, 30, 46] in which the full secret key
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 111

sample Trace samples


Plaintexts Measured
power

Key guesses
Pearson correlation
AES

Ciphertexts Power
consumption
model Modeled
Key guess power

Fig. 5.6 Correlation power analysis attack

is partitioned into smaller parts that are easy to attack independently. However, a
security evaluation of sub-keys does not necessarily illuminate the security of the
entire key. In particular, when an attacker fails to break all the key bytes with
correlation power analysis, the remaining effort for a full key recovery through
trial and error is not quantifiable. To overcome this issue, one can use a key
rank estimation metric. In the unknown-key setting, key rank estimation uses
heuristics to approximate the rank of an unknown secret key without performing key
enumeration [6]. In the known-key case, it uses the correlation computed with the
CPA (or another scoring metric) to sort the key guesses and estimate the remaining
effort for an attacker. For example, if an attacker has no side-channel information,
then the key rank equals the entire key space, i.e., .2128 for AES-128. Alternatively,
when the entire key is broken, the key rank drops to zero. While there are several
ways of computing the key rank estimation metric, in this chapter, we use the
histogram-convolution-based algorithm of Glowacz et al. [10] where the key rank is
upper and lower bounded.

5.3.3 Running the Attack

For practical attack implementations, we choose a Sakura-X side-channel evaluation


board [61], equipped with an AMD Kintex-7 FPGA (xc7k160tfbg676-1), and an
AMD Alveo U200 data center accelerator card, which contains an AMD Virtex
UltraScale+ FPGA (xcu200-fsgd2104-2-e). For compilation, we use the default
Vivado synthesis and implementation strategies. Figure 5.7 shows a block diagram
of the system architecture. Despite FPGA-specific implementation differences, the
system architecture of the two setups has the same main components: the shell is
responsible for communication between the FPGA tenants (the adversary or the
victim) and the host machine. The victim circuit is an openly available AES-128
hardware module [1]; it has an associated controller for transferring plaintexts and
ciphertexts and initiating encryption. The adversary, physically isolated from the
victim, contains a voltage-fluctuation sensor calibrated by a controller module and
a FIFO buffer where sensor samples are stored before being offloaded to the host
machine. To facilitate trace collection, we use the encryption initiation signal as a
112 D. G. Mahmoud et al.

Initial delay length CLK FPGA


Calibration
Controller
Sensor FIFO
H
Trigger ATTACKER O

SHELL
Logical and physical isolation
S
VICTIM
Start encryption T
AES Encryption data Controller
CLK

Fig. 5.7 System architecture for running a remote power side-channel attack

Fig. 5.8 Power side-channel trace recorded with RDS and TDC sensors [65]

trigger for sensor trace storage. Instead of a trigger, one could use trace alignment
techniques on the recorded sensor traces [62]. The clock frequencies of the sensor
and the AES are set to 200 MHz and 20 MHz, respectively. We test the attack with
both TDC and RDS sensors [65], both with a 128-bit output register. The value in
the output register is converted into the numerical value of one sensor sample using
the Hamming weight. For sensor calibration, the coarse (LUTs, latches) and fine
(carry) elements are used, as described in Algorithm 1.
Figure 5.8 shows the power trace recorded by the two sensors during the
encryption of one plaintext on the Sakura-X board. The AES rounds are clearly
visible. The traces have 128 samples, covering the entire duration of one encryption.
Because the sensor calibrations are independently performed, the vertical offsets of
the traces differ. More importantly, note that the peak-to-peak amplitude of the RDS
trace is considerably higher than the peak-to-peak amplitude of the TDC trace. Next,
we run 500,000 encryptions and record the corresponding traces. In Fig. 5.9, we plot
the results of the statistical analysis of the sensor samples, notably the variance of
one bit in the sensor output. In the left half of the figure, the total number of bits
with nonzero variance is shown. On the right, the variance per bit for all output
bits is visualized. For the TDC, we find 11 bits with a nonzero variance; these bits
cover the entire range of clock propagation depths captured by the sensor across
all encryptions. Under the same conditions, the RDS exhibits 47 bits with nonzero
variance, and given the absence of a tapped delay line, the bits with nonzero variance
are unequally distributed. The higher variation in trace sample values and the higher
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 113

Fig. 5.9 Number of bits toggling during trace acquisition (left), and the variance of the bits in the
sensor output register (right) [65]

Fig. 5.10 Signal-to-noise ratio for the RDS and TDC, computed on the least-significant byte of
the output of the ninth AES round [65]

number of bits with nonzero variance suggest that an attacker with RDS can likely
break the secret key faster. Later experiments will confirm this hypothesis.
In addition to computing the variance of the output bits, we compare the signal-
to-noise (SNR) ratio of the traces recorded by the TDC and the RDS on the Sakura-X
board. The SNR as a side-channel evaluation metric is defined as the ratio between
the useful signal, i.e., the variance of the data-dependent power consumption and
noise [55]. It can be obtained from power side-channel traces without performing an
attack and is most commonly used to identify trace samples with significant leakage
(i.e., samples having a strong correlation with the secret). Figure 5.10 shows the
results. For both sensors, we can observe two peaks: in sample 102 (the beginning
of the last AES round) and in sample 112 (the end of the last round, i.e., when
the ciphertext is saved in the state register). Compared to the TDC, the SNR of the
signal picked up by the RDS sensor approximately doubles in these two points of
interest. Furthermore, across all our experiments and every byte of the intermediate
value, the SNR in sample 112 for RDS is consistently higher than for the TDC, by
a factor of 1.6.× on average, with the maximum reaching 2.9.× [65].
To compare the sensors in the power side-channel attack scenario, we attack
power traces recorded on the Alveo U200 data center accelerator card using CPA
and the key rank estimation metric and repeat the attack a number of times [65].
114 D. G. Mahmoud et al.

Fig. 5.11 Key rank estimation for the TDC and RDS sensors on the Alveo U200 data center
card [65]

Fig. 5.12 CPA attack on the last AES round targeting the eighth byte of the key [16]. To the left,
the correlation for all key guesses across the trace samples corresponding to the last AES round,
with the correlation of the correct key byte highlighted in orange. To the right, the rank evolution
of the correct key byte

Figure 5.11 shows the attack results. With almost two million traces, more than
half of the key bits are broken with RDS. The attack with TDC is, on average,
notably less successful, which is not surprising given the results of the trace analyses
discussed previously.
Since signal coupling varies with the target FPGA and PDN quality, we repeat
the analyses, this time targeting AMD UltraScale+ FPGAs available in the Amazon
EC2 F1 platform [13]. The power traces are recorded with a TDC, in which the
output register value is converted into a numerical value of one sensor sample
using a thermometer code. Thirty experimental runs are performed. After each run,
the FPGA instance is shut down and restarted, potentially allowing the resource
allocator to assign a different FPGA for use during each experiment. With CPA,
the full key was successfully broken in 42% of attempts. Of all attempts to
attack individual key bytes across one million traces, 48% were successful. These
variations in the attack success are expected: first, because of low SNR and, second,
because of device timing characteristics that are known to be affected by process
and temperature variations [21]. Figure 5.12 shows the CPA attack results on the last
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 115

AES round for the eighth byte of the key [16]: the left half shows the correlation for
all key guesses for the trace samples corresponding to the last AES round, obtained
with one million traces. The right half illustrates the rank evolution of the correct
key byte.
In the final set of experiments, we examine how the choice of the numerical
transformation (i.e., encoding) applied to the sensor output register—to obtain the
numerical value of the sensor sample—impacts the attack success. Typically for
TDCs, the output register value (i.e., one sensor sample) is a sequence of zeros
followed by a sequence of ones (or vice versa), where the transition corresponds
to the propagation depth of the clock driving the tapped delay line. In practice, the
transition is not always perfect, and one or more pulses (commonly called bubbles)
may be observed. The bubbles occur because the delays in the observable delay line
are not always strictly monotonic [16]. Therefore, it is conceivable that the choice
of encoding may impact the attack’s success. To test this hypothesis, we take the
traces collected previously (on Amazon EC2 F1 instances), apply three different
encodings, and repeat the CPA attack. The baseline encoding is the thermometer
code, i.e., the sensor sample is the propagation depth of the clock (ignoring the
bubbles). In the second test case, we reorder the output register bits so that the
delays from the input of the delay line to each of the bits in the output register are
monotonically increasing, at least according to static timing analysis. Reordering is
performed in software (off-chip). In the third and the last test case, a sensor sample
is computed as the Hamming weight of the output register. The result of the CPA
attack on the obtained three datasets (original, permuted, and Hamming weight)
is shown in Fig. 5.13. As expected, the baseline is the least-performing option.
Timing-aware reordering of the bits and Hamming weight capture are successful
at capturing the signal of interest. Hence, they permit a more successful attack. In
conclusion, Hamming weight is a good approach for encoding TDC bits, while it is
also naturally suited for routing delay sensors [65].

Fig. 5.13 Key rank estimation when targeting the full 128-bit key of the AES encryption on AMD
UltraScale+ FPGAs in the Amazon EC2 F1 instances, for the original (i.e., baseline), permuted,
and Hamming weight traces
116 D. G. Mahmoud et al.

5.3.4 Countermeasures

Countermeasures against power side-channel attacks have been extensively studied,


primarily in the context in which attackers have physical access to the device.
Physical access allows the possibility of connecting equipment to measure and
observe current, voltage, and electromagnetic field variations during FPGA oper-
ation. These traditional countermeasures fall into two main categories: hiding and
masking [46]. In hiding, the focus is on reducing the signal-to-noise ratio captured
by the measurements, e.g., by equalizing the data-related power consumption [67]
or by increasing the noise component of the signal in the side channel. Masking
requires the processing of algorithmically randomized data, while maintaining
correct circuit operation [58]. Both hiding and masking suffer from considerable
area overhead and vulnerability to higher-order attacks [46]. Furthermore, masking
is application-specific and requires effort to be successfully implemented.
Shortly after the discovery of remote power SCA attacks in multitenant FPGAs,
a mitigation technique called active fences was proposed [32]. This hiding approach
employs fast-switching FPGA circuits to generate noise and inject it into the PDN
and, consequently, reduce the SNR measured by the attacker; the designs of some of
the fast-switching circuits suitable for active fences will be presented in Sect. 5.4.2.
The fence is placed between the adversary and victim circuits, in space that would
likely be unused (recall that some FPGA logic and routing cannot be allocated
to tenants because it is required for ensuring the physical and logical separation
between them). A pseudorandom number generator and a TDC sensor were shown
to be suitable activators of the fence. The result was a significant increase in the
number of traces needed for a successful attack. More recently, an enhanced solution
called active wire fence was proposed [17]. In addition to the fast-switching circuits,
a wire fence harnesses FPGA routing resources (wires of variable lengths and
routing multiplexers) to stress the PDN further.
An alternative protection strategy is bitstream scanning [12, 34, 36]. Bitstream
scanners look for signatures of circuits that are at high risk of becoming security
threats. For instance, signatures can include data-to-clock paths in TDCs (which an
adversary could avoid, if needed, by adding a T flip-flop on the clock path and using
its output instead of the clock to drive the delay line) or high-fanout nets. The typical
input to the scanner is a partial bitstream file, from which a netlist is extracted and
analyzed for malicious signatures. Currently, available bitstream scanners support
two FPGA families, AMD UltraScale+ [36] and Lattice iCE40 [34], because there
are very few commercial devices that have bitstream reverse-engineering tool chains
that are openly available.

5.4 Remote Fault-Injection Attacks

This section focuses on the possibility of injecting faults by leveraging FPGA


designs to affect the power-delivery network and, by extension, the FPGA logic
powered by it. We construct a case study to highlight the requirements of fault-
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 117

D Q Combinational D Q
FF logic Tsetup FF
Tclk2q
Tcomb
CLK
Tskew

Fig. 5.14 Digital circuit timing parameters, which together form constraints that must be
respected for correct operation

injection exploits and the parameters affecting their success. The considered exploit
uses one FPGA circuit to carry out an attack against another, co-located on the same
FPGA die. Two logic signals must have faults, such that the registers capturing
their values record a binary combination that should not normally occur. We then
examine how this type of fault-injection attack can compromise the security of a
more complex system. The section continues with the discussion on how the effects
of FPGA-based PDN manipulation can affect computing components other than the
programmable logic. A discussion of countermeasures closes the section.

5.4.1 Timing Constraints

For the typical, sequential circuit illustrated in Fig. 5.14, the choice of the clock
frequency depends on the circuit’s timing constraints. Given that registers supply
the inputs of the combinational logic and the outputs are saved in registers, the
circuit designer must ensure that the design meets the setup and hold constraints
expressed as follows [60]:

Tclk ≥ Tclk2q
.
max
+ Tsetup + Tcomb
max
− Tskew , (5.2)

Thold ≤ Tclk2q
.
min
+ Tcomb
min
. (5.3)

Here, .Tclk is the clock period. .Tclk2q is the time between the arrival of the clock
edge and the corresponding update of a flip-flop’s output, or in other words, the
time the input register needs to supply the new value to the combinational circuit.
The superscripts refer to the maximum and minimum values this delay can take.
.Tsetup is the time for which the value at the input of the output register must remain

stable before the clock edge arrives, as shown in Fig. 5.15. .Tcomb is the delay the
combinational logic part takes to compute the result. Depending on the input values,
this delay can vary as more levels of logic gates may need to switch to produce the
118 D. G. Mahmoud et al.

setup hold setup hold

clk

D data0 data1

Q data0 data1

Fig. 5.15 Timing of a D flip-flop highlighting the setup and hold times

final output. Accordingly, the timing constraints also consider this delay’s maximum
and minimum values. Different clock arrival times to the input and output registers
are reflected in .Tskew . Finally, .Thold is the time for which the input to a register needs
to remain stable after the clock edge to guarantee a correct update of the register’s
output, as shown in Fig. 5.15.
If the inequalities (5.2) or (5.3) are not met, the circuit may not operate correctly.
Therefore, the first step toward correct circuit’s functionality is to select a suitable
clock frequency. Then, if there are any hold time constraint issues (e.g., due to
clock skew), additional delays can be added. Delay tuning options include changing
the routing to use longer paths, using slower cells (if available), adding buffers, or
using negative-edge registers. FPGA synthesis tools typically support commands
for ensuring that synthesis meets the hold time requirements [5].
An adversary aiming to generate circuit faults can target the circuit’s timing
behavior. If the external clock input is accessible, the adversary can manipulate the
clock frequency to violate inequality (5.2). For example, the adversary can switch to
a faster clock for a short time before switching back to the normal clock to introduce
glitches [31]. While control over the clock can allow the adversary precise control
over the glitch and, accordingly, the injected fault [47], that control is not always
available. In remote multitenant FPGA scenarios, the adversary cannot control the
clock source of another tenant’s circuit. Instead, the attacker can focus on increasing
the combinational delays to eventually violate inequality (5.2). In this case, the
relationship between the circuit voltage and delay as the voltage decreases (e.g.,
the delays increase) serves as an attack enabler [37]. While the adversary does not
have access to the FPGA power supply, they can construct malicious FPGA power
waster circuits that consume excessive power, overwhelming the power supply and
resulting in transient voltage fluctuations.

5.4.2 FPGA Power Wasters

As their name suggests, power wasters are circuits whose main aim is to consume
as much dynamic power as possible. The dynamic power consumption is directly
proportional to the square of the supply voltage, the switching frequency of the
signal being considered, and the load capacitance [70]. Given that the adversary
has no direct control over the supplied voltage in a remote attack scenario, the only
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 119

Fig. 5.16 The


implementation of a LUT
NAND-based ring oscillator
with a single LUT. The
enable signal controls the OUT
EN
attack activation

remaining factors that can increase power consumption are the switching frequency
and the load capacitance of the logic gates within the design. Accordingly, power
waster designs aim to generate high-frequency signals.
Many approaches for power waster designs have been proposed; for instance,
a high-frequency clock signal generated by a phase-locked loop (PLL) within the
FPGA can be used to drive long wires or a large number of sequential elements that
change state with every clock cycle (e.g., a shift register initialized to an alternate
pattern of ones and zeros). Efficient power wasters can be built with circuits that
generate glitches; glitches can be created by overclocking an otherwise benign
design (e.g., overclocking encryption rounds of a block cipher [24, 57]) or by
crafting signal paths of various delays that switch at different times and XORing the
signals [48]. The most common technique for generating a high-frequency signal is
to find the shortest path within the programmable logic, create an oscillator out of
it, and use it to the adversary’s advantage.
Ring oscillators (ROs) are circuits that use an odd number of inverters to
change the value of a logical signal and then use that same signal as input to the
inverters through a feedback loop. For the highest frequency of oscillation, one
inverter suffices. One FPGA look-up table (LUT) can be programmed to act as
an inverter, and its output can be connected back to its input with a short routing
path. The resulting FPGA-based RO can generate signals at a frequency surpassing
the maximum clock frequency that an on-chip PLL can generate. Depending on the
technology node, the oscillation frequency can surpass 1 GHz [22].
When used within an FPGA design, ROs typically employ an enable signal that
allows the user control over when the RO oscillates. Therefore, in practice, an
LUT implements a NAND instead of a NOT functionality, as shown in Fig. 5.16.
Even though an RO generates a high-frequency signal, one LUT cannot consume
enough power to overwhelm the power supply—an adversary needs to instantiate
a large number of ROs. Again, given that one RO cannot significantly change
the consumed power, a group of ROs can share a common enable signal. Such a
collection of ROs, representing the smallest unit the attacker can control, is often
called a block or a bank of ROs. Figure 5.17 shows the voltage drop resulting from
the activation of a varying number of ROs. A Genesys-ZU board, equipped with
an AMD Zynq UltraScale+ multi-processor system-on-chip (XCZU3EG) [14], is
targeted. The voltage readings in Fig. 5.17 are obtained using the TDC described
in Sect. 5.3.1. Not surprisingly, under the same enable signal activation pattern,
the more the ROs, the more significant the resulting voltage drop. The maximum
number of ROs in Fig. 5.17 corresponds to 60,963 LUTs; such a high number
120 D. G. Mahmoud et al.

Fig. 5.17 Voltage drop in the function of the number of active ROs. The enable signal is toggling
with a period of 1.1 µs. RO blocks are activated (and later deactivated) one by one. The baseline
corresponds to no RO blocks active

Fig. 5.18 The


implementation of a LUT LUT
Source Sink
wire-based power waster FPGA routing
using two LUTs, with an
enable signal for attack EN
control [17]

was achieved by implementing two ROs in one 6-input LUT, since LUTs in the
UltraScale FPGA family support two independent outputs [69].
Another way to boost the power consumption of ROs without using additional
LUTs is to increase the load capacitance at the output of each RO. This increase
can be achieved by connecting an RO output to another LUT, passing the signal
through routing resources, and driving additional loads to consume more power.
For example, wire-based power wasters, shown in Fig. 5.18, implement a group of
sources and sinks for each block. Then, with placement constraints, the signal from
each source is forced to travel a certain distance (using routing resources) to the
corresponding sink [17]. Enhanced ROs (EROs), which use LUTs implementing
ROs and connect the output of each LUT to three other LUTs, can also be used,
ensuring that the additional inputs do not change the RO functionality. Figure 5.19
shows the implementation of an ERO instance using four LUTs [36].
Evaluating EROs and comparing them to ROs on an AMD UltraScale+ FPGA
(the FPGA family used by many commercial cloud service providers) demonstrate
their higher power consumption [75]. Figure 5.20 shows the TDC sensor readings
when using 60,963 LUTs as ROs and 58,981 LUTs as EROs. Despite the lower
number of LUTs used for the EROs, the resulting voltage drop from their activity is
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 121

LUTA LUTB LUTC LUTD


B A A A
C A C B B C B D
D D D C

EN

Fig. 5.19 Implementation of an enhanced ring oscillator using four LUTs, with an enable signal
for attack control [36]

Fig. 5.20 Comparison of the voltage drop induced by EROs and ROs, under the same activation
pattern and the period of the enable signal of 1.4 µs. The baseline corresponds to no power wasters
active

more pronounced. This increase in voltage drop highlights the effect of the increased
load capacitance and routing resources used [42].

5.4.3 Fault Injection vs. Denial of Service

Activating a large number of power wasters is not sufficient for a successful fault-
injection attack. The malicious party also needs to consider the following factors: the
timing of the fault, the desired voltage drop, and the stealthiness of the exploit. Fault
-injection timing is important as the effect of a fault depends on the computation
affected. Attacks leveraging faulty outputs typically require the fault to be injected
at a specific point of the algorithm under attack. The adversary can leverage side
channels to determine when the circuit executes the target function and activate the
power wasters suitably [39]. The stealthiness of the exploit is strongly related to its
122 D. G. Mahmoud et al.

Period Duty cycle

Period
Duty cycle

Attack duration

Fig. 5.21 Enable signal and controllable parameters for remote undervolting attacks

timing, as a well-timed undervolting does not need to last long. Hiding the exploit
also requires controlling its strength or the voltage drop. If the voltage decreases
too much, the FPGA may become unavailable, resulting in a denial of service. The
adversary must carefully control the voltage drop to match the desired delay increase
while avoiding a reset of the entire FPGA.
The voltage drop is a factor of the FPGA’s PDN parameters. Continuous
activation of power wasters is typically not sufficient for successful fault injection.
If the number of power wasters is insufficient to reset the board, then the PDN will
likely suffer an immediate voltage drop and compensate for the increased activity,
resulting in the final voltage drop not being as significant. If, instead, the power
wasters are activated periodically, as shown in Fig. 5.21, the power supply may not
be able to recover within the given time. In addition, the frequency of the switching
affects the voltage drop. The impedances of the PDN capacitive and inductive
elements are a function of the frequency, and the voltage drop is a function of the
impedance and the drawn current. Consequently, an activation signal matching the
PDN resonance frequency results in the most significant voltage drop [72].
Therefore, the adversary needs to be able to control several parameters that shape
the enable signal of the power wasters. Figure 5.21 illustrates these parameters:
• The start of the attack, i.e., the moment at which the power wasters are enabled
for the first time, corresponding to time 0 in Fig. 5.21
• The period of the enable signal, i.e., the time between two subsequent activations
of the enable signal
• The duty cycle of the enable signal, i.e., the time during which the enable signal
remains high over the period of the enable signal
• The duration of the attack
A well-chosen enable signal period allows for an effective power draw. Control-
ling the duty cycle of the enable signal also allows for controlling the voltage drop
since a longer-lasting activation draws more power than a shorter activation. Figure
5.22 shows the TDC sensor readings when using EROs with varying frequencies
for the enable signal on an AMD UltraScale+ FPGA [14]. Figure 5.22 compares the
voltage drop corresponding to different frequencies to the baseline. Using 26,908
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 123

Fig. 5.22 Comparison of the voltage drop induced by different periods of the attacker enable
signal for 26,908 LUTs implementing an ERO-based attacker. The baseline corresponds to no
power wasters active

LUTs as EROs (activated gradually in steps of 3844 LUTs), we can see that the
value and shape of the voltage drop depend on the period of the enable signal. If that
period is too short, the lowest voltage lasts for a short time since the power wasters
are active for a limited number of clock cycles. The longer the period, the longer the
voltage drop lasts. However, the period increase should also take into account the
attack duration. In Fig. 5.22, the longer period only results in one continuous voltage
drop, whereas a period of 1500 ns results in the EROs being active twice during the
attack, with the second voltage drop being more significant than the first and than
that achieved with a longer period. The results in Fig. 5.22 highlight the importance
and impact of carefully controlling the enable signal.
The waveform of the enable signal can be generated from external sources and
sent as an input to the FPGA. However, in a remote scenario, the adversary must
generate the signal on the chip. Access to a reconfigurable phase-locked loop (PLL)
may facilitate the creation of arbitrary signal waveforms. Finally, the adversary
can implement a control circuit to receive parameters and generate the signal
accordingly. Various solutions exist; Algorithm 2 shows an example leveraging a
counter to keep track of the number of clock cycles elapsed for the attack and the
toggling of the enable signal.
The count process is called at every rising clock edge. The state of the input
signals determines the action taken. If the reset signal is asserted, then all counters
are reset to 0. If the user has asserted the start signal, then the durationCnt
is incremented, while periodCnt is incremented if the period of the enable signal
has not passed. If it has, the periodCnt goes back to zero to start counting another
period. If the start signal is deasserted, then the counters keep their old values.
124 D. G. Mahmoud et al.

Algorithm 2 Algorithm that generates the signals that count the duration of the
exploit and the duration of each period for the toggling of the enable signal
Input: reset, signal to reset the counters
Input: start, signal to start the attack and the counters
Input: period, number of clock cycles which define the period of the enable signal
Output: durationCnt, counter for the attack duration
Output: periodCnt, counter for the attack period

1: procedure COUNT(reset, start, period)


2: if reset = 1 then
3: durationCnt ← 0
4: periodCnt ← 0
5: else if start = 1 then
6: durationCnt ← durationCnt + 1
7: if periodCnt < period then
8: periodCnt ← periodCnt + 1
9: else
10: periodCnt ← 0
11: end if
12: end if
13: end procedure

The counters produced by the count process can then be used with other
parameters provided to the circuit to generate either a continuous enable signal
or a periodically activated and deactivated signal. Algorithm 3 shows an example
for enable signal generation. The process is called at every rising clock edge. The
obtained frequency is the result of dividing the clock frequency by the specified
number of clock cycles in the period input. The dutyCycle input specifies the
number of clock cycles for which the power wasters are active. If the system has
been reset, all of the enable signals are deactivated, and the end signal is set to
zero. If the start signal is deasserted, which is equivalent to the attack ending or
being stopped, then the end signal is high, while all enable signals are deactivated.
Otherwise, depending on the toggle input, the enable signals of the specified
number of power wasters blocks (.NB ) are either continuously high for the duration
of the attack, or periodically enabled and disabled, depending on the dutyCycle
input and the values of the counters.

5.4.4 Victim

With precise control over the activity of the power wasters, the adversary can lower
the voltage of the FPGA chip and increase the delays of other circuits within the
programmable logic. The attacker can then target a victim that contains a secret or
is processing sensitive information. Fault injection typically targets a specific part of
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 125

Algorithm 3 Algorithm for generating the enable signals that control the activity of
the power waster blocks, according to the waveform in Fig. 5.21
Constant: N, Total number of power waster blocks instantiated
Input: reset, signal to reset the counters and disable the power wasters
Input: start, signal to start the attack and the counters
Input: NB , number of power wasters to enable
Input: toggle, flag specifying whether the enable signal toggles (i.e., follows a period pattern)
Input: dutyCycle, number of clock cycles (in the activation period) to keep the enable active
Input: durationCnt, counter for the attack duration
Input: periodCnt, counter for the period for the toggling
Input: duration, number of clock cycles for which the attack should last
Output: enable[N − 1 : 0], enable signal for the power waster blocks
Output: end, signal indicating that the attack is over

1: procedure ENABLE_PWS(reset, start, NB , toggle, dutyCycle, durationCnt, periodCnt,


duration)
2: if reset = 1 or start = 0 then
3: enable[N − 1 : 0] ← (0, 0, ..., 0)  Deactivate all power wasters
4: end ← 0
5: else
6: if toggle = 0 then  Keep the enable signal constantly active
7: if durationCnt < duration then
8: enable[NB − 1 : 0] ← (1, 1, ..., 1)  Activate power wasters
9: end ← 0
10: else
11: enable[N − 1 : 0] ← (0, 0, ..., 0)  Deactivate all power wasters
12: end ← 1
13: end if
14: else  Toggle the activation of the enable signal
15: if durationCnt < duration then
16: if periodCnt < dutyCycle then
17: enable[NB − 1 : 0] ← (1, 1, ..., 1)  Activate power wasters
18: else
19: enable[N − 1 : 0] ← (0, 0, ..., 0)  Deactivate all power wasters
20: end if
21: end ← 0
22: else
23: enable[N − 1 : 0] ← (0, 0, ..., 0)  Deactivate all power wasters
24: end ← 1
25: end if
26: end if
27: end if
28: end procedure

the circuit to affect the output. This action allows the adversary to learn information
that is otherwise inaccessible.
Encryption cores have a secret key and process sensitive information (the
plaintexts are encrypted so that they are not accessible to other parties when
transmitted through communication channels or stored in potentially compromised
media, for example). These circuits can thus be victimized. Let us consider a fault-
126 D. G. Mahmoud et al.

Fig. 5.23 Example


illustrating two signals that IN0
g0 DFF OUT0
cannot reach the state 11

IN1 g2
g1 DFF OUT1

IN2

injection exploit targeting an AES encryption core. For AES attacks, adversaries
usually leverage differential fault analysis (DFA), which relies on AES input control.
The malicious party sends the same plaintext twice, once in the absence of an attack
and once targeting a specific phase of AES algorithm execution during an attack.
The two encryptions result in one correct and one faulty ciphertext. By collecting
enough pairs of correct and faulty outputs, the adversary can retrieve the encryption
key and break the system’s security. Machine learning inference cores are also
interesting attack targets because of their proprietary architectural parameters. These
circuits often process sensitive data that should not be accessible to other parties and
their outputs can affect decisions made within a larger system. A fault injected into
a model can lead to erroneous inference results.
The remainder of this section considers AES as the target victim, but instead of
performing DFA, it assumes a scenario in which the cryptographic accelerator is
compromised by an embedded hardware Trojan, triggered only in the presence of
the attack. The first step of the attack is finding two signal paths within AES that
together cannot reach a specific stable state. In other words, two correlated signals
that guarantee that downstream registers that output their values cannot produce
a specific two-bit output state (e.g., never reach 11) must be identified. A simple
circuit satisfying the criteria is shown in Fig. 5.23: two signals produced by two
AND gates that share an input, except that one gate (.g0 in Fig. 5.23) uses the input as
is, while the other uses the inverted version of the input (.g1 in Fig. 5.23). Therefore,
both AND gates cannot output stable 1 outputs, ignoring glitching. Signals exhibiting
such a correlation can exist in circuits at a scale broader than the simple example in
Fig. 5.23. The adversary’s goal, once such two signals are identified, is to increase
the delay of the two correlated paths so that the outputs of downstream clocked
registers reach the impossible output state (e.g., timing closure is not achieved).
Triggering such an impossible state can be used for malicious purposes, including
activating a stealthy hardware Trojan to leak secret information.
We implemented RO-based power wasters to inject into an openly available AES
circuit [26] using an Intel Arria 10 GX FPGA deployed within a commercial cloud
instance [29]. Two preselected signals within the AES substitution box (SBox)
form the attack target; the signals are sampled and their values collected to analyze
whether they reach the desired (otherwise impossible to reach) state when under
attack. The number of observed faults as a function of the number of active ROs is
shown in Fig. 5.24. The data are averaged across 35 experiments, each containing
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 127

Fig. 5.24 Percentage of the target faults at the output of an AES circuit [26] within a cloud-scale
FPGA [29], as a function of the number of active power wasters. The data are averaged over a total
of 350 attack runs

30 runs, in which a run consists of 4096 encryptions of plaintexts with varying


distributions of values. Two mechanisms for generating plaintexts are used: the
first leverages a linear-feedback shift register to create a pseudorandom sequence
of plaintexts. The second considers a fixed set of 16 plaintexts, cyclically repeating.
The enable signal period is 25 ns, and the duty cycle is 62.5%. As shown in Fig. 5.24,
if the attack size (the number of ROs) is too small, the target paths do not have a
sufficient delay increase to inject faults. As the number of ROs increases, faults
appear more frequently. At 53,248 ROs (equivalent to 13 blocks of 4096 ROs), a
maximum occurrence of faults is observed. Adding more ROs makes the attack
powerful enough to create faults on two paths simultaneously, resulting in both
useful and undesired faults (e.g., inverted output instead of the specific faulty state).
As a result, the number of target faults observed with 14 blocks of ROs in Fig. 5.24
drops. The exact trend in the most general case is somewhat more complex to
predict: it depends on the plaintexts since the circuit’s inputs affect the paths signals
take and, consequently, impact the circuit’s sensitivity to the fault-injection exploit.
Another factor impacting attack success is the period of the enable signal. Figure
5.22 shows the percentage of observed faults for the period lower or equal to 160 ns.
Given the differences between the cloud FPGA board [29] and the Genesys-ZU
board [14], in particular the quality and robustness of the power deliver network, it
is not surprising that the results in Fig. 5.25 differ from those in Fig. 5.22. Despite
the differences, we clearly see that a specific frequency range results in successful
triggering of the target faults. The results in Fig. 5.25 are obtained using 28,672
(or seven blocks) ROs with an enable signal duty cycle of approximately 30%.
The number of ROs used does not correspond to a case where there was leakage
as observed in Fig. 5.24. If an enable signal with a frequency within the range in
Fig. 5.25 is used, the attack can succeed with a small number of power wasters. In
fact, these enable signal frequencies in combination with increased attack size may
lead to DoS, since the strength of the attack is affected by both the enable signal
frequency and the size of the power wasters.
128 D. G. Mahmoud et al.

Fig. 5.25 Percentage of desired faults at the output of an AES circuit [26] within a cloud-scale
FPGA [29], in the function of the period of the enable signal controlling power wasters. The data
are averaged over 150 attack runs

Infected circuit
Trojan
I0
Circuit
AES ciphertext I0 OT
O I1 output

secret secret I1

sig0 DFF DFF

sig1

Fig. 5.26 An example victim circuit infected with a stealthy hardware Trojan utilizing two
correlated signals as a trigger

5.4.5 Exploits and Possible Extensions

To validate the possibility of fault injection targeted to two correlated Sbox signals
in a cloud-scale FPGA, the pair are used as a trigger for a stealthy hardware
Trojan, as shown in Fig. 5.26. A similar approach has previously been used by
Mahmoud et al. [43]. Figure 5.27 shows a block diagram of the system. The shared
FPGA contains two user regions. One is assigned to the attacker and one to the
victim. The shell occupies part of the fabric and provides interfaces for both user
regions, allowing them to communicate with the host CPU. The adversary employs
a collection of RO blocks. They can control the number of activated blocks and
the shape of the enable signal. The victim uses an AES core and leverages FIFO
buffers to store and later send the results for processing outside the FPGA. The target
AES core has a hardware Trojan hidden inside it; we show the Trojan separately
to highlight its presence and operation. Activating the Trojan using the correlated
signal pair makes the Trojan stealthy, because when the user tests the designs with a
variety of inputs, the Trojan will never be activated [27]. In particular, as the Trojan
relies on faults to leak information, the core functionality testing will not reveal
expected results discrepancies [27].
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 129

Shared FPGA

EN Start
Control
...
... Number

Parameters
RO Blocks
Toggle
Attacker region H

SHELL
Logical and physical separation O
S
T
plaintext
AES Control
signals
secret output
Trojan FIFO
ciphertext

Victim region

Fig. 5.27 System design for the exploit targeting Trojan activation using undervolting-based fault
injection in a cloud FPGA. The parameters signal controls the period and the duty cycle of the
enable signal for the ROs

The functionality of the trigger signal pair is illustrated in Fig. 5.26: they act
as select signals for two multiplexers, such that if both select signals are 1, the
encryption key (i.e., the secret) leaks to the AES output. In all other cases, the AES
output is the ciphertext [43], as would normally be expected. The delays of the two
signals are 2.17 ns and 2.23 ns, under the fast timing model at .100◦ C. The AES
core operates at 320 MHz. Based on the results in Figs. 5.24 and 5.25, we allocate
12 RO blocks (49,152 ROs) and set the period and duty cycle of the enable signal to
16 clock cycles and 31.25%, respectively. The experiments are repeated ten times,
each experiment consisting of 30 attack runs of 4096 encryptions per run. One run
corresponds to one attack duration, equivalent to 12.8 µs. Pseudorandom plaintexts
and a fixed set of 16 values are supplied to the AES in an alternating fashion.
The results show that the secret key leaks, on average, 88 times per experiment
when using the fixed inputs and 30 times in the case of pseudorandom inputs.
The different input sequences affect the paths used within the AES, which in
turn affects the delays, and accordingly, the leakage depends on the sequence of
inputs. The key becomes the most occurring value at the AES output after 20,480
encryptions, corresponding to five attacks with pseudorandom inputs. In conclusion,
with enough samples and the correct set of attack control parameters, an adversary
can successfully induce faults in a victim design implemented in a cloud-scale
FPGA.
It is worth noting that, given the existence of various levels of sharing of the
power supply network (illustrated in Fig. 5.1), the effects caused by the activity of
the power wasters are not limited to the FPGA. An FPGA-based undervolting attack
on the Zynq UltraScale+ multi-processor system-on-chip revealed that disturbances
130 D. G. Mahmoud et al.

created by ROs can affect the CPU, to the point that exploitable faults can be injected
into bare-metal applications or that Linux-based applications can crash [24, 42, 44].
The propagation of electrical-level effects is not limited to the components on the
same chip: voltage disturbances can extend to other components in the same rack in
a data center [15]. It is, therefore, important to monitor the security of all the devices
within a data center or a cloud system where FPGAs are available to remote users.

5.4.6 Countermeasures

Since the primary mechanism of injecting faults in a multitenant FPGA is based


on increasing critical path delays, operating a circuit at a lower frequency adds
protection at the cost of decreased performance. However, there is no precise safety
margin that guarantees that remote undervolting exploits will not inject faults, which
is why it is essential to create solutions for protecting current systems against fault-
injection exploits and to design new systems for better reliability when subjected to
such attacks.
An important category of countermeasures focuses on the detection of voltage
drops to alert the system operators or directly take action to stop the attack. Voltage-
drop detection can leverage TDC sensors, such as those presented in Sect. 5.3.1.
Some solutions focus on cloud FPGAs and try to localize the source of the voltage
drop, while embedding sensors in a way that does not interfere with user opera-
tion [49, 56]. Disturbance detection can be combined with mechanisms to prevent
faulty computation results from propagating to the output, where it can be read
by the adversary [43]. If the attack is detected, a mechanism such as LoopBreaker
can be triggered, to remove the assaulting circuit from the shared FPGA [53].
Alternatively, frequency scaling could be triggered, to temporarily lower the clock
frequency and allow increased resilience to voltage disturbances [40]. New defense
strategies could protect future devices and systems. For example, FPGAs could be
built with LUTs designed to ensure that input-to-output delays are less sensitive to
supply voltage changes [2], if associated costs allow it.

5.5 Conclusions

FPGAs are becoming ubiquitous based on the flexibility and hardware parallelism
they provide. Today, FPGAs are commonly found in cyber-physical systems and
data centers. However, access to low-level FPGA hardware resources can lead to
security issues, especially considering a multitenant FPGA scenario consistent with
common cloud and data center practices. This chapter focused on two security
threats to remote FPGAs, power side-channel and fault-injection attacks, resulting
from power-delivery network sharing and the possibility of an adversary deploying
carefully crafted malicious FPGA circuits. After discussing the threat models and
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 131

the requirements for the attacks, several malicious circuit implementations and their
deployment were given. The success of the attacks was demonstrated on a range
of FPGA families and platforms, including cloud FPGAs, confirming the extent of
the threat FPGA multitenancy poses. Lastly, an overview of a selected subset of
proposed approaches for protecting against power side-channel and fault-injection
attacks was given. This chapter demonstrated the practical risk of multitenancy on
cloud FPGAs. While multitenancy offers many potential benefits, there remains a
need for countermeasure deployment and FPGA-based cloud system redesign to
ensure a high level of security for future multitenant FPGA systems.

Acknowledgments This work is partially supported by the Swiss National Science Foundation
(grant No. 182428), by armasuisse Science and Technology, and by the EU Horizon 2020
Programme under grant agreement No 957269 (EVEREST).

References

1. AES Encryption Core. (2019). http://www.aoki.ecei.tohoku.ac.jp/crypto/.


2. Ahmed, I., Shen, L. L., & Betz, V. (2020). Optimizing FPGA logic circuitry for variable voltage
supplies. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 28(4), 890–903.
3. Ahmed, M. K., Mandebi, J., Saha, S. K., & Bobda, C. (2022). Multi-tenant cloud FPGA: A
survey on security. arXiv.
4. Amazon. (2019). AWS EC2 FPGA GitHub. https://github.com/aws/aws-fpga/tree/master.
5. AMD. (2023). Using directives Vivado design suite user guide: Implementation (UG904).
https://docs.xilinx.com/r/en-US/ug904-vivado-implementation.
6. Azouaoui, M., Poussier, R., Standaert, F., & Verneuil, V. (2019). Key enumeration from the
adversarial viewpoint. In 18th smart card research and advanced applications conference
(CARDIS 2019) (pp. 252–67). Springer, Prague.
7. Azure, M. (2023). Machine Learning. https://azure.microsoft.com/en-us/pricing/details/
machine-learning/.
8. Bobda, C., Mbongue, J. M., Chow, P., Ewais, M., Tarafdar, N., Vega, J. C., Eguro, K., Koch,
D., Handagala, S., Leeser, M., et al. (2022). The future of FPGA acceleration in datacenters
and the cloud. ACM Transactions on Reconfigurable Technology and Systems, 15(3), 1–42.
9. Brier, E., Clavier, C., & Olivier, F. (2004). Correlation power analysis with a leakage
model. In Cryptographic hardware and embedded systems—CHES ’04 (pp. 16–29). Springer,
Cambridge.
10. Cezary, G., Vincent, G., Romain, P., Joachim, S., & François-Xavier, S. (2015). Simpler and
more efficient rank estimation for side-channel security assessment. In International workshop
on fast software encryption (pp. 117–29). Istanbul, Turkey.
11. Compute optimized type family with FPGA. (2022). https://www.alibabacloud.com/help/en/
elastic-compute-service/latest/compute-optimized-type-family-with-fpga.
12. Elnaggar, R., Chaudhur, J., Karri, R., & Chakrabarty, K. (2022). Learning malicious circuits
in FPGA bitstreams. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 42(3), 726–39.
13. FPGA-based Amazon EC2 F1 computing instances. (2023). https://aws.amazon.com/ec2/
instance-types/f1/.
14. Genesys ZU. (2022) Zynq UltraScale+ MPSoC development board. https://digilent.com/
reference/programmable-logic/genesys-zu/reference-manual.
15. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2020). C3APSULe: Cross-FPGA covert-
channel attacks through power supply unit leakage. In 2020 IEEE symposium on security and
privacy (pp. 1728–41). IEEE, San Francisco.
132 D. G. Mahmoud et al.

16. Glamočanin, O., Coulon, L., Regazzoni, F., & Stojilović, M. (2020). Are cloud FPGAs really
vulnerable to power analysis attacks? In Design, Automation and Test in Europe Conference
and Exhibition (DATE) (pp. 1–4). IEEE, Grenoble.
17. Glamočanin, O., Kostić, A., Kostić, S., & Stojilović, M. (2023). Active wire fences for
multitenant FPGAs. In 26th international symposium on design and diagnostics of electronic
circuits systems (DDECS) (pp. 13–20). IEEE, Tallinn.
18. Glamočanin, O., Mahmoud, D. G., Regazzoni, F., & Stojilović, M. (2023). Cloud FPGA
security—practical implementations of remote power side-channel and fault-injection attacks
on multitenant FPGAs—artifacts. https://github.com/mirjanastojilovic/remote-fpga-attacks-
book-chapter.
19. Gnad, D. R., Oboril, F., & Tahoori, M. B. (2017). Voltage drop-based fault attacks on
FPGAs using valid bitstreams. In Proceedings of the 27th international conference on field-
programmable logic and applications (FPL) (pp. 1–7). IEEE, Ghent.
20. Gnad, D. R. E., Nguyen, C. D. K., Gillani, S. H., & Tahoori, M. B. (2021). Voltage-based covert
channels using FPGAs. ACM Transactions on Design Automation of Electronic Systems, 26(6),
1–25.
21. Gnad, D. R. E., Oboril, F., Kiamehr, S., & Tahoori, M. B. (2016). Analysis of transient voltage
fluctuations in FPGAs. In 2016 international conference on field-programmable technology
(FPT) (pp. 12–19). IEEE, Xi’an.
22. Gravellier, J., Dutertre, J. M., Teglia, Y., & Loubet-Moundi, P. (2019). High-speed ring
oscillator based sensors for remote side-channel attacks on FPGAs. In 2019 international
conference on ReConFigurable computing and FPGAs (ReConFig) (pp. 1–8). IEEE, Cancun.
23. Gravellier, J., Dutertre, J. M., Teglia, Y., Loubet-Moundi, P., & Olivier, F. (2019). Remote side-
channel attacks on heterogeneous SoC. In 18th smart card research and advanced applications
conference (CARDIS 2019) (pp. 109–25). Springer, Prague.
24. Gross, M., Krautter, J., Gnad, D., Gruber, M., Sigl, G., & Tahoori, M. (2023). FPGANeedle:
Precise remote fault attacks from FPGA to CPU. In Proceedings of the 28th Asia and South
Pacific design automation conference (pp. 358–64). ACM, Tokyo.
25. Hoozemans, J., Peltenburg, J., Nonnemacher, F., Hadnagy, A., Al-Ars, Z., & Hofstee, H. P.
(2021). FPGA acceleration for big data analytics: Challenges and opportunities. IEEE Circuits
and Systems Magazine, 21(2), 30–47.
26. Hsing, H. (2019). Tiny AES. https://opencores.org/projects/tiny_aes.
27. Hu, W., Zhang, L., Ardeshiricham, A., Blackston, J., Hou, B., Tai, Y., & Kastner, R. (2017).
Why you should care about don’t cares: Exploiting internal don’t care conditions for hardware
Trojans. In 2017 IEEE/ACM international conference on computer-aided design (ICCAD) (pp.
707–13). Irvine, CA, USA.
28. Huawei. (2023). FPGA accelerated cloud server—Huawei cloud. https://www.huaweicloud.
com/en-us/product/fcs.html.
29. Intel® programmable acceleration card (PAC) with Intel® Arria® 10 GX FPGA data
sheet. (2020). https://www.intel.com/content/www/us/en/docs/programmable/683226/current/
introduction-rush-creek.html.
30. Kocher, P., Jaffe, J., & Jun, B. (1999). Differential power analysis. In Advances in Cryptology—
CRYPTO ’99 (pp. 387–97). Santa Barbara, CA, USA.
31. Korczyc, J., & Krasniewski, A. (2012). Evaluation of susceptibility of FPGA-based circuits to
fault injection attacks based on clock glitching. In 15th international symposium on design and
diagnostics of electronic circuits systems (DDECS) (pp. 171–74). IEEE, Talinn.
32. Krautter, J., Gnad, D. R. E., Schellenberg, F., Moradi, A., & Tahoori, M. B. (2019). Active
fences against voltage-based side channels in multi-tenant FPGAs. In 2019 IEEE/ACM
international conference on computer-aided design (ICCAD) (pp. 1–8). Westminster, CO,
USA.
33. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2018). FPGAhammer: Remote voltage fault
attacks on shared FPGAs, suitable for DFA on AES. IACR Transactions on Cryptographic
Hardware and Embedded Systems, 2018(3), 44–68.
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 133

34. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2019). Mitigating electrical-level attacks
towards secure multi-tenant FPGAs in the cloud. ACM Transactions on Reconfigurable
Technology and Systems, 12(3), 1–26.
35. La, T., Pham, K. D., Powell, J., & Koch, D. (2021). Denial-of-service on FPGA-based cloud
infrastructures—attack and defense. IACR Transactions on Cryptographic Hardware and
Embedded Systems, 2021(3), 441–464.
36. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
Malicious self-oscillator scanning for Xilinx UltraScale + FPGAs. ACM Transactions on
Reconfigurable Technology and Systems, 13(3), 15:1–15:31.
37. Lee, W., Wang, Y., Cui, T., Nazarian, S., & Pedram, M. (2014). Dynamic thermal manage-
ment for FinFET-based circuits exploiting the temperature effect inversion phenomenon. In
Proceedings of the 2014 international symposium on low power electronics and design (pp.
105–10). ACM, La Jolla California.
38. Li, H., Tang, Y., Que, Z., & Zhang, J. (2022). FPGA accelerated post-quantum cryptography.
IEEE Transactions on Nanotechnology, 21, 685–691.
39. Luo, Y., Gongye, C., Fei, Y., & Xu, X. (2021). DeepStrike: Remotely-guided fault injection
attacks on DNN accelerator in cloud-FPGA. In 58th ACM/IEEE design automation conference
(DAC) (pp. 295–300). San Francisco, CA, USA.
40. Luo, Y., & Xu, X. (2020). A quantitative defense framework against power attacks on multi-
tenant FPGA. In Proceedings of the 39th international conference on computer-aided design
(pp. 1–9). ACM, New York.
41. Mahmoud, D., & Stojilović, M. (2019). Timing violation induced faults in multi-tenant FPGAs.
In Design, automation and test in europe conference and exhibition (DATE) (pp. 1745–50).
IEEE, Florence.
42. Mahmoud, D. G., Dervishi, D., Hussein, S., Lenders, V., & Stojilović, M. (2022). DFAulted:
Analyzing and exploiting CPU software faults caused by FPGA-driven undervolting attacks.
IEEE Access, 10(134), 199–216.
43. Mahmoud, D. G., Hu, W., & Stojilović, M. (2020). X-attack: Remote activation of satisfiability
don’t-care hardware Trojans on shared FPGAs. In Proceedings of the 30th international con-
ference on field-programmable logic and applications (FPL) (pp. 185–92). IEEE, Gothenburg.
44. Mahmoud, D. G., Hussein, S., Lenders, V., & Stojilović, M. (2022). FPGA-to-CPU undervolt-
ing attacks. In Design, automation and test in Europe conference and exhibition (DATE) (pp.
999–1004). IEEE, Virtual Event.
45. Mahmoud, D. G., Lenders, V., & Stojilović, M. (2022). Electrical-level attacks on CPUs,
FPGAs, and GPUs: Survey and implications in the heterogeneous era. ACM Computing
Surveys, 55(3), 1–40.
46. Mangard, S., Oswald, E., & Popp, T. (2007). Power analysis attacks—revealing the secrets of
smart cards. Springer, New York.
47. Martín, H., Korak, T., Millán, E. S., & Hutter, M. (2015). Fault attacks on STRNGs: Impact of
glitches, temperature, and underpowering on randomness. IEEE Transactions on Information
Forensics and Security, 10(2), 266–277.
48. Matas, K., La, T. M., Pham, K. D., & Koch, D. (2020). Power-hammering through glitch
amplification—attacks and mitigation. In 28th annual international symposium on field-
programmable custom computing machines (FCCM) (pp. 65–69). IEEE, Fayetteville.
49. Mirzargar, S. S., Renault, G., Guerrieri, A., & Stojilović, M. (2020). Nonintrusive and adaptive
monitoring for locating voltage attacks in virtualized FPGAs. In IEEE international conference
on field programmable technology (FPT) (pp. 1–2). IEEE, Maui.
50. Moini, S., Deric, A., Li, X., Provelengios, G., Burleson, W., Tessier, R., & Holcomb, D. (2022).
Voltage sensor implementations for remote power attacks on FPGAs. ACM Transactions on
Reconfigurable Technology and Systems, 16(1), 1–21.
51. Moini, S., Li, X., Stanwicks, P., Provelengios, G., Burleson, W., Tessier, R., & Holcomb,
D. (2020). Understanding and comparing the capabilities of on-chip voltage sensors against
remote power attacks on FPGAs. In 63rd International midwest symposium on circuits and
systems (MWSCAS) (pp. 941–44). IEEE, Springfield.
134 D. G. Mahmoud et al.

52. Moini, S., Tian, S., Holcomb, D., Szefer, J., & Tessier, R. (2021). Remote power side-channel
attacks on BNN accelerators in FPGAs. In Design, automation and test in Europe conference
and exhibition (DATE) (pp. 1639–44). IEEE.
53. Nassar, H., AlZughbi, H., Gnad, D. R. E., Bauer, L., Tahoori, M. B., & Henkel, J. (2021). Loop-
Breaker: Disabling interconnects to mitigate voltage-based attacks in multi-tenant FPGAs.
In 2021 IEEE/ACM international conference on computer aided design (ICCAD) (pp. 1–9).
Munich, Germany.
54. Örs, S. B., Oswald, E., & Preneel, B. (2003). Power-analysis attacks on an FPGA—first
experimental results. In Conference on cryptographic hardware and embedded systems (CHES)
(pp. 35–50). Springer, Cologne.
55. Papagiannopoulos, K., Glamočanin, O., Azouaoui, M., Ros, D., Regazzoni, F., & Stojilović,
M. (2023). The side-channel metrics cheat sheet. ACM Computing Surveys, 55(10), 1–38.
56. Provelengios, G., Holcomb, D., & Tessier, R. (2019). Characterizing power distribution attacks
in multi-user FPGA environments. In Proceedings of the 29th international conference on field-
programmable logic and applications (FPL) (pp. 194–201). IEEE, Barcelona.
57. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power wasting circuits for cloud FPGA
attacks. In Proceedings of the 30th international conference on field-programmable logic and
applications (FPL) (pp. 231–35). IEEE, Gothenburg.
58. Regazzoni, F., Yi, W., & Standaert, F. X. (2011). FPGA implementations of the AES masked
against power analysis attacks. In Proceedings of 2nd international workshop on constructive
side-channel analysis and secure design (COSADE) (pp. 1–11). Darmstadt, Germany.
59. Rodgers, J. L., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation
coefficient. The American Statistician, 42(1), 59–66.
60. Salman, E., Dasdan, A., Taraporevala, F., Kucukcakar, K., & Friedman, E. G. (2007).
Exploiting setup-hold-time interdependence in static timing analysis. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 26(6), 1114–1125.
61. SAKURA-X side-channel evaluation board. (2021). https://satoh.cs.uec.ac.jp/SAKURA/
hardware/SAKURA-X.html.
62. Schellenberg, F., Gnad, D. R. E., Moradi, A., & Tahoori, M. B. (2018). An inside job: Remote
power analysis attacks on FPGAs. In Design, automation and test in Europe conference and
exhibition (DATE) (pp. 1111–1116). IEEE, Dresden.
63. Schellenberg, F., Gnad, D. R. E., Moradi, A., & Tahoori, M. B. (2018). Remote inter-
chip power analysis side-channel attacks at board-level. In 2018 IEEE/ACM international
conference on computer-aided design (ICCAD) (pp. 114:1–114:7). New York.
64. Shawahna, A., Sait, S. M., & El-Maleh, A. (2019). FPGA-based accelerators of deep learning
networks for learning and classification: A review. IEEE Access, 7, 7823–7859.
65. Spielmann, D., Glamočanin, O., & Stojilović, M. (2023). RDS: FPGA routing delay sensors
for effective remote power analysis attacks. IACR Transactions on Cryptographic Hardware
and Embedded Systems, 2023(2), 543–67.
66. Tian, S., Moini, S., Wolnikowski, A., Holcomb, D., Tessier, R., & Szefer, J. (2021). Remote
power attacks on the versatile tensor accelerator in multi-tenant FPGAs. In Proceedings of the
international symposium on field-programmable custom computing machines (FCCM).
67. Tiri, K., & Verbauwhede, I. (2004). A logic level design methodology for a secure DPA
resistant ASIC or FPGA implementation. In Design, automation and test in Europe conference
and exhibition (DATE) (pp. 246–51). Paris, France.
68. Wu, J. (2010). Several key issues on implementing delay line based TDCs using FPGAs. IEEE
Transactions on Nuclear Science, 57(3), 1543–1548.
69. Xilinx. (2017). UltraScale architecture configurable logic block user guide (UG574). https://
docs.xilinx.com/v/u/en-US/ug574-ultrascale-clb.
70. Yeap, G. K. (2012). Practical low power digital VLSI design. Springer Science and Business
Media, Berlin.
71. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In 2018 IEEE
symposium on security and privacy (pp. 805–820). IEEE, San Francisco.
5 Remote Power SCA and FI Attacks on Multitenant FPGAs 135

72. Zhu, H., Guo, X., Jin, Y., & Zhang, X. (2020). PowerScout: A security-oriented power delivery
network modeling framework for cross-domain side-channel analysis. In Asian hardware
oriented security and trust symposium (AsianHOST) (1–6). IEEE.
73. Zick, K. M., Srivastav, M., Zhang, W., & French, M. (2013). Sensing nanosecond-scale voltage
attacks and natural transients in FPGAs. In Proceedings of the 21st ACM/SIGDA international
symposium on field-programmable gate arrays (FPGA) (pp. 101–104). Monterey, CA, USA.
74. Zussa, L., Dutertre, J. M., Clédière, J., & Robisson, B. (2014). Analysis of the fault injection
mechanism related to negative and positive power supply glitches using an on-chip voltmeter.
In International symposium on hardware-oriented security and trust (HOST) (pp. 130–35).
IEEE, Arlington.
75. Zynq UltraScale+ MPSoC. (2022). https://www.xilinx.com/products/silicon-devices/soc/
zynq-ultrascale-mpsoc.html.
Chapter 6
Contention-Based Threats Between
Single-Tenant Cloud FPGA Instances

Ilias Giechaskiel, Shanquan Tian, and Jakub Szefer

6.1 Introduction

Public cloud infrastructures with FPGA-accelerated virtual machine (VM) instances


allow for easy, on-demand access to reconfigurable hardware that users can program
with their own designs. These VM instances can be used to accelerate machine
learning, image and video manipulation, or genomic applications, for example, [5].
The potential benefits of the instances with FPGAs have resulted in numerous cloud
providers including Amazon Web Services (AWS) [14], Alibaba [3], Baidu [20],
and Tencent [57], giving public users direct access to FPGAs. However, providing
users low-level access to upload their own hardware designs has resulted in serious
implications for the security of cloud users and the cloud infrastructure itself.
Several recent works have considered the security implications of shared FPGAs
in the cloud and have demonstrated covert-channel [29] and side-channel [32]
attacks in this multi-tenant setting. However, today’s cloud providers, such as AWS
with their F1 instances, only offer “single-tenant” access to FPGAs. In the single-
tenant setting, each FPGA is fully dedicated to the one user who rents it, while
many other users may be performing computations in parallel using their separate,

The first two authors contributed equally to this work.

I. Giechaskiel
Independent Researcher, London, UK
e-mail: ilias@giechaskiel.com
S. Tian · J. Szefer ()
Yale University, New Haven, CT, USA
e-mail: shanquan.tian@yale.edu; jakub.szefer@yale.edu

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 137
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_6
138 I. Giechaskiel et al.

dedicated FPGAs which are within the same server. Once an FPGA is released
by a user, it can then be assigned to the next user who rents it. This can lead to
temporal thermal covert channels [59], where heat generated by one circuit can be
later observed by other circuits that are loaded onto the same FPGA. Such channels
are slow (less than 1 bps) and are only suitable for covert communication, since
they require the two parties to coordinate and keep being scheduled on the same
physical hardware one after the other. Other means of covert communication in
the single-tenant setting do not require assignment to the same FPGA chip. For
example, multiple FPGA boards in servers share the same power supply, and prior
work has shown the potential for such shared power supplies to leak information
between FPGA boards [30]. However, the resulting covert channel was also slow
(less than 10 bps) and was only demonstrated in a lab setup.
Fingerprinting FPGA instances using physical unclonable functions (PUFs) [58,
60] is another single-tenant security topic that has been previously explored.
Fingerprinting allows users to partially map the infrastructure and gain insight
about the allocation of FPGAs (e.g., how likely a user is to be re-assigned to the
same physical FPGA they used before), but fingerprinting by itself does not lead
to information leaks. A more recent fingerprinting-related work explored mapping
FPGA infrastructures using PCIe contention to find which FPGAs are co-located in
the same non-uniform memory access (NUMA) node within a server [61]. However,
no prior work has successfully launched a cross-VM covert- or side-channel attack
in a real cloud FPGA setting.
By contrast, our work shows that shared resources can be used to leak infor-
mation across separate virtual machines running on the FPGA-accelerated F1
instances in AWS data centers. In particular, we use the contention of the PCIe
bus to not only demonstrate a new, fast covert channel (reaching up to 20 kbps)
but also identify patterns of activity based on the PCIe signatures of different
users’ Amazon FPGA Images (AFIs). This includes detecting when co-located
VMs are initialized, or slowing down the programming of other users’ FPGAs,
and more generally degrading the transfer bandwidth between the FPGA and the
host VM. Our attacks do not require special privileges or potentially malicious
circuits such as ring oscillators (ROs) or time-to-digital converters (TDCs) and thus
cannot easily be detected through static analysis or design rule checks (DRCs) that
cloud providers may perform. We further introduce three new methods of finding
co-located instances that are in the same physical server: (a) through reducing the
network bandwidth via PCIe contention, (b) through resource contention of the non-
volatile memory express (NVMe) SSDs that are accessible from each F1 instance
via the PCIe bus, and (c) through the common thermal signatures obtained from
the decay rates of each FPGA’s DRAM modules. Our work therefore shows that
single-tenant attacks in real FPGA-accelerated cloud environments are practical and
demonstrates several ways to infer information about the operations of other cloud
users and their FPGA-accelerated virtual machines or the data center itself.
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 139

6.1.1 Contributions

In summary, the contributions of our work are:


1. Demonstrating the first FPGA-based covert channel between separate virtual
machines, reaching 20 kbps with 99% accuracy
2. Characterizing the cross-VM covert-channel accuracy and bandwidth trade-offs
across different operating systems
3. Inferring information about the behavior of other users through the PCIe
signatures of their Amazon FPGA Images (AFIs)
4. Detecting when co-located VM instances with FPGAs are initialized
5. Performing long-term monitoring of data center activity
6. Slowing down communication between the host and the FPGA, thereby degrad-
ing the FPGA programming times and other application data transfers
7. Identifying network- and SSD-based interference mechanisms and covert chan-
nels between separate F1 users
8. Exploiting the thermal signatures of DRAM modules to identify F1 instances
which are on separate NUMA nodes but share the same server

6.1.2 Chapter Organization

The remainder of the chapter is organized as follows. Section 6.2 provides the
background on today’s deployments of FPGAs in public cloud data centers and
summarizes related work. Section 6.3 discusses typical FPGA-accelerated cloud
servers and PCIe contention that can occur among the FPGAs, while Sect. 6.4
evaluates our fast, PCIe-based, cross-VM channel. Using the ideas from the covert
channel, Sect. 6.5 investigates how to infer information about other VMs through
their PCIe traffic patterns, including detecting the initialization of co-located VMs,
long-term PCIe monitoring of data center activity, and slowing down PCIe traffic
on adjacent instances. Section 6.6 then presents alternative sources of information
leakage due to network bandwidth contention, shared SSDs, and thermal signatures
of DRAM modules. The chapter concludes in Sect. 6.7.

6.2 Background and Related Work

This section first provides a brief background on the F1 instances from Amazon
Web Services (AWS) [14] that are evaluated in this work, with a focus on their
architecture (Sect. 6.2.1) and programming model (Sect. 6.2.2). It also summarizes
related work in the area of FPGA cloud security (Sect. 6.2.3).
140 I. Giechaskiel et al.

6.2.1 AWS F1 Instance Architecture

AWS has offered FPGA-accelerated virtual machine instances to users since late
2016 [4]. These so-called F1 instances are available in three sizes: f1.2xlarge,
f1.4xlarge, and f1.16xlarge, where the instance name represents twice the
number of FPGAs it contains (so f1.2xlarge has 1 FPGA, while f1.4xlarge
has 2, and f1.16xlarge has 8 FPGAs). Each instance is allocated 8 virtual CPUs
(vCPUs), 131 GB of DRAM, and 470 GB of NVMe SSD storage per FPGA. For
example, f1.4xlarge instances have 16 vCPUS, 262 GB of DRAM, and 940 GB
of SSD space [14], since they contain 2 FPGAs.
Each FPGA board is attached to the server over an x16 PCIe Gen 3 bus. In
addition, each FPGA board contains four DDR4 DRAM chips, totaling 68.7 GB
of memory per FPGA board [14]. These memories are separate from the server’s
DRAM and are directly accessible by each FPGA. The F1 instances use Virtex
UltraScale+ XCVU9P chips [14], which contain over 1.1 million lookup tables
(LUTs), 2.3 million flip-flops (FFs), and 6.8 thousand digital signal processing
(DSP) blocks [67].
Each server contains 8 FPGA boards, which are evenly split between two non-
uniform memory access (NUMA) nodes, as shown in Fig. 6.1 [11, 14, 61]. AWS
servers containing FPGAs have two Intel Xeon E5-2686 v4 (Broadwell) processors,
connected through an Intel QuickPath Interconnect (QPI) link. Each processor
forms its own NUMA node with its associated DRAM and four FPGAs attached
as PCIe devices. Due to this architecture, an f1.2xlarge instance may be on the
same NUMA node as up to three other f1.2xlarge instances or on the same
NUMA node as one other f1.2xlarge instance and one f1.4xlarge instance
(which uses 2 FPGAs). Conversely, an f1.4xlarge instance may share the
same NUMA node with up to two f1.2xlarge instances or one f1.4xlarge
instance. Finally, as f1.16xlarge instances use up to all 8 FPGAs in the server,
they do not share any resources with other instances since both NUMA nodes of
the server are fully occupied. In this work, we are able to produce a more fine-

Fig. 6.1 AWS servers


contain 8 FPGAs divided
between two NUMA
nodes [61]
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 141

grained topology of the servers and their PCIe topologies due to additional sources
of contention via NVMe SSDs and network interface controller (NIC) cards.

6.2.2 Programming AWS F1 Instances

Users of F1 instances do not retain entirely unrestricted control to the underlying


hardware but instead need to adapt their hardware designs to fit within a predefined
architecture. In particular, user designs are defined as “custom logic (CL)” modules
that interact with external interfaces through the cloud-provided “shell,” which
hides physical aspects such as clocking logic and I/O pinouts (including for PCIe
and DRAM) [29, 60]. This restrictive shell interface further prevents users from
accessing identifier resources, such as eFUSE and Device DNA primitives, which
could be used to distinguish between different FPGA boards [29, 60]. Finally,
users cannot directly upload bitstreams to the FPGAs. Instead, they generate a
design checkpoint (DCP) file using Xilinx’s tools and then provide it to Amazon to
create the final bitstream (AFI), after it has passed a number of design rule checks
(DRCs). The checks, for example, include prohibiting combinatorial loops such as
ring oscillators as a way of protecting the underlying hardware [28, 29], although
alternative designs bypassing these restrictions have been proposed [29, 55].

6.2.3 Related Work

Since the introduction of FPGA-accelerated cloud computing, a number of


researchers have been exploring the security aspects of FPGAs in the cloud. A
key feature differentiating such research from prior work on FPGA security outside
of cloud environments is the threat model, which assumes remote attackers without
physical access to or the ability to modify the FPGA boards.

6.2.3.1 PCIe-Based Threats

The Peripheral Component Interconnect Express (PCIe) standard provides a high-


bandwidth, point-to-point, full-duplex interface for connecting peripherals within
servers. The existing work has shown that PCIe switches can cause bottlenecks in
systems with multiple graphics processing units (GPUs) [21, 25, 27, 53, 54], leading
to severe stalls due to their scheduling policy [41]. In terms of PCIe contention
in FPGA-accelerated cloud environments, prior work has shown that different
driver implementations result in different overheads [64] and that changes in PCIe
bandwidth can be used to co-locate different instances on the same server [61].
In parallel to this work, PCIe contention was used for side-channel attacks which
can recover the workload of GPUs and NICs via changes in the delay of PCIe
142 I. Giechaskiel et al.

responses [56]. Our work is similar but presents the first successful cross-VM
attacks using PCIe contention on a public cloud. Moreover, by going beyond just
PCIe, our work is able to deduce cross-NUMA-node co-location using a DRAM
thermal fingerprinting approach.

6.2.3.2 Power-Based Threats

Computations that cause data-dependent power consumption can result in infor-


mation leaks that can be detected even by adversaries without physical access to
the device under attack. For example, it is known that a shared power supply in
a server can be used to leak information between different FPGAs, where one
FPGA modulates power consumption and the other measures the resulting voltage
fluctuations [30]. However, such work results in low transmission rates (below
10 bps) and has only been demonstrated in a lab environment.
Other work has shown that it is possible to develop stressor circuits which
modulate the overall power consumption of an FPGA and generate a lot of heat,
for instance, by using ring oscillators or transient short circuits [1, 2, 34]. These
large power draws can be used for fault attacks [37] or as Denial-of-Service (DoS)
attacks [39] which make the hardware unavailable for an extended period of time.
Such attacks could also prematurely age FPGAs, due to the potentially excessive
heat for an extended period of time [19]. Our work has instead focused on informa-
tion leaks and non-destructive reverse-engineering of the cloud infrastructure.

6.2.3.3 Thermal-Based Threats

It is now well-known that it is possible to implement temperature sensors suitable


for thermal monitoring on FPGAs using ring oscillators [23], whose frequency
drifts in response to temperature variations [42, 43, 63, 70]. A receiver FPGA could
thus use a ring oscillator to observe the ambient temperature of a data center. For
example, existing work [59] has explored a new type of temporal thermal attack:
heat generated by one circuit can be later observed by other circuits that are loaded
onto the same FPGA. This type of attack is able to leak information between
different users of an FPGA who are assigned to the same FPGA over time. However,
the bandwidth of temporal attacks is low (less than 1 bps), while our covert channels
can reach a bandwidth of up to 20 kbps.

6.2.3.4 DRAM-Based Threats

Recent work has shown that direct control of the DRAM connected to the FPGA
boards can be used to fingerprint them [60]. This can be combined with existing
work [61] to build a map of the cloud data centers where FPGAs are used. Such
fingerprinting does not by itself, however, help with cross-VM covert channels, as
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 143

it does not provide co-location information. By contrast, our PCIe, NIC, SSD, and
DRAM approaches are able to co-locate instances in the same server and enable
cross-VM covert channels and information leaks.

6.2.3.5 Multi-tenant Security

This work has focused on the single-tenant setting, where each user gets full
access to the FPGA and thus reflects the current environment offered by cloud
providers. However, there is also a large body of security research in the multi-
tenant context, where a single FPGA is shared by multiple, logically (and potentially
physically) isolated users [26, 36, 46, 48, 71]. For example, several researchers have
shown how to recover information about the structure [62, 72] or inputs [49] of
machine learning models or cause timing faults to reduce their accuracy [24, 52].
Other work in this area has shown that crosstalk due to routing wires [28] and
logic elements [31] inside the FPGA chips can be used to leak static signals,
while voltage drops due to dynamic signals can lead to covert-channel [29], side-
channel [32, 35], and fault [50] attacks. Several works have also tried to address such
issues to enable multi-tenant applications, proposing static checks [38, 40], voltage
monitors [33, 45, 51], or a combination of the two [39]. Our work on PCIe, SSD,
and DRAM threats is orthogonal to such work but is directly applicable to current
cloud FPGA deployments.

6.3 PCIe Contention in Cloud FPGAs

The user’s custom logic running on the FPGA instances can use the Shell to
communicate with the server through the PCIe bus. Users cannot directly control the
PCIe transactions but instead perform reads and writes to predefined address ranges
through the shell. These memory accesses get translated into PCIe commands and
PCIe data transfers between the server and the FPGA. Users may also set up direct
memory access (DMA) transfers between the FPGA and the server. By designing
hardware modules with low logic overhead, users can generate transfers fast enough
to saturate the PCIe bandwidth. In fact, because of the shared PCIe bus within each
NUMA node, these transfers can create interference and bus contention that affects
the PCIe bandwidth of other users. The resulting performance degradation can be
used for detecting co-location [61] or, as we show in this work, for fast covert-
and side-channel attacks, breaking the isolation between otherwise logically and
physically separate VM instances.
In our covert-channel analysis (Sect. 6.4), we show that the communication
bandwidth is not identical for all pairs of FPGAs in a NUMA node. In particular,
this suggests that the 4 PCIe devices are not directly connected to each CPU, but
instead likely go through two separate switches, forming the hierarchy shown in
Fig. 6.2, improving the deduced model of prior work [61]. Although not publicly
144 I. Giechaskiel et al.

Fig. 6.2 The newly deduced PCIe configuration for F1 servers based on the experiments in this
work: each CPU has two PCIe links, each of which provides connectivity to two FPGAs, an NVMe
SSD, and a NIC through a PCIe switch

confirmed by AWS, this topology is similar to the one described for P4d instances,
which contain 8 GPUs [7]. As a result, even though all 4 FPGAs in a NUMA node
contend with one another, the covert-channel bandwidth is highest among those
sharing a PCIe switch, due to the bottleneck imposed by the shared link (Sect. 6.4).
We also expand on the model to show that the PCIe switches provide connectivity
to an NVMe SSD drive and a network interface card, thereby expanding the attack
surface by identifying additional sources of PCIe contention. Finally, as we show in
Sect. 6.4.5, how effectively these PCIe links can be saturated is also dependent on
the operating system/kernel configuration instead of just the user-level software and
underlying hardware architecture.

6.4 Cross-VM Covert Channels

In this section, we describe our implementation for the first cross-FPGA covert
channel on public clouds (Sect. 6.4.1) and discuss our experimental setup
(Sect. 6.4.2). We then analyze bandwidth vs. accuracy trade-offs (Sect. 6.4.3),
before investigating the impact of receiver and transmitter transfer sizes on the
covert-channel accuracy for a given covert-channel bandwidth (Sect. 6.4.4). We
finish the section by discussing differences in the covert-channel bandwidth between
VMs using different operating systems (Sect. 6.4.5). Side channels and information
leaks based on PCIe contention from other VMs are discussed in Sect. 6.5.

6.4.1 Covert-Channel Implementation

Our covert channel is based on saturating the PCIe link between the FPGA and the
server, so, at their core, both the transmitter and the receiver consist of (a) an FPGA
image that interfaces with the host over PCIe with minimal latency in accepting
write requests or responding to read requests and (b) software that attaches to the
FPGA and repeatedly writes to (or reads from) the mapped base address register
(BAR). These requests are translated to PCIe transactions, transmitted over the data
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 145

and physical layers, and then relayed to the custom logic (CL) hardware through
the shell (SH) logic as AXI reads or writes. The transmitter stresses its PCIe link to
transmit a 1 but remains idle to transmit a bit 0, while the receiver keeps measuring
its own bandwidth during the transmission period (the receiver is thus identical to
a transmitter that sends a 1 during every measurement period). The receiver then
classifies the received bit as a 1 if the bandwidth B has dropped below a threshold
T and as 0 otherwise.
The two communicating parties need to have agreed upon some minimal
information prior to the transmissions: the specific data center to use (region and
availability zone, e.g., us-east-1e), the time t to start communications, and the
initial measurement period, expressed as the time difference between successive
transmissions .δ. All other aspects of the communication can be handled within the
channel itself, including detecting that the two parties are on the same NUMA node
or increasing the bandwidth by decreasing .δ. To ensure that the PCIe link returns
to idle between successive measurements, transmissions stop before the end of the
measurement interval, i.e., the transmission duration d satisfies .d < δ. Note that
synchronization is implicit due to the receiver and transmitter having access to
a shared wall clock time, e.g., via the network time protocol (NTP). Figure 6.3
provides a high-level overview of our covert-channel mechanism.
Before they can communicate, the two parties (Alice and Bob in the example
of Fig. 6.3) first need to ensure that they are co-located on the same NUMA
node within the server. To do so, they can launch multiple instances at or near
an agreed-upon time and attempt to detect whether any of their instances are co-
located by sending handshake messages and expecting a handshake response, using
the same setup information as for the covert channel itself (i.e., the time .t ' to start
the communication, the measurement duration .δ, and location information such as
the data center region and availability zone). They additionally need to have agreed

Fig. 6.3 Example cross-VM covert communication: The transmitter (Alice) sends the ASCII byte
“H,” represented as 01001000 in binary, to the receiver (Bob) in 8 intervals by stressing her PCIe
bandwidth to transmit a 1 and remaining idle to transmit a 0. If Bob’s FPGA bandwidth B drops
below a threshold T , he detects a 1, otherwise a 0 is detected. To ensure no residual effects after
each transmission, the time difference .δ between successive measurements is slightly larger than
the transmission duration d
146 I. Giechaskiel et al.

Fig. 6.4 The process to find a pair of co-located f1.2xlarge instances using PCIe contention
uses the covert-channel mechanism to check for pre-agreed handshake messages: Alice transmits
the handshake message with her first FPGA and waits to see if Bob acknowledges the handshake
message. In parallel, Bob measures the bandwidths of all his FPGAs. In this example, Bob detects
the contention in his seventh FPGA during the fourth handshake attempt. Note that Alice and Bob
can rent any number of FPGAs for finding co-location, with five and seven used as an example

on the handshake message H , which determines the per-handshake measurement


duration .Δ. This co-location process is summarized in Fig. 6.4. Note that as prior
work has shown [61], by launching multiple instances, the probability for co-
location is high, but the two parties would have to agree on a “timeout” approach.
For instance, they could have a maximum number of handshake attempts M, after
which they re-launch instances at time .t ' + M · Δ or launch additional instances
for every unsuccessful handshake attempt (e.g., after attempt 1, Alice and Bob both
launch a new instance, while Alice terminates instance 1).
In our experiments, we typically launch 5 instances per user at the same time
in the same region and availability zone, have a 128-bit handshake message H ,
and consider the co-location attempt successful if the message was recovered with
.≥ 80% accuracy. Other fixed parameters, such as the measurement duration or

transfer sizes, were informed by early manual experiments and the work in [61] to
ensure we can reliably detect co-location. Note that these parameters can be different
from those used after the co-location has been established. For instance, co-location
detection can use low-bandwidth transfers (e.g., 200 bps) that are reliable across all
NUMA node setups and can be increased as part of the setup process, once co-
location has been established.
During the co-location process, the two communicating parties can also establish
what the threshold T should be and whether the communication bandwidth should
be increased. As shown in [61], the PCIe bandwidth of an instance drops from over
3,000 MBps to under 1,000 MBps when there is an external PCIe stressor. As a
result, this threshold T could be simply hardcoded (at, say, 2,000 MBps), or be
adaptive, as the mid-point between the minimum .bm and maximum .bM bandwidths
recorded during a successful handshake. The latter is the approach we use in our
work: if the two instances are not co-located, .bm ≈ bM , and the decoded bits will
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 147

be random and hence will not match the expected handshake message H . If the two
instances are co-located, .bM ⪢ bm (assuming H contains at least one 0 and at least
one 1), so any bit 1 will correspond to a bandwidth .b1 ≈ bm ⪡ (bm + bM )/2 = T
and any bit 0 will result in bandwidth .b0 ≈ bM ⪢ (bm + bM )/2 = T .

6.4.2 Experimental Setup

For the majority of our experiments, we use VMs with AWS FPGA Developer
Amazon Machine Image (AMI) [17] version 1.8.1, which runs CentOS 7.7.1908,
and develop our code with the hardware and software development Kit (HDK/SDK)
version 1.4.15 [8]. We vary the operating systems used for the transmitters and
receivers and significantly improve the covert-channel bandwidth in Sect. 6.4.5.
For our FPGA bitstream, we use the unmodified CL_DRAM_DMA image provided
by AWS (agfi-0d132ece5c8010bf7) [10] for both the transmitter and the
receiver designs. This design maps the 137.4 GB PCIe application physical function
(AppPF) BAR4 (a 64-bit prefetchable base address register (BAR) [9]) to the
68.7 GB of FPGA DRAM (twice). It is not necessary to use the FPGA DRAMs
themselves: just responding to the PCIe requests to not hang the interfaces,
like in the CL_HELLO_WORLD example [12], is sufficient. Our custom-written
software maps the FPGA DRAM modules via the BAR4 register and uses the
BURST_CAPABLE flag to support write-combining for higher performance. Data
transfers are implemented using memcpy, getting similar performance to the AWS
benchmarks [6]. Algorithm 1 summarizes our software routine in pseudocode.
Unless otherwise noted, we perform experiments with “spot” instances in the
us-east-1 (North Virginia) region in availability zone d for cost reasons, though
a prior work has shown that PCIe contention is present with both spot and on-
demand instances, in all regions and different availability zones where F1 instances
are currently supported, namely ap-southeast-2 (Sydney), eu-west-1 (Ire-
land), us-east-1 (North Virginia), and us-west-2 (Oregon) [61]. Although
the results presented are for instances launched by a single user, it should also
be noted that we have successfully created a cross-VM covert channel between
instances launched by two different users.

6.4.3 Bandwidth vs. Accuracy Trade-Offs

Using our co-location mechanism, we are able to easily find 4 distinct


f1.2xlarge instances that are all in the same NUMA node and then
measure the covert-channel accuracy for different bandwidths, i.e., different
measurement parameters d and .δ. Specifically, we test .(d, δ) from .(0.1 ms, 0.2 ms)
148 I. Giechaskiel et al.

Algorithm 1 Cross-VM transmissions on AWS F1 instances


1: procedure INIT(nb)
2: local ← malloc(nb) ⊳ Allocate a local buffer with nb bytes
3: remote ← attach(BAR4, BU RST _CAP ABLE) ⊳ Attach to the FPGA and get the
pointer
4: return local, remote
5: procedure STRESS(local, remote, d, nb)
6: count ← 0 ⊳ Count of repetitions
7: start ← cur_time() ⊳ Get the current time
8: while cur_time() < start + d do
9: memcpy(remote, local, nb) ⊳ Copy nb bytes from the local buffer to the FPGA
10: count + + ⊳ Increment the number of successful repetitions
11: return count
12: procedure MEASUREBANDWIDTH(start, msg, d, δ, nb)
13: local, remote ← init(nb) ⊳ Get the local and remote buffers
14: counts[len(msg)] = {0} ⊳ Initialize the repetition counts
15: for i = 0; i < len(msg); i = i + 1 do
16: while cur_time() < start + i ∗ δ do ⊳ Wait until the agreed start time
17: sleep()
18: if msg[i] then ⊳ Only stress for 1 bits and record counts
19: counts[i] = stress(local, remote, d, nb)
20: return counts

to .(9 ms, 10 ms), corresponding to transmission rates between 5 kbps and 100 bps.1
For these experiments, the receiver keeps transferring 2 kB chunks of data from
the host, while the transmitter repeatedly sends 64 kB of data in each transmission
period (i.e., until the end of the interval d). These parameters are explored separately
in Sect. 6.4.4 below.
The results of our experiments, shown in Fig. 6.5, indicate that we can create
a fast covert channel between any two FPGAs in either direction: at 200 bps and
below, the accuracy of the covert channel is 100%, with the accuracy at 250 bps
dropping to 99% for just one pair. At 500 bps, three of the six possible pairs
can communicate at 100% accuracy, while one pair can communicate with 97%
accuracy at 2 kbps (and sharply falls to 70% accuracy even at 2.5 kbps—though
in Sect. 6.4.5 we show that bandwidths of 20 kbps at 99% accuracy are possible). It
should be noted that, as expected, the bandwidth within any given pair is symmetric,
i.e., it remains the same when the roles of the transmitter and the receiver are
reversed. As the VMs occupy a full NUMA node, there should not be any impact
from other users’ traffic. The variable bandwidth between different pairs is therefore
likely due to the PCIe topology.

1 Section 6.4.5 shows that different setups can result in even higher bandwidths exceeding 20 kbps.
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 149

100

90 Transmitter A
Transmitter B
Transmitter C
80 Transmitter D
Accuracy (%)

Receiver A
70 Receiver B
Receiver C
Receiver D
60 Pair (A, B)
Pair (A, C)
50 Pair (A, D)
Pair (B, C)
Pair (B, D)
40
Pair (C, D)
5.0 kbps

3.3 kbps

2.5 kbps
2.0 kbps
1.7 kbps
1.4 kbps
1.2 kbps
1.1 kbps
1.0 kbps
500.0 bps

333.3 bps

250.0 bps
200.0 bps
166.7 bps
142.9 bps
125.0 bps
111.1 bps
100.0 bps
Bandwidth

Fig. 6.5 Bandwidth and accuracy for covert-channel transmissions between any pair of FPGAs,
among the four FPGAs in the same NUMA node. Each FPGA pair is color-coded, with transmitters
indicated through different markers and receivers through different line styles. For any pair of
FPGAs X and Y , the bandwidth is approximately the same in each direction, i.e., the bandwidth
from X to Y is approximately the same as the bandwidth from Y to X. Communication is possible
between any two FPGAs in the NUMA node, but the bandwidths for different pairs diverge

6.4.4 Transfer Sizes

In this set of experiments, we fix .d = 4 ms and .δ = 5 ms (i.e., a covert-channel


bandwidth of 200 bps) and vary the transmitter and receiver transfer sizes. Figure 6.6
first shows the per-pair channel accuracy for different transmitter sizes. The results
show that at 4 kB and above, the covert-channel accuracy is 100%, while it becomes
much lower at smaller transfer sizes. This is because sending smaller chunks of
data over PCIe results in lower bandwidth due to the associated PCIe overhead of
each transaction. For example, in one 4 ms transmission, the transmitter completes
140,301 transfers of 1 B each, corresponding to a PCIe bandwidth of only .1 B ×
140,301/4 ms = 33.5 MBps. However, in the same time, a transmitter can complete
1,890 transfers of 4 kB, for a PCIe bandwidth of .4 kB × 1,890/4 ms = 1.8 GBps.
The maximum transfer size of 1 MB was chosen to ensure that multiple transfers
were possible within each transfer interval without ever interfering with the next
measurement interval.
The results of the corresponding experiments for receiver transfer sizes are shown
in Fig. 6.7. Similar to the transmitter experiments, very small transfer sizes are
unsuitable for the covert channel due to the low resulting bandwidth. However,
unlike in the transmitter case, large receiver transfer sizes are also problematic, as
the number of transfers completed within each measurement interval is too small to
be able to distinguish between external transmissions and the inherent measurement
noise.
150 I. Giechaskiel et al.

100

80 Transmitter A
Transmitter B
Transmitter C
Accuracy (%)

Transmitter D
60 Receiver A
Receiver B
Receiver C
40 Receiver D
Pair (A, B)
Pair (A, C)
20 Pair (A, D)
Pair (B, C)
Pair (B, D)
0 Pair (C, D)
1B
2B
4B
8B
16 B
32 B
64 B
128 B
256 B
512 B
1 kB
2 kB
4 kB
8 kB
16 kB
32 kB
64 kB
128 kB
256 kB
512 kB
1 MB
Transmitter Transfer Size

Fig. 6.6 Covert-channel accuracy for different transmitter transfer sizes. Each chunk transmitted
over PCIe needs to be ≥ 4 kB to ensure an accuracy of 100% at 200 bps between any two FPGAs
in the NUMA node

100

90
Transmitter A
80 Transmitter B
Transmitter C
Accuracy (%)

70 Transmitter D
Receiver A
60 Receiver B
Receiver C
50 Receiver D
Pair (A, B)

40 Pair (A, C)
Pair (A, D)

30 Pair (B, C)
Pair (B, D)

20 Pair (C, D)
1B
2B
4B
8B
16 B
32 B
64 B
128 B
256 B
512 B
1 kB
2 kB
4 kB
8 kB
16 kB
32 kB
64 kB
128 kB
256 kB

Receiver Transfer Size

Fig. 6.7 Covert-channel accuracy for different receiver transfer sizes. Chunks between 64 B and
4 kB are suitable for 100% accuracies, but sizes outside this range result in a drop in accuracy for
at least one pair of FPGAs in the NUMA node
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 151

6.4.5 Operating Systems

Starting with FPGA AMI version 1.10.0, Amazon has provided AMIs based on
Amazon Linux 2 (AL2) [18] alongside AMIs based on CentOS [17] (both using the
Xilinx-provided XOCL PCIe driver). AL2 uses a Linux kernel that has been tuned
for the AWS infrastructure [15] and may therefore impact the performance of the
covert channel. Since the attacker does not have control over the victim’s VM, it
is necessary to explore the effect of the operating system on our communication
channel and thus experiment with both types of operating systems as receivers and
transmitters. We use the co-location methodology of Sect. 6.4.1 to find different
instances that are in the same NUMA node and report the accuracy of our cross-
VM covert channel from bandwidths as low as 0.1 kbps to as high as 66.6 kbps.
As described in Sect. 6.3 and shown in Fig. 6.2, each NUMA node consists of 4
distinct f1.2xlarge instances, and each one can run either AL2 or CentOS. As
Sect. 6.4.3 identified, the bandwidth between different FPGA pairs will depend on
where they are in the PCIe topology, so to get an accurate estimate of the maximum
cross-instance covert-channel bandwidth for different setups, we run experiments
on three different configurations of full NUMA nodes. The first experiment contains
one instance running CentOS and three instances running AL2 (Fig. 6.8), the second

100

TX A (AL2)
90 TX B (AL2)
TX C (AL2)
TX D (CentOS)
80 RX A (AL2)
Accuracy (%)

RX B (AL2)
RX C (AL2)
70 RX D (CentOS)
Pair (A, B)
Pair (A, C)
60 Pair (A, D)
Pair (B, C)
Pair (B, D)
50 Pair (C, D)

40
66.67kbps
50.00kbps

20.00kbps

12.50kbps
10.00kbps
6.67kbps
5.00kbps
3.33kbps
2.50kbps
2.00kbps
1.43kbps
1.00kbps

0.50kbps
0.33kbps
0.25kbps
0.20kbps
0.17kbps
0.12kbps
0.10kbps

Bandwidth

Fig. 6.8 Bandwidth and accuracy for covert-channel transmissions between any pair of four co-
located instances, where three instances are running Amazon Linux 2 (AL2) and the last one is
running CentOS. Each FPGA pair is color-coded, with transmitters indicated through different
markers and receivers through different line styles
152 I. Giechaskiel et al.

100

TX A (AL2)
90 TX B (AL2)
TX C (CentOS)
80 TX D (CentOS)
RX A (AL2)
Accuracy (%)
RX B (AL2)
70 RX C (CentOS)
RX D (CentOS)
Pair (A, B)
60
Pair (A, C)
Pair (A, D)
50 Pair (B, C)
Pair (B, D)
Pair (C, D)
40

30
66.67kbps
50.00kbps

20.00kbps
12.50kbps
10.00kbps
6.67kbps
5.00kbps
3.33kbps
2.50kbps
2.00kbps
1.43kbps
1.00kbps

0.50kbps
0.33kbps
0.25kbps
0.17kbps
0.12kbps
0.10kbps
0.07kbps
0.05kbps
Bandwidth

Fig. 6.9 Bandwidth and accuracy for covert-channel transmissions between any pair of four co-
located instances, where two instances are running Amazon Linux 2 (AL2) and the other two are
running CentOS

100

TX A (AL2)
90 TX B (CentOS)
TX C (CentOS)
TX D (CentOS)
80 RX A (AL2)
Accuracy (%)

RX B (CentOS)
RX C (CentOS)
70 RX D (CentOS)
Pair (A, B)
Pair (A, C)
60 Pair (A, D)
Pair (B, C)
Pair (B, D)
50 Pair (C, D)

40
6.67kbps
5.00kbps
3.33kbps
2.50kbps
2.00kbps
1.43kbps
1.00kbps

0.50kbps
0.33kbps
0.25kbps
0.20kbps
0.17kbps
0.12kbps
0.10kbps
0.08kbps
0.07kbps
0.05kbps
66.67kbps
50.00kbps

20.00kbps

12.50kbps
10.00kbps

Bandwidth

Fig. 6.10 Bandwidth and accuracy for covert-channel transmissions between any pair of four co-
located instances, where only one instance is running Amazon Linux 2 (AL2) and the remaining
are running CentOS

contains two instances with CentOS and two with AL2 (Fig. 6.9), while the last
setup has three CentOS VMs and one AL2 one (Fig. 6.10). For each experiment,
we collect the covert-channel bandwidths for all pairs of instances and in both
directions of communication, resulting in 12 different bandwidth vs. accuracy sets
of measurements.
Figure 6.8 shows the covert channel bandwidths for all FPGA pairs, where one
instance is running CentOS and the remaining three are running AL2. For any pair of
AL2 instances, the covert-channel accuracy at 20 kbps is over 90% (in fact, reaching
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 153

Table 6.1 Cross-VM covert channel bandwidth for different receiver and transmitter operating
systems. .∗ A bandwidth of 5.9 kbps at 95% accuracy could be sustained across repeated individual
experiments outside of a full NUMA node.
Transmitter Receiver Bandwidth Accuracy
CentOS CentOS 2.0 kbps 97%
CentOS Amazon Linux 2 ∗
. 0.3 kbps 100%
Amazon Linux 2 CentOS 2.5 kbps 94%
Amazon Linux 2 Amazon Linux 2 20.0 kbps 99%

Table 6.2 Cross-FPGA covert channel bandwidth achieved by different works. The PCIe con-
tention approach of our work achieves bandwidths that are several orders of magnitude faster than
prior research and are performed on a commercial public cloud. .∗ Achieved only in a lab setup.
Cloud Method Reference Bandwidth Accuracy
TACC Thermal Attack [59] .⪡ 0.1 bps 100%

. — Voltage Stressing [30] 6.1 bps 99%
AWS PCIe Contention (CentOS) This work 2,000.0 bps 97%
AWS PCIe Contention (AL2) This work 20,000.0 bps 99%

99%) and for a subset of those pairs remains above 80% at even 40 kbps. However,
when a CentOS instance is involved, the bandwidth drops to 0.5 kbps, for either
direction of communication.
Figures 6.9 and 6.10 show that, depending on where the instances are on the PCIe
topology, the bandwidth can vary. Indeed, Fig. 6.9 shows that the bandwidth for an
AL2 transmitter and a CentOS receiver can reach 2.5 kbps at 98% accuracy, but
CentOS transmitters and AL2 receivers generally have bandwidths below 0.5 kbps,
though in repeated individual experiments (outside of a full NUMA node), we have
been able to get a channel at 5.9 kbps at 95% accuracy. The CentOS-CentOS results
of Fig. 6.10 are consistent with those of Sect. 6.4.3, with bandwidths between
250 bps and 1.4 kbps for all but the fastest pair of instances. Table 6.1 summarizes
these results, while Table 6.2 compares the achieved bandwidths to prior work in
cross-FPGA communications.

6.5 Cross-VM Side-Channel Leaks

In this section, we explore what kinds of information malicious adversaries can infer
about computations performed by un-cooperating victim users that are co-located in
the same NUMA node in different, logically isolated VMs. We first show that the
PCIe activity of an off-the-shelf video-processing AMI from the AWS Marketplace
leaks information about the resolution and bitrate properties of the video being
processed, allowing adversaries to infer the activity of different users (Sect. 6.5.1).
We then show that it is possible to detect when a VM in the same NUMA node
is being initialized (Sect. 6.5.2) and more generally monitor the PCIe bus over a
long period of time (Sect. 6.5.3). We finally show that PCIe contention can be used
154 I. Giechaskiel et al.

(a) PCIe contention can reveal the initializa- (b) Stressing the PCIe bandwidth using the
tion process of co-located VM instances and FPGA can slow down the FPGA, NIC, and
traffic patterns of applications in the same SSD bandwidth of other users. SSD con-
NUMA node to a passive eavesdropping ad- tention for users on the same PCIe switch
versary. is also possible.

Fig. 6.11 Summary of the (a) passive monitoring side-channel and (b) active interface contention-
based attacks presented in Sects. 6.5 and 6.6. Bandwidths are not drawn to scale

for interference attacks, including slowing down the programming of the FPGA
itself, or of other data transfer communications between the FPGA and the host VM
(Sect. 6.5.4). The attacks of this and the next section are summarized in Fig. 6.11.

6.5.1 Inferring User Activity

To help users in accelerating various types of computations on F1 FPGA instances,


the AWS Marketplace lists numerous virtual machine images created and sold by
independent software vendors [16]. Users can purchase instances with pre-loaded
software and hardware FPGA designs for data analytics, machine learning, and
other applications and deploy them directly on the AWS Elastic Cloud Compute
(EC2) platform. AWS Marketplace products are usually delivered as AMIs, each of
which provides the virtual machine setup, system environment settings, and all the
required programs for the application that is being sold. AWS Marketplace instances
which use FPGAs naturally use PCIe to communicate between the software and the
hardware of the purchased instance. In this section, we first introduce an AMI we
purchased to test as the victim software and hardware design (Sect. 6.5.1.1) and
then discuss the recovery of potentially private information from the victim AMI’s
activity by running a co-located receiver VM that monitors the victim’s PCIe activity
(Sect. 6.5.1.2).

6.5.1.1 Experimental Setup

Among the different hardware accelerator solutions for cloud FPGAs, in this
section, we target video processing using the DeepField AMI, which leverages
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 155

FPGAs to accelerate the video super-resolution (VSR) algorithm to convert low-


resolution videos to high-resolution ones [22]. The DeepField AMI is based on
Amazon Linux 2 and sets up the system environment to make use of the proprietary,
pre-trained neural network models [22]. To use the AMI, the virtual machine
software first loads the AFI onto the associated FPGA using the load_afi
command to set up the FPGA board on the F1 instance. The ffmpeg program,
which is customized for the FPGA platform, is called to convert an input video of
no more than .1280 × 720 in resolution to a high-resolution video with a maximum
output resolution of .3840 × 2160. As discussed above, the DeepField AMI handles
all of the software and provides the FPGA image for the acceleration of the VSR
algorithm. Users do not know how the FPGA logic operates, since it is provided as a
pre-compiled AFI. However, PCIe contention allows us to reveal potentially private
information from such example AMIs by running an attacker VM to measure the
PCIe activity of the victim. In particular, this type of high-performance computing
for image and video processing inevitably requires massive data transfers between
the FPGA and the host processor through PCIe. These AMI behaviors are reflected
in the PCIe bandwidth trace.
For our experiments, we first launch a group of f1.2xlarge instances running
the DeepField AMI to find a co-located F1 instance pair using our PCIe contention
approach of Sect. 6.4. After verifying that the attacker and the victim are co-located,
we set up the attacker VM in monitoring mode, which continuously measures the
PCIe bandwidth, similar to the receiver in the covert-channel setup. The monitoring
program has been configured to measure bandwidth with a measurement duration
of .δ = 20 ms and a data transfer duration of .d = 18 ms.
The victim VM then runs the unmodified DeepField AMI to convert different
lower resolution videos to higher resolution ones using the ffmpeg program.
In our experiments, each run of the DeepField AMI takes approximately 5 min,
and each bandwidth trace in the attacker VM lasts for 10 min, thus covering both
the conversion process and periods of inactivity. As discussed in Sect. 6.5.1.2, by
comparing the bandwidth traces among the different experiments, we observe that
we can (a) infer information about whether the victim is actively in the process of
converting a video and (b) deduce certain parameters of the videos.

6.5.1.2 Leaking Private Information from Marketplace AMIs

We now show that private information regarding the activities of co-located


instances can be revealed through the PCIe bandwidth traces. Figure 6.12 shows the
PCIe bandwidth measured by the attacker, while the victim is running the DeepField
AMI on an f1.2xlarge instance. We test different input video files, with three
different resolutions (360p, 480p, and 720p) and two frame rates of 15 and 30 frames
per second (FPS). All videos have a 16:9 aspect ratio, and, except for the resolutions
and frame rates, the contents of the input video files are otherwise identical. The
output video produced for each conversion always has a resolution of .3840 × 2160
but maintains the same frame rate as the original input. The beginning and ending
156 I. Giechaskiel et al.

Fig. 6.12 PCIe bandwidth traces collected by the attacker, while the victim runs the DeepField
AMI to perform VSR conversions with input videos of different resolutions and frame rates. Within
each sub-figure, the red lines label the start and the end of the VSR conversion on the FPGA

of the VSR conversion on the FPGA can be clearly seen in Fig. 6.12, where vertical
red lines delineating the start and end of the process have been added for clarity. We
observe that the PCIe bandwidth drops during the conversion and that runtime is
reduced as the input resolution or the input frame rate decreases. For example, the
runtime for a 720p, 30 FPS video (Fig. 6.12f) is approximately twice as long as for
a 15 FPS one (Fig. 6.12c).

6.5.2 Detecting Instance Initialization

In our experiments, we have thus far only focused on covert communication and
side-channel information leakage between VM instances that have already been
initialized. By contrast, in this section, we show for the first time that the instance
initialization process can also be detected by monitoring the bandwidth of the PCIe
bus. Indeed, on AWS, there is a time lag between when a user requests that an
instance with a target AMI be launched and when it is provisioned, initialized, and
ready for the user to connect to it over SSH. This process can take multiple minutes
and, as we show in this work, causes significant PCIe traffic that is measurable by
co-located adversaries.
For our experiments, we first create an f1.2xlarge instance (named INST-
A) and start the PCIe bandwidth monitoring program on it. We then launch five
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 157

Fig. 6.13 Detecting the VM


initialization process for
co-located f1.2xlarge
instances by monitoring the
PCIe traffic. In this
experiment, five new
instances are created in
sequence, of which the last
three happen to be co-located
with the monitoring instance

f1.2xlarge instances in sequence, named INST-B-i, for .i ∈ {1, 2, 3, 4, 5}.


For each INST-B-i, we attempt to complete a handshake with INST-A at a pre-
determined time and then terminate the instance before launching the next one. As
the monitoring program on INST-A is running throughout the experiments (including
when no INST-B is running), it is able to capture the initialization, handshake, and
termination of any potentially co-located instances.
Figure 6.13 plots the PCIe bandwidth of the monitoring instance INST-A, along
with three reference lines for each of the five instance initializations:
• “Create VM” denotes the request for initializing a new VM.
• “Finish Init” means that the VM has been initialized, which we define as being
able to SSH into the VM instance.
• “Terminate VM” indicates the request for shutting down the VM.
For each VM, we load the PCIe transmitter AFI and software and attempt a
handshake between the “Finish Init” and “Terminate VM” steps. The handshake
results suggest that the last three instances are co-located with INST-A but the first
two are not. Incidentally, the last three instances also cause large PCIe bandwidth
drops (from 1,600 MBps to 600 MBps) during their initialization process, as shown
in Fig. 6.13. The PCIe bandwidth stays stable for the first two instances, as they
are not co-located with INST-A. Note that this bandwidth drop occurs before we
can SSH into the instances and therefore reflects the initialization process itself.
Moreover, it is worth noting that the termination step is not reflected in the PCIe
trace, indicating a potentially lazy termination process that does not require heavy
data transfers. The ability to detect when other users are being allocated to the
same NUMA node not only helps with the covert-channel handshaking process of
Sect. 6.4.1 but can also alert non-adversarial users to potential interference from
other users so that they can tweak their applications to expect slower transfers.
158 I. Giechaskiel et al.

1,880 1,880
Receiver Bandwidth (MBps)

Receiver Bandwidth (MBps)


1,870 1,870

1,860 1,860

1,850 1,850

1,840 1,840
0 1 2 3 4 0 1 2 3 4
Time (h) Time (h)

(a) April 25, after 5pm (b) April 25, after 9pm

Fig. 6.14 Long-term PCIe-based data center monitoring between the evening of April 25 and the
early morning of April 26, with .d = 4 ms and .δ = 5 ms on an f1.2xlarge on-demand instance

6.5.3 Long-Term PCIe Monitoring

In this section, we present the results of measuring the PCIe bandwidth for two on-
demand f1.2xlarge instances in the us-east-1 region (availability zone e).
These experiments took place between 5pm on April 25, 2021 and 2am on April 26
(Eastern Time, as us-east-1 is located in North Virginia). For both sets of four-
hour measurements, the first f1.2xlarge instance (Fig. 6.14) is measuring with
a transmission duration of .d = 4 ms and a measurement duration of .δ = 5 ms,
while the second instance (Fig. 6.15) has .d = 18 ms and .δ = 20 ms. For the
first instance, the PCIe link remains mostly idle during the evening (Fig. 6.14a)
but experiences contention in the first night hour (Fig. 6.14b). The second instance
instead appears to be co-located with other FPGAs that make heavier use of
their PCIe bandwidth. During the evening measurements (Fig. 6.15a), the PCIe
bandwidth drops momentarily below 1,200 MBps during the third hour and below
800 MBps during the fourth hour. These large drops are likely due to co-located
VMs being initialized and not normal user traffic, as described in Sect. 6.5.2. It
also experiences sustained contention in the third hour of the night measurement
(Fig. 6.15b). Although the bandwidth in the two instances is comparable, the 5 ms
measurements are noisier compared to the 20 ms ones. Finally, note that, generally,
our covert-channel code results in bandwidth drops of over 800 MBps, while the
activity of other users tends to cause drops of less than 50 MBps, suggesting that
noise from external traffic has a minimal impact on our channel.

6.5.4 Interference Attacks

The PCIe contention mechanism we have uncovered can also be used to degrade the
performance of co-located applications by other users. Indeed, as we have shown in
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 159

1,880 1,880
Receiver Bandwidth (MBps)

Receiver Bandwidth (MBps)


1,870 1,870

1,860 1,860

1,850 1,850

1,840 1,840
0 1 2 3 4 0 1 2 3 4
Time (h) Time (h)

(a) April 25, after 5pm (b) April 25, after 9pm

Fig. 6.15 Long-term PCIe-based data center monitoring on a different f1.2xlarge on-demand
instance with .d = 18 ms and .δ = 20 ms

1,960
Receiver Bandwidth (MBps)
650
Receiver Bandwidth (MBps)

1,955
600

1,950 550

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Time (min) Time (min)

(a) Without a PCIe stressor (b) With a PCIe stressor

Fig. 6.16 PCIe bandwidth traces collected by the monitoring instance while the victim instance
runs the DeepField AMI to perform a VSR conversion of the same video five consecutive times, (a)
without and (b) with the third instance acting as a PCIe stressor. Within each sub-figure, the red
lines label the start and the end of the VSR conversion on the FPGA

a prior work [61], the bandwidth can fall from 3 GBps to under 1 GBps using just
one PCIe stressor (transmitter) and to below 200 MBps when using two stressors.
To exemplify how the reduced PCIe bandwidth can affect user applications,
we again find a full NUMA node with four co-located VMs but only use three
of them. Specifically, the first VM is running the DeepField AMI video super-
resolution (VSR) algorithm [22] and represents the victim user. The second VM
is monitoring the PCIe bandwidth (similar to the experiments of Sect. 6.5.1), while
the third acts as a PCIe stressor. The fourth one is unused and left idle, to avoid
unintended interference. To further minimize any other external effects, the VSR
computation in Fig. 6.16 is repeated five times in sequence. As Fig. 6.16 shows, the
PCIe bandwidth measured by the monitoring instance drops from over 1,950 MBps
to under 650 MBps, and the conversion time in the victim instance increases by
33%. In addition to slowing down the victim application, when using a stressor, the
attacker can extract even more fine-grained information about the victim. Indeed, as
160 I. Giechaskiel et al.

Table 6.3 Resources used by the three AFIs tested.


AFI Lookup Tables (LUTs) Registers CARRY8 Chains Multiplexers
Small 6,728 8,369 75 72
Medium 139,020 220,061 2,529 4,741
Large 310,462 321,713 7,316 28,597

12 Small AFI Small AFI


FPGA Programming Time (s)

FPGA Programming Time (s)


Medium AFI 30 Medium AFI
Large AFI Large AFI

8
20

4 10

0 0
w/o contention w/ contention w/o contention w/ contention

(a) Partial reconfiguration of CL only (b) Reconfiguration of SH and CL

Fig. 6.17 The FPGA programming time can be slowed down by heavy PCIe traffic from co-
located instances. In (a), only the user’s custom logic is reconfigured, while in (b), both the FPGA
shell and the custom logic are reloaded onto the FPGA. Three AFIs with different numbers of logic
resources are used

Fig. 6.16b shows, the boundary between the five repetitions becomes clear, aiding
the AMI fingerprinting attacks discussed in Sect. 6.5.1.
One particular, and perhaps unexpected, consequence of the reduced PCIe
bandwidth is a more time-consuming programming process that can, in some cases,
be more than tripled. To investigate this effect, we measure the FPGA programming
time in one of the instances (INST-A) under different conditions including:
1. Whether a PCIe bandwidth-hogging application is running on a second instance,
INST-B.
2. Whether just the custom logic or both the custom logic and FPGA shell are
reloaded with fpga-load-local-image (using the -F flag).
3. The size of the loaded AFI in terms of the logic resources used (see Table 6.3).
Because AWS uses partial reconfiguration [13], “the size of a partial bitstream is
directly proportional to the size of the region it is reconfiguring” [66], with larger
images therefore requiring more data transfers from the host to the FPGA device.
The results of our experiments are summarized in Fig. 6.17, where three AFIs
of different sizes are loaded onto INST-A with/without reloading the shell and
with/without PCIe contention on INST-B. As Fig. 6.17a shows, PCIe contention
slows down the FPGA programming of all AFIs, with the effect being more
prominent for larger instances, where programming has slowed down from .≈ 7 s
to .≈ 12 s. When the shell is also reloaded (Fig. 6.17b), the same pattern holds, but
the effects are even more pronounced: even reloading the small AFI slows down
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 161

from .≈ 7 s to over .20 s, while the large AFI takes over .30 s compared to .≈ 9 s
without PCIe stressing. The effect is likely not just due to the fact that the AFI needs
to be transferred to the FPGA over PCIe using the fpga-load-local-image
command, but in part also because the AFIs need to be fetched over the network
from the cloud provider’s internal servers. As we show in the next section, the
network bandwidth is also impacted by the FPGA’s PCIe activity.

6.6 Other Cross-Instance Effects

In this section, we investigate how other aspects of the hardware that is present
in F1 servers, namely Network Interface Cards (Sect. 6.6.1), NVMe SSD storage
(Sect. 6.6.2), and DRAM modules directly attached to the FPGAs (Sect. 6.6.3) leak
information that can permeate the VM instance boundary. These effects can be used
to, for example, cause interference on other users or determine that different VM
instances belong to the same server. The NIC and SSD contention-based attacks are
summarized in Fig. 6.11b.

6.6.1 Network-Based Contention

Network interface controller cards provide connectivity between a virtual machine


and the Internet through external devices such as switches and routers. NIC cards
are typically also connected to the host over PCIe and therefore share the bandwidth
with the FPGAs. To test whether the FPGA PCIe traffic has any effect on the
network bandwidth, we rent three co-located f1.2xlarge instances and test
each instance as the PCIe bandwidth-hogging stressor and use the remaining two
instances in turn to measure the network bandwidth using the speedtest-cli
program [47] (a total of six combinations).
The results for all six pairs of instances are identical: when the PCIe stressor
is not running, speedtest-cli --bytes reports a download bandwidth of
approximately 233 MBps and an upload bandwidth of 157 MBps. However, when
the stressor is running on a co-located instance, the download bandwidth drops to
100 MBps, while the upload bandwidth is reduced to 75 MBps. This means that
the PCIe stressor can demonstrably halve the network bandwidth of co-located
instances as a result of the NIC sharing the PCIe bus with the FPGAs, as shown
in Fig. 6.2. It is worth noting that our experiments did not reveal any influences in
the other direction, i.e., the PCIe and network bandwidth of co-located instances
remained the same when running a network bandwidth stressor, likely because such
a network stressor does not saturate the PCIe bus.
162 I. Giechaskiel et al.

6.6.2 SSD Contention

Another shared resource that can lead to contention is the SSD storage
that F1 instances can access. The public specification of F1 instances notes
that f1.2xlarge instances have access to 470 GB of NVMe SSD storage,
f1.4xlarge have 940 GB, and f1.16xlarge have 4 × 940 GB [14]. This
suggests that F1 servers have four separate 940 GB SSD drives, each of which
can be shared between two f1.2xlarge instances. In this section, we confirm
our hypothesis that one SSD drive can be shared between multiple instances and
explain how this fact can be exploited to reverse-engineer the PCIe topology and
co-locate VM instances. The SSD contention we uncover can also be used for a
slow but reliable, covert channel or to degrade the performance of other users,
akin to the interference attack of Sect. 6.5.4. We also demonstrate the existence
of FPGA-to-SSD contention, which is likely the result of the SSD going through
the same PCIe switch, as shown in Fig. 6.2. This topology remains consistent with
the one publicly described for GPU-based P4d instances [7], which appear to be
architecturally similar to F1 instances.

6.6.2.1 SSD-to-SSD Contention

SSD contention is tested by measuring the bandwidth of the SSD by using the
hdparm command with its -t option, which performs disk reads without any
data caching [44]. Measurements are averaged over repeated reads of 2 MB chunks
from the disk in a period of 3 seconds. When the server is otherwise idle, hdparm
reports the SSD read bandwidth to be over 800 MBps. However, when the other
f1.2xlarge instance that shares the same SSD stresses it using the stress
command [65] with the --io 4 --hdd 4 parameters, the bandwidth drops
below 50 MBps. The stress command with the parameters above results in 4
threads calling sync (to stress the read buffers) and another 4 threads calling
write and unlink (to stress write performance). The total number of threads
is kept to 8, to match the number of vCPUs allocated to an f1.2xlarge instance,
while all FPGAs remain idle during these experiments.
This non-uniform SSD behavior can be used for a robust covert channel with a
bandwidth of 0.125 bps with 100% accuracy. Specifically, for a transmission of bit
1, stress is called for 7 seconds, while for a transmission of bit 0, the transmitter
remains idle. The receiver uses hdparm to measure its SSD’s bandwidth and can
distinguish between contention and no-contention of the SSD resources (i.e., bits 1
and 0, respectively) using a simple threshold. The period of 8 seconds per bit also
accounts for 1 second of inactivity in every transmission, allowing the disk usage to
return to normal.
The same mechanism can be exploited to deteriorate the performance of other
tenants. It can further co-locate instances on an even more fine-grained level
than was previously possible. To accomplish this, we rent several f1.2xlarge
instances until we find four which form a full NUMA node through the PCIe-based
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 163

co-location approach of Sect. 6.4. We then stress the SSD in one of the four instances
and measure the SSD performance in the remaining three. We discover two pairs of
instances with mutual SSD contention, which supports our hypothesis and is also
consistent with the PCIe topology for other instance types [7].
The fact that SSD contention only exists between two f1.2xlarge instances
can be beneficial for adversaries: when the covert-channel receiver and the transmit-
ter are scheduled on two instances that share an SSD, they can communicate without
interference from other tenants in the same NUMA node.2

6.6.2.2 FPGA-to-SSD Contention

To formalize the above observations, we use the methodology described in Sect. 6.4
to find four co-located f1.2xlarge instances in the same NUMA node. Then,
for each pair of instances, we repeatedly run hdparm in the “receiver” instance
for a period of 3 minutes and then in the transmitter instance, (a) at the one minute
mark run stress for 30 s and (b) at the two minute mark use our FPGA-based
covert-channel code as a stressor which constantly transmits the bit 1 during each
measurement period for another 30 s.
The results of these experiments are summarized in Fig. 6.18. During idle
periods, the SSD bandwidth is approximately 800–900 MBps. However, for the two
instances with SSD contention, i.e., pairs .(A, D) and .(B, C), the bandwidth drops
to as low as 7 MBps, while the stress command is running (the bandwidth for
the other instance pairs remains unaffected). When the FPGA-based PCIe stressor
is enabled, the SSD bandwidth reported by hdparm is reduced in a measurable way
to approximately 700 MBps.
We further test for the opposite effect, i.e., whether stressing the SSD can cause
a measurable difference to the FPGA-based PCIe performance. We again stress the
SSD between 60 and 90 s and stress the FPGA between 120 and 150 s. As the results
of Fig. 6.19 show, the PCIe bandwidth drops from almost 1.8 GBps to approximately
500–1,000 MBps when the FPGA stressor is enabled, but there is no significant
difference in performance when the SSD-based stressor is turned on. Similar to the
experiments of Sect. 6.6.1, this is likely because the FPGA-based stressor can more
effectively saturate the PCIe link, while the SSD-based stressor seems to be limited
by the performance of the hard drive itself, whose bandwidth when idle (800 MBps)
is much lower than that of the FPGA (1.8 GBps). In summary, using the FPGA as
a PCIe stressor can cause the SSD bandwidth to drop, but the converse is not true,
since there is no observable influence on the FPGA PCIe bandwidth as a result of
SSD activity.

2 Assuming that slots within a server are assigned randomly, the probability of getting instances

with shared SSDs given that they are already co-located in the same NUMA node is 33%: out
of the three remaining slots in the same NUMA node, exactly one slot can be in an instance that
shares the SSD.
164 I. Giechaskiel et al.

800

Transmitter A
SSD Bandwidth (MBps)

Transmitter B
600 Transmitter C
Transmitter D
Receiver A
Receiver B
400 Receiver C
Receiver D
Pair (A, B)
Pair (A, C)
200
Pair (A, D)
Pair (B, C)
Pair (B, D)

0 Pair (C, D)

0 15 30 45 60 75 90 105 120 135 150 165 180


Time (s)

Fig. 6.18 NVMe SSD bandwidth for all transmitter and receiver pairs in a NUMA node, as
measured by hdparm. Running stress between seconds 60 and 90 causes a bandwidth drop
in exactly one other instance in the NUMA node, while running the FPGA-based PCIe stressor
(between seconds 120 and 150) reduces the SSD bandwidth in all cases

1,800

1,600
PCIe Bandwidth (MBps)

Transmitter A
Transmitter B
1,400 Transmitter C
Transmitter D
Receiver A
1,200
Receiver B
Receiver C
1,000 Receiver D
Pair (A, B)
Pair (A, C)
800 Pair (A, D)
Pair (B, C)
Pair (B, D)
600
Pair (C, D)

0 15 30 45 60 75 90 105 120 135 150 165 180


Time (s)

Fig. 6.19 FPGA PCIe bandwidth for all transmitter and receiver pairs in a NUMA node, as
measured by our covert-channel receiver. Running stress between seconds 60 and 90 does not
cause a bandwidth drop, but running the FPGA-based PCIe stressor (between seconds 120 and
150) reduces the bandwidth in all cases
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 165

Fig. 6.20 By alternating between AFIs that instantiate DRAM controllers or leave them uncon-
nected, the decay rate of DRAM cells can be measured as a proxy for environmental temperature
monitors [60]

6.6.3 DRAM-Based Thermal Monitoring

DRAM decay is known to depend on the temperature of the DRAM chip and its
environment [68, 69]. Since the FPGAs in cloud servers have direct access to the
on-board DRAM, they can be used as sensors for detecting and estimating the tem-
perature around the FPGA boards, supplementing PCIe traffic-based measurements.
Figure 6.20 summarizes how the DRAM decay of on-board chips can be used
to monitor thermal changes in the data center. When a DRAM module is being
initialized with some data, the DRAM cells will become charged to store the values,
with true cells storing logical 1s as charged capacitors and anti-cells storing them
as depleted capacitors. Typically, true and anti-cells are paired, so initializing the
DRAM to all ones will ensure only half of the DRAM cells will be charged, even if
the actual location of true and anti-cells is not known.
After the data has been written to the DRAM and the cells have been charged,
the DRAM refresh is disabled. Disabling DRAM refresh in the server itself is not
possible as the physical hardware on the server is controlled by the hypervisor, not
the users. However, the FPGA boards have their own DRAMs. By programming
the FPGAs with AFIs that do and do not have DRAM controllers, disabling of
the DRAM refresh can be emulated, allowing the DRAM cells to decay [60].
Eventually, some of the cells will lose enough charge to “flip” their value (for
example, data written as 1 becomes 0 for true cells since the charge has dissipated).
DRAM data can then be read after a fixed time .Tdecay , which is called the decay
time. The number of flipped cells during this time depends on the temperature of the
DRAM and its environment [69] and can therefore produce coarse-grained DRAM-
based temperature sensors of F1 instances.
166 I. Giechaskiel et al.

Prior work [61] and this chapter have so far focused on information leaks due to
shared resources within a NUMA node but did not attempt to co-locate instances
that are in the same physical server but belong to different NUMA nodes. In this
section, we propose such a methodology that uses the boards’ thermal signatures,
which are obtained from the decay rates of each FPGA’s DRAM modules. To
collect these signatures, we use the method and code provided by Tian et al. [60]
to alternate between bitstreams that instantiate DRAM controllers and ones that
leave them unconnected to initialize the memory and then disable its refresh rate.
When two instances are in the same server, the temperatures of all 8 FPGAs in
an f1.16xlarge instance (and by extension the DRAM thermal signatures) are
highly correlated. However, when the instances come from different servers, the
decay rates are different and thus contain distinguishable patterns that can be used
to classify the two instances separately. This insight can be used to find FPGA
instances that are co-located in the same server, even across different NUMA nodes.

6.6.3.1 Setup and Evaluation

Our method for co-locating instances within a server has two aspects to it: first,
we show that we can successfully identify two FPGA boards as being in the same
server with high probability using their DRAM decay rates, and then we show that
by using PCIe-based co-location we can build the full profile of a server and identify
all eight of its FPGA boards, even if they are in different NUMA nodes. More
specifically, we use the open-source software by Tian et al. [60] to collect DRAM
decay measurements for several FPGAs over a long period of time and then find
which FPGAs’ DRAM decay patterns are the “closest.”
To validate our approach, we rent three f1.16xlarge instances (a total of
24 FPGAs) for a period of 24 hours and measure how “close” each pair of FPGA
traces is by calculating the total distance between their data points over the entire
measurement period for three different metrics. The first metric compares the
raw number of bit flips from the DRAM decay measurement .craw i directly. The
second approach normalizes the data to fit in the .[−1, 1] range, i.e., .cnorm i =
(2craw − m − M)/(M − m), where .m = mini craw and .M = maxi craw . In Fig. 6.21,
i i i

we show an alternative metric, which takes the difference between successive raw
i
measurements, i.e., .cdiff = craw
i − ci−1 . Note that if FPGA A is the closest to FPGA
raw
B using these metrics, then B is not necessarily the closest to A. However, if FPGA
A is closest to B and B is closest to C, then A, B, and C are all in the same server.
The raw data metric has an accuracy of 75%, the normalized metric is 71%
accurate, while the difference metric succeeds in correctly pairing all FPGAs except
for one, for an accuracy of 96%. Shorter measurement periods still result in high
accuracies. For example, using the DRAM data from the first 12 hours results in
only one additional FPGA mis-identification, for an accuracy of 92%. We plot the
classification accuracy for the three metrics as a function of time in Fig. 6.22.
In the experiments of Fig. 6.21, the .cdiff metric places slots 0–4 of server A
together (along with, mistakenly, slot 0 of server B), slots 5–7 of server A as a second
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 167

150 150 150


NUMA Node 0
DRAM Bit Flip Count Difference cdiff

DRAM Bit Flip Count Difference cdiff

DRAM Bit Flip Count Difference cdiff


i

i
NUMA Node 1
100 100 100

50 50 50

0 0 0

−50 −50 −50

−100 −100 −100

−150 NUMA Node 0 −150 −150 NUMA Node 0


NUMA Node 1 NUMA Node 1
−200 −200 −200
19 5

19 5

19 0

20 5

20 5

20 0

20 5

20 5

20 0

19 5

19 5

19 0
15

20 5
00

20 0

15

20 0

19 5

19 5

15

00

30

15

30

5
1

:4

-0 21:0

:1

:4

-0 04:0

:4

:1

:3

:4

:1

:4

:0

:4

:4

-0 18:4

-0 21:0

:4

:4
6:

6:

3:

1:

4:

6:

1:

3:

6:

3:

1:

4:

6:

1:

3:
18

23

01

08

11

13

15

16

18

21

08

15

08

15
21 9 1

21 0 0

21 0 0

21 0 0

21 0 1

21 0 1

21 9 1

21 9 2

21 0 0

21 0 0

21 0 0

21 0 1

21 0 1
19

20

20
1

2
3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-

3-
-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0

-0
21

21

21

21

21

21

21

21

21

21

21

21

21

21

21

21

21

21

21

21
20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20
Date Date Date

(a) Server A (b) Server B (c) Server C

Fig. 6.21 DRAM decay traces from three f1.16xlarge instances (24 FPGAs in total) for a
i as the comparison
period of 24 hours, using the differences between successive measurements .cdiff
metric, which results in the highest co-location accuracy of 96%. Within each server, measurements
from slots in the same NUMA node have been drawn in the same style

cdiff
cnorm
craw
90.0
Classification Accuracy (%)

80.0

70.0

60.0

0 5 10 15 20 25
Time (h)

Fig. 6.22 Accuracy of classifying individual FPGAs as belonging to the right server as a function
of measurement time using the three different proposed metrics

group, slots 1–7 of server B as one server and slots 0–3 and 4–7 of server C as the
two final groups. Consequently, our method successfully identifies the six NUMA
nodes without making use of PCIe contention at all.
However, by using insights about the NUMA nodes that can be extracted through
our PCIe-based experiments, the accuracy and reliability of this method can be
further increased. For example, slot 0 of server B could already be placed in the
same NUMA node as slots 1–3 using PCIe-based co-location. Leveraging the PCIe-
based co-location method, if the “closest” FPGA is known to be in the same NUMA
node due to PCIe contention and the second-closest FPGA (not in the same NUMA
node according to PCIe contention) is only farther by at most 1% compared to the
closest FPGA, then this second-closest FPGA can be identified as belonging to the
168 I. Giechaskiel et al.

second NUMA node of the same server. In the experiment of Fig. 6.21, this approach
successfully groups all FPGAs in the three tested servers without errors.

6.7 Conclusion

This chapter introduced a novel, fast covert-channel attack between separate users
in a public, FPGA-accelerated cloud computing setting. It characterized how con-
tention of the PCIe bus can be used to create a robust communication mechanism,
even among users of different operating systems, with bandwidths reaching 20 kbps
with 99% accuracy. In addition to using PCIe contention for covert channels, this
chapter demonstrated that contention can be used to monitor or disrupt the activities
of other users, including inferring information about their applications or slowing
them down. This work further identified alternative co-location mechanisms, which
make use of network cards, SSDs, or even the DRAM modules attached to the FPGA
boards, allowing adversaries to co-locate FPGAs in the same server, even if they are
on separate NUMA nodes. More generally, this work demonstrated that malicious
adversaries can use PCIe monitoring to observe the data center server activity,
breaking the separation of privilege that isolated VM instances are supposed to
provide. With more types of accelerators becoming available on the cloud, including
FPGAs, GPUs, and TPUs, PCIe-based threats are bound to become a key aspect
of cross-user attacks. Overall, our insights showed that low-level, direct hardware
access to PCIe, NIC, SSD, and DRAM hardware creates new attack vectors that
need to be considered by both users and cloud providers alike when deciding how
to trade off performance, cost, and security for their designs: even if the endpoints of
computations (e.g., CPUs and FPGAs) are assumed to be secure, the shared nature
of cloud infrastructures poses new challenges that need to be addressed.

Acknowledgments This work was supported in part by NSF grant 1901901.

References

1. Agne, A., Hangmann, H., Happe, M., Platzner, M., & Plessl, C. (2014). Seven recipes
for setting your FPGA on fire—A cookbook on heat generators. Microprocessors and
Microsystems, 38(8), 911–919.
2. Alam, M. M., Tajik, S., Ganji, F., Tehranipoor, M., & Forte, D. (2019). RAM-Jam: Remote
temperature and voltage fault attack on FPGAs using memory collisions. In Workshop on
Fault Diagnosis and Tolerance in Cryptography (FDTC).
3. Alibaba Cloud. (2023). Instance Families. https://www.alibabacloud.com/help/doc-detail/
25378.html. Accessed 18 May 2023.
4. Amazon Web Services. (2016). Developer preview—EC2 instances (F1) with pro-
grammable hardware. https://aws.amazon.com/blogs/aws/developer-preview-ec2-instances-
f1-with-programmable-hardware/. Accessed 18 May 2023.
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 169

5. Amazon Web Services. (2018). The agility of F1: Accelerate your applications with custom
compute power. https://d1.awsstatic.com/Amazon_EC2_F1_Infographic.pdf. Accessed 18
May 2023.
6. Amazon Web Services. (2019). F1 FPGA application note: How to use write combining to
improve PCIe bus performance. https://github.com/awslabs/aws-fpga-app-notes/tree/master/
Using-PCIe-Write-Combining. Accessed 18 May 2023.
7. Amazon Web Services. (2020). Amazon EC2 P4d instances deep dive. https://aws.amazon.
com/blogs/compute/amazon-ec2-p4d-instances-deep-dive/. Accessed 18 May 2023.
8. Amazon Web Services. (2020). Official repository of the AWS EC2 FPGA hardware and
software development kit v1.4.15. https://github.com/aws/aws-fpga/tree/v1.4.15. Accessed
18 May 2023.
9. Amazon Web Services. (2021). AWS shell interface specification. https://github.com/aws/aws-
fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md. Accessed 18 May 2023.
10. Amazon Web Services. (2021). CL_DRAM_DMA custom logic example. https://github.com/
aws/aws-fpga/tree/master/hdk/cl/examples/cl_dram_dma. Accessed 18 May 2023.
11. Amazon Web Services. (2021). F1 FPGA application note: How to use the PCIe peer-
2-peer version 1.0. https://github.com/awslabs/aws-fpga-app-notes/tree/master/Using-PCIe-
Peer2Peer. Accessed 18 May 2023.
12. Amazon Web Services. (2021). Hello World CL example. https://github.com/aws/aws-fpga/
tree/master/hdk/cl/examples/cl_hello_world. Accessed 18 May 2023.
13. Amazon Web Services. (2022). AWS FPGA - frequently asked questions. https://github.com/
aws/aws-fpga/blob/master/FAQs.md. Accessed 18 May 2023.
14. Amazon Web Services. (2023). Amazon EC2 instance types. https://aws.amazon.com/ec2/
instance-types/. Accessed 18 May 2023.
15. Amazon Web Services. (2023). Amazon Linux 2 FAQs. https://aws.amazon.com/amazon-
linux-2/faqs/. Accessed 18 May 2023.
16. Amazon Web Services. (2023). AWS marketplace. https://aws.amazon.com/marketplace.
Accessed 18 May 2023.
17. Amazon Web Services. (2023). FPGA developer AMI. https://aws.amazon.com/marketplace/
pp/prodview-gimv3gqbpe57k. Accessed 18 May 2023.
18. Amazon Web Services. (2023). FPGA developer AMI (Amazon Linux 2). https://aws.amazon.
com/marketplace/pp/prodview-iehshpgi7hcjg. Accessed 18 May 2023.
19. Amouri, A., Bruguier, F., Kiamehr, S., Benoit, P., Torres, L., & Tahoori, M. (2014). Aging
effects in FPGAs: An experimental analysis. In International Conference on Field Pro-
grammable Logic and Applications (FPL).
20. Baidu Cloud. (2023). FPGA cloud compute. https://cloud.baidu.com/product/fpga.html.
Accessed 18 May 2023.
21. Baker, G., & Lupo, C. (2017). TARUC: A topology-aware resource usability and contention
benchmark. In ACM/SPEC International Conference on Performance Engineering (ICPE).
22. BLUEDOT Inc. (2020). DeepField-SR video super resolution hardware accelera-
tor. https://www.xilinx.com/content/dam/xilinx/publications/solution-briefs/partner/xilinx-
bluedot-solution-brief.pdf. Accessed 18 May 2023.
23. Boemo, E., & López-Buedo, S. (1997). Thermal monitoring on FPGAs using ring-oscillators.
In International Workshop on Field-Programmable Logic and Applications (FPL).
24. Boutros, A., Hall, M., Papernot, N., & Betz, V. (2020). Neighbors from hell: Voltage attacks
against deep learning accelerators on multi-tenant FPGAs. In International Conference on
Field-Programmable Technology (FPT)
25. Danalis, A., Marin, G., McCurdy, C., Meredith, J. S., Roth, P. C., Spafford, K., Tipparaju, V.,
& Vetter, J. S. (2010). The scalable heterogeneous computing (SHOC) benchmark suite. In
Workshop on General-Purpose Processing on Graphics Processing Units (GPGPU).
26. Duan, S., Wang, W., Luo, Y., & Xu, X. (2021). A survey of recent attacks and mitigation on
FPGA systems. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI).
27. Faraji, I., Mirsadeghi, S. H., & Afsahi, A. (2016). Topology-aware GPU selection on multi-
GPU nodes. In IEEE International Parallel and Distributed Processing Symposium Workshops
(IPDPSW).
170 I. Giechaskiel et al.

28. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Measuring long wire leakage with ring
oscillators in cloud FPGAs. In International Conference on Field Programmable Logic and
Applications (FPL).
29. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Reading between the dies: Cross-SLR
covert channels on multi-tenant cloud FPGAs. In IEEE International Conference on Computer
Design (ICCD).
30. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2020). C3APSULe: Cross-FPGA covert-
channel attacks through power supply unit leakage. In IEEE Symposium on Security and
Privacy (S&P).
31. Giechaskiel, I., & Szefer, J. (2020). Information leakage from FPGA routing and logic
elements. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
32. Glamočanin, O., Coulon, L., Regazzoni, F., & Stojilović, M. (2020). Are cloud FPGAs really
vulnerable to power analysis attacks? In Design, Automation & Test in Europe Conference &
Exhibition (DATE).
33. Glamočanin, O., Mahmoud, D. G., Regazzoni, F., & Stojilović, M. (2021). Shared FPGAs and
the holy grail: Protections against side-channel and fault attacks. In Design, Automation & Test
in Europe Conference & Exhibition (DATE).
34. Gnad, D. R. E., Oboril, F., & Tahoori, M. B. (2017). Voltage drop-based fault attacks on
FPGAs using valid bitstreams. In International Conference on Field Programmable Logic
and Applications (FPL).
35. Gobulukoglu, M., Drewes, C., Hunter, W., Kastner, R., & Richmond, D. (2021). Classifying
computations on multi-tenant FPGAs. In Design Automation Conference (DAC).
36. Jin, C., Gohil, V., Karri, R., & Rajendran, J. (2020). Security of cloud FPGAs: A survey. https://
arxiv.org/abs/2005.04867. Accessed 18 May 2023.
37. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2018). FPGAhammer: Remote voltage fault
attacks on shared FPGAs, suitable for DFA on AES. Transactions on Cryptographic Hardware
and Embedded Systems (TCHES), 2018(3), 44–68.
38. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2019). Mitigating electrical-level attacks
towards secure multi-tenant FPGAs in the cloud. ACM Transactions on Reconfigurable
Technology and Systems (TRETS) 12(3).
39. La, T., Pham, K., Powell, J., & Koch, D. (2021). Denial-of-Service on FPGA-based cloud
infrastructures – Attack and defense. Transactions on Cryptographic Hardware and Embedded
Systems (TCHES) 2021(3), 441–464.
40. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
Malicious self-oscillator scanning for Xilinx UltraScale+ FPGAs. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 13(3).
41. Li, C., Sun, Y., Jin, L., Xu, L., Cao, Z., Fan, P., Kaeli, D., Ma, S., Guo, Y., & Yang, J.
(2019). Priority-based PCIe scheduling for multi-tenant multi-GPU systems. IEEE Computer
Architecture Letters (LCA), 18(2), 157–160.
42. López-Buedo, S., Garrido, J., & Boemo, E. (2000). Thermal testing on reconfigurable
computers. IEEE Design & Test of Computers (D&T), 17(1), 84–91.
43. López-Buedo, S., Garrido, J., & Boemo, E. (2002). Dynamically inserting, operating, and
eliminating thermal sensors of FPGA-based systems. IEEE Transactions on Components and
Packaging Technologies (TCAPT), 25(4), 561–566.
44. Lord, M. (2022). hdparm. https://sourceforge.net/projects/hdparm/. Accessed 18 May 2023.
45. Luo, Y., & Xu, X. (2020). A quantitative defense framework against power attacks on multi-
tenant FPGA. In International Conference on Computer-Aided Design (ICCAD).
46. Mahmoud, D. G., Lenders, V., & Stojilović, M. (2023). Electrical-level attacks on CPUs,
FPGAs, and GPUs: Survey and implications in the heterogeneous era. ACM Computing
Surveys (CSUR), 55(3).
47. Martz, M. (2021). speedtest-cli. https://github.com/sivel/speedtest-cli. Accessed 18 May 2023.
48. Mirzargar, S. S., & Stojilović, M. (2019). Physical side-channel attacks and covert communi-
cation on FPGAs: A survey. In International Conference on Field Programmable Logic and
Applications (FPL).
6 Contention-Based Threats Between Single-Tenant Cloud FPGA Instances 171

49. Moini, S., Tian, S., Holcomb, D., Szefer, J., & Tessier, R. (2021). Remote power side-channel
attacks on BNN accelerators in FPGAs. In Design, Automation & Test in Europe Conference
& Exhibition (DATE).
50. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power Distribution Attacks in Multi-
Tenant FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems (TVLSI),
28(1).
51. Provelengios, G., Holcomb, D., & Tessier, R. (2021). Mitigating voltage attacks in multi-tenant
FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 1(1).
52. Rakin, A. S., Luo, Y., Xu, X., & Fan, D. (2021). Deep-Dup: An adversarial weight duplication
attack framework to crush deep neural network in multi-tenant FPGA. In USENIX Security
Symposium.
53. Schaa, D., & Kaeli, D. (2009). Exploring the multiple-GPU design space. In IEEE Interna-
tional Parallel and Distributed Processing Symposium Workshops (IPDPSW).
54. Spafford, K., Meredith, J. S., & Vetter, J. S. (2011). Quantifying NUMA and contention effects
in multi-GPU systems. In Workshop on General-Purpose Processing on Graphics Processing
Units (GPGPU).
55. Sugawara, T., Sakiyama, K., Nashimoto, S., Suzuki, D., & Nagatsuka, T. (2019). Oscillator
without a combinatorial loop and its threat to FPGA in data centre. Electronics Letters, 15(11),
640–642.
56. Tan, M., Wan, J., Zho, Z., & Li, Z. (2021). Invisible probe: Timing attacks with PCIe congestion
side-channel. In IEEE Symposium on Security and Privacy (S&P).
57. Tencent Cloud. (2023). FPGA cloud server. https://cloud.tencent.com/product/fpga. Accessed
18 May 2023.
58. Tian, S., Krzywosz, A., Giechaskiel, I., & Szefer, J. (2020). Cloud FPGA security with RO-
based primitives. In International Conference on Field-Programmable Technology (FPT).
59. Tian, S., & Szefer, J. (2019). Temporal thermal covert channels in cloud FPGAs. In
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA).
60. Tian, S., Xiong, W., Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2020). Finger-
printing cloud FPGA infrastructures. In ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays (FPGA).
61. Tian, S., Xiong, W., Giechaskiel, I., & Szefer, J. (2021). Cloud FPGA cartography using
PCIe contention. In IEEE Symposium on Field-Programmable Custom Computing Machines
(FCCM).
62. Tian, S., Xiong, W., Giechaskiel, I., & Szefer, J. (2021). Remote power attacks on the versatile
tensor accelerator in multi-tenant FPGAs. In IEEE Symposium on Field-Programmable
Custom Computing Machines (FCCM)
63. Valtchanov, B., Aubert, A., Bernard, F., & Fischer, V. (2008). Modeling and observing the jitter
in ring oscillators implemented in FPGAs. In IEEE Workshop on Design and Diagnostics of
Electronic Circuits and Systems (DDECS).
64. Wang, X., Niu, Y., Liu, F., & Xu, Z. (2022). When FPGA meets cloud: A first look at
performance. IEEE Transactions on Cloud Computing (TCC), 10(2), 1344–1357.
65. Waterland, A. P. (2014). stress. https://web.archive.org/web/20190502/https://people.seas.
harvard.edu/~apw/stress/. Accessed 18 May 2023.
66. Xilinx, Inc. (2021). 63419 - Vivado partial reconfiguration - What types of bitstreams are used
in partial reconfiguration (PR) solutions? https://support.xilinx.com/s/article/63419. Accessed
18 May 2023.
67. Xilinx, Inc. (2023). UltraScale+ FPGAs: Product tables and product selection guides. https://
www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-
selection-guide.pdf. Accessed 18 May 2023.
68. Xiong, W., Anagnostopoulos, N. A., Schaller, A., Katzenbeisser, S., & Szefer, J. (2019).
Spying on temperature using DRAM. In Design, Automation & Test in Europe Conference
& Exhibition (DATE).
69. Xiong, W., Schaller, A., Anagnostopoulos, N. A., Saleem, M. U., Gabmeyer, S., Katzenbeisser,
S., & Szefer, J. (2016). Run-time accessible DRAM PUFs in commodity devices. In
Conference on Cryptographic Hardware and Embedded Systems (CHES).
172 I. Giechaskiel et al.

70. Yin, C. E., & Qu, G. (2009). Temperature-aware cooperative ring oscillator PUF. In IEEE
International Workshop on Hardware-Oriented Security and Trust (HOST).
71. Zhang, J., & Qu, G. (2019). Recent attacks and defenses on FPGA-based systems. ACM
Transactions on Reconfigurable Technology and Systems (TRETS), 12(3).
72. Zhang, Y., Yasaei, R., Chen, H., Li, Z., & Al Faruque, M. A. (2021). Stealing neural network
structure through remote FPGA side-channel analysis. IEEE Transactions on Information
Forensics and Security (TIFS), 16, 4377–4388.
Chapter 7
Cross-board Power-Based FPGA, CPU,
and GPU Covert Channels

Ilias Giechaskiel, Kasper Rasmussen, and Jakub Szefer

7.1 Introduction

Field-programmable gate arrays (FPGAs) have become increasingly popular in


cloud deployments, and this transition has also resulted in a threat model shift
from one of physical attacks that require physical proximity to the FPGA board
and external equipment (e.g., high-end oscilloscopes) to one of remote attacks
using only on-chip logic. The majority of recent work has so far shown that
remote fault, covert-channel, and side-channel attacks are indeed possible between
designs belonging to different users co-located within the same FPGA chip [4, 8–
10, 13, 24, 26, 27, 29, 30, 32, 37, 45, 46]. However, as boards are currently allocated
on a per-user basis in commercial clouds, this multi-tenant threat model remains
theoretical, with little practical impact.
In this chapter, we instead tackle a more pressing scenario that is applicable to
existing cloud FPGA deployments, where boards are co-located within the same
server rack unit. Users renting FPGAs from such FPGA cloud providers assume that
their designs are safely isolated from potentially malicious designs by other users
running in the same data center. However, as we show, the assumption of isolation
can be broken due to leakage through the shared use of power supply units (PSUs).
Specifically, we introduce a new class of remote covert-channel attacks between

I. Giechaskiel
Independent Researcher, London, UK
e-mail: ilias@giechaskiel.com
K. Rasmussen
University of Oxford, Oxford, UK
e-mail: kasper.rasmussen@cs.ox.ac.uk
J. Szefer ()
Yale University, New Haven, CT, USA
e-mail: jakub.szefer@yale.edu

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 173
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_7
174 I. Giechaskiel et al.

single-tenant FPGAs on different FPGA boards that are merely powered through
the same PSU. Moreover, we show that if this PSU also powers the host computer,
the same sink FPGA (receiver) can detect high levels of CPU and GPU activity,
creating new CPU-to-FPGA and GPU-to-FPGA channels. These channels allow one
system, which may (GPU, FPGA) or may not (CPU) contain an accelerator, to leak
information such as private encryption keys to an entirely different system (the sink
FPGA), which is fully isolated, except for the shared power supply.
The first crucial observation of our work is that although causing variable power
consumption to transmit information is easy, detecting voltage fluctuations without
external equipment is non-trivial. However, the reconfigurability of FPGAs provides
access to the hardware at a much lower level and can be used to implement circuits
that detect voltage changes that are imperceptible to fixed silicon chips such as
CPUs and GPUs. Indeed, cloud providers are aware of the impact of such low-level
hardware access, so besides allocating FPGAs on a per-user basis, they also keep
several features such as voltage and temperature monitors inaccessible to end users.
The second key observation is that ring oscillators (ROs) are capable of both
causing and sensing voltage fluctuations. This chapter therefore introduces a novel
way of monitoring changes in voltage caused by the source FPGA, CPU, or GPU.
Specifically, both properties of ROs are used in the sink (receiver) FPGA, whereby
stressing the voltage regulator of the sink FPGA allows one to detect transmissions
by the source (transmitter) FPGA.
Using these insights, we demonstrate the first cross-FPGA covert channel
between off-the-shelf, unmodified Xilinx Artix 7, and Kintex 7 boards in either
direction of communication. We also characterize the bandwidth–accuracy tradeoffs
across different measurement periods and sizes of the covert-channel ROs on the
source and sink FPGAs. We further test our covert channel on two PSUs running
under normal operating conditions (i.e., without being overloaded) and introduce
CPU-to-FPGA and GPU-to-FPGA covert channels by modulating their respective
loads. We finally discuss countermeasures to mitigate this source of leakage.

7.1.1 Contributions

Our contributions can be summarized as follows:


1. We identify sharing of PSUs as a new source of vulnerability, even for unprivi-
leged FPGA designs without access to voltage or temperature system monitors.
2. We introduce a novel measurement setup and classification metric that uses ring
oscillators (ROs) on the sink FPGA to stress its voltage regulator and therefore
reliably detect external voltage fluctuations.
3. We exploit this setup to create the first remote covert-channel attack between
FPGAs on distinct physical boards that are dedicated on a per-user basis,
reaching accuracies of up to 100%.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 175

4. We evaluate the strength of the information leakage across different architectural


choices and perform a bandwidth–accuracy tradeoff analysis.
5. We introduce the first CPU-to-FPGA and GPU-to-FPGA covert channels using
high loads of activity on their respective processors, opening up new avenues for
remote FPGA attacks.
6. We propose hardware- and software-level countermeasures to reduce the impact
of the leakage.

7.1.2 Chapter Organization

The rest of the chapter is organized as follows. Section 7.2 introduces the threat
model, while Sect. 7.3 details the experimental setup, including hardware properties,
the measurement procedure, and the high-level architectural FPGA design. Sec-
tion 7.4 then describes the need for our novel classification metric and explains why
it works where the naive approach of looking at absolute ring oscillator counts fails.
Section 7.5 then evaluates cross-FPGA covert communication over shared PSUs,
varying the number of source and sink ring oscillators used, and performing an
analysis of bandwidth–accuracy tradeoffs. Section 7.6 then covers CPU-to-FPGA
and GPU-to-FPGA information leakage, while Sect. 7.7 discusses potential defense
mechanisms. We place our work in the context of related research in Sect. 7.8, before
we conclude in Sect. 7.9.

7.2 Threat Model

Prior work on attacks without physical access to the FPGA hardware has primarily
investigated security in the context of multi-tenant FPGAs. It has shown that when
a single FPGA chip is shared among multiple users concurrently, designs are
vulnerable to temperature and voltage attacks (Sect. 7.8). Although these attacks
highlight potential issues with future architectures, they remain theoretical at the
moment, as FPGAs are currently allocated on a per-user basis. In this chapter, we
are thus concerned with covert-channel attacks against platforms where the entire
logic is allocated to a single user. Design logic therefore cannot access any voltage
or thermal system monitors present on the FPGA fabric, as these inaccessible in a
cloud environment.1 Compared to multi-tenant attacks on FPGA designs that share
the same power distribution network, adversarial attacks to infer any information
about the activity or data (e.g., encryption keys) of other users necessitate that side-

1 In cloud FPGAs, part of the fabric is reserved by a cloud-provided “shell” that hides implementa-

tion details, including physical pinouts, identification primitives, and system monitors. User logic
is forced to interact with external hardware through the shell’s AXI4 interfaces.
176 I. Giechaskiel et al.

Fig. 7.1 System model for FPGA-to-FPGA, CPU-to-FPGA, and GPU-to-FPGA leakage in co-
located environments. The CPU, GPU, and one or more (potentially malicious) FPGAs are
powered through the same PSU but do not share any logic and do not have access to system
monitors for measuring voltage or temperature changes

channel leakage be measurable across extensive physical separation (as opposed to


logic on the same FPGA chip), and with multiple intermediate components (passive
capacitors, inductors, voltage regulators, etc.) on the path between the source and
sink FPGA boards.
In this chapter, we specifically investigate remote voltage-based attacks, where
a shared PSU provides an indirect connection between FPGA boards. We do not
consider reverse-engineering attacks on the bitstream itself or the contained logic,
but instead focus on how to initiate a communication channel through modulating
the load on the PSU itself. We mainly consider FPGA-to-FPGA attacks between
otherwise unconnected devices, but also investigate CPU-to-FPGA and GPU-to-
FPGA attacks. This is because the same PSU might also power the host computer,
and, by extension, its internal components including CPUs and GPUs, as shown in
the high-level system model of Fig. 7.1. We make no assumptions regarding how
the FPGAs are connected to the computer. In other words, we do not assume that
FPGAs are attached to the motherboard over PCIe, to a USB controller over a serial
chip, or, in fact, if they are even (logically) connected to the computer at all. Our
only assumption is that of a shared PSU between the two communicating parties.
Within an FPGA, and in accordance with prior work [9, 10, 27, 46], (potentially
adversarial) users can place and route any designs of their choice, such as different
types of ring oscillators. This is allowed by current FPGA cloud deployments, as
long as the logic is placed outside of the cloud-provided shell. In this chapter, we
show that by relying only on on-chip FPGA logic (i.e., ring oscillators), we are
able to demonstrate FPGA-to-FPGA, CPU-to-FPGA, and GPU-to-FPGA covert
communication, without physical access to the FPGA boards. One of the key
contributions of our work is therefore the ability to communicate across unmodified
devices, without external equipment or access to internal voltage monitors, which
are off-limits to unprivileged FPGA designs.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 177

It should be noted that some cloud providers such as Amazon Web Services
(AWS) place restrictions on the types of circuits that can be instantiated on their
FPGAs and prohibit combinatorial loops including ring oscillators [9, 35]. Although
in this chapter we primarily use conventional ring oscillators, Sect. 7.5.5 shows that
they can be easily replaced by alternate designs proposed in recent work [9, 10, 22,
35], which bypass such cloud countermeasures, and could therefore be used to attack
the isolation mechanisms that separate physical hardware is supposed to provide.

7.3 Experimental Setup

In this section, we detail our experimental setup, starting with the ring oscillators
employed in the source and sink FPGAs (Sect. 7.3.1) and delving into the archi-
tectural design of the FPGA transmission and reception circuitry (Sect. 7.3.2). We
then describe the hardware properties of the FPGA boards used (Sect. 7.3.3), as well
as the computer PSUs, CPUs, and GPUs, which are effectively turned into covert-
channel transmitters (Sect. 7.3.4). We finally discuss the process followed for data
collection (Sect. 7.3.5).

7.3.1 Ring Oscillators

Ring oscillators are comprised of an odd number of NOT gates in a ring formation
and therefore form a combinatorial loop, whose value oscillates. The frequency of
oscillation changes based on process variations, as well as voltage and temperature
conditions [16], making ROs good temperature [38] and voltage [46] monitors. ROs
also cause voltage fluctuations, which stress power circuits, and can potentially
crash the FPGA or inject faults [12, 24, 26, 27, 30].
In this chapter, we use ROs as both transmitters and receivers and implement
them using lookup tables (LUT-RO) with one inverter and three buffer stages as
shown in Fig. 7.2. We chose to use this RO design instead of more common ROs
with three inverters or one inverter and two buffer stages because preliminary
experiments showed that they resulted in more stable measurements. Alternative
types of ROs are evaluated in Sect. 7.5.5.

Fig. 7.2 The ring oscillators are implemented using lookup tables (LUT-ROs) and contain one
inverter and three buffer gates
178 I. Giechaskiel et al.

Fig. 7.3 Experimental setup: the covert source (left) uses T · NT ROs, while the sink (right) has
R · NR measurement ROs and S · NS stressor ROs. The same power supply unit powers both boards

7.3.2 Architectural FPGA Design

We now give a high-level overview of the covert-channel source and sink FPGA
designs, which are summarized in Fig. 7.3.

7.3.2.1 Covert-Channel Source

To cause detectable changes on the sink, the source FPGA employs ring oscillators
organized as T transmitters, which can be controlled independently. These transmit-
ters are placed on separate clock regions to make power consumption more evenly
spread throughout the FPGA. They contain .NT ROs each, for a total of .T · NT ROs,
as shown in the left part of Fig. 7.3.

7.3.2.2 Covert-Channel Sink

To receive transmissions, we employ R receivers, placed on separate clock regions


of the sink FPGA, and each containing .NR ROs. We estimate the RO frequency by
counting the number of RO signal transitions in a fixed measurement interval of .2t
clock cycles through counters placed outside of the RO clock regions.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 179

Fig. 7.4 Annotated Vivado screenshot of the sink architecture on the Kintex 7 board, with receiver
ROs in red, stressor ROs in blue, and other logic (counters, UART, FIFO) in brown

However, this setup is not sufficient to decode covert transmissions, due to


inherent noise in the power supply and environmental fluctuations. Instead, it is
necessary to introduce additional circuitry on the sink FPGA that stresses the board’s
voltage regulator, making maintaining a constant voltage harder. This fact allows
us to sense voltage changes induced by the source FPGA, or even by CPU and
GPU activity, as presented later in Sect. 7.6. Specifically, we include S stressors,
each with .NS ROs. As with the source transmitters, these S stressors are placed on
separate clock regions and can also be controlled independently. The block diagram
for the sink design is shown in the right part of Fig. 7.3, while Fig. 7.4 shows a
concrete instantiation of the sink architecture on the Kintex 7 board. Section 7.4
further demonstrates the need for stressor ROs.

7.3.3 FPGA Boards

For our experiments, we use Xilinx Kintex 7 KC705 and Artix 7 AC701 boards. The
28 nm chips these devices contain are similar, but the Kintex 7 is more performant,
while the Artix 7 is optimized for low power [41, 44]. Both FPGAs have a 200 MHz
oscillator and operate at a core VCCINT voltage of 1.0 V, but the boards use
different regulators to convert the 12 V PSU output into 1.0 V [42, 43].
180 I. Giechaskiel et al.

Table 7.1 Properties of the Property Artix 7 Kintex 7


FPGA boards used, along
with fixed compile-time Board AC701 KC705
choices for the source and Part Number XC7A200T XC7K325T
sink circuit configurations Slices 33 650 50 950
Clock Regions .2 × 5 .2 × 7

Core Voltage, VCCINT 1.0 V 1.0 V


Voltage Regulator LMZ31710 PTD08A020W
Clock Frequency 200 MHz 200 MHz
# of Boards Tested 2 2
# of Transmitters, T 10 14
# of Stressors, S 5 5
# of Receivers, R 4 4
# of ROs per Receiver, .NR 5 5

For the source FPGA designs, we place a transmitter on each clock region of the
FPGA. As the Artix 7 board has 10 clock regions, while the Kintex 7 has 14, the
numbers of transmitters on these devices are .T = 10 and .T = 14, respectively.
The sink FPGAs contain .R = 4 receivers in the corners of each chip, each with
.NR = 5 ROs. Sink FPGAs also contain .S = 5 stressors, one of which is placed

in the center of the device, while the remaining four are next to the receiver clock
regions (Fig. 7.4 shows an example with .NS = 500). Although not shown to be
significant in our experiments, these early architectural choices were made to ensure
that the power draw was approximately equally spread across the FPGA fabric.
These decisions and other FPGA properties are summarized in Table 7.1. More
compile- and run-time parameters, such as the measurement period and the number
of source transmitters ROs .NT and sink stressor ROs .NS , are varied in Sect. 7.5.

7.3.4 Power Supply Units and Computer Transmitters

To verify that the covert channel is not due to faulty design in a line of specific power
supply units, we test communication on two PSUs made by different manufacturers
(Corsair and Dell), rated for different loads (850 W and 1300 W, respectively), and
both with a Gold 80 Plus Certification (which guarantees 90% efficiency at 50%
load). These PSUs are integrated in two computers, the first of which contains two
Xeon E5645 CPUs for a total of 24 threads, while the second contains a single
Xeon E5-2609 with 4 threads. They also contain Nvidia GeForce GPUs, with 96
and 640 CUDA cores, respectively. The CPU and GPU cores are used as the covert-
channel sources in Sect. 7.6 for CPU-to-FPGA and GPU-to-FPGA communication
over the shared power supply. The properties of the computers used are summarized
in Table 7.2.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 181

Table 7.2 Hardware properties of the two computers used, with their corresponding PSUs, CPUs,
and GPUs
Property PC-A PC-B
PSU Brand Corsair Dell
Power Rating 850 W 1300 W
80 Plus Certification Gold Gold
Motherboard SuperMicro X8DAL-i Dell Precision T7600
Xeon CPU Model E5645 E5-2609
# of CPU Cores 6 @ 2.4 GHz 4 @ 2.4 GHz
# of Threads 12 4
# of CPUs 2 1
GeForce GPU ZOTAC GT 430 EVGA GTX 750 Ti
GPU Memory 1 GB GDDR3 2 GB GDDR5
# of CUDA Cores 96 @ 0.7 GHz 640 @ 1.0 GHz

7.3.5 Data Collection and Encoding

For our data collection process, we made several choices to make the communica-
tion scenario realistic. For instance, the computers attached to the PSUs were used
normally during experimentation, including running and installing other software.
Moreover, to ensure leakage is not due to temperature, the FPGAs were placed
outside the computer case, and away from computer fans, which may affect
measurements by turning on or off based on the computer temperature. We similarly
placed the FPGAs next to each other horizontally (as opposed to stacking them
vertically), further minimizing cross-FPGA temperature effects. In addition, to
control for other voltage effects, the FPGAs were not connected to the computer
over PCIe, which would likely increase the potential for leakage. However, as
we show in Sect. 7.5.5, our covert channel operates with similar accuracy, even
when the FPGAs are connected to the computer over PCIe and are enclosed in it
without accounting for temperature variations. Finally, to verify that the leakage is
not caused through the UART interface, we often used one computer to take the
measurements, and the other to power the source and sink boards through its PSU.
As there is inherent noise in the measurements, (a) the absolute RO frequency
is not well-suited for comparison, and (b) the RO counts need to be averaged over
repeated measurements to produce meaningful results. To address both concerns, we
use Manchester-encoding, where to send a 1, the source transmitters are enabled for
one measurement period and disabled for the next (a 0 is similarly encoded by first
disabling transmitters during the first measurement period and enabling them in the
second period). These measurement periods are .M · 2t clock cycles long, where we
average M RO counts collected by ROs enabled for .2t clock cycles (see Sect. 7.4).
The bandwidth can thus be calculated as
fc
b=
. , (7.1)
2 · 2t · M
182 I. Giechaskiel et al.

where .fc = 200 MHz is the FPGA clock frequency.


In most experiments, we transmit the 20-bit number 0xf3ed1 across the covert
channel, Manchester-encoding it in 40 bits. Additional patterns are evaluated in
Sect. 7.5.4. To ensure that perfect synchronization is not needed between the source
and the sink, for each of the 40 periods, we take four sets of M measurements, where
M is in the order of a few hundred counts (see Table 7.3 and Sect. 7.5.3). The four
sets of repetitions create .42 = 16 Manchester-encoded pairs per bit to be transferred,
for a total of .16 × 20 = 320 pairs to estimate the covert-channel accuracy.

7.4 Classification Metric

This section introduces a novel methodology to detect changes in the power supply
voltage through the sink’s “stressor” ROs. Section 7.4.1 first motivates why the naive
approach of using the absolute ring oscillator counts is insufficient for classification
of transmissions in this scenario. Section 7.4.2 then introduces the metric using
stressors, while Sect. 7.4.3 finally explains why our technique works.

7.4.1 Why Absolute Counts Are Not Enough

Broadly speaking, when the transmitters are activated on the source FPGA, CPU, or
GPU, there is a voltage drop that is visible not just at the board regulator, but also
at the 12 V rail PSU input to the FPGA board. Indeed, Fig. 7.5 demonstrates this

40
Nominal Voltage Difference (mV)

35 # of Enabled
Transmitters T
30 0
2
25 4
6
20 8
10
15
12
10 14

5
11.5 11.7 11.9 12.1 12.3 12.5
Power Supply Voltage (V)

Fig. 7.5 Voltage as set by the power supply and measured by the oscilloscope for various numbers
of enabled transmitters T on the KC705-2 source, with 99% confidence intervals
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 183

775,000 Ring Osc.


Index i
Average RO Count CVi 750,000 0
1
725,000 2
3
700,000 4
5
675,000 6
7
650,000

11.5 11.7 11.9 12.1 12.3 12.5


Power Supply Voltage (V)

Fig. 7.6 The average ring oscillator counts .CVi (at 99% confidence) on the AC701-1 sink remain
approximately the same for different power supply voltages V and all eight ring oscillators .Ri

for a Kintex 7 source without a sink FPGA present across multiple input voltages
and different numbers of enabled transmitters T . Specifically, we power the board
using a Keithley 2231A power supply and measure the voltage at the power rail of
the board using a Tektronix MDO3104 Mixed Domain Oscilloscope with TPP1000
1 GHz passive probes, taking 10 000 data points. Figure 7.5 indicates that at any
voltage level provided by the power supply (11.5 V to 12.5 V), as the number of
enabled source transmitters T increases, the voltage measured by the oscilloscope
decreases. For example, at 12.5 V, the oscilloscope measures 12.539 V when no
transmitters are enabled, but only 12.521 V when 14 transmitters are enabled, for
a voltage drop of approximately 18 mV. At 11.5 V, the measured voltage similarly
drops from 11.525 V to 11.507 V.
Although one would expect RO frequency to increase with higher voltages [16],
this is not the case. For a ring oscillator i, let its average count be .CVi when the
voltage provided by the power supply is .11.5 V ≤ V ≤ 12.5 V. We would expect
that .CVi 1 > CVi 2 whenever .V1 > V2 , but Fig. 7.6 suggests that the RO counts remain
approximately the same for all eight ring oscillators and voltages V tested on an
Artix 7 sink, likely because the regulator is able to deal with such input voltages. As
a result, the absolute RO frequency cannot be used to decode cross-FPGA covert-
channel transmissions.

7.4.2 A New Metric Based on Count Differences

To solve the issues identified above, we introduce ROs to “stress” the voltage
regulator and make external changes in the power supply voltage measurable. For
any bit transmission (say the i-th one), we take M measurements as follows:
184 I. Giechaskiel et al.

Clock
Transmitted Bit
Encoded Bit
Source ROs
Stressor ROs Enabled
Measurement Period 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
R · NR RO Counts C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33
Contributes To D0 E0 D0 E0 D1 E1 D1 E1 D2 E2 D2 E2 D3 E3 D3 E3
Δn = Dn − En (C00 − C01 + C02 − C03)/2 (C10 − C11 + C12 − C13)/2 (C20 − C21 + C22 − C23)/2 (C30 − C31 + C32 − C33)/2
Final Metric Δ0 > Δ1 Δ2 < Δ3

Fig. 7.7 Timing diagram for a Manchester-encoded transmission of the two bits 10, with M = 4
measurement periods. Half of the ring oscillator counts are taken when the stressors are enabled
(E), and the other M/2 = 2 counts when they are disabled (D) to compute Δ = D − E. The
receiver uses the sign (positive or negative) of the difference Δ2n − Δ2n+1 between the two parts
of the encoded transmission of the n-th bit to determine if it should be decoded as a 0 or as a 1.
For example, (C 00 − C 01 + C 02 − C 03 )/2 = Δ0 > Δ1 = (C 10 − C 11 + C 12 − C 13 )/2, so the first bit is
decoded as a 1. Similarly, Δ2 < Δ3 , so the second bit is decoded as a 0

1. For the first measurement period, we disable all stressor ROs, and let the receiver
ROs run for .2t clock cycles, producing counts .C i0 = (C00 , . . . , C0R·NR −1 ).
2. In the second period, we enable all (or some, see Sects. 7.4.3 and 7.5.3) stressor
ROs and estimate the RO frequencies through their counts, .C i1 .
3. In the third measurement period, we disable all stressor ROs, re-enable them in
the fourth period, and so forth.
This procedure produces .M/2 measurements .C i0 , C i2 , . . . corresponding to
disabled stressors, and .M/2 measurements .C i1 , C i3 , . . . corresponding to enabled
stressors, as also shown in the timing diagram of Fig. 7.7. Figure 7.7 represents
Manchester-encoded transmissions of the 2 bits 10, averaging over .M = 4
measurements and only repeating transmissions once (actual measurements have
.M = 500, with 4 repetitions). We take the average of each set per RO, thereby
M/2−1
calculating the disabled-stressor average .D i = 2/M · k=0 C i2k and the enabled-
 M/2−1
stressor average .E i = 2/M · k=0 C i2k+1 . We then use .Δi = D i − E i to recover
the transmitted bit.
Specifically, assume that we wish to recover the n-th bit, corresponding to
transmissions 2n and .2n + 1, as each bit b is Manchester-encoded as the pair
.(b, 1 − b). In each transmission pair, there is always a 1 bit and a 0 bit, so we can

compare the .R · NR counts of .Δ2n and .Δ2n+1 . If the majority of the RO differences
in the first set of measurements is bigger than the corresponding differences in the
second set of measurements (i.e., .Δ2n > Δ2n+1 for most ROs), we classify the n-th
bit as a 1, while if the majority is smaller, (.Δ2n < Δ2n+1 for most ROs), we classify
it as a 0.
Figure 7.8 demonstrates the need for this more complicated procedure in
practice for a transmission of a Manchester-encoded 1 bit. Specifically, it compares
our new metric with stressor ROs, .Δ2n − Δ2n+1 , against the naive bit-recovery
metric .D 2n − D 2n+1 for all 20 receiver ROs. As Fig. 7.8 (blue circles) shows,

2n
− Δ2n+1 > 0 for all 20 receiver ROs .R0 , R1 , . . ., so our metric correctly
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 185

1,000

Average RO Count Differences


100
10

1
With Stressors: Δ2n − Δ2n+1
0
-1 Without Stressors: D2n − D2n+1

-10
-100
-1,000

R0 R5 R10 R15
Ring Oscillator Receiver Ri

Fig. 7.8 All RO count differences with stressors Δ2n − Δ2n+1 (blue circles) are positive, correctly
decoding a transmission of 1. However, the naive metric without stressors D 2n − D 2n+1 (orange
diamonds) behaves randomly, with only about half being positive

recovers this bit transmission. However, the .D 2n − D 2n+1 values with stressors
disabled (orange diamonds) behave randomly, and indeed, in the experiment in
which these measurements originated, our metric successfully recovered over 98%
of transmissions, compared to 53% using the naive method without the stressors.
Section 7.4.3 further expands on why the new technique makes for a good approach
in detecting transmissions.

7.4.3 Characterization of the Proposed Metric

In this section, we test the receiving circuit (sink FPGA) on its own to characterize
its behavior. We first plot in Fig. 7.9 the average metric .ΔiV for the eight ring
oscillators of Fig. 7.6 across the same power supply voltages .11.5 V ≤ V ≤ 12.5 V.
As expected, for all ROs, .ΔiV1 < ΔiV2 whenever .V1 > V2 : When there is an external
voltage drop (e.g., when the source FPGA enables the transmitter ROs), the .Δ metric
increases compared to when there are no external transmissions.
We additionally test the behavior of the receiver FPGA across different measure-
ment times of .2t clock cycles and the numbers of enabled stressors S. Specifically,
we conduct measurements on an Artix 7 sink and calculate the average value of
our .Δ metric over all 20 receiver ROs at two voltage levels: 11.5 V and 12.5 V.
Figure 7.10 plots our results, which lead to several observations.
First of all, the average difference .Δ = Δ11.5 − Δ12.5 is close to zero for time
periods up to 41 µs, indicating that prolonged measurement times are necessary
to distinguish between transmissions of zero and one, which in practice result in
186 I. Giechaskiel et al.

6,000
Ring Osc.
5,800 Index i
Average Metric ΔVi 0
5,600 1
5,400 2
3
5,200 4
5
5,000
6
4,800 7

4,600
11.5 11.7 11.9 12.1 12.3 12.5
Power Supply Voltage (V)

Fig. 7.9 The average metric ΔiV on the AC701-1 sink decreases with higher power supply voltages
V for all eight ring oscillators Ri

200
Average Difference Δ11.5 − Δ12.5

−200

# of Enabled Stressors S
−400 1 4
2 5
3
−600
0.6µs 5.1µs 41.0µs 0.3ms 2.6ms 21.0ms
Measurement Time

Fig. 7.10 Difference between the average Δ metric as measured at 11.5 V and 12.5 V for different
measurement times and numbers of stressors enabled on the AC701-1 sink

much smaller voltage drops of .≈20 mV. Moreover, until 2.6 ms, .Δ > 0 for all
choices of how many stressors S to enable simultaneously, with fewer stressors
resulting in a larger effect. However, for even larger time periods, .Δ < 0, with
more stressors resulting in a bigger effect in magnitude. Consequently, the choice of
number of stressors and measurement time is intricately linked with the accuracy of
the covert channel and, in fact, helps explain why in some experimental setups (e.g.,
the KC705-1 receiver on PSU-B of Table 7.3), the recovered pattern is flipped, i.e.,
a 0 bit is identified as a 1 bit and vice versa.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 187

Table 7.3 Default values for accuracy- and bandwidth-related parameters, and the chapter
sections in which they are varied. Bandwidth is calculated using Eq. (7.1)
Property Artix 7 Kintex 7 Section
# of Transmitter ROs, NT 1000 1000 7.5.2
# of Enabled Transmitters 10 14 7.5.2
Transmitted Pattern 0xf3ed1 0xf3ed1 7.5.4
Transmitter Types LUT-RO LUT-RO 7.5.4
# of Stressor ROs, NS 500 500 7.5.2
# of Enabled Stressors 1 5 7.5.3
Stressor and Receiver Types LUT-RO LUT-RO 7.5.5
# of Repetitions per Bit, M 500 500 7.5.3
Measurement Cycles, 2t 215 221 7.5.3
Channel Bandwidth b (b s−1 ) 6.1 0.1 7.5.3

7.5 Cross-FPGA Communication

In this section, we explore FPGA-to-FPGA covert communication, presenting a


summary of our results with the default experimental parameters in Sect. 7.5.1. We
then vary the number of source transmitter and sink stressor ROs in Sect. 7.5.2.
We further evaluate bandwidth–accuracy tradeoffs in Sect. 7.5.3 and test the perfor-
mance of the covert channel across different transmitter patterns and cabling setups
in Sect. 7.5.4. We finally test the covert channel using different types of ROs and
under different experimental conditions in Sect. 7.5.5.

7.5.1 Overview of Results

In this section, we give an overview of our cross-FPGA results. The values for the
default experimental parameters used in these experiments and the corresponding
covert-channel bandwidths are summarized in Table 7.3. These values were chosen
based on exploratory testing, as they represent a good tradeoff between accuracy
and bandwidth. However, in some cases, better accuracy can be achieved at the
cost of bandwidth, or the same accuracy can be maintained despite increasing the
bandwidth (see Sect. 7.5.3).
The results of our measurements across all 12 combinations of source and sink
FPGAs on both PSUs are summarized in Table 7.4. As the table shows, covert
communication is possible with high accuracy between any two boards, in either
direction, and on both PSUs. The table also allows us to draw various conclusions.
First of all, the behavior is not the same for identical boards. This is likely due to
both process variations internal to the FPGA chip (which affect RO measurements),
and because of different component tolerances. As an example, the AC701-2 board
188 I. Giechaskiel et al.

Table 7.4 Accuracy for cross-FPGA covert channels on PSUs A and B, using the default
experimental parameters
Receiver
PSU Transmitter AC701-1 AC701-2 KC705-1 KC705-2
A AC701-1 – 79% 92% 100%
A AC701-2 99% – 93% 100%
A KC705-1 100% 86% – 100%
A KC705-2 100% 98% 99% –
B AC701-1 – 100% †98% 100%
B AC701-2 100% – †99% 100%
B KC705-1 100% 95% - 100%
B KC705-2 100% 100% †98% -
† signifies that the recovered bit pattern is flipped

is a worse sink than the AC701-1 board, while the KC705-1 board is a worse source
than the KC705-2 board.
Moreover, the Kintex 7 boards are generally better sources than the Artix 7
boards, due to the higher count of transmitters they contain (.T = 14 as opposed to
.T = 10). As we show in Sect. 7.5.2, more transmitters tend to improve the quality of

the covert channel. Finally, we notice that although the information leakage remains
strong in both PSUs, the accuracy of the recovered data on the .1300 W PSU-B is
generally higher than the accuracy on the .850 W PSU-A. This is perhaps somewhat
surprising, given that we would have expected the higher-rated PSU to produce more
stable output under sudden changes in the load, but this appears to not be the case.

7.5.2 Transmitter and Stressor ROs

In this section, we evaluate the effect of changing the size of the transmitting and
receiving circuits in the source and sink FPGAs, respectively, on the accuracy of
the covert channel. Since each of the T transmitters (with .NT ROs each) can be
controlled independently (Fig. 7.3), we first vary the number of simultaneously
enabled transmitters on the KC705-1 board and plot the results across all receiver
boards in Fig. 7.11a. We also change the number of transmitter ROs .NT on KC705-
1 with all T transmitters enabled at the same time and plot the results in Fig. 7.11b.
Both experiments show that increasing the number of effective transmitter ROs
.T · NT increases the accuracy of the covert channel. This is because the ensuing

voltage drops are more pronounced and can thus be more easily detected by the
receiving boards. However, for the KC705-2 sink board, too much activity on the
transmitter can decrease the accuracy of the channel. This is because although the
magnitude of the voltage drop increases in isolation (Fig. 7.5), the stressor ROs are
also causing a voltage drop that can overshadow that of the source FPGA.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 189

100 100

90
90
Accuracy (%)

Accuracy (%)
80
80
70 Source FPGA: Source FPGA:
KC705-1 KC705-1
60 Sink FPGA: 70 Sink FPGA:
AC701-1 AC701-1
50 AC701-2 AC701-2
KC705-2 60 KC705-2
40
0 2 4 6 8 10 12 14 200 400 600 800 1,000 1,200 1,400
Number of Enabled Transmitters Number of Transmitter ROs NT

(a) Transmitters (b) Transmitter ROs

Fig. 7.11 Increasing the number of (a) simultaneously enabled transmitters and (b) transmitter
ROs .NT on the KC705-1 source board generally increases the accuracy of the cross-board covert
channel, except for the KC705-2 sink past a certain threshold

100

90
Accuracy (%)

80
Sink FPGA:
AC701-2
Source FPGA:
70 AC701-1
KC705-1
KC705-2
60
500 1,000 1,500 2,000
Number of Stressor ROs NS

Fig. 7.12 Increasing the number of stressor ROs .NS on the AC701-2 sink board can decrease
accuracy, as the additional activity can hide external transmissions under the noise floor

We additionally evaluate the effect of changing the number of stressor ROs .NS
on the sink AC701-2 board and plot the accuracy of the covert channel in Fig. 7.12.
Consistent with Fig. 7.10, although stressor ROs are necessary to detect covert
transmissions, further increasing .NS can have the opposite effect: the voltage drop
caused by the stressors overpowers any effect caused by the source transmissions
and starts pushing the average difference from positive to negative.
190 I. Giechaskiel et al.

100

90
Accuracy (%)

80

70
T: Source FPGA, R: Sink FPGA
T: AC701-1 T: KC705-1
60 T: AC701-2 T: KC705-2
R: AC701-1 R: AC701-2

0 200 400 600 800 1,000 1,200 1,400


Number of Measurements M

Fig. 7.13 Increasing the number of measurements M improves accuracy to any AC701 sink R,
from any FPGA source T

7.5.3 Bandwidth–Accuracy Tradeoffs

In this section, we investigate accuracy–bandwidth tradeoffs by varying both the


measurement period of .2t clock cycles and the number of measurements M over
which the RO counts are averaged. We first experiment with both the AC701-1 and
the AC701-2 boards as sinks and plot the results from all other possible FPGA
sources in Fig. 7.13. In general, increasing the number of measurements increases
the accuracy of the covert channel, but at a cost of lower bandwidth. .M = 500
represents a good tradeoff between accuracy and bandwidth (over 90% accuracy at
6.1 b s−1 for the Artix 7 boards), but .M ≥ 1000 results in higher accuracy at half
the bandwidth.
The second aspect we investigate is varying the number of clock cycles .2t for
which each RO is counting. At the same time, we also change the number of enabled
stressors on the sink FPGA and test the accuracy of the covert channel with the
AC701-2 FPGA source. The results for the KC705-1 and AC701-1 sinks are shown
in Figs. 7.14a and b, respectively. These results indicate that the parameters for the
receivers need to be carefully tuned for different types of boards. For example, the
Artix 7 board necessitates that fewer stressors be driven, which is consistent with
the results of Sects. 7.4.3 and 7.5.2. On the other hand, the KC705-1 sink remains
accurate across a wider range of enabled stressors but requires longer measurement
periods for acceptable accuracies.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 191
Number of Enabled Stressors

Number of Enabled Stressors


5 5
Accuracy (%) Accuracy (%)
4 100 4 100

75 75
3 3
50 50

2 25 2 25

0 0
1 1
μs

μs

μs

μs

μs

μs

μs
s

21 s

s
m

m
6

.0

.0

.9
3

.5

.0

.0
0.

5.

0.

5.
41

0.

0.

1.

2.

5.

41

81

0.

0.

0.

1.

2.
10

21
Measurement Time Measurement Time

(a) KC705-1 (b) AC701-1

Fig. 7.14 Accuracy for different measurement times and the number of enabled stressors on the
(a) KC705-1 and (b) AC701-1 sinks

Fig. 7.15 The accuracy of 100


the covert channel with the Pattern
Accuracy (%)

AC701-2 source remains 95 0x00000000


0x8badf00d
similar across five different 0xdeadbeef
32-bit patterns 90
0xf0e1d2c3
0xffffffff
85
AC701-1 KC705-1 KC705-2
Receiver Board

7.5.4 Transmitted Patterns and Cabling Layouts

We test the transmission of longer patterns by communicating five 32-bit patterns


(64 encoded bits). The patterns were chosen to have different Hamming Weights
and runs of zeros and ones to show that the channel does not fundamentally depend
on the values transmitted. The results, plotted in Fig. 7.15 for the AC701-2 source,
indicate that the covert channel remains similarly accurate for all three sink boards
and five transmitted patterns.
In the majority of the previous experiments, the source and sink FPGA boards
were connected to the same PSU output through a Corsair peripheral cable with four
Molex connectors. This cable was attached to one of the “bottom” 6-pin outputs of
the PSU. However, to verify that the information leakage persists across different
cable setups, we also use a 12-pin output of the PSU splitting into two 6-pin
PCIe cables, denoted by “left” and “right.” We then test communication from the
KC705-1 board to the KC705-2 board across different cable setups, using the default
measurement time of .221 clock cycles, enabling all 5 stressors, but also increasing
the number of measurements to .M = 1000. The results of our experiments are
summarized in Fig. 7.16, which demonstrates that a covert channel is possible in all
setups tested. This is perhaps to be expected, since the PSU uses a “dedicated single
+12 V rail” [5], but the results further indicate that there are differences among the
ports tested. Specifically, the covert channel is most accurate between FPGA boards
on the same cable (as they are at exactly the same electric potential difference) and
192 I. Giechaskiel et al.

Fig. 7.16 The accuracy of T-R Locations


100
communication between the Bottom-Bottom

Accuracy (%)
transmitter (T) and the 90 Bottom-Left
receiver (R) Kintex 7 boards Bottom-Right
Left-Bottom
depends on how they are 80
Left-Right
connected to the Power Right-Bottom
Supply Unit 70
Right-Left
BB BL BR LB LR RB RL
Source and Sink Power Cable Location

Fig. 7.17 The accuracy 100.0


between the two Kintex 7 Transmitter

Accuracy (%)
97.5
boards is consistently high for RO Type
all types of source ROs tested 95.0 FF
LD
92.5
LUT
90.0
KC705-1→KC705-2 KC705-2→KC705-1
FPGA Boards

least accurate between the single location on the bottom of the PSU and either of
the dual outputs. Finally, it should be noted that the recovered pattern is flipped in
all setups, except when sharing the cable on the bottom output.

7.5.5 Ring Oscillator Types and Alternative Experimental


Setup

We finally test communication using alternative types of ROs on the Kintex 7


boards, which we measure in a more realistic setup. Specifically, both boards
are connected to PC-A over PCIe and are enclosed in the computer tower to
avoid isolating thermal effects. The ROs used were proposed by Giechaskiel
et al. [9, 10] to bypass currently deployed cloud countermeasures that prohibit
combinatorial loops such as the LUT-RO used so far. One of them replaces a
buffer gate with a latch (LD-RO), while the other one with an inverter and a flip-
flop (FF-RO). The setup otherwise uses the default experimental parameters of
Table 7.3. Figure 7.17 first shows that for all three types of transmitter ROs, the
accuracy of the cross-KC705 channel remains above 95%, despite potential noise
introduced by thermal conditions and the shared PCIe buses. Similarly, Fig. 7.18
shows that accuracy remains above 95% when using these alternative ROs for
stressors and receivers on a KC705 sink. Although in many cases bits are again
flipped, blocking combinatorial loops and introducing environmental noise cannot
prevent our channel from operating.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 193

Fig. 7.18 The accuracy from 100.0


the KC705-1 source to the Stressor

Accuracy (%)
97.5
KC705-2 sink using different RO Type
receiver and stressor ROs also 95.0 FF
remains high LD
92.5
LUT
90.0
FF LD LUT
Receiver RO Type

7.6 Additional Covert Channels

In this section, we explore CPU-to-FPGA (Sect. 7.6.1) and GPU-to-FPGA covert


channels (Sect. 7.6.2).

7.6.1 CPU Transmissions

In order to test the CPU-to-FPGA communication channel, we replace the power


draw of the FPGA source with heavy CPU loads. To that end, we use the open-
source stress program, which is available on Debian-based Linux distribution
package managers [40]. We vary the number of threads that stress uses from 0
(i.e., no transmissions, corresponding to random measurements), up to the number
of threads available on each computer, i.e., 24 on the CPU attached to PSU-A, and
4 on the CPU attached to PSU-B.
The measurement process and classification metric remain the same as for the
cross-FPGA channels, but we introduce an additional delay of 3 seconds after
the stress program has started to ensure full utilization of the cores, and an
additional 3 seconds after killing the process, to ensure that the usage has returned
to normal. Moreover, when testing with PSU-A, and to increase accuracy, we
reduce the measurement period for the KC705 receivers to .2t = 218 clock cycles
(1.3 ms) from .221 (10 ms), and the number of stressors to 4 instead of 5 (we use the
default parameters on PSU-B but increase measurements for the AC701 boards to
.M = 1200). This increases the bandwidth of the covert channel by a factor of .8×

to 0.8 b s−1 compared to the cross-FPGA channel.


We plot the results for the two PSUs in Fig. 7.19, which allows us to draw
three main conclusions. First of all, there is a critical CPU activity threshold that
is necessary to make the covert channel possible. On PSU-A, this requires about
4 threads for the AC701 boards and 7 threads for the KC705 boards. Moreover,
increasing the number of threads does not always make the covert channel more
accurate. For example, increasing the number of CPU threads from 0 to 10 increases
accuracy, but the accuracy generally plateaus between 10 and 17 CPU threads, and
then decreases, perhaps due to hyper-threading. Finally, we notice that for a similar
number of threads used, the accuracy on PSU-B is often higher compared to that
194 I. Giechaskiel et al.

90

80
Accuracy (%)

70

60
Sink FPGA and PSU Used
50 AC701-1 KC705-1
AC701-2 KC705-2
40 PSU-A PSU-B

0 5 10 15 20 25
Number of CPU Threads

Fig. 7.19 CPU-to-FPGA accuracy for the four FPGA sink boards on both PSUs for different
numbers of CPU threads used as transmitters. As PSU-A powers a CPU with only 4 threads, no
more than 4 threads can be dispatched for testing

Table 7.5 Maximum accuracy of transmissions from a CPU source to the four FPGA sinks on the
two PSU and PC setups, along with the parameters for which the accuracy is achieved
PSU Parameter AC701-1 AC701-2 KC705-1 KC705-2
A Accuracy 95% 97% 95% 86%
A Bandwidth 6.1 b s−1 6.1 b s−1 0.8 b s−1 0.8 b s−1
A # of Threads 10 14 11 23
A # of Enabled Stressors 1 1 4 4
A # of Measurements 500 500 500 500
A Measurement Cycles .2
15 .2
15 .2
18 .2
18

B Accuracy 81% 70% †84% 88%


B Bandwidth 2.5 b s−1 2.5 b s−1 0.1 b s−1 0.1 b s−1
B # of Threads 4 4 4 4
B # of Stressors 1 1 5 5
B # of Measurements 1200 1200 500 500
B Measurement Cycles .2
15 .2
15 .2
21 .2
21

† signifies that the recovered bit pattern is flipped

for PSU-A. This parallels our cross-FPGA results of Sect. 7.5 and indicates that
PSU-B is generally more prone to covert communication. The maximum accuracy
achieved, the number of CPU threads used, and other experimental parameters are
summarized in Table 7.5.
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 195

Table 7.6 Parameters for Property GPU-A GPU-B


GPU testing with gpu_burn
Architecture Fermi Kepler
Technology 40 nm 28 nm
Driver Version 390.87 418.67
CUDA Version 8.0 10.1
Compiler Flag compute_20 compute_50

Table 7.7 Maximum accuracy of transmissions from a GPU source to the four FPGA sinks on the
two PSU and PC setups, along with the parameters for which the accuracy is achieved
PSU Parameter AC701-1 AC701-2 KC705-1 KC705-2
A Accuracy 76% 70% 94% 89%
B Accuracy 97% 87% 96% †100%
A&B Bandwidth 2.0 b s−1 2.0 b s−1 0.03 b s−1 0.03 b s−1
A&B # of Enabled Stressors 1 1 5 5
A&B # of Measurements 1500 1500 1500 1500
A&B Measurement Cycles 215 215 221 221
† signifies that the recovered bit pattern is flipped

7.6.2 GPU Transmissions

The process for testing GPU-to-FPGA transmissions is similar to that of CPU-to-


FPGA transmissions. We stress the GPUs with the open-source gpu_burn [39]
program, which uses Nvidia’s CUDA platform to fully utilize the GPU cores. As
the two GPUs use different architectures, we compile and run the gpu_burn
program against different Nvidia drivers and CUDA versions. These differences are
summarized in Table 7.6. Moreover, we return to the default measurement period of
.2 = 2
t 21 cycles for the Kintex 7 boards and increase the number of measurements

for all boards to 1500, reducing bandwidth by a factor of .3×. These parameters
and the corresponding results are summarized in Table 7.7. As in the CPU case, 3
seconds of delay are added after before and after the program, to allow usage to
return to normal.
Figure 7.20 plots the results of our experiments for the four boards on both GPUs.
We find that it is possible to create a communication channel to all four boards, on
both PSUs. As expected, since there are fewer GPU cores attached to PSU-A, the
covert channel is weaker, but the accuracy is over 95% for three of the four boards
when using the GPU attached to PSU-B, which is larger. Moreover, we notice that
the AC701 boards are worse sinks than the KC705 boards. Although this pattern
is not entirely identical across the three communication channels (FPGA-to-FPGA,
CPU-to-FPGA, and GPU-to-FPGA), it broadly remains consistent, potentially due
to the differences in the voltage regulators themselves or other aspects of board
design and component tolerances.
196 I. Giechaskiel et al.

Fig. 7.20 GPU-to-FPGA 100


accuracy for the four FPGA

Accuracy (%)
90
sink boards on both PSU
80 A
computers and PSUs
B
70

60
AC701-1 AC701-2 KC705-1 KC705-2
Receiver Board

7.7 Discussion

In this section, we discuss how practical the covert channels we introduced are
(Sect. 7.7.1) and propose some software- and hardware-level countermeasures to
mitigate the impact of the information leakage (Sect. 7.7.2).

7.7.1 Practicality of Attacks

There are two aspects of how practical our communication scheme is, which we
evaluate in this section. The first is how costly transmissions are in terms of
resources used on the FPGA boards. The amount of logic instantiated is moderate,
but not negligible. On the transmitting end, .G · T · NT lookup tables (LUTs) are
used, where .G = 4 is the number of ring oscillator stages. In particular, the source
design (including the UART and other logic) utilizes 16.6% of LUT resources on
the Artix 7 FPGA chip. Similarly, the sink design uses .G·(R ·NR +S ·NS ) LUTs for
the receiver and stressor ROs, and .L · R · NR registers for counting, where .L = 32
is the length of the counters. Only 7.8% of the Artix 7 resources are used in this
case—a number that can be reduced to 3.4%, as the AC701 boards only enable one
stressor for higher accuracy.
The second aspect is the channel capacity, which lies between that of thermal
attacks, which can transmit under 15 bits in an hour [14, 38], and power attacks
within CPUs that can transfer between 20 and 120 bits per second [1, 19].
Although the Kintex 7 boards were shown to be better sinks (often with 0% error
rate), the Artix 7 boards were faster by a factor of .7.6× (6.1 b s−1 vs. 0.8 b s−1 ).
This difference is significant in practice: Table 7.8 shows how long it would take
to transmit keys for different popular cryptographic algorithms. Even assuming that
the channel is not noisy, it would take almost 45 minutes to transfer a 256-bit AES
key to a KC705 board, and 3 hours to transfer a 1024-bit RSA key. However, the
AC701 board would need less than 3 minutes to transfer the same RSA key, despite
the potential drop in accuracy.
To increase accuracy, one can either tweak the parameters of the source and sink
FPGA designs (including the number of measurements M over which RO counts
are averaged) or instead change the communication scheme itself. For example, a
3-repetition code decreases bandwidth by a factor of 3, but also lowers the error rate
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 197

Table 7.8 Time to leak Algorithm Key size AC701 KC705


cryptographic keys of
different sizes to the Artix 7 AES 256 0.7 min 44.7 min
and Kintex 7 boards ECDSA 521 1.4 min 91.1 min
RSA 1024 2.8 min 179.0 min

e to .3e2 − 2e3 : a 10% error rate is reduced to under 3%. The channel capacity is
.1−H (e) = 1+e log2 e+(1−e) log2 (1−e), and for smaller bitflip probabilities, other

error correcting codes such as Hamming and Golay codes can be used to improve
accuracy.

7.7.2 Defense Mechanisms

In this section, we discuss potential software and hardware defense countermeasures


against voltage-based covert- and side-channel attacks. To start with, some coun-
termeasures might revolve around preventing intentional transmissions from the
covert-channel source. However, doing so would be particularly hard without huge
sacrifices in terms of power and performance. Although we used ring oscillators to
cause fluctuations in the voltage of FPGAs sharing the same PSU, other switching
activity can also result in voltage over- and under-shoots. For example, prior work
has shown that switching large sets of programmable interconnect points [47] or
self-oscillating circuits consisting of flip-flops or carry chains [21] can cause voltage
fluctuations outside of the allowed operating voltage range for an FPGA device.
Moreover, we demonstrated CPU-to-FPGA and GPU-to-FPGA channels, which
show that the problem is not FPGA-specific, but can be found in other types of
activities that result in large power draws. Consequently, unless power is equalized
among all possible algorithm implementations, some leakage that can differentiate
between levels of activity will persist.
To prevent side-channel attacks from being possible, designers may remove the
power-draw dependence on the data being processed and increase the noise level.
Although several masking and hiding techniques have been proposed, leakage on
FPGAs persists due to variations in placement and routing [6]. Consequently, a
better approach is to prevent the leakage from being measurable on the FPGA sinks.
Current FPGA cloud providers prevent voltage and temperature monitors from
being accessible by user logic and prohibit traditional LUT-ROs from being
instantiated on their infrastructure [2]. However, alternative ring oscillator designs
can bypass cloud restrictions [9, 10, 21, 22, 35] and can also replace LUT-
ROs (Sect. 7.5.5). Moreover, time-to-digital converters (TDCs) can also be used
instead of ring oscillators to monitor voltage fluctuations and conduct side-channel
attacks [33]. Although compiler tools that check for combinatorial loops and
latches [21, 22] would prevent some of the above monitoring logic, it would not
necessarily catch all forms of self-oscillating logic.
198 I. Giechaskiel et al.

Given that designing effective countermeasures against side- and covert-channel


receivers is an arms race, defense-in-depth would dictate run-time solutions in
addition to any preventive approach. One feature of the covert channel is the high
switching activity on the receiver. Built-in voltage monitors (such as those proposed
for shared FPGAs [12, 25, 31]) could thus be used by cloud providers to detect
abnormal fluctuations—with the caveat that legitimate circuits may also cause
similar patterns, and that, at least on the AC701 boards, the number of enabled
stressor ROs was small (.NS = 500). In fact, proposals to “detect the insertion of
power measurement circuits onto a device’s power rail” [23] are similar, though the
challenge is to reduce false positives.
Finally, better hardware (at a higher cost) can also help hide the useful signal
under the noise floor. For example, independent, fully separate power supplies
for different boards would require that the leakage be detectable even over the
AC power line, and through two different AC-to-DC rectifiers. Moreover, better
isolation of power circuits within the same PSU, as well as voltage regulators with
better transient responses on both the source and the sink FPGAs, or differently
designed powering circuits with more filters and smoothing capacitors can also
reduce the signal available to an attacker.

7.8 Related Work

This section summarizes prior work in remote FPGA attacks without physical
access to the boards (Sect. 7.8.1), as well as voltage- and temperature-based covert
channels (Sect. 7.8.2).

7.8.1 Remote FPGA Attacks

Although attacks on FPGA systems have traditionally required physical access to


the FPGA board, a recent class of remote attacks has emerged. These attacks have
used ring oscillators and TDCs as covert- and side-channel receivers, and ROs and
other circuits as covert-channel transmitters and fault attack inductors.
Most of the proposed attacks operate in the multi-tenant setting, from a weak
threat model where logic resources of different tenants are adjacent [8, 9, 11] to
progressively stronger ones where the attacker and victim are physically separated
on the same FPGA die [30, 45, 46] or even across separate dies on 2.5D-
integrated FPGA chips [10]. The target applications are equally diverse, from
covert channels [10] and fingerprinting different applications [13] to recovering
cryptographic keys [33, 46] and inferring machine learning parameters [29, 37, 45].
In parallel, fault attacks have been used for similar purposes, from biasing True
Random Number Generators (TRNGs) [27] and causing errors in CPU-FPGA SoC
hybrids [26] to attacking neural networks [4, 24, 32].
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 199

So far, there have only been few works that consider remote attacks in the
single-tenant setting. One such attack by Tian and Szefer introduced a temporal
thermal channel, where different users receive time-shared access to the same FPGA
fabric [38]. A different attack by Schellenberg et al. considered cross-chip side-
channel attacks to recover RSA keys [33]. However, the chips were located on the
same FPGA board that is explicitly “designed for external side-channel analysis
research” [33], and hence shared the same voltage regulator, making them easier to
influence directly, due to the lack of additional intermediate components between
their power distribution networks.

7.8.2 Power and Temperature Covert Channels

It is well-known that data-dependent power consumption can be used to recover


cryptographic keys through differential power analysis and other techniques by
acquiring and analyzing power traces [20]. The same principles can be applied to
create covert communication, for example, from a malware app on a phone to a
malicious USB charger [34], or from a program that modulates CPU utilization
to an attacker measuring the current consumption of the computer [15]. Similarly,
measuring voltage ripple on the power lines can be used to track the power usage
pattern of other data center tenants [17]. Although these works exploit the same
source of information leakage, they require external equipment to detect these data-
dependent power variations and are thus not applicable to cloud environments in
practice. However, it is possible to use the reconfigurable fabric of FPGAs as a
covert-channel sink, allowing for accurate transmission of data remotely, without
physical access.
Another category of power attacks that has recently been discovered is related
to dynamic voltage and frequency scaling (DVFS) on modern processors, which
regulates the voltage and frequency of CPUs in accordance with usage demands.
Malicious software can exploit DVFS to cause faults in computations [36], or create
covert channels between CPU cores, where the source core modulates frequency,
and the sink core measures a reduction in its own performance [19].
Thermal attacks can also be used to create covert channels between CPU
cores [28], but they require access to CPU thermal sensors and are slower than their
power counterparts, having a capacity of up to 300 b s−1 [3]. Temperature-based
covert channels need not be limited to communication within a single computer.
Assuming computers are sufficiently close, a covert channel between nearby yet
air-gapped devices is also possible with access to temperature sensors on the sink
computer [14]. Finally, thermal information can also be used as a proxy estimate for
power consumption in data centers. This information can alert potential adversaries
to opportune moments to attack the availability of servers, either by exceeding the
power capacity [18], or by more generally degrading performance [7]. Although
these attacks require privileged thermal sensors, FPGA ROs could also be used for
similar purposes, complementing our work.
200 I. Giechaskiel et al.

7.9 Conclusion

In this chapter, we presented the first FPGA-to-FPGA, CPU-to-FPGA, and GPU-


to-FPGA voltage-based covert channels, achieving transmission accuracies of up to
100%. Unlike prior work, which unrealistically assumes that different users share
the same FPGA fabric, our work considered a stronger threat model, where the
FPGA chip and board are allocated on a per-user basis. Our covert channel exploited
properties of the response of power supply units (PSUs) and voltage regulators to
changes in their load. To detect these changes, we introduced a novel architectural
design and classification metric that depends on stressor ring oscillators on the
covert-channel sink FPGA. We showed that ring oscillators also performed well
in the source FPGA and further showed that heavy CPU and GPU activity could
also be used as an effective transmitter. We demonstrated our covert channel on
four Artix 7 and Kintex 7 boards, creating a channel of communication between
any two of them in either direction, with high accuracy. We also performed an
analysis of bandwidth–accuracy tradeoffs and further explored the accuracy of the
covert channel across different sizes and types of the sink and source FPGA circuits,
different measurement patterns and setup layouts, and PSUs with different power
ratings from two manufacturers. We finally proposed potential countermeasures to
prevent the information leakage we discovered from being exploitable. Overall, our
remote covert-channel attacks highlight the dangers of shared power supply units,
and therefore a need to re-think FPGA security, even for single-user monolithic
designs.

Acknowledgments This work was supported in part by NSF grant 1901901.

References

1. Alagappan, M., Rajendran, J., Doroslovački, M., & Venkataramani, G. (2017). DFS covert
channels on multi-core platforms. In IFIP/IEEE international conference on very large scale
integration (VLSI-SoC).
2. Amazon Web Services (2021). AWS EC2 FPGA HDK+SDK errata. https://github.com/aws/
aws-fpga/blob/master/ERRATA.md. Accessed: 2023-05-21.
3. Bartolini, D. B., Miedl, P., & Thiele, L. (2016). On the capacity of thermal covert channels in
multicores. In European conference on computer systems (EuroSys).
4. Boutros, A., Hall, M., Papernot, N., & Betz, V. (2020). Neighbors from hell: Voltage attacks
against deep learning accelerators on multi-tenant FPGAs. In International conference on field-
programmable technology (FPT).
5. Corsair (2010). Professional series Gold AX850–80 PLUS Gold certified fully-modular power
supply. https://www.corsair.com/p/CMPSU-850AX. Accessed: 2023-05-21.
6. De Cnudde, T., Ender, M., & Moradi, A. (2018). Hardware masking, revisited. IACR
transactions on cryptographic hardware and embedded systems (TCHES), 2018(2), 123–148.
7. Gao, X., Xu, Z., Wang, H., Li, L., & Wang, X. (2018). Reduced cooling redundancy: A
new security vulnerability in a hot data center. In Network and distributed system security
symposium (NDSS).
7 Cross-board Power-Based FPGA, CPU, and GPU Covert Channels 201

8. Giechaskiel, I., Rasmussen, K. B., & Eguro, K. (2022). Long-wire leakage: The threat of
crosstalk. IEEE Design and Test (D&T), 39(4), 41–48.
9. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Measuring long wire leakage with
ring oscillators in cloud FPGAs. In International conference on field programmable logic and
applications (FPL).
10. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Reading between the dies: Cross-SLR
covert channels on multi-tenant cloud FPGAs. In IEEE international conference on computer
design (ICCD).
11. Giechaskiel, I. & Szefer, J. (2020). Information leakage from FPGA routing and logic elements.
In International conference on computer-aided design (ICCAD).
12. Glamočanin, O., Mahmoud, D., Regazzoni, F., & Stojilović, M. (2021). Shared FPGAs and the
holy grail: Protections against side-channel and fault attacks. In Design, automation & test in
Europe (DATE).
13. Gobulukoglu, M., Drewes, C., Hunter, W., Kastner, R., & Richmond, D. (2021). Classifying
computations on multi-tenant FPGAs. In Design automation conference (DAC).
14. Guri, M., Monitz, M., Mirski, Y., & Elovici, Y. (2015). BitWhisper: Covert signaling channel
between air-gapped computers using thermal manipulations. In IEEE computer security
foundations symposium (CSF).
15. Guri, M., Zadov, B., Bykhovsky, D., & Elovici, Y. (2020). PowerHammer: Exfiltrating data
from air-gapped computers through power lines. IEEE Transactions on Information Forensics
and Security (TIFS), 15, 1879–1890.
16. Hajimiri, A., Limotyrakis, S., & Lee, T. H. (1999). Jitter and phase noise in ring oscillators.
IEEE Journal of Solid-State Circuits (JSSC), 34(6), 790–804.
17. Islam, M. A., & Ren, S. (2018). Ohm’s law in data centers: A voltage side channel for timing
power attacks. In ACM conference on computer and communications security (CCS).
18. Islam, M. A., Ren, S., & Wierman, A. (2017). Exploiting a thermal side channel for power
attacks in multi-tenant data centers. In ACM conference on computer and communications
security (CCS).
19. Khatamifard, S. K., Wang, L., Das, A., Köse, S., & Karpuzcu, U. R. (2019). POWERT chan-
nels: A novel class of covert communication exploiting power management vulnerabilities. In
IEEE international symposium on high-performance computer architecture (HPCA).
20. Kocher, P., Jaffe, J., Jun, B., & Rohatgi, P. (2011) Introduction to differential power analysis.
Journal of Cryptographic Engineering, 1(1), 5–27.
21. La, T., Pham, K., Powell J., & Koch, D. (2021). Denial-of-Service on FPGA-based cloud
infrastructures: Attack and defense. IACR Transactions on Cryptographic Hardware and
Embedded Systems (TCHES), 2021(3), 441–464.
22. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
Malicious self-oscillator scanning for Xilinx UltraScale+ FPGAs. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 13(3), 1–31.
23. Le Masle, A., & Luk, W. (2012). Detecting power attacks on reconfigurable hardware. In
International conference on field programmable logic and applications (FPL).
24. Luo, Y., Gongye, C., Fei, Y., & Xu, X. (2021). DeepStrike: Remotely-guided fault injection
attacks on DNN accelerator in cloud-FPGA. In Design automation conference (DAC).
25. Luo, Y. & Xu, X. (2020). A quantitative defense framework against power attacks on multi-
tenant FPGA. In International conference on computer-aided design (ICCAD).
26. Mahmoud, D., Hussein, S., Lenders, V., & Stojilović, M. (2022). FPGA-to-CPU undervolting
attacks. In Design, automation and test in Europe (DATE).
27. Mahmoud, D., & Stojilović, M. (2019). Timing violation induced faults in multi-tenant FPGAs.
In Design, automation and test in Europe (DATE).
28. Masti, R. J., Rai, D., Ranganathan, A., Müller, C., Thiele, L., & Čapkun, S. (2015). Thermal
covert channels on multi-core platforms. In USENIX security symposium.
29. Moini, S., Tian, S., Holcomb, D., Szefer, J., & Tessier, R. (2021). Remote power side-channel
attacks on BNN accelerators in FPGAs. In Design, automation and test in Europe (DATE).
202 I. Giechaskiel et al.

30. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power distribution attacks in multi-tenant
FPGAs. IEEE transactions on very large scale integration systems (TVLSI), 28(12), 2685–
2698.
31. Provelengios, G., Holcomb, D., & Tessier, R. (2021). Mitigating voltage attacks in multi-tenant
FPGAs. ACM transactions on reconfigurable technology and systems (TRETS), 14(2), 1–24.
32. Rakin, A. S., Luo, Y., Xu, X., & Fan, D. (2021). Deep-Dup: An adversarial weight duplication
attack framework to crush deep neural network in multi-tenant FPGA. In USENIX security
symposium.
33. Schellenberg, F., Gnad, D. R. E., Moradi, A., & Tahoori, M. B. (2018). Remote Inter-chip
power analysis side-channel attacks at board-level. In International conference on computer-
aided design (ICCAD).
34. Spolaor, R., Abudahi, L., Moonsamy, V., Conti, M., & Poovendran, R. (2017). No free charge
theorem: A covert channel via USB charging cable on mobile devices. In Applied cryptography
and network security (ACNS).
35. Sugawara, T., Sakiyama, K., Nashimoto, S., Suzuki, D., & Nagatsuka, T. (2019). Oscillator
without a combinatorial loop and its threat to FPGA in data centre. Electronics Letters, 15(11),
640–642.
36. Tang, A., Sethumadhavan, S., & Stolfo, S. (2017). CLKSCREW: Exposing the perils of
security-oblivious energy management. In USENIX security symposium.
37. Tian, S., Moini, S., Wolnikowski, A., Holcomb, D., Tessier, R., & Szefer, J. (2021). Remote
power attacks on the versatile tensor accelerator in multi-tenant FPGAs. In IEEE international
symposium on field-programmable custom computing machines (FCCM).
38. Tian, S., & Szefer, J. (2019). Temporal thermal covert channels in cloud FPGAs. In
ACM/SIGDA international symposium on field-programmable gate arrays (FPGA).
39. Timonen, V. (2020). Multi-GPU CUDA stress test. http://wili.cc/blog/gpu-burn.html.
Accessed: 2023-05-21.
40. Waterland, A. P. (2014). Stress. https://web.archive.org/web/20190502184531/https://people.
seas.harvard.edu/~apw/stress/. Accessed: 2023-05-21.
41. Xilinx, Inc. (2012). 7 Series Product Brief. https://www.xilinx.com/publications/prod_mktg/7-
Series-Product-Brief.pdf. Accessed: 2023-05-21.
42. Xilinx, Inc. (2019). AC701 evaluation board for the Artix-7 FPGA (UG952). https://www.
xilinx.com/support/documentation/boards_and_kits/ac701/ug952-ac701-a7-eval-bd.pdf.
Accessed: 2023-05-21.
43. Xilinx, Inc. (2019). KC705 evaluation board for the Kintex-7 FPGA (UG810). https://
www.xilinx.com/support/documentation/boards_and_kits/kc705/ug810_KC705_Eval_Bd.
pdf. Accessed: 2023-05-21.
44. Xilinx, Inc. (2020). 7 series FPGAs data sheet: Overview (DS180). https://www.xilinx.com/
support/documentation/data_sheets/ds180_7Series_Overview.pdf. Accessed: 2023-05-21.
45. Zhang, Y., Yasaei, R., Chen, H., Li, Z., & Al Faruque, M. A. (2021). Stealing neural network
structure through remote FPGA side-channel analysis. IEEE Transactions on Information
Forensics and Security (TIFS), 16, 4377–4388.
46. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In IEEE
symposium on security and privacy (S&P).
47. Zick, K. M., Srivastav, M., Zhang, W., & French, M. (2013). Sensing nanosecond-scale voltage
attacks and natural transients in FPGAs. In ACM/SIGDA international symposium on field-
programmable gate arrays (FPGA).
Chapter 8
Microarchitectural Vulnerabilities
Introduced, Exploited, and Accelerated
by Heterogeneous FPGA-CPU Platforms

Thore Tiemann, Zane Weissman, Thomas Eisenbarth, and Berk Sunar

8.1 Introduction

In the late 2010s, a number of high-performance, scalable, reconfigurable, and


affordable Field-Programmable Gate Array (FPGA)-based acceleration platforms
were first publicly deployed on major cloud service providers [1, 2]. These platforms
enabled what cloud providers call “FPGA-as-a-Service:” the ability to rent FPGAs
to clients on a real-time as-needed basis, just as cloud providers now rent Central
Processing Unit (CPU) resources. The FPGA systems that power FPGA-as-a-
Service, including Xilinx’s Ultrascale+ series and Intel’s Arria 10, Stratix, and
Agilex series, are designed for high I/O bandwidth and high compute capacity,
making them ideal for demanding server workloads.
New Intel FPGAs offer cache-coherent memory systems for even better per-
formance when data is being passed back and forth between CPU and FPGA.
Integrated FPGA platforms connect the FPGA into the processor bus interconnect
giving the FPGA direct access into cache and memory [13]. Similarly, high-end
FPGAs can be integrated into a server as an accelerator, e.g., connected via a PCIe
interface [29, 59]. Such combinations provide unprecedented performance over a
high-throughput and low-latency connection with the versatility and scalability of a
reprogrammable FPGA infrastructure shared among cloud users. However, the tight
integration that enables such performance may also expose users to new adversarial
threats from untrusted cloud tenants as hardware components like caches are shared
between not only CPU tenants but also FPGA tenants alike.

T. Tiemann · T. Eisenbarth
Universität zu Lübeck, Lübeck, Germany
e-mail: t.tiemann@uni-luebeck.de; thomas.eisenbarth@uni-luebeck.de
Z. Weissman () · B. Sunar
Worcester Polytechnic Institute, Worcester, MA, USA
e-mail: zweissman@wpi.edu; sunar@wpi.edu

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 203
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_8
204 T. Tiemann et al.

This work exposes hardware vulnerabilities in hybrid FPGA-CPU systems with


a particular focus on cloud platforms where the FPGA and the CPU are in
distinct security domains: one potentially a victim and the other an attacker. We
examine Intel’s Arria 10 GX FPGA as an example of a current generation of
FPGA accelerator platform designed in particular for heavy and/or cloud-based
computation loads. We thoroughly analyze the memory interfaces between such
platforms and their host CPUs. These interfaces, which allow the CPU and FPGA
to interact in various direct and indirect ways, include hardware on both the FPGA
and CPU, application libraries and software drivers executed by the CPU, and
logical interfaces implemented on the FPGA outside of but accessible to the user-
configurable region. We propose attacks that exploit practical use cases of these
interfaces to target adjacent systems such as the CPU memory and cache. Recent
results improve on our cache attacks [48] and show that FPGA-based accelerators
can spy on neighboring peripherals through caches like the IOTLB [53] which are
inaccessible by the CPU.

8.2 Background

This section provides background knowledge about cache attack techniques as well
as the Rowhammer effect. It also explains the math behind the Chinese Remainder
Theorem (CRT) commonly used to implement the Rivest-Shamir-Adleman (RSA)
cryptosystem [4].

8.2.1 Cache Attacks

Cache attacks on multiple applications have been proposed [5, 10, 21, 23, 45, 54].
In general, cache attacks use the timing of cache behaviors to leak information.
Modern cache systems use a hierarchical architecture that includes smaller, faster
caches and bigger, slower caches. Measuring the latency of a memory access
can often confidently determine which levels of cache contain a certain memory
address (or if the memory is cached at all). Many modern cache subsystems also
support coherency, which ensures that whenever memory is overwritten in one
cache, copies of that memory in other caches are either updated or invalidated.
Cache coherency may allow an attacker to learn about a cache line that is not
even directly accessible [34]. Cache attacks have become a major focus of security
research in cloud computing platforms where users are allocated CPUs, cores, or
virtual machines which, in theory, should offer perfect isolation, but in practice may
leak information to each other via shared caches [26]. An introduction to various
cache attack techniques is given below.
A Flush+Reload (F+R) attack [61] has three steps: (1) the attacker uses the
clflush instruction to flush the cache line that is to be monitored. After flushing
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 205

this cache line, (2) they wait for the victim to execute. Later, (3) they reload the
flushed line and measure the reload latency. If the latency is low, the cache line was
served from the cache hierarchy, so the cache line was accessed by the victim. If
the access latency is high, the cache line is loaded from main memory, meaning that
the victim did not access it. F+R can work across cores and even across sockets, as
long as the Last-Level Cache (LLC) is coherent, as is the case with many modern
multi-CPU systems. Flush+Flush (F+F) [20] is similar to F+R, but the third step is
different: the attacker flushes the cache line again and measures the execution time
of the flush instruction instead of the memory access.
Orthogonal to F+R, if the attacker does not have access to an instruction to flush
a cache line, he/she can instead evict the desired cache line by accessing cache lines
that form an eviction set in an Evict+Reload (E+R) [39] attack. Eviction sets are
described shortly. E+R can be used if the attacker shares the same CPU socket (but
not necessarily the same core) as the victim and if the LLC is inclusive.1 F+R, F+F,
and E+R are limited to shared memory scenarios, where the victim and attacker
share data or instructions, e.g., when memory de-duplication is enabled.
Prime+Probe (P+P) gives the attacker less temporal resolution than the aforemen-
tioned methods since the attacker checks the status of the cache by probing a whole
cache set rather than flushing or reloading a single line. However, this resolution
is sufficient in many cases [3, 39, 43, 45, 46, 49, 64]. P+P has three steps: (1) the
attacker primes the cache set under surveillance with dummy data by accessing
a proper eviction set, (2) he/she waits for the victim to execute, and (3) he/she
accesses the eviction set again and measures the access latency (probing). If the
latency is above a certain threshold, some parts of the eviction set were evicted by
the victim process, meaning that the victim accessed cache lines belonging to the
cache set under surveillance [41]. Unlike F+R, E+R, and F+F, P+P does not rely
on shared memory. However, it is noisier, only works if the victim is located on
the same socket as the attacker, and relies on inclusive caches. An alternative attack
against non-inclusive caches is to target the cache directory structure [60].
In scenarios where the attacker cannot probe the target cache set or line but can
still influence the target cache line, an Evict+Time (E+T) is still possible depending
on the target application. In an E+T attack, the attacker only evicts the victim’s
cache line and measures the aggregate execution time of the victim’s operation,
hoping to observe a correlation between the execution time of an operation such as
a cryptographic routine and the cache access pattern.

8.2.1.1 Eviction Sets

Caches store data in units of cache lines that can hold .2b bytes each (64 bytes on
many Intel CPUs, including most Core, Xeon, and Atom architectures). Caches

1 A lower level cache is called inclusive of a higher level cache if all cache lines present in the

higher level cache are always present in the lower level cache.
206 T. Tiemann et al.

are divided into .2s sets, each capable of holding w cache lines. w is called the
way-ness or associativity of the cache. An eviction set is a set of congruent cache
line addresses capable of filling a whole cache set. Two cache lines are considered
congruent if they belong to the same cache set. Memory addresses are mapped to
cache sets depending on the s bits of the physical memory address directly following
the b cache line offset bits, which are the least significant bits. Additionally, some
caches are divided into n slices, where n is the number of CPU cores. In the
presence of slices, each slice has .2s sets with w ways each. The previous work
has reverse-engineered the mapping of physical address bits to cache slices on some
Intel processors [33]. A minimal eviction set contains w addresses and therefore fills
an entire cache set when accessed.

8.2.2 Rowhammer

DRAM cells discharge over time, and the memory controller has to refresh the
cells to avoid accidental data corruption. Generally, DRAM cells are laid out in
banks and rows, and each row within a bank has two adjacent rows, one on either
side. In a Rowhammer attack, memory addresses in the same bank as the target
memory address are accessed in quick succession. When memory adjacent to the
target is accessed repeatedly, the electrostatic interference generated by the physical
process of accessing the memory can elevate the discharge for bits stored in the
target memory. A “single-sided” Rowhammer performs accesses to just one of these
rows to generate bit flips in the target row; a “double-sided” Rowhammer performs
accesses to both adjacent rows and is generally more effective in producing bit flips.
A Rowhammer relies on the ability to find blocks of memory accessible to the
malicious program (or in this work, hardware) that are in the same memory bank
as a given target address. The standard way of finding these memory addresses is by
exploiting row buffer conflicts as a timing side channel [14]. Pessl et al. [47] reverse-
engineered the bank mapping algorithms of several CPU and DRAM configurations,
which allows an attacker to deterministically calculate all of the physical addresses
that share the same bank if the chipset and memory configuration are known.

8.2.3 RSA-CRT Signing

RSA signatures are computed by raising a plaintext m to a secret power d modulo


N = pq, where p and q are prime and secret, and N is public [8]. These numbers
.

must all be large for RSA to be secure, which makes the exponentiation rather slow.
However, there is an algebraic shortcut for modular exponentiation: the CRT, used
in many RSA implementations, including OpenSSL [11] and WolfSSL, which is
brought under attack as described in Sect. 8.6. The basic form of the RSA-CRT
signature algorithm is shown in Algorithm 1. The execution of the CRT algorithm
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 207

Algorithm 1: Chinese remainder theorem RSA signature


Input: m: message; d: private exponent; p: private factor; q: private factor
Result: S: signature
Sp ← mdp mod p // equivalent to md mod p
Sq ← mdq mod q // equivalent to md mod q
−1
Iq ← q mod p // inverse of q mod p
 
S ← Sq + q (Sp − Sq )Iq mod p

is much faster than the computation of .md mod N because .dp and .dq are of order
p and q, respectively, while d is of order N , which, being the product of p and q,
is significantly greater than p or q. It is around four times faster to compute the two
exponentiations .mdp and .mdq individually than it is to compute .md outright [4].

8.2.4 IOTLB Side Channel

Tiemann et al.’s paper “IOTLB-SC” [53] demonstrated a timing side-channel


vulnerability in the I/O translation lookaside buffer (IOTLB), a specialized cache
that stores the results of recent Direct Memory Access (DMA) address translation
lookups. FPGAs can exploit low-noise Flush+Reload and Prime+Probe type side
channels in the IOTLB, making PCIe peripherals like smart NICs, GPUs, or other
FPGAs that share a DMA remapping engine with a malicious FPGA potentially
vulnerable to a snooping attack.

8.3 Experimental Setup

We experiment with two distinct FPGA-CPU platforms with Intel Arria 10 FPGAs:
(1) integrated into the CPU package and (2) Programmable Acceleration Card.
The integrated Intel Arria 10 is based on a prototype E5-2600v4 CPU with 12
physical cores [31]. The CPU has a Broadwell architecture in which the LLC is
inclusive of the L1/L2 caches. The CPU package has an integrated Arria 10 GX
1150 FPGA running at 400 MHz [31]. All measurements done on this platform are
strictly done from user space only, as access is provided by Intel through their Intel
Lab (IL) Academic Compute Environment (ACE) [32]. The IL environment also
gives us access to platforms with Programmable Acceleration Cards (PACs) with
Arria 10 GX 1150 FPGA installed and running at 200 MHz. These systems have
Intel Xeon Platinum 8180 CPUs that come with non-inclusive LLCs. We carried out
Rowhammer [19] experiments on a local Dell Optiplex 7010 system with an Intel
i7-3770 CPU and a single DIMM of Samsung M378B5773DH0-CH9 1333 MHz
208 T. Tiemann et al.

2 GB DDR3 DRAM equipped with the same Intel PAC running with a primary
clock speed of 200 MHz.2
The Operating System (OS) running in the IL ACE is a 64-bit Red Hat Enterprise
Linux 7 with kernel version 3.10. The Open Programmable Acceleration Engine
(OPAE) was compiled and installed on July 15, 2019 for both the FPGA PAC
and the integrated FPGA platform. We used Quartus 17.1.1 and Quartus 16.0.0
to synthesize hardware designs for the PACs and integrated FPGAs, respectively.
The bitstream version of the non-user-configurable Board Management Controller
(BMC) firmware is 1.1.3 on the FPGA PAC and 5.0.3 on the integrated FPGA. The
OS on the local Optiplex 7010 workstation is Ubuntu 16.04.4 LTS with Linux kernel
4.13.0–36. On this system, we installed the latest stable release of OPAE at the time
of the experiments, 1.3.0, and on its FPGA PAC, we installed the compatible 1.1.3
BMC firmware bitstream.

8.4 Analysis of Intel FPGA-CPU Systems

This section explains the hardware and software interfaces that the Intel Arria 10
GX FPGA platforms use to communicate with their host CPUs and the firmware,
drivers, and architectures that underlay them. Figure 8.1 gives an overview of this
type of architecture.
Intel refers to a single logical unit implemented in FPGA logic and having a
single interface to the CPU as an Accelerator Functional Unit (AFU). The AFU is
an abstraction of the user-configurable logic that can be thought of as analogous to
a program for a processor. Available FPGA platforms only support one AFU per
Partial Reconfiguration Unit (PRU). The FPGA Interface Manager (FIM) is part of
the non-user-configurable portion of the FPGA and contains external interfaces like
memory and network controllers as well as the FPGA Interface Unit (FIU), which
bridges those external interfaces with internal interfaces to the AFU.

8.4.1 Intel FPGA Platforms

Intel’s Arria 10 GX Programmable Acceleration Card (PAC) is a PCIe expansion


card for FPGA acceleration [29]. The Arria 10 GX FPGA on the card communicates
with its host processor over a single PCIe Gen3x8 connection. Memory reads and
writes from the FPGA to the CPU’s main memory use physical addresses; in virtual
environments, the PCIe controller on the CPU side implements an I/O Memory
Management Unit (IOMMU) to translate physical addresses in the virtual machine

2 The PAC is intended to support 400 MHz clock speed, but the current version of the Intel

Acceleration Stack (IAS) has a bug that halves the clock speed.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 209

Arria 10 Arria 10
Application
AFU AFU
API
OPAE
Drivers
IOMMU FIU
OS $
Core Core
UPI PCIe PCIe PCIe
L1/L2 L1/L2
LLC IOMMU

RAM

Fig. 8.1 Overview of the architecture of CPU-FPGA systems based on Intel FPGAs. The software
part of the IAS called the OPAE is highlighted in orange. Applications (yellow) use OPAE’s API
to communicate with the AFU. The green region marks the part of the FPGA that is reconfigurable
from user space at runtime. The blue region shows the static soft core of the FPGA. It exposes the
CCI-P interface to the AFU

(what Intel calls I/O Virtual Addresses (IOVAs)) to physical addresses in the host.
Alongside the FPGA, the PAC contains 8 GB of DDR4, 128 MB of flash memory,
and USB 2.0 for debugging.
An alternative accelerator platform is the Xeon server processor with an
integrated Arria 10 FPGA in the same package [13]. The FPGA and CPU are
closely connected through two PCIe Gen3x8 links and an Ultra Path Interconnect
(UPI) link. UPI is Intel’s high-speed CPU interconnect (replacing its predecessor
QPI) in Skylake and later Intel CPU architectures [44]. The FPGA has a 128 KiB
direct mapped cache that is coherent with the CPU caches over the UPI bus. Like
the PCIe link on the PAC, both the PCIe links and the UPI link use I/O virtual
addressing, appearing as physical addresses to virtualized environments. As the UPI
link bypasses the PCIe controller’s IOMMU, the FIU implements its own IOMMU
and device TLB to translate physical addresses for reads and writes using UPI [28].

8.4.2 Intel’s FPGA-CPU Compatibility Layers

Intel’s latest generations of FPGA products are designed for use with the Open
Programmable Acceleration Engine (OPAE) [27] which is part of the Intel Accel-
eration Stack (IAS). OPAE is an open-source, hardware-flexible software stack
for accessing FPGAs. Intel’s compatible FPGAs use the Core Cache Interface
(CCI-P), a hardware host interface for AFUs that specifies transaction requests,
header formats, timing, and memory models [28]. OPAE provides a software
interface for software developers to interact with a hosted FPGA, while CCI-P
210 T. Tiemann et al.

provides a hardware interface for hardware developers to interact with a host CPU.
Excluding a few platform-specific hardware features, any CCI-P compatible AFU
should be synthesizable (and the result should be logically identical) for any CCI-P
compatible FPGA platform. OPAE is built on top of hardware- and OS-specific
drivers and as such is compatible with any system with the appropriate drivers. As
described below, the OPAE/CCI-P system provides two main methods for passing
data between the host CPU and the FPGA.

8.4.2.1 Memory-Mapped I/O (MMIO)

OPAE can send 32- or 64-bit MMIO requests to the AFU directly or it can map an
AFU’s MMIO space to OS virtual memory [27]. CCI-P provides an interface for
incoming MMIO requests and outgoing MMIO read responses. The AFU responds
to read and write requests, although an MMIO read request will time out after 65,536
cycles of the FPGA clock used to access the interface. In software, MMIO offsets
are indicated in bytes and addresses are expected to be multiples of 4 (or 8, for 64-
bit reads and writes). In CCI-P, the last two bits of the address are truncated, since
at least four bytes are used for each read or write transaction. There are 16 available
address bits in CCI-P, so the total available MMIO space is .216 32-bit words or
256 KiB [28].

8.4.2.2 Direct Memory Access (DMA)

OPAE can request that the OS allocates a block of memory that can be read by
the FPGA. The memory is allocated in a contiguous physical address space. The
FPGA uses physical addresses to index the shared memory, so physical and virtual
offsets within the shared memory must match. On systems using Intel Virtualization
Technology for Directed I/O (VT-d) [30], which employs the IOMMU to provide
an IOVA to PCIe devices, the OS can allocate memory in continuous IOVA space
if needed by the peripheral. This approach ensures that the FPGA will see an
accessible and continuous buffer of the requested size. For buffer sizes up to and
including one 4 KiB memory page, a normal memory page will be allocated to the
calling process by the OS and configured to be accessible by the FPGA with its
IOVA or physical address. For buffer sizes greater than 4 KiB, the OPAE will call
the OS to allocate a 2 MB or 1 GB huge page. Isolating the buffer in a single page
ensures that it will be contiguously allocated in physical memory.

8.4.3 Cache and Memory Architecture on the Intel FPGAs

Before we can talk about microarchitectural attacks, the memory architecture of


Intel FPGAs has to be understood. This section describes the cache and memory
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 211

architecture for the Arria 10 PAC and the integrated Arria 10 FPGA and highlights
their common features and differences. We set a special focus on the caching
hints that can be used by Intel FPGAs to influence the cache coherency state of
cache lines.

8.4.3.1 Arria 10 PAC

The Arria 10 PAC has access to the CPU’s memory system as well as its own local
DRAM with a separate address space from that of the CPU and its memory. The
PAC’s local DRAM is always directly accessed, without a separate caching system.
When the PAC reads from the CPU’s memory, the CPU’s memory system will serve
the request from its LLC if possible. If the memory that is read or written is not
present in the LLC, the request will be served by the CPU’s main DRAM. The PAC
is unable to place cache lines into the LLC with reads but writes from the PAC
update the LLC.

8.4.3.2 Integrated Arria 10

The integrated Arria 10 FPGA has access to the host memory. Additionally, it has its
own 128 KiB cache that is kept coherent with the CPU’s caches over UPI. Memory
requests over PCIe take the same path as requests issued by an FPGA PAC. If the
request is routed over UPI, the local coherent FPGA cache is checked first. On a
local cache miss, the request is forwarded to the CPU’s LLC or main memory.

8.4.3.3 Reverse-Engineering Caching Hint Behavior

An AFU on the Arria 10 GX can exercise some control over caching behavior by
adding caching hints to memory requests. The available hints are summarized in
Table 8.1. For memory reads, RdLine_I is used to prevent local caching and
RdLine_S is used to cache data locally in a shared state. For memory writes,
WrLine_I is used to prevent local caching on the FPGA, and WrLine_M leaves
written data in the local cache in the modified state. WrPush_I does not cache data
locally but provides hints to the cache controller to cache data in the CPU’s LLC.
The CCI-P documentation lists all caching hints as available for memory requests
over UPI [28]. When sending requests over PCIe, only RdLine_I, WrLine_I,
and WrPush_I can be used while other hints are ignored. However, based on our
experiments, not all cache hints are implemented exactly to specification.
To confirm the behavior of caching hints available for DMA writes, we designed
an AFU that writes a constant string to a configurable memory address via either
UPI or PCIe and using a configurable caching hint. We used the AFU to write a
cache line and afterward timed a read access to the same cache line on the CPU. As
displayed in Fig. 8.2, these experiments confirm that the majority of the cache lines
212 T. Tiemann et al.

Table 8.1 Overview of the caching hints configurable over CCI-P on an integrated FPGA. *_I
hints invalidate a cache line in the local cache. Reading with RdLine_S stores the cache line in
the shared state. Writing with WrLine_M caches the line modified state
Cache Hint RdLine_I RdLine_S WrLine_I WrLine_M WrPush_I
Desc. No FPGA Leave FPGA No FPGA Leave FPGA Intent to cache
caching cache in S caching cache in M in LLC
state state
Available UPI, PCIe UPI UPI, PCIe UPI UPI, PCIe

WrLine I

UPI 150
2

Frequency
PCIe0
Density

100
1 PCIe1
50
0 0
0 50 100 150 200 250 300 350 400

WrLine M

2 UPI 150

Frequency
PCIe0
Density

100
1 PCIe1
50
0 0
0 50 100 150 200 250 300 350 400

WrPush I

2 UPI 100
Frequency

PCIe0
Density

1 PCIe1 50

0 0
0 50 100 150 200 250 300 350 400
Access time (CPU clock cycles)

Fig. 8.2 Memory access latency histograms and their density for cache lines accessed by the CPU
after being written to by an AFU running on an integrated Arria 10. On an Arria 10 PAC, the
behavior is nearly identical to the PCIe0 behavior in the integrated platform

written by the AFU are placed in the LLC, as access times stay below 100 CPU clock
cycles while main memory accesses take 175 cycles on average.3 This behavior is
independent of the caching hint, the bus, or the platform (PAC, integrated Arria 10).

3 Each average is computed over 200 measurements.


8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 213

In fact, we only see data being written to main memory in 21.21% of all runs if the
write request uses WrLine_I or WrPush_I and is sent over UPI. All other writes
are cached in at least 99.5% of all runs. The result is surprising as the hint meant to
place the data in the cache of the integrated Arria 10 and the hint meant for writing
directly to the main memory are either ignored by the FIM and the CPU or not
implemented. Intel later verified4 that the FIM ignores all caching hints that apply
to DMA writes. Instead, the CPU is configured to handle all DMA writes as if the
WrPush_I caching hint is set. The observed LLC caching behavior is likely caused
by Intel’s Data Direct I/O (DDIO), which is enabled by default in Intel Xeon E5 v2
and E7 v2 CPUs. DDIO is meant to give peripherals direct access to the LLC and
thus causes the CPU to cache all memory lines written by the AFU. DDIO restricts
cache access to a subset of ways per cache set, which reduces the attack surface for
Prime+Probe attacks. Nonetheless, attacks against other DDIO-enabled peripherals
are possible [38, 52].

8.5 The JackHammer Attack

In this section, we present and evaluate a simple AFU for the Arria 10 GX FPGA
that performs Rowhammer against its host CPU’s DRAM as much as two times
faster and four times more effectively than its host CPU. In a Rowhammer attack,
a significant factor in the speed and efficacy is the rate at which memory can be
repeatedly accessed. On many systems, the CPU is sufficiently fast to cause some
bit flips, but an FPGA can repeatedly access its host machine’s memory system
substantially faster than the host machine’s CPU can. Both the CPU and FPGA share
access to the same memory controller, but the CPU must flush the memory after
each access to ensure that the next access reaches DRAM; memory reads from the
FPGA do not affect the CPU cache system, so no time is wasted flushing memory.
We measure the performance of CPU and FPGA Rowhammer implementations
with caching both enabled and disabled and find that disabling caching brings CPU
Rowhammer speed near that of our FPGA Rowhammer implementation. Crucially,
the architectural difference makes it much more difficult for a CPU program to
detect the presence of an FPGA Rowhammer attack than that of a CPU Rowhammer
attack—the FPGA’s memory accesses leave far fewer traces on the CPU.

8.5.1 JackHammer: An FPGA Implementation of Rowhammer

We now present our design for JackHammer, a Rowhammer AFU for the Arria
10 FPGA. JackHammer supports configuration through the MMIO interface. When

4 This was verified by an Intel employee in personal communication.


214 T. Tiemann et al.

the JackHammer AFU is loaded, the CPU first sets the target physical addresses that
the AFU will repeatedly access. It is recommended that two addresses are set for a
double-sided attack, but if the second address is set to 0, JackHammer will perform
a single-sided attack using just the first address. The CPU must also set the number
of times the targeted addresses are accessed.
When the configuration is set, the CPU signals the AFU to repeat memory
accesses and issue them as fast as it can, alternating between addresses in a
double-sided attack. Unlike a software implementation of Rowhammer, the accessed
addresses do not need to be flushed from cache—DMA read requests from the
FPGA do not cache the line in the CPU cache, although if the requested value is in
the LLC, the value will be provided to the FPGA by the cache instead of by memory
(see Sect. 8.4.3 for more details on caching behavior). In this attack, the attacker
needs to ensure that the cache lines used for inducing bit flips are not accessed
by the CPU during the attack. The number of remaining accesses can be reread
by the CPU. This is the simplest way for software to check if AFU has finished
sending these accesses. When the last read request has been sent by the AFU, the
total amount of time taken to send all of the requests is recorded.5

8.5.2 JackHammer on the FPGA PAC vs. CPU Rowhammer

Figure 8.3 shows a box plot of the 0th, 25th, 50th, 75th, and 100th percentile
of measured “hammering rates” on the Arria 10 FPGA PAC and its host i7-
3770 CPU. Each measurement in these distributions is the average hammering

Core i7-3770
3392 MHz

Arria 10 PAC
200 MHz
1 PCIe lane

1 1.2 1.4 1.6 1.8 2


Hammers per second ·107

Fig. 8.3 Box plots showing distributions of hammering rates (memory requests or “hammers” per
second) on FPGA PAC and i7-3770. The hammering rate of the FPGA is so consistent as to appear
as a single line. Red crosses indicate outliers

5 The time to send all the requests is not precisely the time to complete all the requests, but it is

very close for sufficiently high numbers of requests. The FPGA has a transaction buffer that holds
up to 64 transactions after they have been sent by the AFU. The buffer does take some time to clear,
but the additional time is negligible for our performance measurements of millions of requests.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 215

Core i7-3770
3392 MHz

Arria 10 PAC
200 MHz
1 PCIe lane

0 0.2 0.4 0.6 0.8 1 1.2


Flips per second

Fig. 8.4 Box plots of distributions of flip rates on FPGA PAC and i7-3770

rate over a run of 2 billion memory requests. The median hammering rate of our
JackHammer implementation is 81% faster than the median rate of the standard
CPU Rowhammer, and its speed is far more consistent than the CPU’s. The FPGA
can manage an average throughput of one memory request, or “hammer,” every ten
200 MHz FPGA clock cycles (finishing 2 billion hammers in an average of 103.25
seconds); the CPU averages one hammer every 311 3.4 GHz CPU clock cycles
(finishing 2 billion hammers in an average of 183.41 seconds). If our Arria 10 FPGA
were running at its intended frequency of 400 MHz, the hammering rate would still
be bottlenecked by memory speed, and JackHammer would not run significantly
faster.
Figure 8.4 shows measured bit flip rates in the victim row for the same
experiment. Runs where zero flips occurred during hardware or software hammering
were excluded from the flip rate distributions, as they are assumed to correspond
with sets of rows that are in the same logical bank, but not directly adjacent to
each other. The increased hammering speed of JackHammer produces a more than
proportional increase in flip rate, which is unsurprising due to the physical rather
than logical nature of Rowhammer faults. As the Rowhammer attack is underway,
electrical charge is drained from capacitors in the victim row. However, the memory
controller also periodically refreshes the charge in the capacitors. When there are
more memory accesses to adjacent rows within each refresh window, it is more
likely that a bit flip occurs before the next refresh. This is why the FPGA’s increased
memory throughput is so much more effective in conducting Rowhammer against
the same DRAM chip.
Another way to look at hammering performance is by counting the total number
of flips produced by a given number of hammers. Figures 8.5 and 8.6 show
distributions of flip counts after various numbers of hammers on the FPGA PAC
and i7 CPU, respectively. The graphs in these figures demonstrate how much more
effectively the FPGA PAC can generate bit flips in the DRAM after the same number
of memory accesses. For hammering attempts that resulted in a nonzero number of
bit flips, the AFU exhibits a wide distribution of flip counts in the range of 200
million to 800 million hammers which then rapidly narrows in the range of 800
million to 1.2 billion and finally levels out by 1.8 billion hammers. This set of
216 T. Tiemann et al.

150

in flippy rows 100


Total flips

50

% of rows with flips


0
1

0.5

0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Billions of hammers

Fig. 8.5 Distributions of total flips after 200 million to 2 billion hammers with JackHammer on
the Arria 10 FPGA PAC. In the upper graph, box plots show quartiles and outliers of flip counts
in flippy rows, that is, rows with nonzero flip counts. The bar graph in the lower axes shows the
portion of rows in the sample that incurred any flips. This is the same portion of rows represented
in the box plots

150
in flippy rows

100
Total flips

50

% of rows with flips


0
1

0.5

0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Billions of hammers

Fig. 8.6 Distributions of total flips after 200 million to 2 billion hammers with software
Rowhammer on the i7-3770, following the same layout as Fig. 8.5

distributions indicates that “flippable” rows will ultimately reach about 80–120 total
flips after enough hammering, but it can take anywhere from 200 million hammers
(about 10 seconds) to 2 billion hammers (about 100 seconds) to reach that limit.
There are also several rows that only incur a few flips. These samples appear in
a consistent pattern as demonstrated in Fig. 8.7, which plots a portion of the data
used to create Fig. 8.5. Each impulse in this plot represents the number of flips after
a single run of 2 billion hammers on a particular target row. In Fig. 8.7, at indices
23 and 36, two of these outliers are visible, each appearing two indices after several
samples in the standard 80–120 flip range. These outliers could indicate rows that
are affected by hammering nearby rows that are not adjacent.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 217

Flips after 2 billion hammers

100

50

outlier outlier

0
14 16 18 20 22 24 26 28 30 32 34 36 38
Row index

Fig. 8.7 Time series plotting number of flips on a row-by-row basis, showing examples of the
consistent placement of small-valued but (unlike their immediate neighbors) nonzero outliers:
samples 23 and 36 on this graph. These rows only ever incur a few flips at most and always are
located two rows away from a block of “flippy” rows which incur dozens of flips. By contrast, rows
22 and 24–29, for example, incur no flips at all

8.5.3 JackHammer on the Integrated Arria 10 vs. CPU


Rowhammer

The JackHammer AFU designed for the integrated platform is the same as the AFU
for the PAC, except that the integrated platform has access to more physical channels
for memory reads. The PAC has a single PCIe channel; the integrated platform has
one UPI channel and two PCIe channels, as well as an “automatic” setting that lets
the interface manager select a physical channel automatically. Therefore we present
the hammering rates on this platform with two different settings—alternating PCIe
lanes on each access and using the automatic setting.
However, the integrated platform is only available on Intel’s servers, so we have
only been able to test on one DRAM setup and have been unable to get bit flips in
the DRAM.6 The integrated Arria 10 shares the package with a modified Xeon v4-
style CPU. The available servers are equipped with an X99 series motherboard with
64 GB of DDR4 memory. Figure 8.8 shows distributions of measured hammering
rates on the integrated Arria 10 platform. Compared to the Arria 10 PAC, the
integrated Arria 10’s hammering rate is more varied, but with a similar mean rate.

6 There are several reasons why this could be the case. Some DRAM is simply more resistant to

Rowhammer by its physical nature. Error correcting code (ECC) memory is capable of reversing
some memory faults in real time. DDR4 memory, which can be found in this system, sometimes
has hardware features to block Rowhammer style attacks [35]. It is impossible to say whether
the DRAM in this system has any particular defenses in place without access to the hardware
or BIOS. Some methods have been developed to circumvent these protections [15, 18], but for
this work we focus on DDR3, where flips are more reliable and the advantage of the FPGA is
easier to demonstrate.
218 T. Tiemann et al.

Integrated Xeon
E5-2600v4
3400 MHz

Integrated
Arria 10
400 MHz
2 PCIe lanes

1.5 1.6 1.7 1.8 1.9 2 2.1


Hammers per second ·107

Fig. 8.8 Distributions of hammering rates (memory requests or “hammers” per second) on
integrated Arria 10 and Xeon E5-2600 v4. While the Xeon’s hammering speed varies greatly and
can be faster than the Arria 10’s, the Arria 10 is more consistent and generally hammers faster

8.5.4 The Effect of Caching on Rowhammer Performance

We hypothesized that a primary reason for the difference in Rowhammer perfor-


mance between JackHammer on the FPGAs and a typical Rowhammer implemen-
tation on the CPUs is that when one of the FPGAs reads a line of memory from
DRAM, it is not cached, so the next read will register as a cache miss and be
directed to the DRAM. When the CPUs access a line of memory, it is cached, and
the memory line must be flushed from the cache before the next read is issued, or
the next read will hit in the cache instead of in the DRAM.
To evaluate our hypothesis that caching is an important factor in the disparity we
observed between FPGA- and CPU-based Rowhammer performance, we used the
PTEditor [50] kernel module to set allocated pages as uncacheable before testing
hammering rates. We edited the setup of the Rowhammer performance tests to
allocate many 4 KiB pages and set them as uncacheable instead of one 2 MB huge
page, as the kernel module could not set 2 MB pages as uncacheable. The software
allocates thousands of 4 KiB pages, sorts them, and then finds the largest continuous
range within them and attempts to find colliding row addresses within that range.
The JackHammer AFU required no modifications from the initial performance tests;
the assembly code used by the CPU to hammer was edited to not flush memory after
reading it, since the memory was not cached in the first place.
We performed the experiment by placing the FPGA PAC on a Dell Poweredge
R720 system with a Xeon E5-2670 v2 CPU fixed to a clock speed of 2500 MHz
and two 4 GB DIMMs of DDR3 DRAM clocked at 1600 MHz. Figure 8.9 shows
the performance of the FPGA PAC and this system’s CPU with caching enabled
and disabled. Disabling caching produces a significant speedup in hammering
for both the PAC and the CPU, but especially for the CPU, which saw a 188%
performance increase. With caching enabled, the median hammering rate of the
PAC was more than twice that of the CPU, but with caching disabled, the median
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 219

·107

3
Hammers per second

0
Cachable Cachable Uncachable Uncachable
Arria 10 PAC Xeon Arria 10 PAC Xeon
200 MHz E5-2670 v2 200 MHz E5-2670 v2
1 PCIe lane 2500 MHz 1 PCIe lane 2500 MHz

Fig. 8.9 Distributions of hammering rates (memory requests or “hammers” per second) with
cacheable and uncacheable memory. Disabling caching significantly speeds up both the Arria 10
and Xeon Rowhammer implementations and brings the speed of the Xeon much closer to the speed
of the Arria 10

hammering rate of the PAC was only 22% faster than that of the CPU. Of course,
memory accesses on modern systems are extremely complex (even with caching
disabled), so there are likely additional factors affecting the difference in hammering
rate. However, our experimental evidence supports our hypothesis that time spent
flushing the cache slows down CPU Rowhammer implementations compared to
FPGA implementations.

8.6 Fault Attack on RSA Using JackHammer

Rowhammer has been used for fault injections on cryptographic schemes [6, 7]
and for privilege escalation [18, 51, 55]. Using JackHammer, we demonstrate a
practical fault injection attack from the Arria 10 FPGA to the WolfSSL RSA
implementation running on its host CPU. In the RSA fault injection attack proposed
by Boneh et al. [8], an intermediate value in the Chinese remainder theorem modular
exponentiation algorithm is faulted, causing an invalid signature to be produced.
Similarly, we attack the WolfSSL RSA implementation using JackHammer from
the FPGA PAC and Rowhammer from the host CPU and compare the efficiency of
220 T. Tiemann et al.

1 The victim CPU FPGA 3 The AFU hammers


application initializes the memory shared
the RSA Key. Core 0 Core 1 Rowhammer with the malicious app,
AFU while the victim is
performing RSA.
Malicious WolfCrypt
App RSA
2 The malicious app FIU
prepares the shared
(with AFU) memory for LLC
hammering.
4 The RSA signature
Main Memory
will become faulty due
Memory
mory for
Mem RSA Key to the induced bit flips
Hammering in the key.

Fig. 8.10 WolfSSL RSA fault injection attack

the two attacks. The increased hammering speed and flip rate of the Arria 10 FPGA
make the attack more practical in the time frame of about 9 RSA signatures.
Figure 8.10 shows the high-level overview of our attack: the WolfSSL RSA
application runs on one core, while a malicious application runs adjacent to it,
assisting the JackHammer AFU on the FPGA in setting up the attack. JackHammer
causes a hardware fault in the main memory, and when the WolfSSL application
reads the faulty memory, it produces a faulty signature and leaks the private factors
used in the RSA scheme.

8.6.1 RSA Fault Injection Attacks

We implement a fault injection attack against the Chinese remainder theorem


implementation of the RSA algorithm, commonly known as the Bellcore attack [8].
Algorithm 1 shows the CRT RSA signing scheme where the signature S is computed
by raising a message m to the private exponent dth power, modulo N . .dp and .dq
are precomputed as .d mod p − 1 and .d mod q − 1, where p and q are the prime
factors of N [4]. When one of the intermediates .Sq or .Sp is computed incorrectly,
an interesting case arises. Consider the difference between a correctly computed
signature S of a message m and an incorrectly computed signature .S  of the same
message, computed with an invalid intermediate .Sp . The difference .S − S  leaves
a factor of q times the difference .Sp − Sp , so the greatest common divisor (GCD)
of .S − S  and N is the other factor p [4]. This reduces the problem of factoring N
to a simple subtraction and the Greatest Common Divisor (GCD) operation, so the
private factors .(p, q) are revealed if the attacker has just one valid signature and
one faulty signature, each signed on the same message m. These factors can also
be recovered with just one faulty signature if the message m and public key e are
known; it is also equal to the GCD of .S e − m and N.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 221

8.6.1.1 Fault Injection Attack with RSA Base Blinding

A common modification to any RSA scheme is the addition of base blinding,


effective against simple and differential power analysis side-channel attacks, but
vulnerable to a correlational power analysis attack demonstrated by [57]. Base
blinding is used by default in our target WolfSSL RSA-CRT signature scheme. In
this blinding process, the message m is blinded by a randomly generated number r
by computing .mb = m · r e mod n. The resulting signature .Sb = (m · r e )d mod n =
md · r mod n must then be multiplied by the inverse of the random number r to
generate a valid signature .S = Sb · r −1 mod n.
This blinding scheme does not prevent the Bellcore fault injection attack.
Consider a valid signature blinded with random factor .r1 and an invalid signature
blinded with .r2 . When the faulty signature is subtracted by the valid signature, the
valid and blinded intermediates .Spb are each unblinded and cancel as before, as
shown in the equation below:
  
S − S  = Sqb + q · (Spb − −Sqb ) · q −1 mod p · r1−1 mod N
  
. − Sqb + q · (Spb
− Sqb ) · q −1 mod p · r2−1 mod N
    
= q · Spb · q −1 mod p · r1−1 − Spb 
· q −1 mod p · r2−1 mod N
(8.1)

Ultimately, there is still a factor of q in the difference .S − S  which can be


extracted with a GCD as before.

8.6.2 Our Attack

We developed a simplified attack model to test the effectiveness of the Arria 10


Rowhammer in a fault injection scenario. Our model simplifies the setup of the
attack so that we can efficiently measure the performance of both CPU Rowhammer
and JackHammer. We sign the same message with the same key repeatedly while the
Rowhammer exploit runs and count the number of correct signatures until a faulty
signature is generated, which is used to leak the private RSA key.

8.6.2.1 Attack Setup

In summary, our simplified attack model works as follows: The attacker first
allocates a large block of memory and checks it for conflicting row addresses.
It then quickly tests which of those rows can be faulted with hammering using
JackHammer. A list of rows that incur flips is saved so that it can be iterated over.
222 T. Tiemann et al.

The program then begins the “attack,” iterating through each row that incurred flips
during the test, and through the sixty-four 1024-bit offsets that make up the row.
During the attack, the JackHammer AFU is instructed to repeatedly access the rows
adjacent to the target row. Meanwhile, in the “victim” program, the targeted data
(the precomputed intermediate value .d mod q − 1) is copied to the target address,
which is computed as an offset of the targeted row. The victim then enters a loop
where it reads back the data from the target row and uses it as part of an RSA key to
create a signature from a sample message. Additionally, the “attacker” opens a new
thread on the CPU which repeatedly flushes the target row on a given interval. It is
necessary for the attacker to flush the target row because the victim is repeatedly
reading the targeted data and placing it in cache, but the fault will only ever occur
in main memory. For the victim program to read the faulty data from DRAM, there
cannot be an unaffected copy of the same data in cache or the CPU will simply read
that copy. As we show below, the performance of the attack depends significantly
on the time interval between flushes.
One of the typical complications of a Rowhammer fault injection attack is
ensuring that the victim’s data is located in a row that can be hammered. In our
simplified model, we choose the location of the victim data manually within a row
that we have already determined to be one that incurs flips under a Rowhammer
attack so that we may easily test the effectiveness of the attack at various rows
and various offsets within the rows. In a real attack, the location of the victim
program’s memory can be controlled by the attacker with a technique known as
page spraying [51], which is simply allocating a large number of pages and then
deallocating a select few, filling the memory in an attempt to cause the victim
program to allocate the right pages. Improvements in this process can be made;
for example, [6] demonstrated how cache attacks can be used to gather information
about the physical addresses of data being used by the victim process.
The other simplification in our model is that we force the CPU to read from
DRAM using the clflush instruction to flush the targeted memory from cache.
In an end-to-end attack, the attacker would use an eviction set to evict the targeted
memory since it is not directly accessible in the attack process’s address space.
However, the effect is ultimately the same—the targeted data is forcibly removed
from the cache by the attacker.

8.6.3 Performance of the Attack

In this section, we show that our JackHammer implementation with optimal


settings can cause a faulty signature an average of 17% faster than a typical
CPU-based, software-driven Rowhammer implementation with optimal settings.
In some scenarios, the performance is as much as 4.8 times that of the software
implementation. However, under some conditions, the software implementation can
be more likely to cause a fault over a longer period of time. Our results indicate that
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 223

Table 8.2 Performance of our JackHammer exploit compared to a standard software CPU Rowham-
mer with various eviction intervals. JackHammer is able to achieve better performance in many cases
because it bypasses caching architecture, sending more memory requests during the eviction interval
and causing bit flips at a higher rate
Mean signatures to fault Successful fault rate
Eviction interval (ms) CPU JackHammer % Inc. speed CPU JackHammer % Inc. rate
16 280 186 51% 0.4% 0.2% .−46%

32 627 219 185% 0.2% 0.8% 264%


48 273 124 120% 14% 19% 39%
64 81 76 7% 17% 26% 56%
96 74 58 27% 46% 49% 8%
128 73 70 4% 52% 50% .−1.2%

256 106 115 .−7% 57% 55% .−3%


Best performance 73 58 25% 57% 55% .−3%

increasing the DRAM row refresh rate provides significant but not complete defense
against both implementations.
The performance of this fault injection attack is highly dependent on the time
interval between evictions, and as such we present all of our results in this section
as functions of the eviction interval. Each eviction triggers a subsequent reload from
memory when the key is read for the next signature, which refreshes the capacitors
in the DRAM. Whenever DRAM capacitors are refreshed, any accumulated voltage
error in each capacitor (due to Rowhammer or any other physical effect) is either
solidified as a new faulty bit value or reset to a safe and correct value. Too short of
an interval between evictions will cause the DRAM capacitors to be refreshed too
quickly to be flipped with a high probability. On the other hand, however, longer
intervals can mean the attack is waiting to evict the memory for a longer time, while
a bit flip has already occurred. It is crucial to note, also, that DRAM capacitors
are automatically refreshed by the memory controller on a 64 ms interval7 [19].
On some systems, this interval is configurable: faster refresh rates reduce the rate
of memory errors, including those induced by Rowhammer, but they can impede
maximum performance because the memory spends more time doing maintenance
refreshes rather than serving read and write request. For more discussion on
modifying row refresh rates as a defense against Rowhammer, see Sect. 8.8.
In Table 8.2 we present two metrics with which we compare JackHammer and
a standard CPU Rowhammer implementation. This table shows the mean number
of signatures until a faulty signature is produced and the ultimate probability of
success of an attack within 1000 signatures against a random key in a randomly
selected chunk of memory within a row known to be vulnerable to Rowhammer.
With an eviction interval of 96 ms, the JackHammer attack achieves the lowest

7 More specifically, DDR3 and DDR4 specifications indicate 64 ms as the maximum allowable

time between DRAM row refreshes.


224 T. Tiemann et al.

Fig. 8.11 Mean number of signatures to fault at various eviction intervals

average number of signatures before a fault, at only 58, 25% faster than the best
performance of the CPU Rowhammer. The CPU attack is impeded significantly by
shorter eviction latency, while the JackHammer implementation is not, indicating
that on systems where the DRAM row refresh rate has been increased to protect
against memory faults and Rowhammers, JackHammer likely offers substantially
improved attack performance. Figure 8.11 highlights the mean number of signatures
until a faulty signature for the 16 ms to 96 ms range of eviction latency.

8.7 Cache Attacks on Intel FPGA-CPU Platforms

In Sect. 8.4.3.3, we reverse-engineered the behavior of the memory subsystem on


current Arria 10-based FPGA-CPU platforms. We learned that memory writes that
are performed by an AFU can evict cache lines from the LLC in the CPU. In this
section, we systematically analyze the viability of cache attacks if the attacker
controls an AFU and targets a victim on the CPU through the LLC. We show
that eviction-based attack vectors like E+T, E+R, and P+P are indeed possible
attacks that can be performed by an AFU-based attacker against a CPU-based target.
Additionally, we analyze the inverse case where an attacker can execute code on
the CPU and tries to target an AFU of the victim that is instantiated on an FPGA
in the system. Our results show that the coherence protocol can be exploited to
perform flush-based cache attacks against the AFU if the FPGA features an internal
coherent cache. We also discuss the viability of intra-FPGA cache attacks. Table 8.3
summarizes our findings.
To measure memory access latency on the FPGA, we designed a timer clocked
at 200 MHz on the FPGA PAC and 400 MHz in the integrated FPGA platform. The
hardware timer runs uninterruptible in parallel to other CPU or FPGA operations.
Therefore, the timer precisely counts FPGA clock cycles, while timers on the CPU,
such as rdtsc, may yield noisier measurements due to interruptions by the OS and
the CPU’s out-of-order pipeline.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 225

Table 8.3 Summary of our cache attacks analysis: OPAE accelerates eviction set construction by
making 2 MB huge pages and physical addresses available to user space
Attacker Target Channel Attack
FPGA PAC AFU CPU LLC PCIe E+T, E+R, P+P
Integrated FPGA AFU CPU LLC UPI E+T, E+R, P+P
Integrated FPGA AFU CPU LLC PCIe E+T, E+R, P+P
Integrated FPGA AFU FPGA Cache CCI-P E+T, E+R, P+P
CPU FPGA Cache UPI F+R, F+F

·105
LLC
Main memory
2
Frequency

0
130 140 150 160 170 180 190 200 210 220 230
FPGA clock cycles (200 MHz)

Fig. 8.12 Latency histogram for one million PCIe read requests on an FPGA PAC served by the
CPU’s LLC or main memory. The two distinct peaks enable the FPGA to distinguish the two
memory locations

8.7.1 Cache Attacks from FPGA PAC to CPU

The Intel PAC has access to one PCIe lane that connects it to the main memory
of the system through the CPU’s LLC. The CCI-P documentation [28] mentions a
timing difference for memory requests served by the CPU’s LLC and those served
by the main memory. Using our timer we verified the differences as shown in
Fig. 8.12. Accesses to the LLC take between 139 and 145 cycles, and accesses to
main memory take between 148 to 158 cycles. These access latency distributions
form the basis of cache attacks, as they enable an attacker to tell which part of the
memory subsystem served a particular memory request. Our results indicate that
FPGA-based attackers can precisely distinguish memory responses served by the
LLC from those served by main memory.
In addition to probing, the cache state must be influenced to perform cache
attacks. We investigated cache interactions offered by the CCI-P interface on an
FPGA PAC and found that cache lines read by the AFU from the main memory
are not cached. While this behavior is not usable for cache attacks, it boosts
Rowhammer performance as we saw in Sect. 8.5. On the other hand, cache lines
226 T. Tiemann et al.

written by an AFU on the PAC end up in the LLC with nearly 100% probability.
The reason for this behavior was discussed in Sect. 8.4.3 along with an analysis of
caching hints. This behavior can be used to evict other lines from the cache and
perform eviction-based attacks like Evict+Time, Evict+Reload, and Prime+Probe.
For E+T, DMA writes can be used to evict a cache line, while our hardware timer
measures the victim’s execution time. Even though an AFU cannot load data into
the LLC, E+R can be performed as the purpose of reloading a cache line is to learn
the reload latency. So the primitives for E+R on the FPGA are DMA writes and
timed DMA reads using a hardware timer. P+P can be performed using DMA writes
and timed reads. In the case where DDIO limits the number of accessible ways per
cache set, other DDIO-enabled peripherals are attackable. Flush-based attacks like
Flush+Reload or Flush+Flush cannot be performed by an AFU as CCI-P does not
offer a flush instruction.

8.7.2 Cache Attacks from Integrated Arria 10 FPGA to CPU

The integrated Arria 10 has two PCIe lanes and one UPI lane connecting it to
the CPU’s memory subsystem. It also has its own additional cache on the FPGA
accessible over UPI (recall Sect. 8.4.3 for further details).
By timing memory requests from the AFU using our hardware timer, we show
that distinct delays for the different levels of the memory subsystem exist. Both
PCIe lanes have delays similar to those measured on a PAC (cf. Fig. 8.12). Our
memory access latency measurements for the UPI lane, depicted in Fig. 8.13, show
an additional peak for requests answered by the FPGA’s local cache. The two peaks
for LLC and main memory accesses are likely narrower and further apart than in
the PCIe case because UPI, Intel’s proprietary high-speed processor interconnect, is

1,000 FPGA cache


LLC
Main memory
Frequency

500

0
0 50 100 150 200
FPGA clock cycles (400 MHz)

Fig. 8.13 Latency histogram for one thousand UPI read requests on an integrated Arria 10 served
by the FPGA’s local cache, CPU’s LLC, or main memory. All peaks are distinct which allows the
FPGA to identify the memory location that served the corresponding read request
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 227

an on-chip and inter-CPU bus only connecting CPUs and FPGAs. On all interfaces,
read requests, again, are not usable for evicting cache lines from the LLC. DMA
writes, however, can be used to alter the LLC on the CPU. Because the UPI and
PCIe lanes behave much like the PCIe lane on a PAC, we state the same attack
scenarios (E+T, E+R, P+P) are viable on the integrated Arria 10.

8.7.2.1 Constructing a Covert Channel from AFU to CPU

Since an AFU can place data in at least one way per LLC slice, it is possible to
construct a covert channel from the AFU to a co-operating process on the CPU using
side effects of the LLC. To do so, we designed an AFU that writes a fixed string to a
pre-configured cache line whenever a “1” is transmitted and does nothing whenever
a “0” is sent. Using this technique, the AFU sends messages which can be read by
the CPU. For the rest of this section, we will refer to the address to which the AFU
writes as the target address.
The receiver process8 first constructs an eviction set for the set/slice-pair of
the target address. To find an eviction set, we run a slightly modified version of
Algorithm 1 using Test 1 in [56]. Using the OPAE API to allocate 2 MB huge pages
and obtain physical addresses (cf. Sect. 8.4.2.2) allows us to construct the eviction
set from a rather small set of candidate addresses all belonging to the same set.
We construct the covert channel on the integrated platform as the LLC of the
CPU is inclusive. Additionally, the receiver has access to the target address via
shared memory and can receiver test its eviction set against the target address
directly. This way, we do not need to explicitly identify the target address’s
LLC slice. In a real-world scenario, either the slice selection function has to be
known [24, 25, 33] or eviction sets for all slices have to be constructed by seeking
conflicting addresses [41, 45]. The time penalty introduced by monitoring all cache
sets can be prevented by multi-threading.
Next, the receiver primes the LLC with the identified eviction set and probes the
set in an endless loop. Whenever the execution time of a probe is above a certain
threshold, the receiver assumes that the eviction of one of its eviction set addresses
was the result of the AFU writing to the target address and therefore interprets
this as receiving a “1.” If the probe execution time stays below the threshold, a
“0” is detected as no eviction of the eviction set addresses occurred. An example
measurement of the receiver and its decoding steps are depicted in Fig. 8.14.
To ease the decoding and visualization of the results, the AFU sends every
bit thrice and the CPU uses six probes to detect all three repetitions. This high
level of redundancy comes at the expense of speed, as we achieve a bandwidth of
about 94.98 kBit/s, which is low when compared to other work [41, 42, 58]. The
throughput can be increased by reducing the three redundant writes per bit from
the AFU as well as by increasing the transmission frequency further to reduce

8 This process is not the software process directly communicating with the AFU over OPAE/CCI-P.
228 T. Tiemann et al.

0 20 40 60 80 100 120 140 160 180 200 220


CPU clock cycles

2,000

1,800
Classification

0
01001011010010110100101101001011010

0 20 40 60 80 100 120 140 160 180 200 220


Measurement

Fig. 8.14 Covert channel measurements and decoding. The AFU sends each bit three times, which
results in three peaks at the receiver if a “1” is transmitted

the redundant CPU probes per AFU write. Also, multiple cache sets can be used
in parallel to encode several bits at once. The synchronization problem can be
solved by using one cache set as the clock, where the AFU writes an alternating
bit pattern [52]. An average probe on the CPU takes 1855 clock cycles. The CPU
operating in the range of 2.8–3.4 GHz results in a throughput of 1.5–1.8 MBit/s.
The AFU can on average send one write request every 10 clock cycles without
filling the CCI-P PCIe buffer and thereby losing the write pattern. In theory, this
makes the AFU capable of sending 40 MBit/s over the covert channel when clocked
at 400 MHz.9
Even though caching hints for memory writes are being ignored by the FIU, an
AFU can place data in the LLC because the CPU is configured to handle write
requests as if WrPush_I is set, allowing for evictions in the LLC. We corroborated
our findings by establishing a covert channel between the AFU and the CPU with
a bandwidth of 94.98 kBit/s. By exposing physical addresses to the user and by
enabling 2 MB huge pages, OPAE further eases eviction set determination from user
space.

8.7.3 Cache Attacks from CPU to Integrated Arria 10 FPGA

We also investigated the CPU’s capabilities to run cache attacks against the coherent
cache on the integrated Arria 10 FPGA. First, we measured the memory access
latency depending on the location of the address accessed using the rdtsc

9 This is a worst-case scenario where every transmitted bit is a ‘1’-bit. For a random message, this
estimation increases as ‘0‘-bits do not fill the buffer, allowing for faster transmission.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 229

800
LLC
FPGA cache
600
Main memory
Frequency
400

200

0
0 50 100 150 200 250 300 350 400
CPU clock cycles

Fig. 8.15 Memory access latency histogram on the CPU with data being present in FPGA local
cache, CPU LLC, or main memory. All memory locations show unique access latency. To our
surprise, responses from the FPGA cache are slower than those originating from the main memory

instruction. The results in Fig. 8.15 show that the CPU can clearly distinguish where
an accessed address is located. Therefore, the CPU is capable of probing a memory
address that may or may not be present in the local FPGA cache. It is interesting
to note that requests to main memory return faster than those going to the FPGA
cache. This can be explained by the much slower clock speed of the FPGA running
at 400 MHz while the CPU operates at 1.2–3.4 GHz. As nearly all known cache
attack techniques rely on some form of probing, the capability to distinguish data
location is a good step in the direction of having a fully working cache attack that
originates from the CPU and targets the FPGA cache.
Besides probing the FPGA cache, we also need a way of flushing, priming, or
evicting cache lines to put the FPGA cache into a known state. While the AFU can
control which data is cached locally by using caching hints, there is no such option
documented for the CPU. Therefore, priming the FPGA cache to evict cache lines
is not possible. This disables all eviction-based cache attacks. However, as the CPU
has a clfush instruction, we can use it to flush cache lines from the FPGA cache,
because it is coherent with the LLC. Hence, we can flush and probe cache lines
located in the FPGA cache. This enables us to run a Flush+Reload attack against the
victim AFU where the addresses used by the AFU get flushed before the execution
of the AFU. After the execution, the attacker then probes all previously flushed
addresses to learn which addresses were used during the AFU execution. Another
possible cache attack is the more efficient Flush+Flush attack. We expect this attack
to be more precise as flushing a cache line that is present in the FPGA cache takes
about 500 CPU clock cycles longer than flushing a cache line that is not present (cf.
Fig. 8.16), while the latency difference between memory and FPGA cache accesses
adds up to only about 50–70 CPU clock cycles.
In general, the applicability of F+R and F+F is limited to shared memory
scenarios. For example, two users on the same CPU might share an instantiation
of a library that uses an AFU for acceleration of a process that should remain
private, like training a machine learning model with confidential data or performing
cryptographic operations.
230 T. Tiemann et al.

100 Absent
Present

Frequency
50

0
200 300 400 500 600 700 800
CPU clock cycles

Fig. 8.16 The flush execution time on the CPU with the flushed address being absent or present
in the FPGA cache. The two peaks are clearly distinct

8.7.4 Intra-FPGA Cache Side Channels

If FPGAs support simultaneous multi-tenancy, that is, the capability to place two
AFUs from different users on the same FPGA at the same time, the possibility of
intra-FPGA cache attacks arises. As the cache on the integrated Arria 10 is directly
mapped and only 128 KiB in size, finding eviction sets becomes trivial when giving
the attacker AFU access to 2 MB huge pages. As this is the default behavior of the
OPAE driver when allocating more than one memory page at once, we assume that
it is straightforward to run eviction-based attacks like Evict+Time or Prime+Probe
against a neighboring AFU to, e.g., extract information about a machine learning
model. Flush-based attacks would still be impossible due to the lack of a flush
instruction in CCI-P.

8.8 Countermeasures

Cloud service providers can choose from a variety of countermeasures to limit


the attack surface of their systems. These countermeasures range from monitoring
systems via Hardware Performance Counters (HPCs) to changing physical system
properties like the DRAM row refresh rate to detect or mitigate Rowhammer attack
attempts. Hypervisors can make use of cache partitioning and pinning, disable 2 MB
huge pages, and virtualize the AFU address space to limit the cache attack surface.
We discuss these countermeasures in detail in the following.

8.8.1 Hardware Monitors

Microarchitectural attacks against CPUs leave traces in HPCs such as cache hit
and miss counters. Previous works have paired these HPCs with machine learning
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 231

techniques to build real-time detectors for these attacks [9, 12, 22, 63]. In some
cases, CPU HPCs may be able to trace incoming attacks from FPGAs. While
HPCs do not exist in the same form on the Arria 10 GX platforms, they could be
implemented by the FIM. A system combining FPGA and CPU HPCs could provide
monitoring of the FPGA-CPU interface.

8.8.2 Increasing DRAM Row Refresh Rate

An approach to reduce the impact of Rowhammer is increasing the DRAM refresh


rate. DDR3 and DDR4 specifications require that each row is refreshed at least every
64 ms, but many systems can be configured to refresh each row every 32 or 16 ms for
better memory stability. When we measured the performance of our fault injection
attack in Sect. 8.6, we measured the performance with varying intervals between
evictions of the targeted data, simulating equivalent intervals in row refresh rate,
since each eviction causes a subsequent row refresh when the memory is read by
the victim program. Table 8.2 shows that under 1% of attempted Rowhammers
from both CPU and FPGA were successful with an eviction interval of 32 ms,
compared to 14% of CPU attacks and 26% of FPGA attacks with an interval of
64 ms, suggesting that increasing the row refresh rate would significantly impede
even the more powerful FPGA Rowhammer.

8.8.3 Cache Partitioning and Pinning

Several cache partitioning mechanisms have been proposed to protect CPUs against
cache attacks. While some are implementable in software [36, 37, 62, 65], others
require hardware support [16, 17, 40]. When trying to protect FPGA caches against
cache attacks, hardware-based approaches should receive special consideration. For
example, the FIM could partition the FPGA’s cache into several security domains,
such that each AFU can only use a subset of the cache lines in the local cache.
Another approach would introduce an additional flag to the CCI-P interface telling
the local caching agent which cache lines to pin to the cache.

8.8.4 Disabling Huge Pages and Virtualizing AFU Address


Space

Intel is aware that making physical addresses available to User space through OPAE
has negative security consequences [27]. In addition to exposing physical addresses,
OPAE makes heavy use of huge pages to ensure physical address continuity of
232 T. Tiemann et al.

buffers shared with the AFU. However, it is well known that disabling huge pages
increases the barrier of finding eviction sets [3, 41], which in turn makes cache
attacks and Rowhammer more difficult. We suggest disabling OPAE’s usage of huge
pages. To do so, the AFU address space has to be virtualized independent of the
presence of virtual environments.

8.8.5 Protection Against Bellcore Attack

Defenses against fault injection attacks proposed in the original Bellcore whitepa-
per [8] include verifying the signature before releasing it, and random padding of
the message before signing, which ensures that no unique message is ever signed
twice and that the exact plaintext cannot be easily determined. OpenSSL protects
against the Bellcore attack by verifying the signature with its plaintext and public
key and recomputing the exponentiation by a slower but safer single exponentiation
instead of by the CRT if verification does not match [11].

8.9 Conclusion

In this work, we show that modern FPGA-CPU hybrid systems can be more
vulnerable to well-known hardware attacks that are traditionally seen on CPU-only
systems. We show that the shared cache systems of the Arria 10 GX and its host
CPU present possible CPU to FPGA, FPGA to CPU, and FPGA to FPGA attack
vectors. For Rowhammer, we show that the Arria 10 GX is capable of causing
more DRAM faults in less time than modern CPUs. Our research indicates that
hardware side-channel defenses are just as essential for modern FPGA systems
as they are for modern CPUs. Of course, the security of any device physically
installed in a system, like a network card or graphics card, is important, but FPGAs
present additional security challenges due to their inherently flexible nature. With
FPGAs, new hardware caches and buffers become accessible for users which opens
a new area of possible side channels. Tiemann et al. [53] showed that FPGAs can
be used to observe the IOTLB and therefore derive information about the state of
neighboring peripherals. From a security perspective, a user-configurable FPGA on
a cloud system needs to be treated with at least as much care and caution as a user-
controlled CPU thread, as it can exploit many of the same vulnerabilities.

Acknowledgments We would like to extend special thanks to Alpa Trivedi and Evan Custodio,
without whom this research would not have been possible. We also thank Daniel Moghimi, Sayak
Ray, and Thomas Unterluggauer for their indispensable advice and insights. Research included
in this chapter was funded in part by Intel, National Science Foundation (NSF) grants CNS
1814406 and CNS 2026913, German Research Foundation (DFG) grant 456967092, German
Federal Ministry of Education and Research (BMBF) grant VE-Jupiter, and the Qatar National
Research Fund.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 233

References

1. Alibaba Cloud. (2019). FPGA-based compute-optimized instance families. https://www.


alibabacloud.com/help/doc-detail/108504.html. Accessed 2023-05-23.
2. Amazon Web Services. (2017). Amazon EC2 F1 instances. https://aws.amazon.com/ec2/
instance-types/f1/. Accessed 2023-05-23.
3. Apecechea, G. I., Eisenbarth, T., & Sunar, B. (2015). S$A: A shared cache attack that works
across cores and defies VM sandboxing—and its application to AES. In 2015 IEEE symposium
on security and privacy, SP 2015, San Jose, CA, USA, May 17–21, 2015 (pp. 591–604). IEEE
Computer Society. https://doi.org/10.1109/SP.2015.42.
4. Aumüller, C., Bier, P., Fischer, W., Hofreiter, P., & Seifert, J. (2002). Fault attacks on
RSA with CRT: Concrete results and practical countermeasures. In: B. S. K. Jr., Ç. K.
Koç, & C. Paar (Eds.), Cryptographic hardware and embedded systems—CHES 2002, 4th
international workshop, Redwood Shores, CA, USA, August 13–15, 2002. Revised Papers,
Lecture Notes in Computer Science (Vol. 2523, pp. 260–275). Springer. https://doi.org/10.
1007/3-540-36400-5_20.
5. Benger, N., van de Pol, J., Smart, N. P., & Yarom, Y. (2014). “Ooh Aah... Just a Little
Bit” : A small amount of side channel can go a long way. In: L. Batina, & M. Robshaw
(Eds.), Proceedings of the Cryptographic hardware and embedded systems—CHES 2014—
16th International Workshop, Busan, South Korea, September 23–26, 2014. Lecture Notes in
Computer Science (Vol. 8731, pp. 75–92). Springer. https://doi.org/10.1007/978-3-662-44709-
3_5.
6. Bhattacharya, S., & Mukhopadhyay, D. (2016). Curious case of Rowhammer: Flipping secret
exponent bits using timing analysis. In B. Gierlichs, & A. Y. Poschmann (Eds.), Proceedings
of the Cryptographic hardware and embedded systems—CHES 2016—18th international
conference, Santa Barbara, CA, USA, August 17–19, 2016. Lecture notes in computer science
(Vol. 9813, pp. 602–624). Springer. https://doi.org/10.1007/978-3-662-53140-2_29.
7. Bhattacharya, S., & Mukhopadhyay, D. (2018). Advanced fault attacks in software: Exploiting
the Rowhammer bug. In S. Patranabis, & D. Mukhopadhyay (Eds.), Fault tolerant architectures
for cryptography and hardware security, computer architecture and design methodologies (pp.
111–135). Springer Singapore. https://doi.org/10.1007/978-981-10-1387-4_6.
8. Boneh, D., DeMillo, R. A., & Lipton, R. J. (1997). On the importance of checking crypto-
graphic protocols for faults (extended abstract). In W. Fumy (Ed.), Proceeding of the advances
in Cryptology—EUROCRYPT ’97, international conference on the theory and application of
cryptographic techniques, Konstanz, Germany, May 11–15, 1997. Lecture notes in computer
science (Vol. 1233, pp. 37–51). Springer. https://doi.org/10.1007/3-540-69053-0_4.
9. Briongos, S., Irazoqui, G., Malagón, P., & Eisenbarth, T. (2018). CacheShield: Detecting cache
attacks through self-observation. In Z. Zhao, G. Ahn, R. Krishnan, & G. Ghinita (Eds.),
Proceedings of the eighth ACM conference on data and application security and privacy,
CODASPY 2018, Tempe, AZ, USA, March 19–21, 2018 (pp. 224–235). ACM. https://doi.org/
10.1145/3176258.3176320.
10. Brumley, B. B., & Hakala, R. M. (2009). Cache-timing template attacks. In: M. Matsui (Ed.),
Proceedings of the advances in cryptology—ASIACRYPT 2009, 15th international conference
on the theory and application of cryptology and information security, Tokyo, Japan, December
6–10, 2009. Lecture notes in computer science (Vol. 5912, pp. 667–684). Springer. https://doi.
org/10.1007/978-3-642-10366-7_39.
11. Carré, S., Desjardins, M., Facon, A., & Guilley, S. (2018). OpenSSL Bellcore’s protection helps
fault attack. In M. Novotný, N. Konofaos, & A. Skavhaug (Eds.), 21st Euromicro conference on
digital system design, DSD 2018, Prague, Czech Republic, August 29–31, 2018 (pp. 500–507).
IEEE Computer Society. https://doi.org/10.1109/DSD.2018.00089.
12. Chiappetta, M., Savas, E., & Yilmaz, C. (2016). Real time detection of cache-based side-
channel attacks using hardware performance counters. Applied Soft Computing, 49, 1162–
1174. https://doi.org/10.1016/j.asoc.2016.09.014.
234 T. Tiemann et al.

13. Faict, T., D’Hollander, E. H., & Goossens, B. (2019). Mapping a guided image filter on the
HARP reconfigurable architecture using OpenCL. Algorithms, 12(8), 149. https://doi.org/10.
3390/a12080149.
14. Frigo, P., Giuffrida, C., Bos, H., & Razavi, K. (2018). Grand Pwning Unit: Accelerating
microarchitectural attacks with the GPU. In Proceedings of the 2018 IEEE symposium on
security and privacy, SP 2018, 21–23 May 2018, San Francisco, California, USA (pp. 195–
210). IEEE Computer Society. https://doi.org/10.1109/SP.2018.00022.
15. Frigo, P., Vannacci, E., Hassan, H., van der Veen, V., Mutlu, O., Giuffrida, C., Bos, H., &
Razavi, K. (2020). TRRespass: Exploiting the many sides of target row refresh. In 2020 IEEE
symposium on security and privacy, SP 2020, San Francisco, CA, USA, May 18–21, 2020 (pp.
747–762). IEEE. https://doi.org/10.1109/SP40000.2020.00090.
16. Green, M., Lima, L. R., Zankl, A., Irazoqui, G., Heyszl, J., & Eisenbarth, T. (2017). AutoLock:
Why cache attacks on ARM are harder than you think. In E. Kirda, & T. Ristenpart (Eds.),
26th USENIX security symposium, USENIX security 2017, Vancouver, BC, Canada, August
16–18, 2017 (pp. 1075–1091). USENIX Association. https://www.usenix.org/conference/
usenixsecurity17/technical-sessions/presentation/green.
17. Gruss, D., Lettner, J., Schuster, F., Ohrimenko, O., Haller, I., & Costa, M. (2017). Strong and
efficient cache side-channel protection using hardware transactional memory. In E. Kirda, &
T. Ristenpart (Eds.), 26th USENIX security symposium, USENIX security 2017, Vancouver,
BC, Canada, August 16–18, 2017 (pp. 217–233). USENIX Association. https://www.usenix.
org/conference/usenixsecurity17/technical-sessions/presentation/gruss.
18. Gruss, D., Lipp, M., Schwarz, M., Genkin, D., Juffinger, J., O’Connell, S., Schoechl, W., &
Yarom, Y. (2018). Another flip in the wall of Rowhammer defenses. In Proceedings of the
2018 IEEE symposium on security and privacy, SP 2018, 21–23 May 2018, San Francisco,
California, USA (pp. 245–261). IEEE Computer Society. https://doi.org/10.1109/SP.2018.
00031.
19. Gruss, D., Maurice, C., & Mangard, S. (2016). Rowhammer.js: A remote software-induced
fault attack in JavaScript. In J. Caballero, U. Zurutuza, & R. J. Rodríguez (Eds.), Proceedings
of the detection of intrusions and malware, and vulnerability assessment—13th international
conference, DIMVA 2016, San Sebastián, Spain, July 7–8, 2016. Lecture notes in computer
science (vol. 9721, pp. 300–321). Springer. https://doi.org/10.1007/978-3-319-40667-1_15.
20. Gruss, D., Maurice, C., Wagner, K., & Mangard, S. (2016). Flush+Flush: A fast and
stealthy cache attack. In J. Caballero, U. Zurutuza, & R. J. Rodríguez (Eds.), Proceedings
of the detection of intrusions and malware, and vulnerability assessment—13th international
conference, DIMVA 2016, San Sebastián, Spain, July 7–8, 2016. Lecture notes in computer
science (Vol. 9721, pp. 279–299). Springer. https://doi.org/10.1007/978-3-319-40667-1_14.
21. Gülmezoglu, B., Eisenbarth, T., & Sunar, B. (2017). Cache-based application detection in
the cloud using machine learning. In R. Karri, O. Sinanoglu, A. Sadeghi, & X. Yi (Eds.),
Proceedings of the 2017 ACM Asia conference on computer and communications security,
AsiaCCS 2017, Abu Dhabi, United Arab Emirates, April 02–06, 2017 (pp. 288–300. ACM).
https://doi.org/10.1145/3052973.3053036.
22. Gülmezoglu, B., Moghimi, A., Eisenbarth, T., & Sunar, B. (2019). FortuneTeller: Predicting
microarchitectural attacks via unsupervised deep learning. CoRR abs/1907.03651. http://arxiv.
org/abs/1907.03651.
23. Gülmezoglu, B., Zankl, A., Eisenbarth, T., & Sunar, B. (2017). PerfWeb: How to violate web
privacy with hardware performance events. In S. N. Foley, D. Gollmann, & E. Snekkenes
(Eds.), Computer security—ESORICS 2017—22nd European symposium on research in
computer security, Oslo, Norway, September 11–15, 2017, Proceedings, Part II. Lecture notes
in computer science (Vol. 10493, pp. 80–97). Springer. https://doi.org/10.1007/978-3-319-
66399-9_5.
24. Hund, R., Willems, C., & Holz, T. (2013). Practical timing side channel attacks against kernel
space ASLR. In 20th Annual network and distributed system security symposium, NDSS 2013,
San Diego, California, USA, February 24–27, 2013. The Internet Society. https://www.ndss-
symposium.org/ndss2013/practical-timing-side-channel-attacks-against-kernel-space-aslr.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 235

25. Inci, M. S., Gülmezoglu, B., Apecechea, G. I., Eisenbarth, T., & Sunar, B. (2015). Seriously,
get off my cloud! cross-VM RSA key recovery in a public cloud. IACR Cryptol. ePrint Arch.
(p. 898). http://eprint.iacr.org/2015/898.
26. Inci, M. S., Gülmezoglu, B., Irazoqui, G., Eisenbarth, T., & Sunar, B. (2016). Cache attacks
enable bulk key recovery on the cloud. In: B. Gierlichs, & A. Y. Poschmann (Eds.). Proceedings
of the Cryptographic hardware and embedded systems—CHES 2016—18th international
conference, Santa Barbara, CA, USA, August 17–19, 2016. Lecture notes in computer science
(Vol. 9813, pp. 368–388). Springer. https://doi.org/10.1007/978-3-662-53140-2_18.
27. Intel (2017). Open programmable acceleration engine (1.1.2 ed.). Accessed 2023-05-23.
28. Intel (2018). Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-
P) Reference Manual (1.2 ed.).
29. Intel (2020). Intel programmable acceleration card (PAC) with Intel Arria 10 GX
FPGA data sheet. https://www.intel.com/content/www/us/en/docs/programmable/683226/
current/introduction-rush-creek.html. Accessed 2023-05-22.
30. Intel (2022). Intel Virtualization Technology for Directed I/O. Rev. 4.0.
31. Intel Labs (2021). FPGA accelerators. https://wiki.intel-research.net/FPGA.html#fpga-
system-classes. Accessed 2023-05-22.
32. Intel Labs (2021). IL academic compute environment documentation. https://wiki.intel-
research.net/. Accessed 2023-05-22.
33. Irazoqui, G., Eisenbarth, T., & Sunar, B. (2015). Systematic reverse engineering of cache slice
selection in intel processors. In 2015 Euromicro conference on digital system design, DSD
2015, Madeira, Portugal, August 26–28, 2015 (pp. 629–636). IEEE Computer Society. https://
doi.org/10.1109/DSD.2015.56.
34. Irazoqui, G., Eisenbarth, T., & Sunar, B. (2016). Cross processor cache attacks. In X. Chen,
X. Wang, & X. Huang (Eds.), Proceedings of the 11th ACM Asia conference on computer and
communications security, AsiaCCS 2016, Xi’an, China, May 30–June 3, 2016 (pp. 353–364).
ACM. https://doi.org/10.1145/2897845.2897867.
35. JC-42.6 Low Power Memories Committee (2017). Low Power Double Data Rate 4 (LPDDR4).
In Standard JESD209-4B, JEDEC solid state technology association.
36. Kim, T., Peinado, M., Mainar-Ruiz, G. (2012). STEALTHMEM: system-level protection
against cache-based side channel attacks in the cloud. In T. Kohno (Ed.), Proceedings of
the 21st USENIX security symposium, Bellevue, WA, USA, August 8–10, 2012 (pp. 189–
204). USENIX Association. https://www.usenix.org/conference/usenixsecurity12/technical-
sessions/presentation/kim.
37. Kiriansky, V., Lebedev, I. A., Amarasinghe, S. P., Devadas, S., & Emer, J. S. (2018). DAWG:
A defense against cache timing attacks in speculative execution processors. In 51st Annual
IEEE/ACM international symposium on microarchitecture, MICRO 2018, Fukuoka, Japan,
October 20–24, 2018 (pp. 974–987). IEEE Computer Society. https://doi.org/10.1109/MICRO.
2018.00083.
38. Kurth, M., Gras, B., Andriesse, D., Giuffrida, C., Bos, H., & Razavi, K. (2020). NetCAT:
Practical cache attacks from the network. In 2020 IEEE symposium on security and privacy,
SP 2020, San Francisco, CA, USA, May 18–21, 2020, pp. 20–38. IEEE. https://doi.org/10.
1109/SP40000.2020.00082.
39. Lipp, M., Gruss, D., Spreitzer, R., Maurice, C., Mangard, S.: ARMageddon: Cache attacks on
mobile devices. In T. Holz, & S. Savage (Eds.), 25th USENIX security symposium, USENIX
security 16, Austin, TX, USA, August 10–12, 2016, pp. 549–564. USENIX Association (2016).
https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/lipp.
40. Liu, F., Ge, Q., Yarom, Y., McKeen, F., Rozas, C. V., Heiser, G., Lee, R. B. (2016). CATalyst:
Defeating last-level cache side channel attacks in cloud computing. In 2016 IEEE international
symposium on high performance computer architecture, HPCA 2016, Barcelona, Spain, March
12–16, 2016 (pp. 406–418). IEEE Computer Society. https://doi.org/10.1109/HPCA.2016.
7446082.
236 T. Tiemann et al.

41. Liu, F., Yarom, Y., Ge, Q., Heiser, G., & Lee, R. B. (2015). Last-level cache side-channel
attacks are practical. In 2015 IEEE symposium on security and privacy, SP 2015, San Jose,
CA, USA, May 17–21, 2015 (pp. 605–622). IEEE Computer Society. https://doi.org/10.1109/
SP.2015.43.
42. Maurice, C., Weber, M., Schwarz, M., Giner, L., Gruss, D., Boano, C. A., Mangard, S.,
& Römer, K. (2017). Hello from the other side: SSH over robust cache covert chan-
nels in the cloud. In 24th annual network and distributed system security symposium,
NDSS 2017, San Diego, California, USA, February 26–March 1, 2017. The Internet Soci-
ety. https://www.ndss-symposium.org/ndss2017/ndss-2017-programme/hello-other-side-ssh-
over-robust-cache-covert-channels-cloud/.
43. Moghimi, A., Irazoqui, G., & Eisenbarth, T. (2017). CacheZoom: How SGX amplifies the
power of cache attacks. In W. Fischer, & N. Homma (Eds.), Proceedings of the cryptographic
hardware and embedded systems—CHES 2017—19th international conference, Taipei, Tai-
wan, September 25–28, 2017. Lecture Notes in Computer Science (Vol. 10529, pp. 69–90).
Springer. https://doi.org/10.1007/978-3-319-66787-4_4.
44. Mulnix, D. (2017). Intel Xeon processor scalable family technical overview. https://www.
intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-
technical-overview.html. Accessed 2023-05-22.
45. Oren, Y., Kemerlis, V. P., Sethumadhavan, S., & Keromytis, A. D. (2015). The spy in the
sandbox: Practical cache attacks in JavaScript and their implications. In I. Ray, N. Li, &
C. Kruegel (Eds.), Proceedings of the 22nd ACM SIGSAC conference on computer and
communications security, Denver, CO, USA, October 12–16, 2015 (pp. 1406–1418). ACM.
https://doi.org/10.1145/2810103.2813708.
46. Osvik, D. A., Shamir, A., & Tromer, E. (2006). Cache attacks and countermeasures: The case
of AES. In D. Pointcheval (Ed.), Proceedings of the Topics in Cryptology—CT-RSA 2006,
The Cryptographers’ track at the RSA conference 2006, San Jose, CA, USA, February 13–17,
2006. Lecture notes in computer science (Vol. 3860, pp. 1–20). Springer. https://doi.org/10.
1007/11605805_1.
47. Pessl, P., Gruss, D., Maurice, C., Schwarz, M., & Mangard, S. (2016). DRAMA: exploiting
DRAM addressing for cross-CPU attacks. In T. Holz, & S. Savage (Eds.), 25th USENIX
security symposium, USENIX security 16, Austin, TX, USA, August 10–12, 2016 (pp. 565–
581). USENIX Association. https://www.usenix.org/conference/usenixsecurity16/technical-
sessions/presentation/pessl.
48. Purnal, A., Turan, F., & Verbauwhede, I.: Double trouble: Combined heterogeneous attacks
on non-inclusive cache hierarchies. In K. R. B. Butler, & K. Thomas (Eds.), 31st USENIX
security symposium, USENIX security 2022, Boston, MA, USA, August 10–12, 2022 (pp. 3647–
3664). USENIX Association (2022). https://www.usenix.org/conference/usenixsecurity22/
presentation/purnal.
49. Ristenpart, T., Tromer, E., Shacham, H., & Savage, S. (2009). Hey, you, get off of my cloud:
Exploring information leakage in third-party compute clouds. In E. Al-Shaer, S. Jha, & A. D.
Keromytis (Eds.), Proceedings of the 2009 ACM conference on computer and communications
security, CCS 2009, Chicago, Illinois, USA, November 9–13, 2009 (pp. 199–212). ACM.
https://doi.org/10.1145/1653662.1653687.
50. Schwarz, M. (2019). PTEditor: A small library to modify all page-table levels of all processes
from user space for x86_64 and ARMv8. https://github.com/misc0110/PTEditor. Version
738f42e, accessed 2023-05-22.
51. Seaborn, M., & Dullien, T. (2015). Exploiting the DRAM Rowhammer bug to gain kernel
privileges. Black Hat USA, 18, 71. https://www.blackhat.com/docs/us-15/materials/us-15-
Seaborn-Exploiting-The-DRAM-Rowhammer-Bug-To-Gain-Kernel-Privileges.pdf.
52. Taram, M., Venkat, A., & Tullsen, D. M. (2020). Packet chasing: Spying on network packets
over a cache side-channel. In 47th ACM/IEEE annual international symposium on computer
architecture, ISCA 2020, Valencia, Spain, May 30–June 3, 2020 (pp. 721–734). IEEE. https://
doi.org/10.1109/ISCA45697.2020.00065.
8 Microarchitectural Vulnerabilities of FPGA-CPU Platforms 237

53. Tiemann, T., Weissman, Z., Eisenbarth, T., & Sunar, B.: IOTLB-SC: An accelerator-
independent leakage source in modern cloud systems. In: Proceedings of the 2023 ACM Asia
conference on computer and communications security, AsiaCCS 2023, Melbourne, Australia,
July 10–14, 2023. ACM (2023). https://doi.org/10.1145/3579856.3582838.
54. Tsunoo, Y., Saito, T., Suzaki, T., Shigeri, M., & Miyauchi, H. (2003). Cryptanalysis of
DES implemented on computers with cache. In C. D. Walter, Ç. K. Koç, & C. Paar
(Eds.) Proceedings of the Cryptographic hardware and embedded systems—CHES 2003, 5th
international workshop, Cologne, Germany, September 8–10, 2003. Lecture notes in computer
science (Vol. 2779, pp. 62–76). Springer. https://doi.org/10.1007/978-3-540-45238-6_6.
55. van der Veen, V., Fratantonio, Y., Lindorfer, M., Gruss, D., Maurice, C., Vigna, G., Bos, H.,
Razavi, K., & Giuffrida, C.: Drammer: Deterministic Rowhammer attacks on mobile platforms.
In E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, S. Halevi (Eds.), Proceedings
of the 2016 ACM SIGSAC conference on computer and communications security, Vienna,
Austria, October 24–28, 2016, pp. 1675–1689. ACM (2016). https://doi.org/10.1145/2976749.
2978406.
56. Vila, P., Köpf, B., & Morales, J. F. (2019). Theory and practice of finding eviction sets. In 2019
IEEE symposium on security and privacy, SP 2019, San Francisco, CA, USA, May 19–23, 2019
(pp. 39–54). IEEE. https://doi.org/10.1109/SP.2019.00042.
57. Witteman, M. F., van Woudenberg, J. G. J., & Menarini, F.: Defeating RSA multiply-always
and message blinding countermeasures. In A. Kiayias (Ed.), Proceedings of the topics in
cryptology—CT-RSA 2011—the Cryptographers’ track at the RSA conference 2011, San
Francisco, CA, USA, February 14–18, 2011. Lecture notes in computer science (Vol. 6558,
pp. 77–88). Springer (2011). https://doi.org/10.1007/978-3-642-19074-2_6.
58. Wu, Z., Xu, Z., & Wang, H. (2012). Whispers in the hyper-space: High-speed covert channel
attacks in the cloud. In T. Kohno (Ed.), Proceedings of the 21th USENIX security symposium,
Bellevue, WA, USA, August 8–10, 2012 (pp. 159–173). USENIX Association. https://www.
usenix.org/conference/usenixsecurity12/technical-sessions/presentation/wu.
59. Xilinx (2019). Accelerator Cards. https://www.xilinx.com/products/boards-and-kits/
accelerator-cards.html. Accessed 2023-05-22.
60. Yan, M., Sprabery, R., Gopireddy, B., Fletcher, C. W., Campbell, R. H., & Torrellas, J. (2019).
Attack directories, not caches: Side channel attacks in a non-inclusive world. In 2019 IEEE
symposium on security and privacy, SP 2019, San Francisco, CA, USA, May 19–23, 2019 (pp.
888–904). IEEE. https://doi.org/10.1109/SP.2019.00004.
61. Yarom, Y., & Falkner, K. (2014). FLUSH+RELOAD: A high resolution, low noise, L3 cache
side-channel attack. In K. Fu, & J. Jung (Eds.), Proceedings of the 23rd USENIX security
symposium, San Diego, CA, USA, August 20–22, 2014 (pp. 719–732). USENIX Association.
https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/yarom.
62. Ye, Y., West, R., Cheng, Z., & Li, Y. (2014). COLORIS: a dynamic cache partitioning system
using page coloring. In J. N. Amaral, & J. Torrellas (Eds.), International conference on parallel
architectures and compilation, PACT ’14, Edmonton, AB, Canada, August 24–27, 2014 (pp.
381–392). ACM. https://doi.org/10.1145/2628071.2628104.
63. Zhang, T., Zhang, Y., & Lee, R. B.: CloudRadar: A real-time side-channel attack detection
system in clouds. In F. Monrose, M. Dacier, G. Blanc, & J. García-Alfaro (Eds.), Proceedings
of the research in attacks, intrusions, and defenses—19th international symposium, RAID 2016,
Paris, France, September 19–21, 2016. Lecture notes in computer science (Vol. 9854, pp. 118–
140). Springer (2016). https://doi.org/10.1007/978-3-319-45719-2_6.
64. Zhang, Y., Juels, A., Reiter, M. K., & Ristenpart, T. (2012). Cross-VM side channels and their
use to extract private keys. In T. Yu, G. Danezis, & V. D. Gligor (Eds.), The ACM conference
on computer and communications security, CCS’12, Raleigh, NC, USA, October 16–18, 2012
(pp. 305–316). ACM. https://doi.org/10.1145/2382196.2382230.
65. Zhou, Z., Reiter, M. K., & Zhang, Y. (2016). A software approach to defeating side channels
in last-level caches. In E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, & S. Halevi
(Eds.), Proceedings of the 2016 ACM SIGSAC conference on computer and communications
security, Vienna, Austria, October 24–28, 2016 (pp. 871–882). ACM. https://doi.org/10.1145/
2976749.2978324.
Chapter 9
Fingerprinting and Mapping Cloud
FPGA Infrastructures

Shanquan Tian, Ilias Giechaskiel, Wenjie Xiong, and Jakub Szefer

9.1 Introduction

The proliferation of cloud FPGA infrastructures has made on-demand access to


FPGA acceleration available for several types of applications, including financial
modeling, cryptography, and genome data analysis, among others [3]. The wide
availability of FPGAs has many benefits, but the potentially highly sensitive nature
of the information processed has attracted recent research on FPGA covert-channel
attacks. Specifically, multi-tenant [21, 22] and temporal [63] covert communication
was shown to be possible in cloud FPGAs.
Such attacks, however, make a crucial assumption in their threat model, namely,
that the adversary is able to identify the FPGAs to be used for attacks or has some
knowledge of the cloud FPGA infrastructure itself. In other words, it is assumed that
attackers know that their designs are co-located with the victim logic on the same
FPGA chip (for multi-tenant attacks) or that the victim had rented the same physical
FPGA board as the attacker in the previous time slot (for temporal attacks).
Existing cloud FPGA providers, such as Amazon Web Services (AWS) [9], take
a number of measures to protect their hardware against such attacks, or, more
generally, against malicious logic that targets other users or the infrastructure itself.
The architectures are not disclosed publicly, except for the types of FPGA chips

S. Tian () · J. Szefer


Yale University, New Haven, CT, USA
e-mail: shanquan.tian@yale.edu; jakub.szefer@yale.edu
I. Giechaskiel
Independent Researcher, London, UK
e-mail: ilias@giechaskiel.com
W. Xiong
Virginia Tech, Blacksburg, VA, USA
e-mail: wenjiex@vt.edu

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 239
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_9
240 S. Tian et al.

used and the geographic location of the data centers. Furthermore, there is a number
of design rule checks (DRCs) on the design checkpoint (DCP) files generated by
Xilinx’s Vivado tools before the generated bitstream (called an Amazon FPGA
image, or AFI) can be loaded onto one of the AWS FPGAs. The checks, which
include prohibiting combinatorial loops [4], are combined with a restrictive “shell”
interface that prevents access to Xilinx eFUSE and Device DNA primitives [66],
which could be used to identify the specific FPGA hardware that a user has rented.
In spite of the efforts to hide information about the cloud FPGA architecture, this
chapter shows that it is possible to gain insights into the infrastructure through the
resources that are available to unprivileged FPGA users. Specifically, this chapter
introduces two algorithms for fingerprinting cloud FPGAs through unique features
in their boards. The first approach uses physical unclonable functions (PUFs) based
on the decay of dynamic random access memory (DRAM) [69] to identify the
DRAM modules attached to the cloud FPGA boards, and, by extension, the FPGAs
themselves. The second PUF uses ring oscillators (ROs) that fingerprint the FPGA
chips themselves. Both designs bypass AWS countermeasures: the DRAM-based
PUF disables DRAM refresh by loading AFIs with and without DRAM controllers
instantiated, while the RO-based PUF uses novel ring oscillators with latches and
flip-flops [21, 59] that are not detected by the deployed DRCs.
Our work then shifts focus from identifying single FPGA boards to mapping the
whole cloud FPGA infrastructure itself. The main insight behind our research is
that memory accesses between the host computer and an FPGA board become a
bottleneck when two or more FPGAs from the same non-uniform memory access
(NUMA) node within a server are accessing memory simultaneously. We show that
it is possible to influence the peripheral component interconnect express (PCIe)
bandwidth by running an FPGA memory stressor and to observe this change in
bandwidth by running a separate FPGA memory tester on a different FPGA board
in the server.1 Using this approach, we can determine which FPGA slots within a
server belong to the same NUMA node.
In particular, we show that in f1.16xlarge AWS instances, FPGAs in slots
0–3 or 4–7 interfere with each other, and thus conclude that they form separate
NUMA nodes. We then use data from dozens of f1.2xlarge instances that have
been rented one after the other and determine that successive instances often (but
not always) belong to the same NUMA locality. The findings are confirmed for
f1.4xlarge instances and across different data center regions for both instance
types. We perform additional experiments with f1.2xlarge instances to calculate
the probability of renting the same FPGA or FPGAs within the same NUMA locality
across time, therefore fingerprinting and mapping the cloud infrastructure on a very
fine-grained level.
Overall, our work exposes a fundamental infrastructure issue and highlights
that simply focusing on the security of the FPGA chip itself, but ignoring other

1 In later work, we show how to exploit this effect to create covert- and side-channel attacks

between isolated VMs [23, 24].


9 Fingerprinting and Mapping Cloud FPGA Infrastructures 241

infrastructure components, such as the DRAM modules or the PCIe bus, leaves
cloud FPGAs open to new vulnerabilities.

9.1.1 Contributions and Chapter Organization

The contributions of our work are as follows:


1. After describing the relevant background (Sect. 9.2), we first identify attacks on
fingerprinting and mapping the infrastructure as part of the attack surface for
cloud FPGAs (Sect. 9.3).
2. We then introduce and evaluate novel ways to fingerprint dozens of cloud FPGAs
using DRAM-based and RO-based PUFs as well as propose potential counter-
measures. Our PUF fingerprints not only allow us to calculate the probability
of renting the same FPGA as a function of time, but also to demonstrate that
adversaries can monitor temperature changes in the data center (Sect. 9.4).
3. By identifying that there is PCIe contention between FPGA slots in cloud servers
and by using our PUFs, we map the cloud FPGA infrastructure on a more fine-
grained basis and further reveal insights about the FPGA instance allocation
algorithm used by the AWS cloud (Sect. 9.5).
4. We summarize related work in Sect. 9.6 before concluding in Sect. 9.7.
5. We make our code available under an open-source license at https://caslab.
csl.yale.edu/code/cloud-fpga-fingerprinting and https://caslab.csl.yale.edu/code/
cloud-ro-primitives/.

9.2 Background

This section describes current public cloud FPGA deployments and their typi-
cal hardware setup (Sect. 9.2.1). It then summarizes decay-based DRAM PUFs
(Sect. 9.2.2), ring oscillators (Sect. 9.2.3), and PCIe-related concepts (Sect. 9.2.4).

9.2.1 Cloud FPGAs

Several options are available for renting FPGAs in the cloud. Since 2015, academic
researchers can access a cluster with Intel Stratix V FPGAs in the Texas Advanced
Computing Center (TACC) [62]. Intel FPGAs are also available on Alibaba
Cloud [1] and on Microsoft Azure for machine learning applications [43]. Xilinx-
based cloud offerings have been available since 2016, when AWS announced F1
instances with Xilinx Virtex UltraScale+ FPGAs [2]. The same chips also power
242 S. Tian et al.

Huawei [65] and Alibaba [1] cloud services. Meanwhile, Kintex UltraScale boards
are available on Baidu [12] and Tencent [61].
In this chapter, we focus on FPGA instances provided by AWS. These instances,
or virtual machines (VMs), are offered in several geographical regions and come
in three flavors, with 1, 2, or 8 dedicated FPGAs per VM instance, called
f1.2xlarge, f1.4xlarge, and f1.16xlarge (the instance name is twice
the number of FPGAs, so f1.2xlarge has 1 FPGA, while f1.4xlarge has 2,
etc.). The total amount of resources allocated per VM increases proportionally with
the number of attached FPGAs, providing 8 virtual CPUs (vCPUs) from Intel Xeon
E5-2686 v4 (Broadwell) processors, 130 GB of RAM, and 470 GB of NVMe SSD
per FPGA [9]. Thus, e.g., f1.16xlarge instances have 64 vCPUs, over 1 TB of
RAM and 3.7 TB of disk space.
Each FPGA board can communicate with the server over x16 PCIe Gen 3. In
addition, each FPGA can access (via the programmable logic) four DDR4 DRAM
chips on the FPGA board itself, which are separate from the server’s DRAM.
The FPGA DRAM comes with error correction code (ECC) and a total of 16 GB
of memory [9] for each FPGA. AWS F1 instances use 16 nm Virtex UltraScale+
XCVU9P chips [9], which internally contain over 1.1 million lookup tables (LUTs),
2.3 million flip-flops (FFs), and 6,800 digital signal processing (DSP) blocks [67]. It
should be noted that f1.16xlarge instances use a dedicated PCIe fabric, which
“lets the FPGAs share the same memory space and communicate with each other
across the fabric at up to 12 Gbps in each direction” [9], suggesting that a server can
consist of at most two CPUs and eight attached PCIe cards, a.k.a. FPGAs. As we
show in this chapter, servers indeed seem to have two NUMA locality nodes, each
encompassing one CPU and four FPGAs. This is consistent with known server and
PCIe designs but is not publicly specified by Amazon.2

9.2.2 Decay-Based DRAM PUFs

DRAM is widely used in personal computers and servers due to its high storage
density. Usually, multiple DRAM chips (ranks) are combined in a DRAM module
to provide enough memory. Each DRAM chip consists of DRAM banks, which are
arrays of DRAM cells. A single DRAM cell consists of a capacitor and a transistor,
with bits of information stored as charges on the capacitors. The gate of the access
transistor in the DRAM cell connects to the wordline (WL) in that row, while the
capacitor in the DRAM cell connects to the bitline (BL) through the transistor. To
access a certain memory address, the bitlines are first reset by the equalizers. Then,
the corresponding wordline is enabled, and the charge on the capacitors is read
through the sense amplifiers.

2 We further reverse-engineer the server architecture by incorporating information about SSDs and

network interface cards in [24].


9 Fingerprinting and Mapping Cloud FPGA Infrastructures 243

DRAM is a type of volatile memory because the capacitor charge leaks over time
through different leakage paths. The time that a DRAM cell can retain the charge on
the capacitor and store the data value is called the retention time. After the retention
time elapses, the charge on the cell will leak, and the bit stored in the DRAM cell
may flip its value. To maintain the data integrity of information stored, the DRAM
is refreshed periodically to recharge the capacitors to their original voltage levels.
Moreover, an error correction code (ECC) can also be applied.
The variation in the retention time of different DRAM cells can be used in
physical unclonable functions (PUFs) [69]. Specifically, in a decay-based DRAM
PUF, the DRAM PUF region is first set to a known initial value (e.g., all ones) and
the DRAM refresh is disabled. After a certain decay period elapses, the DRAM
PUF region is then read. Due to DRAM charge leakage, bit flips (errors) in the
initial values will occur. The location of the bit flips depends on variations in the
fabrication process and is considered to be unique for each DRAM chip. Thus,
the bit flips due to DRAM decay can be used as a PUF response. DRAM PUFs
have been used to identify and authenticate DRAM chips [48, 52, 55, 56, 60, 69]
or generate keys [53, 56, 60, 69]. In this chapter, we use DRAM data retention
properties to create a unique fingerprint of DRAM chips, and, by extension, the
FPGAs to which they are attached.

9.2.3 Ring Oscillators

Ring oscillators (ROs) are a type of circuit with an odd number of NOT gates that
are chained together in a loop (i.e., the output of the last gate is the input of the
first gate). The value at any given stage of an RO oscillates between 1 and 0, at a
frequency that depends on the number of stages in the RO, the delay between the
stages, as well as process, voltage, and temperature (PVT) variations [26].
ROs on FPGAs are traditionally implemented using lookup tables (LUTs), which
are configured as either inverters or buffers (e.g., 3 inverters, or 1 inverter and 2
buffers). These combinatorial loops can be detected by the synthesis tools, and some
cloud providers, such as Amazon Web Services [9], in fact prohibit ROs in FPGA
bitstreams deployed on their cloud FPGAs. However, alternative types of ROs that
behave similarly have been designed to bypass these countermeasures. These ROs,
used in this chapter, replace one of the LUT stages with a latch or a flip-flop [21, 59].

9.2.4 PCIe Contention and NUMA Localities

The PCIe standard provides a high-bandwidth, serial, full-duplex interface. Unlike


its parallel, bus-based PCI predecessor, each PCIe slot provides a point-to-point
communication mechanism (link) that connects devices in a tree topology to the
host CPU via the root complex and possibly through intermediate switches. A link
244 S. Tian et al.

is composed of 1, 2, 4, 8, 12, 16, or 32 pairs of RX and TX differential signals (lanes)


that allow for higher throughput by interleaving transmissions (approximately
1 GBps per lane for PCIe 3.0) [57].
PCIe implements a credit-based flow-control protocol between link partners
(e.g., the card and a switch, or a switch and the root complex) by sequencing the
outgoing transaction layer packets (TLPs) [57]. The credit tokens are therefore used
to arbitrate and distribute bandwidth among competing incoming or outgoing link
connections. As credits are point-to-point-based, it is possible for end-to-end traffic
to be passing through a congested link (due to an unrelated flow), even when the
individual endpoints can deal with higher bandwidth [40]. The slowdown that can
be experienced by applications therefore not only depends on which devices are
communicating, but also the PCIe topology of the given system [18, 40].
Although different PCIe slots on the motherboard might advertise the same
number of lanes and identical performance, where they lie on the PCIe topology
graph might be different, leading to non-uniform latency and bandwidth. For
example, even desktop computers can have multiple PCIe root complexes (one in the
chipset and one in the CPU integrated I/O hub) that are linked over lower-bandwidth
interfaces [38]. These interfaces, such as Intel’s quick path interconnect (QPI), can
also be used for cache coherency protocols between different CPU sockets [44],
which represent different root complexes and only have direct access to a subset of
PCIe devices or DRAM chips.
This leads to NUMA latencies from a CPU to different PCIe or memory slots (or
vice versa) and potentially increases contention of the interconnect resources [33,
41]. Memory and PCIe devices that have “the same access characteristics for a
particular processor” are called NUMA nodes or localities [33]. Although for a
given program it is possible to reduce the effects of NUMA problems, for instance
by pinning threads to cores [41], PCIe contention can remain pronounced even
when performing only “local” accesses within a NUMA node. In some systems,
translation lookaside buffer (TLB) misses in the input–output memory management
unit (IOMMU) may also increase latency and decrease throughput when issuing
direct memory access (DMA) requests [45].
PCIe congestion has in the past primarily been studied for systems that consist
of multiple graphics processing units (GPUs), with PCIe switches becoming a
bottleneck when handling traffic from multiple GPUs [13, 16, 18, 54, 58], in part
due to a round-robin scheduling policy that can lead to severe stalls [34].
Despite the existing characterization of PCIe contention in setups with more than
one GPU, no previous work has used PCIe contention to reverse-engineer cloud
FPGA infrastructures. The closest research is a recent work by Wang et al. [64],
who investigated the PCIe overhead of accessing FPGAs in cloud environments
for different driver implementations. In this chapter, we instead more extensively
study PCIe contention effects due to simultaneous memory accesses over multiple
FPGAs at once and use our results for architectural insights into AWS’s server
infrastructure.
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 245

9.3 Threat Model

Covert- and side-channel attacks are possible in cloud FPGAs (Sect. 9.6), but
they often require that adversaries be able to uniquely identify FPGA instances
to carry out the attacks. This chapter provides a way to uniquely fingerprint
individual FPGAs and more generally map the cloud infrastructure, while obeying
the DRCs imposed by cloud FPGA providers. The adversarial user is therefore
free to place and route their potentially malicious logic within the confines of
their dedicated region on the FPGA chip but cannot use any prohibited circuits,
such as combinatorial loops [4]. In addition, users only interact with the physical
interfaces through the cloud-provided IP modules, such as the DRAM controller,
and do not have direct access to I/O pins, identifiers such as eFUSE and Device
DNA primitives [66], or voltage and temperature monitors. Adversaries can instead
try to infer or influence such information indirectly (e.g., by using PUFs or other
alternative RO constructions), but they do not have physical access to the underlying
FPGA boards or server racks themselves. We also do not assume vulnerabilities
in the virtualization mechanisms that would allow users to (directly) snoop on the
memory or PCIe transactions of other VM instances. Attacks on the cloud-provided
logic (including decrypting the protected IP), FPGA software tools, or the bitstream
are similarly out of scope.

9.4 Fingerprinting Cloud FPGAs

The DRAM-based PUF is described in Sect. 9.4.1 and evaluated in Sect. 9.4.2,
while the RO-based PUF is introduced in Sect. 9.4.3 and evaluated in Sect. 9.4.4.
Section 9.4.5 presents some potential countermeasures against FPGA fingerprinting.

9.4.1 DRAM PUF Design

This section presents a novel way to create decay-based DRAM PUFs by loading
and unloading two different types of AFIs: one with and one without a memory
controller. This approach disables refresh, while still providing power to the
DRAM modules. Section 9.4.1.1 expands on the memory-related aspects of our
experimental setup, while Sect. 9.4.1.2 explains in detail how DRAM PUFs are
instantiated and used for data collection on AWS F1 instances.

9.4.1.1 Accessing DRAM from the FPGA

The cl_dram_dma example in the AWS development kit [6] explains how to
access the DRAM from the FPGA. The physical pinout and timing parameters
246 S. Tian et al.

Fig. 9.1 System diagram: a


virtual machine
communicates with one or
more FPGAs over PCIe. Of
the four DRAM modules on
each FPGA board, one
(DRAM C) is reserved by the
shell. PUFs that exploit
DRAM charge leakage on the
other three DRAMs (A, B,
and D) can uniquely identify
the underlying hardware

of the DDR4 DRAM chips are hidden in the sh_ddr module, which provides a
512-bit AXI4 interface to user logic. It also implements memory initialization, error
correction, and self-refresh of DRAM cells. As shown in Fig. 9.1, although there are
four DRAM modules, one (DRAM C) is reserved by the FPGA “shell.” It is always
initialized (and refreshed) regardless of whether custom user logic has instantiated
a DRAM controller to use the memory. The remaining DRAMs (A, B, and D) are
instantiated within the custom logic [5]. Each instantiation also uses the sh_ddr
module, which is encrypted and prevents users from modifying its functionality:
self-refresh of DRAM cells is always enabled whenever the DRAM controller is
instantiated. Nevertheless, Sect. 9.4.1.2 presents a novel approach through a method
that disables self-refresh, thus allowing DRAM cells to decay.
It should be noted that two key modifications are made to the cl_dram_dma
logic. First of all, the memory scrubber module mem_scrb (which erases DRAMs
when the AFI is loaded) is disabled through the macro NO_CL_TST_SCRUBBER.
This is necessary to ensure that the decay-based DRAM PUF fingerprints are not
zeroed out before they are read. And, second, the error correction logic is also turned
off by setting the ECC parameter of the ddr4_core_ddr4 module to OFF. This
ensures that the PUF response remains usable by keeping decay-based errors intact,
instead of being corrected by ECC. The ability to turn off ECC is discussed further
in the context of defense mechanisms in Sect. 9.4.5.

9.4.1.2 Collecting DRAM PUF Fingerprints

Figure 9.2 depicts the data collection process of the decay-based DRAM PUFs.
There are three steps in the measurement process, which use two separate AFIs:
1. The first step is to write all 1s to a fixed area within a DRAM chip. This step
uses AFI-0, which is based on the AWS example design cl_dram_dma, with
modifications to the memory scrubber and ECC, as explained above.
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 247

Time

Fig. 9.2 Steps to measure DRAM PUFs: AFI-0 is first loaded to write all 1s to a certain area of a
DRAM module. Then AFI-1 is loaded to stop memory self-refresh. Finally, after a fixed amount
of time, AFI-0 is re-loaded to measure bit flips in the written addresses

2. The second step is to wait for DRAM cells to decay by using AFI-1. This step
loads an FPGA image that stops self-refresh of DRAMs A, B, and D for the
chosen decay period, idling the FPGA. The cl_hello_world design is used
for this purpose, as it does not instantiate memory controllers. The self-refresh
logic is only disabled in this step.
3. The final step of reading returns to AFI-0 and simply reads back the DRAM
data to generate the PUF fingerprints from DRAMs A, B, and D. The memory
scrubber and ECC are disabled, so the image retrieves the previous data, with
some of its bits decayed.

9.4.2 DRAM PUF Evaluation

This section expands on the experimental setup (Sect. 9.4.2.1) and provides an
example of the DRAM PUF response (Sect. 9.4.2.2). It then details the metric used
for fingerprinting FPGA instances (Sect. 9.4.2.3) and calculates the probability of
re-renting the same FPGA (Sect. 9.4.2.4). Finally, it finishes with an investigation
of the background data center conditions (Sect. 9.4.2.5).

9.4.2.1 Data Collection on AWS

Experiments are performed on Amazon EC2 F1 spot instances [11], in the North
Virginia us-east-1 region. Spot instances are similar to on-demand ones but can
be terminated at a moment’s notice. As a result, they are cheaper: an on-demand
f1.16xlarge instance costs $13.20 per hour, while the same spot instance only
costs $3.96 [10], i.e., less than a third of the price.
The VMs used on the cloud servers, also called Amazon Machine Images
(AMIs) [8], run CentOS 7.6.1810 and access the Xilinx Virtex UltraScale+ FPGAs
in the f1 instances. A series of spot instances, launched with the same AMIs, are
requested in order and are terminated after collecting DRAM PUFs responses on
all FPGA slots of each instance. The interval between terminating one instance and
requesting the next one is five minutes. However, due to variations in how long
248 S. Tian et al.

Fig. 9.3 The number of bit


flips in the four DRAMs of an
FPGA board for different
decay periods. DRAM C is
reserved by the FPGA shell
and cannot be used for PUFs.
The other three DRAM error
counts follow a similar
pattern, but the absolute
magnitudes vary

initialization of the FPGAs takes, there are some small differences in the collection
time of the DRAM PUFs in practice. On multi-FPGA (4x and 16x) instances, the
measurements on different FPGA slots are done in sequence, minimizing contention
errors or delays due to the shared PCIe bus.

9.4.2.2 DRAM PUF Example on Cloud FPGAs

As discussed in Sects. 9.2.2 and 9.4.1, the location of bit flips that occur after
disabling the memory scrubber, error correction, and self-refresh is related to the
manufacturing process and can fingerprint the DRAM modules attached to the
FPGAs. It can thus serve as a proxy for fingerprinting the cloud FPGA instances,
under the reasonable assumption that the same DRAM chips are always permanently
and physically connected to the same FPGA board. Figure 9.3 shows the number of
bit flips (error counts) for the four DRAMs on an FPGA board after waiting for
different decay periods. Due to the influence of memory access on DRAM PUFs,
all data points in Fig. 9.3 are independent. The waiting time between measurements
is two minutes, and the size of the PUF is 512 kB. Decay on DRAM C cannot be
measured, as it is reserved by the shell, but the other three DRAMs follow a similar
pattern: the longer the wait, the more pronounced the decay. However, the absolute
magnitude varies due to manufacturing variations. The decay period is chosen as
120 seconds in the following experiments.
Figure 9.4 shows an example DRAM PUF response, with each pixel in the .1024×
1024 grid representing four bits in the 512 kB PUF response. There is sufficient
randomness in the response to distinguish between otherwise-identical DRAMs.

9.4.2.3 Fingerprinting Metric

To quantify how similar or different DRAM PUF responses are, we use the Jaccard
index [29]. Let .F1 and .F2 denote the set of bit flips in two DRAM PUF responses.
Then, the Jaccard index for the two DRAM PUF responses is defined as
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 249

Fig. 9.4 Bitmap of an


example DRAM PUF
response on an AWS FPGA,
where each pixel denotes the
number of bit flips per four
bits in the DRAM PUF
response

Fig. 9.5 Distribution of the


Jaccard index metric, shown
over each pair of DRAM PUF
responses on f1.2xlarge
instances

|F1 ∩ F2 |
J (F1 , F2 ) =
. (9.1)
|F1 ∪ F2 |

As shown by Xiong et al. [69], the intra-device Jaccard index J of PUF responses
from the same DRAM chip is close to one, whereas the inter-device Jaccard index
J from different DRAMs is close to zero. This remains true for the data collected
in our work, where sixty f1.2xlarge instances are launched in series. Due to
the AWS allocation process, these instances may or may not use different FPGA
boards. As shown in Fig. 9.5, the distribution of the Jaccard indices for each pair of
PUF responses has a peak close to 0, and the rest are between 0.5 and 1 as expected.
Therefore, DRAM PUF responses that have a Jaccard index of less (resp., more)
than 0.5 are assumed to come from different (resp., the same) FPGA boards.

9.4.2.4 Identifying Repeated Instances

This section identifies the number of unique FPGAs when renting f1.2xlarge,
f1.4xlarge, f1.16xlarge instances sixty times each. As these instance
types contain 1, 2, and 8 FPGA boards, respectively, DRAM PUF fingerprints are
measured on a total of .60 + 120 + 480 = 660 FPGAs (which contain repeated
FPGA boards due to re-allocation). Table 9.1 summarizes the number of unique
FPGAs seen on AWS, as indicated by the Jaccard indices of their DRAM PUFs.
The results indicate that only 10, 6, and 8 unique FPGA sets have been allocated for
each type.
Given that we observed the same FPGA multiple times, Fig. 9.6 plots the
probability of getting the same FPGA board in the North Virginia region, as a
250 S. Tian et al.

Table 9.1 The number and type of FPGA instances rented, along with the number of unique sets
of FPGAs found and the approximate experimental cost using spot instances
Instance type # of FPGAs Unique FPGAs Cost ($)
f1.2xlarge .60 ×1 .10×1 3.47
f1.4xlarge .60 × 2 . 6×2 8.91
f1.16xlarge .60 × 8 . 8×8 83.16

Fig. 9.6 Probability of


renting re-allocated FPGA
boards for all three instance
types and different waiting
periods. Although the figure
only shows slot 0, the
probability for all slots is
identical, as FPGA ordering
does not change within
instances

function of the amount of time between requests for two instances. As DRAM
PUFs are collected in sequence, the intervals between two adjacent measurements
are nearly identical. For a given time period t, all n pairs of measurements that are
(approximately) t minutes apart are used to calculate the probability .p/n of renting
a re-allocated FPGA, where p denotes the number of pairs (out of n) for which
Jaccard indices are bigger than .0.5. Please note that the number of pairs n varies for
different instance types and numbers of measurements remaining.
Although the probability appears random and hard to predict, it is non-zero for all
instance types most of the time and often close to 25–30% for 2x and 4x instances.
As a result, temporal covert channels [63] indeed seem possible: the attacker and
the victim end up on the same FPGA in consecutive time slots after about four tries
on average. For 16x instances, the probability is around 10%, requiring about ten
tries to get the same instance.
Figure 9.7, in particular, shows the results of renting 16x instances eleven
consecutive times. As can be seen in the figure, one set of FPGAs is repeated four
times, two are repeated two times, while three are allocated once. Moreover, the
same eight FPGAs are re-allocated at once: in other words, by identifying that, e.g.,
DRAM D on slot 0 has stayed the same in INST-0 and INST-1, an adversary is able
to deduce that all eight FPGAs have stayed the same.

9.4.2.5 Monitoring Temperature Changes

We finally investigate whether one can infer patterns about the environmental
conditions of the data center in which we performed measurements. To that end,
we measure how the DRAM decay varies in a span of approximately three days. As
Fig. 9.8 reveals, the PUF behaves differently throughout the measurement period,
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 251

Fig. 9.7 Fingerprinting FPGAs on f1.16xlarge instances with 8 FPGA slots: out of 11 spot
instances, only 6 different sets of FPGAs are allocated. In the remaining instances, only 2 additional
sets were identified (Table 9.1)

Fig. 9.8 DRAM decay


measured in the course of
three days can reveal
information about the data
center environmental
conditions. The decay time of
each measurement is 120
seconds. The experiment was
done on a spot f1.2xlarge
instance

where the decay time of each measurement is 120 seconds. As DRAM decay varies
with temperature [68], these variations can give insights into the workloads and
operating conditions of the servers. For example, there may be a decrease in activity
at certain times in the day, allowing the data center to cool, and the DRAM PUF
to result in fewer errors. An attacker might use these insights to reason about data
center capacity and launch attacks on server availability [19, 27, 28].

9.4.3 RO PUF Design

In this section, we introduce an RO-based PUF design that fingerprints the FPGA
chips themselves instead of the DRAM modules attached to the FPGA boards. Our
PUF design introduces the idea of redundant ROs, which allows us to evaluate the
quality of each RO pair by pre-testing the PUF design on a smaller number of cloud
FPGA instances, and discarding “bad” RO pairs, which decrease the effectiveness
(uniqueness and reliability) of the PUF response.
252 S. Tian et al.

Fig. 9.9 Diagram of the RO FPGA Server


PUF on AWS F1 instances. PCIe
The controller communicates
with the user’s VM running
on the server over PCIe via
the shell. The PUF consists of Shell
latch- or flip-flop-based ROs,
with each RO pair’s response Custom Logic
processed through Control
multiplexers, counters, and a RO Sensor and RO Heaters
comparator RO PUF
RO[0] Latch
or FF

Counter-0
RO[1] Latch
or FF

MUX-0 0110
Timer !
… PUF

Response
Counter-1
RO[n-1] Latch
or FF

MUX-1

Our PUFs consist of 512 ROs, with each RO only used in one comparison
pair. The 256 resulting pairs then generate 256 bits for pre-testing with 40 FPGA
instances. After the pre-testing phase, 128 “good” bits (i.e., RO pairs) that generate
high entropy are chosen to be included in the final PUF responses, with the other 128
bits ignored entirely. As we show in Sect. 9.4.4 when testing with 160 instances, the
selection of high-entropy bits decreases the intra-device Hamming distance (HD)
of the PUF response, while increasing the inter-device HD. In other words, our
approach significantly improves the reliability and uniqueness of the RO PUFs by
finding the RO pairs that are stable within an FPGA but differ among FPGAs.
Figure 9.9 shows our RO PUF module, which consists of .n = 512 ROs with
two multiplexers (MUXes) and two RO counters for comparing the RO pairs. The
.m = n/2 = 256 RO pairs use adjacent ROs to eliminate systematic variations [39,

42], and each RO is only used once. The RO outputs drive the counters, which are
sampled on a system clock timer set by the software. To minimize noise, when an
RO pair is sampled, the other ROs are disabled. In pre-processing, .m = 256 bits
are sent back each time and are used to identify the 128 good RO pairs. In later
experiments with more (and different) FPGAs, the same 128 good RO pairs that
were identified in the pre-processing stage are used to generate the PUF fingerprints.
To evaluate the effect of temperature on the quality of the proposed RO PUFs, we
further add an RO Sensor and RO Heaters module to our design. This module allows
us to increase and observe the FPGA temperature, collecting PUF fingerprints at
different thermal states of the fabric, proving the stability of our design across
environmental conditions that might arise in a data center.
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 253

Fig. 9.10 Ignoring low-entropy bits (PUF-B) improves Uniqueness and maintains Reliability
compared to the baseline PUF-A. Temperature increases due to the heaters do not affect the
responses

9.4.4 RO PUF Evaluation

Our PUF generates 128-bit fingerprints out of 256 ROs pairs. We validate our design
by calculating the Uniqueness (the average inter-device HD over responses from
different FPGAs) and the Reliability (the average intra-device HD over repeated
measurements from the same FPGA). Ideal PUFs have Uniqueness values of .0.5, as
their responses behave randomly. Reliability values are close to 0, as few bits differ
when re-querying the PUF.
Improving Uniqueness While Maintaining Reliability Previous RO PUF designs
usually compare .m = n/2 RO pairs out of n ROs and output .n/2 bits, but our design
eliminates low-entropy bits and instead produces .n/4 = 128 bits. To evaluate the
quality of our PUF design, we compare the Uniqueness and the Reliability of the
baseline implementation (PUF-A), which uses all .n/2 = 256 RO pair comparisons,
to our improved 128-bit PUF, PUF-B. Figure 9.10 shows that the Uniqueness of
PUF-B increases to .≈0.25 from .≈0.13 in PUF-A, while maintaining the Reliability
at almost the same value. In addition, Fig. 9.10 shows that the RO heaters do not
influence the Uniqueness and Reliability much, indicating that the RO PUFs are
stable at different temperatures.
Choosing Low-Entropy Bits To show the benefits of the novel idea to ignore but
not remove low-entropy RO pairs, we implement two additional types of PUFs:
PUF-C, which physically removes the “bad” ROs from the floorplan, and PUF-D,
which instead ignores randomly selected RO pairs. Table 9.2 contains a summary of
the four PUF designs. PUFs A, B, D utilize 669 slices, while PUF-C uses 359 slices.
Figure 9.11a shows that although the Reliability remains almost the same for
all four PUF designs, the Uniqueness of PUF-B is much higher than that of the
remaining three PUFs. In particular, PUF-C suggests that re-routed logic will still
influence the entropy of the remaining RO pairs, i.e., the “good” bits from PUF-B
no longer remain good in PUF-C.
254 S. Tian et al.

Table 9.2 Different RO PUF implementations


Design Explanation # ROs # Bits
PUF-A Baseline PUF using 512 RO pairs 512 256
PUF-B As A, but ignores low-entropy ROs 512 128
PUF-C As B, but physically removes ROs 256 128
PUF-D As B, but ignores pairs .[1, 3, 5, . . .] 512 128

Fig. 9.11 Uniqueness and reliability for the four setups of Table 9.2. (a) The best uniqueness
is achieved when ignoring but not removing low-entropy bits (PUF-B). (b) Uniqueness remains
constant when testing on additional FPGAs that were not used in pre-testing

Fig. 9.12 PUF-B Hamming distances (a) intra- and inter-device, and (b) with or without the RO
heaters enabled

Moreover, the locations of “good” bits remain stable across many FPGAs.
Although 40 FPGAs were used in pre-testing for PUF-B, Fig. 9.11b shows that the
Uniqueness of PUF-B stays stable when applying the PUF with the same “good”
bits to 160 different instances.
PUF Responses for Different FPGAs By storing PUF responses in a database,
users are able to infer (based on the HD) whether a given FPGA has been used
before or not: as Fig. 9.12 shows, the intra-device and inter-device HDs of .N = 40
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 255

Fig. 9.13 The Uniqueness and Reliability of PUF-B using two types of ROs at three FPGA
locations

new F1 FPGA instances (measured 20 times) are clearly separated. Even if the RO
heaters are turned on, the intra-device HD stays almost the same, and always under
10. By contrast, inter-device HD ranges from about 20–50, so a threshold of 15 can
separate PUF responses from the same device to those from different devices.

Different RO Types and Locations We implement both flip-flop-based RO PUFs


and latch-based RO PUFs, which can bypass AWS design rule checks against
combinatorial loops. Figure 9.13 compares the two RO types on three different
locations of the FPGA device. Both RO types work well on all three locations,
but the Uniqueness of latch-based PUFs is generally higher than those with flip-
flops. Differences in performance between the two ring oscillator types have been
identified in previous works [21, 22] and may be due to differences in the absolute
frequencies of the ROs [22].

9.4.5 Defense Strategies

In this section, we propose several countermeasures to prevent adversaries from


being able to fingerprint cloud FPGAs.
First, DRAM PUFs are possible because AWS currently retains DRAM data even
if the FPGA has been cleared, or an image without a memory controller is loaded.
In other words, although “DRAM Data retention is not supported for CL designs
with less than 4 DDRs enabled” [4], the DRAM data are not erased. Consequently,
clearing or refreshing the DRAM in either of these two cases would prevent our
fingerprinting approach. At the same time, it would still allow the intended use case
of the data retention feature, namely, sharing data between consecutively loaded
Amazon FPGA images.
Second, we disabled ECC to reliably identify the locations of bit flips and
measure the response of the DRAM PUF. Disabling ECC could be banned, but
256 S. Tian et al.

at a cost of energy usage for designs that do not need it. Moreover, ECC is not
guaranteed to entirely prevent our fingerprints. For example, researchers have shown
that attacks using DRAM are possible even with error correction enabled [15].
Furthermore, introducing randomness at different layers of abstraction can raise
the bar for adversaries. Currently, our work can identify all eight FPGAs in an
f1.16xlarge instance by measuring the PUF behavior on a single DRAM
module (e.g., DRAM D) on one FPGA. However, software can randomize the order
of FPGAs within an instance as they appear to the user, or the way the DRAM
modules are presented to the FPGA. Moreover, DRAM address scrambling in the
memory controller can prevent the DRAM PUF from operating.
For the RO-based PUFs, fingerprinting is based on the existence of combinatorial
loops that can bypass AWS’s DRCs. Thus a more sophisticated and stricter set of
DRCs is able to prevent such a fingerprinting method. For example, researchers have
proposed an antivirus scan of FPGA bitstreams [32].
Although these approaches make power-based attacks harder, they cannot elimi-
nate temporal thermal channels (e.g., [63]). As such attacks only exploit temperature
effects, a mandatory cool-down period before re-assigning FPGAs can prevent
covert channels, even if adversaries successfully fingerprint the devices.

9.5 Cloud FPGA Cartography

In this section, we present the setup (Sect. 9.5.1) and evaluation (Sect. 9.5.2) of cloud
FPGA cartography using PCIe contention.

9.5.1 Experimental Setup

The experiments in this chapter are performed on FPGA F1 instances publicly


available through the AWS Elastic Cloud Compute (EC2) platform. Figure 9.14
shows the likely AWS server configuration based on public information and the
results of our experiments. In particular, a server can contain up to 8 FPGAs, split
over two NUMA nodes across two CPU sockets. Each server can accommodate
eight separate f1.2xlarge users (recall that each f1.2xlarge instance only
uses 1 FPGA), four f1.4xlarge instances with 2 FPGAs each, a combination of
f1.2xlarge and f1.4xlarge instances using at most 8 FPGAs in total, or a
single f1.16xlarge customer using all 8 FPGAs. Based on our assumption that
only 4 FPGAs share the same NUMA node, we expect that only interference within
a NUMA node will be measurable. This translates into up to four f1.2xlarge,
two f1.4xlarge, or one f1.4xlarge and two f1.2xlarge instances inter-
fering with each other.
To confirm the above server setup configuration, i.e., to show that exactly four
FPGAs can lead to mutual PCIe contention, we experiment with all three types
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 257

Fig. 9.14 Diagram of the deduced AWS server configuration, with 8 FPGAs sharing the same
server across two NUMA nodes

of F1 instances. Unless otherwise specified, experiments are primarily conducted


in the us-east-1 region, although Sect. 9.5.2.4 reproduces the experimental
results in all four AWS data center regions that offer FPGA instances, namely,
ap-southeast-2 (Sydney), eu-west-1 (Ireland), us-east-1 (North Vir-
ginia), and us-west-2 (Oregon). As explained in Sect. 9.5.2.4, our approach
works with both on-demand and spot instances.
Our setup involves testing with two FPGAs at a time, a memory stressor and a
memory tester. The tester repeatedly measures its PCIe bandwidth by writing from
the host VM to the FPGA DRAM, while the stressor similarly attempts to interfere
with the tester’s bandwidth by stressing its own PCIe connection. More precisely,
we use the (unmodified) CL_DRAM_DMA example FPGA image provided by the
AWS FPGA development kit [6] as the basis for both the PCIe tester and the PCIe
stressor designs. For the stressor, approximately .8.4 MB of data are moved between
the CPU and FPGA for each transfer, and the transfers are repeated 100 times, giving
a total transfer size of 840 MB. This transfer takes less than a second to complete.
The tester runs the same bandwidth measurement program but transfers 3.9 kB per
test, or 394 kB total.
The stressor and tester are run in parallel on separate instances in order to observe
if the tester bandwidth is affected (reduced) due to the stressor’s activity. If it is, we
conclude that the stressor and tester instances are on the same server and share the
same NUMA node. The tests are repeated multiple times with the same instances
to prevent false positives or negatives, e.g., due to memory activity from other,
unrelated users’ instances being on the same server. Finally, we additionally collect
the DRAM-based PUF fingerprints of the FPGA instances to more concretely map
the cloud infrastructure.
258 S. Tian et al.

9.5.2 Evaluation

In this section, we perform a thorough evaluation of the cross-FPGA PCIe con-


tention in several geographical regions and VM instance types. Specifically, we first
analyze the impact of simultaneous memory accesses over PCIe in f1.16xlarge
instances, showing that FPGAs in slots 0–3 and 4–7 interfere with each other, and
conclude that they form separate NUMA nodes (Sect. 9.5.2.1). We then experiment
with dozens of f1.2xlarge (Sect. 9.5.2.2) and f1.4xlarge (Sect. 9.5.2.3)
instances across different data center regions (Sect. 9.5.2.4) to determine that
successive instances often (but not always) belong to the same NUMA locality,
for both spot and on-demand instance types. We use our results to calculate the
probability of renting FPGAs in the same server (Sect. 9.5.2.5) and fingerprint
the FPGAs (Sect. 9.5.2.6), showing overlap among instance types, in contrast to
prior work. We finally discuss practical aspects of our reverse-engineering approach
(Sect. 9.5.2.7).

9.5.2.1 Determining NUMA Localities

Based on the background presented in Sect. 9.2.1, we expect each F1 server to


contain 2 CPUs, each of which has 4 FPGAs in its locality, for a maximum
of 8 FPGAs per server. We verify this in several ways. First, by running the
numactl-hardware command [33], we determine that f1.16xlarge
instances produce 2 nodes compared to smaller instances, which show the
output 1 nodes (sic). The results of the Linux command lscpu [36] are
similar, and commands such as hwloc-info [35] or lstopo -p [37]
also only reference two NUMANodes in the f1.16xlarge case. Although
information about physical hardware in VMs is not always reliable (e.g.,
/sys/bus/pci/devices/<id>/numa_node returns -1 for the various
PCIe slot identifiers), an application note by AWS also confirms that FPGAs in
slots 0–3 and 4–7 are separate “groups” (i.e., NUMA nodes) [7], with FPGAs
within a group being able to “directly access other FPGAs within the same group,”
while an access “between groups is not direct and not optimal (higher latency, lower
bandwidth)” [7].
With the above information supporting our initial assumptions, we rent five
f1.16xlarge spot instances, containing a total of 40 FPGAs. We use each of
those FPGAs individually as a stressor and consecutively (one by one) test the effect
on all FPGAs (including the stressor) as testers, for a total of .40 × 40 = 1,600 data
points. We repeat measurements twice, with the results of both repetitions being
identical: as shown in Fig. 9.15, we only find contention between .(instance, slot)
FPGA pairs .(i1 , s1 ) and .(i2 , s2 ) if and only if .i1 = i2 and .0 ≤ s1 , s2 ≤ 3 or .4 ≤
s1 , s2 ≤ 7. In other words, there is no overlap between separate f1.16xlarge
instances or cross-NUMA effects, and the graph is symmetric, i.e., the roles of
stressors and testers are interchangeable.
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 259

Fig. 9.15 PCIe contention Tester VM Instance


1 2 3 4 5
analysis for all eight FPGA
1 0
slots in five f1.16xlarge 1
2
instances rented concurrently 3
4
5
in the us-east-1 region. A 6
7
red square denotes that the 2 0
1
stressor (y-axis) interferes 2
3
4
with the tester (x-axis), as 5

Stressor VM Instance
6
determined by observing that 7
3 0
the PCIe bandwidth is 1
2
3
reduced below a target 4
5
threshold. As can be seen, 6
7
contention exists among 4 0
1
2
groups of exactly 4 3
4
consecutive slots within a 5
6
7
single VM instance. In this 5 0
1
figure, each f1.16xlarge 2
3
instance consists of slots 0–7 4
5
6
7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 slot

9.5.2.2 Cross-VM PCIe Contention

In this section, we rent 20 f1.2xlarge spot instances in the us-east-1 region,


collecting .20×20 = 400 bandwidth measurements for each pair of possible stressor
and tester combinations. We repeat experiments five times, with identical results for
each repetition, again proving that our bandwidth-based metric is a robust way of
detecting cross-instance contention. The results are plotted in Fig. 9.16 in two ways.
First, Fig. 9.16a shows the results of our measurements, with instances numbered in
the chronological order in which they are launched. Second, Fig. 9.16b re-orders the
instances so that FPGA pairs with contention are plotted adjacent to each other.
As in the f1.16xlarge case, the resulting matrix is symmetric, since the
contention is bi-directional. More importantly, otherwise-independent VM instances
affect each other in a measurable way. Specifically, within the 20 instances launched,
we find two groups of full NUMA nodes (i.e., four FPGAs), three groups of three
FPGAs, one of two FPGAs, and only one FPGA without any contention, likely
because it only has two instances launched after it. FPGAs within the same NUMA
node are occasionally returned one after the other (e.g., instances 11–15) but are
also sometimes interspersed with other NUMA nodes (e.g., instances .{4, 6, 8, 10}).
We calculate the probability of renting another instance in the same NUMA node in
Sect. 9.5.2.5.

9.5.2.3 Contention Between f1.4xlarge Instances

To evaluate whether our observations also hold for f1.4xlarge instances, we rent
20 f1.4xlarge spot instances in us-east-1, for a total of .40 × 40 = 1,600
260 S. Tian et al.

Tester VM Instance Tester VM Instance


1 2 3 4 5 6 7 8 9 1011121314151617181920 4 6 8 1011121314 1 2 9 3 1619 5 1520 7 1718
1 4
2 6
3 8
4 10
Stressor VM Instance

Stressor VM Instance
5 11
6 12
7 13
8 14
9 1
10 2
11 9
12 3
13 16
14 19
15 5
16 15
17 20
18 7
19 17
20 18

(a) Original (b) Re-ordered

Fig. 9.16 Cross-VM contention between f1.2xlarge instances in the us-east-1 region: (a)
presents the instances in the order in which they are launched, while (b) re-orders them to more
clearly show pairs with PCIe contention. In this figure, each f1.2xlarge instance has one slot

stressor and tester pairs, repeating measurements three times. The results, which are
shown in Fig. 9.17, indicate that contention is still possible both within and between
different f1.4xlarge instances: we find 7 pairs of distinct instances that form
complete NUMA nodes with 4 FPGAs. We again notice that co-located instances
tend to not be fully consecutive but are instead interspersed with other instances.
However, unlike the results of Sect. 9.5.2.2, where the lone FPGA was rented near
the end of the 20 instances, all six lone f1.4xlarge instances are among the
first 10 instances launched, with the first five corresponding to instances 1–5. In
other words, in our experiments, it was almost always possible to find contention
in f1.2xlarge instances provided enough instances were launched after them.
However, the first few f1.4xlarge instances were not co-located with any other
instances (possibly because other users had already rented their counterparts), while
later VMs were more likely to be co-located in the same server.

9.5.2.4 Data Center Regions and On-Demand Instances

As the previous experiments were conducted with spot instances in the us-east-1
region, we perform measurements with spot instances in the us-west-2
(Fig. 9.18) and eu-west-1 (Fig. 9.19) regions, as well as with on-demand
instances in the ap-southeast-2 (Fig. 9.20) and us-east-1 (Fig. 9.21)
regions, with experiments repeated once with 20 and 10 FPGAs, respectively.
It should be noted that we could not find availability for spot instances in the
ap-southeast-2 region. In addition, the pre-synthesized CL_DRAM_DMA AFI
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 261

Tester VM Instance Tester VM Instance


1 2 3 4 5 6 7 8 9 1011121314151617181920 6 7 8 12 9 131114151716191820 1 2 3 4 5 10
1 0
1
6 0
1
2 0
1
7 0
1
3 0
1
8 0
1
4 0 12 0
Stressor VM Instance

Stressor VM Instance
1 1
5 0
1
9 0
1
6 0
1
13 0
1
7 0
1
11 0
1
8 0
1
14 0
1
9 0
1
15 0
1
10 0
1
17 0
1
11 0
1
16 0
1
12 0
1
19 0
1
13 0
1
18 0
1
14 0
1
20 0
1
15 0
1
1 0
1
16 0
1
2 0
1
17 0
1
3 0
1
18 0
1
4 0
1
19 0
1
5 0
1
20 0
1
10 0
1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 slot 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 slot

(a) Original (b) Re-ordered

Fig. 9.17 Cross-VM contention between f1.4xlarge instances with 2 FPGAs each in the
us-east-1 region. In this test, 14 of the 20 launched instances are co-located in 7 NUMA nodes,
while the remaining 6 are not co-located with any of the other instances. Recall that f1.4xlarge
VMs contain two FPGAs, so the two slots within an instance always interfere with each other,
explaining why there are always at least two red squares per row and column in the figure. In this
figure, each f1.4xlarge instance consists of slots 0–1

Tester VM Instance Tester VM Instance


1 2 3 4 5 6 7 8 9 1011121314151617181920 1 2 3 4 5 6 7 8 12131415 9 10111820161719
1 1
2 2
3 3
4 4
Stressor VM Instance

Stressor VM Instance

5 5
6 6
7 7
8 8
9 12
10 13
11 14
12 15
13 9
14 10
15 11
16 18
17 20
18 16
19 17
20 19

(a) Original (b) Re-ordered

Fig. 9.18 Example test results for cross-VM contention between f1.2xlarge spot instances in
the us-west-2 region

was not available in this region, so we synthesized it using its publicly available
source code.
The results are broadly similar to the experiments of Sect. 9.5.2.2, with only 10%
(.6/60) of instances not resulting in cross-VM contention. Indeed, seven complete
NUMA localities are identified, along with six groups of three FPGAs and four
pairs of two VMs. It is further interesting to note that, in our experiments, groups in
262 S. Tian et al.

Tester VM Instance Tester VM Instance


1 2 3 4 5 6 7 8 9 1011121314151617181920 1 2 3 8 6 7 9 1011121415 4 5 131617181920
1 1
2 2
3 3
4 8
Stressor VM Instance

Stressor VM Instance
5 6
6 7
7 9
8 10
9 11
10 12
11 14
12 15
13 4
14 5
15 13
16 16
17 17
18 18
19 19
20 20

(a) Original (b) Re-ordered

Fig. 9.19 Example test results for cross-VM contention between f1.2xlarge spot instances in
the eu-west-1 region

Tester VM Instance Tester VM Instance


1 2 3 4 5 6 7 8 9 10 1 3 5 9 2 4 6 7 8 10

1 1

2 3
Stressor VM Instance

Stressor VM Instance

3 5

4 9

5 2

6 4

7 6

8 7

9 8

10 10

(a) Original (b) Re-ordered

Fig. 9.20 Cross-VM contention between f1.2xlarge on-demand instances in the


ap-southeast-2 region

us-west-2 almost always appeared in succession, while groups in other regions


were more disjointed. More experiments could determine whether this pattern was
due to user demand and usage characteristics (which may differ among regions)
or whether different instance allocation strategies apply to the various data centers,
instance types, or times of the day or week.
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 263

9.5.2.5 Probability of Co-location

Having uncovered that up to four f1.2xlarge instances can interfere with each
other, we further analyze the probability of the f1.2xlarge instances being
co-located on the same server. Specifically, we use the data gathered in previous
sections to calculate the probability .PK that, given a tester FPGA, launching K
stressor instances will result in at least one of the K new FPGAs being placed in the
same NUMA node as the tester FPGA.
Let .NK be the number of VMs for which there is PCIe contention with any
of the next K instances launched. Moreover, let .MK be the number of VMs that
have fewer than K instances launched after them and that are not the last in a
full NUMA node detected within the experiment. The second constraint is needed
because, although for groups of up to 3 FPGAs the remaining FPGAs might be
found by renting more instances, if the NUMA node has been fully detected, no
additional VM will correspond to the same locality, no matter how many instances
are launched. Denoting by T as the total number of VMs launched, the desired
probability is

NK
PK =
. (9.2)
T − MK

As an example using the on-demand instances of Fig. 9.21, .T = 10, and for
K = 3, .MK = 3 (instances 8–10 have .≤2 VMs launched after them), while .NK =
.

3 (instances 1, 4, and 7 are in the same NUMA nodes as instances 3, 5, and 10,
respectively), so .P3 = 3/(10 − 3) = 43%.

Tester VM Instance Tester VM Instance


1 2 3 4 5 6 7 8 9 10 4 5 9 1 3 2 6 7 10 8

1 4

2 5
Stressor VM Instance

Stressor VM Instance

3 9

4 1

5 3

6 2

7 6

8 7

9 10

10 8

(a) Original (b) Re-ordered

Fig. 9.21 Cross-VM contention between f1.2xlarge on-demand instances in us-east-1


264 S. Tian et al.

Probability PK of Finding a Co-Located FPGA


90

80

70

60

50
Region
40
all
30 ap-southeast-2
eu-west-1
20 us-east-1
us-west-2
10
2 4 6 8 10
Number of Launched FPGA Instances K

Fig. 9.22 Probability .PK of finding another f1.2xlarge instance in the same NUMA node
(overall and for individual data centers), as a function of the number of FPGA instances launched K

Figure 9.22 calculates .PK for .1 ≤ K ≤ 10, both for individual regions and over
all regions tested. Depending on the region, the probability that two consecutive
VMs are co-located (i.e., .K = 1) ranges between 38–58% (with the exception
of ap-southeast-2), while renting just one more instance can increase this
probability by approximately 10 percentage points (or 55 points for the Sydney
region). Renting even more instances increases this probability further to about 80%
for .K = 10, in part due to the smaller number of instances that have at least K
FPGAs launched after them (i.e., a larger .MK ).
Note that for sufficiently large T and K, we expect .PK = 75%, as there is a 1 in
4 chance that the sensor instance is the last FPGA in its NUMA node. However, for
smaller T and K, .PK can be larger (as in Fig. 9.22), since Amazon does not always
fully pack consecutive instances within a single server: as the previous sections
showed, co-located instances are often launched further apart in time.

9.5.2.6 Overlap Between Instance Types

In this section, we use the DRAM PUFs of Sect. 9.4 to fingerprint individual FPGAs
and detect overlaps between experiments repeated on different days. Specifically, we
make additional measurements from the us-east-1 region and compare the PUF
fingerprints between spot and on-demand f1.2xlarge instances in availability
zone c and spot f1.2xlarge, f1.4xlarge, and f1.16xlarge instances in
zone e, all collected on different days.
We reach two main conclusions. First, there is an overlap between spot and on-
demand VMs, and, second, there is overlap between all three instance types. For
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 265

Fig. 9.23 Sample PUF 0 32 64 96 128 0 32 64 96 128 0 32 64 96 128


fingerprints from a pair of 0 0 0
overlapping FPGAs between 32 32 32

Row

Row

Row
(a) f1.2xlarge and (b) 64 64 64
f1.16xlarge instances. 96 96 96
The two PUFs and their (c) 128 128 128
bitwise AND are almost Column Column Column
identical (a) PUF A (2x) (b) PUF B (16x) (c) PUF A AND B

Fig. 9.24 Sample PUF 0 32 64 96 128 0 32 64 96 128 0 32 64 96 128


fingerprints from 0 0 0
non-overlapping FPGAs 32 32 32
Row

Row

Row
between (a) f1.2xlarge 64 64 64
and (b) f1.16xlarge 96 96 96
instances. The two PUFs are 128 128 128
Column Column Column
distinct, and their (c) bitwise
AND is empty (a) PUF A (2x) (b) PUF B (16x) (c) PUF A AND B

example, instances 7 and 17 of Fig. 9.16 overlap with slots 0 and 1 of instance 3 in
Fig. 9.15. Figure 9.23 presents extracts from the PUF fingerprints for one of the two
pairs of overlapping FPGAs. The FPGAs in the f1.2xlarge and f1.16xlarge
instances had 419 and 461 DRAM bit flips, respectively, of which 419 bit flip
locations (DRAM addresses) were identical. This allows us to conclude that the two
instances correspond to the same underlying FPGA hardware. By contrast, Fig. 9.24
shows the same f1.2xlarge instance along with a different FPGA slot of the
f1.16xlarge instance. This FPGA has a fingerprint with 483 bit flips, of which
none are in the same locations as those in the f1.2xlarge instance. As a result,
these two FPGAs are distinct, as expected.
In addition, we found an overlap between instances 1, 2, and 9 of Fig. 9.16
and another set of 10 f1.4xlarge instances rented. Consequently, not only
is there the potential for cross-VM contention between identical spot instance
types, but also between f1.2xlarge and f1.4xlarge spot and on-demand
instances (but not f1.16xlarge ones, as they reserve all FPGAs within the
server). It should be noted that there was no overlap between different instance
types in the experiments of Sect. 9.4.2, likely because AWS often re-used the same
instances, resulting in only 10 unique f1.2xlarge, 6 unique f1.4xlarge,
and 8 unique f1.16xlarge instances, for a total of 86 FPGAs. By contrast, the
experiments of this section found an overlap of just two FPGAs in a pool of 20
f1.2xlarge and 5 f1.16xlarge instances, three FPGAs between the same set
of 20 f1.2xlarge and an additional 10 f1.4xlarge instances, and no overlap
between the ten f1.4xlarge and five f1.16xlarge instances, for a total of
.20 · 1 + 10 · 2 + 5 · 8 − 2 − 3 = 75 unique FPGAs. This suggests that it is rare

(but not impossible) for instance types to be repurposed, likely primarily in cases of
unmet demand of smaller F1 types.
266 S. Tian et al.

3,000 Transfer Size

Bandwidth (MB/s)
16.4 kB
2,000 262.1 kB
4194.3 kB

1,000

0
0 1 2 3
Number of Stressors

Fig. 9.25 Median tester bandwidth for different numbers of enabled stressors and transfer sizes
(bandwidth averaged over 100 transfers)

9.5.2.7 Practical Considerations

The presented infrastructure mapping attack is inexpensive to deploy, and anyone


with access to the public cloud FPGA infrastructure can perform it, using just the
utilities provided by AWS. In fact, it remains both easy and cheap to find instances
in the same NUMA node: less than a minute is needed to determine whether the
given stressor is co-located with any of the testers.
Another important aspect to consider is whether detecting contention is possible
in the presence of interference from other tenants. To address this issue, we rent
an f1.16xlarge instance and investigate simultaneous transmissions from 0–3
stressor FPGAs to the last one in the NUMA node. We experiment with different
transfer sizes, i.e., bytes moved from the DRAM to the stressor, and average the
calculated bandwidth over 100 transfers. The results, summarized in Fig. 9.25, show
that the tester bandwidth is highest when no stressors are enabled, at over 3 GBps.
When a stressor is enabled (at any transfer size), the bandwidth quickly drops below
1 GBps and is the reason why we can effectively detect co-located FPGAs with a
simple threshold. Enabling a second stressor FPGA, bandwidth becomes even lower
at 128–190 MBps (depending on the transfer size) and remains approximately the
same when the final FPGA in the NUMA node acts as a third stressor.
As a result, it is still possible to reverse-engineer the infrastructure and detect
co-location in the presence of traffic generated by a single external user, provided
that the bandwidth sustained by that user remains the same during the measurement
period. In other words, the tester threshold needs to be adjusted to account for
the additional external traffic, e.g., from 2.0 to 0.5 GBps. Moreover, the adversary
can completely bypass any effects from third parties by renting f1.4xlarge
instances, thereby using all resources in a NUMA node. In fact, doing so allows
them to more quickly map the infrastructure, but at a higher cost.
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 267

9.6 Related Work

In this section, we summarize prior work in FPGA security (Sect. 9.6.1) and cloud-
related attacks (Sect. 9.6.2).

9.6.1 Remote FPGA Attacks

In recent years, besides attacks on the FPGA bitstream itself, e.g., [17], there has
been extensive research on FPGA security without physical access to the underlying
hardware, with covert-channel, side-channel, and fault attacks predominantly using
voltage or temperature to affect the FPGA chips [30, 63]. Many such works,
e.g., [20, 31, 46, 70], focus on attacks between different users of the FPGA and
are therefore not directly applicable to single-tenant clouds.
Although most attacks have been performed in lab environments, a covert-
channel attack between separate dies (“Super Logic Regions”) was shown to be
possible on AWS and Huawei cloud [22], and a side-channel attack on AWS by
Glamocanin et al. soon followed [25]. The former depends on alternative ring
oscillator designs that bypass AWS restrictions [21, 59], while the latter uses
time-to-digital converters (TDCs), both of which could be detected by additional
DRCs [32]. By contrast, our research does not focus on the FPGA chip, but on
the cloud FPGA infrastructure, and in particular the unique aspects of the DRAM
modules as well as the shared PCIe bus used by different FPGA boards within each
server.

9.6.2 Cloud Security

Since the initial public deployments of cloud computing infrastructures, researchers


have looked at ways to attack (and improve) their security. In early Amazon
EC2 cloud architectures involving only CPUs, researchers quickly showed how to
reverse-engineer the infrastructure and place an attacker VM on the same server as
the victim VM [51]. The main method for doing so was through internal IP addresses
and network latencies, which exposed sufficient information about the underlying
setup for effectively 0% false positives, and a 40% chance of VM co-location on
the same server and CPU in the AWS EC2 cloud [51]. The false positive rate and
the co-location probability are all comparable to (but lower than) those in our work
focusing on PCIe and FPGAs.
There is also a body of research that has demonstrated that contention of I/O
resources (hard-drive throughput and network bandwidth, for example) is also a
potential security vulnerability in cloud infrastructures such as EC2. For example, it
is possible to cause performance degradation of co-located instances [14, 47]. Also,
268 S. Tian et al.

Richter et al. showed that when virtualizing PCIe network interface cards (NICs)
using single root I/O virtualization (SR-IOV), it is feasible for one VM to cause
congestion on the NIC ingress buffers [49]. As a possible defense, Richter et al.
recommend quality-of-service (QoS) extensions and different scheduling algorithms
to ensure that flooding a virtual function (VF) in one physical function (PF) cannot
cause performance degradation in a different PF [50].
Our work further advances research in similar areas, as we have shown how
to determine co-location and aspects of the scheduling algorithm using PCIe
contention in FPGA-accelerated clouds. Based on our work, attacks on intentional
performance degradation of the PCIe bandwidth or further work on understanding
how interference can affect our new attack or disrupt other users are natural
extensions.

9.7 Conclusion

This chapter focused on how to fingerprint cloud FPGAs and map the cloud FPGA
infrastructure itself, without damaging it.
We first introduced a novel algorithm for fingerprinting cloud FPGAs through
decay-based DRAM PUFs. Because it is not possible for users to directly control the
memory self-refresh parameters, we made use of a feature that enables data sharing
between different AFIs: by using two AFIs, one that disables memory scrubbing
and ECC, and one that does not instantiate memory controllers at all, we were able
to observe how DRAM decays, without violating any restrictions placed by AWS.
In addition, we described the design and evaluation of RO PUFs that can bypass
the AWS design rule checks on combinatorial loops. The PUFs created resulted in
unique and stable fingerprints of FPGAs in AWS F1 FPGAs.
This chapter further identified PCIe contention as a means of mapping cloud
FPGA infrastructures. We showed that it is possible to reverse-engineer the NUMA
locality of different FPGAs within an AWS server and find which f1.2xlarge
and f1.4xlarge instances are co-located within the same server. We also found
that f1.2xlarge and f1.4xlarge instance types can be scheduled on the same
AWS server, and we deduced that the probability of successive users renting FPGAs
within the same server is high.
Overall, this chapter highlighted the dangers of direct accesses to hardware
resources and a need for a more holistic approach to FPGA security that not
only considers the FPGA chips themselves, but also the security of other system
components accessible to the logic running on the FPGA.

Acknowledgment This work was supported in part by NSF grant 1901901.


9 Fingerprinting and Mapping Cloud FPGA Infrastructures 269

References

1. Alibaba Cloud (2023). Elastic Compute Service: Instance Type Families. https://www.
alibabacloud.com/help/en/elastic-compute-service/latest/instance-family#f3. Accessed May
1, 2023.
2. Amazon Web Services (2016). Developer preview – EC2 instances (F1) with pro-
grammable hardware. https://aws.amazon.com/blogs/aws/developer-preview-ec2-instances-
f1-with-programmable-hardware/. Accessed May 1, 2023.
3. Amazon Web Services (2018). The agility of F1: Accelerate your applications with custom
compute power. https://d1.awsstatic.com/Amazon_EC2_F1_Infographic.pdf.. Accessed May
1, 2023.
4. Amazon Web Services (2021). AWS EC2 FPGA HDK+SDK errata. https://github.com/aws/
aws-fpga/blob/master/ERRATA.md. Accessed May 1, 2023.
5. Amazon Web Services (2021). AWS shell interface specification. https://github.com/aws/aws-
fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md. Accessed May 1, 2023.
6. Amazon Web Services (2021). CL_DRAM_DMA custom logic example. https://github.com/
aws/aws-fpga/tree/master/hdk/cl/examples/cl_dram_dma. Accessed May 1, 2023.
7. Amazon Web Services (2021). F1 FPGA application note: How to use the PCIe peer-
2-peer version 1.0. https://github.com/awslabs/aws-fpga-app-notes/tree/master/Using-PCIe-
Peer2Peer. Accessed May 1, 2023.
8. Amazon Web Services (2022). Amazon machine images (AMI). https://github.com/awsdocs/
amazon-ec2-user-guide/blob/master/doc_source/AMIs.md. Accessed May 1, 2023.
9. Amazon Web Services (2023). Amazon EC2 instance types. https://aws.amazon.com/ec2/
instance-types/. Accessed May 1, 2023.
10. Amazon Web Services (2023). Amazon EC2 spot instances pricing. https://aws.amazon.com/
ec2/spot/pricing/. Accessed May 1, 2023.
11. Amazon Web Services (2023). AWS EC2 spot instances. https://aws.amazon.com/ec2/spot/.
Accessed May 1, 2023.
12. Baidu Cloud (2023). FPGA cloud compute. https://cloud.baidu.com/product/fpga.html.
Accessed May 1, 2023.
13. Baker, G., & Lupo, C. (2017). TARUC: A topology-aware resource usability and contention
benchmark. In ACM/SPEC International Conference on Performance Engineering (ICPE).
14. Chiang, R. C., Rajasekaran, S., Zhang, N., & Huang, H. H. (2015). Swiper: Exploiting
virtual machine vulnerability in third-party clouds with competition for I/O resources. IEEE
Transactions on Parallel and Distributed Systems (TPDS), 26(6), 1732–1742.
15. Cojocar, L., Razavi, K., Giuffrida, C., & Bos, H. (2019). Exploiting correcting codes: On the
effectiveness of ECC memory against Rowhammer attacks. In IEEE Symposium on Security
and Privacy (S&P).
16. Danalis, A., Marin, G., McCurdy, C., Meredith, J. S., Roth, P. C., Spafford, K., Tipparaju, V.,
& Vetter, J. S. (2010). The scalable heterogeneous computing (SHOC) benchmark suite. In
Workshop on General-Purpose Processing on Graphics Processing Units (GPGPU).
17. Ender, M., Moradi, A., & Paar, C. (2020). The unpatchable silicon: A full break of the bitstream
encryption of Xilinx 7-Series FPGAs. In USENIX Security Symposium.
18. Faraji, I., Mirsadeghi, S. H., & Afsahi, A. (2016). Topology-aware GPU selection on multi-
GPU nodes. In IEEE International Parallel and Distributed Processing Symposium Workshops
(IPDPSW).
19. Gao, X., Xu, Z., Wang, H., Li, L., & Wang, X. (2018). Reduced cooling redundancy: A
new security vulnerability in a hot data center. In Network and Distributed Systems Security
Symposium (NDSS).
20. Giechaskiel, I., Rasmussen, K. B., & Eguro, K. (2018). Leaky wires: Information leakage and
covert communication between FPGA long wires. In ACM Asia Conference on Computer and
Communications Security (ASIACCS).
270 S. Tian et al.

21. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Measuring long wire leakage with ring
oscillators in cloud FPGAs. In International Conference on Field Programmable Logic and
Applications (FPL).
22. Giechaskiel, I., Rasmussen, K. B., & Szefer, J. (2019). Reading between the dies: Cross-SLR
covert channels on multi-tenant cloud FPGAs. In IEEE International Conference on Computer
Design (ICCD).
23. Giechaskiel, I., Tian, S., & Szefer, J. (2021). Cross-VM information leaks in FPGA-accelerated
cloud environments. In IEEE International Symposium on Hardware Oriented Security and
Trust (HOST).
24. Giechaskiel, I., Tian, S., & Szefer, J. (2022). Cross-VM covert- and side-channel attacks in
cloud FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 16(1),
1–29.
25. Glamočanin, O., Coulon, L., Regazzoni, F., & Stojilović, M. (2020). Are cloud FPGAs really
vulnerable to power analysis attacks? In Design, Automation & Test in Europe Conference &
Exhibition (DATE).
26. Hajimiri, A., Limotyrakis, S., & Lee, T. H. (1999). Jitter and phase noise in ring oscillators.
IEEE Journal of Solid-State Circuits (JSSC), 34(6), 790–804.
27. Islam, M. A., & Ren, S. (2018). Ohm’s law in data centers: A voltage side channel for timing
power attacks. In ACM Conference on Computer and Communications Security (CCS).
28. Islam, M. A., Ren, S., & Wierman, A. (2017). Exploiting a thermal side channel for power
attacks in multi-tenant data centers. In ACM Conference on Computer and Communications
Security (CCS).
29. Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et
du Jura. Bulletin del la Société Vaudoise des Sciences Naturelles, 37, 547–579.
30. Jin, C., Gohil, V., Karri, R., & Rajendran, J. (2020). Security of cloud FPGAs: A survey. https://
arxiv.org/abs/2005.04867. Accessed May 1, 2023.
31. Krautter, J., Gnad, D. R. E., & Tahoori, M. B. (2018). FPGAhammer: Remote voltage fault
attacks on shared FPGAs, suitable for DFA on AES. Transactions on Cryptographic Hardware
and Embedded Systems (TCHES), 2018(3), 44–68.
32. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
Malicious self-oscillator scanning for Xilinx UltraScale+ FPGAs. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 13(3), 1–31.
33. Lameter, C. (2013). NUMA (non-uniform memory access): An overview. ACM Queue, 11(7),
40–51.
34. Li, C., Sun, Y., Jin, L., Xu, L., Cao, Z., Fan, P., et al. (2019). Priority-based PCIe scheduling for
multi-tenant multi-GPU systems. IEEE Computer Architecture Letters (LCA), 18(2), 157–160.
35. Linux man page: hwloc(7) (2023). https://linux.die.net/man/7/hwloc . Accessed May 1,
2023.
36. Linux man page: lscpu(1) (2023). https://linux.die.net/man/1/lscpu. Accessed May 1, 2023.
37. Linux man page: lstopo(1) (2023). https://linux.die.net/man/1/lstopo. Accessed May 1,
2023.
38. Lutz, T., Fensch, C., & Cole, M. (2013). PARTANS: An autotuning framework for stencil com-
putation on multi-GPU systems. ACM Transactions on Architecture and Code Optimization
(TACO), 9(4), 1–24.
39. Maiti, A., & Schaumont, P. (2011). Improved ring oscillator PUF: An FPGA-friendly secure
primitive. Journal of Cryptology, 24(2), 375–397.
40. Martinasso, M., Kwasniewski, G., Alam, S. R., Schulthess, T. C., & Torsten, H. (2016).
A PCIe congestion-aware performance model for densely populated accelerator servers. In
International Conference for High Performance Computing, Networking, Storage and Analysis
(SC).
41. McCurdy, C., & Vetter, J. (2010). Memphis: Finding and fixing NUMA-related performance
problems on multi-core platforms. In IEEE International Symposium on Performance Analysis
of Systems & Software (ISPASS).
9 Fingerprinting and Mapping Cloud FPGA Infrastructures 271

42. Merli, D., Stumpf, F., & Eckert, C. (2010). Improving the quality of ring oscillator PUFs on
FPGAs. In Workshop on Embedded Systems Security (WESS).
43. Microsoft Research (2017). Microsoft unveils Project Brainwave for real-time AI. https://www.
microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/. Accessed May 1,
2023.
44. Molka, D., Hackenberg, D., Schöne, R., & Müller, M. S. (2009). Memory performance
and cache coherency effects on an Intel Nehalem multiprocessor system. In International
Conference on Parallel Architectures and Compilation Techniques (PACT).
45. Neugebauer, R., Antichi, G., Zazo, J. F., Audzevich, Y., López-Buedo, S., & Moore, A. W.
(2018). Understanding PCIe performance for end host networking. In ACM Special Interest
Group on Data Communication (SIGCOMM).
46. Provelengios, G., Ramesh, C., Patil, S. B., Eguro, K., Tessier, R., & Holcomb, D. (2019).
Characterization of long wire data leakage in deep submicron FPGAs. In ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays (FPGA).
47. Pu, X., Liu, L., Mei, Y., Sivathanu, S., Koh, Y., & Pu, C. (2010). Understanding performance
interference of I/O workload in virtualized cloud environments. In IEEE International
Conference on Cloud Computing (CLOUD).
48. Rahmati, A., Hicks, M., Holcomb, D. E., & Fu, K. (2015). Probable cause: The deanonymizing
effects of approximate DRAM. In Annual International Symposium on Computer Architecture
(ISCA).
49. Richter, A., Herber, C., Wild, T., & Herkersdorf, A. (2015). Denial-of-service attacks on PCI
passthrough devices: Demonstrating the impact on network- and storage-I/O performance.
Journal of Systems Architecture, 61(10), 592–599.
50. Richter, A., Herber, C., Wild, T., & Herkersdorf, A. (2016). Resolving performance interfer-
ence in SR-IOV setups with PCIe Quality-of-Service extensions. In Euromicro Conference on
Digital System Design (DSD).
51. Ristenpart, T., Tromer, E., Shacham, H., & Savage, S. (2009). Hey, you, get off of my
cloud: Exploring information leakage in third-party compute clouds. In ACM Conference on
Computer and Communications Security (CCS).
52. Rosenblatt, S., Chellappa, S., Cestero, A., Robson, N., Kirihata, T., & Iyer, S. S. (2013). A
self-authenticating chip architecture using an intrinsic fingerprint of embedded DRAM. IEEE
Journal of Solid-State Circuits (JSSC), 48(11), 2934–2943.
53. Rosenblatt, S., Fainstein, D., Cestero, A., Safran, J., Robson, N., Kirihata, T., & Iyer, S. S.
(2013). Field tolerant dynamic intrinsic chip ID using 32 nm high-K/metal gate SOI embedded
DRAM. IEEE Journal of Solid-State Circuits (JSSC), 48(4), 940–947.
54. Schaa, D., & Kaeli, D. (2009). Exploring the multiple-GPU design space. In IEEE Interna-
tional Parallel and Distributed Processing Symposium Workshops (IPDPSW).
55. Schaller, A., Xiong, W., Anagnostopoulos, N. A., Saleem, M. U., Gabmeyer, S., Katzenbeisser,
S., & Szefer, J. (2017). Intrinsic Rowhammer PUFs: Leveraging the Rowhammer effect for
improved security. In IEEE International Symposium on Hardware Oriented Security and Trust
(HOST).
56. Schaller, A., Xiong, W., Anagnostopoulos, N. A., Saleem, M. U., Gabmeyer, S., Skoric,
B., et al. (2018). Decay-based DRAM PUFs in commodity devices. IEEE Transactions on
Dependable and Secure Computing (TDSC), 16(3), 462–475.
57. Solomon, R. (2014). PCI express basics & background. https://pcisig.com/sites/default/files/
files/PCI_Express_Basics_Background.pdf. Accessed May 1, 2023.
58. Spafford, K., Meredith, J. S., & Vetter, J. S. (2011). Quantifying NUMA and contention effects
in multi-GPU systems. In Workshop on General-Purpose Processing on Graphics Processing
Units (GPGPU).
59. Sugawara, T., Sakiyama, K., Nashimoto, S., Suzuki, D., & Nagatsuka, T. (2019). Oscillator
without a combinatorial loop and its threat to FPGA in data centre. Electronics Letters, 15(11),
640–642.
272 S. Tian et al.

60. Sutar, S., Raha, A., & Raghunathan, V. (2016). D-PUF: An intrinsically reconfigurable
DRAM PUF for device authentication in embedded systems. In International Conference on
Compliers, Architectures, and Synthesis of Embedded Systems (CASES).
61. Tencent Cloud (2023). FPGA cloud server. https://cloud.tencent.com/product/fpga. Accessed
May 1, 2023.
62. Texas Advanced Computing Center (2015) TACC to launch new Catapult system to researchers
worldwide. https://web.archive.org/web/20211215201222/https://www.tacc.utexas.edu/-/tacc-
to-launch-new-catapult-system-to-researchers-worldwide. Accessed May 1, 2023.
63. Tian, S., & Szefer, J. (2019). Temporal thermal covert channels in cloud FPGAs. In
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA).
64. Wang, X., Niu, Y., Liu, F., & Xu, Z. (2022). When FPGA meets cloud: A first look at
performance. IEEE Transactions on Cloud Computing (TCC), 10(2), 1344–1357.
65. Xilinx, Inc. (2017). Xilinx powers Huawei FPGA accelerated cloud server. https://web.archive.
org/web/20220616002445/. https://www.xilinx.com/news/press/2017/xilinx-powers-huawei-
fpga-accelerated-cloud-server.html. Accessed May 1, 2023.
66. Xilinx, Inc. (2022). UltraScale architecture configuration: User guide (UG570). https://www.
xilinx.com/support/documentation/user_guides/ug570-ultrascale-configuration.pdf. Accessed
May 1, 2023.
67. Xilinx, Inc. (2023). UltraScale+ FPGAs: Product tables and product selection guides. https://
www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-
selection-guide.pdf. Accessed May 1, 2023.
68. Xiong, W., Anagnostopoulos, N. A., Schaller, A., Katzenbeisser, S., & Szefer, J. (2019). Spying
on temperature using DRAM. In Design, Automation, and Test in Europe (DATE).
69. Xiong, W., Schaller, A., Anagnostopoulos, N. A., Saleem, M. U., Gabmeyer, S., Katzenbeisser,
S., & Szefer, J. (2016). Run-time accessible DRAM PUFs in commodity devices. In Interna-
tional Conference on Cryptographic Hardware and Embedded Systems (CHES).
70. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In IEEE
Symposium on Security and Privacy (S&P).
Chapter 10
Countermeasures Against Voltage
Attacks in Multi-tenant FPGAs

Shayan Moini, George Provelengios, Daniel Holcomb, and Russell Tessier

10.1 Introduction

Multi-tenancy has been hailed as a mechanism to enhance the widespread adoption


of cloud FPGAs. It allows multiple users to share the cost of a single FPGA
cloud service and increase FPGA resource utilization. Security concerns currently
are holding back the commercial deployment of multi-tenant cloud FPGAs. It is
well-known that FPGAs containing multiple independent tenants are susceptible to
voltage manipulation attacks [9, 30]. The physical proximity of an adversary and
their potential victim on the same FPGA, and the shared FPGA power distribution
network (PDN), provide multiple opportunities to perform voltage attacks. For
example, through the use of asynchronous or synchronous power waster circuits
[33], a malicious tenant can deploy a voltage attack to inject timing faults into
a neighboring tenant’s circuit or force the board containing the FPGA into reset
as a denial-of-service attack. Understanding the threats and addressing them with
effective countermeasures are crucial steps that are needed before the widespread
adoption of multi-tenancy by commercial cloud FPGA providers can commence.
In this chapter, we provide an overview of the types of attacks targeting multi-
tenant FPGAs, including voltage attacks and the available countermeasures. Voltage
attack countermeasures detect the presence of attacks and locate and suppress
them. A low-latency voltage monitoring system is described that can be easily
fashioned from FPGA logic. This system allows for the collection and analysis of
run-time voltage information to detect potential voltage attacks. We show that this
information can be used for real-time suppression of synchronous power waster
circuits that otherwise would force a board reset. The synchronous wasters are

S. Moini () · G. Provelengios · D. Holcomb · R. Tessier


University of Massachusetts, Department of Electrical and Computer Engineering, Amherst, MA,
USA
e-mail: smoini@umass.edu; dholcomb@umass.edu; tessier@umass.edu

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 273
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_10
274 S. Moini et al.

effectively disabled by suppressing their clock signals. We also evaluate whether


asynchronous power wasters can be suppressed via partial reconfiguration of the
FPGA fabric in a timely manner. Our experimental evaluation for Intel Stratix V
FPGAs shows that the latency of partial reconfiguration is not sufficiently short
to prevent fault injection or board reset. LoopBreaker [28], an alternative attack-
suppression approach based on partial reconfiguration, partially addresses this issue.
The organization of this chapter is as follows. In Sect. 10.2, we discuss the
threat model for multi-tenant FPGA attacks, different types of attacks studied in
the literature, and the countermeasures introduced to detect and mitigate them.
Section 10.3 focuses on voltage attacks specifically. We introduce different types
of power wasters used in these attacks, a detailed analysis of the effect of these
wasters on the FPGA power distribution network, and the threats that they pose to
potential victims. Section 10.4 covers methods for detecting at runtime the presence
of active voltage attacks including using an on-chip sensor network. In Sect. 10.5,
we review reactive countermeasures against voltage attacks that work by disabling
the power wasters used for these attacks. In Sect. 10.6, we present ideas for future
research in reactive countermeasures for voltage attacks. We conclude this chapter
in Sect. 10.7.

10.2 Overview of Voltage Attacks and Countermeasures

Section 10.2 provides an overview of the different threats that rise in a multi-tenant
FPGA environment, the types of remote attacks that an adversary can exploit, and
potential countermeasures to mitigate these attacks.

10.2.1 Threats

In a multi-tenant cloud FPGA, the resources in the device are divided into multiple
physically isolated sections, each with its own share of the logic resources and
I/O [4]. At runtime, each tenant independently accesses their assigned section to
implement a hardware design and perform computation. The adjacent placement of
potential adversaries on the FPGA introduces security threats. Commercial FPGAs
contain a single PDN that is shared by all tenants’ logic. Voltage fluctuations caused
by hardware activity in one device region influence the supply voltage in other
sections of the FPGA via the PDN. The shared power distribution network provides
an opportunity for a malicious adversary tenant to attack one of more victim tenants.
The adversary tenant can eavesdrop on the hardware activity of the victim tenant
in a side-channel attack [50]. The adversary tenant can also maliciously affect the
voltage on the shared PDN to cause faults in the victim tenant’s design [19, 30, 35].
Other types of attacks involve signal coupling and thermal monitoring. Previous
work [7, 34, 36] has shown that a measurable electrical coupling exists between
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 275

neighboring long wires in Xilinx-AMD and Intel FPGAs. An adversary tenant can
use a wire that is adjacent to a victim tenant wire to extract information. Thermal
coupling is also a potential security threat that can be used by adversary tenants for
data transmission in a cloud FPGA [47].
There are two general groups of remote attacks in a multi-tenant FPGA scenario,
passive and active attacks. In a passive attack, the goal of the adversary is to
steal secret information from their victim, while in the active attack, the adversary
actively causes changes in the victim’s computation or its environment. Figure 10.1
illustrates each type of attack mentioned in this section.
There are two main types of passive attacks in a multi-tenant FPGA scenario. In a
long-wire cross-talk attack (Fig. 10.1a), the adversary uses cross-coupling between
adjacent FPGA long wires in a routing channel to extract secret information from
a victim. The delay of a long wire is affected by the logic level in an adjacent long
wire. An adversary tenant can measure the delay in its long wire (e.g., by measuring
ring oscillator frequency) and use the value to determine the logic value of an
adjacent wire belonging to a victim tenant [36]. This attack has been used to extract
cryptographic keys [7, 36] from a victim tenant. It is also possible to implement

Fig. 10.1 Illustrations of multi-tenant FPGA attacks


276 S. Moini et al.

a hardware Trojan in the victim tenant to act as a transmitter and transfer secret
information to an adversary tenant (the receiver) using long-wire cross-coupling as
the communication channel [7].
A remote power side-channel attack uses the shared FPGA PDN to extract
information from the victim (Figure 10.1b). In this scenario, the adversary tenant
implements on-chip voltage sensors to measure the voltage across the PDN [8].
FPGA voltage sensors are discussed in depth in Sect. 10.4. The adversary tenant
uses the voltage sensors to measure voltage fluctuations caused by the victim
tenant’s dynamic power consumption, including resistive (IR) and inductive (.L dI dt )
voltage drops [15]. This type of attack has been used to extract secret information
about cryptographic algorithms [39, 50] and to extract secret information from
computational accelerators for machine learning algorithms [27].
During active attacks in multi-tenant FPGAs, the adversary tenant actively
attempts to sabotage the victim tenant’s hardware design without physically access-
ing it. In active voltage attacks, the adversary tenant uses power wasters to consume
large amounts of dynamic power and drop the voltage across the FPGA PDN. The
lower voltage causes increased delay in FPGA logic and routing elements leading
to timing violations in paths with small slack, and consequently inducing timing
faults in the victim tenant hardware design (Fig. 10.1c). A targeted fault injection
attack can expose secret information from hardware accelerators for cryptographic
algorithms (RSA [30] and AES [14]), generate biased outputs in a true random
number generator [22], or cause a hardware accelerator for machine learning
algorithms to generate misclassified results [19].
The adversary tenant can also crash the FPGA board by activating thousands
of power waster circuits and overloading the FPGA board power regulator [9, 20].
This action will cause the regulator to fail leading to a denial-of-service (DoS) attack
(Fig. 10.1d) on the multi-tenant FPGA. Voltage attacks are covered in more detail in
Sect. 10.3.3.

10.2.2 Countermeasures

Countermeasures have been introduced to address multi-tenant FPGA security


challenges [21]. These countermeasures can be divided into two types, offline and
online countermeasures.
Offline countermeasures increase the security of a hardware design in a multi-
tenant FPGA scenario during the design or deployment phase, before it executes
on the FPGA. An example of offline countermeasures is bitstream scanning that
is deployed in Amazon AWS EC2 F1 instances [16]. Here, bitstream scanning
checks for combinational loops present in the FPGA design. These combinational
loops may be used as power wasters for active fault injection attacks. Bitstream
scanning requires the cloud FPGA vendor to gain access to the hardware design
that may not be desirable for security-conscious users. More sophisticated bitstream
scanning techniques have been introduced [17] that rely on reverse-engineering
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 277

the tenant bitstream and detecting the activity of potential power waster circuits.
Other examples of offline countermeasures include design modification to hide
the dynamic power signature of the victim tenant hardware to make power side-
channel attacks more difficult [13, 37], and isolating long wires containing sensitive
information to counteract attacks based on long-wire cross-coupling [41].
Online countermeasures are deployed during FPGA execution. They either try
to make performing an attack harder for the adversary tenant and remediate any
damage (preventative), or focus on disabling the source of the attack once it has
started (reactive). Attack detection can be performed using on-chip voltage sensors
to monitor the voltage across the FPGA PDN and detect the activity of power
wasters used for voltage attacks when the sensor measurement drops below a
predetermined threshold [33, 51]. Fault detection circuits [43] can be used to identify
fault injection. These circuits typically use shadow registers [6]. The output of a
register is compared to the slightly delayed version stored in a shadow register. A
fault is detected if the two values are not equal.
Methods that mask the dynamic power signature of a victim tenant’s design and
make side-channel attacks difficult are another example of a preventative online
countermeasure. Active fences [13] surround the victim tenant’s design with ring-
oscillator-based power wasters. The activity of the power wasters is adjusted to
complement the dynamic power consumption of the protected circuit resulting in
reduced observable voltage fluctuations. As a result, power side-channel attacks
become more difficult. Once the activity of power wasters used for an attack is
detected, reactive online countermeasures can be deployed to disable the source of
the attack. These countermeasures must respond fast enough to avoid the damage
intended by the attacker. For example, the clock source to an adversary tenant can
be cut off to disable synchronous power wasters [33] once an attack is detected.
Detailed examples of reactive online countermeasures targeting several types of
power wasters are reviewed in Sect. 10.5. In the remainder of this chapter, we will
focus on reactive online countermeasures for multi-tenant FPGAs that attempt to
disable the attack source.

10.3 Voltage Attacks

In this section, we consider the voltage attacks introduced in Sect. 10.2 in detail.
First, we provide an overview of the different types of power wasters used in voltage
attacks in multi-tenant FPGAs and their effect on the FPGA PDN. Then, we discuss
the threats that these attacks pose to potential victims.
278 S. Moini et al.

10.3.1 Types of Power Wasters Used in Voltage Attacks

A power waster is generally composed of multiple replicas of a hardware design


that, when activated, can aggressively consume dynamic power and cause high
levels of voltage drop across the FPGA PDN. Effective power wasters consume
considerable levels of dynamic power while remaining undetected by offline
countermeasures.
Both synchronous and asynchronous power wasters can be implemented in
FPGA logic. Figure 10.2 provides schematics of the power wasters introduced in
this section. Asynchronous power wasters generally implement a combinational
loop with negative feedback that causes the circuit to toggle with a very high
frequency. The most basic asynchronous power waster uses an FPGA look-up table
(LUT) primitive and directly connects output to input to create a combinational
loop, as shown in Fig. 10.2a [32]. LUT-based power wasters are detectable [17]
during the design rule check (DRC) stage of design synthesis and, therefore, can
generally be suppressed prior to design deployment. More complex asynchronous
power wasters [32] can more easily escape DRC checks, although sophisticated
bitstream checking methods are able to detect them [15]. Multiplexer-based power
wasters [17] are implemented using an FPGA MUX primitive (Fig. 10.2b). Latch-
based power wasters [44] (Fig. 10.2c) use latch and inverter primitives to pass DRC
checks. The possibility of detecting these power wasters in DRC checking or in
the bitstream does not make other countermeasures against them unnecessary, since
checking may be too burdensome for some tenants [28] and the risk of false positives
may limit activities of benign tenants [15].
Synchronous power wasters use a clock source, and therefore, their maximum
toggling frequency is limited by the clock. Flip-flop-based power wasters [32]
(Fig. 10.2d) use FPGA flip-flop primitives in a loop. Shift-register-based power
wasters (Fig. 10.2e) [32] include a long shift register passing an alternating “0”
and “1” pattern. A synchronous power waster can also be implemented using a
chain of multiple 128-bit AES modules [32], as illustrated in Fig. 10.2f. Multiple
bypass paths with XOR gates and feedback routes are used between the AES
modules to add glitches and cause more dynamic power consumption. RAM-Jam [1]
(Fig. 10.2g) uses constant intentional write collisions to cause transient short circuits
in on-chip dual-port block RAMs.

10.3.2 Effects of Power Wasters on FPGA PDN

The effect of power waster operation on the FPGA PDN has previously been
characterized [33]. A total of 30,000 LUT-based power wasters (Fig. 10.2a) were
instantiated on an Intel Stratix 10 FPGA and used to generate voltage drops across
the FPGA PDN. Their effect was captured using an on-chip network of 218 ring-
oscillator-based voltage sensors uniformly placed across the FPGA floorplan to
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 279

MUX
LUT Enable I0 Oscillating
I0 Out Output
0 I1
Oscillating Sel
Enable Out Output
I1

(a) LUT-based waster (b) MUX-based waster

LUT

EN
D DFF Q
LUT
Latch
Enable
Enable EN
CLK
D Q

(c) Latch-based waster (d) Flip flop waster

Enable Shift Register

EN EN EN
D DFF Q D DFF Q ... D DFF Q

CLK

(e) Shift register waster

First Run
128

128-bit AES 128-bit AES 128-bit AES


round round round
Data in

CLK
(f) AES-based waster

LUT

101010...

Address
Din_A Din_B
Addr_A Addr_B
Dual Port RAM
1 WE_A WE_B 1
CLK_A CLK_B

CLK

(g) RAM-based waster

Fig. 10.2 Examples of synchronous and asynchronous power wasters


280 S. Moini et al.

Fig. 10.3 Normalized RO voltage sensor measurements and supply voltage measurements during
power waster activity. The power wasters are activated at time .= 0. The plot shows the response of
a voltage sensor at specific distances (measured as the number of FPGA logic array blocks) from
the center of the power wasters [33]

monitor the transient response of the FPGA PDN to the power wasters’ activity.
The voltage sensor’s architecture and operation are described in detail in Sect. 10.4.
Figure 10.3 shows the transient response of multiple RO-based voltage sensors
when the power wasters are activated at time 0. The number of oscillations of
the RO-based voltage sensors located across the FPGA die is recorded during the
measurement clock period (equal to 10 .μs). Each oscillation count is equated with
a corresponding voltage value [33]. This derived voltage measurement is shown on
the Y-axis of the graph. The distance between the power waster and each sensor is
measured as the number of FPGA logic array blocks (LABs) between them. Based
on the figure, the RO sensor measurements show a large voltage drop after time
0 followed by a steady-state voltage value that is smaller than the initial voltage of
0.92 V before the wasters were turned on. As expected, there is less of a voltage drop
as the distance between the center of the power wasters and the RO-based voltage
sensor drops. This experiment shows that there is a significant voltage drop present
even at a considerable distance from the source of the attack.

10.3.3 Potential Threats of Voltage Attacks

In a denial-of-service (DoS) attack based on power wasters, the adversary tenant


overwhelms the voltage regulator supplying the FPGA by activating a large number
of power wasters (e.g., tens of thousands) causing FPGA power failure. For
example, it was shown [33] that 30,000 or more LUT-based RO power wasters will
crash a Terasic DE10-Pro development board [46] containing a 14-nm Intel Stratix
10 FPGA. Experimental results [9, 28] show that periodically activating RO-based
power wasters at a specific frequency (toggling frequency) can enhance this attack.
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 281

An adversary tenant may wish to inject faults in the victim tenant’s circuit with-
out causing a board crash to cause incorrect behavior or retrieve secret information.
For example, LUT-based RO power wasters were used to extract the key from a
hardware accelerator for the 512-bit RSA encryption algorithm [31] by injecting
faults during the modular multiplication phase of the RSA hardware accelerator.
This action resulted in incorrect encrypted ciphertext for known plaintext and public
keys. The faulty ciphertext was then used to extract the prime numbers needed to
generate the RSA key pair and expose the private key. FPGAhammer [14] extracted
the key from a 128-bit AES encryption module by injecting timing faults using LUT-
based power wasters. This action caused a byte fault in an AES encryption round
and consequently generated a ciphertext with faults injected in all bytes. The faulty
ciphertext was used alongside the correct ciphertext to perform a differential fault
attack (DFA) to extract the secret key. It is also possible to extract the AES key from
an FPGA by injecting faults without having access to the fault-free ciphertext [18].
Hardware accelerators for machine learning algorithms are also vulnerable to
fault injection attacks. In DeepStrike [19], timing faults were injected into a
hardware accelerator for a deep neural network (DNN) used to classify images,
resulting in misclassifications. Latch-based asynchronous power wasters were used
to inject timing faults in the digital signal processing (DSP) primitives used
by the hardware accelerators. In a similar project [35], a fault injection attack
was performed against a DNN hardware accelerator using LUT- and latch-based
asynchronous power wasters. The data communication process for loading the
DNN weights from off-chip memory was targeted, causing timing faults when data
were loaded. Input images to the DNN accelerator were misclassified using the
attack. On-chip RAM-based power wasters (Fig. 10.2g) were used to perform a fault
injection attack against a machine learning target [1]. Using RAM-based power
wasters, it was possible to change the output class of a neural network hardware
accelerator.

10.4 Voltage Sensing and Fault Detection

In this section, we begin our analysis of countermeasures against voltage attacks.


We start with a detailed discussion of the various types of on-chip voltage sensors
that are used to detect the activity of power wasters used for voltage attacks. We also
elaborate on how these voltage sensors, among other methods, can be used to detect
the activity of the power wasters used in voltage attacks.

10.4.1 On-Chip Voltage Sensors

Since voltage attacks disrupt the FPGA PDN, an on-chip voltage sensor with the
ability to sense the voltage across the FPGA PDN is needed. On-chip voltage sensors
282 S. Moini et al.

Fig. 10.4 RO- and


0S
...
TDC-based power wasters
Frequency
1 Counter

(a) RO-based sensor

Sample
D Q D Q D Q

Q’ Q’ Q’
Clk
Input Signal Delay Delay ... Delay

(b) TDC-based sensor

are digital circuits implemented using FPGA resources that can provide a proxy for
the voltage of the FPGA PDN. Figure 10.4 shows an overview of the on-chip voltage
sensors discussed in this section. Ring oscillators (ROs) form a common type of on-
chip voltage sensor. RO-based sensors (Fig. 10.4a) include an RO followed by a
counter that counts the number of RO oscillations during a specific time period. The
length of the combinational loop in the ring oscillator must be sufficiently long to
avoid timing faults in the counter circuit. As an example, 19 inverter stages have
previously been used [33] for voltage sensor implementation.
The relationship between the voltage measured from the FPGA supply voltage
pin and the on-chip voltage sensor oscillation frequency can be derived during a
calibration phase by measuring the voltage sensor oscillation frequency for various
supply voltage values. The FPGA supply voltage can be modified by connecting the
FPGA to a power supply and directly adjusting the voltage [30]. Another way to
adjust the supply voltage of the FPGA is to activate power wasters. The number of
active power wasters can be dynamically adjusted to change the voltage measured at
the supply voltage pin of the FPGA. In this scenario, an independent measurement
of the voltage across the FPGA PDN is needed for calibration. This independent
measurement is either achieved by connecting an oscilloscope to the supply voltage
pin of the FPGA [25], or by using the dedicated analog-to-digital (ADC) converter
available on the FPGA that monitors the supply voltage [30]. Figure 10.5, derived
during the calibration phase, shows the relationship between the FPGA PDN voltage
and the normalized frequency of the RO voltage sensor for two Intel-based FPGAs.
A Terasic DE10-Pro evaluation kit [46] containing an Intel Stratix 10 FPGA, and
a Terasic DE5a-Net evaluation kit [45] containing an Intel Arria 10 FPGA were
used. In both experiments [30, 33], variable numbers of RO-based power wasters
are activated to adjust the voltage across the FPGA PDN, while dedicated ADC-
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 283

Fig. 10.5 Calibration curves


for FPGA supply voltage
based on the oscillation
frequency of an RO-based
sensor. Calibration was
performed separately for an
Arria 10 and a Stratix 10
FPGA

based on-chip voltage sensors are used to record the voltage values. This figure
clearly shows a highly linear relationship between the FPGA PDN voltage and the
RO sensor frequency, and it can be used to directly convert on-chip voltage sensor
measurements into voltage values.
A drawback of RO-based on-chip voltage sensors is their high measurement
sampling time. A short sampling period (e.g., 10 .μs [33]) is needed before a single
measurement is recovered from the sensor. If this time is too short, the RO only
oscillates a few times and will be less sensitive to PDN voltage fluctuations [25].
Additionally, the RO sensor may not provide consistent measurements for repeated
sensing iterations and may require multiple runs and averaging to converge to a
consistent value [25]. On-chip voltage sensors based on time-to-digital converters
(TDCs) can alleviate these issues. A TDC sensor can digitally measure the time
elapsed between two separate events. In FPGAs, the TDC sensor is implemented
using the FPGA carry chain primitives as delay elements (Fig. 10.4b) and is able
to sense small FPGA PDN voltage fluctuations [51]. Multiple carry chain elements
are connected in a daisy chain. The output of each carry element is connected to
a flip-flop to record its value. A rising edge signal is propagated through the carry
chain, and the number of elements it passes is caught by the flip-flops. The TDC is
able to measure the number of carry elements passed between its start (rising edge
entering the carry chain) and stop (flip-flop clock) signals. The Hamming weight of
the TDC output can be used to approximate voltage across the FPGA PDN. With
higher voltage, each carry element has a lower propagation delay resulting in the
input signal propagating through a higher number of carry elements. Thus, the TDC
output will have a higher Hamming weight. Conversely, a lower voltage across the
FPGA PDN results in the TDC output having a lower Hamming weight.
A TDC sensor typically has a much shorter sampling delay (2 ns [51]) compared
to the RO sensor (10 .μs). Additionally, TDC measurements converge to a stable
averaged value with fewer repeated measurements [25].
284 S. Moini et al.

10.4.2 Voltage Attack Detection

On-chip voltage sensors have been widely used in FPGAs to detect the activity
of power wasters used for fault injection attacks or denial-of-service voltage
attacks [51]. Generally, one or more voltage sensors constantly monitor the voltage
of the FPGA PDN and detect the activity of power wasters if the voltage drops
significantly below a predetermined threshold. For example [33], a sensor network
consisting of 218 RO-based voltage sensors was used to detect the activity of
power wasters deployed in a denial-of-service attack. The Stratix 10 FPGA was
divided into four separate regions (one per tenant). Sensors are used to detect
the activity of power wasters in each region independently. An integrated ARM
processor constantly calculates the average RO sensor count in each FPGA region
and compares it to a predetermined threshold. If a potential power waster is detected,
it can be remediated, as described in Sect. 10.5.
Similarly [24], a network of RO-based on-chip voltage sensors is implemented
on an FPGA to detect the activity of power wasters. The FPGA is divided into four
regions, with each assigned to a tenant. An algorithm is applied to collected sensor
data to locate the region potentially containing power wasters. For each sensor, the
measured RO-based sensor frequency is compared with a running frequency average
from the same sensor. When an attack occurs, the difference between these values
increases significantly as the RO sensor’s oscillation frequency drops. The location
of the wasters can be determined by identifying the region with the sensors that
have the largest drops. Unlike the approach described above, sensor measurements
are processed offline and cannot be used for immediate attack remediation.
A final attack detection approach [42] uses a long combinational adder circuit
connected to a register to detect voltage drops. Intermediate carry values stored
in the register are used for detection. Similar to a TDC, the PDN voltage can be
approximated by calculating how far a carry signal propagates across the carry
chain during a measurement clock cycle. If the value is smaller than a threshold
determined during calibration, the activity of power wasters used for voltage attacks
is detected. Subsequently, a countermeasure, which involves suppressing the user
clock, is deployed to alleviate the attack.

10.5 Reactive Countermeasures for Voltage Attacks

In this section, we discuss examples of reactive online countermeasures against


fault injection and denial-of-service attacks that use power wasters. These coun-
termeasures operate by disabling the power wasters as the source of the attack and
mitigating the attack. As discussed in Sect. 10.4, these reactive countermeasures are
generally deployed alongside a fault detection method that identifies the activity
of the power wasters and in some cases locates them. The success of a reactive
countermeasure depends on how fast it is able to disable the power wasters
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 285

before they deliver their intended harm. Different countermeasures are deployed
for synchronous and asynchronous power wasters.

10.5.1 Disabling Synchronous Power Wasters

The clock in the FPGA design is typically provided by an external source,


distributed using a dedicated clock routing network, and delivered to the design
using global clock buffers. In the following multi-tenant scenario, the FPGA is
separated into multiple clock regions each with its own clocking resources. A
privileged user can control individual clock buffers in each clock region and cut off
the clock to a specific region without affecting the others. This capability can be used
to suppress synchronous power wasters employed by an adversary tenant in a multi-
tenant FPGA setting. Once the presence of an adversary tenant is detected (e.g.,
using methods discussed in Sect. 10.4), the clock signal feeding the synchronous
power wasters can be cut off resulting in their deactivation. Successful attack
detection and remediation include the detection of the adversary tenant location
and the disabling of the power wasters before the FPGA board crashes or a fault is
injected.
An example of synchronous power waster suppression [33] considers four tenant
regions, each within an isolated clock region. The clock buffers of each region can
be controlled by a microprocessor under administrative control. An adversary tenant
may instantiate one or more synchronous power wasters powerful enough to cause
a denial-of-service attack. A network of voltage sensors is used to collect real-
time voltage information from the regions. If an attack is detected, the clock to
the offending region is suppressed, stopping the attack.
An overview of the multi-tenant system running on a Stratix 10 FPGA is depicted
in Fig. 10.6. A network of evenly spaced, continuously running RO-based voltage
sensors are placed throughout the four regions. Software, constantly running in
a loop, obtains the measurements from each voltage sensor and uses them to
periodically check for the activity of power wasters used for voltage attacks in each
tenant region. Once the activity of the power wasters is detected in a tenant region,
the clock buffer for that region is deactivated.
The performance of this monitoring system was evaluated in two scenarios: (1)
the software was run on a NIOS II soft-core processor implemented in the FPGA
fabric, and (2) the software code was run on a hard-core ARM processor. For the first
scenario, the on-chip monitoring system is shown in Fig. 10.7. Data communication
with the NIOS II is realized using an Avalon Memory-Mapped (AVMM) bridge.
RO sensor counts in each region are recovered, added in accumulators (ACC in
Fig. 10.7), grouped into 32-bit words, and sent to the NIOS II for processing. The
software, running on the NIOS-II processor, unpacks the received words to recover
sensor counts. These sensor counts are used to calculate a running average of the
sensor measurements in each tenant region. If this measurement drops below a
threshold for a region, the activity of the power wasters is detected in that region, and
286 S. Moini et al.

Fig. 10.6 Overview of the multi-tenant attack remediation system running on a Stratix 10
FPGA [33]

Fig. 10.7 Overview of the


monitoring system
implemented on the Stratix
10 FPGA. The software that
locates the adversary tenant
runs on a NIOS-II soft-core
processor

the clock buffer for that region is disabled using a memory write to the clock-enable
registers. Experimental results for 1,000 trials show that this monitoring system is
able to successfully locate a tenant region containing the activated power wasters
and disable its clock in 11.05 .μs on average, avoiding a power regulator crash.
For the second scenario, an ARM processor that runs faster than the NIOS II
processor is used to run the software. However, the data communication that sends
the RO sensor counts to the ARM processor and receives commands to disable
clock buffers is slower than in the NIOS II setup due to the delay associated
with the AVMM bridge. Similar to the previous scenario, the experiment with the
power wasters was run 1,000 times, and the monitoring system was able to detect
the activation of the power wasters, locate them, and disable them in 9.95 .μs on
average [33].
These experiments show the success of an RO-based on-chip voltage sensor
network in detecting and disabling synchronous power wasters used for a DoS
attack that requires 20 .μs. A drawback of this countermeasure is that it is unable
to deactivate asynchronous power wasters that do not rely on a clock to function.
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 287

10.5.2 Disabling Asynchronous Power Wasters Using Partial


Reconfiguration

Asynchronous power wasters cannot be deactivated with clock suppression. They


must be disabled through FPGA reconfiguration or shutdown. Partial FPGA recon-
figuration allows for changes in portions of the FPGA fabric at runtime while
allowing logic executing in the rest of the FPGA to execute normally. Figure 10.8
shows an overview of the partial reconfiguration operation [12]. Partial reconfigura-
tion (PR) is managed on the FPGA by the PR controller module. It receives the PR
bitstream and programs it during FPGA operation. The part of the FPGA fabric that
is being reconfigured is called the PR region. A new partial design, or persona, is
loaded into the PR region during the partial reconfiguration operation.
Once the activity of power wasters is detected and the adversary tenant region
is located, partial reconfiguration can be used to remove the logic in the adversary
tenant’s region, disabling the power wasters. The success of partial reconfiguration
as a reactive countermeasure against fault injection and DoS attacks depends on
how fast it can disable the power wasters in a region. We devised an experiment
to measure the minimum amount of time it takes partial reconfiguration to dis-
able multiple ring oscillators using an Intel Stratix V GX Edition development
board [10] containing an Intel Stratix V FPGA. Ten ring-oscillator-based voltage
sensors (Sect. 10.4) were instantiated with the goal of disabling them using partial
reconfiguration. The sensors were placed vertically across the FPGA floorplan (the
10 small rectangles in Fig. 10.9). Each RO-based voltage sensor includes 19 inverter
stages. A 32-bit counter keeps a record of the number of RO oscillations in a
specific time period. These counts allow for the determination of the exact moment
during reconfiguration when an RO is deactivated. As shown in Fig. 10.10a, each
RO voltage sensor contains a wire that goes through an adjacent PR region (the
single long and narrow rectangle in Fig. 10.9). After partial reconfiguration, the
hardware inside the PR region is cleared, which results in the deactivation of the ring
oscillators, as demonstrated in Fig. 10.10b. To minimize the partial reconfiguration
time, we chose the smallest PR region size possible, a single FPGA logic array
block (LAB) column with a width equal to 1 LAB and a height of 120 LABs

FPGA

Old
New
PR Persona
Bitstream Persona
PR
JTAG PR Controller
Region

Fig. 10.8 Overview of the partial reconfiguration operation


288 S. Moini et al.

Fig. 10.9 Floorplan of the


Stratix V FPGA containing RO PR
the partial reconfiguration Sensors Region
region and the 10 RO-based
voltage sensors
1


3

10

stretching vertically across the FPGA from top to bottom (Fig. 10.9). The size of
the PR bitstream for this PR region is 500 KB.
At the start of the experiment, all 10 RO voltage sensors are activated. The
number of oscillations of each RO voltage sensor during the active time period
is measured using a counter and stored in on-chip memory. When the sensors are
activated, the partial reconfiguration operation is started and the PR bitstream of
500 KB is sent to the FPGA using a JTAG cable. Once the partial reconfiguration
operation is complete, the RO voltage sensor measurements are recovered from the
on-chip memory. These measurements are used to calculate the time it takes from
the start of the PR operation until all 10 RO sensors are cut off (the cut-off time).
Note that the RO cut-off time does not include the time needed to detect the presence
of an attack using an on-chip voltage sensor network.
Figure 10.11 shows the measurements recovered from the 10 RO sensors after the
PR operation starts at time .= 0. Each count of RO sensor oscillations is collected
with a sampling period of 1 .μs . As shown in the figure, all RO sensors have a
similar count of 125 during a 1 .μs sampling period until they all simultaneously stop
oscillating after 40 ms. Further investigation of the zoomed-in image shows that the
last non-zero sensor measurement was recovered at 39.5 ms (the cut-off time of the
ring oscillators).
JTAG is a slow communication protocol with low bandwidth and high operating
system-level handshake delay, resulting in reconfiguration delay in the measured
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 289

PR Region PR Region
S0 S 0
1 1

...

...
On-Chip Counter
On-Chip Counter
Memory Memory
CLK CLK
S 0 S0
1 1
...

...
On-Chip Counter On-Chip Counter
Memory Memory
CLK CLK

(a) Before partial reconfiguration (b) After partial reconfiguration

Fig. 10.10 RO-based voltage sensors before and after the partial reconfiguration operation. The
red rectangle shows the partial reconfiguration region. The ring oscillator is disabled after the PR
operation is completed

RO cut-off times. To provide a better estimate of the RO cut-off time if a faster data
communication medium was available, we investigated the way data are transferred
through JTAG during the PR operation using the Intel Signal Tap logic analyzer [11].
By the time the RO sensors are cut off, 10% of the PR bitstream (50 KB) has been
loaded onto the FPGA. Based on this insight, the cut-off time of the RO sensors can
be estimated when loading the PR bitstream from on-chip memory instead of off-
chip communication with JTAG. Using on-chip memory, the PR bitstream could be
read using a 200-MHz clock with 32 bits per clock. In this scenario, it would take
at least 63 .μs to load the 10% of the PR bitstream needed to disable the ROs in a
single FPGA column.
Disabling thousands of RO wasters would require performing the partial recon-
figuration of numerous FPGA LAB columns. For example, previously [33], 64 LAB
columns in a Stratix 10 device were used to implement RO wasters. With each LAB
column requiring at least 63 .μs to disable, the PR operation to serially disable all RO
wasters in all LAB columns will take at least four milliseconds if the wasters were
implemented in a Stratix V. Since an adversary tenant can inject faults into a victim
design in 10–20 .μs [31], partial reconfiguration in current Intel Stratix V FPGAs is
not fast enough to avoid these attacks.
290 S. Moini et al.

Fig. 10.11 RO counts for 10 separate sensors during partial reconfiguration. The PR operation
starts at time 0. As shown in the zoomed-in version of the figure, the sensors are cut off at 39.5 ms

10.5.3 Modified Partial Reconfiguration

Recent work has modified the partial reconfiguration bitstreams in Xilinx 7-series
and UltraScale+ FPGAs to reduce the cut-off time for asynchronous power wasters.
The LoopBreaker approach [28] uses a partial reconfiguration (PR) bitstream that
disables asynchronous power wasters. The approach disables all interconnects in
the partial reconfiguration region by setting them to a high impedance “Z” state
prior to the full loading of a new bitstream for the region. The approach takes
advantage of the use of configuration commands that are part of the bitstream.
The commands, which are executed by the PR controller, select the PR region,
cut off its input and output wires, and disable its interconnects. LoopBreaker sends
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 291

modified PR commands, which were determined via reverse engineering, to the PR


controller. The modified PR commands, which are included in the bitstream, disable
all interconnects in a selected PR region. The LoopBreaker experiments, targeting
a ZCU102 evaluation board [48] containing a Xilinx Zynq UltraScale+ system-on-
chip (SoC) and a VC707 evaluation board [49] containing a Xilinx Virtex 7 FPGA,
show the ability to disable the asynchronous power wasters within 1.5 .μs of the start
of the PR process.
The LoopBreaker experiments [28] evaluated whether the new PR approach is
fast enough to suppress a voltage attack and avoid fault injection or board failure.
Several types and sizes of synchronous and asynchronous power wasters were
activated, and the results of FPGA execution were examined. For each experiment,
the power wasters were enabled, a TDC-based on-chip voltage sensor was used to
detect the activation of the power wasters, and their partial reconfiguration process
was initiated to suppress the attack. Each experiment using a distinct power waster
was repeated 20 times.
LoopBreaker experimental results show that Xilinx PR on Virtex-7 and Zynq
UltraScale+ FPGAs substantially decreases the probability of successful fault
injection and FPGA board failure compared to unmodified partial reconfiguration
using a blank bitstream. The success of an attack depends on the area the power
wasters consume on the FPGA. For RO- and latch-based power wasters that
consume less than 22.5% of the FPGA area, modified PR successfully suppresses
all attacks that crash the FPGA board. For larger power wasters and wasters toggled
on-and-off, less success is achieved. Fault injection was not prevented in either case.
These experiments show a benefit from fast PR and should provide encouragement
to FPGA vendors to enhance this functionality to improve security against voltage
attacks. In the LoopBreaker case, the availability of open-source configuration
manager tools developed for Xilinx FPGAs [5, 29] and open-source bitstream
information [3, 38] allows for fine-level control of the partial reconfiguration process
in Xilinx Virtex-7 and UltraScale+ FPGAs. This functionality and information are
not available for Intel Stratix V devices.

10.6 Ideas to Address Security Threats

The security measures that have been introduced for multi-tenant cloud FPGAs
are insufficient to address all threats posed by voltage attacks using power
wasters. Although experimentally tested techniques have successfully suppressed
synchronous [33] and asynchronous power wasters [28] in some scenarios, an
adversary tenant with power wasters can inject timing faults and cause an FPGA
board crash in the presence of the fastest reactive countermeasures in other
cases. Therefore, we propose several additional directions for voltage attack
countermeasures, targeting different stages of the multi-tenant FPGA design and
development process.
292 S. Moini et al.

Cloud users could potentially insert hardware in a design to roll back the design
state to known non-compromised values in the presence of a fault injection attack
[26]. This mechanism would allow the user’s hardware to halt operations once
the presence of a fault injection attack is detected, reject potentially compromised
computed results, and recompute affected values once the attack source is disabled.
Cloud FPGA users could use checkpointing methods to periodically generate a
checkpoint snapshot of the working design (e.g., state machines, registers, and on-
chip block memory instances [2, 40]). In the case of a denial-of-service attack,
the user could potentially return to their most recent checkpoint and minimize the
damage caused by the FPGA shutdown.
Cloud FPGA providers could also invest in countermeasures to improve the
security of the user environment. More sophisticated bitstream scanning methods
that discover complex power wasters that are undetected by the current bitstream
scanning methods [17] could be used. For example, glitch-based power wasters [23,
32] do not use ring oscillators, and RAM-based power wasters [1] appear similar to
standard circuitry. These power wasters could be discovered by scanning the FPGA
bitstream and design for suspicious high-fanout circuits with the ability to generate
high-power glitches.
FPGA hardware vendors also have an important role in increasing the security
of multi-tenant FPGAs. They could introduce improved partial reconfiguration
capabilities to more quickly cut off the interconnects in a PR region or clear
configuration hardware. Improvements could include removing the communication
and setup overheads and redundancies during partial reconfiguration and improving
the supported maximum throughput for PR bitstream loading. These changes may
help improve the success rate of partial reconfiguration as a countermeasure against
fault injection and DoS attacks by deactivating power wasters faster. FPGA vendors
could also improve the design of the FPGA power distribution network by dividing
the PDN into multiple isolated voltage planes [14]. Doing so could potentially avoid
side-channel attacks and reduce the viability of active fault injection and FPGA-
wide denial-of-service attacks.

10.7 Conclusion

In this chapter, we provided an overview of multi-tenant cloud FPGA attacks and


countermeasures. Active voltage attacks were a particular focus. After evaluating
different types of attacks, attack circuits, and sensors, an evaluation of current
reactive countermeasures to disable power wasters in a fault injection attack was
provided. These approaches include cutting off the clock for synchronous power
wasters and the partial reconfiguration of asynchronous power wasters. Results
from on-FPGA experiments evaluated the effectiveness of partial reconfiguration
in disabling wasters in Intel Stratix V devices. These results were contrasted with
previously published results for Xilinx Virtex-7 and Zynq UltraScale+ FPGAs.
Although denial-of-service attacks can be suppressed in some cases, partial recon-
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 293

figuration is not sufficiently fast enough to prevent fault injection and suppress
all denial-of-service attacks. Possible research directions that could address this
limitation were subsequently presented.

Acknowledgment This work was supported in part by National Science Foundation grant
1902532.

References

1. Alam, M. M., Tajik, S., Ganji, F., Tehranipoor, M., & Forte, D. (2019). RAM-Jam: Remote
temperature and voltage fault attack on FPGAs using memory collisions. In Workshop on
Fault Diagnosis and Tolerance in Cryptography (FDTC) (pp. 48–55).
2. Attia, S., & Betz, V. (2020). Feel free to interrupt: Safe task stopping to enable FPGA
checkpointing and context switching. ACM Transactions on Reconfigurable Technology and
Systems (TRETS), 13(1), 1–27.
3. Benz, F., Seffrin, A., & Huss, S. A. (2012). Bil: A tool-chain for bitstream reverse-engineering.
In 22nd International Conference on Field Programmable Logic and Applications (FPL) (pp.
735–738). IEEE.
4. Bobda, C., Mbongue, J. M., Chow, P., Ewais, M., Tarafdar, N., Vega, J. C., et al. (2022).
The future of FPGA acceleration in datacenters and the cloud. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 15(3), 1–42.
5. Damschen, M., Bauer, L., & Henkel, J. (2017). CoRQ: Enabling runtime reconfiguration under
WCET guarantees for real-time systems. IEEE Embedded Systems Letters, 9(3), 77–80.
6. Ernst, D., Kim, N. S., Das, S., Pant, S., Rao, R., Pham, T., Ziesler, C., Blaauw, D.,
Austin, T., Flautner, K., et al. (2003). Razor: A low-power pipeline based on circuit-level
timing speculation. In Proceedings. 36th Annual IEEE/ACM International Symposium on
Microarchitecture, 2003. MICRO-36 (pp. 7–7). Citeseer.
7. Giechaskiel, I., Rasmussen, K. B., & Eguro, K. (2018). Leaky wires: Information leakage and
covert communication between FPGA long wires. In ASM Asia Conference on Computer and
Communications Security (ASIACCS) (pp. 15–27).
8. Gnad, D. R., Oboril, F., Kiamehr, S., & Tahoori, M. B. (2016). Analysis of transient voltage
fluctuations in FPGAs. In International Conference on Field-Programmable Technology (FPT)
(pp. 12–19).
9. Gnad, D. R., Oboril, F., & Tahoori, M. B. (2017). Voltage drop-based fault attacks on FPGAs
using valid bitstreams. In International Conference on Field Programmable Logic and
Applications (FPL) (pp. 1–7).
10. Intel: Stratix V GX FPGA development kit (2015).
11. Intel (2017). Quartus Prime Standard Edition Handbook Volume 3: Verification, Chap. 14:
Design Debugging with the Signal Tap Logic Analyzer.
12. Intel: UG-20179: Intel Quartus prime standard edition user guide (2018).
13. Krautter, J., Gnad, D. R., Schellenberg, F., Moradi, A., & Tahoori, M. B. (2019). Active
fences against voltage-based side channels in multi-tenant FPGAs. In IEEE/ACM International
Conference on Computer-Aided Design (ICCAD) (pp. 1–8).
14. Krautter, J., Gnad, D. R., & Tahoori, M. B. (2018). FPGAhammer: Remote voltage fault attacks
on shared FPGAs, suitable for DFA on AES. IACR Transactions on Cryptographic Hardware
and Embedded Systems (TCHES), 2018(3), 44–68.
15. Krautter, J., Gnad, D. R., & Tahoori, M. B. (2019). Mitigating electrical-level attacks towards
secure multi-tenant FPGAs in the cloud. ACM Transactions on Reconfigurable Technology and
Systems (TRETS), 12(3), 12:1–12:26.
294 S. Moini et al.

16. La, T., Pham, K., Powell, J., & Koch, D. (2021). Denial-of-service on FPGA-based cloud
infrastructures - attack and defense. IACR Transactions on Cryptographic Hardware and
Embedded Systems, 2021(3), 441–464.
17. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020). FPGADefender:
Malicious self-oscillator scanning for Xilinx UltraScale+ FPGAs. ACM Transactions on
Reconfigurable Technology and Systems (TRETS), 13(3), 1–31.
18. Li, X., Tessier, R., & Holcomb, D. (2022). Precise fault injection to enable DFIA for attacking
AES in remote FPGAs. In 2022 IEEE 30th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM). IEEE.
19. Luo, Y., Gongye, C., Fei, Y., & Xu, X. (2021). DeepStrike: Remotely-guided fault injection
attacks on DNN accelerator in cloud-FPGA. In 2021 58th ACM/IEEE Design Automation
Conference (DAC) (pp. 295–300). IEEE.
20. Luo, Y., Gongye, C., Ren, S., Fei, Y., & Xu, X. (2020). Stealthy-shutdown: Practical remote
power attacks in multi-tenant FPGAs. In 2020 IEEE 38th International Conference on
Computer Design (ICCD) pp. 545–552. IEEE.
21. Mahmoud, D. G., Lenders, V., & Stojilović, M. (2022). Electrical-level attacks on CPUs,
FPGAs, and GPUs: Survey and implications in the heterogeneous era. ACM Computing
Surveys (CSUR), 55(3), 1–40.
22. Mahmoud, D. G., & Stojilović, M. (2019). Timing violation induced faults in multi-tenant
FPGAs. In Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1745–
1750).
23. Matas, K., La, T. M., Pham, K. D., & Koch, D. (2020). Power-hammering through
glitch amplification–attacks and mitigation. In IEEE International Symposium on Field-
Programmable Custom Computing Machines (FCCM) (pp. 65–69).
24. Mirzargar, S. S., Renault, G., Guerrieri, A., & Stojilović, M. (2020). Nonintrusive and adaptive
monitoring for locating voltage attacks in virtualized FPGAs. In Cryptology ePrint Archive.
25. Moini, S., Li, X., Stanwicks, P., Provelengios, G., Burleson, W., Tessier, R., & Holcomb,
D. (2020). Understanding and comparing the capabilities of on-chip voltage sensors against
remote power attacks on FPGAs. In 2020 IEEE 63rd International Midwest Symposium on
Circuits and Systems (MWSCAS) (pp. 941–944). IEEE.
26. Moini, S., Provelengios, G., Holcomb, D., & Tessier, R. (2023). Fault recovery from multi-
tenant FPGA voltage attacks. In ACM SIGDA Great Lakes Symposium on VLSI (GLSVLSI)
(pp. 1–6).
27. Moini, S., Tian, S., Holcomb, D., Szefer, J., & Tessier, R. (2021). Power side-channel attacks
on BNN accelerators in remote FPGAs. IEEE Journal on Emerging and Selected Topics in
Circuits and Systems, 11(2), 357–370.
28. Nassar, H., AlZughbi, H., Gnad, D. R., Bauer, L., Tahoori, M. B., & Henkel, J. (2021). Loop-
Breaker: Disabling interconnects to mitigate voltage-based attacks in multi-tenant FPGAs. In
2021 IEEE/ACM International Conference on Computer Aided Design (ICCAD) (pp. 1–9).
IEEE.
29. Pezzarossa, L., Schoeberl, M., & Sparsø, J. (2017). A controller for dynamic partial reconfig-
uration in FPGA-based real-time systems. In 2017 IEEE 20th International Symposium on
Real-Time Distributed Computing (ISORC) (pp. 92–100). IEEE.
30. Provelengios, G., Holcomb, D., & Tessier, R. (2019). Characterizing power distribution attacks
in multi-user FPGA environments. In International Conference on Field Programmable Logic
and Applications (FPL) (pp. 194–201).
31. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power distribution attacks in multitenant
FPGAs. IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 28(12), 2685–
2698.
32. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power wasting circuits for cloud FPGA
attacks. In International Conference on Field Programmable Logic and Applications (FPL)
(pp. 231–235).
33. Provelengios, G., Holcomb, D., & Tessier, R. (2021). Mitigating voltage attacks in multi-tenant
FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 14(2), 1–24.
10 Countermeasures Against Voltage Attacks in Multi-tenant FPGAs 295

34. Provelengios, G., Ramesh, C., Patil, S. B., Eguro, K., Tessier, R., & Holcomb, D. (2019).
Characterization of long wire data leakage in deep submicron FPGAs. In ACM/SIGDA
International Symposium on Field Programmable Gate Arrays (FPGA) (pp. 292–297).
35. Rakin, A. S., Luo, Y., Xu, X., & Fan, D. (2021). Deep-Dup: An adversarial weight duplication
attack framework to crush deep neural network in multi-tenant FPGA. In 30th USENIX
Security Symposium (USENIX Security 21) (pp. 1919–1936).
36. Ramesh, C., Patil, S. B., Dhanuskodi, S. N., Provelengios, G., Pillement, S., Holcomb, D., &
Tessier, R. (2018). FPGA side channel attacks without physical access. In IEEE International
Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 45–52).
37. Regazzoni, F., Wang, Y., & Standaert, F. X. (2011). FPGA implementations of the AES masked
against power analysis attacks. Proceedings of COSADE, 2011, 56–66.
38. Rossi, E., Damschen, M., Bauer, L., Buttazzo, G., & Henkel, J. (2018). Preemption of the
partial reconfiguration process to enable real-time computing with FPGAs. ACM Transactions
on Reconfigurable Technology and Systems (TRETS), 11(2), 1–24.
39. Schellenberg, F., Gnad, D. R., Moradi, A., & Tahoori, M. B. (2018). An inside job: Remote
power analysis attacks on FPGAs. In Design, Automation & Test in Europe Conference &
Exhibition (DATE) (pp. 1111–1116).
40. Schmidt, A. G., Huang, B., Sass, R., & French, M. (2011). Checkpoint/restart and beyond:
Resilient high performance computing with FPGAs. In 2011 IEEE 19th Annual International
Symposium on Field-Programmable Custom Computing Machines (pp. 162–169). IEEE.
41. Seifoori, Z., Mirzargar, S. S., & Stojilović, M. (2020). Closing leaks: Routing against crosstalk
side-channel attacks. In Proceedings of the 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (pp. 197–203).
42. Shen, L. L., Ahmed, I., & Betz, V. (2019). Fast voltage transients on FPGAs: Impact and
mitigation strategies. In IEEE International Symposium on Field-Programmable Custom
Computing Machines (FCCM) (pp. 271–279).
43. Stott, E., Levine, J. M., Cheung, P. Y., & Kapre, N. (2014). Timing fault detection in FPGA-
based circuits. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable
Custom Computing Machines (pp. 96–99). IEEE.
44. Sugawara, T., Sakiyama, K., Nashimoto, S., Suzuki, D., & Nagatsuka, T. (2019). Oscillator
without a combinatorial loop and its threat to FPGA in data centre. Electronics Letters, 55(11),
640–642.
45. Terasic Technologies: DE5a-Net User Manual (2018).
46. Terasic Technologies: DE10-Pro User Manual (2019).
47. Tian, S., & Szefer, J. (2019). Temporal thermal covert channels in cloud FPGAs. In
Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (pp. 298–303).
48. Xilinx: UG1182: ZCU102 Evaluation Board User Guide (2019).
49. Xilinx: UG885: VC707 Evaluation Board for the Virtex-7 FPGA (2019).
50. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In IEEE
Symposium on Security and Privacy (S&P) (pp. 229–244).
51. Zick, K. M., Srivastav, M., Zhang, W., & French, M. (2013). Sensing nanosecond-scale voltage
attacks and natural transients in FPGAs. In ACM/SIGDA International Symposium on Field
Programmable Gate Arrays (FPGA) (pp. 101–104).
Chapter 11
Programmable RO (PRO): A
Multipurpose Countermeasure Against
Side-Channel and Fault Injection Attack

Yuan Yao, Pantea Kiaei, Richa Singh, Shahin Tajik, and Patrick Schaumont

11.1 Introduction

In a physical side-channel attack, an adversary learns secret information by either


passively monitoring or actively influencing the implementation of a secure elec-
tronic system. While power consumption is a popular target in side-channel attacks,
many other sources of physical quantities have been identified and used as side-
channel leakage. Besides passive monitoring of circuit behavior, an additional cause
of information leakage stems from targeted faults. By analyzing the corresponding
fault response, an attacker can retrieve the secret information from a target [3].
The most common methods to inject faults include power glitches, clock glitches,
electromagnetic pulses, and laser pulses. Finally, fault injection and side-channel
monitoring can also be used in a combined attack, for example, to break a masking
side-channel countermeasure [44].
Even though many existing works have demonstrated side-channel and fault
attack countermeasures, there are no simple circuit-level solutions to solve both
side-channel and fault attack vulnerabilities in a generic manner. Generally, even
for individual side-channel or fault countermeasures, a significant overhead will
be introduced to the design. Moreover, many of the existing countermeasure
mechanisms have to be specifically adjusted for the implemented algorithm.

Y. Yao ()
Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and
State University, Blacksburg, VA, USA
e-mail: yuan9@vt.edu
P. Kiaei · R. Singh · S. Tajik · P. Schaumont
Department of Electrical and Computer Engineering, Worcester Polytechnic Institute, Worcester,
MA, USA
e-mail: pkiaei@wpi.edu; rsingh7@wpi.edu; stajik@wpi.edu; pschaumont@wpi.edu

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 297
J. Szefer, R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3_11
298 Y. Yao et al.

In recent years, researchers have further demonstrated that the placement of the
attacker and the victim circuitry on the same chip while sharing a common power
distribution network (PDN) brings new side-channel and fault attack opportunities
[13, 15]. Having a common PDN intrinsically relates the perturbations from the
victim’s logic to the attacker’s logic and vice versa. Therefore, a neighboring
adversary logic can interpret information about the victim operations by monitoring
the changes on the shared PDN [13, 15]. On the other hand, the same physical effect
exists in the other way around; the victim logic can infer malicious operations of its
neighbor circuitry by monitoring the shared PDN [25, 32, 40]. Therefore, in order to
guarantee the security of the PDN, a monitoring sensor network on the PDN should
be built to detect ongoing attacks. The monitoring sensor network should fulfill the
requirements including large spatial coverage, i.e., covering the full PDN area, and
large temporal coverage, i.e., continuously monitoring the PDN [38].
Previously, a ring oscillator (RO) was often used by silicon design houses as a
test structure or on-chip sensor to monitor the technology and circuit performance
[19]. In a cloud context, where an FPGA circuit can host multiple applications, such
test structures are prohibited because of their potential impact on the shared PDN.
Cloud providers can use tools to scan FPGA bitstreams and automatically flag such
malicious structures [9, 24]. On the other hand, the myriad of ways to create an
oscillating structure in digital logic makes no detection technique foolproof. For
example, latches or flip flops can be inserted in the feedback structure of the ring
oscillator to prevent detection [39].
A multipurpose design of RO-based on-chip sensors has not been investigated
to add resistance against both side-channel and fault attacks to the circuit. In this
chapter, we introduce a new multipurpose RO design—a programmable RO (PRO).
With a low overhead, the proposed PRO can provide the following solutions within
the same structure:
• Active side-channel hiding countermeasure
• On-chip power monitoring
• Fault injection monitoring
The proposed PRO design has multiple configurations of oscillation frequency,
which are under the control of the user (i.e., the defender). Each PRO has its own
counter that can be read to calculate the PRO’s frequency by comparing it with
a reference counter. We first demonstrate that with low overhead, an individual
PRO can provide sufficient disturbance to the power to hide the side-channel
leakage of the secret information in the system. Moreover, we further demonstrate
that by combining multiple PROs into an array and by placing them within the
module under protection, a secure on-chip monitoring network can be constructed to
monitor the power fluctuations on the PDN to detect abnormalities and fault attacks.
Figure 11.1 shows the overall structure of the PRO-based on-chip secure system.
The PROs are evenly placed on the chip to form a secure on-chip network. The PRO
secure network can be controlled by the external user configuration. The user can
turn on the side-channel analysis (SCA) countermeasure by configuring the PRO to
oscillate at randomized oscillation frequencies. Besides, the user can monitor the
11 Programmable Ring Oscillator (PRO) 299

Fig. 11.1 A sensitive hardware module can be protected by a grid of programmable ring
oscillators (PROs) that monitor power integrity, detect EM fault injection, and detect side-channel
leakage

oscillation frequency of each PRO in the array by reading out its corresponding
counter value. We demonstrate that by monitoring the frequency change of the
PROs, on-chip local power attacks and EM fault injections can be detected.
The proposed design can be used on any secure module, from small hardware
accelerators to complex system-on-chips (SoCs). To the best of our knowledge, our
work is the first to comprehensively study the potential of RO-based designs in SCA
countermeasure, power sensing, and fault detection.

11.1.1 Adversary Model

PRO covers adversaries with side-channel and fault attack capabilities listed in the
following.

11.1.1.1 Side-Channel Attacker Model

We consider two attacker models. The first attacker model has physical access to
the device, which enables the attacker to control the input data and monitor the
power dissipation by shunting the device’s power supply. The second attacker model
works remotely; the attacker circuit shares a PDN with the victim circuit, and the
attacker can control only the attacker circuit remotely. Therefore, the attacker is
able to implement malicious logic to monitor the changes on the shared PDN and
measure the power consumption of the device [37, 46]. This enables the attacker to
perform side-channel attacks, such as simple power analysis (SPA) [29], differential
300 Y. Yao et al.

power analysis (DPA) [20], and correlation power analysis (CPA) [5], to retrieve the
secret information used in the victim circuit.

11.1.1.2 Fault Attacker Model

We also assume the adversary can induce faults into the victim circuit by stressing
the electrical environment, such as injecting clock glitches, power glitches, and
EM glitches. These glitches can induce targeted transient faults that can flip bits,
change the control flow of the secure algorithm, set or reset the circuit, etc.
Fault injection can be done either by exerting disturbance to the circuit directly,
which requires the adversary to have physical access to the device, or by having
remote access to the shared cloud computing environment with the victim circuit
[1, 23, 28, 36]. The exact fault effects to the circuit highly depend on the fault
injection parameters, victim circuit’s architecture and algorithm, and fault injection
technique. By monitoring the fault response of the circuit after injecting targeted
faults, the adversary can retrieve the secret information by performing differential
fault analysis (DFA) [4], statistical fault analysis (SFA) [10], or instruction skip
attacks [43]. PRO as a secure on-chip add-on can be integrated to the circuit to
protect against the aforementioned attackers. Adversaries may try to tamper with the
PRO sensor itself to bypass the PRO’s security mechanisms, but we do not consider
this adversary model within this work.

11.1.1.3 Chapter Organization

The structure of the chapter is as follows. The next section reviews related work of
ring oscillators and highlights our contribution. Section 11.3 describes our proposed
PRO design. In Fig. 11.4, we explain and demonstrate the effectiveness of PRO
as a side-channel countermeasure. Next, we present the PRO’s power sensing
functionality in Fig. 11.5. We further show that PRO can detect power fault and
EM faults in Fig. 11.6. Finally, we conclude the chapter in Fig. 11.7.

11.2 Related Work

When sharing the same PDN, seemingly unsuspecting parts of the implemented
logic can perform adversarial operations on the other parts. In this chapter, our focus
is on two categories of adversarial operations: fault injection and power side-channel
analysis. In the following, we categorize the related work into three parts: using
on-chip logic as a countermeasure against power SCA, using on-chip sensors as
power sensors to detect power perturbation, and using on-chip sensors to detect
fault injection attacks.
11 Programmable Ring Oscillator (PRO) 301

11.2.1 On-Chip Sensors as a Countermeasure Against Power


SCA

Liu et al. [27] use an array of ROs, randomly switched on and off, to dynamically
hide the power consumption of AES SBoxes and hinder the first-order DPA. Simi-
larly, Krautter et al. [22] use ROs as a power-based SCA mitigation methodology. In
their work, the part of the implementation that needs to be protected is surrounded
by a network of ROs. By switching an arbitrary number of the ROs on and off, the
signal-to-noise ratio (SNR) in power traces decreases, and therefore, the number
of traces required to be successful is increased. This approach is called hiding
side-channel leakage. However, the ROs in both designs are running at a fixed
oscillation frequency, and thus, only a single-frequency noise is injected. In this
case, it is straightforward for an attacker to apply post-processing techniques to
remove the noise effect. To avoid this weakness, PRO uses user-controlled but
random frequency changes (Fig. 11.4). Moreover, to further reduce the overhead,
we show how a simple modification can enhance the countermeasure efficacy.

11.2.2 On-Chip Sensors to Detect or Cause Power Perturbation

Zick et al. [47] use ROs to measure on-chip voltage variations. Indeed, the
oscillation frequency is proportional to the supplied voltage on the PDN. To measure
the frequency of an RO accurately, counters are required that are clocked with the
output of the RO. This limits the maximum sample rate attainable by the RO counter
structure, and hence, the bandwidth of the side-channel signal. This limitation
has motivated research on other voltage-sensitive time-to-digital converter (TDC)
methods. For instance, Gnad et al. [14] use carry-chain primitives available on
Xilinx FPGAs as TDCs. However, the use of carry-chain primitives makes their
approach specific to certain FPGA families. Similar TDC structures have been
explored in the context of CMOS design simulation to measure the operating voltage
of a chip [2].
Moreover, ROs have been used in offensive scenarios affecting the PDN for
both passive (power-based) and active (fault-injection-based) physical attacks. As
an example of power-based SCA, Zhao et al. [46] presented on-chip power monitors
with ROs. They demonstrated that ROs can be used as a power monitor to observe
the power consumption of other modules on the FPGA or SoC. Using their power
monitor, they captured power traces of the device running the RSA algorithm and
were able to successfully find the private key by applying SPA. Gravellier et al. [16]
perform CPA on power traces acquired with RO-based power sensors.
To inject timing faults, Mahmoud et al. [28] employ ROs to increase the voltage
drop on the power network and lower the voltage level. Effectively, they make the
victim chip slower, causing timing faults. Similar attacks have been shown in other
works [1, 23, 36].
302 Y. Yao et al.

11.2.3 On-Chip Sensors to Detect Fault Injection

Next, we consider on-chip sensors for fault detection. Miura et al. [30] present
a sensor consisting of phase-locked loop (PLL) and ROs. In their work, ROs are
routed in a specific way to ensure their path travels through most parts of the chip.
Once an EM fault is injected, the path delay of the ROs will be affected, resulting
in changes in the RO phase. The PLL logic can capture this phase disturbance and
detect the ongoing fault injection. Similarly, He et al. used a PLL block to detect the
laser disturbance on RO oscillation frequency [17].
Provelengios et al. [36] show that on-chip ROs cannot only detect fault injection,
but also locate the origin of the fault injection. With a similar structure, RON [45]
builds a ring oscillator network, distributed across the entire chip, to detect hardware
Trojans. Their work confirmed that RO-based power sensors can have a sufficiently
high sample rate to detect fluctuations on the PDN.
However, the scope of their work is limited to the power fault detection, whereas,
in our work, we further investigate EM fault detection (Fig. 11.6). Additionally, the
unique programmable design of our proposed RO structure also enables its usage
for a power SCA countermeasure (Fig. 11.4).

11.2.4 Our Contribution

In general, each previous work addresses one single aspect at a time: a side-channel
countermeasure, a power monitor, or a fault detector. In practice, an adversary
is capable of performing a combination of attacks. Hence, it is crucial to find a
security mechanism that encapsulates protection against these attacks. Our goal in
this chapter is therefore to design a programmable RO structure that can provide the
following functionalities within the same structure:
1. Hiding protection against power-based SCA
2. On-chip power monitoring of the fluctuations on the PDN
3. Detecting fault injection
To the best of our knowledge, this is the first work to comprehensively investigate
the RO’s potential in addressing all these three aspects. In the following sections, we
introduce our proposed design and demonstrate through experiments the capability
of the proposed system. Even though we demonstrate our experiments as an FPGA
prototype, our design is not limited to FPGAs and can be extended to other
electronic chips.
11 Programmable Ring Oscillator (PRO) 303

11.3 Programmable RO Design

11.3.1 Background

In this section, we introduce our programmable RO (PRO) sensor design. As shown


in Fig. 11.2, RO’s output oscillation frequency depends on the propagation delay of
their internal signals. In each oscillation period of an RO, the signal has to propagate
twice through the propagation path. Therefore, the oscillation period (.TRO ) of an RO
is .TRO = 2 · Tprop , and its frequency follows the following equation:

1
fRO =
. . (11.1)
2 · Tprop

More specifically, the propagation delay path is composed of an odd number of


inverters, and each inverter contributes to the delay of the path. If t represents the
average delay of an individual inverter plus the fanout interconnect delay to the next
inverter, .tAN D represents the delay through the AND gate and its fanout interconnect
delay, n denotes the number of inverters in the chain, and the frequency of the RO
can be approximated by the following equation:

1
fRO =
. (11.2)
2(n · t + tAN D )

Hence, the frequency of an RO can be controlled by adjusting the number of stages


in the inverter chain.

11.3.2 PRO Design and Configuration

In this chapter, we aim to have a programmable design of the RO that gives the
designer the flexibility to choose the RO oscillation frequency.
Figure 11.3 shows the basic structure of our proposed design of the pro-
grammable sensor. The PRO consists of multiple delay cells. Each delay cell
includes two delay paths: one consisting of inverters and the other a shorting path
that bypasses the inverters. The multiplexer in the delay cell can control the delay
cell’s propagation delay by selecting between the delay path and the shorting path
with the control input signal SEL. Each delay cell has its independent control signal.

Fig. 11.2 Propagation delay


of a ring oscillator
304 Y. Yao et al.

Fig. 11.3 PRO Design. D0 donates the delay cell type-0, D1 donates the delay cell type-1, D2
donates the delay cell type-2

Suppose there are N inverters configured in the delay cell, when SEL = 1, the delay
path is selected, and when SEL = 0, the shorting path is selected. The propagation
delay of each delay cell .TC is therefore

TC = SEL · Td + (1 − SEL) · Ts ,
. (11.3)

where .Td denotes the propagation delay of the delay path and .Ts denotes propaga-
tion delay of the shorting path. The propagation delay of the shorting path .Ts is a
very small value compared to .Td but not 0; this is because of the delay of routing
and the delay of the multiplexer. Other user control inputs include EN that controls
whether PRO is enabled (oscillating) or not, and a control signal to reset/read the
PRO counter. The structure of the PRO design gives flexibility to the designer in
manifold. As shown in Table 11.1, there are multiple initial structural configurations
to be decided by the hardware designer at design time, including the number of
inverters per delay cell, as well as the number and type of different delay cells. These
parameters determine the range of the programmable RO’s oscillation frequency and
the number of frequency configurations the programmable RO can have.
Several constraints can be used as the guidance while configuring the initial
design configurations of PRO:
1. Oscillation frequency range
2. The number of configurations
3. Size of frequency changing step
4. Area
As a starting point of PRO parameter configuration, the designer should estimate
the propagation delay .Tprop for a single inverter. This knowledge can be obtained
11 Programmable Ring Oscillator (PRO) 305

Table 11.1 Configurations for PRO


Configuration type Configurations
Initial design configurations The number of delay cell types,
The number of delay cells,
The number of stages in delay cells
User configurations EN,
SELs,
Counter start/stop

through the design library, timing simulation, or measuring the RO’s oscillation
frequency with a single inverter (when working on an FPGA environment). Then,
based on the designated oscillation frequency range of PRO, the designer can
calculate the minimum and maximum number of inverters that are needed by
Fig. 11.2. After deciding the number of inverters needed, the designer can group
the inverters into different types of delay cells based on the designated frequency
changing step and the number of configurations that are needed. Theoretically, more
inverters result in a larger oscillation frequency range at a cost of larger PRO area.
Therefore, based on the targeted protect design area, the designer should decide the
area constraint for the PRO, so that each PRO can have a good spatial coverage of
the design while at the same time is not too close to other PROs to influence their
local power distribution.
Next, to better explain our proposed structure, we pick one configuration as an
example. Figure 11.3 shows the structure of the PRO with three types of delay cells.
The type-0 delay cell (D0) has 4 inverters, the type-1 delay cell (D1) has 8 inverters,
and type-2 delay cell (D2) has 16 inverters. We instantiated 2 of each type of delay
cells in the inverter chain. All the delay cells have an even number of inverters, and
1 inverter is instantiated at the start of the inverter chain to make sure that there is
always an odd number of inverters in the inverter chain. When all the inverters are
configured to be used in the delay path, the propagation delay .Tprop is maximal,
and therefore, the overall programmable RO will oscillate at its lowest frequency.
When all the delay cells are configured to use the shorting path, the propagation
delay .Tprop is minimal, and therefore, the overall programmable RO will oscillate
at its highest frequency.
In our experiment setup, we implement PRO on a Xilinx Spartan-6 FPGA,
which is fabricated with 45 nm CMOS technology. Under the aforementioned
configuration, we measured a low oscillation frequency of 22 MHz and a high
oscillation frequency of 123.44 MHz. Since each delay cell’s SEL is independent,
there are in total 15 frequency configurations consisting of {1, 5, 9, ..., 57} inverters.
Since there are six SEL signals, there are 64 configurations in total that redundantly
map into the 15 achievable configurations. Through this redundancy, we are able
to estimate the local manufacturing process variations, which is helpful in deciding
when a deviation should be cause for alarm (i.e., fault detection) or not.
306 Y. Yao et al.

The designers can control the RO’s frequency by setting the input value of SEL.
We are using the same configurations for all subsequent experiments in this chapter.
Under this PRO configuration, each PRO can be implemented with 128 lookup
table (LUT)s and 32 registers, in total 160 slices. In our experimental setup, a PRO
array with 36 PROs can cover the entire Spartan-6 FPGA (46,648 LUTs and 93,296
registers, in total 139,944 slices) to provide the whole chip power monitoring and
fault detection. Therefore, only an overhead of 4.1% is introduced.

11.3.3 PRO Integration and Basic Principles

As a security resistance add-on, PRO can be integrated into the design to protect
simple designs such as hardware encryption engines as well as complex systems
such as an SoC. Control signals are needed for communicating with the PRO.
The control signals set up the user control configurations in Table 11.1. Generally,
different control mechanisms can be adopted by the designer. In an SoC, the
designer can add PROs as a co-processor that can be controlled by the processor
through memory-mapped registers. Under this environment, the software running
on the processor can configure the PROs on the chip. Therefore, PRO-based coun-
termeasures can be dynamically enabled/disabled while the software is running.
Besides, a hardware-based finite-state machine (FSM) can be used to control the
PROs as well. In our experiments, we use the UART protocol to communicate with
PRO, and the control signals sent through the UART are generated by a Python
script in this chapter.
Figure 11.4 shows the high-level basic principles for the fault detection mech-
anism using the PROs. The counter value will be evaluated at the end of each

Fig. 11.4 The basic principle for PRO fault detection is to look for inconsistency in the
accumulated period count over a given time interval
11 Programmable Ring Oscillator (PRO) 307

monitoring interval and compared with the reference counter value to get the
actual oscillation frequency of the PRO. Under normal circumstances, each PRO
oscillates at a certain constant frequency, and thus, its counter value will increase
linearly during the monitoring interval. There will be some small variances caused
by the environmental changes, jitters, process variance of the manufacturer, etc.
A characterization procedure, therefore, is needed to define the range of normal
operation [38]. However, in the occurrence of instant fault injection (e.g., power
glitch, EM pulse, time glitch, laser pulse), the counter will be disturbed. The counter
value read out at the evaluation time will deviate from the normal range, and thus,
a pulse fault injection will be detected by the PROs. Additionally, an adversary can
inject timing faults by stressing the PDN continuously (e.g., power starving). As a
result, the victim circuit will operate slower and cause timing violations to create
faults. In this case, the PRO counter value will also deviate from the normal value
and capture the fault injection event. In this chapter, we use power and EM faults
as illustrations, but PRO’s fault detection coverage is not only limited to these two
fault types.

11.4 Side-Channel Countermeasure

Masking and hiding are two popular techniques for side-channel countermeasures.
In masking, each secret variable is split into two or more shares that are concealed by
random numbers [6]. The side-channel leakage of each share alone does not reveal
the secret variable because of the randomization introduced by random numbers.
A random source that provides fresh random variables is significantly important in
masking implementations. Hiding countermeasures reduce the SNR for secret data-
dependent operations. Hiding can be achieved by several techniques, such as by
reshuffling cryptographic operations [41], inserting random delays [7], and running
multiple tasks in parallel [37]. In this chapter, we utilize the proposed PRO design
as a hiding countermeasure by injecting noise with random frequency. Previous
work has proposed injecting noise to reduce the SNR [8, 26]. However, since only
single-frequency noises are injected, it is not tricky for an attacker to decrease the
effect of noise either by using a band-pass filter while collecting traces or by post-
processing the collected power traces, such as applying averaging, filtering, and
frequency-domain analysis. Thus previously proposed noise-injection-based hiding
mechanisms still have security flaws. In our proposed design, we inject random-
frequency noises with the PRO design so that it will be much harder for an adversary
to eliminate the noise.
The countermeasure circuit consists of a single PRO whose frequency can be
controlled by the SEL input signals. The PRO drives one of the I/O pins on the
board. As demonstrated in previous work [21, 26], the power consumed by a single
RO is not large enough to have a significant influence on the power profile of a
complete chip or a complete cipher. Instead, hundreds of ROs need to be instantiated
on the chip to have a profound hiding influence. This approach will cause significant
308 Y. Yao et al.

Fig. 11.5 Experimental setup for evaluating RO’s performance in side-channel leakage hiding

design overhead and has the potential risk of inducing power faults into the circuit
[18]. In our proposed mechanism, by driving an I/O pin with a PRO, the effect
of a single (randomly switched) PRO to influence the off-chip power network is
amplified. Since the load capacitance of an I/O pin is much larger than the load
capacitance of an internal FPGA net, the IO pin requires more energy to charge
and discharge. In this manner, even with a single programmable ring oscillator
(PRO), sufficient additional power is consumed to affect the power consumption
characteristic (see Fig. 11.6). In practice, an adversary senses the on-chip power
consumption using a probe, either by connecting an external probe to the system
via a power supply pin [42] or else using an EM probe. Both of these probing
approaches are dependent on the off-chip power network, and therefore, affecting
the off-chip power network is an important factor to defeat an attacker maliciously
monitoring the power profile.
The performance of our proposed hiding countermeasure design is evaluated with
AES-128. Figure 11.5 shows our experimental setup. We implement a hardware
AES core as well as the programmable sensor in the FPGA. The output signal
of the PRO is mapped to drive the I/O pin to amplify the noise effect. For each
encryption scenario, plaintext and ciphertext are provided through the UART for
AES. The communication procedure is controlled by the AES control script. At
the same time, we use the sensor control script to send in control signals through
the PRO UART (Fig. 11.5). The control signals can enable/disable the RO and
configure the oscillation frequency of the RO. While AES is running, the sensor’s
control script generates random numbers for the frequency configuration so that the
frequency of the PRO can change randomly. Equally, an on-chip pseudo-random
number generator (PRNG) can be used for this purpose. Figure 11.6a shows the
collected AES power trace when the programmable sensor is off. We can clearly
see the pattern of ten rounds of the AES algorithm. By comparison, the power trace
changes to a repeated oscillation pattern when we turned the PRO on, as shown in
11 Programmable Ring Oscillator (PRO) 309

Fig. 11.6 AES power traces when PRO is (a) off; (b) on

Fig. 11.6b, which indicates the strong influence of PRO on the power profile. Under
our setup, the complete AES takes 41 ms, and we configure the PRO control script
such that the frequency of the PRO changes every 2 ms, which means that the PRO’s
frequency will change at least 20 times while AES is running. Figure 11.7a shows
the frequency spectrum of the power traces when the PRO is off. We can observe
small peaks at the clock frequency (24 MHz). We do not observe a significant
influence on the power spectrum if we only put a single PRO without driving the IO
pin. Figure 11.7c and d shows the power spectrum when the PRO is on and driving
the output pin. By comparing to Fig. 11.7a, a significant influence on the frequency
spectrum of the power profile can be observed, while the PRO is on. Figure 11.7c
shows a sharp peak when the PRO’s oscillation frequency if fixed to 120 MHz.
A single-frequency RO is a poor noise-injection countermeasure. Indeed, it is
easy for the attacker to implement frequency spectrum analysis, find the injected
noise frequency, and apply the corresponding filter to eliminate the influence of
the injected protection noise. As a sharp comparison, Fig. 11.7d shows that when
random frequency noise is injected by PRO the frequency spectrum is expanded
within the PRO’s oscillation range from 22 Mhz to 123 Mhz. This makes it much
harder for the adversary to filter out the noise by post-processing. To further evaluate
the effectiveness of the proposed design on increasing side-channel resistance, we
applied TVLA [12] on 50 k collected traces; as shown in Fig. 11.8, a dramatic
decrease in t-value can be observed when the PRO is turned on compared to when
the PRO is off. This indicates that the PRO design can significantly reduce the side-
channel leakage of the circuit.
310 Y. Yao et al.

Fig. 11.7 Power spectrum for power traces when (a) PRO off; (b) PRO on without driving I/O
pin; (c) PRO with fixed oscillation frequency and driving IO pin; (d) PRO with random oscillation
frequency and driving I/O pin

Fig. 11.8 T-value comparison when PRO is on/off

Note Generally, even though the adversary is aware of the noise signal, since
the noise is injected by the PRO at a random frequency that also changes at a
fast pace, it is exceedingly hard to remove its effect by normal post-processing
techniques; the adversary needs to monitor both the power consumption and the
11 Programmable Ring Oscillator (PRO) 311

output of the PRO simultaneously with sufficient precision and should be able to
remove the part of the power consumption related to the output pad’s oscillation
using noise-cancelation techniques, which requires high-end devices. Additionally,
to have sufficient information to perform a successful side-channel analysis from
the obtained side-channel traces, the sampling rate for side-channel attacks has to
be at least 2.× the clock frequency (according to the Nyquist theorem). We suggest
that when choosing the initial design configurations in Table 11.1, the designers
should adjust the configurations such that the oscillation frequency range of the
PRO covers at least 3.× the clock frequency. As a result, the adversary will need
a higher-end device with a much higher sampling frequency (at least 10.× the
clock frequency) to successfully apply the same side-channel attack. Hence, PRO
as a hiding countermeasure makes it much harder to attack the circuit by largely
elevating the technique bar for the adversaries.

11.5 Power Sensing

In this section, we demonstrate the on-die power monitoring functionality of the


proposed PRO design. Power integrity is essential to guarantee the nominal function
of the circuit. Therefore, monitoring of the fluctuations on the PDN is critical to
detect abnormalities. We first explore the PRO’s oscillation frequency with regard
to the external power deviation. Then, we look into the PRO’s performance in terms
of local power sensing on the PDN of the chip.
Electric circuits use PDNs to deliver power to the transistors in the circuit via
external voltage regulators. PDNs are still affected by sudden current consumption
changes despite these voltage regulators. Thus, a sudden change in the switching
activity induces transient voltage drops in the PDN. PDNs can be modeled using
RLC networks. The transient voltage drop seen by the PDN can be defined as
follows:
di
Vdrop = I R + L
. . (11.4)
dt
Here, the IR drop is due to the resistive components of the PDN and is dependent
on the steady-state current I. The other term, .L di
dt , influences voltage drop due to the
inductive components of the PDN and is dependent on the changes in the current
over time. Hence, as soon as there is a high variation in current consumption due to
switching activities of the logic circuit, the voltage drop will increase.
The propagation delay of signals is affected by the on-chip voltage level; higher
voltage levels increase the switching speeds of transistors, whereas lower voltage
levels decrease them. Since the voltage level affects the propagation delay of signals,
the immediate frequency of a ring oscillator can indicate the level of the voltage on
a chip. We take advantage of this property in our proposed PRO sensor to monitor
the integrity of the on-chip PDN.
312 Y. Yao et al.

Fig. 11.9 Experimental


setup for PRO frequency
changing as a function of
external power supply

11.5.1 PRO Power Sensing with Regard to External Power


Variations

We first investigate the PRO’s frequency with respect to external power variations.
Figure 11.9 shows the setup for this experiment scenario. We put a single PRO
sensor on the FPGA. For the PRO’s frequency measurement, we start the PRO
sensor’s counter and system clock counter CTR at the same time. After running
for an arbitrary amount of time .Tarb , we read out the RO sensor’s counter value
.CRO and the reference system clock counter value .Cclk through the UART. Then,

we calculate the PRO’s frequency by

CRO
fP RO =
. · fclk , (11.5)
Cclk

where .fP RO is the PRO sensor oscillation frequency, and .fclk is the reference clock
frequency. We measure the value of .CP RO 1000 times and take the average for
better precision. The measurement procedure is automated through a control script
running on a PC.
To investigate the PRO’s power sensing sensitivity when operating under dif-
ferent frequencies, we set the PRO sensor to several oscillation frequencies at the
starting (highest) power supply voltage for the main FPGA core (1.33 V). The
frequency configurations we pick are 153 MHz, 100 MHz, 66 MHz, 40 MHz, and
27 MHz. We gradually decrease the FPGA’s supply voltage and monitor the PRO
sensor’s oscillation frequency.
Figure 11.10 shows the result of the PRO oscillation frequency with regard to the
external supply voltage. As shown in the figure, when the external supply voltage
drops, the PRO’s frequency drops steadily. The PRO’s oscillation frequency reflects
the power supply voltage, and therefore, it can sense the power supply changes and
can be used for power monitoring. With respect to the sensitivity of power sensing, it
can be observed that the higher the oscillation frequency is, the sharper the slope of
the frequency vs. the external supply voltage line will be. This indicates that a higher
oscillation frequency can achieve higher sensitivity in detecting power variations.
11 Programmable Ring Oscillator (PRO) 313

Fig. 11.10 PRO’s oscillation frequency with regard to external power supply voltage

11.5.2 PRO Power Sensing with Regard to On-die Local Power


Variations

After investigating the correspondence between the PRO sensor’s oscillation fre-
quency and the variations of external power, we evaluate the power sensing
performance with regard to the on-die local power variations. Previous work
has shown that RO-based power wasters can cause a local power supply drop
[31, 35, 46]. This will cause the local circuit’s logic to operate at a lower voltage;
therefore, the local power sensor should show a decrease in the oscillation frequency
when the power wasters are turned on. In this chapter, we adopt the RO-based power
waster shown in Fig. 11.11. Each power waster has five inverters in the delay chain
with an AND gate and oscillates at 245 MHz. A global enable signal is used to turn
on/off all the power wasters in the circuit.
Figure 11.12 shows the experimental setup for the local power sensing evalu-
ation. In this setup, UART communication is used to read out the PRO’s counter
value. We constrain the power waster to locations near the PRO sensor to induce
a local power drop. By configuring the number of power wasters, we can control
the amount of local power drop. An on-board dip switch is used to enable/disable
the power wasters. In a measurement scenario, we gradually increase the number
of power wasters. For each power waster configuration, we measure the PRO’s
oscillation frequency 1000 times and take the average with power waster on/off,
respectively. Next, we calculate the frequency drop ratio as follows:
314 Y. Yao et al.

Fig. 11.11 The structure of


the employed RO-based
power wasters

Fig. 11.12 Experimental


setup for PRO power sensing
with regard to on-die local
power variations

foff − fon
F requency Drop Ratio =
. . (11.6)
foff

In Fig. 11.6, .foff denotes the PRO sensor’s frequency when the power wasters are
disabled (turned off) and .fon denotes its frequency when the power wasters are
enabled (turned on). The results from the experiment are shown in Fig. 11.13 when
different numbers of power wasters are enabled. As more power wasters are enabled,
the frequency drop ratio increases correspondingly. We can observe a nearly linear
relationship between the number of power wasters and sensor oscillation slowdown.
The linear regression that can closely model the correlation between the number
of power wasters and the frequency drop ratio can be constructed as .f (x) =
0.00031x + 0.247 with an R-squared value of 0.991. Therefore, we conclude that
the PRO can effectively sense local power fluctuations.

11.5.3 PRO Power Sensing with Regard to Sensor Locality

In this experimental scenario, we evaluate the PRO sensor’s frequency change with
respect to the spatial proximity to the switching logic that consumes the power. In
11 Programmable Ring Oscillator (PRO) 315

Fig. 11.13 PRO frequency drop ratio as a function of the number of active power wasting circuits

Fig. 11.14 PRO’s average frequency drop ratio for each row versus the spatial proximity of the
power wasters

this experiment, we instantiate 36 PRO sensors to get full spatial coverage of the
FPGA. For this experimental scenario, 36 sensors reside in nine rows, and each row
has four sensors.
To remove the process variations among the PRO instances, we calculate the
frequency drop ratio for each PRO instance following Fig. 11.6. We first measure the
frequency drop ratio for all the sensors. Then, we take the average of the frequency
drop of the four PROs in each row. The results are shown in Fig. 11.14. We observe
that as the PRO sensors are placed closer to the power wasters (from Row 0 to Row
8), the frequency drop ratio increases. Therefore, we can see the spatial distance of
the PRO sensor to the switching logic (power wasters) indeed can be reflected in
316 Y. Yao et al.

the frequency drop ratio. We can further use this feature to detect the location of
injected faults on the chip (will be demonstrated in Fig. 11.6). Note that there is an
outlier in our designed sensor, which might be attributed to the power distribution
network structure of the electronic circuits in which the power in the center of the
chip is built to be more stable [33].

11.6 Fault Detection

In this section, we evaluate our proposed on-die PRO sensor’s performance in


sensing the occurrence of fault injection attacks. We show that the PRO can be
used to protect the circuit from adversaries who have physical access or remote
control of the device [28] that enables them to inject power or EM faults. However,
we assume that the PRO itself is protected against the manipulation of the attacker.
We demonstrate that the PRO cannot only detect the occurrence of a power-based
fault, but also the sensor array can detect the location of the power fault. This
enables the designer (or the system administrator) to identify the source of the
fault injection or the malicious circuits and build highly targeted fault response
mechanisms accordingly. Moreover, we further demonstrate that the PRO can be
used for EM fault detection.

11.6.1 Power Fault Detection

Sharing the same PDN between a potential adversary and a victim opens the door to
a new array of attacks. An adversarial logic can impose changes in the voltage level
to cause timing faults in the victim circuit [1, 23, 28, 36]. Since all these attacks
affect the PDN, we aim to build sensors that are sufficiently sensitive to the voltage
level and therefore can detect such attacks. Detecting ongoing fault injection attacks
will prevent resulting timing faults to go unnoticed.
Figure 11.15 shows our experimental setup for evaluating the power fault
detection performance of our sensor. We instantiate AES as well as the PRO sensor
array on the FPGA. Power wasters are placed locally on the chip to simulate
the situation when local power faults are induced by an adversary. An on-board
dip switch can control the activation of the power wasters. A control script is
used to control starting the AES co-processor, sending plaintext, and reading the
ciphertext. The AES control script is also used to monitor the correctness of the
resulting ciphertext. We adjust the number of power wasters instantiated, while
AES is running. When faulty ciphertexts are observed, we know that a power fault
is successfully injected. This ensures that the power faults detected by PRO are
actually effective faults. Next, we read out the PRO’s counter value through the
sensor’s control script both when the fault is injected and not injected, respectively,
and compare their values. Note that as a chip-level sensor, our goal is to detect the
11 Programmable Ring Oscillator (PRO) 317

Fig. 11.15 The test setup uses a user-controlled power waster to test PRO fault detection. The
power waster is visible as an orange field of dots on the left side of the floorplan between row 1
and row 2, and between row 2 and row 3

location of the attacker instead of identifying that a fault has occurred within the
victim algorithm or circuit.
Figure 11.16 shows the floorplan of the aforementioned setup. We placed 36
sensors on the chip, and 524 power wasters are instantiated to generate power
faults. We first put the power wasters at Row 1 and Row 2 on the left as shown
in the orange blocks in Fig. 11.16. This power waster location is further identified
as location-1. While AES is running, we read out the sensors’ counter values when
the faults are injected and not injected by power wasters, respectively. Then we
calculate the frequency drop ratio based on Fig. 11.6. With the PRO sensor data, we
are to able find the location of the power fault. First, to locate which row has the
power fault, we take the average of the four PRO sensors’ frequency drop ratio in
each row. Figure 11.17 shows the result of each row’s average frequency drop ratio.
The maximum frequency drop ratio points to a location adjacent to Row 2. This
demonstrates that our sensor array can point to the correct row in which the fault
has occurred. Then, we divide the chip into two regions, left and right. To locate the
fault region, we take the average of the frequency drop ratios of the 18 sensors in the
left and right two columns separately. The average frequency drop ratio in the left
region is 0.2184, and the average frequency drop in the right region is 0.213. The
drop in the left region is higher than that in the right region, which indicates that the
source of the fault is in the left region. This demonstrates that our sensor array can
locate the correct fault column. Now, after analyzing the data of the sensor array, we
can locate the power fault’s location in Row 2, left region.
To further demonstrate the capability of the proposed PRO sensor in detecting
the fault locations, we placed the power wasters in different locations, while AES
is running. We repeat the same experimental scenario to locate the faulty row and
318 Y. Yao et al.

Fig. 11.16 FPGA floorplan


for evaluating PRO
performance in power fault
detection

Fig. 11.17 PRO average frequency drop ratio for each row when a power fault happens at location-
1 (the power wasters are placed as shown in Fig. 11.16)

faulty column. We first put the power wasters in locations Row 4 and Row 5 on
the left region and get the results of locating the faulty row as demonstrated in
Fig. 11.18. The highest frequency drop ratio points to Row 4, which indicates that
the fault happens adjacent to Row 4. The left region’s average frequency drop is
0.2159 and the right region’s average frequency drop is 0.2091, which indicates
that the left region has the fault. Therefore, the sensor array locates the place where
11 Programmable Ring Oscillator (PRO) 319

Fig. 11.18 Floorplan and corresponding PRO average frequency drop ratio for each row when
power fault happens at location-2. Black blocks donate PROs in the floorplan, and red blocks
donate power waster positions in the floorplan

Fig. 11.19 Floorplan and the corresponding PRO average frequency drop ratio for each row when
power fault happens at location 3. Black blocks denote PROs in the floorplan, and red blocks denote
power waster positions in the floorplan

the fault is injected as the left region Row 4. Next, we move the power wasters
Row 1 and Row 2 on the right as shown in Fig. 11.19. We observe the left region’s
average frequency drop is 0.2083 and the right region’s average frequency drop is
0.2204, which indicates that the right region has the fault. In Fig. 11.18, the highest
frequency drop ratio correctly points to Row 2. Therefore, we demonstrate that our
proposed sensor can detect the location of the on-chip power fault.
320 Y. Yao et al.

11.6.2 Electromagnetic Fault Injection (EMFI) Detection

EMFI is a well-known active attack. It uses an active probe to apply an intense and
transient magnetic field to integrated circuit (IC)s. An EM pulse causes a sudden
current flow in the circuit of the targeted IC, and therefore, the local supply voltage
drops. This voltage drop produces timing faults such as bit flips, bit sets, and
bit resets due to timing constraint violation and sampling faults by disrupting the
switching process of D-flip flops if EM perturbations are synchronous with clock
rising edges. This enables the adversary to exploit such faults to extract sensitive
content from the device. Previous research has shown that EM perturbations can
cause faulty computations, alter the program flow, and cause bit flips in the contents
of the memory. Other authors have demonstrated that EM can induce faults into
the devices [11, 34]. In the past few years, EM fault injection attacks have gained
increasing attention. In this section, we investigate the performance of our proposed
PRO sensor with regard to EM fault injection.
Figure 11.20 shows the experimental setup for evaluating EM fault injection
detection performance. In this setup, we instantiate AES and a PRO array with 36
sensors on the FPGA. A script is used to sending the plaintext, starting the AES,
and reading the ciphertext. While AES is running, an EM probe is placed in a fixed
position on top of the FPGA chip surface at a vertical distance of approximately
1.5 mm. It generates an EM pulse to induce faults. The EM probe’s tip is 4 mm
in diameter and produces a magnetic field that is perpendicular to the surface
of the chip. A glitch controller controls the time and intensity of the EM pulse.
While AES is running, we adjust the intensity of the EM pulse. When a faulty
ciphertext is observed, we know that an effective EM fault is injected. Next, in

Fig. 11.20 This experimental setup uses PRO to detect EM fault detection
11 Programmable Ring Oscillator (PRO) 321

Fig. 11.21 Influence on the frequency distribution, X-axis is probability and Y -axis is frequency

each measurement, we read out the PRO sensor’s counter value through the sensor’s
control script when the fault is injected and not injected, respectively, and compare
their values.
We collected 1000 frequency measurements for all 36 PROs. For each PRO
sensor, we investigate the distribution of the 1000 frequency measurements when
the EM fault is injected and not injected. We observe that the EM fault can cause
variations of the PRO’s frequency distribution. Figure 11.21 and Fig. 11.22 shows
comparisons of the frequency distribution when the EM fault is injected and not
injected for RO-0 to RO-15. We notice that the PRO sensor’s frequency shifts to
a larger value when faults are injected. We also observe that besides frequency
shifting, there is another fault injection reaction that the PRO sensors can have.
We observe that EM faults can also cause faulty counter value for RO-23 to RO-27
and RO-31 to RO-36. When the faults are injected, the counter values are read out
by the UART jump to a large (and faulty) value of .4.08 × 107 MHz. Therefore, by
monitoring the value of the PRO counters, we can detect ongoing electromagnetic
fault injection (EMFI) at runtime.
322 Y. Yao et al.

Fig. 11.22 Influence on the frequency distribution for PRO-32

11.7 Conclusion

In this chapter, we proposed a multipurpose RO design. We demonstrated that it is


possible, with a low cost, to have a side-channel countermeasure and fault detection
mechanism within the same design. We showed that PRO can provide an effective
hiding countermeasure to a circuit with low overhead by injecting random frequency
noise. We further demonstrated that the PRO array can form a comprehensive on-
chip secure monitoring network. The network can potentially provide both temporal
and spatial coverage of on-chip power monitoring and fault detection. PRO allows a
user to communicate and control its configurations, such as its oscillation frequency,
in real time. This feature highlights its potential to be integrated into large designs,
such as SoCs, as a secure extension to build more comprehensive side-channel
and fault-resistant systems. In future work, we will further investigate integrating
PRO into an SoC and build up a real-time side-channel countermeasure and fault
detection/response system that can protect both software and hardware applications.

Acknowledgments This research was supported in part through NSF Award 1931639 and NSF
Award 2219810.
11 Programmable Ring Oscillator (PRO) 323

References

1. Alam, M. M., Tajik, S., Ganji, F., Tehranipoor, M., & Forte, D. (2019). RAM-Jam: Remote
temperature and voltage fault attack on FPGAs using memory collisions. In 2019 Workshop
on fault diagnosis and tolerance in cryptography (FDTC), pp. 48–55. https://doi.org/10.1109/
FDTC.2019.00015.
2. Anik, M. T. H., Ebrahimabadi, M., Pirsiavash, H., Danger, J. L., Guilley, S., & Karimi, N.
(2020). On-Chip voltage and temperature digital sensor for security, reliability, and portability.
In 2020 IEEE 38th international conference on computer design (ICCD) (pp. 506–509). IEEE.
3. Bar-El, H., Choukri, H., Naccache, D., Tunstall, M., & Whelan, C. (2006). The sorcerer’s
apprentice guide to fault attacks. Proceedings of the IEEE, 94(2), 370–382.
4. Biham, E., & Shamir, A. (1997). Differential fault analysis of secret key cryptosystems. In
Annual international cryptology conference (pp. 513–525). Springer.
5. Brier, E., Clavier, C., & Olivier, F. (2004). Correlation power analysis with a leakage model.
In International workshop on cryptographic hardware and embedded systems (pp. 16–29).
Springer.
6. Chari, S., Jutla, C.S., Rao, J. R., & Rohatgi, P. (1999). Towards sound approaches to counteract
power-analysis attacks. In M. Wiener (Ed.), Advances in Cryptology—CRYPTO’ 99 (pp. 398–
412). Springer.
7. Coron, J. S., & Kizhvatov, I. (2010). Analysis and improvement of the random delay
countermeasure of CHES 2009. In International workshop on cryptographic hardware and
embedded systems (pp. 95–109). Springer.
8. Das, D., Maity, S., Nasir, S. B., Ghosh, S., Raychowdhury, A., & Sen, S. (2017). High efficiency
power side-channel attack immunity using noise injection in attenuated signature domain. In
2017 IEEE international symposium on Hardware Oriented Security and Trust (HOST) (pp.
62–67). IEEE.
9. Elnaggar, R., Chaudhuri, J., Karri, R., & Chakrabarty, K. (2022). Learning malicious circuits
in FPGA bitstreams. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 42(3), 726–739. https://doi.org/10.1109/TCAD.2022.3190771.
10. Fuhr, T., Jaulmes, E., Lomné, V., & Thillard, A. (2013). Fault attacks on AES with faulty
ciphertexts only. In 2013 Workshop on fault diagnosis and tolerance in cryptography (pp.
108–118). IEEE.
11. Ghodrati, M., Yuce, B., Gujar, S., Deshpande, C., Nazhandali, L., & Schaumont, P. (2018).
Inducing local timing fault through EM injection. In 2018 55th ACM/ESDA/IEEE design
automation conference (DAC) (pp. 1–6). https://doi.org/10.1109/DAC.2018.8465836.
12. Gilbert Goodwill, B. J., Jaffe, J., Rohatgi, P., et al. (2011). A testing methodology for side-
channel resistance validation. In NIST non-invasive attack testing workshop (vol. 7, pp. 115–
136).
13. Glamocanin, O., Coulon, L., Regazzoni, F., & Stojilovic, M. (2020). Are cloud FPGAs
really vulnerable to power analysis attacks? In 2020 Design, automation and test in Europe
conference and exhibition, DATE 2020, Grenoble, France, March 9–13, 2020 (pp. 1007–1010).
IEEE. https://doi.org/10.23919/DATE48585.2020.9116481.
14. Gnad, D. R., Oboril, F., Kiamehr, S., & Tahoori, M. B. (2016). Analysis of transient voltage
fluctuations in FPGAs. In 2016 International conference on field-programmable technology
(FPT) (pp. 12–19). https://doi.org/10.1109/FPT.2016.7929182.
15. Gnad, D. R. E., Krautter, J., & Tahoori, M. B. (2019). Leaky noise: New side-channel attack
vectors in mixed-signal IoT devices. IACR Transactions on Cryptographic Hardware and
Embedded Systems, 2019(3), 305–339. https://doi.org/10.13154/tches.v2019.i3.305-339.
16. Gravellier, J., Dutertre, J. M., Teglia, Y., Loubet-Moundi, P. (2019). High-speed ring oscillator
based sensors for remote side-channel attacks on FPGAs. In 2019 International conference
on ReConFigurable computing and FPGAs (ReConFig) (pp. 1–8). https://doi.org/10.1109/
ReConFig48160.2019.8994789.
324 Y. Yao et al.

17. He, W., Breier, J., Bhasin, S., Miura, N., & Nagata, M. (2016). Ring oscillator under laser:
Potential of PLL-based countermeasure against laser fault injection. In 2016 Workshop on fault
diagnosis and tolerance in cryptography (FDTC) (pp. 102–113). IEEE.
18. Kim, C. H., & Quisquater, J. J. (2007). Faults, injection methods, and fault attacks. IEEE
Design & Test of Computers, 24(6), 544–545.
19. Kim, C. K., Kong, B. S., Lee, C. G., & Jun, Y. H. (2008). CMOS temperature sensor with ring
oscillator for mobile DRAM self-refresh control. In 2008 IEEE international symposium on
circuits and systems (pp. 3094–3097). IEEE.
20. Kocher, P. (1996). Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other
systems. In N. Koblitz (Ed.), CRYPTO ’96, LNCS (vol. 1109, pp. 104–113). Springer.
21. Krautter, J., Gnad, D., & Tahoori, M. (2020). CPAmap: On the complexity of secure FPGA
virtualization, multi-tenancy, and physical design. In IACR transactions on cryptographic
hardware and embedded systems (pp. 121–146).
22. Krautter, J., Gnad, D. R., Schellenberg, F., Moradi, A., & Tahoori, M. B. (2019). Active
fences against voltage-based side channels in multi-tenant FPGAs. In 2019 IEEE/ACM
international conference on computer-aided design (ICCAD) (pp. 1–8). https://doi.org/10.
1109/ICCAD45719.2019.8942094.
23. Krautter, J., Gnad, D. R., & Tahoori, M. B. (2018). FPGAhammer: Remote voltage fault attacks
on shared FPGAs, suitable for DFA on AES. In IACR transactions on cryptographic hardware
and embedded systems (pp. 44–68)
24. La, T. M., Matas, K., Grunchevski, N., Pham, K. D., & Koch, D. (2020) FPGADefender:
Malicious Self-oscillator Scanning for Xilinx UltraScale + FPGAs. ACM Transactions on
Reconfigurable Technology and Systems, 13(3), 15:1–15:31. https://doi.org/10.1145/3402937.
25. Li, X., Tessier, R., & Holcomb, D. E. (2022). Precise Fault Injection to Enable DFIA for
Attacking AES in Remote FPGAs. In 30th IEEE annual international symposium on field-
programmable custom computing machines, FCCM 2022, New York City, NY, USA, May
15–18, 2022 (pp. 1–5). IEEE. https://doi.org/10.1109/FCCM53951.2022.9786154.
26. Liu, P. C., Chang, H. C., & Lee, C. Y.: A low overhead DPA countermeasure circuit based on
ring oscillators. IEEE Transactions on Circuits and Systems II: Express Briefs, 57(7), 546–550
(2010).
27. Liu, P. C., Chang, H. C., & Lee, C. Y.: A low overhead dpa countermeasure circuit based on
ring oscillators. IEEE Transactions on Circuits and Systems II: Express Briefs, 57(7), 546–550
(2010). https://doi.org/10.1109/TCSII.2010.2048400.
28. Mahmoud, D., & Stojilović, M. (2019). Timing violation induced faults in multi-tenant FPGAs.
In 2019 Design, automation test in europe conference exhibition (DATE), pp. 1745–1750.
https://doi.org/10.23919/DATE.2019.8715263.
29. Mangard, S., Oswald, E., & Popp, T. (2008). Power analysis attacks: Revealing the secrets of
smart cards (Vol. 31). Springer Science & Business Media.
30. Miura, N., Najm, Z., He, W., Bhasin, S., Ngo, X. T., Nagata, M., & Danger, J. L. (2016).
PLL to the rescue: A novel EM fault countermeasure. In 2016 53rd ACM/EDAC/IEEE design
automation conference (DAC) (pp. 1–6). https://doi.org/10.1145/2897937.2898065.
31. Moini, S., Li, X., Stanwicks, P., Provelengios, G., Burleson, W., Tessier, R., & Holcomb,
D. (2020). Understanding and comparing the capabilities of on-chip voltage sensors against
remote power attacks on FPGAs. In 2020 IEEE 63rd international midwest symposium on
circuits and systems (MWSCAS) (pp. 941–944). IEEE.
32. Moini, S., Tian, S., Szefer, J., Holcomb, D., & Tessier, R. (2021). Remote Power side-channel
attacks on BNN accelerators in FPGAs. In Design, automation and test in Europe conference
(DATE).
33. Popovich, M., Mezhiba, A., & Friedman, E. G. (2007). Power distribution networks with on-
chip decoupling capacitors. Springer Science & Business Media.
34. Poucheret, F., Tobich, K., Lisarty, M., Chusseauz, L., Robissonx, B., & Maurine, P. (2011).
Local and direct EM injection of power into CMOS integrated circuits. In 2011 Workshop on
fault diagnosis and tolerance in cryptography (pp. 100–104). https://doi.org/10.1109/FDTC.
2011.18.
11 Programmable Ring Oscillator (PRO) 325

35. Provelengios, G., Holcomb, D., & Tessier, R. (2019). Characterizing power distribution
attacks in multi-user FPGA environments. In 2019 29th International conference on field
programmable logic and applications (FPL) (pp. 194–201). IEEE.
36. Provelengios, G., Holcomb, D., & Tessier, R. (2020). Power distribution attacks in multitenant
FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 28(12), 2685–
2698. https://doi.org/10.1109/TVLSI.2020.3027711.
37. Standaert, O. X., Peeters, E., Rouvroy, G., & Quisquater, J. J. (2006). An overview of power
analysis attacks against field programmable gate arrays. Proceedings of the IEEE, 94(2), 383–
394.
38. Tajik, S., Fietkau, J., Lohrke, H., Seifert, J. P., & Boit, C. (2017). PUFMon: Security monitoring
of FPGAs using physically unclonable functions. In 2017 IEEE 23rd International symposium
on on-line testing and robust system design (IOLTS) (pp. 186–191). IEEE.
39. Tian, S., Krzywosz, A., Giechaskiel, I., & Szefer, J. (2020). Cloud FPGA security with RO-
Based primitives. In 2020 International conference on field-programmable technology (ICFPT)
(pp. 154–158). https://doi.org/10.1109/ICFPT51103.2020.00029.
40. Tian, S., Moini, S., Wolnikowski, A., Holcomb, D. E., Tessier, R., & Szefer, J. (2021).
Remote power attacks on the versatile tensor accelerator in multi-tenant FPGAs. In 29th
IEEE annual international symposium on field-programmable custom computing machines,
FCCM 2021, Orlando, FL, USA, May 9–12, 2021 (pp. 242–246). IEEE. https://doi.org/10.
1109/FCCM51124.2021.00037.
41. Tillich, S., & Herbst, C. (2008). Attacking state-of-the-art software countermeasures—a case
study for AES. In International workshop on cryptographic hardware and embedded systems
(pp. 228–243). Springer.
42. Utyamishev, D., & Partin-Vaisband, I. (2018). Real-time detection of power analysis attacks by
machine learning of power supply variations on-chip. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 39(1), 45–55.
43. Yao, Y., & Schaumont, P. (2018). A low-cost function call protection mechanism against
instruction skip fault attacks. In Proceedings of the 2018 workshop on attacks and solutions in
hardware security (pp. 55–64).
44. Yao, Y., Yang, M., Patrick, C., Yuce, B., & Schaumont, P. (2018). Fault-assisted side-channel
analysis of masked implementations. In 2018 IEEE international symposium on hardware
oriented security and trust (HOST) (pp. 57–64). IEEE.
45. Zhang, X., & Tehranipoor, M. (2011). RON: An on-chip ring oscillator network for hardware
Trojan detection. In 2011 Design, automation & test in Europe (pp. 1–6). IEEE.
46. Zhao, M., & Suh, G. E. (2018). FPGA-based remote power side-channel attacks. In 2018 IEEE
symposium on security and privacy (SP) (pp. 229–244). IEEE.
47. Zick, K. M., & Hayes, J. P. (2012). Low-cost sensing with ring oscillator arrays for health-
ier reconfigurable systems. ACM Transactions on Reconfigurable Technology and Systems
(TRETS), 5(1), 1–26.
Index

A Correlation power analysis (CPA), 17, 85, 88,


Access delegation, 2, 15–22, 24 89, 109–111, 113–115, 300, 301
Active voltage attacks, 274, 276, 292 Countermeasure, 76, 82, 87, 92–96, 105, 116,
Advanced Encryption Standard (AES), 14, 17, 117, 130, 131, 174, 175, 177, 192,
43, 60, 61, 69, 75, 85, 87–89, 92, 94, 196–198, 200, 230–232, 240, 243, 245,
109, 110, 112–115, 126–129, 196, 197, 255, 273–293, 297–322
276, 278, 281, 301, 308, 309, 316, 317, Covert channels, 17, 35, 95, 137–139,
320 142–153, 155, 157, 158, 162–164, 168,
ASCON, 66, 73–75 173–200, 227–228, 239, 250, 256, 267
Attack, 5, 35, 58, 82, 101, 137, 173, 204, 239, Cryptographic primitives, 58–76
273, 297
Authenticated encryption (AE), 7, 8, 10–12,
15, 24, 57, 65–75 D
Authentication, 1–24, 36, 45, 57, 65–67 Data protection, 24
Defense, 36, 130, 175, 197–198, 217, 223, 232,
246, 255–256, 268
B DRAM decay, 165–167, 243, 250, 251, 268
Block ciphers, 7, 11, 12, 57, 59–62, 66, 67, 69,
71–73, 76, 119
F
Fault attack, 76, 84, 85, 89–94, 142, 198,
C 219–224, 267, 297–300
Cache attack, 204–206, 222, 224–232 Fault countermeasure, 297
Cloud, 1, 29, 58, 81, 101, 137, 173, 203, 239, Fault detection, 277, 281–284, 299, 302,
273, 298 305–307, 316–322
Cloud cartography, 256–266 Fault injection, 86, 87, 91, 93, 104, 121–124,
Cloud FPGAs, 3, 8–10, 16, 21, 29–51, 58, 127–129, 219, 221, 274, 277, 284, 287,
75, 82, 92, 102, 103, 127, 129–131, 291–293, 297–302, 307, 316, 320–322
137–168, 173, 175, 239–268, 273–276, Fault injection attack, 101–131, 219–223, 232,
291, 292 276, 281, 284, 292, 297–322
Cloud security, 1, 24, 36, 139, 267–268 Field-programmable gate array (FPGA), 1, 29,
Confidentiality, 1–24, 36, 38, 39, 43, 47–49, 58, 81, 101, 137, 173, 203, 239, 273,
51, 57, 65, 66, 102, 105 298

© The Editor(s) (if applicable) and The Author(s), under exclusive license to 327
Springer Nature Switzerland AG 2024
J. Szefer and R. Tessier (eds.), Security of FPGA-Accelerated Cloud Computing
Environments, https://doi.org/10.1007/978-3-031-45395-3
328 Index

Fingerprinting, 138, 142, 160, 198, 239–268 Power supply units (PSUs), 173–182, 186–188,
FPGA acceleration, 3–4, 6, 14, 103, 155, 203, 191–198, 200
209, 239
FPGA security, 6, 141, 200, 267, 268, 276
R
Remote attacks, 17, 35, 103, 118, 173, 198,
H 199, 274, 275
Hardware trojans, 5, 10, 126, 128, 276, 302 Ring oscillators (ROs), 5, 43, 86, 87, 105, 119,
Heterogeneous compute, 81, 101 121, 138, 141, 142, 174–178, 182–184,
186, 192–193, 196–198, 200, 240, 241,
243, 255, 275, 282, 287–289, 292, 298,
I 300, 302, 303, 311
Information leakage, 96, 139, 175, 188, 191, Rowhammer, 204, 206, 207, 213–219,
196, 199, 200, 297 221–225, 230–232
Interference attack, 154, 158–162

S
M
Secure remote computing, 1
Multi-tenancy, 13–16, 24, 29, 30, 33, 35, 50,
Security, 1, 29, 58, 81, 101, 137, 175, 204, 240,
51, 103, 131, 230, 276
273, 298
Multi-tenant FPGA, 17, 21, 31, 36, 48, 93–95,
Side-channel attack, 82, 84, 85, 87–95,
101–131, 175, 273–293
101–103, 105–116, 138, 141, 143, 173,
197, 221, 240, 245, 267, 274, 276, 277,
O 292, 297, 299, 311
OAuth 2.0, 2, 9, 10, 15, 21, 24 Side-channel countermeasure, 94, 297, 300,
On-chip sensors, 88, 89, 105, 274, 298, 307–311, 322
300–302 Side channels, 5, 91, 102, 137, 198, 206, 267,
On-chip voltage sensor, 94, 102, 276, 277, 297
281–284, 286, 288, 291 Stream ciphers, 57, 59, 60, 62–65, 67
System-on-chip (SoC), 12, 36, 86, 91, 119,
129, 198, 291, 299, 301, 306, 322
P
Partial reconfiguration, 21, 31, 34, 160, 274,
287–292 T
PCIe contention, 138, 139, 141–144, 146, Threats, 5, 13, 15, 24, 36–38, 75, 76, 81,
147, 153–155, 158, 160, 167, 168, 241, 83–85, 87, 92, 94, 103, 104, 116, 130,
243–244, 256, 259, 260, 263, 268 131, 137–168, 173, 175–177, 198, 200,
Physical unclonable functions (PUFs), 12–14, 203, 239, 245, 273–277, 280–281,
21, 138, 240, 243, 245–257, 264, 265, 291–292
268
Power analysis, 76, 82, 84, 88–89, 105,
109–111, 199, 221 V
Power attacks, 196, 199, 299 Virtualization, 17, 21, 22, 30, 32, 33, 36,
Power distribution network (PDN), 17, 82–84, 47–49, 82, 101, 245
86, 92, 94, 102–104, 114, 116, 117, Voltage fluctuations, 83, 87, 88, 103, 105, 111,
122, 175, 199, 273, 274, 276–284, 292, 118, 174, 177, 197, 274, 276, 277, 283
298–302, 307, 311, 316 Voltage regulators, 83, 174, 176, 179, 180, 195,
Power integrity monitoring, 299 198–200, 280, 311

You might also like