You are on page 1of 14

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO.

7, JULY 2022 2669

NTT Architecture for a Linux-Ready RISC-V


Fully-Homomorphic Encryption Accelerator
Rogério Paludo , Student Member, IEEE, and Leonel Sousa , Senior Member, IEEE

Abstract— This paper proposes two architectures for the FHE was conceptualized in 1978 by Rivest, Adleman and
acceleration of Number Theoretic Transforms (NTTs) using a Dertouzos [1], yet only in 2009 Gentry [2] proposed the
novel Montgomery-based butterfly. We first design a custom first feasible FHE scheme. Major breakthroughs have been
NTT hardware accelerator for Field-Programmable Gate Arrays
(FPGAs). The butterfly architecture is expanded to a Modular made on the field of FHE [3] since 2009. At present, FHE
Arithmetic Logic Unit (MALU) and for greater reuse and easier schemes [4]–[7] are orders of magnitude faster than earlier
programmability a six-stage pipeline Linux-ready RISC-V core generations, but they still remain prohibitively expensive in
is extended with custom instructions. The performance of the terms of computing.
proposed architectures is assessed on a Xilinx Ultrascale+ FPGA The mathematical framework used in the first FHE scheme
and with an Application-Specific Integrated Circuit (ASIC) on
28nm CMOS technology. In FPGA, the results for custom by Gentry [2] was based on lattices, which is still being used
acceleration show reductions of 30%, 90% and 42% in the in most of the FHE schemes [4]–[9]. Lattice problems are
number of Lookup tables (LUTs) and registers, Block RAMs believed to be hard and provide resistance to quantum attacks.
(BRAMs) and Digital Signal Processors (DSPs), while providing a Because of this, Lattice-Based Cryptography (LBC) is at the
speedup of 1.9 times, in comparison with the state of the art. The core of the standardization process for next generation cryp-
ASIC results show that at 1 GHz the proposed architecture is in
average 45% and 52% less area and power hungry, respectively, tosystems [10]. In practical terms, lattice-based FHE and LBC
compared to the state of the art. Furthermore, the proposed require arithmetic on polynomials, implying multiplication as
MALU, operating as an additional execution unit, increases the the more complex operation.
overall area of the extended RISC-V core by only 10%, without Naive algorithms for polynomial multiplication, such as
significant changes in the frequency of operation. schoolbook, are known to be quadratic in complexity. In this
Index Terms— Fully-homomorphic encryption, number theo- context, NTTs represent a practical tool for complexity reduc-
retic transform, Montgomery algorithm, RISC-V, FPGA, ASIC. tion to quasilinear (i.e., O(n log n)), bringing the prohibitively
high cost of FHE down to more practical levels. Hence, the
development of efficient NTT hardware accelerators, aiming at
I. I NTRODUCTION
increasing the performance of polynomial operations for lattice

C LOUD computing offers on-demand processing and stor-


age without an investment in high-performance hard-
ware. However, several cloud computing service providers
problems, has been an active research field [11]–[29]. From
the vast literature there are two common design objectives
that stand out. In most Post-Quantum Cryptography (PQC)
have been subject to data breaches and attacks, raising con- schemes the polynomial operations are defined in a singular
cerns about privacy and data protection. One of the issues with modular ring with a low number of bits, typically around
the current cloud-computing model is rooted on the limita- 16 bits [11]–[17]. Contrarily, in FHE, due to requirement of
tions of cryptosystems. Computing data encrypted with these thousands of bits [27] for security and accuracy of operations,
systems requires full decryption, with the danger to expose it is common to apply a decomposition for data parallelism
private information to possible attackers. Fully-Homomorphic taking advantage of the Chinese Remainder Theorem (CRT).
Encryption (FHE) allows arbitrary computations directly on A single ring with a high number of bits is broken down
ciphertexts, without the need for full decryption, which is a into smaller more manageable words. In the context of PQC,
powerful tool and achieves long desired paradigms to tackle an isolated modular arithmetic path might be more easily
the privacy and security problem in the outsourced computa- improved. For FHE, however, there might be dozens of prime
tional model. numbers [21], [24], [27], increasing the difficulty. Besides,
generic accelerators in the literature are not versatile enough to
Manuscript received September 29, 2021; revised January 4, 2022; accepted
April 1, 2022. Date of publication April 27, 2022; date of current version scale from PQC to FHE [16], [17], [23]. Most of the literature
June 29, 2022. This work was supported in part by the Fundação para a is focused on accelerators for FPGA, given its flexibility, but as
Ciência e a Tecnologia (FCT), Portugal, under Project UIDB/50021/2020; and we show in this paper these solutions are not easily adaptable
in part by the European High Performance Computing Joint Undertaking (JU)
under Framework Partnership Agreement 800928 and Specific Grant Agree- to an ASIC-based design flow.
ment 101036168 EPI SGA2. This article was recommended by Associate A recurrent approach in the related literature for NTT
Editor S. Liu. (Corresponding author: Rogério Paludo.) accelerators is to use a HW/SW co-design methodology,
The authors are with the INESC-ID, Instituto Superior Técnico, Universi-
dade de Lisboa, 1649-004 Lisbon, Portugal (e-mail: rogerio.pld@gmail.com). allowing greater reuse of components and easier program-
Digital Object Identifier 10.1109/TCSI.2022.3166550 mability, in contrast to fully-custom accelerators. As such,
1549-8328 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2670 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022

the extensibility of the RISC-V Instruction Set Architecture Algorithm 1 Iterative DIF NTT Algorithm With Gentleman-
(ISA) and the availability of open-source tools is recurrently Sande Butterfly [38].
explored in the development of NTT accelerators [12], [16], Input: Polynomial x = (x 0 , . . . , x n−1 ), primitive n-th root of
[17]. Nonetheless, less attention has been paid to the research unity ωn .
of RISC-V ISA extensions for FHE. Previous work in the Output: x̂ = N T T (x).
literature focused either on custom accelerators [29], or co- 1: for s = log2 n to 1 by −1 do
design with ARM [21], [24], MicroBlaze [28], or general pur- 2: m = 2s
pose processors [19], [20], [27], [30]. Moreover, the extensions 3:
n/m
Wm = ωn (mod p)
of the RISC-V [12], [16], [17] for PQC concentrated mainly 4: for k = 0 to n − 1 by m do
on the development of acceleration for light microcontroler- 5: W =1
like cores. However, an important result from the litera- 6: for j = 0 to m/2 − 1 do  Butterfly loop.
ture [31] shows that memory-bound algorithms profit more 7: X = x[k + j ]
from stronger cores, which is the case of the NTT, when the 8: Y = x[k + j + m/2]
computational intensive parts are hardware accelerated. 9: X  = X + Y (mod p)
Therefore, this paper starts by proposing design methods 10: Y  = W (X − Y ) (mod p)
and architectures for high-performance and low-cost acceler- 11: x[k + j ] = X 
ation of the NTT in custom hardware, either on FPGA or 12: x[k + j + m/2] = Y 
ASIC. Later on, the arithmetic units are adapted into a MALU 13: W = Wm ωn (mod p)
being integrated into the Linux-ready with a six-stage pipeline 14: end for
CVA6 [32] RISC-V core. This paper presents the following 15: end for
contributions. 16: end for
• A search algorithm to find prime modulus that minimize 17: return bit-reversed(x)  x is in bit-reversed order.
the hardware resources of the NTT in the context of FHE
and Residue Number System (RNS) (through the CRT).
• A generic butterfly architecture for the NTT that leverages
the bounded Hamming weight of the prime moduli for Polynomial multiplication is one of the most time-
both ASIC and FPGA. demanding arithmetic operations in FHE [36]. Nevertheless,
• A parametric custom accelerator for a Xilinx Ultrascale+ polynomial multiplication can be improved over the naive
FPGA that uses the butterfly. quadratic complexity by using a divide-and-conquer approach
• A highly-efficient MALU supporting five different mod- formulated through the NTT. This is very important in
ular operations, including two different types of butterfly. lattice-based implementations since it reduces time complex-
• The RISC-V ISA and compiler toolchain extended with a ity from O(n 2 ) to O(n log n), where n is the polynomial
new register set and specific instructions for acceleration order [33].
of NTTs. The DIF formulation of the NTT is presented in Algo-
• Extensive experimental results and comparisons with the rithm 1, where the arithmetic is carried on the residue class
related state of the art conducted on an FPGA and for a of the prime p. Due to the DIF and the order of execution of
28nm ASIC technology. the algorithm the inputs, x = (x 0 , . . . , x n−1 ), are fed in order
The remainder of this paper is organized as follows. while the outputs, x̂ = N T T (x), are obtained in bit reverse
Section II reviews in detail the background and motiva- order [37].
tion for hardware acceleration of the NTTs in the context It is also noted in Lines 9 and 10 of Algorithm 1 the
of FHE. Section III presents the proposed procedures to modular arithmetic of the Gentleman-Sande butterfly [35],
find complexity-bounded prime moduli sets. In Section IV, involving modular addition, subtraction, and multiplication.
hardware architectures and RISC-V extensions are proposed. The input ωn is an integer primitive n-th root of unity such
Section V evaluates and compares the architectures to the state that ωnn ≡ 1 (mod p) and ωni = 1 (mod p) with i ∈ [0, n − 1].
of the art. Finally, the last section concludes the paper and lays Additionally, to support NTTs up to n, the prime modulus p
down future research possibilities. has to satisfy p ≡ 1 (mod n). For polynomial multiplication
it is possible to simplify the arithmetic operations using the
negatively-wrapped convolution by enforcing p ≡ 1 (mod 2n),
II. BACKGROUND AND M OTIVATION which eliminates the requirements for some modular and
Number Theoretic Transforms (NTTs) have the same algo- polynomial reductions. When the later condition is not met,
rithmic structure as Fast Fourier Transforms (FFTs), where one can still use the NTT for polynomial multiplication by
complex roots are represented by they equivalent in the zero-padding and further reducing modulus p.
integers modulo some prime p [33]. Similar to the FFT, Algorithm 1 also allows the computation of the Inverse
Decimation in Time (DIT) [34] or Decimation in Frequency Number Theoretic Transform (INTT) by replacing the ωn
(DIF) [35] algorithms can be applied to compute the NTT. factors by ωn−1 , such that ωn−1 ×ωn ≡ 1 (mod p). Another pos-
However, this requires implementing the arithmetic structures sible algorithmic arrangement for polynomial multiplication is
in a ring, where modular addition and multiplication are more to use DIF and DIT NTTs, to avoid the reordering operation.
involved. Since the DIT accepts the inputs in bit-reverse order it is

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2671

possible to feed the outputs from the DIF algorithm without Algorithm 2 Harvey’s Montgomery Butterfly [40].
altering the order. One has to first multiply pointwise the Input:
transformed coefficients in the NTT domain to actually imple- p odd and β = 2log2 p+2
ment polynomial multiplication. The choice of p is crucial to 0 < W < p, W  = βW (mod p), J = p−1 (mod β)
obtain efficient NTT implementations, since it must provide 0 ≤ X < 2 p and 0 ≤ Y < 2 p
simple modular operations, including modular multiplication Output:
and reduction, while satisfying the NTT restrictions. X  = X + Y (mod p), 0 ≤ X  < 2 p
The high complexity of FHE schemes has led to the Y  = W  (X − Y ) (mod p), 0 ≤ Y  < 2 p
development of several hardware accelerators for the 1: X  = X + Y
NTT [18]–[29]. A recurring approach is to target FPGAs, 2: if X  ≥ 2 p then
where constants can be stored into Block RAMs (BRAMs) 3: X = X − 2p
(i.e., ωn and ωn−1 ), while single or several parallel butterflies 4: end if
are used to implement Algorithm 1. When using more than 5: T = X−Y + 2 p
one execution unit one must avoid memory hazards, which can 6: R1 β + R0 = W  T
happen due to the in-place nature of Algorithm 1. This is a 7: Q = R0 J (mod β)  Multiplication by constant.
challenge on such designs, requiring a network of permutations 8: H = Qp/β  Multiplication by constant.
to connect the different Processing Element (PE) (butterflies) 9: Y  = R1 −H + p
to the BRAMs. To address this challenge, some accelerators 10: return X  , Y 
have been proposed based on HW/SW architectures [16],
[21], [24], providing frameworks with a higher degree of
flexibility and hardware reusability. In a flexible HW/SW
reduction [19], [20], [23] or require extra storage [21], [39] for
design environment, control of the custom hardware can be
constants. The PQC algorithms Kyber [42] and Dilithium [43]
more easily approached by software, while still offering sig-
are good practical examples of restrictions imposed on the
nificant performance gains in comparison with pure software
prime choice. Moreover, when the value of k is kept low, hard-
implementations. Nonetheless, one might argue that in the
ware architectures can be simplified [39]. However, for FHE,
context of higher performance, ASICs should be considered.
the more practical and efficient realizations are based on RNSs,
Previously, HW/SW ASIC-based accelerators for PQC have
which requires a set of prime modulus [44], [45]. In RNSs,
been designed [12], [16], [17]. But for FHE there is a lack of
large integer numbers are decomposed into a set of smaller
approaches, namely extensions for RISC-V, which we present
ones, such that arithmetic can be parallelized decreasing the
as one of the contributions of this work.
calculation time. Therefore, one might need to eventually
Besides the difficulties in the control component, the
select primes with higher hardware requirements, due to the
arithmetic operations are quite cumbersome in Algorithm 1.
usage of several moduli in RNSs and the unrestricted upper
Lines 9 and 10 implement the butterfly arithmetic involving
bounded approaches in the literature [19]–[21], [23], [39].
modular multiplication, addition and subtraction. There have
Since in RNS the channels are computed in parallel, the critical
been several efforts to improve modular multiplication in
path will be given by the slowest one.
the literature [12], [19], [20], [23], [39]. Commonly adopted
The arithmetic operations related to the constants J and
techniques are based on the Barret or Montgomery Multipli-
p directly influence the time complexity of Algorithm 2.
cation and Reduction. Barret reduction lifts the operations to
Complexity of multiplication by constants are widely known
a more simple modulus, typically a power of two, but it might
to be heavily dependent on the Hamming weight of the
require some successive subtractions to obtain the correct
constant [46]. Therefore, selecting primes with a given upper
result reduced for the prime modulus p [18], [22], [27], [28].
bound on the Hamming weight could potentially minimize the
Montgomery multiplication and reduction also transfers
hardware cost of Algorithm 2, while keeping the number of
the operations to a more friendly power-of-two modulus
bits in the RNS channels for FHE relatively close to each
providing better architectures for hardware realization [19],
other. In the next sections, a procedure to find such primes
[20], [23], [39].
is presented for improving the overall performance of NTTs
Additional improvements can be introduced into the but-
accelerators. Later, an adaptation of the proposed butterfly is
terfly computation by deferring some modular reductions to
used to extend the datapath of a RISC-V processor, offering
the end of the algorithm execution. A well-known approach
more flexibility for computing the NTT algorithm.
was presented by Harvey [40] and is replicated in Algo-
rithm 2 for completeness. In Harvey’s butterfly the arithmetic
is transferred to a power-of-two Montgomery domain based on III. P RIME S ELECTION
modulus β, and the intermediate values are allowed to grow The search for primes with similar characteristics has been
up to at most 2 p − 1, to reduce the total number of costly extensively studied in the literature. Some earlier theoretical
modular arithmetic reductions. works studied their statistical distributions regarding the num-
On top of the deferred reductions, restricting the prime ber of binary digits [47], without focusing on the applications
choice to the form p = k2m + 1 with odd k < 2m and of such results. The recent construction of lattice-based FHE
m ≥ 1 [41] introduces further simplifications in Harvey’s schemes, and their implementations relying on the RNS, has
butterfly. Some of the current approaches apply an iterative increased the interest in finding primes that offer lower cost

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2672 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022

TABLE I
Algorithm 3 Search for Primes With at Most 4 Bits in
R ELATIVE F REQUENCY OF THE T OTAL N UMBER OF P RIMES FOR
the Canonical Sign Digit (CSD) Representation Adapted VARIOUS I NPUT VALUES IN A LGORITHM 3
From [18], [48], [49].
Input:
b, b0 and n max
Output:
plist
1: λ = 2b+b0 , k = 1, plist = [ ]
2: do
3: c = λ/2b0 −k · 2n + 1
4: if isPrime(c) and c > (1 + 1/23b0 ) · λ/(2b0 + 1) then
5: if sum(abs(csd(c))) == 4 then TABLE II
6: plist = plist ∪ {c}  List Append FHE PARAMETERS S PECIFICATIONS A CCORDING TO THE SEAL [9]
7: end if L IBRARY, THE FHE S TANDARD C ONSIDERING C LASSICAL
8: end if S ECURITY [50], AND THE RNS V ERSION OF THE
B RAKERSKI /FAN -V ERCAUTEREN (BFV) S CHEME [44]
9: k =k+1
10: while c > (1 + 1/23b0 ) · λ/(2b0 + 1)
11: return plist

and more efficient arithmetic structures [25], [26], [48]. From


a software perspective, all libraries supporting current state-of-
the-art FHE schemes, such as PALISADE [8] and SEAL [9],
implement some form of prime searching algorithm. Noto-
riously, the NFLLib [49] library has presented a parametric convolution. To the best of our knowledge, it is not clear from
search algorithm that accepts as inputs a word of size b, the FHE literature which approach is best when considering
a bit margin b0 , and outputs primes supporting NTTs of up RNS-based hardware implementations. Therefore, we present
to a maximum degree n max . For implementing the NTT on a general search for primes in Algorithm 3, but one can further
custom hardware, it is important to reduce the overall area classify the primes to meet other design criteria.
cost. To this end, one can add an additional constraint on the Nevertheless, the relative frequency of the number of primes
Hamming weight of the primes to reduce the complexity of with a given amount of bits provides a good hint on the choice
multiplications and consequently the area cost. of the 4 bits Hamming weight. Results of Algorithm 3 shows
In Algorithm 3, the cost of the multiplications by the that primes with 4 bits appear frequently enough to be used
constants p and J is estimated by CSD recoding [46]. Then, in common FHE security standards, according to the practical
to upper-bound the complexity, the number of bits in the parameters reported in Table II. When constrained to 3 bits, the
primes CSD representation is limited to 4. Due to the negative required word size log2 p is significantly larger to satisfy the
values in the CSD representation the sum of the absolute value security settings used in practice, that can impose an undesired
from the function csd(·) returns the Hamming weight of the higher computing time by the RNS channel. Additionally,
prime under consideration in Algorithm 5. Table I presents constraining for the number of bits in J (Algorithm 2) is not
the results from executing Algorithm 3 for many values of b, necessary, since our experimental evaluation shows that the
considering as fixed input b0 = 2 and n max = 16. Based primes found with 14 to 30 bits tend to produce J s with at
on these results, it is clear that for larger log2 p more most 5 bits.
primes are found, though the number of primes satisfying Previous works have already proposed some methods for
the NTT criteria p ≡ 1 (mod n) decreases for polynomials prime selection and hardware implementation. Tan et al. [25]
of higher degree (n). Therefore, if a given set of security has proposed a modular multiplier for lattice-based cryptosys-
parameters is not met with the provided primes, one might tems, and a more recent improvement was introduced by
consider increasing log2 p to obtain more variability and Hu et al. [26]. In comparison with the presented approach,
increase the chances of finding suitable candidates, although both solutions lack versatility. As will become clear from
at a higher area cost. Note that the search is not constrained the experimental results, the proposed solutions outperform
for primes that meet the negatively-wrapped convolution, i.e., previous work with more flexible designs for both ASIC and
p ≡ 1 (mod 2n). Although there are many proposals using FPGA implementations, offering a better performance with
the negatively-wrapped convolution in the context of PQC, lower circuit area requirements.
for FHE there are efficient solutions that do not enforce In summary, designing hardware architectures for the NTT
this condition [21], [24], [27], introducing zero-padding and targeted at accelerating FHE schemes requires selecting a
modular reducing. The trade-off among these solutions comes set of prime moduli, which directly impacts the cost of
from selecting simpler underlying arithmetic, at the cost of the butterflies. Algorithm 3 proposes a constrained method
possibly increasing overall runtime, due to the longer padded to select such prime sets imposing an upper bound on the

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2673

Fig. 1. Proposed Montgomery-based butterfly architecture.

costs associated with Algorithm 2. The following sections


describe hardware architectures for the NTT benefiting from
Algorithm 3 and the prime choice.

IV. H ARDWARE A RCHITECTURES


This section describes the proposed hardware architectures.
It starts with a concise description of the Montgomery Butter-
fly, which is later applied to design the NTT/INTT in FPGA Fig. 2. Internal shifts and adder tree for computing: (a) R0 × J (mod β):
(b) p × Q.
and further integrates a modified version into a MALU that is
used to extend the CVA6 RISC-V [32] datapath.
J = 2s3 ± 2s2 ± 2s1 ± 2s0 ± 20 , where fl and sm are shift
A. Efficient Butterfly With Improved Montgomery Reduction
values such that f 0 < f 1 < f 2 and s0 < s1 < s2 < s3 . 1 By
Fig. 1 presents a possible realization for Algorithm 2. The implementing the multiplications through shifts and additions
focus on versatility and high-performance leads to a pipelined one significantly reduces the hardware cost while providing
architecture with seven stages, with X  and Y  been outputted fixed overall hardware structures. On FPGAs, specialized
in every clock cycle. Additionally, internal paths are fully DSPs blocks for multiplication are readily available, though,
pipelined to synchronously output values in a fixed number replacing a full multiplication by a few shifts and additions
of clock cycles. spares many DSPs without compromising performance, as it
In terms of arithmetic, the hardware implementation of will be shown later. However, on ASICs design flows, the
Algorithm 2 requires four main modular operations: addition, multiplication is much more costly. It takes significant area for
subtraction, multiplication by a constant, and regular multi- achieving high performance, needing carry-lookahead adders
plication. Addition and subtraction are internally organized which are highly costly circuits compared to a few shift-and-
with two adders in parallel to attain high performance. The add blocks.
correct result within the ring p is selected a posteriori using A diagram of bit ranges used in implementation of the two
a multiplexer. Regular multiplication in Algorithm 2 (Line 6) adder trees is presented in Fig. 2. There are levels in the
is internally pipelined with two levels for achieving higher adder tree that are simplified based on the restrictions imposed
performance. When FPGAs are the target, the descriptions on the prime choice. For instance, from the Least-Significant
are automatically mapped into DSPs blocks. On ASICs, the Bit (LSB) up to s0 and f 0 no operation is required due to
pipelined multiplier breaks the critical path of the butterfly the left shift. Additionally, some CSAs might be simplified
allowing higher operating frequencies. to CSA∗ with Half-Adders (HAs), as a consequence of only
The remaining operations in the description of the butterfly having two inputs. It is also worthwhile to mention that
are related to modular reductions for the multiplications by p × Q requires twice the number of bits, but only the Most-
constants (Lines 7 to 9) in Algorithm 2. The primes have Significant Bits (MSBs) are added because of the shift right by
been chosen to minimize the hardware requirements, based on β in Algorithm 8 of Algorithm 2. Carry propagation happens
Algorithm 3. Therefore, R0 × J (mod β) and p×Q require five only at the very end of the addition tree using a regular Ripple-
and four shifts, respectively. Again, aiming for versatility and Carry Adder (RCA).
performance, the architecture makes use of Wallace Trees that It is interesting to highlight the main characteristics of
automatically map to ternary adders in FPGAs and Carry-Save the related architectures in the state of the art. We start by
Adders (CSAs) in ASICs.
Based on the selection in Algorithm 3 one can denote 1 Negative values of R or Q requires the two complement of the inputs in
0
the prime modulus as p = 2 f2 ± 2 f 1 ± 2 f0 ± 20 and J as the multiplications.

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2674 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022

classifying such architectures in two groups: i) those that target


PQC, and ii) those that are flexible enough to support a set
of different primes and can be used in the context of FHE.
In group i) there are several different approaches [11]–[17].
These cryptographic schemes are generally implemented by
using a singular prime with a low number of bits. For instance,
from the previously cited works, only [12] provides arithmetic
units for moduli with more than 14 bits, which is still far from
the number of bits required in FHE. The second group ii),
more related to this work, [18]–[29] has a plethora of different
approaches. One can further divide the techniques by type of
modular reduction applied, such as: Barret [18], [22], [27], Fig. 3. Proposed NTT/INTT architecture with parallel butterfly cores.
[28], Montgomery [19], [20], [23], [29], sliding window [21],
[24], or a custom form of reduction [25], [26]. The sliding
window method requires extra storage of constants, since it the requirements, since three parameters vary: depth, width
adopts an interactive table-based approach [21], [24]. The in bits, and number of PEs. Therefore, each PE has its own
remaining works on Barret [18], [22], [27], [28] and Mont- BRAMs and ROM. By increasing the number of PEs the depth
gomery [19], [20], [23], [29] are not upper bounded on time of the memories is reduced but more memories have to be
complexity, because the implementation is not constrained by instantiated. Hence, this produces higher throughput to feed
the choice of primes. Custom reductions for some types of the PEs but it also increases the cost.
moduli represent a better overall choice for FHE [25], [26]. To avoid execution halts and memory hazards, full multi-
However, in [25] the authors applied low-performance units plexing/demultiplexing of the butterfly inputs and outputs are
with several RCAs on the critical path. In the architectures provided. Every clock cycle 3 coefficients and constants W 
presented in [26], the constraints for prime selection might or W −1 per PE are fetched from memory and processed in
impose higher computational time and area cost, due to the parallel. After the pipeline of the butterflies, 2 results per PE
choice of primes. are stored back at the same BRAMs addresses.
In comparison with the related state-of-the-art architectures, For high performance, the BRAMs are fully-connected to
the proposed butterfly provides significant savings in hardware the PEs in Fig. 3. Therefore, extra clock cycles are only
resources while still offering high performance, as quantita- required to fill the pipeline of the PEs. Other approaches [12],
tively assessed later. To implement the architectures for any [19]–[21], [23], have proposed a mixed solution using a form
prime with four binary digits and polynomial order, only five of time multiplexing in the PEs, with results arriving to the
CSAs (where two are simplified) and two carry-propagate BRAMs at different clock cycles. This simplifies the routing
adders are needed. A similar approach focused on software but increases the processing time.
implementation was presented in [39]. However, it proposes Two individual control blocks are designed. One is respon-
no upper bound on the number of additions, which grows with sible for translating and correctly multiplexing the addresses,
the number of bits in the primes, and it is only efficient for while the other implements the general control and inter-
small primes. On the other hand, our proposed architecture is face. Fig. 4 presents an example of the addressing scheme.
versatile and adaptable to different restrictions on the prime It implements two PEs with separated BRAMs. The memories
numbers by considering the required modifications to the are simple dual-port, where B0, B1, B2, and B3 denote the
adder tree in Fig. 2. The next section describes NTT/INTT four read and write ports. The ROMs are not represented,
architectures using the proposed butterfly. while the internal pipeline stages is also not considered for
the sake of simplicity. An array representing a polynomial
with eight coefficients is stored during initialization in the
B. NTT Architecture BRAMs. The order of execution is from left to right and each
The arithmetic operations of the butterfly in Fig. 1 requires level divides the number of coefficients in half due the divide-
fetching and storing values at each clock cycle. Fig. 3 makes and-conquer algorithm. In parallel (0, 4) and (1, 5) are read
use of two simple dual-port BRAMs per PE to feed (X, Y ) from memory and processed, being stored later at the same
and a single-port ROM for W  or W −1 . Such memory devices addresses. Execution follows an identical pattern for (2, 6)
are available in most FPGAs. The simple dual-port BRAMs and (3, 7). However, the store operation has to be multiplexed
operates synchronously with single independent read and write between the PEs outputs and BRAMs inputs to avoid hazards
ports that can be used concurrently. When writing the results in later cycles. Level 1 requires reading (0, 2) and (1, 3)
to memory, the delay stages in Fig. 1 have to be taken into in parallel, which would not be possible if the values were
account, making the read and writes distant by seven clock located in the BRAM accessible from the same reading port
cycles, which helps avoid collisions. Since the constants W  or B. Demultiplexing the inputs and multiplexing the outputs,
W −1 are known beforehand, their values are used to initialize as represented in Fig. 3, enables continuous execution and
the ROMs. avoids memory hazards.
For versatility, a variable number of PEs is supported. It is Another possible organization for the PEs has been recently
easy to infer that three dimensional BRAMs [51] fit exactly adopted in some implementations [16], [52]. Those proposals

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2675

approach offers hardware reusability and a HW/SW co-design


environment based on the RISC-V.

C. RISC-V Fully-Homomorphic Encryption Accelerator


There are two major approaches commonly applied to the
design of accelerators in the context of HW/SW co-design. As
presented in [13], we differ them by the level of integration
within a processor datapath. The accelerators that extend the
processor datapath with additional execution units are denoted
as tightly coupled, while accelerators that are outside of the
datapath are called loosely coupled. Previous papers have
already proposed tightly coupled accelerators for PQC [17],
[56], [57], but FHE specific approaches have not been yet
proposed. Additionally, when closer to the processor datapath,
the accelerator demands less communication overhead but
Fig. 4. Example of address scheduling and memory hazard avoidance through
multiplexing for two parallel butterfly cores and n = 8. requires extending the ISA [17]. Considering tightly coupled
extensions, the RISC-V open-source instruction set is of
particular interest due to its extensibility. Therefore, the open-
merge the layers of the NTT by using a mix of serial and source CVA6 [32] processor is herein considered.
parallel organization of PEs, to reduce the memory access The CVA6 offers a six-stage pipeline with out-of-order exe-
overhead of the in-place Algorithm 1. It is important to note cution and in-order commit stages, which has been taped-out
that a versatile implementation scheme that performs well on in a 22nm technology operating at 1.7GHz [32]. These
both ASICs and FPGAs requires an internal pipeline in the characteristics offer a more performant platform for accel-
butterfly. Therefore, a serial organization of pipelined PEs eration, in contrast to previous extended microcontroller-like
adds extra cycles into the processing time (i.e., filling the cores [17], [56], [57].
pipelines takes longer). Besides, additional storage in the form Thus, the following sections describe an FHE tightly cou-
of buffers is required [52], that can impose an area overhead. pled accelerator and an ISA extension for the 64-bit Linux-
Such an organization can be more adequate in a HW/SW ready CVA6 RISC-V core (previously known as Ariane) [32].
co-design scenario, improving the performance without impos- 1) Extending the Datapath: The first step for the CVA6
ing an extra buffer and area cost. datapath extension is to adapt the butterfly architecture in
Executing the NTT in-place imposes several additional Fig. 1. For the sake of flexibility and greater reuse, the
requirements on the control flow of the algorithm to avoid MALU organization presented in Fig. 5 is adopted. The
memory hazards and to ensure continuous execution flow. proposed MALU provides five different operations, which
There are different solutions in the related literature to tackle are accessible through routing the inputs and outputs accord-
this problem [12], [19]–[21], [23]. In fact, this is a well-known ing to the instruction being executed. These instructions
and studied shortcoming of the FFT [53], [54]. In such are based on the modular operations required to imple-
approaches, it is common to use arithmetic operations to ment the arithmetic units in Fig. 1. To avoid additional
schedule the execution sequence and to generate the addresses. cycles doing bit-reverse operations, the proposed MALU
Since the NTT follows the same algorithmic structure, similar supports two different butterflies architectures. This allows
approaches [12], [19]–[21], [23] can be adopted. Nonetheless, hardware accelerated computations of different NTT algo-
one might be interested in general solutions that are valid for rithms, such as DIF (Gentleman-Sande (GS)) [35] and DIT
any parameter set. Thus, we implement an Address Generation (Cooley-Tukey (CT) [34]).
Unit (AGU) based on lookup tables. To ensure minimization To make a better use of the 64-bit datapath available in
throughout the process we use the ABC System for sequential the CVA6 processor, a restriction in the number of bits of the
logic synthesis and formal Verification [55]. The order of exe- MALU is applied. When constraining the number of bits of
cution, de/multiplexing, and addresses are generated looping the MALU, r = log2 p + 2, to 32 (following the primes
through the rows of the lookup table with a single counter. The frequency in Table I), one can merge two of the operands
control and address scheduler has to be converted from truth (X and Y ) of the butterfly into a single register, simplifying
table to hardware descriptions only once. Thus, we have pre the extended ISA. The values for the constants W  or W −1
generated the implementations up to n = 215 and PE = 32. can be selected from the 32 MSBs or LSBs, according to the
FPGAs offer a flexible platform for prototyping and achiev- instruction being executed.
ing acceleration with custom hardware implementations. How- The internal organization of the MALU uses two pipeline
ever, for higher performance one might be interested in stages in the operations that involve multiplication (i.e., modu-
implementing an ASIC-based accelerator for the NTT, and lar multiplication, GS, and CT), with a throughput of one result
more generally to FHE. In the next sections we present a per clock cycle, after the pipeline is full. Modular addition and
HW/SW accelerator that is suitable for an ASIC-based flow. subtraction, on the other hand, are single cycled. These design
By restructuring the butterflies into a MALU, the proposed choices follow the same principles used in the Arithmetic

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2676 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022

(32-bits). To facilitate the programming, one can move back


and forth data from the general-purpose register set using the
instructions with prefixes fhe.mv.
Most of the operations in Table III are straightfor-
ward. However, the specific instructions for the NTT
(fhe.op.gsb.mod.p and fhe.op.ctb.mod.p for the
butterfly and fhe.op.preg for permutation), have the
extra field hm[26:25], not specified in the original RISC-V
ISA. Based on the previous descriptions of the address
translation scheme in Fig. 4, it is clear that some form
of permutation is required to implement the NTT. The
instruction fhe.op.preg adds exactly this functionality to
the ISA, contrary to previous work that proposed a pure
Fig. 5. Overview of the proposed MALU used to extend the datapath of the fixed hardware structure for such operation [17]. In the con-
CVA6 RISC-V core [32]. text of the butterfly, instructions fhe.op.gsb.mod.p and
fhe.op.ctb.mod.p use the field hm[26:25] to specify
the origin of the constants W  , as in the MSBs or LSBs
Logic Unit (ALU) and multiplier unit of the CVA6 processor, 32 bits of hs2. All these extensions have been added to the
for consistency and ease of integration. RISC-V GNU Toolchain [59].
It is important to highlight the different design choices in
the MALU architecture based on the butterfly in Fig. 1, which V. R ESULTS AND D ISCUSSION
takes in consideration experimental results in the related art. This section presents a comparison with the related state
Previous reports of operating frequencies up to 1.7GHz for of the art considering FPGA and ASIC design flows. The
the CVA6 processor [32] have not considered the Floating- assessment of the proposed architectures is divided into three
Point Unit (FPU) in the datapath. However, when included, parts. To begin, the results presented in Tables IV and VII
the FPU drops the maximum operating frequency significantly. compare the butterfly unit in Fig. 1 with the related state of the
Since some FHE schemes in the literature are making use of art. Subsequently, complete NTT/INTT designs are considered
floating-point operations [45], we have chosen to keep the FPU and the results are depicted in Table V and Fig. 6. Finally,
in the datapath. In summary, the deeper pipeline of Fig. 1 is the extended CVA6 core results are discussed in detail in
not required in this particular configuration. Section V-C. The results for arithmetic units and RISC-V
Specific hardware acceleration for Algorithm 1 through cores consider as a target the Xilinx Ultrascale+ FPGA [60]
the MALU mitigates the problem of the complex modular (VCU128 board with device XCVU37P-L2) and the UMC’s
operations, though it transfers the bottleneck to the memory ASIC 28nm high-performance technology libraries [61].
interface. Hence, it is well-known that merging the layers in
the NTT algorithm reduces the number of memory accesses,
improving the overall performance of the algorithm [58]. The A. Results for the NTT Implementation in FPGA
main limitation in merging multiple layers of the NTT is the The butterfly and NTT/INTT architecture were described in
availability of registers to feed the MALU. Thus, we have VHDL and implemented on the VCU128 board using Xilinx’s
extended the CVA6 register file with a set of FHE specific Vivado 2020.1.2 Hereafter, the reductions of requirements
registers. The new register set consists of 32 × 64-bit elements is expressed as (referenced − proposed)/referenced × 100%,
being denoted as hpr. This extension requires changing the where positive values signify that the proposed architecture
implementation of several units of the processor including the improves on the referenced designs. To ensure fair evaluations
decoder, issue (scoreboard and register renaming), execution, we avoid comparing to results obtained for different devices,
load-and-store unit, and commit stages. Additionally, an ISA synthesis flows, or software versions. Hence, the evaluations
extensions is required for the new register set and instructions, are quantitatively conducted using the same resources and
as discussed in the next section. tools. The implementations published in [23], [62] are herein
2) Extended Instruction Set Architecture: Table III presents completely resynthesized targeting the same technology and
the fourteen FHE specific instructions added to the ISA. There FPGA. It is worthwhile to mention that in [23] an extensive
is a total of six modular arithmetic instructions, five of which evaluation of the architectures was conducted, with significant
directly represent the operations in the MALU in Fig. 5. improvements over the related state of the art. Therefore,
The additional instruction fhe.op.preg, simply performs improving on the results presented in [23] means better
permutations of the source register’s (hs1 and hs2) MSBs NTT/INTT architectures in comparison to the state of the art.
or LSBs 32 bits, according to the specified field hm[26:25]. As can be noted in Table IV, the proposed butterfly cores
Two immediate instructions are supported: fhe.op.addi outperform those in [23] in all assessed scenarios. Considering
and fhe.op.slli, where the destination is the FHE reg- the implementations for 28 bits, one observes in Table IV that
ister set (i.e., hd). Loading and storing are performed by 2 The original VHDL code for the butterfly [48] in Fig. 1 as well
the instructions fhe.ld and fhe.st, respectively, where as the improvements proposed in this paper are publicly available at
each operation might occur on a word (64-bits) or half-word https://github.com/rogpld/fhentt.

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2677

TABLE III
D ESCRIPTION OF THE C USTOM I NSTRUCTIONS E NCODINGS

TABLE IV descriptions from Xilinx has substantially improved previous


X ILINX VCU128 FPGA I MPLEMENTATION R ESULTS FOR results [48]. When instantiated, such descriptions infer better
THE B UTTERFLY C ORE IN F IG . 1
tiling optimizations by the synthesis tools, leading to the
obtained enhancements. Usage of DSP blocks, one of the most
scarce resources on the FPGAs, reduces by an average of 43%
the numbers of DSPs in [23]. This is due to the Montgomery
iterative procedure for modular reduction that [23] imple-
ments, which requires instantiation of a sequence of multiply
and accumulate units leading to the higher usage of DSPs
blocks observed in the results in Table V. Furthermore, higher
operating frequencies are achieved in most cases, with up to
56% increase for n = 2048 and PE = 2. For a number of PEs
the proposed butterfly reduces the number of LUTs and regis- higher than four the proposed approach requires an average of
ters by at most 14% and 21%, respectively, while speeding up 5% more LUTs, due to the fully-connected routing layer from
the arithmetic by 1.21 times. While the Montgomery butterfly the PEs to the BRAMs.
in [23] has an iterative multiply-and-accumulate architecture, By observing the behavior of the number of LUTs when
the proposed one (Fig. 1) has an upper-bounded structure increasing the PEs in Table V it can be seen that the proposed
requiring nearly half as many DSP blocks. Parameters from designs consume more LUTs in comparison with [23] for
hardware realizations of FHE schemes in the related art using PE = 8. The reason is that to minimize the number of
RNS might require up to 13 moduli of 30 bits each [21]. The BRAMs and to improve on the execution time the proposed
hardware requirements, with many parallel execution units, solutions need fully-connected interfaces to the PEs (see
stack up quickly taking a significant chunk of the FPGA Fig. 3). This increases the number of LUTs, but keeps the
resources just for butterfly units. Therefore, cost reductions number of registers relatively low. There are different solutions
are specially important for those FHE schemes. to overcome this small drawback in the state of the art.
Table V presents the results for complete NTT/INTT FPGA In general, additional time multiplexing has to be inserted into
designs with several PEs instantiated in a parallel topology, the connections, similar to the control in [23]. Another possible
according to the block diagram in Fig. 3. The implementations solution is to merge some of the levels of the NTT/INTT,
consider a range of parameters, varying from N = 1024 up as in [17], for reducing the number of memory accesses and
to 4096 with 28 bit primes, to highlight the versatility and simplifying the connections. Pragmatically, if the number of
flexibility of the proposed NTT/INTT. Based on the results LUTs is the bottleneck in a given application, one might
in Table V, one concludes that for most of the metrics, the choose the control of [23] with the proposed butterfly in
proposed designs improve the state of the art with signifi- Fig. 1, at the cost of increasing the number of BRAMs,
cant savings in resource allocation while also improving the but with no detriment in functionality. However, since FHE
maximum frequency of operation. In terms of FPGA resource implementations might demand large amounts of memory [63]
usage, we observe up to 23.5% and 39.5% reduction in LUTs and several different parallel paths, due to the CRT, a trade-off
and registers for PE = 2, N = 2048 and PE = 2, N = in favor of higher number LUTs instead of BRAMs might not
4096, respectively, compared to [23]. Besides, inspection of be necessarily undesired.
the results in Table V for the number of BRAMs required The number of clock cycles is a characteristic of the
shows average reductions of 56% in comparison to [23]. design and relatively independent of the platform and tool
The new addressing scheme using three dimensional BRAMs parametrization. Hence, one can extend the evaluation of

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2678 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022

TABLE V TABLE VI
NTT E XPERIMENTAL R ESULTS F ROM S YNTHESIS AND P LACE AND N UMBER OF C LOCK C YCLES (CC) TO P ERFORM THE NTT/INTT W ITH
ROUTE TARGETING D IFFERENT PARAMETERS ON T WO PARALLEL P ROCESSING E LEMENTS
A X ILINX XCVU37P-L2 D EVICE

Fig. 6. Experimental results for execution time of the architecture in Fig. 3


the architectures by considering the timing behavior of the implemented on the FPGA.
NTT/INTT operation and allowing more quantitative com-
parisons with the related state of the art. Table VI shows
clock cycles results for a number of different designs to cells library from Faraday Tech with 7 metal stack layers.
implement the same NTT/INTT. Ideally, an NTT/INTT com- Table VII presents the post place-and-route results considering
putation would take n/(2PE) × log2 n clock cycles to finish. a target frequency of 1 GHz and typical operating conditions
Additional cycles might be required to fill/empty the pipeline of 1.05 V and 25◦ C for all designs. The feasible operating
or if the control algorithm halts to avoid hazards. Because of frequency, under these conditions, is propagated from logical
design choices highlighted in Fig. 3, the proposed architecture synthesis to the physical implementation. Logical synthesis
needs additional cycles just for pipeline management, which was carried out using Cadence Genus version 19.11-s087,
corresponds in total to n/(2PE)×log2 n+8. Even though some while the complete layout and power estimation was done by
architectures [14], [15] require a similar number of cycles, Innovus version v19.11-s128. The synthesis tools are config-
they are not generic on the parametrization and, generally, ured to avoid retiming due to the proposed manually optimized
they only work for small prime moduli. The herein proposed pipelined. This maintains the structure of the pipeline and
architectures are flexibl and achieve the least number of provides a fair comparison with related the state of the art. For
clock cycles among the FHE-based architectures compared in the sake of accuracy, power estimation considers the parasitic
Table VI. extraction post layout and gate level switching activity using
The execution time is computed and presented in Fig. 6 10000 uniformly distributed random inputs.
based on the number of clock cycles and the maximum Table VII presents the results for the butterfly with 28 and
frequency of operation. From Fig. 6 it can be concluded 30 bit primes. In terms of frequency, the results are very close,
that the proposed designs achieve significant speedups in with the exception of 28 bits, where the butterfly presented
comparison with [23]. Mert et al. [23] takes additional cycles in [23] cannot achieve the 1 GHz target frequency. Still, the
due to the control and timing multiplexing adopted, specially proposed butterfly has exceptionally reduced the required area,
for the INTT operation. Therefore, a speedup of up to 1.9 is 38% and 52% for 28 and 30 bits in comparison with [23],
attained when computing INTTs for n = 2048 and PE = 2. respectively. Additionally, the smaller area directly impacts
the power of the circuit, with substantial reductions of almost
53% for 28 bits, compared to [23].
B. Results for the Butterfly Implementation in ASIC While DSPs are readily available in FPGAs, in ASICs one
The proposed butterfly was also implemented on UMC’s has to consider the higher area utilization when compared
28nm technology, making use of the ultra-high-density core to simple shift and additions. Unfortunately, most of the

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2679

TABLE VII TABLE VIII


ASIC P LACE - AND -ROUTE R ESULTS ON UMC’ S 28nm HP T ECHNOLOGY ASIC R ESULTS ON UMC’ S 28nm T ECHNOLOGY FOR THE O RIGINAL [32],
FOR THE B UTTERFLY C ORE IN F IG . 1 TARGETING 1GH Z E XTENDED CVA6 C ORE AND THE I NTERNAL A RITHMETIC
U NITS MALU, M ULTIPLIER , AND FPU

FHE hardware accelerators in the state of the art [18]–[23],


[26]–[29] that target FPGAs are not well-suited for ASIC variability. In both the original and extended core the critical
realization. As previously observed, the number of DSP blocks path goes through the FPU.
required in [23] are almost twice of the proposed architecture, As expected, the ISA extension provides flexibility and
which leads to the poor results observed in Table VII. The increased performance to develop efficient NTT and INTT
proposed butterfly outperforms even partial implementations algorithms using the both types of butterflies being supported
of the NTT/INTT operation, as the single modular multi- (i.e., GS and CT). To assess the impact of the MALU on the
plication operation proposed in [25]. A resynthesis of the pipeline we implemented the CVA6 on the VCU128 board
multiplier proposed in [25], using the same technology, tools, and executed a series of benchmarks to assess performance
and operating conditions, shows that it takes approximately and also for verification. The FPGA implementation was
40% more area while being 4.6 times slower. These results configured to use 16 kB and 32 kB for instruction and data
are not surprising since [25] uses several RCA hardware units caches, associative with of 4 and 8 ways, respectively. In
in the datapath, in contrast with our simplified adder tree in such a configuration we measured the number of clock cycles
Fig. 2. and instructions executed to assess the performance of the
In summary, the ASIC results emphasize the versatility, processor with the additional MALU. A total of 50 execu-
performance and low complexity of the proposed Montgomery tions for cache warmup is considered, while the results are
reduction and butterfly architecture. Besides, these character- averaged over 50 runs as well. Based on the reported values,
istics are specially important when designing HW/SW acceler- we consider the ASIC frequency of operation to determine
ators supported on ASIC implementations. High-performance similar metrics as in [32]. When considering an NTT of size
processors have deeper pipelines operating at high frequencies. 4096, the extended core achieves a maximum Instruction per
The next sections addresses these issues considering the CVA6 Cycle (IPC) of 0.83 with an average of 1.71 MIPS/MHz,
core. similar to the previously reported metrics for the CVA6 in the
literature [32]. These results are expected, since the MALU
C. Results for the RISC-V Accelerator follows the same design principles of the ALU and is outside
Table VIII presents the results from synthesis of the original of the critical path of the processor.
CVA6 [32] and extended RISC-V core on the same 28nm The performance results using HW/SW co-design can also
technology and operating conditions previously considered. be assessed in comparison to the FPGA accelerator. Hence,
The results are in the Gate Equivalent (GE) metric, but the by considering the operating frequencies in the ASIC design
normalization factor is provided for future comparison. For flow we observe that, in general, the RISC-V accelerator is
the sake of experimentation, only the processor pipeline with 17 times slower than the FPGA accelerator, for the parameters
all execution units are considered for synthesis similar to [56], evaluated. This results aligns with previous tightly-coupled
since the memory cells would substantially offset the results. accelerators [17]. It is worthwhile to mention that we have
Inspection of Table VIII suggests that the MALU in Fig. 5 focused mostly on the hardware design of the RISC-V accel-
increases the total area of the core by only around 10%. erator and there still significant room for improvements in
For reference and comparison, the area introduced with the the software implementation. A software design based on a
proposed MALU is slightly more than an order of magnitude high-performance core as the CVA6 should make better usage
of the total area of the FPU and less than half of the area of of cache-aware or cache-oblivious algorithms for the NTT,
the multiplier unit.3 Furthermore, due to careful design and reducing the memory accesses and cache misses and improv-
reduced complexity, the MALU in Fig. 5 has no impact on ing the overall execution time. However, such development is
the critical path of the processor. The analysis of Table VIII outside of the scope of this work, where we focus more on
shows that the maximum operating frequency drops by around the hardware and MALU aspects of the design.
1% in the extended core, which is due to the changes in the There are RISC-V extensions in the related art for accelera-
pipeline to support the hpr register set, which introduce some tion of the traditional PQC algorithms (e.g., Kyber, NewHope,
Frodo, qTESLA, LAC, etc.) [12], [16], [17]. However, as men-
3 The CVA6 has a separate integer multiplier unit. tioned earlier, these approaches are not well suited to be

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2680 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022

used in the context of FHE, due the low number of bits number of clock cycles in approximately, 57%, 58%, and 66%
in the butterfly arithmetic. Additionally, the target class of for 512, 1024, and 2048, respectively, in comparison to [64]
RISC-V core is generally towards smaller 32-bit low-power for the RISC-V 32 bits. Additionally, since the results in [64]
implementations. Thus, we avoid direct comparisons in terms depend on the prime number, for a given polynomial degree
of performance due to the different design objectives. Nev- there are significant variations in the number of clock cycles.
ertheless, one can quantitatively compare the butterfly and Such shortcoming would not be a problem in the proposed
MALU, being aware of the difference in the datapath and extended CVA6 core, since it implements the modular arith-
number of bits. For comparison, the tightly-coupled accelera- metic in hardware. However, the MALU presented in Fig. 5
tors for PQC presented in [17], also provide similar operations only supports a single modulus. In the context of FHE, the
when the MALU is compared separately. A resynthesis of the extended CVA6 core provides a better platform for speeding
accelerators presented in [17] show that their proposed MALU up the computation, since enables more arithmetic intensive
takes approximately the same area (60 kGE). Although in [17] operations as the CRT by reusing the arithmetic units and
the MALU supports two butterflies in parallel, it operates making easy the development of the algorithms assisted by
on only 14 bits, which equates the area. Based on area, the software.
approach in [17] is quite similar to ours, however, taking a Previous work is mainly focused on the microcontroler-
closer look at the critical path reveals that [17] slows the like RISC-V cores with limited or no support for fully-
processor down quite significantly. Since no internal pipeline featured Operating Systems (OSs). Here, we present an FHE
is adopted, the maximum operating frequency is bounded by accelerator that supports a full-blown Linux OS. As originally
the MALU and around 300 MHz (in 28nm), which is half reported in [32] the support for application-class implies some
of our proposed extended RISC-V. Moreover, extending the extra cost in energy efficiency, but comes with many benefits,
number of bits in [17] would make the case even worst, such as, easier programmability, memory management, pro-
emphasizing the advantages of the proposed solution when gram isolation, additional libraries and drivers, and user space
FHE is considered. programming standardization. Besides, it has been shown that
In [16], the authors proposed a vector coprocessor for the memory-bound applications, such as the NTT, benefit more
NTT/INTT that takes 942 kGE in a similar 28nm technology. from acceleration on a powerful core than on several weaker
For comparison, the whole 64 bit Linux-ready CVA6 core ones [31].
(with an FPU) takes only 14% more area (1075 kGE vs. Aside from the complexity and performance discussion it
942 kGE). Because the main objective in [16] is acceleration is important to consider the implications of the extended
of PQC algorithms, their butterfly unit operates on fixed RISC-V core and ISA from a security perspective. The avail-
16-bits words, but it is parametric on the number of bits and ability of a full-fledged OS already provides some additional
the authors have made their code available. Thus, we have level of isolation and security, in comparison with previous
resynthesized the butterfly architecture with a word length of work [16], [17]. Nevertheless, one might consider the scenario
28 bits for comparison. The results show that it requires more where the accelerator would be applied. On the server side,
than three times of the area (195 kGE), while the maximum if always operating on encrypted data, based the principles of
operating frequency is around 225 MHz, less than half of FHE, some security features might not be required. On the
our proposed RISC-V extended core. Clearly, with no internal other hand, when used in the context of the client, all security
pipeline and three RCA and two multipliers in the critical features might be of uttermost importance. Although in this
path, increasing the number of bits will greatly degrade the work we have not addressed these issues, security has to be
performance of [16]. taken into consideration at the hardware and software level,
A different acceleration technique in the context of RISC-V which is left as future work.
and PQC was presented in [64]. Karabulut et al. proposed a
dynamic instruction scheduling implementation that interfaces
VI. C ONCLUSION
with the fetch, execute and write-back pipeline stages of the
RISC-V, being able of detecting and improving the execution Novel hardware architectures for acceleration of the NTTs
schedule of NTT/INTT computations. No additional arithmetic on FPGA and ASIC have been presented. The problem
units are added and the extensions are hardware centered, of finding efficient primes for constructing moduli sets for
demanding no change in the compiler infrastructure. In [64], RNS-based FHE schemes was tackled. A constrained algo-
the NTT/INTT extensions increase the area (LUTs) of the core rithm for finding primes that generate upper-bounded butterfly
by around 40%, but significantly impacts the performance. architectures for the NTT was proposed. Furthermore, versatile
The internal loop of the NTT related to the arithmetic of specialized hardware taking advantage of such primes was
the butterfly takes 6 clock cycles to complete [64], which is developed. When compared to the state of the art in FPGAs,
the same as a single execution of GS or CT butterfly in the the architectures speed up the computations by a factor of 1.9,
proposed MALU. Therefore, one can conclude that in terms of while showing reductions of 30%, 90% and 42% in the number
clock cycles our approach provides similar results, as long as it of LUTs and registers, BRAMs and DSPs. For ASIC, the
is guaranteed the execution of the translated software running results at 1 GHz reveal an average decrease of 45% and 52%
the instructions in Table III. Unfortunately, the parameters in in circuit area and power consumption, respectively, compared
terms of cache sizes for the 64 bits implementation in [64] to the state of the art. The RISC-V extension with the
were not disclosed. Our proposed architecture improves the MALU imposed almost no penalty in the operating frequency,

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2681

while adding only 10% on the overall circuit area. Moreover, [20] A. C. Mert, E. Öztürk, and E. Savaş, “Design and implementation of
the proposed extended ISA offers versatility, performance, a fast and scalable NTT-based polynomial multiplier architectures,” in
Proc. 22nd Euromicro Conf. Digit. Syst. Des. (DSD), 2019, pp. 253–260,
and easier programmability based on the extension of the doi: 10.1109/DSD.2019.00045.
capabilities of the CVA6 Linux-ready RISC-V core. Future [21] S. S. Roy, F. Turan, K. Jarvinen, F. Vercauteren, and I. Verbauwhede,
work will focus on the addition of more execution units, “FPGA-based high-performance parallel architecture for homomor-
phic computing on encrypted data,” in Proc. IEEE Int. Symp. High
ISA extensions, security analysis, and development of the Perform. Comput. Archit. (HPCA), Feb. 2019, pp. 387–398, doi:
remaining polynomial arithmetic typically used in the FHE 10.1109/HPCA.2019.00052.
schemes. [22] J. Cathébras, A. Carbon, P. Milder, R. Sirdey, and N. Ventroux,
“Data flow oriented hardware design of RNS-based polynomial
multiplication for SHE acceleration,” IACR Trans. CHES,
R EFERENCES vol. 2018, no. 3, pp. 69–88, Aug. 2018. [Online]. Available:
[1] R. L. Rivest, L. Adleman, and M. L. Dertouzos, “On data banks https://tches.iacr.org/index.php/TCHES/article/view/7293
and privacy homomorphisms,” Found. Secure Comput., vol. 4, no. 11, [23] A. C. Mert, E. Karabulut, E. Ozturk, E. Savas, and A. Aysu,
pp. 169–180, 1978. “An extensive study of flexible design methods for the number theoretic
[2] C. Gentry, “Fully homomorphic encryption using ideal lattices,” in transform,” IEEE Trans. Comput., early access, Aug. 19, 2020, doi:
Proc. 41st Annu. ACM Symp. Symp. Theory Comput. (STOC), 2009, 10.1109/TC.2020.3017930.
pp. 169–178. [24] F. Turan, S. S. Roy, and I. Verbauwhede, “HEAWS: An accelerator
[3] A. Viand, P. Jattke, and A. Hithnawi, “SoK: Fully homomorphic encryp- for homomorphic encryption on the Amazon AWS FPGA,” IEEE
tion compilers,” in Proc. IEEE Symp. Secur. Privacy (SP), May 2021, Trans. Comput., vol. 69, no. 8, pp. 1185–1196, Aug. 2020, doi:
pp. 1092–1108. 10.1109/TC.2020.2988765.
[4] L. Ducas and D. Micciancio, “FHEW: Bootstrapping homo- [25] W. Tan, B. M. Case, A. Wang, S. Gao, and Y. Lao, “High-speed modular
morphic encryption in less than a second,” in Advances in multiplier for lattice-based cryptosystems,” IEEE Trans. Circuits Syst. II,
Cryptology—EUROCRYPT. Berlin, Germany: Springer, 2015, Exp. Briefs, vol. 68, no. 8, pp. 2927–2931, Aug. 2021.
pp. 617–640, doi: 10.1007/978-3-662-46800-5_24. [26] X. Hu, M. Li, J. Tian, and Z. Wang, “DARM: A low-complexity and fast
[5] J. H. Cheon, A. Kim, M. Kim, and Y. Song, “Homomorphic modular multiplier for lattice-based cryptography,” in Proc. IEEE 32nd
encryption for arithmetic of approximate numbers,” in Advances Int. Conf. Appl.-Specific Syst., Archit. Processors (ASAP), Jul. 2021,
in Cryptology—ASIACRYPT. Cham, Switzerland: Springer, 2017, pp. 175–178.
pp. 409–437, doi: 10.1007/978-3-319-70694-8_15. [27] S. S. Roy, K. Jarvinen, J. Vliegen, F. Vercauteren, and I. Verbauwhede,
[6] P. Martins, L. Sousa, and A. Mariano, “A survey on fully homomorphic “HEPCloud: An FPGA-based multicore processor for FV somewhat
encryption: An engineering perspective,” ACM Comput. Surv., vol. 50, homomorphic function evaluation,” IEEE Trans. Comput., vol. 67,
no. 6, pp. 1–33, Nov. 2018, doi: 10.1145/3124441. no. 11, pp. 1637–1650, Nov. 2018, doi: 10.1109/TC.2018.2816640.
[7] I. Chillotti, N. Gama, M. Georgieva, and M. Izabachène, “TFHE: Fast [28] D. B. Cousins, K. Rohloff, and D. Sumorok, “Designing an
fully homomorphic encryption over the torus,” J. Cryptol., vol. 33, no. 1, FPGA-accelerated homomorphic encryption co-processor,” IEEE Trans.
pp. 34–91, 2018, doi: 10.1007/s00145-019-09319-x. Emerg. Topics Comput., vol. 5, no. 2, pp. 193–206, Oct. 2017, doi:
[8] (Aug. 2021). PALISADE Lattice Cryptography Library. [Online].
10.1109/TETC.2016.2619669.
Available: https://palisade-crypto.org/
[9] (Nov. 2020). Microsoft SEAL. Microsoft Research, Redmond, WA, USA. [29] M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “HEAX: An architecture
[Online]. Available: https://github.com/Microsoft/SEAL for computing on encrypted data,” in Proc. 25th Int. Conf. Architectural
[10] National Institute of Standards and Technology (NIST). (2020). Support Program. Lang. Operating Syst. New York, NY, USA: ACM,
PQC Standardization Process: Third Round Candidate Announce- 2020, pp. 1295–1309, doi: 10.1145/3373376.3378523.
ment. [Online]. Available: https://csrc.nist.gov/News/2020/pqc-third- [30] V. Migliore et al., “A high-speed accelerator for homomorphic encryp-
round-candidate-announcement tion using the Karatsuba algorithm,” ACM Trans. Embedded Comput.
[11] P.-C. Kuo et al., “High performance post-quantum key exchange on Syst., vol. 16, no. 5, pp. 1–17, Oct. 2017, doi: 10.1145/3126558.
FPGAs,” J. Inf. Sci. Eng., vol. 37, no. 5, pp. 1211–1229, Sep. 2021, [31] X. Liang, M. Nguyen, and H. Che, “Wimpy or brawny cores:
doi: 10.6688/JISE.202109 37(5).0015. A throughput perspective,” J. Parallel Distrib. Comput., vol. 73, no. 10,
[12] U. Banerjee, T. S. Ukyab, and A. P. Chandrakasan, “Sapphire: A pp. 1351–1361, 2013, doi: 10.1016/j.jpdc.2013.06.001.
configurable crypto-processor for post-quantum lattice-based protocols,” [32] F. Zaruba and L. Benini, “The cost of application-class processing:
IACR Trans. Cryptograph. Hardw. Embedded Syst., vol. 2019, no. 4, Energy and performance analysis of a Linux-ready 1.7-GHz 64-bit
pp. 17–61, Aug. 2019, doi: 10.13154/tches.v2019.i4.17-6. RISC-V core in 22-nm FDSOI technology,” IEEE Trans. Very Large
[13] T. Fritzmann and J. Sepúlveda, “Efficient and flexible low-power Scale Integr. (VLSI) Syst., vol. 27, no. 11, pp. 2629–2640, Nov. 2019.
NTT for lattice-based cryptography,” in Proc. IEEE Int. Symp. [33] E. O. Brigham, The Fast Fourier Transform and its Applications.
Hardw. Oriented Secur. Trust (HOST), May 2019, pp. 141–150, doi: Upper Saddle River, NJ, USA: Prentice-Hall, 1988.
10.1109/HST.2019.8741027. [34] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation
[14] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, and L. Liu, “Highly of complex Fourier series,” Math. Comput., vol. 19, no. 90, pp. 297–301,
efficient architecture of NewHope-NIST on FPGA using low-complexity 1965, doi: 10.1090/S0025-5718-1965-0178586-1.
NTT/INTT,” IACR Trans. Cryptograph. Hardw. Embedded Syst., [35] W. M. Gentleman and G. Sande, “Fast Fourier transforms: For fun and
vol. 2020, no. 2, pp. 49–72, Mar. 2020, doi: 10.46586/tches.v2020.i2.49- profit,” in Proc. Fall Joint Comput. Conf. New York, NY, USA: ACM,
72. 1966, pp. 563–578, doi: 10.1145/1464291.1464352.
[15] Y. Xing and S. Li, “An efficient implementation of the NewHope key [36] The Alan Turing Institute. (2021). SHEEP Homomorphic Encryption
exchange on FPGAs,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, Evaluation Platform. [Online]. Available: https://github.com/alan-turing-
no. 3, pp. 866–878, Mar. 2020, doi: 10.1109/TCSI.2019.2956651. institute/SHEEP
[16] G. Xin et al., “VPQC: A domain-specific vector processor for post-
[37] E. Chu and A. George, Inside the FFT Black Box. Boca Raton, FL,
quantum cryptography based on RISC-V architecture,” IEEE Trans.
USA: CRC Press, Nov. 1999, doi: 10.1201/9781420049961.
Circuits Syst. I, Reg. Papers, vol. 67, no. 8, pp. 2672–2684, Aug. 2020,
doi: 10.1109/TCSI.2020.2983185. [38] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction
[17] T. Fritzmann, G. Sigl, and J. Sepúlveda, “RISQ-V: Tightly coupled to Algorithms. Cambridge, MA, USA: MIT Press, 2009.
RISC-V accelerators for post-quantum cryptography,” IACR Trans. [39] P. Longa and M. Naehrig, “Speeding up the number theoretic transform
Cryptograph. Hardw. Embedded Syst., vol. 2020, no. 4, pp. 239–280, for faster ideal lattice-based cryptography,” in Proc. Int. Conf. Cryptol.
Aug. 2020, doi: 10.46586/tches.v2020.i4.239-280. Netw. Secur. Cham, Switzerland: Springer, 2016, pp. 124–139.
[18] J. Cathebras, “Hardware acceleration for homomorphic encryption,” [40] D. Harvey, “Faster arithmetic for number-theoretic transforms,”
Ph.D. dissertation, Dept. Sci. Technol. Inf. Commun., Paris-Sud Univ., J. Symbolic Comput., vol. 60, pp. 113–119, Jan. 2014, doi:
Bures-sur-Yvette, France, 2018. 10.1016/j.jsc.2013.09.002.
[19] A. C. Mert, E. Ozturk, and E. Savas, “Design and implementation of [41] Proth Primes. (2003). The Online Encyclopedia of Integer
encryption/decryption architectures for BFV homomorphic encryption Sequences—Proth Primes. [Online]. Available: https://oeis.org/A080076
scheme,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, [42] R. Avanzi et al., “CRYSTALS-Kyber algorithm specifications and sup-
no. 2, pp. 353–362, Feb. 2020, doi: 10.1109/TVLSI.2019.2943127. porting documentation,” NIST PQC Round, vol. 2, no. 4, pp. 1–43, 2021.

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2682 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022

[43] S. Bai et al., “CRYSTALS-Dilithium algorithm specifications and sup- [60] Xilinx. (2021). Virtex UltraScale+ HBM VCU128 FPGA.
porting documentation,” NIST Post-Quantum Cryptogr. Standardization [Online]. Available: https://www.xilinx.com/products/boards-and-
Round, vol. 3, pp. 1–38, Feb. 2021. kits/vcu128.html
[44] J.-C. Bajard, J. Eynard, M. A. Hasan, and V. Zucca, “A full RNS variant [61] UMC. (2021). UMC’s 28 nm High-k/Metal Gate Stack HPCU. [Online].
of FV like somewhat homomorphic encryption schemes,” in Selected Available: https://www.umc.com/upload/media/05_Press_Center/
Areas in Cryptography—SAC (Lecture Notes in Computer Science). 3_Literatures/Process_Technology/28nm_Brochure.pdf
Cham, Switzerland: Springer, 2017, pp. 423–442, doi: 10.1007/978-3- [62] A. C. Mert. (2021). Parametric NTT/INTT Hardware Generator.
319-69453-5_23 [Online]. Available: https://github.com/acmert/parametric-ntt
[45] A. Al Badawi, Y. Polyakov, K. M. M. Aung, B. Veeravalli, and [63] L. de Castro et al., “Does fully homomorphic encryption need compute
K. Rohloff, “Implementation and performance evaluation of RNS acceleration?” 2021, arXiv:2112.06396.
variants of the BFV homomorphic encryption scheme,” IEEE Trans. [64] E. Karabulut and A. Aysu, “RANTT: A RISC-V architecture exten-
Emerg. Topics Comput., vol. 9, no. 2, pp. 941–956, Apr. 2021, doi: sion for the number theoretic transform,” in Proc. 30th Int. Conf.
10.1109/TETC.2019.2902799. Field-Program. Log. Appl. (FPL), Aug./Sep. 2020, pp. 26–32, doi:
[46] V. Dimitrov, L. Imbert, and A. Zakaluzny, “Multiplication by a constant 10.1109/FPL50879.2020.00016.
is sublinear,” in Proc. 18th IEEE Symp. Comput. Arithmetic (ARITH),
Jun. 2007, pp. 261–268, doi: 10.1109/ARITH.2007.24.
[47] S. S. Wagstaff, “Prime numbers with a fixed number of one bits or
zero bits in their binary representation,” Experim. Math., vol. 10, no. 2,
pp. 267–273, Jan. 2001.
[48] R. Paludo and L. Sousa, “Number theoretic transform architecture suit-
able to lattice-based fully-homomorphic encryption,” in Proc. IEEE 32nd Rogério Paludo (Student Member, IEEE) received
Int. Conf. Appl.-Specific Syst., Archit. Processors (ASAP), Jul. 2021, the Ph.D. degree in electrical and engineering from
pp. 163–170. the Federal University of Santa Catarina, Florianópo-
[49] C. Aguilar-Melchor, J. Barrier, S. Guelton, A. Guinet, M.-O. Killijian, lis, Brazil, in 2020. He is currently a Post-Doctoral
and T. Lepoint, “NFLLIB: NTT-based fast lattice library,” in Proc. Researcher at the Instituto de Engenharia de
Cryptographers’ Track RSA Conf. Cham, Switzerland: Springer, 2016, Sistemas e Computadores (INESC-ID), Lisbon,
pp. 341–356. Portugal. His research interests include computer
[50] M. Albrecht et al., “Homomorphic encryption security standard,” arithmetic, VLSI architectures, hardware design, sig-
HomomorphicEncryption.org, Toronto, ON, Canada, Tech. Rep. v1.1, nal processing, and fully-homomorphic encryption.
Nov. 2018.
[51] Vivado Design Suite User Guide (UG901), Xilinx, San Jose, CA, USA,
2016.
[52] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani,
“High-speed NTT-based polynomial multiplication accelerator for
CRYSTALS-Kyber post-quantum cryptography,” Cryptol. ePrint Arch.,
Tech. Rep., 2021/563, 2021. [Online]. Available: https://ia.cr/2021/563
[53] X. Xiao, E. Oruklu, and J. Saniie, “An efficient FFT engine with reduced Leonel Sousa (Senior Member, IEEE) received the
addressing logic,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 55, Ph.D. degree in electrical and computer engineering
no. 11, pp. 1149–1153, Nov. 2008, doi: 10.1109/TCSII.2008.2004540. from the Instituto Superior Técnico (IST), Univer-
[54] S. Agarwal, S. R. Ahamed, A. K. Gogoi, and G. Trivedi, “A low- sidade de Lisboa (UL), Lisbon, Portugal, in 1996.
complexity shifting-based conflict-free memory-addressing architecture He is currently a Full Professor with the IST, UL.
for higher-radix FFT,” IEEE Access, vol. 9, pp. 140349–140357, 2021, He is also a Senior Researcher with the Research and
doi: 10.1109/ACCESS.2021.3119598. Development Instituto de Engenharia de Sistemas e
[55] Berkeley Logic Synthesis and Verification Group. (2021). ABC: Computadores (INESC-ID). His research interests
A System for Sequential Synthesis and Verification. [Online]. Available: include computer arithmetic, VLSI architectures,
https://github.com/berkeley-abc/abc parallel computing, and signal processing. He is
[56] T. Fritzmann, G. Sigl, and J. Sepulveda, “Extending the RISC-V instruc- a fellow of the IET. He has contributed to more
tion set for hardware acceleration of the post-quantum scheme LAC,” than 200 papers in journals and international conferences, for which he got
in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2020, several awards, including the DASIP13 Best Paper Award, the SAMOS11
pp. 1420–1425, doi: 10.23919/DATE48585.2020.9116567. Stamatis Vassiliadis Best Paper Award, the DASIP10 Best Poster Award,
[57] E. Alkim, H. Evkan, N. Lahr, R. Niederhagen, and R. Petri, “ISA exten- and the Honorable Mention Award UL/Santander Totta for the quality of
sions for finite field arithmetic: Accelerating Kyber and NewHope on the publications in 2021 and 2016. He has contributed to the organization
RISC-V,” IACR Trans. Cryptograph. Hardw. Embedded Syst., vol. 2020, of several international conferences, namely as the program chair and as the
pp. 219–242, Jun. 2020, doi: 10.46586/tches.v2020.i3.219-242. general and topic chair, and has given keynotes in some of them. He has edited
[58] M. Frigo and S. G. Johnson, “The design and implementation of five special issues of international journals, he is also an Associate Editor
FFTW3,” Proc. IEEE, vol. 93, no. 2, pp. 216–231, Feb. 2005. of the IEEE T RANSACTIONS ON C OMPUTERS , IEEE A CCESS , and JRTIP
[59] (Sep. 2021). RISC-V GNU Toolchain. [Online]. Available: https://github. (Springer), and a Senior Editor of the Journal on Emerging and Selected
com/riscv-collab/riscv-gnu-toolchain Topics in Circuits and Systems. He is a Distinguished Scientist of ACM.

Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.

You might also like