Professional Documents
Culture Documents
Abstract— This paper proposes two architectures for the FHE was conceptualized in 1978 by Rivest, Adleman and
acceleration of Number Theoretic Transforms (NTTs) using a Dertouzos [1], yet only in 2009 Gentry [2] proposed the
novel Montgomery-based butterfly. We first design a custom first feasible FHE scheme. Major breakthroughs have been
NTT hardware accelerator for Field-Programmable Gate Arrays
(FPGAs). The butterfly architecture is expanded to a Modular made on the field of FHE [3] since 2009. At present, FHE
Arithmetic Logic Unit (MALU) and for greater reuse and easier schemes [4]–[7] are orders of magnitude faster than earlier
programmability a six-stage pipeline Linux-ready RISC-V core generations, but they still remain prohibitively expensive in
is extended with custom instructions. The performance of the terms of computing.
proposed architectures is assessed on a Xilinx Ultrascale+ FPGA The mathematical framework used in the first FHE scheme
and with an Application-Specific Integrated Circuit (ASIC) on
28nm CMOS technology. In FPGA, the results for custom by Gentry [2] was based on lattices, which is still being used
acceleration show reductions of 30%, 90% and 42% in the in most of the FHE schemes [4]–[9]. Lattice problems are
number of Lookup tables (LUTs) and registers, Block RAMs believed to be hard and provide resistance to quantum attacks.
(BRAMs) and Digital Signal Processors (DSPs), while providing a Because of this, Lattice-Based Cryptography (LBC) is at the
speedup of 1.9 times, in comparison with the state of the art. The core of the standardization process for next generation cryp-
ASIC results show that at 1 GHz the proposed architecture is in
average 45% and 52% less area and power hungry, respectively, tosystems [10]. In practical terms, lattice-based FHE and LBC
compared to the state of the art. Furthermore, the proposed require arithmetic on polynomials, implying multiplication as
MALU, operating as an additional execution unit, increases the the more complex operation.
overall area of the extended RISC-V core by only 10%, without Naive algorithms for polynomial multiplication, such as
significant changes in the frequency of operation. schoolbook, are known to be quadratic in complexity. In this
Index Terms— Fully-homomorphic encryption, number theo- context, NTTs represent a practical tool for complexity reduc-
retic transform, Montgomery algorithm, RISC-V, FPGA, ASIC. tion to quasilinear (i.e., O(n log n)), bringing the prohibitively
high cost of FHE down to more practical levels. Hence, the
development of efficient NTT hardware accelerators, aiming at
I. I NTRODUCTION
increasing the performance of polynomial operations for lattice
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2670 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022
the extensibility of the RISC-V Instruction Set Architecture Algorithm 1 Iterative DIF NTT Algorithm With Gentleman-
(ISA) and the availability of open-source tools is recurrently Sande Butterfly [38].
explored in the development of NTT accelerators [12], [16], Input: Polynomial x = (x 0 , . . . , x n−1 ), primitive n-th root of
[17]. Nonetheless, less attention has been paid to the research unity ωn .
of RISC-V ISA extensions for FHE. Previous work in the Output: x̂ = N T T (x).
literature focused either on custom accelerators [29], or co- 1: for s = log2 n to 1 by −1 do
design with ARM [21], [24], MicroBlaze [28], or general pur- 2: m = 2s
pose processors [19], [20], [27], [30]. Moreover, the extensions 3:
n/m
Wm = ωn (mod p)
of the RISC-V [12], [16], [17] for PQC concentrated mainly 4: for k = 0 to n − 1 by m do
on the development of acceleration for light microcontroler- 5: W =1
like cores. However, an important result from the litera- 6: for j = 0 to m/2 − 1 do Butterfly loop.
ture [31] shows that memory-bound algorithms profit more 7: X = x[k + j ]
from stronger cores, which is the case of the NTT, when the 8: Y = x[k + j + m/2]
computational intensive parts are hardware accelerated. 9: X = X + Y (mod p)
Therefore, this paper starts by proposing design methods 10: Y = W (X − Y ) (mod p)
and architectures for high-performance and low-cost acceler- 11: x[k + j ] = X
ation of the NTT in custom hardware, either on FPGA or 12: x[k + j + m/2] = Y
ASIC. Later on, the arithmetic units are adapted into a MALU 13: W = Wm ωn (mod p)
being integrated into the Linux-ready with a six-stage pipeline 14: end for
CVA6 [32] RISC-V core. This paper presents the following 15: end for
contributions. 16: end for
• A search algorithm to find prime modulus that minimize 17: return bit-reversed(x) x is in bit-reversed order.
the hardware resources of the NTT in the context of FHE
and Residue Number System (RNS) (through the CRT).
• A generic butterfly architecture for the NTT that leverages
the bounded Hamming weight of the prime moduli for Polynomial multiplication is one of the most time-
both ASIC and FPGA. demanding arithmetic operations in FHE [36]. Nevertheless,
• A parametric custom accelerator for a Xilinx Ultrascale+ polynomial multiplication can be improved over the naive
FPGA that uses the butterfly. quadratic complexity by using a divide-and-conquer approach
• A highly-efficient MALU supporting five different mod- formulated through the NTT. This is very important in
ular operations, including two different types of butterfly. lattice-based implementations since it reduces time complex-
• The RISC-V ISA and compiler toolchain extended with a ity from O(n 2 ) to O(n log n), where n is the polynomial
new register set and specific instructions for acceleration order [33].
of NTTs. The DIF formulation of the NTT is presented in Algo-
• Extensive experimental results and comparisons with the rithm 1, where the arithmetic is carried on the residue class
related state of the art conducted on an FPGA and for a of the prime p. Due to the DIF and the order of execution of
28nm ASIC technology. the algorithm the inputs, x = (x 0 , . . . , x n−1 ), are fed in order
The remainder of this paper is organized as follows. while the outputs, x̂ = N T T (x), are obtained in bit reverse
Section II reviews in detail the background and motiva- order [37].
tion for hardware acceleration of the NTTs in the context It is also noted in Lines 9 and 10 of Algorithm 1 the
of FHE. Section III presents the proposed procedures to modular arithmetic of the Gentleman-Sande butterfly [35],
find complexity-bounded prime moduli sets. In Section IV, involving modular addition, subtraction, and multiplication.
hardware architectures and RISC-V extensions are proposed. The input ωn is an integer primitive n-th root of unity such
Section V evaluates and compares the architectures to the state that ωnn ≡ 1 (mod p) and ωni = 1 (mod p) with i ∈ [0, n − 1].
of the art. Finally, the last section concludes the paper and lays Additionally, to support NTTs up to n, the prime modulus p
down future research possibilities. has to satisfy p ≡ 1 (mod n). For polynomial multiplication
it is possible to simplify the arithmetic operations using the
negatively-wrapped convolution by enforcing p ≡ 1 (mod 2n),
II. BACKGROUND AND M OTIVATION which eliminates the requirements for some modular and
Number Theoretic Transforms (NTTs) have the same algo- polynomial reductions. When the later condition is not met,
rithmic structure as Fast Fourier Transforms (FFTs), where one can still use the NTT for polynomial multiplication by
complex roots are represented by they equivalent in the zero-padding and further reducing modulus p.
integers modulo some prime p [33]. Similar to the FFT, Algorithm 1 also allows the computation of the Inverse
Decimation in Time (DIT) [34] or Decimation in Frequency Number Theoretic Transform (INTT) by replacing the ωn
(DIF) [35] algorithms can be applied to compute the NTT. factors by ωn−1 , such that ωn−1 ×ωn ≡ 1 (mod p). Another pos-
However, this requires implementing the arithmetic structures sible algorithmic arrangement for polynomial multiplication is
in a ring, where modular addition and multiplication are more to use DIF and DIT NTTs, to avoid the reordering operation.
involved. Since the DIT accepts the inputs in bit-reverse order it is
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2671
possible to feed the outputs from the DIF algorithm without Algorithm 2 Harvey’s Montgomery Butterfly [40].
altering the order. One has to first multiply pointwise the Input:
transformed coefficients in the NTT domain to actually imple- p odd and β = 2log2 p+2
ment polynomial multiplication. The choice of p is crucial to 0 < W < p, W = βW (mod p), J = p−1 (mod β)
obtain efficient NTT implementations, since it must provide 0 ≤ X < 2 p and 0 ≤ Y < 2 p
simple modular operations, including modular multiplication Output:
and reduction, while satisfying the NTT restrictions. X = X + Y (mod p), 0 ≤ X < 2 p
The high complexity of FHE schemes has led to the Y = W (X − Y ) (mod p), 0 ≤ Y < 2 p
development of several hardware accelerators for the 1: X = X + Y
NTT [18]–[29]. A recurring approach is to target FPGAs, 2: if X ≥ 2 p then
where constants can be stored into Block RAMs (BRAMs) 3: X = X − 2p
(i.e., ωn and ωn−1 ), while single or several parallel butterflies 4: end if
are used to implement Algorithm 1. When using more than 5: T = X−Y + 2 p
one execution unit one must avoid memory hazards, which can 6: R1 β + R0 = W T
happen due to the in-place nature of Algorithm 1. This is a 7: Q = R0 J (mod β) Multiplication by constant.
challenge on such designs, requiring a network of permutations 8: H = Qp/β Multiplication by constant.
to connect the different Processing Element (PE) (butterflies) 9: Y = R1 −H + p
to the BRAMs. To address this challenge, some accelerators 10: return X , Y
have been proposed based on HW/SW architectures [16],
[21], [24], providing frameworks with a higher degree of
flexibility and hardware reusability. In a flexible HW/SW
reduction [19], [20], [23] or require extra storage [21], [39] for
design environment, control of the custom hardware can be
constants. The PQC algorithms Kyber [42] and Dilithium [43]
more easily approached by software, while still offering sig-
are good practical examples of restrictions imposed on the
nificant performance gains in comparison with pure software
prime choice. Moreover, when the value of k is kept low, hard-
implementations. Nonetheless, one might argue that in the
ware architectures can be simplified [39]. However, for FHE,
context of higher performance, ASICs should be considered.
the more practical and efficient realizations are based on RNSs,
Previously, HW/SW ASIC-based accelerators for PQC have
which requires a set of prime modulus [44], [45]. In RNSs,
been designed [12], [16], [17]. But for FHE there is a lack of
large integer numbers are decomposed into a set of smaller
approaches, namely extensions for RISC-V, which we present
ones, such that arithmetic can be parallelized decreasing the
as one of the contributions of this work.
calculation time. Therefore, one might need to eventually
Besides the difficulties in the control component, the
select primes with higher hardware requirements, due to the
arithmetic operations are quite cumbersome in Algorithm 1.
usage of several moduli in RNSs and the unrestricted upper
Lines 9 and 10 implement the butterfly arithmetic involving
bounded approaches in the literature [19]–[21], [23], [39].
modular multiplication, addition and subtraction. There have
Since in RNS the channels are computed in parallel, the critical
been several efforts to improve modular multiplication in
path will be given by the slowest one.
the literature [12], [19], [20], [23], [39]. Commonly adopted
The arithmetic operations related to the constants J and
techniques are based on the Barret or Montgomery Multipli-
p directly influence the time complexity of Algorithm 2.
cation and Reduction. Barret reduction lifts the operations to
Complexity of multiplication by constants are widely known
a more simple modulus, typically a power of two, but it might
to be heavily dependent on the Hamming weight of the
require some successive subtractions to obtain the correct
constant [46]. Therefore, selecting primes with a given upper
result reduced for the prime modulus p [18], [22], [27], [28].
bound on the Hamming weight could potentially minimize the
Montgomery multiplication and reduction also transfers
hardware cost of Algorithm 2, while keeping the number of
the operations to a more friendly power-of-two modulus
bits in the RNS channels for FHE relatively close to each
providing better architectures for hardware realization [19],
other. In the next sections, a procedure to find such primes
[20], [23], [39].
is presented for improving the overall performance of NTTs
Additional improvements can be introduced into the but-
accelerators. Later, an adaptation of the proposed butterfly is
terfly computation by deferring some modular reductions to
used to extend the datapath of a RISC-V processor, offering
the end of the algorithm execution. A well-known approach
more flexibility for computing the NTT algorithm.
was presented by Harvey [40] and is replicated in Algo-
rithm 2 for completeness. In Harvey’s butterfly the arithmetic
is transferred to a power-of-two Montgomery domain based on III. P RIME S ELECTION
modulus β, and the intermediate values are allowed to grow The search for primes with similar characteristics has been
up to at most 2 p − 1, to reduce the total number of costly extensively studied in the literature. Some earlier theoretical
modular arithmetic reductions. works studied their statistical distributions regarding the num-
On top of the deferred reductions, restricting the prime ber of binary digits [47], without focusing on the applications
choice to the form p = k2m + 1 with odd k < 2m and of such results. The recent construction of lattice-based FHE
m ≥ 1 [41] introduces further simplifications in Harvey’s schemes, and their implementations relying on the RNS, has
butterfly. Some of the current approaches apply an iterative increased the interest in finding primes that offer lower cost
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2672 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022
TABLE I
Algorithm 3 Search for Primes With at Most 4 Bits in
R ELATIVE F REQUENCY OF THE T OTAL N UMBER OF P RIMES FOR
the Canonical Sign Digit (CSD) Representation Adapted VARIOUS I NPUT VALUES IN A LGORITHM 3
From [18], [48], [49].
Input:
b, b0 and n max
Output:
plist
1: λ = 2b+b0 , k = 1, plist = [ ]
2: do
3: c = λ/2b0 −k · 2n + 1
4: if isPrime(c) and c > (1 + 1/23b0 ) · λ/(2b0 + 1) then
5: if sum(abs(csd(c))) == 4 then TABLE II
6: plist = plist ∪ {c} List Append FHE PARAMETERS S PECIFICATIONS A CCORDING TO THE SEAL [9]
7: end if L IBRARY, THE FHE S TANDARD C ONSIDERING C LASSICAL
8: end if S ECURITY [50], AND THE RNS V ERSION OF THE
B RAKERSKI /FAN -V ERCAUTEREN (BFV) S CHEME [44]
9: k =k+1
10: while c > (1 + 1/23b0 ) · λ/(2b0 + 1)
11: return plist
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2673
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2674 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2675
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2676 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2677
TABLE III
D ESCRIPTION OF THE C USTOM I NSTRUCTIONS E NCODINGS
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2678 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022
TABLE V TABLE VI
NTT E XPERIMENTAL R ESULTS F ROM S YNTHESIS AND P LACE AND N UMBER OF C LOCK C YCLES (CC) TO P ERFORM THE NTT/INTT W ITH
ROUTE TARGETING D IFFERENT PARAMETERS ON T WO PARALLEL P ROCESSING E LEMENTS
A X ILINX XCVU37P-L2 D EVICE
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2679
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2680 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022
used in the context of FHE, due the low number of bits number of clock cycles in approximately, 57%, 58%, and 66%
in the butterfly arithmetic. Additionally, the target class of for 512, 1024, and 2048, respectively, in comparison to [64]
RISC-V core is generally towards smaller 32-bit low-power for the RISC-V 32 bits. Additionally, since the results in [64]
implementations. Thus, we avoid direct comparisons in terms depend on the prime number, for a given polynomial degree
of performance due to the different design objectives. Nev- there are significant variations in the number of clock cycles.
ertheless, one can quantitatively compare the butterfly and Such shortcoming would not be a problem in the proposed
MALU, being aware of the difference in the datapath and extended CVA6 core, since it implements the modular arith-
number of bits. For comparison, the tightly-coupled accelera- metic in hardware. However, the MALU presented in Fig. 5
tors for PQC presented in [17], also provide similar operations only supports a single modulus. In the context of FHE, the
when the MALU is compared separately. A resynthesis of the extended CVA6 core provides a better platform for speeding
accelerators presented in [17] show that their proposed MALU up the computation, since enables more arithmetic intensive
takes approximately the same area (60 kGE). Although in [17] operations as the CRT by reusing the arithmetic units and
the MALU supports two butterflies in parallel, it operates making easy the development of the algorithms assisted by
on only 14 bits, which equates the area. Based on area, the software.
approach in [17] is quite similar to ours, however, taking a Previous work is mainly focused on the microcontroler-
closer look at the critical path reveals that [17] slows the like RISC-V cores with limited or no support for fully-
processor down quite significantly. Since no internal pipeline featured Operating Systems (OSs). Here, we present an FHE
is adopted, the maximum operating frequency is bounded by accelerator that supports a full-blown Linux OS. As originally
the MALU and around 300 MHz (in 28nm), which is half reported in [32] the support for application-class implies some
of our proposed extended RISC-V. Moreover, extending the extra cost in energy efficiency, but comes with many benefits,
number of bits in [17] would make the case even worst, such as, easier programmability, memory management, pro-
emphasizing the advantages of the proposed solution when gram isolation, additional libraries and drivers, and user space
FHE is considered. programming standardization. Besides, it has been shown that
In [16], the authors proposed a vector coprocessor for the memory-bound applications, such as the NTT, benefit more
NTT/INTT that takes 942 kGE in a similar 28nm technology. from acceleration on a powerful core than on several weaker
For comparison, the whole 64 bit Linux-ready CVA6 core ones [31].
(with an FPU) takes only 14% more area (1075 kGE vs. Aside from the complexity and performance discussion it
942 kGE). Because the main objective in [16] is acceleration is important to consider the implications of the extended
of PQC algorithms, their butterfly unit operates on fixed RISC-V core and ISA from a security perspective. The avail-
16-bits words, but it is parametric on the number of bits and ability of a full-fledged OS already provides some additional
the authors have made their code available. Thus, we have level of isolation and security, in comparison with previous
resynthesized the butterfly architecture with a word length of work [16], [17]. Nevertheless, one might consider the scenario
28 bits for comparison. The results show that it requires more where the accelerator would be applied. On the server side,
than three times of the area (195 kGE), while the maximum if always operating on encrypted data, based the principles of
operating frequency is around 225 MHz, less than half of FHE, some security features might not be required. On the
our proposed RISC-V extended core. Clearly, with no internal other hand, when used in the context of the client, all security
pipeline and three RCA and two multipliers in the critical features might be of uttermost importance. Although in this
path, increasing the number of bits will greatly degrade the work we have not addressed these issues, security has to be
performance of [16]. taken into consideration at the hardware and software level,
A different acceleration technique in the context of RISC-V which is left as future work.
and PQC was presented in [64]. Karabulut et al. proposed a
dynamic instruction scheduling implementation that interfaces
VI. C ONCLUSION
with the fetch, execute and write-back pipeline stages of the
RISC-V, being able of detecting and improving the execution Novel hardware architectures for acceleration of the NTTs
schedule of NTT/INTT computations. No additional arithmetic on FPGA and ASIC have been presented. The problem
units are added and the extensions are hardware centered, of finding efficient primes for constructing moduli sets for
demanding no change in the compiler infrastructure. In [64], RNS-based FHE schemes was tackled. A constrained algo-
the NTT/INTT extensions increase the area (LUTs) of the core rithm for finding primes that generate upper-bounded butterfly
by around 40%, but significantly impacts the performance. architectures for the NTT was proposed. Furthermore, versatile
The internal loop of the NTT related to the arithmetic of specialized hardware taking advantage of such primes was
the butterfly takes 6 clock cycles to complete [64], which is developed. When compared to the state of the art in FPGAs,
the same as a single execution of GS or CT butterfly in the the architectures speed up the computations by a factor of 1.9,
proposed MALU. Therefore, one can conclude that in terms of while showing reductions of 30%, 90% and 42% in the number
clock cycles our approach provides similar results, as long as it of LUTs and registers, BRAMs and DSPs. For ASIC, the
is guaranteed the execution of the translated software running results at 1 GHz reveal an average decrease of 45% and 52%
the instructions in Table III. Unfortunately, the parameters in in circuit area and power consumption, respectively, compared
terms of cache sizes for the 64 bits implementation in [64] to the state of the art. The RISC-V extension with the
were not disclosed. Our proposed architecture improves the MALU imposed almost no penalty in the operating frequency,
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
PALUDO AND SOUSA: NTT ARCHITECTURE FOR LINUX-READY RISC-V FULLY-HOMOMORPHIC ENCRYPTION ACCELERATOR 2681
while adding only 10% on the overall circuit area. Moreover, [20] A. C. Mert, E. Öztürk, and E. Savaş, “Design and implementation of
the proposed extended ISA offers versatility, performance, a fast and scalable NTT-based polynomial multiplier architectures,” in
Proc. 22nd Euromicro Conf. Digit. Syst. Des. (DSD), 2019, pp. 253–260,
and easier programmability based on the extension of the doi: 10.1109/DSD.2019.00045.
capabilities of the CVA6 Linux-ready RISC-V core. Future [21] S. S. Roy, F. Turan, K. Jarvinen, F. Vercauteren, and I. Verbauwhede,
work will focus on the addition of more execution units, “FPGA-based high-performance parallel architecture for homomor-
phic computing on encrypted data,” in Proc. IEEE Int. Symp. High
ISA extensions, security analysis, and development of the Perform. Comput. Archit. (HPCA), Feb. 2019, pp. 387–398, doi:
remaining polynomial arithmetic typically used in the FHE 10.1109/HPCA.2019.00052.
schemes. [22] J. Cathébras, A. Carbon, P. Milder, R. Sirdey, and N. Ventroux,
“Data flow oriented hardware design of RNS-based polynomial
multiplication for SHE acceleration,” IACR Trans. CHES,
R EFERENCES vol. 2018, no. 3, pp. 69–88, Aug. 2018. [Online]. Available:
[1] R. L. Rivest, L. Adleman, and M. L. Dertouzos, “On data banks https://tches.iacr.org/index.php/TCHES/article/view/7293
and privacy homomorphisms,” Found. Secure Comput., vol. 4, no. 11, [23] A. C. Mert, E. Karabulut, E. Ozturk, E. Savas, and A. Aysu,
pp. 169–180, 1978. “An extensive study of flexible design methods for the number theoretic
[2] C. Gentry, “Fully homomorphic encryption using ideal lattices,” in transform,” IEEE Trans. Comput., early access, Aug. 19, 2020, doi:
Proc. 41st Annu. ACM Symp. Symp. Theory Comput. (STOC), 2009, 10.1109/TC.2020.3017930.
pp. 169–178. [24] F. Turan, S. S. Roy, and I. Verbauwhede, “HEAWS: An accelerator
[3] A. Viand, P. Jattke, and A. Hithnawi, “SoK: Fully homomorphic encryp- for homomorphic encryption on the Amazon AWS FPGA,” IEEE
tion compilers,” in Proc. IEEE Symp. Secur. Privacy (SP), May 2021, Trans. Comput., vol. 69, no. 8, pp. 1185–1196, Aug. 2020, doi:
pp. 1092–1108. 10.1109/TC.2020.2988765.
[4] L. Ducas and D. Micciancio, “FHEW: Bootstrapping homo- [25] W. Tan, B. M. Case, A. Wang, S. Gao, and Y. Lao, “High-speed modular
morphic encryption in less than a second,” in Advances in multiplier for lattice-based cryptosystems,” IEEE Trans. Circuits Syst. II,
Cryptology—EUROCRYPT. Berlin, Germany: Springer, 2015, Exp. Briefs, vol. 68, no. 8, pp. 2927–2931, Aug. 2021.
pp. 617–640, doi: 10.1007/978-3-662-46800-5_24. [26] X. Hu, M. Li, J. Tian, and Z. Wang, “DARM: A low-complexity and fast
[5] J. H. Cheon, A. Kim, M. Kim, and Y. Song, “Homomorphic modular multiplier for lattice-based cryptography,” in Proc. IEEE 32nd
encryption for arithmetic of approximate numbers,” in Advances Int. Conf. Appl.-Specific Syst., Archit. Processors (ASAP), Jul. 2021,
in Cryptology—ASIACRYPT. Cham, Switzerland: Springer, 2017, pp. 175–178.
pp. 409–437, doi: 10.1007/978-3-319-70694-8_15. [27] S. S. Roy, K. Jarvinen, J. Vliegen, F. Vercauteren, and I. Verbauwhede,
[6] P. Martins, L. Sousa, and A. Mariano, “A survey on fully homomorphic “HEPCloud: An FPGA-based multicore processor for FV somewhat
encryption: An engineering perspective,” ACM Comput. Surv., vol. 50, homomorphic function evaluation,” IEEE Trans. Comput., vol. 67,
no. 6, pp. 1–33, Nov. 2018, doi: 10.1145/3124441. no. 11, pp. 1637–1650, Nov. 2018, doi: 10.1109/TC.2018.2816640.
[7] I. Chillotti, N. Gama, M. Georgieva, and M. Izabachène, “TFHE: Fast [28] D. B. Cousins, K. Rohloff, and D. Sumorok, “Designing an
fully homomorphic encryption over the torus,” J. Cryptol., vol. 33, no. 1, FPGA-accelerated homomorphic encryption co-processor,” IEEE Trans.
pp. 34–91, 2018, doi: 10.1007/s00145-019-09319-x. Emerg. Topics Comput., vol. 5, no. 2, pp. 193–206, Oct. 2017, doi:
[8] (Aug. 2021). PALISADE Lattice Cryptography Library. [Online].
10.1109/TETC.2016.2619669.
Available: https://palisade-crypto.org/
[9] (Nov. 2020). Microsoft SEAL. Microsoft Research, Redmond, WA, USA. [29] M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “HEAX: An architecture
[Online]. Available: https://github.com/Microsoft/SEAL for computing on encrypted data,” in Proc. 25th Int. Conf. Architectural
[10] National Institute of Standards and Technology (NIST). (2020). Support Program. Lang. Operating Syst. New York, NY, USA: ACM,
PQC Standardization Process: Third Round Candidate Announce- 2020, pp. 1295–1309, doi: 10.1145/3373376.3378523.
ment. [Online]. Available: https://csrc.nist.gov/News/2020/pqc-third- [30] V. Migliore et al., “A high-speed accelerator for homomorphic encryp-
round-candidate-announcement tion using the Karatsuba algorithm,” ACM Trans. Embedded Comput.
[11] P.-C. Kuo et al., “High performance post-quantum key exchange on Syst., vol. 16, no. 5, pp. 1–17, Oct. 2017, doi: 10.1145/3126558.
FPGAs,” J. Inf. Sci. Eng., vol. 37, no. 5, pp. 1211–1229, Sep. 2021, [31] X. Liang, M. Nguyen, and H. Che, “Wimpy or brawny cores:
doi: 10.6688/JISE.202109 37(5).0015. A throughput perspective,” J. Parallel Distrib. Comput., vol. 73, no. 10,
[12] U. Banerjee, T. S. Ukyab, and A. P. Chandrakasan, “Sapphire: A pp. 1351–1361, 2013, doi: 10.1016/j.jpdc.2013.06.001.
configurable crypto-processor for post-quantum lattice-based protocols,” [32] F. Zaruba and L. Benini, “The cost of application-class processing:
IACR Trans. Cryptograph. Hardw. Embedded Syst., vol. 2019, no. 4, Energy and performance analysis of a Linux-ready 1.7-GHz 64-bit
pp. 17–61, Aug. 2019, doi: 10.13154/tches.v2019.i4.17-6. RISC-V core in 22-nm FDSOI technology,” IEEE Trans. Very Large
[13] T. Fritzmann and J. Sepúlveda, “Efficient and flexible low-power Scale Integr. (VLSI) Syst., vol. 27, no. 11, pp. 2629–2640, Nov. 2019.
NTT for lattice-based cryptography,” in Proc. IEEE Int. Symp. [33] E. O. Brigham, The Fast Fourier Transform and its Applications.
Hardw. Oriented Secur. Trust (HOST), May 2019, pp. 141–150, doi: Upper Saddle River, NJ, USA: Prentice-Hall, 1988.
10.1109/HST.2019.8741027. [34] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation
[14] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, and L. Liu, “Highly of complex Fourier series,” Math. Comput., vol. 19, no. 90, pp. 297–301,
efficient architecture of NewHope-NIST on FPGA using low-complexity 1965, doi: 10.1090/S0025-5718-1965-0178586-1.
NTT/INTT,” IACR Trans. Cryptograph. Hardw. Embedded Syst., [35] W. M. Gentleman and G. Sande, “Fast Fourier transforms: For fun and
vol. 2020, no. 2, pp. 49–72, Mar. 2020, doi: 10.46586/tches.v2020.i2.49- profit,” in Proc. Fall Joint Comput. Conf. New York, NY, USA: ACM,
72. 1966, pp. 563–578, doi: 10.1145/1464291.1464352.
[15] Y. Xing and S. Li, “An efficient implementation of the NewHope key [36] The Alan Turing Institute. (2021). SHEEP Homomorphic Encryption
exchange on FPGAs,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, Evaluation Platform. [Online]. Available: https://github.com/alan-turing-
no. 3, pp. 866–878, Mar. 2020, doi: 10.1109/TCSI.2019.2956651. institute/SHEEP
[16] G. Xin et al., “VPQC: A domain-specific vector processor for post-
[37] E. Chu and A. George, Inside the FFT Black Box. Boca Raton, FL,
quantum cryptography based on RISC-V architecture,” IEEE Trans.
USA: CRC Press, Nov. 1999, doi: 10.1201/9781420049961.
Circuits Syst. I, Reg. Papers, vol. 67, no. 8, pp. 2672–2684, Aug. 2020,
doi: 10.1109/TCSI.2020.2983185. [38] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction
[17] T. Fritzmann, G. Sigl, and J. Sepúlveda, “RISQ-V: Tightly coupled to Algorithms. Cambridge, MA, USA: MIT Press, 2009.
RISC-V accelerators for post-quantum cryptography,” IACR Trans. [39] P. Longa and M. Naehrig, “Speeding up the number theoretic transform
Cryptograph. Hardw. Embedded Syst., vol. 2020, no. 4, pp. 239–280, for faster ideal lattice-based cryptography,” in Proc. Int. Conf. Cryptol.
Aug. 2020, doi: 10.46586/tches.v2020.i4.239-280. Netw. Secur. Cham, Switzerland: Springer, 2016, pp. 124–139.
[18] J. Cathebras, “Hardware acceleration for homomorphic encryption,” [40] D. Harvey, “Faster arithmetic for number-theoretic transforms,”
Ph.D. dissertation, Dept. Sci. Technol. Inf. Commun., Paris-Sud Univ., J. Symbolic Comput., vol. 60, pp. 113–119, Jan. 2014, doi:
Bures-sur-Yvette, France, 2018. 10.1016/j.jsc.2013.09.002.
[19] A. C. Mert, E. Ozturk, and E. Savas, “Design and implementation of [41] Proth Primes. (2003). The Online Encyclopedia of Integer
encryption/decryption architectures for BFV homomorphic encryption Sequences—Proth Primes. [Online]. Available: https://oeis.org/A080076
scheme,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, [42] R. Avanzi et al., “CRYSTALS-Kyber algorithm specifications and sup-
no. 2, pp. 353–362, Feb. 2020, doi: 10.1109/TVLSI.2019.2943127. porting documentation,” NIST PQC Round, vol. 2, no. 4, pp. 1–43, 2021.
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.
2682 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 7, JULY 2022
[43] S. Bai et al., “CRYSTALS-Dilithium algorithm specifications and sup- [60] Xilinx. (2021). Virtex UltraScale+ HBM VCU128 FPGA.
porting documentation,” NIST Post-Quantum Cryptogr. Standardization [Online]. Available: https://www.xilinx.com/products/boards-and-
Round, vol. 3, pp. 1–38, Feb. 2021. kits/vcu128.html
[44] J.-C. Bajard, J. Eynard, M. A. Hasan, and V. Zucca, “A full RNS variant [61] UMC. (2021). UMC’s 28 nm High-k/Metal Gate Stack HPCU. [Online].
of FV like somewhat homomorphic encryption schemes,” in Selected Available: https://www.umc.com/upload/media/05_Press_Center/
Areas in Cryptography—SAC (Lecture Notes in Computer Science). 3_Literatures/Process_Technology/28nm_Brochure.pdf
Cham, Switzerland: Springer, 2017, pp. 423–442, doi: 10.1007/978-3- [62] A. C. Mert. (2021). Parametric NTT/INTT Hardware Generator.
319-69453-5_23 [Online]. Available: https://github.com/acmert/parametric-ntt
[45] A. Al Badawi, Y. Polyakov, K. M. M. Aung, B. Veeravalli, and [63] L. de Castro et al., “Does fully homomorphic encryption need compute
K. Rohloff, “Implementation and performance evaluation of RNS acceleration?” 2021, arXiv:2112.06396.
variants of the BFV homomorphic encryption scheme,” IEEE Trans. [64] E. Karabulut and A. Aysu, “RANTT: A RISC-V architecture exten-
Emerg. Topics Comput., vol. 9, no. 2, pp. 941–956, Apr. 2021, doi: sion for the number theoretic transform,” in Proc. 30th Int. Conf.
10.1109/TETC.2019.2902799. Field-Program. Log. Appl. (FPL), Aug./Sep. 2020, pp. 26–32, doi:
[46] V. Dimitrov, L. Imbert, and A. Zakaluzny, “Multiplication by a constant 10.1109/FPL50879.2020.00016.
is sublinear,” in Proc. 18th IEEE Symp. Comput. Arithmetic (ARITH),
Jun. 2007, pp. 261–268, doi: 10.1109/ARITH.2007.24.
[47] S. S. Wagstaff, “Prime numbers with a fixed number of one bits or
zero bits in their binary representation,” Experim. Math., vol. 10, no. 2,
pp. 267–273, Jan. 2001.
[48] R. Paludo and L. Sousa, “Number theoretic transform architecture suit-
able to lattice-based fully-homomorphic encryption,” in Proc. IEEE 32nd Rogério Paludo (Student Member, IEEE) received
Int. Conf. Appl.-Specific Syst., Archit. Processors (ASAP), Jul. 2021, the Ph.D. degree in electrical and engineering from
pp. 163–170. the Federal University of Santa Catarina, Florianópo-
[49] C. Aguilar-Melchor, J. Barrier, S. Guelton, A. Guinet, M.-O. Killijian, lis, Brazil, in 2020. He is currently a Post-Doctoral
and T. Lepoint, “NFLLIB: NTT-based fast lattice library,” in Proc. Researcher at the Instituto de Engenharia de
Cryptographers’ Track RSA Conf. Cham, Switzerland: Springer, 2016, Sistemas e Computadores (INESC-ID), Lisbon,
pp. 341–356. Portugal. His research interests include computer
[50] M. Albrecht et al., “Homomorphic encryption security standard,” arithmetic, VLSI architectures, hardware design, sig-
HomomorphicEncryption.org, Toronto, ON, Canada, Tech. Rep. v1.1, nal processing, and fully-homomorphic encryption.
Nov. 2018.
[51] Vivado Design Suite User Guide (UG901), Xilinx, San Jose, CA, USA,
2016.
[52] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani,
“High-speed NTT-based polynomial multiplication accelerator for
CRYSTALS-Kyber post-quantum cryptography,” Cryptol. ePrint Arch.,
Tech. Rep., 2021/563, 2021. [Online]. Available: https://ia.cr/2021/563
[53] X. Xiao, E. Oruklu, and J. Saniie, “An efficient FFT engine with reduced Leonel Sousa (Senior Member, IEEE) received the
addressing logic,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 55, Ph.D. degree in electrical and computer engineering
no. 11, pp. 1149–1153, Nov. 2008, doi: 10.1109/TCSII.2008.2004540. from the Instituto Superior Técnico (IST), Univer-
[54] S. Agarwal, S. R. Ahamed, A. K. Gogoi, and G. Trivedi, “A low- sidade de Lisboa (UL), Lisbon, Portugal, in 1996.
complexity shifting-based conflict-free memory-addressing architecture He is currently a Full Professor with the IST, UL.
for higher-radix FFT,” IEEE Access, vol. 9, pp. 140349–140357, 2021, He is also a Senior Researcher with the Research and
doi: 10.1109/ACCESS.2021.3119598. Development Instituto de Engenharia de Sistemas e
[55] Berkeley Logic Synthesis and Verification Group. (2021). ABC: Computadores (INESC-ID). His research interests
A System for Sequential Synthesis and Verification. [Online]. Available: include computer arithmetic, VLSI architectures,
https://github.com/berkeley-abc/abc parallel computing, and signal processing. He is
[56] T. Fritzmann, G. Sigl, and J. Sepulveda, “Extending the RISC-V instruc- a fellow of the IET. He has contributed to more
tion set for hardware acceleration of the post-quantum scheme LAC,” than 200 papers in journals and international conferences, for which he got
in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2020, several awards, including the DASIP13 Best Paper Award, the SAMOS11
pp. 1420–1425, doi: 10.23919/DATE48585.2020.9116567. Stamatis Vassiliadis Best Paper Award, the DASIP10 Best Poster Award,
[57] E. Alkim, H. Evkan, N. Lahr, R. Niederhagen, and R. Petri, “ISA exten- and the Honorable Mention Award UL/Santander Totta for the quality of
sions for finite field arithmetic: Accelerating Kyber and NewHope on the publications in 2021 and 2016. He has contributed to the organization
RISC-V,” IACR Trans. Cryptograph. Hardw. Embedded Syst., vol. 2020, of several international conferences, namely as the program chair and as the
pp. 219–242, Jun. 2020, doi: 10.46586/tches.v2020.i3.219-242. general and topic chair, and has given keynotes in some of them. He has edited
[58] M. Frigo and S. G. Johnson, “The design and implementation of five special issues of international journals, he is also an Associate Editor
FFTW3,” Proc. IEEE, vol. 93, no. 2, pp. 216–231, Feb. 2005. of the IEEE T RANSACTIONS ON C OMPUTERS , IEEE A CCESS , and JRTIP
[59] (Sep. 2021). RISC-V GNU Toolchain. [Online]. Available: https://github. (Springer), and a Senior Editor of the Journal on Emerging and Selected
com/riscv-collab/riscv-gnu-toolchain Topics in Circuits and Systems. He is a Distinguished Scientist of ACM.
Authorized licensed use limited to: The Ohio State University. Downloaded on November 08,2022 at 02:41:22 UTC from IEEE Xplore. Restrictions apply.