Professional Documents
Culture Documents
3, JUNE 2019
vector problems (SVP/CVP) in lattices [8], learning with implementation consumes 62% and 66% less area and power
errors (LWEs) [9] and its variants [10]–[12], code-based cryp- compared to the most lightweight implementation of ECC,
tography [13], and isogenies over elliptic curve [14], [15]. respectively. Furthermore, proposed architectures are shown
There are multiple cryptosystems proposed based on afore- to be resistant against simple power analysis (SPA) [26] and
mentioned hard problems among which lattice-based cryptog- timing attacks [27]. The main contributions of this paper are
raphy is more practical for resource-constrained nodes in IoT summarized as follows.
due to its relatively fast operations with low complexity. In the 1) A variant of the Ring-BinLWE scheme has been
recent post-quantum cryptography standardization by National proposed, namely InvRBLWE, which is fully optimized
Institute of Standard and Technology (NIST) [16], most of the for hardware implementation.
proposed schemes in the first round of general submissions are 2) The operation cost in the proposed scheme is reduced
based on lattice hard problems [17]–[19]. by omitting all of the required reduction operations.
In 2009, Regev et al. [9] introduced a new lattice-based 3) Our high-speed implementation dominates previous
cryptosystem that relies on the hardness of the LWE problem. work in terms of speed on both FPGA and ASIC
The search version of the problem is defined over finding platforms.
secret s in a linear combination a.s + e, where a is known 4) The high-speed ASIC implementations of LWE-based
and e is an error according to a certain distribution. LWE cryptosystems requires significantly lower energy com-
and its variants have shown resistance to various types of pared to the best AISC implementations of classic and
classic and quantum attacks [10]–[12], [20]–[22]. In 2010, a post-quantum cryptosystems. This architecture best suits
new variant of LWE based on ring theory introduced in [10], for battery-based IoT devices.
called Ring-LWE. It utilizes ideal lattices and has relatively 5) The ultralightweight implementation of InvRBLWE on
smaller key sizes compared to the original LWE scheme. In ASIC platform consumes less power and area compared
2016, Buchmann et al. [11] proposed a new variant of Ring- to that ECC requires. To the best of our knowledge,
LWE, namely Ring-BinLWE, by choosing errors from a binary this is the first public key implementation to have such
distribution instead of Gaussian one. low power consumption that can even be supplied by
Many hardware implementations for LWE-based schemes Vibration Piezo or electro magnetic (EM) [3] energy
have been proposed [17], [23]–[25]. Recently, a hard- harvesting units.
ware implementation based on Ring-BinLWE [11] has been The rest of the paper is organized as follows. In Section II,
proposed in [25] that is relatively faster than previous imple- required background is described. Section IV provides detailed
mentations for Ring-LWE schemes. information of the proposed scheme and architectures. We
In this paper, we propose a hardware-optimized variant compare our implementation results against previous work on
of Ring-BinLWE, hereafter referred to as InvRBLWE. Our both FPGA and ASIC platforms in Section V. Finally, the
optimization utilizes inverted ring of Ring-BinLWE and there- paper is concluded in Section VI.
fore, matches 2’s-complement notation range that is highly
optimized for hardware implementation. In Section III, we II. P RELIMINARIES
justify that operations over the ring require no reduction
In this section, we describe the required background and
when ring elements are presented in 2’s-complement nota-
notations to follow the rest of this paper. At first, we make
tion and thus, the entire reduction operations are omitted from
a few clarifications regarding mathematical notations used in
the InvRBLWE scheme. We propose two architectures for
this paper. Later, this paper presents a brief history of devel-
InvRBLWE targeting different IoT devices.
opments in lattice-based cryptosystems. Finally, we describe
1) High-Speed: A fast and low-complexity architecture,
the Ring-BinLWE scheme [11] in details.
which is a good match for edge and high-performance
devices in IoT.
2) Ultralightweight: An ultra low-power and low area A. Ring Theory
architecture targeting resource-constrained end-nodes For an integer q ∈ Z, we define the finite ring Zq =
that are powered by batteries or energy harvesting units Z/qZ = {0, 1, . . . , q − 1}. In other words, Zq is a finite ring,
in IoT. where q is its modulus. For an integer n ∈ Z, the set of all
Regarding implementation results, the high-speed archi- n-dimensional vectors, which each dimension belongs to Zq ,
tecture has been implemented on FPGA that is considered is denoted as Znq = {w0 , w1 , . . . , wn−1 |wi ∈ Zq }.
a high-performance platform practical on edge and pow- The set of all polynomials whose coefficients belong to Z is
erful IoT devices. Our FPGA implementations dominate shown as Z[x]. Similarly, we refer to the set of all polynomials
previous hardware implementations of any LWE-based scheme that have their coefficients chosen from Zq , as Zq [x]. Thus,
by improving Area × Time (AT) complexity with at least each vector in Znq can be mapped to a unique polynomial of
52% in encryption/decryption operations. The proposed archi- degree n − 1 in Z[x]. In addition to simple rings over integers,
tectures for InvRBLWE are platform-independent and can we can extend ring definitions to polynomials. Therefore, the
also be implemented on ASIC platforms, which are ideal ring of polynomials R is defined as R = Z[x]/f (x), where
for crypto-processors in IoT devices. The ASIC high-speed f (x) is the modulus of the ring. If all of the coefficients of the
implementation results appear to be at least two times faster polynomials in the ring are chosen from Zq , we refer to the
than previous work. Additionally, the ultralightweight ASIC polynomial ring as Rq = Zq [x]/f (x).
5502 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 3, JUNE 2019
Detailed information of the Ring-BinLWE scheme operations this scheme is represented in the range of (−q/2, q/2−1);
is provided in Fig. 2. The scheme consists of three main notice the difference with the original range of Ring-LWE.
phases: 1) key generation 2) encryption; and 3) decryption. In InvRBLWE, the range of coefficients, exactly matches
q
We describe the operations in each phase in the following. the 2’s-complement notation range for a log2 -bit integer.
1) Key Generation: This phase starts with choosing two The advantage of InvRBLWE is that the reductions over the
random binary vectors r1 , r2 ∈ {0, 1}n . These vectors inverted ring can be handled easily in hardware implementa-
can be mapped to unique polynomials in Rq . In addi- tion. We show that all necessary reductions for any modular
tion, a publicly known polynomial a ∈ Rq is used to operation are performed automatically by normal overflow and
calculate the public key p = r1 − a.r2 ∈ Rq . The binary underflow in 2’s-complement notation according to Lemma 1.
vector r1 is a one-time error and will be discarded after- Lemma 1: Assume q = 2k and k ∈ Z. Moreover, the
wards, while r2 is Alice’s private key. In this scheme, ring Zq is represented with 2’s-complement notation. Then,
q
the private and public keys consist of n and n × log2 modular addition/subtraction of two members a, b ∈ Zq does
bits, respectively. not require any reduction.
EBRAHIMI et al.: POST-QUANTUM CRYPTOPROCESSORS OPTIMIZED FOR EDGE AND RESOURCE-CONSTRAINED DEVICES IN IoT 5503
Fig. 4. High-speed architecture for decryption phase. of n × 8-bit registers in high-speed architecture. Clearly, this
architectural modification results in exploiting only one 8-bit
the first one, are performing add function. The multiplication adder as well. By decreasing the number of adder and register
inputs are a ∈ Rq and r2 ∈ {0, 1}n . Hence, the multiplication pairs from n to only one, the required area and corresponding
can be calculated in n clock cycles using the shift-and-add power consumption are decreased. As mentioned earlier, the
q multiplication requires n rounds to be completed. This archi-
method, which requires n parallel adders of 8-bit (log2 -bit)
length as shown in Fig. 4. It is worth mentioning that r2 is tectural modification causes each round of the multiplication to
loaded into a shift-register, which outputs each bit of r2 in require n clock cycles in the lightweight architecture. Hence,
certain clock cycle during multiplication operation. by taking the very final addition to account, the total num-
Due to the characteristics of the ring Rq , the shift opera- ber of cycles for this architecture to perform decryption is
tion is performed using an anti-circular rotation because of the (n + 1) × n.
modulus f (x) = xn + 1 [11]. To implement such anti-circular All of the control signals are very similar to the high-speed
q architecture with the difference that a counter to keep track of
rotation in hardware, we simply feed each 8-bit (log2 -bit) reg-
ister to the input of the next adder and finally feed the negative each state is added. To perform anti-circular rotation in shift-
of Res[n-1] (the last 8-bit register) to the first adder. Setting and-add algorithm over the ring Rq , two additional control
carry_in of the first adder to 1, changes its function to sub- signals Crtl_1 and Crtl_2 are added, which decide whether
traction, which completes the anti-circular rotation. During the current operation is an addition or a subtraction.
multiplication, control signal S1 is set to zero and the first
adder performs subtraction. The rest of the adders always per- C. Side-Channel Analysis (SPA and Timing)
form addition and the corresponding carry_in signal is set to
The main target for the proposed architectures is IoT devices
zero.
that are mostly provided by constrained resources. Therefore,
After completion of the multiplication, an additional clock
no additional countermeasures are used in the proposed archi-
cycle is required to calculate the final addition operation. In
tectures. We aim to provide SPA and timing attack resis-
this state, the control signal S1 is set to one and all of the
tant architectures according to principals proposed by [26].
adders perform addition. The public key p is available after
We note that regarding stronger attacks, such as differen-
n + 1 clock cycles. During each clock cycle, the results are
tial power analysis (DPA), complex, and resource-consuming
stored in n registers of 8-bit length. It is worth mentioning that
countermeasures that are proposed in previous work, such
r2 is a binary vector and each bit of r2 needs to be extended to
q as [25] and [31] can also be applied to our architectures and
8 sequential bits (log2 -bit). This bit adjustment is performed
can be considered for future work.
in hardware with no significant overhead by simple rewiring.
In the proposed architecture, for the InvRBLWE scheme,
At the end of decryption phase, the result must be decoded
there are no conditional branches that is taken or not taken
from a polynomial in Rq to a binary vector based on (6). We
according to a secret value. Therefore, the architecture oper-
implement the decode function by an array of n 2-input XOR
ates independently from input values during each clock cycle,
gates that compare two most significant bits of each register
which results in constant number of clock cycles too. More
block, namely Res[7] and Res[6].
precisely, regarding high-speed architecture, the key genera-
tion, encryption, and decryption phases require exactly n + 1,
B. UltraLightweight Architecture 2n + 3, and n + 1 clock cycles, respectively. Moreover, using
In order to achieve a lightweight architecture for InvRBLWE the ultralightweight architecture, the key generation, encryp-
scheme, instead of the straightforward and parallel use of tion, and decryption phases require exactly (n + 1) × 256,
components (i.e., 8-bit adders and registers), we have utilized (2n + 3) × 256, and (n + 1) × 256 clock cycles, respectively.
only one set of such components in a serial manner as shown In addition, the CPD of the proposed architectures is always
in Fig. 5. Moreover, contrasting with the high-speed method to the same during each phase of execution. Thus, the proposed
perform multiplication, the multiplicand is shifted during each architecture is secure against timing attacks [27].
cycle instead of the product. This choice gives us the capa- To quantitatively evaluate the SPA resistance, we imple-
bility of using only one 8-bit register for the product instead mented InvRBLWE on a Sakura-X board in order to capture
EBRAHIMI et al.: POST-QUANTUM CRYPTOPROCESSORS OPTIMIZED FOR EDGE AND RESOURCE-CONSTRAINED DEVICES IN IoT 5505
TABLE I
C OMPARISON OF FPGA I MPLEMENTATION R ESULTS
TABLE II
C OMPARISON OF ASIC I MPLEMENTATIONS
architectures, we compare our implementation results against cell libraries. To the best of our knowledge, this is the first
the best implementations of ECC as a classic cryptosystem. ASIC implementation of an LWE-based cryptosystem.
We exploit Area × Time as the measurement for complex- According to the recent NIST’s lightweight crypto standard-
ity, which is a common and widely used measurement [6] ization reports [3], a practical crypto-processor for resource-
in previous work. constrained IoT end-nodes should meet certain power and
In [25], three hardware implementations for decryption energy thresholds. Battery-powered devices have limited
phase of standard Ring-BinLWE,rblwe on Xilinx Spartan-6 energy in contrast to energy-harvesting devices that can pro-
device for scheme parameters n = 256 and q = 256 have been vide unlimited energy with bounded power.
provided. Our implementation using the optimized InvRBLWE Table II compares our ASIC implementation results against
scheme benefit from avoiding reduction and therefore, have those for SIKE [29] as a post-quantum key generation scheme
shorter CPD. This makes our implementations with the same based on isogenies over elliptic curve. Moreover, the best
set of parameters on the same device to have up to three times ASIC implementation results available for ECC have been
higher frequency and perform about 3.1 times faster than the compared against ours. Due to the low-complexity operations
high-performance implementation in [25]. Thus, InvRBLWE of the optimized InvRBLWE scheme, our ASIC implementa-
improves decryption time and AT complexity by at least 68% tion is at least two times faster than ECC implementations.
and 87% compared to the fastest implementation in [25], Additionally, the ultralightweight ASIC implementation con-
respectively. sumes 62% and 66% less area and power compared to the
Compared to standard Ring-LWE [10] implementa- most lightweight implementation of ECC, respectively.
tions [17], [24], InvRBLWE improves time and AT com- As seen in the table, our high-speed architecture provides
plexity by at least 83% and 52% on Xilinx Virtex-6 device. fast scheme operations compared to all of previous work,
Moreover, compared to the best of the available hardware while consuming lower energy. This makes the high-speed
implementations of other post-quantum cryptosystems, our implementation of InvRBLWE a proper choice for IoT devices
maximum security implementation (n = 512 and q = 256) powered by batteries. On the other hand, the ultralightweight
achieves higher frequency while consuming less slices of architecture can be implemented by only 7.5k gates and also
FPGA and improves AT complexity significantly. consumes only 0.18 mW. Such low power consumption makes
Finally, compared to the most recent and the best of the this architecture to be the first public key implementation that
ECC’s implementations [6], we achieve higher speed and more can be supplied by ultra low-power energy harvesters such as
than 92% AT improvement. Regarding time and AT com- Vibration Piezo or EM [3].
parison, we have compared our most secure implementation We note that the ASIC implementation results for both
(190 bits of security) against the ECC scheme over GF(2283 ) architectures indicate that InvRBLWE has the potential to be
in order to offer a fair result. considered as an alternative for classic cryptosystems.
and IoT due to the venerability of the classic cryptosystems [8] C. Peikert, “Public-key cryptosystems from the worst-case shortest vec-
such as ECC and RSA against quantum attacks [7]. Thus, tor problem,” in Proc. ACM 41st Annu. ACM Symp. Theory Comput.,
2009, pp. 333–342.
many of the organizations are already looking for reliable [9] O. Regev, “On lattices, learning with errors, random linear codes, and
and practical alternatives for classic cryptosystems [16], [34]. cryptography,” J. ACM, vol. 56, no. 6, p. 34, 2009.
Among current post-quantum crypto-schemes, lattice-based [10] V. Lyubashevsky, C. Peikert, and O. Regev, “On ideal lattices and learn-
ing with errors over rings,” in Proc. Annu. Int. Conf. Theory Appl.
cryptography has gained high attention from researchers Cryptograph. Techn., 2010, pp. 1–23.
due to its relatively faster operations compared to other [11] J. Buchmann, F. Göpfert, T. Güneysu, T. Oder, and T. Pöppelmann,
post-quantum cryptosystems, such as code-based [13] or “High-performance and lightweight lattice-based public-key encryption,”
in Proc. ACM 2nd Int. Workshop IoT Privacy Trust Security, 2016,
isogeny [14]. pp. 2–9.
In this paper, we propose an optimized version of Ring- [12] J. Buchmann, F. Göpfert, R. Player, and T. Wunderer, “On the hardness
BinLWE [11], namely InvRBLWE, which is highly efficient of LWE with binary error: Revisiting the hybrid lattice-reduction and
meet-in-the-middle attack,” in Proc. Int. Conf. Cryptol. Africa, 2016,
for hardware implementation. Moreover, we propose two pp. 24–43.
architectures for InvRBLWE scheme targeting different IoT [13] Classic McEliece. Accessed: Jan. 2019. [Online]. Available:
devices with alternative capabilities varying from edge and https://classic.mceliece.org/
[14] D. Jao and L. De Feo, “Towards quantum-resistant cryptosystems from
powerful devices to resource-constrained and tiny end-nodes. supersingular elliptic curve isogenies,” in Proc. Int. Workshop Post
Our FPGA implementations improve time and AT compared to Quant. Cryptography, 2011, pp. 19–34.
the best previous RingLWE implementations [17], [24], [25] [15] B. Koziel, R. Azarderakhsh, and M. M. Kermani, “A high-performance
and scalable hardware architecture for isogeny-based cryptography,”
by at least 68% and 52%, respectively. These implementa- IEEE Trans. Comput., vol. 67, no. 11, pp. 1594–1609, Nov. 2018.
tions also improve AT by at least 92% compared to the [16] National Institute of Standards and Technology. Accessed: Mar. 2018.
best of ECC implementations. We are the first to propose a [Online]. Available: https://www.nist.gov/
[17] S. S. Roy, F. Vercauteren, N. Mentens, D. D. Chen, and I. Verbauwhede,
practical ASIC implementation of LWE-based cryptosystems “Compact ring-LWE cryptoprocessor,” in Proc. Workshop Cryptograph.
for IoT devices. The proposed high-speed ASIC implemen- Hardw. Embedded Syst., 2014, pp. 371–391.
tation is at least two times faster than the best of the ECC [18] The Cryptographic Suite for Algebraic Lattices (CRYSTALS). Accessed:
Jan. 2019. [Online]. Available: https://pq-crystals.org/
implementations while consuming less energy. This makes [19] NTRU Prime Family: Streamlined NTRU Prime and NTRU LPRime.
the high-speed architecture suitable for battery-based solu- Accessed: Jan. 2019. [Online]. Available: https://ntruprime.cr.yp.to/
tions. Moreover, our lightweight ASIC implementation has index.html
[20] D. Micciancio and C. Peikert, “Hardness of SIS and LWE with small
power consumption as low as 0.18 mW, which can be supplied parameters,” in Proc. Adv. Cryptol. (CRYPTO), 2013, pp. 21–39.
by low-power energy harvesting devices such as, Vibration [21] R. De Clercq, S. S. Roy, F. Vercauteren, and I. Verbauwhede, “Efficient
Piezo or EM [3]. In other words, the lightweight architec- software implementation of ring-LWE encryption,” in Proc. IEEE
Design Autom. Test Europe Conf. (DATE), 2015, pp. 339–344.
ture is the first public key ASIC implementation that satisfies [22] F. Göpfert, C. van Vredendaal, and T. Wunderer, “A hybrid lattice basis
NIST report criteria [3] and can be practically exploited in IoT reduction and quantum search attack on LWE,” in Proc. Int. Workshop
end-nodes. Post Quant. Cryptography, 2017, pp. 184–202.
[23] N. Göttert, T. Feller, M. Schneider, J. Buchmann, and S. Huss, “On the
As future work, the proposed hardware implementations can design of hardware building blocks for modern lattice-based encryption
be extensively analyzed against side-channel analysis (SCA). schemes,” in Proc. Int. Workshop Cryptograph. Hardw. Embedded Syst.,
Our current implementations are resistant regarding SPA and 2012, pp. 512–529.
[24] T. Pöppelmann and T. Güneysu, “Towards practical lattice-based public-
timing attacks. However, there exist other powerful attacks, key encryption on reconfigurable hardware,” in Proc. Int. Conf. Sel.
such as DPA [26], simple, and differential fault attacks (SFA Areas Cryptography, 2013, pp. 68–85.
and DFA) [31]. [25] A. Aysu, M. Orshansky, and M. Tiwari, “Binary ring-LWE hard-
ware with power side-channel countermeasures,” in Proc. IEEE Design
Autom. Test Europe Conf. Exhibit. (DATE), 2018, pp. 1253–1258.
[26] P. Kocher, J. Jaffe, B. Jun, and P. Rohatgi, “Introduction to differential
R EFERENCES power analysis,” J. Cryptograph. Eng., vol. 1, no. 1, pp. 5–27, 2011.
[27] P. C. Kocher, “Timing attacks on implementations of Diffie–Hellman,
[1] Z. Ling et al., “Security vulnerabilities of Internet of Things: A case RSA, DSS, and other systems,” in Proc. Annu. Int. Cryptol. Conf., 1996,
study of the smart plug system,” IEEE Internet Things J., vol. 4, no. 6, pp. 104–113.
pp. 1899–1909, Dec. 2017. [28] A. A. Kamal and A. M. Youssef, “An FPGA implementation of the
[2] R. Chaudhary, G. S. Aujla, N. Kumar, and S. Zeadally, “Lattice NTRUEncrypt cryptosystem,” in Proc. IEEE Int. Conf. Microelectron.
based public key cryptosystem for Internet of Things environment: (ICM), 2009, pp. 209–212.
Challenges and solutions,” IEEE Internet Things J., to be published. [29] D. Jao et al. Supersingular Isogeny Key Encapsulation (SIKE).
doi: 10.1109/JIOT.2018.2878707. Accessed: Jan. 2019. [Online]. Available: https://sike.org
[3] C. Patrick and P. Schaumont, “The role of energy in the lightweight [30] W. Wang, J. Szefer, and R. Niederhagen, “FPGA-based key generator
cryptographic profile,” in Proc. NIST Lightweight Cryptography for the Niederreiter cryptosystem using binary Goppa codes,” in Proc.
Workshop, 2016, pp. 1–16. Int. Conf. Cryptograph. Hardw. Embedded Syst., 2017, pp. 253–274.
[4] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital [31] T. Schneider, A. Moradi, and T. Güneysu, “ParTI—Towards com-
signatures and public-key cryptosystems,” Commun. ACM, vol. 21, no. 2, bined hardware countermeasures against side-channel and fault-injection
pp. 120–126, 1978. attacks,” in Proc. Annu. Cryptol. Conf., 2016, pp. 302–332.
[5] N. Koblitz, “Elliptic curve cryptosystems,” Math. Comput., vol. 48, [32] Semiconductor Manufacturing Company (TSMC). Accessed: Jan. 2018.
no. 177, pp. 203–209, 1987. [Online]. Available: http://www.tsmc.com/
[6] R. Salarifard, S. Bayat-Sarmadi, and H. Mosanaei-Boorani, “A [33] NanGate Standard Cell Library. Accessed: Jan. 2018. [Online].
low-latency and low-complexity point-multiplication in ECC,” IEEE Available: http://www.si2.org/
Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 9, pp. 2869–2877, [34] L. Chen et al., Post-Quantum Cryptography, U.S. Dept. Commerce, Nat.
Sep. 2018. Inst. Stand. Technol., Gaithersburg, MD, USA, 2016.
[7] P. W. Shor, “Algorithms for quantum computation: Discrete logarithms
and factoring,” in Proc. IEEE 35th Annu. Symp. Found. Comput. Sci.,
Santa Fe, NM, USA, 1994, pp. 124–134. Authors’ photographs and biographies not available at the time of publication.