You are on page 1of 8

A Novel Modular Multiplier for Isogeny-

Based Post-Quantum Cryptography


ABSTRACT

Super singular isogeny key encapsulation (SIKE) protocol is a promising


candidate for the standard of post quantum cryptography (PQC), but it suffers
from high computational complexity. Since the modular multiplication takes up
a large proportion of the computations in SIKE protocol, accelerating this
operation can efficiently speed up the entire protocol. In this paper, we propose
a new modular multiplication algorithm, which can achieve lower complexity
than prior arts. The SIKE-friendly prime with form of p = 2nxl ny B +1= Rn + 1
is considered. The modulo-p operation is mainly replaced by modulo-R
operations, for which a general Barrett reduction (GBR) algorithm is presented
and applied. Moreover, an efficient architecture is designed for the proposed
algorithm, where the pipelining and interleaved techniques are applied. For the
multiply-accumulate (MAC) part, various optimization techniques are
introduced to reduce the data path and the complexity. The FPGA
implementation results show that fora level-5 quantum-security parameter, our
design achieves the fastest clock speed with middle number of clock cycles and
small resources consumption among the state-of-the-art works.
Index Terms—Modular multiplication, super singular isogeny key encapsulation
(SIKE), post quantum cryptography (PQC), hardware implementation, FPGA.
INTRODUCTION

The super singular isogeny key encapsulation (SIKE) protocol has entered the
second round of competition for post quantum cryptography (PQC) launched by
the National Institute of Standards and Technology (NIST). As an improved
version of the super singular isogeny Diffifie-Hellman (SIDH) key exchange
protocol, SIKE inherits SIDH’s advantages of
the smallest size of keys the ability to resist the attack from the powerful
quantum computer. Meanwhile, it can also defend from other side channel
attacks. These features make
it a promising candidate in the competition. However, the considerable
computations make the protocol difficult to be applied in practical applications.
The SIDH was first proposed by Jao and Feo in 2011, and up to the present,
researchers have proposed many optimizations to accelerate it, where the
modular multiplication, as one of the most complicated operations, usually
becomes the focus. As the major computations in SIDH are the same as those in
SIKE, the optimizations of the modular multiplication for SIDH are also
appropriate for SIKE protocol. We propose a lower-complexity infinite Fifield
multiplication algorithm for the n-fold prime of p = 2nxl ny B + 1 = Rn + 1,
named IFFMn, for isogeny-based elliptic curves cryptography (ECC). In this
new algorithm, the modulo-p operation is replaced by n modulo-R operations.
We introduce a general Barrett reduction algorithm allowing negative inputs for
those small modulo operations. This new Barrett reduction algorithm can also
achieve lower complexity than previous ones, Moreover, we devise an efficient
hardware architecture for the IFFMn and implement it on FPGA. According to
the FPGA implementation results, our design achieves the fastest clock speed
with small resources consumption compared with the state-of-the-art works.
WORKING BLOCKDIAGRAM

The proposed top-level module.

In this section, a corresponding hardware architecture is proposed for the


IFFMn algorithm. The top-level architecture is shown in Figure Besides the
explicit adders and multiplexers (Muxs), the proposed modular multiplier is
composed of four modules: 1) top mul; 2) accumulator; 3) two GBR; and 4)
Post Process (PP).
To balance the hardware resources and latency, the design is devised in partially
parallel. 2m pairs of input coeffificients are sent into the m top mul modules. It
requires about n(n+1) 2m iterations to compute these coefficient multipliers. In
each iteration, the m outputs from the top mul modules are accumulated and sent
back to the corresponding registers in the accumulator. After getting the raw
coeffificients c0, ..., cn−1, they are put into the two GBR module in pairs.
Thereafter, the quotient q of cn−1 and the remainders r0, ..., rn−1 are carried
into the PP module for the final outputs, all of which fall into τ2, and τ3 denote
the latency of the four modules, respectively. Since these
major modules can be processed separately, the average number of clock cycles
(CCs) can be computed as max {τ0, τ1, τ2, τ3}. More details of these modules
will be presented below.

1)Top mul: - This module occupies the most hardware resources. By using the
Karatsuba decomposition referred in , n2 multiplications are reduced to
n(n+1) 2 multiplications, including n aibi and n(n−1) 2 (ai + aj) (bi + bj)
multiplications (i = j). This module applies m multipliers to calculate the n(n+1)
2 multiplications, consuming (n(n+1))/2 m iterations. The architecture of the
multiplier is depicted in Fig. 2. In Kr mul module, we calculate the
multiplication of two 32-bit numbers. Because by using Karatsuba
decomposition, the 32 × 32 multiplication can be dividing into two 16 × 16
and one 17 × 17 multiplication which achieves the highest hardware utilization
with only 3 DSPs. If N is big, we can divide the N-bit number into many
numbers and the numbers’ bit width is multiple of 32, then the Kr mul module
calculates the multiplication with more DSPs.

2)Accumulator: - The Accumulator part is used to reprocess the outputs of the


top mul module. n registers are used to store the intermediate data. In every CC,
a number. out of ci (0 ≤ i<n) add with the output of the Top mul module, then
the addition result is sent to ci register. We put t adders in the architecture so
that it can finish t additions at one CC and improve the efficiency, where t
depends on the clock cycle you want to complete the accumulator part. You can
accelerate this part by increase the number of adders. After some CCs, the
whole products are added to c0, ..., cn−1, the data in registers are sent to the
GBR module.
3)Two GBR: - The GBR module is used to implement the reduction and its
architecture resembles the IBR module in but has been adjusted slightly. As
shown in fig, the two GBR functions are processed in parallel with the extra
compensation (Cmp) module to reduce the number of CCs. If the input c ≥ 0, c
will be sent into the GBR module normally and get the quotient and remainder
of c divided by R. If the input c < 0, the absolute value of c will be sent into the
GBR module. The output of GBR are q and r, and after two additions, we can
get the quotient −q − 1 and the remainder R−r. We use some Muxs to judge
whether input is a negative number and some adders to process the negative
input
IMPLEMENTATION RESULTS

By using the method proposed in, we find a SIDH friendly prime p = 24083224
+1 = R8 + 1 where R = 251328, targeting the level-5 post-quantum security
level and implement the proposed architecture on FPGA for this prime. The
Xilinx Viv ado 2018.2 EDA platform is applied and the Virtex-
7xc7vx690tffg1157-3 board is selected. Since the maximum latency lies in the
top mul module or the two GBR module, either of which consumes 36 CCs, this
design can be processed within 36 CCs for one pair of inputs with a latency of
89 CCs. The comparisons of implementations on FPGA with previous
algorithms which also adopted the unconventional radix are listed in Table II. It
can be seen that our design achieves the highest frequency (193MHz) among
the state-of-the-arts. As for hardware resources, our work uses much fewer
resources especially the number of DSPs than the previous works except
the EFFM. However, with much faster clock speed, our design offers a more
reasonable choice for the SIKE implementation.
CONCLUSION

we have proposed a low-complexity modular multiplication algorithm named


IFFMn for the prime with the form p = 2nxl ny B +1 = Rn + 1 and devised the
corresponding architecture with various optimizations. As the
FPGA implementation result shows, our design achieves much higher clock
speed with reasonable resource consumption compared with the prior arts. In
brief, our work provides a promising hardware architecture for the modular
multiplication implementation of isogeny-based ECC cryptosystems.

You might also like