Professional Documents
Culture Documents
Keywords: We propose two hardware architectures for the modular multiplication over a finite field F𝑝 using the Adapted
Modular multiplication Modular Number System (AMNS). We present their optimized implementations, with low latency.
Hardware implementation
FPGA
Adapted modular number system
1. Introduction slices register and DSP blocks) of our architectures, we compare our
results with the state of the art of modular multiplications in binary,
Nowadays, cryptography primitives are implemented for miscella- signed-digits and RNS representation. Various architectures have been
neous uses and the efficiency of their implementation on small embed- built for the finite field F𝑝 , where 𝑝 is a prime integer, to provide a
ded system devices is important. Most of the public key cryptographic scalable implementation on FPGA. Our results are very promising as
protocols involve modular arithmetic; indeed. This family of cryptosys- our implementations provide the most efficient modular multiplication
tem which includes RSA [1] and elliptic curve cryptography (ECC) [2], compared to binary implementations with the same size of the modulus
relies on it [3]. Hence, the most efficient is the modular arithmetic 𝑝 and on the same FPGA target (see Table 8). The work presented is
implementation, the most reduced is the execution time. The most done in continuation of Chaouch et al. [11].
critical operation is the modular multiplication. Many algorithms for The paper is organized as follows. In Section 2, we present the
the binary representation have been proposed in the literature [4–8]. background of the modular multiplication and the AMNS. We describe
Beside the algorithmic solutions, it is possible to improve their our hardware architectures in Section 3 and in Section 4. Our results
efficiency by choosing unusual number representations. In [9], a new and a comparison with the state of the art are presented in Section 5
number system called the Adapted Modular Number System (AMNS) is and in Section 6. We conclude in Section 7.
proposed in order to speed-up the modular operations. Its main char-
acteristic is that its elements are polynomials with small coefficients.
2. The adapted modular number system
In [10] the theoretical background of AMNS present with a software
implementation of the modular multiplication in this system. They
The Adapted Modular Number System (AMNS) is a number system
show how to build an efficient AMNS that allows modular multiplica-
that was introduced in [9], in order to speed-up modular arithmetic.
tions that can be more efficient than the classical Montgomery modular
The main characteristic of the AMNS is that its elements are polyno-
multiplication.
In this paper, we present a FPGA architecture for modular mul- mials. This characteristic gives to the AMNS many advantages for both
tiplication using AMNS and analyze its behavior. This paper gives efficiency and safety. In [10], the authors give a method to generate
guidelines to the material fitting of such a structure, namely two many AMNS for any prime integer in order to perform modular opera-
methods are presented and a suitable material model is identified. tions efficiently. They present software implementation results which
In order to evaluate the performances and the costs (in number of highlight that the AMNS allows to perform modular multiplication
https://doi.org/10.1016/j.jisa.2021.102770
more efficiently than well known libraries like GNU-MP and OpenSSL. 2.1. Some notations and conventions
In this section, we give an overview of this number system and of
modular operations in it. Before presenting the arithmetic operations with the reduction
In this paper, we do not deal with the generation process of the methods, we need to establish some notations and conventions for
AMNS, since it requires a consequent mathematical background that is
simplicity and consistency. For consistency, we assume that 𝑝 ⩾ 3 and
out of the scope of this article and is quite long. See [9,10]. The authors
𝑛, 𝛾, 𝜌 ⩾ 1.
of Didier et al. [10] provide an implementation of the AMNS generation
process1 that uses SageMath library [12]. In the same location, there is Let Z𝑛 [𝑋] be the set of polynomials in Z[𝑋] of degree smaller than
also a C code generator. Given all the parameters of an AMNS, this 𝑛:
generator outputs a software implementation in C language that allows
Z𝑛 [𝑋] = {𝐶 ∈ Z[𝑋], such that: deg(𝐶) < 𝑛}.
to efficiently perform arithmetic operations in that AMNS.
The AMNS is a subclass of the Modular Number System (MNS). So, If 𝑉 ∈ Z𝑛 [𝑋] is a polynomial, we assume that 𝑉 (𝑋) = 𝑣0 + 𝑣1 𝑋 + ⋯ +
we start by first presenting the MNS.
𝑣𝑛−1 𝑋 𝑛−1 .
In this case, we say that the polynomial 𝑉 (𝑋) = 𝑣0 + 𝑣1 𝑋 + ⋯ + The external reduction is a polynomial modular reduction. The goal
𝑣𝑛−1 𝑋 𝑛−1 is a representation of 𝑥 in and we denote 𝑉 ≡ 𝑥 . of this operation is to keep the degree of the AMNS representations
lower than 𝑛. Let 𝐶 ∈ Z[𝑋] be a polynomial. The external reduction
For a MNS to be an AMNS, the parameter 𝛾 should meet a require-
ment which is essential for arithmetic operations. consists in computing a polynomial 𝐶 ′ such that:
Example 1. Let 𝑝 = 19. The tuple = (19, 3, 7, 2, 𝐸), with 𝐸(𝑋) = with deg(𝐶 ′ ) < 𝑛 and 𝑄 ∈ Z[𝑋]. Since 𝐸(𝛾) ≡ 0 (mod 𝑝), one has 𝐶 ′ (𝛾) ≡
𝑋 3 − 1, is an AMNS. Indeed, Table 1 gives a representation of each 𝐶(𝛾) (mod 𝑝). So, the external reduction is done as: 𝐶 ′ = 𝐶 mod 𝐸. The
element of Z∕19Z. This shows that the tuple = (19, 3, 7, 2) is a MNS. polynomial 𝐸 is called the external reduction polynomial.
Moreover, we have 𝛾 𝑛 = 73 ≡ 1 (mod 19), which is very small. Let 𝐴 ∈ Z𝑛 [𝑋] and 𝐵 ∈ Z𝑛 [𝑋]. Let 𝐶 = 𝐴𝐵 be a polynomial. Then,
It can be checked in Table 1 that any representation 𝐴 of an element deg(𝐶) < 2𝑛 − 1. Since 𝐸(𝑋) = 𝑋 𝑛 − 𝜆, with 𝜆 very small, the external
𝑎 ∈ Z∕19Z is such that: deg(𝐴) < 3, ‖𝐴‖∞ < 2 and 𝐴(𝛾) ≡ 𝑎 (mod 𝑝). reduction can be done very efficiently. The function RedExt (Algorithm
For instance, 𝛾 2 −𝛾 +1 = 49−7+1 = 43 ≡ 5 mod 19 shows that 𝑋 2 −𝑋 +1 1) proposed by Plantard in [13] can be used to perform this operation.
is a representation of 5 in .
Algorithm 1 RedExt - External reduction [13]
Like in usual number systems, the main arithmetic operations in
the AMNS are the addition and the multiplication. Since elements are Require: 𝐶 ∈ Z[𝑋] with deg(𝐶) < 2𝑛 − 1 and 𝐸(𝑋) = 𝑋 𝑛 − 𝜆
polynomials in the AMNS, the addition (resp. the multiplication) is an Ensure: 𝐶 ′ ∈ Z𝑛 [𝑋], such that 𝐶 ′ = 𝐶 mod 𝐸
addition (resp. the multiplication) of polynomials. However, additional 1: for 𝑖 = 0 … 𝑛 − 2 do
operations have to be done in order to obtain the result in the AMNS.
2: 𝑐𝑖′ ← 𝑐𝑖 + 𝜆𝑐𝑛+𝑖
This is explained by the following:
Let 𝑥, 𝑦 ∈ Z∕𝑝Z be two integers. Let 𝑉 ≡ 𝑥 and 𝑊 ≡ 𝑦 be their 3: end for
representations in . The polynomial 𝑇 = 𝑉 𝑊 satisfies 𝑇 (𝛾) ≡ 𝑥𝑦 ′
4: 𝑐𝑛−1 ← 𝑐𝑛−1
(mod 𝑝). However, 𝑇 might not be a valid representation of 𝑥𝑦 in ,
5: return 𝐶 ′ # 𝐶 ′ = (𝑐0′ , … , 𝑐𝑛−1
′ )
because its degree could be greater than or equal to 𝑛. To keep the
degree lower than 𝑛, the product 𝑉 𝑊 has to be computed modulo the
polynomial 𝐸. This operation is called the external reduction. Notice Remark 1. In [9], the authors show that the output 𝐶 ′ of Algorithm
that since 𝐸(𝛾) ≡ 0 (mod 𝑝) and 𝑇 = 𝑉 𝑊 (mod 𝐸), we have 𝑇 (𝛾) ≡ 𝑥𝑦 1 is such that:
(mod 𝑝) and deg(𝑇 ) < 𝑛.
‖𝐶 ′ ‖∞ < 𝑛|𝜆|‖𝐴‖∞ ‖𝐵‖∞ ,
1
https://github.com/arithPMNS/generalisation_amns with 𝐶 = 𝐴𝐵 and 𝐴, 𝐵 ∈ Z𝑛 [𝑋]; cf. Section 3 of that paper.
2
A. Chaouch et al. Journal of Information Security and Applications 58 (2021) 102770
3
A. Chaouch et al. Journal of Information Security and Applications 58 (2021) 102770
𝐶 ←𝐴×𝐵 (1)
′
𝐶 ← 𝐶 mod 𝐸 (2)
′ ′ 𝑟
𝑄 ← 𝐶 × 𝑀 mod (𝐸, 2 ) (3)
4
A. Chaouch et al. Journal of Information Security and Applications 58 (2021) 102770
In this section, we present two architectures that can be used for Steps Transition Condition
more clock cycles for completing the modular multiplication in the Step (4) 8 → 9 9 → 10 𝐷𝑂𝑁𝐸_𝑅 = 1
AMNS. The second architecture is semi−parallel architecture which 10 → 11 11 → 12
aims to execute the most arithmetic operations in the same cycle to Step (5) 12 → 13 𝐷𝑂𝑁𝐸_𝑆 = 1
have a fast design.
In Section 4.1, we present our sequential architecture while our
semi-parallel architecture is presented in Section 4.2.
and the output. Each MAC operator inputs a 36-bit coefficient of 𝑀 and
a coefficient of 𝑄 that is serially given. Each module outputs a 76-bit
4.1. Sequential architecture
coefficient of R which is stored in a 76-bit register (not displayed in
Fig. 6).
In this design the steps described in Section 3.1 are scheduled as
The final step (5) consists of a 76-bit addition and a shift. The
shown in Fig. 3. This figure is an example of the scheduling of a
module that computes this step is illustrated in Fig. 7. Each adder inputs
modular multiplication in AMNS, which inputs 𝐴 and 𝐵 have four
a 72-bit coefficient of 𝐶 ′ and a 76-bit coefficient of 𝑅. The following
36−bits coefficients and the modulus 𝑝 is a 128-bit integer.
shift operator corresponds to the division in step (5). In this example
In Fig. 1 the AMNS Multiple-Coefficient block executes the foregoing
the divisor is 240 . The 36 bit coefficient of the result 𝑆 are stored back
algorithm as follows. It inputs the loop variables 𝑖 and 𝑗 from the con-
to the memory.
trol state machine and supplies them to the memory as its address input.
The input values 𝑎𝑖 , 𝑏𝑗 , 𝑚′𝑗 , and 𝑚𝑗 are arrays of 𝑤-bits coefficients. Here Main controller (MC). The controller interface signals are set according
𝑤 = 36-bits for 𝑎𝑖 , 𝑏𝑗 , 𝑤 = 39-bits for 𝑚′𝑗 and 𝑤 = 40-bits for 𝑚𝑗 . to the synchronous finite state machine (FSM). This FSM is illustrated
The AMNS Multiple-Coefficient block computes the 5 steps of Algo- in Fig. 8 for a 128-bit modulus 𝑝, where in state 0 is the initial state.
rithm 4. The step (1) calculates an intermediate variable 𝐶 Step by Step, Each state of our FSM requires two cycles to be completed, except the
from the least significant coefficient to the most significant coefficient. final state which is executed in only one cycle. The controller has 14
At each cycle a new value of 𝑐0 , 𝑐1 , 𝑐2 and 𝑐3 comes out as is illustrated states. Our basic sequential architecture requires a total of 28 cycles
in Fig. 2. for computing a modular multiplication for a 128-bit modulus 𝑝 and
The process executes step (2) computes another intermediate vari- 36 bits-coefficients AMNS numbers.
able 𝐶 ′ similarly to step (1). The input 𝐶 is given serially least- The data path consists of eight internal registers, a counter and a
significant coefficient first in order to compute the least significant comparator. The controller remains in the state 0 until the START
coefficient of 𝐶 ′ . The computation of steps (3) and (4) that compute the instruction is set. When the START signal is set, the 𝑎𝑖 and 𝑏𝑖 registers
intermediate variables 𝑄 and 𝑅 are performed the same way. Finally, are loaded with input values, the 𝐶00 , 𝐶10 , 𝐶20 , 𝐶30 registers and the
the division in step (5) is performed by a simple 𝑟-bit right shift. The counter are reset. The different states are summarized in Table 3. The
final result is computed at this step. remaining states work as follows:
The steps (1), (3) and (4) requires 𝑛 states in order to be computed,
while the steps (2) and (5) are performed in one state. • In the state 1 , 2 , 3 and 4 the FSM machine execute the steps
The Figs. 4–7 shows the parts of the AMNS Multiple-Coefficient block (1) and (2) of the algorithm.
that computes steps (1) and (2), (3), (4) and (5). While these figures do • 1 , 2 , 3 and 4 are waiting to validate the 𝐷𝑂𝑁𝐸_𝐶 ′ signal.
not show it, the AMNS Multiplication architecture includes a 36 − 𝑏𝑖𝑡 Once the coefficients of the polynomial 𝐶 ′ are charged in the
wide single port memory, which allows either read or write access to a BRAM register, the FSM change to the next state.
single address in one clock cycle. This memory stores the input values • In the state 5 , 6 , 7 and 8 are waiting to validate the 𝐷𝑂𝑁𝐸_𝑄
𝐴, 𝐵, 𝑀 and 𝑀 ′ for computation as well as receives the output value signal. They computes the step (3) in the algorithm. Once the
𝑆. It also serves as temporary storage for intermediate variables 𝐶 ′ , 𝑅 coefficients the polynomial 𝑄 are charged in the BRAM register,
and 𝑄. the FSM change to the next state.
The computation of step (1) and (2) is illustrated in Fig. 4. Each 72- • 9 , 10 , 11 and 12 control the computation of the step (4) of the
bit coefficient of 𝐶 ′ is computed with a 36-bit inputs MAC operator. For algorithm. They are waiting to validate the 𝐷𝑂𝑁𝐸_𝑅 signal. Once
instance, for a 128-bit precision architecture, four MAC operators are the FSM machine generates the coefficients of the polynomial 𝑅
required at this step. Such a data-path requires 3 DSP blocks for each and charges them in the BRAM register, the FSM change to the
MAC operator. Each MAC performs a multiplication of coefficients of 𝐴 next state.
and 𝐵 taken from memory and accumulates the result. The final value • In the state 13 is waiting to validate the 𝐷𝑂𝑁𝐸_𝑆 signal. It
is stored in the 𝐶𝑖′ registers that can output serially the result. It has computes the last step of the algorithm.
to be remarked that the step (3) which used 𝐶 ′ has to be computed
modulo(E, 𝜙). Thus, in this example, only 40 bits of the coefficients 𝑐𝑖′
are needed at the next step. 4.2. Our semi−parallel architecture
The architecture of the module that computes the step (3) of the
modular multiplication is similar to the previous module except the size We present in this section our semi-parallel architecture which aim
of inputs and outputs (Fig. 5). In this module, each MAC operator inputs is to execute the maximum of intermediate operations in parallel in
one 39-bit coefficient of 𝑀 ′ and one 40-bit coefficient of 𝐶 ′ . Each MAC order to minimize the number of cycles. As an example, we describe
operator outputs one 79-bit coefficient of 𝑄, but only 40 bits are kept in Fig. 9 a scheduling of the steps of the algorithm 3, with 𝑛 = 4 and
because step (3) is computed modulo(E, 𝜙). 𝑤 = 36.
The computation of step (4) is illustrated by Fig. 6. This module is It appears that it is not necessary to wait for the end of the
similar to the module displayed in Fig. 4 except the size of the inputs computation of the last coefficient of 𝐶 in step (1) before to start
5
A. Chaouch et al. Journal of Information Security and Applications 58 (2021) 102770
the computation of the step (2) (Algorithm 4). Indeed, at step 1, we • Fig. 10: The multiplication 𝑎𝑖 × 𝑏𝑗 is done through the cell
compute in parallel the terms that have to be added in order to get 𝑐0′ . 𝑀𝑈 𝐿𝑇 𝐼𝑃 𝐿𝐼𝐸𝑅. The cell 𝐴𝐷𝐷𝐸𝑅 collects the products and
Thus, the computation of the coefficients 𝐶0′ starts at step 2 (Fig. 9). computes give 𝑐𝑖′ .
It is possible to observe the same thing with step (2) and step (3) and • Fig. 11: The multiplication 𝑐𝑖′ × 𝑚′𝑖 mod (𝜙) generates the coeffi-
cients of polynomial 𝑄.
with step (4) and step (5).
• Fig. 12: The multiplication of the coefficients of 𝑄 by 𝑀 mod 𝐸
As a consequence, our semi-parallel architecture is composed of 4 generate 𝑅.
modules as detailed in Figs. 10–13. Each of them, take in charge one • Fig. 13: The addition 𝐶 ′ +𝑅, followed by the exact division by Z𝑟 ,
or two steps of Algorithm 3. These modules are: generates the coefficients of the output 𝑆.
6
A. Chaouch et al. Journal of Information Security and Applications 58 (2021) 102770
Fig. 8. Finite state machine for 𝑤 = 36 and a 128-bit modulus 𝑝 - Sequential architecture.
The first module in Fig. 10 is in charge of step (1) and (2). The first The module operator in Fig. 12 computes the step (4). It is composed
part of this module is composed of multipliers that compute the product of multipliers that input the coefficients of 𝑀 and 𝑄. The multipliers
of the 36-bit coefficients of 𝐴 and 𝐵 in one cycle. The resulting 72-bit are designed to perform the calculation of (40 − 𝑏𝑖𝑡𝑠 × 36 − 𝑏𝑖𝑡 integers)
coefficient is stored in the registers. and output 76-bit integers. The result is stored in the 𝑅𝑖 − 𝑅𝑒𝑔 registers
The second part of this module is composed of an Adder operator that are input to the adder which computes one 76-bit coefficient of
which collect the intermediate products and sum them in order to 𝑅. The output is connected to a demultiplexer which outputs all the
compute one coefficient of 𝐶. In this example: 𝐶𝑖 = 𝐶𝑖0 + 𝐶𝑖1 + 𝐶𝑖2 + 𝐶𝑖3 . computed coefficients of 𝑅.
The output of the Adder is connected to a demultiplexer that output the The last module (Fig. 13) computes the last step of the modular
coefficient as soon as its computation is completed. In this example, this multiplication in AMNS. It is composed of 76− Addershift operators
module requires 5 cycles for completing the computation. It has to be that input the 72-bit coefficients of 𝐶 ′ and the 76-bit coefficients of
noted that each state is executed in two clock cycles. 𝑅. In this step a division by 2𝑟 is performed. Thus, only the 36 most
The module in Fig. 11 is designed to compute the step (3) of significant bits of the sum are output.
Algorithm 3. It is composed of MAC operators that input a 39-bit
coefficient of 𝑀 ′ and a coefficient of 𝐶 ′ . This coefficient is input Main controller (MC). This design is controlled by a synchronous finite
serially. Each operator output a coefficient of 𝑄 that is written back state machine (FSM). For 𝑤 = 36 and a modulus 𝑝 of 128 bits, the
serially to the memory. FSM controlling our semi-parallel architecture is illustrated in Fig. 14.
7
A. Chaouch et al. Journal of Information Security and Applications 58 (2021) 102770
Fig. 13. Semi-parallel architecture, computation of step (5) with 𝑤 = 36 and a 128-bit
modulus 𝑝.
Fig. 10. Semi-parallel architecture, computation of step (1) and (2) with 𝑤 = 36 and
a 128-bit modulus 𝑝.
Once the START signal is set, the inputs 𝑎𝑖 and 𝑏𝑖 are loaded, the
𝐶00 , 𝐶10 , 𝐶20 , 𝐶30 registers and the counter are reset.
The FSM loops while the DONE is not set. The machine advances to
the next state once it is triggered by a valid transition.
• The first state is an initial state 0 . During this state, the coeffi-
cients 𝑎𝑖 and 𝑏𝑖 are loaded into the registers.
• Once the states 1 , 2 , 3 and 4 are completed, 𝐷𝑂𝑁𝐸_𝐶 ′ signal
is set. These state permits the computations of the coefficients of
the polynomial 𝐶 ′ and charge them in the BRAM register. This
corresponds to the steps (1) and (2)
• In the state 1 the FSM machine use an MAC operator to execute
equation (1) and (2) of the Algorithm 4. In this state the machine
compute the coefficient 𝑐0 of the polynomial 𝐶 ′ .
Fig. 11. Semi-parallel architecture, computation of step (3) with 𝑤 = 36 and a 128-bit • In the state 2 the machine compute the second coefficient 𝑐1′
modulus 𝑝. of 𝐶 ′ . In parallel it starts the computation the coefficient 𝑞0 of
polynomial Q. Once the coefficients value of the polynomial 𝑞0
and 𝑐1′ are charged in the BRAM register, the FSM change to the
next state.
• The state 3 , 4 and 5 aim to validate the 𝐷𝑂𝑁𝐸_𝑞1 , 𝐷𝑂𝑁𝐸_𝑞2 ,
𝐷𝑂𝑁𝐸_𝑞3 , 𝐷𝑂𝑁𝐸_𝑐2′ , 𝐷𝑂𝑁𝐸_𝑐3′ signals. The FSM use eight MAC
operators in each state to execute step (3). Once all the coeffi-
cients of the polynomial 𝑄 are generated, the FSM moves to the
next state.
• The state 6 validate the 𝐷𝑂𝑁𝐸_𝑟0 signal. It computes step (4),
• The state 7 , 8 , 9 aim to validate the 𝐷𝑂𝑁𝐸_𝑟1 , 𝐷𝑂𝑁𝐸_𝑟2 ,
𝐷𝑂𝑁𝐸_𝑟3 , 𝐷𝑂𝑁𝐸_𝑠0 , 𝐷𝑂𝑁𝐸_𝑠3 and 𝐷𝑂𝑁𝐸_𝑠2 signals. During
these states, the coefficients of 𝑄 × 𝑀 are computed in parallel.
This corresponds to the step (5).
• The state 10 sets the 𝐷𝑂𝑁𝐸_𝑠3 signal. It compute the fourth
coefficient of the polynomial 𝑆 by using an adder operator and a
shift operator.
5. Experimental results
8
A. Chaouch et al. Journal of Information Security and Applications 58 (2021) 102770
Fig. 14. Finite state machine for the optimal architecture with 𝑤 = 36 and a modulus 𝑝 of 128 bits.
Table 4 Table 7
Semi parallel architecture results on Xilinx FPGA. Sequential architecture results on Intel-Altera FPGA.
Family Xilinx Family Altera
FPGA Zynq Virtex 6 FPGA Aria II Cyclone V
𝑝 size bit 128 256 128 256 512 𝑝 size bit 128 256 128 256 512
𝑤 bit 36 36 36 36 72 𝑤 bit 36 36 36 36 72
Fq(Mhz) 256.5 242.3 238 225.8 191 Fq(Mhz) 237.3 213.3 157.2 123.4 82.7
Cycles 19 33 19 33 33 Cycles 23 47 23 47 47
Time μs 0.074 0.136 0.079 0.146 0.172 Time μs 0.096 0.22 0.146 0.38 0.568
DSP 60 56 60 120 236 DSP 42 80 28 52 112
LUT 1025 12 105 1035 2738 18 964 LUT 1267 2423 679 1304 14 283
Table 5
Semi parallel architecture results on Intel-Altera FPGA. 6. Comparison with the state of the art
Family Intel - Altera
FPGA Aria II Cyclone V To our knowledge our implementation is the first VHDL imple-
𝑝 size bit 128 256 128 256 512 mentation of a FPGA architecture of a modular multiplication using
𝑤 bit 36 36 36 36 72 AMNS as a number system. We compared our design to some previously
Fq(Mhz) 218.2 194.4 157.1 106.4 101.6
published works:
Cycles 19 33 19 33 33
Time μs 0.087 0.169 0.12 0.309 0.324
• The first design we are comparing to is detailed in [18]. It is a
DSP 70 128 40 80 112
LUT 1722 4173 1201 2747 14 670 systolic architecture for Montgomery Modular multiplication and
uses binary numbers.
• The second design is also a systolic architecture that is based on
Table 6 the CIOS Montgomery modular multiplication by Mrabet et al.
Sequential architecture results on Xilinx FPGA.
[19]. It also make use of binary numbers.
Family Xilinx
• The third design is detailed in [20]. Its originality is that it
FPGA Zynq Virtex 6
uses Residue Number Systems (RNS) as an integer number system.
𝑝 size bit 128 256 128 256 512 This architecture is based on cox-rower components introduced
𝑤 bit 36 36 36 36 72
in [21].
Fq(Mhz) 283.2 255.7 309.5 189.7 224.1
Cycles 23 47 23 47 47 • The design published in [22] uses carry-save representation in the
Time μs 0.081 0.183 0.074 0.247 0.209 intermediate results.
DSP 51 53 51 91 216 • The systolic architecture described in [23] uses the binary signed
LUT 791 2827 791 1740 24 763
digits representation in order to minimize the number of non-zero
partial product in the modular multiplication.
• Similarly the architecture presented in [24] make use of a binary
Below, we report the results for a modular multiplication with signed digits.
different coefficient sizes 𝑤 and for different size of 𝑝: 128, 256, Comparing FPGA design is a challenging task because the tech-
and 512 bits for FPGA that are large enough. Here we report the nologies used are heterogeneous. In this comparison we detailed the
maximum operating frequency of the architectures, the number of number of slices, the number of DSP blocks, the number of cycles
cycles require to completed the modular multiplication, the total time needed to complete the modular multiplication and the total delay.
of the computation, the number of DSP blocks and LUT needed in the Some architectures uses DSP blocks, while some others use only LUT
implementation. blocks. For the sake of comparison, we unified the measure of the area
In Tables 4 and 5 respectively, we give a report of the results of by calculating an estimate value of 𝐿𝑈 𝑇𝑒𝑞 = 𝑛𝐿𝑈 𝑇 + (𝑛𝐷𝑆𝑃 × 𝑟), where
our semi-parallel architecture for Xilinx and Intel-Altera FPGA after 𝑛𝐿𝑈 𝑇 is the number of LUT slices, 𝑛𝐷𝑆𝑃 is the number of DSP blocks
placement and routing. The results of the implementations of our and 𝑟 is the ratio of number of LUTs by the number of DSP blocks
sequential architectures are reported in Tables 6 and 7. available on the target device. Thus the is Area-Delay Product (ADP)
As expected, the semi-parallel architecture records the most efficient is computed as follows: 𝐿𝑈 𝑇𝑒𝑞 × 𝑇 𝑖𝑚𝑒(μs).
speed of execution while the sequential design achieves the smallest We present in Table 8 the implementation results of our architec-
area. We observe that Xilinx FPGA provides fastest computations for tures compared to the state of the art for modulus size between 128
modular multiplication in AMNS representation compared to the Altera and 512 bits. Such a modulus size range targets implementations for
Family. ECC [2]. We highlight the area, time and throughput results for two
9
A. Chaouch et al. Journal of Information Security and Applications 58 (2021) 102770
Table 8 Table 9
Comparison of our work with state of the art implementations for modulus 𝑝 from 128 Comparison of ADP and throughput result for 256-bit modulus 𝑝.
bits up to 512 bits. Architecture FPGA LUTeq Time (μs) ADP Throughput
Architecture FPGA 𝑝 size DSP LUT Cycle Time (μs) ADP
A7 23 008 0.165 3796.32 1544.3
Semi-parallel
128 60 1563 19 0.077 901.1
V6 22 058 0.146 3220.468 1751.8
A7 256 120 2728 33 0.165 3796.3
512 188 29 985 33 0.204 12598.4 A7 17 097 0.242 4137.47 1055.5
Semi-parallel Sequential
128 60 1035 19 0.079 844.9 V6 16 391 0.24 3933.84 1033.5
V6 256 120 2738 33 0.146 3220.4 Mrabet et al. [19] A7 6386 0.311 1986.04 822.3
512 236 18 964 33 0.172 9797.1
K7 9081 0.324 2942.24 1185.1
128 51 790 23 0.12 1129 Bigou and Tisserand [20]
V5 5238 0.467 2446.146 820.9
A7 256 91 1718 47 0.242 4137.4
512 176 37 138 47 0.258 17255.5 Ors et al. [18] V-E 1548 7.686 11897.928 33.3
Sequential
128 51 791 23 0.074 666.1
V6 256 91 1740 47 0.24 3933.8
512 216 24 763 47 0.209 12443.6
128 19 355 33 0.166 591.9
It appears that our designs faster than the other designs for 256-bit
Mrabet et al. [19] A7 256 33 809 33 0.311 1986 moduli as well for 1024 moduli. As a consequence, the throughput is
512 87 2650 33 0.507 8797.9 the highest for all moduli size. This is mainly due to the number of DSP
192 18 999 58 0.213 864.5 blocks that greatly increase the speed of the architecture, at the price of
K7 384 41 2111 58 0.324 2942.2
512 56 8757 66 0.374 6835.5
the area. Compared to the design described in [18], our architectures
Bigou and Tisserand [20]
are up to 23 times faster and a ADP up to 1.6 times smaller.
192 15 1447 58 0.295 741
V5 384 42 2256 58 0.467 2446.1 Compared to the architecture of Mrabet et al. [19] which is a
512 57 10 877 66 0.536 7999.2
systolic hardware architectures of Montgomery modular multiplication,
128 – 806 388 1.807 1456.4
our design is a bit faster with a better throughput and a bit higher.
Ors et al. [18] V-E 256 – 1548 772 7.686 11897.9
512 – 2972 1450 16.17 48057.2 The design of Bigou and Tisserand [20] is an original design that
Rezai and Keshavarzi [22] V5 512 – 3048 – 0.449 1400 make use of Residue Number Systems as number system. Compared to
Rezai and Keshavarzi [24] V5 512 – 6091 – 0.851 5200 this architecture, our designs remain faster but larger. The throughput
of our architecture is higher.
Considering the comparison with the modular multiplication archi-
sizes of the modulus 𝑝. The Table 9 show the results for 256-bits moduli. tectures described in [22,23] and [24], our sequential design is faster
This size targets Elliptic Curve Cryptography. In Table 10 we show the at the cost of a larger number of slices. We can see that the throughput
results for large cryptographic sizes that target RSA algorithm [25]. In achieved by these implementations is significantly smaller than what
this table the target moduli have 1024 bits. we are able to achieve with our proposed semi-parallel and sequential
Our semi-parallel implementation achieves a higher throughput
implementations.
than the sequential. Although the ADP of semi-parallel improve signif-
icantly, they do not approach the minimum of ADP in all technologies. Regardless to the size of the modulus 𝑝 both our sequential and
The performance of semi-parallel architecture is generally better than semi-parallel designs have an interesting property: the number of cycle
sequential architecture (see Fig. 15 and Fig. 16). is independent from the size of the modulus.
10
A. Chaouch et al. Journal of Information Security and Applications 58 (2021) 102770
References
[1] Rivest RL, Shamir A, Adleman L. A method for obtaining digital signatures and
7. Conclusion and future works public-key cryptosystems. Commun ACM 1978;21(2):120–6.
[2] Hankerson D, Menezes A. Elliptic curve cryptography. Springer; 2011.
[3] Menezes A, Van Oorschot PC, Vanstone SA. Handbook of applied cryptography.
In this paper, we present a fast hardware architecture that realizes
CRC Press; 1997.
a modular multiplication over large prime characteristic finite fields [4] Barrett P. Implementing the rivest Shamir and Adleman public key encryption
F𝑝 , which uses AMNS representations in order to speed up modular algorithm on a standard digital signal processor. In: Odlyzko AM, editor.
multiplication. We propose two hardware architectures: a semi-parallel Advances in cryptology. Berlin, Heidelberg: Springer; 1987, p. 311–23.
[5] Montgomery PL. Modular multiplication without trial division. Math Comp
architecture and a sequential one. Our semi-parallel architecture pro-
1985;44(170):519–21.
vides the fastest implementations. As we propose the first hardware [6] Taylor F. Large moduli multipliers for signal processing. IEEE Trans Circuits Syst
implementation of the modular multiplication in AMNS, we compare 1981;28(7):731–6.
our results with existing modular multiplication like systolic CIOS, [7] Blakely GR. A computer algorithm for calculating the product AB modulo M.
IEEE Trans Comput 1983;C-32(5):497–500. http://dx.doi.org/10.1109/TC.1983.
systolic architectures using carry-save or signed digit representations
1676262.
for the intermediate results and architecture designed for the RNS [8] Takagi N. A radix-4 modular multiplication hardware algorithm for modular
modular multiplication. The comparison of our results with the state of exponentiation. IEEE Trans Comput 1992;41(8):949–56. http://dx.doi.org/10.
the art highlight that we propose the fastest implementation of modular 1109/12.156537.
[9] Bajard J-C, Imbert L, Plantard T. Modular number systems: Beyond the mersenne
multiplication for large field application with the highest throughput.
family. In: Selected areas in cryptography, 11th international workshop. 2004.
For future work, we prospect to optimize the semi-parallel structure p. 159–69.
to reduce the area size. We would like to implement a full ECC scalar [10] Didier L-S, Dosso FY, Véron P. Efficient modular operations using the Adapted
multiplication using the modular arithmetic in AMNS. Modular Number System. J Cryptogr Eng 2020;1–23. http://dx.doi.org/10.1007/
s13389-019-00221-7.
[11] Chaouch A, Dosso YF, Didier L-S, El Mrabet N, Ouni B, Bouallegue B. Hardware
CRediT authorship contribution statement optimization on FPGA for the modular multiplication in the AMNS representa-
tion. In: International conference on risks and security of internet and systems.
Springer; 2019, p. 113–27.
Asma Chaouch: Conceptualization, Hardware, Writing - original [12] Stein W, et al. SageMath. 2018, http://www.sagemath.org/index.html.
draft. Laurent-Stéphane Didier: Visualization, Supervision, Concep- [13] Plantard T. Arithmétique modulaire pour la cryptographie [Ph.D. thesis], France:
tualization, Reviewing and editing. Fangan Yssouf Dosso: Methodol- Montpellier 2 University; 2005.
[14] Bajard J-C, Imbert L, Plantard T. Arithmetic operations in the polynomial
ogy, Software, Writing, Data curation. Nadia El Mrabet: Validation,
modular number system. In: 17th IEEE symposium on computer arithmetic. 2005,
Project administration, Reviewing and editing. Belgacem Bouallegue: p. 206–13, Extended (complete) version available at: https://hal-lirmm.ccsd.cnrs.
Investigation. Bouraoui Ouni: Supervision. fr/lirmm-00109201/document.
11
A. Chaouch et al. Journal of Information Security and Applications 58 (2021) 102770
[15] Negre C, Plantard T. Efficient modular arithmetic in adapted modular number [21] Kawamura S, Koike M, Sano F, Shimbo A. Cox-rower architecture for fast
system using Lagrange representation. In: Information security and privacy, 13th parallel montgomery multiplication. In: International conference on the theory
Australasian conference. 2008. p. 463–77. and applications of cryptographic techniques. Springer; 2000, p. 523–38.
[16] El Mrabet N, Gama N. Efficient multiplication over extension fields. In: WAIFI. [22] Rezai A, Keshavarzi P. High-throughput modular multiplication and exponenti-
Lecture notes in computer science, vol. 7369, Springer; 2012, p. 136–51. ation algorithms using multibit-scan–multibit-shift technique. IEEE Trans Very
[17] Alrimeih H, Rakhmatov D. Fast and flexible hardware support for ECC over Large Scale Integr (VLSI) Syst 2014;23(9):1710–9.
multiple standard prime fields. Trans Very Large Scale Integr (VLSI) Syst [23] Rezai A, Keshavarzi P. High-performance scalable architecture for mod-
2014;12(22):2661–74. ular multiplication using a new digit-serial computation. Microelectron J
[18] Ors SB, Batina L, Preneel B, Vandewalle J. Hardware implementation of a 2016;55:169–78.
Montgomery modular multiplier in a systolic array. In: Proceedings international [24] Rezai A, Keshavarzi P. Compact SD: A new encoding algorithm and its
parallel and distributed processing symposium. 2003, p. 8. http://dx.doi.org/10. application in multiplication. Int J Comput Math 2015;94(3):554–69.
1109/IPDPS.2003.1213341. [25] Rivest R, Shamir A, Adleman L. A method for obtaining digital signatures and
[19] Mrabet A, Mrabet NE, Lashermes R, Rigaud J-B, Bouallegue B, Mesnager S, Mach- public-key cryptosystems. Commun ACM 1978;21(2):120–6.
hout M. A systolic hardware architectures of montgomery modular multiplication
for public key cryptosystems. IACR Cryptol ePrint Arch 2016;2016:487.
[20] Bigou K, Tisserand A. Single base modular multiplication for efficient hardware
RNS implementations of ECC. In: 17th International workshop on cryptographic
hardware and embedded systems, vol. 9293. Springer; 2015, p. 123–40.
12