You are on page 1of 5

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2017.2756680, IEEE
Transactions on Circuits and Systems II: Express Briefs
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1

Low-Complexity Elliptic Curve Cryptography


Processor based on Configurable Partial Modular
Reduction over NIST Prime Fields
Piljoo Choi, Student Member, Mun-Kyu Lee, Member, IEEE, Ji-Hoon Kim, Member, IEEE,
and Dong Kyue Kim, Member, IEEE

 are frequently used for ECCs: binary field [9]-[11] and prime
Abstract—We proposed a high-performance elliptic curve field [12]-[22]. Since the operational environment of the above
cryptography (ECC) processor over NIST prime fields. Instead of applications includes extremely resource-constrained devices, a
applying a full modular reduction to a 2k-bit product, the low-complexity implementation of an ECC satisfying the
proposed partial modular reduction method iteratively performs
throughput requirements becomes a major challenge.
reductions on partial products whose bit length is slightly greater
than k, where k is the bit length of field elements. As a result, the Accordingly, the hardware implementation is preferred over
computational complexity of modular multiplication was the software-based implementation. The performance of the
significantly reduced. Moreover, the amount of computation is hardware ECC is considerably affected by the method adopted
configurable by parameterizing the size of the partial products. to process finite field operations such as modular multiplication
This is a very desirable characteristic of the proposed ECC (MM), modular division (MD), and modular addition (MA).
processor, because the hardware complexity and processing time
The efficient implementation of MM is the key part, as the
of the entire ECC processor can be adjusted according to the
requirements of various IoT environments. Including the complexity of MA is negligible and the most complicated
proposed modular multiplication module, finite field operation operation, MD, can be replaced by multiple MMs using
modules are integrated into a single module to further reduce the projective coordinates.
required resources. The proposed ECC processor synthesized As a method for fast MM, the Montgomery multiplication is
using 180-nm CMOS process technology can perform a 256-bit widely used [12]-[18]. The typical acceleration technique for
elliptic curve point multiplication in 0.20-0.74 ms with
the Montgomery system is to use a full-word multiplier,
144.8k-65.4k gate counts. These results and the experimental
results in various FPGA devices show that the proposed ECC systolic array, residue number system (RNS), etc.; however, a
processor has significantly better throughput per area than the full-word multiplier and a systolic array require considerable
previously reported ones. resources and/or very high clock frequency, and the
conversions from/to the RNS cause additional overhead.
Index Terms—Elliptic curve cryptography (ECC), finite field, Therefore, these acceleration techniques are not appropriate for
hardware implementation, partial modular reduction resource-constrained environments. Fast reduction [23, pp.
44-46] enables efficient MM over NIST prime fields without
the Montgomery multiplication [19]-[21]. However, fast
I. INTRODUCTION
reduction still requires considerable resources such as

I T is well known that elliptic curve cryptosystems (ECCs) [1],


[2] provide relatively higher security with shorter keys [3];
for example, the security of a 256-bit ECC is equivalent to that
multipliers for k-bit × k-bit multiplication, storage for the 2k-bit
product, and adders for modular reduction. Hence, a more
compact and faster design of an MM method for NIST prime
of a 3072-bit RSA. Owing to this desirable property, ECCs are fields is necessary for low-complexity ECC processors on
adopted for the security services in diverse internet of things resource-constrained IoT platforms.
(IoT) platforms such as oneM2M [4], NFC [5], WAVE [6], and In this paper, we propose a low-complexity ECC processor
the EVITA project [7]. These platforms use only elliptic curves with efficient MM over NIST prime fields. Our main
over NIST prime fields [8], although two kinds of finite fields contribution is the proposal of a novel modular reduction
method called partial modular reduction. In the proposed
This work was supported in part by KETEP and MOTIE, Korea (No. reduction, a smaller (k + m)-bit partial product is considered
20161210200610) and in part by MSIT, Korea under the ITRC support instead of the 2k-bit product as in conventional modular
program (IITP-2017-2012-0-00646) supervised by the IITP.
P. Choi and D. K. Kim are with the Department of Electronic Engineering, reduction methods [19]-[21], where k is the bit length of the
Hanyang University, Seoul 04763, Korea (e-mail: pjchoi@hanyang.ac.kr; field elements and m ≪ k. This significantly lowers the
dqkim@hanyang.ac.kr). computational complexity of MM. Moreover, the amount of
M.-K. Lee is with the Department of Computer Engineering, Inha
University, Incheon 22212, Korea (e-mail: mklee@inha.ac.kr). computation becomes configurable by choosing m
J.-H. Kim is with the Department of Electrical and Information Engineering, appropriately where the hardware complexity and the
Seoul National University of Science and Technology, Seoul 01811, Korea processing time of the entire ECC processor can be adjusted
(e-mail: jihoonkim@seoultech.ac.kr).

1549-7747 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2017.2756680, IEEE
Transactions on Circuits and Systems II: Express Briefs
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2

Fig. 1. (a) Conventional integer multiplication and modular reduction and (b) the proposed partial product computation with partial modular reduction over p256.
according to the requirements of various IoT devices. We then modular adders. We remark that it would be more intuitive to
propose a resource-sharing scheme among all finite field use the iterative equation 𝑋 = ((X ≪ 𝑚) + 𝑃𝑃)𝑘−1:0 + 𝑅
operations. The proposed processing module ModOp can instead of (1), where R is the appropriate number to remove the
perform MM, MD, and MA, thereby avoiding the huge MSBs in ((X ≪ 𝑚) + 𝑃𝑃). However, this approach lengthens
hardware complexity. the critical path because R can be determined and added only
after ((X ≪ 𝑚) + 𝑃𝑃) is available. On the other hand, all
II. PROPOSED MODOP MODULE OVER NIST PRIME FIELDS additions in (1) can be performed with a single carry save adder
In this section, the details of MM are explained, and then the (CSA) tree.
overall structure of the ModOp module that performs MM, MD, Algorithm 1 shows the procedure of the proposed MM
and MA is shown. method based on (1) to reduce the critical path delay in
hardware implementation. The intermediate value X in (1) is
A. Modular Multiplication Method in ModOp Module
updated in a for loop and decomposed into two registers S and
1) Proposed MM method C, i.e., X = S + C. The MSBs of X are determined by the
It is well known that NIST primes guarantee fast modular
truncated S and C without loss of accuracy, and are stored in
reduction owing to their special structure [8]. Using a NIST temporary variables T, C’, and c. The Reduction function in line
prime, MM is usually done in two separate stages, k-bit × k-bit
6 computes R in (1) using T and c. The value of R is separated
integer multiplication and modular reduction, which are shown
into two return values, R0 and R1, satisfying R = R0 + R1, to
in Fig. 1(a). As shown in Fig. 1(a), a 512-bit register is required
deal with negative numbers effectively. The details of the
to store ab, the intermediate result of the integer multiplication,
Reduction function will be provided in Section II-A-2. In line 7,
and a lot of modular adders are required to reduce the 512-bit
R0, R1, and the partial products generated by the PP function
result to a 256-bit value. To address this problem while
are used to update S and C.
retaining the desirable property of a NIST prime, our MM
module adopts k-bit × m-bit multiplication instead, where m is Algorithm 1 Proposed modular multiplication algorithm
selected from 4, 8, 16, and 30 according to the required Input: a, b ∈ [1, p – 1]
performance. For each k-bit × m-bit partial product, partial Output: a × b mod p
reduction is performed. As a result, the final modular reduction 1 : S ← 0; C ← 0; // (k+m+3)-bit signed numbers
2 : b' ← b ≪ (m + 1); // (k+m+1)-bit number
stage can be eliminated. 3 : for i ← w+1 to 0 do // w is 64, 32, 16, and 8 for m = 4, 8 16, and 30
Let X be the intermediate result of MM. The ModOp module 4: T ← Sk+m+2:k + Ck+m+2:k; // T: (m+3)-bit number
performs the accumulation of the partial product PP and the 5: {c, C’} ← Sk–1:k–32 + Ck–1:k–32; // c: 1-bit carry, C’: 32-bit number
// {c, C’}: concatenation of c and C’
partial reduction at once during a single clock cycle as follows: 6: (R0, R1) ← Reduction (T, c);
7: (S, C) ← PP (a, b' (i+1)×m:i×m) + (Sk–33:0 ≪ m) + ({C’, Ck–33:0} ≪ m) +
𝑋 = (𝑋𝑘−1:0 ≪ 𝑚) + (𝑅 ≪ m) + 𝑃𝑃, (1) (R0 ≪ m) + (R1 ≪ m);
8: end for
9 : return (Sk+m+1:m + Ck+m+1:m) mod p
where 𝑋𝑎:𝑏 denotes the a-th to b-th bits in X, and R is the
number required to remove the most significant bits (MSBs) in
Partial product computation and partial modular reduction in
X that go beyond the k-bit boundary. For example, for the NIST
lines 4-7 can be done during one clock cycle; hence, the for
prime 𝑝256 = 2256 − 2224 + 2192 + 296 − 1, the value 2256 can
loop terminates in 𝑤 + 2 cycles, where 𝑤 = ⌊𝑘/𝑚⌋. Including
be replaced by 𝑌 = 2224 − 2192 − 296 + 1. By using this fact,
one clock cycle for (Sk+m+1:m + Ck+m+1:m) mod p in line 10, a total
𝑥 ∙ 2256 can be replaced by 𝑅 = 𝑥 ∙ 𝑌 . The details for this of 𝑤 + 3 clock cycles are required for Algorithm 1.
conversion will be explained in the next subsection. The The PP function and the summation in lines 7 are designed
process of calculating X using (1) is shown in Fig. 1(b). by using a booth encoder (BE) and a CSA tree, respectively,
Although two more adders are required than those used in the and the CSA tree consists of CSAs in parallel. Other more
conventional method, the entire complexity decreases due to efficient methods can be applied if possible.
the 237-bit shorter register as well as the elimination of

1549-7747 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2017.2756680, IEEE
Transactions on Circuits and Systems II: Express Briefs
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3

TABLE I
EACH 32-BIT VALUE OF R0 AND R1 FOR NIST PRIMES P256, P224, AND P192
Word index: 7 6 5 4 3 2 1 0
R0 x 𝑥̅ 𝑥̅ x
p256 R1 (c = 0) 𝑠𝑠
̅ 𝑠 𝑠𝑠
̅ 𝑠𝑠
̅ 𝑠̅ 𝑠𝑠 𝑠𝑠
R1 (c = 1) 𝑠𝑠
̅ 1s 1s 1s 𝑠𝑠 𝑠𝑠 𝑠𝑠 1
R0 - x 𝑥̅
p224 R1 (c = 0) - 1s 1s 1s 𝑠𝑠
̅ 𝑠𝑠
̅ 𝑠𝑠
̅ 1
R1 (c = 1) - 𝑠𝑠 𝑠𝑠 𝑠𝑠 𝑠 𝑠𝑠
̅ 𝑠𝑠
̅
R0 - - x x
p192 R1 (c = 0) - - 𝑠𝑠 𝑠𝑠 𝑠0 𝑠𝑠 𝑠𝑠
R1 (c = 1) - - 𝑠𝑠 𝑠𝑠 𝑠𝑠 𝑠̅ 1
1s: 32-bit data filled with one.
𝑠𝑠: 32-bit data filled with s.
𝑠0: 32-bit data, their LSB is zero and the remaining bits are filled with s.

directly used for c = 0. When c = 1, 2256 − 𝑝256 is added to R1.


Fig. 2. 256-bit representation in the right-hand side of (2): (a) without negative For the other NIST primes, it is sufficient to rewrite (2)
signs, (b) with negative signs, and (c) its reorganized expression.
properly. R0 and R1 for the other NIST primes can be
The reason why the largest value of m was set to 30 instead calculated in the same way. R0 and R1 for three NIST primes,
of 32 is that a smaller m requires fewer adders while the p256, p224, and p192, are shown in Table I. Note that only bitwise
numbers of the required clock cycles for m = 30 and m = 32 are inversions to calculate the one’s complement are required to
the same for k = 192, 224, 256, and 384. Another reason is that obtain R0 and R1, and the ECC processors for different NIST
when m = 32, the length of T in line 6 of Algorithm 1 becomes primes have the same structure with only different R0 and R1.
larger than 32 bits even excluding the sign bit, complicating the B. Structure of ModOp Module
calculation of the Reduction function.
Fig. 3 shows the overall structure of our hardware ModOp
module. The ModOp module is composed of various arithmetic
2) Reduction function
The calculation method of R0 and R1 for k = 256 is presented. components and six registers as well as control logic, which is
We begin with the case c = 0. First, it is easy to see that not explicitly shown in Fig. 3. Register p stores a NIST prime pk
or its related values. S, C, U, and V are the shared registers used
for all operations. We remark that the lower m-bits of the
𝑇 ∙ 2256 = 𝑇 ∙ 2224 − 𝑇 ∙ 2192 − 𝑇 ∙ 296 + 𝑇 (mod 𝑝256 ). (2)
registers S and C are used only in MM. M is an (m+1)-bit
register used for multiplication. Among arithmetic components,
To deal with T with various lengths, a new 32-bit binary
A0 and A1 are general purpose adders used for all operations. A
representation is defined as x. When m < 30, i.e., m = 4, 8, or 16,
BE, a CSA tree, bit organizers for modular reduction (BO1 and
x is set as T expanded to 32 bits. For sign extension of a signed
TABLE II
number T, its expanded bits in x are filled with its sign bit, s. On INITIAL VALUES OF REGISTER U, V, S, AND C,
the other hand, when m = 30, x is filled with the lower 32 bits of AND OPERATIONS PERFORMED ON A0 AND A1
the 33-bit-long T, and the sign bit s is dealt separately. When T U V S C A0 A1
should be expanded to a binary representation with more than MA a ±b S + C ∓ pk S+C
MD –pk b a 0 S + C + pa U+V
32 bits, say, an l-bit representation T’, it can be expressed as MM a b 0 0 S + C ± pk S+C
follows by using s and x, a
one of 0, pk, –pk, and 2pk, which is stored in register p in Fig. 3.

𝑇′ = 𝑠(2𝑙−32 − 1) ∙ 232 + 𝑥, (3)

by filling the upper (l – 32) bits with s. The 256-bit


representation of the right-hand side of (2) by using (3) is
shown in Fig. 2. The negative signs of −𝑇 ∙ 2192 and −𝑇 ∙ 296
in (2) are expressed using two’s complements in Fig. 2(b),
where 𝑥̅ is the one’s complement of x. It is easy to see that the
light gray part in Fig. 2(b) can be canceled to 0 by ignoring an
overflow. By applying the fact that 𝑠(296 − 1) + 1 = 𝑠 ∙ 296 +
(1 − 𝑠) to the dark gray part in Fig. 2(b) and reorganizing the
four terms into two terms,

𝑅0 = 𝑥 ∙ 2224 + 𝑥̅ ∙ 2192 + 𝑥̅ ∙ 296 + 𝑥 (4)


𝑅1 = 𝑠̅(232 − 1) ∙ 2224 + 𝑠 ∙ 2192
(5)
+ 𝑠̅(264 − 1) ∙ 2128 + 𝑠̅ ∙ 296 + 𝑠(264 − 1) ∙ 232 ,
Fig. 3. Datapaths of ModOp that performs 𝑎 ± 𝑏 mod 𝑝𝑘 , 𝑎𝑏 −1 mod 𝑝𝑘 , and
are obtained as shown in Fig. 2(c). Equations (4) and (5) can be 𝑎𝑏 mod 𝑝𝑘 .

1549-7747 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2017.2756680, IEEE
Transactions on Circuits and Systems II: Express Briefs
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4

BO2), and two small adders are used for MM. TABLE III
NUMBER OF OPERATIONS REQUIRED FOR ECPD, ECPA, AND ECPM
As shown in Fig. 3, the modules for MM, MD, and MA
Jacobian Affine
operations were not separately designed, but these operations ECPD 8MM + 10MA 3MM + 8MA + MD
share resources, significantly reducing the area of our ECC ECPA 11MM + 8MA 2MM + 7MA + MD
35𝑘 38𝑘 11𝑘 31𝑘 4𝑘
processor without sacrificing performance. This sharing was ECPM ( + 4)MM + MA+MD
3 3
MM + MA + MD
3 3 3
possible by cleverly utilizing the shared registers, S, C, U, and V
as well as adders A0 and A1. TABLE IV
Table II shows the initial values and operations of these REQUIRED CLOCK CYCLES, PROCESSING TIME (MS), AND AREA (GE)
registers and adders for MM, MD, and MA. When MA is Expected Measured performance @ 100MHz
m
T(ECPM)a T(ECPM)b Proc. time Area
performed, the inputs, a and b are stored in the registers S and C
30 36.4 k 36.8 k 0.37 110.1 k
during the first clock cycle, and then S± C mod pk is performed ECPM- 16 60.3 k 60.5 k 0.61 85.8 k
using the adders A0 and A1 during the next clock cycle. Jacob 8 108.1 k 108.1 k 1.08 73.8 k
Because of pipelining, each MA consumes only one clock cycle 4 203.8 k 203.2 k 2.03 66.7 k
30 95.1 k 95.5 k 0.96 95.8 k
on average. For MD, ab-1 mod pk, the inputs a and b are stored ECPM- 16 102.6 k 102.9 k 1.03 72.7 k
in the registers S and V during the first clock cycle. Our MD is Affine 8 117.6 k 117.9 k 1.18 60.4 k
based on the variant of the right-shift binary inverse (RS) 4 147.7 k 147.7 k 1.48 53.5 k
algorithm [24]. When both the values in the registers U and V
a
calculated using T(MM) = ⌊𝑘/𝑚⌋ + 3, T(MD) = 0.94k [20], and T(MA) =
1, where T(Y) is the number of clock cycles required for operation Y.
are odd, they are added using the adder A1, and the values in b
obtained from 10000 samples.
the registers S and C are added using the adder A0. When MM TABLE V
is performed, the inputs a and b are stored in the registers U and PERFORMANCE COMPARISON WITH PREVIOUS DESIGNS
V, and then ab mod p is calculated using the BE and the CSA Proc. Time
Device Design @ Max. Freq. Area 𝑁(𝜌)a
tree shown in Fig. 3. The temporary results are stored in the (ms @ MHz)
registers S and C, and the two adders A0 and A1 perform MA This workI, S 0.61 @ 59.2 22.6 k 1.00
during the last clock cycle of MM. This workII, S 2.78 @ 53.1 12.1 k 0.41
Virtex 2
TVLSI’08 [13]b, S 2.66 @ 94.7 83.2 k 0.06
pro
TVLSI’13 [14] I, X
0.59 @ 50.2 28.7 k 0.83
III. IMPLEMENTATION RESULTS (LUTs)
TVLSI’13 [14]II, X 2.62 @ 50.2 18.9 k 0.28
TVLSI’16 [19]X 3.81 @ 95 48.8 k 0.08
Using the ModOp module, ECC processors with m = 4, 8, 16,
This workI, S 0.51 @ 71.6 12.8 k 1.00
and 30 that perform elliptic curve point multiplication (ECPM) Virtex 4
This workII, S 2.94 @ 50.2 6.1 k 0.37
(Slices)
were designed. Two typical choices for coordinates, i.e., MM’16 [22]PnR 3.91 @ 49 20.6 k 0.12
Jacobian projective and affine coordinates are considered. Our Virtex 6 This workI, S 0.30 @ 121.6 18.6 k 1.00
(LUTs) This workII, S 1.18 @ 125.1 8.1 k 0.59
ECC processors over Jacobian and affine coordinates are called This workI, 180, S 0.20 @ 185 144.8 k 1.00
ECC-Jacob and ECC-Affine, respectively. In this section, their ASIC This workII, 180, S 0.74 @ 200 65.4 k 0.59
performance for the NIST prime, p256 is analyzed. (Gates) TCAS-II’07 [15]130, S 1.01 @ 556 122 k 0.23
ISCAS’12 [16]90, S 0.12 @ 185 540 k 0.44
I
Case I: our fastest design or fast design of [14].
1) Required clock cycles for ECPM II
Case II: our smallest design or lower area design of [14].
ECPM is performed by repeating elliptic curve point a
𝜌: throughput per area, i.e., 1/(proc. time × area), and 𝑁(𝜌): normalized 𝜌,
doublings (ECPDs) and elliptic curve point additions (ECPAs). i.e., (𝜌 of each design)/(𝜌 of our fastest design in each device).
b
A common practice in the literature is to represent a scalar as a Support ECC over dual-field.
180, 130,
and 90: 180, 130, and 90 nm CMOS technology, respectively.
non-adjacent form, which requires k ECPDs and k/3 ECPAs on S, PnR, and X
: synthesis, place & route, and not specified, respectively.
average [23, p. 92]. The expected numbers of modular
operations for ECPD, ECPA, and ECPM are shown in Table III, smallest designs were synthesized in various Xilinx FPGA
including additional operations for returning back to affine devices: XC6VLX760 with speed –2 (Virtex 6), XC4VLX60
coordinates when Jacobian coordinates are used. The estimated with speed –10 (Virtex 4), and XC2VP100 with speed –6
and measured numbers of clock cycles required for one ECPM (Virtex 2 pro). ISE 9.2 for Virtex 2 pro and ISE 14.2 for Virtex
are shown in Table IV, and their difference was only about 1%. 4 and Virtex 6 were used for implementation since Virtex 2 pro
is not supported by ISE 14.2. When implemented in an FPGA,
2) Processing time and area k-bit × m-bit multiplication and some additions are designed
ECC processors were synthesized using 180-nm CMOS using 16-bit × m-bit LUT-based multipliers generated by Xilinx
process technology. The results synthesized at the same CORE generator and using CSAs and carry lookahead adders,
frequency, 100 MHz for fair comparison are shown in Table IV. respectively. As for ASIC designs, the results were measured at
As shown in Table IV, our fastest design, ECC-Jacob with m = the maximum operating frequency as in previous research.
30 consumes only 0.37 ms for one ECPM. Our smallest design, Table V shows the performance of our designs and the
ECC-Affine with m = 4 occupies only 53.5 k gates and designs reported in previous research. Among all
consumes 1.48 ms. implementations, only the fast design in [14] and the design in
[16] were 1.03 and 1.67 times faster than our fastest design, but
3) Performance Comparison their areas were 1.27 times and 3.73 times larger, respectively.
For comparison with previous research, our fastest and Moreover, the throughput per area, 𝜌 , of our design was

1549-7747 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2017.2756680, IEEE
Transactions on Circuits and Systems II: Express Briefs
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5

TABLE VI
PERFORMANCE OF PREVIOUS DESIGNS WITH ADDITIONAL COMPONENTS
Design Device Proc. Time Max. Freq. Area 𝑁(𝜌)a Additional information
PnR
TCAS-I’06 [17] Virtex 2 pro 3.86 ms 39.5 MHz 31.6 k LUTs + 256 DSP 0.11
CHES’08 [20] PnR Virtex 4 0.50 ms 490 MHz 1.7 k slices + 32 DSP + 11 BRAM 7.71 One BRAM block can store 18 k bits
CHES’10 [18]X Stratix II 0.68 ms 157.2 MHz 9177 ALMs + 96 DSP 0.80a
TVLSI’14 [21] X Virtex 6 0.4 ms 100 MHz 32.9 k LUTs + 289 DSP + 128 RAMB 0.42 ECC over all NIST prime fields
a
Normalized using 𝜌 of our fastest design in V4 (9177 ALMs in Stratix II are considered as 9177 × 1.3 = 11.9 k slices in V4 according to [25]).

significantly better than all previous results. Although the [5] ISO/IEC 13157-3, “Information technology Telecommunications and
information exchange between systems – NFC Security – Part 3: NFCSEC
design reported in [13] supports ECC over dual-field, its 𝜌 was cryptography standard using ECDH-256 and AES,” ISO/IEC, Apr. 2016.
much lower. As for our smallest design, its area was the [6] IEEE Standard for Wireless Access in Vehicular Environments—Security
smallest among all designs in Table V. Moreover, there is no Services for Applications and Management Messages, IEEE Standard 1609.2,
2013.
design with higher 𝜌 except our fastest design and the fast [7] B. Weyl, M. Wolf, F. Zweers, T. Gendrullis, M. S. Idrees, Y. Roudier, et al.,
design of [14]. “Secure on-board architecture specification,” EVITA Project, Tech. Rep.
On the other hand, there are also previous results that use Deliverable D3.2, 2010.
additional FPGA components. Their performance is shown in [8] National Institute of Standards and Technology, FIPS 186-2: Digital signature
standard (DSS). Gaithersburg, MD, USA: NIST, 2000.
Table VI, separately, for reference, although it is not possible to [9] S. Bayat-Sarmadi and M. Farmani, “High-throughput low-complexity systolic
directly compare it with ours. We note that the additional Montgomery multiplication over GF(2m) based on trinomials,” IEEE Trans.
components were not counted when 𝑁(𝜌) in Table VI were Circuits Syst. II: Express Briefs, vol. 62, pp. 377-381, 2015.
[10] M. Benaissa, “Throughput/area-efficient ECC processor using Montgomery
calculated. Thus, while the design in [20] has 𝑁(𝜌) > 1, this point multiplication on FPGA,” IEEE Trans. Circuits Syst. II: Express Briefs,
should not be interpreted that its performance is better than vol. 62, pp. 1078-1082, 2015.
those of our designs. Moreover, the storage space of its BRAM [11] R. Azarderakhsh and A. Reyhani-Masoleh, “High-performance implementation
of point multiplication on Koblitz curves,” IEEE Trans. Circuits Syst. II: Express
blocks, which is 198 k bits in total, is very large, and its Briefs, vol. 60, pp. 41-45, 2013.
6.8-times higher operating frequency than that of our fastest [12] S.-R. Kuang, C.-Y. Liang, and C.-C. Chen, “An efficient radix-4 scalable
design in V4 may cause large dynamic power consumption. architecture for Montgomery modular multiplication,” IEEE Trans. Circuits Syst.
II: Express Briefs, vol. 63, pp. 568-572, 2016.
Except the design [20], all designs in Table VI have N(𝜌) < 1 [13] J.-Y. Lai and C.-T. Huang, “Elixir: High-throughput cost-effective dual-field
even when additional components are not counted. Although processors and the design framework for elliptic curve cryptography,” IEEE
the design reported in [21] supports ECC over all NIST prime Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 11, pp. 1567-1580,
Nov. 2008.
fields, its 289 DSP slices and 128 RAMB36 blocks may occupy [14] M. Esmaeildoust, D. Schinianakis, H. Javashi, T. Stouraitis, and K. Navi,
a vast area if implemented in ASIC. “Efficient RNS implementation of elliptic curve point multiplication over GF(p),”
IEEE Trans. Very Large Scale Integ. (VLSI) Syst., vol. 21, no. 8, pp. 1545-1549,
Aug. 2013.
IV. CONCLUSION [15] G. Chen, G. Bai, and H. Chen, “A High-Performance elliptic curve
We proposed a novel ECC processor over NIST prime fields. cryptographic processor for general curves over GF(p) based on a systolic
arithmetic unit,” IEEE Trans. Circuits Syst. II: Exp. Briefs, vol. 54, no. 5, pp.
Using partial modular reduction, our processor can perform 412-416, May 2007.
modular multiplication considerably fast, and the required area [16] S.-C. Chung, J.-W. Lee, H.-C. Chang, and C.-Y. Lee, “A high-performance
is minimized by resource sharing even without using additional elliptic curve cryptographic processor over GF (p) with SPA resistance,” in Proc.
IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 1456-1459, May 2012.
components, such as DSP slices and BRAM blocks. When
[17] C. J. McIvor, M. McLoone, and J. V. McCanny, “Hardware elliptic curve
compared with previous designs without any additional cryptographic processor over GF(p),” IEEE Trans. Circuits Syst. I, Reg. Papers,
components, our designs have the best throughput per area. A vol. 53, no. 9, pp. 1946-1957, Sep. 2006.
direct performance comparison with the designs with additional [18] N. Guillermin, “A high speed coprocessor for elliptic curve scalar
multiplications over Fp,” in Proc. 12th Int. Workshop Cryptogr. Hardw.
components was not possible. However, considering the large Embedded Syst. (CHES), pp. 48-64, 2010.
amount of those components, it can be conjectured that our [19] H. Marzouqi, M. Al-Qutayri, K. Salah, D. Schinianakis, and T. Stouraitis, “A
design may have better performance. We remark that although high-speed FPGA implementation of an RSD-based ECC processor,” IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 1 pp. 151-164, Jan.
the performance can become less efficient after place & route, it 2016.
mainly depends on the effort for physical design, and the main [20] T. Güneysu and C. Paar, “Ultra high performance ECC over NIST primes on
contribution of this work is still effective. In addition, the commercial FPGAs,” in Proc. 10th Int. Workshop Cryptogr. Hardw. Embedded
Syst. (CHES), pp. 62-78, 2008.
proposed ECC processor has a desirable feature that the amount [21] H. Alrimeih and D. Rakhmatov, “Fast and flexible hardware support for ECC
of consumed resource and processing speed can be adjusted by over multiple standard prime fields,” IEEE Trans. Very Large Scale Integr.
selecting an appropriate m for the target application. (VLSI) Syst., vol. 22, no. 12, pp. 2661-2674, Dec. 2014.
[22] K. Javeed, X. Wang, and M. Scott, “High performance hardware support for
elliptic curve cryptography over general prime field,” Microprocessors and
REFERENCES Microsystems, Dec. 2016.
[1] V. S. Miller, “Use of elliptic curves in cryptography,” in Proc. Adv. Cryptology [23] D. Hankerson, A. J. Menezes, and S. Vanstone, Guide to elliptic curve
(Crypto), pp. 417-426, 1985. cryptography. New Your, NY, USA: Springer-Verlag, 2004.
[2] N. Koblitz, “Elliptic curve cryptosystems,” Math. of Comput., vol. 48, pp. [24] P. Choi, M.-K. Lee, J.-T. Kong, and D. K. Kim, “Efficient design and
203-209, 1987. performance analysis of a hardware right-shift binary modular inversion
[3] E. Barker, W. Barker, W. Burr, W. Polk, and M. Smid, “Recommendation for algorithm in GF(p),” J. Semicond. Technol. Sci., vol. 17, no. 3, pp. 425-437, Jun.
Key Management,” Special Publication 800-57 Part 1, Revision 3, NIST, 2011. 2017.
[4] oneM2M, “Security Solutions,” oneM2M Tech. Specification TS-0003-v2.4.1, [25] FPGA Logic Cell Comparison, 1-CORE Technologies. [Online]. Available:
Nov. 2016. http://ee.sharif.edu/~asic/Docs/fpga-logic-cells_V4_V5.pdf.

1549-7747 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like