Professional Documents
Culture Documents
Abstract— This paper proposes an efficient pipelined Sakiyama et al. [7] proposed a superscalar coprocessor
architecture of elliptic curve scalar multiplication (ECSM) and accelerated ECSM by exploiting instruction-level
over GF(2 m ). The architecture uses a bit-parallel finite- parallelism (ILP) dynamically. In [9], a pipelined application-
field (FF) multiplier accumulator (MAC) based on the
Karatsuba–Ofman algorithm. The Montgomery ladder specific instruction set processor for ECC was proposed,
algorithm is modified for better sharing of execution paths. which performed ECSM over GF(2163) in 19.55 μs on
The data path in the architecture is well designed, so that Xilinx XC4VLX200. Designs [10]–[13] implemented
the critical path contains few extra logic primitives apart high-speed scalar multiplication over a special class of
from the FF MAC. In order to find the optimal number of curves, such as Koblitz curves, binary Edwards curves, and
pipeline stages, scheduling schemes with different pipeline stages
are proposed and the ideal placement of pipeline registers is Hessian curves. In this paper, we focus on optimizing ECSM
thoroughly analyzed. We implement ECSM over the five binary over generic curves in GF(2m ).
fields recommended by the National Institute of Standard and Some designs [14]–[16] duplicate arithmetic blocks to
Technology on Xilinx Virtex-4 and Virtex-5 field-programmable maximize the parallelism in ECSM. For GF(2163 ),
gate arrays. The three-stage pipelined architecture is shown to Kim et al. [14] used three Gaussian normal basis multipliers
have the best performance, which achieves a scalar multiplication
to achieve ECSM in 10 μs on Xilinx XC4VLX80.
over GF(2163 ) in 6.1 µs using 7354 Slices on Virtex-4. Using
Virtex-5, the scalar multiplication for m = 163, 233, 283, 409, Zhang et al. [15] developed three finite-field (FF) cores
and 571 can be achieved in 4.6, 7.9, 10.9, 19.4, and 36.5 µs, and a main controller to achieve ECSM in 7.7 μs on
respectively, which are faster than previous results. Xilinx XC4VLX80. The best design in [16] performed
Index Terms— Elliptic curve cryptography (ECC), ECSM in 5.5 μs on Xilinx Virtex-5 using three digit-serial
field-programmable gate array (FPGA), Karatsuba–Ofman FF multipliers and one FF divider. Despite high speed, these
multiplier (KOM), pipelining, scalar multiplication. deigns require massive logic resources, and thus, they are not
I. I NTRODUCTION practical for FPGA implementation.
Considering the tradeoff between area and speed, many
E LLIPTIC curve cryptography (ECC) is a public-key
cryptographic algorithm proposed by Neal Koblitz and
Victor Miller in 1985. Compared with Ron Rivest, Adi Shamir
designs [16]–[20] use word-serial or digit-serial FF multi-
pliers to implement ECSM. These designs usually require
and Leonard Adleman (RSA), ECC can offer the equivalent a large number of clock cycles for a scalar multiplication.
level of security with a much shorter key length [1], which Ansari and Hasan [19] proposed an efficient scheme, which
results in higher speed and smaller overhead in area, power, kept the pseudopipelined word-serial FF multiplier working
storage space, and so on. ECC over binary field GF(2m ) is without idle cycles. A scalar multiplication over GF(2163)
especially attractive for its carry-free property [2]. However, costs 4050 clock cycles and 21 μs on Xilinx XC4VLX200.
for resource-constrained platforms such as field-programmable In [20], FF multipliers with different word sizes (w) were
gate arrays (FPGAs), it remains a challenge to achieve high developed, and the best design with w = 55 performed ECSM
performance as well as area efficient designs of ECC. over GF(2163 ) in 2751 clock cycles and 9.6 μs on
Elliptic curve scalar multiplication (ECSM) is the key Xilinx XC4VLX200.
operation, which dominates the performance of Designs [21]–[24] exploit a bit-parallel Karatsuba–Ofman
ECC cryptosystem. Various architectures [3]–[24] have multiplier (KOM) [25] to achieve high-performance ECSM.
been proposed to speed up ECSM. Most of them explore The architectures in [22] and [23] consisted of a pipelined
pipeline and parallelism to improve the working frequency KOM and a Powerblock. The Powerblock was a cascade
and to reduce the required number of clock cycles in ECSM. of several quads or 24 circuits to accelerate inversion.
Leong and Leung [3] developed a microcoded elliptic curve A theoretical model was presented to estimate the delay of
processor, supporting ECSM over GF(2m ) for arbitrary m. different primitives. However, such architectures introduced
lots of redundant primitives into the critical path apart from the
Manuscript received April 5, 2015; revised June 3, 2015; accepted FF multiplier, which resulted in longer critical path delay and
July 3, 2015. Date of publication August 13, 2015; date of current version lower clock frequency. A scalar multiplication over GF(2163)
March 18, 2016. This work was supported by the Tsinghua National Labora-
tory for Information Science and Technology. was achieved in 8.6 μs on Xilinx XC5VLX85t. Liu et al. [24]
The authors are with the Tsinghua National Laboratory for Information used a KOM running with no idle cycle and performed ECSM
Science and Technology, Institute of Microelectronics, Tsinghua in 9 μs on XC4VLX200 for GF(2163 ).
University, Beijing 100084, China (e-mail: li-lj11@mails.tsinghua.edu.cn;
lisg@tsinghua.edu.cn). In this paper, we use a pipelined bit-parallel FF multiplier
Digital Object Identifier 10.1109/TVLSI.2015.2453360 accumulator (MAC) based on KOM, along with one
1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Central Electronics Engineering Research Institute. Downloaded on May 08,2023 at 07:11:22 UTC from IEEE Xplore. Restrictions apply.
1224 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016
FF squarer, to implement ECSM. Different from [22]–[24], Algorithm 1 Montgomery Ladder Scalar Multiplication
there are only two extra primitives [an addition and a Over GF(2m )
4:1 multiplexer (MUX)] added to the critical path apart from
the FF MAC. Various pipeline stages are applied to the data
path to find the best tradeoff between the working frequency
and the required number of clock cycles in ECSM.
The rest of this paper is organized as follows. Section II
introduces some basic knowledge about the binary
FF and ECSM. In Section III, scheduling schemes based on
the improved Montgomery ladder algorithm are presented.
The proposed ECSM architecture along with a detailed data
path is given in Section IV. Hardware implementation, delay
estimation, and pipeline analysis are given in Section V.
Section VI shows the performance analysis and comparison.
Finally, the conclusion is drawn in Section VII.
II. P RELIMINARIES
A. Finite Field
Elliptic curves are defined over FFs. The most commonly
used FFs are prime fields GF( p) and binary fields GF(2m ).
Both can provide the same level of security. Because arithmetic
in GF(2m ) is carry-free, ECC over GF(2m ) is more efficient
for hardware implementation. Elements in GF(2m ) have many
Points on elliptic curve E, along with the point at infinity, form
basis representations, among which polynomial basis and nor-
a commutative finite group under point addition and doubling.
mal basis are the most commonly used ones. Normal basis has
Given a base point P on the curve, and a positive integer k,
cheaper FF square but much more complex FF multiplication
computing
than polynomial basis. Here, we consider ECC over GF(2m )
with polynomial basis. Q = k P = P + P +
· · · + P
In polynomial basis, elements in GF(2m ) are represented k times
as binary polynomials with degrees less than m, is called scalar multiplication. The result Q is another point
m−1
i.e., a(x) = i=0 ai x , ai ∈ [0, 1]. Arithmetic operations
i
on the elliptic curve.
m
in GF(2 ) [26], [27] are computed over an irreducible Scalar multiplication is performed by repeated point
polynomial f (x) with degree m. Addition a(x) + b(x) addition and doubling, which essentially rely on a series of
in GF(2m ) is bitwise EXCLUSIVE OR (XOR) between FF arithmetics, such as multiplication, square, addition, and
coefficients of two polynomials. Multiplication a(x) · b(x) in inversion. Point addition and doubling in affine coordinate
GF(2m ) is more involved, which is performed in two steps: involve inversion each time. As inversion is the most complex
1) polynomial multiplication and 2) reduction modulo f (x). and time-consuming operation in FF, it is more practical to
Square a 2 (x) is cheaper than multiplication, because the represent curve points using a projective coordinate. In this
polynomial multiplication in square is simply padding zeros way, only one inversion is needed for the entire scalar
to the odd bits. Inversion a −1 (x), which is the most complex multiplication at the expense of more other FF operations.
operation in GF(2m ), is to find a polynomial b(x) that satisfies At present, the most efficient algorithm of scalar multipli-
a(x) · b(x) = 1 for a given a(x). It can be computed using cation over GF(2m ) is the Montgomery ladder algorithm using
either the extended Euclidean algorithm or Fermat’s little the López–Dahab (LD) projective coordinate [30]. As shown
theorem. in Algorithm 1, the Montgomery ladder algorithm computes
The selection of irreducible polynomials has a great impact both point doubling and point addition for every bit of k
on the space and time complexities of modular reduction. (denoted as ki ). The algorithm is efficient, because only the
Sparse irreducible polynomials can give considerable x-coordinate (X and Z ) is computed and the y-coordinate is
computational advantages. Hence, polynomials with recovered in the end. What is more, because each iteration of
three (trinomials) or five (pentanomials) nonzero terms are the main loop performs the same operation no matter ki is
preferred to high performance. This paper considers generic 0 or 1, the Montgomery method natively has the ability of
elliptic curves with irreducible polynomials recommended by defending the side channel attack.
the National Institute of Standard and Technology (NIST) [28].
III. R ESCHEDULING AND E NHANCEMENT OF
B. Elliptic Curve Scalar Multiplication E LLIPTIC C URVE S CALAR M ULTIPLICATION
A nonsupersingular elliptic curve over GF(2m ) is A. Modified Montgomery Ladder Scalar Multiplication
represented as follows [29]:
The Montgomery ladder algorithm consists of three
E : y 2 + x y = x 3 + ax 2 + b. stages: 1) initialization: conversing from affine coordinate to
Authorized licensed use limited to: Central Electronics Engineering Research Institute. Downloaded on May 08,2023 at 07:11:22 UTC from IEEE Xplore. Restrictions apply.
LI AND LI: HIGH-PERFORMANCE PIPELINED ARCHITECTURE OF ECSM OVER GF(2m ) 1225
Fig. 1. Data dependence graph of (a) point addition and (b) point doubling
in the Montgomery ladder algorithm.
Authorized licensed use limited to: Central Electronics Engineering Research Institute. Downloaded on May 08,2023 at 07:11:22 UTC from IEEE Xplore. Restrictions apply.
1226 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016
TABLE II
P ROPOSED S CHEDULING S CHEME U SING A
T HREE -S TAGE P IPELINED FF MAC
TABLE I
P ROPOSED S CHEDULING S CHEME U SING A
T WO -S TAGE P IPELINED FF MAC
Authorized licensed use limited to: Central Electronics Engineering Research Institute. Downloaded on May 08,2023 at 07:11:22 UTC from IEEE Xplore. Restrictions apply.
LI AND LI: HIGH-PERFORMANCE PIPELINED ARCHITECTURE OF ECSM OVER GF(2m ) 1227
Authorized licensed use limited to: Central Electronics Engineering Research Institute. Downloaded on May 08,2023 at 07:11:22 UTC from IEEE Xplore. Restrictions apply.
1228 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016
B. FF Inversion
Compared with other FF operations, FF inversion is the
most complex and time-consuming one. The Itoh–Tsujii’s
algorithm (ITA) [32], which is based on Fermat’s little
theorem, shows high efficiency for inversion in GF(2m ) [1].
The ITA splits inverse into a lot of squares and very few
multiplications, thus inversion can be accomplished by exploit-
Fig. 7. Area-delay efficient KOM for m = 163 on (a) four-input LUT FPGAs ing existing multipliers instead of implementing another set
and (b) six-input LUT FPGAs. of hardware. Square in GF(2m ) with polynomial basis is an
efficient operation, which only involves reduction; hence, it is
highly used to speed up the ITA inversion.
The core of the FF MAC lies in the bit-parallel polynomial
Given an element a in GF(2m ), the inversion a −1 can be
multiplication, which can be directly implemented using the
computed as follows:
classical multiplication (CM) as follows:
a −1 = a 2
m −2 m−1 −1)
2m−2 = a 2(2 .
d(x) = dk · x k
The ITA performs inversion using the following observation:
k=0 ⎧
where ⎨(2s/2 − 1)(2s/2 + 1), s is even
2s − 1 =
dk = ai b j , 0 ≤ i, j ≤ m − 1. ⎩2(2 s−1 s−1
2 − 1)(2 2 + 1) + 1, s is odd.
i+ j =k
For example, the inversion a −1 in GF(2163 ) is calculated in
It is expensive to directly implement an m-bit CM when
such an order
m is large, especially for resource-constrained devices such
a −1 = a 2
163 −2 162 −1)
as FPGAs. A good solution is to use the KOM [25], which has = a 2(2
subquadratic complexity. A one-step KOM splits the operand =a 2(281 +1){2(240 +1)(220 +1)(210 +1)(25 +1)[2(22 +1)(2+1)+1]+1}
.
size by half and performs the polynomial multiplication as
follows: In GF(2m ), the ITA requires (m − 1) squares and
log2 (m − 1) + H (m − 1) − 1 multiplications, where
A · B = ah x n + al bh x n + bl
H denotes the hamming weight of the binary sequence of
= ah bh x 2n + (al bh + ah bl )x n + al bl the given integer. In case of m = 163, H (m − 1) = H (162) =
= ah bh x 2n +[(ah +al )(bh +bl )+ ah bh + al bl ]x n + al bl H (10100010)2 = 3. Thus, inversion in GF(2163) requires
162 squares and 9 multiplications.
where n = m/2, A = ah x n + al , and B = bh x n + bl . In [23] and [33], a Powerblock, which is a cascade of several
An m-bit KOM consists of three (m/2)-bit multipliers number of quad circuits, is proposed to speed up inversion.
when m is even, and two m/2 -bit and one m/2-bit Since scalar multiplication requires only once inversion in
multipliers when m is odd. Similarly, the three submulti- the postprocess stage, this paper simply uses square and
pliers ah bh , al bl , and (ah + al )(bh + bl ) can apply the multiplication to implement inversion. In this way, the
Karatsuba–Ofman algorithm recursively. The delay of a KOM FF inversion is realized by controlling the existing data path,
increases monotonically with the iteration steps, while the and no resource overhead is introduced. The FF squarer
area consumption reaches the optimum when the split oper- and addition are much cheaper than the FF MAC and the
ation size arrives at a threshold τ , and then the τ -bit FF inversion, which have been discussed in detail in [26].
CM is invoked.
In [31], the area and lookup table (LUT) delay
estimations of KOMs on FPGA have been discussed C. Delay Estimation and Pipeline Analysis
in detail. We have experimentally found that, in The FF MAC based on KOM consists of four stages:
case of GF(2163), the four-step KOM for four-input 1) splitting the input operands to produce threshold operands;
LUT FPGAs and the three-step KOM for six-input 2) performing threshold CMs; 3) recursively aligning the
LUT FPGAs achieve the best tradeoff between area and delay. outputs to combine higher-level multiplications; and 4) doing
The optimum constructions of the 163-bit KOMs are shown the modular reduction and addition. Apart from the FF MAC,
in Fig. 7. An m-bit KOM (CM) is denoted as KOMm (CMm ). the critical path in the proposed architecture contains an
The weight on each edge denotes the required number of addition (XOR) and a 4:1 MUX, as the dashed line shows
submultipliers. For example, a KOM81 requires one KOM40 in Fig. 6.
and two values of KOM41 , while a KOM82 requires The LUT delay formulation of KOM on FPGA is given
three KOM41 . The number of CMs can be calculated by in [23] and [31]. The overlapped operands (ah + al ) and
tracing the weights in a reverse order. For example, the (bh + bl ) require additions, and the r -step splitting generates
Authorized licensed use limited to: Central Electronics Engineering Research Institute. Downloaded on May 08,2023 at 07:11:22 UTC from IEEE Xplore. Restrictions apply.
LI AND LI: HIGH-PERFORMANCE PIPELINED ARCHITECTURE OF ECSM OVER GF(2m ) 1229
Authorized licensed use limited to: Central Electronics Engineering Research Institute. Downloaded on May 08,2023 at 07:11:22 UTC from IEEE Xplore. Restrictions apply.
1230 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016
TABLE III
I MPLEMENTATION R ESULTS OF THE P ROPOSED ECSM A RCHITECTURES ON V IRTEX -5
TABLE IV
P ERFORMANCE C OMPARISON OF ECSM I MPLEMENTATIONS ON FPGA
The KOM in [24] was two-stage pipelined, and a quad many redundant primitives into the critical path. For
circuit was used to speed up inversion. That is why [24] example, Rebeiro et al. [22] introduced two FF squares, two
costs less clock cycles (1091) than ours (1180 for our additions (XOR), three 4:1 MUXs, and one 2:1 MUX into
two-stage pipelined ECSM). Rebeiro et al. [22] and the critical path apart from the FF multiplier, which greatly
Roy et al. [23] used similar architectures, and the best increased the critical path delay (in total, 23 LUTs for k = 4
one was four-stage pipelined. The required clock cycles and 17 LUTs for k = 6).
were reduced to 1429 (1404) due to data forward- In our architecture, the FF MAC is used instead of
ing and the Powerblock. All of [22]–[24] introduced FF multiplier. Most additions are absorbed into the FF MAC
Authorized licensed use limited to: Central Electronics Engineering Research Institute. Downloaded on May 08,2023 at 07:11:22 UTC from IEEE Xplore. Restrictions apply.
LI AND LI: HIGH-PERFORMANCE PIPELINED ARCHITECTURE OF ECSM OVER GF(2m ) 1231
with no extra logic and clock cycle. The data path of our [7] K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede, “Superscalar
architecture is well designed, so that only one addition (XOR) coprocessor for high-speed curve-based cryptography,” in Proc. 8th
Int. Workshop Cryptograph. Hardw. Embedded Syst. (CHES), 2006,
and one 4:1 MUX are added to the critical path of the pp. 415–429.
FF MAC. The total critical path delay of our architecture is [8] D. Schinianakis, A. Kakarountas, T. Stouraitis, and A. Skavantzos,
14 LUTs for k = 4 and 12 LUTs for k = 6. By well scheduling “Elliptic curve point multiplication in GF(2n ) using polynomial
residue arithmetic,” in Proc. 16th IEEE Int. Conf. Electron., Circuits,
and fair pipelining, our designs cost relatively less clock cycles Syst. (ICECS), Dec. 2009, pp. 980–983.
and achieve higher clock frequency. The control signals are [9] W. N. Chelton and M. Benaissa, “Fast elliptic curve cryptography on
stored in Block RAMs on FPGA, thus a bulky state machine is FPGA,” IEEE Trans, Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 2,
pp. 198–205, Feb. 2008.
saved, which also contributes to higher frequency and smaller [10] K. Järvinen and J. Skyttä, “On parallelization of high-speed proces-
area. Area reduction owes much to the use of the KOM. With sors for elliptic curve cryptography,” IEEE Trans. Very Large Scale
even less resources, our ECSM is 47.5% faster than [24] and Integr. (VLSI) Syst., vol. 16, no. 9, pp. 1162–1175, Sep. 2008.
[11] R. Azarderakhsh and A. Reyhani-Masoleh, “High-performance
59% faster than [22] on Virtex-4, and 87.0% faster than [22] implementation of point multiplication on Koblitz curves,” IEEE Trans.
and double the speed of [23] on Virtex-5. Circuits Syst. II, Exp. Briefs, vol. 60, no. 1, pp. 41–45, Jan. 2013.
We also present results for GF(2233 ) and GF(2283 ) [12] K. Järvinen, “Optimized FPGA-based elliptic curve cryptography
processor for high-speed applications,” Integr., VLSI J., vol. 44, no. 4,
in Table IV. Except for [19], the proposed architecture outper- pp. 270–279, Sep. 2011.
forms other designs in terms of both speed and area. The speed [13] R. Azarderakhsh and A. Reyhani-Masoleh, “Parallel and high-speed
of our designs is 3.13 times of [19] in GF(2233 ) and 3.92 times computations of elliptic curve cryptography using hybrid-double multi-
of [19] in GF(2283 ) with a moderate increase in area. The pliers,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 6, pp. 1668–1677,
Jun. 2015.
comparisons for designs in GF(2409 ) and GF(2571) are not [14] C. H. Kim, S. Kwon, and C. P. Hong, “FPGA implementation of
presented, because they are rarely implemented in previous high performance elliptic curve cryptographic processor over GF(2163 ),”
works. J. Syst. Archit., vol. 54, no. 10, pp. 893–900, 2008.
[15] Y. Zhang, D. Chen, Y. Choi, L. Chen, and S. Ko, “A high performance
ECC hardware implementation with instruction-level parallelism over
VII. C ONCLUSION GF(2163 ),” Microprocess. Microsyst., vol. 34, no. 6, pp. 228–236,
Oct. 2010.
This paper focuses on speeding up ECSM over GF(2m ) on [16] G. D. Sutter, J. Deschamps, and J. L. Imaña, “Efficient elliptic curve
FPGA with the premise of high area utilization efficiency. point multiplication using digit-serial binary field operations,” IEEE
The proposed architecture mainly consists of a pipelined Trans. Ind. Electron., vol. 60, no. 1, pp. 217–225, Jan. 2013.
[17] S. Kumar, T. Wollinger, and C. Paar, “Optimum digit serial GF(2m )
bit-parallel FF MAC and a nonpipelined FF squarer. Area multipliers for curve-based cryptography,” IEEE Trans. Comput., vol. 55,
reduction is achieved using the Karatsuba–Ofman algorithm. no. 10, pp. 1306–1311, Oct. 2006.
Compact scheduling schemes are proposed to reduce the [18] C. Shu, K. Gaj, and T. El-Ghazawi, “Low latency elliptic curve cryptog-
required number of clock cycles for ECSM. The data path in raphy accelerators for NIST curves over binary fields,” in Proc. IEEE
Int. Conf. Field-Program. Technol., Dec. 2005, pp. 309–310.
the architecture is carefully designed to achieve a short critical [19] B. Ansari and M. A. Hasan, “High-performance architecture of elliptic
path. Pipeline techniques are applied to improve the work- curve scalar multiplication,” IEEE Trans. Comput., vol. 57, no. 11,
ing frequency. Thorough analyzes, supported with detailed pp. 1443–1453, Nov. 2008.
[20] H. Mahdizadeh and M. Masoumi, “Novel architecture for efficient
experimental results, are provided to find the architecture with fpga implementation of elliptic curve cryptographic processor over
the optimal number of pipeline stages. Compared with other GF(2163 ),” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21,
existing designs, the proposed architecture outperforms their no. 12, pp. 2330–2333, Dec. 2013.
[21] F. Rodríguez-Henríquez, N. A. Saqib, and A. Díaz-Pérez, “A fast parallel
results in terms of both speed and area. implementation of elliptic curve point multiplication over GF (2m ),”
Microprocess. Microsyst., vol. 28, nos. 5–6, pp. 329–339, Aug. 2004.
ACKNOWLEDGMENT [22] C. Rebeiro, S. S. Roy, and D. Mukhopadhyay, “Pushing the limits of
high-speed GF(2m ) elliptic curve scalar multiplication on FPGAs,” in
The authors would like to thank the editors and reviewers Proc. 14th Int. Workshop Cryptograph. Hardw. Embedded Syst. (CHES),
for their contributive comments and suggestions. 2012, pp. 494–511.
[23] S. S. Roy, C. Rebeiro, and D. Mukhopadhyay, “Theoretical modeling
of elliptic curve scalar multiplier on LUT-based fpgas for area and
R EFERENCES speed,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 5,
[1] F. Rodríguez-Henríquez, N. A. Saqib, A. D. Pérez, and Ç. K. Koç, pp. 901–909, May 2013.
Cryptographic Algorithms on Reconfigurable Hardware. New York, NY, [24] S. Liu, L. Ju, X. Cai, Z. Jia, and Z. Zhang, “High performance
USA: Springer-Verlag, 2006. FPGA implementation of elliptic curve cryptography over binary
[2] E. Wenger and M. Hutter, “Exploring the design space of prime field fields,” in Proc. IEEE Int. Conf. Trust, Security Privacy Comput.
vs. binary field ECC-hardware implementations,” in Proc. 16th Nordic Commun. (TrustCom), Sep. 2014, pp. 148–155.
Conf. Secure IT Syst. Inf. Security Technol. Appl. (NordSec), Tallinn, [25] A. Karatsuba and Y. Ofman, “Multiplication of multidigit numbers on
Estonia, Oct. 2011, pp. 256–271. automata,” Sov. Phys. Doklady, vol. 7, no. 7, pp. 595–596, 1963.
[3] P. H. W. Leong and I. K. H. Leung, “A microcoded elliptic curve [26] H. Wu, “Bit-parallel finite field multiplier and squarer using polynomial
processor using FPGA technology,” IEEE Trans. Very Large Scale basis,” IEEE Trans. Comput., vol. 51, no. 7, pp. 750–758, Jul. 2002.
Integr. (VLSI) Syst., vol. 10, no. 5, pp. 550–559, Oct. 2002. [27] A. Reyhani-Masoleh and M. A. Hasan, “Low complexity bit parallel
[4] N. Gura et al., “An end-to-end systems approach to elliptic curve architecture for polynomial basis multiplication over GF(2m ),” IEEE
cryptography,” in Proc. 4th Int. Workshop CHES, London, U.K., 2003, Trans. Comput., vol. 53, no. 8, pp. 945–995, Aug. 2004.
pp. 349–365. [28] Recommended Elliptic Curves for Federal Government Use, NIST,
[5] A. Satoh and K. Takano, “A scalable dual-field elliptic curve crypto- Gaithersburg, MD, USA, May 1999.
graphic processor,” IEEE Trans. Comput., vol. 52, no. 4, pp. 449–460, [29] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curve
Apr. 2003. Cryptography. New York, NY, USA: Springer-Verlag, 2004.
[6] R. C. C. Cheung, N. J. Telle, W. Luk, and P. Y. K. Cheung, [30] J. López and R. Dahab, “Fast multiplication on elliptic curves
“Customizable elliptic curve cryptosystems,” IEEE Trans. Very Large over GF(2m ) without precomputation,” in Proc. 1st Int. Workshop
Scale Integr. (VLSI) Syst., vol. 13, no. 9, pp. 1048–1059, Sep. 2005. Cryptograph. Hardw. Embedded Syst., 1999, pp. 316–327.
Authorized licensed use limited to: Central Electronics Engineering Research Institute. Downloaded on May 08,2023 at 07:11:22 UTC from IEEE Xplore. Restrictions apply.
1232 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016
[31] G. Zhou, H. Michalik, and L. Hinsenkamp, “Complexity analysis and Shuguo Li (M’09) received the bachelor’s degree
efficient implementations of bit parallel finite field multipliers based on in computer science from Xidian University, Xi’an,
Karatsuba–Ofman algorithm on FPGAs,” IEEE Trans. Very Large Scale China, in 1986, the M.S. degree from Shandong
Integr. (VLSI) Syst., vol. 18, no. 7, pp. 1057–1066, Jul. 2010. University, Shandong, China, in 1993, and the
[32] T. Itoh and S. Tsujii, “A fast algorithm for computing multiplicative Ph.D. degree from Northwestern Polytechnical
inverses in GF(2m ) using normal bases,” Inf. Comput., vol. 78, no. 3, University, Xi’an, in 1999.
pp. 171–177, 1988. He is currently a Professor with the Institute
[33] C. Rebeiro, S. S. Roy, D. S. Reddy, and D. Mukhopadhyay, “Revisiting of Microelectronics, Tsinghua University, Beijing,
the Itoh–Tsujii inversion algorithm for FPGA platforms,” IEEE Trans. China. His current research interests include the
Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 8, pp. 1508–1512, algorithm, design, and VLSI implementation for
Aug. 2011. cryptography.
Authorized licensed use limited to: Central Electronics Engineering Research Institute. Downloaded on May 08,2023 at 07:11:22 UTC from IEEE Xplore. Restrictions apply.