High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF 2m

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO.
4, APRIL 2016 1223
High-Performance Pipelined Architecture of Elliptic

Curve Scalar Multiplication Over GF(2m )
Lijuan Li and Shuguo Li, Member, IEEE
Abstract— This paper proposes an efficient pipelined Sakiyama et al. [7] proposed a superscalar coprocessor
architecture of elliptic curve scalar multiplication (ECSM) and accelerated ECSM by exploiting instruction-level
over GF(2 m ). The architecture uses a bit-parallel finite- parallelism (ILP) dynamically. In [9], a pipelined application-
field (FF) multiplier accumulator (MAC) based on the
Karatsuba–Ofman algorithm. The Montgomery ladder specific instruction set processor for ECC was proposed,
algorithm is modified for better sharing of execution paths. which performed ECSM over GF(2163) in 19.55 μs on
The data path in the architecture is well designed, so that Xilinx XC4VLX200. Designs [10]–[13] implemented
the critical path contains few extra logic primitives apart high-speed scalar multiplication over a special class of
from the FF MAC. In order to find the optimal number of curves, such as Koblitz curves, binary Edwards curves, and
pipeline stages, scheduling schemes with different pipeline stages
are proposed and the ideal placement of pipeline registers is Hessian curves. In this paper, we focus on optimizing ECSM
thoroughly analyzed. We implement ECSM over the five binary over generic curves in GF(2m ).
fields recommended by the National Institute of Standard and Some designs [14]–[16] duplicate arithmetic blocks to
Technology on Xilinx Virtex-4 and Virtex-5 field-programmable maximize the parallelism in ECSM. For GF(2163 ),
gate arrays. The three-stage pipelined architecture is shown to Kim et al. [14] used three Gaussian normal basis multipliers
have the best performance, which achieves a scalar multiplication
to achieve ECSM in 10 μs on Xilinx XC4VLX80.
over GF(2163 ) in 6.1 µs using 7354 Slices on Virtex-4. Using
Virtex-5, the scalar multiplication for m = 163, 233, 283, 409, Zhang et al. [15] developed three finite-field (FF) cores
and 571 can be achieved in 4.6, 7.9, 10.9, 19.4, and 36.5 µs, and a main controller to achieve ECSM in 7.7 μs on
respectively, which are faster than previous results. Xilinx XC4VLX80. The best design in [16] performed
Index Terms— Elliptic curve cryptography (ECC), ECSM in 5.5 μs on Xilinx Virtex-5 using three digit-serial
field-programmable gate array (FPGA), Karatsuba–Ofman FF multipliers and one FF divider. Despite high speed, these
multiplier (KOM), pipelining, scalar multiplication. deigns require massive logic resources, and thus, they are not
I. I NTRODUCTION practical for FPGA implementation.
Considering the tradeoff between area and speed, many
E LLIPTIC curve cryptography (ECC) is a public-key
cryptographic algorithm proposed by Neal Koblitz and
Victor Miller in 1985. Compared with Ron Rivest, Adi Shamir
designs [16]–[20] use word-serial or digit-serial FF multi-
pliers to implement ECSM. These designs usually require
and Leonard Adleman (RSA), ECC can offer the equivalent a large number of clock cycles for a scalar multiplication.
level of security with a much shorter key length [1], which Ansari and Hasan [19] proposed an efficient scheme, which
results in higher speed and smaller overhead in area, power, kept the pseudopipelined word-serial FF multiplier working
storage space, and so on. ECC over binary field GF(2m ) is without idle cycles. A scalar multiplication over GF(2163)
especially attractive for its carry-free property [2]. However, costs 4050 clock cycles and 21 μs on Xilinx XC4VLX200.
for resource-constrained platforms such as field-programmable In [20], FF multipliers with different word sizes (w) were
gate arrays (FPGAs), it remains a challenge to achieve high developed, and the best design with w = 55 performed ECSM
performance as well as area efficient designs of ECC. over GF(2163 ) in 2751 clock cycles and 9.6 μs on
Elliptic curve scalar multiplication (ECSM) is the key Xilinx XC4VLX200.
operation, which dominates the performance of Designs [21]–[24] exploit a bit-parallel Karatsuba–Ofman
ECC cryptosystem. Various architectures [3]–[24] have multiplier (KOM) [25] to achieve high-performance ECSM.
been proposed to speed up ECSM. Most of them explore The architectures in [22] and [23] consisted of a pipelined
pipeline and parallelism to improve the working frequency KOM and a Powerblock. The Powerblock was a cascade
and to reduce the required number of clock cycles in ECSM. of several quads or 24 circuits to accelerate inversion.
Leong and Leung [3] developed a microcoded elliptic curve A theoretical model was presented to estimate the delay of
processor, supporting ECSM over GF(2m ) for arbitrary m. different primitives. However, such architectures introduced
lots of redundant primitives into the critical path apart from the
Manuscript received April 5, 2015; revised June 3, 2015; accepted FF multiplier, which resulted in longer critical path delay and
July 3, 2015. Date of publication August 13, 2015; date of current version lower clock frequency. A scalar multiplication over GF(2163)
March 18, 2016. This work was supported by the Tsinghua National Labora-
tory for Information Science and Technology. was achieved in 8.6 μs on Xilinx XC5VLX85t. Liu et al. [24]
The authors are with the Tsinghua National Laboratory for Information used a KOM running with no idle cycle and performed ECSM
Science and Technology, Institute of Microelectronics, Tsinghua in 9 μs on XC4VLX200 for GF(2163 ).
University, Beijing 100084, China (e-mail: li-lj11@mails.tsinghua.edu.cn;
lisg@tsinghua.edu.cn). In this paper, we use a pipelined bit-parallel FF multiplier
Digital Object Identifier 10.1109/TVLSI.2015.2453360 accumulator (MAC) based on KOM, along with one
1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Central Electronics Engineering Research Institute. Downloaded on May 08,2023 at 07:11:22 UTC from IEEE Xplore. Restrictions apply.
1224 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016
FF squarer, to implement ECSM. Different from [22]–[24], Algorithm 1 Montgomery Ladder Scalar Multiplication
there are only two extra primitives [an addition and a Over GF(2m )
4:1 multiplexer (MUX)] added to the critical path apart from
the FF MAC. Various pipeline stages are applied to the data
path to find the best tradeoff between the working frequency
and the required number of clock cycles in ECSM.
The rest of this paper is organized as follows. Section II
introduces some basic knowledge about the binary
FF and ECSM. In Section III, scheduling schemes based on
the improved Montgomery ladder algorithm are presented.
The proposed ECSM architecture along with a detailed data
path is given in Section IV. Hardware implementation, delay
estimation, and pipeline analysis are given in Section V.
Section VI shows the performance analysis and comparison.
Finally, the conclusion is drawn in Section VII.
II. P RELIMINARIES
A. Finite Field
Elliptic curves are defined over FFs. The most commonly
used FFs are prime fields GF( p) and binary fields GF(2m ).
Both can provide the same level of security. Because arithmetic
in GF(2m ) is carry-free, ECC over GF(2m ) is more efficient
for hardware implementation. Elements in GF(2m ) have many
Points on elliptic curve E, along with the point at infinity, form
basis representations, among which polynomial basis and nor-
a commutative finite group under point addition and doubling.
mal basis are the most commonly used ones. Normal basis has
Given a base point P on the curve, and a positive integer k,
cheaper FF square but much more complex FF multiplication
computing
than polynomial basis. Here, we consider ECC over GF(2m )
with polynomial basis. Q = k P = P + P +
· · · + P
In polynomial basis, elements in GF(2m ) are represented k times
as binary polynomials with degrees less than m, is called scalar multiplication. The result Q is another point
m−1
i.e., a(x) = i=0 ai x , ai ∈ [0, 1]. Arithmetic operations
i
on the elliptic curve.
m
in GF(2 ) [26], [27] are computed over an irreducible Scalar multiplication is performed by repeated point
polynomial f (x) with degree m. Addition a(x) + b(x) addition and doubling, which essentially rely on a series of
in GF(2m ) is bitwise EXCLUSIVE OR (XOR) between FF arithmetics, such as multiplication, square, addition, and
coefficients of two polynomials. Multiplication a(x) · b(x) in inversion. Point addition and doubling in affine coordinate
GF(2m ) is more involved, which is performed in two steps: involve inversion each time. As inversion is the most complex
1) polynomial multiplication and 2) reduction modulo f (x). and time-consuming operation in FF, it is more practical to
Square a 2 (x) is cheaper than multiplication, because the represent curve points using a projective coordinate. In this
polynomial multiplication in square is simply padding zeros way, only one inversion is needed for the entire scalar
to the odd bits. Inversion a −1 (x), which is the most complex multiplication at the expense of more other FF operations.
operation in GF(2m ), is to find a polynomial b(x) that satisfies At present, the most efficient algorithm of scalar multipli-
a(x) · b(x) = 1 for a given a(x). It can be computed using cation over GF(2m ) is the Montgomery ladder algorithm using
either the extended Euclidean algorithm or Fermat’s little the López–Dahab (LD) projective coordinate [30]. As shown
theorem. in Algorithm 1, the Montgomery ladder algorithm computes
The selection of irreducible polynomials has a great impact both point doubling and point addition for every bit of k
on the space and time complexities of modular reduction. (denoted as ki ). The algorithm is efficient, because only the
Sparse irreducible polynomials can give considerable x-coordinate (X and Z ) is computed and the y-coordinate is
computational advantages. Hence, polynomials with recovered in the end. What is more, because each iteration of
three (trinomials) or five (pentanomials) nonzero terms are the main loop performs the same operation no matter ki is
preferred to high performance. This paper considers generic 0 or 1, the Montgomery method natively has the ability of
elliptic curves with irreducible polynomials recommended by defending the side channel attack.
the National Institute of Standard and Technology (NIST) [28].
III. R ESCHEDULING AND E NHANCEMENT OF
B. Elliptic Curve Scalar Multiplication E LLIPTIC C URVE S CALAR M ULTIPLICATION
A nonsupersingular elliptic curve over GF(2m ) is A. Modified Montgomery Ladder Scalar Multiplication
represented as follows [29]:
The Montgomery ladder algorithm consists of three
E : y 2 + x y = x 3 + ax 2 + b. stages: 1) initialization: conversing from affine coordinate to
LI AND LI: HIGH-PERFORMANCE PIPELINED ARCHITECTURE OF ECSM OVER GF(2m ) 1225
Algorithm 2 Uniform-Addressed Montgomery Ladder Scalar

Multiplication With Merged Initialization Stage
Fig. 1. Data dependence graph of (a) point addition and (b) point doubling
in the Montgomery ladder algorithm.
where M, S, A, and I denote multiplication, square, addition,

LD projective coordinate; 2) the main loop: performing point and inversion in GF(2m ), respectively, and m is the dimension
addition and doubling in LD projective coordinate; and 3) post- of the binary field GF(2m ). The original Montgomery ladder
process: recovering the y-coordinate and conversing from scalar multiplication requires (m − 1)(6M + 5S + 3 A) +
LD projective coordinate back to affine coordinate. This is (10M + 7 A + 3S + I ) operations. The increased operations are
shown in Algorithm 1. due to the merged initialization and the modified postprocess
From Algorithm 1, we can see that each iteration in the for better sharing the data path with the main loop. As square
main loop has the same operations except for input and output and addition are much cheaper than multiplication, and inver-
variables depending on whether ki is 0 or 1. Based on this, sion occurs only once, we can see that optimizing operations
Ansari and Hasan [19] proposed a modified version for scalar in the main loop, especially the FF multiplication, is the key
multiplication with uniform addressing. By swapping X 1 with to realize high-performance ECSM.
X 2 and Z 1 with Z 2 , the two different execution paths can Each iteration in the main loop performs point addition
be merged to one. Ansari and Hasan [19] selected the path and point doubling, which take 6M + 5S + 3 A together.
of ki = 0 and generated the swap logic using the state The data dependence of point addition and doubling in the
of ki−1 and ki . Montgomery ladder algorithm is shown in Fig. 1. The critical
The initialization stage can also be merged with the main path lies in calculating the X-coordinate of point addition,
loop [20]. We set which takes 2M + 1S + 2 A, as is shaded in Fig. 1. Thus, it
may use at most three FF multipliers to achieve maximum
X 1 ← 1, Z 1 ← 0, X2 ← x P , Z2 ← 1 parallelism in scalar multiplication.
and input them to the path of ki = 1, i.e., the step 4 Considering the area overhead, this paper uses only
in Algorithm 1. After calculation, the results are obtained as one FF MAC and one FF squarer to achieve high-
performance ECSM. The FF MAC merges addition with the
X1 ← x P , Z 1 ← 1, X 2 ← x 4P + b, Z 2 ← x 2P reduction in FF multiplication. This brings advantages that
which are exactly the required values for initialization. no extra clock cycle is needed for addition, and it would
Though Mahdizadeh and Masoumi [20] have made the not deteriorate the critical-path delay. The multiplier can be
above improvement, a detailed algorithm was not given, implemented in serial, parallel, or digit-serial way. The serial
and the generation of the swap logic was not considered. and digit-serial multipliers consume less area, but require more
In this paper, a complete algorithm for uniform-addressed clock cycles. For high-speed consideration, we use the parallel
Montgomery ladder scalar multiplication with a merged ini- multiplier, which takes only one clock cycle to finish one
tialization stage is presented. As shown in Algorithm 2, the multiplication. Later, we will talk about area reduction using
main loop starts from i = t − 1 and the path of ki = 1 is the Karatsuba–Ofman algorithm.
selected to cover the initial coordinate conversion. The swap
logic is modified accordingly. We also modify the postprocess C. Scheduling Schemes With Different Pipeline Stages
stage to share the data path with the main loop as much as As each iteration has six multiplications, it needs at least
possible. six clock cycles using one bit-parallel FF MAC. The pipelining
technique can improve the working frequency of FF MAC by
B. Data Dependence Analysis of ECSM inserting registers into the combinational logic. The FF MAC
The modified Montgomery ladder scalar multiplication is to compute MR = A · B + C. Because addition is
totally takes m(6M + 5S + 3 A) + (11M + 5 A + I ) operations, performed with the final reduction, C can be provided at the
TABLE II
P ROPOSED S CHEDULING S CHEME U SING A
T HREE -S TAGE P IPELINED FF MAC
Fig. 2. Timing diagram of a three-stage pipelined FF MAC.
TABLE I
P ROPOSED S CHEDULING S CHEME U SING A
T WO -S TAGE P IPELINED FF MAC
Fig. 4. Timing diagram of the scheduling scheme using a three-stage

pipelined FF MAC.
When calculating the terms X 22 and Z 22 , it needs to choose

either X 2 or X 1 and either Z 2 or Z 1 to enter the FF squarer.
Once selected, the other terms, such as X 24 and Z 24 , and
X 22 · Z 22 , can use the calculated results, and no swapping
selection is involved.
Since calculating the new X 1 (X-coordinate of point
addition) is the critical path, it should be set the highest priority
Fig. 3. Timing diagram of the scheduling scheme using a two-stage pipelined when scheduling. The multiplication X 1 · Z 2 is put behind the
FF MAC.
multiplication X 2 · Z 1 , because the value of the new X 1 is
last clock cycle. In case of an n-stage pipelined FF MAC, available at last. Similarly, the square X 22 is put behind Z 22 .
if we input data A and B at the first clock cycle and C at the The values of new Z 2 , Z 1 , X 2 , and X 1 are available at the fifth,
nth cycle, the result MR can be obtained at the (n +1)th clock sixth, seventh, eighth clock cycles, respectively. The
cycle. The timing diagram of a three-stage pipelined FF MAC seventh and eighth clock cycles mean the first and second
is shown in Fig. 2 for example. The FF squarer is to compute clock cycles for the next iteration, respectively. Therefore, the
SR = S 2 , which has a small delay of only two to three proposed scheduling scheme in Table I can make the FF MAC
XOR gates for a trinomial or pentanomial. The FF squarer is working with no idle cycle during the main loop of ECSM,
not pipelined. The result SR can be obtained one clock cycle which may indicate high efficiency of the two-stage pipelined
later after inputting data S. architecture.
Using a two-stage pipelined FF MAC, each iteration of the Similarly, using a three-stage pipelined FF MAC, each
main loop in ECSM takes six clock cycles. The proposed iteration of the main loop takes seven clock cycles, which
scheduling scheme is shown in Table I. The detailed timing is shown in Table II and Fig. 4. One idle cycle is inserted
diagram of the main loop is shown in Fig. 3. The swapping for the FF MAC, because the value of new X 1 is available
(step 5) in Algorithm 2 is performed by on-the-fly selection of at the ninth clock cycle. It has to wait for one cycle, so that
input variables to the arithmetic units, and it does not cost extra X 1 can be used at the second cycle for the next iteration. The
clock cycles. We notice that the results of point addition are values of new Z 2 , Z 1 , X 2 , and X 1 are available at the sixth,
independent of the value of ki . Swapping X 1 with X 2 and Z 1 sixth, seventh, and ninth clock cycles, respectively. The
with Z 2 gives the same results, i.e., the terms X 1 Z 2 + X 2 Z 1 ninth clock cycle means the second clock cycle for the next
and X 1 Z 2 X 2 Z 1 remain unchanged. Thus, calculating point iteration. In fact, in case of an n-stage (n > 2) pipelined
addition (new X 1 and new Z 1 ) does not involve the FF MAC, the main loop of ECSM needs (2n +1) clock cycles
swapping selection. But, it is not the case with point doubling. for each iteration.
Fig. 5. Proposed architecture of ECSM.
Along with the increasing of pipeline stages, the required

number of clock cycles for a scalar multiplication grows
linearly, while the frequency improvement has a limit, because
the inserted registers will introduce an extra setup and hold
time to the critical path delay, and prevent timing opti-
mization in combinational logics. A good balance should be
found between the number of clock cycles and the working
Fig. 6. Data path of ECSM using a three-stage pipelined FF MAC.
frequency. We have implemented the two-/three-/four-stage
pipelined ECSM architecture on Xilinx Virtex-4 and
Virtex-5 FPGAs, and found that the three-stage pipelined one ROM can be realized using Block RAMs in Xilinx devices or
achieves the best performance. embedded memory blocks in Altera devices. Thus, it does not
consume logic resources in FPGA.
IV. P ROPOSED A RCHITECTURE OF E LLIPTIC C URVE The data path of ECSM using a three-stage pipelined
S CALAR M ULTIPLICATION FF MAC is given for example in Fig. 6. The terms X 1 , X 2 , Z 1 ,
and Z 2 are not presented, because they are the intermediate
In Section III-C, we have discussed scheduling schemes for results of the FF MAC or FF Squarer. The bold dashed line
the main loop in ECSM. The postprocess stage of ECSM in Fig. 6 shows the critical path of the three-stage
also requires careful consideration. While this stage is not pipelined architecture, which consists of a pipelined FF MAC,
the crucial part of ECSM, its optimization goal is to share an addition (XOR), and a 4:1 MUX. Data paths with other
the data path with the main loop as much as possible, rather pipeline stages are similar to Fig. 6 except for different data
than to reduce the required number of clock cycles. After connections. Control signals stored in the control ROM are
proper scheduling of ECSM, we propose the high-performance also different. But, the critical path delay remains unchanged.
architecture based on the improved Montgomery ladder scalar
multiplication algorithm, as shown in Fig. 5. V. H ARDWARE I MPLEMENTATION
The proposed ECSM architecture consists of one bit-parallel
A. FF Multiplier Accumulator
FF MAC, one FF squarer, a register bank, a finite-state
machine, and a 6 × 18 control ROM. The FF MAC is The m−1 MRi = A · B
FF MAC i is to compute + C, where
implemented using the Karatsuba–Ofman algorithm, and is A = m−1i=0 ia · x , B = b
i=0 i · x , and C = m−1
i=0 ci · x .
i
well pipelined. The n-stage pipelined FF MAC takes n clock The

2m−2FF MAC performs polynomial multiplication A · B =
i=0 di · x first, and then does reduction modulo f (x)
i
cycles to finish one multiplication. The FF squarer is not
pipelined, and one clock cycle is required to finish one square. and addition with C. In case of degree m = 163, the NIST
The inputs to FF MAC, A, B, and C, and the input to recommended irreducible polynomial is
FF squarer, S, are all registered. Another four regis-
f (x) = x 163 + x 7 + x 6 + x 3 + 1.
ters T1 , T2 , T3 , and T4 are used in the data path for data
caching. Each register has a MUX before it. The control Let ⊕ represents the XOR operation, W = di ⊕
signals of MUXs are given at each clock cycle to switch di+157 ⊕ di+160 ⊕ ci , M = di ⊕ di+163 ⊕ di+319 ⊕ ci , and
between different operations in ECSM. The inputs fed to H = di+163 ⊕ di+314 . The result of reduction and addition is
MUXs of registers are carefully allocated with the guideline given as
that each MUX contains at most four branches. In this way, ⎧
⎪
⎪ W ⊕ di+156 i = 162
the input delay for registers is only the delay of a 4:1 MUX. ⎪
⎪
⎪
The control signals are different at every clock cycle for ⎪W ⊕ di+156 ⊕ di+163
⎪
⎪
13 ≤ i ≤ 161
⎪
⎪
each iteration of the main loop and the postprocess stage. ⎪
⎪ W ⊕ d i+156 ⊕ d i+163 ⊕ d i+312 11 ≤ i ≤ 12
⎪
⎨W ⊕ H ⊕ d
A heavy state machine is required to provide all the control i+156 ⊕ di+312 7 ≤ i ≤ 10
signals in sequence. Here, we use a 6 × 18 control ROM to MRi =
⎪
⎪ W ⊕ H ⊕ di+313 ⊕ di+316 i =6
store the control signals for MUXs (16 bits) and the swapping ⎪
⎪
⎪
⎪ M ⊕ d ⊕ d ⊕ d 3≤i ≤5
selection (2 bits). A small state machine is used for conditional ⎪
⎪ i+160 i+316 i+317
⎪
⎪
branching and jumping, and is providing the 6-bit address to ⎪
⎪ M ⊕ di+320 i =2
⎪
⎩
the control ROM. For the FPGA implementation, the control M ⊕ di+320 ⊕ di+323 0 ≤ i ≤ 1.
four-step KOM163 includes 49 CM10 s and 32 CM11 s, while

the three-step KOM163 includes 11 CM20 s and 16 CM21 s.
B. FF Inversion
Compared with other FF operations, FF inversion is the
most complex and time-consuming one. The Itoh–Tsujii’s
algorithm (ITA) [32], which is based on Fermat’s little
theorem, shows high efficiency for inversion in GF(2m ) [1].
The ITA splits inverse into a lot of squares and very few
multiplications, thus inversion can be accomplished by exploit-
Fig. 7. Area-delay efficient KOM for m = 163 on (a) four-input LUT FPGAs ing existing multipliers instead of implementing another set
and (b) six-input LUT FPGAs. of hardware. Square in GF(2m ) with polynomial basis is an
efficient operation, which only involves reduction; hence, it is
highly used to speed up the ITA inversion.
The core of the FF MAC lies in the bit-parallel polynomial
Given an element a in GF(2m ), the inversion a −1 can be
multiplication, which can be directly implemented using the
computed as follows:
classical multiplication (CM) as follows:
a −1 = a 2
m −2 m−1 −1)

2m−2 = a 2(2 .
d(x) = dk · x k
The ITA performs inversion using the following observation:
k=0 ⎧
where ⎨(2s/2 − 1)(2s/2 + 1), s is even
2s − 1 =
dk = ai b j , 0 ≤ i, j ≤ m − 1. ⎩2(2 s−1 s−1
2 − 1)(2 2 + 1) + 1, s is odd.
i+ j =k
For example, the inversion a −1 in GF(2163 ) is calculated in
It is expensive to directly implement an m-bit CM when
such an order
m is large, especially for resource-constrained devices such
a −1 = a 2
163 −2 162 −1)
as FPGAs. A good solution is to use the KOM [25], which has = a 2(2
subquadratic complexity. A one-step KOM splits the operand =a 2(281 +1){2(240 +1)(220 +1)(210 +1)(25 +1)[2(22 +1)(2+1)+1]+1}
.
size by half and performs the polynomial multiplication as
follows: In GF(2m ), the ITA requires (m − 1) squares and
log2 (m − 1) + H (m − 1) − 1 multiplications, where
A · B = ah x n + al bh x n + bl
H denotes the hamming weight of the binary sequence of
= ah bh x 2n + (al bh + ah bl )x n + al bl the given integer. In case of m = 163, H (m − 1) = H (162) =
= ah bh x 2n +[(ah +al )(bh +bl )+ ah bh + al bl ]x n + al bl H (10100010)2 = 3. Thus, inversion in GF(2163) requires
162 squares and 9 multiplications.
where n = m/2, A = ah x n + al , and B = bh x n + bl . In [23] and [33], a Powerblock, which is a cascade of several
An m-bit KOM consists of three (m/2)-bit multipliers number of quad circuits, is proposed to speed up inversion.
when m is even, and two m/2 -bit and one m/2-bit Since scalar multiplication requires only once inversion in
multipliers when m is odd. Similarly, the three submulti- the postprocess stage, this paper simply uses square and
pliers ah bh , al bl , and (ah + al )(bh + bl ) can apply the multiplication to implement inversion. In this way, the
Karatsuba–Ofman algorithm recursively. The delay of a KOM FF inversion is realized by controlling the existing data path,
increases monotonically with the iteration steps, while the and no resource overhead is introduced. The FF squarer
area consumption reaches the optimum when the split oper- and addition are much cheaper than the FF MAC and the
ation size arrives at a threshold τ , and then the τ -bit FF inversion, which have been discussed in detail in [26].
CM is invoked.
In [31], the area and lookup table (LUT) delay
estimations of KOMs on FPGA have been discussed C. Delay Estimation and Pipeline Analysis
in detail. We have experimentally found that, in The FF MAC based on KOM consists of four stages:
case of GF(2163), the four-step KOM for four-input 1) splitting the input operands to produce threshold operands;
LUT FPGAs and the three-step KOM for six-input 2) performing threshold CMs; 3) recursively aligning the
LUT FPGAs achieve the best tradeoff between area and delay. outputs to combine higher-level multiplications; and 4) doing
The optimum constructions of the 163-bit KOMs are shown the modular reduction and addition. Apart from the FF MAC,
in Fig. 7. An m-bit KOM (CM) is denoted as KOMm (CMm ). the critical path in the proposed architecture contains an
The weight on each edge denotes the required number of addition (XOR) and a 4:1 MUX, as the dashed line shows
submultipliers. For example, a KOM81 requires one KOM40 in Fig. 6.
and two values of KOM41 , while a KOM82 requires The LUT delay formulation of KOM on FPGA is given
three KOM41 . The number of CMs can be calculated by in [23] and [31]. The overlapped operands (ah + al ) and
tracing the weights in a reverse order. For example, the (bh + bl ) require additions, and the r -step splitting generates
VI. P ERFORMANCE A NALYSIS AND C OMPARISON

A. Time Estimation and Performance Analysis
The estimated time for a scalar multiplication is given by
TSM = number of clock cycles × clock period.
In Algorithm 2, the main loop requires m × Cloop , where
Cloop denotes the number of clock cycles taken by each
Fig. 8. Critical path of a three-stage pipelined ECSM in GF(2163 ) for k = 6. iteration in the main loop. The postprocess stage requires
11 FF multiplications, 5 FF additions, and 1 FF inversion.
The simple ITA inversion requires (m − 1) FF squares and
the longest chain of 2r input bits. Thus, the splitting stage Minv FF multiplications. The FF square costs one clock
has an LUT delay of Dsp = logk 2r for k-input LUT FPGAs. cycle, while the FF multiplication costs n cycles for the
Classical multipliers of threshold size τ have an LUT delay of n-stage pipelined architecture. FF addition costs no clock
DCM = logk 2τ . The aligning stage has r steps, and each step cycle, because it is merged with the FF multiplication using
has one LUT delay for k ≥ 4. Thus, the delay of this stage the FF MAC. Therefore, the total number of clock cycles for
is Dal = r . With regard to the merged reduction and addition, ECSM is given as follows:
if there are t variables in the longest chain, then the delay
is Dmod = logk t. For trinomials and pentanomials, this stage Ctol = m × Cloop + n × (Minv + 11) + (m − 1).
has a delay of one or two LUTs given 4 ≤ k ≤ 6. Therefore, For example, the two-stage pipelined architecture over
the entire delay of the FF MAC based on the KOM is GF(2163) costs 163 × 6 + 2 × (9 + 11) + 162 = 1180 clock
given by cycles for one scalar multiplication.
The clock period or frequency is determined by FPGA
DMAC = Dsp + DCM + Dal + Dmod . devices and pipeline stages. We implement the two-/three-/
four-stage pipelined ECSM architectures over five binary fields
In GF(2163 ), the FF MAC using the three-step KOM for k = 6 recommended by the NIST, and do the synthesis and imple-
has a delay of ten LUTs, and the one using the four-step KOM mentation using a Xilinx ISE 14.4 tool. The FPGA platforms
for k = 4 has a delay of 11 LUTs. we use are Xilinx XC5VLX155-3ff1760 for GF(2571 ) and
For a 2s :1 MUX, there are s selection lines, so that the XC5VLX110-3ff1760 for others. Table III shows the place
output of a 2s :1 MUX is a function of 2s + s variables [33]. and route results with a balanced global optimization goal.
The delay for a 2s input MUX is DMUX = logk (2s + s). Thus, As we can see from Table III, with the increasing of
a 4:1 MUX has a delay of one LUT for k = 6 and two LUTs pipeline stages, the required clock cycles grow linearly, while
for k = 4. Fig. 8 shows the LUT delay formation of the the frequency improvement is not obvious. The three-stage
critical path of ECSM over GF(2m ), which contains a total pipelined architectures achieve the fastest speed for scalar
delay of multiplication. Designs with deeper pipeline stages consume
more slice registers. Usually, each slice on FPGAs consist of
Dtol = DMAC + Dadd + DMUX several LUTs and registers (for example, the Xilinx Virtex-5
slice has two LUTs and two registers). The pipeline registers
where Dadd (1 LUT) is the delay due to the extra addition. can be implemented by the slices, which are already in use.
The pipeline is achieved by inserting registers into the Thus, the pipeline stages have a minor effect on the area
combinational logic. If the critical path is divided equally, (occupied slices) on FPGAs.
the obtained circuit will have a minimum delay. In case of
k = 6, there are 12 LUTs in the critical path of ECSM
B. Results Comparison and Discussion
over GF(2163 ). For a two-stage pipeline, it should be divided
in the ratio 6:6, which indicates that the registers should be put In order to compare with other ECSM designs on the
after the combined KOM40 /KOM41 . For a three-stage pipeline, same device, we also implement the proposed architecture
the best division lies in the ratio 4:4:4. Registers are put inside on Xilinx XC4VLX200. The performance comparisons are
the threshold CM20 /CM21 for the first stage and after the given in Table IV. The fastest reported results in GF(2163)
combined KOM163 for the second stage. The best division for a are 7.7 μs [15] for Virtex-4 and 5.5 μs [16] for Virtex-5.
four-stage pipeline is of the ratio 3:3:3:3. However, for pipeline Zhang et al. [15] used three FF cores to achieve ILP.
stages more than three, the frequency improvement is not Sutter et al. [16] used three digit-serial field multipliers
obvious compared with the linearly increasing clock cycles. and one field divider to obtain the best tradeoff between
We have experimentally found that the three-stage pipelined area and speed. Our proposed three-stage pipelined ECSM
architecture has the best performance. The critical path of the achieves 6.1 μs (26.2% faster than [15]) on Virtex-4 and
three-stage pipelined ECSM in GF(2163 ) for k = 6 is shown 4.6 μs (19.6% faster than [16]) on Virtex-5, while the occupied
in Fig. 8. Similarly, the delay estimation and the pipeline slices are much less (only 35.3% of [15] and 49.4% of [16]).
analysis can be applied to ECSM in other binary fields or Rebeiro et al. [22], Roy et al. [23], and Liu et al. [24]
on other FPGAs (e.g., k = 4). also adopted a bit-parallel KOM to implement ECSM.
TABLE III
I MPLEMENTATION R ESULTS OF THE P ROPOSED ECSM A RCHITECTURES ON V IRTEX -5
TABLE IV
P ERFORMANCE C OMPARISON OF ECSM I MPLEMENTATIONS ON FPGA
The KOM in [24] was two-stage pipelined, and a quad many redundant primitives into the critical path. For
circuit was used to speed up inversion. That is why [24] example, Rebeiro et al. [22] introduced two FF squares, two
costs less clock cycles (1091) than ours (1180 for our additions (XOR), three 4:1 MUXs, and one 2:1 MUX into
two-stage pipelined ECSM). Rebeiro et al. [22] and the critical path apart from the FF multiplier, which greatly
Roy et al. [23] used similar architectures, and the best increased the critical path delay (in total, 23 LUTs for k = 4
one was four-stage pipelined. The required clock cycles and 17 LUTs for k = 6).
were reduced to 1429 (1404) due to data forward- In our architecture, the FF MAC is used instead of
ing and the Powerblock. All of [22]–[24] introduced FF multiplier. Most additions are absorbed into the FF MAC
with no extra logic and clock cycle. The data path of our [7] K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede, “Superscalar
architecture is well designed, so that only one addition (XOR) coprocessor for high-speed curve-based cryptography,” in Proc. 8th
Int. Workshop Cryptograph. Hardw. Embedded Syst. (CHES), 2006,
and one 4:1 MUX are added to the critical path of the pp. 415–429.
FF MAC. The total critical path delay of our architecture is [8] D. Schinianakis, A. Kakarountas, T. Stouraitis, and A. Skavantzos,
14 LUTs for k = 4 and 12 LUTs for k = 6. By well scheduling “Elliptic curve point multiplication in GF(2n ) using polynomial
residue arithmetic,” in Proc. 16th IEEE Int. Conf. Electron., Circuits,
and fair pipelining, our designs cost relatively less clock cycles Syst. (ICECS), Dec. 2009, pp. 980–983.
and achieve higher clock frequency. The control signals are [9] W. N. Chelton and M. Benaissa, “Fast elliptic curve cryptography on
stored in Block RAMs on FPGA, thus a bulky state machine is FPGA,” IEEE Trans, Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 2,
pp. 198–205, Feb. 2008.
saved, which also contributes to higher frequency and smaller [10] K. Järvinen and J. Skyttä, “On parallelization of high-speed proces-
area. Area reduction owes much to the use of the KOM. With sors for elliptic curve cryptography,” IEEE Trans. Very Large Scale
even less resources, our ECSM is 47.5% faster than [24] and Integr. (VLSI) Syst., vol. 16, no. 9, pp. 1162–1175, Sep. 2008.
[11] R. Azarderakhsh and A. Reyhani-Masoleh, “High-performance
59% faster than [22] on Virtex-4, and 87.0% faster than [22] implementation of point multiplication on Koblitz curves,” IEEE Trans.
and double the speed of [23] on Virtex-5. Circuits Syst. II, Exp. Briefs, vol. 60, no. 1, pp. 41–45, Jan. 2013.
We also present results for GF(2233 ) and GF(2283 ) [12] K. Järvinen, “Optimized FPGA-based elliptic curve cryptography
processor for high-speed applications,” Integr., VLSI J., vol. 44, no. 4,
in Table IV. Except for [19], the proposed architecture outper- pp. 270–279, Sep. 2011.
forms other designs in terms of both speed and area. The speed [13] R. Azarderakhsh and A. Reyhani-Masoleh, “Parallel and high-speed
of our designs is 3.13 times of [19] in GF(2233 ) and 3.92 times computations of elliptic curve cryptography using hybrid-double multi-
of [19] in GF(2283 ) with a moderate increase in area. The pliers,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 6, pp. 1668–1677,
Jun. 2015.
comparisons for designs in GF(2409 ) and GF(2571) are not [14] C. H. Kim, S. Kwon, and C. P. Hong, “FPGA implementation of
presented, because they are rarely implemented in previous high performance elliptic curve cryptographic processor over GF(2163 ),”
works. J. Syst. Archit., vol. 54, no. 10, pp. 893–900, 2008.
[15] Y. Zhang, D. Chen, Y. Choi, L. Chen, and S. Ko, “A high performance
ECC hardware implementation with instruction-level parallelism over
VII. C ONCLUSION GF(2163 ),” Microprocess. Microsyst., vol. 34, no. 6, pp. 228–236,
Oct. 2010.
This paper focuses on speeding up ECSM over GF(2m ) on [16] G. D. Sutter, J. Deschamps, and J. L. Imaña, “Efficient elliptic curve
FPGA with the premise of high area utilization efficiency. point multiplication using digit-serial binary field operations,” IEEE
The proposed architecture mainly consists of a pipelined Trans. Ind. Electron., vol. 60, no. 1, pp. 217–225, Jan. 2013.
[17] S. Kumar, T. Wollinger, and C. Paar, “Optimum digit serial GF(2m )
bit-parallel FF MAC and a nonpipelined FF squarer. Area multipliers for curve-based cryptography,” IEEE Trans. Comput., vol. 55,
reduction is achieved using the Karatsuba–Ofman algorithm. no. 10, pp. 1306–1311, Oct. 2006.
Compact scheduling schemes are proposed to reduce the [18] C. Shu, K. Gaj, and T. El-Ghazawi, “Low latency elliptic curve cryptog-
required number of clock cycles for ECSM. The data path in raphy accelerators for NIST curves over binary fields,” in Proc. IEEE
Int. Conf. Field-Program. Technol., Dec. 2005, pp. 309–310.
the architecture is carefully designed to achieve a short critical [19] B. Ansari and M. A. Hasan, “High-performance architecture of elliptic
path. Pipeline techniques are applied to improve the work- curve scalar multiplication,” IEEE Trans. Comput., vol. 57, no. 11,
ing frequency. Thorough analyzes, supported with detailed pp. 1443–1453, Nov. 2008.
[20] H. Mahdizadeh and M. Masoumi, “Novel architecture for efficient
experimental results, are provided to find the architecture with fpga implementation of elliptic curve cryptographic processor over
the optimal number of pipeline stages. Compared with other GF(2163 ),” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21,
existing designs, the proposed architecture outperforms their no. 12, pp. 2330–2333, Dec. 2013.
[21] F. Rodríguez-Henríquez, N. A. Saqib, and A. Díaz-Pérez, “A fast parallel
results in terms of both speed and area. implementation of elliptic curve point multiplication over GF (2m ),”
Microprocess. Microsyst., vol. 28, nos. 5–6, pp. 329–339, Aug. 2004.
ACKNOWLEDGMENT [22] C. Rebeiro, S. S. Roy, and D. Mukhopadhyay, “Pushing the limits of
high-speed GF(2m ) elliptic curve scalar multiplication on FPGAs,” in
The authors would like to thank the editors and reviewers Proc. 14th Int. Workshop Cryptograph. Hardw. Embedded Syst. (CHES),
for their contributive comments and suggestions. 2012, pp. 494–511.
[23] S. S. Roy, C. Rebeiro, and D. Mukhopadhyay, “Theoretical modeling
of elliptic curve scalar multiplier on LUT-based fpgas for area and
R EFERENCES speed,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 5,
[1] F. Rodríguez-Henríquez, N. A. Saqib, A. D. Pérez, and Ç. K. Koç, pp. 901–909, May 2013.
Cryptographic Algorithms on Reconfigurable Hardware. New York, NY, [24] S. Liu, L. Ju, X. Cai, Z. Jia, and Z. Zhang, “High performance
USA: Springer-Verlag, 2006. FPGA implementation of elliptic curve cryptography over binary
[2] E. Wenger and M. Hutter, “Exploring the design space of prime field fields,” in Proc. IEEE Int. Conf. Trust, Security Privacy Comput.
vs. binary field ECC-hardware implementations,” in Proc. 16th Nordic Commun. (TrustCom), Sep. 2014, pp. 148–155.
Conf. Secure IT Syst. Inf. Security Technol. Appl. (NordSec), Tallinn, [25] A. Karatsuba and Y. Ofman, “Multiplication of multidigit numbers on
Estonia, Oct. 2011, pp. 256–271. automata,” Sov. Phys. Doklady, vol. 7, no. 7, pp. 595–596, 1963.
[3] P. H. W. Leong and I. K. H. Leung, “A microcoded elliptic curve [26] H. Wu, “Bit-parallel finite field multiplier and squarer using polynomial
processor using FPGA technology,” IEEE Trans. Very Large Scale basis,” IEEE Trans. Comput., vol. 51, no. 7, pp. 750–758, Jul. 2002.
Integr. (VLSI) Syst., vol. 10, no. 5, pp. 550–559, Oct. 2002. [27] A. Reyhani-Masoleh and M. A. Hasan, “Low complexity bit parallel
[4] N. Gura et al., “An end-to-end systems approach to elliptic curve architecture for polynomial basis multiplication over GF(2m ),” IEEE
cryptography,” in Proc. 4th Int. Workshop CHES, London, U.K., 2003, Trans. Comput., vol. 53, no. 8, pp. 945–995, Aug. 2004.
pp. 349–365. [28] Recommended Elliptic Curves for Federal Government Use, NIST,
[5] A. Satoh and K. Takano, “A scalable dual-field elliptic curve crypto- Gaithersburg, MD, USA, May 1999.
graphic processor,” IEEE Trans. Comput., vol. 52, no. 4, pp. 449–460, [29] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curve
Apr. 2003. Cryptography. New York, NY, USA: Springer-Verlag, 2004.
[6] R. C. C. Cheung, N. J. Telle, W. Luk, and P. Y. K. Cheung, [30] J. López and R. Dahab, “Fast multiplication on elliptic curves
“Customizable elliptic curve cryptosystems,” IEEE Trans. Very Large over GF(2m ) without precomputation,” in Proc. 1st Int. Workshop
Scale Integr. (VLSI) Syst., vol. 13, no. 9, pp. 1048–1059, Sep. 2005. Cryptograph. Hardw. Embedded Syst., 1999, pp. 316–327.
[31] G. Zhou, H. Michalik, and L. Hinsenkamp, “Complexity analysis and Shuguo Li (M’09) received the bachelor’s degree
efficient implementations of bit parallel finite field multipliers based on in computer science from Xidian University, Xi’an,
Karatsuba–Ofman algorithm on FPGAs,” IEEE Trans. Very Large Scale China, in 1986, the M.S. degree from Shandong
Integr. (VLSI) Syst., vol. 18, no. 7, pp. 1057–1066, Jul. 2010. University, Shandong, China, in 1993, and the
[32] T. Itoh and S. Tsujii, “A fast algorithm for computing multiplicative Ph.D. degree from Northwestern Polytechnical
inverses in GF(2m ) using normal bases,” Inf. Comput., vol. 78, no. 3, University, Xi’an, in 1999.
pp. 171–177, 1988. He is currently a Professor with the Institute
[33] C. Rebeiro, S. S. Roy, D. S. Reddy, and D. Mukhopadhyay, “Revisiting of Microelectronics, Tsinghua University, Beijing,
the Itoh–Tsujii inversion algorithm for FPGA platforms,” IEEE Trans. China. His current research interests include the
Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 8, pp. 1508–1512, algorithm, design, and VLSI implementation for
Aug. 2011. cryptography.
Lijuan Li received the bachelor’s degree from the

School of Astronautics, Harbin Institute of
Technology, Harbin, China, in 2011. She is
currently pursuing the Ph.D. degree with the
Institute of Microelectronics, Tsinghua University,
Beijing, China.
Her current research interests include
VLSI architectures for cryptography algorithms and
true random number generator.

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF 2m

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF 2m

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO.

4, APRIL 2016 1223

High-Performance Pipelined Architecture of Elliptic

Algorithm 2 Uniform-Addressed Montgomery Ladder Scalar

where M, S, A, and I denote multiplication, square, addition,

Fig. 2. Timing diagram of a three-stage pipelined FF MAC.

Fig. 4. Timing diagram of the scheduling scheme using a three-stage

When calculating the terms X 22 and Z 22 , it needs to choose

Fig. 5. Proposed architecture of ECSM.

Along with the increasing of pipeline stages, the required

well pipelined. The n-stage pipelined FF MAC takes n clock The

four-step KOM163 includes 49 CM10 s and 32 CM11 s, while

VI. P ERFORMANCE A NALYSIS AND C OMPARISON

Lijuan Li received the bachelor’s degree from the

You might also like