You are on page 1of 10

118 Int'l Conf.

Reconfigurable Systems and Algorithms | ERSA'11 |

Elliptic Curve Cryptography on FPGAs:


How Fast Can We Go with a Single Chip?
Kimmo Järvinen
Aalto University, School of Science
Department of Information and Computer Science
P.O. Box 15400, FI-00076 Aalto, Finland
kimmo.jarvinen@aalto.fi

Abstract— In this paper we present an extremely high generality of the system because, if other parameters are
throughput implementation of an elliptic curve cryptosystem. needed, the FPGA can be reprogrammed to support the new
The work builds on the author’s previous work which has parameters [8].
resulted in a high throughput processor architecture for a “Architecture efficiency” has been exploited in numerous
specific family of elliptic curves called Koblitz curves. The papers describing FPGA-based implementations of ECC [7].
architecture extensively utilizes the fact that FPGA based However, a vast majority of papers optimize only field
designs can be carefully optimized for fixed parameters (e.g., arithmetic units for one specific field while the higher levels
for a single elliptic curve) because parameter flexibility of ECC still remain unoptimized and use more generic
(e.g., support for different curves) can be achieved through architectures. This approach seems rather pointless because
reprogrammability. In this paper the work is extended by fixing the underlying field already restricts the number of
exploring optimal solutions in a situation where several usable elliptic curves to very few. For instance, fixing the
parallel instances of the processor architecture are placed field to F2163 means that from the total of fifteen curves
on a single chip. It is shown that by utilizing the sug- recommended by U.S. National Institute of Standards and
gested parallel architecture a single modern FPGA chip, Technology (NIST) in [9] only two curves, namely B-163
Stratix IV GX EP4SGX230KF40C2, can deliver a through- and K-163, could be used. Hence, if the field is fixed in
put of about 1,700,000 public-key cryptography operations order to increase performance, one should optimize the
(scalar multiplications on a secure elliptic curve) per sec- architecture also on higher levels for a specific curve. In
ond. This far exceeds any values thus far reported in the this paper, we describe an FPGA-based processor that is
literature. optimized specifically for Koblitz curves [10].

Keywords: Elliptic curve cryptography, field-programmable gate


array, parallel implementation, Koblitz curve 1.1 Related work
The first FPGA-based implementation using Koblitz
1. Introduction curves was presented in [11], where one scalar multiplication
Neal Koblitz and Victor Miller independently proposed was shown to require 45.6 ms on the NIST K-163 curve with
the use of elliptic curves for public-key cryptography in an Altera Flex 10K FPGA. They concluded that Koblitz
1985 [1], [2]. Since then, elliptic curve cryptography (ECC) curves are approximately twice as fast as general curves.
has been intensively studied because it offers both shorter [12] presented an implementation which computes scalar
keys [3] and faster performance [4], [5] compared to more multiplication in 75 µs on NIST K-163 in a Xilinx Virtex-
traditional public-key cryptosystems, such as RSA [6]. Hard- E FPGA. Neither of the two designs includes a circuitry
ware implementation of ECC has also gained considerable for conversions that are mandatory for Koblitz curves (see
interest and, as a consequence, many descriptions of hard- Sec. 2.3). [13], [14] proposed a multiple-base expansion
ware implementations exist in the literature; see [7] for a which can be used for increasing the speed of Koblitz curve
comprehensive review. computations and presented FPGA implementations for both
Field programmable gate arrays (FPGAs) have proven to elliptic curve scalar multiplication and conversion. Scalar
be highly feasible platforms for implementing cryptographic multiplication was shown to require 35.75 µs on NIST K-
algorithms because of the combination of programmability 163 with a Xilinx Virtex-II whereas the conversion requires
and high speed. Several advantages of FPGAs in cryp- 3.81 µs in [13]. [14] presented a parallelized version of the
tographic applications were listed in [8]. One of the ad- processor of [13] achieving computation delay of 17.15 µs
vantages, “Architecture efficiency,” follows from the fact on Stratix II including the conversion. [15] presented a
that, in an FPGA-based design, optimizations for specific high-speed processor using parallel formulations of scalar
parameters can be done without major restrictions in the multiplication on Koblitz curves. Their processor achieves a
Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 | 119

very fast computation delay of 7.22 µs on NIST K-233 with q = 2m , with polynomial basis are considered in this
Virtex-II, but it also neglects the conversions. paper. Polynomial bases are commonly used in elliptic
Our recent work considering scalar multiplication on curve cryptosystems because they provide fast performance
Koblitz curves in FPGAs consists of [16], [17], [18], [19], on both software and hardware. Another commonly used
[20]. It was shown in [16] that up to 166,000 signature basis, normal basis, provides very efficient squaring but
verifications can be computed using a single Stratix II multiplication is more complicated.
FPGA with parallel processing. More general paralleliza- Elements of F2m with polynomial basis are represented
tion studies were presented in [17] and they resulted in as binary polynomials with degrees less than m as a(x) =
P m−1 i
an implementation that computes scalar multiplication in i=0 ai x . Arithmetic operations in F2m are computed
25.81 µs. We showed that even shorter computation delay modulo an irreducible polynomial1 with a degree m. Be-
of only 4.91 µs (without the conversion) can be achieved on cause sparse polynomials offer considerable computational
NIST K-163 with interleaved operations [18]. A complete advantages, trinomials (three nonzero terms) or pentanomials
FPGA-based processor architecture utilizing this method (five nonzero terms) are used in practice. The curve, NIST
was described in [19]. This was architecture was improved K-163, considered in this paper is defined over F2163 with
by using more efficient algorithms and by redesigning the the pentanomial p(x) = x163 + x7 + x6 + x3 + 1 [9].
processor architecture in [20]. Addition, a(x) + b(x), in F2m is a bitwise exclusive-or
(XOR). Multiplication, a(x)b(x), is more involved and it
1.2 Contributions of the paper
consists of two steps: ordinary multiplication of polynomials
In this article, we explore parallel implementations of and reduction modulo p(x). If both multiplicands are the
the processor architecture presented in [20]. The processor same, the operation is called squaring, a2 (x). Squaring is
of [20] was optimized to deliver the maximum throughput. cheaper than multiplication because the multiplication of
This optimization strategy is not necessarily optimal if we polynomials is performed simply by adding zeros to the
have several parallel instances of the processor because a bit vector. Reduction modulo p(x) can be performed with
higher overall throughput might be achievable with a larger a small number of XORs if p(x) is sparse and fixed, i.e. the
number of slower but smaller processors. Hence, the target same p(x) is always used, which is the case in this paper.
in this work is to maximize the throughput-area ratio of the Repeated squaring denotes several successive squarings, i.e.,
e
processor. We show that a setup similar to the one used in exponentiation a2 (x). Inversion, a−1 (x), is an operation
[20] is the optimal one also in this case. which finds b(x) such that a(x)b(x) = 1 for a given
It was conjectured in [20] that a parallel implementation a(x). Inversion is the most complex operation and it can
on a modern FPGA could achieve throughputs of over be computed either with the Extended Euclidean Algorithm
1,000,000 scalar multiplications per second. In this paper, or Fermat’s Little Theorem (e.g., as suggested in [21]) that
we show that this conjecture was correct. We demonstrate m
gives a−1 (x) = a2 −2 (x).
that a single Altera Stratix IV chip is capable of delivering a Multiplication has the most crucial effect on performance
throughput of 1,700,000 scalar multiplications per second on of an elliptic curve cryptosystem. A digit-serial multiplier
a standardized elliptic curve NIST K-163. Such an extremely computes D bits of the output in one cycle resulting in a total
fast accelerator could have applications, for instance, in very latency of dm/De cycles. We use hardware modifications
heavily loaded network servers or in cryptanalytic hardware. of the multiplier described in [22]. Instead of using pre-
1.3 Structure of the paper computed look-up tables as in [22], our multiplier computes
everything on-the-fly similarly as in [12]. Repeated squarings
The remaining of the paper is structured as follows. Sec. 2
can be computed efficiently with the repeated squarers
presents the preliminaries of finite fields, elliptic curves, and e
presented in [23] which are components that compute a2 (x)
Koblitz curves. Sec. 3 introduces algorithms that are used
directly in one clock cycle.
in the proposed implementation. The processor architecture
from [20] that we use as the base architecture is described
in Sec. 4. Sec. 5 discusses parallel implementations of the 2.2 Scalar multiplication
processor architecture described in Sec. 4 and finds out Let E be an elliptic curve defined over a finite field Fq .
the parameters that provide the maximum throughput. The Points on E form an additive Abelian group, E(Fq ), together
results on an Altera Stratix IV GX FPGA are presented in with a point called the point at infinity, O, acting as the zero
Sec. 6. Finally, we conclude the paper in Sec. 7. element. The group operation is called point addition. Let
P1 and P2 be two points in E(Fq ). Point addition P1 + P2
2. Preliminaries where P1 = P2 is called point doubling. In order to avoid
2.1 Finite fields
1 A polynomial, f (x) ∈ F[x], with a positive degree is irreducible over
Elliptic curves defined over finite fields Fq are used in F if it cannot be presented as a product of two polynomials in F[x] with
cryptography and only curves over binary fields, where positive degrees.
120 Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 |

confusion, point addition henceforth refers solely to the case 2.3 Koblitz curves
P1 6= ±P2 . Koblitz curves [10] are a family of elliptic curves defined
The principal operation of elliptic curve cryptosystems is over F2m by the following equation:
scalar multiplication kP where k is an integer and P ∈
E(Fq ) is called the base point. The most straightforward EK : y 2 + xy = x3 + ax2 + 1 (1)
practical algorithm for scalar multiplication is the double- where a ∈ {0, 1}. Koblitz curves are appealing because they
and-add algorithm (binary algorithm) where k is represented
P`−1 offer considerable computational advantages over general
as a binary expansion i=0 ki 2i with ki ∈ {0, 1}. Each curves. These advantages are based on the fact that an algo-
bit in the representation results in a point doubling and an rithm similar to the double-and-add can be devised so that
additional point addition is computed if ki = 1. Let w denote point doublings are replaced by Frobenius endomorphisms.
the Hamming weight of k, i.e., the number of nonzeros in the The Frobenius endomorphism, φ, for a point P = (x, y) is
expansion. Depending on whether the algorithm starts from a map such that
the most significant (k`−1 ) or the least significant (k0 ) bit
of the expansion, the algorithm is called either left-to-right φ : (x, y) 7→ (x2 , y 2 ) and O 7→ O. (2)
or right-to-left double-and-add algorithm. They are shown in
Obviously, Frobenius endomorphism is very cheap: only two
Alg. 1 and 2, respectively. They both have the same costs:
or three squarings depending on the coordinate system. Sev-
` − 1 point doublings and w − 1 point additions (the first
eral successive Frobenius maps, i.e. φe (P ), can be computed
operations are simply substitutions). e e
with two (or three) repeated squarings: x2 and y 2 .
Replacing point doublings with Frobenius endomorphisms
P`−1 requires manipulations on k. It stands for all points in
Input: Integer k = i=0 ki 2i , point P
Output: Point Q = kP EK (F2m ) that µφ(P ) − φ2 (P ) = 2P where µ = (−1)1−a .
Q←O Thus, φ can be seen as a complex√number, τ , satisfying
for i = ` − 1 to 0 do µτ − τ 2 = 2 which gives τ = (µ + −7)/2. Moving from
Q ← 2Q a bit to another in a representation of k corresponds to an
if ki = 1 then Q ← Q + P application of φ if k is given in a τ -adic representation:
Algorithm 1: Left-to-right scalar multiplication X
`−1
k= i τ i . (3)
i=0

Hence in order to utilize fast Frobenius endomorphisms, k


P`−1 i must be given in a τ -adic representation [10].
Input: Integer k = i=0 ki 2 , point P
Output: Point Q = kP Efficient conversion algorithms were presented by Solinas
Q←O in [26]. The basic algorithm returns the so-called τ -adic
for i = 0 to ` − 1 do non-adjacent form (τ NAF) where k is represented with the
if ki = 1 then Q ← Q + P signed-binary format, i.e., i ∈ {0, ±1}. Henceforth, we
P ← 2P denote 1̄ = −1. The average length of τ NAF is the same as
the binary length of k, i.e., `. τ NAF has w ≈ `/3 and one
Algorithm 2: Right-to-left scalar multiplication
of two adjacent digits is always zero. Because ` ≈ m, scalar
multiplication with k in τ NAF requires on average m/3 − 1
If points on E are represented traditionally with two point additions or subtractions and m − 1 applications of φ.
coordinates as (x, y), referred to as the affine coordinates,
or A for short, both point addition and point doubling 2.4 Width-ω τ NAF
require one inversion in F2m . Inversions are expensive as If enough storage space is available, point multiplica-
discussed in Sec. 2.1. Hence, it is often beneficial to rep- tion can be sped up with window methods which involve
resent points using projective coordinates, (X, Y, Z), where precomputations with P . We consider window methods
point additions and point doublings can be computed without only on Koblitz curves in order to keep the discussion
inversions, but have an expense of a larger number of focused, although analogous algorithms exist also for general
other operations (multiplications, squarings, and additions). curves. A left-to-right scalar multiplication algorithm with
In this paper, we use López-Dahab coordinates [24], or LD precomputations on Koblitz curves is shown in Alg. 3; see
for short, where the point (X, Y, Z) represents the point Sec. 3.1 for details about the scalar encoding.
(X/Z, Y /Z 2 ) in A. The LD coordinates offer, in particular, Solinas presented an algorithm for producing width-ω
very efficient point additions, P1 +P2 , if P1 is in LD and P2 τ NAF in [26]. Instead of using that algorithm, we use the
is in A. We extensively utilize these so called point additions τ NAF algorithm which is simpler to implement in hardware
with mixed coordinates [25] in our implementation. and interpret its results as a width-ω τ NAF by replacing
Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 | 121

P`−1 P`−1
Input: Integer k = i=0 i τ i , point P Input: Integer k = i=0 i τ i , point P
Output: Point Q = kP Output: Point Q = kP
(k0 , f0 ), (k1 , f1 ), . . . , (kw−1 , fw−1 ) ← (k0 , f0 ), (k1 , f1 ), . . . , (kw−1 , fw−1 ) ←
ConvertAndEncode(k) ConvertAndEncode(k)
(P1 , P2 , . . . , PN ) ← Precompute(P ) (P1 , P2 , . . . , PN ) ← Precompute(P )
Q←O Q←O
for i = w − 1 to 0 do for i = 0 to w − 1 do
Q ← Q + sign(ki )P|ki | for j = 1 to N do
Q ← φfi (Q) Pj ← φfi (Pj )
Algorithm 3: Left-to-right scalar multiplication on Q ← Q + sign(ki )P|ki |
Koblitz curves with precomputations Algorithm 4: Right-to-left scalar multiplication on
Koblitz curves with precomputations

certain strings of 0, 1, and 1̄’s with window values. The


resulting representation has a weight of w ≈ `/(ω + 1). 4. Description of the processor
In this section we review the processor architecture origi-
3. Algorithms for the processor nally presented in [20]. The processor implements Alg. 4. It
consists of four main components as shown in the toplevel
3.1 Encoding for τ -adic expansions view of the processor given in Fig. 1, and each of them is
In the implementation presented in this paper, we encode k discussed in detail in Secs. 4.1–4.4.
as follows: (k0 , f0 ), (k1 , f1 ), . . . , (kw−1 , fw−1 ), where ki 6= The processor operates as follows. The converter (see
0 are the nonzero coefficients from (3) and fi are the number Sec. 4.1) converts the integer k into width-4 τ NAF and
of Frobenius maps between point additions (i.e., the number encodes it as described in Sec. 3.1. The precomputations are
of zeros plus one). Similar encodings have been used prior performed in the preprocessor (see Sec. 4.2) simultaneously
to this work at least in [13], [14], [15], [16], [17], [19], [20]. with the conversion. When both of these computations are
For example, the expansion h100003̄00001̄00000050i results ready, the main for-loop of Alg. 4 is executed in the main
in (1, 5), (3̄, 5), (1̄, 7), (5, 1) for a left-to-right algorithm and processor (see Sec. 4.3). Finally, the postprocessor (see
(5, 1), (1̄, 7), (3̄, 5), (1, 5) for a right-to-left algorithm; i.e., Sec. 4.4) maps the result point Q from LD to A and returns
they are mirror-images. Q = (x, y).
The processor comprises a three-stage pipeline and is
3.2 Right-to-left algorithms capable of processing three scalar multiplications simulta-
neously. The converter and the preprocessor form the first
As shown in Alg. 2, when scalar multiplication is com- stage, the main processor is the second stage and, finally,
puted from right to left, the base point P is doubled the postprocessor is the last stage, as shown in Fig. 1.
(Frobenius mapped) instead of the point Q. Point additions
and point doublings (Frobenius maps) can be processed 4.1 τ NAF converter
in parallel making these algorithms feasible for implemen- As noted in Sec. 2.3, k must be given in a τ -adic
tations supporting parallel processing of point operations. representation when using Koblitz curves. We use the τ NAF
However, a right-to-left algorithm is, in general, unpractical converter presented in [27] for converting the integer k into
if it involves precomputations with the base point P because width-2 τ NAF. After this, the width-2 τ NAF is converted
all precomputed points need to be doubled in each iteration. into width-4 τ NAF by using a simple string replacement
We can overcome this disadvantage by utilizing the inex- circuitry.
pensiveness of Frobenius maps. Because Frobenius maps are
cheap, we can map all precomputed points simultaneously 4.2 Preprocessor
while processing a point addition. We attach a repeated The preprocessor computes the precomputed points,
squarer directly to the register bank containing precomputed P1 , . . . , PN , required by Alg. 4. In this paper, N = 5: We
points. One repeated squarer can compute φfi (x, y) in two need 4 precomputed points with width-4 τ NAF and one extra
clock cycles if fi ≤ emax where emax is the predefined point in order to ensure that Frobenius maps do not appear
maximum exponent [23]. Hence, N points require 2N clock on the critical path (i.e., fi ≤ emax ; see Sec. 3.2 and [20] for
cycles. If the latency of one point addition is longer than 2N , more details). Because these precomputations do not depend
then Frobenius maps reduce from the critical path. The right- on k, the τ NAF conversion and the precomputations can
to-left scalar multiplication algorithm on Koblitz curves with be performed in parallel. The preprocessor is implemented
precomputations is shown in Alg. 4. using the architecture described in [17].
122 Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 |

(ki , fi ) Main processor

k τ NAF
Control Z unit
converter

(X, Y, Z) Q = (x, y)
X unit Postprocessor

Register
bank
P Pi
Preprocessor Y unit

Stage 1 Stage 2 Stage 3

Fig. 1: Toplevel view of the processor

4.3 Main processor second multiplication of Z3 . All multiplications of Y3 require


The main processor computes the for-loop of the scalar that both multiplications of Z3 are ready. The computation
multiplication by computing point additions and Frobenius can be interleaved so that one computes the multiplications
maps. The way how this task is performed is based on an of Z3 while simultaneously still processing the Y3 coordinate
idea presented in [18] (and used also in [19], [20]). An of the previous point addition. The X3 computation is started
adaptation of this idea to Alg. 4, the right-to-left algorithm when the first multiplication of Z3 is ready.
with precomputations, was introduced in [20]. The advantage The main processor includes separate processing units for
of this adaptation compared to [18], [19] is that the Frobenius computing X, Y , and Z coordinates, henceforth referred to
maps are computed in parallel with the point additions, as as the X, Y , and Z units. The X and Z units both contain
discussed in Sec. 3.2. As a consequence, performance is one multiplier whereas the Y unit has two multipliers. These
bounded only by the latency of multiplication in F2m . units are depicted in Fig. 2. The figure also shows how to
First, we shall recap how point additions are computed set their inputs and outputs in order to compute (4); i.e., all
in [18]. The main processor computes point addition with units must be applied only twice to compute a point addition.
mixed coordinates, (X3 , Y3 , Z3 ) = (X1 , Y1 , Z1 ) + (x2 , y2 ), Contrary to the main processor of [18], [19], the units do
using the formulae proposed by Al-Daoud et al. [25]: not have squarers computing Frobenius maps. The Frobenius
maps are computed with a single repeated squarer [23]
A = Y1 + y2 Z12 ; B = X1 + x2 Z1 ; attached to the register bank containing the precomputed
C = BZ1 ; Z3 = C 2 ; D = x2 Z3 ; points. Fig. 3 shows the register bank.
(4) The following procedure shows how (almost all) Frobe-
X3 = A + C(A + B 2 + aC);
2
nius maps can be removed from the critical path if one uses
Y3 = (D + X3 )(AC + Z3 ) + (x2 + y2 )Z32 ; a right-to-left scalar multiplication algorithm:
Clearly, there are eight multiplications in the above formulae 1) All precomputed points, P1 , . . . , PN , are stored in the
(aC does not require multiplication because a ∈ {0, 1}), and register bank;
they have a critical path of four multiplications with three 2) One computes and stores φf0 (P1 ), . . . , φf0 (PN ) with
or more multipliers. The main observation of [18] was that one repeated squarer attached to the register bank
despite this critical path, it is possible to reduce the effective which takes 2N clock cycles;
critical path to only two multiplications per point addition 3) Immediately after this, one initializes the scalar mul-
by interleaving the computation of successive point additions tiplication by setting Q = sign(k0 )P|k0 | ; that is, X =
and Frobenius maps if one uses four multipliers. x|k0 | , Z = 1, and Y = y|k0 | if sign(k0 ) = 1 and
As shown in (4), computing Z3 requires two multiplica- Y = x|k0 | + y|k0 | if sign(k1 ) = −1;
tions (in B and C computations). Computing X3 requires 4) One computes and stores φf1 (P1 ), . . . , φf1 (PN ) and
two additional multiplications and Y3 requires the remaining when they are ready, begins computing the point
four. The second multiplication of X3 cannot be started addition Q + sign(k1 )P|k1 | ;
before C is available, i.e., one must wait for the result of the 5) While the multipliers compute the above
Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 | 123

x2 /Z1 point addition, the repeated squarer performs


∅/Z3
mult sqr φf2 (P1 ), . . . , φf2 (PN );
6) After this, scalar multiplication proceeds so that when
E/C the multipliers are computing interleaved point ad-
ditions with ki and ki−1 , the repeated squarer is
Z1 /E ∅/F already updating the register bank by computing
xor
xor sqr φfi+1 (P1 ), . . . , φfi+1 (PN ).
0/X1 Obviously, the critical path consists only of the point
additions (and Frobenius maps with f0 and f1 ) if the
(a)
Frobenius maps are faster than the point additions, which
G/0 require only two multiplications as shown above.
xor sqr Notice that because the encoding presented in Sec. 3.1
Y1 /0
xor xor
X3 /G was designed so that it ensures that fi ≤ emax for all i
F/y2 and, consequently, each φfi (Pj ) has a constant latency of
mult two clock cycles, it is easy to design the main processor
so that the Frobenius maps are always faster than two
C/Z1
sqr multiplications. In order to ensure that, the main processor
architecture must satisfy
(b)
2N + p ≤ 2M2 (5)
C/H where M2 is the latency of multiplication and p is the
G/D mult H/∅ number of pipeline stages in the repeated squarer. Above we
xor
xor
assumed that each repeated squaring requires only one clock
Y1 /X3
xor ∅/Y3 cycle, i.e., p = 0 but the repeated squarer can be pipelined.
In our implementation of the processor, the repeated
D/∅ squarer attached to the register bank has emax = 10 and
sqr mult its computation was pipelined with p = 1. Therefore, it can
Z3 /Z3
x2 /x2 + y2 compute φfi (Pi ), where fi ≤ 10, for N points in 2N + 1
clock cycles. Because the width-4 τ NAF is used, the number
(c) of points is N = 5 (four precomputed points and an extra
point). As a consequence, Frobenius maps require 11 clock
Fig. 2: (a) Z, (b) X, and (c) Y units cycles and, in order to ensure that Frobenius maps are not on
the critical path, one must select M2 ≥ 6 for the multipliers
in the main processor.
load
4.4 Postprocessor
wr_addr The postprocessor maps the result point Q = (X, Y, Z)
from the main processor to affine coordinates by computing
(x = X/Z, y = Y /Z 2 ). It includes a multiplier and
e
a repeated squarer [23] for computing a2 where e ∈
Control
{1, 2, 4}. The postprocessor computes the inversion as pro-
posed in [21].

4.5 Latency of scalar multiplication


rd_addr
Let D1 , D2 , and D3 denote the digit sizes and M1 , M2 ,
and M3 the latencies of the multipliers in the preprocessor,
the main processor, and the postprocessor, respectively. The
fi Repeated
latencies of different operations in clock cycles are given
squarer
in Table 1. As shown in Table 1, the latencies of all
other operations except τ NAF conversion can be tuned by
varying the latencies of multipliers, which are given by
Mi = dm/Di e + 1, where i ∈ {1, 2, 3}.
Fig. 3: Register bank for storing the precomputed points and
Fig. 4 shows an example scalar multiplication with the
a repeated squarer for computing φfi (Pj ) for j = 1, . . . , N
processor (shown in white). It also shows how the pipeline
124 Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 |

Table 1: Latencies
Operation Latency (clock cycles)
Conversion, width-4 τ NAF 335
Precomputation 18M1 + 327
For-loop 2(w − 2)(M2 + 1) + 3(2N + p) + 18
Coordinate conversion 11M3 + 59

τ NAF converter
Preprocessor
Frobenius maps f0 f1 f2 f3 f4 fw−1
X coordinate k0 k1 k2 k2 k3 kw−3 kw−2 kw−2 kw−1 kw−1
Y coordinate k0 k1 k1 k2 k2 kw−3 kw−3 kw−2 kw−2 kw−1 kw−1
Z coordinate k1 k2 k2 k3 k3 kw−2 kw−2 kw−1 kw−1
Postprocessor

Time

Fig. 4: Example computation schedule for computing scalar multiplications with the processor. The previous and the next
scalar multiplication processed in the pipeline are shown in dark and light grey, respectively.

works by presenting the previous (shown in dark grey) and wr1


Processor 1
the next (shown in light grey) scalar multiplication. The
computation proceeds so that, first, the τ -adic conversion and
the precomputation are performed concurrently. The τ -adic wr2
conversion could be performed concurrently also with the Processor 2
main for-loop but this would not bring any benefit, because
precomputation would still be on the critical path. When both wr3 Q
are ready, the for-loop computation is started in the main pro- Processor 3
cessor. The computation proceeds as discussed in Sec. 4.3,
and the Frobenius maps are computed simultaneously with
the point additions. It is clearly visible that the X, Y , and
Z units are processing two point additions simultaneously; wrT
each block in Fig. 4 denotes one iteration of the units. When k Processor T
P
the for-loop is ready, the result point (X, Y, Z) is converted
into A in the postprocessor. This computation can be started rd_addr
immediately when Z is ready; i.e., while X and Y are still
being computed. Fig. 5: Parallel architecture

5. Parallelization
It was conjectured in [20] that a throughput of over is possible that the optimum is reached with different digit
1,000,000 scalar multiplications per second can be achieved sizes. We opted to maximize the throughput-area ratio of
with a single modern FPGA. In this section, we seek to a single processor and then replicate several instances of
verify this conjecture by studying what is the maximum that processor although it is possible that one could be able
throughput achievable with a single FPGA implementation to achieve slightly higher throughput by aiming to fill the
consisting of several parallel instances of the processor entire FPGA (or some fixed percentage of it). Our approach,
presented above. however, provides more general results and, of course, a
The target for the optimizations presented in [20] was to better throughput-area ratio.
maximize the throughput of a single processor. In order to The approach of our implementation is simple: We repli-
achieve this goal, the multipliers used in the main processor cate T parallel instances of the processor architecture pre-
used the digit size of D2 = 17 (M2 = 11). This digit size sented in Sec. 4, all with the parameters that maximize the
was the largest one that still ensured that the critical path of throughput-area ratio of a single processor, and provide a
computations remained in the main processor. In this paper, common interface for them. This architecture is shown in
however, our target is to maximize the throughput of several Fig. 5.
parallel instances of the processor. Instead of maximizing As explained above, the dominating component in our
the throughput of the processor, we focus on maximizing processor architecture is the main processor. Hence, we
the throughput-area ratio of the processor. In that case, it fit the parameters of the other components based on the
Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 | 125

Table 2: Balanced setups above, we cannot utilize setups with D2 > 17 because then
Setup D1 M 1 D2 M 2 D3 M 3 the throughput will be bounded by the throughput of the
1 2 (83) 6 (29) 1 (164) converter which roughly equals the throughput of the main
2 3 (56) 7 (25) 2 (83)
3 3 (56) 8 (22) 2 (83)
processor with D2 = 17.
4 3 (56) 9 (20) 2 (83) As shown in Table 3, setup 11 occupies approximately
5 4 (42) 10 (18) 2 (83) 15 % of the resources (ALMs) available on the FPGA. This
6 4 (42) 11 (16) 2 (83)
7 5 (34) 12 (15) 2 (83)
implies that we could fit six such processors on a single chip
8 5 (34) 13 (14) 3 (56) by using approximately 90 % of the resources. However,
9 6 (29) 14 (13) 3 (56) area requirements tend to grow more than linearly when
10 7 (25) 15 (12) 3 (56)
11 7 (25) 17 (11) 3 (56)
the size of the design approaches the limits of the device
because place&route has a more difficult task to fulfill timing
constraints. Hence, we use only five parallel instances of
the processor in our implementation. That is, we prepared a
parameters of the main processor. As shown in Table 1, prototype implementation using the architecture depicted in
the latencies are determined solely by the latencies of the Fig. 5 with T = 5. The results for this implementation are
multipliers (w = 4, N = 5, and p = 1 are fixed). We choose given in Sec. 6.
the digit sizes of the multipliers in the preprocessor and the
postprocessor, D1 and D3 , so that they are the smallest digit 6. Results
sizes that still ensure that the bottleneck (i.e., the longest The parallel architecture shown in Fig. 5 and described in
latency) is in the main processor. We call a setup (D1 , D2 , Sec. 4 and 5 was realized with the parameters obtained in
D3 ) that satisfies this condition a balanced setup. Sec. 5, i.e., T = 5, D1 = 7, D2 = 17, and D3 = 3. The
The (average) latency of the conversion is constant: 335 implementation was described in VHDL and compiled for
clock cycles; i.e., the latency cannot be tuned by choosing an Altera Stratix IV GX EP4SGX230KF40C2 FPGA with
design parameters (e.g., multiplier digit size). Moreover, Quartus 10.1. We emphasize that the processor architecture
the converter uses a different clock than the rest of the is not restricted to any specific FPGA but the optimal
architecture. Based on the work in [20], we assume that parameters may vary between different FPGAs.
we can use about 2.2 times faster clock for the other parts The area consumptions are collected in Table 4. Timing
of the architecture (in [20], we had 85 MHz and 185 MHz constraints of 120 MHz and 266 MHz were set for the τ -adic
clocks). Hence, we estimate that the latency of the converter converter and the rest of the processor, respectively, and the
corresponds to about 2.2 × 335 ≈ 740 clock cycles with the compiler was able to meet these constraints.
other clock. The latency of the main processor should not get Using the clock frequencies of 120 MHz and 266 MHz
shorter than this latency or else the converter will become and the latencies derived in Sec. 4.5, we get the following
the bottleneck and all area that is used for making the main average timings: width-4 τ NAF conversion 335/120 =
processor faster does not improve the overall throughput of 2.79 µs, precomputations 777/266 = 2.92 µs (constant),
the processor. Based on Table 1, it is clear that if D2 > 17, for-loop 785.4/266 = 2.95 µs, and coordinate conversion
the latency of the main processor is smaller than 740. In 675/266 = 2.53 µs (constant). The throughput of the pro-
Sec. 4.3, we derived a lower bound of D2 ≥ 6 for the cessor is bounded by the main processor; hence, the theoret-
the same digit size that ensured that the Frobenius maps ical maximum throughput is 338,680 scalar multiplications
do not appear on the critical path. Hence, only digit sizes per second for a single processor and 1,693,000 scalar
6 ≤ D2 ≤ 17 are viable. The balanced setups satisfying multiplications per second for the parallel implementation
these constraints are collected in Table 2. of five processors. The average timing for a single scalar
Because there are no other trustworthy methods to get multiplication is 8.3 µs.
exact (post-place&route) area requirements of an FPGA im-
plementation besides running synthesis and place&route for 7. Conclusions
the design, we determined throughput-area ratios by compil- We described a parallel implementation of elliptic curve
ing processors with different balanced setups for Stratix IV cryptography with several parallel instances of the processor
GX 4SGX230NC2 with Quartus II ver. 10.1. This FPGA introduced in [20]. This implementation is capable of deliv-
is used, for instance, in Stratix IV GX FPGA Development ering a theoretical maximum throughput of about 1,700,000
Board [28]. The results are collected in Table 3. We started scalar multiplications per second on the standardized elliptic
compilations from setup 11 downwards (see Table 2) and curve NIST K-163 [9].
it soon became apparent that the best throughput-area ratio Contrary to many other published works, this study and
is achieved with setup 11, the largest balanced setup. This the earlier versions of this work presented in [19], [20]
means that the maximum throughput-area ratio of the main focused on maximizing the throughput, i.e., scalar multi-
processor is achieved with D2 ≥ 17. Unfortunately, as noted plications per second instead of minimizing the computating
126 Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 |

Table 3: Throughput-area ratios of selected processors with balanced setups on Stratix IV GX FPGA
Memory Time Throughput Throughput/Area
Setup ALUTs Regs ALMs
M9K (µs) (1/s) (1 / s·ALM)
11 16,001 12,377 13,768 21 8.3 339,000 26.51
10 14,568 12,360 13,163 21 8.6 314,000 23.87
9 14,531 12,371 12,778 21 9.1 293,000 21.28

Table 4: Results on Stratix IV GX EP4SGX230KF40C2. References


ALUTs 78,695 (43 %) [1] N. Koblitz, “Elliptic curve cryptosystems,” Math. Comput., vol. 48,
Regs 61,871 (34 %) pp. 203–209, 1987.
ALMs 74,750 (82 %) [2] V. Miller, “Use of elliptic curves in cryptography,” in Advances in
M9K 105 (9 %) Cryptology, CRYPTO 1985, ser. Lecture Notes in Comput. Sci., vol.
Clock, converter 120 MHz 218. Springer, 1986, pp. 417–426.
Clock, others 266 MHz [3] A. K. Lenstra and E. R. Verheul, “Selecting cryptographic key sizes,”
Time 8.3 µs J. Cryptol., vol. 14, no. 4, pp. 255–293, Dec. 2001.
Throughput 1,693,000 [4] H. Eberle, N. Gura, S. C. Shantz, V. Gupta, L. Rarick, and S. Sun-
daram, “A public-key cryptographic processor for RSA and ECC,” in
Proc. 15th IEEE Int. Conf. Application-Specific Systems, Architectures
and Processors, ASAP 2004, 2004, pp. 98–110.
[5] K. Sakiyama, N. Mentens, L. Batina, B. Preneel, and I. Ver-
bauwhede, “Reconfigurable modular arithmetic logic unit supporting
time of a single scalar multiplication. The difference to [19], high-performance RSA and ECC over GF (p),” Int. J. Electron.,
[20] is that they focused on maximizing the throughput of a vol. 94, no. 5, pp. 501–514, May 2007.
single processor, whereas this paper presented a study which [6] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining
digital signatures and public-key cryptosystems,” Commun. ACM,
aimed to maximizing the throughput of an implementation vol. 21, no. 2, pp. 120–126, Feb. 1978.
comprising several processors. The results showed that both [7] G. Meurice de Dormale and J.-J. Quisquater, “High-speed hardware
the maximum throughput and the maximum throughput-area implementations of elliptic curve cryptography: A survey,” J. Syst.
Architect., vol. 53, no. 2-3, pp. 72–84, Feb.-Mar. 2007.
ratio are achieved with similar setups. The main component [8] T. Wollinger, J. Guajardo, and C. Paar, “Security on FPGAs: State-
of the processor would be capable of producing even higher of-the-art implementations and attacks,” ACM Trans. Embed. Comput.
throughput-area ratios but, in that case, another component, Syst., vol. 3, no. 3, pp. 534–574, Aug. 2004.
[9] National Institute of Standards and Technology (NIST), “Digital
namely the converter, would become the limiting factor. signature standard (DSS),” Federal Information Processing Standard,
Hence, there is an obvious need for even faster converter FIPS PUB 186-3, June 2009.
architectures. Very recently a faster (but slightly larger) [10] N. Koblitz, “CM-curves with good cryptographic properties,” in
Advances in Cryptology, CRYPTO ’91, ser. Lecture Notes in Comput.
converter was presented in [29] and it would allow using Sci., vol. 576. Springer, 1991, pp. 279–287.
setups with even higher throughput-area ratios. [11] S. Okada, N. Torii, K. Itoh, and M. Takenaka, “Implementation of
elliptic curve cryptographic coprocessor over GF (2m ) on an FPGA,”
In this paper, we focused on a single FPGA; namely, in Cryptographic Hardware and Embedded Systems, CHES 2000, ser.
Altera Stratix IV GX EP4SGX230KF40C2. It is possible Lecture Notes in Comput. Sci., vol. 1965. Springer, 2000, pp. 25–40.
[12] J. Lutz and A. Hasan, “High performance FPGA based elliptic
(and even likely) that a better performance-price ratio is curve cryptographic co-processor,” in Proc. Int. Conf. Information
achieved with budget FPGAs, e.g., from Altera Cyclone Technology: Coding and Computing, ITCC 2004, vol. 2, 2004, pp.
family. Although the balanced setup that delivers the best 486–492.
[13] V. S. Dimitrov, K. U. Järvinen, M. J. Jacobson, W. F. Chan, and
performance (in this case, it was setup 11 from Table 2) Z. Huang, “FPGA implementation of point multiplication on Koblitz
may vary between diffferent FPGAs, the implementation curves using Kleinian integers,” in Cryptographic Hardware and
architecture and the methodology are generic. Embedded Systems, CHES 2006, ser. Lecture Notes in Comput. Sci.,
vol. 4249. Springer, 2006, pp. 445–459.
The results show that modern FPGAs are able to deliver [14] ——, “Provably sublinear point multiplication on Koblitz curves and
its hardware implementation,” IEEE Trans. Comput., vol. 57, no. 11,
extremely high throughtputs for secure public-key cryptog- pp. 1469–1481, Nov. 2008.
raphy, reaching as high as 1,700,000 scalar multiplications [15] O. Ahmadi, D. Hankerson, and F. Rodríguez-Henríquez, “Parallel
per second on a secure elliptic curve. The implementation formulations of scalar multiplication on Koblitz curves,” J. Univers.
Comput. Sci., vol. 14, no. 3, pp. 481–504, 2008.
presented in this paper can have applications, e.g., in ac- [16] K. Järvinen, J. Forsten, and J. Skyttä, “FPGA design of self-certified
celerating cryptographic operations in very highly loaded signature verification on Koblitz curves,” in Cryptographic Hardware
network servers. Other interesting applications could be and Embedded Systems, CHES 2007, ser. Lecture Notes in Comput.
Sci., vol. 4727. Springer, 2007, pp. 256–271.
found in cryptanalytic hardware. For example, the FPGA- [17] K. Järvinen and J. Skyttä, “On parallelization of high-speed processors
based machine designed for cryptanalytic purposes called for elliptic curve cryptography,” IEEE Trans. VLSI Syst., vol. 16, no. 9,
COPACOBANA [30] could be programmed to implement pp. 1162–1175, Sept. 2008.
[18] ——, “Fast point multiplication on Koblitz curves: Parallelization
the proposed parallel architecture (or a modification of it) method and implementations,” Microproc. Microsyst., vol. 33, no. 2,
which could have some cryptanalytic importance. pp. 106–116, Mar. 2009.
Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 | 127

[19] K. U. Järvinen and J. O. Skyttä, “High-speed elliptic curve cryp- pp. 201–212.
tography accelerator for Koblitz curves,” in Proc. IEEE 16th IEEE [25] E. Al-Daoud, R. Mahmod, M. Rushdan, and A. Kilicman, “A new
Symp. Field-programmable Custom Computing Machines, FCCM addition formula for elliptic curves over GF (2n ),” IEEE Trans.
2008. IEEE Computer Society, 2008, pp. 109–118. Comput., vol. 51, no. 8, pp. 972–975, Aug. 2002.
[20] K. Järvinen, “Optimized FPGA-based elliptic curve cryptography [26] J. A. Solinas, “Efficient arithmetic on Koblitz curves,” Des. Codes
processor for high-speed applications,” Integration—the VLSI Journal, Cryptography, vol. 19, no. 2–3, pp. 195–249, 2000.
in press. [27] B. B. Brumley and K. U. Järvinen, “Conversion algorithms and im-
[21] T. Itoh and S. Tsujii, “A fast algorithm for computing multiplicative plementations for Koblitz curve cryptography,” IEEE Trans. Comput.,
inverses in GF (2m ) using normal bases,” Inform. Comput., vol. 78, vol. 59, no. 1, pp. 81–92, 2010.
no. 3, pp. 171–177, Sept. 1988. [28] Altera, “Stratix IV GX FPGA development board: Reference manual,”
[22] M. A. Hasan, “Look-up table-based large finite field multiplication in Aug. 2010, http://www.altera.com/literature/manual/rm_sivgx_fpga_
memory constrained cryptosystems,” IEEE Trans. Comput., vol. 49, dev_board.pdf.
no. 7, pp. 749–758, July 2000. [29] J. Adikari, V. Dimitrov, and K. Järvinen, “A fast hardware architecture
[23] K. U. Järvinen, “On repeated squarings in binary fields,” in Selected for integer to τ NAF conversion for Koblitz curves,” IEEE Trans.
Areas in Cryptography, SAC 2009, ser. Lecture Notes in Comput. Sci., Comput., in press.
vol. 5867. Springer, 2009, pp. 331–349. [30] T. Güneysu, T. Kasper, M. Novotný, C. Paar, and A. Rupp, “Crypt-
[24] J. López and R. Dahab, “Improved algorithms for elliptic curve analysis with COPACOBANA,” IEEE Trans. Comput., vol. 57, no. 11,
arithmetic in GF (2n ),” in Selected Areas in Cryptography, SAC’98, pp. 1498–1513, Nov. 2008.
ser. Lecture Notes in Computer Science, vol. 1556. Springer, 1999,

You might also like