Professional Documents
Culture Documents
Abstract— In this paper we present an extremely high generality of the system because, if other parameters are
throughput implementation of an elliptic curve cryptosystem. needed, the FPGA can be reprogrammed to support the new
The work builds on the author’s previous work which has parameters [8].
resulted in a high throughput processor architecture for a “Architecture efficiency” has been exploited in numerous
specific family of elliptic curves called Koblitz curves. The papers describing FPGA-based implementations of ECC [7].
architecture extensively utilizes the fact that FPGA based However, a vast majority of papers optimize only field
designs can be carefully optimized for fixed parameters (e.g., arithmetic units for one specific field while the higher levels
for a single elliptic curve) because parameter flexibility of ECC still remain unoptimized and use more generic
(e.g., support for different curves) can be achieved through architectures. This approach seems rather pointless because
reprogrammability. In this paper the work is extended by fixing the underlying field already restricts the number of
exploring optimal solutions in a situation where several usable elliptic curves to very few. For instance, fixing the
parallel instances of the processor architecture are placed field to F2163 means that from the total of fifteen curves
on a single chip. It is shown that by utilizing the sug- recommended by U.S. National Institute of Standards and
gested parallel architecture a single modern FPGA chip, Technology (NIST) in [9] only two curves, namely B-163
Stratix IV GX EP4SGX230KF40C2, can deliver a through- and K-163, could be used. Hence, if the field is fixed in
put of about 1,700,000 public-key cryptography operations order to increase performance, one should optimize the
(scalar multiplications on a secure elliptic curve) per sec- architecture also on higher levels for a specific curve. In
ond. This far exceeds any values thus far reported in the this paper, we describe an FPGA-based processor that is
literature. optimized specifically for Koblitz curves [10].
very fast computation delay of 7.22 µs on NIST K-233 with q = 2m , with polynomial basis are considered in this
Virtex-II, but it also neglects the conversions. paper. Polynomial bases are commonly used in elliptic
Our recent work considering scalar multiplication on curve cryptosystems because they provide fast performance
Koblitz curves in FPGAs consists of [16], [17], [18], [19], on both software and hardware. Another commonly used
[20]. It was shown in [16] that up to 166,000 signature basis, normal basis, provides very efficient squaring but
verifications can be computed using a single Stratix II multiplication is more complicated.
FPGA with parallel processing. More general paralleliza- Elements of F2m with polynomial basis are represented
tion studies were presented in [17] and they resulted in as binary polynomials with degrees less than m as a(x) =
P m−1 i
an implementation that computes scalar multiplication in i=0 ai x . Arithmetic operations in F2m are computed
25.81 µs. We showed that even shorter computation delay modulo an irreducible polynomial1 with a degree m. Be-
of only 4.91 µs (without the conversion) can be achieved on cause sparse polynomials offer considerable computational
NIST K-163 with interleaved operations [18]. A complete advantages, trinomials (three nonzero terms) or pentanomials
FPGA-based processor architecture utilizing this method (five nonzero terms) are used in practice. The curve, NIST
was described in [19]. This was architecture was improved K-163, considered in this paper is defined over F2163 with
by using more efficient algorithms and by redesigning the the pentanomial p(x) = x163 + x7 + x6 + x3 + 1 [9].
processor architecture in [20]. Addition, a(x) + b(x), in F2m is a bitwise exclusive-or
(XOR). Multiplication, a(x)b(x), is more involved and it
1.2 Contributions of the paper
consists of two steps: ordinary multiplication of polynomials
In this article, we explore parallel implementations of and reduction modulo p(x). If both multiplicands are the
the processor architecture presented in [20]. The processor same, the operation is called squaring, a2 (x). Squaring is
of [20] was optimized to deliver the maximum throughput. cheaper than multiplication because the multiplication of
This optimization strategy is not necessarily optimal if we polynomials is performed simply by adding zeros to the
have several parallel instances of the processor because a bit vector. Reduction modulo p(x) can be performed with
higher overall throughput might be achievable with a larger a small number of XORs if p(x) is sparse and fixed, i.e. the
number of slower but smaller processors. Hence, the target same p(x) is always used, which is the case in this paper.
in this work is to maximize the throughput-area ratio of the Repeated squaring denotes several successive squarings, i.e.,
e
processor. We show that a setup similar to the one used in exponentiation a2 (x). Inversion, a−1 (x), is an operation
[20] is the optimal one also in this case. which finds b(x) such that a(x)b(x) = 1 for a given
It was conjectured in [20] that a parallel implementation a(x). Inversion is the most complex operation and it can
on a modern FPGA could achieve throughputs of over be computed either with the Extended Euclidean Algorithm
1,000,000 scalar multiplications per second. In this paper, or Fermat’s Little Theorem (e.g., as suggested in [21]) that
we show that this conjecture was correct. We demonstrate m
gives a−1 (x) = a2 −2 (x).
that a single Altera Stratix IV chip is capable of delivering a Multiplication has the most crucial effect on performance
throughput of 1,700,000 scalar multiplications per second on of an elliptic curve cryptosystem. A digit-serial multiplier
a standardized elliptic curve NIST K-163. Such an extremely computes D bits of the output in one cycle resulting in a total
fast accelerator could have applications, for instance, in very latency of dm/De cycles. We use hardware modifications
heavily loaded network servers or in cryptanalytic hardware. of the multiplier described in [22]. Instead of using pre-
1.3 Structure of the paper computed look-up tables as in [22], our multiplier computes
everything on-the-fly similarly as in [12]. Repeated squarings
The remaining of the paper is structured as follows. Sec. 2
can be computed efficiently with the repeated squarers
presents the preliminaries of finite fields, elliptic curves, and e
presented in [23] which are components that compute a2 (x)
Koblitz curves. Sec. 3 introduces algorithms that are used
directly in one clock cycle.
in the proposed implementation. The processor architecture
from [20] that we use as the base architecture is described
in Sec. 4. Sec. 5 discusses parallel implementations of the 2.2 Scalar multiplication
processor architecture described in Sec. 4 and finds out Let E be an elliptic curve defined over a finite field Fq .
the parameters that provide the maximum throughput. The Points on E form an additive Abelian group, E(Fq ), together
results on an Altera Stratix IV GX FPGA are presented in with a point called the point at infinity, O, acting as the zero
Sec. 6. Finally, we conclude the paper in Sec. 7. element. The group operation is called point addition. Let
P1 and P2 be two points in E(Fq ). Point addition P1 + P2
2. Preliminaries where P1 = P2 is called point doubling. In order to avoid
2.1 Finite fields
1 A polynomial, f (x) ∈ F[x], with a positive degree is irreducible over
Elliptic curves defined over finite fields Fq are used in F if it cannot be presented as a product of two polynomials in F[x] with
cryptography and only curves over binary fields, where positive degrees.
120 Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 |
confusion, point addition henceforth refers solely to the case 2.3 Koblitz curves
P1 6= ±P2 . Koblitz curves [10] are a family of elliptic curves defined
The principal operation of elliptic curve cryptosystems is over F2m by the following equation:
scalar multiplication kP where k is an integer and P ∈
E(Fq ) is called the base point. The most straightforward EK : y 2 + xy = x3 + ax2 + 1 (1)
practical algorithm for scalar multiplication is the double- where a ∈ {0, 1}. Koblitz curves are appealing because they
and-add algorithm (binary algorithm) where k is represented
P`−1 offer considerable computational advantages over general
as a binary expansion i=0 ki 2i with ki ∈ {0, 1}. Each curves. These advantages are based on the fact that an algo-
bit in the representation results in a point doubling and an rithm similar to the double-and-add can be devised so that
additional point addition is computed if ki = 1. Let w denote point doublings are replaced by Frobenius endomorphisms.
the Hamming weight of k, i.e., the number of nonzeros in the The Frobenius endomorphism, φ, for a point P = (x, y) is
expansion. Depending on whether the algorithm starts from a map such that
the most significant (k`−1 ) or the least significant (k0 ) bit
of the expansion, the algorithm is called either left-to-right φ : (x, y) 7→ (x2 , y 2 ) and O 7→ O. (2)
or right-to-left double-and-add algorithm. They are shown in
Obviously, Frobenius endomorphism is very cheap: only two
Alg. 1 and 2, respectively. They both have the same costs:
or three squarings depending on the coordinate system. Sev-
` − 1 point doublings and w − 1 point additions (the first
eral successive Frobenius maps, i.e. φe (P ), can be computed
operations are simply substitutions). e e
with two (or three) repeated squarings: x2 and y 2 .
Replacing point doublings with Frobenius endomorphisms
P`−1 requires manipulations on k. It stands for all points in
Input: Integer k = i=0 ki 2i , point P
Output: Point Q = kP EK (F2m ) that µφ(P ) − φ2 (P ) = 2P where µ = (−1)1−a .
Q←O Thus, φ can be seen as a complex√number, τ , satisfying
for i = ` − 1 to 0 do µτ − τ 2 = 2 which gives τ = (µ + −7)/2. Moving from
Q ← 2Q a bit to another in a representation of k corresponds to an
if ki = 1 then Q ← Q + P application of φ if k is given in a τ -adic representation:
Algorithm 1: Left-to-right scalar multiplication X
`−1
k= i τ i . (3)
i=0
P`−1 P`−1
Input: Integer k = i=0 i τ i , point P Input: Integer k = i=0 i τ i , point P
Output: Point Q = kP Output: Point Q = kP
(k0 , f0 ), (k1 , f1 ), . . . , (kw−1 , fw−1 ) ← (k0 , f0 ), (k1 , f1 ), . . . , (kw−1 , fw−1 ) ←
ConvertAndEncode(k) ConvertAndEncode(k)
(P1 , P2 , . . . , PN ) ← Precompute(P ) (P1 , P2 , . . . , PN ) ← Precompute(P )
Q←O Q←O
for i = w − 1 to 0 do for i = 0 to w − 1 do
Q ← Q + sign(ki )P|ki | for j = 1 to N do
Q ← φfi (Q) Pj ← φfi (Pj )
Algorithm 3: Left-to-right scalar multiplication on Q ← Q + sign(ki )P|ki |
Koblitz curves with precomputations Algorithm 4: Right-to-left scalar multiplication on
Koblitz curves with precomputations
k τ NAF
Control Z unit
converter
(X, Y, Z) Q = (x, y)
X unit Postprocessor
Register
bank
P Pi
Preprocessor Y unit
Table 1: Latencies
Operation Latency (clock cycles)
Conversion, width-4 τ NAF 335
Precomputation 18M1 + 327
For-loop 2(w − 2)(M2 + 1) + 3(2N + p) + 18
Coordinate conversion 11M3 + 59
τ NAF converter
Preprocessor
Frobenius maps f0 f1 f2 f3 f4 fw−1
X coordinate k0 k1 k2 k2 k3 kw−3 kw−2 kw−2 kw−1 kw−1
Y coordinate k0 k1 k1 k2 k2 kw−3 kw−3 kw−2 kw−2 kw−1 kw−1
Z coordinate k1 k2 k2 k3 k3 kw−2 kw−2 kw−1 kw−1
Postprocessor
Time
Fig. 4: Example computation schedule for computing scalar multiplications with the processor. The previous and the next
scalar multiplication processed in the pipeline are shown in dark and light grey, respectively.
5. Parallelization
It was conjectured in [20] that a throughput of over is possible that the optimum is reached with different digit
1,000,000 scalar multiplications per second can be achieved sizes. We opted to maximize the throughput-area ratio of
with a single modern FPGA. In this section, we seek to a single processor and then replicate several instances of
verify this conjecture by studying what is the maximum that processor although it is possible that one could be able
throughput achievable with a single FPGA implementation to achieve slightly higher throughput by aiming to fill the
consisting of several parallel instances of the processor entire FPGA (or some fixed percentage of it). Our approach,
presented above. however, provides more general results and, of course, a
The target for the optimizations presented in [20] was to better throughput-area ratio.
maximize the throughput of a single processor. In order to The approach of our implementation is simple: We repli-
achieve this goal, the multipliers used in the main processor cate T parallel instances of the processor architecture pre-
used the digit size of D2 = 17 (M2 = 11). This digit size sented in Sec. 4, all with the parameters that maximize the
was the largest one that still ensured that the critical path of throughput-area ratio of a single processor, and provide a
computations remained in the main processor. In this paper, common interface for them. This architecture is shown in
however, our target is to maximize the throughput of several Fig. 5.
parallel instances of the processor. Instead of maximizing As explained above, the dominating component in our
the throughput of the processor, we focus on maximizing processor architecture is the main processor. Hence, we
the throughput-area ratio of the processor. In that case, it fit the parameters of the other components based on the
Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 | 125
Table 2: Balanced setups above, we cannot utilize setups with D2 > 17 because then
Setup D1 M 1 D2 M 2 D3 M 3 the throughput will be bounded by the throughput of the
1 2 (83) 6 (29) 1 (164) converter which roughly equals the throughput of the main
2 3 (56) 7 (25) 2 (83)
3 3 (56) 8 (22) 2 (83)
processor with D2 = 17.
4 3 (56) 9 (20) 2 (83) As shown in Table 3, setup 11 occupies approximately
5 4 (42) 10 (18) 2 (83) 15 % of the resources (ALMs) available on the FPGA. This
6 4 (42) 11 (16) 2 (83)
7 5 (34) 12 (15) 2 (83)
implies that we could fit six such processors on a single chip
8 5 (34) 13 (14) 3 (56) by using approximately 90 % of the resources. However,
9 6 (29) 14 (13) 3 (56) area requirements tend to grow more than linearly when
10 7 (25) 15 (12) 3 (56)
11 7 (25) 17 (11) 3 (56)
the size of the design approaches the limits of the device
because place&route has a more difficult task to fulfill timing
constraints. Hence, we use only five parallel instances of
the processor in our implementation. That is, we prepared a
parameters of the main processor. As shown in Table 1, prototype implementation using the architecture depicted in
the latencies are determined solely by the latencies of the Fig. 5 with T = 5. The results for this implementation are
multipliers (w = 4, N = 5, and p = 1 are fixed). We choose given in Sec. 6.
the digit sizes of the multipliers in the preprocessor and the
postprocessor, D1 and D3 , so that they are the smallest digit 6. Results
sizes that still ensure that the bottleneck (i.e., the longest The parallel architecture shown in Fig. 5 and described in
latency) is in the main processor. We call a setup (D1 , D2 , Sec. 4 and 5 was realized with the parameters obtained in
D3 ) that satisfies this condition a balanced setup. Sec. 5, i.e., T = 5, D1 = 7, D2 = 17, and D3 = 3. The
The (average) latency of the conversion is constant: 335 implementation was described in VHDL and compiled for
clock cycles; i.e., the latency cannot be tuned by choosing an Altera Stratix IV GX EP4SGX230KF40C2 FPGA with
design parameters (e.g., multiplier digit size). Moreover, Quartus 10.1. We emphasize that the processor architecture
the converter uses a different clock than the rest of the is not restricted to any specific FPGA but the optimal
architecture. Based on the work in [20], we assume that parameters may vary between different FPGAs.
we can use about 2.2 times faster clock for the other parts The area consumptions are collected in Table 4. Timing
of the architecture (in [20], we had 85 MHz and 185 MHz constraints of 120 MHz and 266 MHz were set for the τ -adic
clocks). Hence, we estimate that the latency of the converter converter and the rest of the processor, respectively, and the
corresponds to about 2.2 × 335 ≈ 740 clock cycles with the compiler was able to meet these constraints.
other clock. The latency of the main processor should not get Using the clock frequencies of 120 MHz and 266 MHz
shorter than this latency or else the converter will become and the latencies derived in Sec. 4.5, we get the following
the bottleneck and all area that is used for making the main average timings: width-4 τ NAF conversion 335/120 =
processor faster does not improve the overall throughput of 2.79 µs, precomputations 777/266 = 2.92 µs (constant),
the processor. Based on Table 1, it is clear that if D2 > 17, for-loop 785.4/266 = 2.95 µs, and coordinate conversion
the latency of the main processor is smaller than 740. In 675/266 = 2.53 µs (constant). The throughput of the pro-
Sec. 4.3, we derived a lower bound of D2 ≥ 6 for the cessor is bounded by the main processor; hence, the theoret-
the same digit size that ensured that the Frobenius maps ical maximum throughput is 338,680 scalar multiplications
do not appear on the critical path. Hence, only digit sizes per second for a single processor and 1,693,000 scalar
6 ≤ D2 ≤ 17 are viable. The balanced setups satisfying multiplications per second for the parallel implementation
these constraints are collected in Table 2. of five processors. The average timing for a single scalar
Because there are no other trustworthy methods to get multiplication is 8.3 µs.
exact (post-place&route) area requirements of an FPGA im-
plementation besides running synthesis and place&route for 7. Conclusions
the design, we determined throughput-area ratios by compil- We described a parallel implementation of elliptic curve
ing processors with different balanced setups for Stratix IV cryptography with several parallel instances of the processor
GX 4SGX230NC2 with Quartus II ver. 10.1. This FPGA introduced in [20]. This implementation is capable of deliv-
is used, for instance, in Stratix IV GX FPGA Development ering a theoretical maximum throughput of about 1,700,000
Board [28]. The results are collected in Table 3. We started scalar multiplications per second on the standardized elliptic
compilations from setup 11 downwards (see Table 2) and curve NIST K-163 [9].
it soon became apparent that the best throughput-area ratio Contrary to many other published works, this study and
is achieved with setup 11, the largest balanced setup. This the earlier versions of this work presented in [19], [20]
means that the maximum throughput-area ratio of the main focused on maximizing the throughput, i.e., scalar multi-
processor is achieved with D2 ≥ 17. Unfortunately, as noted plications per second instead of minimizing the computating
126 Int'l Conf. Reconfigurable Systems and Algorithms | ERSA'11 |
Table 3: Throughput-area ratios of selected processors with balanced setups on Stratix IV GX FPGA
Memory Time Throughput Throughput/Area
Setup ALUTs Regs ALMs
M9K (µs) (1/s) (1 / s·ALM)
11 16,001 12,377 13,768 21 8.3 339,000 26.51
10 14,568 12,360 13,163 21 8.6 314,000 23.87
9 14,531 12,371 12,778 21 9.1 293,000 21.28
[19] K. U. Järvinen and J. O. Skyttä, “High-speed elliptic curve cryp- pp. 201–212.
tography accelerator for Koblitz curves,” in Proc. IEEE 16th IEEE [25] E. Al-Daoud, R. Mahmod, M. Rushdan, and A. Kilicman, “A new
Symp. Field-programmable Custom Computing Machines, FCCM addition formula for elliptic curves over GF (2n ),” IEEE Trans.
2008. IEEE Computer Society, 2008, pp. 109–118. Comput., vol. 51, no. 8, pp. 972–975, Aug. 2002.
[20] K. Järvinen, “Optimized FPGA-based elliptic curve cryptography [26] J. A. Solinas, “Efficient arithmetic on Koblitz curves,” Des. Codes
processor for high-speed applications,” Integration—the VLSI Journal, Cryptography, vol. 19, no. 2–3, pp. 195–249, 2000.
in press. [27] B. B. Brumley and K. U. Järvinen, “Conversion algorithms and im-
[21] T. Itoh and S. Tsujii, “A fast algorithm for computing multiplicative plementations for Koblitz curve cryptography,” IEEE Trans. Comput.,
inverses in GF (2m ) using normal bases,” Inform. Comput., vol. 78, vol. 59, no. 1, pp. 81–92, 2010.
no. 3, pp. 171–177, Sept. 1988. [28] Altera, “Stratix IV GX FPGA development board: Reference manual,”
[22] M. A. Hasan, “Look-up table-based large finite field multiplication in Aug. 2010, http://www.altera.com/literature/manual/rm_sivgx_fpga_
memory constrained cryptosystems,” IEEE Trans. Comput., vol. 49, dev_board.pdf.
no. 7, pp. 749–758, July 2000. [29] J. Adikari, V. Dimitrov, and K. Järvinen, “A fast hardware architecture
[23] K. U. Järvinen, “On repeated squarings in binary fields,” in Selected for integer to τ NAF conversion for Koblitz curves,” IEEE Trans.
Areas in Cryptography, SAC 2009, ser. Lecture Notes in Comput. Sci., Comput., in press.
vol. 5867. Springer, 2009, pp. 331–349. [30] T. Güneysu, T. Kasper, M. Novotný, C. Paar, and A. Rupp, “Crypt-
[24] J. López and R. Dahab, “Improved algorithms for elliptic curve analysis with COPACOBANA,” IEEE Trans. Comput., vol. 57, no. 11,
arithmetic in GF (2n ),” in Selected Areas in Cryptography, SAC’98, pp. 1498–1513, Nov. 2008.
ser. Lecture Notes in Computer Science, vol. 1556. Springer, 1999,