You are on page 1of 14

High Performance ECDSA over F (2n )

based on Java with Hardware Acceleration


Markus Ernst Birgit Henhapl
Integrated Circuits Cryptography and
and Systems Lab Computer Algebra
Department of Computer Science
Darmstadt University of Technology
Germany

{ernst@vlsi, birgit@cdc}.informatik.tu-darmstadt.de

Keywords: Public Key Cryptography, ECDSA, Java and JCA, VHDL Model Generator, FPGA
based Hardware Acceleration

Abstract
Many E-Commerce applications are characterized, for instance, by their demand for confidential
data exchange via public communication networks (e.g., Internet). These data exchanges must be pro-
tected from fraudulent access by third parties. One way is to use public key crypto systems based on
elliptic curves. They gain more and more acceptance, since they provide high security in spite of their
small key sizes. We introduce an elliptic curve based crypto provider, featuring ECDSA, within the
Java Cryptography Architecture for the sake of flexibility and platform independence. Furthermore we
present a FPGA based CryptoProcessor, which raises the performance significantly. The design of the
CryptoProcessor is supported by a custom VHDL model generator.

1
1 Introduction
Many E-Commerce applications are characterized, for instance, by their demand for confidential data
exchange via public communication networks (e.g., Internet). These data exchanges must be protected
from fraudulent access by third parties. The basic technology, which can warrant this kind of protection,
is known as Public-Key Cryptography.
Besides the widely-used RSA method, Public-Key methods based on elliptic curves (EC) have
gained more importance because they are believed to give higher security per key bit, i.e. one can
work with shorter keys [1] (1024 RSA-bits are equivalent to 160 EC-bits). The smaller key size permits
a more cost-efficient implementation and higher throughput.
This article describes an implementation of the ECDSA digital signature algorithm over F (2n ) based
on Java and the Java Cryptographic Architecture (JCA), which is a de-facto standard platform for public-
key software implementations. The basic operation in the area of EC cryptography is the point multipli-
cation (k·P ). This is a complex operation, and its computation is very time consuming. Basically, the
time required for the computation of k·P determines the performance of EC algorithms like ECDSA.
To support ECDSA within server-based cryptosystems (e.g., online banking servers), the performance
of pure software implementations is not sufficient.
To overcome this problem, an Elliptic-Curve CryptoProcessor, which implements the k·P multi-
plication within hardware was developed at our institute. This hardware implementation is based on a
reconfigurable logic device (FPGA) mounted on a PCI card, so that the system integration can be done
easily via the PCI interface. It is shown, that the performance of the ECDSA algorithm can be enhanced
significantly by the use of this processor.
The mathematical background for finite fields and elliptic curves is explained in the following sec-
tion. Section 3 considers the ECDSA algorithm. The implementation of the proposed CryptoProcessor
together with the corresponding design flow is illustrated in Section 4, followed by conclusions.

2 Background
In this section, we introduce some basic notations and definitions. We will start introducing the finite
field F (2n ), its representation and its arithmetic and then go on to elliptic curves over F (2n ). In this
paper we will not discuss either the generation of elliptic curves nor the security aspects. For the latter
please consult [2].

2.1 Optimal Normal Bases


Let F (2n ) be the extension field with 2n elements and extension degree n. A normal base for F (2n ) is
0 1 2
a set B = {Θ2 , Θ2 , Θ2 , . . . , Θ2 }, where the elements of B are linearly independent. There is
n−1

a normal base for each positive integer n. If n is not divisible by 8, then F (2n ) has a gaussian normal
base, with which the multiplication is simpler and faster than with non-gaussian normal bases.
The type T of a normal base is an integer, which measures the complexity of the multiplication
operation of that base. The smaller type T is, the smaller is the complexity and the more efficient the
operation. Bases with type T = 1 or T = 2 are called optimal normal bases.
A field F (2n ) has a gaussain normal base of type T , if and only if each of the following items are true:
• n is not divisable by 8
• p = T · n + 1 is a prime
• gcd(n, T · n/k) = 1, with 2k = 1 mod p.

2
0 1 2
We will represent an element α = α0 · Θ2 + α1 · Θ2 + α2 · Θ2 + · · · + αn−1 · Θ2
n−1
by the bitstring
(α0 α1 α2 . . . αn−1 ). In this work we only work with optimal normal bases of type T = 2 and therefore
keep explanations at this level.
Let α = (α0 α1 α1 . . . , αn−1 ) and β = (β0 β1 β1 . . . , βn−1 ) be two elements of F (2n ). Then the
sum γ = α + β is
γ = (α0 + β0 α1 + β1 α2 + β2 . . . αn−1 + βn−1 ) . (1)
So the addition simply is an addition of the coefficients αi βi mod 2, that means an xor, so this
field operation can be done very fast.
The multiplication is more complicated:
Compute the sequence S(1), S(2), . . . , S(p − 1) as follows:
1. Set m ← 1.
2. For i from 0 to n − 1 do
(a) S(m) = i.
(b) m = 2m mod p.
3. Set m = p − 1 and repeat.
Given a field F (2n ) with the gaussian normal base B of type T = 2 and two elements α and β ∈ F (2n ).
Then the first coefficient γ0 of the product γ = α · β is
p−2
X
γ0 = αS(k+1) βS(p−k) . (2)
k=1

The other coordinates of this product are obtained by the formula from γ0 by first left-cycling the sub-
scripts of α and β modulo n. In praxis the sequence S can be saved in an integer matrix of size 2 × n.
The inversion of an element β of F (2n ) is even more expensive: For any element β ∈ F (2n ) is
β = β 2 −2 . There are several algorithms to perform the inversion more efficient than straight forward
−1 m

squaring and multiplication (i.e. see [3]). Even so, each inversion needs several field multiplications.
In contrast to these operations, computing squares and square roots can be done efficiently, especially
in hardware, since these operations are circular shifts:
Let α = (α0 α1 . . . αn−2 αn−1 ) be an element in F (2n ). Then

α2 = (αn−1 α0 α1 . . . αn−2 )
and

α = (α1 . . . αn−2 αn−1 α0 ).

All these algorithms can be reviewed in [3].

2.2 Elliptic Curves over F (2n)


In the following we will again exclusively consider the field F (2n ). Formulas and algorithms are again
taken from [3]. The cubic equation

y 2 + xy = x3 + ax2 + b, (3)

with x, y, a and b ∈ F (2n ) is called Weierstrass equation for the field F (2n ). An elliptic curve E over
F (2n ) is the set of pairs (x, y) ∈ F (2n ) × F (2n ), solving equation (3), where a, b 6= 0:

E : {(x, y) : y 2 + xy + x3 + ax2 + b = 0, x, y, a, b ∈ F (2n ), a, b 6= 0} (4)

3
In the following we will denote a curve E over the field F (2n ) as E(F (2n )).
The points, along with a point at infinity (denoted by O) and an inner operation called point addition,
form an additive group, where O is the neutral element [6]. The order r of this group is the number of
points on the curve, including point O. By Hasse’s Bound the order r of an elliptic curve over the field
√ √
F (q) is approximately q: q − 2 q + 1 ≤ r ≤ q + 2 q + 1. For a proof see [6].
The point addition is defined as follows: Let P = (x0 , y0 ) and Q = (x1 , y1 ), with x0 , y0 , x1 , y1 ∈
F (2n ). Then R = (x2 , y2 ) = P + Q, x2 , y2 ∈ F (2n ), with

R =P + O = P and
R =P + −P = O,

where −P = (x, y + x), and

x2 = a + λ 2 + λ + x 0 + x1 and (5)
y2 = (x1 + x2 )λ + x1 + x2 , where (6)

y0 − y1
λ= , for P =
6 Q and (7)
x0 − x1
y1
λ = x1 + , for P = Q. (8)
x1
These formulas are exclusively for curves over the field F (2n ). For general addition rules see for exam-
ple [7] or [6].
Since the points form an additive group, there is no inner group operation like the multiplication.
Even so repeated point additions like

|P + P +{z. . . + P} = r · P = R,
r times

with P, R ∈ E(F (2n )), are sometimes considered as one operation called point multiplication. With
this operation we obtain a parallel problem to the discrete logarithm problem (DLP) over finite fields:
Let P and R be points on the curve E(F (2n )), r an integer with r · P = R. Then r is the discrete
logarithm of R to the base P . Therefore cryptographic algorithms based on discrete logarithms over
finite fields can be modified to algorithms based on the discrete logarithm problem of a group of points
(ECDLP), for it is known, that the ECDLP is very hard to solve. The currently best algorithm attacking
the ECDLP is the Pohlig-Hellman algorithm, which has exponential complexity. Therefore it’s assumed
to be save to use finite fields of size 2160 for cryptographic algorithms based on the ECDLP, whereas
for the most common algorithm RSA, which is based on the factorization problem, the size of 21024 is
recommended.
Now we have seen, that each point addition or doubling requires one inversion, which is very expen-
sive (see 2.1). Therefore we avoid this costly operation in point arithmetic by using projective coordi-
nates as proposed in [3].
Let P 2 (F (2n )) be the projective plain over F (2n ). Then one projective representation the Weierstrass
equation is of following form:

y 2 z 2 + xyz 2 = x3 z 3 + ax2 z 2 + b,

where x∗ = xz and y ∗ = yz are the affine coordinates and where two points Q = (xQ , yQ , zQ ) and
R = (xR , yR , zR ) on the same elliptic curve E are equal, if and only if
xR xQ yR yQ
= and = .
zR zQ zR zQ

4
Then we can add two points as follows:
Let P = (x1 , y1 , z1 ) and Q = (x2 , y2 , z2 ) where P, Q 6= O and P 6= −Q then R = P +Q = (x3 , y3 , z3 )
is for P 6= Q:

x3 = AD
y3 = CD + A2 (Bx1 + Ay1 ) (9)
3
z3 = A z1 z2

where A = x2 z1 + x1 z2 , B = y2 z1 + y1 z2 , C = A + B and D = A2 (A + az1 z2 ) + z1 z2 BC.


If P = Q, then

x3 = AB
y3 = x41 A + B(x21 + y1 z1 + A) (10)
3
z3 = A

where A = x1 z1 , B = bz14 + x41 .


Thus, one point addition P + Q requires 13 field multiplications and one point doubling 2P requires
7 multiplications. A full double and add requires 20 multiplications. In the following we will denote a
point addition and doubling as EC-ADD and EC-Double and a field multiplication, addition and squaring
as FF-Mult, FF-Add and FF-Square, respectively.
The multiplication of an elliptic curve point P by some k > 1 is performed as repeated double and
add of the base point P0 using the above equations for x3 , y3 , z3 ( 9, 10):
Let k = kr kr−1 . . . k1 k0 with ki ∈ {0, 1} be the binary representation of k. The following algorithm
computes R = k · P :
1. Set R ← O.
2. From i = r − 1 down to 0 do
(a) Set R ← 2R.
(b) If ki = 1 set R ← R + P .
3. Output R.
With this algorithm each k·P multiplication requires n = r−1 EC-Double and h EC-Add operations
(see Fig. 1). As EC-Double is cheaper in terms of multiplication as EC-Add, the performance of the
algorithm benefits from a key k with low hamming weight h.
The EC arithmetic in turn is based on the underlying finite field arithmetic (see section 2.1). As
already mentioned, in F (2n ) the addition is reduced to a XOR-ing of the corresponding bits. This is
a very simple operation in hardware. Also, squaring can be easily performed by a cyclic shift because
of the utilized ONB representation. To summarize, FF-Add and FF-Square can be implemented very
efficiently, so that the performance of the EC arithmetic is mainly determined by the time required for
the computation of FF-Mult. For the implementation of FF-Mult the Massey-Omura architecture is used
[10]. In this architecture all bits of the result vector can be computed independently from each other, so
that very area-efficient bit-serial implementations as well as high-performance parallel implementations
are feasible. For one k·P computation 7n + 13h FF-Mult operations have to be done, which results
in n(7n + 13h) Massey-Omura elementary operations taking about log n time each. A more detailed
description of ONBs and how multiplication is done in ONB representation can be found in [11] and
[12].
Example: For n = 270 and h = 40, with a pure bit-serial implementation of the Massey-Omura
multiplier, one computation of Q = k·P requires 650700 multiplier iterations.

5
Figure 1: Double-and-Add algorithm

3 ECDSA
ECDSA is a digital signature algorithm, originated from DSA to be based on the ECDLP (see 2.2). It’s
by now widely known and accepted, since it’s been included in the IEEE standard P1363 [3] and in
ANSI X9.62 [4]. In following sections we will introduce ECDSA as it is in [3] and its use and we will
illustrate, why we chose Java as implementation platform.

3.1 EC Domain Parameters


Every cryptographic algorithm based on the ECDLP has a set of ec domain parameters. This set is
required for all kinds of cryptographic transactions, which are based on the ECDLP, like creating key
pairs or signatures. It specifies the parameters, that are used for the algorithms:
q is the size of the underlying field F (q), which can be a large prime p or a prime to a power, like pn .
In our case is q = 2n .
a, b ∈ F (q) are the parameters of the elliptic curve E. For our use a, b are field elements of F (2n ) in
ONB representation.
G ∈ E(F (2n )) is a point of order r. That means, G generates a subgroup of E(F (2n )) of order r. G is
called basepoint.
r is the order of basepoint G (ord G = r), see above.
h is called the cofactor and measures the ratio of the curve order and the order r of G: h = ord E/ord G

3.2 ECDSA Key Pair


The ECDSA key pair is a pair (s, W ), where s is the private key and W the public key. This key pair is
generated as follows:
1. Obtain a set of ec domain parameters.
2. Generate a random integer s in the range [1, r − 1], so that s is unpredictable.
3. Compute the point W = s · G. Since ord G = r and s < r, W 6= O.
4. Output key pair (s, W ).

6
3.3 ECDSA Signature Generation
Let f be a message representative, an integer with f ≥ 0. The digital signature, generated by ECDSA,
is a pair (c, d) with 1 ≤ c, d < r, which is computed as follows:
1. Generate a one-time key pair (u, V ) with the same set of ec domain parameters used for the
generation of (s, W ) and where u is a random integer in the range [1, r − 1] and V = u · G =
(xV , yV ) Again, since ord G = r and u < r, V 6=O.
2. Convert xV to an integer i.
3. Compute an integer c = i mod r; if c = 0, goto step 1.
4. compute an integer d = u−1 (f + sc) mod r; if c = 0 goto, step 1.
(c, d) is the digital signature of the message representative f associated to its ec domain parameters and
the private key s.

3.4 ECDSA Signature Verification


To verify a digital signature (c, d) associated to a set of ec domain parameters, a key pair (s, W ) and a
message representative f , one performs following operations:
1. If c or d is not in the interval [1, . . . r − 1] output invalid and stop.
2. Compute integers h = d−1 mod r, h1 = f h mod r and h2 = ch mod r.
3. Compute the elliptic curve point P = h1 · G + h2 · W . If P = O, output invalid and stop. Else
P = (xP , yP ).
4. Convert the field element xP to an integer i.
5. Compute an integer c′ = i mod r.
6. If c′ = c, output valid; else output invalid.
The operations of this algorithm are not bound to a specific field, with exception of the converting
mechanism in both the signature generation in step (2) and the signature verification in step (4), respec-
tively. In our case the elliptic curve is defined over the field F (2n ), so the elements xV and xP are
field elements of F (2n )and therefore are represented with respect to an optimal normal base B . The
conversion is as follows:
Let e = (e0 e1 . . . en−1 ) be one of these elements. The bitstring e is padded by enough zeroes on
the left to make its length a multiple of 8. Then it is broken in octets o = ol−1 , . . . , o1 , o0 of length l,
with ⌈n/8⌉ = l. The corresponding integer then is i = ol−1 · 256l−1 + · · · + o1 · 2561 + o0 · 2560 .

3.5 The Implementation Platform Java


As in section EC Domain Parameters in 3.1 explained, is ECDSA an ECDLP-based algorithm which
gives us the freedom to choose under arbitrary finite fields. That gives us a greater set of possible
ec domain parameters and therefore a greater set of key pairs. Now it is desirable to exploit these
possibilities; for example in implementing a generic ECDSA, which uses the arithmetic of arbitrary
finite fields, given through the ec domain parameters. This can easily done by any object oriented
language or even by templates. But we can do another step forward: Since ec domain parameters and
ec key pairs are always the same (generic) sets, the next obvious step is to reuse implementations for
ec domain parameters and key pairs for different cryptographic algorithms based on the ECDLP. Java

7
gives us the required facilities with its Java Cryptography Architecture (JCA) and its extension Java
Cryptography Extension (JCE):
The JCA refers to a framework for accessing and developing cryptographic functionality for the Java
platform. It was first introduced in JDK1.1 and is by now extended to include, along with the JCE,
so called APIs (Application Programming Interface) for digital signatures, message digests, encryption,
key exchange, and Message Authentication Code (MAC). The JCA includes a provider architecture that
allows for multiple and inter-operable cryptography implementations.
Further more, this architecture defines plain interfaces between user applications and provider im-
plementations. A user can rely upon a fix sequence of operations to use any of the primitives named
above. We will demonstrate the user interface with an example for key pair generation, signature gen-
eration and verification for ECDSA. Let us assume, that there is already a set of ec domain parameters,
stored in an instance ps of ParameterSpec. This example does not claim completeness or syntactical
correctness, it’s only for illustration.
Key Pair Generation:
1. KeyPairGenerator kpg = KeyPairGenerator.getInstance(“ECDSA”);
2. kpg.init(ps);
3. KeyPair kp = kpg.generateKeyPair();
Signature Generation:
1. Signature sig = Signature.getInstance(“ECDSA”);
2. sig.init(kp.getPrivate());
3. sig.sign();
Signature Verification (Assuming, the public key is already stored in an instance pk of PublicKey:):
1. Signature sig = Signature.getInstance(“ECDSA”);
2. sig.init(pk);
3. sig.verify();
The scheme is more or less always the same: Step (1) - obtain an instance of the class, which
implements the desired primitive, step (2) - initialize this instance and step (3) - carry out your intended
operation.
The classes KeyPairGenerator and Signature are so called engine classes, which provide
the interface to the functionality of a specific type of cryptographic service (independent of a particular
cryptographic algorithm). It defines API-methods to allow for applications access to the specific type
of cryptographic service that each of them provides. The actual implementations are those for specific
algorithms. The application interfaces supplied by an engine class are implemented in terms of a "Ser-
vice Provider Interface" (SPI): For each engine class there is an abstract SPI class, which defines the
methods, that a provider must implement.
So the interfaces between user applications and the provider are well defined. For detailed informa-
tion please see [9].
Now that we explained what Java allows us to do, let us introduce you to our cdcProvider [5]:
The cdcProvider is a powerful toolkit for the Java Cryptography Architecture. It provides cryptographic
modules that can be plugged in into every application that is built on top of the JCA. The cdcProvider is
split in three parts - theCDCStandardProvider, the CDCECProvider and the CDCNFProvider.
Part of the CDCECProvider is ECDSA, which is still under construction. Once finished, it will work on
elliptic curves over large prime fields (a first version of this part can already be down-loaded), over finite

8
fields of characteristic 2 in both optimal normal base and polynomial base representation and in future
time perhaps over Optimal Extension Fields (OEF), see [13].
A first version of the F (2n )-arithmetic with ONBs in Java is already finished and plugged in a test
version of the CDCECProvider, so that ECDSA works within the JCA over F (p), p prime, and over
F (2n ) in pure software. But to gain a high performance, which is absolutely necessary for clients like
banks, we substituted the software of the critical part, namely the point multiplication, by hardware. Our
state of the art is an algorithm-optimized Java implementation of the EC arithmetic, exploiting the use
of projective coordinates and a windows sliding method with build-in NAFs (for NAFs, see [15]).
Since server-based cryptosystems like online banking servers depend on high performance imple-
mentations, we applied hardware acceleration for the most critical part, the point multiplication.

4 Hardware Acceleration
Due to the immense computational effort for the k·P computation, high performance hardware imple-
mentations, like the CryptoProcessor described below, are necessary in order to support the use of EC
methods in server-based cryptosystems (e.g., online banking servers). The performance of software im-
plementations is not sufficient for this kind of applications because the n-bit finite field operations have
to be mapped to a processor with fixed word length (e.g., Intel Pentium, 32 bit) which introduces an
immense computational overhead. As mentioned before, detailed information about leading software
implementations can be found in [15].

4.1 Hardware Specification


The hardware description language VHDL is the de-facto standard for abstract modeling of digital cir-
cuits. These VHDL descriptions can be processed by synthesis tools to derive a netlist of basic logic
elements, which can be fed into place and route tools. Based on this design flow Register Transfer Level
(RTL) descriptions have proven to be well suited to efficiently design integrated circuits. In addition to
using commercial synthesis tools, there is a lot of potential for application specific model generators,
which build an RTL description from a more abstract rule set.
RTL is a good choice when aiming at synthesizable models which have to be independent from the
utilized synthesis tool. In 1999 an IEEE Standard [14] was drafted which defines a rule set for RTL syn-
thesis. RTL is characterized by functional blocks such as registers, memory units or ALUs and control
logic. The control logic is based on clocked state transitions. This abstraction level is well suited for
synchronous designs with a clear functionality and hard chip-size or performance requirements. Design-
ers have a good chance to add manual optimizations, and complexity can be managed by a hierarchical
design. To obtain sufficient performance whilst keeping a design flexible it turns out that a custom model
generator is a perfect choice for the design and implementation of cryptographic hardware.
The Algorithmic Level is defined with multi-cycle operations in mind. While each control step in
RTL is based on a clock cycle, the algorithmic level uses a causality paradigm. This results in descrip-
tions which are similar to software programming languages. The number of clock cycles needed for
evaluating an algorithm in hardware can be left open. The synthesis tool has to provide for resource
allocation, scheduling and pipelining of datapath and control logic. Additional synthesis constraints
are applied to tune the algorithms for the intended application. The definition of values for latency,
throughput and area allow for the production of application-specific optimizations based on a common
algorithmic description. Code reuse is increased and design validation is simplified, but thoroughly
checked RTL descriptions can be more efficient.

9
Figure 2: Architecture of the CryptoProzessor

For applications such as the proposed CryptoProcessor the most important optimization goals are
through-put and area, the modeling of multi-cycle operations is not of primary concern. To overcome
the deficiencies of current algorithmic synthesis tools, it has become clear that another approach is
needed.

4.2 Architecture and VHDL Model Generator


The Generator Approach is based on the idea of defining an abstract specification which can be processed
by a generator program in order to produce a synthesizable hardware description. In our case this
specification can be parameterized by the order of the underlying field, i.e. the key size, and the number
of single bit multipliers, i.e. the radix of the design. The generator includes a Meta-Model of the
CryptoProcessor, which is independent from these parameters. This meta-model in turn is composed of
sub-models, which are corresponding to the functional blocks in the CryptoProcessor’s architecture.
The architecture of the CryptoProcessor is introduced here, in order to explain how the generator
works and to give the motivation, why a generator is needed to automate the transformation from the
abstract specification to a specific bit-level implementation.
The top-level architecture is shown in Fig. 2. It consists of 3 main functional blocks. The Register
File comes with 16 n-bit registers to hold finite field elements or n-bit integers. In addition to the n-bit
internal interface it has a 32-bit interface for the external communication. The Controller is realized as a
finite state machine and implements the Double-and-Add algorithm. Here the number of iterations which
is required for one complete FF-Mult operation varies with the key size and the radix of the design. But
specifying generic models for these two components in terms of RT level VHDL is not really a problem.
Now lets have a look at the FF Arithmetic. Here the operations FF-Mult, FF-Add and FF-Square
are implemented. All of these components depend on the previously introduced design parameters. The
operation FF-Mult again is composed of at least one Massey-Omura single-bit multiplier [10]. Such
a multiplier takes 2 n-bit inputs and reduces them in a huge XOR-tree according to equation (2) to
one single bit. Because the structure of this XOR-tree is changing with the underlying finite field, it
is impossible to create a generic synthesizable VHDL model for this component1 . To overcome the
1
The algorithm for the computation of the sequence S(1), S(2), . . . , S(p − 1), which is required to calculate
equation (2), is given in Section 2.1.

10
deficiency, it has become clear that a custom VHDL model generator has to be developed, which was
done at our institute.
Having such a generator we are able to build CryptoProcessors for various key sizes, so we are well
prepared to support upcoming increasing key sizes. Especially when targeting FPGA implementations
there are some additional benefits from the generator approach. The available FPGA resources can be
used at an optimum, because the number of Massey-Omura multipliers (radix) is not fixed. This in turn
is the prerequisite to achieve maximal performance from a specific FPGA. Since the generated VHDL
descriptions are not bound to a special FPGA family, the rapid advance in FPGA technology can be
directly transferred into better performance. When targeting recent FPGA architectures (e.g. Xilinx
Virtex Family) the expected k·P performance is comparable to standard cell ASIC implementations.
But ASIC implementations normally support only one fixed key size. The use of reconfigurable logic
enables EC methods with variable key sizes on the same hardware.
With respect to design quality and validation there is another benefit from the generator approach.
An implementation for a small key size (e.g. 18 bit) can be used for exhaustive tests. This is necessary
to be sure that the model generator itself is correct. For real world crypto application key sizes n ≥ 160
are required, but in this order of magnitude really exhaustive tests are not possible.

4.3 FPGA Implementation


The implementation of the CryptoProcessor is based on the microEnable PCI card (see Fig. 3) from
Silicon Software GmbH [17]. This card is equipped with a reconfigurable logic device (FPGA) from
Xilinx Inc. [18], in which the CryptoProcessor’s functionality is implemented. This card is available
with FPGAs of different complexities. In our case a XC4085XLA-FPGA with a complexity of max.
180000 system gates is used. Furthermore the card comes with a programmable clock generator, static
RAM and external interfaces. The integration into a target system is accomplished easily via the PCI
interface.
The design flow for the FPGA based implementation of the CryptoProcessor is a three-stage process.
First the previously described model generator produces a VHDL description of the CryptoProcessor
for a given key size. In the next step this model is fed into a synthesis tool to generate a corresponding
netlist, which is matching to the granularity of the target FPGA device. Finally, the manufacturer-specific
placement and routing tools are used to generate the FPGA core.

RAM

RAM   FPGA 


RAM  

RAM PCI 



Interface  
RAM
microEnable

Figure 3: microEnable PCI card

11
Target Platform Key Size k·P Operations
per second
C/C++ Software [15] 191 48
(Intel PentiumPro, 200 MHz)
FPGA Hardware [16] 167 4762
(XCV400E, 76.7 MHz)
FPGA CryptoProcessor 173 568
(XC4085XLA, 36 MHz)
FPGA CryptoProcessor 191 431
(XC4085XLA, 36 MHz)
FPGA CryptoProcessor 270 146
(XC4085XLA, 34 MHz)
FPGA CryptoProcessor 173 6816
(XCV400E, 120 MHz) (estimated)

Table 1: Performance comparison

Using our generator approach on top of a VHDL based design flow for a XC4085XLA-FPGA as
target device, leads to a 270-bit CryptoProcessor design including 3 Massey-Omura single-bit multipliers
(radix 3). This design has a CLB utilization of 82% and is running at 34 MHz. The resulting performance
is 146 k·P operations per second, as summarized in Tab. 1.
Two other versions of the Cryptoprocessor supporting a key size of 191- resp. 173-bit have been
mapped to the same XC4085XLA-FPGA. This shows how the generator allows to easily trade secu-
rity for performance. The 191-bit version of the Processor with radix 5 has a CLB utilization of 69%
and is running at 36 MHz. The resulting performance is 431 k·P operations per second. The 173-bit
version with radix 6 is also running at 36 MHz and has a CLB utilization of 66%. This results in 568
k·P operations per second. Please note that the achievable CLB utilization decreases when the radix
increases, which denotes the number of Massey-Omura multipliers in the design. This is because of the
relatively few routing resources a XC4085XLA-FPGA provides in comparison to its logic complexity.
It is possible to implement a 191-bit Processor with radix 9 in a XC4085XLA-FPGA. This results in a
CLB utilization of 95%, but the achievable operating frequency is only 12 MHz. Therefore the overall
k·P performance is lower as in the case of the implementation with radix 5. This tradeoff, which has to
be done whenever the target FPGA changes is uniquely supported by the proposed generator approach.
The values given above, which are summarized in Tab. 1, were measured in real-time within our test
environment consisting of a standard PC (Intel Pentium III, 550 MHz) running MS Windows NT 4.0.
The test application is based on the application programming interface provided with the microEnable
PCI card, written in C++ and compiled with MS Visual C++ 6.0. For the hardware synthesis we used
FPGA Compiler II V3.5 from Synopsys Inc. The FPGA mapping was performed using the Foundation
Series Software V2.1i from Xilinx Inc.
There are several hardware implementations for the k·P computation documented in literature. The
latest and best performing one, representing the current benchmark with respect to k·P performance, is
described in [16]. However, this implementation uses a fixed key size of 167 bit only and a polynomial
basis representation for the underlying field. It is highly optimized, exploiting pipelining and concur-
rency. For the finite field multiplication, which is the performance critical part, a digit-serial multiplier
is used. The latter topic is similar to our approach, but even if the architecture in [16] can be applied to
any field F (2m ), a method to utilize this feature has not been detailed.
A performance comparison of hardware implementations against each other is in general not straight

12
forward. This is because of mostly different key sizes and due to the fact, that different ASIC resp.
FPGA technologies are used for the implementation. In order to do an almost fair comparison of the
implementation in [16] to our approach, we applied our design flow to the same device as it is used in
[16], which is a Xilinx XCV400E-FPGA of speed grade -8. Our design flow for this target device leads
to a 173-bit CryptoProcessor design with radix 35. From the design tools we can expect an operating
frequency of approximately 120 MHz for this design. This would be a speedup of roughly 3 caused by
the frequency increase in comparison to the previously described 173-bit implementation with radix 6
running at 36 MHz. To our experience, at least a further speedup of 4 can be expected because of the
radix increase (35 in comparison to 6) 2 . Summing it up, this would result in an overall performance of
3∗4∗568 = 6816 k·P operations per second for this 173-bit implementation of our CryptoProcessor. In
contrast, the implementation in [16] supports a key size of 167-bit and achieves a performance of 4762
k·P operations per second.

5 Conclusions
We implemented an elliptic curve based crypto provider within the Java Cryptography Architecture,
which provides cryptographic modules that can be plugged in into every application that is built on top of
the JCA. For high performance requirements a CryptoProcessor has been developed, which implements
the most critical operation k·P .
Our solution of the crypto provider is flexible and platform independent. Its performance can be
increased significantly by exploiting the acceleration provided by the proposed FPGA based CryptoPro-
cessor. All the same, we retain the flexibility according to the key size, enabled by our custom VHDL
model generator.
The result, namely high performance and flexibility, is of highest interest for e.g. online banking
server applications. We illustrated these results by means of ECDSA over F (2n ) in ONB representation.

References
[1] N. Koblitz, “Elliptic Curve Cryptosystems,” Mathematics of Computation, 48 (1987), pp. 203–
209.
[2] A. Lenstra and E. Verheul, “Selecting cryptographic key sizes”, August 1999,
http://www.cryptosavvy.com
[3] IEEE P1363, “Standard Specifications For Public Key Cryptography”
http://grouper.ieee.org/groups/1363/
[4] ANSI X9.62, “Public key cryptography for the financial services industry: The Elliptic Curve
Digital Signature Algorithm (ECDSA)”, 1999 (available from the ANSI X9 catalog)
[5] Institute of Cryptography and Computer Algebra, J. Buchmann, TU Darmstadt, “CDCProvider”,
http://www.informatik.tu-darmstadt.de/TI/Forschung/cdcProvider/overview.html, 2000
[6] Joseph H. Silverman, “The Arithmetic of Elliptic Curves”,1986, Springer, Graduate Texts in
Mathematics Vol.106
[7] Neal Koblitz, “Introduction to Elliptic Curves and Modular Forms”, 1993 Graduate Texts in
Mathematics, Springer
[8] SUN, “Java Native Interface”, http://java.sun.com/products/jdk/1.2/docs/guide/jni/index.html
2
The k·P performance is not scaling linear by speeding up the finite field multiplication only.

13
[9] SUN, “Java Cryptography Architecture API Specification & Reference”, 1997,
http://java.sun.com/products/jdk/1.1/docs/guide/security/CryptoSpec.html
[10] J. Massey and J. Omura,”Computational Method and Apparatus for Finite Field Arithmetic,”
U.S. Patent 4,587,627, 1986.
[11] O. Hauck, A. Katoch and S. A. Huss, “VLSI System Design Using Asynchronous Wave
Pipelines: A 0.35 µm CMOS 1.5 GHz Elliptic Curve Public Key Cryptosystem Chip,” Proc.
IEEE ASYNC 2000, Eilat, April 2000.
[12] M. Rosing, “Implementing Elliptic Curve Cryptogarphy,” Manning Publications Co., Greenwich,
1999. ISBN 1-884777-69-4
[13] D. V. Bailey and C. Paar, “Efficient Arithmetic in Finite Field Extensions with Application in
Elliptic Curve Cryptography,” To appear in Journal of Cryptology.
[14] IEEE Standard 1076.6; Standard for VHDL Register Transfer Level Synthesis, IEEE Standards
Department, New York, 1999.
[15] E. De Win, S. Mister, B. Preneel and M. Wiener, ”On the Performance of Signature Schemes
based on Elliptic Curves,” Proc. Algorithmic Number Theory Symposium III, LNCS 1423, J. P.
Buhler, Ed., Springer-Verlag, pp. 252-266, 1998.
[16] G. Orlando and C. Paar, “A High-Performance Reconfigurable Elliptic Curve Processor for
GF (2m ),” Proc. Workshop on Cryptographic Hardware and Embedded Systems (CHES 2000),
Worcester MA, USA, August 2000.
[17] Silicon Software, ”microEnable Users Guide”, 1999.
[18] Xilinx, ”Programmable Logic Data Book”, 1999.

14

You might also like