You are on page 1of 14

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO.

8, AUGUST 2009 1099

Flexible Hardware Processor for Elliptic Curve


Cryptography Over NIST Prime Fields
Kendall Ananyi, Hamad Alrimeih, and Daler Rakhmatov, Member, IEEE

Abstract—Exchange of private information over a public TABLE I


medium must incorporate a method for data protection against NIST-RECOMMENDED PRIMES FOR GF (p) [3]
unauthorized access. Elliptic curve cryptography (ECC) has
become widely accepted as an efficient mechanism to secure sen-
sitive data. The main ECC computation is a scalar multiplication,
translating into an appropriate sequence of point operations, each
involving several modular arithmetic operations. We describe
a flexible hardware processor for performing computationally
expensive modular addition, subtraction, multiplication, and
inversion over prime finite fields ( ). The proposed processor
supports all five primes recommended by NIST, whose sizes
are 192, 224, 256, 384, and 521 bits. It can also be programmed (e.g., speed, area) and flexibility (e.g., different curves, different
to automatically execute sequences of modular arithmetic opera- scalar multiplication methods).
tions. Our field-programmable gate-array implementation runs at The proposed architecture performs arithmetic operations
60 MHz and takes between 4 and 40 ms (depending on the used over prime finite fields and supports all five NIST
prime) to perform a typical scalar multiplication.
primes of size 192, 224, 256, 384, and 521 bits, as shown in
Index Terms—Elliptic curve cryptography (ECC), modular Table I [3]. The corresponding modular operations—addition,
arithmetic, prime finite fields, programmable hardware. subtraction, multiplication, and inversion—form the instruction
set of the processor. These instructions can be combined into
simple programs to perform basic ECC operations, such as
I. INTRODUCTION point addition or subtraction and point doubling. Changing a
scalar multiplication algorithm would only require changing

I NFORMATION security can be achieved in part by fol-


lowing carefully designed protocols that use various crypto-
graphic algorithms. One class of such algorithms, elliptic curve
the execution sequence of point operation programs. Thus, the
flexibility of our ECC processor is characterized by: 1) sup-
porting different fields with different prime sizes and 2)
cryptography (ECC), has become widely accepted and stan- supporting programmable scalar multiplication. For the sake of
dardized by IEEE [1], ANSI [2], and NIST [3]. The core func- efficiency, we have decided to focus only on NIST prime fields,
tionality of ECC is based on manipulating points of a properly as opposed to other alternatives with the same prime
chosen elliptic curve over a finite field. For example, one can size. One of the main reasons behind NIST recommendations
add some point to itself times to obtain new point is the possibility of fast modular reductions, and our ECC
on the curve (see Section II for details). The crypto- processor takes full advantage of this feature. The well-known
graphic strength of elliptic curves relies on the computational concepts of concurrency and locality have also been utilized to
hardness of finding (a private value) given and (public improve the efficiency of the proposed architecture.
values). Prime fields are not the only choice for ECC. NIST
As ECC protocols routinely perform scalar multiplications has also recommended several binary fields , whose el-
to secure data , an efficient implementation of this ements are defined as -bit binary vectors [3]. There are many
operation is critical. A flexible implementation is also desired, hardware-based implementations of ECC over , such as
as there are many different scalar multiplication algorithms [4] and [5], that can be reprogrammed for different values of .
and many different elliptic curves, offering different tradeoffs However, there is a lack of programmable hardware solutions
between computational performance and security. This article for ECC over , especially with prime sizes reaching 521
presents a programmable ECC processor that attempts to strike bits. The proposed ECC processor attempts to fill this gap. This
a balance between the conflicting requirements of efficiency article also describes: 1) a hardware implementation of the fast
reduction schemes specific to the NIST primes and 2) software
algorithms for processor programming. Our general approach
Manuscript received September 10, 2007; revised January 28, 2008. First is similar to [6] that also uses modular operation instructions as
published May 12, 2009; current version published July 22, 2009. This work was well as an optimized reduction circuit for three special primes.
supported in part by the Natural Sciences and Engineering Council of Canada,
Canadian Microelectronics Corporation and by Xilinx University Program. However, [6] targets binary fields , where .
K. Ananyi is with FinancialCAD Corporation, 13450-102 Ave, Surrey, BC A brief overview of the proposed processor was presented in
V3T 5X3, Canada. [7]. This paper provides a full description and analysis of our de-
H. Alrimeih and D. Rakhmatov are with the Department of Electrical and
Computer Engineering, University of Victoria, EOW 448, Victoria, BC V6P
sign, including the execution environment. We focus primarily
5C2, Canada (e-mail: daler@uvic.ca). on the system-level hardware and software details, rather than
Digital Object Identifier 10.1109/TVLSI.2009.2019415 on individual modular arithmetic blocks. We also show that the
1063-8210/$26.00 © 2009 IEEE

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1100 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

or NAF , where denotes 1. While computing


, the algorithm in Fig. 1 will perform 7 point dou-
blings and 5 point additions. On the other hand, the algorithm
in Fig. 2 will perform 8 point doublings and 3 point subtrac-
tions. This example demonstrates that the number of point oper-
ations performed per scalar multiplication depends on not only
the scalar value, but also the scalar representation.
Fig. 1. Binary scalar multiplication algorithm, adopted from [8]. Each point operation involves a certain number of modular
additions, subtractions, multiplications, and inversions—e.g.,
see (1) and (2). The number of modular operations performed
per point operation depends on not only the point operation
type, but also the point representation. Affine coordinates are
not the only choice for point representations. One should also
consider the following projective coordinates [8].
• Standard projective points correspond to affine
points , where . Affine points
correspond to standard projective points . Con-
Fig. 2. NAF scalar multiplication algorithm, adopted from [8]. verting from standard projective to affine coordinates re-
quires one modular inversion and two modular multi-
plications: and .
programmability of the proposed processor allows for a flexible • Jacobian projective points correspond to affine
tradeoff between computational performance and security asso- points , where . Affine points
ciated with scalar multiplications. correspond to Jacobian points . Converting
from Jacobian to affine coordinates requires one modular
II. ECC BACKGROUND
inversion and four modular multiplications: ,
Let be a non-supersingular elliptic , , and .
curve over , where is a prime greater than 3. Table I • Chudnovsky projective points correspond to
shows the five primes recommended by NIST and used in this Jacobian points with two redundant coordinates
work. The corresponding five curve equations have the same and . Affine points correspond to
, but different values of that can be found in [3]. Chudnovsky points . Converting from
A point with affine coordinates lies on , if the values Chudnovsky to affine coordinates requires one modular
of and satisfy the curve equation. Addition of two points inversion and four modular multiplications: ,
and yields the third , , and .
point , whose coordinates are Table II shows the equations for computing point dou-
computed as follows [8]: bling and point addition/subtraction using mixed coordinates:
(1) 1) affine and Jacobian and 2) Chudnovsky and Jacobian
. As appears on both sides of the expressions and
where , we use the embellishment to differentiate
between the new coordinate values of (left-hand side) and
,
(2) the old coordinate values of (right-hand side).
In the case of affine and Jacobian , we need to perform 11
Point subtraction is performed by adding modular multiplications and 7 modular additions/subtractions
negated point defined as . There is a special point per point addition/subtraction. In the case of Chudnovsky and
at infinity that serves as an additive identity, i.e., Jacobian , we need to perform 14 modular multiplications and
, for all points on the curve. 7 modular additions/subtractions per point addition/subtraction.
Scalar multiplication is the result of adding point In both cases, we need to perform 8 modular multiplications and
to itself times. Fig. 1 shows a simple algorithm for 12 modular additions/subtractions per point doubling. Note that
computing , where is represented in the binary form the most expensive modular operation, inversion, is not required
[8]. Fig. 2 shows a different algorithm, where is represented in either case. There are other inversion-free combinations pos-
in the non-adjacent form (NAF) [8]. For both algorithms, we sible (e.g., affine and Standard , Jacobian and Jacobian
assume that all leading zeros (if exist) in the representation of , etc.), but they require extra modular multiplications [8].1
have been removed to avoid unnecessary computations. The Using Jacobian with affine involves fewer modular mul-
number of point doublings is equal to , where tiplications than with Chudnovsky . The latter, however, has a
denotes the number of bits (binary ) or digits (NAF ) in security advantage: coordinate can be randomized to counter
the scalar representation. The number of point additions/sub- certain side-channel attacks. Ideally, the scalar should also be
tractions is equal to , where denotes 1Inversion-free projective point operations do not eliminate an inversion re-
the number of nonzero bits/digits in the scalar representation. quired for converting Q
from projective to affine coordinates at the end of a
For example, can be represented as binary (11010111) scalar multiplication.

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1101

TABLE II
EQUATIONS FOR COMPUTING POINT OPERATIONS USING AFFINE OR CHUDNOVSKY P AND JACOBIAN Q, ADOPTED FROM [8]

TABLE III
HARDWARE SOLUTIONS FOR ECC OVER GF (p)

Fig. 3. JSF scalar multiplication algorithm, adopted from [8].

not optimized for cryptographic computations. An ISA can


randomized for security reasons. For example, we can repre- be extended to provide partial support for ECC-related arith-
sent as the sum of two integers and , where is a random metic operations [9]. A more aggressive approach would be
number. In other words, we compute rather than to introduce a special arithmetic unit for accelerating modular
computing directly. Such double-scalar multiplications can operations [10]–[14] or even complete scalar multiplications
be performed efficiently, if we represent and in the joint [15]–[18]. Obviously, as an architecture becomes more spe-
sparse form (JSF), where each digit of is paired with the cor- cialized, its efficiency increases and its flexibility decreases.
responding digit of . For instance, integers and Table III summarizes the hardware solutions specifically opti-
(i.e., ) have the following 8-pair JSF represen- mized for ECC over , excluding general-purpose ISA
tation: , where extensions.
denotes 1. Fig. 3 shows the JSF scalar multiplication algo- Implementations reported in [15], [16], and [18] commit to a
specific scalar multiplication method and specific point coordi-
rithm. In our example, and the algorithm performs
nates. The prime itself can be varied as long as its size is less than
1 doubling of , 7 doublings of , 3 subtractions of , and
a certain maximum. For example, the architecture from [18] can
1 addition of . Note that a point addition/subtraction is per-
work with any prime whose size does not exceed 256 bits. Con-
formed only if . In other words, the number of point
sequently, it can handle NIST primes , , and , but
additions/subtractions is equal to the number of nonzero-sum
not or .
digit pairs in the JSF scalar representation, denoted by , Implementations reported in [10]–[14], and [17] are the
minus 1. In our example, , which implies 4 point most flexible architectures that are still relatively efficient.
additions/subtractions. Their purpose is to speed up arithmetic computations over
Remark: The useful values of are within , where for faster point operations. The control flow of point
is the order of a NIST elliptic curve over for operations (e.g., the scalar multiplication algorithm) can be
a specific NIST prime . Given any point , all other changed without altering the hardware architecture in question.
points with are unique [8]. If Our ECC processor falls into this design category.
, then , while and The hardware from [17] can be programmed to perform a
. If exceeds , then , i.e., complete scalar multiplication autonomously; the same is likely
as many as point computations are redundant. We assume to be true for [14], although it is not apparent from the pub-
that never exceeds , and we also exclude the set of values lished materials. On the other hand, [10]–[13] can execute only
as trivial cases. Thus, given that one modular operation at a time, i.e., they need an external con-
, we never encounter another point equal to , , troller to sequence multiple modular operations (for each point
or during scalar multiplication . operation). Our hardware is in the middle: it can perform in-
dividual point operations autonomously, but needs an external
III. RELATED WORK controller to sequence a complete scalar multiplication.
Software-based implementations of ECC, such as described By supporting automatic execution of individual point oper-
in [8], are flexible but inefficient, as a general-purpose in- ations, our processor can run on its own for many clock cycles.
struction set architecture (ISA) of the underlying hardware is This lets the host switch to other tasks, while the hardware is

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1102 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

busy. Since [10]–[13] do not include point operation controls,


they demand more involvement from the host in terms of both
control and data traffic. Since [17] and [14] include scalar mul-
tiplication controls, they are least demanding as hardware accel-
erators. We rely on a software-based scalar multiplication con-
troller (running on the host), because it is more flexible and less
costly than hardware implementations. In other words, we have
attempted to strike a balance between the complexity of the con-
trol hardware and the speed of scalar multiplications.
The advantage of our implementation over [10]–[13] and [17]
is the ability to work with all five NIST primes. The implemen-
tation from [17] is limited to only, while the others can
handle any 256-bit prime or smaller, but not or . Al-
though the hardware from [10]–[13] can be redesigned with a
521-bit datapath, no corresponding implementation results have
been reported.
The custom 130-nm IP core from [14] can handle primes up to
2048 bits in size. It features a pipelined Montgomery-type Large
Number Multiplier and Exponentiator (LNME). A 384-bit point Fig. 4. Block diagram of ECC processor.
addition and a 384-bit scalar multiplication are reported to take
0.64 and 6.3 ms, respectively (at 230 MHz). Our 384-bit point
addition and a 384-bit scalar multiplication, respectively, take • Scheduler: software procedure generating a sequence of
0.04 (better) and 19.9 ms (worse) on a Xilinx FPGA running point operations, each corresponding to a programmed se-
at 60 MHz (see Section VII). Interestingly, a scalar multiplica- quence of modular operations.
tion delay in [14] is about 10 point addition delays. A typical • Processor: programmable hardware unit executing the
assumption for a 384-bit (binary) scalar would be programmed sequences of modular operations;
point additions per (sequential) scalar multiplication. It may be • Supervisor: general-purpose microprocessor running the
possible that [14] supports concurrent execution of point oper- scheduler and controlling the processor.
ations, which may partially explain the reported data. Unfor-
tunately, the available information is insufficient to allow for A. Execution Environment
a meaningful architectural and performance comparison. It ap-
pears that the authors of [19] used an older version of the similar The supervisor is responsible for performing scalar multipli-
hardware, coupled with an ARM microprocessor and running at cations as well as protocol-level ECC computations. For faster
50 MHz. Their typical 192-bit scalar multiplication delay was point operations, it takes advantage of the processor. Each point
18 ms, which is worse than delays reported for other implemen- operation is programmable as a sequence of modular operations
tations (including ours). (addition, subtraction, multiplication, and inversion). The super-
A unique feature of our design is the utilization of the fast visor can assign multiple such programs to the processor. The
reduction schemes specific to the NIST primes. We replace program execution is triggered by the scheduler, which is a soft-
conventional Montgomery-based modular multiplications [8] ware procedure executed by the supervisor. The responsibility
by regular (non-modular) multiplications followed by modular of the scheduler is to examine the scalar value and generate the
reductions, which simplifies the processor datapath. To the sequence of point operations accordingly. While the processor
authors’ knowledge our modular multiplier is the first hardware is executing its assigned set of programs, the supervisor can run
implementation of its kind. Since currently our architecture the scheduler to generate the next set of programs. Thus, the la-
does not include instructions and hardware for universal tency overhead of the supervisor running the scheduler can be
Montgomery-based multiplications, any arithmetic operations hidden by useful computations performed by the processor.
modulo non-NIST prime must be performed externally (e.g.,
by the host itself, in software). Note that even if the five NIST B. Processor Organization and Instruction Set
curves are everything the user may need, there are still pro- The block diagram of the proposed ECC processor is shown
tocol-level operations modulo non-NIST prime. For instance, in Fig. 4, and the external input/output (I/O) signals are sum-
the ECDSA signature verification requires two multiplications marized in Table IV. The processor includes the following five
and one inversions modulo (NIST curve order) [8]. In components:
the case of 384-bit computations, this would take about 0.1 ms • Main Memory: 512 32-bit memory that stores both
in software running on Pentium III at 800 MHz [8]. Such 32-bit instructions and 32-bit data words;
a delay may be acceptable to the user, as it is negligible in • Operand Register A: 2 265-bit register that stores in-
comparison to the overall scalar multiplication delay of several termediate operand and result ;
milliseconds. • Operand Register B: 2 265-bit register that stores in-
termediate operand ;
IV. PROPOSED PROCESSOR ARCHITECTURE • Control Unit: Global controller that fetches and decodes
Our approach to accelerating scalar multiplications involves instructions from the Main Memory and sends appropriate
the following three system components: control signals to the Functional Unit;

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1103

TABLE IV
PROCESSOR I/O SIGNALS

Fig. 6. ECC processor instructions.

TABLE V
NUMBER OF CLOCK CYCLES PER INSTRUCTION

Fig. 5. Block diagram of functional unit.

• Functional Unit: Flexible processing element that per-


forms modular addition, subtraction, multiplication, and port DINA and transferring 32 bits at a time. The Functional
inversion. Unit reads the Operand Registers A and B concurrently, using
The block diagram of the Functional Unit includes the fol- two separate inputs A/B and transferring at most 265 bits at a
lowing four components, as shown in Fig. 5: time through each input. However, it can write only Operand
• Modular Adder/Subtractor, implementing modular ad- Register A, using output and transferring at most 265 bits at
dition and subtraction ; a time.
• Modular Inverter, implementing modular inversion Processor instructions are summarized in Fig. 6. Each 32-bit
; instruction is divided into four fields: Opcode, Operand1,
• Regular Multiplier, implementing non-modular multipli- Operand2, and Result. The 5-bit Opcode field encodes the
cation ; type of a modular operation and a NIST prime. The 9-bit
• Modular Reductor, reducing product modulo . Operand1/Operand2 fields contain an address pointer to the
The Functional Unit has a 265-bit (256 9) datapath, which first 32-bit word of the corresponding operand in the Main
allows for concurrent processing of multiple 32-bit words. Memory. The 9-bit Result field contains an address pointer to
It reads the operands from the Operand Registers A and B. the Main Memory location where the first 32-bit word of the
Once a modular operation is completed, the result is written result is to be stored. Fig. 6 shows an example encoding of an
to the Operand Register A. These registers have been intro- add instruction using prime .
duced to take advantage of the spatial locality of 32-bit data Address pointers 0 and 1 are reserved for the Operand Reg-
words forming an operand/result and the temporal locality of isters A and B. Consequently, locations 0 and 1 of the Main
operand/result accesses. Control signals LOADL/LOADH en- Memory are to be used only for instructions (data words stored
able writing into the lower/higher 265-bit half of each register. in those locations will be inaccessible). The other locations can
The lower half is used to store the least significant portion of the be used to store either instructions or data. If Operand1 is 0 (1 is
384-bit or 521-bit data (lower 256 bits), while the higher half not allowed), then operand is read from the Operand Register
is used to store the most significant portion of the data (at most A without accessing the Main Memory. If Operand2 is 1 (0 is
265 remaining bits). The Main Memory writes the Operand not allowed), then operand is read from the Operand Register
Registers A and B concurrently, using two separate ports B without accessing the Main Memory. Otherwise, the operand
DOUTA/DOUTB and transferring 32 bits at a time through each registers are written by the Main Memory first, and then read by
port. However, it can read only the Operand Register A, using the Functional Unit. If Result is 0 (1 is not allowed), then result

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1104 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

TABLE VI
DATAPATH WIDTH OPTIONS

is written to the Operand Register A without updating the D. Processor Programmability


Main Memory. Otherwise, the Main Memory is updated con- To reprogram the processor for a different point operation,
currently with the Operand Register A. we must download a different sequence of processor instruc-
Table V shows the number of clock cycles required to com- tions. To reprogram the processor for a different prime, we must
plete each modular operation. Since several primes are sup- modify the Opcode field of the instructions to be executed. The
ported, the amount of time per instruction depends on the used prime information, encoded in the instruction’s Opcode field, is
prime. For example, a multiplication modulo requires 178 sent to the Functional Unit as the 3-bit signal PRIME. Changing
cycles, while a multiplication modulo requires only 74 cy- this signal will immediately “reconfigure” the Functional Unit:
cles. The modular inversion is the most expensive operation, and the datapath structure remains fixed, while the control signal
its latency is data-dependent (Table V shows the upper bound). sequence will change according to the selected prime. For ex-
The table entries include the overhead of instruction fetch-de- ample, if a given instruction uses prime , control signal
code, reading the operands from the Main Memory, and writing WECU (write-enable from the Control Unit) would be asserted
the result into the Main Memory. Fetch-decode always takes 6 for 6 clock cycles to read six 32-bit words per 192-bit operand
clock cycles. Memory reading or writing latencies are equal to from the Main Memory. On the other hand, if is used,
, where is the prime size in bits. the Control Unit would assert WECU for 12 clock cycles. An-
other example is given in Section V-B describing our Modular
C. Control Flow
Reductor.
As long as control signal SUPERVISOR is asserted, the pro- As opposed to conventional FPGA-based reconfigurable
cessor remains in the idle mode and may not access the Main architectures, our processor does not require expensive down-
Memory. Once the supervisor deasserts the SUPERVISOR loads of configuration bitstreams. Any delay or energy penalties
signal and asserts GLOBALSTART, the processor asserts BUSY associated with changing the processor functionality are due to
and starts executing instructions from the Main Memory. When writing new instructions and accessing data in the processor’s
a stop instruction is encountered, the processor deasserts BUSY Main Memory. We discuss the details of this overhead in
and enters the idle mode. Section VII.
Instruction execution is performed in a classic Fetch-Decode-
Execute fashion. For that purpose, the Control Unit includes a V. FLEXIBLE FUNCTIONAL UNIT
9-bit program counter, a 32-bit instruction register, and a mi- The datapath of the Functional Unit is 265-bit wide. The
crocode for each valid instruction.2 Each microcode word en- reason behind this design decision is as follows. The NIST
codes 43 control signals sent to the Functional Unit—some of prime sizes shown in Table I are divisible by 32, except for
those signals are shown in Fig. 4. Instructions are fetched from whose size is 9 bits longer than 512, the closest multiple of 32.
the Main Memory using port DOUTA. After decoding an instruc- The size of operands and is determined by the size of the
tion, the Control Unit generates a START pulse signaling the used prime, i.e., it ranges from six 32-bit words (prime ) to
beginning of the instruction execution. It waits for the Func- sixteen 32-bit words with an extra 9-bit portion (prime ).
tional Unit to generates a DONE pulse signaling the end of the in- Table VI compares various datapath width options in terms of
struction execution. This simple handshaking scheme decouples the number of execution passes (EP) and the number of unused
the control flow from the variable-latency instruction execution. bits (UB). For example, processing 224-bit operands with the
Once the DONE pulse is received, the program counter is incre- 64-bit datapath requires four passes, i.e., . The total
mented, and the next instruction is read. When a jump instruc- number of bits that can be processed with these four passes
tion is encountered, the appropriate target address is loaded into is . Since only 224 bits are needed, 32 bits are
the program counter. This instruction can be used to link dif- unused, i.e., .
ferent program portions residing in different areas of the Main Ideally, both EP and UB should be small, which can be in-
Memory. The program counter is cleared when the processor terpreted as an heuristic measure of the datapath efficiency. We
receives a GLOBALRESET pulse. have selected the width of 265 bits, which yields
2Microcoded execution control can be reprogrammed to accommodate future and on average. We assume that the five
extensions of the Functional Unit. primes under consideration are equally important to the user,

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1105

Fig. 7. Combined modular addition/subraction, adopted from [8].

i.e., the average values of EP and UB are not biased towards a


specific prime size. The alternatives with the smaller EP (i.e.,
better execution speed, e.g., 384-bit width) have much larger
UB, while the alternatives with the smaller UB (i.e., better area
utilization, e.g., 128-bit width) have much larger EP. In other
words, the 265-bit datapath offers a balanced compromise be-
tween execution speed and area utilization. As not all bits are
used all the time, the unused datapath portions are disabled to
Fig. 8. Modular inversion, adopted from [8].
reduce dynamic power consumption.
Modular operations using the primes and require
two passes through our 265-bit datapath. For that purpose, the
Control Unit generates control signal INHALF indicating which time-multiplexing, sacrificing potential area savings for the sake
part of the operands should be used. If , then the of efficiency. When working with the prime sizes smaller than
datapath processes the least significant portion of the data (i.e., 521 bits, the unused portions of the datapath are disabled to
the first 256 bits)—this is the first pass. Otherwise, reduce dynamic power consumption. Our inverter takes 8 cycles
signaling the second pass: the datapath processes the most sig- to perform one pass through the outer loop of the algorithm.
nificant portion of the data (i.e., the remaining 128 bits of 384-bit The number of loop iterations is bounded by [8]. Hence, the
data or 265 bits of the 521-bit data). Control signal OUTHALF upper bound on the inversion delay is cycles. For example,
serves a similar purpose for reading out the result. computing will take no more than 4096
Each block of the Functional Unit includes a dedicated local cycles.
controller and input registers. Although not explicitly shown, Note that inversion requires only one input operand. If an
control signals RESET, START, DONE, INHALF, OUTHALF, operand is 192-bit, 224-bit, or 256-bit wide, then the Modular
and PRIME are generated individually for each block, e.g., Inverter reads it from 265-bit bus A (see Fig. 5). However, if
START of the Modular Multiplier and START of the Modular an operand is 384- or 521-bit wide, then our inverter simulta-
Inverter are physically separate signals. Thus, the operation neously reads the least significant portion from bus A and the
of each block is independent of the others. This allows for most significant portion from bus B. Thus, it takes advantage of
a parallel execution of multiple instructions. Currently, the unused 265-bit bus B instead of performing two data transfers
instructions are executed sequentially. The Control Unit will be over bus A. To enable such an operand fetch, the Main Memory
extended in the future to support instruction-level parallelism. loads the Operand Register A with the least significant data por-
tion and the Operand Register B with the most significant data
A. Modular Adder/Subtractor and Modular Inverter portion.
Our modular adder/subractor combines addition and subtrac- B. Modular Multiplier
tion algorithms from [8] as shown in Fig. 7. The type of the
operation, addition or subtraction, is determined by 1-bit con- To perform modular multiplication we have designed a reg-
trol signal ADDSUB. For primes , , and , the Mod- ular (non-modular) multiplier coupled with a modular reduc-
ular Adder/Subtractor disables the unused portions of its 265-bit tion circuit. Our modular reductor implements fast reduction al-
datapath. For primes and , it requires extra two clock gorithms from [3] specific to NIST primes. Consequently, the
cycles to process the most significant portions of the operands. proposed design does not use relatively more complex Mont-
The proposed Modular Inverter is a hardware implementation gomery-type multiplication schemes.
of a slightly modified binary inversion algorithm from [8] shown Regular multiplications are performed using eight 32-bit mul-
in Fig. 8. Our modifications prevent the intermediate results tipliers and two addition stages.3 Our regular multiplier can be
and from becoming negative (see steps 2.3a.2 and 2.3b.2). directly used when working with the primes , , and
Inversion is the most expensive modular operation with data- . To handle and , we have designed a larger di-
dependent delays. Consequently, we have decided to make the vide-and-conquer structure [20] that time-multiplexes the mul-
datapath of the Modular Inverter 521-bit wide, so that additions tiplier and uses an extra 512-bit adder with an extra register
and subtractions are performed in one pass for all five NIST 3The 32-bit multipliers are to be implemented using embedded 18 2 18-bit
primes. In other words, we have deliberately avoided hardware multiplier blocks provided in Xilinx FPGAs.

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1106 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

for storing intermediate results. As either 384-bit or 521-bit


operands generate the product wider than 521 bits, transferring
such a product to the modular reductor will require two passes
over 521-bit connector C between the multiplier and the reduc-
tion circuit.
Fig. 9 shows the implemented reduction algorithms. Our
design objective was to handle these five different algorithms
using a single circuit. Fig. 10 represents the gist of our idea:
we use eight trees, each performing ten 32-bit additions and
one 32-bit subtraction (intermediate registers are not shown).
The th tree performs the summation of the th 32-bit words
forming , as defined in Fig. 9. Another tree of 12-bit
adders handles accumulated carries as well as the extra 9 bits
during reductions modulo .
For example, during reduction modulo we have to
add and , where and
are the corresponding -terms in Fig. 9 defined for
reduction modulo . The first 32-bit word of is ,
and the first 32-bit word of is 0. The first tree includes
an adder that will sum and .4 Similarly,
the eighth tree includes an adder that will sum and
.
Using these trees for a different prime involves controlling
their inputs with 7:1 multiplexers. For example, if prime
is used, then the eighth tree must add of and of
. Hence, the corresponding multiplexer will select in-
puts of and of as tree inputs and
, respectively. On the other hand, if prime is used,
then the eighth tree must add of and 0, the eighth
32-bit word of . In this case, the corresponding multi-
plexer will select inputs of and 0 as tree inputs
and , respectively. Thus, five multiplexer inputs handle
five possibilities (corresponding to five NIST primes) for se-
lecting one of the eight 32-bit words of an appropriate -term.
Since there are more than eight 32-bit words used to define the
-terms during reduction modulo or , the second pass
through our 8-tree circuit is required. The remaining two multi-
plexer inputs are used for selecting one of the remaining words
during the second pass.
The th tree contributes the th word to the final
result . The value of is bounded by the in-
terval , as eight 32-bit trees can generate a 3-bit
carry/borrow. If , then we let . If ,
then we reduce modulo using a shorter version of the same
reduction algorithms: the input size becomes instead of
bits, and the higher-order zero-valued words are optimized
away.

VI. OPERATION SCHEDULER


To illustrate the flexibility of our ECC processor, this section
provides two scheduling examples for computing . The
first example uses NAF , affine , and Jacobian , with no
countermeasures against side-channel (e.g., power analysis) at-
tacks. The second example uses randomized JSF , ran- Fig. 9. Reduction modulo NIST primes p ,p ,p ,p ,p [3].
domized Chudnovsky , and Jacobian , while using atomic
blocks to perform point operations. For further information on side-channel attacks and related countermeasures, the interested
4If one of the inputs of an adder or subtractor is zero, then the nonzero input reader is referred to [21].
is fed directly to the output, i.e., the internal logic is bypassed, which eliminates Table VII shows the modular operation sequences for
redundant computations. non-atomic point computations with affine and Jacobian .

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1107

Fig. 10. Block diagram of Modular Reductor’s ith tree. Carry signals of the 32-bit adders are collectively processed by a separate 12-bit tree (not shown).

TABLE VII struction point conversion program . The last pro-


NON-ATOMIC AFFINE-JACOBIAN POINT OPERATIONS gram converts Jacobian to affine , involving 1 modular in-
version and 4 modular multiplications. Note that
is reserved for the jump instruction, and all four programs ter-
minate with stop—their purpose will be explained shortly. Data
pointers and are the respective addresses of coordinates
and . Temporary variables are arranged
in the Main Memory so that their pointers can be derived from
and using appropriate address offsets.
Fig. 13 shows the supervisor’s scheduling procedure for
performing non-atomic affine-Jacobian point operations. This
procedure follows the NAF scalar multiplication algorithm
in Fig. 2. It starts with asserting the SUPERVISOR signal,
which allows the supervisor to access to the processor’s Main
Memory. Then, the supervisor loads appropriate programs
according to Fig. 12(a). It also loads the affine -coordinates
and initial Jacobian -coordinates into
and , respectively. Next, the
supervisor scans the NAF scalar digits and calls function
, shown in Table VIII, which forces the processor
to execute a program stored at . This function
Fig. 11 shows the modular operation graphs for atomic point
writes to , resets, and restarts the pro-
computations with Chudnovsky and Jacobian . Redundant
cessor. The processor executes , then the program
additions (for maintaining atomicity) are shown in small boxes.
stored at , followed by stop. The execution of stop
Each atomic block corresponds to the following 6-instruction
results in the deassertion of signal BUSY. While waiting for this
sequence: mul, add/sub, add/sub, mul, add/sub, add/sub.
event (it may take thousands of clock cycles), the supervisor
A. Non-Atomic NAF Scalar Multiplication With can switch to a different task and treat the deassertion of BUSY
Affine-Jacobian Point Operations as an interrupt request. Once all scalar digits have been pro-
cessed, the supervisor schedules point conversion ,
Fig. 12(a) shows an example of mapping instructions and data
and then reads the final affine -coordinates from
to the processor’s Main Memory during non-atomic affine-Ja-
.
cobian point operations (see Table VII). Let de-
note a memory location with address . Program pointers ,
B. Atomic JSF Scalar Multiplication With
, , and are the respective addresses of the 20-instruc-
Chudnovsky-Jacobian Point Operations
tion point doubling program , the 18-instruction
point addition program , the 18-instruction Fig. 12(b) shows an example of mapping instructions and data
point subtraction program , and the 5-in- to the processor’s Main Memory during atomic Chudnovsky-Ja-

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1108 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

Fig. 11. Atomic Chudnovsky-Jacobian point operations. Boxes with [ =+] 0


0 0
represent [ ] for +P and [+] for P . Redundant additions are shown in
small boxes. (a) Q 2Q ; (b) Q P +Q . 6
Fig. 12. Mapping examples for processor’s Main Memory. (a) Mapping for
non-atomic affine-Jacobian computations; (b) mapping for atomic Chudnovsky-
Jacobian computations.
cobian point operations (see Fig. 11). Each atomic block is a
separate 6-instruction program followed by stop. Pointers
are the respective addresses of the 4 atomic blocks needed for
. Pointers are the respective addresses of the
7 atomic blocks needed for . Pointers
are the respective addresses of the 7 atomic blocks needed for
. We also need 7 atomic blocks for
(pointers ) and (pointers
), per the JSF scalar multiplication algorithm in Fig. 3.
Since the algorithm uses extra point equal to , the Main
Memory also contains the coordinates of (pointer ) and
a Chudnovsky point doubling program (pointer ). This pro-
gram requires 21 instructions: 9 modular multiplications and
12 modular additions/subtractions [8]. Finally, there are two
programs for point conversions (pointer , 4 in- Fig. 13. Non-atomic scheduling for NAF k , affine P , and Jacobian Q.
structions) and (pointer , 5 instructions). Recall
that converting from affine to Chudnovsky coordinates involves
4 modular multiplications, while converting from Jacobian to in Fig. 3. First, it performs conversion , computes
affine coordinates also involves an additional modular inversion. , and initializes either to or based on
Fig. 14 shows the supervisor’s scheduling procedure for the most significant digit values of . Next, computations
performing atomic Chudnovsky-Jacobian point operations. are performed one atomic block at a time in the WHILE-loop
This procedure follows the JSF scalar multiplication algorithm by calling , where . The executed atomic

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1109

Fig. 14. Atomic scheduling for JSF k = r + s, Chudnovsky P , and Jacobian Q.

TABLE IX
TABLE VIII PROCESSOR CYCLE COUNTS FOR TWO CASES UNDER CONSIDERATION.
SUPERVISOR’S FUNCTION Schedule t () = +
(a) Non-atomic, NAF k , Affine P , Jacobian Q; (b) Atomic, JSF k r s,
Chudnovsky P , Jacobian Q

block is at location . Adopting the idea from [21], we com-


pute variable without using IF-ELSE statements in order to
maintain a uniform execution profile, regardless of the point
operation type. According to Fig. 3, if
, and if ; there
are no point addition/subtractions if . We represent
5-valued using the 3-bit signed magnitude format,
and denote individual bits by , , and . Then, we
apply bitwise operations NOT and AND to compute five TABLE X
ESTIMATED TIMING OF OUR PROCESSOR RUNNING AT 60 MHz
corresponding 0–1 variables . Other 0–1 variables
indicate that a particular point operation has been finished.
Effectively, we check whether (from the last loop iteration)
points of the last atomic block for that point operation. For
example, has 7 atomic blocks pointed at by
. If , this point operation has been finished.
To perform this check we first XOR and 25 bitwise, then NOT
and AND the bits of the result. This operation is symbolized by
in Fig. 14. Labels correspond to the first atomic blocks
of the respective point operations: double , add , subtract , C. Performance Comparison
add , subtract . If no point operation has been finished (none Scalar multiplication procedures shown in Figs. 13 and 14
of is 1), we increment . If point addition/subtraction have different security and performance characteristics. The
has been finished (one of is 1), we assign to . If former is faster, while the latter is more secure. Table IX lists
point doubling has been finished ( is 1), we assign to one the number of cycles taken by the processor in both cases, for
of depending on which one of is 1. We decrement each NIST prime. We use the following notation:
only if: a point addition/subtraction has been completed, OR • is the number of digits in the NAF scalar repre-
a point doubling has been completed and . Once sentation;
becomes negative, is converted to affine coordinates, which • is the number of nonzero digits in the NAF
completes the scalar multiplication. Note that an adversary scalar representation;
must be unable to access the processor’s memory or the bus • is the number of digit pairs in the JSF scalar
between the supervisor and the processor. Otherwise, the scalar representation;
value can be obtained by examining the sequence of generated • is the number of nonzero-sum digit pairs in
jump targets . the JSF scalar representation.

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1110 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

TABLE XI
PARAMETERS OF OTHER IMPLEMENTATIONS

Reference [14] reported delays only for a 384-bit prime; our delay estimate is for p .
Reference [10] did not report the maximum clock frequency; we assumed the same clock as in [18].
Reference [15] used 160-bit primes; our delay estimate is for p .

The following example illustrates how we calculate entries in multiplications, at the cost of reduced performance. Enabling
Table IX. We consider the case of a non-atomic NAF scalar mul- these tradeoffs by supporting dynamically programmable ECC
tiplication with affine-Jacobian point operations. Let the prime computations is the key feature of our hardware. It can offer ei-
be , so that we know how many clock cycles each instruc- ther high security or high performance, depending on the user’s
tion takes according to Table V. There are Jacobian priorities and application environment.
point doublings, each involving 8 mul instructions (78 8 cy-
cles) and 12 add/sub instructions (33 12 cycles). Thus, all VII. IMPLEMENTATION RESULTS
point doublings take cycles. There are We have mapped our ECC processor onto a Xilinx Virtex-4
affine-Jacobian point additions/subtractions, each involving XCV4FX100 FPGA to obtain quantitative performance fig-
11 mul instructions (78 11 cycles) and 7 add/sub instruc- ures for comparison purposes. The processor implementation
tions (33 7 cycles). Thus, all point additions/subtractions take runs at 60 MHz, occupies 20 793 slices (31 946 four-input
cycles. Jacobian-to-affine point conversion in- LUTs), and uses 32 DSP48 blocks (embedded 18 18-bit
volves 1 inv instruction (4118 cycles) and 4 mul instructions multipliers) as well as 1 RAMB16 block (embedded 16-kbit
(78 4 cycles). Hence, the total over all point operations is RAM). Our clock frequency is relatively low, which is the
cycles. Each point operation in- consequence of choosing the large bitwidth for our datapath:
volves the execution of 1 jump instruction (7 cycles) and 1 stop routing 265-bit signals requires many stages of generic bit-level
instruction (7 cycles). The total contribution of these control in- FPGA switches. The slowest signal path resides in the Modular
structions is cycles, which is negligible. The Inverter: logic contributes 54% to the total delay, while wiring
execution of all instructions takes . contributes 46%. The Modular Inverter occupies approximately
This expression does not include the processor programming the same number of slices as the Modular Multiplier. Both of
overhead, which is calculated next. The point doubling program them consume approximately 90% of the total area.
has 20 instructions, the point addition and point subtraction pro- Table X shows estimated delays of our processor performing
grams have 18 instructions each, and the point conversion pro- non-atomic affine-Jacobian point operations and typical scalar
gram has 5 instructions. Thus, the supervisor has to write 65 multiplications, using either binary or NAF . These figures
instructions (including 4 stop’s) into the processor memory. are based on the 60 MHz clock and cycle counts shown in
Also, the supervisor has to write/read 7 coordinates, each 8 Table IX(a), including the processor programming overhead.
words long (for 256-bit prime ). Hence, the total number Note that we can use Table IX(a) not only for NAF , but also
of 32-bit words to be accessed by the supervisor is 121. Each for binary . The only difference would be in the values of
access (write or read) takes 2 cycles, which yields 242 cycles and used. For binary , we assume on
over all accesses. This is the static programming overhead. We average. For NAF , we assume on average. In
also need to account for the overhead due to the execution of both cases, we let equal the prime size in (binary) bits or
function called times. On each (NAF) digits. These assumptions are common in the literature.
call, it writes a jump instruction into the processor memory (2 Table XI summarizes scalar multiplication delays of other
cycles) and performs a 7-step synchronization (7 cycles). Thus, reported implementations and shows the corresponding delays
the dynamic programming overhead amounts to in our case. It should be noted that due to substantial differences
cycles. The total overhead is , which is neg- in the implementation technology, meaningful comparisons of
ligible. delays and area-delay products are difficult.
Table IX quantitatively illustrates two tradeoffs: 1) mathemat- Among implementations targeting 256-bit primes, [18] offers
ical security versus performance and 2) physical security versus the smallest delay, while [10] and [11] offer the smallest area.
performance. By switching to a larger NIST prime, we increase This is not surprising, as [18] not only uses a superior implemen-
mathematical security of scalar multiplications, at the cost of re- tation technology, but also features a custom modular multiplier,
duced performance. By switching to atomic computations with whereas [10] and [11] employ time-multiplexed modular adders
randomized and , we increase physical security of scalar and shifters instead. Our processor outperforms [10] and [11],

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1111

TABLE XII term is the subtracted fetch-decode delay (6 cycles per instruc-
DETAILED COMPARISON AGAINST [13] tion). We also ignore the delays due to the execution of jump’s
and stop’s as well the processor programming overhead. Since
the processor from [13] is not programmable, we believe it is
appropriate to ignore these delays during comparisons.
Our scalar multiplication delays are 21% worse than those
of [13], which is mostly due to our 20% slower modular mul-
tiplier. Another disadvantage of our multiplier is its restriction
to , , and among 256-bit primes, while that of [13]
can handle any 256-bit prime. However, our multiplier also sup-
ports and and requires fewer resources. In other words,
either multiplier offers certain features that the other does not.
The choice between the two is user-specific.
Another interesting hardware implementation of a modular
multiplier has been reported in [22]. It is significantly faster
than, yet as flexible as, a software-based implementation. It can
but it requires a significantly larger area. On the other hand, our handle any prime size, provided that more computational time
processor is 2.3 times slower than [18]. This is a direct conse- is allowed for larger primes. A compact 28-Kgate 80-MHz con-
quence of having a 2.3 times slower clock in our case, which is figuration from [22] performs a 256-bit modular multiplication
partially due to the relative inefficiency of an FPGA in compar- in 7.4 s. While this multiplier is significantly slower than ours
ison to an application-specific integrated circuit (ASIC). (Sim- and that of [13], it is much more flexible and can replace ei-
ilar observations can be made when comparing our processor ther one. Whether such a replacement is worth the performance
against [14].) penalty depends on the user.
Implementations from [16] and [17] target 192-bit primes. Remark: Our FPGA implementation has a relatively large
Our processor is larger, as expected, due to its wider datapath. It area, nearly half of which is taken by the 521-bit modular
is faster than [16], but slower than [17]—at 40 MHz, our delay is inverter. However, the inverter is not a critical component:
7.2 ms, while [17] reports 3.0 ms. However, [17] ignores the cost we need only one inversion per scalar multiplication, which
of modular additions/subtractions and the cost of data transfers, contributes only 1% to the overall delay (see Table XII). It may
which amounts to 3.1 ms in our case. Consequently, our com- be worthwhile to move the inverter to software: for example,
parable delay becomes 4.1 ms, which is still worse than 3.0 ms [8] reports 44.3- s software-based inversions in on
of [17]. Note that [17] is specialized to prime , while our an 800-MHz Intel microprocessor. Without the inverter, our
implementation also supports the other four NIST primes. In ECC processor occupies 13 571 slices and runs at 57.7 MHz on
other words, the flexibility of our processor costs extra 1.1 ms a Xilinx Virtex-II Pro FPGA (as opposed to 20 793 slices and
(37%) in performance when compared to [17], assuming the 49.5 MHz from Table XII). The resulting hardware will take
same 40 MHz clock for both implementations. approximately 6.1 ms per typical NAF scalar multiplication,
Reference [13] reports the fastest FPGA implementation excluding a modular inversion. If we assume a 57.7-MHz
to date. It runs at 39.5 MHz on a Xilinx Virtex-II Pro device software implementation of the inverter from [8], the inversion
and takes 3.86 ms per scalar multiplication. We have re-im- delay becomes 614 s, and we obtain approximately 6.7 ms
plemented our design using the same technology to allow for per typical NAF scalar multiplication (as opposed to 7.2 ms
a detailed comparison against [13]. The results are shown in from Table XII). Thus, moving the inverter to software not
Table XII. only saves 35% in area, but also improves performance by 7%.
Our processor and [13] use different modular multipliers and However, the supervisor (running software) becomes burdened
inverters. When implemented on its own, our modular multiplier with time-consuming inversion-related computations.
runs at 58.6 MHz using 10 921 slices and 32 MULT blocks,
while that of [13] runs at 45.7 MHz using 11 992 slices and VIII. CONCLUSION
256 MULT blocks. (MULT blocks are embedded 18 18-bit We have described a flexible ECC processor for performing
multipliers.) On the other hand, our modular inverter runs at computationally expensive additions, subtractions, multiplica-
49.9 MHz using 9774 slices, while that of [13] runs at 40.0 MHz tions, and inversions over prime finite fields . Our archi-
using 14 800 slices. The modular inverter from [13] incorporates tecture supports all five NIST primes with sizes ranging from
the modular multiplier as a sub-block, while our inverter and 192 to 521 bits. It can also be programmed to execute modular
multiplier are separate blocks. This is the main reason why our operation sequences for any desired point operation. Depending
processor uses 32% more slices; however, the processor from on the used prime, our Xilinx Virtex-4 FPGA implementation
[13] uses 8 times more MULT blocks. takes between 4 and 40 ms to perform a typical NAF scalar
For [13], we report the sum of two delays in Table XII. The multiplication.
first term is the actual modular operation delay. The second term The proposed processor does not support non-NIST primes,
is the added data write/read delay (8 cycles to write the input, which limits its suitability for ECC applications that do not
plus 8 cycles to read the output, per modular operation). For our follow NIST recommendations. Another disadvantage of our
case, we report the difference of two delays when using . processor is its relatively large area. However, the hardware
The first term is the full instruction delay, which includes the area can be reduced dramatically by implementing modular in-
data write/read overhead (16 cycles per instruction). The second versions in software. Our future efforts will target the following

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1112 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

enhancements: 1) enabling instruction-level parallelism; 2) en- [16] W. Shuhua and Z. Yuefei, “A timing-and-area tradeoff GF(p) elliptic
abling concurrent execution of point operations; 3) supporting curve processor architecture for FPGA,” in Proc. Int. Conf. Commun.,
Circuits Syst., 2005, pp. 1308–1312.
non-NIST Montgomery-type modular multiplications; and [17] G. Orlando and C. Paar, “A scalable GF(p) elliptic curve processor
4) improving the efficiency of individual computational blocks architecture for programmable hardware,” in Proc. Cryptographic
and communication resources. Hardw. Embed. Syst., 2001, pp. 356–371.
[18] A. Satoh and K. Takano, “A scalable dual-field elliptic curve crypto-
graphic processor,” IEEE Trans. Comput., vol. 52, no. 4, pp. 449–460,
ACKNOWLEDGMENT Apr. 2003.
The authors would like to thank the anonymous reviewers for [19] S. Xu and L. Batina, “Efficient implementation of elliptic curve cryp-
tosystems on an ARM7 with hardware accelerator,” in Proc. Inf. Secu-
their critical suggestions that greatly improved the quality of this rity Conf., 2001, pp. 266–279.
paper. [20] B. Parhami, Computer Arithmetic. New York: Oxford, 2000.
[21] Advances in Elliptic Curve Cryptography, I. Blake, G. Seroussi, and
REFERENCES N. Smart, Eds. New York: Cambridge, 2005.
[22] A. Tenca and C. Koc, “A scalable architecture fo modualr multiplica-
[1] Institute of Electrical and Electronic Engineers, NY, “P1363 standard tion based on Montgomery’s algorithm,” IEEE Trans. Comput., vol.
specifications for public key cryptography,” 2000. 52, no. 9, pp. 1215–1221, Sep. 2003.
[2] American National Standards Institute, Washington, DC, “X 9.62
public key cryptography for the financial services industry: Elliptic
Kendall Ananyi received the B.Sc. degree in elec-
curve digital signature algorithm (ECDSA),” 1999.
trical and electronics engineering from the University
[3] National Institute of Standards and Technology, Gaithersburg, MD,
of Benin, Nigeria, and the M.A.Sc. degree in elec-
“FIPS 186—Digital signature standard,” 1994.
trical and computer engineering from the University
[4] J. Goodman and A. Chandrakasan, “An energy-efficient reconfigurable
of Victoria, BC, Canada.
public-key cryptography processor,” IEEE J. Solid-State Circuits, vol.
He is a Software Engineer with FinancialCAD
36, no. 11, pp. 1808–1820, Nov. 2001.
Corporation, Surrey, BC, Canada, where he builds
[5] R. Chung, N. Telle, W. Luk, and P. Cheung, “Customizable elliptic
web systems for valuing financial derivatives.
curve cryptosystems,” IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 13, no. 9, pp. 1048–1058, Sep. 2005.
[6] N. Gura, S. Shantz, H. Eberle, S. Gupta, V. Gupta, D. Finchelstein, E.
Goupy, and D. Stebila, “An end-to-end systems approach to elliptic
curve cryptography,” in Proc. Cryptographic Hardw. Embed. Syst.,
2002, pp. 349–365.
[7] K. Ananyi and D. Rakhmatov, “Design of a reconfigurable processor Hamad Alrimeih received the B.Sc. degree in com-
for NIST prime field ECC,” in Proc. IEEE Symp. Field-Program. puter engineering from King Saud University, Saudi
Custom Comput. Mach., 2006, pp. 333–334. Arabia, the M.A.Sc. degree in computer engineering
[8] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curve from Essex University, U.K., and the M.Sc. degree
Cryptography. New York: Springer, 2004. in electrical engineering from Edinburgh University,
[9] H. Eberle, S. Shantz, V. Gupta, N. Gura, L. Rarick, and L. Spracklen, U.K. He is currently pursuing the Ph.D. degree
“Accelerating next-generation public-key cryptosystems on general- from the Department of Electrical and Computer
purpose CPUs,” IEEE Micro, pp. 52–59, Mar. 2005.
GF(2 )
Engineering, the University of Victoria, BC, Canada.
[10] J. Wolkerstorfer, “Dual-field arithmetic unit for GF(p) and ,” His research interests include flexible architectures
in Proc. Cryptographic Hardw. Embed. Syst., 2002, pp. 500–514. for elliptic curve cryptography.
[11] A. Daly, W. Marnane, T. Kerins, and E. Popovici, “An FPGA imple-
mentation of a GF(p) ALU for encryption processors,” Elsevier Micro-
processors Microsyst., vol. 28, pp. 253–260, 2004.
[12] K. Sakiyama, N. Mentens, L. Batina, B. Preneel, and I. Verbauwhede,
“Reconfigurable modular arithmetic logic unit for high-performance Daler Rakhmatov (M’02) received the B.Sc. degree
public-key cryptosystems,” in Proc. Int. Workshop Appl. Reconfig- in electrical engineering from the Rochester Institute
urable Comput., 2006, pp. 347–357. of Technology, Rochester, NY, in 1996, and the M.Sc.
[13] C. McIvor, M. McLoone, and J. McCanny, “Hardware elliptic curve and a Ph.D. degrees in electrical engineering from the
cryptographic processor over GF(p),” IEEE Trans. Circuits Syst. I, Reg. University of Arizona, Tempe, in 1998 and 2002, re-
Papers, vol. 53, no. 9, pp. 1946–1957, Sep. 2006. spectively.
[14] SafeNet, Inc., Bellcamp, MD, “SafeXcel IP public key accelerators,” He is currently an Assistant Professor with the
2007 [Online]. Available: http://www.safenet-inc.com Department of Electrical and Computer Engineering,
[15] S. Ors, L. Batina, and B. Preneel, “Hardware implementation of an University of Victoria, Victoria, BC, Canada. His
elliptic curve processor over GF(p),” in Proc. Int. Conf. Appl.-Specific research interests include energy-efficient computing
Syst., Arch. Process., 2001, pp. 433–443. and dynamically reconfigurable systems.

Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.

You might also like