Professional Documents
Culture Documents
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1100 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1101
TABLE II
EQUATIONS FOR COMPUTING POINT OPERATIONS USING AFFINE OR CHUDNOVSKY P AND JACOBIAN Q, ADOPTED FROM [8]
TABLE III
HARDWARE SOLUTIONS FOR ECC OVER GF (p)
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1102 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1103
TABLE IV
PROCESSOR I/O SIGNALS
TABLE V
NUMBER OF CLOCK CYCLES PER INSTRUCTION
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1104 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009
TABLE VI
DATAPATH WIDTH OPTIONS
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1105
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1106 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1107
Fig. 10. Block diagram of Modular Reductor’s ith tree. Carry signals of the 32-bit adders are collectively processed by a separate 12-bit tree (not shown).
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1108 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1109
TABLE IX
TABLE VIII PROCESSOR CYCLE COUNTS FOR TWO CASES UNDER CONSIDERATION.
SUPERVISOR’S FUNCTION Schedule t () = +
(a) Non-atomic, NAF k , Affine P , Jacobian Q; (b) Atomic, JSF k r s,
Chudnovsky P , Jacobian Q
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1110 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009
TABLE XI
PARAMETERS OF OTHER IMPLEMENTATIONS
Reference [14] reported delays only for a 384-bit prime; our delay estimate is for p .
Reference [10] did not report the maximum clock frequency; we assumed the same clock as in [18].
Reference [15] used 160-bit primes; our delay estimate is for p .
The following example illustrates how we calculate entries in multiplications, at the cost of reduced performance. Enabling
Table IX. We consider the case of a non-atomic NAF scalar mul- these tradeoffs by supporting dynamically programmable ECC
tiplication with affine-Jacobian point operations. Let the prime computations is the key feature of our hardware. It can offer ei-
be , so that we know how many clock cycles each instruc- ther high security or high performance, depending on the user’s
tion takes according to Table V. There are Jacobian priorities and application environment.
point doublings, each involving 8 mul instructions (78 8 cy-
cles) and 12 add/sub instructions (33 12 cycles). Thus, all VII. IMPLEMENTATION RESULTS
point doublings take cycles. There are We have mapped our ECC processor onto a Xilinx Virtex-4
affine-Jacobian point additions/subtractions, each involving XCV4FX100 FPGA to obtain quantitative performance fig-
11 mul instructions (78 11 cycles) and 7 add/sub instruc- ures for comparison purposes. The processor implementation
tions (33 7 cycles). Thus, all point additions/subtractions take runs at 60 MHz, occupies 20 793 slices (31 946 four-input
cycles. Jacobian-to-affine point conversion in- LUTs), and uses 32 DSP48 blocks (embedded 18 18-bit
volves 1 inv instruction (4118 cycles) and 4 mul instructions multipliers) as well as 1 RAMB16 block (embedded 16-kbit
(78 4 cycles). Hence, the total over all point operations is RAM). Our clock frequency is relatively low, which is the
cycles. Each point operation in- consequence of choosing the large bitwidth for our datapath:
volves the execution of 1 jump instruction (7 cycles) and 1 stop routing 265-bit signals requires many stages of generic bit-level
instruction (7 cycles). The total contribution of these control in- FPGA switches. The slowest signal path resides in the Modular
structions is cycles, which is negligible. The Inverter: logic contributes 54% to the total delay, while wiring
execution of all instructions takes . contributes 46%. The Modular Inverter occupies approximately
This expression does not include the processor programming the same number of slices as the Modular Multiplier. Both of
overhead, which is calculated next. The point doubling program them consume approximately 90% of the total area.
has 20 instructions, the point addition and point subtraction pro- Table X shows estimated delays of our processor performing
grams have 18 instructions each, and the point conversion pro- non-atomic affine-Jacobian point operations and typical scalar
gram has 5 instructions. Thus, the supervisor has to write 65 multiplications, using either binary or NAF . These figures
instructions (including 4 stop’s) into the processor memory. are based on the 60 MHz clock and cycle counts shown in
Also, the supervisor has to write/read 7 coordinates, each 8 Table IX(a), including the processor programming overhead.
words long (for 256-bit prime ). Hence, the total number Note that we can use Table IX(a) not only for NAF , but also
of 32-bit words to be accessed by the supervisor is 121. Each for binary . The only difference would be in the values of
access (write or read) takes 2 cycles, which yields 242 cycles and used. For binary , we assume on
over all accesses. This is the static programming overhead. We average. For NAF , we assume on average. In
also need to account for the overhead due to the execution of both cases, we let equal the prime size in (binary) bits or
function called times. On each (NAF) digits. These assumptions are common in the literature.
call, it writes a jump instruction into the processor memory (2 Table XI summarizes scalar multiplication delays of other
cycles) and performs a 7-step synchronization (7 cycles). Thus, reported implementations and shows the corresponding delays
the dynamic programming overhead amounts to in our case. It should be noted that due to substantial differences
cycles. The total overhead is , which is neg- in the implementation technology, meaningful comparisons of
ligible. delays and area-delay products are difficult.
Table IX quantitatively illustrates two tradeoffs: 1) mathemat- Among implementations targeting 256-bit primes, [18] offers
ical security versus performance and 2) physical security versus the smallest delay, while [10] and [11] offer the smallest area.
performance. By switching to a larger NIST prime, we increase This is not surprising, as [18] not only uses a superior implemen-
mathematical security of scalar multiplications, at the cost of re- tation technology, but also features a custom modular multiplier,
duced performance. By switching to atomic computations with whereas [10] and [11] employ time-multiplexed modular adders
randomized and , we increase physical security of scalar and shifters instead. Our processor outperforms [10] and [11],
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
ANANYI et al.: FLEXIBLE HARDWARE PROCESSOR FOR ELLIPTIC CURVE CRYPTOGRAPHY 1111
TABLE XII term is the subtracted fetch-decode delay (6 cycles per instruc-
DETAILED COMPARISON AGAINST [13] tion). We also ignore the delays due to the execution of jump’s
and stop’s as well the processor programming overhead. Since
the processor from [13] is not programmable, we believe it is
appropriate to ignore these delays during comparisons.
Our scalar multiplication delays are 21% worse than those
of [13], which is mostly due to our 20% slower modular mul-
tiplier. Another disadvantage of our multiplier is its restriction
to , , and among 256-bit primes, while that of [13]
can handle any 256-bit prime. However, our multiplier also sup-
ports and and requires fewer resources. In other words,
either multiplier offers certain features that the other does not.
The choice between the two is user-specific.
Another interesting hardware implementation of a modular
multiplier has been reported in [22]. It is significantly faster
than, yet as flexible as, a software-based implementation. It can
but it requires a significantly larger area. On the other hand, our handle any prime size, provided that more computational time
processor is 2.3 times slower than [18]. This is a direct conse- is allowed for larger primes. A compact 28-Kgate 80-MHz con-
quence of having a 2.3 times slower clock in our case, which is figuration from [22] performs a 256-bit modular multiplication
partially due to the relative inefficiency of an FPGA in compar- in 7.4 s. While this multiplier is significantly slower than ours
ison to an application-specific integrated circuit (ASIC). (Sim- and that of [13], it is much more flexible and can replace ei-
ilar observations can be made when comparing our processor ther one. Whether such a replacement is worth the performance
against [14].) penalty depends on the user.
Implementations from [16] and [17] target 192-bit primes. Remark: Our FPGA implementation has a relatively large
Our processor is larger, as expected, due to its wider datapath. It area, nearly half of which is taken by the 521-bit modular
is faster than [16], but slower than [17]—at 40 MHz, our delay is inverter. However, the inverter is not a critical component:
7.2 ms, while [17] reports 3.0 ms. However, [17] ignores the cost we need only one inversion per scalar multiplication, which
of modular additions/subtractions and the cost of data transfers, contributes only 1% to the overall delay (see Table XII). It may
which amounts to 3.1 ms in our case. Consequently, our com- be worthwhile to move the inverter to software: for example,
parable delay becomes 4.1 ms, which is still worse than 3.0 ms [8] reports 44.3- s software-based inversions in on
of [17]. Note that [17] is specialized to prime , while our an 800-MHz Intel microprocessor. Without the inverter, our
implementation also supports the other four NIST primes. In ECC processor occupies 13 571 slices and runs at 57.7 MHz on
other words, the flexibility of our processor costs extra 1.1 ms a Xilinx Virtex-II Pro FPGA (as opposed to 20 793 slices and
(37%) in performance when compared to [17], assuming the 49.5 MHz from Table XII). The resulting hardware will take
same 40 MHz clock for both implementations. approximately 6.1 ms per typical NAF scalar multiplication,
Reference [13] reports the fastest FPGA implementation excluding a modular inversion. If we assume a 57.7-MHz
to date. It runs at 39.5 MHz on a Xilinx Virtex-II Pro device software implementation of the inverter from [8], the inversion
and takes 3.86 ms per scalar multiplication. We have re-im- delay becomes 614 s, and we obtain approximately 6.7 ms
plemented our design using the same technology to allow for per typical NAF scalar multiplication (as opposed to 7.2 ms
a detailed comparison against [13]. The results are shown in from Table XII). Thus, moving the inverter to software not
Table XII. only saves 35% in area, but also improves performance by 7%.
Our processor and [13] use different modular multipliers and However, the supervisor (running software) becomes burdened
inverters. When implemented on its own, our modular multiplier with time-consuming inversion-related computations.
runs at 58.6 MHz using 10 921 slices and 32 MULT blocks,
while that of [13] runs at 45.7 MHz using 11 992 slices and VIII. CONCLUSION
256 MULT blocks. (MULT blocks are embedded 18 18-bit We have described a flexible ECC processor for performing
multipliers.) On the other hand, our modular inverter runs at computationally expensive additions, subtractions, multiplica-
49.9 MHz using 9774 slices, while that of [13] runs at 40.0 MHz tions, and inversions over prime finite fields . Our archi-
using 14 800 slices. The modular inverter from [13] incorporates tecture supports all five NIST primes with sizes ranging from
the modular multiplier as a sub-block, while our inverter and 192 to 521 bits. It can also be programmed to execute modular
multiplier are separate blocks. This is the main reason why our operation sequences for any desired point operation. Depending
processor uses 32% more slices; however, the processor from on the used prime, our Xilinx Virtex-4 FPGA implementation
[13] uses 8 times more MULT blocks. takes between 4 and 40 ms to perform a typical NAF scalar
For [13], we report the sum of two delays in Table XII. The multiplication.
first term is the actual modular operation delay. The second term The proposed processor does not support non-NIST primes,
is the added data write/read delay (8 cycles to write the input, which limits its suitability for ECC applications that do not
plus 8 cycles to read the output, per modular operation). For our follow NIST recommendations. Another disadvantage of our
case, we report the difference of two delays when using . processor is its relatively large area. However, the hardware
The first term is the full instruction delay, which includes the area can be reduced dramatically by implementing modular in-
data write/read overhead (16 cycles per instruction). The second versions in software. Our future efforts will target the following
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.
1112 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009
enhancements: 1) enabling instruction-level parallelism; 2) en- [16] W. Shuhua and Z. Yuefei, “A timing-and-area tradeoff GF(p) elliptic
abling concurrent execution of point operations; 3) supporting curve processor architecture for FPGA,” in Proc. Int. Conf. Commun.,
Circuits Syst., 2005, pp. 1308–1312.
non-NIST Montgomery-type modular multiplications; and [17] G. Orlando and C. Paar, “A scalable GF(p) elliptic curve processor
4) improving the efficiency of individual computational blocks architecture for programmable hardware,” in Proc. Cryptographic
and communication resources. Hardw. Embed. Syst., 2001, pp. 356–371.
[18] A. Satoh and K. Takano, “A scalable dual-field elliptic curve crypto-
graphic processor,” IEEE Trans. Comput., vol. 52, no. 4, pp. 449–460,
ACKNOWLEDGMENT Apr. 2003.
The authors would like to thank the anonymous reviewers for [19] S. Xu and L. Batina, “Efficient implementation of elliptic curve cryp-
tosystems on an ARM7 with hardware accelerator,” in Proc. Inf. Secu-
their critical suggestions that greatly improved the quality of this rity Conf., 2001, pp. 266–279.
paper. [20] B. Parhami, Computer Arithmetic. New York: Oxford, 2000.
[21] Advances in Elliptic Curve Cryptography, I. Blake, G. Seroussi, and
REFERENCES N. Smart, Eds. New York: Cambridge, 2005.
[22] A. Tenca and C. Koc, “A scalable architecture fo modualr multiplica-
[1] Institute of Electrical and Electronic Engineers, NY, “P1363 standard tion based on Montgomery’s algorithm,” IEEE Trans. Comput., vol.
specifications for public key cryptography,” 2000. 52, no. 9, pp. 1215–1221, Sep. 2003.
[2] American National Standards Institute, Washington, DC, “X 9.62
public key cryptography for the financial services industry: Elliptic
Kendall Ananyi received the B.Sc. degree in elec-
curve digital signature algorithm (ECDSA),” 1999.
trical and electronics engineering from the University
[3] National Institute of Standards and Technology, Gaithersburg, MD,
of Benin, Nigeria, and the M.A.Sc. degree in elec-
“FIPS 186—Digital signature standard,” 1994.
trical and computer engineering from the University
[4] J. Goodman and A. Chandrakasan, “An energy-efficient reconfigurable
of Victoria, BC, Canada.
public-key cryptography processor,” IEEE J. Solid-State Circuits, vol.
He is a Software Engineer with FinancialCAD
36, no. 11, pp. 1808–1820, Nov. 2001.
Corporation, Surrey, BC, Canada, where he builds
[5] R. Chung, N. Telle, W. Luk, and P. Cheung, “Customizable elliptic
web systems for valuing financial derivatives.
curve cryptosystems,” IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 13, no. 9, pp. 1048–1058, Sep. 2005.
[6] N. Gura, S. Shantz, H. Eberle, S. Gupta, V. Gupta, D. Finchelstein, E.
Goupy, and D. Stebila, “An end-to-end systems approach to elliptic
curve cryptography,” in Proc. Cryptographic Hardw. Embed. Syst.,
2002, pp. 349–365.
[7] K. Ananyi and D. Rakhmatov, “Design of a reconfigurable processor Hamad Alrimeih received the B.Sc. degree in com-
for NIST prime field ECC,” in Proc. IEEE Symp. Field-Program. puter engineering from King Saud University, Saudi
Custom Comput. Mach., 2006, pp. 333–334. Arabia, the M.A.Sc. degree in computer engineering
[8] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curve from Essex University, U.K., and the M.Sc. degree
Cryptography. New York: Springer, 2004. in electrical engineering from Edinburgh University,
[9] H. Eberle, S. Shantz, V. Gupta, N. Gura, L. Rarick, and L. Spracklen, U.K. He is currently pursuing the Ph.D. degree
“Accelerating next-generation public-key cryptosystems on general- from the Department of Electrical and Computer
purpose CPUs,” IEEE Micro, pp. 52–59, Mar. 2005.
GF(2 )
Engineering, the University of Victoria, BC, Canada.
[10] J. Wolkerstorfer, “Dual-field arithmetic unit for GF(p) and ,” His research interests include flexible architectures
in Proc. Cryptographic Hardw. Embed. Syst., 2002, pp. 500–514. for elliptic curve cryptography.
[11] A. Daly, W. Marnane, T. Kerins, and E. Popovici, “An FPGA imple-
mentation of a GF(p) ALU for encryption processors,” Elsevier Micro-
processors Microsyst., vol. 28, pp. 253–260, 2004.
[12] K. Sakiyama, N. Mentens, L. Batina, B. Preneel, and I. Verbauwhede,
“Reconfigurable modular arithmetic logic unit for high-performance Daler Rakhmatov (M’02) received the B.Sc. degree
public-key cryptosystems,” in Proc. Int. Workshop Appl. Reconfig- in electrical engineering from the Rochester Institute
urable Comput., 2006, pp. 347–357. of Technology, Rochester, NY, in 1996, and the M.Sc.
[13] C. McIvor, M. McLoone, and J. McCanny, “Hardware elliptic curve and a Ph.D. degrees in electrical engineering from the
cryptographic processor over GF(p),” IEEE Trans. Circuits Syst. I, Reg. University of Arizona, Tempe, in 1998 and 2002, re-
Papers, vol. 53, no. 9, pp. 1946–1957, Sep. 2006. spectively.
[14] SafeNet, Inc., Bellcamp, MD, “SafeXcel IP public key accelerators,” He is currently an Assistant Professor with the
2007 [Online]. Available: http://www.safenet-inc.com Department of Electrical and Computer Engineering,
[15] S. Ors, L. Batina, and B. Preneel, “Hardware implementation of an University of Victoria, Victoria, BC, Canada. His
elliptic curve processor over GF(p),” in Proc. Int. Conf. Appl.-Specific research interests include energy-efficient computing
Syst., Arch. Process., 2001, pp. 433–443. and dynamically reconfigurable systems.
Authorized licensed use limited to: Kumaraguru College of Technology. Downloaded on August 11, 2009 at 06:11 from IEEE Xplore. Restrictions apply.