Professional Documents
Culture Documents
7, JULY 2011
Abstract—This paper presents a systematic design approach modular multiplication is essential in order to achieve high-per-
to provide the optimized Rivest–Shamir–Adleman (RSA) pro- formance RSA cryptosystem designs. The Montgomery multi-
cessors based on high-radix Montgomery multipliers satisfying plication algorithm [2], which does not require trial division, is
various user requirements, such as circuit area, operating time,
and resistance against side-channel attacks. In order to involve widely used for practical hardware and software implementa-
the tradeoff between the performance and the resistance, we tions because of its high speed capability.
apply four types of exponentiation algorithms: two variants of the Many computational techniques and hardware architectures
binary method with/without Chinese Remainder Theorem (CRT). have been proposed for Montgomery multiplication [3]–[11].
We also introduces three multiplier-based datapath-architectures Among them, the radix-2 algorithms proposed in [3] and [4] are
using different intermediate data forms: 1) single form, 2) semi
carry-save form, and 3) carry-save form, and combined them with primarily implemented with long -bit adders to scan the -bit
a wide variety of arithmetic components. Their radices are pa- operand bit-by-bit in a straightforward manner. Hardware archi-
rameterized from 28 to 2128 . A total of 242 datapaths for 1024-bit tectures have large fan-out signals and large wire delays for long
RSA processors were obtained for each radix. The potential of operands. These drawbacks can be reduced by systolic array ar-
the proposed approach is demonstrated through an experimental chitectures [6], [7] with multiple operation units. However, these
synthesis of all possible processors with a 90-nm CMOS standard
cell library. As a result, the smallest design of 861 gates with architectures are usually tailored for fixed-precision computa-
118.47 ms/RSA to the fastest design of 0.67 ms/RSA at 153 862 tions and cannot respond flexibly to changes in operand size. To
gates were obtained. In addition, the use of the CRT technique deal with variable-length data, a radix-2 architecture was pro-
reduced the RSA operation time of the fastest design to 0.24 ms. posed [8]–[10] in which a -bit operand is divided into -bit
Even if we employed the exponentiation algorithm resistant to word blocks, and -bit addition is performed by repeating -bit
typical side-channel attacks, the fastest design can perform the
RSA operation in less than 1.0 ms. addition times. These radix-2 architectures are quite simple,
but have difficulty in improving the performances of circuit area
Index Terms—Application-specific integrated circuit (ASIC) and efficiency. A high-radix architecture using a 64-bit 64-bit
implementation, high-radix Montgomery multiplication, multiplier was proposed in [11] to achieve higher circuit effi-
Rivest–Shamir–Adleman (RSA) cryptosystem.
ciency. The performance of such a multiplier-based architec-
ture depends heavily on the datapath structure, and varies with
the structure of the arithmetic components, but previous papers
I. INTRODUCTION
have focused on designing their own architectures. These ar-
chitectures are optimized for some design parameters, such as
size and speed, while the most suitable design point in prac-
Given two large integers and , the Montgomery multi- Algorithm 2 shows the high-radix Montgomery multiplica-
plication algorithm performs the following operation: tion algorithm [11], where the uppercase and lowercase letters
indicate the -bit operands and the -bit words, respectively.
(1) Each operand is divided into smaller words, and are processed in
1138 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 7, JULY 2011
ALGORITHM 3:
MONTMULT USING SINGLE FORM (TYPE-I)
ALGORITHM 5:
MONTMULT USING CARRY-SAVE FORM (TYPE-III)
memory outside the core. At the end of the outer loop (Line 10), and the carry , respectively. The critical path is approximately
the carry signals and are added to the -bit result using half in comparison with that of the Type-I architecture, while
the internal CPA, and its lower bits and upper 2 bits are stored the largest number of registers is required to operate two pairs
for the next iteration as an intermediate sum and a carry of carry-save signals. The number of cycles for Algorithm 5 on
, respectively. This architecture requires the least number of Type-III architecture is given as
registers and intermediate wires among the three architectures.
However, the Arithmetic Core has the longest critical path be- (7)
cause of the -bit CPA operation. The number of cycles for
Compared with the Type-I architecture, the Type-III architecture
Algorithm 3 on the Type-I architecture is given as
requires extra cycles for the additions in Lines 5, 7, 11, and 13.
(5)
III. RSA PROCESSOR
In order to prevent timing attacks [21] against Montgomery mul-
tiplication, the final subtraction at Line 12 is always performed A. RSA Cryptosystem
regardless of the conditional branching, and the values of The RSA cryptosystem employs modular exponentiation for
and are stored in different memory spaces. The -bit subtrac- encryption and decryption as follows:
tion is performed by the Arithmetic Core within clock cycles.
After the subtraction, either the or the value is selected (8)
as the final result.
Type-II architecture enhances the hardware efficiency, which Basically, there are two types of efficient exponentiation
is defined as the product of the operation speed (i.e., critical algorithms, which are known as binary methods and windows
path) and the area. The Arithmetic Core has an -bit CPA fol- methods [22], [23]. The binary method performs multiplication
lowing the PPA stage. The PPA produces four outputs , , and squaring operations sequentially according to the bit pattern
, and . The carry signals and are fed back to the of the exponent and is mainly categorized as left-to-right or
core. On the other hand, the sum signals and are fed to right-to-left binary methods. The former starts operation at the
the following CPA and are converted to an -bit output and exponent’s MSB and works downward, while the latter starts at
1-bit carry . As a result, the Type-II architecture has the reg- the exponent’s LSB and works upward. The left-to-right binary
isters CS1, CS2, and EC to store the intermediate carries , method is frequently used for hardware implementations in
, and , respectively. The output at Line 4 is calculated in smartcards and embedded devices because it requires lower
the same manner as with the Type-I architecture. For each mul- hardware resources in comparison to the right-to-left binary
tiply-addition step, the -bit output is stored into a register or method. In contrast, the window method ( -ary method,
in an external memory. At the end of the outer loop, the carry sliding window method) processes more than one bit of the
signals , , and , as well as , , and , are exponent in each iteration cycle and reduces the number of
added to the -bit result for the next iteration. The lower multiplication operations using the precomputed values.
bits and upper 2 bits are stored as an intermediate sum It requires fewer clock cycles, but more memory resources
and a carry , respectively. The size of the -bit CPA is approx- compared with the binary methods and thus is often used for
imately half of the -bit CPA in Type-I architecture, and thus, software implementation on processors with large memory
the entire critical path is shortened by 25%, while the number resources. Therefore, we adopt the left-to-right binary method
of registers is increased. The number of cycles for Algorithm 4 for the sequencer in our RSA processor.
on the Type-II architecture is given as We also adopt a variation of the binary methods with
tamper-resistant features, namely, the square-multiply expo-
(6) nentiation method [16]. The side-channel attacks considered
here are not DPAs but rather SPAs against modular exponenti-
Compared with the Type-I architecture, the Type-II architecture ation methods. Such SPA-type attacks can be performed more
requires more cycles to calculate ( , ) in Line 10. easily than DPA-type attacks in practice due to their simplicity.
Type-III architecture has the fastest datapath since there is no The square-multiply exponentiation method is regarded as a
carry propagation in the Arithmetic Core. Both the carry and variation of the square-and-multiply-always method [14] or
the sum signals from the PPA, that is, , , , and , the Montgomery powering ladder [15]. The square-and-mul-
are fed back into the core in carry-save form. The -bit regis- tiply-always method is known as a typical countermeasure
ters CS1, CS2, ZS1, and ZS2, and 1-bit register EC are inserted against SPAs [13], which performs dummy multiplication
to store the carry-save signals in this architecture. The CPA is operations for the binary method. The dummy multiplication is
performed outside the core at the end of every iteration cycle, inserted even for the zero bits of the exponent so as to perform
and generates an -bit output and a 1-bit carry . Type-III ar- squaring and multiplication for each bit. This countermeasure
chitecture calculates an output at Lines 4 and 5 in two steps. is vulnerable to safe-error attacks [24], which induces a fault
The PPA first calculates two outputs and in carry-save timely during the multiplication process. The Montgomery
form, and then the CPA generates the output outside the core. powering ladder [15] can prevent both SPA-type attacks and
At the end of the outer loop, the carry signals , , , safe-error attacks. The algorithm always performs a pair of
, and are added to the -bit result, and its lower multiplication and squaring using the two variables meaning-
bits and upper 2 bits are stored as an intermediate sum fully and does not involve dummy multiplications. This feature
MIYAMOTO et al.: SYSTEMATIC DESIGN OF RSA PROCESSORS BASED ON HIGH-RADIX MONTGOMERY MULTIPLIERS 1141
ALGORITHM 6:
LEFT-TO-RIGHT BINARY METHOD WITH MONTMULT
TABLE I
OPERATION CYCLES OF RSA OPERATIONS (TYPE-I ARCHITECTURE)
Fig. 4. Synthesis results of 242 designs for each type: (a) Type-I; (b) Type-II; (c) Type-III.
TABLE IV
PPA AND CPA ALGORITHMS
Fig. 6. Performance variations in radices from 2 to 2 . (a) Datapath area. (b) Delay time. (c) RSA1 operation time. (d) Efficiency.
TABLE V
PERFORMANCE OF 1024-BIT RSA PROCESSORS
architecture and the arithmetic components to tailor the circuit method), and CRT-RSA1 (i.e., RSA1 with the CRT technique)
performance. operations, respectively. The three rows in bold font indicate the
Table V summarizes the above results, where the columns en- best performance among all of the developed designs in terms
titled Mont. mult. time, RSA1 time, RSA2 time, and CRT-RSA1 of operating speed (Speed), hardware efficiency (Balance), and
time indicate the computation times of Montgomery multi- circuit area (Area), respectively. In addition, Table VI shows a
plication, RSA1, RSA2 (i.e., square-multiply exponentiation performance comparison between our designs and conventional
MIYAMOTO et al.: SYSTEMATIC DESIGN OF RSA PROCESSORS BASED ON HIGH-RADIX MONTGOMERY MULTIPLIERS 1145
TABLE VI
PERFORMANCE COMPARISON WITH CONVENTIONAL DESIGNS
designs [6]–[11]. In comparison, our design approach provides chitectures optimized for several design parameters have been
a considerably wider variety, including the smallest area of 861 proposed, but it is not feasible to design all architectures in-
gates with the Type-I radix- processor to the shortest RSA dependently to find the best design which meets the perfor-
operating time of 0.67 ms at 421.94 MHz with the Type-III mance requirements for practical use. In contrast, the proposed
radix- processor. The highest hardware efficiency of 83.12 approach provides the optimized datapaths satisfying the re-
s gates was achieved with the Type-II radix- processor. quirements by combining three datapath architectures using dif-
Thus, the top performance obtained using the proposed system ferent intermediate data forms [(Type-I) single form, (Type-II)
is higher than that of conventional designs. semi carry-save form, and (Type-III) carry-save form], a wide
In addition to the various datapath architectures, the proposed variety of arithmetic components, as well as radices of
system provides four exponentiation algorithms that take into . 1024-bit RSA processors ranging from 861 gates@118.47
account the tradeoff between total computational cost and re- ms/RSA to 153 862 gates@0.67 ms/RSA in a 90-nm CMOS
sistance against SPA-based side-channel attacks. As shown in standard cell library were obtained by exhaustive synthesis for
Table V, the Type-III CRT-RSA1 processor achieved the fastest all possible combinations. Other than these two designs, a user
operating time of 0.24 ms. Even when the countermeasure was can freely select the best design to fit their application from
applied, the RSA processors achieved an operating time of less among these combinations and can also choose other process
than 1.0 ms (the fastest time was 0.89 ms with the Type-III radix technologies.
RSA2 processor). These processors can be easily obtained In addition to the approach from the datapath-architecture
without any changes to the datapath architectures. Although the level to the arithmetic-component level, the tradeoff between
size of the sequencer modules is slightly increased, this increase the performance (operating speed) and the resistance against
is only a few percent of the total size of the RSA processors as side-channel attacks can be optimized at the algorithm level
shown in Tables II and III. using square-multiply exponentiation method and CRT tech-
As described above, our systematic approach can serve a nique. The fastest processor with the left-to-right binary method
wide variety of performance data sets obtained from the exhaus- and CRT achieved an operating speed of 0.24 ms/RSA, while
tive synthesis, which can be used for a reference with a 90-nm the square-multiply exponentiation method provided the highest
CMOS standard cell library to design future RSA processors. level of resistance against side-channel attacks such as SPA and
From the reference, we can also estimate tamper-resistant safe error attacks, and performed the RSA operation within 1.0
RSA processors based on different exponentiation algorithms. ms. These processors with CRT and/or countermeasures can be
Selecting the adequate design parameters such as architecture implemented at low costs for area without any changes of data-
and radix in each design level, our method can provide the path architectures.
best RSA processor design to meet the requirements of a target In future studies, some of the best RSA processors will be
application. For example, the radix value might be predeter- implemented in ASICs and evaluated in terms of tamper resis-
mined by the application or the system. Even in the case, our tance as well as circuit performance. Also, further research to
method would acquire the best possible performance by the support other public-key cryptographic algorithms, such as el-
combination of other design parameters. liptic curve cryptography, will be conducted.
In this experiment, a cell-based design with a ASIC library
was investigated, but our design approach can be applied to any
process technologies, libraries, and synthesis parameters. In ad- REFERENCES
dition, each design stage of the proposed approach can be op-
timized independently, and thus the proposed approach also al- [1] R. L. Rivest, A. Shamir, and L. Adliman, “A method for obtaining
digital signatures and public-key crypto systems,” Commun. ACM, vol.
lows easy addition of new architectures or arithmetic compo- 21, no. 2, pp. 120–126, Feb. 1978.
nents depending on the synthesis conditions. [2] P. L. Montgomery, “Modular multiplication without trial division,”
Math. Comput., vol. 44, no. 170, pp. 519–521, Apr. 1985.
V. CONCLUSION [3] A. Daly and W. Marnane, “Efficient architectures for implementing
Montgomery modular multiplication and RSA modular exponentia-
This paper proposed a systematic approach to designing high- tion on reconfigurable logic,” in Proc. ACM/SIGDA 10th Int. Symp. on
performance RSA processors. A number of RSA hardware ar- Field-Program. Gate Arrays, Nov. 2002, pp. 40–49.
1146 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 7, JULY 2011
[4] C. McIvor, M. McLoone, J. McCanny, A. Daly, and W. Marnane, Atsushi Miyamoto (S’06) received the B.E. degree
“Fast Montgomery modular multiplication and RSA cryptographic in information engineering and the M.S. degree in in-
processor architectures,” in Proc. 37th Annu. Asilomar Conf. Signals, formation sciences from Tohoku University, Sendai,
Syst. Comput., Nov. 2003, pp. 379–384. Japan, in 2005 and 2007, respectively, where he is
[5] F. Crowe, A. Daly, and W. Marnane, “A scalable dual mode arith- currently pursuing the Ph.D. degree in information
metic unit for public key cryptosystems,” in Proc. IEEE Int. Conf. Inf. sciences.
Technol.: Coding Comput. (ITCC), Apr. 2005, pp. 568–573. Since 2009, he has been a Research Fellow with
[6] T. Blum and C. Paar, “Montgomery modular exponentiation on recon- the Japan Society for the Promotion of Science. His
figurable hardware,” in Proc. 14th IEEE Symp. Comput. Arith., 1999, research interests include cryptographic hardware,
pp. 70–78. computer arithmetic, and algorithms for high-perfor-
[7] T. Blum and C. Paar, “High-radix Montgomery modular exponentia- mance VLSI computing.
tion on reconfigurable hardware,” IEEE Trans. Comput., vol. 50, no. 7,
pp. 759–764, Jul. 2001.
[8] A. F. Tenca and C. K. Koc, “A scalable architecture for modular mul-
tiplication based on Montgomery’s algorithm,” IEEE Trans. Comput., Naofumi Homma (M’99) received the B.E. degree
vol. 52, no. 9, pp. 1215–1221, Sep. 2003. in information engineering, and the M.S. and Ph.D.
[9] D. Harris, R. Krishnamurthy, S. Mathew, and S. Hsu, “An improved degrees in information sciences from Tohoku Univer-
unified scalable radix-2 Montgomery multiplier,” in Proc. 17th IEEE sity, Sendai, Japan, in 1997, 1999, and 2001, respec-
Symp. Comput. Arith., 2005, pp. 172–178. tively.
[10] E. Savas, A. F. Tenca, and C. K. Koc, “A scalable and unified multi-
plier architecture for finite fields GF(p) andGF(2 ) ,” in CHES 2000,
He is currently an Associate Professor with the
Graduate School of Information Sciences, Tohoku
Lecture Notes in Computer Science. New York: Springer, 2000, vol. University. From 1999 to 2001, he was a Research
1965, pp. 277–292. Fellow with the Japan Society for the Promotion
[11] A. Satoh and K. Takano, “A scalable dual-field elliptic curve crypto- of Science. From 2002 to 2006, he also joined the
graphic processor,” IEEE Trans. Comput., vol. 52, no. 4, pp. 449–460, Japan Science and Technology Agency (JST) as a
Apr. 2003. Researcher for the PRESTO Project. He is a member of Cryptographic Imple-
[12] P. Kocher, J. Jaffe, and B. Jun, “Introduction to differential power anal- mentation Committee in Cryptography Research and Evaluation Committees
ysis and related attacks,” IEEE Trans. Electron Devices, vol. 50, no. 2, (CRYPTREC). His research interests include computer arithmetic, EDA
pp. 462–470, Feb. 1998. methodology, high-performance/secure VLSI computing, and cryptographic
[13] P. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” CRYPTO hardware.
1999, Lecture Notes Comput. Sci., vol. 1666, pp. 388–397, Aug. 1999. Dr. Homma was a recipient of the IP Award from the 2005 LSI IP Design
[14] J. S. Coron, “Resistance against differential power analysis for elliptic Award, and the Best Paper Award from the Workshop on Synthesis and System
curve cryptosystems,” CHES 1999, Lecture Notes Comput. Sci., vol. Integration of Mixed Information Technologies in 2007.
1717, pp. 192–302, Aug. 1999.
[15] M. Joye and S. M. Yen, “The Montgomery powering ladder,” CHES
2002, Lecture Notes Comput. Sci., vol. 2523, pp. 291–302, Aug. 2002.
[16] M. Joye, “Highly regular right-to-left algorithms for scalar multiplica-
tion,” CHES 2007, Lecture Notes Comput. Sci., vol. 4727, pp. 135–147, Takafumi Aoki (M’90) received the B.E., M.E., and
Sep. 2007. D.E. degrees in electronic engineering from Tohoku
[17] J. J. Quisquater and C. Couvreur, “Fast decipherment algorithm for University, Sendai, Japan, in 1988, 1990, and 1992,
RSA public-key cryptosystem,” Electron. Lett., vol. 18, no. 21, pp. respectively.
905–907, Oct. 1982. He is currently a Professor with the Graduate
[18] Circuits Multi-Projets (CMP), Grenoble, France, “CMOS 90 nm School of Information Sciences, Tohoku University.
(CMOS090) from STMicroelectronics,” 2002. [Online]. Available: From 1997 to 1999, he also joined the PRESTO
http://cmp.imag.fr/products/ic/?p=STCMOS090. Project, Japan Science and Technology Corporation
[19] A. Miyamoto, N. Homma, T. Aoki, and A. Satoh, “Systematic design (JST). His research interests include theoretical
of high-radix Montgomery multipliers for RSA processors,” in Proc. aspects of computation, digital signal processing,
26th IEEE Int. Conf. Comput. Des., Oct. 2008, pp. 416–422. computer vision, image processing, biometric au-
[20] C. K. Koc, T. Acar, and B. S. Kaliski, “Analyzing and comparing Mont- thentication, and security issues in computer systems.
gomery multiplication algorithms,” IEEE Micro, vol. 16, no. 3, pp. Dr. Aoki was a recipient of the Outstanding Paper Award at the 1990, 2000,
26–33, Jun. 1996. 2001, and 2006 IEEE International Symposiums on Multiple-Valued Logic, the
[21] P. Kocher, “Timing attacks on implementations of Diffie-Hellman, Outstanding Transactions Paper Award from the Institute of Electronics, Infor-
RSA, DSS, and other systems,” in CRYPTO 1996, Lecture Notes mation and Communication Engineers (IEICE) of Japan in 1989 and 1997, the
Comput. Sci.. New York: Springer, 1996, vol. 1109, pp. 104–113. IEE Ambrose Fleming Premium Award in 1994, the IEICE Inose Award in 1997,
[22] J. A. Menezes, C. P. Oorschot, and A. S. Vanstone, Handbook of Ap- the IEE Mountbatten Premium Award in 1999, the Best Paper Award at the 1999
plied Cryptography. Boca Raton, FL: CRC Press, 1997. IEEE International Symposium on Intelligent Signal Processing and Communi-
[23] C. K. Koc, “High-speed RSA implementation,” RSA Laboratories, cation Systems, the IP Award at the 7th LSI IP Design Award in 2005, and the
Bedford, MA, Tech. Rep. TR201, 1994. Best Paper Award at the 14th Workshop on Synthesis and System Integration of
[24] S. M. Yen and M. Joye, “Checking before output may not be enough Mixed Information Technologies in 2007.
against fault-based cryptanalysis,” IEEE Trans. Comput., vol. 49, no.
9, pp. 967–970, Sep. 2000.
[25] A. P. Fouque and F. Valette, “The doubling attack—why upwards is
better than downwards,” CHES 2003, Lecture Notes Comput. Sci., vol. Akashi Satoh received the B.S., M.S., and Ph.D. de-
2779, pp. 269–280, Sep. 2003. grees in electrical engineering from Waseda Univer-
[26] I. Koren, Computer Arithmetic Algorithms, 2nd ed. Natick, MA: A. sity, Waseda, Tokyo, in 1987, 1989, 1999, respec-
K. Peters, 2001. tively.
[27] B. Parhami, Computer Arithmetic: Algorithms and Hardware De- In 1989, he joined IBM Research, Tokyo Research
signs. London, U.K.: Oxford University Press, 2000. Laboratory, where he was involved in the research
and development of digital and analog VLSI circuits.
In 2007, he joined the National Institute of Advanced
Industrial Standard Technology, Research Center for
Information Security. His current research interests
include algorithms and architectures for data security
and high-performance circuit implementations.