You are on page 1of 11

1136 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO.

7, JULY 2011

Systematic Design of RSA Processors Based on


High-Radix Montgomery Multipliers
Atsushi Miyamoto, Student Member, IEEE, Naofumi Homma, Member, IEEE, Takafumi Aoki, Member, IEEE,
and Akashi Satoh

Abstract—This paper presents a systematic design approach modular multiplication is essential in order to achieve high-per-
to provide the optimized Rivest–Shamir–Adleman (RSA) pro- formance RSA cryptosystem designs. The Montgomery multi-
cessors based on high-radix Montgomery multipliers satisfying plication algorithm [2], which does not require trial division, is
various user requirements, such as circuit area, operating time,
and resistance against side-channel attacks. In order to involve widely used for practical hardware and software implementa-
the tradeoff between the performance and the resistance, we tions because of its high speed capability.
apply four types of exponentiation algorithms: two variants of the Many computational techniques and hardware architectures
binary method with/without Chinese Remainder Theorem (CRT). have been proposed for Montgomery multiplication [3]–[11].
We also introduces three multiplier-based datapath-architectures Among them, the radix-2 algorithms proposed in [3] and [4] are
using different intermediate data forms: 1) single form, 2) semi
carry-save form, and 3) carry-save form, and combined them with primarily implemented with long -bit adders to scan the -bit
a wide variety of arithmetic components. Their radices are pa- operand bit-by-bit in a straightforward manner. Hardware archi-
rameterized from 28 to 2128 . A total of 242 datapaths for 1024-bit tectures have large fan-out signals and large wire delays for long
RSA processors were obtained for each radix. The potential of operands. These drawbacks can be reduced by systolic array ar-
the proposed approach is demonstrated through an experimental chitectures [6], [7] with multiple operation units. However, these
synthesis of all possible processors with a 90-nm CMOS standard
cell library. As a result, the smallest design of 861 gates with architectures are usually tailored for fixed-precision computa-
118.47 ms/RSA to the fastest design of 0.67 ms/RSA at 153 862 tions and cannot respond flexibly to changes in operand size. To
gates were obtained. In addition, the use of the CRT technique deal with variable-length data, a radix-2 architecture was pro-
reduced the RSA operation time of the fastest design to 0.24 ms. posed [8]–[10] in which a -bit operand is divided into -bit
Even if we employed the exponentiation algorithm resistant to word blocks, and -bit addition is performed by repeating -bit
typical side-channel attacks, the fastest design can perform the
RSA operation in less than 1.0 ms. addition times. These radix-2 architectures are quite simple,
but have difficulty in improving the performances of circuit area
Index Terms—Application-specific integrated circuit (ASIC) and efficiency. A high-radix architecture using a 64-bit 64-bit
implementation, high-radix Montgomery multiplication, multiplier was proposed in [11] to achieve higher circuit effi-
Rivest–Shamir–Adleman (RSA) cryptosystem.
ciency. The performance of such a multiplier-based architec-
ture depends heavily on the datapath structure, and varies with
the structure of the arithmetic components, but previous papers
I. INTRODUCTION
have focused on designing their own architectures. These ar-
chitectures are optimized for some design parameters, such as
size and speed, while the most suitable design point in prac-

C RYPTOGRAPHIC modules are now mounted on many


embedded systems, such as smart cards and digital con-
sumer electronics, and are used to ensure the protection of pri-
tical use varies depending on the application and the user re-
quirements. Therefore, in order to provide the best design which
satisfies these requirements, a systematic study considering the
vacy and confidential information in communication. The en- entire process of design from the datapath architecture level to
cryption/decryption process usually requires a large amount of the arithmetic-component level is indispensable from a practical
arithmetic operations with very large operands. In particular, standpoint.
Rivest–Shamir–Adleman (RSA) cryptosystem [1] usually per- On the other hand, cryptanalysis based on side-channel in-
forms modular exponentiation using operands longer than 1000 formation is a major concern for hardware designers. When a
bits. Modular exponentiation is performed by repeating modular cryptographic module performs encryption or decryption, secret
multiplication and squaring operations, and thus optimization of parameters related to the intermediate data being processed can
leak as side-channel information in the form of power dissipa-
Manuscript received March 31, 2009; revised September 04, 2009; accepted tion, electromagnetic radiation, or operating time. Among them,
April 02, 2010. First published June 28, 2010; current version published June two of the best known attacks are simple power analysis (SPA)
24, 2011.
A. Miyamoto, N. Homma, and T. Aoki are with the Department of Com-
and differential power analysis (DPA) proposed by Kocher [12],
puter and Mathematical Sciences, Graduate School of Information Sciences, [13]. Several countermeasures have been proposed [14]–[16],
Tohoku University, Sendai 980-8579, Japan (e-mail: miyamoto@aoki.ecei.to- and their implementations are required to protect secret infor-
hoku.ac.jp). mation from these power analysis attacks. However, the perfor-
A. Satoh is with the National Institute of Advanced Industrial Science and
Technology, Tokyo 101-0021, Japan. mance of RSA processors with such countermeasures has not
Digital Object Identifier 10.1109/TVLSI.2010.2049037 been fully evaluated in previous work.
1063-8210/$26.00 © 2010 IEEE
MIYAMOTO et al.: SYSTEMATIC DESIGN OF RSA PROCESSORS BASED ON HIGH-RADIX MONTGOMERY MULTIPLIERS 1137

This paper proposes a systematic design of RSA processors ALGORITHM 1:


combining various datapath architectures and exponentiation MONTGOMERY MULTIPLICATION [2]
algorithms (i.e., sequences) for performance and resistance
against side-channel attacks, respectively. This systematic ap-
proach is divided into four design stages: 1) algorithm design; 2)
radix design; 3) architecture design; and 4) arithmetic-compo-
nent design. We first select a modular exponentiation algorithm
considering the tradeoff between the RSA computation time
and tamper resistance. We then select the radix to determine
the basic characteristics of the processor, such as circuit area ALGORITHM 2:
and operation frequency (i.e., critical path). Finally, we adopt HIGH-RADIX MONTGOMERY MULTIPLICATION (MONTMULT) [11]
the datapath architecture and the arithmetic components to
optimize the circuit performance.
For this systematic approach, we designed four types of mod-
ular exponentiation algorithms combining two variants of bi-
nary methods that have different resistance against side-channel
attack, and with/without Chinese Remainder Theorem (CRT)
[17]. In addition to the modular exponentiation algorithms, three
datapath architectures employing three intermediate data forms,
namely, 1) single form, 2) semi carry-save form, and 3) carry-
save form, were designed for architecture-level design. These
three types of architecture are combined with a wide variety of
arithmetic components with a parameterized radix. To demon-
strate the capabilities of the proposed approach, 242 variants
of RSA processors were exhaustively generated for each radix,
and synthesized by using a 90-nm CMOS standard cell library
[18]. The size and speed features of the processors were then dis- where and the modulus is an integer in the range
played graphically and analyzed so that users can easily choose such that . For crypto-
the best combinations of algorithm, datapath radix, architecture, graphic applications, is usually a prime number or a product
and arithmetic unit to meet the requirements of a target applica- of primes, and thus satisfies the condition easily. In addition, the
tion. -bit integers , , , and satisfy the following condition:
A preliminary version of the proposed method was reported
(2)
in [19]. In this previous paper, we presented the datapath archi-
tectures and synthesis results of high-radix Montgomery multi- Algorithm 1 shows the original Montgomery multiplication
pliers. In contrast, in the present work, we propose a systematic algorithm [2], which replaces a modular division by with
method of designing RSA processors, and evaluate the overhead a -bit right shift operation. Equation (1) can be calculated by
of resistance to SPA-based side-channel attacks, as well as stan- one multiplication and a -bit right shift operation if the lowest
dard criteria such as area and critical path. Moreover, all syn- -bits of are equal to 0. For this purpose, a multiple of is
thesis results are analyzed to clarify how each design parameter added to in this algorithm. Its coefficient is generated in
(i.e., architecture, radix, and arithmetic component) affects total advance using a precomputed number .
processor performance. The final result is not changed by the addition because (1) is in
The remainder of this paper is organized as follows. Section II modulo arithmetic.
presents three types of datapath architectures for high-radix Public-key cryptosystems, such as the RSA scheme, use an
Montgomery multiplication. Section III describes a variety of extremely long key length , typically more than 1000 bits. In
RSA algorithms including those with typical countermeasures the high-radix Montgomery algorithm [11], a -bit operand is
against power analysis attacks, and then shows the corre- divided into blocks in order to use a normal -bit
sponding RSA processor architectures. Section IV evaluates -bit multiplier. The -bit operand can be represented by
the systematic design of RSA processors with the 90-nm -bit words as follows:
CMOS technology. Section V concludes this paper.
(3)
II. HIGH-RADIX MONTGOMERY MULTIPLIER For simplification, we use the following notation:

A. Montgomery Multiplication Algorithm (4)

Given two large integers and , the Montgomery multi- Algorithm 2 shows the high-radix Montgomery multiplica-
plication algorithm performs the following operation: tion algorithm [11], where the uppercase and lowercase letters
indicate the -bit operands and the -bit words, respectively.
(1) Each operand is divided into smaller words, and are processed in
1138 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 7, JULY 2011

ALGORITHM 3:
MONTMULT USING SINGLE FORM (TYPE-I)

Fig. 1. Output arrival profile of a parallel multiplier.


ALGORITHM 4:
MONTMULT USING SEMI CARRY-SAVE FORM (TYPE-II)
nested loops (Loop1 for and Loop2 for , ), respectively.
The size of the temporary variable is bits, and its upper
bits and lower bits are stored separately into word and
the intermediate carry , respectively. Finally, the stored value
is the output of this algorithm.
In Algorithm 2, the most critical operation which affects the
circuit delay and the area is the multiply-accumulation at Line
6, which consists of two multiplications and three additions. In
order to improve operation efficiency, the multiply-accumula-
tion is divided into two three-term operations [20], and each op-
eration is usually performed by a multiplier in the datapath. As
a result, the design of such multiplier is of major importance
for the hardware implementation of the high-radix Montgomery
multiplier.
the middle (32nd) bit, where the maximum number of operands
B. Designed Montgomery Multiplication Algorithms exists. As shown in Fig. 1, the computation time of CPA oper-
Based on Algorithm 2, we designed three types of high-radix ations has a significant impact on the critical path. The funda-
Montgomery multiplication algorithms with different interme- mental concept of our datapath designs is to control the times
diate forms: 1) single form (Type-I); 2) semi carry-save form of CPA operations using carry-save techniques in exchange for
(Type-II); 3) carry-save form (Type-III). Note here that Type-I the total area.
algorithm is equivalent to finely integrated operand scanning Algorithm 3–5 (Type-I–III) shows the designed three types
(FIOS) method in [20]. The straightforward algorithm is fun- of high-radix Montgomery multiplication algorithms based on
damental for the proposed approach because of its simplicity. the delay profile. The use of carry-save forms can reduce the
On the other hand, Type-II and Type-III algorithms are newly operand length of CPA operations, that is, the computation time
introduced as variations of Type-I algorithm. of the carry-propagation chain. The operand lengths of CPAs
In order to explain these algorithms, we first discuss the delay in Type-I and Type-II are and , respectively. Type-III does
profile of the output signals from a parallel multiplier. The con- not perform a CPA operation for each computation step. The
ventional multiplier [26] mainly consists of three components: carry-save signals obtained from PPA are fed back to the next
a partial product generator (PPG), a partial product accumulator step without any processing. In Algorithm 3–5, where is the
(PPA), and a carry propagation adder (CPA). The PPG stage correction coefficient for Montgomery multiplication, is the
first generates partial products from multiplicand and multiplier result of the preprocessing , is the carry at
in parallel. The PPA stage then performs a multi-operand ad- the outer loop, and the tuple indicates two outputs: the
dition without any carry-propagation for all of the generated intermediate carry with a weight of and the intermediate
partial products, and produces the two outputs represented in sum . The integer value of is given by . In
carry-save form. Finally, the carry-save form is converted to Type-II and Type-III, such intermediate variables can be given
the corresponding binary output at CPA stage. Fig. 1 shows the by two or three carry-save signals. For example, the intermediate
delay profile of a 32-bit 32-bit parallel multiplier, where the carry in Type-II is given by two carry-save values ( and )
horizontal axis denotes the bit position from LSB to MSB and and 1-bit value . Note that Algorithms 4 and 5 represent such
the vertical axis shows the output signal delay time. The trian- intermediate values as a summation notation (e.g.,
gles and the squares indicate the signal delay times for output ) for convenience.
bits of the PPA and CPA, respectively. The long carry-propaga- Algorithm 3 (Type-I) is based on straightforward algorithm
tion chain of the CPA causes longer delays for higher bits. In with a three-term multiply-addition operation such as .
contrast, the position of the slowest signal for the PPA is around The arithmetic operation in Algorithm
MIYAMOTO et al.: SYSTEMATIC DESIGN OF RSA PROCESSORS BASED ON HIGH-RADIX MONTGOMERY MULTIPLIERS 1139

ALGORITHM 5:
MONTMULT USING CARRY-SAVE FORM (TYPE-III)

2 is divided into two steps: a multiplication step


and a reduction step
. In order to avoid increasing the number of variables, this
algorithm does not use any carry-save form for the intermediate
data, and thus is almost the same as FIOS. The size of multiply-
addition output is bits, and its upper bits and lower bits are
used separately as the intermediate output and the intermediate
carry , respectively.
Algorithm 4 (Type-II) uses the carry-save form only for the
intermediate carry , where and are the -bit carry-save
signals and is the 1-bit carry signal. The summation notation
indicates an intermediate carry . At Lines 7
and 8, a lower -bit output and 1-bit carry are given by a
CPA operation, while -bit carries and are given by a
PPA operation. The other computational steps can be done in a
manner similar to Algorithm 3.
Algorithm 5 (Type-III) uses the carry-save form for both in-
termediate sum and carry , where and are the in-
termediate carry signals, and and are the intermediate
sum signals. No CPA operation is performed during each mul-
tiply-addition step. An extra CPA operation is finally inserted at
the end of the inner loops at Lines 7 and 11 to obtain . Algo-
rithm 5 requires more steps than those of other two algorithms
because of these extra addition operations.

C. Designed Datapath Architectures


Fig. 2 shows the designed datapath architectures corre-
sponding to Algorithms 3–5 (i.e., Type-I–III). All of these
are -bit bus architectures. Each datapath has an -bit -bit
multiply-accumulator called the Arithmetic Core. The major
components of the Arithmetic Core are a partial product Fig. 2. Datapath architectures of Montgomery multiplication block. (a) Type-I.
generator (PPG), a partial product accumulator (PPA), and a (b) Type-II. (c) Type-III.
carry propagation adder (CPA). Unlike conventional parallel
multipliers, the PPA processes multiple inputs (e.g., and ) in
addition to the partial products of PPG. The Arithmetic Core performs a multiply-addition using four
Type-I architecture has a simple datapath with the Arithmetic operands stored in the registers at each step in Algorithm 3. At
Core, which receives three -bit inputs ( , , and ) and an -bit Line 4, Ca, Cb, and Z are set to zero and X and Y are set to
carry , and outputs a -bit result. There are five registers (Ca, and , respectively. The lower-half -bit output is given as
Cb, X, Y, and Z) in the datapath. Ca and Cb store intermediate . At Lines 3, 5, 7, and 8, the upper -bit output of the mul-
carries and , respectively. X stores an operand , , or . Y tiply-addition is fed back to the core as the carry for the next
stores an operand , , or . Z stores an intermediate sum . cycle, and the lower -bit output is stored into a register or in
1140 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 7, JULY 2011

memory outside the core. At the end of the outer loop (Line 10), and the carry , respectively. The critical path is approximately
the carry signals and are added to the -bit result using half in comparison with that of the Type-I architecture, while
the internal CPA, and its lower bits and upper 2 bits are stored the largest number of registers is required to operate two pairs
for the next iteration as an intermediate sum and a carry of carry-save signals. The number of cycles for Algorithm 5 on
, respectively. This architecture requires the least number of Type-III architecture is given as
registers and intermediate wires among the three architectures.
However, the Arithmetic Core has the longest critical path be- (7)
cause of the -bit CPA operation. The number of cycles for
Compared with the Type-I architecture, the Type-III architecture
Algorithm 3 on the Type-I architecture is given as
requires extra cycles for the additions in Lines 5, 7, 11, and 13.
(5)
III. RSA PROCESSOR
In order to prevent timing attacks [21] against Montgomery mul-
tiplication, the final subtraction at Line 12 is always performed A. RSA Cryptosystem
regardless of the conditional branching, and the values of The RSA cryptosystem employs modular exponentiation for
and are stored in different memory spaces. The -bit subtrac- encryption and decryption as follows:
tion is performed by the Arithmetic Core within clock cycles.
After the subtraction, either the or the value is selected (8)
as the final result.
Type-II architecture enhances the hardware efficiency, which Basically, there are two types of efficient exponentiation
is defined as the product of the operation speed (i.e., critical algorithms, which are known as binary methods and windows
path) and the area. The Arithmetic Core has an -bit CPA fol- methods [22], [23]. The binary method performs multiplication
lowing the PPA stage. The PPA produces four outputs , , and squaring operations sequentially according to the bit pattern
, and . The carry signals and are fed back to the of the exponent and is mainly categorized as left-to-right or
core. On the other hand, the sum signals and are fed to right-to-left binary methods. The former starts operation at the
the following CPA and are converted to an -bit output and exponent’s MSB and works downward, while the latter starts at
1-bit carry . As a result, the Type-II architecture has the reg- the exponent’s LSB and works upward. The left-to-right binary
isters CS1, CS2, and EC to store the intermediate carries , method is frequently used for hardware implementations in
, and , respectively. The output at Line 4 is calculated in smartcards and embedded devices because it requires lower
the same manner as with the Type-I architecture. For each mul- hardware resources in comparison to the right-to-left binary
tiply-addition step, the -bit output is stored into a register or method. In contrast, the window method ( -ary method,
in an external memory. At the end of the outer loop, the carry sliding window method) processes more than one bit of the
signals , , and , as well as , , and , are exponent in each iteration cycle and reduces the number of
added to the -bit result for the next iteration. The lower multiplication operations using the precomputed values.
bits and upper 2 bits are stored as an intermediate sum It requires fewer clock cycles, but more memory resources
and a carry , respectively. The size of the -bit CPA is approx- compared with the binary methods and thus is often used for
imately half of the -bit CPA in Type-I architecture, and thus, software implementation on processors with large memory
the entire critical path is shortened by 25%, while the number resources. Therefore, we adopt the left-to-right binary method
of registers is increased. The number of cycles for Algorithm 4 for the sequencer in our RSA processor.
on the Type-II architecture is given as We also adopt a variation of the binary methods with
tamper-resistant features, namely, the square-multiply expo-
(6) nentiation method [16]. The side-channel attacks considered
here are not DPAs but rather SPAs against modular exponenti-
Compared with the Type-I architecture, the Type-II architecture ation methods. Such SPA-type attacks can be performed more
requires more cycles to calculate ( , ) in Line 10. easily than DPA-type attacks in practice due to their simplicity.
Type-III architecture has the fastest datapath since there is no The square-multiply exponentiation method is regarded as a
carry propagation in the Arithmetic Core. Both the carry and variation of the square-and-multiply-always method [14] or
the sum signals from the PPA, that is, , , , and , the Montgomery powering ladder [15]. The square-and-mul-
are fed back into the core in carry-save form. The -bit regis- tiply-always method is known as a typical countermeasure
ters CS1, CS2, ZS1, and ZS2, and 1-bit register EC are inserted against SPAs [13], which performs dummy multiplication
to store the carry-save signals in this architecture. The CPA is operations for the binary method. The dummy multiplication is
performed outside the core at the end of every iteration cycle, inserted even for the zero bits of the exponent so as to perform
and generates an -bit output and a 1-bit carry . Type-III ar- squaring and multiplication for each bit. This countermeasure
chitecture calculates an output at Lines 4 and 5 in two steps. is vulnerable to safe-error attacks [24], which induces a fault
The PPA first calculates two outputs and in carry-save timely during the multiplication process. The Montgomery
form, and then the CPA generates the output outside the core. powering ladder [15] can prevent both SPA-type attacks and
At the end of the outer loop, the carry signals , , , safe-error attacks. The algorithm always performs a pair of
, and are added to the -bit result, and its lower multiplication and squaring using the two variables meaning-
bits and upper 2 bits are stored as an intermediate sum fully and does not involve dummy multiplications. This feature
MIYAMOTO et al.: SYSTEMATIC DESIGN OF RSA PROCESSORS BASED ON HIGH-RADIX MONTGOMERY MULTIPLIERS 1141

ALGORITHM 6:
LEFT-TO-RIGHT BINARY METHOD WITH MONTMULT

allows us to prevent safe-error attacks. However, an advanced


chosen-message power analysis attack, known as Doubling Fig. 3. RSA processor architecture.
attack [25], can defeat the Montgomery powering ladder since
it starts at the exponent’s MSB and works downward. The
square-multiply exponentiation method, on the other hand, the data are generated by Memory Address Generator. The it-
starts at the exponent’s LSB and works upward. This algorithm eration count and/or the data size are counted by Data Counter.
can prevent SPA-type attacks, safe-error attacks, and Doubling The secret key bit is fed to the Sequencer Block bit by bit from
attacks. the MSB side (or the LSB side) through the Key Shift.
In addition to the above two methods, we use CRT [17] opera- The Sequencer Block performs state transition control and
tion to increase the speed of the exponentiation. The use of CRT generates the control signals for the datapath core. Sequencer
reduces the clock cycles by almost 3/4, and requires extra hard- Block has a three-level hierarchical structure. Level-1 supports
ware resources to perform pre-processing and post-processing. low-level functions, such as basic modular arithmetic, Mont-
gomery multiplication, as well as pre-processing and post-pro-
B. Processor Architecture cessing for Montgomery multiplication. Level-2 executes one
of the modular exponentiation. The highest level, Level-3, sup-
Our RSA processor performs one of four modular exponen-
ports RSA operations including CRT. Input-output processing is
tiation algorithms, namely, the left-to-right binary method and
also supported in this hierarchy. Our architecture has a clearly
the square-multiply exponentiation method with and without
separated control structure and thus it is easy to design and
CRT operation. All of these algorithms are based on Mont-
modify the logic due to its high flexibility for functional exten-
gomery multiplication, and therefore Montgomery reduction
sions.
and pre-computation is executed on the input data. However,
these operations can be performed in one step to reduce the
IV. PERFORMANCE EVALUATION
overhead time for the consecutive Montgomery operations.
Algorithm 6 shows the modular exponentiation algorithm com- The proposed approach selects the modular exponentiation
bining the left-to-right binary method with the Montgomery algorithm depending on the user requirements, and then obtains
multiplication (MontMult). Lines 1 and 2 indicate the pre-pro- a wide variety of RSA processors combining three datapath ar-
cessing and the Montgomery reduction given by chitectures with arithmetic algorithms with radices between
and , respectively. The inverse operation at Line and . Their RSA processors are exhaustively generated and
1 is simplified to , where the radix is . As synthesized using the ASIC library. Their size and speed fea-
the modular division by is equal to a shift operation, the tures are evaluated so that the user can easily choose the best
inverse value can be calculated by repeating multiplication and combination of algorithm, datapath architecture, arithmetic unit,
shift operations in a manner similar to modular multiplication. and radix.
On the other hand, the operation at Line 2 is calculated by
repeating addition/subtraction operations. Both pre-processing A. Operation Cycles
operations can be calculated by combining multiplication, The operation cycles for the Montgomery multiplication are
addition, and shift operations in the proposed datapath. calculated from (5)–(7) depending on the architecture type. The
Fig. 3 shows a block diagram of our RSA processor, which pre-processing of the Type-I architecture given by Lines 1 and
consists of six components: Multiplication Block, Sequencer 2 in Algorithm 6 requires clock cycles.
Block, Memories, Data Counter, Memory Address Generator, If and , for example, the cy-
and Key Shift. The exponent is set into the -bit shift reg- cles of Montgomery multiplication and pre-processing are 2177
ister key shift. The -bit data and modulus are divided and 36 262, respectively. Those of Type-II and Type-III archi-
into blocks, and are sequentially stored into Memory0. Mul- tectures are a few percent increase over that of the Type-I archi-
tiplication Block implemented with one of the three multipliers tecture because of the extra addition operations. Note that the
in Fig. 2 performs the multiply-addition operations repeatedly pre-processing is executed only once during the entire sequence
according to the exponent bits. The read and write addresses for of modular exponentiation. In contrast, the post-processing at
1142 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 7, JULY 2011

TABLE I
OPERATION CYCLES OF RSA OPERATIONS (TYPE-I ARCHITECTURE)

Line 10 in Algorithm 6 requires the same number of clock cy- TABLE II


cles as one Montgomery multiplication. The amounts of compu- CIRCUIT SIZE OF 1024-BIT RSA1 TYPE-I PROCESSOR
tational costs of Montgomery multiplication and pre-/post-pro-
cessing are and , re-
spectively.
Table I shows the operating cycles of the four RSA opera-
tions combining two algorithms: 1) left-to-right binary method
(RSA1) and 2) square-multiply exponentiation method (RSA2);
with and without CRT operation for the Type-I datapath archi- TABLE III
tecture. The number of cycles for the binary methods varies with CIRCUIT SIZE OF 1024-BIT CRT-RSA2 TYPE-I PROCESSOR
the key bit pattern, that is, the ratio between “0” and “1” bits, and
thus Table I shows the average cycles. On the other hand, the
operating cycles of the square-multiply exponentiation method
do not vary with the key bit pattern. The countermeasure in-
creases the number of operating cycles by about 25% in compar-
ison with the binary method without countermeasure. Then, the
number of cycle for the exponentiation methods with CRT (i.e.,
CRT-RSA1 and CRT-RSA2 in Table I) is almost one-quarter the data width) due to the presence of extra hardware compo-
shorter more than those in the case without CRT (i.e., RSA1 nents, such as the selector. As a result, the minimum gate count
and RSA2 in Table I). for radix is 49 786 gates (including 41,438 gates of memo-
Table I indicate that the operating cycle counts for RSA op- ries), while the maximum gate count for radix is 127 782
erations and Montgomery multiplication can be reduced by in- gates (including 39 827 gates of memories).
creasing the radix (i.e., the multiplier size). Note however that The proposed systematic approach provides a wide variety of
the larger radix increases the critical path, and thus decreases Multiplication Blocks combining three datapath architectures
the operating frequency. (Type-I–III) with arithmetic algorithms from radix to .
Table IV shows a set of arithmetic algorithms [26], [27] consid-
B. Performance in ASIC ered in our experimental system. For efficient PPA algorithms,
we use Dadda, (4;2) compressor and (7,3) counter trees in
The 1024-bit RSA processors were evaluated using the addition to conventional algorithms based on (3,2) counters.
STMicroelectronics 90-nm CMOS standard cell library (1.2-V In addition, we use four parallel-prefix adders (Kogge–Stone,
version) [18]. The designs were synthesized by Synopsys De- Brent–Kung, Han–Carlson, and Ladner–Fischer adders) for
sign Compiler (Version A-2007.12-SP3). In order to evaluate high-speed designs, conditional sum and carry select adders
the proposed approach, we discuss the influences of design for balanced designs, and two carry-skip adders for compact
parameters (i.e., datapath architecture, arithmetic component, designs, in addition to conventional CPA algorithms such as
and radix) as well as the total circuit area. the ripple carry and the carry look-ahead adders. In total,
Table II shows the gate counts of the major components in we synthesized and evaluated 242 datapath cores for each
the processor core with radices of , where one gate radix. Type-I, Type-II, and Type-III have ,
unit indicates a two-way NAND. The Multiplication Block is , and variations, respectively.
evaluated on the Type-I architecture, and the multiply-accumu- Type-I has an extra algorithm obtained from the Synopsys
lator (i.e., Arithmetic Core) is retrieved from Synopsys Design- DesignWare IP core. On the other hand, Type-II has only one
Ware library. The gate count increases as . On the other variation when Array is selected for the PPA algorithm since
hand, the Sequencer Block is evaluated in the case of the left-to- the CPA is not necessary.
right binary method without CRT (RSA1). If the operation mode Fig. 4 shows the characteristics of 242 designs using radix
is changed, the gate count of the sequencer unit would be in- and emphasizes the Pareto points for each type, where the
creased. As shown in Table III, the square-multiply exponentia- horizontal and vertical axes represent the delay time and the
tion method with CRT (CRT-RSA2) requires 1455 gates when gate count, respectively. The delay time was calculated under
the radix is . This gate count is about 2.5 times larger than in the worst-case conditions and the gate count (datapath area)
RSA1. In Table II, the memory size is 1024 bits for each value of includes only datapath, that is, the Multiplication Block, and
the radix, although the gate count changes with the radix (i.e., dose not include the memories and the sequencer modules. Dots
MIYAMOTO et al.: SYSTEMATIC DESIGN OF RSA PROCESSORS BASED ON HIGH-RADIX MONTGOMERY MULTIPLIERS 1143

Fig. 4. Synthesis results of 242 designs for each type: (a) Type-I; (b) Type-II; (c) Type-III.

TABLE IV
PPA AND CPA ALGORITHMS

plotted lower and further to the left indicate higher performance.


The results for Type-I show a wider distribution of size and Fig. 5. Pareto frontier obtained from all types.
delay as compared with the other two types. If we select com-
pact PPA and CPA algorithms, Type-I will have the smallest size
since it has the minimum number of registers. In contrast, the re- the datapath area (i.e., the Multiplication Block), the delay time,
sults for Type-II are narrowly distributed within the gate count the computation time of the RSA1 operation, and the hardware
of 16 kgates and the critical path of 5 ns. Finally, the results efficiency, respectively. Fig. 6(a) and (b) indicate that the cir-
for Type-III can be distributed within the critical path of 2 ns cuit area and the delay time increase exponentially as the radix
since it reduces the delay time using carry-save form. As shown increases. On the other hand, the increasing tendency of circuit
in these results, the three architecture types showed the signifi- area is different from that of delay time; the datapath area in (a)
cant advantage in terms of area, balance and delay, respectively. is heavily affected by radix while the delay time in (b) is affected
Considering all the types, we finally obtained a smooth Pareto by architecture and arithmetic component as well as radix. In
frontier as shown in Fig. 5. (c), the computation time of one RSA operation decreases as
The results also showed that the performances of the ob- the radix increases. The main reason for this is that the number
tained processors are heavily dependent on the arithmetic com- of cycles decreases greatly even if the critical path length in-
ponents as well as the architectures. Under the experimental creases. The results of (a) and (c) suggest the tradeoffs between
condition, the combination of Dadda tree (for PPA) and par- circuit area and computation time depending on the radix. In (d),
allel-prefix adder (for CPA) achieved higher-speed implemen- the circuit efficiencies are higher from radix to . When
tations and the combination of (7,3) counter tree (for PPA) and we use Type-II Balance and Type-III Speed, efficiencies in the
carry-skip/ripple-carry adder (for CPA) achieved smaller imple- radices of and are lower than those of other radices due
mentations. to the huge computation time and circuit area.
We then discuss the impact of radix value on circuit perfor- Thus, the radix has a critical impact on all the performance
mance. Fig. 6 shows the semi-log plot of the performance vari- criteria (circuit area, critical path, RSA computation time, and
ations in various radices from to , where Type-I Area , circuit efficiency). The architecture and the arithmetic compo-
Type-II Balance, and Type-III Speed indicate the smallest de- nent can also improve each criterion by several tens of percent.
sign, the most balanced design, and the highest-speed design The above observation suggests that we first select the radix to
shown in Fig. 5, respectively. The vertical axes of (a)–(d) denote determine the basic characteristics, and then adopt the datapath
1144 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 7, JULY 2011

Fig. 6. Performance variations in radices from 2 to 2 . (a) Datapath area. (b) Delay time. (c) RSA1 operation time. (d) Efficiency.

TABLE V
PERFORMANCE OF 1024-BIT RSA PROCESSORS

architecture and the arithmetic components to tailor the circuit method), and CRT-RSA1 (i.e., RSA1 with the CRT technique)
performance. operations, respectively. The three rows in bold font indicate the
Table V summarizes the above results, where the columns en- best performance among all of the developed designs in terms
titled Mont. mult. time, RSA1 time, RSA2 time, and CRT-RSA1 of operating speed (Speed), hardware efficiency (Balance), and
time indicate the computation times of Montgomery multi- circuit area (Area), respectively. In addition, Table VI shows a
plication, RSA1, RSA2 (i.e., square-multiply exponentiation performance comparison between our designs and conventional
MIYAMOTO et al.: SYSTEMATIC DESIGN OF RSA PROCESSORS BASED ON HIGH-RADIX MONTGOMERY MULTIPLIERS 1145

TABLE VI
PERFORMANCE COMPARISON WITH CONVENTIONAL DESIGNS

designs [6]–[11]. In comparison, our design approach provides chitectures optimized for several design parameters have been
a considerably wider variety, including the smallest area of 861 proposed, but it is not feasible to design all architectures in-
gates with the Type-I radix- processor to the shortest RSA dependently to find the best design which meets the perfor-
operating time of 0.67 ms at 421.94 MHz with the Type-III mance requirements for practical use. In contrast, the proposed
radix- processor. The highest hardware efficiency of 83.12 approach provides the optimized datapaths satisfying the re-
s gates was achieved with the Type-II radix- processor. quirements by combining three datapath architectures using dif-
Thus, the top performance obtained using the proposed system ferent intermediate data forms [(Type-I) single form, (Type-II)
is higher than that of conventional designs. semi carry-save form, and (Type-III) carry-save form], a wide
In addition to the various datapath architectures, the proposed variety of arithmetic components, as well as radices of
system provides four exponentiation algorithms that take into . 1024-bit RSA processors ranging from 861 gates@118.47
account the tradeoff between total computational cost and re- ms/RSA to 153 862 gates@0.67 ms/RSA in a 90-nm CMOS
sistance against SPA-based side-channel attacks. As shown in standard cell library were obtained by exhaustive synthesis for
Table V, the Type-III CRT-RSA1 processor achieved the fastest all possible combinations. Other than these two designs, a user
operating time of 0.24 ms. Even when the countermeasure was can freely select the best design to fit their application from
applied, the RSA processors achieved an operating time of less among these combinations and can also choose other process
than 1.0 ms (the fastest time was 0.89 ms with the Type-III radix technologies.
RSA2 processor). These processors can be easily obtained In addition to the approach from the datapath-architecture
without any changes to the datapath architectures. Although the level to the arithmetic-component level, the tradeoff between
size of the sequencer modules is slightly increased, this increase the performance (operating speed) and the resistance against
is only a few percent of the total size of the RSA processors as side-channel attacks can be optimized at the algorithm level
shown in Tables II and III. using square-multiply exponentiation method and CRT tech-
As described above, our systematic approach can serve a nique. The fastest processor with the left-to-right binary method
wide variety of performance data sets obtained from the exhaus- and CRT achieved an operating speed of 0.24 ms/RSA, while
tive synthesis, which can be used for a reference with a 90-nm the square-multiply exponentiation method provided the highest
CMOS standard cell library to design future RSA processors. level of resistance against side-channel attacks such as SPA and
From the reference, we can also estimate tamper-resistant safe error attacks, and performed the RSA operation within 1.0
RSA processors based on different exponentiation algorithms. ms. These processors with CRT and/or countermeasures can be
Selecting the adequate design parameters such as architecture implemented at low costs for area without any changes of data-
and radix in each design level, our method can provide the path architectures.
best RSA processor design to meet the requirements of a target In future studies, some of the best RSA processors will be
application. For example, the radix value might be predeter- implemented in ASICs and evaluated in terms of tamper resis-
mined by the application or the system. Even in the case, our tance as well as circuit performance. Also, further research to
method would acquire the best possible performance by the support other public-key cryptographic algorithms, such as el-
combination of other design parameters. liptic curve cryptography, will be conducted.
In this experiment, a cell-based design with a ASIC library
was investigated, but our design approach can be applied to any
process technologies, libraries, and synthesis parameters. In ad- REFERENCES
dition, each design stage of the proposed approach can be op-
timized independently, and thus the proposed approach also al- [1] R. L. Rivest, A. Shamir, and L. Adliman, “A method for obtaining
digital signatures and public-key crypto systems,” Commun. ACM, vol.
lows easy addition of new architectures or arithmetic compo- 21, no. 2, pp. 120–126, Feb. 1978.
nents depending on the synthesis conditions. [2] P. L. Montgomery, “Modular multiplication without trial division,”
Math. Comput., vol. 44, no. 170, pp. 519–521, Apr. 1985.
V. CONCLUSION [3] A. Daly and W. Marnane, “Efficient architectures for implementing
Montgomery modular multiplication and RSA modular exponentia-
This paper proposed a systematic approach to designing high- tion on reconfigurable logic,” in Proc. ACM/SIGDA 10th Int. Symp. on
performance RSA processors. A number of RSA hardware ar- Field-Program. Gate Arrays, Nov. 2002, pp. 40–49.
1146 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 7, JULY 2011

[4] C. McIvor, M. McLoone, J. McCanny, A. Daly, and W. Marnane, Atsushi Miyamoto (S’06) received the B.E. degree
“Fast Montgomery modular multiplication and RSA cryptographic in information engineering and the M.S. degree in in-
processor architectures,” in Proc. 37th Annu. Asilomar Conf. Signals, formation sciences from Tohoku University, Sendai,
Syst. Comput., Nov. 2003, pp. 379–384. Japan, in 2005 and 2007, respectively, where he is
[5] F. Crowe, A. Daly, and W. Marnane, “A scalable dual mode arith- currently pursuing the Ph.D. degree in information
metic unit for public key cryptosystems,” in Proc. IEEE Int. Conf. Inf. sciences.
Technol.: Coding Comput. (ITCC), Apr. 2005, pp. 568–573. Since 2009, he has been a Research Fellow with
[6] T. Blum and C. Paar, “Montgomery modular exponentiation on recon- the Japan Society for the Promotion of Science. His
figurable hardware,” in Proc. 14th IEEE Symp. Comput. Arith., 1999, research interests include cryptographic hardware,
pp. 70–78. computer arithmetic, and algorithms for high-perfor-
[7] T. Blum and C. Paar, “High-radix Montgomery modular exponentia- mance VLSI computing.
tion on reconfigurable hardware,” IEEE Trans. Comput., vol. 50, no. 7,
pp. 759–764, Jul. 2001.
[8] A. F. Tenca and C. K. Koc, “A scalable architecture for modular mul-
tiplication based on Montgomery’s algorithm,” IEEE Trans. Comput., Naofumi Homma (M’99) received the B.E. degree
vol. 52, no. 9, pp. 1215–1221, Sep. 2003. in information engineering, and the M.S. and Ph.D.
[9] D. Harris, R. Krishnamurthy, S. Mathew, and S. Hsu, “An improved degrees in information sciences from Tohoku Univer-
unified scalable radix-2 Montgomery multiplier,” in Proc. 17th IEEE sity, Sendai, Japan, in 1997, 1999, and 2001, respec-
Symp. Comput. Arith., 2005, pp. 172–178. tively.
[10] E. Savas, A. F. Tenca, and C. K. Koc, “A scalable and unified multi-
plier architecture for finite fields GF(p) andGF(2 ) ,” in CHES 2000,
He is currently an Associate Professor with the
Graduate School of Information Sciences, Tohoku
Lecture Notes in Computer Science. New York: Springer, 2000, vol. University. From 1999 to 2001, he was a Research
1965, pp. 277–292. Fellow with the Japan Society for the Promotion
[11] A. Satoh and K. Takano, “A scalable dual-field elliptic curve crypto- of Science. From 2002 to 2006, he also joined the
graphic processor,” IEEE Trans. Comput., vol. 52, no. 4, pp. 449–460, Japan Science and Technology Agency (JST) as a
Apr. 2003. Researcher for the PRESTO Project. He is a member of Cryptographic Imple-
[12] P. Kocher, J. Jaffe, and B. Jun, “Introduction to differential power anal- mentation Committee in Cryptography Research and Evaluation Committees
ysis and related attacks,” IEEE Trans. Electron Devices, vol. 50, no. 2, (CRYPTREC). His research interests include computer arithmetic, EDA
pp. 462–470, Feb. 1998. methodology, high-performance/secure VLSI computing, and cryptographic
[13] P. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” CRYPTO hardware.
1999, Lecture Notes Comput. Sci., vol. 1666, pp. 388–397, Aug. 1999. Dr. Homma was a recipient of the IP Award from the 2005 LSI IP Design
[14] J. S. Coron, “Resistance against differential power analysis for elliptic Award, and the Best Paper Award from the Workshop on Synthesis and System
curve cryptosystems,” CHES 1999, Lecture Notes Comput. Sci., vol. Integration of Mixed Information Technologies in 2007.
1717, pp. 192–302, Aug. 1999.
[15] M. Joye and S. M. Yen, “The Montgomery powering ladder,” CHES
2002, Lecture Notes Comput. Sci., vol. 2523, pp. 291–302, Aug. 2002.
[16] M. Joye, “Highly regular right-to-left algorithms for scalar multiplica-
tion,” CHES 2007, Lecture Notes Comput. Sci., vol. 4727, pp. 135–147, Takafumi Aoki (M’90) received the B.E., M.E., and
Sep. 2007. D.E. degrees in electronic engineering from Tohoku
[17] J. J. Quisquater and C. Couvreur, “Fast decipherment algorithm for University, Sendai, Japan, in 1988, 1990, and 1992,
RSA public-key cryptosystem,” Electron. Lett., vol. 18, no. 21, pp. respectively.
905–907, Oct. 1982. He is currently a Professor with the Graduate
[18] Circuits Multi-Projets (CMP), Grenoble, France, “CMOS 90 nm School of Information Sciences, Tohoku University.
(CMOS090) from STMicroelectronics,” 2002. [Online]. Available: From 1997 to 1999, he also joined the PRESTO
http://cmp.imag.fr/products/ic/?p=STCMOS090. Project, Japan Science and Technology Corporation
[19] A. Miyamoto, N. Homma, T. Aoki, and A. Satoh, “Systematic design (JST). His research interests include theoretical
of high-radix Montgomery multipliers for RSA processors,” in Proc. aspects of computation, digital signal processing,
26th IEEE Int. Conf. Comput. Des., Oct. 2008, pp. 416–422. computer vision, image processing, biometric au-
[20] C. K. Koc, T. Acar, and B. S. Kaliski, “Analyzing and comparing Mont- thentication, and security issues in computer systems.
gomery multiplication algorithms,” IEEE Micro, vol. 16, no. 3, pp. Dr. Aoki was a recipient of the Outstanding Paper Award at the 1990, 2000,
26–33, Jun. 1996. 2001, and 2006 IEEE International Symposiums on Multiple-Valued Logic, the
[21] P. Kocher, “Timing attacks on implementations of Diffie-Hellman, Outstanding Transactions Paper Award from the Institute of Electronics, Infor-
RSA, DSS, and other systems,” in CRYPTO 1996, Lecture Notes mation and Communication Engineers (IEICE) of Japan in 1989 and 1997, the
Comput. Sci.. New York: Springer, 1996, vol. 1109, pp. 104–113. IEE Ambrose Fleming Premium Award in 1994, the IEICE Inose Award in 1997,
[22] J. A. Menezes, C. P. Oorschot, and A. S. Vanstone, Handbook of Ap- the IEE Mountbatten Premium Award in 1999, the Best Paper Award at the 1999
plied Cryptography. Boca Raton, FL: CRC Press, 1997. IEEE International Symposium on Intelligent Signal Processing and Communi-
[23] C. K. Koc, “High-speed RSA implementation,” RSA Laboratories, cation Systems, the IP Award at the 7th LSI IP Design Award in 2005, and the
Bedford, MA, Tech. Rep. TR201, 1994. Best Paper Award at the 14th Workshop on Synthesis and System Integration of
[24] S. M. Yen and M. Joye, “Checking before output may not be enough Mixed Information Technologies in 2007.
against fault-based cryptanalysis,” IEEE Trans. Comput., vol. 49, no.
9, pp. 967–970, Sep. 2000.
[25] A. P. Fouque and F. Valette, “The doubling attack—why upwards is
better than downwards,” CHES 2003, Lecture Notes Comput. Sci., vol. Akashi Satoh received the B.S., M.S., and Ph.D. de-
2779, pp. 269–280, Sep. 2003. grees in electrical engineering from Waseda Univer-
[26] I. Koren, Computer Arithmetic Algorithms, 2nd ed. Natick, MA: A. sity, Waseda, Tokyo, in 1987, 1989, 1999, respec-
K. Peters, 2001. tively.
[27] B. Parhami, Computer Arithmetic: Algorithms and Hardware De- In 1989, he joined IBM Research, Tokyo Research
signs. London, U.K.: Oxford University Press, 2000. Laboratory, where he was involved in the research
and development of digital and analog VLSI circuits.
In 2007, he joined the National Institute of Advanced
Industrial Standard Technology, Research Center for
Information Security. His current research interests
include algorithms and architectures for data security
and high-performance circuit implementations.

You might also like