You are on page 1of 5

An RNS-Enhanced Microprocessor Implementation

of Public Key Cryptography

Zhining Lim and Braden J. Phillips


Centre for High Performance Integrated Technologies & Systems (CHiPTec)
School of Electrical and Electronic Engineering
The University of Adelaide
Adelaide, SA 5005, Australia
Email: fzhining.lim,braden.phillips}@adelaide.edu.au

Abstract-This paper presents a new residue number system comparison is not possible, division and modular reduction
implementation of the RSA cryptosystem. The system runs on a require a more complicated approach.
low-area, low-power microprocessor that we have extended with The idea of using the RNS to improve the performance of
hardware support for residue arithmetic. When compared against
a baseline implementation that uses non-RNS multi-precision RSA has received attention for some years [5]-[8]. The attrac-
methods, the new RNS implementation executes in 67.7% fewer tion of RNS is that since the channels are independent, they
clock cycles. The hardware support requires 42.7% more gates are amenable to parallel implementations [9]. We have found
than the base processor core. that there is also a benefit in a sequential software approach
as the RNS improves operand scheduling and reduces the cost
I. INTRODUCTION
of carry propagation.
We present a new implementation of the RSA public key To evaluate this new approach, two versions of RSA have
cryptosystem on a low power microprocessor. Our solution been built on Tensilica's Xtensa processor [10], a 32-bit
uses the Residue Number System (RNS) on a processor microprocessor with a RISC instruction set. The Xtensa can
for which we have developed hardware support for modular be augmented with hardware defined by either Tensilica or
arithmetic. These extensions provide good decryption rates at the user in the form of operations, states and registers with
low power and with low hardware cost. the Tensilica Instruction Extension (TIE) language [11]. The
The RSA public key cryptosystem [1] is a proven secure Xtensa is suited to low-power and low-area applications as
communication method for static and mobile applications. its base configuration has a size of 0.26mm2 and consumes
It uses modular exponentiation on integers for encryption, 76,uW/MHz for a 130nm process [12].
decryption, signature generation and authentication. For exam- The first RSA version, a baseline, uses conventional mul-
ple, a signature C is generated from message M using private tiprecision arithmetic on an Xtensa core which has been
exponent D and public modulus N in the exponentation configured to include a 32-bit multiplier. The second uses RNS
C = MD mod N, where C, D, M and N are integers. arithmetic for which we have developed modular arithmetic
However, for security reasons, C, D, M and N must be long instructions and hardware in the TIE language.
integers [2], thus slowing its performance. While existing low- This paper first describes the modular multiplication algo-
power implementations [3], [4] use hardware accelerators, ours rithm needed for the RSA signature generation in Section II.
uses an RNS-enabled microprocessor. The specific algorithms for the baseline and the RNS version
The RNS is defined by a set of coprime integers called a are described in Section II-A and Section Il-B respectively.
base. Each integer is associated with a channel and is known The hardware support developed for the RNS version appears
as the channel's modulus. To represent a number in an RNS in Section III. The results of a signature generation performed
base, each channel holds the residue of that number modulo on the baseline and the RNS version follows in Section IV.
the associated channel modulus. The product of the channel II. ALGORITHMS
moduli is the dynamic range of the base. The Chinese Re-
mainder Theorem (CRT) implies that the RNS representation The modular exponentiations in the RSA algorithm are
of a number is unique modulo the dynamic range. performed as a series of modular multiplications. Modular
The RNS channels are independent, thus addition of two multiplication is composed of a multiplication step and a
numbers in the same base is performed through modular ad- reduction step. For both our multiprecision and RNS versions,
dition of the corresponding channels. Subtraction and multipli- the reduction step uses Montgomery's algorithm [13] which
cation are performed in the same way. However, as magnitude reduces an integer product AB modulo the integer N as shown
in (1). R in (1) is the radix and N`1 is the modular inverse
This research was supported under Australian Research Council's Discovery of N modulo R.
Projects funding scheme (project number DP0557582). It was also supported
under the Tensilica University Program. ABR-1 mod N = (AB -(ABN-1 mod R)N)/R (1)

978-1-4244-2110-7/08/$25.00 C2007 IEEE 1430


TABLE I
For computer arithmetic in the weighted binary number NUMBER OF ALU INSTRUCTIONS PER MODULAR MULTIPLICATION.
system, such as the baseline in Section II-A, a recommended
choice for radix R is a power of 2 as the division and reduction Version Number of instructions
operations become shift and truncation operations respectively. Bajard 4n2 + 12n + 4
The extra R`1 factor in the result is the modular inverse of Kawamura 4n2 + 12n + 4
R modulo N. It is removed with pre-processing and an extra Hybrid 4n2+ lOn + 2
modular multiplication.
In the RNS, reduction and division operations are not as
straightforward. The radix R is chosen to be the dynamic range error during the exponentiation using the redundant channel
of the base as then the reduction done in ABN-1 mod R is technique from [16], as shown in Fig. la. The other error
provided without extra effort and fewer channels are needed. is removed in the corrective modular multiplication after the
However, the division step is then not possible unless a second exponentiation using an error-free base extension method such
base is used. This introduces the need for conversions from as the Mixed Radix System [17].
the first base to the second and then back at the end. These Kawamura estimates each error by multiplying the original
base extensions are described in more detail in Section lI-B. base's dynamic range by the shifted sum of the top bits of the
channels. As shown in Fig. lb, both errors are removed per
A. Baseline Implementation multiplication.
The baseline implementation uses multi-precision Mont- As shown in Fig. Ic, the hybrid version adopts Bajard's
gomery modular multiplication. Ko, explored variations of it strategy of removing just one error per multiplication and does
in [14] using C on an Intel Pentium and the most promising so using Kawamura's technique. In the corrective modular
of these, the CIOS scheme, was chosen as the starting point multiplication, both errors are removed with Kawamura's
for the baseline. The CIOS algorithm follows the normal technique.
multiplication procedure but performs modular reduction on Table I shows the number of ALU instructions needed per
each partial product before forming the next partial product. multiplication in terms of n, the number of channels in each
To enhance the baseline, a 32-bit multiplier was added and the base. These instructions include channel addition, channel
algorithm was coded in assembly. multiplication and shifts. Multiplications are considered to be
The algorithm was improved using the CRT to reduce both as complex as the additions and shifts due to the fast multiplier.
the exponent and modulus into two shorter keys. This produces However, the number of instructions does not include loads,
two shorter exponentiations and their results are combined into stores or associated indexing because they attract additional
the final result with the technique in [15]. delays.
Results for the baseline and the CRT-enhanced baseline Table I indicates that Bajard and Kawamura would require
appear in Table III. an equal number of instructions. This is verified by the results
in Table III, although they are not exactly equal due to
B. RAS Implementation different numbers of loads and stores. The hybrid requires
RNS versions of Montgomery's algorithm have been pub- fewer instructions as it addresses the disadvantages of Bajard
lished by Bajard [7] and Kawamura [8]. We have explored both requiring extra hardware and constants and Kawamura using
algorithms and developed a new hybrid. They are illustrated more cycles to remove both errors. This advantage is also
in Fig. 1, where Montgomery Part 1 refers to the calculation apparent in Table III.
of ABN-1 mod R and Montgomery Part 2 finishes off the The CRT enhancement was also applied to the hybrid RNS
modular multiplication, as in (1). algorithm. The two bases have 17 channels whose moduli have
As mentioned earlier, the implementation of Montgomery's been chosen in the same way as the original bases.
algorithm in the RNS requires two bases. The base extension Results for the RNS algorithms and the CRT enhancement
operation is a computationally intensive task that contributes are shown in Table III.
significant delay. As shown in Fig. 1, there are two base
extensions per multiplication. III. HARDWARE SUPPORT
Base extensions convert from one base to another via a Our RNS hardware support for the Xtensa takes the form
method that would normally get the number into its non-RNS of channel representations and additional operations.
form. The method used by the three RNS implementations of
RSA is provided by the CRT. It involves many independent A. Channel Representations
multiplications, hence is well-suited for the RNS. However, The channels are represented in 64 bits, as shown in Fig. 2,
it also requires a final modular reduction by the original where the channel modulus is held in the upper 32 bits and the
base's dynamic range. As this is not possible, each CRT base channel's value in the lower 32 bits. Channel operations occur
extension introduces an error in the form of a multiple of the in user-defined 64-bit "modbase" registers and the channels
original base's dynamic range. are stored in memory when not in use.
The RNS implementations calculate the error and subtract it Storing the channels in memory means that accessing each
from the result of the base extension. Bajard corrects just one channel requires a load and a store. Our implementation's

1431
Bs 1Jchannell2
1
Montgomery
~~~~~Parti1
--Base
extn 2
channeln 1
{ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~_
-
t~~~~~~~~~~~~~~

channel -1_____________
channel 2 ._______________
Base 2^channeln --

Redundant Channel __

(a) Bajard's algorithm

channel 1
channel 2

channeln.

B
channel 1---___ Base Montgomery
Base 2 m channel 2 _ _ ~ ~ ~~~ ~~ 1 =
extn D Part 2

channel n -. o

(b) Kawamura's algorithm

channel 1
Base 1
hahnnel 2
cat
Montgomery-- Base=
B channel : extn 2.
channeln n __
-
.__________________~~~~~~~~~~~~~~~~~~~~~~~
---------

channell Base M__


Base 2 hanne12 e-- xtnl Mor
channel n - - - - - - - - -

(c) Hybrid algorithm


Fig. 1. General illustration of the three RNS versions of Montgomery's multiplication algorithm. The execution flows from left to right. The horizontal lines
represent the channels in the bases. A solid channel holds a value that is relevant to the current execution whereas a dashed channel does not. The arrows
represent a channel value being passed to an execution block in another base.

User-defined
Xtensa "modbase" Memory
registers registers chl ch2 ch3 chl I ch2 ch3

aO mbO----------
al load 0
a2 mbl --modulus byte n
store-. - multiply multiply add multiply multiply add
a3 value byte n+8
channel2 value channel2 modulus (a) (b)
* * byte n+24 channel3 value channel3 modulus Fig. 3. (a) High load-store overhead at the multiply operation due to sequen-
* * tial implementation of parallel algorithm. (b) Reduced load-store overhead for
;a14
channel 2 by working along the channel instead of across the channels.
b7
a15I I----------I-
Fig. 2. Representation of the channels on the Xtensa Processor. Each
rectangle is 32 bits wide. ular reduction step as the product is potentially double the
length of the inputs. The multiplication step is performed with
a 64-bit multiplier. The product A is reduced modulo 232 N -

sequential nature attracts a high load-store overhead if each using a method employed in Crandall's public key exchange
RNS operation is performed across the channels, as illustrated device [18]. The algorithm in shown in Fig. 4. When N is less
in Fig. 3a. Instead, as the channels are independent, we can than 16 bits long, Crandall's method requires two iterations
load one channel, perform a series of operations, and then store and the corrective subtraction.
it before loading the next channel, as illustrated in Fig. 3b. Crandall's method has clear advantages over the Mont-
gomery scheme in that it does not require any extra constants,
B. RATS Operations it does not introduce any errors, it does not use 64-bit
The most important requirement for the new RNS opera- multipliers, and it uses relatively few steps. These factors make
tions is that they must include modular reduction. Modular the design require less memory space, area, power, and time.
reduction for addition involves at most two subtractions of the The number of different bases available, and hence the
modulus. Similarly, subtraction requires at most two additions dynamic range, is restricted by the bound on N. However,
of the modulus. it is still more than adequate for our purposes as a range for
Modular multiplication requires a more complicated mod- N of 1 to 429 provides 66 coprime 32-bit integers. These are

1432
Input: A, N < 232 Bajard's and Kawamura's versions executed in a similar
Output: A mod (232 -N) number of clock cycles. By removing the redundant channel
while A > (232 -N) or A-(232 -N) > (232 -N) and the extra corrective step, the hybrid improved the Kawa-
A (- N x LA/232j + A mod 232 mura version by 1.8%.
end while The CRT enhancement improved both the baseline and the
if A < (232 -N) then hybrid RNS versions. The CRT enhancement to the baseline
return A and the hybrid RNS version improved the time required by
else 73.6% and 67.7%, respectively.
return A -(232 -N) The area measurements included the decode and multi-
end if plexing support. The Xplorer reported the core's area to be
Fig. 4. Crandall's modular reduction algorithm. 69282 gates or 0.7mm2 for a 130nm LV worst-case process.
Kawamura's and the hybrid versions increased the number of
TABLE III gates by 42.7%. Bajard's version increased the number of gates
TIMING AND AREA REQUIREMENTS FOR RSA IMPLEMENTATIONS. by 44.5%.
Version [ Clock Cycles [ Additional Gates We intend to investigate the power consumption in the
future.
Baseline 50075284 0
RNS Bajard 16496753 30813 V. CONCLUSION
RNS Kawamura 16491058 29578
RNS Hybrid 16191768 29578 In this paper, we have presented a new implementation of
CRT Baseline 13244373 0 the RSA cryptosystem in the RNS. It was implemented on
CRT Hybrid 5260670 29578 a low area, low power Xtensa processor for which we have
developed residue arithmetic hardware. The simulation results
for a 1024-bit RSA signature generation using the Tensilica's
divided into two bases made up of 33 integers, each of which Xplorer Development Environment are promising. The fastest
provides the necessary dynamic range for 1024-bit values. RNS version of RSA required 67.7% fewer clock cycles than
Table II shows the types of new instructions we have defined a non-RNS baseline version. Our hardware residue arithmetic
for the RSA algorithm in the RNS. In addition to the basic support for the smallest RNS version increased the number
modular arithmetic, there are instructions that assist the CRT of gates in the microprocessor by 42.7%. The extra power
and MRS base extensions. The other instructions, load and consumption used by the hardware support is still unknown
store with offset and load with increment, assist with the access and is the subject of future work.
of sequential memory locations. We included them because
the channels and constants are stored in memory arrays. The VI. ACKNOWLEDGMENT
additional area needed appears in Table III. We wish to thank the team at Tensilica for their help and
support in this project.
IV. RESULTS
The six versions of the RSA signature generation were REFERENCES
implemented and simulated with 1024-bit inputs. The ex- [1] R. Rivest, A. Shamir, and L. Adleman, "A method for obtaining digital
ponent was composed of 542 zeros and 482 ones and the signatures and public-key cryptosystems," Communications of the ACM,
exponentiation used the right-to-left binary method [19]. To vol. 21, no. 2, pp. 120 - 126, 1978.
[2] "RSA Laboratories' Frequently Asked Questions About Today's Cryp-
achieve a 1024-bit dynamic range, the RNS versions used tography," RSA Laboratories, RSA Security Inc., 2000, version 4.1.
bases with 33 channels of 32 bits. The CRT RNS versions [3] "SecurCoreTM Solutions Data Sheet," ARM Limited, Cambridge,
used bases with 17 channels of 32 bits. United Kingdom.
The simulation results are shown in Table III in terms of
[4] "MIPS32g 4KSdTM Secure Data Core Product Brief," MIPS Tech-
nologies, Mountain View, CA, USA, 2005.
clock cycles required to perform the exponentiation, excluding [5] J. Schwemmlein, K. Posch, and R. Posch, "RNS-modulo reduction upon
cache misses, and area of additional hardware support. Both a restricted base value set and its applicability to RSA cryptography,"
Computers & Security, vol. 17, no. 7, pp. 637 - 650, 1998.
the clock cycle timing and the area measurements were [6] M. Ciet, M. Neve, E. Peeters, and J.-J. Quisquater, "Parallel FPGA
provided by Tensilica's Xtensa Xplorer Design Environment. implementation of RSA with residue number systems - Can side-channel
The implementations did not consider side-channel attacks. threats be avoided?" in 46th IEEE International Midwest Symposium on
Circuits and Systems - MWSCAS-2003, vol. 2, Cairo, Egypt, 2003, pp.
In Table III, all of the RNS versions performed better 806 - 810.
than the baseline in terms of clock cycles. The hybrid RNS [7] J.-C. Bajard, L.-S. Didier, and P. Kornerup, "Modular multiplication and
version completed the signature generation in 32.3% of the base extensions in residue number systems," in Proceedings - 15th IEEE
Symposium on Computer Arithmetic, ARITH '01, Vail, Colorado, 2001,
time needed for the baseline. One of the factors influencing the pp. 59 - 65.
baseline's performance was the Xtensa's lack of carry overflow [8] S. Kawamura, M. Koike, F. Sano, and A. Shimbo, "Cox-rower archi-
support. The baseline did not use TIE enhancements, thus tecture for fast parallel montgomery multiplication," in Advances in
Cryptology - EuroCrypt '00, ser. Lecture Notes in Computer Science,
carry overflow had to be detected with costly tests involving B. Preneel, Ed., vol. 1807. Berlin: Springer-Verlag, 2000, pp. 523 -
branch instructions. 538.

1433
TABLE II
TYPES OF NEW INSTRUCTIONS FOR THE RNS IMPLEMENTATION OF RSA.

Type J Example(s) [Description [Operation


Channel access inmb A, ba Put b in A AL bb
Basic modular arithmetic cmul A, B Multiply A and B in A' AL (AL X BL)AU
cmular A, b Multiply A and b in A AL (AL X b)AU
Basic channel load, store, CL64 A, b,c Load 64 bits from memory location A +- MEM64 [b+c]
move b+c into A
Load, store with offset incoff Increment state off by 1 off ofof + 1
clroff Reset state off to 0 off 0
CL64off A,b Load 64 bits from memory location A +- MEM64[b+off]
b+off into A
s32off A,b Store lower 32 bits of A at memory MEM[b+off] - AL
location b+off
Load with increment wur.aradd a Store a in user-defined state aradd +- a
aradd
CL64inc A Load 64 bits from memory location A +- MEM64 [mbadd],
in user-defined state mbadd into A, mbadd - mbadd + 8
increment mbadd by 8
132inc a Load 32 bits from memory location a +- MEM [aradd],
in user-defined state aradd into a, aradd - aradd + 4
increment aradd by 4
CRT Base extension modmular A, B,c Multiply B and c in A AL (BL X C)AU
cmara A,B,c Add A to product of B and c in A AL - (BL X C + AL AU
MRS Base extension csmar A, B,c Subtract product of B and c from AL (- AL -BL X C)AU
A in A

aUppercase operands are the user-defined 64-bit registers used for channel manipulations. Lowercase operands are either
Xtensa 32-bit address registers or immediate operands.
bAL means the lower 32-bits of A. AU means the upper 32-bits of A.
C'in A' means to modularly reduce by A's modulus and store the 32-bit result in the lower 32 bits of A.

[9] H. Nozaki, M. Motoyama, A. Shimbo, and S. Kawamura, "Implemen- [14] Q. Koc, T. Acar, and B. S. Kaliski, Jr, "Analysing and comparing
tation of RSA algorithm based on RNS montgomery multiplication," montgomery multiplication algorithms," IEEE Micro, vol. 16, no. 2, pp.
in CHES '01: Proceedings of the Third International Workshop on 26 - 33, 1996.
Cryptographic Hardware and Embedded Systems. London, UK: [15] J.-J. Quisquater and C. Couvreur, "Fast decipherment algorithm for RSA
Springer-Verlag, 2001, pp. 364 - 376. public-key cryptosystem," Electron. Lett., vol. 18, no. 21, pp. 905 - 907,
[10] R. Gonzalez, "Xtensa: a configurable and extensible processor," IEEE 1982.
Micro, vol. 20, no. 2, pp. 60 - 70, 2000. [16] A. Shenoy and R. Kumaresan, "Fast base extension using a redundant
[11] Tensilica® Instruction Extension (TIE) Language Reference Manual, modulus in RNS," IEEE Trans. Comput., vol. 38, no. 2, pp. 292 - 296,
Tensilica, Inc., 3255-6 Scott Blvd., Santa Clara, CA 95054, February 1989.
2006. [17] N. Szabo and R. Tanaka, Residue Arithmetic and its Applications to
[12] Xtensa® LX Microprocessor Data Book, Tensilica, Inc., 3255-6 Scott Computer Technology. New York, NY: McGraw-Hill Inc., 1967.
Blvd., Santa Clara, CA 95054, February 2006. [18] R. Crandall, "Method and apparatus for public key exchange in a
[13] P. L. Montgomery, "Modular multiplication without trial division," cryptographic system," U.S. Patent 5 271 061, December 14, 1993.
Mathematics of Computation, vol. 44, no. 170, pp. 519 - 521, 1985. [19] D. Knuth, The Art of Computer Programming: Seminumerical Algo-
rithms, 2nd ed. Reading, MA: Addison-Wesley, 1981, vol. 2.

1434

You might also like