Professional Documents
Culture Documents
Abstract-This paper presents a new residue number system comparison is not possible, division and modular reduction
implementation of the RSA cryptosystem. The system runs on a require a more complicated approach.
low-area, low-power microprocessor that we have extended with The idea of using the RNS to improve the performance of
hardware support for residue arithmetic. When compared against
a baseline implementation that uses non-RNS multi-precision RSA has received attention for some years [5]-[8]. The attrac-
methods, the new RNS implementation executes in 67.7% fewer tion of RNS is that since the channels are independent, they
clock cycles. The hardware support requires 42.7% more gates are amenable to parallel implementations [9]. We have found
than the base processor core. that there is also a benefit in a sequential software approach
as the RNS improves operand scheduling and reduces the cost
I. INTRODUCTION
of carry propagation.
We present a new implementation of the RSA public key To evaluate this new approach, two versions of RSA have
cryptosystem on a low power microprocessor. Our solution been built on Tensilica's Xtensa processor [10], a 32-bit
uses the Residue Number System (RNS) on a processor microprocessor with a RISC instruction set. The Xtensa can
for which we have developed hardware support for modular be augmented with hardware defined by either Tensilica or
arithmetic. These extensions provide good decryption rates at the user in the form of operations, states and registers with
low power and with low hardware cost. the Tensilica Instruction Extension (TIE) language [11]. The
The RSA public key cryptosystem [1] is a proven secure Xtensa is suited to low-power and low-area applications as
communication method for static and mobile applications. its base configuration has a size of 0.26mm2 and consumes
It uses modular exponentiation on integers for encryption, 76,uW/MHz for a 130nm process [12].
decryption, signature generation and authentication. For exam- The first RSA version, a baseline, uses conventional mul-
ple, a signature C is generated from message M using private tiprecision arithmetic on an Xtensa core which has been
exponent D and public modulus N in the exponentation configured to include a 32-bit multiplier. The second uses RNS
C = MD mod N, where C, D, M and N are integers. arithmetic for which we have developed modular arithmetic
However, for security reasons, C, D, M and N must be long instructions and hardware in the TIE language.
integers [2], thus slowing its performance. While existing low- This paper first describes the modular multiplication algo-
power implementations [3], [4] use hardware accelerators, ours rithm needed for the RSA signature generation in Section II.
uses an RNS-enabled microprocessor. The specific algorithms for the baseline and the RNS version
The RNS is defined by a set of coprime integers called a are described in Section II-A and Section Il-B respectively.
base. Each integer is associated with a channel and is known The hardware support developed for the RNS version appears
as the channel's modulus. To represent a number in an RNS in Section III. The results of a signature generation performed
base, each channel holds the residue of that number modulo on the baseline and the RNS version follows in Section IV.
the associated channel modulus. The product of the channel II. ALGORITHMS
moduli is the dynamic range of the base. The Chinese Re-
mainder Theorem (CRT) implies that the RNS representation The modular exponentiations in the RSA algorithm are
of a number is unique modulo the dynamic range. performed as a series of modular multiplications. Modular
The RNS channels are independent, thus addition of two multiplication is composed of a multiplication step and a
numbers in the same base is performed through modular ad- reduction step. For both our multiprecision and RNS versions,
dition of the corresponding channels. Subtraction and multipli- the reduction step uses Montgomery's algorithm [13] which
cation are performed in the same way. However, as magnitude reduces an integer product AB modulo the integer N as shown
in (1). R in (1) is the radix and N`1 is the modular inverse
This research was supported under Australian Research Council's Discovery of N modulo R.
Projects funding scheme (project number DP0557582). It was also supported
under the Tensilica University Program. ABR-1 mod N = (AB -(ABN-1 mod R)N)/R (1)
1431
Bs 1Jchannell2
1
Montgomery
~~~~~Parti1
--Base
extn 2
channeln 1
{ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~_
-
t~~~~~~~~~~~~~~
channel -1_____________
channel 2 ._______________
Base 2^channeln --
Redundant Channel __
channel 1
channel 2
channeln.
B
channel 1---___ Base Montgomery
Base 2 m channel 2 _ _ ~ ~ ~~~ ~~ 1 =
extn D Part 2
channel n -. o
channel 1
Base 1
hahnnel 2
cat
Montgomery-- Base=
B channel : extn 2.
channeln n __
-
.__________________~~~~~~~~~~~~~~~~~~~~~~~
---------
User-defined
Xtensa "modbase" Memory
registers registers chl ch2 ch3 chl I ch2 ch3
aO mbO----------
al load 0
a2 mbl --modulus byte n
store-. - multiply multiply add multiply multiply add
a3 value byte n+8
channel2 value channel2 modulus (a) (b)
* * byte n+24 channel3 value channel3 modulus Fig. 3. (a) High load-store overhead at the multiply operation due to sequen-
* * tial implementation of parallel algorithm. (b) Reduced load-store overhead for
;a14
channel 2 by working along the channel instead of across the channels.
b7
a15I I----------I-
Fig. 2. Representation of the channels on the Xtensa Processor. Each
rectangle is 32 bits wide. ular reduction step as the product is potentially double the
length of the inputs. The multiplication step is performed with
a 64-bit multiplier. The product A is reduced modulo 232 N -
sequential nature attracts a high load-store overhead if each using a method employed in Crandall's public key exchange
RNS operation is performed across the channels, as illustrated device [18]. The algorithm in shown in Fig. 4. When N is less
in Fig. 3a. Instead, as the channels are independent, we can than 16 bits long, Crandall's method requires two iterations
load one channel, perform a series of operations, and then store and the corrective subtraction.
it before loading the next channel, as illustrated in Fig. 3b. Crandall's method has clear advantages over the Mont-
gomery scheme in that it does not require any extra constants,
B. RATS Operations it does not introduce any errors, it does not use 64-bit
The most important requirement for the new RNS opera- multipliers, and it uses relatively few steps. These factors make
tions is that they must include modular reduction. Modular the design require less memory space, area, power, and time.
reduction for addition involves at most two subtractions of the The number of different bases available, and hence the
modulus. Similarly, subtraction requires at most two additions dynamic range, is restricted by the bound on N. However,
of the modulus. it is still more than adequate for our purposes as a range for
Modular multiplication requires a more complicated mod- N of 1 to 429 provides 66 coprime 32-bit integers. These are
1432
Input: A, N < 232 Bajard's and Kawamura's versions executed in a similar
Output: A mod (232 -N) number of clock cycles. By removing the redundant channel
while A > (232 -N) or A-(232 -N) > (232 -N) and the extra corrective step, the hybrid improved the Kawa-
A (- N x LA/232j + A mod 232 mura version by 1.8%.
end while The CRT enhancement improved both the baseline and the
if A < (232 -N) then hybrid RNS versions. The CRT enhancement to the baseline
return A and the hybrid RNS version improved the time required by
else 73.6% and 67.7%, respectively.
return A -(232 -N) The area measurements included the decode and multi-
end if plexing support. The Xplorer reported the core's area to be
Fig. 4. Crandall's modular reduction algorithm. 69282 gates or 0.7mm2 for a 130nm LV worst-case process.
Kawamura's and the hybrid versions increased the number of
TABLE III gates by 42.7%. Bajard's version increased the number of gates
TIMING AND AREA REQUIREMENTS FOR RSA IMPLEMENTATIONS. by 44.5%.
Version [ Clock Cycles [ Additional Gates We intend to investigate the power consumption in the
future.
Baseline 50075284 0
RNS Bajard 16496753 30813 V. CONCLUSION
RNS Kawamura 16491058 29578
RNS Hybrid 16191768 29578 In this paper, we have presented a new implementation of
CRT Baseline 13244373 0 the RSA cryptosystem in the RNS. It was implemented on
CRT Hybrid 5260670 29578 a low area, low power Xtensa processor for which we have
developed residue arithmetic hardware. The simulation results
for a 1024-bit RSA signature generation using the Tensilica's
divided into two bases made up of 33 integers, each of which Xplorer Development Environment are promising. The fastest
provides the necessary dynamic range for 1024-bit values. RNS version of RSA required 67.7% fewer clock cycles than
Table II shows the types of new instructions we have defined a non-RNS baseline version. Our hardware residue arithmetic
for the RSA algorithm in the RNS. In addition to the basic support for the smallest RNS version increased the number
modular arithmetic, there are instructions that assist the CRT of gates in the microprocessor by 42.7%. The extra power
and MRS base extensions. The other instructions, load and consumption used by the hardware support is still unknown
store with offset and load with increment, assist with the access and is the subject of future work.
of sequential memory locations. We included them because
the channels and constants are stored in memory arrays. The VI. ACKNOWLEDGMENT
additional area needed appears in Table III. We wish to thank the team at Tensilica for their help and
support in this project.
IV. RESULTS
The six versions of the RSA signature generation were REFERENCES
implemented and simulated with 1024-bit inputs. The ex- [1] R. Rivest, A. Shamir, and L. Adleman, "A method for obtaining digital
ponent was composed of 542 zeros and 482 ones and the signatures and public-key cryptosystems," Communications of the ACM,
exponentiation used the right-to-left binary method [19]. To vol. 21, no. 2, pp. 120 - 126, 1978.
[2] "RSA Laboratories' Frequently Asked Questions About Today's Cryp-
achieve a 1024-bit dynamic range, the RNS versions used tography," RSA Laboratories, RSA Security Inc., 2000, version 4.1.
bases with 33 channels of 32 bits. The CRT RNS versions [3] "SecurCoreTM Solutions Data Sheet," ARM Limited, Cambridge,
used bases with 17 channels of 32 bits. United Kingdom.
The simulation results are shown in Table III in terms of
[4] "MIPS32g 4KSdTM Secure Data Core Product Brief," MIPS Tech-
nologies, Mountain View, CA, USA, 2005.
clock cycles required to perform the exponentiation, excluding [5] J. Schwemmlein, K. Posch, and R. Posch, "RNS-modulo reduction upon
cache misses, and area of additional hardware support. Both a restricted base value set and its applicability to RSA cryptography,"
Computers & Security, vol. 17, no. 7, pp. 637 - 650, 1998.
the clock cycle timing and the area measurements were [6] M. Ciet, M. Neve, E. Peeters, and J.-J. Quisquater, "Parallel FPGA
provided by Tensilica's Xtensa Xplorer Design Environment. implementation of RSA with residue number systems - Can side-channel
The implementations did not consider side-channel attacks. threats be avoided?" in 46th IEEE International Midwest Symposium on
Circuits and Systems - MWSCAS-2003, vol. 2, Cairo, Egypt, 2003, pp.
In Table III, all of the RNS versions performed better 806 - 810.
than the baseline in terms of clock cycles. The hybrid RNS [7] J.-C. Bajard, L.-S. Didier, and P. Kornerup, "Modular multiplication and
version completed the signature generation in 32.3% of the base extensions in residue number systems," in Proceedings - 15th IEEE
Symposium on Computer Arithmetic, ARITH '01, Vail, Colorado, 2001,
time needed for the baseline. One of the factors influencing the pp. 59 - 65.
baseline's performance was the Xtensa's lack of carry overflow [8] S. Kawamura, M. Koike, F. Sano, and A. Shimbo, "Cox-rower archi-
support. The baseline did not use TIE enhancements, thus tecture for fast parallel montgomery multiplication," in Advances in
Cryptology - EuroCrypt '00, ser. Lecture Notes in Computer Science,
carry overflow had to be detected with costly tests involving B. Preneel, Ed., vol. 1807. Berlin: Springer-Verlag, 2000, pp. 523 -
branch instructions. 538.
1433
TABLE II
TYPES OF NEW INSTRUCTIONS FOR THE RNS IMPLEMENTATION OF RSA.
aUppercase operands are the user-defined 64-bit registers used for channel manipulations. Lowercase operands are either
Xtensa 32-bit address registers or immediate operands.
bAL means the lower 32-bits of A. AU means the upper 32-bits of A.
C'in A' means to modularly reduce by A's modulus and store the 32-bit result in the lower 32 bits of A.
[9] H. Nozaki, M. Motoyama, A. Shimbo, and S. Kawamura, "Implemen- [14] Q. Koc, T. Acar, and B. S. Kaliski, Jr, "Analysing and comparing
tation of RSA algorithm based on RNS montgomery multiplication," montgomery multiplication algorithms," IEEE Micro, vol. 16, no. 2, pp.
in CHES '01: Proceedings of the Third International Workshop on 26 - 33, 1996.
Cryptographic Hardware and Embedded Systems. London, UK: [15] J.-J. Quisquater and C. Couvreur, "Fast decipherment algorithm for RSA
Springer-Verlag, 2001, pp. 364 - 376. public-key cryptosystem," Electron. Lett., vol. 18, no. 21, pp. 905 - 907,
[10] R. Gonzalez, "Xtensa: a configurable and extensible processor," IEEE 1982.
Micro, vol. 20, no. 2, pp. 60 - 70, 2000. [16] A. Shenoy and R. Kumaresan, "Fast base extension using a redundant
[11] Tensilica® Instruction Extension (TIE) Language Reference Manual, modulus in RNS," IEEE Trans. Comput., vol. 38, no. 2, pp. 292 - 296,
Tensilica, Inc., 3255-6 Scott Blvd., Santa Clara, CA 95054, February 1989.
2006. [17] N. Szabo and R. Tanaka, Residue Arithmetic and its Applications to
[12] Xtensa® LX Microprocessor Data Book, Tensilica, Inc., 3255-6 Scott Computer Technology. New York, NY: McGraw-Hill Inc., 1967.
Blvd., Santa Clara, CA 95054, February 2006. [18] R. Crandall, "Method and apparatus for public key exchange in a
[13] P. L. Montgomery, "Modular multiplication without trial division," cryptographic system," U.S. Patent 5 271 061, December 14, 1993.
Mathematics of Computation, vol. 44, no. 170, pp. 519 - 521, 1985. [19] D. Knuth, The Art of Computer Programming: Seminumerical Algo-
rithms, 2nd ed. Reading, MA: Addison-Wesley, 1981, vol. 2.
1434