Professional Documents
Culture Documents
Ananda Mohan
Residue
Number
Systems
Theory and Applications
Residue Number Systems
P.V. Ananda Mohan
vii
viii Preface
transforms such as discrete cosine transform have been explored. The most inter-
esting development has been the application of RNS in cryptography. Some of the
cryptography algorithms used in authentication which need big word lengths
ranging from 1024 bits to 4096 bits using RSA (Rivest Shamir Adleman) algorithm
and with word lengths ranging from 160 bits to 256 bits used in elliptic curve
cryptography have been realized using the residue number systems. Several appli-
cations have been in the implementation of Montgomery algorithm and implemen-
tation of pairing protocols which need thousands of modulo multiplication,
addition, and reduction operations. Recent research has shown that RNS can be
one of the preferred solutions for these applications, and thus it is necessary to
include this topic in the study of RNS-based designs.
This book brings together various topics in the design and implementation of
RNS-based systems. It should be useful for the cryptographic research community,
researchers, and students in the areas of computer arithmetic and digital signal
processing. It can be used for self-study, and numerical examples have been
provided to assist understanding. It can also be prescribed for a one-semester course
in a graduate program.
The author wishes to thank Electronics Corporation of India Limited, Bangalore,
where a major part of this work was carried out, and the Centre for Development of
Advanced Computing, Bangalore, where some part was carried out, for providing
an outstanding R&D environment. He would like to express his gratitude to
Dr. Nelaturu Sarat Chandra Babu, Executive Director, CDAC Bangalore, for his
encouragement. The author also acknowledges Ramakrishna, Shiva Rama Kumar,
Sridevi, Srinivas, Mahathi, and his grandchildren Baby Manognyaa and Master
Abhinav for the warmth and cheer they have spread. The author wishes to thank
Danielle Walker, Associate Editor, Birkhäuser Science for arranging the reviews,
her patience in waiting for the final manuscript and assistance for launching the
book to production. Special thanks are also to Agnes Felema. A and the Production
and graphics team at SPi-Global for their most efficiently typesetting, editing and
readying the book for production.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Modulo Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Adders for General Moduli . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Modulo (2n 1) Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Modulo (2n + 1) Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Binary to Residue Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Binary to RNS Converters Using ROMs . . . . . . . . . . . . . . . . . 27
3.2 Binary to RNS Conversion Using Periodic Property
of Residues of Powers of Two . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Forward Conversion Using Modular Exponentiation . . . . . . . . . 30
3.4 Forward Conversion for Multiple Moduli Using
Shared Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Low and Chang Forward Conversion Technique
for Arbitrary Moduli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Forward Converters for Moduli of the Type (2n k) . . . . . . . . . 35
3.7 Scaled Residue Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 36
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Modulo Multiplication and Modulo Squaring . . . . . . . . . . . . . . . . . 39
4.1 Modulo Multipliers for General Moduli . . . . . . . . . . . . . . . . . . 39
4.2 Multipliers mod (2n 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Multipliers mod (2n + 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Modulo Squarers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5 RNS to Binary Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 CRT-Based RNS to Binary Conversion . . . . . . . . . . . . . . . . . . 81
5.2 Mixed Radix Conversion-Based RNS to Binary Conversion . . . 90
ix
x Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Chapter 1
Introduction
where 0 i < n, n is the number of digits. Note that Mj is the ratio between weights
for the jth and ( j + 1) th digit position and x mod y is the remainder obtained by
dividing x with y. MRNS can represent
a
n1
M¼ Mj ð1:1bÞ
j¼0
where b is the base of the logarithm, z when asserted indicates that X ¼ 0, s is the
sign of X. In LNS, the input binary numbers are converted into logarithmic form
with a mantissa and characteristic each of appropriate word length to achieve the
desired accuracy. As is well known, multiplication and division are quite simple in
this system needing only addition or subtraction of the given converted inputs
whereas simple operations like addition, subtraction cannot be done easily. Thus
in applications where frequent additions or subtractions are not required, these may
be of utility. The inverse mapping from LNS to linear numbers is given as
X ¼ ð1 zÞð1Þs bx ð1:3bÞ
The second term is obtained using an LUT whose size can be very large for n 20
[3, 6, 7]. The multiplication, division, exponentiation and finding nth root are very
simple. After the processing, the results need to be converted into binary number
system.
The logarithmic system can be seen to be a special case of floating-point system
where the significand (mantissa) is always 1. Hence the exponent can be a mixed
number than an integer. Numbers with the same exponent are equally spaced in
floating-point whereas in sign logarithm system, smaller numbers are denser [3].
LNS reduces the strength of certain arithmetic operations and the bit activity
[5, 8, 9]. The reduction of strength reduces the switching capacitance. The change
of base from 2 to a lesser value reduces the probability of a transition from low to
high. It has been found that about two times reduction in power dissipation is
possible for operations with word size 8–14 bits.
The other system that has been considered is Residue Number system [10–12]
which has received considerable attention in the past few decades. We consider this
topic in great detail in the next few chapters. We, however, present here a historical
review on this area. The origin is attributed to the third century Chinese author Sun
4 1 Introduction
Tzu (also attributed to Sun Tsu in the first century AD) in the book Suan-Ching. We
reproduce the poem [11]:
We have things of which we do not know the number
If we count them by threes, the remainder is 2
If we count them by fives, the remainder is 3
If we count them by sevens, the remainder is 2
How many things are there?
The answer, 23.
Sun Tzu in First Century AD and Greek Mathematicians Nichomachus and
Hsin-Tai-Wei of Ming Dynasty (1368AD-1643AD) were the first to explore
Residue Number Systems. Sun Tzu has presented the formula for computing the
answer which came to be known later as Chinese Remainder Theorem (CRT). This
is described by Gauss in his book Disquisitiones Arithmeticae [12].
Interestingly, Aryabhata, an Indian mathematician in fifth century A.D., has
described a technique of finding the number corresponding to two given residues
corresponding to two moduli. This was named as Aryabhata Remainder Theorem
[13–16] and is known by the Sanskrit name Saagra-kuttaakaara (residual
pulveriser) which is the well-known Mixed Radix conversion for two moduli RNS.
Extension to moduli sets with common factors has been recently described [17].
In an RNS using mutually prime integers m1, m2, m3, . . .., mj as moduli, the
dynamic range M is the product of the moduli, M ¼ m1 m2 m3 . . . mj. The numbers
between 0 and M1 can be uniquely represented by the residues.
M1 Alternatively,
M M1
numbers betweenM/2 to 2 1 when M is even and 2 to 2 when M is
odd can be represented. A large number can thus be represented by several smaller
numbers called residues obtained as the remainders when the given number is
divided by the moduli. Thus, instead of big word length operations, we can perform
several small word length operations on these residues. The modulo addition,
modulo subtraction and modulo multiplication operations can thus be performed
quite efficiently.
As an illustration, using the moduli set {3, 5, 7}, any number between 0 and 104
can be uniquely represented by the residues. The number 52 corresponds to the
residue set (1, 2, 3) in this moduli set. The residue is the remainder obtained by the
division operation X/mi. Evidently, the residues ri are such that 0 ri (mi1).
The front-end of an RNS-based processor (see Figure 1.1) is a binary to RNS
converter known as forward converter whose k output words corresponding to k
moduli mk will be processed by k parallel processors in the Residue Processor
blocks to yield k output words. The last stage in the RNS-based processor converts
these k words to a conventional binary number. This process known as reverse
conversion is very important and needs to be hardware-efficient and time-efficient,
since it may be often needed also to perform functions such as comparison, sign
detection and scaling. The various RNS processors need smaller word length and
hence the multiplication, addition and multiplications can be done faster. Of course,
these are all modulo operations. The modulo processors do not have any
1 Introduction 5
Input Binary
Binary output
inter-dependency and hence speed can be achieved for performing operations such
as convolution, FIR filtering, and IIR filtering (not needing in-between scaling).
The division or scaling by an arbitrary number, sign detection, and comparison are
of course time-consuming in residue number systems.
Each MRS digit or RNS modulus can be represented in several ways: binary
(d log2Mje wires with binary logic), index (d log2Mje wires with binary logic), one-
hot (Mj wires with two-valued logic) [18] and Mj-ary (one wire with multi-valued
logic). Binary representation is most compact in storage, but one-hot coding allows
faster logic and lower power consumption. In addition to electronics, optical and
quantum RNS implementations have been suggested [19, 20].
The first two books on Residue number systems appeared in 1967 [21, 22].
Several attempts have been made to build digital computers and other hardware
using Residue number Systems. Fundamental work on topics like Error correction
has been performed in early seventies. However, there was renewed interest in
applying RNS to DSP applications in 1977. An IEEE press book collection of
papers [23] focused on this area in 1986 documenting key papers in this area. There
was resurgence in 1988 regarding use of special moduli sets. Since then the research
interest has increased and a book appeared in 2002 [24] and another in 2007 [25].
Several topics have been addressed such as Binary to Residue conversion, Residue
to binary conversion, scaling, sign detection, modulo multiplication, overflow
detection, and basic operations such as addition. Since four decades, designers
have been exploring the use of RNS to various applications in communication
systems, such as Digital signal Processing with emphasis on low power, low area
and programmability. Special RNS such as Quadratic RNS and polynomial RNS
have been studied with a view to reduce computational requirements in filtering.
6 1 Introduction
More recently, it is very interesting that the power of RNS has been explored to
solve problems in cryptography involving very large integers of bit lengths varying
from 160 bits to 4096 bits. Attempts also have been made to combine RNS with
logarithmic number system known as Logarithmic RNS.
The organization of the book is as follows. In Chapter 2, the topic of modulo
addition and subtraction is considered for general moduli as well powers-of-two related
moduli. Several advances made in designing hardware using diminished-1arithmetic
are discussed. The topic of forward conversion is considered in Chapter 3 in
detail for general as well as special moduli. These use several interesting properties
of residues of powers of two of the moduli. New techniques for sharing hardware for
multiple moduli are also considered. In Chapter 4, modulo multiplication and
modulo squaring using Booth-recoding and not using Booth-recoding is described
for general moduli as well moduli of the type 2n1 and especially 2n + 1. Both the
diminished-1 and normal representations are considered for design of multipliers
mod (2n + 1). Multi-modulus architectures are also considered to share the hardware
amongst various moduli. In Chapter 5, the well-investigated topic of reverse con-
version for three, four, five and more number of moduli is considered. Several
recently described techniques using Core function, quotient function, Mixed-Radix
CRT, New CRTs, and diagonal function have been considered in addition to the well-
known Mixed Radix Conversion and CRT. Area and time requirements are
highlighted to serve as benchmarks for evaluating future designs. In Chapter 6, the
important topics of scaling, base extension, magnitude comparison and sign detec-
tion are considered. The use of core function for scaling is also described.
In Chapter 7, we consider specialized Residue number systems such as Qua-
dratic Residue Number systems (QRNS) and its variations. Polynomial Residue
number systems and Logarithmic Residue Number systems are also considered.
The topic of error detection, correction and fault tolerance has been discussed in
Chapter 8. In Chapter 9, we deal with applications of RNS to FIR and IIR Filter
design, communication systems, frequency synthesis, DFT and 1-D and 2-D DCT
in detail. This chapter highlights the tremendous attention paid by researchers to
numerous applications including CDMA, Frequency hopping, etc. Fault tolerance
techniques applicable for FIR filters are also described. In Chapter 10, we cover
extensively applications of RNS in cryptography perhaps for the first time in any
book. Modulo multiplication and exponentiation using various techniques, modulo
reduction techniques, multiplication of large operands, application to ECC and
pairing protocols are covered extensively. Extensive bibliography and examples
are provided in each chapter.
References
1. M.G. Arnold, The residue logarithmic number system: Theory and application, in Proceedings
of the 17th IEEE Symposium on Computer Arithmetic (ARITH), Cape Cod, 27–29 June 2005,
pp. 196–205
2. E.C. Ifeachor, B.W. Jervis, Digital Signal Processing: A Practical Approach, 2nd edn.
(Pearson Education, Harlow, 2003)
References 7
The modulo addition of two operands A and B can be implemented using the
architectures of Figure 2.1a and b [1, 2]. Essentially, first A + B is computed and
then m is subtracted from the result to find whether the result is larger than m or not.
(Note that TC stands for two’s complement.) Then using a 2:1 multiplexer, either
(A + B) or (A + Bm) is selected. Thus, the computation time is that of one n-bit
addition, one (n + 1)-bit addition and delay of a multiplexer. On the other hand, in
the architecture of Figure 2.2b, both (A + B) and (A + Bm) are computed in
parallel and one of the outputs is selected using a 2:1 multiplexer depending on
the sign of (A + Bm). Note that a carry-save adder (CSA) stage is needed for
computing (A + Bm) which is followed by a carry propagate adder (CPA). Thus,
the area is more than that of Figure 2.2a, but the addition time is less. The area A and
computation time Δ for both the techniques can be found for n-bit operands
assuming that a CPA is used as
A B
a A B b
TC of m
Adder
Adder CSA
TC of m
Adder Adder
A+B A+B
A+B-m
A+B-m
2:1 MUX
2:1 MUX
(A+B) mod m
(A+B) mod m
B A b a
CPG
P G p g
CLA for
Cout
1
MUX
CLAS
n
R
Acascade ¼ ð2n þ 1ÞAFA þ nA2:1MUX þ nAINV , Δcascade ¼ ð2n þ 1ÞΔFA þ Δ2:1MUX þ ΔINV
AParallel ¼ ð3n þ 2ÞAFA þ nA2:1MUX þ nAINV , Δparallel ¼ ðn þ 2ÞΔFA þ Δ2:1MUX þ ΔINV
ð2:1Þ
where ΔFA, Δ2:1MUX, and ΔINV are the delays and AFA, A2:1MUX and AINV are the
areas of a full-adder, 2:1 Multiplexer and an inverter, respectively. On the other
2.1 Adders for General Moduli 11
hand, by using VLSI adders with regular layout e.g. BrentKung adder [3], the area
and delay requirements will be as follows:
ð2:2Þ
Thus, half-adder like cells which give both the outputs are used. Note that si, ci+1,
^s i , ^c iþ1 serve as inputs to carry propagate and generate unit which has outputs Pi,
Gi, pi, gi corresponding to both the cases. Based on the computation of cout using a
CLA, a multiplexer is used to select one of these pairs to compute all the carries and
the final sum. The block diagram of this adder is shown in Figure 2.2 where SAC is
sum and carry unit, CPG is carry propagate generate unit, and CLA is carry look
ahead unit for computing Cout. Then using a MUX, either P, G or p, g are selected to
be added using CLA summation unit (CLAS). The CLAS unit computes all the
carries and performs the summation Pi ci to produce the output R. This design
leads to lower area and delay than the designs in Refs. [1, 5].
Adders for moduli (2n1) and (2n + 1) have received considerable attention in
literature which will be considered next.
Efstathiou, Nikolos and Kalamatinos [7] have described a mod (2n1) adder. In this
design, the carry that results from addition assuming carry input is zero is taken into
account in reformulating the equations to compute the sum. Consider a mod 7 adder
with inputs A and B. With the usual definition of generate and propagate signals, it
can be easily seen that for a conventional adder we have
c0 ¼ G0 þ P0 c1 ð2:3aÞ
c 1 ¼ G 1 þ P1 c 0 ð2:3bÞ
c2 ¼ G2 þ P2 G1 þ P2 P1 g0 ð2:3cÞ
Substituting c1 in (2.3a) with c2 due to the end-around carry operation of a mod
(2n1) adder, we have
c0 ¼ G0 þ P0 G2 þ P0 P2 G1 þ G0 P2 P1 G0 ¼ G0 þ P0 G2 þ P0 P2 G1 ð2:4Þ
c1 ¼ G1 þ P1 G0 þ P1 P0 G2 ð2:5aÞ
c2 ¼ G2 þ P2 G1 þ P2 P1 Go ð2:5bÞ
a X0 P0
Y0
G0
S0
C-1
P1
X1
Y1
G1
C0 S1
P2
X2
Y2
G2
C1 S2
b P0
X0
Y0
G0 S0
C-1
P1
X1
Y1
G1
C0 S1
X2 P2
Y2
G2
C1 S2
Figure 2.3 (a) Mod 7 adder with double representation of zero (b) with single representation of
zero (adapted from [7] ©IEEE1994)
14 2 Modulo Addition and Subtraction
si ¼ Pi þ P ci1 for 0 i n 1: ð2:6Þ
The architectures of Figure 2.3, although they are elegant, they lack regularity.
Instead of using single level CLA, when the operands are large, multiple levels can
also be used.
Another approach is to consider the carry propagation in binary addition as a
prefix problem. Various types of parallel-prefix adders e.g. (a) LadnerFischer [8],
(b) Kogge-Stone [9], (c) BrentKung [3] and (d) Knowles [10] are available in
literature. Among these, type (a) requires less area but has unlimited fan out
compared to type (b). But designs based on (b) are faster.
Zimmerman [11] has suggested using an additional level for adding end-around-
carry for realizing a mod (2n1) adder (see Figure 2.4a) which needs extra
hardware and more over, this carry has a large fan out thus making it slower.
Kalampoukas et al. [12] have considered modulo (2n1) adders using parallel-
prefix adders. The idea of carry recirculation at each prefix level as shown in
Figure 2.4b has been employed. Here, no extra level of adders will be required,
thus having minimum logic depth. In addition, the fan out requirement of the carry
output is also removed. These architectures are very fast while consuming large
area.
The area and delay requirements of adders can be estimated using the unit-gate
model [13]. In this model, all gates are considered as a unit, whereas only exclusive-
OR gate counts for two elementary gates. The model, however, ignores fan-in and
fan-out. Hence, validation needs to be carried out by using static simulations. The
area and delay requirements of mod (2n1) adder described in [12] are 3nlogn + 4n
and 2logn + 3 assuming this model.
Efstathiou et al. [14] have also considered design using select-prefix blocks with
the difference that the adder is divided into several small length adder blocks by
proper interconnection of propagate and generate signals of the blocks. A select-
prefix architecture for mod (2n1) adder is presented in Figure 2.5. Note that d,
f and g indicate the word lengths of the three sections. It can be seen that
where BGi and BPi are block generate and propagate signals outputs of each block.
Tyagi [13] has given an algorithm for selecting the lengths of the various adder
blocks suitably with the aim of minimization of adder delay. Note that designs
based on parallel-prefix adders are fastest but are more complex. On the other hand,
CLA-based adder architecture is area effective. Select prefix-architectures achieve
delay closer to parallel prefix adders and have complexity close to the best adders.
Patel et al. [15] have suggested fast parallel-prefix architectures for modulo
(2n1) addition with a single representation of zero. In these, the sum is
computed with a carry in of “1”. Later, a conditional decrement operation is
2.2 Modulo (2n1) Adders 15
an-1
bn-1
an-2
bn-2
a1
b1
a0
b0
prefix structure
Cin
Cout
sn-1
sn-2
s1
s0
b
b7 a7 b6 a6 b5 a5 b4 a4 b3 a3 b2 a2 b1 a1 b0 a0
C7
C*-1
C*6 C*5 C*4 C*3 C*2 C*1 C*0
S7 S6 S5 S4 S3 S2 S1 S0
Figure 2.4 Modulo (2n1) adder architectures due to (a) Zimmermann and (b) modulo (281)
adder due to Kalampoukas et al. ((a) adapted from [11] ©IEEE1999 and (b) adapted from [12]
©IEEE2000)
performed. However, by cyclically feeding back the carry generate and carry
propagate signals at each prefix level in the adder, the authors show that
significant improvement in latency is possible over existing designs.
16 2 Modulo Addition and Subtraction
Figure 2.5 Modulo 2d+f+g1 adder design using three blocks (adapted from [14] ©IEEE2003)
Next,
d 2k A ¼ dðA þ A þ A þ . . . þ AÞ ¼ 2k d ðAÞ þ 2k 1 mod ð2n þ 1Þ:
or
2k d ðAÞ ¼ d 2k A 2k þ 1 mod ð2n þ 1Þ ð2:10Þ
2.3 Modulo (2n + 1) Adders 17
Prefix Computation
Sn-1 Sn-2 S1 S0
18 2 Modulo Addition and Subtraction
These designs need lesser area than designs using parallel-prefix adders while they
are slower than CLA-based designs.
Efstathiou, Vergos and Nikolos [19] have described fast parallel-prefix modulo
(2n + 1) adders for two (n + 1)-bit numbers which use two stages. The first stage
computes jX þ Y þ 2n 1j nþ1 which has (n + 2) bits. If MSB of the result is zero,
2
then 2n + 1 is added mod 2n+1 and the n LSBs yield the result. For computing
M ¼ X þ Y þ 2n 1, a CSA is used followed by a (n + 1)-bit adder. The authors
use parallel-prefix with fast carry increment (PPFCI) architecture and also a totally
2.3 Modulo (2n + 1) Adders 19
BG0
BG1
BLOCK 1 BLOCK 0
Adder (d+f-1:f) Adder (f-1:0)
Figure 2.7 Diminished-1 modulo (2d+f + 1) adder using two blocks (adapted from [14]
©IEEE2004)
Note that, in this case, the added bit zi is always 1 in all bit positions.
Vergos and Efstathiou [20] proposed an adder that caters for both weighted and
diminished-1 operands. They point out that a diminished-1 adder can be used
to realize a weighted adder by having a front-end inverted EAC CSA stage. Herein,
A + B is computed where A and B are (n + 1)-bit numbers using a diminished-1
adder. In this design, the computation carried out is
jA þ Bj2n þ1 ¼ jAn þ Bn þ D þ 1j2n þ1 þ 1 n ¼ jY þ U þ 1j2n þ1 ð2:14Þ
2 þ1
where Y and U are the sum and carry vector outputs of a CSA stage computing
An + Bn + D:
where D ¼ 2n 4 þ 2cnþ1 þ sn . Note that An, Bn are the words formed by the n-bit
LSBs of A and B, respectively, and sn, cn+1 are the sum and carry of addition of 1-bit
words an and bn. It may be seen that D is the n-bit vector 11111:::1cnþ1 sn .
An example will be illustrative. Consider n ¼ 4 and the addition of A ¼ 16 and
B ¼ 11. Evidently an ¼ 1, bn ¼ 0, An ¼ 0 and Bn ¼ 11 and D ¼ 01110 yielding
(16 + 11)17 ¼ ((0 + 11 + 14 + 1)17 + 1)17 ¼ 10. Note that the periodic property of res-
idues mod (2n + 1) is used. The sum of the n th bits is complimented and added to
get D and a correction term is added to take into account the mod (2n + 1) operation.
20 2 Modulo Addition and Subtraction
The mod (2n + 1) adder for weighted representation needs a diminished-1 adder and
an inverted end-around-carry stage. The full adders of this CSA stage perform
(An + Bn + D) mod (2n + 1) addition. Some of the FAs have one input “1” and can
thus be simplified. The outputs of this stage Y and U are fed to a diminished-1 adder
to obtain (Y + U + 1) mod 2n. The architecture is presented in Figure 2.8. It can be
seen that every diminished-1 adder can be used to perform weighted binary addition
using an inverted EAC CSA stage in the front-end.
an bn an b n
Diminished-1 adder
(any architecture)
Sn Sn-1 Sn-2 S2 S1 S0
Figure 2.8 Modulo (2n + 1) adder for weighted operands built using a diminished-1 adder
(adapted from [20] ©IEEE2008)
2.3 Modulo (2n + 1) Adders 21
In another technique due to Vergos and Bakalis [21], first A* and B* are
computed such that A* + B* ¼ A + B1 using a translator. Then, a diminished-1
adder can sum A* and B* such that
jA þ Bj2n þ1 n ¼ jA* þ B*j2n þ cout ð2:15Þ
2
where cout is the carry of the n-bit adder computing A* + B*. However, Vergos and
Bakalis do not present the details of obtaining A* and B* using the translator. Note
that in this method, the inputs are (2n1).
Lin and Sheu [22] have suggested the use of two parallel adders to find A* + B*
and A* + B* + 1 so that the carry of the former adder can be used to select the correct
result using a multiplexer. Note that Lin and Sheu [22] have also suggested
partitioning the n-bit circular carry selection (CCS) modular adder to m number
of r-bit blocks similar to the select-prefix block type of design considered earlier.
These need circular carry selection addition blocks and circular carry generators.
Juang et al. [23] have given a corrected version of this type of mod (2n + 1) adder
shown in Figure 2.9a and b. Note that this design uses a dual sum carry look ahead
adder (DS-CLA). These designs are most efficient among all the mod (2n + 1)
adders regarding area, time and power.
Juang et al. [24] have suggested considering (n + 1) bits for inputs A and B. The
weighted modulo (2n + 1) sum of A and B can be expressed as
jA þ Bj2n þ1 n ¼ jA þ B ð2n þ 1Þj2n if (A + B) > 2n
2
¼ j A þ B ð 2n þ 1Þ j 2n þ 1 otherwise ð2:16Þ
Thus, weighted modulo (2n + 1) addition can be obtained by subtracting the sum of
A and B by (2n + 1) and using a diminished-1 adder to get the final modulo sum by
making the inverted EAC as carry-in.
Denoting Y0 and U0 as the carry and sum vectors of the summation A + B(2n + 1),
where A and B are (n + 1)-bit words, we have
Xn2
j A þ B ð 2n þ 1 Þ j 2 n ¼ 2i 2y0i þ u0i þ 2n1 2an þ 2bn þ an1 þ bn1 þ 1
i¼0 n
2
ð2:17Þ
where
y0i ¼ ai _ bi , u0i ¼ ai bi :
and for A ¼ 6, B ¼ 7,
22 2 Modulo Addition and Subtraction
a B* A*
n n
Cn-1 DS – CLA
Adder
{ * *
Sn-1,0 ... S0,0 } { *
Sn-1,1...S0,1
*
}
n n
MUX
{ * *
Sn-1... S0 }
b*3a3* b*2a2* b*1a1* b*0 a 0* p*0 p*1
b
p2*
Modified
part
c3
MUX MUX MUX MUX
p*3 p*2 p1*
Figure 2.9 (a) Block diagram of CCS diminished-1 modulo (2n + 1) adder and (b) Logic circuit of
CCS diminished-1 modulo (24 + 1) adder ((a) adapted from [22] ©IEEE2008, (b) adapted from
[23] ©IEEE2009)
2.3 Modulo (2n + 1) Adders 23
The multiplier of 2n1 in (2.17) can be at most 5 since 0 A, B 2n. Since only
bits n and n1 are available, the authors consider the (n + 1)-th bit to merge with Cout:
jA þ Bj2n þ1 n ¼ jA þ B ð2n þ 1Þj2n ¼ jY 0 þ U 0 j2n þ cout _ FIX ð2:18Þ
2
where y0n1 ¼ an _ bn _ an1 _ bn1 , u0n1 ¼ an1 bn1 and FIX ¼ an bn _ an1 bn
_an bn1 . Note that y0 n1 and u0 n1 are the values of the carry bit and sum bit
produced by the addition 2an þ 2bn þ an1 þ bn1 þ 1. The block diagram is
presented in Figure 2.10a together with the translator in b. Note that FAF
block generates y0 n1, u0 n1 and FA blocks generate y0 i, u0 i for i ¼ 0,1,. . ., n2
FIX
Diminished-1 adder
Sn Sn-1 Sn-2 S0
Figure 2.10 (a) Architecture of weighted modulo (2n + 1) adder with the correction scheme and
(b) translator A + B–(2n + 1) (adapted from [24] ©IEEE2010)
24 2 Modulo Addition and Subtraction
where y0i ¼ ai _ bi and u0i ¼ ai bi . Note also that FIX is wired OR with the carry
cout to yield the inverted EAC as the carry in. The FIX bit is needed since value
greater than 3 cannot be accommodated in yn1 and un1.
The authors have used Sklansky [25] and BrentKung [3] parallel-prefix adders
for the diminished-1 adder.
References
1. M.A. Bayoumi, G.A. Jullien, W.C. Miller, A VLSI implementation of residue adders. IEEE
Trans. Circuits Syst. 34, 284–288 (1987)
2. M. Dugdale, VLSI implementation of residue adders based on binary adders. IEEE Trans.
Circuits Syst. 39, 325–329 (1992)
3. R.P. Brent, H.T. Kung, A regular layout for parallel adders. IEEE Trans. Comput. 31, 260–264
(1982)
4. G. Alia, E. Martinelli, Designing multi-operand modular adders. Electron. Lett. 32, 22–23
(1996)
5. K.M. Elleithy, M.A. Bayoumi, A θ(1) algorithm for modulo addition. IEEE Trans. Circuits
Syst. 37, 628–631 (1990)
6. A.A. Hiasat, High-speed and reduced area modular adder structures for RNS. IEEE Trans.
Comput. 51, 84–89 (2002)
7. C. Efstathiou, D. Nikolos, J. Kalanmatianos, Area-time efficient modulo 2n1 adder design.
IEEE Trans. Circuits Syst. 41, 463–467 (1994)
8. R.E. Ladner, M.J. Fischer, Parallel-prefix computation. JACM 27, 831–838 (1980)
9. P.M. Kogge, H.S. Stone, A parallel algorithm for efficient solution of a general class of
recurrence equations. IEEE Trans. Comput. 22, 783–791 (1973)
10. S. Knowles, A family of adders, in Proceedings of the 15th IEEE Symposium on Computer
Arithmetic, Vail, 11 June 2001–13 June 2001. pp. 277–281
11. R. Zimmermann, Efficient VLSI implementation of Modulo (2n 1) addition and multiplica-
tion, Proceedings of the IEEE Symposium on Computer Arithmetic, Adelaide, 14 April
1999–16 April 1999. pp. 158–167
12. L. Kalampoukas, D. Nikolos, C. Efstathiou, H.T. Vergos, J. Kalamatianos, High speed parallel
prefix modulo (2n1) adders. IEEE Trans. Comput. 49, 673–680 (2000)
13. A. Tyagi, A reduced area scheme for carry-select adders. IEEE Trans. Comput. 42, 1163–1170
(1993)
14. C. Efstathiou, H.T. Vergos, D. Nikolos, Modulo 2n 1 adder design using select-prefix blocks.
IEEE Trans. Comput. 52, 1399–1406 (2003)
15. R.A. Patel, S. Boussakta, Fast parallel-prefix architectures for modulo 2n1 addition with a
single representation of zero. IEEE Trans. Comput. 56, 1484–1492 (2007)
16. L.M. Liebowitz, A simplified binary arithmetic for the fermat number transform. IEEE Trans.
ASSP 24, 356–359 (1976)
17. Z. Wang, G.A. Jullien, W.C. Miller, An efficient tree architecture for modulo (2n + 1) multi-
plication. J. VLSI Sig. Proc. Syst. 14(3), 241–248 (1996)
18. H.T. Vergos, C. Efstathiou, D. Nikolos, Diminished-1 modulo 2n + 1 adder design. IEEE
Trans. Comput. 51, 1389–1399 (2002)
19. S. Efstathiou, H.T. Vergos, D. Nikolos, Fast parallel prefix modulo (2n + 1) adders. IEEE
Trans. Comput. 53, 1211–1216 (2004)
20. H.T. Vergos, C. Efstathiou, A unifying approach for weighted and diminished-1 modulo
(2n + 1) addition. IEEE Trans. Circuits Syst. II Exp. Briefs 55, 1041–1045 (2008)
References 25
21. H.T. Vergos, D. Bakalis, On the use of diminished-1 adders for weighted modulo (2n + 1)
arithmetic components, Proceedings of the 11th Euro Micro Conference on Digital System
Design Architectures, Methods Tools, Parma, 3–5 Sept. 2008. pp. 752–759
22. S.H. Lin, M.H. Sheu, VLSI design of diminished-one modulo (2n + 1) adders using circular
carry selection. IEEE Trans. Circuits Syst. 55, 897–901 (2008)
23. T.B. Juang, M.Y. Tsai, C.C. Chin, Corrections to VLSI design of diminished-one modulo
(2n + 1) adders using circular carry selection. IEEE Trans. Circuits Syst. 56, 260–261
(2009)
24. T.-B. Juang, C.-C. Chiu, M.-Y. Tsai, Improved area-efficient weighted modulo 2n + 1 adder
design with simple correction schemes. IEEE Trans. Circuits Syst. II Exp. Briefs 57, 198–202
(2010)
25. J. Sklansky, Conditional sum addition logic. IEEE Trans. Comput. EC-9, 226–231 (1960)
Chapter 3
Binary to Residue Conversion
The given binary number needs to be converted to RNS. In this chapter, various
techniques described in literature for this purpose are reviewed. A straightforward
method is to use a divider for each modulus to obtain the residue while ignoring the
quotient obtained. But, as is well known, division is a complicated process [1]. As
such, alternative techniques to obtain residue easily have been investigated.
Jenkins and Leon [2] have suggested reading sequentially the residues mod mi
corresponding to all the input bytes from PROM and performing mod mi addition.
Stouraitis [3] has suggested reading residues corresponding to various bytes in the
input word in parallel from ROM and adding them using a tree of mod mi adders.
Alia and Martinelli [4] have suggested forward conversion for a given n-bit input
binary word using n/2 PEs (processing elements) each storing residues
corresponding to 2j and 2j+1 (i.e. j th and j + 1 th bit positions) for j ¼ 0, . . ., n/2
and adding these residues mod mi selectively depending on the bit value if it is “1”.
Next the results of the n/2 PEs are added in a tree of modulo mi adders to obtain the
final residue.
Capocelli and Giancarlo [5] have suggested using t PEs where t ¼ dn/log2ne each
computing the residue of a log2n-bit word by adding the residues corresponding to
various bits of this word and then adding the residues obtained from various PEs in
a tree of modulo mi adders containing h steps where h ¼ log2t. Note, however, that
only the residue corresponding to the LSB position in each word is stored and
residue corresponding to each of the next bit position is online computed by
doubling the previous residue and finding residue mod mi using one subtractor
and one multiplexer. Thus, the ROM requirement is reduced to t locations.
More recent designs avoid the use of ROMs and use combinational logic to a
large extent. These are discussed in the next few sections.
We consider first an example of finding the residue of 892 mod 19. Expressing
892 in binary form, we have 11 0111 1100. (We can start with the 5th bit from the
right since 12 mod 19 is 12 itself.) We know the residues of consecutive powers of
two mod 19 as 1, 2, 4, 8, 16, 13, 7, 14, 9, and 18. Thus, we can add the residues
wherever the input bit corresponding to a power of 2 is “1”. This yields (4 + 8 + 16
+ 13 + 7 + 9 + 18) mod 19 ¼ 18. Note that at each step, when a new residue
corresponding to a new power of 2, is added, modulo 19 reduction can be done to
avoid unnecessary growth of the result: (((4 + 8) mod 19 + 16) mod 19 + 13) mod
19, etc.
Note, however, that certain simplifications can be made by noting the periodic
property of residues of 2k mod m [6–9]. Denoting 2T 1 mod m, we know that
2αTþi 2i mod m, if T is the period of the modulus m. All the α words (α can be
even or odd) each of T bits in the given n-bit binary number where α ¼ n/T can be
added first in a carry-save-adder (CSA) with EAC (end around carry) to first obtain
a T-bit word for which using the procedure described above the residue mod m can
be obtained. Note that “T” is denoted as “order” and can be m1 or less. As an
illustration for m ¼ 89, T ¼11 and for m ¼ 19, T ¼ 18. Consider finding the residue
of 0001 0100 1110 1101 1011 0001 0100 1110 1101 1011 mod 19 ¼ 89887166171
mod 19. Thus, the three 18-bit words (here T ¼ 18) can be added with EAC to obtain
0001
01 0011 1011 0110 1100
01 0100 1110 1101 1011
10 1000 1010 0100 1000
This corresponds to 166472. The residue of this 18-bit number can be obtained
next using the procedure presented earlier by adding the residues of various powers
of 2 mod 19. In short, the periodic property of 2k mod m has been used to simplify
the computation.
Another simplification is possible for moduli satisfying the property 2ðm1Þ=2
mod m ¼ 1. Considering modulus 19 again, we observe that 29 ¼1 mod 19, 210
¼2 mod 19, . . ., 217 ¼ 10 mod 19, 218 ¼ 1 mod19 and 219 ¼ 2 mod 19, etc. Thus,
the residues in the upper half of a period are opposite in sign to those in the lower
half of the period. This property can be used to reduce the CSA word length to
(m1)/2. Denoting successive half period length words W0, W1, W2, .., Wα where α
is odd (considered for illustration), we need to estimate
ðα1
!
XÞ=2 XÞ=2
ðα1
W 2i W 2iþ1 mod m. Considering the same example considered
i¼0 i¼0
above, we first divide the given word into 9-bit fields starting from LSB as follows:
3.2 Binary to RNS Conversion Using Periodic Property of Residues of Powers of Two 29
W4 ¼ 0001
W3 ¼ 0 1001 1101
W2 ¼ 1 0110 1100
W1 ¼ 0 1010 0111
W0 ¼ 0 1101 1011
Thus, adding together alternate fields in separate CSAs i.e. adding W0, W2 and W4,
we get Se ¼ 10 0100 1000 and adding W1 and W3 we have So ¼ 1 0100 0100.
Subtracting So from Se, we have S ¼ 0001 0000 0100. (Here subtraction is two’s
complement addition of So with Se.) Note that the word length of So and Se can be
more than T/2 bits depending on the number of T/2-bit fields in the given binary
number. (Note also that So and Se can be retained in carry save form.) The residue of
the resulting word can be found easily using another stage using the periodic
property and a final mod m reduction described earlier, as 13 for our example.
It is observed [6–9] that the choice of moduli shall be such that the period or half
period shall be small compared to the dynamic range in bits of the complete RNS in
order to take advantage of the periodic property of the residues.
Interestingly, for special moduli of the form 2k1 and 2k + 1, the second stage of
binary to RNS conversion of a smaller length word of either T or T/2 bits (see
Figure 3.1a and b) can altogether be avoided [6]. For moduli of the form 2k1, the
input word can be divided into k-bit fields all of which can be added in a CSA with
EAC to yield the final residue. On the other hand, for moduli of the form 2k + 1, all
even k-bit fields can be added and all odd k-bit fields can be added to obtain Se and
a b
Wα-1Wα-3 ... W2 W0
Wα Wα-1 ---- W2 W1 W0 Wα-2 Wα-4 W3 W1
k k k k k k k k
k k k k k
k C2 k S2 C1 k S1 k
k+m k+m k+m k+m
CPA with EAC (C2+S2)-(C1+S1) calculation
W mod (2k+1)
Figure 3.1 Forward converters mod (2k–1) (a) and mod (2k + 1) (b)
30 3 Binary to Residue Conversion
So, respectively, and one final adder gives (SeSo) mod (2k + 1). As an illustration,
892 mod 15 ¼ (0011 0111 1100)2 mod 15 ¼ (3 + 7 + 12) mod 15 ¼ 7 and 892 mod
17 ¼ (3–7 + 12) ¼ 8.
Pettenghi, Chaves and Sousa [10] have suggested, for moduli of the form 2n k,
rewriting the weights (residues of 2j mod mi) so as to reduce the width of the final
adder in binary to RNS conversion than that needed in designs using period or half
period. In this technique, the negative weights are considered as positive weights and
bits corresponding to negative weights are complimented and a correction factor is
added. The residues uj are allowed to be such that 2n k þ 3 uj 2n k 3:
As an illustration, for modulus 37, the residues corresponding to a 20-bit
dynamic range are shown for the full period for the original and modified cases.
Since the period is 18, the last two residues are rewritten as 1 and 2. Thus, the
total worst case weighted sum (corresponding to all inputs bits being 1) is 328 as
against 402 in the first case. In order to avoid negative weights, we can consider the
last two weights as 1 and 2, but complement the inputs and add a correction term 34.
As an illustration for the 20-bit input words 000. . .00, 000. . .01, 000. . .010,
0000. . .011, after complementing the last 2 bits we have 11, 10, 01 and 00 and
adding the corresponding “positive” residues and adding the correction term
COR ¼ 34, we obtain, 0, 35, 36, and 34 which may be verified to be correct.
Design 0 1 2 3 4 5 6 7 8 9 10 11 12
Original 1 2 4 8 16 32 27 17 34 31 25 13 26
Modified 1 1 2 4 8 16 32 27 17 34 31 25 13 26
Modified 2 1 2 4 8 16 5 10 17 3 6 12 13 11
Design 13 14 15 16 17 18 19
Original 15 30 23 9 18 36 35
Modified 1 15 30 23 9 18 1 2
Modified 2 15 7 14 9 18 1 2
Premkumar [11], Premkumar and Lai [12] have described a technique for for-
ward conversion without using ROMs. They denote this technique as “modular
exponentiation”. Basically, in this technique, the various residues of powers of
3.3 Forward Conversion Using Modular Exponentiation 31
2 (i.e. 2x mod mi) are obtained using logic functions. This will be illustrated first
using an example. Consider finding 2s3 s2 s1 s0 mod 13 where the exponent is a
4-bit binary word. We can write this expression as
2s3 s2 s1 s0 mod 13 ¼ 28s3 þ4s2 4s1 2so mod 13 ¼ 256s3 16s2 4s1 2so mod 13
¼ ð255s3 þ 1Þð15s2 þ 1Þ 4s1 2so mod 13 ¼ ð3s3 s2 þ 8s3 þ 2s2 þ 1Þ4s1 2so mod13
Next for various values of s1, s0, the bracketed term can be evaluated. As an illustration
for s1 ¼ 0, s0 ¼ 0, 2s3 s2 s1 s0 mod 13 ¼ ð3s3 s2 þ 8s3 þ 2s2 þ 1Þ mod 13. Next, for
the four options for bits s3 and s2 viz., 11, 10, 01, 00, the value of 2s3 s2 s1 s0 mod 13
can be estimated as 1, 9, 3, 1, respectively. Thus, the logic function g0 can be used to
represent 2s3 s2 s1 s0 mod 13 for s1 ¼ 0, s0 ¼ 0 by looking at the bit values as
g0 ¼ 8s3 s2 þ 2s3 s2 þ 1
In a similar manner, the other functions corresponding to s1s0 i.e. 01, 10, 11 can be
obtained as
g1 ¼ 4ð s 2 s 3 Þ þ 2ð s 3 þ s 2 Þ þ s 3 s 2 ð3:1aÞ
Note that the logic gates that are used to generate and combine the MIN terms in the
gi functions can be shared among the moduli. As an illustration, 211mod 13 can be
obtained from g3 (since s1 ¼ s0 ¼ 1), by substituting s3 ¼ 1, s2 ¼ 0 as 7. The archi-
tecture consists of feeding the input power “i” for which 2i mod 13 is needed. The
two LSBs of i viz., x1, xo are used to select the output nibble using four 4:1
multiplexers of the residue corresponding to function gj dependent on s3 and s2-
bit values. Thus, for each power of 2, the residue will be selected using the set of
multiplexers and all these residues need to be added mod 13 in a tree of modulo
adders to get the final residue. Fully parallel architecture or serial parallel architec-
tures can be used to have area/time trade-offs. Premkumar, Ang and Lai [12] later
have extended this technique to reduce hardware by taking advantage of the
periodic properties of moduli so that a first stage will yield a word of length
equaling period of the modulus and next the modular exponentiation-based tech-
nique can be used.
32 3 Binary to Residue Conversion
Forward converters for moduli set {2n1, 2n, 2n + 1} have been considered by several
authors. A common architecture for finding residues mod (2n1) and (2n + 1) was
first advanced by Bi and Jones [13]. Given a 3n-bit binary word W ¼ A22n + B2n + C,
where A, B and C are n-bit words, we have already noted that W mod (2n1) ¼
(A + B + C) mod (2n1) and W mod 2n + 1 ¼ (AB + C) mod (2n + 1). Bi and
Jones suggest finding S ¼ A + C first and then compute (S + B) mod (2n1) or
(SB) mod (2n + 1) in a second stage. A third stage performs the modulo m1 or m3
reduction using the carry or borrow from the second stage. Thus, three n-bit adders
will be required for each of the residue generators for moduli (2n1) and (2n + 1).
Pourbigharaz and Yassine [14] have suggested a shared architecture for com-
puting both the residues mod (2n1) and mod (2n + 1) for a large dynamic range
RNS. They sum the k even n-bit fields and k odd n-bit fields separately using a
multi-operand CSA adder to obtain sum and carry vectors Se, So, Ce and Co of (n + β)
bits where β ¼ log2k. Next, So and Co can be added to or subtracted from Se + Ce in a
two-level CSA. Next, the (m + β + 1)-bit carry and sum words can be partitioned into
LSB m-bit and MSB (β + 1)-bit words. Both can be added to obtain mod (2n1) or
MSB word can be subtracted from the LSB word to obtain mod (2n + 1) using another
two-level CSA in parallel with a two-level carry save subtractor. A final CLA/CPA
computes the final result. This method has been applied to the moduli set {2n1,
2n, 2n + 1}. The delay is O(2n).
Pourbigharaz and Yassine [15] have suggested another three-level architecture
comprising of a CSA stage, a CPA stage and a multiplexer to eliminate the modulo
operation. Since P ¼ (A + B + C) in the case of the moduli set {2n1, 2n, 2n + 1}
needs (n + 2) bits, denoting the two MSBs as pn+1, pn, using these 2 bits, P, P + 1 or
P + 2 computed using three CPAs can be selected using a 3:1 multiplexer for
obtaining X mod (2n1). For evaluating X mod (2n + 1), a three operand carry
save subtractor is used to find P0 ¼ (AB + C) and using the two MSBs pn+1 and pn,
P0 or P0 1 or P0 + 1 is selected using a 3:1 Multiplexer. Thus, the delay is reduced to
that of one n-bit adder (of O(n)).
Sheu et al. [16] have simplified the design of Pourbigharaz and Yassine [15]
slightly. In this design, A + C is computed using a Carry Save Half adder (CSHA)
and one CPA (CPA1) is used to add B to A + C and one CSHA and one CPA (CPA2)
is used to subtract B from A + C. Using the two MSBs of the results of CPA1 and
CPA2, two correction factors are applied to add 0, 1 or 2 in the case of mod (2n1)
and 0,1 or 1 in the case of mod (2n + 1). The correction logic can be designed
using XOR/OR and AND/NOT gates. The total hardware requirement is 3 n-bit
CSHA, one n-bit CSA, one (n + 1)-bit CSA, (2n + 2) XOR, (3n + 1) AND, (n + 3)
OR and (2n + 3) NOT gates. The delay is, however, comparable to Pourbigharaz
and Yassine design [15].
The concept of shared hardware has been extended by Piestrak [17] for several
moduli and Skavantzos and Abdallah [18] for conjugate moduli (moduli pairs of the
3.4 Forward Conversion for Multiple Moduli Using Shared Hardware 33
form 2a1, 2a + 1). Piestrak has suggested that moduli having common factor
among periods or half periods can take advantage of sharing. As an illustration
consider the two moduli 7 and 73. The periods are three and nine respectively.
Consider forward conversion of a 32-bit input word. A first stage can take 9-bit
fields of the given 32-bit word and sum them using a CSA tree with end around
carry to get a 9-bit word. Then the converter for modulus 7 can add the three 3-bit
fields using EAC and obtain the residue mod 7, whereas the converter for mod
73 can find the residue of the 9-bit word mod 73. The hardware can be saved
compared to using two separate converters for modulus 7 and modulus 73.
The technique can be extended to the case with period and half-period being
same. As an example for the moduli 5 and 17, the period P(5) ¼ HP(17) ¼4 where
HP stands for half-period and P stands for period. Evidently, the first stage takes
8-bit fields of the input 32-bit word since P(17) ¼ 8 and using a CSA gets 8-bit Sum
and Carry vectors. These are considered as two 4-bit fields and are fed next to mod
5 and mod 17 residue generators.
It is possible to combine generators for moduli with different half periods with
LCM being one of these half-periods. Consider the moduli 3, 5, and 17 whose half-
periods are 1, 2 and 4, respectively. Considering a 32-bit input binary word, a first
stage computes from four 8-bit fields, mod 255 value by adding in a CSA and 8-bit
sum and carry vectors are obtained. Next, these vectors are fed to a mod 17 residue
generator and mod 15 residue generator. The mod 15 residue generator in turn is fed
to mod 3 and mod 5 residue generators. Several full-adders can be saved by this
technique. For example, for moduli 5, 7, 9, and 13, for forward conversion of a
32-bit input binary word, using four separate residue generators, we need 114 full-
adders, whereas in shared hardware, we need only 66 full-adders. The architecture
is presented in Figure 3.2 for illustration.
C 8 S 8
CH CL SH SL
4 4 4 4
5 3 2
X 17 X 5 X 3
34 3 Binary to Residue Conversion
In Skavantzos and Abdallah [18] technique proposed for residue number sys-
tems using several pairs of conjugate moduli (2a + 1) and (2a1), a first stage is a
mod (22a1) generator taking 2a-bit fields and summing using CSA to get 2a-bit
sum S and carry C vectors. The second stage uses two residue generators for finding
mod (2a1) and (2a + 1) from the four a-bit vectors SH, SL, CH and CL where H and
L stand for the higher and lower a-bit fields. Considering an RNS with dynamic
range X of 2Ka-bit number, in conventional single-level design, in the case
of modulus (2a1), we need 2 K-operand mod (2a1) CSA tree followed by a
mod (2a1) CPA. Thus, (2 K2) CSAs each containing a FAs will be needed. In
case of mod (2a + 1), we need in addition 2 K operand mod (2a + 1) CSA tree and a
mod (2a + 1) CPA. The CSA tree has (2 K2) CSAs each containing (a + 1)
full-adders. The total cost for a conjugate moduli pair is thus, (4Ka–4a + 2 K2)
full-adders. On the other hand, in the two-level design, we need only (2Ka + 2) full-
adders for the CSA cost, whereas the CPA cost is same as that in the case of
one-level approach.
Low and Chang [19] have suggested Binary to RNS converters for large input word
lengths. This technique uses the idea that the residues mod m of various powers of
2 from 1 to 63 can assume values only between 0 and (m1). Thus, the number
of “1” bits in the residues corresponding to the 64 bits to be added are less. Even
these can be reduced by rewriting the residues which have large Hamming weight
as sum of a correction term and word with smaller Hamming weight. This will
result in reducing the number of terms (bits) being added. As an illustration, for
modulus 29, the various values of 2x mod 29 from 20 to 228 are as follows:
1,2,4,8,16,3,6,12,24,19,9,18,7,14,28,27,25,21,13,26,23,17,5,10,20,11,22,15.
Thus, for a 64-bit input word, these repeat once again for 229 till 257 and once
again from 258 till 263. Many of the bits are zero in these. Consider the residue
27 (i.e. 215mod 29) with Hamming weight 4. It can be written as (x15215) mod 29 ¼
(27 + 2 x15 ) so that when x15 is zero, its value is (27 + 2) mod 29 ¼ 0. Since 2 has
Hamming weight much less than 27, the number of bits to be added will be reduced.
This property applies to residues 19, 28, 27, 25, 21, 13, 26, 23, 11 and 15. Thus,
corresponding to a 64-bit input word, in the conventional design, 64 5-bit words
shall have been added in the general case. Many bits are zero in these words.
Deleting all these bits which are zero, we would have needed to add 27, 28, 29, 31
and 30 bits in various columns. It can be verified without Hamming weight
optimization and with Hamming weight optimization, the number of bits to be
added in each column (corresponding to 2i for i ¼ 4, 3, 2, 1, 0) are as follows:
3.6 Forward Converters for Moduli of the Type (2n k) 35
Thus, in the case of modulus 29, the full-adders (FA) and half-adders
(HA) needed to add all these bits in each column and carries coming from the
column on the right can be shown before and after Hamming weight reduction to be
111FA + 11HA and 87FA + 11HA. The end result will need a CPA whose word
length will be more than that of the modulus. Low and Chang suggest that the bits
above the (r1)-th bit position also can be simplified in a similar manner by using
additional hardware without disturbing the LSB portion already obtained as one
r-bit word. An LUT can be used to perform the simplification of the MSB bits to
finally obtain one r-bit word.
The two r-bit operands A and B (LSB and MSB) next need to be added mod m.
The authors adopt the technique of Hiasat [20] after modification to handle the
possibility that A + B can be greater than 2 m. The modulo addition of A and B in this
case can be realized as
jXjm ¼ jA þ B þ 2Z j2r if A þ B þ 2Z 2rþ1
jXjm ¼ jA þ B þ Z j2r if 2r A þ B þ Z 2rþ1
jXjm ¼ A þ B otherwise
where Z ¼ 2rm since A < (2r1) and B < (m1). Two CLAs will be needed for
estimating Cout and C*out corresponding to the computation of A + B + Z and A + B
+ 2Z where Z ¼ 2rmi. Using a 3:1 multiplexer, generate and propagate vectors can
be selected for being added in the CLA and Summation unit.
Matutino, Pettenghi, Chaves and Sousa [21, 22] have described binary to RNS
conversion for moduli of the type 2n k for the four moduli set {2n1, 2n + 1, 2n3,
2n + 3} with dynamic range of 4n bits. The given 4n-bit binary word can be considered
as four n-bit fields W3, W2, W1 and W0 yielding in the case of modulus 2nk,
W 2n k ¼ W 3 k3 þ W 2 k2 þ W 1 k þ W 0 2n k ð3:2Þ
In the case of the modulus (2n + k), the computation carried out is
W 2n þ k ¼ W 3 k3 þ W 2 k2 W 1 k þ W 0 2n þ k ð3:3aÞ
since 2n mod (2n + k) ¼ k and 23n mod (2n + k) ¼ k3. Note that (3.3a) can be
rewritten as
W 2n þ k ¼ W 3 k 3 þ W 2 k 2 þ W 1 k þ W 0 þ c 2n þ k ð3:3bÞ
where
c ¼ k3 ðk þ 1Þ þ 3kðk þ 1Þ mod ð2n þ kÞ:
Note that due to the intermediate reduction steps for reducing the (2n + 2)-bit
word to (n + p + 1) bits and next (n + p + 1) bits to n bits, the correction factor is
c ¼ k3 ðk þ 1Þ þ 3kðk þ 1Þ. The converter for modulus (2n + k) needs three stages,
whereas that for modulus (2nk) needs four stages. Matutino et al. [22] also suggest
multiplier realization by adding only shifted versions of inputs instead of using
hardware multipliers.
There is often
x requirement in cryptography as well as in RNS to obtain a scaled
residue i.e. α [23]. This can be achieved by successive division by 2 mod m. As
2 m
13 13 13 þ 19
an illustration 3
can be obtained by first computing ¼
2 2
2 19 19 19
13 16 13 8
¼ 16: Next, ¼ ¼ 8, and ¼ ¼ 4: The proce-
22 19 2 19 23 19 2 19
dure for scaling x by 2 implies addition of modulus m in case LSB of x is 1 and
dividing by two (ignoring the LSB or performing one bit right shift). In case LSB of
x is zero, just dividing by two (ignoring the LSB or right
xshift)
will suffice.
Montgomery’s algorithm [24] permits evaluation of α by considering α bits
2 m
at a time (also called higher-radix implementation). Here, we wish to find the
multiple of m that needs to be added to x to make it exactly divisible by 2α. First,
we need to compute β ¼ ðmÞ2α . Next, knowing the word Z corresponding to α
LSBs of x, we need to compute Y ¼ x þ ðZβÞ2α m which will be exactly divisible
by 2α. The division is by right shifting by α bits. x
Consider x ¼ (101001101)2 ¼ 333, we wish to find . We find
16 23
1
β¼ ¼ 9: We know α ¼ 13 (4-bit LSBs of x). Thus, we need to compute
23 16
References 37
References
1. K. Hwang, Computer arithmetic: Principles, architecture and design (Wiley, New York,
1979)
2. W.K. Jenkins, B.J. Leon, The use of residue number systems in the design of finite impulse
response digital filters. IEEE Trans. Circuits Syst. CAS-24, 191–201 (1977)
3. T. Stouraitis, Analogue and binary to residue conversion schemes. IEE Proc. Circuits, Devices
and Systems. 141, 135–139 (1994)
4. G. Alia, E. Martinelli, A VLSI algorithm for direct and reverse conversion from weighted
binary system to residue number system. IIEEE Trans. Circuits Syst. 31, 1033–1039 (1984)
5. R.M. Capocelli, R. Giancarlo, Efficient VLSI networks for converting an integer from binary
system to residue number system and vice versa. IEEE Trans. Circuits Syst. 35, 1425–1430
(1988)
6. S.J. Piestrak, Design of residue generators and multi-operand modulo adders using carry save
adders, in Proceedings of the. 10th Symposium on Computer Arithmetic, Grenoble, 26–28 June
1991. pp. 100–107
7. S.J. Piestrak, Design of residue generators and multi-operand modulo adders using carry save
adders. IEEE Trans. Comput. 43, 68–77 (1994)
8. P.V. Ananda Mohan, Efficient design of binary to RNS converters. J. Circuit. Syst. Comp 9,
145–154 (1999)
9. P.V. Ananda Mohan, Novel design for binary to RNS converters, in Proceedings of ISCAS,
London, 30 May–2 June 1994. pp. 357–360
10. H. Pettenghi, R. Chave, L. Sousa, Method for designing modulo {2n k} binary to RNS
converters, in Proceedings of the Conference on Design of Circuits and Integrated Systems,
DCIS, Estoril, 25–27 Nov. 2013
11. A.B. Premkumar, A formal framework for conversion from binary to residue numbers. IEEE
Trans. Circuits Syst. 49, 135–144 (2002)
12. A.B. Premkumar, E.L. Ang, E.M.K. Lai, Improved memory-less RNS forward converter based
on periodicity of residues. IEEE Trans. Circuits Syst. 53, 133–137 (2006)
13. G. Bi, E.V. Jones, Fast conversion between binary and residue numbers. Electron. Lett. 24,
1195–1197 (1988)
14. F. Pourbigharaz, H.M. Yassine, Simple binary to residue transformation with respect to 2m+1
moduli. Proc. IEE Circuits Dev. Syst. 141, 522–526 (1994)
15. F. Pourbigharaz, H.M. Yassine, Modulo free architecture for binary to residue transformation
with respect to {2n-1, 2n, 2n + 1} moduli set. Proc. IEEE ISCAS 2, 317–320 (1994)
16. M.H. Sheu, S.H. Lin, Y.T. Chen, Y.C. Chang, High-speed and reduced area RNS forward
converter based on {2n–1, 2n, 2n + 1} moduli set, in Proceedings of the IEEE 2004 Asia-Pacific
Conference on Circuits and Systems, 6–9 Dec. 2004. pp. 821–824
38 3 Binary to Residue Conversion
17. Piestrak, Design of multi-residue generators using shared logic, in Proceeding of ISCAS, Rio
de Janeiro, 15–19 May 2011. pp. 1435–1438
18. A. Skavantzos, M. Abdallah, Implementation issues of the two-level residue number system
with pairs of conjugate moduli. IEEE Trans. Signal Process. 47, 826–838 (1999)
19. J.Y.S. Low, C.H. Chang, A new approach to the design of efficient residue generators for
arbitrary moduli. IEEE Trans. Circuits Syst. I Reg. Papers 60, 2366–2374 (2013)
20. A.A. Hiasat, High-speed and reduced area modular adder structures for RNS. IEEE Trans.
Comput. 51, 84–89 (2002)
21. P.K. Matutino, H. Pettenghi, R. Chave, L. Sousa, Multiplier based binary to RNS converters
modulo (2n k), in Proceedings of 26th Conference on Design of Circuits and Integrated
Systems, Albufeira, Portugal, pp. 125–130, 2011
22. P.K. Matutino, R. Chaves, L. Sousa, Binary to RNS conversion units for moduli (2n 3), in
14th IEEE Euromicro Conference on Digital System Design, Oulu, Aug. 31 2011-–Sept.
2 2011. pp. 460–467
23. S.J. Meehan, S.D. O’Neil, J.J. Vaccaro, An universal input and output RNS converter. IEEE
Trans. Circuits Syst. 37, 799–803 (1990)
24. P.L. Montgomery, Modular multiplication without trial division. Math. Comput. 44, 519–521
(1985)
25. C.K. Koc, T. Acar, B.S. Kaliski Jr., Analyzing and comparing Montgomery multiplication
algorithms. IEEE Micro 16(3), 26–33 (1996)
Further Reading
The residue number multiplication for general moduli (i.e. moduli not of the form
(2k a)) can be carried out by several methods: using index calculus, using
sub-modular decomposition, using auto-scale multipliers and based on quarter-
square multiplication.
Soderstrand and Vernia [1] have suggested modulo multipliers based on index
calculus. Here, the residues are expressed as exponents of a chosen base modulo m.
The indices corresponding to the inputs for a chosen base are read from the LUTs
(look-up tables) and added mod (m 1) and then using another LUT, the actual
product mod m can be obtained.
As an illustration, consider m ¼ 11. Choosing base 2, since 28 mod 11 ¼ 3, and 24
mod 11 ¼ 5, corresponding to the inputs 3 and 5, the indices are 8 and 4, respec-
tively. We wish to find (3 5) mod11. Thus, we have the index corresponding to
the product as (8 + 4) mod 10 ¼ 2. This corresponds to 22mod11 ¼ 4 which is the
desired answer.
Note that the multiplication modulo m is isomorphic to modulo (m 1) addition
of the indices. Note further that zero detection logic is required if the input is zero
since no index exists.
Jullien [2] has suggested first using sub-modular decomposition for index
calculus-based multipliers mod m. This involves three stages (a) sub-modular
reconstruction (b) modulo index addition and (c) reconstruction of desired result.
The choice of sub-moduli m1 and m2 such that m1m2 > 2 m has been suggested.
As an illustration, for m ¼ 19, m1 ¼ 6 and m2 ¼ 7 can be chosen. Considering
multiplication of X ¼ 12 and Y ¼ 17, and base 2, for modulus m ¼ 19, the indices
corresponding to X and Y can be seen to be 15 and 10 which in residue form are
(3, 1) and (4, 3) corresponding to (m1, m2). Adding the indices corresponding to the
product XY, we obtain the indices as (1, 4). Using CRT (which will be introduced
later in Chapter 5), the decoded word corresponding to moduli {6, 7} can be shown
to be 25 which mod 18 is 7. Thus, the final result can be obtained as 14 since 27 mod
19 ¼ 14. Note that for input zero since index does not exist, the value 7 is used since
7 will never appear as a valid sub-modular result due to the fact that sub-moduli are
less than 7.
Jullien’s approach needs large ROM which can be reduced by using sub-modular
decomposition due to Radhakrishnan and Yuan [3]. This technique is applicable
only if m 1 can be factorized. For example, for m ¼ 29, m1 ¼ 7 and m2 ¼ 4 can be
used. The approach is same as before.
As an illustration, consider the example of multiplication of X ¼ 5 and Y ¼ 12
and base 2 for modulus m ¼ 29. We have the indices as 22 and 7 or in RNS (1, 2)
and (0, 3). Adding these, we have (1, 1). Using CRT, we obtain the sum of indices as
1 which corresponds to the product 2. Note that among the several factorizations
possible for m 1, the one that needs small ROM can be chosen. In the case of
either input being zero, a combinational logic shall detect the condition and output
shall be made zero. Note that addition of indices can be using combinational logic
whereas rest can be using ROMs. As an illustration for the above example, the
memory requirement is totally 320 bits for index ROMs and 160 bits for inverse
index ROM (considering a memory of 32 locations for the needed 29 locations).
Dugdale [4] has suggested that by using unused memory locations in the LUTs,
the zero detection logic can be avoided. Further, he has suggested that in case of a
composite modulus, for non-prime factors, direct multiplication can be used
whereas for prime factors, index calculus may be employed. Consider for illustration
the composite modulus 28 which is realized using the RNS (7, 4). Since multiplica-
tion mod 4 is relatively simple, normal multiplication can be employed, whereas for
the modulus 7 index calculus can be employed. Since m 1 ¼ 6 for m ¼ 7, we can
use the two moduli set {2, 3} to perform computation using index calculus. The input
zero can be represented by unused residue set (2, 0) or (2, 1) or (2, 2).
As an illustration, consider X ¼ 14 and Y ¼ 9, corresponding to the three moduli
i.e. 2, 3, 4, the indices for the first two moduli and actual residue for modulus
4 corresponding to X and Y are (2, 0, 2) and (0, 2, 1) which yields the product as
(2, 2, 2). We obtain thus the result as 14.
Dugdale has compared the multiplier realizations using direct method and
method using index calculus and observed that memory requirements are
much less for the latter method. As an illustration, for the composite modulus
221 ¼ (13 17), the index tables in the first level corresponding to the two moduli
need 2 221 (4 + 5) ¼ 3978 bits and the memory in the second level needed to
perform index addition needs 2121 bits (169 4 + 289 5), whereas the third level
4.1 Modulo Multipliers for General Moduli 41
needed to find the actual result corresponding to the obtained index needs 221 loca-
tions of 8 bits each needing 1768 bits totally thus needing 7867 bits, whereas a
direct multiplier needs 390, 728 bits.
Ramnarayan [5] has considered modulus of the type m ¼ (2n 2k + 1) in which
case the modulo (m 1) adder needed for adding or subtracting indices can be
simplified. Note that m 1 ¼ (2n 2k) is having (n k) “ones” followed by
k zeroes. Hence, if 0 ðx þ yÞ 2n 2k , (x + y) mod (m 1) ¼ x + y. If
2n 2k x þ y 2n , then (x + y) mod (m 1) ¼ x + y 2n + 2k. If
2n x þ y 2nþ1 , then (x + y) mod (m 1) ¼ {(x + y) mod 2n} + 2k. These three
conditions can be checked by combinational logic.
In the quarter-square technique [1], XY mod m is founds as
0 1
ðX þ Y Þ2 m þ ðX Y Þ2 m
ðXY Þm ¼ @ mA
ð4:1Þ
4
m
Thus, using look-up tables, both the terms in the numerator can be found and added
mod m. The division by 4 is also carried out using a ROM. These designs are well
suited for small moduli.
Designs based on combinational logic will be needed for larger word length
moduli. Extensive work on modulo multiplication has been carried out in the past
three decades due to the immense application in Cryptography-authentication and
key exchange algorithms. Therein, word lengths of the operands are very large
ranging from 160 bits to 2048 bits, whereas in RNS for DSP applications, the word
lengths could be few tens of bits at most.
The operation (AB) mod m can be carried out by first multiplying A with B and
then dividing the result with m to obtain the remainder. The quotient is obtained in
this process but is not of interest for us. Moreover, the word length of the product is
2n bits for n-bit input operands. The division process is involved as is well known.
Hence, usually, the modulo multipliers are realized in an integrated manner doing
partial product addition and modulo reduction in one step. This will ensure that the
word length is increased by at most 2 bits. An example will illustrate Brickell’s
algorithm [6, 7].
Example 4.1 Consider C ¼ AB mod m where A ¼ 13, B ¼ 17 and m ¼ 19. We start
with MSB of 13 and in each step, the value Ei+1 ¼ (2Ei + aiB) mod m is computed
considering E1 ¼ 0. The following illustrates the procedure:
E0 ¼ (2 0 + 1 17) mod 19 ¼ 17 a3 ¼ 1
E1 ¼ (2 17 + 1 17) mod 19 ¼ 13 a2 ¼ 1
E2 ¼ (2 13 + 0 17) mod 19 ¼ 7 a1 ¼ 0
E3 ¼ (2 7 + 1 17) mod 19 ¼ 12 a0 ¼ 1
■
42 4 Modulo Multiplication and Modulo Squaring
Note, however, that each step involves a left shift (appending LSB of zero to
realize multiplication by 2), a conditional addition depending on bi value to add B or
not and a modulo m reduction. The operand in the bracket U ¼ (2Ei + aiB) < 3m
meaning that the word length is 2 bits more than that of m and the modulo reduction
needs to be performed by subtracting m or 2m and selecting U or U m or U 2m
based on the signs of U m and U 2m. The method can be extended to higher
radix as well, with some expense in hardware but reducing the modulo multiplica-
tion time (see for example [8]).
Several authors have suggested designing the modulo multiplier in two stages.
The first stage is a n-bit by n-bit multiplier whose output 2n-bit word is reduced
mod m using a second stage. Hiasat [9] has suggested a modulo multiplier which
uses a conventional multiplier to obtain first z ¼ xy followed by a modulo reduction
block. In this method, defining a ¼ 2n m and considering that k bits are used to
represent a, the 2n-bit product can be written as
In this technique, we first compute ZL2bmod m and add ZH. We actually compute
h i
P ¼ Z H þ Z L 2b αi b m div2b þ δ ð4:4Þ
2
1
where αi ¼ m mod2b and δ can be considered as 1 if ðZ Þ b 6¼ 0 else δ ¼ 0.
2
This architecture needs three n n multipliers for computing Z, for multiplying
by (2b αi) and multiplying by mi and a three-input n-bit modulo adder. One of the
multipliers need not compute the most significant word of the product. Thus,
roughly it needs 2.5(n n) binary multipliers, two n-bit CPAs and additional
control logic. The delay is (11 log2n + 2) units. Note that this technique is a simple
variation of Montgomery technique [11] to be described in Chapter 10.
4.1 Modulo Multipliers for General Moduli 43
Next, we can re-organize the input two-bit products bits aibj into doublets and
triplets, the maximum sum of which does not exceed unity for any input bit values.
This reduces the number of bits to be added in each column and reducing thereby
the 1-bit adders in a column by replacing them with OR gates. Paliouras et al. [13]
have used extensive simulation to arrive at all possible input combinations to
identify the input bit product pairs or triplets that cannot be active simultaneously.
The design contains a recursive modulo reduction stage formed by cascaded adders
following Stouraitis et al. [12] which, however, is delay consuming while the area is
reduced. A mod 5 multiplier realized using this approach is presented in Figure 4.2a
for illustration.
Dimitrakopoulos et al. [14] have considered the use of signed digit representa-
tion of the bit product weight sequence (2k) mod mi in order to reduce the hardware
complexity. A graph-based optimization method is described which could achieve
area reduction of 50 %. A mod 11 multiplier realized using this approach is
presented in Figure 4.2b for illustration. Note that in the case of mod 11 multiplier,
the bits corresponding to 24, 25 and 26 are written as 5, 1, 2 so that the number of
bits to be added will be reduced. A 4 4 multiplier mod 11 will need the following
bits to be added in various columns:
A correction term needs to be added to compensate the effect of the inverted bits.
Note that ROMs can be used to realize the multiplication in RNS for small
moduli. On the other hand, recent work has focused on powers of two related
moduli sets since modulo multiplication can be simpler using the periodic property
of moduli discussed in Chapter 3. The multiplication operation (AB) mod 2n is
simple. Only n LSBs of the product AB need to be obtained. An array multiplier can
be used for instance omitting the full adders beyond the (n 1)th bit. We next
consider mod (2n 1) and mod (2n + 1) multipliers separately.
We use the periodic property of modulus (2n 1) for this purpose. We note that
2n+k ¼ 2k mod (2n 1). Thus, the multiplication mod (2n 1) involves addition
of modified partial products obtained by rotation of bits of each of the partial
products. The resulting n words need to be added in a CSA with EAC followed
by a CPA with EAC in the last stage. Consider the following example.
4.2 Multipliers mod (2n 1) 45
a1b2
H s1
a2b1 F
1
a1b0
a0b1
a2b0
a1b1 s2
H H
a0b2
F
1
a1b3 a2b2
b
a3b1 a2b0 a0b2 a1b3
a0b0
FA FA a2b2 FA a3b3
HA FA HA
1 1
Logic FA HA
HA FA FA FA HA
1
FA FA FA FA HA
C3 C2 C1 C0
46 4 Modulo Multiplication and Modulo Squaring
1011 PP0 b oA
0000 PP1 b1A rotated left by 1 bit
1110 PP2 b2A rotated left by 2 bits
0101 SUM
0101 CARRY Bold is used for EAC bit
1101 PP3 b3A rotated left by 3 bits
1101 SUM
1010 CARRY
1000 Final EAC addition ■
n
Wang et al. [15] have described a mod (2 1) multiplier which is based on
adding the n-bit partial products using a Wallace tree followed by a CPA with EAC.
The partial products are obtained by circularly rotating left by i bits of the input
word A and added in case bi is 1 as mentioned before.
The area and delay of Wang et al. design are n2 + n(n 2)AFA + APAn and
1 + (n 2)DFA + DPAn when a CSA is used where APAn and DPAn are the area
and delay of a n-bit carry-propagate-adder and AFA, DFA, AAND and DAND are areas
and delays a full-adder and AND gate, respectively. The delay in the case of
Wallace tree is 1 + d(n)DFA + DPAn. Note that the depth of the Wallace tree d(n) is
equal to 1, 2, 3, 4, 4, 6, 7, 8, 9 when n is 3, 4, 5–6, 7–9, 10–13, 14–19, 20–28, 29–42,
43–63, respectively.
Zimmermann [16] also has suggested the use of CSA based on Wallace tree. He
has observed that Booth recoding does not lead to faster and smaller multipliers due
to the overhead of recoding logic which is not compensated by the smaller carry
save adder of (n/2) + 1 partial products only.
The area and delay of Zimermann’s modulo (2n 1) multiplier are (8n2 + (3/2)
nlogn 7n) and 4d(n) + 2logn + 6 using the unit gate model, where d(n) is the depth
of the Wallace tree.
Efstathiou et al. [17] have pointed out that while using modified Booth’s
algorithm in the case of even n for a modulus of 2n 1, the number of partial
products can be only n/2. The most significant recoded digit of the multiplier B can
be seen to be bn1 which corresponds to (bn12n) mod (2n 1) ¼ bn1. Thus, in
place of b1 (which is zero) in the first digit, we can use bn1. Accordingly, the first
recoded digit will be (bn1 + b0 2b1). The truth tables and implementations of
Booth encoder and Booth selector are shown in Figure 4.3a, b and the design of the
modulo 255 multiplier is presented in Figure 4.3c.
n
The area and delay
n of this nmod (2 n1) multiplier
using either CSA array or
Wallace tree are 2 ABE þ n 2 ABS þ n 2 2 AFA þ APAn where ABE, ABS, and
APan are the areas of Booth encoder (BE), Booth selector (BS) and a mod (2n 1)
adder,
n whereas nfor Zimmermann’s
n technique, the area needed is
2 þ 1 A BE þ n 2 þ 1 A BS þ n 2 1 AFA þ APAn . The delays for
Efstathiou et al. multiplier and Zimmermann’s multiplier are respectively T BE þ T BS
þ n2 2 T FA þ T PAn and T BE þ T BS þ n2 1 T FA þ T PAn .
4.2 Multipliers mod (2n 1) 47
a
b2i+1 b2i b2i-1 s 2x 1x BE
0 0 0 0 0 0
0 0 1 0 0 1 b2i-1
1x
0 1 0 0 0 1
0 1 1 0 1 0 b2i 2x
1 0 0 1 1 0
b2i+1
1 0 1 1 0 1
s
1 1 0 1 0 1
1 1 1 1 0 0
b
BS
s 2x 1x di
0 0 0 0 ai
0 0 1 aj 1x
0 1 0 ai-j
di
ai-1
1 1 0 ai-j
1 1 1 aj 2x
1 0 0 1 s
c a7
a6 a5 a4 a3 a2 a1 a0
b3
b0
a5 b1
a4 a3 a2 a1 a0 a7 a6
b1
b2
a3 b3
a2 a1 a0 a7 a6 a5 a4
b3
b4
b5
a1
a0 a7 a6 a5 a4 a3 a2
b5
b6
b7
: Selector
: Full Adder
P7 P6 P5 P4 P3 P2 P1 P0
Figure 4.3 (a) Radix-4 Booth encoder, (b) Booth selector and (c) a mod 255 multiplier using
Booth’s algorithm (Adapted from [17] ©IEEE 2004)
48 4 Modulo Multiplication and Modulo Squaring
In the case of Wallace tree being used, we have for both the cases the
delays respectively as T BE þ T BS þ k n2 T FA þ T PAn and T BE þ T BS þ
k n2 þ 1 T FA þ T PAn .
Recently, Muralidharan and Chang [18] have suggested a radix-8 Booth-
encoded modulo (2n 1) multiplier with adaptive delay. The digit in this case
corresponding to multiplier Y can be expressed as
Thus, di can assume one of the values 0, 1, 2, 3, and 4. The Booth encoder
(BE) and Booth selector (BS) blocks are shown in Figure 4.4a, b. Note that while
the partial products (PP) for multiples of 1, 2 and 4 can be easily obtained, the
PPs for multiples of 3 need attention. Conventional realization of 3X mod (2n 1)
by adding 2X with X needs a CPA followed by another carry propagation stage
using half-adders to perform the addition of carry generated by the CPA.
Muralidharan and Chang [18] suggest the use of (n/k) number of k-bit adders to
add X and 2X so that there is no carry propagation between the adder blocks. While
the carry of the leftmost k-bit adder can be shifted to LSB due to the mod (2n 1)
operation, the carry of each k-bit adder results in a second vector as shown in
Figure 4.4c. However, in the case of obtaining 3X, we need to have one’s
complement of both the words. The second word will have several strings of
(k 1) ones. These can be avoided by adding a bias word B [19] which has ones
in al-th bit position where a ¼ 0, . . ., (n/k) 1. The addition of a bias word with the
Sum and Carry bits of the adder in Figure 4.4c can be realized easily using one
XNOR gate and one OR gate as shown in Figure 4.4d to obtain jB þ 3Xj2n 1 .
Note that bs0j ¼ s0j ck1 j1
when j 6¼ 0 and bs0j ¼ s00 cM1
k1 when j ¼ 0 and
jþ1 j j 0 M1
bck1 j ¼ s0 þ Ck1 when j 6¼ M 1 and bck1 ¼ s0 þ Ck1 when j ¼ M 1 for
j ¼ 0, 1, . . ., M 1 where M ¼ n/k.
The bias word B needs to be added to all multiples of X for uniformity and a
compensation constant (CC) can be added at the end. The biased simple multiples
jB þ 0j2n 1, jB þ Xj2n 1, jB þ 2Xj2n 1, jB þ 4Xj2n 1 for n ¼ 8 are realized by left
circular shift and selective complimenting of the multiplicand bits without addi-
tional hardware as shown in Figure 4.4e. The multiple B 3X is realized in a similar
way. Note that (3X) mod (2n 1) is one’s complement of 3X. The bias word B can
be added in a similar way as in the case of +3X to get B 3X. Note that
the bias word needs to be scaled by 23, 26, etc. Each PPi consists of a n-bit vector
ppi n1, . . ., ppi,0 and a vector of n/k ¼ 2 redundant carry bits qi1 and qi0. These are
circularly displaced to the left by 3 bits for each PPi. In the case of radix-8 Booth
encoding, the ith partial product can be seen to be PPi ¼ 23i d i X 2n 1 , This is
modified to include the bias B as PPi ¼ 23i ðB þ di XÞ 2n 1 . The modulo reduced
partial products and correction terms for a mod 255 multiplier are shown in
Figure 4.4f. Hence, the correction word will be one n-bit word if k is chosen to
be prime to n.
4.2 Multipliers mod (2n 1) 49
a y3i+2 y3i+1 b
y3i+1 y3i y3i y3i-1
Sel X
Sel 2X
Sel 3X
Sel 4X X Xj-1 Xj
j-2
(3X)j
Sign
c x7 x6 x6 x5 x5 x4 x4 x3 x3 x2 x2 x1 x1 x0 x0 x7
C12 C11 C10 C0 2 C01 C00
FA FA FA FA FA FA FA FA
d x7 x6 x6 x5 x5 x4 x4 x3 x3 x2 x2 x1 x1 x0 x0 x7
1
C 2 C11 C10 C02 C01 C00
FA FA FA FA FA FA FA FA
OR XNOR OR XNOR
Figure 4.4 (a) Booth Encoder block (b) Booth Selector block (c) generation of partially redun-
dant (+3X) mod (2n 1) using k-bit RCAs (d) generation of (B + 3X) mod (2n 1) using k-bit
RCAs, (e) generation of partially redundant simple multiples and (f) modulo-reduced partial
products and CC for mod (28 1) multiplier (Adapted from [18] ©IEEE2011)
50 4 Modulo Multiplication and Modulo Squaring
e 0 0 0 1 0 0 0 1 B+0
0 0
x7 x6 x5 x3 x2 x1 B+X
x4 x0
x4 x0
x6 x5 x4 x3 x2 x1 x0
x7
B+2X
x3 x7
x5 x4 x3 x1 x0 x7 B+4X
x2 x6
x2 x6
f
x7 x6 x5 x4 x3 x2 x1 x0
d2 d1 d0
pp07 pp06 pp05 pp04 pp03 pp02 pp01 pp00
q01 q00
pp17 pp16 pp15 pp14 pp13 pp12 pp11 pp10
q10 q11
pp27 pp26 pp25 pp24 pp23 pp22 pp21 pp20
q 20 q 21
0 0 1 0 0 0 1 0
Figure 4.4 (continued)
Note that the choice of k decides the speed of generation of hard multiple (i.e. the
delay of the k-bit ripple carry adder). Here, the partial product accumulation by
CSA tree has a time complexity of O(log (n + n/k)). The delay of the hard multiple
generation, CC generation, partial product generation by BS and BE blocks, and
two operand parallel-prefix modulo (2n 1) adder are respectively O(k), O(1),O(1),
O(logn). Thus, the total delay of the multiplier is logarithmically dependent on
n and linearly on k. Hence, the delay can be manipulated by proper choice of k and
n. The final adder used by the authors was 2-operand mod (2n 1) adder using
Sklansky parallel-prefix structure with an additional level for EAC addition
following Zimmermann [16]. The authors have shown that by proper choice of k,
the delay of the mod (22n 1) multiplier can be changed to match the RNS delay of
the multiplier for lower bit length moduli in four moduli sets of the type {2n, 2n 1,
2n + 1, 22n 1).
The area of BE, BS, k-bit CPA are respectively 3AINV + 3AAND2 +
AAND3 + 3AXOR2 and 4AAND2 + AOR4 + AXOR2, (k 1)AFA + AHA + AOR2 + AXNOR2.
The total normalized area requirements
in terms of gates are
25:5n n3 þ 1 þ10:5n þ 38:5 n3 þ 1 þ 1 if k ¼ n and 25:5n n3 þ 1 þ 21n þ
68:5 n3 þ 1 þ 3 if k ¼ n/3. Note that M (¼ n/k) k-bit RCAs, n3 þ 1 BE blocks,
n n bn3cþ1
3 þ 1 (n + M ) BS blocks and n 3 þ Q full-adders where Q ¼ k are
required where M ¼ n/k.
4.3 Multipliers mod (2n + 1) 51
Mod (2n + 1) multipliers of various types have been considered in literature (a) both
inputs in standard representation, (b) one input in standard form and another in
diminished-1 form and (c) both inputs in diminished-1 representation.
Curiger et al. [20] have reviewed the multiplication mod (2n + 1) techniques for
use in implementation of IDEA (International Data Encryption Algorithm) [21] in
which input “zero” does not appear as an operand. Instead, it is considered as 216 for
n ¼ 16.
The quarter square multiplier needs only 2 2n n bits of ROM as against the
requirement of 22n n bits in case of direct table look-up. Note that ϕðx þ yÞ ¼
x þ y 2 x y2
and ϕðx yÞ ¼ are stored in memories (see (4.1)). The index
2 2
n
calculus technique needs 3 2 n bits of ROM.
Curiger et al. [20] suggest three techniques. The first technique follows the well-
known LowHigh lemma [21] based on the periodic property of modulus (2n + 1)
discussed in Chapter 3. The LowHigh lemma states that
Modulo 2n adder
Modulo 2n adder
A =0
Reduction stage
2n X n
n
Y n
2n
B =0
n+1
n
Cout
r2n
Figure 4.5 Modulo multiplier based on Lowhigh lemma (adapted from [23] ©IEEE1991)
or
and note that 0. . .01. . .111 indicates a number with (n i) zeros and i ones. It can
be seen that the MSBs of the circularly left-shifted words are one’s complemented
and added. The “1” term within the brackets and “2” term in (4.9) are correction
terms. Note that the property ðA þ B þ 1Þmodð2n þ 1Þ ¼ ðA þ B þ cout Þmod2n is
used in computing PPi + 1.
Note that if xi ¼ 0, we need to add a string 000..111..11 since most significant
zeroes are inverted and added. Thus, using a multiplexer controlled by xi and x0 i,
either the shifted word with one’s complemented MSBs or the 0000..11..11 are
selected.
As an illustration, consider the following example for finding (1101) (1011)
mod 17.
1101 1101+1
1101 1010 +1
0000 0011+1
1101 1001+1
+2
0111
With 2n represented by zero, the correction unitneeds two zero detectors which
are outside the critical path. Note that in (4.12), Y þ 1 is computed instead of
Y þ 2 because of the reason that the final modulo adder adds an extra “1”.
54 4 Modulo Multiplication and Modulo Squaring
P’C P’S
PC PS
Using unit gate model, it can be shown that the area and delay of mod (2n + 1)
multiplier due to Zimmermann are given by 9n2 + (3/2)nlogn + 11n gates and
4d(n + 1) + 2logn + 9 gate delays, respectively.
Wang et al. [24] have described a Wallace tree-based architecture for modulo
(2n + 1) multiplication for diminished-1 numbers. The expression evaluated is
Xn1
d ðBAÞ ¼ bd
k¼1 k
2k A Z d1 ðAÞ þ 1 modð2n þ 1Þ ð4:13Þ
where stands for addition and Σ dðAk Þ stands for modulo (2n + 1)summation of
diminished-1 operands and Z is defined as
Xn1 Xn1
Z¼ b ¼n1
k¼1 k k¼1
bk ð4:14Þ
Note that d(2kA) is obtained by k-bit left cyclical shift with the shifted bits being
complimented if bk ¼ 1. On the other hand, if bk ¼ 0, d(2kA) will be replaced with
n zeroes. The case xn ¼ yn ¼ 1 can be handled by having an OR gate for LSB at the
end to give an output of 1. Note that the computation of Z involves a counter which
counts the number of zeroes in the bits b1 to bn1.
An example is presented in Figure 4.7 for finding (58 183) mod 257 where
58 and 183 are diminished-1 numbers. Note that the final mod(2n + 1) adder has a
carry in of ‘1’.
4.3 Multipliers mod (2n + 1) 55
00111010
X10110111
001110101
001110101 B D
s 11101011 C s 01101101
0011101011 c 011101011 s 11110100
A 00000000 c 110101100
s 11110100 c 111010110 00100010
001110101100 c000010001 s 01001111 E
0011101011000 s 11100011
s 10011111 c 100100010
00000000 c 001011001
c011000001 1
001110101100010 F
11111101 100111101
Z
0 G
Final result in diminished-1 coding 000111101
Figure 4.7 Example of multiplication mod 257 (adapted from [24] ©Springer1996)
The area requirement of Wang, Jullien and Miller mod (2n + 1) multiplier are
8n2 þ 92 nlogn þ 92 n 7dlog2 ðn 1Þe 1 equivalent gates.
Wrzyszcz and Milford [25] have described a multiplier (XY) mod (2n + 1) which
reduces the (n + 1)-bit (n + 1)-bit array of bits corresponding to the (n + 1) partial
products to be added, to a n n array. They first observe that the bits can be divided
into four groups (see Figure 4.8a). Taking advantage of the fact that when xn or yn is
1, the remaining bits of X and Y are zero, they have suggested combining bits in left
most diagonal and bottom most line into a new row as shown in Figure 4.8b noting
that the partial product bits can be OR-ed instead of being added. Note that the new
bits sqk are defined by s ¼ xn yn and qk ¼ xk _ yk where k 2 0, 1, . . . , n 1 where
_ and stand for OR and Exclusive OR operation, respectively. Next, using
periodic property of the modulus (2n + 1), the bits in the positions higher than n 1
are one’s complemented and mapped into LSBs (see Figure 4.8c):
By summing all the s 2n+k terms for k 2 ð0, 1, . . . , n 1Þ, we get j2sj2n þ1 . Note
also that xnyn22n will yield xnyn which can be moved to LSB. Since in the first and
last rows only one bit can be different from zero, we can combine them as shown in
Figure 4.8d. Thus, the size of the partial product bit matrix has been reduced from
(n + 1)2 to n2. A correction term is added to take into account mod (2n + 1) operation
needed. All these n words are added next using a final modulo (2n + 1) adder.
The number of partial products are n/2 for the case n even and (n + 1)/2 for the
case n odd except for one correction term. This multiplier receives full inputs and
avoids (n + 1)-bit circuits. It uses inverted end-around-carry CSA adders and one
diminished-1 adder.
56 4 Modulo Multiplication and Modulo Squaring
c 2n-1 2n-2 …. 22 21 20
pn-1,0 pn-2,0 …. p2,0 p1,0 p0,0νp0,n
pn-2,1 pn-3,1 .... p1,1 pn-1,1
p0,1
A
pn-3,2 pn-4,2 …. p1,2 pn-1,2 pn-2,2
B
sq n-1 sq n-2 ….. sq2 sq1 sq0
d n-1
….. 2 n-2 1 0
2 2 2 2 2
p n-1,0 ∪ s qn–1 p n-2,0 ∪ s qn–2 ….. p 2,0 ∪ s q2 p 1,0 ∪ s q1 p0,0 ∪ pn,n ∪ sq0
p n-2,1 p n-3,1 ….. p 1,1 p 0,1 s pn–1,1
p n-3,2 p n-4,2 ….. p 0,2 pn–1,2 pn–2,2
….. ….. ….. ….. ….. …..
p 1,n -2 p 0,n -2 ….. p4,n–2 p3,n–2 p2,n–2
p 0,n -1 pn–1,n–1 …. p3,n–1 p2,n–1 p1,n–1
Figure 4.8 Architecture of mod (2n + 1) multiplier (adapted from [25] ©IEEE1993)
products where some simplification is possible by noting that three bits of the type
an1b0, an1 and bn1 can be added using a simplified full-adder SFA. The final
diminished-1 adder uses a parallel-prefix architecture with carry being inverted
and re-circulated at each prefix level [27].
The area and delay requirements of this adder using unit-gate model are 8n2 + (9/2)
nlog2n + (n/2) + 4 gates and 4d(n + 3) + 2log2n + 2 gate delays if n ¼ 4, 5, 7, 8, 11,
12, 17, 18, and 4d(n + 3) + 2log2n + 4 gate delays otherwise.
Note that in another architecture due to Ma [28], bit-pair recoding is used to
reduce the number of partial products but it accepts only diminished-1 operands.
The number of partial products is reduced to n/2. The result of carry save addition is
two words (SUM and CARRY) R0 and R1 each of length n + (log2n + 1) bits. These
words are written as
R0 ¼ 2n M0 þ R0L and R1 ¼ 2n M1 þ R1L .
Thus, we have
ðR0 þ R1 Þmodð2n þ 1Þ ¼ R0L þ M0 þ M1 þ 1 þ R1L þ 1 þ 2 modð2n þ 1Þ
ð4:17Þ
where M0 and M1 are one’s compliments of M0 and M1. All the four words
can be added using two stages of MCSA (mod (2n + 1) CSA) followed by a final
mod (2n + 1) CPA with a carry input of 1. The first MCSA computes the value of
the sum in the inner bracket R0L þ M0 þ M1 þ 1 and the second MCSA computes
the value of the sum in the middle bracket. The CPA only adds carry-in of 1 since
diminished-1 result is desired.
Considering that a Dadda tree is used in place of CSA array in the above
technique suggested by Ma [28], Efstathiou et al. [26] show that the area and
time requirements in terms of unit-gates are
58 4 Modulo Multiplication and Modulo Squaring
9 27 l nm j n k
6n2 þ nlogn þ n þ 7 log2 14 log2 þ 1 and
2 2 2 2
n
20 þ 4d þ 1 þ 2log2 n:
2
Chaves and Sousa [29, 30] have modified slightly the formulation of
Zimmerman [16] by adding the 2n correction term (see (4.9)) without needing
additional multiplexers. They realize
Xn1
ðXY Þmodð2n þ 1Þ ¼ i¼0
PPi þ 1 þ 2 þ yn X0 þ xn Y 0 þ 4 modð2n þ 1Þ
ð4:18Þ
or
h i
P0 ¼ y x 2n þ1 þ yn x þ xn y þ ðxn _ yn Þ 2n modð2n þ 1Þ ð4:21bÞ
where zn is the n th bit and z the remaining n least significant bits of the number
Z and _, ^ stand for OR and AND operations.
Modified
Booth recoding can be applied to the efficient computation of
y x 2n þ1 . As has been mentioned before, the number of partial products are n/2
4.3 Multipliers mod (2n + 1) 59
a
Y 0 0 1 1 0 1 0 0 1 106
X × 0 0 1 0 1 1 0 1 0 91
P2×0(100) 1 0 0 1 0 1 1 0 y 7:0
P2×1(101) 0 1 0 1 1 0 0 1 y 5:0# y 7:6
P2×2(011) + 0 0 1 1 0 0 1 0 y2:0# y 7:3
1 1 1 1 1 1 0 1 S
0* 0 0 1 0 0 1 0 1* C
P2×3(010) + 0 1 1 0 0 1 0 1 y1:0# y 7:2
1 0 1 1 1 1 0 1 S
0* 1 1 0 0 1 0 1 1* C
CT + 1 1 1 1 1 1 1 1 ct7:0
1 0 0 0 1 0 0 1 S
1* 1 1 1 1 1 1 1 0* C
+ 1* 1 0 0 0 0 1 1 1
0*+ 1 +1
zn= xn ∪ yn + 0 1 0 0 0 1 0 0 0
<9645>mod (28+1) =136
b
Y 0 0 1 1 0 1 0 0 1 105
X × 0 0 1 0 1 1 0 1 0 90
P2×0(100) 0 0 1 0 1 1 0 0 y 6:0# y 7
P2×1(101) 0 1 0 1 1 0 0 1 y 5:0# y 7:6
P2×2(011) + 0 0 1 1 0 0 1 0 y2:0# y 7:3
0 1 0 0 0 1 1 1 S
0* 0 1 1 1 0 0 0 1* C
P2×3(010) + 0 1 1 0 0 1 0 1 y1:0# y 7:2
0 1 0 1 0 0 1 1 S
0* 1 1 0 0 1 0 1 1* C
P2×4(000) + 1 1 1 1 1 1 1 1 28-1
0 1 1 0 0 1 1 1 S
1* 1 0 1 1 0 1 1 0* C
CT + 1 0 1 0 0 1 1 0 ct7:0
0 1 1 1 0 1 1 1 S
+ 1* 0 1 0 0 1 1 0 0* C
0* 1 1 0 0 0 0 1 1
+ 1 1* +2
0 1 1 0 0 0 1 1 0
<9450>mod (28+1) =198
Figure 4.10 Example of (a) diminished-1 and (b) ordinary modulo (28 + 1) multipliers (adapted
from [32] ©IEEE2005)
digit of 1 for normal representation. Since this digit has a place value of 4, we need
to find 4 (105). Left shifting the byte by two bits first and inverting two MSBs
and appending as LSBs and one’s complimenting all the bits, we obtain
01011001 ¼ 89(10) as shown, whereas the actual value is (420) mod 257 ¼ 94.
Thus, a correction of (3 + 2) ¼ 5 needs to be added where the term 3 comes because
of the one’s complimenting of the two shifted bits due to multiplication by 4. Since
one’s complimenting to make the PP negative, needs addition of 255, the second
term needs to be added to make total addition mod 257 zero. In the case of positive
partial products, for example, the fourth digit 1 with place value of 26, we only need
to left shift by 6 bits and invert the six MSBs and put them in LSB position. No one’s
complementing will be needed. The value of this PP is 101, whereas original value is
(64 105) mod 257 ¼ 38. Accordingly, correction word needs to be added.
In the case of normal representation, depending on the values of xn and yn, viz.,
00, 01, 10, and 11, we need to add (see (4.21c)) 0, x, y, and x y + 1,
respectively. In the case of diminished-1 representation, depending on the values
of xn and yn, viz., 00, 01, 10, and 11, we need to add (see (4.21b)) x + y, y, x and 1
respectively. Note that in both the cases of normal and diminished-1 representa-
tions, y or y can be combined with the least significant Booth recoded digit as bit
y1 which is unused. Chaves and Sousa [32] have derived a closed-form expression
for the correction term CT.
The various PPs are added using diminished-1 adders which inherently add 1 in a
Wallace tree configuration and the final SUM and CARRY vectors are added using
a modulo (2n + 1) CPA. The case of xn or yn is used to determine the nth bit for
diminished-1 representation, whereas the nth bit of the product is generated by the
modulo (2n + 1) CPA in the case of ordinary representation.
The area and delay of these designs considering the unit-gate model for dimin-
ished-1 and ordinary representations are as follows:
n n o
n2 3
Diminished-1 : 9 þ 7 1 n þ f21ng þ ndlog2 ne þ 8n;
2 2 2
n o
nþ1 n 3
Ordinary : 9n þ 7 1 n þ f28ng þ ndlog2 ne þ 8n gates
2 2 2
and
n jnk o
Diminished-1 : f6g þ 4d þ 1 þ f0g þ 2dlog2 ne þ 6;
2
n jnk o
Ordinary : f5g þ 4 d þ 1 þ f4g þ 2dlog2 ne þ 6 unit-gate delays
2
where the three terms {} correspond to PPG (partial Product Generator), CSA
(carry-save-adder) and COR (correction unit). A Sklansky parallel-prefix structure
with a fast output incrementer has been used for realizing mod (2n + 1) CPA
following Zimmermann [16].
4.3 Multipliers mod (2n + 1) 61
Chen et al. [33] have described an improved multiplier for mod (2n + 1) to be
used in IDEA. This design caters for both non-zero input and zero-input case. In the
non-zero input case, the partial product matrix has only n PPs which have MSBs
inverted and put in LSB positions. Next, these are added using a CSA based on
Wallace tree which inverts the carry bit and puts in the LSB position of the next
adder. The next stage uses 4-bit CLA in place of 2-bit CLA suggested by Zimmer-
mann [16]. Next, in the case of zero handling, a special adder is used to find a þ 2
(since ð2n aÞ mod ð2n þ 1Þ ¼ amod ð2n þ 1Þ ¼ a þ 2 ) where a is the input
when b is zero and similarly to find b þ 2 when a is zero. The actual output or the
output of the special adder is selected using OR gates.
Vergos and Efstathiou [34] have described a mod (2n + 1) multiplier for
weighted binary representation by extending the architecture of [25]. The architec-
ture in [25] suffers from the disadvantage that three n-bit parallel adders are
connected in series and a final row of multiplexers is needed. It uses slightly
different partial products from that of Wryszcz (see Figure 4.8). The correction
factor is a constant (¼3) however. It also has n +1 number of n-bit partial products.
They suggest the use of an inverted EAC parallel adder so that the correction factor
becomes 2 instead of 3. Further, the MSB of the result is computed as the group
propagate signal out of the n bits of the final inverted EAC adder. The reader is
referred to [34] for more details.
The area of this multiplier is 8n2 þ 92 ndlogne 13 2 n þ 9 which is smaller
compared to that of [26]. The delay is 18 if n ¼ 4, 4d ðn þ 1Þ þ 2dlogne þ 6 if
(n + 1) is a number of Dadda sequence and 4dðn þ 1Þ þ 2dlogne þ 4 otherwise.
Chen et al. [35] have described mod (2n + 1) multipliers for one operand B in
weighted form (n + 1 bits) and another operand A in diminished-1 form. The
product P is in (n + 1)-bit weighted binary form.
This multiplier uses pure radix-4 Booth encoding without needing any changes
as in the case of [31] and [32]. The multiplier accepts full input and does not need
any conversion between weighted and diminished-1 forms. The multiplier uses
n-bit circuits only. This needs considering the two cases an þ bn ¼ 1 and 0.
In the case an þ bn ¼ 1, i.e. an ¼ 0 and bn ¼ 0, when n is even, the Booth
X
n=21
encoding of B gives B ¼ bn1 þ b0 2b1 þ ðb2i1 þ b2i 2b2iþ1 Þ22i .
i¼1
(Note that since the most significant digit is (bn1 + bn 2bn+1) ¼ bn1 and since
bn12n ¼ bn1, we have combined this term with the least significant digit.) The
first term evidently needs a hard multiple (can be 3 when b0 ¼ 0, b1 ¼ bn1 ¼ 1).
In order to avoid this hard multiple, the authors suggest considering case bn1 ¼ 0
and 1 separately. Note that B can be written as
X
K 1
B¼ Ei 22i ð4:22Þ
i¼0 2n þ1
62 4 Modulo Multiplication and Modulo Squaring
where Ei ¼ b2i1 þ b2i 2b2iþ1 , bi1 ¼ 0, K ¼ n/2 for n even and (n + 1)/2 for
n odd. Note also that Ei can be 0, 1, 2. Noting that
X
K1
P¼ Kþ d AEi 22i ð4:23bÞ
i¼0 2n þ1
X
K 1 X
K 1
jA Bj2n þ1 ¼ PPi þ Ci þ K ð4:24Þ
i¼0 i¼0 2n þ1
X
K 1
P ¼ jABj2n þ1 ¼ PPi þ C þ K ð4:26Þ
i¼0 2n þ1
4.3 Multipliers mod (2n + 1) 63
X
K 1
where C ¼ ci . Note that the correction bits can be merged into a single
i¼0
correction term C of the form . . .0xi+10xi0. . .x10x0. The area and delay of these
multipliers can be estimated as 7n2 þ 92 ndlog2 ne 4n þ 11 and 4D n2 þ 1 þ 2
dlog2 ne þ 9 respectively.
The authors show that the new multipliers are area and time efficient compared
to multipliers due to Zimmerman [16], Sousa and Chaves [32], Efstathiou et al. [26]
and Vergos and Efstathiou [34].
Chen and Yao [36] have considered mod (2n + 1) multipliers for both operands in
diminished-1 form. The procedure is similar to that in the case considered above.
The result is in diminished-1 form and is given by
X
K1
dðABÞ ¼ PPi þ C þ d ½1 þ K þ 1 ð4:27Þ
i¼0 2n þ1
deriving the modulo dependent partial product bits, nþ1 2 AND gates forming the
CT vector, one three-input XOR gate for zero indication bit of the result and one
XOR gate at input of LS BE block (in case of even n).
Jaberipur and Alavi [38] have described an architecture for finding (XY) mod
(2n + 1) using double-LSB encoding of residues. The use of two LSBs can
accommodate the residue 2n as two words one of value 2n 1 and an additional
1-bit word having value 1. Thus, the partial products need to be doubled to cater
for the extra LSB bit in the multiplier Y (see Figure 4.11a). Next, the periodic
properties of residues is used to map the bits above 2n to the LSBs as negabits of
value 1in the LSB positions since 2n+i mod (2n + 1) ¼ (1)2i mod (2n + 1). Next
special adders which can accept inputs of either polarity can be used to add the
positive bits and negabits. A variety of special full/half adders are possible to
handle different combinations of positive bits and negabits. The partial products
corresponding to multiplier are shown in Figure 4.11b where black colored bits
are posibits and white colored bits are inverted negabits. These partial products
can be arranged as n + 3 words which can be added using a CSA with inverted
EAC followed by a conventional n-bit adder.
Muralidharan and Chang [39] have described radix-4 Booth encoded multi-
modulus multipliers. A unified architecture to cater for the three moduli 2n, 2n 1
and 2n + 1 is developed. They show that for computing (X Y ) mod m, we need to
compute
X
n
21
jZ jm ¼ 22i d i X for m ¼ 2n 1 or 2n
i¼0 m
ð4:28Þ
X
n
21
¼ 22i d i X þ Y if m ¼ 2n þ 1:
i¼0 m
y1 ¼ yn1 if m ¼ 2n 1,
¼ 0 if m ¼ 2n ð4:29Þ
¼ if m ¼ 2n þ 1:
Figure 4.11 (a) Partial products in double LSB multiplication and (b) partial products of modulo
(2n + 1) multiplication (adapted from [38] ©IEEE2010)
66 4 Modulo Multiplication and Modulo Squaring
a b
y3 y2 y1 y0 bi
MUX3
0
BE2 BE2 0 1 2 ModSel0
ModSel1
m2l m20 ci
sl m1l s0 m10
c x x2 x1 x0
3
MUX3
s0
BS2 BS2 BS2 BS2 m10
m20
pp03 pp02 pp01 pp00
x1 x0 x3 x2
MUX3 s1
BS2 BS2 BS2 BS2 m11
Figure 4.12 (a) Multi-modulus Radix-22 Booth encoder and (b) 3:1 multiplexer (c) multi-moduli
partial product generation for radix 22 Booth encoding for n ¼ 4 (adapted from [39] ©IEEE2013)
products can be combined into a single closed form expression comprising of two
parts—static bias K1 and dynamic bias K2. The static bias is independent of di
whereas the dynamic bias is dependent on di. The reader is urged to refer to [39] for
details.
The partial products PPi for i ¼ 0, . . ., (n/2 1) are formed by the Booth
selector (BS) from ppij, 0, ppij in the least significant 2i-bit positions for the moduli
2n 1, 2n and 2n + 1, respectively. When j ¼ 2i also, the input to the BS2 block
is xn1, 0, xn1 for modulus 2n 1, 2n and 2n + 1, respectively. Thus, the input to
BS2 block is also selected using a MUX3. The design for the case n ¼ 4 is presented
in Figure 4.12c for illustration.
The multi-moduli addition of n/2 + 3 partial products is given as
X
n
21
Z m ¼ PPi þ K 1 þ K 2 þ 0 in case of m ¼ 2n 1, or m ¼ 2n
i0 m
X
n
21
a b
k12 pp12 pp02 k11 pp10 pp00 ai, bi
k13pp13pp03 k11 pp11 pp01 PP
FA FA FA FA gi , pi , hi
MUX3 ai bi ai bi ai bi
k23 k22 k21 k20
FA FA FA FA
gi pi hi
FA FA FA FA
gi +pi • gj ,pi • pj
MUX3
pi gj pi gj
PP PP PP PP
gi
gi +pi • gj ,pi • pj
hi ci–1
MUX3
zi
hi ci–1
z3 z2 z1 z0 zi
Figure 4.13 (a) Multi-modulus partial product addition for radix-22 Booth encoding and (b)
details of components in (a) (adapted from [39] ©IEEE2013)
Note that in the carry feedback path a MUX will be needed to select cout or 0 or
cout . A CSA tree followed by a parallel-prefix adder needed for adding the ((n/2)
+ 3) partial products is illustrated in Figure 4.13a where • and e are pre-processing
and post-processing blocks. In Figure 4.13b, details of various blocks in (a) are
presented.
Muralidharan and Chang [39, 40] have described unified mod (2n 1), mod 2n
and mod (2n + 1) multipliers
for radix-8 Booth encoding to reduce the number of
partial products to n3 þ 1. In these cases, considering that the diminished-1
representation is used for the mod (2n + 1) case, the product can be written as [39]
bn3c
X
ðZ Þm ¼ Xdi 23i for m ¼ 2n 1 or m ¼ 2n ð4:31aÞ
i¼0
m
68 4 Modulo Multiplication and Modulo Squaring
and
bn3c
X
ðZ Þm ¼ Xdi 23i þ X þ Y for m ¼ 2n þ 1 ð4:31bÞ
i¼0
m
bn3c
X
ðZ Þm ¼ PPi if m ¼ 2n 1
i¼0
m ð4:32aÞ
bn3c
X bn3c
X
¼ PPi þ Ki if m ¼ 2n
i¼0 i¼0
m
and
bn3c
X bn3c
X
¼ PPi þ Ki þ X þ Y if m ¼ 2n þ 1 ð4:32bÞ
i¼0 i¼0
m
Note that di an be 0, 1,2, 3 and 4. The hard multiples 3X are obtained using
customized adders by adding X, 2X and reformulating the carry equations for
moduli (2n 1) and mod (2n + 1) for even and odd cases of i respectively. The
modulo (2n 1) hard multiple generator (HMG) follows the Kalampoukas
et al. adder [41] in case of the modulus (2n 1) and Vergos et al. adder [27] in
case of modulus (2n + 1).
The authors have shown that multi-modulus multipliers save 60 % of the area
over the corresponding single-modulus multipliers. They increase the delay by
about 18 % and 13 %, respectively, for the radix-4 and radix-8 cases. The power
dissipation is more by 5 %.
The area of the mod (2n 1)
multiplier
in terms of unit gates [40] using radix-8
Booth encoding are 25:5n n3 þ 23:5 n3 þ 1 þ 14:5n þ 6ndlog2 ne and for
jnk2
mod (2n + 1) multiplier using radix-8 Booth encoding are 1:5
3
þ25:5n n3 þ 1 þ 28 n3 þ 1 þ 18:5n þ 12nlog2 n þ 52:5. The unit-gate area
and time of radix-4
n Booth
encoded triple moduli multiplier are 6:75n2 þ 14:5n
þ6 and 12 þ 7d 2 þ 3 where d(N ) is the Dadda depth function for N inputs. These
may be compared with dual modulus radix-4 Booth multiplier due to [37] which are
6.5n2 + 8n + 2 and 12 + 6d((n/2) + 2), respectively.
4.4 Modulo Squarers 69
8 7 6 5 4 3 2 1 0
a4 a 4a 2 a3 a 3a 1 a2 a 0a 2 a1 - ao
a 4a 3 a 3a 2 a 4a 0 a 0a 3 a 1a 0
a4 a1 a 2a 1
a b
G5 G4 G3 G2 G1 G0 G5 G4 G3 G2 G1 G0
p4,0 p3,0 p2,0 p1,0 p4,2 p0,0
p3,1 p2,1 p1,1 p4,1 p3,0 p2,0 p1,0 – p0,0
p2,2 p4,3 p3,2
p4,4 p3,3
p2,1 p1,1 p5,5
2 3 1 4 1 4
– FA – FA – FA p2,2
3 1 2 2 2 2
p4,3 p4,2 p4,1 p4,0
5-bit adder with EAC
1 2 1 1 1 1 p4,4 p3,2 p3,1
p3,3
7-input generator mod 21
5 – 3 3 3 3 4
X2
21
Figure 4.15 Squarers Modulo 21 and Modulo 33 following Piestrak (adapted from [42]
©IEEE2002)
separate hardware units for various moduli. On the other hand, VMAs achieve
hardware savings at the possible expense of delay and at the cost of decreased
parallelism. These share the hardware for several moduli. VMAs can be used in
serial by modulus (SBM) architectures [44] where not all moduli channels are
processed in parallel.
As an illustration, for the moduli set {2n, 2n 1, 2n + 1}, squarers can use VMA
and FMA [45]. The bit matrix for modulus (2n 1) and modulus (2n + 1) for both
these cases will have only few entries in some columns which are different and
hence using multiplexers one of these can be selected. However, a correction term
needs to be added in the case of modulus (2n + 1). The correction term can be added
in separate level after the parallel-prefix adder as in Zimmerman’s technique [16]
(see Figure 4.16 in the VMA architecture for n ¼ 5). This design uses a Sklansky
parallel-prefix structure and a MUX in the carry reinsertion path. Note that in the
case of modulus 2n, no carry feedback is required. The outputs are
A2 2n 1 , A2 2n , A2 2n þ1 .
The authors have shown that VMA has 15 % increase in delay than single
modulus architectures (SMA) but needs 50 % less area for the area optimized
synthesis and 18 % less power dissipation for the moduli set {2n,2n 1,2n + 1} for
n ¼ 24. On the other hand, FMA has area and power savings of 5 % and 10 %
respectively over single-modulus architecture with similar delay.
Adamidis and Vergos [46] have described a multiplication (AB)/sum of squares
(A2 + B2) unit mod (2n 1) and mod (2n + 1) to perform one of these operations
4.4 Modulo Squarers 71
PP generator array
a
0
CSA CSA CSA CSA CSA =
00 01 10
S0 3:1 MUX
0 0 0 0 0 S1
Z
gi-1 pi pi-1
CSA CSA CSA CSA CSA
=
gi
p
Parallel Prefix Structure g
= hi ci-1
Si
Figure 4.16 VMA squarer architecture for moduli 33, 32 and 31 (adapted from [45] ©IEEE2009)
partially sharing the same hardware. They point out the commonality and differ-
ences between the two cases and show that using multiplexers controlled by a select
signal, one of the desired operations can be performed. In an alternative design, they
reduce the multiplexers by defining new variable. Multiplication time is increased
by 11 % in case of n 16 in case of mod (2n 1) multiplier.
In the case of multiplier mod (2n + 1), they consider diminished-1 operands. In
case of sum of squares, denoting SS ¼ A2 + B2, we have
ðSS*Þ2n þ1 ¼ ðA* þ 1Þ2 þ ðB* þ 1Þ2 1 n
2 þ1
¼ ðA*Þ2 þ ðB*Þ2 þ 2A* þ 2B* þ 1 n ð4:34aÞ
2 þ1
ci ¼ ai m _ bi m, d i ¼ ai m _ bi m, ei ¼ mðai bi Þ ð4:35Þ
where _ is logic OR function. The partial product bit of the form aibj with j > i
( j < i) is substituted with aicj (dibj) in the multiplication partial product bit matrix.
Bits of the form aibi are retained as they are. In the case of sum of squares, the first
column becomes the last column implying rotation of all the bits. Note that in this
case aiaj is substituted by aicj, bibj becomes dibj and ai exor bi is substituted with ei.
In a similar manner, the case for (2n + 1) also can be implemented. In this case also,
the change of variables can reduce the area:
ci ¼ ai s _ bi s 0 i n 1, d i ¼ ai s _ bi s 0 i n 1,
ð4:36Þ
ei ¼ sðai bi Þ 0 i n=2, ei ¼ sðai bi Þ n=2 i n 1
a
27 26 25 24 23 22 21 20
a3 a2 a1 a0
a0a6 a0a5 a0a4 a0a3 a0a2 a0a1 a0a7
a1a5 a1a4 a1a3 a1a2 a1a7 a1a6
a2a4 a2a3 a2a7 a2a6 a2a5
a3a7 a3a6 a3a5 a3a4
a4a7 a4a6 a4a5 a4
a5a7 a5a6 a5
a6a7 a6
a7
b 27 26 25 24 23 22 21 20
a2a3 a2a3 a1a2 a1a2 a0a1 a0a1 a0a7 a0a7
a0a6 a0a5 a0a4 a0a3 a0a2
a1a5 a1a4 a1a3 a1a7 a1a6
a2a4 a2a7 a2a6 a2a5
a3a7 a3a6 a3a5
a4a7 a4a6 a3a4 a3a4
a5a7 a4a5 a4a5
a5a6 a5a6
a6a7 a6a7
c
27 26 25 24 23 22 21 20
s4,3,2 c3,2,1 s3,2,1 c2,1,0 s2,1,0 c1,0,7 s1,0,7 c4,3,2
s0,7,6 c7,6,5 s7,6,5 c6,5,4 s6,5,4 c5,4,3 s5,4,3 c0,7,6
a1a5 a0a5 a0a4 a0a3 a3a7 a2a7 a2a6 a1a6
a1a4 a4a7 a3a6 a2a5
Figure 4.17 (a) Initial partial product matrix, (b) modified partial product matrix and (c) final
partial product matrix for squarer mod (28 1) (adapted from [47] ©IEEE 2009)
those needed for Piestrak [42] and Efstathiou et al. [17]. The non-encoded squarers
are suitable for small n whereas Booth encoded squarers are suitable for medium
and large n.
Vergos and Efstathiou [49] and Muralidharan et al. [50] have described modulo
(2n + 1) squarers for (n + 1)-bit operands in normal form. This design also maps the
product bits aibj in the left columns beyond the (n 1)th bit position to the right after
inversion. Only the 2nth bit is having a weight 22n and hence needs to be added as
LSB since when an ¼ 1 all other ai are zero. A correction term of 2n(2n 1 n) will
be required totally to take care of the one’s complementing of these various bits.
74 4 Modulo Multiplication and Modulo Squaring
Figure 4.18 (a) Initial partial product matrix, (b) folded partial product matrix for Booth
encoded design, (c) Booth-folded partial product matrix and (d) final partial product matrix for
modulo (28 1) squarer (adapted from [47] ©IEEE2009)
Next, the duplicating terms in each column for example a1a0 in second column are
considered as a single bit in the immediate column to the left. The duplicating bits in
the n 1 th bit position
need to be inverted and put in LSB position needing another
correction of 2n n2 . After these simplifications, the matrix becomes a n n square.
Using CSAs, the various rows need to be added and carry bits need to be mapped into
LSBs after inversion thus needing further correction. The authors show that for both
4.4 Modulo Squarers 75
even and odd n, the correction term is 3. The authors suggest computing the final
result as
dX
n=2e
R¼ 3þ PP*i ð4:38aÞ
i¼0 2n þ1
dX
n=2e
R¼ 2þ PP*i þ1 ð4:38bÞ
i¼0 2n þ1 2n þ1
dX
n=2e
we can treat 2 as another PP and compute the term 2þ PP*i as
i¼0 2n þ1
jC þ S þ 1j k . The result is 2n if C is ones complement of S or if C + S ¼ 2n 1.
2 þ1
Thus, the MSB of the result can be calculated distinctly from the rest whereas
the n LSBs can be computed using a n-bit adder which adds 1 when carry output is
zero. The authors have used a diminished-1 adder with parallel-prefix carry compu-
tation unit [27].
9nðlog2 nÞ
The area requirement is nðn1 Þ
þ 7nðn1 Þ
þ þ n2 þ 7 in number of unit
2 2 2
gates and the delay for n odd and n even cases are DBin, odd ¼ 4H nþ1 2 þ 1 þ 2log2
nþ1
n þ 4 and DBin, even ¼ 4H 2 þ 1 þ 2log2 n þ 8 where H(l ) is the height of the
CSA tree of l inputs to the CSA tree.
Vergos and Efstathiou [51] have described a diminished-1 modulo (2n + 1)
squarer design. They observe that defining Q ¼ A2 2n þ1 , we have
Q* þ 1 ¼ ðA* þ 1Þ2 2n þ1 ¼ ðA*Þ2 þ 2A* þ 1 2n þ1 ð4:39aÞ
or
Q* ¼ ðA*Þ2 þ 2A* 2n þ1 ð4:39bÞ
Thus, one additional term 2A* needs to be added. These designs are superior to
multipliers or squarers regarding both area and delay [41]. In addition, 2A* needs to
be added by left shifting A* and complimenting the MSB and inserting in the LSB
position. The correction term for both n odd and even cases is 1, which in
diminished-1 form is “0”. The area and delay requirements of this design are nðn1
2
Þ
9nðlog2 nÞ
þ 7nðnþ1
2
Þ
þ 2 þ n þ 6 and T ¼ 4d nþ1
2 þ 2log2 n þ 4 time units where d(k)
is the depth in FA stages of a Dadda tree of k operands.
76 4 Modulo Multiplication and Modulo Squaring
Bakalis, Vergos and Spyrou [52, 53] have described modulo (2n 1) squarers
using radix-4 Booth encoding. They consider squarers for both normal and
diminished-1 representation in the case of modulus (2n + 1). They use Strollo
and Caro [48] Booth folding encoding for both the cases of mod (2n 1) and
mod (2n + 1). In the case of mod (2n 1), the partial product matrix is same as in
the case of Spyrou et al. [47]. In the case of mod (2n + 1) using diminished-1
representation, in case of even n which are multiples of 4, the correction term t is
(888..8)16 and in case of even n which are not multiples of 4, the correction term tis
(222..2)16 where the subscript 16 means that these are in hexadecimal form.
Note that the diminished-1 modulo (2n + 1) squarer computes A2 þ 2A 2n þ1 in
the case An ¼ 0. As a result, we need to add ð2AÞ2n þ1 for the normal representa-
tion, which can be written as an2 an3 . . . a0 an1 provided that an additional
correction term of 3 is taken into account. Note also that in the case A ¼ 2n,
an ¼ 1, the LSB can be modified as an1 OR an. The correction term in the case
of diminished-1 case is increased by 2.
In the case of odd n, the architectures are applicable as long as the input operands
are extended by 1 bit by adding a zero at MSB position. The partial product matrices
for the diminished-1 and normal squarers are presented in Figure 4.19a, b where
X
n=21
Ci ¼ Ai Ai for i ¼ 0,. . ., 3 and Pi ¼ 22ðk1iÞ Ai Ak . The authors observe that
k¼iþ1
the height of the partial product matrix is reduced comparable to earlier methods.
The authors show that their designs offer up to 38 % less implementation area than
previous designs and also have a small improvement in delay as well.
a b
27 26 25 24 23 22 21 20 27 26 25 24 23 22 21 20
C1,2 C1,0 C0,2 C0,0 C1,2 C1,0 C0,2 C0,0
C3,2 C3,0 C2,2 C2,0 C3,2 C3,0 C2,2 C2,0
P0,4 P0,3 P0,2 P0,1 P0,0 P0,6 P0,5
P0,4 P0,3 P0,2 P0,1 P0,0 P0,6 P0,5
P1,0 P1,4 P1,3 P1,2 P1,1
P1,0 P1,4 P1,3 P1,2 P1,1
P2,2 P2,1 P2,0
P2,2 P2,1 P2,0 a6 a5 a4 a3 a2 a1 a0 a7Va8
t7 t6 t5 t4 t3 t2 t1 t0 t7 t6 t5 t4 t3 t2 t1 t0
Figure 4.19 (a) Partial product matrix for mod (28 + 1) squarer, (a) diminished-1 case and (b)
normal case (adapted from [52] ©Elsevier2011)
References 77
References
1. M.A. Soderstrand, C. Vernia, A high-speed low cost modulo pi multiplier with RNS arithmetic
applications. Proc. IEEE 68, 529–532 (1980)
2. G.A. Jullien, Implementation of multiplication modulo a prime number with application to
number theoretic transforms. IEEE Trans. Comput. 29, 899–905 (1980)
3. D. Radhakrishnan, Y. Yuan, Novel approaches to the design of VLSI RNS multipliers. IEEE
Trans. Circuits Syst. 39, 52–57 (1992)
4. M. Dugdale, Residue multipliers using factored decomposition. IEEE Trans. Circuits Syst. 41,
623–627 (1994)
5. A.S. Ramnarayan, Practical realization of mod p, p prime multiplier. Electron. Lett. 16,
466–467 (1980)
6. E.F. Brickell, A fast modular multiplication algorithm with application to two-key cryptogra-
phy, in Advances in Cryptography, Proceedings Crypto ‘82 (Plenum, New York, 1983),
pp. 51–60
7. E. Lu, L. Harn, J. Lee, W. Hwang, A programmable VLSI architecture for computing
multiplication and polynomial evaluation modulo a positive integer”. IEEE J. Solid-State
Circuits SC-23, 204–207 (1988)
8. B.S. Prasanna, P.V. Ananda Mohan, Fast VLSI architectures using non-redundant multi-bit
recoding for computing AY mod N. Proc. IEE ECS 141, 345–349 (1994)
9. A.A. Hiasat, New efficient structure for a modular multiplier for RNS. IEEE Trans. Comput.
C-49, 170–174 (2000)
10. E.D. Di Claudio, F. Piazza, G. Orlandi, Fast combinatorial RNS processors for DSP applica-
tions. IEEE Trans. Comput. 44, 624–633 (1995)
11. P.L. Montgomery, Modular multiplication without trial division. Math. Comput. 44, 519–521
(1985)
12. T. Stouraitis, S.W. Kim, A. Skavantzos, Full adder based arithmetic units for finite integer
rings. IEEE Trans. Circuits Syst. 40, 740–744 (1993)
13. V. Paliouras, K. Karagianni, T. Stouraitis, A low complexity combinatorial RNS multiplier.
IEEE Trans. Circuits Syst. II 48, 675–683 (2001)
14. G. Dimitrakopulos, V. Paliouras, A novel architecture and a systematic graph based optimi-
zation methodology for modulo multiplication. IEEE Trans. Circuits Syst. I 51, 354–370
(2004)
15. Z. Wang, G.A. Jullien, W.C. Miller, An algorithm for multiplication modulo (2N-1), in Pro-
ceedings of 39th Midwest Symposium on Circuits and Systems, Ames, IA, pp. 1301–1304
(1996)
16. R. Zimmermann, Efficient VLSI implementation of modulo (2n 1) addition and multiplica-
tion, in Proceedings of IEEE Symposium on Computer Arithmetic, pp. 158–167 (1999)
17. C. Efstathiou, H.T. Vergos, D. Nikolos, Modified Booth modulo 2n-1 multipliers. IEEE Trans.
Comput. 53, 370–374 (2004)
18. R. Muralidharan, C.H. Chang, Radix-8 Booth encoded modulo 2n-1 multipliers with adaptive
delay for high dynamic range residue number system. IEEE Trans. Circuits Syst. I Reg. Pap.
58, 982–993 (2011)
19. G.W. Bevick, Fast multiplication: algorithms and implementation, Ph.D. Dissertation,
Stanford University, Stanford, 1994
20. A.V. Curiger, H. Bonnennberg, H. Keaslin, Regular architectures for multiplication modulo
(2n+1). IEEE J. Solid-State Circuits SC-26, 990–994 (1991)
21. X. Lai, On the design and security of block ciphers, Ph.D Dissertation, ETH Zurich, No.9752,
1992
22. A. Hiasat, New memory less mod (2n1) residue multiplier. Electron. Lett. 28, 314–315
(1992)
23. M. Bahrami, B. Sadeghiyan, Efficient modulo (2n+1) multiplication schemes for IDEA, in
Proceedings of IEEE ISCAS, vol. IV, pp. 653–656 (2000)
78 4 Modulo Multiplication and Modulo Squaring
24. Z. Wang, G.A. Jullien, W.C. Miller, An efficient tree architecture for modulo (2n+1) multi-
plication. J. VLSI Signal Process. Syst. 14, 241–248 (1996)
25. A. Wrzyszcz, D. Milford, A new modulo 2α+1 multiplier, in IEEE International Conference on
Computer Design: VLSI in Computers and Processors, pp. 614–617 (1993)
26. C. Efstathiou, H.T. Vergos, G. Dimitrakopoulos, D. Nikolos, Efficient diminished-1 modulo 2n
+1 multipliers. IEEE Trans. Comput. 54, 491–496 (2005)
27. H.T. Vergos, C. Efstathiou, D. Nikolos, Diminished-1 modulo 2n+1 adder design. IEEE Trans.
Comput. 51, 1389–1399 (2002)
28. Y. Ma, A simplified architecture for modulo (2n+1) multiplication. IEEE Trans. Comput. 47,
333–337 (1998)
29. R. Chaves, L. Sousa, Faster modulo (2n+1) multipliers without Booth Recoding, in XX
Conference on Design of Circuits and Integrated Systems. ISBN 972-99387-2-5 (Nov 2005)
30. R. Chaves, L. Sousa, Improving residue number system multiplication with more balanced
moduli sets and enhanced modular arithmetic structures. IET Comput. Digit. Tech. 1, 472–480
(2007)
31. L. Sousa, Algorithm for modulo (2n+1) multiplication. Electron. Lett. 39, 752–753 (2003)
32. L. Sousa, R. Chaves, A universal architecture for designing efficient modulo 2n+1 multipliers.
IEEE Trans. Circuits Syst. I 52, 1166–1178 (2005)
33. Y.J. Chen, D.R. Duh, Y.S. Han, Improved modulo (2n+1) multiplier for IDEA. J. Inf. sci. Eng.
23, 907–919 (2007)
34. H.T. Vergos, C. Efstathiou, Design of efficient modulo 2n+1 multipliers. IET Comput. Digit.
Tech 1, 49–57 (2007)
35. J.W. Chen, R.H. Yao, W.J. Wu, Efficient modulo 2n+1 multipliers. IEEE Trans. VLSI Syst. 19,
2149–2157 (2011)
36. J.W. Chen, R.H. Yao, Efficient modulo 2n+1 multipliers for diminished-1 representation. IET
Circuits Devices Syst. 4, 291–300 (2010)
37. E. Vassalos, D. Bakalis, H.T. Vergos, Configurable Booth-encoded modulo 2n1 multipliers.
IEEE PRIME 2012, 107–111 (2012)
38. G. Jabelipur, H. Alavi, A modulo 2n+1 multiplier with double LSB encoding of residues, in
Proceedings of IEEE ISCAS, pp. 147–150 (2010)
39. R. Muralidharan, C.H. Chang, Radix-4 and Radix-8 Booth encoded multi-modulus multipliers.
IEEE Trans. Circuits Syst. I 60, 2940–2952 (2013)
40. R. Muralidharan, C.H. Chang, Area-Power efficient modulo 2n-1 and modulo 2n+1 multipliers
for {2n-1, 2n, 2n+1} based RNS. IEEE Trans. Circuits Syst. 59, 2263–2274 (2012)
41. L. Kalampoukas, D. Nikolos, C. Efstathiou, H.T. Vergos, J. Kalamatianos, High speed parallel
prefix modulo (2n-1) adders. IEEE Trans. Comput. 49, 673–680 (2000)
42. S.J. Piestrak, Design of squarers modulo A with low-level pipelining. IEEE Trans. Circuits
Syst. II Analog Digit. Signal Process. 49, 31–41 (2002)
43. V. Paliouras, T. Stouraitis, Multifunction architectures for RNS processors. IEEE Trans.
Circuits Syst. II 46, 1041–1054 (1999)
44. W.K. Jenkins, A.J. Mansen, Variable word length DSP using serial by modulus residue
arithmetic, in Proceedings of IEEE International Conference on ASSP, pp. 89–92 (1993)
45. R. Muralidharan, C.H. Chang, Fixed and variable multi-modulus squarer architectures for
triple moduli base of RNS, in Proceedings of IEEE ISCAS, pp. 441–444 (2009)
46. D. Adamidis, H.T. Vergos, RNS multiplication/sum-of-squares units. IET Comput. Digit.
Tech. 1, 38–48 (2007)
47. A. Spyrou, D. Bakalis. H.T. Vergos, Efficient architectures for modulo 2n-1 squarers, in
Proceedings of IEEE International Conference on DSP 2009, pp. 1–6 (2009)
48. A. Strollo, D. Caro, Booth Folding encoding for high performance squarer circuits. IEEE
Trans. CAS II 50, 250–254 (2003)
49. H.T. Vergos, C. Efstathiou, Efficient modulo 2n+1 squarers, in Proceedings of XXI Conference
on Design of Circuits and Integrated Systems, DCIS (2006)
References 79
50. R. Muralidharan, C.H. Chang, C. Jong, A low complexity modulo 2n+1 squarer design, in
Proceedings of IEEE Asia Pacific Conference on Circuits and Systems, pp. 1296–1299 (2008)
51. H.T. Vergos, C. Efstathiou, Diminished-1 modulo 2n+1 squarer design. Proc. IEE Comput.
Digit. Tech 152, 561–566 (2005)
52. D. Bakalis, H.T. Vergos, A. Spyrou, Efficient modulo 2n1 squarers. Integr. VLSI J. 44,
163–174 (2011)
53. D. Bakalis, H.T. Vergos, Area-efficient multi-moduli squarers for RNS, in Proceedings of 13th
Euromicro Conference on Digital System Design: Architectures, Methods and Tools,
pp. 408–411 (2010)
Further Reading
B. Cao, T. Srikanthan, C.H. Chang, A new design method to modulo 2n-1 squaring, in Proceedings
of ISCAS, pp. 664–667 (2005)
A.E. Cohen, K.K. Parhi, Architecture optimizations for the RSA public key cryptosystem: a
tutorial. IEEE Circuits Syst. Mag. 11, 24–34 (2011)
Chapter 5
RNS to Binary Conversion
This important topic has received extensive attention in literature. The choice of the
moduli set in RNS is decided by the speed of RNS to binary conversion for
performing efficiently operations such as comparison, scaling, sign detection and
error correction. Both ROM-based and non-ROM-based designs will be of interest.
The number of moduli to be chosen is decided by the desired dynamic range, word
length of the moduli and ease of RNS to binary conversion. There are two basic
classical approaches to converting a number from RNS to binary form. These are
based on Chinese Remainder Theorem (CRT) and Mixed Radix Conversion (MRC)
[1]. Several new techniques have been introduced recently such as New CRT-I,
New CRT-II, Mixed-Radix CRT, quotient function, core function and diagonal
function. All these will be presented in some detail.
The binary number X corresponding to given residues (x1, x2, x3, . . ., xn) in the RNS
{m1, m2, m3, . . . mn} can be derived using CRT as
0 1
! ! !
1 1 1
X ¼ @ x1 M1 þ x2 M 2 þ xn Mn Amod M
M 1 m1 M 2 m2 M n mn
m1 m2 mn
ð5:1Þ
x1 x2 xn
1}, {2n 1, 2n, 2n1 1} since n bits of the decoded number X are available
directly as residue corresponding to modulus 2n and the modulo reduction needed in
the end with respect to the product of remaining moduli can be efficiently
implemented in the case of the first three moduli sets. For some of these moduli
sets, the moduli have wide length ranging from n to 2n bits which may be a
disadvantage since the larger modulus decides the instruction cycle time of the
RNS processor. For general moduli sets, CRT may necessitate the use of complex
modulo reduction!hardware and needs ROM-based implementations for obtaining
0 1 0
xi ¼ xi values and multipliers for calculating xi Mi.
Mi mi
mi
We will consider the RNS to binary conversion for the moduli set {2n 1, 2n, 2n + 1}
based on CRT in detail next in view of the immense attention paid in literature
[2–20]. For this moduli set, denoting m1 ¼ 2n 1, m2 ¼ 2n and m3 ¼ 2n + 1, we
have M ¼ 2n(22n 1), M1 ¼ 2n(2n + 1), M2 ¼ 22n 1, M3 ¼ 2n(2n 1) and (1/M1)
mod (2n 1) ¼ 2n1, (1/M2) mod (2n) ¼ 1, (1/M3) mod (2n + 1) ¼ (2n1 + 1).
Thus, we can obtain using CRT [4], from (5.1) the decoded number as
5.1 CRT-Based RNS to Binary Conversion 83
X ¼ 2n ð2n þ 1Þ2n1 x1 22n 1 x2 þ 2n ð2n 1Þ 2n1 þ 1 x3 mod 2n 22n 1
¼ Y2n þ x2
ð5:2Þ
Since we know the LSBs of X as x2, we can obtain the 2n MSBs of X by computing
X x2
Y¼ mod 22n 1 ð5:3Þ
2n
A ¼ x1, 0 x1, n1 x1, n2 x1, n3 . . . x1, 2 x1, 1 x1, 0 x1, n1 x1, n2 x1, n3 . . . x1, 2 x1, 1 ð5:5Þ
B ¼ x2, n1 x2, n2 . . . x2, 2 x2, 1 x2, 0 111 . . . 111 ð5:6Þ
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}
n bits
and
Piestrak [8] has suggested adding the four words A, B, C and D given in (5.5),
(5.6), (5.7a) and (5.7b) using a (4, 22n 1) MOMA (multi-operand modulo adder).
Two levels of carry-save-adder (CSA) followed by a carry-propagate-adder (CPA)
all with end-around-carry (EAC) will be required in the cost-effective (CE) version.
Piestrak has suggested a high-speed (HS) version wherein the mod (22n 1)
84 5 RNS to Binary Conversion
E ¼ x2, n1 x2, n2 . . . x2, 1 x2, 0 x3, n þ x3, n1 x3, n2 . . . x3, 0 ð5:8aÞ
F ¼ x3, n þ x3, 0 x3, n1 . . . x3, 2 x3, 1 x3, 0 x3, n1 . . . x3, 1 ð5:8bÞ
Thus, the three words given by (5.5), (5.8a) and (5.8b) need to be summed in a
carry-save-adder with end around carry (to take care of mod (22n 1) operation)
and the resulting sum and carry vectors are added using a CPA with end-around-
carry (see Figure 5.2).
Several improvements have been made in the past two decades by examining the
bit structure of the three operands, by using n-bit CPAs in place of 2n-bit CPA to
reduce the addition time [10–19].
The RNS to binary converters for RNS using moduli of the form (2n 1) can
take advantage of the basic properties:
(a) A mod (2n 1) is one’s complement of A.
(b) (2xA) mod (2n 1) is obtained by circular left shift of A by x bits where A is an
n-bit integer.
(c) The logic can be simplified by noting that full adders with a constant “1” as
input can be replaced by a pair of two-input XNOR and OR gates. Similarly,
full adders with one input “0” can be replaced by pairs of XOR and AND gates.
Note also that a full adder with one input “0” and one input “1” can be reduced
to just an inverter.
Bharadwaj et al. [10] have observed that of the four operands to be added in
Piestrak’s technique, three operands have identical bits in the lower n-bit and upper
(n 1)-bit fields. Hence, n 1 FAs can be saved. Strictly speaking, since these are
having one input as “1”, we save (n 1) EXOR/OR gate pairs.
Next, a modified carry-select-adder has been suggested in place of 2n-bit CPA in
order to reduce the propagation delay of the 2n-bit CPA. This needs four n-bit CPAs
and additional multiplexers. The authors also derive the condition for selecting the
outputs of the multiplexers such that double representation of zero can be avoided.
Wang et al. [12] have suggested rewriting (2n + 1)2n1x3 x3 2nx2 in (5.4) as
x3, 0 ^x 3, n1 ::^x 3, 0 x3, n1 . . . x3, 1 þ x2, n1 x2, n2 . . . x2, 0 x2, n1 x2, n1 x2, n1 x2, n1
|fflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflffl} 22n 1
n1 bits
x1 x2 x3
n
n-1
x1,n x1,0 x1,n x1,n-1
n-1
x3,0
n
n-1
n-1
n n
n n
2n 2n 2n
2n
n
Figure 5.2 Architecture of RNS to Binary converter for moduli set {2n 1, 2n, 2n + 1} of
Dhurkadas (adapted from [9] ©IEEE1998)
Conway and Nelson [11] have suggested an RNS to binary converter based on
CRT expansion (5.2) whose dynamic range is less than 2n(22n 1) by 2n(2n 2) 1.
They rewrite the expression for CRT in (5.2) as X ¼ D222n + D12n + D0 so that the
upper and middle n bits can be computed using n-bit hardware. However, there is
forward and backward dependence between these two n-bit computations.
86 5 RNS to Binary Conversion
Gallaher et al. [18] have considered the equations for residues of the desired 3n-
bit output word X ¼ D222n + D12n + D0 corresponding to three moduli given as
ðD2 þ D1 þ D0 Þmod 2k 1 ¼ x1 , ðD2 D1 þ D0 Þmod 2k þ 1
¼ x3 , and D0 ¼ x2 ð5:9aÞ
Note that m can be (0, 1, 2) and n can be (1, 0, 1). Thus the authors explore what
values of m and n will yield the correct result. This technique has been improved
in [19].
CRT can be applied to other RNS systems having three or more moduli.
The various operands to be summed can be easily obtained by bit manipulations
(rotation of word and bit inversions) but the final summation and modulo reduction
can be very involved. Thus, three moduli system described above, hence, is believed to
be attractive. CRT has been applied to other moduli sets {22n, 22n 1, 22n + 1}[48, 49],
{2n, 2n 1, 2n+1 1} [50], {2k, 2k 1, 2k1 1} [47], and {2n + 1, 2n+k, 2n 1} [21].
Chaves and Sousa [21] have suggested a moduli set {2n + 1, 2n+k, 2n 1} with
variable k such that 0 k n. CRT can be used to decode the number
corresponding to the given residues. In this case also, the multiplicative inverses
needed in CRT are very simple and are given as
1 1 1
¼ 2nk1 , ¼ 1, ¼ 2nk1 ð5:10Þ
M 1 m1 M 1 m2 M3 m3
where m1 ¼ 2n + 1, m2 ¼ 2n+k, m3 ¼ 2n 1.
Hence, similar to the case of moduli set {2n 1, 2n, 2n + 1}, using CRT, the
reverse conversion can be carried out by mapping the residues into 2n-bit words and
adding using a mod (22n 1) adder. Hiasat and Sweidan [22] have independently
considered this case with k ¼ n i.e. the moduli set {22n, 2n 1, 2n + 1}.
Soderstrand et al. [23] have suggested the computation of weighted sum in (5.1)
scaled by 2/M. The fractions can be represented as words with one integer bit and
several fractional bits and added. The multiple of 2 in the integer part of the
resulting sum can be discarded. The integer part will convey information about
the sign of the number. However, Vu [24] has pointed out that the precision of these
5.1 CRT-Based RNS to Binary Conversion 87
Considering that the error due to finite bit representation in ui as ei, such that
0 ei 2t where t + 1 bits are used to represent each ui, it can be shown that the
total error e < 2/M for M even and e < 1/M for M odd or n2t 2/M for M even and
n2t 1/M for M odd.
An example will illustrate the idea of sign detection using Vu’s CRT
implementation.
Example 5.2 Consider the moduli set {11, 13, 15, 16} and the two cases
corresponding to the real numbers +1 and 1. Use Vu’s technique for sign
detection.
The residues are (1, 1, 1, 1) and (10, 12, 14, 15) respectively. The expression
(5.11b) becomes in the first case
16 2 4 2
Xs ¼ þ þ þ
11 13 15 16
Adding the first two terms and second two terms and representing in fractional
form, we get
16 2 230 4 2 94
v1 ¼ u1 þ u2 ¼ þ ¼ and v2 ¼ u3 þ u4 ¼ þ ¼ :
11 13 143 15 16 240
^ s ¼ 0:000000000000101
^v 1 ¼ 1:1001101111000000, ^v 2 ¼ 0:0110010001000101, X
Cardarilli et al. [25] have presented a systolic architecture for scaled residue to
binary conversion. Note that it is based on scaling the result of CRT expansion for a
N-moduli RNS by M to obtain
X XN x0i
¼ ð5:12Þ
M i¼1 mi 1
!
1
where x0i ¼ xi . Note that x0 i/mi are all fractions. The bits representing
Mi mi
mi
x0 i /mi can be obtained using an iterative process. The resulting N words can be
added using a carry-save-adder tree followed by a CPA and overflow ignored. The
fraction xi /mi can be written as
x0i H1 H2 H3 HL εL
¼ þ 2 þ 3 þ þ L þ L ð5:13Þ
mi 2 2 2 2 m2
where εL is the error due to truncation of x0 i/mi and (5.13) is the radix 2 expansion of
(x0 i/mi) and L ¼ dlog2 ðNMÞe þ 1. By multiplying (5.13) with 2Lmi and taking mod
2 both sides, we obtain
ðH L Þ2 ¼ H L ¼ εL m1 2 ð5:14Þ
where m1 is the multiplicative inverse of m mod 2. To compute the HL1 term, we
need to multiply (5.13) by 2L1 and take mod 2 both sides and use the values of εL
and HL already obtained. Note that εL ¼ a1 defined by xi 2L ¼ a1 + a2mi and
a2 ¼ H 1 2L1 þ H 2 2L2 þ H3 2L3 þ þ HL . In a similar manner, the other Hi
values can be obtained.
As an illustration, for m ¼ 5, and residue x ¼ 1 and moduli set {3, 5, 7}, N ¼ 3
and M ¼ 105 and L ¼ dlog2 ðNMÞe þ 1 ¼ 10. We thus have 2L ¼ 1024, a1 ¼ 4,
a2 ¼ 204 and 1 1024 ¼ 4 + 204 5. Thus εL ¼ 4 and we can compute iteratively
a2 ¼ 204 bit by bit.
Dimauro et al. [26] have introduced a concept “quotient function” for
performing RNS to binary conversion. This technique denoted as quotient function
technique (QFT) is akin to Andraros and Ahmad technique [4]. In this method, one
modulus mj is of the form 2k. Consider the moduli set {m1, m2, . . ., mj}. The quotient
function is defined as
jX jM XN
QFj jXjM ¼ ¼ bx
i¼1 i i M
ð5:15Þ
mj j
1
bi ¼ if i ¼ j ð5:16aÞ
mj Mj
5.1 CRT-Based RNS to Binary Conversion 89
and
1 1
bi ¼ M i for i ¼ 1, . . . N, i 6¼ j ð5:16bÞ
SQj mj Mj
Mj Mj
XN
Note that the sum of quotients SQj is defined as SQj ¼ i ¼ j Mi . Thus, the RNS
i 6¼ j
to binary conversion procedure using (5.15) is similar to CRT computation. Note
that the technique of Andraros and Ahmad [4] realizes quotient through CRT.
Appending QFj with the residue corresponding to 2k viz., rj as least significant
k bits will yield the final decoded number |X|M. An example will be illustrative.
Example 5.3 Consider the moduli set {5, 7, 9, 11, 13, 16} and the residues (2, 3,
5, 0, 4, 11). Obtain the decoded number using quotientfunction.
1
Note that mj ¼ 16 and Mj ¼ 45,045. Next we have ¼ 8446 and SQj ¼
mj M j
(5 7 9 11 + 5 7 9 13 + !
5 9 11 13 + 5 7 11 13 + 7 9 11
1
13) ¼ 28,009. It follows that ¼ 16, 174. Next, using (5.16), the bi s can be
SQj
Mj
estimated as b(1) ¼ 36,036, b(2) ¼ 12,870, b(3) ¼ 20,020, b(4) ¼ 12,285, b(5) ¼
!
1 1
17,325 and b(6) ¼ 36,599. As an illustration bð2Þ ¼ M2 ¼
SQj mj Mj
Mj Mj
jð5 9 11 13Þ 16, 174 8446j4 5, 045 ¼ 12, 870. Next, the binary number
corresponding to the given residues can be computed as (2 36,036 + 3 12,870
+ 5 20,020 + 0 12,285 + 4 17,325 + 11 36,599) mod 45,045 ¼ 6996 which
in binary form is 1 101 101 010 100. Appending this with the residue 11 in binary
form 1011 corresponding to modulus 16, we obtain the decoded number as
111,947.
■
Dimauro et al. [26] observed that since SQ can be large, an alternative technique
where a first level considers the original moduli set as two subsets for whose
residues, reverse conversion can be done in parallel. Next, the quotient function
can be evaluated. This will reduce the magnitude of SQ and hence leading to
simpler hardware. As an illustration, for the same moduli set considered above,
we can consider the subsets {5, 9, 16} and {7, 11, 13} giving SQ ¼ 1721 as against
28,009 in the previous case.
Kim et al. [27] have suggested an RNS to binary conversion technique with
d
rounded error compensation. In this technique, the CRT result is multiplied by 2 M
90 5 RNS to Binary Conversion
where M is the product of the moduli and d is the output word length deciding
parameter:
"X #
2d L 1 2d
Xs ¼ X ¼ Mi xi α2d ð5:17Þ
M i¼1 M i mi M
X s ¼ j f ðx 1 þ x 2 Þ þ f ðx 3 þ x 4 Þj ð5:18aÞ
2d
where
" #
1 2d 1 2d
f ðx1 þ x2 Þ ¼ M1 x1 þ M2 x2 ð5:18bÞ
M1 m1 M M2 m2 M 2d
Since each rounding f (x1, x2) introduces a ½ LSB round-off error, maximum
error in computing Xs is 1LSB. The authors suggest computation of error due to
round off using a set of nearest error estimates e.g. 1/3, 2/3, 1/5, 2/5. . .,
XN
4/5 etc and evaluate ^δ ¼ δi where δi ¼ xi ½xi where [xi] is a rounded real
i¼1
number. These are read from PROMs for both f (x1, x2) and f (x3, x4) and added
together with sign and the result is added to the coarse values obtained before. They
have shown that the PROM contents need to be obtained by computer simulation to
find the maximum scaling error for the chosen d and thus the lack of ordered outputs
otherwise occurring without rounding error compensation can be avoided.
The MRC technique is sequential and involves modulo subtractions and modulo
multiplication by multiplicative inverses of one modulus with respect to the
remaining moduli. In MRC, the decoded number is expressed as
In each step, one mixed radix digit di is determined. At the end, the MRC digits
are weighted following (5.19) to obtain the final decoded number. There is no need
for final modulo reduction.
Note that in each step, the residue corresponding to one modulus is subtracted so
that the result is exactly divisible by that modulus. The multiplication with multi-
plicative inverse accomplishes this division. The last step needs multiplications of
bigger numbers e.g. z1m2m3 in the three moduli example and addition of the
resulting products using carry-save-adders followed by CPA. But, no final modulo
reduction is needed in the case of MRC since the result is always less than
M ¼ m1m2m3. In case of large number of moduli RNS e.g. {m1, m2, m3, . . ., mn}, the
various multiplicative inverses need to be known a priori as well as various products of
moduli m(k1)mk, m(k2)m(k1)mk, etc. need to be stored. The RNS to Binary conver-
sion time is thus (n 1)Δmodsub + (n 1)Δmodmul + Δmult + ΔCSA(n2) + ΔCPA where
modsub and modmul stand for modulo subtraction and multiplication operations,
csa(k 2) stands for (k 2) level CSA and mult is conventional multiplication.
Note that the MRC algorithm can be pipelined. The following example illustrates
the technique.
Example 5.4 We consider the Mixed Radix Conversion technique for finding the
decimal number corresponding to residues (1, 2, 3) using the moduli set {3, 5, 7}.
The procedure is illustrated below:
m3 m2 m1 3 5 7
x3 x2 x1 1 2 3
-x1 -x1 -3 -3
(x3-x1) mod m3 (x2-x1) mod m2 1 4
×(1/m1) mod m3 ×(1/m1) mod m2 ×1 ×3
y1 d1 1 2
-d1 -2
(y1-d1) mod m3 2
×(1/m2) mod m3 ×2
d2 1
the MRC digits as [2, 0, 0].) Thus, adding these mod mi and adding the carry in the
previous column, we obtain [1, 2, 3] which corresponds to 1 35 + 2 7 + 3 ¼ 52.
MRC is simpler in the case of some powers-of-two related moduli sets
since the various multiplicative inverses needed in successive steps are of the
form 2i so that the modulo multiplication can be realized using bit wise rotation
of the operands in case of mod (2n 1) and one’s complementing certain bits and
adding correction term in case of mod (2n + 1) as explained in Chapter 4 [16, 17].
As an illustration for the case of the earlier considered moduli
set{2n 1, 2n, 2n
1
+ 1}, the multiplicative inverses are as follows: ¼ 2n1 ,
2n þ 1 2n 1
1 1
n ¼ 1, ¼ 1. Thus, in each MRC step, only modulo sub-
2 þ1 2 n 2n 2n 1
tractions are needed and multiplication with 2n1 mod (2n 1) can be realized by
left circular rotation. An example will illustrate the procedure.
Example 5.5 Consider the moduli set {2n 1, 2n, 2n + 1} with n ¼ 3. The RNS is
thus {7, 8, 9}. We wish to find the decimal number corresponding to the residues
(1, 2, 3). The MRC procedure is as follows:
7 8 9
1 2 3
-3 -3
5 7
×4 ×1
6 7
-7
6
×1
6
ð 0Þ ð0Þ
and kj ¼ xjþ1 , lj ¼ xj , j ¼ 1, 2, . . ., n 1.
Thus using both the outputs kj and lj, a tree can be constructed to perform Mixed
Radix conversion. In this technique, two MRC expansions are carried out simulta-
neously. As an illustration for a four moduli RNS, a given number can be expressed
by either
X ¼ x1 þ d1 m1 þ d 2 m1 m2 þ d3 m1 m2 m3 ð5:21aÞ
or
where di and d0 i are Mixed Radix digits. The advantage is the local inter-
connections between various LUTs. However, the size of the LUTs is more due
to the need for two outputs in some cases. A typical converter for a four moduli set
{3, 5, 7, 11} is presented in Figure 5.3b. The numbers in the boxes i, j refer to
adjacent moduli mi, mj ( j ¼ i + 1). As an illustration for the box 2, 3 in the first
column, which corresponds to the moduli 5, 7 and input residues 2 and 5, we have
k2 ¼ 2 and l2 ¼ 1:
x3 x2 ¼ m2 k2 m3 l2 ! 5 2 ¼ 5 2 7 1 ¼ 3:
Yassine and Moore [33] have suggested choice of moduli set to have certain
multiplicative inverses as 1 to facilitate easy reverse conversion using only sub-
tractions. Considering a moduli set {m1, m2, m3, m4) for illustration, we choose the
moduli such that Vi which are constant predetermined factors are all 1:
1 1 1
V 1 ¼ 1, V 2 ¼ ¼ 1, V 3 ¼ ¼ 1, V 4 ¼ ¼1
m1 m2 m1 m2 m3 m1 m2 m3 m4
ð5:22aÞ
Note that Ui are such that ðU i V i Þmi ¼ γ i are the Mixed radix digits. This can be
proved as follows. We can evaluate the residues x1, x2, x3 and x4 from (5.22b) as
94 5 RNS to Binary Conversion
a x a1 x x + a2m1
a1 (1) a2 x x + a3m2
x1 k1
(2) a3
j=1 1,2
(1) k1
l1
x2 (1) j=1 1,3 (2) x x + aN-1mN-2
k2 l1
(N-2)
j=2 2,3 (2) k1 aN-1 x x + aNmN-1
(1) k2
x3 l2 j=2 2,4 1,N-1 (N-1)
(1) j=1 k1 aN
k3 (2) (N-2)
j=3 3,4 l2 l1 j=1 1,N
x4 (1) 2,N
l3 j=2 (N-2)
k2
(2)
kN-2
j = N-2 N-2,N
xN-1 (1)
kN-1
j = N-1 N-1,N
xN
i=1 i=2 i = N-2 i = N-1
b x←2
(1) x ← 2+0.3
k1 = 0 a2 = 0
m1=3 x ←2+3.15 x ←47+1×105 =152
x1 = 2 1, 2 (2)
(1) k1 = 3 a3 = 3
a4 = 1
1,3
(1)
m2=5
k2 = 2
x2 = 2 (2)
2.3 1,4
l1=1 (3)
(1) k1
l2 = 1 2,4 (2) =1
m3=7
x3 = 5 k2 = 4
3,4 (1)
k3 = 10
m4=11
x4 = 9
Figure 5.3 A RNS to binary converter due to Miller and McCormick (a) general case (b) four
moduli example (adapted from [31] ©IEEE1998)
γ 1 ¼ U 1 ¼ 78
γ 2 ¼ U 2 ¼ ð41 78Þ63 ¼ 26
γ 3 ¼ U 3 ¼ ð47 78 127:26Þ50 ¼ 17
γ 4 ¼ U 4 ¼ ð9 78 127:26 127:63:17Þ13 ¼ 9
Variations of CRT have appeared in literature most important being New CRT-I
[29]. Using New CRT-I, given the moduli set {m1, m2, m3, . . ., mn}, the weighted
binary number corresponding to the residues (x1, x2, x3, . . ., xn) can be found as
and
where x10 and x30 are the LSBs of x1 and x3, respectively. The value A can be
computed using a 2-input adder to yield the sum and carry vectors A1 and A2.
Similarly, B can be estimated to yield the sum and carry vectors B1 and B2 and a
carry bit using a three-input n-bit adder. Next, Y can be obtained from A and B using
a 2n-bit adder (Converter I) or n-bit adders to reduce the propagation delay.
Two solutions for n-bit case have been suggested denoted as Converter II and
Converter III.
Bi and Gross [34] have described a Mixed-Radix Chinese Remainder Theorem
(Mixed-Radix CRT) for RNS to binary conversion. The result of RNS to binary
conversion can be computed in this approach for an RNS having moduli {m1, m2,
. . ., mn} with residues (x1, x2, . . ., xn) as
γ 1 x1 þ γ 2 x2 þ γ 3 x3
X ¼ x1 þ m1 jγ 1 x1 þ γ 2 x2 jm2 þ m1 m2
m2 m3 ð5:26aÞ
γ 1 x1 þ γ 2 x2 þ γ 3 x3 þ þ γ n xn
þ þ m1 m2 . . . mn1
m2 m3 . . . mn1 mn
where
1
M1 1
M 1 m1 M 1
γ1 ¼ and γ i ¼ : ð5:26bÞ
m1 m1 mi Mi mi
Note that the first two terms use MRC and other terms use CRT like expansion.
The advantage of this formulation is the possibility for parallel computation of
various MRC digits enabling fast comparison of two numbers at the expense of
hardware since many terms in the numerators for expressions for several Mixed
Radix digits and division by product of moduli and taking integer value are
cumbersome except for special moduli. The topic of comparison using this tech-
nique is discussed in Chapter 6. An example will be illustrative.
5.4 RNS to Binary Converters for Other Three Moduli Sets 97
¼ 1 þ 3 2 þ 15 3 þ 105 3 ¼ 367
■
New CRT III [35, 36] can be used to perform RNS to binary conversion when the
moduli have common factors. Considering two moduli m1 and m2 with common
factors d, and considering m1 > m2, the decoded number corresponding to residues
x1 and x2 can be obtained as
1 ðx 2 x 1 Þ
X ¼ x1 þ m1 ð5:27Þ
m1 =d d m2 =d
As an illustration, consider the moduli set {15, 12} with d ¼ 3 as a common factor
and given residues (5, 2). The decoded number can be obtained from (5.27) as
1 ð 2 5Þ
X ¼ 5 þ 15 ¼ 50:
5 3 4
We will later consider application of this technique for Reverse conversion for an
eight moduli set.
Premkumar [37], Premkumar et al. [38], Wang et al. [39], and Globagade et al. [40]
have investigated the three moduli set {m1, m2, m3} ¼ {2n + 1, 2n, 2n 1}. The
reverse converter for this moduli set based on CRT described by Premkumar [37]
uses the expressions
98 5 RNS to Binary Conversion
M m2 m3 m m
1 2
X¼ þ x1 þ x3 m1 m3 x2 mod M for ðx1 þ x3 Þ odd
2 2 2
ð5:28aÞ
and
nm m m m o
2 3 1 2
X¼ x1 þ x3 m1 m3 x2 mod M for ðx1 þ x3 Þ even ð5:28bÞ
2 2
X x1 x3
¼ nðx1 þ x3 2x2 Þ þ both x1 , x3 odd or both even
m2 2 m1 m3
ð5:29aÞ
and
X x1 x3 þ m1 m3
¼ nðx1 þ x3 2x2 Þ þ x1 even, x3 odd or vice-versa:
m2 2 m1 m3
ð5:29bÞ
which needs one 2k-bit k-bit multiplier and one k-bit k-bit multiplier and few
adders. Note that in this case, m1 ¼ 2n 1, m2 ¼ 2n and m3 ¼ 2n + 1. More recently,
Gbolagade et al. [40] have suggested computing X as
5.5 RNS to Binary Converters for Four and More Moduli Sets 99
x1 þ x3
X ¼ m 2 ðx 2 x 3 Þ þ x2 þ m 3 m 2 x2 ð5:31Þ
2 m1 M
Some reverse converters of four moduli sets [51–54] are extensions of the con-
verters for the three moduli sets. These use the optimum converters for the three
moduli set M1 {2n 1, 2n, 2n + 1} and use MRC to get the final result to include the
fourth modulus 2n+1 1, 2n1 + 1, 2n1 1, 2n+1 + 1, etc.
100 5 RNS to Binary Conversion
The reverse converter due to Vinod and Premkumar [51] for the moduli set
n n n n+1
{m1, m2, m3, m4} ¼ {2
j k1, 2 + 1, 2 , 2 1} uses CRT but computes the higher
Mixed Radix Digit MX mod 2nþ1 1 where X is the desired decoded number
4
and Mi ¼ M/mi. On the other hand, X mod M4 is computed
j k using the three moduli
X
RNS to binary converter. Next, X is computed as M M4 þ x4 .
4
The reverse converter due to Bhardwaj et al. [52] for the moduli set k 1, m2, m3,
j {m
m4} ¼ {2n 1, 2n + 1, 2n, 2n+1 + 1} uses CRT but computes first E ¼X
. Note that
2n
E can be obtained by using CRT on the four moduli set and subtracting the residue
r3 and dividing by m3. However, the multiplicative inverses needed in CRT are
quite complex and hence, E1 and E2 are estimated from the expression for E. Next,
from E1 and E2 using CRT, E can be obtained:
E1 ¼ jEj 2n ¼ 2n1 ð2n þ 1Þr 1 2n r 2 2n1 ð2n 1Þr 3 2n ð5:32aÞ
2 1 2 1
E2 ¼ j E j ¼ ½2r 2 2r 4 ð5:32bÞ
2nþ1 þ1 2nþ1 þ1
Ananda Mohan and Premkumar [53] have suggested using MRC for obtaining
E from E1 and E2.
Ananda Mohan and Premkumar [53] have given an unified architecture for RNS
to binary conversion for the moduli sets {2n 1, 2n + 1, 2n, 2n+1 1} and {2n 1, 2n
+ 1, 2n, 2n+1 + 1} which uses a front-end RNS to binary converter for the moduli set
{2n 1, 2n + 1, 2n} and then uses MRC to include the fourth modulus. Both
ROM-based and non-ROM-based solutions have been given.
Hosseinzadeh et al. [55] have suggested an improvement for the converter of
Ananda Mohan and Premkumar [53] for the moduli set {2n 1, 2n + 1, 2n, 2n+1 1}
for reducing the conversion delay at the expense of area. They suggest using (n + 1)-
bit adders in place of (3n + 1)-bit CPA to compute the three parts of the final result.
Theydo not perform the final addition of the output of the multiplier evaluating
1
ðx4 Xa Þ nþ1 where Xa is the decoded output corresponding the
Xa 2nþ1 1 2 1
moduli set {2n 1, 2n + 1, 2n} but preserve as two carry and sum output vectors and
compute the final output.
Sousa et al. [56] have described an RNS to binary converter for the moduli set
{2n + 1, 2n 1, 2n, 2n+1 + 1}. They have used two-level MRC. In the first
level, reverse conversion using MRC for moduli sets {x1, x2} ¼ {2n + 1, 2n 1}
and {x4, x3} ¼ {2n+1 + 1, 2n} is performed and the decoded words X12, X34 are
1 n1
obtained. Note that the various multiplicative inverses are x1 modx2 ¼ 2 ,
X
n3
2 X
n1
2iþ1
1
x4 modx 3 ¼ 1 and 1
m3 m4 mod ð m m
1 2 Þ ¼ 2 þ 22iþ2 . Since the archi-
i¼0 i¼n1
2
The resulting area is more than that of Ananda Mohan and Premkumar converter
[53], whereas the conversion time is less.
Cao et al. [54] have described reverse converters for the two four moduli sets
{2n + 1, 2n 1, 2n, 2n+1 1} and {2n + 1, 2n 1, 2n, 2n1 1} both for n even.
They use a front-end RNS to binary converter due to Wang et al. [14] for the three
moduli set to obtain the decoded word X1 and use MRC later to include the fourth
modulus m4 (i.e. (2n+1 1) or (2n1 1)). The authors suggest three stage and
four stage converters which differ in the way the MRC in second level is
performed. In the three-stage converter considering the first moduli set, the
second stage computes
!!
1
Z¼ ðx4 X1 Þ ð5:33aÞ
2n 22n 1
2nþ1 1
and the third stage computes X ¼ X1 þ 2n 22n 1 Z. Noting that
!
1 nþ2
¼2 10
, the authors realize Z as
3
2n 22n 1 nþ1
2 1
nþ2
1 2 10
Z¼ ðx4 X1 Þ ¼ ðSQÞ nþ1 ð5:33bÞ
3 3 2nþ1 1 2 1
nþ2
1 2 10
where S ¼ , Q ¼ ðx 4 X 1 Þ . Note that S can be
3 2nþ1 1 3 2nþ1 1
realized as
1
S¼ ¼ 20 þ 22 þ 24 þ þ 2n :
3 2nþ1 1
division by 4 implies ignoring the two LSBs. In the case of computation of Xb, m13
modm4 ¼ 14 modm4 ¼ 2n2 where m3 ¼ 2n + 3 and m4 ¼ 2n 1. The multipli-
cation with 2n2 mod (2n 1) can be carried out in a simple manner by bit rotation
of ðx3 x4 Þm4 . In the case of MRC in the second level, note that m31m4 mod
1
ðm1 m2 Þ ¼ nþ2 modðm1 m2 Þ enabling Montgomery technique to be used easily.
2
In [58], MRC using ROMs and CRT using ROMs also have been explored. In
MRC techniques, modulo subtractions are realized using logic, whereas multipli-
cation with multiplicativeinverse
is carried out using ROMs. In the CRT-based
1
method, the various Mi values are stored in ROM. Carry-save-adder
M i mi
followed by CPA and modulo reduction stage are used to compute the decoded
result.
Jaberipur and Ahmadifar [59] have described an ROM less adder-only reverse
converter for this moduli set. They consider a two-stage converter. The first stage
performs mixed radix conversion corresponding to the two pairs of moduli {2n 1,
2n + 1} and {2n 3, 2n + 3} to obtain residues corresponding to the pair of compos-
ite moduli {22n 1, 22n 9}. The multiplicative inverses needed are as follows:
1 n1 1
n ¼ 2 , n ¼ 2n3 þ 2n5 þ þ 23 þ 2 for n even and
2 1 2 þ1 2 þ 3 2 3
n n
1 n3 1
¼ 2 þ 2 n5
þ þ 2 2
þ 2 0
for n odd, ¼ 22n3 :
2n þ 3 2n 3 22n 9 22n 1
The decoded words in the first and second stages can be easily obtained using
multi-operand addition of circularly shifted words.
Patronik and Piestrak [60] have considered residue to binary conversion for a
new moduli set {m1, m2, m3, m4} ¼ {2n + 1, 2n, 2n 1, 2n1 + 1} for n odd. They
have described two converters. The first converter is based on MRC of a two moduli
set {m1m2m3, m4}. This uses Wang et al. converter [12] for the three moduli set to
obtain the number X1 in the moduli set {m1, m2, m3}. The multiplicative inverse
needed in MRC is
! 0n3 1
1 X
2 1
Note that since the lengths of residues corresponding to the moduli m1m2m3 and
m4 are different, the operation (x4 X1) mod (2n1 + 1) needs to be carried out using
periodic properties of residues. The multiplication with the multiplicative inverse in
(5.34) needs circular left shifts, one’s complementing of bits arriving in LSBs due to
circular shift and addition of all these modified partial products with a correction
term using several CSA stages. Note that mod (2n1 + 1) addition needs correction
5.5 RNS to Binary Converters for Four and More Moduli Sets 103
The multiplication with this multiplicative inverse mod (22n 1) can be obtained
by using a multi-operand carry-save-adder mod (22n 1) which can yield sum and
carry vectors RC and RS. Two versions of the second converter have been presented
which differ in the second stage.
Didier and Rivaille [61] have described a two-stage RNS to binary converter for
moduli specially chosen to simplify the converter using ROMs. They suggest
choosing pairs of moduli with a difference of power of two and difference between
products of pairs of moduli being powers of two. Specifically, the set is of the type
fm1 ; m2 ; m3 ; m4 g ¼ m1 , m1 þ 2p1 , m3 , m3 þ 2p2 such that m1m2 m3m4 ¼ 2pp
where pp is an integer. In the first stage, the decoded numbers corresponding to
residues of {m1, m2} and {m3, m4} can be found and in the second stage, the
decoded number corresponding to the moduli set {m1m2, m3m4} can be found. The
basic converter for the two moduli set {m1, m2} can be realized using one addition
without needing any modular reduction. Denoting the residues as (r1, r2), the
decoded number B1 can be written as B1 ¼ r 2 þ ðr 1 r 2 , 0Þ where the second
term corresponds to the binary number corresponding to (r1 r2, 0). Since r1 r2
can be negative, it can be written as a α-bit two’s complement number with a sign
bit S and (α 1) remaining bits. The authors suggest that the decoded number be
obtained using a look-up table T addressed by sign bit and p LSBs where
m2 m1 ¼ 2p and using addition operation as follows:
B1 ¼ r 2 þ m2 MSBðr 1 r 2 Þα1
p þ T signðr 1 r 2 Þ, LSBðr 1 r 2 Þp1
0 ð5:36Þ
Some of the representative moduli sets are {7, 9, 5, 13}, {23, 39, 25, 41}, {127,
129, 113, 145} and {511, 513, 481, 545}. As an illustration, the implementation for
the RNS {511, 513, 481, 545} needs 170AFA, 2640 bits of ROM and needs a
104 5 RNS to Binary Conversion
conversion time of 78ΔFA + 2ΔROM where ΔFA is the delay of a full adder and
ΔROM is ROM access time.
We next consider four moduli sets with dynamic range (DR) of the order of 5n
and 6n bits. The four moduli set {2n, 2n 1, 2n + 1, 22n + 1} [62] is attractive since
New CRT-I-based reduction can be easily carried out. However, the bit length of
one modulus is double that of the other three moduli. Note that this moduli set can
be considered to be derived from {22n 1, 22n, 22n + 1} [48, 49].
The reverse converters for the moduli set {2n 1, 2n + 1, 22n+1 1, 2n} with DR
of about (5n + 1) bits and {2n 1, 2n + 1, 22n, 22n + 1} with a DR of about 6n bits
based on New CRT II and New CRT I respectively have been described in [63]. In
the first case, MRC is used for the two two moduli sets {m1, m2} ¼ {2n, 22n+1 1}
and {m3, m4} ¼ {2n + 1, 2n 1} to compute Z and Y. A second MRC stage computes
X from Y and Z:
Z ¼ x1 þ 2n 2nþ1 ðx2 x1 Þ 2nþ1 ð5:37aÞ
2 1
Y ¼ x3 þ ð2n þ 1Þ 2n1 ðx4 x3 Þ 2n 1 ð5:37bÞ
X ¼ Z þ 2n 22nþ1 1 ð2n ðY Z ÞÞ 2n ð5:37cÞ
2 1
Due to the modulo reductions which are convenient, the hardware can be simpler.
In the case of the moduli set {m1, m2, m3, m4} ¼ {2n 1, 2n + 1, 22n, 22n + 1},
New CRT-I has been used. The decoded number in this case is given by
X ¼ x1 þ 22n 22n ðx2 x1 Þ þ 22n1 22n þ 1 ðx3 x2 Þ þ 2n2 22n þ 1 ð2n þ 1Þðx4 x3 Þ
24n 1
ð5:38Þ
Zhang and Siy [64] have described an RNS to binary converter for the moduli set
{2n 1, 2n + 1, 22n 2, 22n+1 3} with a DR of about (6n + 1) bits. They
consider two-level MRC using the two moduli sets {m1 ¼ 2n 1, m2 ¼ 2n + 1}
and {m3 ¼ 22n 2, m4 ¼ 22n+1 3}. The multiplicative inverses are very simple:
1 n1 1 1
¼2 , ¼ 1, ¼1 ð5:39Þ
m2 m1 m4 m3 m3 m4 m1 m2
Sousa and Antao [65] have described MRC-based RNS to binary converters for
the moduli sets {2n + 1, 2n 1, 2n, 22n+1 1} and {2n 1, 2n + 1, 22n, 22n+1 1}.
They consider in the first level {x1, x2} ¼ {2n 1, 2n + 1} and {x3, x4} ¼ {2n(1+α),
22n+1 1} where α ¼ 0,1 correspond to the two moduli sets to compute X12 and
X34 respectively.
The multiplicative inverses in the first level are
1 1
¼ 2n1 , ¼ 2ð1þαÞn 1, and in the second
2n þ 1 2n 1 2 2nþ1
1 nð1þαÞ
2
1 1
level are 3nþ1 ¼ 2 n
for α ¼ 0 and ¼ 1 for α ¼ 1.
2 2n 22n 1 24nþ1 2n 22n 1
5.5 RNS to Binary Converters for Four and More Moduli Sets 105
Note that all modulo operations are mod (2n 1), 2(1+α)n and 22n 1 which are
convenient to realize. The authors use X12 and X34 in carry save form for computing
ðX12 X34 Þ 2n thus reducing the critical path.
2 1
Stamenkovic and Jovanovic [66] have described a reverse converter for the four
moduli set {2n 1, 2n, 2n + 1, 22n+1 1}. They have suggested exploring the
24 possible orderings of the moduli for being used in MRC so that the multiplicative
inverses are 1 and 2 n1 . The recommended ordering is {2 2n+1 1, 2n, 2 n + 1,
2n 1}. This leads to MRC using only subtractors and not needing modulo
multiplications. They have not, however, presented the details of hardware require-
ment and conversion delay.
The reverse converter for the five moduli set [67] {2n 1, 2n, 2n + 1, 2n+1 1,
n1
2 1} for n even uses in first level the converter for four moduli set {2n 1,
2 , 2 + 1, 2n+1 1} due to [54] and then uses MRC to include the fifth modulus
n n
(2n1 1).
Hiasat
n [68] has described reverse converters for two o five moduli sets based on
nþ1 nþ1
CRT 2n , 2n 1, 2n þ 1, 2n 2 2 þ 1, 2n þ 2 2 þ 1 when n is odd and n 5 and
n nþ1 nþ1
o
2nþ1 , 2n 1, 2n þ 1, 2n 2 2 þ 1, 2n þ 2 2 þ 1 when n is odd and n 7. Note
that this moduli set uses factored form of the two moduli (22n 1) and (22n + 1) in
the moduli set {2n, 22n 1, 22n + 1}. The reverse conversion procedure is similar to
Andraros and Ahmad technique [4] of evaluating the 4n MSBs since n LSBs of the
decoded result are already available. The architecture needs addition of eight
4n-bit words using 4n-bit CSA with EAC followed by 4n bit CPA with EAC or
modulo (24n 1) adder using parallel prefix architectures.
Skavantzos and Stouraitis [69] and Skavantzos and Abdallah [70] have
suggested general converters for moduli products of the form 2a(2b 1) where 2b
1 is made up of several conjugate moduli pairs such as (2n 1), (2n + 1) or
n nþ1 nþ1
2 þ 2 2 þ 1 , 2n 2 2 þ 1 . The reverse converter for conjugate moduli is
quite simple which needs rotation of bits and one’s complementing and addition
using modulo (24n 1) adders or modulo (22n 1) adders. The authors suggest
two-level converters which will find the final binary number using MRC
corresponding to the intermediate residues. The first level converter uses CRT,
whereas the second level uses MRC. The four moduli sets {2n+1, 2n 1, 2n+1 1,
2n+1 + 1} for n odd, {2n, 2n 1, 2n1 1, 2n1 + 1}for n odd, the five moduli
sets {2n+1, 2n 1, 2n + 1, 2n+1 1, 2n+1 + 1}, {2n, 2n 1, 2n + 1, 2n + 2(n+1)/2 + 1,
2n 2(n+1)/2 + 1} and the RNS with seven moduli {2n+3, 2n 1, 2n + 1, 2n+2 1,
2n+2 + 1, 2n+2 + 2(n+3)/2 + 1, 2n+2 2(n+3)/2 + 1} have been suggested. Other RNS
with only pairs of conjugate moduli up to 8 moduli also have been suggested.
Note that care must be taken to see that the moduli are relatively prime. Note
that in case of one common factor existing among the two sets of moduli, this
should be taken into account in the application of CRT in the second level
converter.
Pettenghi et al. [71] have described general RNS to binary converters for the
moduli sets {2n+β, 2n 1, 2n + 1, 2n + k1, 2n k1} and {2n+β, 2n 1, 2n k1, 2n k2,
106 5 RNS to Binary Conversion
j k
. . ., 2n kf} using CRT. In the case of first moduli set, they compute mX1 where
j k X 5
m1 ¼ 2n+β as mX1 ¼ Mi 1
V i xi where V i ¼ m xi for i ¼ 2, . . ., 5 which are
1 Mi
i¼1 mi
integers since m1 divides Mi exactly. On the other hand, in case of V1, we have
!
1 3nβ
V1 ¼ 2 2 nβ 2
k 1 þ 1 þ ψ x1 ð5:40aÞ
M 1 m1
where ψ is defined as
1
k2 ¼ ψm1 þ 1 ð5:40bÞ
M1 m1 1
X
Note that the fractional part in the computation of can be removed using
m1 m1
this technique. As an illustration for m1 ¼ 2nþβ , k1 ¼ 3,β ¼ n ¼ 3, m1 ¼ 64, m2 ¼ 15,
1
m3 ¼ 17, m4 ¼ 13, m5 ¼ 19, we have ψ ¼ 2, ¼ 57 and V1 ¼ 14,024,
M 1 m1
V2 ¼ 58,786, V3 ¼ 59,280, V4 ¼ 43,605 and V5 ¼ 13,260. Note that the technique
can be extended to the case of additional moduli pairs with different k1, k2, etc.
Skavantzos et al. [72] have suggested in case of the balanced eight moduli RNS
using the moduli set {m1, m2, m3, m4, m5, m6, m7, m8} ¼ {2n5 1, 2n3 1, 2n3
+ 1, 2n2 + 1,2n1 1, 2n1 + 1, 2n, 2n + 1}, four first level converters comprising of
moduli {2n3 1, 2n3 + 1}, {2n5 1, 2n2 + 1}, {2n1 1, 2n1 + 1}, {2n, 2n + 1}
to obtain the results B, D, C and E respectively. The computation of
D ¼ x4 þ m4 X01 ð5:41aÞ
where
1
X01 ¼ ðx 1 x 4 Þ ð5:41bÞ
2n2 þ 1 2n5 1
where m8 ¼ 2n þ 1 and m7 ¼ 2n .
The second level converter takes the pairs {B, D} and {C, E} and evaluates the
corresponding numbers F and G respectively which also uses MRC which can
also be realized by multi-operand modulo (22n6 1) CSA tree followed by a
5.5 RNS to Binary Converters for Four and More Moduli Sets 107
and
!
1 1 2nþ4k
¼ 2 for k ð n þ 3Þ ð5:44bÞ
2k 22n 1 3 2nþ1 1
2nþ1 1
hX L i
Xf ¼ i¼0
x i yi 2r ð5:46aÞ
where
2 1
yi ¼ ð5:46bÞ
mi Mi mi
so that Xf will be in the range [0, 2]. The advantage is that the subtraction of 2r can
be realized by dropping the MSBs. The decoded value can be obtained by scaling Xf
by M/2. Note that yi in (5.46b) can be approximated as
& ’
2bþ1 1
^y i ¼ 2 yi 2
b b
¼ 2b ð5:47Þ
mi Mi mi
X ¼ 0:110000100010001 ¼ 0:75833129882
The value of α can be determined using the fact that due to scaling by 2k, the
k LSBs shall be zero. Hence, by using the k LSBs of the first term in (5.48) to look
into a look-up table, value of α can be determined and αM can be subtracted. The
result is exactly divisible by 2k. This is similar to Montgomery’s algorithm used for
scaling. The authors observe that k shall be such that 2k N 1 where N is the
number of moduli. Note that 2k < α < N.
Note that in the case of two’s complement number being desired as the output, an
addition of (M 1)/2 in the beginning is performed. The result can be obtained by
subtracting (M 1)/2 at the end if the result is positive else by adding (M 1)/2.
In short, we compute
^
e ¼ H αM
X ð5:49aÞ
2k
where
!
X4 1
^ ¼
H Mi xi :2 k
ð5:49bÞ
i¼1 M i mi
mi
5.6 RNS to Binary Conversion Using Core Function 111
Example 5.9 As an illustration let us consider the moduli set {17, 15, 16} and given
residues (2, 9, 15). Obtain the decoded number using Re et al. technique
[78].
1 1
We have M1 ¼ 240, M2 ¼ 272, M3 ¼ 255 and ¼ 9, ¼ 8,
240 17 272 15
1 1
¼ 1. Thus, scaling by 8 the input residues multiplied by ,
255 16 M i mi
yields the weights as (9 2 8)17 ¼ 8, (8 9 8)15 ¼ 6, (1 15 8)16 ¼ 8.
Thus, CRT yields the weighted sum as 8 240 + 6 272 + 8 255 ¼ 5592.
Subtracting from this, the scaled number corresponding to residue mod
16, (8 15) ¼120, we obtain 5472. Dividing this by 16, then we obtain 342.
Next division by 8 is needed
to take
into account the pre-scaling. Using Montgom-
1
ery technique, we obtain ¼ 1 . Noting that 342 mod 8 ¼ 6, we need to add
255 8
to 342, 6M ¼ 6 255 to finally get 342þ6255
8 ¼ 234: Thus the decoded value is
234 16 + 15 ¼ 3759 as desired.
■
Some authors have investigated the use of core function for RNS to Binary
conversion as well as scaling. This will be briefly considered next. Consider the
moduli set {m1, m2, m3, . . .., mk}. We need to choose first a constant C(M ) as the
core [79–86]. It can be the largest modulus in the RNS or product of two or more
moduli.
1
As in the case of CRT, we define various Bi as Bi ¼ Mi where Mi ¼ M/mi
M i mi
1
and is the multiplicative inverse of Mi mod mi. We next need to compute
M i mi
weights wi defined as
112 5 RNS to Binary Conversion
!
1
wi ¼ CðMÞ mod mi ð5:51aÞ
M i mi
X
k
wi
CðMÞ ¼ M ð5:51bÞ
i¼1
mi
thereby necessitating that some weights be negative. The weights are next used to
compute the Core function C(n) of a given number n as
CðMÞ X K wi
CðnÞ ¼ n i¼1 m i
r modCðMÞ ð5:52Þ
M i
Note that the core values C(Bi) corresponding to the input Bi can be seen from
(5.52) to be
Bi CðMÞ wi
CðBi Þ ¼ ð5:53Þ
M mi
where α is known as the rank function defined by CRT. Note that (5.53) is known as
CRT for Core function. Next n can be computed by rewriting (5.52) as
X k wi
M
n¼ CðnÞ þ i¼1 m i
r ð5:54bÞ
C ðM Þ i M
The important property of C(n) is that the term n CðMMÞ in (5.52) monotonically
increases with n with some furriness due to the second term in (5.52), even though
the choice of some weights wi defined in (5.51a) as negative numbers reduces the
furriness (see (5.52)). Hence, accurate comparison of two numbers using Core
function or sign detection is difficult.
The main advantage claimed for using the Core function is that the constants C
(Bi) involved in computing the Core function following (5.53) are small since they
are less than C(M ). However, in order to simplify or avoid the cumbersome division
5.6 RNS to Binary Conversion Using Core Function 113
by C(M ) needed in (5.54a), it has been suggested that C(M ) be chosen as a power of
two or one modulus or product of two or more moduli.
The following example illustrates the procedure of computing Core and reverse
conversion using Core function.
Example 5.10 Consider the moduli set {3, 5, 7, 11}. Let us choose C(M ) ¼ 11.
M ¼ 1155, M1 ¼ 385, M2 ¼ 241,
Then M3 ¼ 165 and M4 ¼ 105. The values of
1 1
are 1, 1, 2, 2. Thus Bi ¼ Mi are 385, 241, 330, 210. The wi can
M i mi M i mi
be found as 1, 1, 1, 0. Next, we find C(B1) ¼ 4, C(B2) ¼ 2, C(B3) ¼ 3 and C(B4) ¼
¼ (1 421
2. Consider the residues (1, 2, 3, 8). Then we find C(n) + 2 2 +3 3 +
8 2) mod 11 ¼ 0. Next, n can be found as n ¼ 105 0 þ 11 3 þ 5 þ 7
31
¼ 52:
■
Note that if 0 CðXÞ < CðMÞ, then ðCðXÞÞCðMÞ ¼ CðXÞ. If C(X) < 0, then
ðCðXÞÞCðMÞ ¼ CðXÞ þ CðMÞ. If C(X) C(M ), then ðCðXÞÞCðMÞ ¼ CðXÞ CðMÞ.
Thus any specific value of (C(X))C(M ) may introduce two possible values for
C(X). There is an ambiguity to determine which case is the correct one. The
ambiguity is due to the non-linear characteristic of the core function.
Miller [80] has suggested the use of a redundant modulus mE larger than C(M )
and computing
CðMÞ X K wi
C ð nÞ ¼ n i¼1 m i
r ð5:55Þ
M i mE
We briefly review the concept of diagonal function [87–90]. For a given moduli set
{m1, m2, m3, . . ., mn} where the moduli mi are mutually prime, we first define a
parameter “Sum of Quotients (SQ)” where
SQ ¼ M1 þ M2 þ þ Mn ð5:56Þ
Yn
where Mi ¼ M/mi and M ¼ i¼1
mi is the dynamic range of the RNS. We also
define the constants
1
ki ¼ for i ¼ 1, . . . , n: ð5:57Þ
mi SQ
It has been shown in [87] and [88] that the ki values exhibit a property
The diagonal Function corresponding to a given number X with residues (x1, x2,
. . .., xn) is defined next as
Note that D(X) is a monotonic function. As such, two numbers X and Y can be
compared based on the D(X) and D(Y ) values. However, if they are equal, we need
to compare any one of the coordinates (residues corresponding to any one modulus)
of X with those of Y in order to determine whether X > Y or X ¼ Y or X < Y. Pirlo and
Impedovo [89] have observed that Diagonal function does not support RNS to
Binary conversion. However, it is now recognized [91] that it is possible to perform
RNS to binary conversion using Diagonal function.
5.7 RNS to Binary Conversion Using Diagonal Function 115
According to CRT, the binary number corresponding to the residues (x1, x2, . . .,
xn) can be obtained as
1 1 1
X ¼ x1 M1 þ x2 M 2 þ þ xn M n rM ð5:60Þ
M1 m1 M 2 m2 M n mn
Note that all the terms in (5.61) are mixed fractions since SQ and mi (i ¼ 1,2, . . ., n)
are mutually prime.
From the definition of ki in (5.57), we have
βi SQ ki mi ¼ 1 ð5:62Þ
Substituting the value of SQ from (5.56) in (5.63) and noting that Mk is a multiple
of mj for k ¼ 1, 2, . . .., n except for k ¼ j, we have
1
βi ¼ ð5:64Þ
Mi mi
or
SQ 1 1
¼ ki þ ð5:65bÞ
mi Mi mi mi
116 5 RNS to Binary Conversion
105 21 þ 1 35 þ 4 21 þ 6 15
X¼ ¼ 34:
71
■
An examination of (5.65b) suggests a new approach for RNS to binary conver-
sion which is considered next. We are adding a multiple of M to
p ¼ ðx1 M1 þ x2 M2 þ þ xn Mn Þ such that the sum is divisible exactly by SQ.
This is what Montgomery’s technique [92] does to find (a/b) mod Z. In this
a the value of s such that (a + sZ) ¼ 0 mod b. Note that s is
technique, we compute
defined as s ¼ . Thus, we observe that by adding D(X) M with
b SQ p
ðx1 M1 þ x2 M2 þ þ xn Mn Þ and dividing with SQ where DðXÞ ¼ , we
M SQ
can obtain X:
ðx1 M1 þ x2 M2 þ þ xn Mn þ M DðXÞÞ
X¼ ð5:67Þ
SQ
Note that in this technique, we need not find the various ki values. We consider
the previous example to illustrate this method.
5.8 Performance of Reverse Converters 117
Example 5.12 Consider the moduli set {m1 ¼ 3, m2 ¼ 5, m3 ¼ 7} and given residues
(x1 ¼ 1, x2 ¼ 4, x3 ¼ 6) corresponding to X ¼ 34. We have M1 ¼ 35, M2 ¼ 21, M3 ¼ 15,
and SQ ¼ 71. We can findp ¼ ðx1 M671 þ x2 M2 þ þ xn Mn Þ ¼ 209. Thus, we have
DðXÞ ¼ Mp SQ ¼ 209 105 71 ¼ 105 71 ¼ ð67 48Þ71 ¼ 21. The decoded number
can be found following (12) as X ¼ ð209þ10521
71
Þ
¼ 34:
■
The hardware requirements and conversion delay for various designs for several
moduli sets considered in this chapter are presented in Table 5.1 for three moduli
sets and Table 5.2 for four and more moduli sets. The various multiplicative
inverses needed in MRC for various moduli sets which use subsets of moduli are
presented in Table 5.3. In Table 5.4, the multiplicative inverses needed in CRT are
presented. The various multiplicative inverses needed in New CRT II are presented
in Table 5.5 and those needed in New CRT I are given in Table 5.6. In Table 5.7,
performance (area in gates in case of ASIC/slices in case of FPGA and conversion
time and power dissipation) for some state-of-the-art practically implemented
reverse converters is presented for both FPGA and ASIC designs. Various dynamic
ranges have been considered. It can be observed from this table that conversion
times less than few ns for dynamic ranges above 100 bits have been demonstrated to
be feasible.
It can be seen that MRC, CRT and New CRT perform differently for different
moduli sets. Various authors have considered various options and design can be
chosen based on low area/conversion time or both. Among the three moduli sets,
in spite of slight complexity in the modulus channel for (2n + 1), the moduli set
{2n, 2n 1, 2n + 1} out performs the rest. It is interesting to note that four moduli
sets with uniform size moduli appear to be more attractive if two-level MRC is
used rather than CRT, contrary to the assumption that CRT-based designs are faster
than MRC-based designs. Modulo sets with moduli of bit lengths varying from n to
2n bits appear to perform well if moduli are properly chosen enabling realization of
higher dynamic range. These have linear dependence of area on “n” as against four
moduli sets which have uniform word length having quadratic dependence on “n”.
The present trend appears to be favoring moduli of the type 2x 1. Multi-moduli
systems investigated more recently appear to be attractive. As will be shown in a
later chapter, several multi-moduli systems have been explored for crypto-
graphic applications and need more rigorous investigations.
We present the detailed design procedure and development of the implementa-
tion architecture of an MRC-based reverse converter for the moduli set {2n, 2n 1,
2n + 1 1}. The MRC technique for this reverse conversion is illustrated in
Figure 5.4a. The various multiplicative inverses, in this converter, denoted as
Converter I, can be computed as follows:
118
Table 5.1 Area and delay requirements of various converters for three moduli sets
Design Moduli set Hardware requirements Delay
1 [44] {2n 1, 2n, 2n1 1} (12n 8)AHA+(6n 4)AAND (5n 4) τFA
2 [45, 46] {2n 1, 2n, 2n1 1} (17n 13)AHA+(7n 3)AAND (3n + 2)τFA
3 [47] CRT based {2n 1, 2n, 2n1 1} (9n 10)AFA+(3n 1)AINV + 18(2n 2)AROM +2nAHA+(n + 1)AEXNOR (2n + 3)τFA
+(n + 1)AOR
4 [47] MRC based {2n 1, 2n, 2n1 1} (4n 3)AFA+(3n 1))AINV+(3n 4)AEXNOR+(3n 4)AOR (6n 5)τFA
5 [10] {2n 1, 2n, 2n + 1} (6n + 1)AFA+(n + 3)AAND/OR+(n + 1)AXOR/XNOR + 2n A2:1MUX (n + 2)τFA + τMUX
6 [8, 9] {2n 1, 2n, 2n + 1} 4nAFA + 2AAND/OR (4n + 1)τFA
7 CI [14] {2n 1, 2n, 2n + 1} 4nAFA + AHA+ AXOR/XNOR + 2A2:1 MUX (4n + 1)τFA
8 CII [14] {2n 1, 2n, 2n + 1} 6nAFA + AHA + 2AAND/OR + AXNOR/XOR +(2n + 2) A2:1MUX (n + 1)τFA
9 CIII [14] {2n 1, 2n, 2n + 1} 4nAFA + AHA+(2n + 2)AAND/OR+(2n 1)AXNOR/XOR+(2n + 2)A 2:1MUX (n + 1)τFA
10 [12] {2n 1, 2n, 2n + 1} 4nAFA + nAAND/OR (4n + 1)τFA
11 Converter I [50] {2n 1, 2n, 2n+1 1} (4n + 3)AFA+ nAAND/OR + nAXOR/XNOR (6n + 5)τFA
12 Converter II [50] {2n 1, 2n, 2n+1 1} (14n + 21)AFA+(2n + 3)AHA+(2n + 1) A3:1MUX (2n + 7)τFA
13 Converter III [50] {2n 1, 2n, 2n+1 1} (12n + 19)AFA+(2n + 2)AHA + 10(2n + 1)AROM + (2n + 1) A2:1MUX (2n + 7)τFA
14 CE [49] {2n, 22n 1,22n + 1} (3n + 1)AINV + (5n + 2)AFA + (2n 1)AEXOR + (2n 1)AAND + (n 1) (8n + 1)τFA + τinv
AOR + (n 1)AEXNOR
15 HS [49] {2n,22n 1,22n + 1} (3n + 1)AINV + (9n + 2)AFA + (2n 1)AEXOR + (2n 1)AAND + (n 1) (2n + 1)τFA + τinv + τMUX
AOR + (n 1)AEXNOR + 4nA2:1MUX + 2τNAND
16 [21] 0 k n (2n 1,2n+k,2n + 1} 4nAFA (4n + 2)tFA
17 [22] {22n, 2n 1,2n + 1} (4n + 1)AFA + (n 1)AHA (4n + 1)τFA
5 RNS to Binary Conversion
Table 5.2 Area and time requirements of RNS to Binary converters used for comparison using four and five moduli sets
Design Moduli set Area A Conversion time T
1 [57] {2n 3, 2n + 1, 2n 1, 2n + 3} (26n + 8) AFA +(2n+5 + 32)nAROM (7n + 8)τFA + 2τROM
2 ROM less CE [58] {2n 3, 2n + 1, 2n 1, 2n + 3} (25.5n + 12 + (5n2/2))AFA + 5nAHA + (18n + 23)τFA
3nAEXNOR + 3nAOR
3 CI HS [58] ROM less {2n 3, 2n + 1, 2n 1, 2n + 3} (37.5n + 28 + (5n2/2)) (12n + 15)τFA
AFA + 5nAHA + 3nAEXNOR + 3nAOR
4 C2 CE [58] MRC with {2n 3, 2n + 1, 2n 1, 2n + 3} (20n + 17)AFA+(3n 4)AHA +2n(5n + 2) 3τROM + (13n + 22)τFA
ROM AROM
5 C2 HS [58] MRC with {2n 3, 2n + 1, 2n 1, 2n + 3} (42n + 61)AFA +(3n 4)AHA + 2n(5n + 2) 3τROM + (7n + 10)τFA
ROM AROM
6 C3 CE [58] CRT with ROM {2n 3, 2n + 1, 2n 1, 2n + 3} (23n + 11)AFA+ (2n 2)AHA + (6n + 4)2n (16n + 14)τFA + τROM
5.8 Performance of Reverse Converters
AROM
7 C3 HS [58] CRT with ROM {2n 3, 2n + 1, 2n 1, 2n + 3} (35n + 17) AFA + (2n 2)AHA + (6n + 4)2n (4n + 7)τFA + τROM
AROM
8 [53] Converter I {2n 1, 2n, 2n + 1, 2n+1 1} (9n + 5 + ((n 4)(n + 1)/2))AFA [(23n + 12)/2]τFA
+ 2nAEXNOR + 2nAOR +(6n + 1)AINV
9 [53] Converter I using {2n 1, 2n, 2n + 1, 2n+1 1} (6n + 1)AINV +(8n + 4)AFA (9n + 6)τFA
ROM + 2nAex-Nor + 2nAOR +(n + 1)2n+1AROM
10 [53] Converter 2 {2n 1, 2n, 2n + 1, 2n+1 + 1} (6n + 7)AINV +(n2 + 12n + 12)AFA + 2n (16n + 22)τFA
AEXNOR + 2nAOR + (4n + 8)A2:1MUX
11 [53] Converter2 using {2n 1, 2n, 2n + 1, 2n+1 + 1} (5n + 6)AINV +(9n + 10)AFA + 2n (11n + 14)τFA
ROM AEXNOR + 2nAOR + (2n + 2)A2:1MUX
+ (n + 2)2n+2AROM
12 [55] {2n 1, 2n, 2n + 1, 2n+1 1} (10n + 6 + (n 4)(n + 1)/2)AFA+(6n + 2) ((15n + 22)/2) τFA
AEXNOR + (6n + 2)AOR+(7n + 2)AINV+(n
+ 3)AMUX2:1 + (2n + 1)AMUX3:1
13 [56] {2n 1, 2n, 2n + 1, 2n+1 + 1} (2n2 + 11n + 3)AFA (11.5n + 2log2n + 2.5)τFA
14 [51] {2n 1, 2n, 2n + 1, 2n+1 1} (37n + 14)AFA (14n + 8)τFA
(continued)
119
Table 5.2 (continued)
120
2(n+1)/2 + 1, 2n + 2(n+1)/2 + 1}
32 [73] {2n,2n 1, 2n + 1, 2n 2(n+1)/2 + 1, 2n 28n + (n 1)(11 + (n/2)) 10n 3 + 2(3 + log2((n/2) + 4))
+ 2(n+1)/2 + 1, 2n1 + 1}
33 [73] {22n,2n 1, 2n + 1, 2n 36n + (n 1)(12 + (n/2)) 10n + 3 + 2(3 + log2((n/2) + 4))
2(n+1)/2 + 1, 2n + 2(n+1)/2 + 1,
2n1 + 1}
34 [73] {23n,2n 1, 2n + 1, 2n 28n + (n 1)(13 + (n/2)) 10n + 2 + 2(3 + log2((n/2) + 4))
2(n+1)/2 + 1, 2n + 2(n+1)/2 + 1,
2n1 + 1}
35 [63, 64] {2n 1, 2n + 1, 22n 2, 22n+1 3} (28n + 9)AFA+(9n + 4)ANOT + 3(2n)AMUX2:1 (14n + 10)tFA
36 [74] {2k, 2n 1, 2n + 1, 2n+1 1} ((n2 + 27n)/2 + 2) AFA +(2n + k + 2)AINV (11n + l + 10)tFAc
37 [60] Version1 {2n + 1, 2n, 2n 1, 2n1 + 1} n odd ((n2 13)/2 + 13n)AFA + AMUX + nAOR (10n + log1.5(n 3)/2 + 5)
+ (((n2 + 3)/4) + 6n)AINV DFA + DOR + 2DINV + DMUX
38 [60] Version2 {2n + 1, 2n, 2n 1, 2n1 + 1} n odd ((n2)/2 + 13n)AFA + 3AXOR + 2AAND + (3n + 4) (8n + log1.5(n 5)/2 + 5)
AINV DFA + 2DXOR + 2DINV
39 [60] Version3 {2n + 1, 2n, 2n 1, 2n1 + 1} n odd (2n2 + 10n)AFA + 3AXOR + 2AAND +(3n + 4) 8n + log1.5(n 1)/2 + 2)DFA +
AINV 2DXOR + 2DINV
121
(continued)
Table 5.2 (continued)
122
Table. 5.3 Multiplicative inverses of four and five moduli sets using subsets
Moduli set (A,B) (1/A) mod B
nþ2
{P, 2n+1 1) 2 10 =3
{P, 2n+1 + 1} 2n + 2n2 + . . . + 2n2k +24 2 till n 2k ¼ 5
for n 5; (14 for n ¼ 3).
n
{P, 2n1 1} 2 þ 2n2 2 =3
{25n 2n, 2n+1 + 1} [73] X4
n11
24iþ8 23 22 21 20
i¼0
{25n 2n, 2n1 + 1} [73] X
4
n9
24iþ9 þ 24 þ 23 þ 22 þ21
i¼0
{27n 23n, 2n+1 + 1} [73] X4
n11
24iþ10 25 24 23 22
i¼0
{27n 23n, 2n1 + 1} [73] X
4
n9
1 nþ1
XA ¼ n mod 2 1 ¼2 ð5:68aÞ
2
1
XB ¼ modð2n 1Þ ¼ 1 ð5:68bÞ
2n
1
XC ¼ n mod 2nþ1 1 ¼ 2 ð5:68cÞ
2 1
The implementation of the MRC algorithm of Figure 5.4a using the various
multiplicative inverses given in (5.68a–c) follows the architecture given in
Figure 5.4b. The various modulo subtractors can make use of the well-known
property of 2x mod mi. The subtraction (r2 r3) mod (2n 1) can be realized by
one’s-complementing r3 and adding to r2 using a mod (2n 1) adder in the
block MODSUB1. The mixed radix digit UB is thus the already available (r2 r3)
124
Table 5.4 Multiplicative inverses (1/Mi) mod mi for use in CRT for various modulo sets
(1/M2) mod
Modulo set (1/M1) mod m1 m2 (1/M3) mod m3 (1/M4) mod m4 (1/M5) mod m5
{2n 1, 2n, 2n1 1} 2n 3 2n1 + 1 2n2 – –
{2n 1, 2n, 2n + 1} 2n1 2n 1 2n1 + 1 – –
{2n 1, 2n, 2n+1 1} 1 1 (4)mod (2n+1 1) – –
{2n,22n 1,22n + 1} 2n1 2n1 2n1 – –
{2n 1,2n+k,2n + 1} 2nk1 2n+k 1 2nk1 – –
{2n 1,22n,2n + 1} 2n1 (1) mod 22n 2n1 – –
{2n 1, 2n, 2n + 1, 2n+1 + 1} [52] 2n1 + 2/3(2n1 1) 2n 1 2n1 2n + 3 + (2n + 1)/3 –
{2n 1,2n,2n + 1,2n+1 1} [51] 2n1 1 2n (2n1 2)/3 2n+1 4 (2n+1 2)/3 –
P00 ¼ {2n,2n 1, 2n + 1, 2n 2(n+1)/2 + (1) mod m1 2n2 (2n2) mod m3 2(n5)/2 (2(n5)/2) mod m5
1,2n + 2(n+1)/2 + 1} [68]
P00 ¼ {2n+1,2n 1, 2n + 1, 2n 2(n+1)/2 + (1) mod m1 2n3 (2n3) mod m3 2(n7)/2 (2(n7)/2) mod m5
1,2n + 2(n+1)/2 + 1} [68]
{22n,2n 1, 2n + 1, 2n 2(n+1)/2 + 1, 2n + (1) mod m1 2n2 2n2 (2n2 + 2(n5)/2) mod m4 (2n2 2(n5)/2) mod m4
2(n+1)/2 + 1} [73]
{23n,2n 1, 2n + 1, 2n 2(n+1)/2 + 1, 2n + (1) mod m1 2n2 (2n2) mod m3 (2(n5)/2) mod m4 2(n5)/2
2(n+1)/2 + 1, 2n1 + 1} [73]
5 RNS to Binary Conversion
5.8 Performance of Reverse Converters 125
Table. 5.5 Multiplicative inverses for MRC for various moduli sets using New CRT II
Moduli set {m1, m2, m3, m4} 1 1 1
m1 m2 m4 m3 m3 m4 m1 m2
{2n 1, 2n + 1, 22n 2, 22n+1 3} 2n1 1 1
[64]
{2n + 1,2n 1,2n,2n+1 + 1} [56] 2n1 1 22nþ2 2n 2
3 22n 1
{2n + 1,2n 1, 2(1+α)n,22n+1 1} 2n1 2(1+α)n 2n for α ¼ 0, 1 for α ¼ 1
[65] 1
{2n + 1, 2n 3,2n 1,2n + 3} (3 2n2 2) 2n2 (1/2n+2)mod (22n 2n+1 3)
[57, 58]
{2n + 1,2n 1, 2n,22n+1 1} [63] 2n1 2n+1 2n
Table. 5.6 Multiplicative inverses for New CRT I for various moduli sets
Moduli set 1 1 1
m1 k1 m1 m2 k2 m1 m2 m3 m4
{2n 1, 2n, 2n1 1} [45] 22n2 2n 2n2 þ 2 2n2 –
{2n, 2n + 1, 22n + 1,2n 1} 23n 23n2
þ2 2n1 n2
2 2n2
[62]
{22n, 22n + 1, 2n + 1, 2n 1} 22n 22n1 2n2
[63]
{2n, 2n + 1, 2n 1} [14] 2n 2n1 –
k1 ¼ m2m3 for three moduli set and m2m3m4 for four moduli set
k2 ¼ m3 for three moduli set and m3m4 for four moduli set
mod (2n 1) since XB ¼ 1. The subtraction (r1 r3) mod (2n+1 1) involves two
numbers of different word length r1 of (n + 1) bits and r3 of n bits. By appending a
1-bit most significant bit (MSB) of zero, r3 can be considered as a (n + 1)-bit word.
Thus, one’s complement of this word can be added to r1 using a (n + 1)-bit modulo
adder in the block MODSUB2.
Next, UA can be obtained by circularly left shifting already obtained (r1 r3)
mod (2n+1 1) by 1 bit. The computation of (UA UB) mod (2n+1 1) can be
carried out as explained before in the case of (r1 r3) mod (2n+1 1) since UA is
(n + 1)-bit wide and UB is n-bit wide using the block MODSUB3.
Next, the multiplication of (UA UB) mod (2n+1 1) with Xc ¼ 2 to obtain [see
(5.68c)] is carried out by first left circular shifting (UA UB) mod (2n+1 1) by 1 bit
and one’s complementing the bits in the result. The last stage in the converter shall
compute
where UC, UB and r3 are the mixed radix digits. Note, however, that since the least
significant bits of B are given by r3, we need to compute only B0 (the (2n + 1) MSBs
of B)
B r3
B0 ¼ ð5:70Þ
2n
B0 ¼ U C ð 2 n 1 Þ þ U B ð5:71Þ
2n+1-1 2n-1 2n
r1 r2 r3
-r3 -r3
(r1-r3) mod (2n+1-1) (r2-r3) mod (2n-1)
×(1/2n) mod (2n+1-1) = X^ ×(1/2n) mod (2n-1) = XB
UA UB
-UB
(UA-UB) mod (2n+1-1)
×(1/2n-1)) mod (2n+1-1) = XC
UC
r1 n+1 n r3 n r2 n r3
n+1
UB n
Rotation
UA
n+1
n+1
n
Rotaion followed by Ones’ complement
n+1 UC
B’ 2n+1
Figure 5.4 (a) Mixed radix conversion flow chart and (b) architecture of implementation of (a),
(c) bit matrix for computing B0 (Adapted from [50]©IEEE2007)
128 5 RNS to Binary Conversion
The hardware requirements for this converter are thus nAFA for MODSUB1,
(n + 1)AFA each for MODSUB2 and MODSUB3, and (n + 1)AFA + nAXNOR + nAOR
for the CPA1. The total hardware requirement and conversion time are presented in
Table 5.1 (entry 11).
References
1. N.S. Szabo, R.I. Tanaka, Residue Arithmetic and Its Applications to Computer Technology
(Mc-Graw Hill, New-York, 1967)
2. G. Bi, E.V. Jones, Fast conversion between binary and Residue Numbers. Electron. Lett. 24,
1195–1197 (1988)
3. P. Bernardson, Fast memory-less over 64-bit residue to binary converter. IEEE Trans. Circuits
Syst. 32, 298–300 (1985)
4. S. Andraros, H. Ahmad, A new efficient memory-less residue to binary converter. IEEE Trans.
Circuits Syst. 35, 1441–1444 (1988)
5. K.M. Ibrahim, S.N. Saloum, An efficient residue to binary converter design. IEEE Trans.
Circuits Syst. 35, 1156–1158 (1988)
6. A. Dhurkadas, Comments on “An efficient Residue to Binary converter design”. IEEE Trans.
Circuits Syst. 37, 849–850 (1990)
7. P.V. Ananda Mohan, D.V. Poornaiah, Novel RNS to binary converters, in Proceedings of
IEEE ISCAS, pp. 1541–1544 (1991)
8. S.J. Piestrak, A high-Speed realization of Residue to Binary System conversion. IEEE Trans.
Circuits Syst. II 42, 661–663 (1995)
9. A. Dhurkadas, Comments on “A High-speed realisation of a residue to binary Number system
converter”. IEEE Trans. Circuits Syst. II 45, 446–447 (1998)
10. M. Bhardwaj, A.B. Premkumar, T. Srikanthan, Breaking the 2n-bit carry propagation barrier in
Residue to Binary conversion for the [2n-1, 2n, 2n+1] moduli set. IEEE Trans. Circuits Syst. II
45, 998–1002 (1998)
11. R. Conway, J. Nelson, Fast converter for 3 moduli RNS using new property of CRT. IEEE
Trans. Comput. 48, 852–860 (1999)
12. Z. Wang, G.A. Jullien, W.C. Miller, An improved Residue to Binary Converter. IEEE Trans.
Circuits Syst. I 47, 1437–1440 (2000)
13. P.V. Ananda Mohan, Comments on “Breaking the 2n-bit carry propagation barrier in Residue
to Binary conversion for the [2n-1, 2n, 2n+1] moduli set”. IEEE Trans. Circuits Syst. II 48, 1031
(2001)
14. Y. Wang, X. Song, M. Aboulhamid, H. Shen, Adder based residue to binary number converters
for (2n-1, 2n, 2n+1). IEEE Trans. Signal Process. 50, 1772–1779 (2002)
15. W. Wang, M.N.S. Swamy, M.O. Ahmad, Y. Wang, A study of the residue-to-Binary con-
verters for the three moduli sets. IEEE Trans. Circuits Syst. I 50, 235–243 (2003)
16. B. Vinnakota, V.V.B. Rao, Fast conversion techniques for Binary to RNS. IEEE Trans.
Circuits Syst. I 41, 927–929 (1994)
17. P.V. Ananda Mohan, Evaluation of Fast Conversion techniques for Binary-Residue Number
Systems. IEEE Trans. Circuits Syst. I 45, 1107–1109 (1998)
18. D. Gallaher, F.E. Petry, P. Srinivasan, The digit parallel method for Fast RNS to weighted
number System conversion for specific moduli (2k-1, 2k, 2k+1). IEEE Trans. Circuits Syst. II
44, 53–57 (1997)
19. P.V. Ananda Mohan, On “The Digit Parallel method for fast RNS to weighted number system
conversion for specific moduli (2k-1, 2k, 2k+1)”. IEEE Trans. Circuits Syst. II 47, 972–974
(2000)
References 129
20. A.S. Ashur, M.K. Ibrahim, A. Aggoun, Novel RNS structures for the moduli set {2n-1, 2n, 2n
+1} and their application to digital filter implementation. Signal Process. 46, 331–343 (1995)
21. R. Chaves, L. Sousa, {2n+1, 2n+k, 2n-1}: a new RNS moduli set extension, in Proceedings of
Euro Micro Systems on Digital System Design, pp. 210–217 (2004)
22. A. Hiasat, A. Sweidan, Residue-to-binary decoder for an enhanced moduli set. Proc. IEE
Comput. Digit. Tech. 151, 127–130 (2004)
23. M.A. Soderstrand, C. Vernia, J.H. Chang, An improved residue number system digital to
analog converter. IEEE Trans. Circuits Syst. 30, 903–907 (1983)
24. T.V. Vu, Efficient implementations of the Chinese remainder theorem for sign detection and
residue decoding. IEEE Trans. Comput. 34, 646–651 (1985)
25. G.C. Cardarilli, M. Re, R. Lojacano, G. Ferri, A systolic architecture for high-performance
scaled residue to binary conversion. IEEE Trans. Circuits Syst. I 47, 1523–1526 (2000)
26. G. Dimauro, S. Impedevo, R. Modugno, G. Pirlo, R. Stefanelli, Residue to binary conversion
by the “Quotient function”. IEEE Trans. Circuits Syst. II 50, 488–493 (2003)
27. J.Y. Kim, K.H. Park, H.S. Lee, Efficient residue to binary conversion technique with rounding
error compensation. IEEE Trans. Circuits Syst. 38, 315–317 (1991)
28. C.H. Huang, A fully parallel Mixed-Radix conversion algorithm for residue number applica-
tions. IEEE Trans. Comput. 32, 398–402 (1983)
29. Y. Wang, Residue to binary converters based on New Chinese Remainder theorems. IEEE
Trans. Circuits Syst. II 47, 197–205 (2000)
30. P.V. Ananda Mohan, Comments on “Residue-to-Binary Converters based on New Chinese
Remainder Theorems”. IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 47, 1541
(2000)
31. D.F. Miller, W.S. McCormick, An arithmetic free Parallel Mixed-Radix conversion algorithm.
IEEE Trans. Circuits Syst. II 45, 158–162 (1998)
32. Antonio Garcia, G.A. Jullien, Comments on “An Arithmetic Free Parallel Mixed-Radix
Conversion Algorithm”, IEEE Trans. Circuits Syst. II Analog Digit. Signal Process, 46,
1259–1260 (1999)
33. H.M. Yassine, W.R. Moore, Improved Mixed radix conversion for residue number system
architectures. Proc. IEE Part G 138, 120–124 (1991)
34. S. Bi, W.J. Gross, The Mixed-Radix Chinese Remainder Theorem and its applications to
Residue comparison. IEEE Trans. Comput. 57, 1624–1632 (2008)
35. A. Skavantzos, Y. Wang, New efficient RNS-to-weighted decoders for conjugate pair moduli
residue number Systems, in Proceedings of 33rd Asilomar Conference on Signals, Systems and
Computers, vol. 2, pp. 1345–1350 (1999)
36. Y. Wang, New Chinese remainder theorems, in Proceedings of 32nd Asilomar Conference on
Signals, Systems and Computers, pp. 165–171 (1998)
37. A.B. Premkumar, An RNS to binary converter in 2n-1, 2n, 2n+1 moduli set. IEEE Trans.
Circuits Syst. II 39, 480–482 (1992)
38. A.B. Premkumar, M. Bhardwaj, T. Srikanthan, High-speed and low-cost reverse converters for
the (2n-1, 2n, 2n+1) moduli set. IEEE Trans. Circuits Syst. II 45, 903–908 (1998)
39. Y. Wang, M.N.S. Swamy, M.O. Ahmad, Residue to binary converters for three moduli sets.
IEEE Trans. Circuits Syst. II 46, 180–183 (1999)
40. K.A. Gbolagade, G.R. Voicu, S.D. Cotofana, An efficient FPGA design of residue-to-binary
converter for the moduli set {2n+1, 2n, 2n-1}. IEEE Trans. Very Large Scale Integr. (VLSI)
Syst. 19, 1500–1503 (2011)
41. A.B. Premkumar, An RNS to binary converter in a three moduli set with common factors.
IEEE Trans. Circuits Syst. II 42, 298–301 (1995)
42. A.B. Premkumar, Corrections to “An RNS to Binary converter in a three moduli set with
common factors”. IEEE Trans. Circuits Syst. II 51, 43 (2004)
43. K.A. Gbolagade, S.D. Cotofana, A residue-to-binary converter for the {2n+2, 2n+1, 2n}
moduli set, in Proceedings of 42nd Asilomar Conference on Signals, Systems Computers,
pp. 1785–1789 (2008)
130 5 RNS to Binary Conversion
44. A.A. Hiasat, H.S. Abdel-Aty-Zohdy, Residue to Binary arithmetic converter for the moduli set
(2k, 2k-1, 2k-1-1). IEEE Trans. Circuits Syst. II 45, 204–209 (1998)
45. W. Wang, M.N.S. Swamy, M.O. Ahmad, Y. Wang, A high-speed Residue-to-binary converter
for thee moduli {2k, 2k-1, 2k-1-1}RNS and a scheme for its VLSI implementation. IEEE Trans.
Circuits Syst. II 47, 1576–1581 (2000)
46. W. Wang, M.N.S. Swamy, M.O. Ahmad, Y. Wang, A note on “A high-speed Residue-to-
binary converter for thee moduli {2k, 2k-1, 2k-1-1} RNS and a scheme for its VLSI implemen-
tation”, IEEE Trans. Circuits Syst. II, 49, 230 (2002)
47. P.V. Ananda Mohan, New residue to Binary converters for the moduli set {2k, 2k-1, 2k-1-1},
IEEE TENCON, doi:10.1109/TENCON.2008.4766524 (2008)
48. P.V. Ananda Mohan, Reverse converters for the moduli sets {22n-1, 2n, 22n+1} and {2n-3, 2n
+1, 2n-1, 2n+3}, in SPCOM, Bangalore, pp. 188–192 (2004)
49. P.V. Ananda Mohan, Reverse converters for a new moduli set {22n-1, 2n, 22n+1}. CSSP 26,
215–228 (2007)
50. P.V. Ananda Mohan, RNS to binary converter for a new three moduli set {2n+1 -1, 2n, 2n-1}.
IEEE Trans. Circuits Syst. II 54, 775–779 (2007)
51. A.P. Vinod, A.B. Premkumar, A residue to Binary converter for the 4-moduli superset {2n-1,
2n, 2n+1, 2n+1-1}. JCSC 10, 85–99 (2000)
52. M. Bhardwaj, T. Srikanthan, C.T. Clarke, A reverse converter for the 4 moduli super set {2n-1,
2n, 2n+1, 2n+1+1}, in IEEE Conference on Computer Arithmetic, pp. 168–175 (1999)
53. P.V. Ananda Mohan, A.B. Premkumar, RNS to Binary converters for two four moduli sets {2n-
1, 2n, 2n+1, 2n+1-1} and {2n-1, 2n, 2n+1, 2n+1+1}. IEEE Trans. Circuits Syst. I 54, 1245–1254
(2007)
54. B. Cao, T. Srikanthan, C.H. Chang, Efficient reverse converters for the four-moduli sets {2n-1,
2n, 2n+1, 2n+1-1} and {2n-1, 2n, 2n+1, 2n-1-1}. IEE Proc. Comput. Digit. Tech. 152, 687–696
(2005)
55. M. Hosseinzadeh, A. Molahosseini, K. Navi, An improved reverse converter for the moduli set
{2n+1, 2n-1, 2n, 2n+1-1}. IEICE Electron. Exp. 5, 672–677 (2008)
56. L. Sousa, S. Antao, R. Chaves, On the design of RNS reverse converters for the four-moduli set
{2n+1, 2n-1, 2n, 2n+1+1}. IEEE Trans. VLSI Syst. 21, 1945–1949 (2013)
57. M.H. Sheu, S.H. Lin, C. Chen, S.W. Yang, An efficient VLSI design for a residue to binary
converter for general balance moduli (2n-3, 2n-1, 2n+1, 2n+3). IEEE Trans. Circuits Syst. Exp.
Briefs 51, 52–55 (2004)
58. P.V. Ananda Mohan, New Reverse converters for the moduli set {2n-3, 2n + 1, 2n-1, 2n + 3}.
AEU 62, 643–658 (2008)
59. G. Jaberipur, H. Ahmadifar, A ROM-less reverse converter for moduli set {2q 1, 2q 3}.
IET Comput. Digit. Tech. 8, 11–22 (2014)
60. P. Patronik, S.J. Piestrak, Design of Reverse Converters for the new RNS moduli set {2n+1, 2n,
2n-1, 2n-1+1} (n odd). IEEE Trans. Circuits Syst. I 61, 3436–3449 (2014)
61. L.S. Didier, P.Y. Rivaille, A generalization of a fast RNS conversion for a new 4-Modulus
Base. IEEE Trans. Circuits Syst. II Exp. Briefs 56, 46–50 (2009)
62. B. Cao, C.H. Chang, T. Srikanthan, An efficient reverse converter for the 4-moduli set {2n-1,
2n, 2n+1, 22n+1} based on the new Chinese Remainder Theorem. IEEE Trans. Circuits Syst. I
50, 1296–1303 (2003)
63. A.S. Molahosseini, K. Navi, C. Dadkhah, O. Kavehei, S. Timarchi, Efficient reverse converter
designs for the new 4-moduli sets {2n-1, 2n, 2n+1, 22n+1-1} and {2n-1, 2n+1, 22n, 22n+1} based
on new CRTs. IEEE Trans. Circuits Syst. I 57, 823–835 (2010)
64. W. Zhang, P. Siy, An efficient design of residue to binary converter for the moduli set {2n-1, 2n
+1, 22n-2, 22n+1-3} based on new CRT II. Elsevier J. Inf. Sci. 178, 264–279 (2008)
65. L. Sousa, S. Antao, MRC based RNS reverse converters for the four-moduli sets {2n+1,2n-1,2n,
22n+1-1} and {2n+1,2n-1,22n, 22n+1-1}. IEEE Trans. Circuits Syst. II 59, 244–248 (2012)
References 131
66. N. Stamenkovic, B. Jovanovic, Reverse Converter design for the 4-moduli set {2n-1,2n,2n+1,
22n+1-1} based on the Mixed-Radix conversion. Facta Universitat (NIS) SER: Elec. Energy 24,
91–105 (2011)
67. B. Cao, C.H. Chang, T. Srikanthan, A residue to binary converter for a New Five-moduli set.
IEEE Trans. Circuits Syst. I 54, 1041–1049 (2007)
68. A.A. Hiasat, VLSI implementation of new arithmetic residue to binary decoders. IEEE Trans.
Very Large Scale Integr. (VLSI) Syst. 13, 153–158 (2005)
69. A. Skavantzos, T. Stouraitis, Grouped-moduli residue number systems for Fast signal
processing, in Proceedings of IEEE ISCAS, pp. 478–483 (1999)
70. A. Skavantzos, M. Abdallah, Implementation issues of the two-level residue number system
with pairs of conjugate moduli. IEEE Trans. Signal Process. 47, 826–838 (1999)
71. H. Pettenghi, R. Chaves, L. Sousa, Method to design general RNS converters for extended
moduli sets. IEEE Trans. Circuits Syst. II 60, 877–881 (2013)
72. A. Skavantzos, M. Abdallah, T. Stouraitis, D. Schinianakis, Design of a balanced 8-modulus
RNS, in Proceeedings of IEEE ISCAS, pp. 61–64 (2009)
73. H. Pettenghi, R. Chaves, L. Sousa, RNS reverse converters for moduli sets with dynamic
ranges up to (8n+1) bits. IEEE Trans. Circuits Syst. 60, 1487–1500 (2013)
74. G. Chalivendra, V. Hanumaiah, S. Vrudhula, A new balanced 4-moduli set {2k, 2n-1, 2n+1,
2n+1-1} and its reverse converter design for efficient reverse converter implementation, in
Proceedings of ACM GSVLSI, Lausanne, Switzerland, pp. 139–144 (2011)
75. P. Patronik, S.J. Piestrak, Design of Reverse converters for general RNS moduli sets {2k, 2n-1,
2n+1, 2n+1-1} and {2k, 2n-1, 2n+1, 2n-1-1} (n even). IEEE Trans. Circuits Syst. I 61, 1687–1700
(2014)
76. R. Conway, J. Nelson, New CRT based RNS converter for restricted moduli set. IEEE Trans.
Comput. 52, 572–578 (2003)
77. R. Lojacono, G. C. Cardarilli, A. Nannarelli, M. Re, Residue Arithmetic techniques for high
performance DSP, in IEEE 4th World Multi-conference on Circuits, Communications and
Computers, CSCC-2000, pp. 314–318 (2000)
78. M. Re, A. Nannarelli, G.C. Cardiralli, M. Lojacono, FPGA implementation of RNS to binary
signed conversion architecture, Proc. ISCAS, IV, 350–353 (2001)
79. L. Akushskii, V.M. Burcev, I.T. Pak, A New Positional Characteristic of Non-positional Codes
and Its Application, in Coding Theory and Optimization of Complex Systems, ed. by
V.M. Amerbaev (Nauka, Kazhakstan, 1977)
80. D.D. Miller et al., Analysis of a Residue Class Core Function of Akushskii, Burcev and Pak, in
RNS Arithmetic: Modern Applications in DSP, ed. by G.A. Jullien (IEEE Press, Piscataway,
1986)
81. J. Gonnella, The application of core functions to residue number systems. IEEE Trans. Signal
Process. SP-39, 69–75 (1991)
82. N. Burgess, Scaled and unscaled residue number systems to binary conversion techniques
using the core function, in Proceedings of 13th IEEE Symposium on Computer Arithmetic, pp
250–257 (1997)
83. N. Burgess, Scaling a RNS number using the core function, in Proceedings of 16th IEEE
Symposium on Computer Arithmetic, pp. 262–269 (2003)
84. M. Abtahi, P. Siy, Core function of an RNS number with no ambiguity. Comput. Math. Appl.
50, 459–470 (2005)
85. M. Abtahi, P. Siy, The non-linear characteristic of core function of RNS numbers and its effect
on RNS to binary conversion and sign detection algorithms, in Proceedings of NAFIPS 2005-
Annual Meeting of the North American Fuzzy Information Processing Society, pp. 731–736
(2005)
86. R. Krishnan, J. Ehrenberg, G. Ray, A core function based residue to binary decoder for RNS
filter architectures, in Proceedings of 33rd Midwest Symposium on Circuits and Systems,
pp. 837–840 (1990)
132 5 RNS to Binary Conversion
87. G. Dimauro, S. Impedevo, G. Pirlo, A new technique for fast number comparison in the
Residue Number system. IEEE Trans. Comput. 42, 608–612 (1993)
88. G. Dimauro, S. Impedevo, G. Pirlo, A. Salzo, RNS architectures for the implementation of the
diagonal function. Inf. Process. Lett. 73, 189–198 (2000)
89. G. Pirlo, D. Impedovo, A new class of monotone functions of the Residue number system. Int.
J. Math. Models Methods Appl. Sci. 7, 802–809 (2013)
90. P.V. Ananda Mohan, RNS to binary conversion using diagonal function and Pirlo and
Impedovo monotonic function, Circuits Syst. Signal Process. 35, 1063–1076 (2016)
91. S.J. Piestrak, A note on RNS architectures for the implementation of the diagonal function. Inf.
Process. Lett. 115, 453–457 (2015)
92. P.L. Montgomery, Modular multiplication without trial division. Math. Comput. 44, 519–521
(1985)
Further Reading
R.E. Altschul, D.D. Miller, Residue to binary conversion using the core function, in 22nd Asilomar
Conference on Signals, Systems and Computers, pp. 735–737 (1988)
M. Esmaeildoust, K. Navi, M. Taheri, A.S. Molahosseini, S. Khodambashi, Efficient RNS to
Binary Converters for the new 4-moduli set {2n, 2n+1 -1, 2n-1, 2n-1 -1}. IEICE Electron. Exp. 9
(1), 1–7 (2012)
F. Pourbigharaz, H.M. Yassine, A signed digit architecture for residue to binary transformation.
IEEE Trans. Comput. 46, 1146–1150 (1997)
W. Zhang, P. Siy, An efficient FPGA design of RNS core function extractor, in Proceedings of
2005 Annual Meeting of the North American Fuzzy Information Processing Society
(NAFIPS), pp. 722–724 (2005)
Chapter 6
Scaling, Base Extension, Sign Detection
and Comparison in RNS
Example 6.1 This example illustrates Szabo and Tanaka base extension technique
considering the moduli set {3, 5, 7} and given residues 2 and 3 corresponding the
moduli 5 and 7, respectively. We need to find the residue corresponding to modulus
3. We use MRC starting from modulus 7.
3 5 7
x 2 3
3 3
ð x 3Þ 3 4
1 3
ð x 3Þ 3 2
2
ð x 2Þ 3
2
ð2x 1Þ3
The condition (2x 1)3 ¼ 0 yields x ¼ 2. Thus, the RNS number is (2, 2, 3). ■
Alternative techniques based on CRT are available but they need the use of a
redundant modulus and ROMs. Shenoy and Kumaresan [2] have suggested base
extension using CRT. This needs one extra (redundant) modulus. Consider the three
moduli set {m1, m2, m3} and given residues (r1, r2, r3). We wish to extend the base
to modulus m4. We need a redundant modulus mr and we need to have the residue
corresponding to mr. In other words, all computations need to be done on moduli
m1, m2, m3 and mr. Using CRT, we can obtain the binary number X corresponding to
(r1, r2, r3) as
!
X3 1
X¼ i¼1
Mi r i kM ð6:1Þ
M i mi
The residue X mod m4 can be found from (6.1), if k is known. Using a redundant
modulus mr, if rr ¼ X mod mr is known, k can be found from (6.1) as:
0 ! ! 1
X3
1 1
k¼@ Mi ri rr A ð6:2Þ
i¼1 M i mi M mr
mr mr
M
where Mi ¼ m and n is the total number of moduli in the RNS. In the approximate
i
method, the quantity r0 x is ignored and a correction term S1 2 is added to obtain
instead an estimate of the quotient as
Xn Mi xi
ye1 ¼ i¼Sþ1 Q M
r x M0
i mi
!
XS 1 x xi S1
0 i
þ i¼1 m
M þ ð6:5Þ
i Mi mi Qi mi 2
!
Xn Mi xi XS 1 xi
0 xi
ye2 ¼ r 0xe þ
0
rx M þ M ð6:6Þ
i¼Sþ1 Q Mi mi i¼1 m
i Mi mi Qi mi
where
!
1 XS xi 0
r 0xe ¼ Q Q x ε ð6:7Þ
Qm i¼1 i
i m
t
t i mt
and mt is a redundant modulus and mt S is any integer prime to m1, m2, . . ., mS.
Note that ε is 0 or 1 thus making r0 xe differ by 1 at most. This technique needs log
n cycles and the scaled integer is having an error of at most unity whereas in
the approximate scaling technique, the error e is such that jej S1 2 for S odd and
jej S2 for S even. The redundant residue channel is only log2n bits wide where n is
the number of moduli and does not depend on the dynamic range of the RNS.
Jullien [4] has proposed two techniques using look-up tables for scaling a
number in an N moduli RNS by product of S moduli. In the first method, based
on Szabo and Tanaka approach, a first step (denoted original) MRC obtains the
residues corresponding to the division by the desired product of S moduli. Next base
extension is carried out for this result by further MRC conversion. A third step (final
stage) is used for obtaining the residues corresponding to the S moduli. The flow
chart is illustrated in Figure 6.1a for a six moduli RNS. The total number of look-up
tables needed for an N moduli system being scaled by product of S moduli for the
three stages is as follows:
Original: L2 ¼ S N Sþ1
2 needing n2 ¼ S cycles.
ðNS1ÞðNSÞ
Mixed Radix: L3 ¼ needing n3 ¼ (N S 1) cycles.
2
Final stage L4 ¼ SðN S 1Þ needing 2n4 1 < N S 2n4 cycles.
The MRC stage and final stage overlap and the minimum number of look-up
cycles needed for these two stages is thus (N S). Thus, totally N cycles are needed.
In the second technique due to Jullien [4], denoted as scaling using estimates, the
CRT summation is divided by the product of the S moduli. Evidently, some of the
1 1
products Mi are integers and some are fractions. Next, these are
Mi mi π j¼S1
j¼0 mj
multiplied by the given residues. All the resulting fractions are rounded by adding
1/2. The residues for all these N numbers corresponding to N S moduli are added
mod mi to get the residues of the estimated quotient. The number of cycles needed
in this first stage is n1 where 2n1 1 < S þ 1 2n1 and L1 ¼ (N S)S tables are
needed. Next, base extension is carried out for these residues to get the residues
corresponding to S moduli (see Figure 6.1b). The MRC and base extension steps
need L3 and L4 look-up tables as before. The total number of LUTs and cycles
needed are thus L1 + L3 + L4 and N + n1 S respectively as compared to L2 + L3 + L4
and N needed for the original algorithm. Note that all the tables have double inputs.
6.1 Scaling and Base Extension Techniques in RNS 137
a 1
Ф0
m0 x0 y0
T1(1,0)
T1(1,1) T4(0,1) T4(0,2)
Ф1
y1
m 1 x1
T1(2,1)
T4(1,0)
T1(2,2) T1(2,0) T4(0,0) T4(1,1) T4(1,2)
Ф2
y2
m2 x2
T2(3,2)
T1(3,1)
T2(3,0)
T2(3,3) y3 1
m 3 x3 T4(2,0)
r0 T4(2,2)
T2(4,2) T4(2,1)
T2(4,4) T2(4,0) T2(4,1) T3(4,0)
m 4 x4 T3(4,4) r1
y4
T3(5,1)
T2(5,2) T3(5,0)
T2(5,5) T2(5,0) T2(5,1)
T3(5,5)
m 5 x5 r2
y5
b
m0 x0 y0
T1(1,0)
T1(1,1) T4(0,1) T4(0,2)
y1
m1 x1
T4(1,0)
T4(0,0) T4(1,1) T4(1,2)
y2
m2 x2 e3 (0)
e3 (2) e3 (1)
y3 1
x3 T4(2,0)
m3 r0 T4(2,2)
T4(2,1)
e3 (3) (2)
e4 (0)
e4
e4 (1) T3(4,0)
T3(4,4)
m4 x4 r1
y4
e4(4) T3(5,1)
e5(0) T3(5,0)
e5 (2) e5 (1)
T3(5,5)
m 5 x5 r2
(5)
e5 y5
Figure 6.1 Jullien’s scaling techniques (a) based on Szabo–Tanaka approach and (b) based on
estimates [Adapted from [4]©IEEE1978)
Garcia and Lloris [5] have suggested an alternative scheme which needs only
two look-up cycles but bigger size look-up tables. The first look-up uses as
addresses, the residues of S moduli and the residue corresponding to each of xS+1
138 6 Scaling, Base Extension, Sign Detection and Comparison in RNS
XN
LUT YN
XS+1
LUT YS+1
XS YS
LUT
X1 Y1
XN
XN YN LUT YN
calculation
XS+1
Iterative
extension
YS XS YS
XS
Base
LUT
Y1 Y1
X1 X1
Figure 6.2 Garcia and Lloris scaling techniques: (a) two look-up cycle scaling in the RNS, (b)
look-up generation of [y1, . . ., yS] and (c) look-up calculation of [yS+1, . . ., yN] (Adapted from [5]
©IEEE1999)
Note that the base extension is carried out without needing any redundant
modulus. We scale the CRT expression by the product of the moduli Mp ¼ m1m2m3
first. The result is given by
X n Xp
Mp M xi
X ¼ XE εM ¼ Mp a þ εM ð6:8aÞ
i¼1 ip M m i M p M i mi
i¼1 mi
Mp
where
$ %
M i xi
aip ¼ ð6:8bÞ
M p M i mi
Note that the integer and fractional parts are separated by this operation. It can be
shown that whereas in conventional CRT, the multiple of M that needs to be
subtracted rM (where r is rank function) can range between 0 and n where n is
the number of moduli, Barsi and Pinotti observe that ε can be 0 or 1 thus needing
X n
subtraction of M only. Under the conditions aip M
M p þ 1 and jXE jM
i¼1 M p
Mp
X p
1
ðp 1ÞMp Mp only, ε can be 1.
i¼1
mi
Note that (6.8a) needs to be estimated for other moduli (m4, m5 and m6) to obtain
the residues corresponding to the residues {x1, x2, x3, 0, 0, 0}. The second base
extension after scaling by m1m2m3 to the moduli set {m1, m2, m3} also needs to use
xn
xs+1
xs
x1
BEn+1 BEn
x*s+1 x*n
Look up Look up
BE1 BEs
Figure 6.3 Scaling scheme due to Barsi and Pinotti (Adapted from [6]©IEEE1995)
140 6 Scaling, Base Extension, Sign Detection and Comparison in RNS
similar technique. The architecture is sketched in Figure 6.3 for an n moduli RNS
scaled by a product of s moduli.
An example will be illustrative.
Example 6.3 Consider the moduli set {23, 25, 27, 29, 31, 32} and given residues
(9, 21, 8, 3, 16, 17). This number corresponds to 578, 321. We wish to divide this
number by product of 23, 25 and 27, viz., 15,525. First we need to perform base
extension for residue set {9, 21, 8} to the moduli 29, 31, 32. Note that this number
is 3896.
The residues of 3896 corresponding to the moduli {29, 31, 32} are (10, 21, 24). It
can be seen that a1,p ¼ 4, a2,p ¼ 1 and a3,p ¼ 1. The
conditions
needed for applica-
X 3
tion of Barsi and Pinotti technique become a m3 1 ¼ 26 and
i¼1 ip
m3
X 3
jXE jM m1 m2 ðm1 þ m2 Þ ¼ 527. Note that since aip ¼ 6 < 26, we can
i¼1
m3
use this technique. The extended digits corresponding to moduli m4, m5 and m6 are
given by evaluating for illustration
X2
23 25 xi
x*4 ¼ 23 25 6 þ ¼ 10
i¼1
mi M*i m i 29
X
L
x¼ ai modM ð6:9Þ
i¼1
xi
where ai ¼ Mi for a L-moduli RNS. We divide (6.9) both sides by d. The
Mi mi
ai M i x i
quantities ¼ and M/d are approximated with real numbers αi and μ,
d d Mi mi
respectively. Typically, αi and μ are chosen as integers. Note that the error in using
y to approximate x/d is given by
6.1 Scaling and Base Extension Techniques in RNS 141
x
y < Lðε þ δÞ
d
or
x
M M
min ; μ Lðε þ δÞ y < max ;μ ð6:10Þ
d d d
a
i M M M
where αi ε < and μ δ <
ε. Note that ε and δ are errors in
d dL d dL
the summands and the error in the modulus, respectively. Note that the smaller of
the two errors in (6.10) is L times the error ε in the summands plus the error δ in the
modulus and hence it is named L(ε + δ)-CRT.
We can choose d ¼ M/ 2k$, in which case, k
% M ¼ d2 , the computation becomes
XL
M i xi
y¼ a0i mod2k where a0i ¼ . This is denoted as L-CRT where ε ¼ 1
i¼1
d Mi mi
and δ ¼ 0. It may be noted that a large modulo M addition in CRT is thus converted
into a smaller k-bit two’s-complement addition. Thus, the L-CRT can be
implemented using look-up tables to find a0 i followed by a tree of k-bit adders.
Meyer-Base and Stouraitis [8] have proposed a technique for scaling by
power of 2. This is based on the following two facts: (a) in the case of a
residue x which is even, x/2 will yield the result and (b) in the case of a residue
x which is odd, the division by 2 needs multiplication with (1/2) mod mi or
xþ1
computing mod mi . Thus, iterative division by 2 in r steps will result in
2
scaling by 2r mod mi.
Note, however, it is required to find whether the intermediate result is odd or
even. This needs a parity detection circuit implying base extension to the modulus
2 using Shenoy and Kumaresan technique which needs a redundant modulus [2]. In
addition, in case of signed numbers, the scaling has to be correctly performed. It is
first needed to determine the sign of the given number to be scaled. Note that the
negative of X is represented by M X in case of odd M. Hence, even negative
numbers are mapped to odd positive numbers. When the input number is positive,
output of the sign detection block is 0 and the output of the parity block is correct,
whereas when the input is negative, the output of the sign detection block is 1 and
the output of the parity block shall be negated. Thus, using the logic (X mod2 ¼ 0)
XOR (X > 0), the operation (X + 1)/2 needs to be performed. The architecture is
presented in Figure 6.4. Cardarilli et al. [9] have applied these ideas to realize a
QRNS polyphase filter. It may be recalled that in Vu’s sign detection algorithm
[10], X is divided by M to yield the conditions Sign Xs ¼ 0 if 0 Xs 1 and 1 if
1 Xs 2.
The authors observe that sign detection can be converted to parity detection by
doubling a number X. If X is positive, 0 X (M 1)/2 or 0 2X M 1. The
142 6 Scaling, Base Extension, Sign Detection and Comparison in RNS
X2 +1
m2 *2-1 ,
MPX m2 X2
XL +1
mL ,
MPX *2-1 XL
mL
integer 2X is within the dynamic range and hence 2X mod 2 ¼ 0. If X < 0, (M + 1)/
2 X M 1 or M + 1 2X 2M 2 and hence the integer is outside the dynamic
range given by (2X)mod 2 ¼ 1.
Note that for n levels of scaling, each level needs parity detection and 1-bit
scaling hardware shown in Figure 6.4. Cardiralli et al. [9] have considered several
techniques for base extension to modulus 2. These are Barsi and Pinotti method [6],
Shenoy and Kumaresan technique [2], fractional CRT of Vu [10], Szabo and
Tanaka technique with two types of ordering of moduli smallest to highest and
highest to smallest in MRC. They observe that Shenoy and Kumaresan technique
[2] together with fractional CRT of Vu [10] for base extension to redundant
modulus (mr ¼ 5) needs minimum resources and exhibits low latency for scaling.
Kong and Philips [11] have suggested another implementation applicable to any
scaling factor K which is mutually prime to all moduli. In this technique, we
compute all the residues
1
yi ¼ xi jXjK ð6:11Þ
K mi
mi
We need to first know X mod K using base extension so that X-(X mod K ) is exactly
divisible by K. The architecture is shown in Figure 6.5. Note that the LUTs used for
base extension need N inputs whereas the LUTs needed in the second stage need
two inputs. Kong and Phillips considered Shenoy and Kumaresan [2] and Barsi and
Pinotti [6] techniques for base extension, scaling method of Garcia and Lloris [5],
core function-based approach due to Burgess discussed in Chapter 5. For large
6.1 Scaling and Base Extension Techniques in RNS 143
x1 x2 xN
(X)K
LUT LUT LUT
y1 y2 yN
Figure 6.5 Scaling scheme of Kong and Philips (Adapted from [11]©IEEE2009)
dynamic range RNS, they show that their technique outperforms other methods in
latency as well as resources. In this method, note that the base extension compu-
tation in Barsi and Pinotti technique [6] is replaced by n LUTs and only one look-up
cycle.
Chang and Low [12] have presented a scaler for the popular three moduli set
{2n 1, 2n, 2n + 1}. They have derived three expressions for the residues of the
result Y obtained by scaling the given RNS number by 2n:
X
y1 ¼ n ¼ jx1 x2 jm1 ð6:12aÞ
2 m1
X
y2 ¼ n ¼ Bmodm2
2 m2
¼ 22n1 þ 2n1 x1 2n x2 þ 22n1 þ 2n1 1 x3 m m ð6:12bÞ
1 3 m2
X
y3 ¼ n ¼ jx2 þ 2n x3 jm3 ð6:12cÞ
2 m3
n n+1 n
⎢X⎥ ⎢X ⎥ ⎢X ⎥
⎢ ⎥ ⎢ n⎥ ⎢ n⎥
⎣ 2n ⎦ m ⎣ 2 ⎦ m3 ⎣ 2 ⎦ m2
1
base extension step by computing in parallel the 2n-bit word B. Interestingly, the
n LSBs of B yield the residue mod 2n. Thus, the scaling technique described in [12]
can be considered to have used both MRC and CRT in parallel as shown in
Figure 6.6.
Tay et al. [16] have described 2n scaling of signed integers in the RNS {2n 1,
2 , 2n + 1}. We define the scaled number in RNS as Y ¼ ðe
n
y1; e
y2; e
y 3 Þ corresponding
to given number X ¼ (x1, x2, x3). In the case of a positive number, these are same
as y1, y2 and y3 given in (6.12).
In the case of negative integers, the scaled e y 2 value needs to be modified as
e
y 2 þ 1 while ey 1 and e
y 3 remain the same. As an illustration, for moduli set {7, 8, 9},
consider a negative number 400 ¼ 104 ¼ (1, 0, 4). After scaling by 8, consider-
ing it as a positive integer, we have (1, 2, 5) corresponding to 50. The actual
answer considering that it is a negative integer is (1, 3, 5} ¼ 491 ¼ 13. The
implementation of the above idea needs detection of sign of the number.
The authors show that the sign can be negative if bit (Y )2n1 ¼ 1 or if Y ¼ 22n1
1 and x2,n1 ¼ 1 where x2 is the residue corresponding to modulus 2n. However,
detection of the condition for Y ¼ 22n1 1 needs a tree of AND gates. An
alternative solution is possible in which the detection of the negative sign is
possible under the three conditions: (2n 1)th bit of Y is zero, y1 ¼ 2n1 1, and
y2,n1 ¼ 1. Thus, a control signal generation block detecting the three conditions
needs to be added to the unsigned 2n scaler architecture. The output of this block
selects 0 or 1 to be added toe y 2 . The resulting architecture is presented in Figure 6.7a.
Note that the first block implements (6.12a)–(6.12c) to obtain y1, Y and y3. The
second block modifies the result to take into account sign of the given number and
yields the residues corresponding to the scaled number Y. ~
The authors have also suggested integrating the two blocks so that the compu-
tation time is reduced. Note that in Figure 6.7b, the n LSBs of Y, i.e. y2 are
computed using n LSBs of A and B (sum and carry vectors of the 2n-bit CSA
with EAC) and carry-in bit c2n1 arriving from the n MSBs in the modified mod 2n
adder block following Zimmermann [17] having an additional prefix level. The
carry-in c2n1 is generated by the control signal generation using n MSBs of A and
B, y1, x2 , y2,(n1) and Gn1 and Pn1 signals arriving from the modified mod 2n
6.1 Scaling and Base Extension Techniques in RNS 145
b y1 ~
~ y1
x1 Mod 2n-1
One’s Adder
complement y2
n Modified Mod 2n
adder n
⎛ G n −1 ⎞
⎜ ⎟ (y2)n−1 c2n −1
A ⎜ ⎟
⎝ Pn −1 ⎠
~
x2 n ~
y2
2n-bit
Bit CSA
rewiring B Simplified AND
With Modified Mod 2n gate
EAC n Control Signal adder n Array n
y1 n Generation
~y
(~x 2)
n −1
2
n (Y ) 2 n −1 control
~
x3
MFAs
n-bit CSA with Diminished-1 y3 ~
mod (2n+1) y3
CEAC
adder
Figure 6.7 (a) Scaler for signed numbers for the moduli set {2n 1, 2n, 2n + 1} and (b) simpli-
fication of (a) (Adapted from [16] ©IEEE 2013)
adder. The bit y(2n1) is added to y2 in a simplified mod 2n adder. Note that the last
AND gate array is needed to realize e y 2 ¼ 0 when Y ¼ 22n1 1 and Y~ is in the
negative range and ey 2 ¼ ðy2 þ y2n1 Þ2n under other conditions.
Ulman and Czyzak [18] have suggested a scaling technique which does not need
redundant moduli. This is based on CRT. They suggest first dividing the CRT
expansion by the desired divisor
j k Kto get integer values of the orthogonal
X n Xj
pro-
Xj xj M
jections K , where Xj ¼ Mj and Mj ¼ m . The value is
Mj mj j i¼1 K
146 6 Scaling, Base Extension, Sign Detection and Comparison in RNS
estimated in one channel. The resulting error ej is smaller than 1/2n where n is the
number of the moduli. The resulting total error denoted as
Xn
ε¼ ε
i¼1 j
ð6:13Þ
X X n Xj rM
¼ ð6:14Þ
K j¼1 K K
or
Xn X
Xj X n Xj
r¼ j¼1
¼ ð6:15Þ
M M j¼1 M
Xj
2dlogme dlogne bits. Next, bj, εj and are summed mod mj for j ¼1, 2, . . ., n.
K mk
Rest of the hardware uses multi-operand modulo adders and binary adders.
Lu and
Chiang [19] have described a subtractive division algorithm to find
Z ¼ XY which uses parity checking for sign and overflow detection. It uses binary
search for deciding the quotient in five parts. In part I, the signs of the dividend and
divisor are determined and the operands are converted into positive numbers. In
part II, it finds a k such that 2kY X 2k+1Y. In part III, the difference between 2k
and the quotient is found in k steps since 2k integers lie in the range 2k and 2k+1.
Note that each step in part III needs several RNS additions, subtractions, one RNS
multiplication, a table look up for finding parity of Si (see later), a table look up for
sign detection, multi-operand binary addition and exclusive OR operations. Part IV
is used in case 2kY X (M 1)/2 2k+1Y where M is the dynamic range of the
RNS. In part V, the quotient is converted to proper RNS form taking into account
the sign extracted in step 1. In this technique, totally 2log2Z steps are needed.
The parity of a RNS number can be found by using CRT in the case of all moduli
being odd. Recall that
X n xi
X¼ Mj rM ð6:16Þ
j¼1 M
j mj
xi
Since M and Mj are odd, the parity of X depends only on and r. Thus, we
M j mj
define parity as
x1 x2 xn
P ¼ LSB LSB LSB LSBðr Þ
ð6:17Þ
M 1 m1 M 2 m2 Mn mn
Aichholzer and Hassler [20] have introduced an idea called relaxed residue
computation (RRC) which facilitates modulo reduction as well as scaling. The
reduction mod L can be performed where L is arbitrary such that gcd (L, M ) ¼ 1 and
M is the dynamic range of the RNS. Note that L is large compared to all moduli and
typically log2 L ¼ 12 log2 M. All large constants in the CRT expression are first
reduced mod L:
X
1
x¼ ðMi ÞL xi þ r ðMÞL ð6:18Þ
M i mi
Note that Shenoy and Kumaresan technique [2] described earlier employing a
redundant modulus needs to be used to estimate r. Note that we do not obtain the
residue x* < L but some number e x ¼ x*mod L which can be larger than L in
general. This technique is in fact a parallel algorithm to compute an L-residue in
RNS.
Example 6.4 As an illustration, consider the moduli set {11, 13, 15, 17, 19, 23} and
the redundant modulus mx ¼ 7. The dynamic range of the chosen RNS is
15,935,205. We consider an example corresponding to the input binary number
X ¼ 1,032,192 which in the chosen RNS is (7, 5, 12, 3, 17, 21/0) and wish to divide
it by 211. (Note that /x indicates that the residue
corresponding to the redundant
1
modulus is x). We first find Mi and as {1448655, 1225785, 1062347,
M i mi
1
937365, 838695, 692835) and (10, 7, 8, 9, 6, 4). Next, we can find xi as
M i mi
(4, 9, 6, 10, 7, 15) and we can compute rx using the redundant modulus 7 as 3.
Considering L ¼ 211, for RRC, we obtain ðMi ÞL¼(719, 1081, 1483, 1429, 1063, 611)
and ðMÞL ¼ 283. Using rx ¼ 3, already obtained, the residue corresponding to 211
can be obtained from (6.18) as x ¼ ð719 4þ 1081 9 þ 1483 6 þ 1429 10
þ1063 7 þ 611 15Þþ 3 283 ¼ 53, 248. We can compute this in RNS itself.
As an illustration for modulus 11, we have x ¼ ð4 4 þ 3 9þ 9 6 þ 10 10
þ7 7 þ 6 15Þþ 3 283 ¼ 1185 ¼ 8mod11. Thus, the residue mod 211 is
(8, 0, 13, 4, 10, 3/6). This corresponds to 978,944. (Note that the residues of
given number and this number mod 2048 are same but what we obtained is not
the actual residue mod 2048). Subtracting this value from the given residues of
X gives (10, 5, 14, 16, 7, 18/1). We next multiply by 211 to remove the effect of
scaling done in the beginning. This will require multiplying with the multiplicative
inverse of 2048: (6, 2, 2, 15, 14, 1/2). This corresponds to 478 as against the actual
value 504. Thus, there can be error in the scaled result. ■
This technique can be used for RSA encryption as well where me mod L needs to
be obtained.
Hung and Parhami [21] suggested a sign estimation procedure which indicates in
log2n steps, whether a residue number is positive or negative or too small in
6.1 Scaling and Base Extension Techniques in RNS 149
magnitude to tell. This is based on CRT and uses a parameter α > 1 to specify the
input range and output precision. The input number shall be within 12 2α M and
1 α
22 M. When the output ES(X) (i.e. sign of X) is indeterminate, X is guaranteed
X n
to be in the range {2αM, 2αM}. We compute EFα ðXÞ ¼ i¼1 EFα ðiÞðjÞ where
20 13 1
1 !
j M
each term EFα ðiÞðjÞ ¼ 4@ A5 is truncated to the (β)th
mi mi
mi 1 2β
X
bit where β ¼ αþ dlog2 ne. Note that EFα(X) is an estimate of FðXÞ ¼ εð0; 1Þ
M1
and contains both the magnitude and sign information. If 0 EFα(X) < 1/2, then
ESα(X) ¼ + and if ½ EF(X) 1 2α, then ESα(X) ¼ and X < 0, otherwise
ESα(X) ¼ and 2αM X 2αM. In case of the result being , MRC can be
carried out to determine the sign.
Huang and Parhami [22] have suggested algorithms for division by
fixed divi-
sors. Consider the divisor D and dividend X. First compute C ¼ M D and choose
k such that 1 k n and M[1,k 1] D M[1,k] where n is the number of moduli.
0
We evaluate X0 ¼ M½Xk;n and next Q ¼ M½1X, k1 C
where M½k; n ¼ M½a; b ¼ π i¼a b
mi .
00
Next we compute X ¼ X QD: By using general division, we get Q0 and R such
that X00 ¼ Q0 D + R. The result is Q00 ¼ Q + Q0 and the remainder is R. One example
will be illustrative.
Example 6.5 Consider the moduli set {3, 5, 7, 11} and the division of 0503503
by 13.
Since M ¼ 1155 and D ¼13, we have C ¼ 1155 13 ¼ 88. We next have X ¼ 385 ¼ 1
188
since k ¼1. Then, we have Q ¼ 3 ¼ 29. It follows that
X0 ¼ 503 29 13 ¼ 126. We next write X00 ¼ 126 ¼ 9 13 + 9, i.e. Q0 ¼ 9 and
R ¼ 9. The quotient hence is 29 + 9 ¼ 38 and the remainder 9. ■
In an alternative technique, CRT is used. The CRT expansion is reduced mod
D to first obtain Y:
Xn
Y¼ i¼1
jαi xi jmi Z i þ BðXÞðD ZÞ ð6:19Þ
Since the scaling result Y0 is positive, it needs to be mapped into negative range of
the RNS as
or
xi ðXÞK þ ðMÞK
yi ¼ ð6:22bÞ
K mi
Comparing with (6.20), we note that the additional term (M )K comes into picture in
case of negative numbers. If the scaling factor is 2n, the above result changes as
!
1
yi ¼ xi ðXÞ2n for X > 0 ð6:23aÞ
2n m i
mi
and
!
1
yi ¼ xi ðXÞ2n þ ðMÞ2n for X < 0: ð6:23bÞ
2n m i
mi
Thus, either of (6.23a) or (6.23b) can be selected using a MUX. Note that a sign
detector is needed in the earlier technique and base extension is needed to find (X)K.
It may be noted that scaling is possible using core function [25, 26]. The
techniques described in Chapter 5 Section 5.6 can be used for scaling by arbitrary
number C(M )/M. From (5.54b) recall that
6.1 Scaling and Base Extension Techniques in RNS 151
CðMÞ Xk
wi
n ¼ CðnÞ þ αi ð6:24Þ
M i¼1
mi
Burgess [26] has suggested scaling of RNS number using core function within
the RNS. It is required to compute (6.24) in RNS. This can be achieved by
splitting the moduli into two subsets MJ and MK and find the cores CMJ ðnÞ and
CMK ðnÞ where MJ and MK are the products of the moduli in the two sets and
M ¼ MJMK. The core can be calculated efficiently since the terms in (6.24)
corresponding to MJ are zero for computing CMJ ðnÞ and corresponding to MK
are zero for computing CMK ðnÞ. Next, we can estimate the difference in the cores
(ΔC(n))ΔC(M ) as follows:
! !
X X
ΔCðnÞ ¼ ni CJ ðBi Þ RðnÞCJ ðMÞ ni CK ðBi Þ RðnÞCK ðMÞ
i ! i
X
¼ ni ΔCðBi Þ RðnÞΔCðMÞ
i
ð6:25Þ
We can add this value to CMa ðnÞ to obtain the residues corresponding to the
other moduli. An example will be illustrative.
Example 6.6 Consider the moduli set {7, 11, 13, 17, 19, 23}. We consider the two
groups MJ ¼ 7 17 23 ¼ 2737 and MK ¼ 11 13 19 ¼ 2717. Note that
1
MJ MK ¼ M. Thus ΔC(M ) ¼ 20. The values Mi and are {1062347,
Mi Mi
676039, 572033, 437437, 391391, 323323} and {6, 1, 2, 12, 2, 2}, respectively.
The weights for Cj(M ) ¼ 2737 are (0, 2, 1, 0, 2, 0) and for CK(M ) ¼ 2717 are
(1, 0, 0, 2, 0, 6). The two sets of C(Bi) can be derived next as CJ(Bi) ¼ {2346,
249, 421, 1932, 288, 238) and CK(Bi) ¼ (2329, 247, 418, 1918, 286, 236). Finally,
we have ΔC(Bi) ¼ (17, 2, 3, 14, 2, 2).
The given number n ¼ 1859107 ¼ (5, 8, 3, 4, 14, 17) is to be approximately
scaled by 2717 to yield 684. The complete calculation in RNS is as follows:
152 6 Scaling, Base Extension, Sign Detection and Comparison in RNS
ΔCðnÞmod20 ¼ 5 17 þ 8 2 þ 3 3 þ 4 14 þ 14 2
þ17 2 mod20 ¼ 8:
Next adding ΔC(n) to CK(n) values to obtain the remaining scaled moduli as
Pirlo and Impedovo [35] have described a monotone function which can facil-
itate magnitude comparison and sign detection. In this technique, the function
calculated is
N
X X
FI ð X Þ ¼ ð6:29Þ
i2I
mi
where I ¼1,2,. . .N for a RNS with N moduli. Note that the number of terms being
added in (6.29) can be optional. As an illustration, for a four moduli RNS (m1, m2,
m3, m4}, we can choose
X X
F I ðX Þ ¼ þ ð6:30Þ
m2 m4
h i h i
Evidently the values of mX2 and mX4 can be calculated by using CRT expansion
and dividing by m2 and m4, respectively, and approximating the multipliers of
various residues x1, x2, x3 and x4 by truncation or rounding. However, these can
also be calculated by defining parameters MI and SINV first as follows:
!
X X 1
MI ¼ Mi and SINV ¼ ð6:31Þ
m
i2I i2I i
MI MI
X n
FI ð X Þ ¼ b xi ð6:32Þ
i¼1 i
MI
where
1
bi ¼
for i 2 I
mi MI
1
bi ¼ Mj SINV for j 2 J ð6:33Þ
Mj mj
MI
2 41 43 64 2 41 43 64
b1 ¼ þ ¼ 9030
41 64
and similarly,
2 37 43 64 2 37 43 64
b2 ¼ þ ¼ 8149:0243,
41 64
7 37 41 64 7 37 41 64
b3 ¼ þ ¼ 27195
41 64
47 37 41 43 47 37 41 43
b4 ¼ þ ¼ 122681:015625
41 64
Evidently, the b1 and b4 are approximated leading to error in the scaled value.
6.3 Sign Detection 157
Sign detection is equally complicated since this involves comparison once again.
A straightforward technique is to perform RNS to binary conversion and compare
with M/2 where M is the dynamic range and declare the sign. However, simpler
techniques in special cases have been considered in literature. Recall from
Chapter 5 Section 5.1 that Vu’s method [10] of RNS to binary conversion based
on scaled CRT is suitable for sign detection.
Ulman [36] has suggested a technique for sign detection for moduli sets having
one even modulus. This is based on Mixed Radix Conversion. Considering the
moduli set {m1, m2, m3, . . ., mn} where mn is even, we can consider another moduli
set having mn/2 in place of mn. Denoting the original dynamic range as
M ¼ m1m2m3. . .mn, the sign function corresponding to a binary number is defined
using a parameter k as
where ai are the Mixed Radix Digits since all moduli are prime except mn, we have
jZ j ¼ ja0 j þ ja1 j þ ja2 j þap ð6:36Þ
P 2 2 2 2 2 2
Hence, the LSBs of MRC digits can be added mod 2 and the result compared with
LSB of zn to determine the sign. An architecture is presented for n ¼ 5, mn even in
Figure 6.8.
Tomczak [37] has suggested a sign detection algorithm for the moduli set {m1,
m2, m3} ¼ {2n 1, 2n, 2n + 1}. It is noted that the MSB of the decoded word gives
the sign (negative if MSB is 1 and positive if MSB is zero) in all cases except for the
numbers between 23n1 2n1 and 23n1. One simple method would be to first
perform RNS to binary conversion and declare sign as negative when MSB is
1 except when all 2n MSBs are “1”. Tomczak observes that the 2n bit MSBs can be
obtained using well-known RNS to Binary conversion due to Wang [38] as
158 6 Scaling, Base Extension, Sign Detection and Comparison in RNS
│Z│mn
│Z│mn to │Z│mn/2
│Z│m1 │Z│m2 │Z│m3 │Z│m4 converter
C1
n-bit modulo
2 adder
C2
1-bit
comparator
a0 a1 a2 a3 an S(z)
Magnitude of |Z|M
Figure 6.8 Architecture for sign detection due to Ulman (Adapted from [36]©IEEE1983)
t ¼ t0 þ jCj2n ð6:38aÞ
where
6.3 Sign Detection 159
t 0 ¼ ðx 3 x2 þ Y Þ ð6:38bÞ
It can be shown that C is not needed in the sign determining function given as
sgnðx1 ; x2 ; x3 Þ ¼ jt0 j2n 2n1 ð6:39Þ
As an illustration for the RNS {15, 16, 17}, consider the following numbers and
their detected signs:
2042 ¼ (2,10,2); Y ¼ 0, t0 ¼ 8, 11000 negative
3010 ¼ (10,2,1); Y ¼ 12, t0 ¼ 11, 01011 negative
1111 ¼ (1,7,6); Y ¼ 5, t0 ¼ 4, 00100 positive
Tomczak suggests implementation of (6.39) as
x*3
t0 ¼ x2 þ 2n1 x1 2n 1 þ 2n1 ðx3, 0 þ x3, n Þ þ þ x3, 0 þ W ð6:40Þ
2
d1 ¼ x1 , d2 ¼ ðx2 x1 Þ2n1 2n 1 ,
ð6:42Þ
d3 ¼ ððx1 x3 Þ þ d2, k1:0 2n þ d 2 Þ
2nþk
where X ¼ d1 + d2(2n + 1) + d3(22n 1). The sign is the MSB of d3. Thus,
ðx1 x3 Þ nþk needs to be added with the (n + k)-bit word formed by d2,k1:0
2
concatenated with d2. Using carry-look-ahead scheme, sign bit of d3 can be
obtained.
Xu et al. [40] have considered sign detection for the three moduli set {2n+1 1,
2 1, 2n}. They suggest using Mixed Radix digit obtained by using Mixed-Radix
n
CRT of Bi and Gross [29] described in Chapter 5 Section 5.3. It can be shown that
the highest Mixed Radix digit is given by
j x2 x1 k
d3 ¼ 2x1 þ x2 þ x3 þ n ð6:43aÞ
2 1 2n
The sign detection algorithm uses the MSB of d3. Note that (6.43a) can be
rewritten as
160 6 Scaling, Base Extension, Sign Detection and Comparison in RNS
00
d 3 ¼ x1 þ x2 þ x3 þ W ð6:43bÞ
2n
where
Note that x00 1 is the n-bit word equaling 2x1,n2:0 + x1,n. A CSA can be used to find
the sum of the first three terms and W is estimated using a comparator comparing x2
and x1 and glue logic. Note that the sum of CSA output vectors and W need to be
added in an adder having only carry computation logic.
Bakalis and Vergos [41] have described shifter circuits for the moduli set
{2n 1, 2n, 2n + 1}. While the shifting operation (multiplication by 2t mod mi) for
moduli 2n 1 and 2n is straightforward by left circular shift and left shift, respec-
tively, shifter for modulus (2n + 1) can be realized for diminished-1 representation.
Denoting A as the residue in this modulus channel, the rotated word is given as
References
1. N.S. Szabo, R.I. Tanaka, Residue Arithmetic and Its Applications to Computer Technology
(Mc-Graw Hill, New-York, 1967)
2. A.P. Shenoy, R. Kumaresan, Fast base extension using a redundant modulus in RNS. IEEE
Trans. Comput. 38, 293–297 (1989)
3. A.P. Shenoy, R. Kumaresan, A fast and accurate scaling technique for high-speed signal
processing. IEEE Trans. Acoust. Speech Signal Process. 37, 929–937 (1989)
4. G.A. Jullien, Residue number scaling and other operations using ROM arrays. IEEE Trans.
Comput. 27(4), 325–337 (1978)
5. A. Garcia, A. Lloris, A look up scheme for scaling in the RNS. IEEE Trans. Comput. 48,
748–751 (1999)
6. F. Barsi, M.C. Pinotti, Fast base extension and precise scaling in RNS for look-up table
implementation. IEEE Trans. Signal Process. 43, 2427–2430 (1995)
7. M. Griffin, M. Sousa, F. Taylor, Efficient scaling in the residue number System, in Pro-
ceedings of IEEE ASSP, pp. 1075–1078 (May 1989)
8. U. Meyer-Base, T. Stouraitis, New power-of-2 RNS scaling scheme for cell-based IC design.
IEEE Trans. Very Large Scale Integr. VLSI Syst. 11, 280–283 (2003)
References 161
9. G.C. Cardarilli, A. Del Re, A. Nannarelli, M. Re, Programmable power-of-two RNS scaler and
its application to a QRNS polyphase filter, in Proceedings of 2005. IEEE International
Symposium on Circuits and Systems, vol. 2, pp 1102–1105 (May 2005)
10. T.V. Vu, Efficient implementations of the Chinese remainder theorem for sign detection and
residue decoding. IEEE Trans. Comput. 34, 646–651 (1985)
11. Y. Kong, B. Phillips, Fast scaling in the Residue Number System, IEEE Trans. Very Large
Scale Integr. VLSI Syst. 17, 443–447 (2009)
12. C.H. Chang, J.Y.S. Low, Simple, fast and exact RNS scaler for the three moduli set {2n-1,
2n, 2n+1}. IEEE Trans. Circuits Syst. I Reg. Pap. 58, 2686–2697 (2011)
13. S. Andraros, H. Ahmad, A new efficient memory-less residue to binary converter. IEEE Trans.
Circuits Syst. 35, 1441–1444 (1988)
14. S.J. Piestrak, A high-speed realization of residue to binary system conversion. IEEE Trans.
Circuits Syst. II 42, 661–663 (1995)
15. A. Dhurkadas, Comments on “A High-speed realisation of a residue to binary Number system
converter”. IEEE Trans. Circuits Syst. II 45, 446–447 (1998)
16. T.F. Tay, C.H. Chang, J.Y.S. Low, Efficient VLSI implementation of 2n scaling of signed
integers in RNS {2n1, 2n, 2n+1}. IEEE Trans. Very Large Scale Integr. VLSI Syst. 21,
1936–1940 (2012)
17. R. Zimmermann, Efficient VLSI implementation of Modulo (2n1) Addition and multiplica-
tion, in Proceedings of IEEE Symposium on Computer Arithmetic, pp. 158–167 (1999)
18. Z.D. Ulman, M. Czyzak, Highly parallel, fast scaling of numbers in nonredundant residue
arithmetic. IEEE Trans. Signal Process. 46, 487–496 (1998)
19. M. Lu, J.S. Chiang, A novel division algorithm for the Residue Number System. IEEE Trans.
Comput. 41(8), 1026–1032 (1992)
20. O. Aichholzer, H. Hassler, Fast method for modulus reduction and scaling in residue number
system, in Proceedings of EPP, Vienna, Austria, pp. 41–53 (1993)
21. C.Y. Hung, B. Parhami, Fast RNS division algorithms for fixed divisors with application to
RSA encryption. Inf. Process. Lett. 51, 163–169 (1994)
22. C.Y. Hung, B. Parhami, An approximate sign detection method for residue numbers and its
application to RNS division. Comput. Math. Appl. 27, 23–35 (1994)
23. A.A. Hiasat, H.S. Abdel-Aty-Zohdy, A high-speed division algorithm for residue number
system, in Proceedings of IEEE ISCAS, pp. 1996–1999 (1995)
24. M.A. Shang, H.U. JianHao, Y.E. YanLong, Z. Lin, L. Xiang, A 2n scaling technique for signed
RNS integers and its VLSI implementation. Sci. China Inf. Sci. 53, 203–212 (2010)
25. N. Burgess, Scaled and unscaled residue number systems to binary conversion techniques
using the core function, in Proceedings of 13th IEEE Symposium on Computer Arithmetic,
pp. 250–257 (1997)
26. N. Burgess, Scaling a RNS number using the core function, in Proceedings of 16th IEEE
Symposium on Computer Arithmetic, pp. 262–269 (2003)
27. B. Vinnakota, V.V.B. Rao, Fast conversion techniques for Binary to RNS. IEEE Trans.
Circuits Syst. I 41, 927–929 (1994)
28. P.V. Ananda Mohan, Evaluation of fast conversion techniques for Binary-Residue Number
Systems. IEEE Trans. Circuits Syst. I 45, 1107–1109 (1998)
29. S. Bi, W.J. Gross, The Mixed-Radix Chinese Remainder theorem and its applications to
residue comparison. IEEE Trans. Comput. 57, 1624–1632 (2008)
30. G. Dimauro, S. Impedovo, G. Pirlo, A new technique for fast number comparison in the residue
number system. IEEE Trans. Comput. 42, 608–612 (1993)
31. S.T. Elvazi, M. Hosseinzadeh, O. Mirmotahari, Fully parallel comparator for the moduli set
{2n,2n-1,2n+1}. IEICE Electron. Express 8, 897–901 (2011)
32. A. Skavantzos, Y. Wang, New efficient RNS-to-weighted decoders for conjugate pair moduli
residue number systems, in Proceedings of 33rd Asilomar Conference on Signals, Systems and
Computers, vol. 2, pp. 1345–1350 (1999)
162 6 Scaling, Base Extension, Sign Detection and Comparison in RNS
33. Y. Wang, New Chinese remainder theorems, in Proceedings of 32nd Asilomar Conference on
Signals, Systems and Computers, pp. 165–171 (1998)
34. L. Sousa, Efficient method for comparison in RNS based on two pairs of conjugate moduli, in
Proceedings of 18th IEEE Symposium on Computer Arithmetic, pp. 240–250 (2007)
35. G. Pirlo, D. Impedovo, A new class of monotone functions of the Residue Number System. Int.
J. Math. Models Methods Appl. Sci. 7, 802–809 (2013)
36. Z.D. Ulman, Sign detection and implicit explicit conversion of numbers in residue arithmetic.
IEEE Trans. Comput. 32, 5890–5894 (1983)
37. T. Tomczak, Fast sign detection for RNS (2n-1, 2n, 2n+1). IEEE Trans. Circuits Syst. I Reg.
Pap. 55, 1502–1511 (2008)
38. Y. Wang, Residue to binary converters based on New Chinese Remainder theorems. IEEE
Trans. Circuits Syst. II 47, 197–205 (2000)
39. L. Sousa, P. Martins, Efficient sign detection engines for integers represented in RNS extended
3-moduli set {2n-1, 2n+1, 2n+k}. Electron. Lett. 50, 1138–1139 (2014)
40. M. Xu, Z. Bian, R. Yao, Fast Sign Detection algorithm for the RNS moduli set {2n+1-1, 2n-1,
2n}. IEEE Trans. VLSI Syst. 23, 379–383 (2014)
41. D. Bakalis, H.T. Vergos, Shifter circuits for {2n+1, 2n, 2n-1) RNS. Electron. Lett. 45, 27–29
(2009)
Further Reading
G. Alia, E. Martinelli, Sign detection in residue Arithmetic units. J. Syst. Archit. 45, 251–258
(1998)
D.K. Banerji, J.A. Brzozouski, Sign detection in residue number Systems. IEEE Trans. Comput.
C-18, 313–320 (1969)
A. Garcia, A. Lloris, RNS scaling based on pipelined multipliers for prime moduli, in IEEE
Workshop on Signal Processing Systems (SIPS 98), Piscataway, NJ, pp. 459–468 (1998)
E. Gholami, R. Farshidi, M. Hosseinzadeh, H. Navi, High speed residue number comparison for
the moduli set {2n, 2n-1, 2n+1}. J. Commun. Comput 6, 40–46 (2009)
Chapter 7
Error Detection, Correction and Fault
Tolerance in RNS-Based Designs
In this chapter, we consider the topic of error detection and error correction in
Residue Number systems using redundant (additional) moduli. RNS has the unique
advantage of having modularity so that once faulty units are identified either in
the original moduli hardware or in the redundant moduli hardware, these can be
isolated. Triple modular redundancy known in conventional binary arithmetic
hardware also can be used in RNS, which also will be briefly considered.
Error detection and correction in RNS has been considered by several authors
[1–20]. Single error detection needs one extra modulus and single error correction
needs two extra moduli. Consider the four moduli set {3, 5, 7, 11} where 11 is the
redundant modulus. Consider that the residues corresponding to the original num-
ber 52 ¼ {1, 2, 3, 8} have been modified as {1, 2, 4, 8} due to an error in the residue
corresponding to the modulus 7.
Barsi and Maeastrini [4, 5] proposed the idea of modulus projections to correct
single residue digit errors. Since the given number needs to be less than 105, any
projection larger than 105 indicates that error has occurred. A projection is obtained
by ignoring one or more moduli thus considering smaller RNS:
{1, 2, 4} ¼ 67 moduli set {3, 5, 7}
{1, 2, 8} ¼ 52 moduli set {3, 5, 11}
{1, 4, 8} ¼ 151 moduli set {3, 7, 11}
{2, 4, 8} ¼ 382 moduli set {5, 7, 11}
Since the last two projections are larger than 105, it is evident that error has
occurred in the residue corresponding to modulus 7 or 11. If we use an additional
modulus, the exact one among these can be found.
Szabo and Tanaka [1] have suggested an exhaustive testing procedure to detect
XK
and correct the error, which needs two extra moduli. It needs i¼1
ðmi 1ÞK tests
where K is the number of moduli. This method is based on the observation that an
errorof “1” in any residue corresponding to modulus mi causes a multiple of
1 mr1 mr2 M
Mi to be added where Mi ¼ and mr1 and mr2 are the redundant
M i mi mi
moduli. Hence, we need to find which multiple yields the correct number within the
dynamic range. This needs to be carried out for all the moduli.
As an example, consider the moduli set {3, 5, 7, 11, 13}. Let us assume that the
residues of 52 ¼ {1, 2, 3, 8, 0} got changed to {1, 2, 6, 8, 0}. We can first find the
number corresponding to {1, 2, 6, 8, ¼ 2197 which is outside the dynamic range
0}
1
of 105. Hence, by adding M3 ¼ 5 ð3 5 11 13Þ ¼ 10725
M3 m3
¼ f0, 0, 1, 0, 0g corresponding to an error of 1 in modulus 7, a number of times
(in this case, six times), we get different decoded numbers which modulo 15015
(the full dynamic range of the system including redundant moduli) are 2197, 12922,
8632, 4342, 52, 10777, 6487. Evidently, 52 is the correct answer.
Mandelbaum [3] has proved that two additional residues can correct single digit
residue errors. He has observed that the high order bits of the decoded word are
non-zero in the presence of errors. Let the product of all the residues be denoted as
Mr. Denoting the MSBs of the decoded word F as T M m
r B and defining two new
. . k
quotients Q1 ¼ T M rB
mk Mr and Q2 ¼ T M mk
rB Mr þ 1, look-up tables can be
used to obtain the values of B and mk. Note that mk is the modulus for which error is
to be tested, B is to be determined
and T() stands for truncation.
The criterion for
M r B M r B
selecting B and mk is that mk agrees with T mk to maximum number of
decimal places. The last step is to obtain X from F as X ¼ F M rB
mk . Mandelbaum
procedure is based on binary representation of numbers and not hence convenient
in RNS.
Consider the moduli set {7, 9, 11, 13, 16, 17} with a dynamic range of 2,450,448
where 16 and 17 are the redundant residues. The number 52 in the RNS is (3, 7, 8, 0,
4, 1) which is modified due to error in the residue of modulus 11 as (3, 7, 0, 0, 4, 1).
The decoding gives a 21 bit word corresponding to 668,356. The 8 MSBs reflect the
error since the original number is less than the DR of 13 bits. The MSB word Q1 is
69 and we have Q2 ¼ Q1 + 1 ¼ 70. Expressing these as fractions d1 ¼ 69/256 and
d2 ¼ 70/256, we need to find B and mi such that d1mi and d2mi are close to an
integer. It can be easily checked that for mi ¼ 11, we have d1mi and d2mi as 2.959
and 3.007 showing that B ¼ 3 and the error is in residue corresponding to modulus
7.1 Error Detection and Correction Using Redundant Moduli 165
11. The original decoded word can be obtained by subtracting (3 Mr)/11 from
668,356 to obtain 52.
Jenkins et al. [7–12] have suggested an error correction technique which is also
based on projections. In this technique using two redundant moduli in addition to
the original n moduli, the decoded words considering only (n + 1) moduli at a time
are computed using MRC. Only one of these will have the redundant MRC digit as
zero. As an illustration, consider the moduli set {3, 5, 7, 11, 13} where 11 and 13 are
the redundant moduli. Consider 52 changed as {1, 2, 4, 8, 0} due to error in the
residue corresponding to modulus 7. The various moduli sets and corresponding
projections are as follows:
{3, 5, 7, 11} {1, 2, 4, 8} ¼ 382, {3, 5, 7, 13} {1, 2, 4, 0} ¼ 1222,
{3, 5, 11, 13} {1, 2, 8, 0} ¼ 52, {3, 7, 11, 13} {1, 4, 8, 0} ¼ 1768,
{5, 7, 11, 13} {2, 4, 8, 0} ¼ 767.
Evidently, 52 is the correct answer and in MRC form is 0 165 + 3 15 + 2 3
+ 1 ¼ 52 with the most significant MRC digit being zero.
Jenkins [8] observed that the MRC structure can be used to obtain the projections
by shorting the row and column corresponding to the residue under consideration
(see Figure 7.1). This may take advantage of the fact that the already computed
MRC digit can be used without repeating the full conversion for obtaining the other
projections. Jenkins has also suggested a pipelined implementation so that at any
given time, L + r 1 projections progress through the error checker simultaneously
where L are non-redundant moduli and r are redundant moduli. First X5 is produced
at the output and next X4, X3, X2 and X1 are produced.
Note that first full MRC needs to be performed in (L + r 1) steps to know
whether any error is present by looking at the MRC digits. If the leading MRC digits
are non-zero, then five steps to obtain the projections need to be carried out. It is
evident that the already computed MRC digits can be used to compute the next to
avoid re-computation. Note that the shaded latches denote invalid numerical values
due to the reason that complete set of residues rather than the reduced set is needed
to compute the projections. Note that a monitoring circuit is needed to monitor the
mixed radix digits to detect the location of the error.
Jenkins and Altman [9] also point out that the effect of error in MRC hardware
using redundant moduli is same as the error in the input residue in that column. The
error checker also checks the errors that occur in the hardware of the MRC
converter.
In another technique known as expanded projection, Jenkins and Altman [9]
suggest multiplying the given residues by mi to generate a projection not involving
mi. In other words, we are computing miX. By observing the most significant MRC
digit, one can find whether error has occurred. However, fresh MRC on the original
residues needs to be carried out on the correct residues. As an illustration, consider
the original residue set (1, 2, 3, 8, 0) residue set corresponding to the moduli set
{3, 5, 7, 11, 13} which is modified as {1, 2, 4, 8, 0} where error has occurred in
residue mod 7. Multiplying by 3, 5, 7, 11 and 13, we get the various decoded
numbers as follows:
166 7 Error Detection, Correction and Fault Tolerance in RNS-Based Designs
x5 x4 x3 x2 x1
ROM
X2 LATCH
X3
X4
X5
a5 a4 a3 a2 a1
X5
Figure 7.1 A pipe-lined mixed Radix converter for the sequential generation of projections
(Adapted from [8] ©IEEE1983)
Thus, the correct result is 364 as it is within the scaled dynamic range
735 (¼7 105). Etzel and Jenkins [10] also observe that by arranging moduli in
ascending order, overflow can be detected. After performing MRC, the illegitimate
range is identified by the higher MRC digit being made non-zero.
7.1 Error Detection and Correction Using Redundant Moduli 167
is s > 2nr þ 2 where n is the number of actual moduli and r is number of redundant
moduli.
Consider, for example, a three moduli RNS (n ¼ 3) {3, 5, 7} with two additional
redundant moduli (r ¼ 2) 11 and 13. It is sufficient to test for the following
combinations {3, 11, 13}, {5, 11, 13}, {5, 7, 13}, {3, 5, 7}, {5, 7, 11}. Note that
we have omitted {5, 7}, {3, 7}, {3, 11}, {11, 13}, and {5, 13}. This method needs
MRC for three moduli whereas Jenkins et al. technique needs MRC for four
moduli RNS.
Orton et al. [15] have suggested error detection and correction using only one
redundant modulus which is a power of two and larger than"all other moduli. In # this
Xr
S ri
technique denoted as approximate decoding, we compute
m i M i mi
i¼1 2k
which is obtained by multiplying the original result of CRT with M/S. Note that S is
the scaling factor 2y. Orton et al. have suggested that S can be chosen suitably so as
to detect errors by looking at MSBs of the result. If the MSB word is all zero or all
1, there is no error. This technique, however, does not apply to the full dynamic
range due to the rounding of (2y/mi) values.
As an illustration, consider the moduli set {17, 19, 23, 27, 32} and given
residues
ri
(12, 0, 22, 6, 18) where 32 is the redundant residue. We can first obtain as
Mi mi
11, 0, 7, 3, 30. Choosing S as 29, S/mi values can be either rounded or truncated.
Considering that rounding
" is used, we#have various S/mi values as 30, 27, 22, 19, 16.
Xr
S ri
We thus obtain ¼ 509 ¼ ð11111101Þ2 . Considering that
mi Mi mi k
i¼1 2
an error "has occurred to make # the input residues as (13, 0, 22, 6, 18), we can
X r
S ri
compute ¼ 239 ¼ ð011101111Þ2 . The six most significant
mi Mi mi k
i¼1 2
bits are not one or zero, thus indicating an error. Note that, we need to next use
projections for finding the error and correcting it. The choice of S has to be proper to
yield the correct answer whether error has occurred. This method needs small
S ri
look-up tables for obtaining corresponding to given ri.
mi Mi mi
Orton et al. [15] have suggested another technique for error detection using two
redundant moduli mk1 and mk2. In this method, each redundant modulus is consid-
ered separately with the actual RNS and the value of A(X) (the multiple of M to be
subtracted from the CRT summation) is determined using redundant modulus mki
using Shenoy and Kumaresan technique [21]:
7.1 Error Detection and Correction Using Redundant Moduli 169
!
X
N
xi 1
A1 ðXÞmk1 ¼ Mi xmk1 ð7:1Þ
j¼1
mi mi M
mk1
Similarly for second redundant modulus also A2(X) is determined. If these two are
equal, there is no error. This technique is called overflow consistency check.
An example will be illustrative. Consider the moduli set {5, 6, 7, 11} where
5 and 6 are actual moduli and 7, 11 are the redundant moduli. For a legitimate
number, 17 ¼ (2, 5, 3, 6), we have A1(X) considering the moduli set {5, 6, 7} as
0 and A2(X) considering the moduli set {5, 6, 11} as zero. On the other hand,
consider that an error has occurred in the residue corresponding to the modulus 5 to
change the residues as (3, 5, 3, 6). It can be verified that A1(X) ¼ 3 and A2(X) ¼ 9
showing the inconsistency. The authors suggest adding mk1 N1
2 to Ai(X) where N is
the number of moduli (not considering redundant moduli) to take care of the
possible negative values of Ai(X).
Watson and Hastings [2] error correction procedure uses base extension to
redundant moduli. The difference between the original and reconstructed redundant
residues Δ1, Δ2 is used to correct the errors. If Δ1 ¼ 0 and Δ2 ¼ 0, then no error has
occurred. If one of them is non-zero, the old residue corresponding to this redundant
modulus is replaced by new one. If both are non-zero, then they are used to address
X n
a correction table of mi 1 entries.
i¼1
Yau and Liu [6] modified this procedure and suggest additional computations in
stead of using look-up tables. Considering a n moduli set with r additional redun-
dant moduli, they compute the sets Δm , nþr ; . . . ; Δm , nþ1 , Δm , n ; Δm , n1 ,
. . . , Δm , 2 ; Δm , 1 . Here the residues are determined by base extension assuming
the RNS contains all moduli except those within the set. If the first set has zero
entries, there is no error. If exactly one of these is non-zero, corresponding
redundant residue is in error. If more than one element is non-zero, then an iterative
procedure checks the remaining sets to identify the incorrect residue in a similar
manner. This means that the information residue is in error.
Barsi and Maestrini [5] and Mandelbaum [3] suggest the concept of product
codes, where each residue is multiplied by a generator A which is larger than the
moduli and mutually prime to all moduli. Thus, of the available dynamic range MA,
only M values represent the original RNS.
Given a positive integer A, called the generator of the code, an integer X in the
range [0, M] is a legitimate number in the product code of the generator A if X ¼ 0
mod A and A is mutually prime to all mi. Any X in the range [0, M] such that
X 6¼ 0mod A is said to be an illegitimate number. The advantage of this technique is
that when addition of X1 and X2 is performed, if overflow has occurred, it can be
found by checking whether jXs jA ¼ 0 where XS ¼ (X1 + X2) mod M. Then, we need
to check whether the number is legitimate. If jXs jA ¼ jMjA , an additive overflow
has been detected. Barsi and Meastrini [5] suggest the use of AX code to allow
single digit error detection and correction. They also point out that the use of AX
170 7 Error Detection, Correction and Fault Tolerance in RNS-Based Designs
code can detect overflow in addition. A single error can be detected since an error in
one residue will yield a decoded word
!
xi
H¼ AX þ Mi modM ð7:2Þ
M i mi
which is not divisible by A. As an illustration for the RNS {2, 3, 5, 7, 11, 13, 17, 19}
with dynamic range 9,699,690 if A ¼ 23 is the generator, the maximum value that
can be represented is 9,699,690/23 ¼ 421,725.
Barsi and Meastrini [5] have suggested a method which can detect error or
additive overflow. Given a number X, to be tested, |X|A can be found by using base
extension. If it is zero, the number is legitimate. If jXjA ¼ jMjA , an additive
overflow is detected. As an illustration consider m1 ¼ 5,m2 ¼ 7, m3 ¼ 9, m4 ¼ 11
jM j
and A ¼ 23. Note that the condition A > 2mi 1 M A for each mi 1 i n
needs to be satisfied. Let X1 ¼ (4, 1, 3, 2) ¼ 2829 and X2 ¼ (3, 3, 3, 8) ¼ 2208. The
sum is Xs ¼ jX1 þ X2 jM ¼ ð2; 4; 6; 10Þ ¼ 1572 and overflow has occurred. Since
jXs jA ¼ 8 ¼ jMjA , overflow is detected. On the other hand, suppose the result has
error to give X^ s ¼ ð2; 0; 6; 10Þ ¼ 2560: Since jXs j ¼ 9, the error is detected.
A
Goh and Siddiqui [17] have described technique for multiple error correction
using redundant moduli. Note that we can correct up to t errors where t br=2c
where r is the number of redundant moduli. This technique is based on CRT
expansion as a first step to obtain the result. If the result is within the dynamic
range allowed by the non-redundant moduli, the answer is correctly decoded.
Otherwise, it indicates wrong decoding. For double error correction as an illustra-
tion, for total number of moduli n, Cn2 possibilities exist. For each one of the
possibilities, it can be seen that because of a double error in residues corresponding
to moduli mi and mj, a multiple of the product of other moduli, i.e. Mij ¼ mM i mj
is
added in the CRT expansion. Thus, by taking mod Mij of the CRT expansion for all
cases excluding two moduli at a time and picking among the results the smallest
within the dynamic range due to the original moduli, we obtain the correct result.
An example will be illustrative. Consider the six moduli set {11, 13, 17,
19, 23, 29} where 11, 13 are non-redundant and 17, 19, 23, 29 are the redundant
moduli. The legitimate dynamic range is 0–142. Consider X ¼ 73 which in RNS is
(7, 8, 5, 16, 4, 15). Let it be changed to (7, 8, 11, 16, 4, 2) due to a double error. It
can be seen that CRT gives the number 25,121,455 which obviously is wrong.
Taking mod Mij for all the 15 cases (excluding two moduli mi, mj each time whose
i and j are (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 3), (2, 4), (2,5), (2, 6), (3, 4), (3, 5),
(3, 6), (4, 5), (4, 6), (5, 6) since C62 ¼ 15), we have, respectively, 130299, 79607,
62265, 36629, 11435, 28915, 50926, 83464, 33722, 36252, 65281, 73, 23811,
16518, 40828. Evidently, 73 is the decoded word since rest of the values are outside
the valid dynamic range.
7.1 Error Detection and Correction Using Redundant Moduli 171
The authors have extended the technique for correcting four errors in the
residues as well even though it is known that with four redundant moduli, only
double errors can be corrected. They observe that only three cases (error positions)
need to be tested for (1, 2, 3, 4), (1, 2, 5, 6) and (3, 4, 5, 6). Note that these
combinations are chosen such that when they are taken t ¼ 2 at a time (where t is
the number of correctable errors), most of the combinations among the 15 would
have been included. Consider the modified residues (1, 12, 0, 5, 6, 14) due to four
errors. The corresponding results mod Mijkl are 1050, 51, 12. It can be seen that the
last two appear to be correct. Only one of them is correct. The correct one can be
chosen by comparing the given residue set with those corresponding to both these
cases and one with most agreements is chosen. As an illustration, 51 ¼ (7, 12, 0, 5,
22, 14) and 12 ¼ (1, 12, 12, 12, 12, 12) whereas the input residues are (1, 12, 0, 5,
6, 14). It can be seen that the disagreements in both cases are, respectively, 2 and
4. Hence, 51 is the correct answer.
Haron and Hamdioui [18] have suggested the use of 6 M-RRNS (six moduli
Redundant Residue Number system) which uses six moduli for protecting hybrid
Memories (e.g., non-CMOS types). Two moduli (information moduli) are used for
actual representation of the memory word as residues whereas four moduli are
used for correcting errors in these two information moduli. The moduli set used was
{2p + 1, 2p, 2p1 1, 2p2 1, 2p3 1, 2p4 1} where p ¼ 8, 16, 32 for securing
memories of data width 16, 32 and 64 bits, respectively. As an illustration,
considering p ¼ 8, the moduli set is {257, 256, 127, 63, 31, 17}. Even though the
redundant moduli are smaller than the information moduli, the dynamic range of the
redundant moduli 4,216,527 is larger than the DR of the information moduli
65,792. They also consider conventional RRNS (C-RRNS) which uses three infor-
mation moduli {2p 1, 2p, 2p + 1} and six redundant moduli. For a 16 bit dynamic
range, they suggest {63, 64, 65} as information moduli and [67, 71, 73, 79, 83, 89}
as redundant moduli. Note that the code word length will be 61 bits whereas for
6 M-RRNS the code word length is for p ¼ 6, only 40 bits. Note, however, that in
6 M-RRNS, since the word lengths of the redundant moduli are not larger than those
of the information moduli, a single read word may be decoded into more than one
output data. This ambiguity can be resolved by using maximum likelihood
decoding similar to Goh and Siddiqui technique [17]. The closest Hamming
distance between the read code word and decoded ambiguous residues is found
and the one with the smallest distance is selected.
Consider the moduli set {257, 256, 127, 63, 31, 17} where the first two are
information residues and last four are redundant residues. Consider X ¼ 9216 which
corresponds to (221, 0, 72, 18, 9, 2). Assume that it is corrupted as (0, 0, 72, 18,
9, 2), Calculating all the projections, we can find that two possible results can exist
corresponding to m1 and m2 discarded and m3 and m6 discarded. These are 9216 and
257. We can easily observe that these correspond to the residues (221, 0, 72,
18, 9, 2) and (0, 1, 3, 5, 9, 2). Compared to the given residues, the choice 9216
has more agreements and hence it is the correct result.
172 7 Error Detection, Correction and Fault Tolerance in RNS-Based Designs
Pontarelli et al. [20] have suggested error detection and correction for
RNS-based FIR filters using scaled values. This technique uses one additional
modulus m which is prime with respect to all other moduli in the RNS. The dynamic
range is divided into two ranges: legitimate range where the elements are divisible
by m exactly and the remaining is the illegitimate range. An error in single modulus
is detected if after the conversion to two’s complement representation, the result
belongs to the illegitimate range. The members of the set in legitimate range are
exactly divisible by m. Thus, the scaled dynamic range is much less than the actual
range. As an illustration, considering the moduli set {3, 5, 7} and m ¼ 11, the only
valid integers are 0, 11, 22, 33, 44, 55, 66, 77, 88, 99. In case of errors in the
residues, e.g. {2, 0, 0} the CRT gives the decoded number as 35 whose residue mod
11 is 2. Thus, all the possible errors corresponding to moduli {3,5,7} are (1,0,0),
(2,0,0), (0,1,0), (0,2,0), (0,3,0), (0,4,0), (0,0,1), (0,0,2), (0,0,3), (0,0,4), (0,0,5),
(0,0,6). The values mod 11 corresponding to these are 4, 2, 10, 9, 8, 7, 4, 8, 1,
5, 9, 2. It can be seen that for all possible errors, the residue mod 11 is non-zero thus
allowing detection of any error.
The authors suggest that for a FIR filter implementation, the input sequence is
multiplied by a constant m and after processing and conversion into binary form, a
division by this constant needs to be performed. The choice of m ¼ 2i needs only left
shifting of the input sequence and hence simplifies the pre-multiplication by a
constant. The division by m is carried out easily by discarding the i LSBs.
Consider the moduli set {11, 13, 15} and the redundant scaling modulus
256. We
1 1
can see that M1 ¼ 195, M2 ¼ 165 and M3 ¼ 143 and ¼ 7, ¼ 3,
M 1 m1 M2 m2
1 1 1 1
¼ 2. We also have ¼ 235, ¼ 45, ¼ 111
M 3 m3 M1 256 M2 256 M3 256
which we need later. The code words (multiples of 256) are only 9, viz. (0, 256,
512, 768, 1024, 1280, 1536, 1792 and 2048) since the dynamic range is 2145. There
can be one digit error in any residue yielding 10 + 12 + 14 error cases which will
yield unique values mod 256 after decoding (without aliasing). As an example, we
have corresponding to error (1, 0, 0), 195 7 ¼ 1365 and 1365 mod 256 ¼ 85.
In a FIR filter implementation, while coefficients are in conventional unscaled
form, the sample values shall be scaled by m. As an example consider the com-
putation of 2 3 + 1 ¼ 7 where 2 is a coefficient. This is evaluated as (2, 2, 2)
(9, 1, 3) + (3, 9, 1) ¼ (10, 11, 7). Note that (9, 1, 3) corresponds to 256 3 in RNS
form. Assuming that 7 is changed to 8, we compute using CRT, the erroneous value
as 2078 which mod 256 ¼ 30 indicating that there is an
error.
We obtain the error
1
corresponding to each modulus as e1 ¼ 30 ¼ ð30 235Þ256 ¼
M
1 256 256
1
138 and similarly e2 ¼ 30 ¼ ð30 45Þ256 ¼ 70, and
M2 256 256
1
e3 ¼ 30 ¼ ð30 111Þ256 ¼ 2. Next, from e1, we can calculate
M3 256 256
7.2 Fault Tolerance Techniques Using TMR 173
the error Ei as Ei ¼ ei Mi. This needs to be subtracted from the decoded word Y.
Repeating the step for other moduli channels also, we note that ðY E1 ÞM ¼ 908,
ðY E2 ÞM ¼ 1253, ðY E3 ÞM ¼ 1792 and since 1792 is a multiple of 256, it is the
correct result.
Preethy et al. [16] have described a fault tolerance scheme for a RNS MAC. This
uses two redundant moduli 2m and 2m 1 which are mutually prime with respect to
the moduli in the RNS. The given residues are used to compute the binary word
X using a reverse converter and the residues of X mod 2m and X mod (2m 1) are
computed and compared with the input residues. The error is used to look into a
LUT to generate the error word which can be added with the decoded word X to
obtain the corrected word. The authors have used the moduli set {7, 11, 13, 17, 19,
23, 25, 27, 29, 31, 32} where 31 and 32 are the redundant residues for realizing a
36 bit MAC.
Triple modular redundancy (TMR) uses three modules working in parallel and if
the outputs of two modules are in agreement, it is selected as the correct output. The
price paid is the need for 200 % redundancy. In the case of Quadruple modular
redundancy (QMR), two units (printed circuit boards for example) will exist each
having two similar modules doing the same computation. Thus, on board modules
will be compared for agreement of the result and if not, output of the other board
having agreement between the two modules on board is selected. QMR needs a
factor of 4.0 redundancy. In the case of RNS, Double modular redundancy (DMR)
can be used in which each modulus hardware is duplicated and checked for
agreement. If there is disagreement, that modulus channel is removed from the
computation. Evidently, more number of channels than in the original RNS will be
needed. As an illustration for a five moduli RNS, one extra channel will be needed
thus having 17 % more hardware. An arbitration unit will be needed to detect error
and switch the channels. This design totally needs a factor of 2.34 redundancy.
Jenkins et al. [22] also suggest a SBM-RNS (serial by modulus RNS) in which only
one modulus channel exists and is reused. This is L times slower than the conven-
tional implementation using L moduli channel hardware. The results corresponding
to various moduli need to be stored. LUTs may be used for all arithmetic operations.
A concept compute-until-correct can be used since the fault is not known by
looking at individual channel.
We will discuss more on specific techniques used for achieving fault tolerance
using specialized number systems in Chapter 8 and fault tolerance of FIR filters in
Chapter 9.
174 7 Error Detection, Correction and Fault Tolerance in RNS-Based Designs
References
1. N.S. Szabo, R.I. Tanaka, Residue Arithmetic and Its Applications to Computer Technology
(Mc-Graw Hill, New-York, 1967)
2. R.W. Watson, C.W. Hastings, Self-checked computation using residue Arithmetic, in Pro-
ceedings of IEEE, pp. 1920–1931 (1966)
3. D. Mandelbaum, Error correction in residue arithmetic. IEEE Trans. Comput. C-21, 538–545
(1972)
4. F. Barsi, P. Maestrini, Error correcting properties of redundant residue number systems. IEEE
Trans. Comput. C-22, 307–315 (1973)
5. F. Barsi, P. Maestrini, Error detection and correction in product codes in residue Number
system. IEEE Trans. Comput. C-23, 915–924 (1974)
6. S.S.S. Yau, Y.C. Liu, Error correction in redundant residue number Systems. IEEE Trans.
Comput. C-22, 5–11 (1973)
7. W.K. Jenkins, Residue number system error checking using expanded projections. Electron.
Lett. 18, 927–928 (1982)
8. W.K. Jenkins, The design of error checkers for self-checking Residue number arithmetic.
IEEE Trans. Comput. 32, 388–396 (1983)
9. W.K. Jenkins, E.J. Altman, Self-checking properties of residue number error checkers based
on Mixed Radix conversion. IEEE Trans. Circuits Syst. 35, 159–167 (1988)
10. M.H. Etzel, W.K. Jenkins, Redundant residue Number systems for error detection and correc-
tion in digital filters. IEEE Trans. Acoust. Speech Signal Process. 28, 538–545 (1980)
11. W.K. Jenkins, M.H. Etzel, Special properties of complement codes for redundant residue
Number systems, in Proceedings of the IEEE, vol. 69, pp. 132–133 (1981)
12. W.K. Jenkins, A technique for the efficient generation of projections for error correcting
residue codes. IEEE Trans. Circuits Syst. CAS-31, 223–226 (1984)
13. V. Ramachandran, Single residue error correction in residue number systems. IEEE Trans.
Comput. C-32, 504–507 (1983)
14. C.C. Su, H.Y. Lo, An algorithm for scaling and single residue error correction in the Residue
Number System. IEEE Trans. Comput. 39, 1053–1064 (1990)
15. G.A. Orton, L.E. Peppard, S.E. Tavares, New fault tolerant techniques for Residue Number
Systems. IEEE Trans. Comput. 41, 1453–1464 (1992)
16. A.P. Preethy, D. Radhakrishnan, A. Omondi, Fault-tolerance scheme for an RNS MAC:
performance and cost analysis, in Proceedings of IEEE ISCAS, pp. 717–720 (2001)
17. V.T. Goh, M.U. Siddiqui, Multiple error detection and correction based on redundant residue
number systems. IEEE Trans. Commun. 56, 325–330 (2008)
18. N.Z. Haron, S. Hamdioui, Redundant Residue Number System code for fault-tolerant hybrid
memories. ACM J. Emerg. Technol. Comput. Syst. 7(1), 1–19 (2011)
19. S. Pantarelli, G.C. Cardarilli, M. Re, A. Salsano, Totally fault tolerant RNS based FIR filters,
in Proceedings of 14th IEEE International On-Line Testing Symposium, pp. 192–194 (2008)
20. S. Pontarelli, G.C. Cardiralli, M. Re, A. Salsano, A novel error detection and correction
technique for RNS based FIR filters, in Proceedings of IEEE International Symposium on
Defect and Fault Tolerance of VLSI Systems, pp. 436–444 (2008)
21. A.P. Shenoy, R. Kumaresan, Fast base extension using a redundant modulus in RNS. IEEE
Trans. Comput. 38, 293–297 (1989)
22. W.K. Jenkins, B.A. Schnaufer, A.J. Mansen, Combined system-level redundancy and modular
arithmetic for fault tolerant digital signal processing, in Proceedings of the 11th Symposium on
Computer Arithmetic, pp. 28–34 (1993)
References 175
Further Reading
R.J. Cosentino, Fault tolerance in a systolic residue arithmetic processor array. IEEE Trans.
Comput. 37, 886–890 (1988)
M.-B. Lin, A.Y. Oruc, A fault tolerant permutation network modulo arithmetic processor. IEEE
Trans. VLSI Syst. 2, 312–319 (1994)
A.B. O’Donnell, C.J. Bleakley, Area efficient fault tolerant convolution using RRNS with NTTs
and WSCA. Electron. Lett. 44, 648–649 (2008)
C. Radhakrishnan, W.K. Jenkins, Hybrid WHT-RNS architectures for fault tolerant adaptive
filtering, in Proceedings of IEEE ISCAS, pp. 2569–2572 (2009)
D. Radhakrishnan, T. Pyon, Fault tolerance in RNS: an efficient approach, in Proceedings, 1990 I.
E. International Conference on Computer Design: VLSI in Computers and Processors, ICCD
’90, pp. 41–44 (1990)
D. Radhakrishnan, T. Pyon, Compact real time RNS error corrector. Int. J. Electron. 70, 51–67
(1991)
T. Sasao, Y. Iguchi, On the complexity of error detection functions for redundant residue Number
systems, DSD-2008, pp. 880–887 (2008)
S. Timarchi, M. Fazlali, Generalized fault-tolerant stored-unibit-transfer residue number system
multiplier for moduli set {2n-1,2n,2n + 1}. IET Comput. Digit. Tech. 6, 269–276 (2012)
Chapter 8
Specialized Residue Number Systems
Hence 42
Sum ¼ R + jR* ¼ (4, 2). The actual sum is, therefore, Q ¼ 4þ22 ¼ 3, Q* ¼ 2:8
mod13 ¼ 5 which can be verified to be true.
The multiplication in QRNS is quite simple. Consider the multiplication of
(a + jb) and (c + jd). The following illustrates the procedure:
Q ¼ ac bd þ j1 ðad þ bcÞ
Q* ¼ ac bd j1 ðad þ bcÞ ð8:3Þ
pffiffiffi pffiffiffi
a þ jb ) Z ¼ ða þ j1 b cÞmod m, Z* ¼ ða j1 b cÞmod m
1
Z þ Z* 2j ð8:4Þ
a¼ , b ¼ ðZ Z*Þ pffiffi1ffi
2 2
where
S ¼ j21 þ 1 bdmod m ð8:5bÞ
Note that several choices of j1 are possible for a given m leading to different
n values. For example, for m ¼ 19, ( j1, n) pairs can be (5, 6), (6, 17), (7, 11), (8, 7),
(10, 5), etc. Consider the following example of multiplication in MQRNS.
Consider m ¼ 19, j1 ¼ 10, n ¼ 5. Let us find (2 + j3) (3 + j5). First note that
S ¼ 14. By the mapping rule, we have
Paliouras and Stouraitis [5, 6] have suggested the use of moduli of the form rn in
order to increase the dynamic range. The residue ri for a modulus rn can be
represented as digits ai (i ¼ 0,. . .,n 1) and these digits can be processed:
where 0 < bi r 1 for computing A B, we have digit products aibj such that
If i + j n, then due to the mod rn operation, pij will not contribute to the result. Note
also that p1ij can be at most (r 2) since ai and bj can be (r 1) at most in which
case
ai bj ¼ r 2 2r þ 1 ¼ r ðr 2Þ þ 1 ð8:6dÞ
The maximum carry digit is hence (r 2). The multiplier will have two types of
cells one with both sum and carry outputs and another with only sum output. The
latter are used in the leftmost column of the multiplier. The architecture of the
multiplier for r > 3 and n ¼ 5 is shown in Figure 8.1a. The preprocessor computes
pij ¼ aibj and yields two outputs p1ij and p0ij (see 8.6c). The preprocessor cell is shown
in Figure 8.1b. Note that it computes the digit product D1 D2 and gives the output
S ¼ Cr + D where C and D are radix-r digits.
The partial product outputs of the various preprocessor cells next need to be
summed in a second stage using an array of radix r adder cells. These special full
adder cells need to add three inputs which are having maximum value (r 1) or
(r 2). The carry digit can be at most 2 for r 3:
ðr 1Þ þ ðr 1Þ þ ðr 2Þ ¼ 2r þ ðr 4Þ ð8:7Þ
a A B
Preprocessor
FAʹ FA FA FA
FAʹ FA FA
FAʹ FA H20
FAʹ H22
Hʹ22 H10
Hʹ20
Hʹ11
C D s3
s 4 s 3 s2 s 1 s 0
d e
s4 s3 s2 s1 s0
HA FA FA HA
c1
HA FA FA
HA FA
b4 b3 b2 b1 b0 b4 b3 b2 b1 b0
f b0 a1 b1 a0 b0 a0 g b0 a2 b1 a1 b2 a0 b0 a1 b1 a0 b0 a0
p0
p0
p1
p1 p2
Figure 8.1 (a) Architecture of a radix 5 multiplier (r > 3, n ¼ 5), (b) Pre-processor cell, (c) full-
adder based three-input radix-5 digit adder, (d) radix-7 D665062 digit-adder converter, (e) opti-
mized version of (d), (f) modulo r2 multiplier, (g) modulo r3 multiplier, (h) modulo r4
multiplier and (i) 8 bit Binary to Residue converter mod 125 ((a, i) adapted from [5] ©IEEE2000,
(b–h) adapted from [7] ©IEEE2009)
182 8 Specialized Residue Number Systems
h
b0 a3 b1 a2 b2 a1 b3 a0 b0 a2 b1 a1 b2 a0 b0 a1 b1 a0 b0 a0
p0
p1
p2
p3
y2 y0
i
441 y6
y4 y3
y3
311 r32 y1
y7
H11 y5 r31
y5
y6 211 r21 y6 r2 y5
y5 Hʹ11 H10
Hʹ10
converter which evaluates the legitimate carry digit C and Sum digit D (see
Figure 8.1b). The second stage is a recursive mod-r adder in which the carry
generated is fed back after being multiplied by k ¼ 2l mod r in a chain of adders
till no carry is produced. Every time a carry is produced, (2l r) is added and the
carry bits are accumulated to form the final carry digit.
As an illustration, a three input radix-5 adder is presented in Figure 8.1c. Each
digit is represented as 3-bit word. However, the values can be greater than 4. Several
radix 5-cells with three 3-bit inputs with maximum specified sum can be designed.
Some possibilities, for example, are D744052, D654052, etc. Note that the first four
digits correspond to the maximum value of the inputs. The last but one indicates the
sum output and the last digit indicates the carry output. The output binary word of
this adder needs to be converted to radix r form using a binary to radix-r converter.
In some cases, it can be simplified, for example, for r ¼ 7. In a binary to RNS
converter, the quotient value is ignored whereas herein we need to compute the
quotient. A radix-7 binary to RNS converter for five bit input s4, s3, s2, s1, s0 with
maximum sum value 17 is shown in Figure 8.1d and a look-ahead version is shown
in Figure 8.1e. Here, b4b3 are carry bits and b2b1b0 are the Sum bits. The authors
have suggested optimum implementation for radix r2, r3 and r4 multiplication.
These are presented in Figure 8.1f–h.
Kouretas and Paliouras [8, 9] have described HRM (high Radix redundant
multiplier) for moduli set rn, rn 1 and rn + 1. The last one uses diminished-1
arithmetic. Redundancy has been used in the representation of radix-r digits to
reduce complexity of the adder cells as discussed earlier.
The binary to radix-r conversion [7] can follow similar methods as in radix-2
case. The residues of various powers of 2 can be stored in radix form and added mod
r. As an illustration for radix 125 ¼ 53, the various powers of 2 corresponding to an
8 bit binary input are as follows:
Powers of 2: 0 1 2 3 4 5 6 7
Radix 53 digits: 000 001 004 013 031 112 224 003.
Thus, for any given binary number, the LUTs can yield the radix 5 digits which
need to be added taking into account the carries to get the final result. A 8-bit
modulo 125 (¼53) binary to residue converter, using an array of special purpose
cells denoted as simplified digit adders, is shown in Figure 8.1i. Note that the
numbers inside the adders in Figure 8.1i indicate the weights of the inputs. For
example, the cell 441 computes 4y2 + y0 + 4y6 and gives the sum digit in radix-r
form and carry bit. The cell rAB adds inputs assigned to three ports. The r port can
be any radix-r digit where the ports IA and IB are 1 bit which when asserted will add
constants A and/or B. This cell also gives a 1 bit carry output.
Abdallah and Skavantzos [10] have described multi-moduli residue number
systems with moduli of the form ra, rb 1, rc + 1 where r > 2. Evidently, the
complete processing has to be done in radix r. The rules used for radix 2 for
reduction mod (2n 1) or (2n + 1) using periodic properties discussed earlier can
be applied in this case as well. The moduli can have common factors which need to
be taken into account in the RNS to binary conversion using CRT. As an
184 8 Specialized Residue Number Systems
illustration, the moduli set {312 + 1, 313 1, 314 + 1, 315 1, 316 + 1} has all even
moduli and division by 2 yields mutually prime numbers. The reader is referred to
[10] for an exhaustive treatment on the options available. The authors show that
these can be faster than radix-2-based designs.
Skavantzos and Taylor [11], Skavantzos and Stouraitis [12] have proposed poly-
nomial residue number system which is useful for polynomial multiplication. This
can be considered as a generalization of QRNS and can perform the polynomial
product with a minimal number of multiplications and with a high degree of
parallelism provided the arithmetic operates in a carefully chosen ring. This is
useful in DSP applications that involve multiplications intensive algorithms like
convolutions and one- or two-dimensional correlations.
Consider two (N 1)th order polynomials A(x), B(x) and we need to find (A(x)
B(x)) mod (xN + 1) for a chosen ring Zm(0, 1, 2, . . ., m 1) which is closed with
respect to operations of additions and multiplications mod m. Such operation is
needed in circular convolution. Note that (xN + 1) can be factored into N distinct
factors in Zm, viz., (x r0)(x r1). . .(x rN1) where r i 2 Z m , i ¼ 0, 1, 2, . . ., N 1,
aL
if and only if ( pi)2N ¼ 1 where m ¼ pi ei where pi are primes and ei are
i¼1
exponents. In case of xN 1, the necessary and sufficient condition for factorization
is ( pi)N ¼ 1.
We consider N ¼ 4 for illustration. We define first the roots of (x4 + 1) ¼ 0 as r0,
(r0) mod m, (1/r0) mod m, (1/r0) mod m. Once these roots are known, using an
isomorphic mapping, it is possible to map A(x) into the 4-tuple (a0*, a1*, a2*, a3*)
where a0* ¼ (A(r1)) mod m, a1* ¼ (A(r2)) mod m, a2* ¼ (A(r3))mod m, a3* ¼ (A
(r4)) mod m as follows:
a0 * ¼ a0 þ a1 r 0 þ a2 r 0 2 þ a3 r 0 3 mod m ð8:8aÞ
a1 * ¼ a0 a1 r 0 þ a2 r 0 2 a3 r 0 3 mod m ð8:8bÞ
a2 * ¼ a0 a1 r 0 3 a2 r 0 2 a3 r 0 mod m ð8:8cÞ
a3 * ¼ a0 þ a1 r 0 3 a2 r 0 2 þ a3 r 0 mod m ð8:8dÞ
Defining the 4-tuple corresponding to B(X) as (b0*, b1*, b2*, b3*), the multipli-
cation of (a0*, a1*, a2*, a3*) with (b0*, b1*, b2*, b3*) item-wise yields the product
(c0*, c1*, c2*, c3*). This task reduces N2 number of mod m multiplications to
N number of mod m multiplications only. The 4-tuple (c0*, c1*, c2*, c3*) needs to
be converted using an inverse isomorphic transformation in order to obtain the final
result (A(X) B(X)) mod (x4 + 1) using the following equations:
8.3 Polynomial Residue Number Systems 185
a0 ¼ 22 ða*0 þ a*1 þ a*2 þ a*3 Þ m ð8:9aÞ
a1 ¼ 22 r 30 ða*1 a*0 Þ þ 22 r 0 ða*2 a*3 Þ m ð8:9bÞ
a0 ¼ 22 r 20 ða*3 þ a*2 a*1 a*0 Þ m ð8:9cÞ
a0 ¼ 22 ða*1 a*0 Þ þ 22 r 3o a*2 a*3 m ð8:9dÞ
In general, ai* can be obtained as ai ¼ N 1 a*0 r i * i i
0 þ a1 r 1 þ þ aN1 r N1 m .
*
Yang and Lu [13] have observed that PRNS can be interpreted in terms of CRT for
polynomials over a finite ring.
An example will be illustrative. Consider the evaluation of A(x)B(x) mod (x4 + 1)
where A(x) ¼ 5 + 6x + 8x2 + 13x3 and B(x) ¼ 9 + 14x + 10x2 + 12x3 with m ¼ 17. It
can be found that the roots of (x4 + 1) mod 17 are 2, 15, 9, 8. We note that
ai ¼ {5, 6, 8, 13} and bi ¼ {9, 4, 10, 12}. For each of the roots of (x4 + 1), we can
find ai* ¼ {0, 6, 1, 13} and bi* ¼ {3, 10, 3, 3}. Thus, we have ci* ¼ {0, 9, 3, 5}.
Using inverse transformation, we have c i ¼ {0, 0, 16, 9}. Thus, the answer is
16x2 + 9x3.
Paliouras et al. [14] have extended PRNS for performing modulo (xn 1)
multiplication as well. They observe that in this case, values of roots ri will be
very simple powers of two. As an illustration, the roots for x8 1 mod (24 + 1) are
{1, 2, 22, 23, 23, 22, 2, 1}. As such, the computation of A(ri) can be
simplified as simple rotations and bit inversions with low hardware complexity.
The authors have used diminished-1 arithmetic for the case (2n + 1) and have shown
that PRNS-based cyclic conversion architectures reduce the area as well as
power consumption. The authors have also considered three moduli systems
{2 4 + 1, 2 8 + 1, 2 16 + 1} so that supply voltage reduction can be done for high
critical path channels.
Skavantzos and Stouraitis [12] have extended the PRNS to perform complex
linear convolutions. This can be computed using two modulo (x2N + 1) polynomial
products. N-point complex linear convolutions can be computed with 4N real
multiplications while using PRNS instead of 2N2 real multiplications when using
QRNS. The reader is referred to their work for more information.
Abdallah and Skavantzos [15] observe that the sizes of the moduli rings used in
PRNS are of the same order as size N of the polynomials to be multiplied. For
multiplication of large polynomials, large modular rings must be chosen leading to
performance degradation. As an illustration, for modulus (x20 + 1) PRNS, the only
possible q values are of the form 40k + 1 or 41 and 241. However, the dynamic
range is 41 241 < 14 bits. In such cases, multi-polynomial Channel PRNS
(MPCPRNS) has been suggested. The reader is referred to [15] for more
information.
Paliouras and Stouraitis [16] have suggested complexity reduction of forward
and inverse PRNS converters exploiting the symmetry of the transformation matri-
ces used for representing the conversion procedure as a matrix-by-vector product.
186 8 Specialized Residue Number Systems
Shyu et al. [17] have suggested a quadratic polynomial residue Number system
based complex multiplier using moduli of the form 22n + 1. The advantage is that
the mapping and inverse mapping can be realized using simple shifts and additions.
For complex numbers with real and imaginary parts less than R, the dynamic
range of the RNS shall be 4R2. As an illustration for R ¼ 28, we can choose the
RNS {28 + 1, 26 + 1, 24 + 1}.
Two-dimensional PRNS techniques will be needed to multiply large polyno-
mials in a fixed size arithmetic ring. Skavantzos and Mitash [18, 19] have described
this technique. PRNS can be extended to compute the products of multivariate
polynomials, e.g. A(x1, x2) ¼ 2 + 5x1 + 4x2 + 7x1x2 + 2x1x22 + x22, B(x1, x2) ¼ 1
+ 2x1 + 4x2 + 11x1x2 + 3x1x22 + 9x22 [20]. This has application in multi-dimensional
signal processing using correlation and convolution techniques.
Beckman and Musicus [21] have presented fault-tolerant convolution algorithms
based on PRNS. Redundancy is incorporated using extra residue channels. Their
approach for error detection and correction is similar to that in integer arithmetic.
They also suggest the use of Triple modular Redundancy (TMR) for CRT recon-
struction, error detection and correction. Note that individual residue channel
operations do not use TMR. The authors recommend the use of specific set of
modulo polynomials which are sparse in order to simplify the modulo multiplica-
tion and accumulation operations.
Parker and Benaissa [22], Chu and Benaissa [23, 24] have used PRNS for
multiplication in GF( pm). They suggest choice of irreducible trinomials such that
the degree of the product is 2m. For implementing ECC curve k-163, with a
polynomial f(x) ¼ x163 + x7 + x6 + x3 + 1, four 84 degree irreducible polynomials x84
+ xk + 1 where k 42 have been selected. In another design, 37 numbers of degree
9 irreducible polynomials have been selected. These need a GF(29) channel
multiplier.
PRNS has also been used for implementing AES (Advanced Encryption Stan-
dard) with error correction capability [25]. The S-Box was mapped using three
irreducible polynomials x4 + x + 1, x4 + x3 + 1, x4 + x3 + x2 + x + 1 for computing
S-Box having three GF(24) modules, while two are sufficient. The additional
modulus has been used for error detection. LUT-based implementation was used
for S-Box whereas MixColumn transformation also was implemented using three
moduli PRNS.
In RNS, the dynamic range is directly related to the moduli since it is a product of
all mutually prime moduli. Increase in dynamic range thus implies increase in the
number of moduli or word lengths of the moduli making the hardware complex. In
MRRNS (modulus replication RNS) [26] the numbers are represented as poly-
nomials of indeterminates which are powers of two (some fixed radix 2β). The
coefficients are integers smaller in magnitude than 2β. The computation of residues
8.4 Modulus Replication RNS 187
in the case of general moduli mi (forward conversion step) does not therefore arise.
As an example for an indeterminate x ¼ 8, we can write 79 in several ways [27]:
79 ¼ x2 þ x þ 7 ¼ x2 þ 2x 1 ¼ 2x2 6x 1 ¼ . . .
The polynomials are represented as elements of copies of finite rings. The dynamic
range is increased by increasing the number of copies of already existing moduli.
Even a small moduli set such as {3, 5, 7} can produce a large dynamic range.
MRRNS is based on a version of CRT that holds for polynomial rings. There is no
restriction that all moduli must be relatively prime. It allows repeated use of moduli
to increase the dynamic range of the computation. A new multivariate version of
MRRNS was described by Wigley et al. [26].
MRRNS uses the fact that every polynomial of degree n can be uniquely
represented by its values at (n + 1) distinct points and closed arithmetic operations
can be performed
completely independent channels. These points ri are chosen
over
such that r i r j 8ði:jÞ; i 6¼ j shall be invertible in Zp. This technique allows
algorithms to be decomposed into independent computations over identical chan-
nels. The number of points must be large enough not only to represent the input
polynomials but also the result of computation.
Consider computation of 79 47 + 121 25 ¼ 6738 [28]. We first define poly-
nomials P1(x), P2(x), Q1(x), Q2(x) that correspond to the input values 79, 47,
121 and 25 assuming an indeterminate x ¼ 23 ¼ 8. Thus, we have
Since the degree of the final polynomial is 3, we need to evaluate each of these
polynomials at n ¼ 4 distinct points. Choosing the set S ¼ {2, 1, 1, 2}, and
assuming the coefficients of the final polynomial belong to the set {128, . . .,
+128}, we can perform all calculations in GF(257). Evaluating P1(x), P2(x), Q1(x),
Q2(x) at each point in S gives u1 ¼ {1, 2, 2, 7}, v1 ¼ {13, 7, 5, 11}, u2 ¼ {9,
5, 9, 19} and v2 ¼ {5, 2, 4, 7}. Thus, the component-wise results can be
computed as w1 ¼ {13, 14, 10, 77} and w2 ¼ {45, 10, 36, 124}. (Note that the last
entry in w2 ¼ 133 is rewritten mod 257 as 124). Adding w1 and w2, we obtain the
result w ¼ w1 + w2 ¼ {58, 24, 46, 47}. Next, using interpolation algorithm
(Newton’s divided difference [29]), we obtain the final polynomial as R(x) ¼
2 + 2x + 33x2 + 9x3. Substituting x ¼ 8, the final answer is found as 6738 which
can be found to be true by straightforward computation. Note that by adding extra
channels, it is possible to detect and correct errors [28]. Error detection is achieved
simply by computing the polynomials at n + 2 points. Error correction can be
achieved by computing at n + 3 points. Evidently, the condition for a fault to be
detected is that the highest degree term of the result R(x) is non-zero.
As an illustration, consider the computation of product of two polynomials P(x) ¼
1 2x + 3x2 and Q(x) ¼ 2 x. Since the result R(x) is a polynomial of degree 3, we
188 8 Specialized Residue Number Systems
need to consider 4 + 2 points to correct a single error. Considering the set S ¼ {4, 2,
1, 1, 2, 4}, evaluating P and Q at these points we get
u ¼ (57, 17, 6, 2, 9, 41) and v ¼ (6, 4, 3, 1, 0, 2) and w ¼ (85, 68, 18, 2, 0, 82).
Performing interpolation, we obtain, R(x) ¼ 2 5x + 8x2 3x3 + 0 x4 + 0 x5
which is the correct result. Let us consider that one error has occurred on channel 2,
and the computed result is w ¼ (85, 71, 18, 2, 0, 82). We can independently
eliminate each of the channels and compute the polynomials R(x) to obtain the
following polynomials:
Note that the only polynomial of third degree is 2 5x + 8x2 3x3 obtained by
removing the second channel.
MRRNS can be extended to polynomials of two determinates also [28]. Consid-
ering two determinates, the polynomials need to be evaluated at (m + 1)(n + 1)
points for the case of polynomials of degree m in x and n in y. Note that addition,
subtraction and multiplication can be carried out in component-wise fashion.
However, the two-dimensional interpolation is carried out in two steps. In the
first step, (m + 1) 1D interpolation in the y direction is performed and next, (n + 1)
1D interpolation in the x direction is performed.
As an illustration consider P(x,y) ¼ 2x2 2xy y2 + 1 and Q(x,y) ¼ x2y + y2 1.
Considering the set of points {1, 0, 1, 2} for x and y, considering x fixed, P(x, y) can
be estimated for all values of y. Similarly, considering y is fixed, P(x, y) can be
computed for the values of x. Performing similar computation for Q(x, y), then on the
4 4 matrix obtained, addition, multiplication and subtraction operations can be
performed component wise. In case of errors, the degrees of the polynomials
obtained for chosen y will be of higher order than 2. In a similar manner for a chosen
x, the degree of the polynomials obtained in y will be higher than 2 thus showing the
error. For error correction, those row and column in the matrix can be removed and
interpolation can be carried out to find the correct result as 2x2 + x2y 2xy.
The authors have also described a symmetric MRRNS (SMRRNS) [28, 30] in
which case the input values are mapped into even polynomials. Then S shall be
chosen as powers of two which are such that xi 6¼ xj . As an illustration consider
finding 324 179. Hence P(x) and Q(x) can be written as P(x) ¼ 5x2 + 4 and Q(x) ¼
3x2 13 choosing x ¼ 8 as the indeterminate. Since R(x) is of degree 4, we need at
least 5 points. Choosing the set S ¼ {8, 4, 2, 1, 0}, and performing computations
over GF(257), we have u ¼ {67, 84, 24, 9, 4} and v ¼ {78, 35, 1, 10, 13} and
thus we have w ¼ {86, 113, 24, 90, 52}. Interpolating with respect to
S yields the result as R(x) ¼ 15x4 53x2 52 and for x ¼ 8, we obtain R(8) ¼
57,996. Note that error detection and error correction can be carried out using a
different technique. Consider that due to error w has changed as {86, 120, 24,
90, 52}. We need to extend virtually the output vector as {86, 120, 24, 90,
8.5 Logarithmic Residue Number Systems 189
52, 90, 24, 120, 86}. Next considering S as S ¼ {8, 4, 2, 1, 0, 1, 2, 4,
8} and interpolating, we obtain an eighth-order polynomial 109x8 68x6 + 122x4
+ 56x2 52. Since we do not know where error has occurred, by removing two
values corresponding to each location, we can find that the error is in the second
position and the answer is R(x) ¼ 15x4 53x2 52.
MRRNS has been applied to realize fault-tolerant complex filters [31–34]. These
use QRNS together with MRRNS techniques. As an illustration consider compu-
tation of product of two complex numbers a ¼ 237 j225 and b ¼ 162 + j211. We
illustrate the technique using the three moduli set {13, 17, 29}. The dynamic range
is evidently M ¼ 6409. As an illustration, for the modulus 13, the elements (resi-
dues) can be considered to lie in the interval [6, 6]. Choosing an indeterminate
x ¼ 8, the given numbers 237, 225, 162, 211 can be written as polynomials 3x2
+ 5x + 5, 3x2 4x 1, 2x2 4x 2 and 3x2 + 2x + 3, respectively. We next
convert the coefficients to QRNS form noting that j1 ¼ 5 for m1 ¼ 13, j2 ¼ 4 for
m2 ¼ 17 and j3 ¼ 12 for m3 ¼ 29. We choose the inner product polynomial as fifth
degree and choose 5 points x ¼ 2, 1, 0, 1, 2 at which we evaluate the poly-
nomials. Note that the procedure is same as in the previous case. After the inner
product computation by element-wise multiplication and interpolation, we get the
values in QRNS form corresponding to each modulus. These can be converted back
into normal form and using CRT for the moduli set {13, 17, 29}, the real and
imaginary parts can be obtained. Note that in the case of inner product of N terms,
considering that the ai and bi values are bounded by 2γ + 1 and 2γ 1, M shall
satisfy the condition M > 4N(2γ 1)2 where M is the product of all the moduli. Note
that fault tolerance can be achieved by having two more channels for fault detection
and correction [31, 33, 34].
Radhakrishnan et al. [35] have described realization of fault tolerant adaptive
filters based on a hybrid combination of Fermat number transform block processing
and MRRNS. These are immune to transient errors. Note that each input sample and
tap weight is interpreted as polynomials and these polynomials are evaluated at the
roots. The transformation matrix of Fermat Number Transform (FNT) is applied to
the resulting matrices corresponding to input samples and weights. Next, the
elements of these matrices are multiplied element wise and converted back using
interpolation formula. Fault tolerance can be achieved by evaluating the polynomial
at two additional points as explained before.
Preethy and Radhakrishnan [36] have suggested using RNS for realizing logarith-
mic adders. They have suggested using multiple bases. The ring elements can be
considered as products of factors of different bases. The multiple base logarithm is
defined for X ¼ b1 α1 b2 α2 . . . bm αm as
190 8 Specialized Residue Number Systems
where m > 1. αi are exponents and lm stands for logarithm. All algebraic
properties obeyed by normal algorithms apply to multiple base algorithm also.
This implies that in case of a prime GF( p), index calculus can be used for the
purpose of multiplication. We can exploit properties of RNS together with those of
finite fields and rings so that LUTs can be reduced to a small size. The index α of
the sum reduced mod ( p 1) corresponding to addition of two non-zero
integers
α α αy αx
X ¼ g x p and Y ¼ g y p is αx þ αf p1 where αf ¼ logg 1 þ g p1 .
p
As an example for modulus 31, base can be 3. Thus, indices exist for all elements
of GF(31). Note that X + Y can be found from the indices. As an example for
X ¼ 15, Y ¼ 4, we have the indices αx ¼ 21 and αy ¼ 18 meaning 321mod31 ¼ 15
and 318mod 31 ¼ 4. Thus, we have index of the result (X + Y ) as
m ¼ 21 þ log3 1 þ 3ð18 21Þ30 ¼ 4. Taking anti-logarithm, we get the
31 30
result as 19.
In case of modulus of the form pm e.g. 33, we can express the given integer as gαpβ.
Thus, the 27 elements can be represented by the indices (α, β) using bases g ¼ 2,
p ¼ 3. As an example integer 7 corresponds to (16, 0) whereas
integer 9 corresponds
αx β x
to (0, 2). In this case, the result of addition of X ¼ g p m , Y ¼ gαy pβy m ,
p p
X and Y also is of the same form (α, β):
ðα; βÞ ¼ αx þ αf ϕðpm Þ , βx þ βf for βy βx
ðα; βÞ ¼ αy þ αf ϕðpm Þ , βy þ βf otherwise ð8:11aÞ
where
ð1Þs αy αx m s
ϕðp Þ ð1Þ
αf ; βf ¼ lmðg;pÞ 1 þ g p βy β x ð8:11bÞ
pm
with s ¼ 0 for βy βx and s ¼ 1 otherwise. Note that ϕ(z) is the number of integers
less than z and prime to it.
The authors also have considered
the case GF(2m) where the given integer X can
be expressed as 2α 5β ð1Þγ m . As an example for GF(25), 7 can be expressed as
2
(α, β, γ) ¼ (0, 2, 1). Check that 52mod 32 ¼ 25. The negative sign makes it 25 ¼ 7
mod 32. In this case also, the computation can be easily carried out for (αf, βf, γf).
The reader may refer to [36] for more information.
In the case of GF( p), it has been pointed out [37] that memory can be reduced by
storing only certain indices and the rest can be obtained by shifting or addition.
References 191
For p ¼ 31, we need to store only for 2, 5, 7, 11, 13, 17, 19, 23, 29. For example for
integer 28, we can obtain the logarithm log328 ¼ log3(22 7) ¼ 2 log32 + log37.
The Residue Logarithmic Number system (RLNS) [38] represents real values as
quantized logarithms which are in turn represented in RNS. This leads to faster
multiplication and division. This uses table look-ups. There is no overflow detec-
tion possibility in RLNS. Hence Xmax and Xmin shall be well within the dynamic
range of RLNS. RLNS addition is difficult whereas multiplication is faster than any
other system. The reader is referred to [38] for more information.
References
1. W.K. Jenkins, J.V. Krogmeier, The design of dual-mode complex signal processors based on
Quadratic modular number codes. IEEE Trans. Circuits Syst. 34, 354–364 (1987)
2. G.A. Jullien, R. Krishnan, W.C. Miller, Complex digital signal processing over finite rings.
IEEE Trans. Circuits Syst. 34, 365–377 (1987)
3. M.A. Soderstrand, G.D. Poe, Application of quadratic-like complex residue number systems to
ultrasonics, in IEEE International Conference on ASSP, vol. 2, pp. 28.A5.1–28.A5.4 (1984)
4. R. Krishnan, G.A. Jullien, W.C. Miller, Complex digital signal processing using quadratic
residue number systems. IEEE Trans. ASSP 34, 166–177 (1986)
5. V. Paliouras, T. Stouraitis, Novel high radix residue number system architectures. IEEE Trans.
Circuits Syst. II 47, 1059–1073 (2000)
6. V. Paliouras, T. Stouraitis, Novel high-radix residue number system multipliers and adders, in
Proceedings of ISCAS, pp. 451–454 (1999)
7. I. Kouretas, V. Paliouras, A low-complexity high-radix RNS multiplier. IEEE Trans. Circuits
Syst. Regul. Pap. 56, 2449–2462 (2009)
8. I. Kouretas, V. Paliouras, High radix redundant circuits for RNS moduli rn-1, rn and rn + 1, in
Proceedings of IEEE ISCAS, vol. V, pp. 229–232 (2003)
9. I. Kouretas, V. Paliouras, High-radix rn-1 modulo multipliers and adders, in Proceedings of 9th
IEEE International Conference on Electronics, Circuits and Systems, vol. II, pp. 561–564
(2002)
10. M. Abdallah, A. Skavantzos, On multi-moduli residue number systems with moduli of the
form ra, rb-1 and rc+1. IEEE Trans. Circuits Syst. 52, 1253–1266 (2005)
11. A. Skavantzos, F.J. Taylor, On the polynomial residue number system. IEEE Trans. Signal
Process. 39, 376–382 (1991)
12. A. Skavantzos, T. Stouraitis, Polynomial residue complex signal processing. IEEE Trans.
Circuits Syst. 40, 342–344 (1993)
13. M.C. Yang, J.L. Wu, A new interpretation of “Polynomial Residue Number System”. IEEE
Trans. Signal Process. 42, 2190–2191 (1994)
14. V. Paliouras, A. Skavantzos, T. Stouraitis, Multi-voltage low power convolvers using the
Polynomial Residue Number System, in Proceedings 12th ACM Great Lakes Symposium on
VLSI, pp. 7–11 (2002)
15. M. Abdallah, A. Skavantzos, The multipolynomial Channel Polynomial Residue Arithmetic
System. IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 46, 165–171 (1999)
16. V. Paliouras, A. Skavantzos, Novel forward and inverse PRNS converters of reduced compu-
tational complexity, in 36th Asilomar Conference on Signals, Systems and Computers,
pp. 1603–1607 (2002)
17. H.C. Shyu, T.K. Truong, I.S. Reed, A complex integer multiplier using the quadratic-
polynomial residue number system with numbers of form 22n+1. IEEE Trans. Comput.
C-36, 1255–1258 (1987)
192 8 Specialized Residue Number Systems
Further Reading
J.H. Cozzens, L.A. Fenkelstein, Computing the discrete Fourier transform using residue number
systems in a ring of algebraic integers. IEEE Trans. Inf. Theory 31, 580–588 (1985)
H.K. Garg, F.V.C. Mendis, On fault-tolerant Polynomial Residue Number systems, in Conference
Record of the 31st Asilomar Conference on Signals, Systems and Computers, pp. 206–209
(1997)
G.A. Jullien, W. Luo, N.M. Wigley, High throughput VLSI DSP using replicated finite rings,
J. VLSI Signal Process. 14(2), 207–220 (1996)
J.B. Martens, M.C. Vanwormhoudt, Convolutions of long integer sequences by means of number
theoretic transforms over residue class polynomial rings. IEEE Trans. Acoust. Speech Signal
Process. 31, 1125–1134 (1983)
J.D. Mellott, J.C. Smith, F.J. Taylor, The Gauss machine: a Galois enhanced Quadratic residue
Number system Systolic array, in Proceedings of 11th Symposium on Computer Arithmetic,
pp. 156–162 (1993)
M. Shahkarami, G.A. Jullien, R. Muscedere, B. Li, W.C. Miller, General purpose FIR filter arrays
using optimized redundancy over direct product polynomial rings, in 32nd Asilomar Confer-
ence on Signals, Systems & Computers, vol. 2, pp. 1209–1213 (1998)
N. Wigley, G.A. Jullien, W.C. Miller, The modular replication RNS (MRRNS): a comparative
study, in Proceedings of 24th Asilomar Conference on Signals, Systems and Computers,
pp. 836–840 (1990)
N. Wigley, G.A. Jullien, D. Reaume, W.C. Miller, Small moduli replications in the MRRNS, in
Proceedings of the 10th IEEE Symposium on Computer Arithmetic, Grenoble, France, June
26–28, pp. 92–99 (1991)
G.S. Zelniker, F.J. Taylor, Prime blocklength discrete Fourier transforms using the Polynomial
Residue Number System, in 24th Asilomar Conference on Signals, Systems and Computers,
pp. 314–318 (1990)
G.S. Zelniker, F.J. Taylor, On the reduction in multiplicative complexity achieved by the Poly-
nomial Residue Number System. IEEE Trans. Signal Process. 40, 2318–2320 (1992)
Chapter 9
Applications of RNS in Signal Processing
Several applications of RNS for realizing FIR filters, Digital signal processors and
digital communication systems have been described in literature. In this Chapter,
these will be reviewed.
FIR (Finite Impulse Response) filters based on ROM-based multipliers using RNS
have been described by Jenkins and Leon [1]. The coefficients of the L-tap FIR filter
and input samples are in RNS form and the multiplications and accumulation
needed for FIR filter operation
X
L1
yðnÞ ¼ hðkÞxðn kÞ ð9:1Þ
k¼0
are carried out in RNS for all the j moduli. In order to avoid overflow, the dynamic
range of the RNS shall be chosen to be greater than the worst case weighted sum of
the products of the given coefficients of the FIR filter and maximum amplitudes of
samples of the input signal. Note that each modulus channel has a multiplier based
on combinational logic or ROM to compute h(k) x(n k) and an accumulator
mod mi to find ðyðnÞÞmi . After the accumulation in L steps following (9.1), the result
which is in residue form ðyðnÞÞmi needs to be converted into binary form using a
RNS to binary converter following one of several methods outlined in Chapter 5.
Jenkins and Leon also suggest that instead of weighting the inputsamples
by the
1
coefficients in RNS form, the coefficients h(k) can be multiplied by where
M i mi
M
Mi ¼ and stored. These modified coefficients h(k) can be multiplied by the
mi
© Springer International Publishing Switzerland 2016 195
P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_9
196 9 Applications of RNS in Signal Processing
samples and accumulated so that CRT can be easily performed by multiplying the
final accumulated residues ðyðnÞÞmi corresponding to modulus channel mi with Mi
and summing all the results mod M to obtain the final result.
Note that instead of using multipliers, a bit-slice approach [2] can be used. We
consider all moduli of length b bits without loss of generality. In this method, the
MSBs of all the final accumulated residues r i ¼ ðyðnÞÞmi for i ¼ 0, 1, . . ., j 1
X
j1
address a ROM to get the intermediate word corresponding to r i, b1 Mi . Next
i¼0
this is multiplied by 2 and the intermediate word corresponding to the next bits
(i.e. b 2, . . ., 0) of all the residues is obtained from ROM and accumulated. This
process continues b times and finally, modulo M reduction needs to be carried out,
where M is the product of the moduli.
As an illustration, consider the realization of a first-order filter y(n) ¼ h0x(n)
+ h1x(n 1) using the RNS {19, 23, 29, 31}. The coefficients are ho ¼ 127 ¼ (13, 12,
11, 3) and h1 ¼ 61 ¼ (15, 8, 26, 1). Consider the input samples u(n) ¼ 30 ¼
(11, 7, 1, 30)andu(n 1) ¼ 97 ¼(2, 5, 10, 4).
Themultiplicative
inverses
needed
1 1 1 1
in CRT are ¼ 4, ¼ 20, ¼ 22, ¼ 5. Note
M1 m1 M2 m2 M3 m3 M4 m4
also that M ¼ 392,863, M1 ¼ 23 29 31 ¼ 20,677, M2 ¼ 19 29 31 ¼ 17,081,
M3 ¼ 19 23 31 ¼ 13,547 and M4 ¼ 19 23 29 ¼ 12, 673. Multiplying the
coefficients with the multiplicative inverses, we have the modified coefficients as
h0 0 ¼ (14, 10, 10, 15) and h0 1 ¼ (3, 22, 21, 5). The result after weighting these with
the input samples and summing yields y(n) ¼ (8, 19, 17, 5). These in binary form are
01000, 10011, 10001, 00101. The MSBs can be seen to be 0, 1, 1, 0 which when
weighted with M1, M2, M3, M4, respectively, yields 30,628. Doubling and adding
the word corresponding to the word formed using the next significant bits (1, 0, 0, 0)
yields 2 30,628 + 20,677 ¼ 81,933. This process is repeated to obtain finally
((81,933) 2 + 12,673) 2) 2 + 51,305 mod 392,863 ¼ 364,598.
Vergos [3] has described a 200 MHz RNS core which can perform FIR
filtering using the moduli set {232 1, 232, 232 + 1}. The architecture is shown in
Figure 9.1. It has a front-end binary to RNS converter (residue generator) for the
channels 232 1 and 232 + 1. The blocks modulo 2n, (2n 1) channels perform
modulo multiplication and modulo addition. The authors have used Kalampoukas
et al. [4] modulo 232 1 adder and Piestrak architecture based on CSA followed by
a modulo adder for modulo (2n + 1) converter [5] modified to give the residue in
diminished-1 form. The modulo (232 1) multiplier is six-stage pipelined. For the
multiplier mod (2n + 1), they have used Booth algorithm-based architecture with
Wallace tree-based partial product reduction. The modulo (2n + 1) multiplier is
based on diminished-1 arithmetic due to Zimmermann [6] and is also six-stage
pipelined. Considering input conversion speed, the execution frequency is 50 MHz
whereas without considering input conversion, the execution frequency is
200 MHz.
9.1 FIR Filters 197
Modulo 2n
Channel Multiplexers Multiplexers
Multiplexers and
Sequencing logic
Output Buffers
Figure 9.1 Architecture of the RNS core (adapted from [3] with permission from H.T. Vergos)
n
n1 CSA 3:2 n2 n2
Registers
Ysk-1 HA Ysk
n1 HA n2 n2
Yck-1 OR Yck
clock cycle
a x(n–k–1) reg
x(n–k)
reg
x(n–k+1)
reg
si
ak-1 si ak ak+1
reg
reg
reg
si
si
si si
b
X(n)
si si
ak+1 ak ak-1
reg
reg
reg
si
si
reg
reg
reg
Y(n)k+2 Y(n)k+1 Y(n)k Y(n)k-1
si
Figure 9.3 RNS FIR filters in (a) direct form and (b) in transpose form (adapted from [8]
©IEEE2007)
Binary Δw(n)
to T
RNS
Shift register
Figure 9.4 Adaptive filter hybrid architecture (adapted from [11] ©IEEE2006)
192 tap, 32-bit dynamic range filter using the moduli set {5, 7, 11, 13, 17, 19, 23,
128}. At a clock frequency of 200 MHz, when compared with TCS, area and power
dissipation saving of 50 % has been demonstrated.
Conway and Nelson [13] suggested FIR filter realization using powers of two
related moduli set with moduli of the form 2n 1, 2n, 2n + 1. These moduli can be
200 9 Applications of RNS in Signal Processing
selected to minimize area and critical path. The multipliers make use of periodic
properties of moduli following the Wrzyszcz and Milford multiplier for modulus
(2n + 1) [14] and multipliers for mod 2n and 2n 1. Note, however, that transpose
structure has been used and the modulo reduction is not performed but product bits
and carry and sum outputs from previous transpose stage are added using CSA
tree. Using the cost details such as number of full-adders, flip-flops, area and
time, appropriate moduli sets have been selected by exhaustive search. The authors
have shown that these designs are attractive over designs using the three moduli set
{2n 1, 2n, 2n + 1}.
As an illustration, for a 24-bit dynamic range, the moduli sets {5, 7, 17, 31, 32,
33} and {255, 256, 257} have area of 280 and 417 units and delays of 5 and 7 units,
respectively. They have shown that the gain in area-delay product of 35–60 % for
16 tap filters with dynamic ranges of 20–40 bits could be achieved.
Wang [15] has described bit serial VLSI implementation of RNS digital N-tap
FIR filter using a linear array of N cells. The output sequence of a N-tap FIR filter is
given as
X
N 1
yn ¼ ai xni n ¼ 0, 1, 2, . . . ð9:2aÞ
i¼0
with
" #
XN1
Snjb ¼ b
aij xn1,j ð9:2cÞ
i¼0 mj
where j stands for modulus mj and superscript b indicates bth bit of the binary
b
representation of xni,j and aij. Note that Snj can be computed recursively as
T nb , j ðiÞ ¼ T nb , j ði 1Þ þ xnb , j a0ij þ xnb, j Cnb, j ðiÞmj for i ¼ 0, . . . N 1
where Cbn;j is the complement of the carry generated in adding the first two terms.
Note that a0i ¼ ai þ mj . In the processor cell, old samples are multiplied by ai and
added with the latest sample weighted by ai1. The FIR filter architecture is shown
in Figure 9.5a in which the bits of x enter serially in word-serial bit-serial form to
the Nth cell and are delayed by B 2 clock cycles using the delay blocks D1 and
move to the next cell. The extra clock cycle is used for clearing the accumulator
a mi 0 b mj
c mi 0
. . . .
T
. . B -bit B -bit . .
. . . .
a´n-1,j L a´ L a´0,j
0 xB-41j 0 xB-41j
1
D2 D1
xB-21j 1 xB-21j
1 -bit
1 xB-11j X 1 xB-11j
C´
0 0 L 0 x00j
x 0j
. .
1 . 1 -bit M C P 1 .
. . . .
L L L
. . . .
xB-40j xB-40j
. D2 . D1
xB-30j xB-30j
1 1
a´0,j a´N-1,j
CLR CLR
1 -bit 1 -bit
B -bit B -bit B -bit B -bit
1 -bit 1 -bit
ROM ROM
Figure 9.5 (a) Basic cell (b) A hybrid VLSI architecture using (a) for RNS FIR sub-filter and (c) alterative FIR filter architecture (adapted from [15]
©IEEE1994)
202 9 Applications of RNS in Signal Processing
addressing the ROM which will perform modulo shift and add operation so that
fresh evaluation of yn can commence. The cell architecture is shown in Figure 9.5b.
Note that the multiplication function is realized
by ANDgates. The cells contain
a0 I ¼ ai mod mi. The cell computes αj þ xβj modmj as αj þ x βj þ r j þ xcmj
where c is the complement of the carry generated by adding the first two terms and
rj ¼ 2B mj and indicates B LSBs of the result of the adder.
In an alternative architecture shown in Figure 9.5c, the input enters at the top.
Note that this structure has unidirectional data flow unlike that in Figure 9.5a. The
second set of latches in Figure 9.5b can be removed. The reader is urged to refer to
[15] regarding detailed timing requirement for both the structures of Figure 9.5a, c.
Lamacchia and Radinbo [16] have described RNS digital filtering for wafer-
scale integration. They have used index calculus-based multipliers and modulo
addition is performed using conventional binary adders. They observe that RNS is
best suited for wafer-scale integration since parallel architectures can be gainfully
employed.
Bajard et al. [17] have considered implementation of digital filters using RNS.
They observe that ρ-direct form transpose filter structure has the advantage of
needing small coefficient word length. A typical ρ-direct form block is shown in
Figure 9.6a where in the delay in transpose form structure is replaced by a lossy
a
u(k)
βn βn-1 β1 β0
βi
y(k)
b Z(k+1)
x(k) z-1
Δi
γi ρ-1i
Figure 9.6 (a) Generalized ρ-direct form II structure (b) realization of operator ρ1i (adapted
from [17] ©IEEE2011)
9.1 FIR Filters 203
integrator (se Figure 9.6b). Note that the parameters Δi, γi can be chosen so as to
minimize the transfer function sensitivity and round off noise. They observe that
5-bit coefficients will suffice for realizing a sixth-order Butterworth filter whereas
for direct form I realization, 15-bit coefficients are required. The popular three
moduli set {2n 1, 2n, 2n + 1} has been used. The ρ-direct form needs 10-bit
variables, 5-bit coefficients and 15-bit adders. The RNS used was {31, 32, 33}.
Multipliers were implemented using LUTs. Conventional direct form I filter using
moduli set {4095, 4096, 4097} has also been designed. They observe that FPGA-
based designs for IIR filters appear to be not attractive for RNS applications over
fixed point designs. The considerations for FPGAs are different from those of
ASICs since in FPGAs fast carry chains make the adders efficient and ripple
carry adders are quite fast and compact. The authors conclude that ρ-direct form
designs are superior to direct form realizations in both fixed point and RNS case.
Patronik et al. [18] have described design techniques for fast and energy efficient
constant-coefficient FIR filters using RNS. They consider several aspects—coeffi-
cient representation, techniques of sharing of sub-expressions in the multiplier
block (MB), optimized usage of RNS in the hardware design of MB and accumu-
lation pipeline. A common sub-expression elimination (CSE) technique has been
used for synthesis of RNS-based filters. Two’s complement arithmetic has been
used. Four and five moduli RNS have been considered.
Multiple constant multiplications (MCM) need to be performed in the transpose
FIR filter structure (see Figure 9.7) where the MB block is shown in dotted lines.
The constant coefficients can be represented in Canonical Signed digit (CSD)
representation, wherein minimum number of ones exist since non-zero strings are
substituted by 1 and 1. Remaining bits of the type 1 1 are replaced by 0 1.
The resulting words can obey all the mod (2n + 1) operations. As an illustration,
27 can be written as 011011 ¼ 101101 SD ¼ 100101 CSD . However, using
periodicity property, we can write (27)31 as 000100 ¼ 4.
The authors next use level constrained CSE (LCCSE) algorithm of [19] to
compute modular MCMs. The coefficients are decomposed into shifts n1 and n2
of two values d1 and d2 written as ck ¼ d1 2n1 d2 2n2 as desired. The
authors modify this technique
by specifying
the bases bi by choosing k such
that the values of bi ¼ 2k ci 2n 1 or 2k ci 2n þ1 are minimized. These are next
decomposed so as take into account their modular interpretation
C1 Xk CN-1Xk CNXk
yk
z -1 z -1
204 9 Applications of RNS in Signal Processing
ck ¼ d 1 2n1 d2 2n2 modð2n 1Þ or ck ¼ d 1 2n1 d 2 2n2 modð2n þ 1Þ. The
coefficients can share one base or different bases. Given bases, any coefficient
can be obtained by bit rotation and inversion. As an illustration, coefficients 5 and
29 have same base: 29 ¼ (5 25) mod 63. This step may result in very compact
CSD form.
Carry-save-adder stages together with Carry-propagate-adders need to be used
to realize the Multiplier block. In the optimization of this multiplier block, the
output of any adder is a result of multiplication by some constant. These interme-
diate results are denoted as “fundamentals”. Fundamentals of the type 2i 2j xk
are created by simply assigning ðf c ; f s Þ ¼ 2i xk , 2j xk . On the other hand,
fundamentals of the form 2i xk axk where axk is a fundamental in the CS form
can be added using a CSA layer. Next, fundamentals of the type (axk + bxk) where
both are in CSA form need addition of two levels of CSA. The modulo (2n + 1)
adders are slightly complicated. The authors have used mod (2n 1) adder due to
Patel et al. [20]. Note that in the filter pipeline, each stage adds two values one from
the product pipeline and another from the previous pipeline stage. These can be
reduced mod (2n 1) or mod (2n + 1) therein itself. However, the carry bit is
separately kept so that in the case of both the moduli, the sums are
zj ¼ zj, n1 , , . . . zj, 0 þ uj, n1 , , . . . uj, 0 þ zj, n modð2n 1Þ ð9:3aÞ
and
zj ¼ zj, n1 , , . . . zj, 0 þ uj, n1 , , . . . uj, 0 þ zj, n 2 þ uj, n modð2n þ 1Þ ð9:3bÞ
where uj ¼ xkcj. Note that zj is the accumulated filter value and uj is output of the
multiplier block. A n-bit adder with carry-in can be used for computing (9.3a),
whereas a CSA is used to compute (9.3b) to add two n-bit vectors and two bits
except for 2 term. The constant term 2 can be added to co. The authors have
shown for the benchmark filters [21], using four or five moduli sets, RNS designs
have better area and power efficiency than TCS designs.
Garcia, Meyer-Baese and Taylor [22] have described implementation of cascade
integrator Comb (CIC) filters (also known as Hogenauer filters) using RNS. The
architecture shown in Figure 9.8a is a three-stage CIC filter consisting of a three-
stage integrator (blocks labeled I), sampling reduction rate by R and three stage
three comb (blocks labeled C). The realized transfer function is given as
S
1 zRD
H ðzÞ ¼ ð9:4Þ
1 z1
where S is the number of stages, R is the decimation ratio and D is the delay of the
comb response. As an illustration, for D ¼ 2, R ¼ 32, we have RD ¼ 64. The
maximum value of the transfer function occurs at dc and is (RD)S. For RD ¼ 64
and S ¼ 3, we have for a 8-bit input, the internal word length of 8 + 3 log2(64) ¼ 26
bits. The output word length can, however, be small. It has been shown that the
9.1 FIR Filters 205
a I I I R C C C
26 bit 26 bit 26 bit 26 bit 26 bit 26 bit
Z-1
z-D
b I I I C C C ε-CRT
Output
8 8 8 8 8 8
BRS
10 bit
BRS
6 6 6 6 6 6
6 6 6
Input
8 bit 6
- ROM
x4
Figure 9.8 (a) CIC filter (b) detailed design with base removal scaling (BRS) and (c) BRS and
ε-CRT conversion steps (adapted from [22] ©IEEE1998)
lower significant bits in the early stages can be dropped without sacrificing the
system performance. The architecture of Figure 9.8a has been implemented using a
RNS based on moduli set {256, 63, 61, 59}. The output scaling has been
implemented using ε-CRT in one technique which needs 8 tables and 3 two’s
complement adders. In another technique, the base removal scaling (BRS) proce-
dure based on two 6-bit moduli (using MRC) and ε-CRT scaling [23] of the
remaining two moduli is illustrated in Figure 9.8b. The scaling architecture is as
shown in Figure 9.8c and needs 9 ROMs and 5 modulo adders. The authors have
shown increase in speed over designs using ε-CRT only.
206 9 Applications of RNS in Signal Processing
Cardarilli et al. [24] have described fast IIR filters using RNS. The recursion
equation is first multiplied by N, an integer related to the quantization step.
A 3 moduli system {128, 15, 17} has been used. The recursion equation
corresponding to channel 2 (modulus m1) is given as
where A, B are coefficients and y is a delayed output and u is input which are scaled
up by N and k is the time index. Similar expressions for other two modulo channels
can be written. The authors perform the conversion using a double modulo method
(DMM). In this technique, two CRTs are used for the moduli sets {m1, m21} and
{m1, m22} to yield Ym1,m21, Ym1,m22. Another CRT follows for the moduli set
{m1m21, m1m22}. The final result is divided by m1 to get a scaled output.
Jenkins [25] has suggested the use of four different moduli sets which facilitate
easy scaling by one of the moduli for designing IIR filters where scaling will be
required. These are (a) {m, m 1}, (b) {2n+k 1, 2k}, (c) {m 1, m, m + 1} and
(d) {2k, 2k 1, 2k1 1}. The respective scale factors are m 1, 2k, (m 1)(m + 1)
and 2k(2k1 1), respectively. The results are as follows:
(a) {m, m 1} Class I
when m ¼ 2k.
(b) {2n+k 1, 2k} Class II
ys ðnÞ ¼ 2n ðy1 ðnÞ y2 ðnÞÞmod 2nþk 1 ð9:6bÞ
Note that base extension needs to be done. In the case of four moduli sets,
{m1, m2, m3, m4} as well, Jenkins [25] has shown that the scaled answer can be
derived by application of CRT as
9.1 FIR Filters 207
yðnÞ
y s ð nÞ ¼ ¼ ðf 1 ðy1 ; y2 Þ þ f 2 ðy3 ; y4 ÞÞmod m1 m2 ð9:7aÞ
m3 m4
where
1
1
f 1 ðy1 ; y2 Þ ¼ m2 y1 þ m1 y2 ð9:7bÞ
M1 m M2 m
1 m1 m2
2
m m 1
m1 m2 1
1 2
f 2 ðy 3 ; y4 Þ ¼ y þ y ð9:7cÞ
m3 M3 3 m m 4 M 4 4 m 2
1 m1 m2
Etzel and Jenkins [26] have suggested several other residue classes (moduli sets)
amenable for use in recursive filters facilitating easy scaling and base extension.
These are Class V {2k 1, 2k1 1}, Class VI {2k 1, 2k1, 2k1 1}, Class VII
{2k+1 3, 2k 1}, Class VIII {2k+1 3, 2k1 1}, Class IX {2k+1 5, 2k 3},
Class X {2k+1 5, 2k1 1}, Class XI {22k+2 5, 22k1 1} and Class XII {2k+1
3, 2k 1, 2k1 1}, respectively.
Nannarelli et al. [27] have suggested techniques for reducing power consump-
tion in modulo adders. They suggest the architecture of Figure 9.9a wherein the
possibility of the sum exceeding the modulus is predicted. Using this information
only A + B or A + B mi is computed thus reducing the power consumption. How-
ever, the prediction can only say whether definitely (A + B) > mi or (A + B) < mi in
some cases. In other cases, both the parallel paths need to work. For m ¼ 11, as in
illustration, the prediction function can choose left or right path using the following
logic:
a b
a PredFunc b a -mI b
n n
-m DIT DIT
table table
n
latch latch latch latch x y
c a b
DIT D I T*
table table
e=y-m1
x
this case (since the entries are doubled). But, the multiplexer in the modulo adder is
eliminated. Another modification is possible (see Figure 9.9c). In the right DIT
(direct isomorphic transformation) table, instead of addressing y, one can address
y mi. Using an n-bit adder, w ¼ x + y mi can be computed. If w is positive, we
access the normal IIT table. Else, modified table IIT* is addressed. This modifica-
tion eliminates the CSA in the critical path present in Figure 9.9b. Note that when
one of the operands is zero, there is no isomorphic correspondence and the modular
adder has to be bypassed as shown using detector blocks in Figure 9.9b, c. The
authors show that power is reduced by 15 % and delay shorter by 20 % for a mod
11 adder.
Cardarilli et al. [28] have described a reconfigurable data path using RNS. The
basic reconfigurable cell is shown in Figure 9.10. The block “isodir” converts the
input residues using an isomorphic transformation so that the multiplication oper-
ation can be converted into an addition problem as explained before. The powers of
chosen radix r cover the range 0 to mi 1 where mi is prime. For addition operation
using multiplexers, the “isodir” block is avoided.
9.1 FIR Filters 209
1 0 1 0
mux mux
modular adder
mod m-1 (mult)
mod m (add)
iso
inv
1 0
mux
There is an “isoinv” block after the multiplier to convert the result back to
conventional residue form. A 32-bit dynamic range was realized using the moduli
set {13, 17, 19, 23, 29, 31, 64}. Sixty four processing elements are used which can
be configured to realize applications such as FIR filtering, bi-dimensional convo-
lution and pattern matching. The dynamic range can be reduced by shutting off
some moduli channels. AMS 0.35 μm technology was used. The RNS processor is
25 % faster than TCS and consumes 30 % less power at the same frequency.
The reduction of power consumption at arithmetic level can be achieved by
using RNS. Cardarilli et al. [29] have compared TCS and RNS implementations of
FIR filters in ASICs. They observe that in general area and power consumption can
be represented as
Ax ¼ k A x Ax
1 þ k 2 N TAP ð9:8aÞ
and
Px ¼ k P x Px
1 þ k 2 N TAP ð9:8bÞ
where x refers to the number system used and NTAP is the number of taps in the FIR
filter. Note that k1 is the offset of the plots representing (9.8) and k2 is the growing
rate value. RNS has large offset because of the presence of RNS to binary and
binary to RNS converters. On the other hand, RNS slopes are less steep than TCS
ones. In VLSI designs, RNS reduces the interconnection capacitances and com-
plexity. On the other hand in FPGA implementations, power consumption due to
210 9 Applications of RNS in Signal Processing
interconnections plays an important role rather than clocking structure and logic
and IOB (Input/output block). They observe that since RNS has local interconnec-
tions, it is very advantageous in addition to complexity reduction in FPGAs. RNS
allows to reduce power in both ASICs and FPGAs.
Nannarelli et al. [30] have shown that transpose type RNS FIR filters have high
latency due to conversions at front end and back end. They can however be clocked
at the same rate as TCS filters and can give same throughput. RNS filters are small
and consume less power than the TCS type when number of taps is larger than 8 for
a coefficient size of 10 bits. In power dissipation aspect as well, RNS filters are
better. For direct form FIR filters as well, for more than 16 tap filters, RNS is faster.
Thus, RNS filters can perform at same speed but with low area and lower power
consumption. Note that transposed form FIR filters give better performance at the
expense of larger area and power dissipation.
Mahesh and Mehendale [31] have suggested low power FIR filter realization by
coefficient encoding and coefficient ordering so as to minimize the switching
activity in the coefficient memory address buses that feed to the modulo MAC
units. Next, they suggest reordering the coefficients and data pairs so that the total
Hamming distance between successive values of coefficient residues is minimized.
Freking and Parhi [32] have considered the case of using hybrid RNS-binary
arithmetic suggested by Ibrahim [33] and later investigated by Parhami [34]. They
have considered hardware cost, switching activity reduction and supply voltage
reduction for the complete processing unit. They observed that the hardware cost is
not affected much by the number of taps in the FIR filter since the conversion
overhead remains constant. Hence, the basic hardware cost shall be less than the
binary implementations. Considering the use of a FIR unit comprising of adders
needing an area of n2 + n full-adders assuming a binary multiplier with no rounding
operation, a RNS with r moduli with a dynamic range of 2n bits, needs an area of
2
r 4n2 þ 4nr full adders using the architecture of Figure 9.11b. Hence, if this needs
r
to be less than n2 + n, we obtain the condition that r > n3 4n
. Thus, a five moduli
system may be optimal.
A direct form unit cell in a FIR filter can be realized in RNS form as shown in
Figure 9.11a. Ibrahim [33] has suggested that the modulo reduction can be deferred
to a later a stage instead of performing in each multiplier/accumulator cell. How-
ever, the word length grows with each accumulation operation. One solution
suggested for this problem is shown in Figure 9.11b wherein integer multiples of
the modulus can be added or subtracted at any point without affecting the result.
Another solution (see Figure 9.11c [34]) uses a different choice of correction factor
which is applied at the MSBs.
Since the MSBs of the residues have different probability than the LSBs, the
switching activity is reduced in RNS considerably for small word length moduli up
to 38 %. Finally, since the word length of the MAC unit is less, the supply voltage
can be reduced as the critical path is smaller and increases as log of the word length
whereas in the binary case, it is linear. Next, considering the input and output
conversion overhead, the power reduction factor is substantial for large number of
taps, e.g. 128 taps up to 3.5 times.
9.1 FIR Filters 211
a x mod mi
ai mod mi
b
x mod mi
n -mi2 0
MUX
ai mod mi n 2n MSB 2n
2n+1
Sum (i-1) D Sum i
2n 2n 2n
x mod mi
c
n -2nmi 0
n
MUX
ai mod m i
2n MSB n
2n+1
Sum (i-1) D Sum i
n MSBs 2n
2n
nLSBs
Figure 9.11 (a) Direct form FIR RNS unit cell (b) deferred reduction concept (c) modified
version of (b) correction applied to MSBs only (adapted from [32] ©IEEE1997)
Cardarilli et al. [35] have described a FIR filter architecture wherein the coeffi-
cients and samples are assumed to be in residue form. They suggest the scaling of
the coefficients by a factor 2h to obtain
* +
XP D E
h
2 Y ð nÞ m ¼ 2 h Ak m h X ð n k Þ i m i ð9:9Þ
i i mi
k¼0 mi
212 9 Applications of RNS in Signal Processing
|x(n)|mi
|x(n)h(N-1)|mi |x(n).h(0)|mi
|y(n)|mi
Z-1 Z-1 Z-1
where the number of taps are P. The inner summation has dynamic range given by
(P 1)mj. The authors suggest modulo reduction and post scaling by 2h by adding
or subtracting αmj. The value of α can be obtained by a LUT look up of the h LSBs
of the summation. The reader may observe that this is nothing but Montgomery’s
technique [36].
Smitha and Vinod [37] have described a reconfigurable FIR filter for software
radio applications. In this design (see Figure 9.12), the multipliers used for com-
puting product of the coefficient of the FIR filter with the sample are realized using
a product encoder. The product encoder can be configured to meet different FIR
filter requirements to meet different standards. The encoder takes advantage of the
fact that output values of the multiplier can only lie between 0 and (mi 1). The
coefficients can be stored in LUTs for various standards. The authors have shown
area and time improvement over conventional designs using modulo multipliers.
Parallel fixed-coefficient FIR filters can be realized using interpolation technique
or using Number Theoretic Transforms (NTT). A 2-Parallel FIR filter is shown in
Figure 9.13a [38]. This uses moduli of the form (2n 1) and (2n + 1). In an
alternative technique, in order to reduce the polynomial multiplication complexity,
parallel filtering using NTT is employed (see Figure 9.13b). Conway [38] has
investigated both these for RNS and has shown that RNS-based designs have low
complexity.
Lee and Jenkins [39] have described a VLSI adaptive equalizer using RNS
which contains a binary to RNS converter, RNS to binary converter, RNS multi-
pliers and RNS adders and coefficient update using LMS algorithm. They use a
hybrid design wherein the error calculation is performed in binary. The block
diagram is presented in Figure 9.14. The authors use an approximate CRT
(ACRT). Note that in ACRT, we compute
9.1 FIR Filters 213
-
H 0 (z-2 )+H1(z-2) 2↑ z-1
Y1(z 2)
-
2↓ H1(z-2) z-2
X1(z 2)
b
N point Forward transform H0(z M)
z-1
HN-1(z M)
HN(z M)
Figure 9.13 (a) Structure of 2 Parallel FIR filter and (b) parallel filtering using transform
approach for reducing polynomial multiplication complexity (adapted from [38] ©IEEE2008)
d(n)
y(n)
x(n) B TO R H(z) R TO B
e(n)
MUX
SHIFT
UPDATE
DELAY
Figure 9.14 RNS implementation of modified LMS algorithm (adapted from [39] ©IEEE1998)
214 9 Applications of RNS in Signal Processing
N
X:2d X r
j 2d X
N
¼ ¼ Rð k Þ ð9:10Þ
M j¼1 Mj m mj d
j 2d j¼1 2
j dk
rj
where RðkÞ ¼ k mj and k ¼ is an integer such that 0 k mj. Note that
2
M j mj
R(k) are stored in ROM.
The binary to residue converter uses LSBs and MSBs of the input word to
address two ROMs to get the residue and these two outputs are next added using
a modulo adder. The moduli multiplier is based on quarter-square concept and uses
ROMs to obtain (A + B)2 and (A B)2 followed by an adder.
Shah et al. [40] have described 2D-recursive digital filter using RNS. They
consider a general 3 3 2D quarter plane filter with scaling included. The differ-
ence equation computed is
X 2 X 2
yðk; lÞ ¼ san1 n2 r xðk n1 , l n2 Þ
n ¼0 n ¼0
1 2
2
X X 2
sbp1 p2 yðk p1 , l p2 Þ ð9:11aÞ
p1 ¼0 p2 ¼0 r
where an1 n2 and bn1 n2 are the sets of coefficients that characterize the filter, k,
l define the particular position of the sample in the array to be filtered, and p1, p2 6¼ 0
simultaneously. Note that the value of y (k, l ) shall be less than or equal to π Li¼0 m2 i .
Equation (9.11a) can be realized using the architecture of Figure 9.15. The authors
suggest scaling by moduli 13 11 7 and use the moduli set {16, 15, 13, 11, 7}.
The scaling is based on the estimates technique due to Jullien [41] described in
Chapter VI using look-up tables. Note that Figure 9.15 realizes the following
equations:
jyS ðk; lÞjmi ¼ jF2, N ð:Þ þ F4, N ð:Þ þ F6, N ð:Þ þ F2, D ð:Þ þ F4, D ð:Þ þ F6, D ð:Þjmi
ð9:11bÞ
where
F2, N ð:Þ ¼ ja00 xðk; lÞ þ a01 xðk, l 1Þjmi þ ja02 xðk, l 2Þjmi
mi
¼ F1N ð:Þ þ ja02 xðk, l 2Þjmi ð9:11cÞ
mi
F4, N ð:Þ ¼ ja10 xðk 1, lÞ þ a11 xðk 1, l 1Þjmi þ ja12 xðk 1, l 2Þjmi
mi
¼ F3N ð:Þ þ ja12 xðk 1, l 2Þjmi
mi
ð9:11dÞ
9.1 FIR Filters 215
y s (k ,l )
m1
Σy SCALER
x(k ,l )
m1
F1N
y(k −1,l )
m1
FIFO
x(k −1,l ) FIFO
m1
F3N F3D
y(k −2,l )
m1
FIFO FIFO
x(k −2,l )
m1
F5N F5D
Figure 9.15 An ith section of a 3 3 2D residue coded recursive digital filter (adapted from [40]
©IEEE1985)
F6, N ð:Þ ¼ ja20 xðk 2, lÞ þ a21 xðk 2, l 1Þjmi þ ja22 xðk 2, l 2Þjmi
mi
¼ F5N ð:Þ þ ja22 xðk 2, l 2Þjmi
mi
ð9:11eÞ
Shanbag and Siferd [42] have described a 2D FIR filter ASIC with a mask size of
3 3 with symmetric coefficients as shown in Figure 9.16a. The data window is
presented in Figure 9.16b. The computation involved is given by
The data and coefficients need to be represented in RNS form. The authors used
the moduli set {13, 11, 9, 7, 5, 4} with a dynamic range of 17.37 bits. The authors
have used a PLA-based multiplier. The IC has incorporated binary to RNS con-
verter realized using PLAs (programmable logic arrays) to find residues BM and BL
216 9 Applications of RNS in Signal Processing
a b
C A D x(i-1,j-1) x(i-1,j) x(i-1,j+1)
A
FROM BTOR
OUTPUT
Figure 9.16 (a) Coefficient and (b) data windows and (c) filter details (adapted from [42]
©IEEE1991)
where the input number is expressed as 2LBM + BL. The modulo result of both PLAs
is added using a modulo adder. The residue to binary converter used MRC in first
stage to obtain the numbers corresponding to {13, 4}, {11, 5} and {9, 7}. A second
stage finds the number corresponding to {52, 55} using MRC. A third stage finds
the number corresponding to {2860, 63}. PLAs were used for storing the values of
!
1
mi r j r i m . The 2D FIR filter architecture is shown in Figure 9.16c
j mi
mj mj
which implements (9.12).
Soderstrand [43] has described digital ladder filters using lossless discrete
integrator (LDI) transformation [44] and RNS using the moduli set {4, 7, 15}.
The LDI transformation-based resonator is shown in Figure 9.17a together with the
RNS realization in Figure 9.17b. These designs need coefficients and data samples
of 8–10 bits word length while achieving low sensitivity.
Taylor and Huang [45], Ramnarayan and Taylor [46] have described an auto-
scale multiplier which implicitly scales the result by a scaling factor c. For a three
moduli set {m1, m2, m3} ¼ {2n 1, 2n, 2n + 1}, the decoded scaled number can be
written as
9.1 FIR Filters 217
a b
BRNS
RNSB
v1 v2
K2
K1
RNSB
BRNS
I1 I2
Figure 9.17 (a) LDI ladder structure and (b) RNS realization of (a) (adapted from [43]
©IEEE1977)
X ¼ X2 þ m2 J 1 þ m2 m1 I 1 ð9:13aÞ
^ ffi m2 J 1 þ m2 m1 I 1
X ð9:13bÞ
This formula is denoted as auto-scale algorithm. Note that there can be two sources
^ and (b) due to round off of
of error in such scaling: (a) error due to estimating X as X
the final result. An architecture of the auto-scale unit is presented in Figure 9.18.
Note that in recursive filters, these need to be used in order to efficiently manage the
register overflow.
The supply voltage of CMOS digital circuits can be scaled below the critical
supply voltage (CSV) for power saving. This mode of operation is called voltage
over scaling (VOS). Chen and Hu [47] have suggested using RNS together with
reduced precision redundancy (RPR). This technique is denoted as JRR (Joint
RNS-RPR). This technique has been applied to the design of a 28-tap FIR filter
using 0.25 μm 2.5 V CMOS technology to recover from the soft errors due to VOS.
In VOS technology, the supply voltage of a DSP system is Vdd ¼ KvVdd-crit
(0 < Kv 1) where Kv is the voltage over-scaling. In the VOS case for a DSP
(digital signal Processing) system, when the critical path delay Tcp is larger than
the sample period Ts, soft errors will occur due to timing violation. They observe
that RNS has tolerance up to Kv ¼ 0.6 compared to TCS-based implementations.
Since RNS has smaller critical path, it can achieve lower Critical Supply Voltage
(CSV) than TCS implementation and lower power consumption can be achieved.
The JRR method uses MRC. It uses the fact that the remainder Rmi (decoded
word corresponding to lower moduli) is uncorrelated with the residues of the higher
moduli. In a four moduli system, e.g. {2n 1, 2n, 2n + 1, 2n+1 1}, probability of
218 9 Applications of RNS in Signal Processing
[p2j1c]modp1
[p2j1c]modp2
[p2j1c]modp3
[p2p1I1c]modp1 Z1
– J1
X1
[p2p1I1c]modp2 Z2
X2 –
–
[p2p1I1c]modp3 Z3
X3
J3
x = x2+p2J1+p2p1I1
∧
x=x= p2J1+p2p1I1
Z = Xc = [p2J1c]+[p2p1I1c]
I1 is a function of J1-J3.
soft errors is more for the modulus {2n+1 1}. Hence they apply JRR for this
modulus. In the full RNS, the quotient Urpr is more precise whereas in the reduced
RNS, the remainder Rmi is more precise. They consider moduli set with n ¼ 7. The
width of the RPR channel is 7. The structure of the complete FIR filter is presented
in Figure 9.19a where the modulo sub-filters (see Figure 9.19b) perform the needed
computation. A binary to RNS converter and a RNS to binary converter precede and
succeed the conventional FIR filter. The RPR unit word length can be n bits. It
processes only the n MSBs of the input samples and modulo reduction is not
performed. The 2n-bit word is processed next. The n LSBs are left shifted to
make it a (3n 2)-bit word from which the RNS filter decoded word corresponding
to the three moduli is subtracted and decision is taken in the block DEC (see
Figure 9.19c, d) regarding the correction 0, +1 or 1 to the MSB n bits to
effectively obtain the higher Mixed radix digit. Next, this value is weighted by
the product of the moduli in the RNS using the technique shown in Figure 9.19e
which realizes (23n 2n) by left shifts and subtraction and added to Rmi to obtain the
correct word. The hardware increase is about 9 % and the power saving is about
67.5 %.
When FIR filters operate in extreme environments like space, they are exposed
to different radiation sources which cause errors in the circuits. One type of such
error is single event upset (SEU) that changes the value of a flip-flop or memory
cell. One way of mitigation is using Redundant Residue Number system (RRNS)
[48]. The normal computation is performed by an n moduli RNS (for example
9.1 FIR Filters 219
a n-1 Z1ʹ
Mod 2n-1
/ FIR Filter
/
n-1
n-1 Z2ʹ
Binary Mod 2n Residue
4n-2 / FIR Filter / Z
To n-1 To
/ /
Residue n n Z3ʹ Binary 4n-2
Mod 2 +1
Converter / / Converter
FIR Filter 4n-2
n +
mux
| |>Th /
n Mod 2n+1-1 Z4ʹ
/ FIR Filter /
n -
Rm
CSV Zrpr CSV
n-1 MSB signal Reduce
/ Precision / JRR unit /
FIR Filter 2n 4n-2
b n
/
/ D D D D / Mod mi / D
2n 2n+k n
Rmi
c
Rrpr
LSB n bit -
LS 2n-2 DEC
Uδ
3n-2
Zrpr {-1,0,1}
Urpr ZJRR
m1m2.... mi
Rmi-Rrpr
|.| <T M
0 +
MUX
Uδ Uδ+Urpr LS3n
23n LSn bit
1
MUX
_
-1
<TM
LSn
Figure 9.19 (a) Block diagram of Joint RNS-RPR method in RNS, (b) mod mi FIR filter structure
(c) optimized structure for JRR (d) Structure of DEC unit and (e) amplification block (adapted
from [47] ©IEEE2013)
220 9 Applications of RNS in Signal Processing
consider n ¼ 3). A redundant channel also processes the signal with modulus m4.
The residue of the output of the 3-moduli channel after RNS to binary conversion
with respect to modulus m4 is computed and compared with that of the channel
pertaining to modulus m4 and if there is a difference between these two, the input
samples and coefficients are reloaded and the FIR filter re-computes the output.
Gao et al. [49] suggested the use of Arithmetic residues of a FIR filter which has
lower cost than the actual FIR filter for avoiding soft and hard errors. In this
scheme, two FIR filters produce the outputs y1 and y2 corresponding to input
samples. We define r1 ¼ y1 mod m and r2 ¼ y2 mod m. We process the input samples
x(n) by finding x(n) mod m in another replica FIR filter to obtain the output r. If
y1 ¼ y2, y1 is chosen as the correct output. If y1 6¼ y2, we compare r1, r2 and r to know
whether r1 or r2 agrees with r and choose the correct output. However, if r1 ¼ r2 ¼ r,
there is no decision possible. The procedure corrects as long as the errors affect a
single FIR filter and the errors are such that residue of the affected branch is
different from r. This leads to undetected errors. Moreover, y1 and y2 can be
different but r1 may be equal to r2 (for example y1 ¼ 15, y2 ¼ 22 and m ¼ 7,
r1 ¼ r2 ¼ 1), no decision is possible. Gao et al. [49] suggested in case of SEU
causing errors in coefficients, modifying the input samples by adding +1 or 1 if
the input sample x(n) mod m ¼ 0 for a low-pass filter. The effect of addition of this
will be filtered out by the filter since the resulting noise will be outside the passband.
On the other hand, in the case of bandpass filters, +1 can be added to all input
samples which again results in noise being outside the band of interest.
Chaves and Sousa [50] have described a RISC DSP based on RNS named as RDSP
(see Figure 9.20). It uses a moduli set {2n 1, 22n, 2n + 1} for n ¼ 8. Even though
one modulus has double the word length, the time for multiplication in all the three
channels is balanced. This processor can support signed values also by adding
M when the input value is negative. The RNS to binary converter used is based on
CRT. Signed RNS decoder subtracts M from the decoded output. The 0.25 μm
ASIC could work at a clock frequency of 200 MHz and has 20–30 % area and power
reduction over mixed or binary designs. The FPGA Virtex E2000 design could
work up to 29 MHz and has smaller occupancy 15 % in case of RNS and 20 % in
case of mixed and 24 % in case of binary designs. In the mixed design, the 22n
channel is expanded to 32 bits. In CMOS VLSI, it could work at 250 MHz and RNS
design outperforms mixed and binary designs. Note that RDSP comprises of a RISC
having five pipeline stages: Instruction fetch (IF), Instruction Decode (ID), Instruc-
tion Execute1 (EX1) and Instruction Execute2 (EX2) and Write Back (WB). There
is an arithmetic address generation unit (AGU) and logical unit (LU) to coordinate
all the operations.
Fu et al. [51] have described FPGA-based RNS optimization techniques using
the moduli set {2n, 2n 1, 2n + 1}. The reverse converter yields n bit outputs similar
to Gallaher et al. design [52] by solving the equations
9.2 RNS-Based Processors 221
clk ID clk
Ar
IF M
U Config
X RA
PC+1
M Rin1 Reg
U PC WB1
+ Bank
X
En-conf A
M
C
Program RA U
O
PC Memory N
X
T RB Data
Rin1 Ar
R
Reg
O
L Rin2
WB1 Bank
M
WB2 U
enb´1
enb´2 X
B
constants
EX1 clk
EX2/WB1 clk
PC+1
Aʹ WB2
LU M WB1
U
Bʹ
X
B
Aʹ Data
AGU Memory
Bʹ
Aʹ A
C
M WB2
Arithmetic Arithmetic U
Unit 1 Unit 2 X
Bʹ
condition
decoder
c1
binary generator c2
binary form: A B C
A þ B þ C ¼ x1 þ C1 m1 C1 2 0, 1, 2
C ¼ x2
A B þ C ¼ x3 þ C 2 m 3 C2 2 1, 0, 1 ð9:14Þ
Y þ 2X ¼ SUM1 þ 2n co
SUM1 þ 2n mi ¼ SUM2 þ c2 2n ð9:15Þ
n n
SUM2 þ 2 mi ¼ SUM3 þ c3 2
Next, using c0, c1, c2 and c3, a 3:1 MUX selects SUM1, SUM2 or SUM3. Note
that c1, c2 and c3 are the carry out signals of first, second and third adders and c0 is
the MSB of the first (n + 1)-bit adder. This architecture is presented in Figure 9.22a.
Ramirez et al. have used this technique later to realize DWT [57] with the modified
arrangement of Figure 9.22b wherein CSAs have been used to reduce the delay.
Recently, Vun et al. [58] have suggested use of one-hot residue (OHR) coding
for simplifying the doubling and modulo reduction operation. The authors use
a (x1 mod mi)[n] Channel of an RNS-DA FIR
x1 mod mi
(ACC mod mi)
2K word
common bit
nth
TLU
see
Register
xK mod mi insert
(xK mod mi)[n]
Shift-registers Lookup
Table
Accumulator
c1 C0 c2 c3
sum_1 N-bit adder
sum_2
y y
sum_1
sum_2
sum_3
+ sum_3
x + +
2x
b Bi
a2n(i-1)
Bi
(i-1)
a2n-1
Bi
Bi 2NxW
N
sCLK sCLK
LUT
+/-
фg
an(i)
Bi bCLK
×2
Bi N bCLK
2NxW
LUT sCLK
Bi +/-
N фh
dn(i)
bCLK
Bi
×2
bCLK
bCLK
Figure 9.22 (a) Distributed Arithmetic scheme for RNS applications and (b) RNS DA-based 1D
DWT architecture ((a) Adapted from [56] ©IEEE 1999, (b) adapted from [57] with permission
from Springer 2003)
9.2 RNS-Based Processors 225
a 1 1 1 OHR[6]
a[6] 0 0 0
1 1 1 OHR[5]
a[5] 0 0 0
1 1 1 OHR[4]
a[4] 0 0 0
1 1 1
OHR[3]
a[3] 0 0 0
1 1 1
OHR[2]
a[2] 0 0 0
1 1 1
a[1] 0 0 0
OHR[1]
1 1 1
a[0] 0 0 0
OHR[0]
B (BC)
1
ton
2K One residue channel of the TCR based
(TCR) entries DA-RNS system
1 DALUT
t(K-1)n
Figure 9.23 (a) OHR-based modulo 7 adder and (b) TCR-based DA-RNS with OHR modulo
accumulator (adapted from [58]©IEEE 2013)
thermometer code format for input residues whereas the output data is encoded in
the one-hot format. The modular adders can be implemented using single shifter
based circuit utilizing one hot-coded format. The modulo operation is performed
automatically during the addition process.
A OHR modulo 7 adder is shown in Figure 9.23a where one input is in OHR and
the other one is in binary format. A TCR (thermometer code encoded residue) based
DA-RNS with OHR modulo accumulator for one channel is shown in Figure 9.23b.
Note that the DA LUT has 2k entries for a k-tap filter. Note that the input samples
226 9 Applications of RNS in Signal Processing
are sent serially and the number of cycles needed is (mi 1) for obtaining one
output sample of the FIR filter.
The authors also extend the technique for taking two bits at a time (2BAAT) by
cascading two OHR modular adders to sum the two DA LUT outputs each driven
by one group of bit streams allocated from the TCR.
Cardarilli et al. [59] have described QRNS realization of complex FIR filters using
transpose structure. The architecture is shown in Figure 9.24b. It comprises of two
RNS filters in parallel. Each RNS filter is decomposed into P filters working in
parallel where P is the number of moduli. RNS {5, 13, 17, 29, 41} has been used for
a dynamic range of 20 bits. Note that this is simpler than conventional architectures
needing four multiplications per tap (see Figure 9.24a). The multipliers are
implemented using index calculus. The authors have compared with TCS and
a
CSA 4:2
REG
COMPLEX FILTER TAP
CSA 4:2
AIM ARE
XRE
XIM
ARE -AIM
CSA 4:2
CSA 4:2
REG
Figure 9.24 (a) Structure of tap in TCS FIR filter and (b) QRNS FIR filter architecture (adapted
from [59] ©IEEE2008)
9.3 RNS Applications in DFT, FFT, DCT, DWT 227
* * *
m1
X Y
* * X
m2
+ + +
* * *
mp
QRNS
TCS
+ + +
↑
↑
CONVERSION QRNS
CONVERSION TCS
y(n)
x(n)
RNS FIR
* * *
m1
+ + +
^
X
^
Y
m2 * * *
+ + +
mp * * *
+ + +
other types and have shown that QRNS-based design consumes less power and
needs less area.
Taylor [60] has described a single modulus complex ALU using QRNS as
against multi-moduli set based systems. He has used Gaussian primes (e.g. 5,
17, 257, 65,537) as well as composite Gaussian primes (85 ¼ 5*17,
1285 ¼ 5*257, 4369 ¼ 17*257, 21,845 ¼ 5*17*257) of the type 2k + 1 as the
modulus. A single modulus ALU has the advantage of trivial magnitude scaling,
sign detection and overflow detection as against multi-moduli RNS.
228 9 Applications of RNS in Signal Processing
a
RNS path mod m1 yo(n)
QRNS path X
QRNS y3(n)
x(n) to
Binary Binary
to
QRNS y4(n)
^ ^
Xn Yn
RNS path mod mP y6(n)
^
QRNS path X
y7(n)
fc fc/8
y1(n)
fc ↓M FIR Filter E1
IDFT
↓M FIR Filter E7
y7(n)
fc/8
fc/8
RNS path mod mi
Figure 9.25 (a) Polyphase filter bank and (b) filter with truncated dynamic range (adapted from
[61] ©IEEE2004)
230 9 Applications of RNS in Signal Processing
b
yo(n)
y1(n)
y2(n)
X(n)
Binary Binary QRNS
Filter QRNS IDFT
y4(n)
y5(n)
y6(n)
y7(n)
Dynamic range 23 bits mi =(13,17, 29,37,41) Dynamic range 28 bits mi =(13,17, 29,37,41,53)
Figure 9.25 (continued)
DCT Computation
and K 0 ¼ p1ffiffi2, K 1 ¼ K 2 ¼ ¼ K N 1 ¼ 1.
9.3 RNS Applications in DFT, FFT, DCT, DWT 231
QRNS-S
-FIR-1 V1mi
ADDERS TREE
128
1
X mi
Xmi 128 ISOMULT
Ymi Vmi Z1mi
BIN
DECIMATOR
ARRAY
V128mi
SHIFT REGISTER
{XR,XI) To 128
MUX
QRNS CTBE
ADDERS TREE
128
QRNS-S
-FIR-128 128 ISOMULT
Z1mi
128 128 ARRAY
X mi V mi
CLK0
Figure 9.26 QRNS polyphase filter architecture (adapted from [62] ©IEEE2010)
x(i)mod mi
LUT
Coefficient
| + | mi y(u)mod mi
Counter Products
0:7
Figure 9.27 Modulo mi channel for one transform point of an RNS-based 1D-DCT processor
(adapted from [65] ©IEEE1999)
Ramirez et al. [66] use the fact that N-point DCT can be computed through the
calculation of real part of 2N-point DCT scaled by a complex exponential constant
as follows:
rffiffiffiffi ( )
jmπ X
2N1
2
XðmÞ ¼ K m Re e 2N xðnÞW mn
2N ð9:17Þ
N n¼0
232 9 Applications of RNS in Signal Processing
j2π
W 2N ¼ e 2N , xðnÞ ¼ 0 n ¼ N, N þ 1, . . . , 2N 1
Initially, the N-point input sequence {x(0), x(1), x(2), . . ., x(N 1)} is reordered in
the sequence {y(0), y(1), y(2). . .y(N 1)} defined by
N
yðnÞ ¼ xð2nÞ, yðN n 1Þ ¼ xð2n þ 1Þ n ¼ 0, 1, . . . , 1 ð9:18Þ
2
Let {Y(0), Y(1), Y(2). . .., Y(N 1)} be the DFT of the sequence {y(0), y(1),
y(2). . .y(N 1)}. The DCT sequence {X(0), X(1), X(2). . .., X(n 1)} of the original
sequence can be obtained through the real part of Z(n) [71] defined as
rffiffiffiffi
2 n
Z ð nÞ ¼ H n Y ð nÞ ¼ K n W 4N Y ð nÞ ð9:19Þ
N
2π
where W 4N ¼ ej4N . By using the property Z(N n) ¼ jZ*(n), $ Re
[Z(N n)] ¼ Im[Z(n)], it is necessary to compute only the N2 þ 1 values of Z(n),
viz., Z(0), Z(1),. . ., Z(N/2), Z(N/2 + 1),. . ..Z(3N/4 1). The N-point DCT
sequence is given by {Re[Z(0)], Re [Z(1)], . . .Re [Z(N/4)]}, Im {Z(3N/4 1)],
Im[Z(3N/4 2)],. . ., Im{Z(N/2 + 1)], Re[(Z(N/2)], Re[(Z(N/2 + 1)],. . .,Re
[Z(3N/4 1)], Im[Z(N/4)],. . ., Im[Z(N/4 1)],. . ., Im[Z(1)]}.
The fast algorithms known for DFT can be used for fast computation of DCT. A
QRNS butterfly for computation of a DIF radix-2 DFT is shown in Figure 9.28a.
Note that since the input sequence is real, each QRNS adder is one modular adder.
A butterfly needs a QRNS adder (two modular adders), a QRNS subtractor (two
modular subtractors) and a QRNS multiplier (two modulo multipliers). The moduli
set used is {221, 229, 233, 241}. The multiplier has used isomorphic mapping with
the roots {47,107, 89, 177}, respectively. The 8-point QRNS DCT computation is
shown in Figure 9.28b. Note that only five outputs are Z(0), Z(1), Z(2).., Z(5) are
required for DCT computation.
Fernandez et al. [69] have presented a RNS architecture for computation of
scaled 2D-DCT on field programmable logic (FPL). An eight pixel 1D-DCT is
implemented as shown in Figure 9.29a. The 2D DCT is computed as
2eðuÞeðvÞ X
N 1 X
N 1
uð2i þ 1Þπ vð2j þ 1Þπ
Xðu; vÞ ¼ xði; jÞ cos cos
N i¼0 j¼0
2N 2N ð9:20Þ
u, v ¼ 0, 1, . . . , ðN 1Þ
Using an algorithm due to Arai et al. [72] (see Figure 9.29a), the 8-pixel
1D-DCT can be realized as shown in Figure 9.29b for one modulus channel
which needs only five multiplications. Note that e1 and e2 are power of two scaling
factors. The coefficients
are k1 ¼ C4, k2 ¼ C6 C2, k3 ¼ C4, k6 ¼ C6 + C2, k5 ¼ C6,
where Cq ¼ cos qπ 16 . The 1D-DCT can be designed to have single multiplication
per stage. Multiplication by DCT coefficients is by ROM look-ups. In order to
obtain the exact value of DCT, each output needs an additional multiplication
which can be taken care of in the next stage. The hardware consists of adders,
registers and LUTs. The moduli set used was {256, 255, 253, 251}. The output
a a + , b-
g+, h-
_ i +, j -
*
c + , d-
e+ , f -
a+ |+|m g+
b- |+|m
h-
|-| m |×e+| m
i+
c+
Figure 9.28 (a) QRNS butterfly for a radix-2 FFT and (b) pipelined QRNS DCT implementation
(adapted from [66] ©IEEE2000)
234 9 Applications of RNS in Signal Processing
b
y(0) Z(0)
+ + + *
H0
Z(4)
y(1)
+ + - *
H4
y(2) Z(2)
+ - * + *
H2
W80
y(3)
+ - *
y(4) W 82 Z(1)
- * + * +
0
H1
W8 Z(5)
- * + * +
1
H5
y(6) W8
- *
2
W8
y(7)
- *
W 83
a
[a] [b] [c] [d ] [e]
x(0) X(0)
x(1) X(4)
k1
x(2) X(2)
x(3) X(6)
k2
x(4) X(5)
k3
x(5) X(1)
k4
x(6) X(7)
x(7) X(3)
k5
Figure 9.29 (a) Flow graph for fast computation of DCT and (b) moduli mi channel of 1D DCT
(adapted from [69] IEEE 2000)
9.3 RNS Applications in DFT, FFT, DCT, DWT 235
b
X (0) m
i
.
.
.
X (7) m
i
+ + + + − − − −m
mi mi mi mi mi mi mi i
+m + − − +m − + D
i mi mi mi i mi mi
+m −m +m D D D + D D
i
i i
D D + − − + − −
mi mi mi mi mi mi
D + + − −
D D D mi mi mi mi
GF(p2) exponents
ex (3),ex (4),ex (2),ex (1)
Circular shift register path
RUN
T T T T
2n LOAD
2n 2n 2n 2n
Input in e
W(3) e
W(4) e
W(2) e
W(1)
permuted
+ + + +
order
ADDER
x(0)
X(k)=[X(3),X(4),X(2),X(1)]
Figure 9.30 Five point RNS prime factor DFT implementation (adapted from [73] ©IEEE1990)
Re[xi(0)] + + Re[xi+1(0)]
Im[xi+1(0)]
Im[xi(0)] + +
Im[xi(2)]
Im[xi+1(3)]
+ -1 + +
Re[xi(2)]
Im[WN2t] -1 Re[xi+1(1)]
+ -1 + +
Re[WN2t]
Im[WNt] -1
Re[xi+1(2)]
+ + +
Re[WNt]
Re[xi(1)] Im[xi+1(2)]
+ + +
Im[xi(1)]
Im[xi(3)] -1
-1
+ + -1 + Im[xi+1(1)]
Re[xi(3)]
Re[WN3t]
-1 Re[xi+1(3)]
+ -1 + +
Im[WN3t]
Figure 9.31 Radix-4 DIT (decimation in time) basic calculation (adapted from [75] ©IEEE1979)
Taylor et al. [76] have presented radix-4 FFT using complex RNS arithmetic. In
this technique, the complex multipliers needed in conventional implementation are
replaced by QRNS multipliers thus reducing the hardware. A radix-4 complex RNS
(CRNS) butterfly is presented in Figure 9.32a together with the QRNS butterfly in
Figure 9.32b. In the CRNS butterfly, 12 real multiplications at level 1, 6 read/
subtract at level 2, 8 read/subtract at level 3 and 8 real add/subtract at level 4 are
needed. On the other hand, in the case of QRNS-based designs, we need only 6 real
multiplications at level 1, 8 real/subtract and 2 multiplications at level 2, 8 real
add/subtract at level 3.
Jullien et al. [77] have described a systolic Quadratic Residue DFT with fault
tolerance. In this each systolic array cell uses a 16 6 ROM in place of 16 4
ROM. The additional two bits correspond to parity of output content of the ROM
and parity of input address bits. In normal operation, the address parity of a cell
must equal content parity of the previous cell.
The general form of Number Theoretic Transform (NTT) [78, 79] is described
by the transform pair
238 9 Applications of RNS in Signal Processing
A5 A9
Im[X(0)] Im[X(0)]
S4 A10
Re[X(2)] Im
CRNS unit #1 Im[X(3)]
Re[WN2t] M1-M4 S5 A11
Im[X(2)] Re 4
A1 Re[X(1)]
2t
IM[WN ]
S1 A6 S8
Re
Re[X(1)] Re[X(2)]
CRNS unit #2
Re[WNt] A7 S9
M5-M8
Im Im[X(2)]
Im[X(1)]
A2
Im[WNt] S6 S10
S2 Re Im[X(1)]
Re[X(3)]
CRNS unit #3
Re[WN3t]
S7 S11
M9-M12
Im[X3)] Im
Re[X(3)]
A3
Im[WN3t]
S3
8 ADD/SUB 8 ADD/SUB
Level 3 Level 4
12MULT, 6ADD/SUB
Level 1&2
Mv
CRNS Re(A)
crossproduct unit + Sv
(1 of 3)
Mv
Re(C)
Im(A) _
V=Value
A=Adder
Mv Level2
S=Subtractor
M=multiplier
Re(B) + Av
Mv Im (C)
Im(B) +
Level1
All multipliers at Level 1
All adders at Level 2
Figure 9.32 (a) CRNS radix-4 FFT butterfly and (b) QRNS radix-4 Butterfly (adapted from [76]
©IEEE1985)
9.3 RNS Applications in DFT, FFT, DCT, DWT 239
b State i+1
State i
A1 A5
Z(0)
Z(0)
A2 A6
Z(0)* Z(0)*
S1 S5
Z(2)* M1
Z(3)*
W(2)*
S2 A7
Z(2)
M2 Z(1)
W(2)
A3 S6
Z(1)
M3 Z(2)
W(1)
A4 S7
Z(1)* Z(2)*
M4
W(1)* -1#
S3 M7 A8
Z(3)* Z(1)*
M5
W(3)*
S4 M8 -1#
S8
Z(3) M6
W(3) Z(3)
−
Level 1 −j
Level 3
6 Multipliers 8 ADD/SUB
Level2 6 ADD/SUB
2SUB/SCALAR
(DUPLICATES)
Figure 9.32 (continued)
XN1
XðmÞ ¼ n¼0 xðnÞαnm 0mN1
M
XN1
xðnÞ ¼ N 1 m¼0 XðmÞαnm 0nN1 ð9:21Þ
M
where α generates a multiplicative group of elements αN M ¼ 1. NTTs are useful
for fast, efficient and error-free coefficient computation of cyclic convolution.
YL
A complex number theoretic transform with dynamic range M ¼ m where
i¼1 i
mi is prime can be implemented in the RNS if the transform length is a divisor of the
gcd of the numbers Ni, i ¼ 1, . . .. L where N i ¼ m2i 1, mi ¼ 4n + 3 and
N i ¼ mi 1, mi ¼ 4n + 1. FIR filters characterized by the finite convolution
X
p1
ym ¼ hmn xn ð9:22Þ
n¼0
240 9 Applications of RNS in Signal Processing
where A, B, A*, B*, C and C* are the element pairs of two input samples and the
twiddle factor, respectively, and Z, Z*,Y and Y* are pairs of elements of the butterfly
output in the QRNS.
Dimitrov et al. [80] have suggested implementation of real orthogonal transfor-
mation using RNS. In this technique, real numbers are approximated by considering
pffiffiffi
them in the form a þ x b. These real numbers are approximated as elements of
Quadratic number rings. The reader is referred to [80] for more information.
A three moduli set {22k2 + 1, 22k + 1, 22k+2 + 1} was suggested by Shyu
et al. [81] where k is odd. Abdallah and Skavantzos [82] have extended these to
general moduli set (having more moduli) and presented rules to select them to be
mutually prime. A four moduli set recommended was {2n6 + 1, 2n2 + 1, 2n + 1,
2n+2 + 1} where n ¼ 8k + 6, k ¼ 1, 2, 3, . . .. Abdallah and Skavantzos [82] have
suggested QRNS using such moduli sets with non-co-prime moduli. This enables
availability of several moduli of the form 2n + 1. The conversion from binary to
QRNS is same as before. Note, however, that CRT needs to be modified to take
9.3 RNS Applications in DFT, FFT, DCT, DWT 241
RNS has been used for communication systems for protecting the information
processed or transmitted [86–89]. This exploits the self-checking/error correction
properties of redundant residue number systems (RRNS). The block diagram of a
transmitter using RNS-based parallel communication scheme is shown in
Figure 9.34a. The input binary word N is converted into residues r1, r2, . . ., ru
(where r1, . . ., rv are the actual residues and rv+1, rv+2, . . ., ru are the redundant
residues) which are then mapped to orthogonal sequences U 1r 1 , U 1r 2 , . . . , Uur v ,
. . . , U u r u and multiplexed for transmission. The receiver architecture is shown in
Figure 9.34b. If an information symbol is coded and sent as u residues, after the
MLD (maximum likelihood detection) of the u banks, d number of MLD outputs
can be dropped while still recovering the transmitted symbol using the remaining
outputs and they can be corrected as well. The block named bank for receiving
residues is expanded in Figure 9.34c which comprises of correlators, square law
detectors and multi-path diversity reception combining MLD units. Note that
L represents the number of resolvable paths being tracked at the receiver. The
receiver has a diversity reception structure with L multi-paths tracking. Note that
some of the redundant channels can be dropped if the dynamic range is less than
π i¼1
u
mi . The errors can be corrected by using the theory of RNS-PC (RNS product
codes) [90, 91].
A RNS (u, v) code has a minimum distance of (u v + 1) and can detect (u v)
or less residue digit errors and correct up to (u v)/2 residue digit errors. Further, a
RNS (u, v) code is capable of correcting a maximum of tmax residue errors and
simultaneously detecting a maximum of β > tmax residue errors if and only if
tmax + β (u v). Note that the diversity combining technique can be based on
equal gain combining (EGC) or selection combining (SC). Note that in EGC, for a
receiver with Lth order diversity (L LP) reception where LP is the number of
resolvable multi-path components, the L paths at the receiver are added after equal
weighting and form the decision variable:
X
L
U ij ¼ U ij ðlÞ ð9:24Þ
l¼1
a U1r1(t)
r1
orthogonal signals
Binary to Residue
r2
Residue to
mapping
N U2r2(t)
converter
s(t)
Binary
symbols
ru Uuru(t)
Carrier
b U1 ri1
Bank for receiving
residue digit r1
λ1
Uu riv
Bank for receiving
residue digit ru λu
Section I Section II Section III
Ui0(1)
c
Multipath diversity combining and MLD decision
Ui0(2)
T
│ │2
∫0
Ui
U*i0 (t) Ui0(L)
r(t)
λi
Ui(mi-1)(2) To RNS
processing
T Ui(mi-1)(2)
│ │2
∫0
U*i(m-1)(t) Ui(mi-1)(L)
Figure 9.34 (a) Transmitter block diagram (b) receiver block diagram with RNS processing
and (c) non-coherent demodulation bank for receiving residue digit ri (adapted from [85]
©IEEE1999)
244 9 Applications of RNS in Signal Processing
uses RRNS for good error correction capabilities. RNS product codes are used as
inner codes and Reed Solomon (RS) codes are used as outer codes. The trade-offs
between information moduli and redundant moduli have been considered. In a
system with v moduli, in order to transmit k bits of information in one symbol
period, for transmitting each of the residues in parallel using a RNS with v moduli,
X v Yv
mi sequences are required which are orthogonal such that mi 2k . Hence at
i¼1 i¼1
X
v
the receiver, mi correlators are required. Similar architecture as discussed in
i¼1
Figure 9.34b can be employed. Note that each of the users is assigned a random PN
X
u
sequence set consisting of mi orthogonal sequences of length Ns where a subset
i¼1
of mi orthogonal sequences is used for transmission of the residue ri. For example,
the moduli set {7, 11, 13, 15, 16} can accommodate a 17-bit symbol. Hence, by
using a 64-bit Walsh code, this symbol can be transmitted in parallel in one symbol
period. Thus, given a message Xq for a qth user, the residue are (r1, r2, . . ., ru) such
ðqÞ ðqÞ ðqÞ
X
u
that u specific PN sequences V 1r 1 , V 2r 2 , ::::, V ur u can be selected from the mi
i¼1
sequences.
Ramirez et al. [92] have described RNS-based communication receiver which
needs a direct digital frequency synthesizer (DDS) and a programmable decimation
FIR filter as shown in the block diagram of Figure 9.35a in the mixer block. The
RNS-based DDS based on [90] and [91] is sketched in Figure 9.35b. Note that the
phase accumulator is conventional binary type. The output of the phase accumula-
tor is 10-bit wide. The first two MSBs are used to decide the sign and the quadrant.
The eight LSBs address the COS and SIN LUTs in RNS form. The negative values
are obtained by modulo subtraction. The correct output cn, sn, cn, sn is selected
by using a multiplexer. The index LUTs convert the input into indices so that the
mixing function can be accomplished.
The input TC word is converted into index form using a binary to RNS converter
operating on p number of b bit blocks of the input word and using ( p 1) LUTs to
store the residue. All these are added in an adder tree whose output is fed to an index
LUT to get the input indices to be added with COS and SIN indices obtained as
explained before. The output is routed to the programmable decimation filter. The
programmable decimation filter uses L index-based RNS channels and a final RNS
to binary converter. The number of taps, input and output precisions are program-
mable. The filter has serial data and coefficient inputs. Using a multiplexer, the
coefficients are sequentially loaded and the input is distributed to all index-based
multipliers. The products are computed using a mod (mi 1) adder to add the
indices followed by a LUT to get the inverse. These are added using a tree of adders
to yield the output. A final stage using both CRT and ε-CRT has been tried to yield
the desired binary result. The authors have shown that the complexity is comparable
9.4 RNS Application in Communication Systems 245
a Speaker
Audio Amp
RF I Prog
A/D
AMP Dec FIR DSP
CONV Demod D/A
Q filter
CONV
cos
sin
Digital
Local
Oscillator
cl
b
Index LUT
Δθ Cos Mod
LUT SUB Cosl(n)
θ
Quantizer
-cl
+ F
Load
F
-sl
Index LUT
CLK
Sin Mod
RES LUT SUB
Sinl(n)
sl
Figure 9.35 (a) Digital receiver architecture (b) RNS based DDS (adapted from [92] © with
permission from Springer 2002)
to or lower than the conventional design whereas the throughput has increased
by 65 %.
The use of RNS for digital frequency synthesis (DFS) has been considered
[93]. The block diagram of classical DFS is shown in Figure 9.36a. The phase
accumulator is incremented by a frequency setting word k and using a look-up table,
the sine value is read corresponding to the accumulator output. This output after
D/A conversion yields the desired sine waveform. There is a possibility of scaling
the phase accumulator output from L bits to W bits. Assuming a clock frequency of
fc, frequency setting word k, an L-bit accumulator, the frequency of the generated
sinusoid is fck/2L. The phase increment is 2πk/2L. The symmetry of sine function
can be taken advantage of to reduce the ROM size. The two MSBs of the accumu-
lator output indicate the sign and quadrant. Chren [93] suggests using RNS com-
prising of several moduli one of them being 2p+2. The architecture shown in
Figure 9.36b uses RNS-based phase accumulator, scaler and modifier for taking
into account sign and quadrant information. The block FSM (finite state machine)
performs phase accumulation mod mi. The input for FSM is binary encoded ith
residue digit of the frequency setting word. The additive invert (AI) and sign
inversion units compute the additive inverse of the inputs so as to take advantage
of the quarter wave symmetry of the cosine function. Chren [93] has suggested
another reduced area architecture shown in Figure 9.36c in which modulo adders
are used in place of FSMs in Figure 9.36b. Here, sample values are computed rather
246 9 Applications of RNS in Signal Processing
a Phase Accumulator
MSB
Output V°
w
b
MSB
Next MSB
|K|mod m2
DAC
Sign
ROM inversion
|K|mod mn FSMn
Al
MSB
c
Next MSB
p+2
2
|K|mod 2p+2 Al ROM1
Output V°
S
C
|K|mod m2 A
m2 L Al ROM2
E RAC
R
Residue
Processor
|K|mod mn
mn Al ROMn
System clock f°
Frequency
setting code
word as Phase Accumulator Address Invert Sample ROM
residues
Figure 9.36 (a) Traditional direct digital frequency synthesizer (b) frequency agile direct syn-
thesizer and (c) Reduced area direct synthesizer (adapted from [94] ©IEEE2001)
9.4 RNS Application in Communication Systems 247
a Cnxn
Spread code
MSB (k-m) bits selection
To Binary
rn
(k bits)
Urn
LSB MPSK/MQAM
(m bits) modulation
d
r1
b Ur1
Sub-transmitter 1
RNS transform+interleaver
r2 Ur2
Sub-transmitter 2 Pulse
M-ary Σ Shaping
symbol
rv
Urv ck(t)
Sub-transmitter v
Figure 9.37 (a) Transmitter block diagram and (b) details of sub-transmitter for nth residue
channel (adapted from [95] ©IEEE2004)
a 1
Cos(2πf11t)
r1 V1r1(k) (t)
Sk(t)
2
Residue to rthogonal
sequence Mapping
r2
Binary to residue
Σ
conversion
Ck(t) Cos(2πf12t)
V2r2(k) (t)
Xk
L
VQrQ(k) (t)
Binary
symbols Cos(2πf1Lt)
rQ
U=LQ
b y11
v̂1
RMD/MS/-MMSE MUD
Chip waveform matched
vˆq
filter ψ*(Tc-t)
IRNST
yql
R(t) X
nTc
exp(-j2πfqlt) yQL
vˆQ
Figure 9.38 (a) Transmitter schematic for kth CRU and (b) receiver block diagram in RRNS MC/
DS-CCDMA system (adapted from [96] ©IEEE2012)
MSBs corresponding to maximum residue values are 9, 10, 10, 10 and 11, respec-
tively, we need overall 55 orthogonal sequences only (considering that 0–9, etc. are
all the residue values).
The high level architecture of the receiver is presented in Figure 9.37b. The
authors have used 16 bits per symbol. The last two elements in the moduli set
(44 and 47) are the redundant residues. The authors have shown that the BER versus
Eb/No performance is superior to earlier technique of Yang and Hanzo [86–89].
Zhang et al. [96] have suggested use of RRNS for multi-carrier direct sequence
CDMA multiple access for cognitive radio. The transmitter block diagram is
presented in Figure 9.38a, wherein the input binary symbol is converted into
residues using a front-end binary to RNS converter. These Q residues
corresponding to Q moduli are mapped next into Q orthogonal sequences. Consid-
XQ
ering a Q moduli set, the number of orthogonal sequences are mq . These are
q¼1
Hadamard–Walsh codes of length Ns. Next, each of these selected orthogonal
9.4 RNS Application in Communication Systems 249
a mi
sub-channel
m1{rm10,rm11,….,rm1(N-1)} IFFT
Source
di
B/R m2{rm20,rm21,….,rm2(N-1)} IFFT
d0 d1 d2…….dN-1 Modulation
IFFT
mv{rmv0,rmv1,….,rmv(N-1)}
b
ri1
m1 residue
channel
ri2
S´k,mi m2 residue
channel FFT R/B di
riv
mv residue
channel
Figure 9.39 (a) Block diagram of RNS OFDM transmitter and (b) receiver (adapted from [97]
©IEEE 2011
AMR RRNS
Modulator Spreader
encoder encoder
Figure 9.40 Adaptive dual-mode JD-CDMA system block diagram (adapted from [98]
©IEEE2006)
9.4 RNS Application in Communication Systems 251
Zhu and Natarajan [99] have described hopping pilot pattern design using RNS
for cellular downlink orthogonal frequency division multiple access (OFDMA).
This enables hopping of pilot in time as well as frequency domains. By using these
RRNS-based pilot patterns, the channel’s Doppler delay response can be fully
reconstructed without aliasing. This technique provides with number of pairs of
hopping patterns that are collision free.
We only describe here briefly the algorithm used. The number of carriers
available are N ¼ MMc where Mc is the number of clusters and M is the number
of contiguous frequencies (sub-carriers) in a cluster. Note that G is the number of
OFDM symbols between two consecutive pilot signals in time Tp and M is the
number of OFDM sub-carriers between two consecutive pilots. M is chosen as
product of two primes a and b. We consider a ¼ 2, b ¼ 3 for M ¼ 6 for illustration.
Thus, if Ts is one OFDM symbol duration, Tp ¼ GTs and fp ¼ Nf where f is
sub-carrier spacing. Consider an example N ¼ 12, M ¼ 6, Mc ¼ 2 and G ¼ 4.
As shown in Figure 9.41a, in which two clusters each with M ¼ 6 carriers are
shown. These are divided into two sub-clusters and each sub-cluster has three
sub-carriers. We start with an initial address (IA), for example, 4 for time slot
0. Since 4 mod 2 ¼ 0 and 4 mod 3 ¼ 1, we send the pilot in each sub-cluster 0, first
sub-carrier. Next, we increment IA to get for first time slot, 5 mod 2 ¼ 1, 5 mod
3 ¼ 2. Thus, the next sub-carrier chosen for sending pilot tone in second cluster
third sub-carrier. In a similar manner, corresponding to the other two time slots, the
location 4 is obtained as shown in Figure 9.41b. Similarly starting with different
G values from 1 to 6, the generic RNS pilot pattern can be found as in Figure 9.41b.
It can be shown that the patterns are orthogonal meaning that pilot patterns using
different IAs do not collide with each other. The authors have shown that such a
system generates more unique hopping pilot patterns than other techniques.
Han et al. [100] have presented an architecture for block interleaving algorithm
in MB-OFDM (multi-band orthogonal frequency division multiplexing) using
Mixed Radix System for UWB (ultra-wide-band) communication. The block inter-
leaving algorithm outlined in [101] WiMedia Alliance MAC-PHY interface spec-
ification 1.0 necessitates three consecutive steps: symbol interleaving, tone
interleaving and cyclic shift. These are described by the following equations:
i 6
aS ð i Þ ¼ a þ modði; N CBPS Þ ð9:25aÞ
N CBPS N TDS
i
aT ð i Þ ¼ a S þ 10 modði; N Tint Þ ð9:25bÞ
N Tint
bðiÞ ¼ aT mi N CBPS þ mod i þ mðiÞ N cyc , N CBPS ð9:25cÞ
a time slots(0-7)
0
0 1
2
0
time slots(0-7) 1
1
2
7 5 4
0
0 1
one pilot signal 2
one subcarrier 0
1 1
one sub-cluster
2
one cluster
b
Time slots {0,1,….)
Tp = GTs
OFDM sub-carriers
6 5 4 3 6 5 4 3
4 3 2 1 4 3 2 1
6 5 4 3
2 1 6 5 2 1 6 5
4 3 2 1
3 2 1 6 3 2 1 6
2 1 6 5
1 6 5 4 1 6 5 4
3 2 1 6
5 4 3 2 5 4 3 2
1 6 5 4
Fp = Mf
5 4 3 2
6 5 4 3 6 5 4 3
2 1 6 5 2 1 6 5
3 2 1 6 3 2 1 6
1 6 5 4 1 6 5 4
5 4 3 2 5 4 3 2
Figure 9.41 (a) Design procedures of pilot patterns using RNS arithmetic and (b) RNS based
pilot pattern (adapted from [99] ©IEEE2010)
9.4 RNS Application in Communication Systems 253
j k
interleaver and b is an output of the cyclic shift and mðiÞ ¼ N i . Note also that
CBPS
NCBPS, NTint, NTDS and Ncyc are constants depending on the data rate of the
MB-OFDM system. We have used the notation mod (m,n) ¼ m mod n. Note that
NCBPS is number of coded bits/OFDM symbol, NTint tone interleaver block size,
Ncyc is cyclic shift and NTDS is TDS factor. The authors suggest realizing the
equations (9.25) using RNS. The interleaving operations can be realized using
Mixed Radix System denoted as 2-radix MRS ( p2/p1). The value of i is written as
(a2p2 + a1) p1 + a0. The choice of p1 and p2 enables mapping of the given xth bit in
a2, a1, a0 position. Note that a0 < p1, a1 < p2, a2 < M/( p1p2) where M is the
block size.
The whole block is divided into p1 sub-blocks differently colored. Each
sub-block is divided into p2 sub-sub-blocks at a time. Each sub-sub-block is
shown as a vertical line in Figure 9.42b. The sub-sub-blocks with different colors
are alternatively arranged and wired using the connections shown in dotted lines.
Thus, the Xth bit position in decimal number system is mapped as a 2D array X ¼
(a2p2 + a1)p1 + a0. As an illustration for data rates of 320, 400, 480 Mb/s, the block
size is 1200 bits and p1 ¼ 6, p2 ¼ 10. In Figure 9.42b, (a2|a1|a0) is represented as cell
(a2, a1) with a0 color. The authors observe that the latency, power consumption and
complexity are benefited over conventional implementation. A latency of six
OFDM symbols could be achieved with complexity reduction by 85.5, 69.4 and
40.3 % for 80, 200 and 480 Mb/s. Power consumption reduction was 87.4 %,
73.6 %, 39.8 %, respectively.
Figure 9.42a shows the general representation for interleaving process. The first
two consecutive modulo permutations given in (9.25a) and (9.25b) are considered
as a single process. In the interleaver architecture shown in Figure 9.42b, the data to
be encoded enters in the lower right cell. Note that there are M cells arranged as
M/( p1p2) rows with each row containing p1p2 cells. These p1p2 cells are grouped as
p2 groups each containing p1 cells. After M cycles, the first bit would have reached
the leftmost corner cell. Next the data is read as indicated in the dotted lines. Note
that different columns in each group of p1 columns are colored differently. The end
point of one color is connected to the starting point of another color as indicated by
Ais. The de-interleaving process is similar where the input enters the INdecode
terminal and output is taken in a similar way at the output decode (leftmost corner
cell). Note finally that the cyclic shift of 33 as required by (9.25c) needs connection
of last point in one p1 cell cluster to the 33rd pin of the next cluster and similarly
66th pin of the third cluster, etc. as shown in Figure 9.42c. Note that this needs
additional multiplexers in some cells.
Quasi-chaotic (QC) generators based on RNS have been described in literature
[102]. A cascade of first-order recursive filters exhibits quasi-chaotic behavior in
the absence of input. The computation performed is
where uk is the input and xi(k) i ¼ 1,2, . . ., N is the output of the ith section. The
zero-input response will exhibit periodic behavior with a period coincident with
LCM of all mi 1. The length of the period can be very large. As an illustration for
n ¼ 8, and moduli set {257,263, 347, 359, 383, 467, 479, 503}, the period will be
256 131 173 179 233 239 251 2.7 108. The shape of the generator
response is independent of the initial conditions. It depends instead on gi values
called primitives. For prime modulus M, the number of primitives is f( f(M )) where f
(M ) is the number of integers less than M and relatively prime to M. Each choice of
gi defines a particular state of the generator. The number of possible states of a RNS
X N
filter for a N section generator is S ¼ f ðf ðMi ÞÞ. Thus, for the above example we
i¼1
have, S ¼ 128 130 172 178 232 238 250 1.3 108. The authors sug-
gest for an 8-bit input, choice of the moduli set {3, 7, 13}. Each of the three QC
generators have used eight sections with moduli {3139, 157, 173, 191,199,
p1
a
MRS( p1)
M/p1
p2
(M/p1)/p2
Figure 9.42 Interleaver architecture for MB-OFDM (a) general representation for interleaving
processes (b) interleaving processor in MRS( p2|p1) (c) Design of the cyclic shift (adapted from
[100] ©IEEE2010)
9.4 RNS Application in Communication Systems 255
p2 ×p1
b
Ap1-1
OUT Encode A1
OUT Decode
A1 A2
IN Encode
IN Decode
MRS p2/p1
P1 sub-blocks
c MSRS (l|k)
l X mod k = 0 l l l
e
e
s
s
e
Ncyc×0 Ncyc×1 Ncyc×2Cyclic
Cyclic Shifted Cyclic Shifted Shifted
Figure 9.42 (continued)
227, 239}, {7149, 163, 179, 193, 211, 229, 241}, {13, 151, 167, 181, 197, 223, 233,
251}, respectively.
The authors have suggested later a self-correcting communication system [103]
also. The input sequence containing the message is converted to residues which
enter the QC generator. The outputs of the QC generator drive the self-correcting
transmission system as well as redundant residues. By using 2r redundant residues,
r errors can be checked. The decoded residues are converted back at the receiver
using CRT.
References
1. W.K. Jenkins, B.J. Leon, The use of residue number systems in the design of finite impulse
response digital filters. IEEE Trans. Circuits Syst. CAS-24, 191–201 (1977)
2. A. Peled, B. Liu, A new hardware realization of digital filters. IEEE Trans. Acoust. Speech
Signal Process. ASSP-22, 456–462 (1974)
3. H.T. Vergos, A 200 MHz RNS core, in Proceedings of ECCTD, vol. II (2001), pp. 249–252
4. L. Kalampoukas, D. Nikolos, C. Efstathiou, H.T. Vergos, J. Kalamatianos, High speed
parallel prefix modulo (2n-1) adders. IEEE Trans. Comput. 49, 673–680 (2000)
5. S.J. Piestrak, Design of residue generators and multi-operand modulo adders using carry save
adders, in Proceedings of 10th Symposium on Computer Arithmetic (1991), pp. 100–107
6. R. Zimmermann, Efficient VLSI implementation of Modulo (2n 1) Addition and multipli-
cation, in Proceedings of IEEE Symposium on Computer Arithmetic (1999), pp. 158–167
References 257
7. A.D. Re, A. Nannarelli, M. Re, Implementation of digital filters in carry save Residue number
system, in Conference Record of 39th Asilomar Conference on Signals, Systems and Com-
puters (2001), pp. 1309–1313
8. G.C. Cardarilli, A.D. Re, A. Nanarelli, M. Re, Impact of RNS coding overhead on FIR filter
performance, in Proceedings of the 41st Asilomar Conference on Circuits, Systems and
Computers (2007), pp. 1426–1429
9. A.D. Re, A. Nannarelli, M. Re, A tool for automatic generation of RTL-level VHDL
description of RNS FIR filters, in Proceedings of Design and Automation and Test of in
Europe Conference and Exhibition (2004), pp. 686–687
10. G.C. Cardarilli, A.D. Re, A. Nannarelli, M. Re, Low power low leakage implementation of
RNS FIR filters, in Conference Record of 39th Asilomar Conference on Signals, Systems and
Computers (2005), pp. 1620–1624
11. G.L. Bernocchi, G.C. Cardarilli, A.D. Re, A. Nannarelli, M.Re, A hybrid RNS adaptive filter
for channel equalization, in Proceedings of 40th Asilomar Conference on Signals, Systems
and Computers (2006), pp. 1706–1710
12. G.L. Bernocchi, G.C. Cardarilli, A.D. Re, A. Nannarelli, M. Re, Low-power adaptive filter
based on RNS components, in Proceedings of 2007 I.E. International Symposium on Circuits
and Systems (ISCAS) (2007), pp. 3211–3214
13. R. Conway, J. Nelson, Improved RNS FIR filter architectures. IEEE Trans. Circuits Syst.
Express Briefs 51, 26–28 (2004)
14. A. Wrzyszcz, D. Milford, A new modulo 2α + 1 multiplier, in IEEE International Conference
on Computer Design: VLSI in Computers and Processors (1993), pp. 614–617
15. C.-L. Wang, New bit-serial VLSI implementation of RNS FIR digital filters. IEEE Trans.
Circuits Syst II. Analog Digit. Signal Process. 41, 768–772 (1994)
16. B. LaMacchia, G. Redinbo, RNS digital filtering structures for wafer-scale integration. IEEE
J. Sel. Areas Commun. 4, 67–80 (1986)
17. J.C. Bajard, L.S. Didier, T. Hilaire, ρ-Direct form transposed and Residue Number systems
for filter implementations, in IEEE 54th International Midwest Symposium on Circuits and
Systems (MWSCAS) (2011), pp. 1–4
18. P. Patronik, P.K. Berezowski, S.J. Piestrak, J. Biernat, A. Shrivastava, Fast and energy-
efficient constant-coefficient FIR filters using residue number system, in 2011 International
Symposium on Low Power Electronics and Design (ISLPED) (2011), pp. 385–390
19. J.H. Choi, N. Banerjee, K. Roy, Variation-aware low-power synthesis methodology for fixed-
point FIR filters. IEEE Trans. CAD 28, 87–97 (2009)
20. R. Patel, M. Benaissa, N. Powell, S. Boussakta, Novel power-delay-area efficient approach to
generic modular addition. IEEE Trans. Circuits Syst. I 54, 1279–1292 (2007)
21. L. Aksoy, E. Da Costa, P. Flores, J. Monteiro, Exact and approximate algorithms for the
optimization of area and delay in multiple constant multiplications. IEEE Trans. CAD 27,
1013–1026 (2008)
22. A. Garcı́a, U. Meyer-Bäse, F.J. Taylor, Pipelined Hogenauer CIC filters using field-
programmable logic and residue number system, in Proceedings of IEEE International
Conference on Acoustics, Speech, and Signal Processing, Seattle, vol. 5 (1998),
pp. 3085–3088
23. M. Griffin, M. Sousa, F. Taylor, Efficient scaling in the residue number System, in Pro-
ceedings of IEEE ASSP (May 1989), pp. 1075–1078
24. G.C. Cardarilli, R. Lojacono, M. Salerno, F. Sargeni, VLSI RNS implementation of fast IIR
filters, in Proceedings of of 35th Midwest Symposium on Circuits and Systems (1992),
pp. 1245–1248
25. W.K. Jenkins, Recent advances in residue number techniques for recursive digital filtering.
IEEE Trans. Acoust. Speech Signal Process. 27, 19–30 (1979)
26. M. Etzel, W. Jenkins, The design of specialized residue classes for efficient recursive digital
filter realization. IEEE Trans. Acoust. Speech Signal Process. 30, 370–380 (1982)
258 9 Applications of RNS in Signal Processing
27. A. Nannarelli, G.C. Cardarilli, M. Re, Power-delay trade-off in residue number system. Proc.
IEEE ISCAS 5, 413–416 (2003)
28. G.C. Cardarilli, A. Del Re, A. Nannarelli, M. Re, Residue number system reconfigurable data
path, in Proceedings of IEEE ISCAS, vol. II, (2002), pp. 756–759
29. G.C. Cardarilli, A. Nannarelli, M. Re, Residue number system for low-power DSP applica-
tions, in 41st Asilomar Conference (2007), pp. 1412–1416
30. A. Nannarelli, M. Re, G. C. Cardarilli, Trade-offs between Residue number system and
traditional FIR filters, in Proceedings of IEEE ISCAS (2001), pp. 305–308
31. M.N. Mahesh, M. Mehendale, Low-power realization of residue number system based FIR
filters, in 13th International Conference on VLSI Design, Bangalore (May 2001), pp. 350–353
32. W.L. Freking, K.K. Parhi, Low-power FIR digital filters using residue arithmetic, in Confer-
ence Record of 31st Asilomar Conference on Signals, Systems and Computers (ACSSC 1997),
Pacific Grove, vol. 1 (1997), pp. 739–743
33. M.K. Ibrahim, A note on digital filter implementation using hybrid RNS-binary arithmetic.
Signal Process. 40, 287–294 (1994)
34. B. Parhami, A note on digital filter implementation using hybrid RNS-binary arithmetic.
Signal Process. 41, 65–67 (1996)
35. G.C. Cardarilli, M. Re, R. Lojacano, A new RNS FIR filter architecture, in Proceedings of
13th International Conference on Digital Signal Processing, DSP 97 (1997), pp. 671–674
36. P.L. Montgomery, Modular multiplication without trial division. Math. Comput. 44, 519–521
(1985)
37. K.G. Smitha, A.P. Vinod, A reconfigurable high-speed RNS FIR channel filter for multi-
standard software radio receivers, in Proceedings of IEEE ICCS (2008), pp. 1354–1358
38. R. Conway, Efficient residue arithmetic based parallel fixed coefficient FIR filters, in Pro-
ceedings of IEEE ISCAS (2008), pp. 1484–1487
39. I. Lee, W.K. Jenkins, The design of residue number system arithmetic units for a VLSI
adaptive equalizer, in IEEE Proceedings of 8th Great Lakes Symposium (1998), pp. 179–184
40. A. Shah, M. Sid-Ahmed, G. Jullien, A proposed hardware structure for two-dimensional
recursive digital filters using the residue number system. IEEE Trans. Circuits Syst. 32,
285–288 (1985)
41. G.A. Jullien, Residue number scaling and other operations using ROM arrays, in IEEE
Transactions on Computers, vol. 27, no. 4 (1978), pp. 325–337
42. N.R. Shanbag, R.E. Siferd, A single-chip pipelined 2-D FIR filter using residue arithmetic.
IEEE J. Solid State Circuits 26, 796–805 (1991)
43. M.A. Soderstrand, A high-speed low-cost recursive digital filter using residue number
arithmetic. Proc. IEEE 65, 1065–1067 (1977)
44. L.T. Bruton, Low-sensitivity digital ladder filters. IEEE Trans. Circuits Syst. CAS-22,
168–176 (1975)
45. F.J. Taylor, C.H. Huang, An auto-scale residue multiplier. IEEE Trans. Comput. C-31,
321–325 (1982)
46. R. Ramnarayan, F. Taylor, On large moduli residue number system recursive digital filters.
IEEE Trans. Circuits Syst. 32, 349–359 (1985)
47. J. Chen, J. Hu, Energy-efficient digital signal processing via voltage-overscaling-based
Residue Number System. IEEE Trans. VLSI Syst. 21, 1322–1332 (2013)
48. Z. Luan, X. Chen, N. Ge, Z. Wang, Simplified fault tolerant FIR filter architecture based on
redundant residue number system. Electron. Lett. 50, 1768–1770 (2014)
49. Z. Gao, P. Reviriego, W. Pan, Z. Xu, M. Zhao, J. Wang, J.A. Maeastro, Efficient arithmetic-
residue-based SEU-tolerant FIR filter design’. IEEE Trans. Circuits Syst. II 60, 497–501
(2013)
50. R. Chaves, L. Sousa, RDSP: a RISC DSP based on residue number system, in Proceedings of
Euromicro Symposium on Digital System Design: Architectures, Methods, and Tools,
Antalya, Turkey (2003), pp. 128–135
References 259
71. M.J. Narasimha, A.M. Peterson, O the computation of the discrete cosine transform. IEEE
Trans. Commun. 26, 934–946 (1978)
72. Y. Arai, T. Agui, M. Nakajima, A fast DCT-SQ scheme for images. Trans. IEICE 71,
1095–1097 (1988)
73. F.J. Taylor, An RNS discrete Fourier transform implementation. IEEE Trans. Acoust. Speech
Signal Process. 38, 1386–1394 (1990)
74. J.H. McCllellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice-Hall,
Englewood Cliffs, 1979)
75. B.D. Tseng, G.A. Jullien, W.C. Miller, Implementation of FFT structures using the residue
number system. IEEE Trans. Comput. C-28, 831–845 (1979)
76. F.J. Taylor, G. Papadorakis, A. Skavantzos, T. Stouraitis, A radix-4 FFT using complex RNS
arithmetic. IEEE Trans. Comput. 34, 573–576 (1985)
77. G.A. Jullien, M. Taheri, J. Carr, G. Thomsen, W.C. Miller, A VLSI systolic quadratic residue
DFT with fault tolerance, in Proceedings of ISCAS (1988), pp. 2271–2274
78. A.Z. Baraniecka, G.A. Jullien, Residue number system implementations of number theoretic
transforms in complex residue rings. IEEE Trans. Acoust. Speech Signal Process. 28,
285–291 (1980)
79. R. Krishnan, G. Jullien, W. Miller, Implementation of complex number theoretic transforms
using quadratic residue number systems. IEEE Trans. Circuits Syst. 33, 759–766 (1986)
80. V.S. Dimitrov, G.A. Jullien, W.C. Miller, A residue number system implementation of real
orthogonal transforms. IEEE Trans. Signal Process. 46, 563–570 (1998)
81. H.C. Shyu, T.K. Truong, I.S. Reed, A complex integer multiplier using the quadratic
polynomial residue number system with numbers of the form 22n+1. IEEE Trans. Comput.
C-36, 1255–1258 (1987)
82. M. Abdallah, A. Skavantzos, On the binary quadratic residue number system with
non-coprime moduli. IEEE Trans. Signal Process. 45, 2085–2091 (1997)
83. Y. Liu, E.M.K. Lai, Design and implementation of an RNS-based 2-D DWT processor. IEEE
Trans. Consum. Electron. 50, 376–385 (2004)
84. J. Ramirez, A. Garcia, P.G. Frernandez, L. Parrilla, A. Lloris, RNS-FPL merged architecture
for orthogonal DWT. Electron. Lett. 36, 1198–1199 (2000)
85. T. Toivonen, J. Heikkilä, Video filtering with Fermat number theoretic transforms using
residue number system. IEEE Trans. Circuits Syst. Video Technol. 16, 92–101 (2006)
86. L.L. Yang, L. Hanjo, Residue number system based Multiple-code DS-CDMA system, in
IEEE 49th Vehicular Technology Conference, vol. 2 (1999), pp. 1450–1454
87. L.L. Yang, L. Hanjo, Ratio statistic assisted Residue number system based parallel commu-
nication scheme, in IEEE 49th Vehicular Technology Conference, vol. 2 (1999), pp. 894–898
88. L.L. Yang, L. Hanzo, A residue number system based parallel communication scheme using
orthogonal signaling: part I—system outline. IEEE Trans. Veh. Technol. 51, 1534–1546
(2002)
89. L.L. Yang, L. Hanzo, Residue number system assisted fast frequency-hopped synchronous
ultra-wideband spread-spectrum multiple-access: a design alternative to impulse radio. IEEE
J. Sel. Areas Commun. 20, 1652–1663 (2002)
90. H. Krishna, K.Y. Lin, J.D. Sun, A coding theory approach to error control in redundant
residue number systems. I. Theory and single error correction. IEEE Trans. Circuits Syst. II
Analog Digit. Signal Process. 39, 8–17 (1992)
91. H. Krishna, J.D. Sun, On theory of fast algorithms for error correction in residue number
systems. IEEE Trans. Comput. C-42, 840–852 (1993)
92. J. Ramirez, A. Garcia, U. Meyer-Base, A. Lloris, Fast RNS FPL based communications
receiver design and implementation, in Proceedings of the 12th International Conference,
FPL 2002 Montpellier, France (2–4 September 2002), pp. 472–481
93. A. Chren, RNS-based enhancements for direct digital frequency synthesis. IEEE Trans.
Circuits Syst. II Analog Digit. Signal Process. 42, 516–524 (1995)
94. P.V. Ananda Mohan, On RNS-based enhancements for direct digital frequency synthesis.
IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 48, 988–990 (2001)
References 261
95. A.S. Madhukumar, F. Chin, Enhanced architecture for residue number system-based CDMA
for high-rate data transmission. IEEE Trans. Wirel. Commun. 3, 1363–1368 (2004)
96. S. Zhang, L.L. Yang, Y. Zhang, Redundant residue number system assisted multicarrier
direct-sequence code-division dynamic multiple access for cognitive radios. IEEE Trans.
Veh. Technol. 61, 1234–1250 (2012)
97. Y. Yi, H. Jian-Hao, RNS based OFDM transmission scheme with low PARR, in Proceedings
of International Conferences on Computational Problem Solving (2011), pp. 326–329
98. H.T. How, T.H. Liew, Ee-Lin Kuan, L.L. Yang, L. Hanzo, A redundant residue number
system coded burst-by-burst adaptive joint-detection based CDMA speech transceiver, IEEE
Trans. Veh. Technol. 55, 387–396 (2006)
99. D. Zhu, B. Natarajan, Residue number system arithmetic-inspired hopping-pilot pattern
design. IEEE Trans. Veh. Technol. 59, 3679–3683 (2010)
100. Y. Han, P. Harliman, S.W. Kim, J.K. Kim, C. Kim, A novel architecture for block interleav-
ing algorithm in MB-OFDM using mixed radix system, in IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 18 (2010), pp. 1020–1024
101. WiMedia Alliance, MAC-PHY interface specification 1.0. (2005), http://www.Wimedia.org
102. M. Panella, G. Martinelli, RNS quasi-chaotic generators. Electron. Lett. 36, 1325–1326
(2000)
103. M. Panella, G. Martinelli, RNS quasi-chaotic generator for self-correcting secure communi-
cation. Electron. Lett. 37, 325–327 (2001)
Further Reading
A. Bertossi, A. Mei, A residue number system on reconfigurable mesh with application to prefix
sums and approximate string matching. IEEE Trans. Parallel Distrib. Syst. 11, 1186–1199
(2000)
E. Kinoshita, K.J. Lee, A residue arithmetic extension for reliable scientific computation. IEEE
Trans. Comput. 46, 129–138 (1997)
Chapter 10
RNS in Cryptography
The need for securely accessing the information and protecting the information
from unauthorized persons is well recognized. These needs can be met by using
Encryption algorithms and Authentication algorithms [1, 2]. Encryption can be
achieved by using block ciphers or stream ciphers. The information considered as
fixed blocks of data e.g. 64-bit, 128-bit, etc. can be mapped into another 64-bit
block or 128-bit block under the control of a key. On the other hand, stream ciphers
generate random sequence of bits which can be used to mask plain text bit stream
(by performing bit-wise Exclusive-Or operation). Several techniques exist for block
and stream cipher implementation. These are called as symmetric key-based sys-
tems, since the receiver shall use for decryption the same key used for the block or
stream cipher at the transmitter. This key has to be somehow made available to the
receiver by previous arrangement or by using Key exchange algorithms
e.g. DiffieHellman Key exchange. The other requirement of authentication of a
source is performed using several techniques. This notable among these is based on
RSA (Rivest Shamir Adleman) algorithm. This algorithm is the workhorse of
Public Key cryptography. As against symmetric key systems, in this case two
keys are needed known as Public Key or Private Key. The strength of these systems
is derived from the difficulty of factoring large numbers which are products of two
big primes.
RSA algorithm is briefly described next. First, Alice chooses two primes p and q.
The product of the primes n ¼ p q is made public since it is extremely difficult to
find p and q given n. Next Alice defines ф(n) ¼ ( p 1) (q 1), which gives the
number of integers less than n and prime to n. Next, Alice chooses a value e denoted
as encryption key (public Key). Alice next computes a decryption key (private key)
d such that e d ¼ 1mod ф(n). The private key d is not disclosed to anybody.
Alice can now encrypt a message m by obtaining C ¼ md mod n where C stands
for cipher text corresponding to m. Anybody (say Bob, the intended recipient or
anybody else) who has knowledge of the public key of Alice, can now find m by
computing Cemod n ¼ m. This method is useful for confirming that only Alice could
have sent this message since meaningful message could be obtained by decryption.
We first consider Barrett’s technique [3, 4] for computing r ¼ x mod M given x and
M in base b where x ¼ x2k1. . .x1x0, and M ¼ mk1. . .m1m0 with mk1 6¼ 0. Note that
radix b is chosen typically to be word length of the processor. We assume b > 3
j 2k k
herein. Barrett’s technique requires pre-computation of a parameter μ ¼ bM .
j k j k j k
q2
First, we find consecutively q1 ¼ k1 x
, q2 ¼ k1x
μ and q3 ¼ kþ1 . Next, we
b b b
compute r 1 ¼ x modbkþ1 , r 2 ¼ ðq3 MÞmodbkþ1 , r ¼ r 1 r 2 . If r < 0, r ¼ r
þbkþ1 and if r M, r ¼ r M. Note that divisions are simple right hand shifts of
the base b representation. In the multiplication q2 ¼ q1μ, the (k + 1) LSBs are
not required to determine q3 except for determining carry from position (k + 1)
to (k + 2). Hence, k 1 least significant digits of q2 need not be computed.
Similarly, r2 ¼ q3M also can be simplified as a partial multiple-precision mul-
tiplication which evaluates only the least significant
(k + 1) digits of q3M. Note
kþ1
that r2 can be computed using at most þ k single precision multipli-
2
μand q1 have at most (k + 1) digits, determining q3 needs at most
cations. Since
k 2
ð k þ 1Þ 2 ¼ k þ5kþ2
2 single precision multiplications. Note that q2 is
2
needed only for computing q3. It can be shown that 0 r < 3M.
As an illustration, consider finding (121) mod 13. Evidently k ¼ 4. We obtain
μ ¼ 19, q1 ¼ 15, q2 ¼ 15 19 ¼ 285, q3 ¼ 8. Thus, r1 ¼ 25, r2 ¼ 8 and r ¼ 17. Since
r > 13, we have r ¼ r mod 13 ¼ 4.
Barrett’s algorithm estimates the quotient q for b ¼ 2 in the general case as
$j X kj2kþα k%
X
q¼ ¼ 2kþβ M ð10:1aÞ
M αβ
2
kþα
where α and β are two parameters. The value μ ¼ 2 M can be pre-computed and
stored. Several attempts have been made to overcome the last modulo reduction
operation. Dhem [5] has suggested α ¼ w þ 3, β ¼ 2 for radix 2w so that the
maximum error in computing q is 1. Barrett has used α ¼ n, β ¼ 1. The classical
modular multiplication algorithm to find (X Y ) mod M is presented in Figure 10.1
where multiplication and reduction are integrated. Note that step 4 uses (10.1a).
Quisquater [7] and other authors [8] have suggested writing the quotient as
kþc
X X 2
q¼ ¼ kþc ð10:1bÞ
M 2 M
Figure 10.1 High Radix classical Modulo multiplication algorithm (adapted from [6]
©IEEE2010)
Knezevic et al. [6] have observed that the performance of Barrett reduction can
be improved by choosing moduli of the form (2n Δ) in set S1 where 0 < Δ
j n k j k
2 α or (2n1 + Δ) in Set S where 0 < Δ 2n1 . In such cases, the value of
1þ2
2 αþ1
2 1
^q in (10.1a) can be computed as
Z Z
^
q ¼ if M 2 S 1 or ^
q ¼ if M 2 S2 ð10:1cÞ
2n 2n1
This modification does not need any computation unlike in (10.1b). Since many
recommendations such as SEC (Standards for Efficient Cryptography), NIST
(national Institute of Standards and Technology), and ANSI (American National
Standards Institute) use such primes, the above method will be useful.
Brickell [9, 10] has introduced a concept called carry-delayed adder. This
comprises of a conventional carry-save-adder whose carry and sum outputs are
added in another level of CSA comprising of half-adders. The result in carry-save
form has the interesting property that either a sum bit or next carry bit is ‘1’. As an
illustration, consider the following example:
A ¼ 40 101000
B ¼ 25 011001
C ¼ 20 010100
S ¼ 37 100101
C ¼ 48 0110000
T ¼ 21 010101
D ¼ 64 1000000
The output (D, T) is called carry-delayed number or carry-delayed integer. It
may be checked that TiDi+1 ¼ 0 for all i ¼ 0, . . ., k 1.
10.2 Montgomery Modular Multiplication 267
Brickell [9] has used this concept to perform modular multiplication. Consider
computing P ¼ AB mod M where A is a carry-delayed integer:
X
k1
A¼ T i þ Di 2i
i¼0
20 T 0 B þ 21 D1 B þ 21 T 1 B þ 22 D2 B þ 22 T 2 B þ 23 D3 B þ
þ2k2 T k2 B þ 2k1 Dk1 B þ 2k1 T k1 B
Since either Ti or Di+1 is zero due to the delayed-carry-adder, each step requires a
shift of B and addition of at most two carry-delayed integers:
þ1
either ðPd ; Pt Þ ¼ ðPd ; Pt Þ þ 2i T i B or ðPd ; Pt Þ ¼ ðPd ; Pt Þ þ 2i Diþ1 B
multiplication can take one form and reduction can take another form even in
integrated approach.
As such, we have five techniques (a) separated operand scanning (SOS),
(b) coarsely integrated operand scanning (CIOS), (c) finely integrated operand
scanning (FIOS), (d) finely integrated product scanning (FIPS) and (e) coarsely
integrated hybrid scanning (CIHS). The word multiplications needed in all these
techniques are (2s2 + s) whereas word additions for FIPS are (6s2 + 4s + 2), for SOS,
CIOS and CIHS are (4s2 + 4s + 2) and for FIOS are (5s2 + 3s + 2).
In SOS technique, we first obtain the product (A B) as a 2s-word integer t.
0 0 1
Next, we compute u ¼ (t + mn)/r where m ¼ (tn ) mod r and n ¼ . We first
n r
take u ¼ t and add mn to it using standard multiplication routine. We divide the
result by 2sw which we accomplish by ignoring the least significant s words. The
reduction actually proceeds word by word using n0 ¼ n mod 2w. Each time the result
is shifted right by one word implying division by 2w. The number of word
multiplications is (2s2 + s).
The CIOS technique [11, 12] improves on the SOS technique by integrating the
multiplication and reduction steps. Here instead of computing complete (A B) and
then reducing it, we alternate between the iterations of the outer loop for multipli-
cation and reduction. Consider an example with A and B each comprising of four
words a3, a2, a1, a0 and b3, b2, b1, b0 respectively. First a0b0 is computed and we
denote the result as cout0 and tout00 where tout00 is the least significant word and
cout0 is the most significant word. In the second cycle, two operations are performed
simultaneously. We multiply tout00 with n0 to get m0 and also computing a1b0 and
adding cout0 to obtain cout1, tout01. At this stage, we know the multiple of N to be
added to make the least significant word zero. In the third cycle, a2b0 is computed
and added to cout1 to obtain cout2, tout02 and in parallel m0n0 is computed and added
to tout00 to obtain cout3. In the fourth cycle, a3b0 is computed and added with cout2
to get cout4 and tout03 and simultaneously m0n1 is computed and added with cout3
and tout01 to obtain cout5 and tout10. Note that the multiplication with b0 is
completed at this stage, but reduction is lagging behind by two cycles. In the fifth
cycle, a0b1 is computed and added with tout10 to get cout7 and tout20 and simul-
taneously m0n2 is computed and added with cout5 and tout02 to obtain cout6 and
tout11. In addition, cout4 is added to get tout04 and tout05. In the sixth cycle, a1b1 is
computed and added with cout7, tout11 to get cout9 and tout21 and simultaneously
m0n3 is computed and added with cout6 and tout03 to obtain cout8 and tout12. In
addition, tout2 is multiplied with n0 to get m1. In this way, the computation proceeds
and totally 18 cycles are needed.
The FIOS technique integrates the two inner loops of the CIOS method by
computing both the addition and multiplication in same loop. In each iteration,
X0Yi is calculated andthe result is added to Z. Using Z0 we calculate T as
1
T ¼ ðZ 0 þ X 0 Y 0 Þ . Next, we add MT to Z. The Least significant word Z0
M r
of Z will be zero, and hence division by r is exact and performed by a simple right
shift. The number of word multiplications in each step is (2s + 1) and hence totally
10.2 Montgomery Modular Multiplication 269
(2s2 + s) word multiplications are needed and (2s2 + s) cycles are needed on a w-bit
processor. The addition operations need additional cycles.
Note that in the CIHS technique, the right half of the partial product summation
of the conventional n n multiplier is performed and the carries flowing beyond the
s words are saved. In the second loop, the least significant word t0 is multiplied by
n0 0 to obtain the value of m0. Next the modulus n0 is multiplied with m0 and added to
t0. This will make the LSBs zero. The multiplication with m0 with n1, n2, etc. and
addition with t1, t2, t3, etc. will be carried out in the next few cycles. Simulta-
neously, the multiplications needed for forming the partial products beyond s words
are carried out and result added to the carries obtained and saved in the first step as
well as with the words obtained by multiplying mi with nj. At appropriate time, the
mi values are computed as soon as the needed information is available. Thus the
CIHS algorithm integrates the multiplication with addition of mn. For a 4 4 word
multiplication, the first loop takes 7 cycles and the second loop takes 19 cycles.
The reader may refer to [13] for a complete description of the operation.
In the FIPS algorithm also, the computation of ab and mn are interleaved.
There are two loops. The first loop computes one part of the product ab and then
adds mn to it. Each iteration of the inner loop executes two multiply accumulate
operations of the form a b + S i.e. products ajbij and pjnij are added to a
cumulative sum. The cumulative sum is stored in three single-precision words t[0],
t[1] and t[2] where the triple (t[0], t[1], t[2]) represents t[2]22w + t[1]2w + t[0].
These registers are thus used as a partial product accumulator for products ab and
mn. This loop computes the words of m using n0 and then adds the least significant
word of mn to t. The second loop completes the computation by forming the final
result u word by word in the memory space of m.
Walter [14] has suggested a technique for computing (ABrn) mod M where
A < 2M, B < 2M and 2M < rn1, r is the radix and r 2 so that S < 2M for all
possible outputs S. (Note that n is the upper bound on the number of digits in A,
B and M). Note also that an1 ¼ 0. Each step computes S ¼ ðS þ ai B þ qi MÞ
div rwhere qi ¼ ðs0 þ ai bo Þ m1 o mod r. It can be verified that S < (M + B) till
the last but one step. Thus, the final output is bounded: S < 2M. Note that in the last
step of exponentiation, multiplication by 1 is needed and scaling by 2n mod M will
be required. A Montgomery step can achieve this. Here also note that since Sr n
¼ Ae þ QM and Q ¼ rn 1 maximum. Note that A ¼ ðA r n Þmod M
i.e. Montgomery form of Ae. Therefore since Ae < 2M, we have Srn < (rn + 1)M
and hence S M needing no final subtraction. The advantage here is that the cycle
time is independent of radix.
Orup [15] has suggested a technique for avoiding the modulo multiplication
needed to obtain the q value for high radix Montgomery modulo multiplication.
e 0 0 1
Orup suggests scaling the modulus M to M ¼ MM where M ¼ consid-
M k
2
1
ering radix 2k so that q is obtained as qi ¼ ðSi1 þ bi AÞ k since 0 ¼ 1.
2 M 2k
Thus only (biA) mod 2k is needed to be added with k LSBs of Si1:
270 10 RNS in Cryptography
e þ bi A div2k
Siþ1 ¼ Si þ qi M ð10:2aÞ
The number of iterations is increased by one to compensate for the extra factor 2k.
McIvor et al. [16] have suggested Montgomery modular multiplication (AB2k
modM ) modification using 5-2 and 4-2 carry save adders. Note that A, B and S are
considered to be in carry-save-form denoted by the vectors A1, A2, B1, B2, S1 and S2.
Specifically, the qi determination and estimation of the sum S is based on the
following equations:
qi ¼ S1 ½i0 þ S2 ½i0 þ Ai ðB10 þ B20 Þ mod2 ð10:3aÞ
and
S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ Ai ðB1 þ B2 Þ þ qi MÞdiv2 ð10:3bÞ
Note that S1,0 ¼ 0 and S2,0 ¼ 0. In other words, the SUM is in redundant form or
carry-save form (CSR). The second step uses a 5:2 CSA. In an alternate algorithm,
qi computation is same as in (10.3a) but it needs a 4:2 CSA. We have for the four
cases of Ai and qi being 00, 01, 10 and 11 the following expressions:
Ai ¼ 0, qi ¼ 0:
Ai ¼ 1, qi ¼ 0:
Ai ¼ 0, qi ¼ 1:
Ai ¼ 1, qi ¼ 1:
The advantage of this technique is that the lengthy and costly conventional
additions are avoided thereby reducing the critical path. Only (n + 1) cycles are
needed in the case of (10.3a) and (10.3b) and (n + 2) cycles are needed in the case of
(10.4a) and (10.4b). The critical path in the case of (10.3a) and (10.3b) is 3ΔFA þ 2
ΔXOR þ ΔAND whereas in the case of (10.4a) and (10.4b), it is 2ΔFA þ
Δ4:1MUX þ 2ΔXOR þ ΔAND . Note that k steps are needed where k is the number of
bits in M, A and B.
Nedjah and Mourelle [17] have described three hardware architectures for binary
Montgomery multiplication and exponentiation. The sequential architecture uses two
Systolic Array Montgomery Modular Multipliers (SAMMM) to perform multiplica-
tion followed by squaring, whereas, the parallel architecture uses two systolic modular
multipliers in parallel to perform squaring and multiplication (see Figure 10.2a, b).
In the sequential architecture, two SAMMMs are used each needing five regis-
ters. The controller controls the number of iterations needed depending on the
exponent. Note, however, one of the multipliers is not necessary. In the parallel
architecture, the hardware is more since multiplication and squaring use different
hardware blocks and this needs eight registers. The systolic linear architecture using
m e-PEs (E-cells) shown in Figure 10.2c, where m is the number of bits in M,
contains two SAMMMs one of which performs squaring and another performs
multiplication. These e-PEs together perform left to right binary modular
exponentiation.
Note that a front-end and back-end SAMMM are needed to do the
pre-computation and post-computation as needed in Montgomery algorithm (see
Figure 10.2d). The front-end multiplies the operands by 22nmod M and post-
Montgomery multiplication multiplies by ‘1’ to get rid of the factor 2n from the
result. The basic PE realizes the Montgomery step of computing R + aiB + qiM
where qi ¼ (r0 + aib0) mod 2. Note that depending on ai and qi, four possibilities
exist:
(1) ai ¼ 1, qi ¼ 1 add M + B, (2) ai ¼ 1, qi ¼ 0, add B, (3) ai ¼ 0, qi ¼ 1 add M,
(4) ai ¼ 0, qi ¼ 0 no addition.
The authors suggest pre-computation of M + B only once and denote it as MB.
Thus, using a 4:1 multiplexer, either MB, B, M or 0 is selected to be added to R. This
will reduce the cell hardware to a full-adder, a 4:1 MUX and few gates to control the
4:1 MUX. Some of the cells can be simplified which are in the border (see
Figure 10.2e showing the systolic architecture which uses the general PE in
Figure 10.2f). The authors show that sequential exponentiation needs least area
whereas systolic exponentiation needs most area. Sequential exponentiation takes
highest time, whereas systolic exponentiation takes the least computation time. The
AT (area time product) is lower for systolic implementation architecture and
highest for parallel implementation.
Shieh et al. [18] have described a new algorithm for Montgomery modular
multiplication. They extend Yang et al. technique [19] in which first AB is com-
puted. This 2k-bit word is considered as MH. 2k + ML. Hence (AB)2k mod N ¼ MH+
(ML)2k mod N. The second part can be computed and the result added to MH to
a e0/1
I E
T
1 0 1 0
ei
CONTROLLER
MUX22 MUX21
MPRODUCT1 SQUARE1
T
b
E TEXT
ei-1/1 1 0
CONTROLLER
MUX21
SQUARE
MODULUS
SAMMM1
SAMMM2
0 1
ei
MUX21
MPRODUCT
Figure 10.2 (a) Parallel (b) sequential and (c) systolic linear architectures for Montgomery
multiplier (d) architecture of the exponentiator (e) systolic architecture and (f) basic PE architec-
ture (adapted from [17] ©IEEE 2006)
10.2 Montgomery Modular Multiplication 273
obtain the result. Denoting ML(i) as the ith bit of ML, the reduction process in each
step of Montgomery algorithm finds
S þ ML ðiÞ þ qðiÞN
qi ¼ ðS þ ML ðiÞÞmod2, S ¼ ð10:5aÞ
2
SC þ SS þ ML ðiÞ þ qðiÞN
qi ¼ ðSC þ SS þ ML ðiÞÞmod2, ðSc ; SS Þ ¼ ð10:5bÞ
2
S þ ML ðiÞ þ 2qðiÞN 0
S¼ ð10:5cÞ
2
where N 0 ¼ Nþ1 0
2 . The advantage is that 2q(i)N and ML(i) can be concatenated as a
single operand, thus decreasing the number of terms to be added as 3 instead of
4. The authors also suggest “quotient pipelining” deferring the use of computed q(i)
to the next iteration. Thus, we need to modify (10.5c) as
em-1 e1 T1M
c e0
d E
one twon T
TE mod M Two2n
2nR 2 nT
SAMMMK Exponentiator 2 nT SAMMM0
M
M
e
mbn bn mn 0 mbj bj mj 0 mb1 b1 m1 0 mb0 b0 m0 0
a0 a0 a0
a0
Carry0,n-2. Carry0,j-1. Carry0,0.
cell 0,n cell 0,j cell 0,1 cell 0,0
qo qo qo
γn(1) γj(1)
mbn bn mn 0 mbj bj mj 0 mb1 b1 m1 0 γ1(1) mb0 b0 m0 0 γ0(1)
a1 a1 a1
γj+1(i) γ1(i)
mbn bn mn 0 γ2(i)
mbj bj mj 0 mb1 b1 m1 0 mb0 b0 m0 0
γ0(i)
ai ai ai
γn(i+1)
γj(i+1) γ1(i+1) γ0(i+1)
γ0(n)
an an an
f gj (i) bj mj mbj 0
MUX4×1
aj aj
qj qj
Carryout FA Carryin
bj mj mbj 0
g j-1 (i+1)
M S 00
M¼ þ ABðiÞ, ML ðiÞ ¼ Mmod2, S ¼ þ ML ði 1Þ þ 2qði 2Þ N ,
2 2
qði 1Þ ¼ Smod2
ð10:6Þ
1 S=0
2 for i = 0 to n – 1
3 (Ca,S(0)) := xiY (0) + S (0)
4 if S0(0) = 1 then
5 (Cb,S(0)) := S(0) + M(0)
6 for j = 1 to e
7 (Ca,S(j)) := Ca + xiY (j) +S (j)
8 (Cb,S(j)) := Cb + M (j) +S (j)
9 S(j–1) :=(S0(j), Sw–1..1
(j–1) )
10 end for
11 else
12 for j = 1 to e
13 (Ca,S(j)) := Ca + xiY (j) + S (j)
14 S(j–1) :=(S (j), S (j–1) )
0 w–1..1
15 end for
end if
16 S (e) = 0
end for
M0), Y ¼ (0, Ye1, . . ., Y1, Y0) and S ¼ (0, Se1, . . ., S1, S0). The algorithm is given in
the pseudocode in Figure 10.3. The arithmetic is performed in w-bit precision. Based
on the value of xi, xiY0 + S0 is computed and if LSB is 1, then M is added so that the
LSB becomes zero. A shift right operation must be performed in each of the inner
loops. A shifted Sj1 word is available only when the LSB of new Sj is obtained.
Basically, the algorithm has two steps (a) add one word from each of the vectors
S, xiY and M (addition of M depending on a test) and (b) one-bit right shift of an
S word. An architecture is shown in Figure 10.4a containing a pipe-lined kernel of p
w-bit PEs (processing elements) for a total of wp-bit cells. In one kernel cycle, p bits
of X are processed. Hence, k ¼ n/p kernel cycles are needed to do the entire
computation. Each PE contains two w-bit adders, two banks of w AND gates to
conditionally add xi Yj and add “odd” Mj to Si and registers hold the results (see
Figure 10.4b). (Note that “odd” is true if LSB of S is “1”.) Note that S is renamed
here as Z and Z is stored in carry save redundant form. A PE must wait two cycles to
kick off after its predecessor until Zo is available because Z1 must be first computed
and shifted. Note that the FIFO needs to store the results of each PE in carry-save
redundant form requiring 2w bits for each entry. These need to be stored until PE1
becomes available again.
A pipeline diagram of Tenca-Koc architecture [20] is shown in Figure 10.5a for
two cases of PEs (a) case 1, e > 2p 1, e ¼ 4 and p ¼ 2 and (b) case 2, e 2p 1,
e ¼ 4 and p ¼ 4 indicating which bits are processed in each cycle. There are two
10.2 Montgomery Modular Multiplication 277
X Mem
Sequence
Control
Kernel
x
YM M
Mem 0 Y PE1 PE2 PE3 PE P
Z’
Z
FIFO
Result
xi
Mw-1:0 Mw-1:0
Yw-1:0 Yw-1:0
cin cin
3:2 CSA
3:2 CSA
(w) (w)
odd
Z0
Zw-1:0 Zw-1:0
cout cout Zw-1
ca cb reset
Z0
Figure 10.4 (a) Scalable Montgomery multiplier architecture and (b) schematic of PE (adapted
from [21] ©IEEE 2005)
dependencies for PE1 to begin a kernel cycle indicated by the gray arrows. PE
1 must be finished with the previous cycle and the Zw1:0 result of the previous
kernel cycle must be available at PE p. Assuming a two cycle latency to bypass the
result from PE p to account for the FIFO and routing, the computation time in clock
cycles is
k ð e þ 1Þ þ 2ð p 1Þ e > 2p 1 ðcase IÞ
ð10:7Þ
kð2p þ 1Þ þ e 2 e 2p 1 ðcase IIÞ
The first case corresponds to large number of words. Each kernel cycle needs e + 1
clock cycles for the first PE to handle one bit of X. The output of PE p must be queued
until the first PE is ready again. There are k kernel cycles. Finally, 2( p 1) cycles are
required for the subsequent PEs to complete on the last kernel cycle.
The second case corresponds to the case where small number of words are
necessary. Each kernel cycle takes 2p clock cycles before the final PE produces
its first word and one more cycle to bypass the result back. k kernel cycles are
needed. Finally, e 2 cycles are needed to obtain the more significant words at the
end of the first kernel cycle.
Harris et al. [21] case is presented for comparison in Figure 10.5b. Harris
et al. [21] have suggested that the results be stored in the FIFO in non-redundant
form to save FIFO area requiring only w bits for each entry in stead of 2w bits in
[20]. They also note that in stead of waiting for the LSBs of the previous word to be
shifted to the right, M and Y can be left shifted thus saving latency of one clock
cycle. This means that as soon as LSB of Z is available, we can start the next step for
another xi. The authors have considered the cases e ¼ p and e > p (number of PEs
p required equal to number of the words e or less than the number of words e). Note
that in this case, (10.7) changes to
ð k þ 1Þ ð e þ 1Þ þ p 2 e > pðcase IÞ
ð10:8Þ
kðp þ 1Þ þ 2e 1 e pðcase IIÞ
Kelley and Harris [22] have extended Tenca-Koc algorithm to high-radix 2v using
a w v bit multiplier. They have also suggested using Orup’s technique [15] for
2 Xo MY2w-1:w
Z2w-2:w-1 Z2w-2:w-1
Xo MY3w-1:2w X1 MYw-1:0
3 Xo MY3w-1:2w X1 MYw-1:0
Kernel cycle 1
Z3w-2:2w-1 Zw-2:1
Z3w-2:2w-1 Zw-2:1 Xo MY4w-1:3w X1 MY2w-1:w
4 Xo MY4w-1:3w X1 MY2w-1:w Z4w-2:3w-1 Z2w-2:w-1
Z4w-2:3w-1 Z2w-2:w-1
Xo MY5w-1:4w X1 MY3w-1:2w X2 MYw-1:0
5 Xo MY5w-1:4w X1 MY3w-1:2w
Z-5w-2:4w-1 Z3w-2:2w-1 Zw-2:1
Z-5w-2:4w-1 Z3w-2:2w-1
X1 MY4w-1:3w X2 MY2w-1:w
6 X2 MYw-1:0 X1 MY4w-1:3w Z4w-2:3w-1 Z2w-2:w-1
Zw-2:1 Z4w-2:3w-1
X1 MY5w-1:4w X2 MY3w-1:2w X3 MYw-1:0
7 X2 MY2w-1:w X1 MY5w-1:4w Z-5w-2:4w-1 Z3w-2:2w-1 Zw-2:1
Z2w-2:w-1 Z-5w-2:4w-1
X2 MY4w-1:3w X3 MY2w-1:w
8 X2 MY3w-1:2w X3 MYw-1:0 Z4w-2:3w-1 Z2w-2:w-1
Kernel cycle 2
Figure 10.5 Pipeline diagrams corresponding to (a) Tenca and Koc technique and (b) Harris
et al. technique (adapted from [21] ©IEEE2005)
10.2 Montgomery Modular Multiplication 279
Kernel cycle 1
Z4w-2:3w-1 Z2w-2:w-1
5 Xo MY5w-1:4w X1 MY3w-1:2w Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1 Zw-2:1
Z-5w-2:4w-1 Z3w-2:2w-1 Xo MY5w-1:4w X1 MY4w-1:3w X2 MY3w-1:2w X3 MY2w-1:w
Z-5w-2:4w-1 Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1
6 X2 MYw-1:0 X1 MY4w-1:3w
Zw-2:1 Z4w-2:3w-1 Kernel Stall X1 MY5w-1:4w X2 MY4w-1:3w X3 MY3w-1:2w
Z-5w-2:4w-1 Z4w-2:3w-1 Z3w-2:2w-1
7 X2 MY2w-1:w X1 MY5w-1:4w
Z2w-2:w-1 Z-5w-2:4w-1 X4 MYw-1:0 X2 MY5w-1:4w X3 MY4w-1:3w
8 X2 MY3w-1:2w X3 MYw-1:0 Zw-2:1 Z-5w-2:4w-1 Z4w-2:3w-1
Kernel cycle 2
Z-5w-2:4w-1 Z3w-2:2w-1
X4 MY4w-1:3w X5 MY3w-1:2w X6 MY2w-1:w X7 MYw-1:0
11 X3 MY4w-1:3w Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1 Zw-2:1
Z4w-2:3w-1 X4 MY5w-1:4w X5 MY4w-1:3w X6 MY3w-1:2w X7 MY2w-1:w
12 X3 MY5w-1:4w Z-5w-2:4w-1 Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1
Z-5w-2:4w-1
X5 MY5w-1:4w X6 X6 X7 MY3w-1:2w
Z-5w-2:4w-1 X6 Z3w-2:2w-1
X6 MY5w-1:4w X7 MY4w-1:3w
Z-5w-2:4w-1 Z4w-2:3w-1
X7 MY5w-1:4w
Z-5w-2:4w-1
multiplications.
Huang et al. [26] have suggested modifications for Tenca-Koc algorithm to
perform Montgomery multiplication in n clock cycles. In order to achieve this,
they suggest pre-computing the partial results using two possible assumptions for
the MSB of the previous word. PE1 can take the w 1 MSBs of S0 (i ¼ 0), from PE
0 at the beginning of clock 1, do a right shift and prepend with both 1 and zero based
on the two different assumptions about the MSBs of this word at the start of the
computation and compute S1(i ¼ 1). At the beginning of the clock cycle 2, since the
correct bit will be available as the LSB of S1 (i ¼ 0), one of the two pre-computed
versions of S0 (i ¼ 1) is chosen. Since the w 1 LSBs are same, the parallel
hardware can have same LSB adding hardware and using small additional adders,
280 10 RNS in Cryptography
the other portions can be handled. Same pattern of computations repeats in the
subsequent clock cycles. Thus, the resource requirement is marginally increased. The
j k cycles is T ¼ n + e 1 if e p and T ¼ n + k(e p) + e 1
computation time in clock
n
otherwise where k ¼ p .
In another technique, each PE processes the complete computation of a
specific word in S. However, all PEs can scan different bits of the operand X at
the same time. The data dependency graphs of both these cases are presented in
Figure 10.6a, b. Note that the second architecture, however, has fixed size
(i.e. e number of PEs which cannot be reduced). The first technique has been
shown to outperform Tenca-Koc design by about 23 % in terms of the product of
latency time and area when implemented on FPGAs. The second technique
achieves an improvement of 50 %.
The authors have also described a high-radix implementation [26] while
preserving the speed up factor of two over corresponding technique of Tenca and
Koc [20]. In this, for example, considering radix-4, two bits are scanned at one time
taking ((n/2) + e 1) clock cycles to produce an n-bit Montgomery multiplication.
The multiplication by 3 needed, can be done on the fly or avoided by using Booth’s
algorithm which needs to handle negative operands [26].
Shieh and Lin [27] have suggested rewriting the recurrent equations in MM
algorithm
qi ¼ ðSi þ A Bi Þmod 2
ðSi þ A Bi þ qi N Þ
Siþ1 ¼ ð10:9Þ
2
as
SMi1
qi ¼ SRi þ þ A Bi mod 2
2
SRi þ SM2i1 þ A Bi þ qi N
SRiþ1 þ SMiþ1 ¼ ð10:10Þ
2
with SR0 ¼ SM0 ¼ SM1 ¼ 0 for i ¼ 0, . . . (k 1). Note that A, B and N are k-bit
words. This will help in deferring the accumulation of MSB of each word of the
intermediate result to the next iteration of the algorithm. Note that the intermediate
result Si in (10.9) is decomposed into two parts SM and SR. The word SM contains
only the MSB followed by zeroes and the word SR comprises the remaining LSBs.
They also observe that in (10.10), the number of terms can be reduced to three,
taking advantage of several zero bits in SRi and SMi1/2. Further by considering
A as two words AP and AR (for example, for W ¼ 4) AP ¼ 0a10000a60000a200 and
AR ¼ a10a9a8a70a5a4a30a1a0, (10.10) gets changed to
10.2 Montgomery Modular Multiplication 281
a PE#0
j=0 i= 0
x0 Y(0)
Sw-1..1 (0)=0
M(0)
j=1 D
Sw-1..1 (0)
S0(1)=0 PE#1
(1)
{xo,qo,C } i=1
Y(1) x1
Y(0)
Sw-1..1 (1)=0
M(1)
M(0)
j=2 S0(1)
E D
Sw-1..1 (0)
S0(2)=0 Sw-1..1 (1)
(1)
{x1,q1,C } PE#2
{xo,qo,C(2)} i= 2
(1)
Y Y(0)
Y(2) x2
Sw-1..1 (2)=0 M(1)
M(2) M(0)
j=3 S0(1)
E E D
S0(3)=0
S0(2) Sw-1..1 (1) Sw-1..1 (0)
{xo,qo,C(3)} {x1,q1,C(2)} {x2,q2,C(1)}
Sw-1..1 (2)
Y(2) Y(1)
Y(3)
Sw-1..1 (3)=0 M(3) M(2) M(1)
j=4 S0(3) S0(2) S0(1)
E E E
S0(4)=0
b PE #0
j=0
S(0)=0 Y(0)
(0)
M
i=0 x0
PE #1
D 0 j=1
{C (1)
Y(0)
(1
) S =0 Y(1)
i=1 S(0) (0)
,q
} M(1)
M 0
x1 x0 PE #2
D S(1)0 E 0 j=2
{ C (1 { C (2 (2) (2)
)
S(0) Y (0) )
,q S(1) Y
(1)
,q S =0 Y
i=2 (1) 0 }
M
(0) 1}
M M(2)
x2 x1 x0 PE #3
D S 0 E(1)
S 0 E 0
(2)
j=3
{ C (1 { C (2 { C (3 (3)
S =0 (3)
) ) (2) )
(0) Y(0) ,q S (1) Y (1) , q (2) Y , q Y (3)
i=3 S (0) 2} (1) 1 } S (2) 0} M
M M M
x3 x2 x1 x0
Figure 10.6 Data dependency graphs of (a) optimized architecture and (b) alternative architec-
ture of MWR2MM algorithm (adapted from [26] ©IEEE2011)
282 10 RNS in Cryptography
0 SM0 i1
qi ¼ SR i þ 2AP Biþ1 þ þ AR Bi mod2 ¼ ðOP1i þ OP2i Þmod 2
2
0
SM i1
SR0 i þ 2AP Biþ1 þ þ AR Bi þ qi N
2
SR0 iþ1 þ SM0 iþ1 ¼
2
OP1i þ OP2i þ OP3i
¼
2
ð10:11Þ
Figure 10.7 (a) Algorithms for Modulo multipliers using Montgomery and Barrett Reduction
(b) original architecture and (c) modified architecture (adapted from [10.8] ©IEEE2010)
284 10 RNS in Cryptography
b X Y M m or M’
n-bit n-bit n-bit l-bit
l l
*
l 2l
w w l l
* * p3
2w n+w n+l 2l
+ + l
n+w-bit n+l-bit
p1 p2
n+w n+l
Barrett: l = w + 4 MS bits
n+l+1 n+l Montgomery: l = w LS bits
+
n+l+1 -bit Z
S
c X Y M
n-bit n-bit n-bit
w w l l
* *
2w n+w n+l 2l
+ +
n+w-bit n+l-bit
p1 p2
n+w n+l
n+l+1 n+l
+
n+l+1 -bit Z
S
Figure 10.7 (continued)
10.2 Montgomery Modular Multiplication 285
Note that csa and csb are r-bit whereas ec is 1-bit carry. The lower r-bit output and
1-bit ec is given by the CPA operation, whereas the rest are obtained by partial
product addition using a CSA.
286 10 RNS in Cryptography
Figure 10.8 Montgomery multipliers using (a) single form (Type I), (b) using semi carry-save
form (Type II) and (c) using carry-save form (Type III) (adapted from [28] ©IEEE2011)
10.3 RNS Montgomery Multiplication and Exponentiation 287
In the third algorithm, carry save form is used for intermediate sum and carry
where cs1 and cs2 are intermediate carry signals and zs1 and zs2 are intermediate sum
signals. The two steps in this case are modified as
The CPA operation is performed at the end of the inner loop to obtain zj. The third
approach needs more steps due to the extra additions.
The computation time of CPA significantly affects the critical path. In Type I
and II, the CPA widths are 2r and r respectively whereas in Type III, in every cycle
CPA is not used. The number of cycles needed for the arithmetic core in types I, II
and III are 2m2 + 4m + 1, 2m2 + 5m + 1 and 2m2 + 6m + 2 respectively. The authors
have considered variety of final CPA based on KoggeStone, BrentKung,
HanCarlson and LadnerFischer types. The partial product addition also used a
variety of algorithms: Dadda tree, 4:2 compressor tree, (7, 3) counter trees and (3, 2)
counters. The radix also was variable from 8 to 64 bits.
The authors report that smallest area of 861 gates using Type-I radix-28 proces-
sor to shortest operating time of 0.67 ms at 421.94 MHz with Type III radix-2128
processor. The highest hardware efficiency (RSA time area) of 83.12 s-gates was
achieved with Type II radix-232 processor.
N N
NþΔ<M <Nþ with Δ <
3 6
288 10 RNS in Cryptography
b
Set1 M Set2 Mʹ
Moduli 3 5 7 11 13 17
a =10 1 0 3 10 10 10
b = 25 1 0 4 3 12 8
a×b=250 1 0 5 Mod (105) 8 3 12 (a×b)mod (2431)
= 250
t 2 2 3
=(-1/37) mod 105
=17
t×(a×b) mod 105 2 0 1 Base Extend 6 11 16 S = t×(a×b) mod 105 =
=50 to M′ 50
→
4 11 3 N = 37
2 4 14 s×N = 1850
10 7 9 a×b+s×N = 2100
2 1 6 1/M = 1/105
Result = 20 2 0 6 Base 9 7 3 2100/105 = 20
Extension to
M
←
Figure 10.9 (a) Bajard’s and Posch and Posch algorithm for Montgomery multiplication using
RNS (adapted from [31] ©IEEE2004) (b) an example of Montgomery multiplication using RNS
1
4N M0 4 þ εM0 =N N with εM0 =N <
12
N
a, b < N þ Δ < N þ ð10:13Þ
6
M1
and δi 1 denoting the error between the actual value and wi ¼ mi i . Base
extension from RNS1 to RNS2 results in some t* ¼ t or t + M1 if W*int ¼ Wint or
Wint 1 (approximation off by 1). It can be shown that t* < M1 or M1 + Δ in these
2 < y < 3N þ 3
N N
two cases respectively. Further, note that where
1 1
y ¼ x xM N modMÞNM þ 2N.
Bajard et al. [30] have described RNS Montgomery modular multiplication
algorithm in which B and N are given in RNS form and A is given in Mixed
Radix form (MRS). In this method only one RNS base is used and the condition
0 N 3max M ðmi Þ needs to be satisfied. The algorithm is executed in n steps
i2ð1, ::nÞ
where n is the number of moduli. In each step,a MRS digit q0 i of a number Q is
AB
computed and a new value of R where R ¼ modN is determined using a0 i and
M
q0 i in RNS where a0 i is the mixed radix digit of A. Next, since R is a multiple of mi,
and the moduli are relatively prime numbers, R is multiplied by the multiplicative
inverse of mi. This process cannot, however, be carried out in the last step, since mi
is not prime to itself. Hence, another RNS shall be used for expressing the result and
reconstructing the residue after it is lost. Since the result is available in RNS base, it
needs to be extended to the original base in the end. The authors also suggest
another technique where the missing residue is recovered using base extension
employing a redundant modulus using Shenoy and Kumaresan technique
[32]. Note, however, that for systems based on MRC, fully parallel computation
cannot be realized and thus they are slower. An example of Montgomery multipli-
cation using RNS to compute (AB/M ) mod N where A ¼ 10, B ¼ 25, M ¼ 105 and
N ¼ 37 is presented in Figure 10.9b.
Bajard et al. [31] approach using two moduli sets is similar to other techniques
[33, 34] but the base extension steps use different algorithms. These, however, have
the same complexity as that of Posch and Posch [29]. Note that q shall be extended
to base B0 as explained before. This can be obtained by using CRT but the multiple
of M that needs to be subtracted needs to be known. The application of CRT in base
B yields
X
k
q¼ σ i Mi αM ð10:14aÞ
i¼1
where M contains k moduli and σ i ¼ qi M1i mi modmi and α < k. We need not
compute the exact value of q in M but instead extend q in M0 as
0
X
k
^
q ¼ q þ αM ¼ σ i Mi ð10:14bÞ
i¼1
ab þ ^
qN
^r ¼ ð10:14cÞ
M
Note that the computed value ^r < M0 and has a valid representation in base B0 .
From (10.14b) and (10.14c), we get
ab þ ðq þ αMÞN
^r ¼ modN ¼ abM1 modN ð10:14dÞ
M
Xn Xn
1
x¼ xi Mi αM ¼ σ i Mi αM ð10:16aÞ
i¼1
M i mi i¼1
where
10.3 RNS Montgomery Multiplication and Exponentiation 291
Xn !
1
σi ¼ xi modmi ð10:16bÞ
i¼1
M i mi
we have
x Xn
σi
αþ ¼ ð10:16cÞ
M i¼1
mi
$ %
Xn
σi Xn
σi
Since x < M, is between α and α + 1. Thus α ¼ and 0 < α < n
i¼1
mi i¼1
mi
holds. The value of α can be recursively estimated in the “cox” unit by approxi-
mating mi in the denominator by 2r, in order to avoid division by mi. Note that it is
assumed that r is common to all moduli in spite of mi being different in general and
computing
$ %
Xn
trunc σ i
^ ¼
α þ α0 ð10:17Þ
i¼1
2r
where trunc σ i ¼ σ i \ ð1:::::10:::::0Þ2 . (Note that the number of ones are q0 and
number of zeroes are r q0 and \ stands for bit-wise AND operation.) Note that σ i is
approximated by its most significant q bits as trunc (σ i). The parameter α0 is an
offset value to take into account the error caused due to the approximation. Note
^ in (10.17) is computed bit by bit with an initial value λo ¼ α as
that α
trunc σ i
λi ¼ λi1 þ , αi ¼ bλi c, λi ¼ λi αi for i ¼ 1, 2, . . . n ð10:18Þ
2r
Note that αi is a bit and if it is 1, the rower unit subtracts M. Note that the error is
transferred to the next step and only in the last step there is residual error.
Kawamura et al. [34] have suggested the use of α0 ¼ 0 and α0 ¼ 0.5 for the first
and second base extensions, respectively. Note that n clock cycles are needed to
obtain the n number of αi values. It can be seen that n2 + 2n modulo multiplications
are needed for each base extension and 5n other modulo multiplications operations
are needed for complete modulo multiplication. The Cox-Rower architecture is
presented in Figure 10.10.
Gandino et al. [35] have suggested reorganization allowing pre-computation of
certain constants of the algorithms due to Bajard et al. [31] and Kawamura
et al. [34]. In these algorithms, several multiplications of partial results with
pre-computed values exist. By exploiting the commutative property, a sequence
of multiplications of a partial result by a pre-computed value is replaced by a single
multiplication. In addition, the authors use the commutative, distributive and
associative properties to rearrange the operations in a more effective way.
292 10 RNS in Cryptography
r
I/O
r
RAM ROM RAM ROM RAM ROM
e
r r
q trune
q
Mul & Acc Mul & Acc Mul & Acc
Adder
mod an/bn mod an-1/bn-1 mod a1/b1
<Cox Unit> 1 ki
<Rower Unit n> <Rower Unit (n-1)> <Rower Unit 1>
Figure 10.10 Cox-Rower architecture for Montgomery algorithm in RNS (adapted from [34]
©Eurocrypt2000)
L1
X
Xmi ¼ xj 2rj m for all i: ð10:19Þ
j¼0
i mi
10.3 RNS Montgomery Multiplication and Exponentiation 293
The constants 2rj m are pre-computed and stored. Thus, L parallel units can
i
convert in L steps the given binary word X into L residues. The MRC technique is
used for RNS-to-binary conversion where each add/multiply unit weighs the Mixed
radix digit appropriately and computes the result. The general architecture for all
these functions for RNS Montgomery multiplication (RMM) is presented in
Figure 10.12. They denote the two RNS bases as K and Q and compute (ABQ1)
mod N. They choose the scaling factor as the product of moduli in the first moduli
2
set. The authors observe that the BE algorithm using MRC needs only L þL 2 2
operations, whereas the techniques due to Bajard et al. [31] and Kawamura
et al. [34] techniques need 2L2 + 3L and 2L2 + 4L operations, respectively. Conse-
quently, a 1024-bit exponentiation using 33 number of 32-bit moduli and a clock
frequency of 100 MHz could work with a throughput of 3 Mb/s using 0.35–0.18μ
CMOS technology as against earlier methods [34] whose throughput is 890 kb/s.
Jie et al. [37] have suggested reformulation of Bajard et al. technique [32] which
requires pre-computations to reduce the number of steps from 2n2 + 8n to 2n2 + 5n,
whereas the technique of Kawamura et al. [34] needs 2n2 + 9n steps. In this
approach, a modulus 2n is used for easy base extension.
Schinianakis and Stouraitis [38] have suggested using MRC following Yassine
and Moore technique [39] discussed in Chapter 5. The moduli in RNS need to be
selected in a proper manner in this method so that the computation is simpler. In
294 10 RNS in Cryptography
Figure 10.12 A RNS Montgomery multiplication architecture due to Schinianakis (adapted from
[36] ©IEEE2011)
Yassine and Moore’s technique, (L 2) multiplications are needed for one base
extension as compared L(L 1)/2 needed for other techniques. The authors also
have unified the hardware to cater for both conventional RNS and Polynomial RNS.
The authors show that in the RNS case, the number of multiplications can be
reduced compared to use of conventional MRC as in [36]. The authors have
designed the hardware for dual field addition/subtraction, multiplication, modular
reduction and MAC operation to cater for both the fields GF( p) and GF (2m). They
have considered the various options viz., number of moduli, the use of several
MACs (β in number) in parallel and selectable radix 2r.
Ciet et al. [40] have suggested FPGA implementation of 1024-bit RSA with
RNS following similar approach as Bajard et al. [31] and Kawamura
et al. [34]. They have suggested that the nine moduli needed for each of the bases
(RNS moduli sets) can be selected from a pool of generalized Mersenne primes of
the form 2k1 2k2 1. Thus (C63 )(C54 ) possible combinations exist for
9 9
58 k1 64, 0 k2 k12þ1. Ciet et al. [40] have also suggested solutions for
signing using private key d using RSA algorithm. In these CRT is considered to
be used [41].
Denoting the hash of the message to be signed using private key d as μ, compute
μp ¼ μ mod p and μq ¼ μ mod q. Choosing two random bases as mentioned above,
μp and μq can be represented in two RNS bases. In order to avoid Differential power
analysis (DPA) attacks, the authors suggest adding randomization to both exponent
and message. Next, μpD mod N and μqD mod N can be computed using
RNS followed by CRT to perform reverse conversion to obtain μD mod N. Note
that μ D mod N can be computed as μ Dp modN where D ¼ D mod ( p 1) and
p p p
Montgomery representation of the modular inverse is b12n mod a [43]. The first
phase in the evaluation computes b12k mod a where k is the number of iterations.
The output of the first phase is denoted as Almost Montgomery Inverse (AMI). The
first phase computes gcd (a, b) where gcd stands for greatest common divisor. The
second phase halves this value (k n) times modulo a and negates the result to yield
b12n mod a.
The pseudocode is presented in Figure 10.13a. Note that u, v, r and s are such
that us + vr ¼ a, s 1, u 1, 0 v b. Note that r, s, u, and v are between 0 and
k1
2a 1. The number of iterations k are such that aþb 2 2 ab. The following
k k
invariants can be verified: br ¼ u2 moda and bs ¼ v2 moda. An example for
a ¼ 17, b ¼ 10 and n ¼ 5 is presented in Figure 10.13b.
The Montgomery inverse can be effectively used in reducing the number of steps
in exponentiation mod a as needed in RSA and other algorithms. For example, if the
exponent is 119 ¼ (1110111)2, it can be recoded as 10001001 2 where 1 has weight
of 1. Thus only three multiplications need to be done instead of 5, whereas the
number of squarings are same in both the cases.
Savas and Koc [44] have suggested defining New Montgomery inverse as x ¼ b1
2n
2 mod a so that the Montgomery inverse of a number already in Montgomery
domain is computed:
Note that MonInv algorithm in Figure 10.13a cannot compute new Montgomery
inverse. This needs two steps:
296 10 RNS in Cryptography
Figure 10.13 (a) Algorithm for computing the Montgomery inverse b12mmoda and (b) example
a ¼ 17, b ¼ 10, n ¼ 5 (adapted from [43] ©IEEE1995)
m 1 m
c ¼ MonInvðb2m Þ ¼ Þ :2 moda
ðb2 1
1
¼ b mod
a
2m m ð10:21aÞ
x ¼ MonPro c; 2 2m
¼ b 2 2 moda ¼ b1 2m mod a
m m
v ¼ MonProðb2m , 1Þ ¼ ðb2 2 Þmod a ¼ bmod a
1 m ð10:21bÞ
x ¼ MonInvðbÞ ¼ b 2 mod a
This will be useful in ECC computation if the intermediate results are already in
Montgomery domain and when division is needed e.g. in computation of point
addition or doubling.
Gutub et al. [45] have described a VLSI architecture for GF( p) Montgomery
modular inverse computation. They observe that two parallel subtractors for finding
(u v), (v u) and (r a) are required so as to speed up the computation (see
Figure 10.13a). They have suggested a scalable architecture which takes w bits at a
time and performs computation in wn cycles for scalable operations such as
addition/subtraction. The area of the scalable design has been found to be on
average 60 % smaller than the fixed one.
Savas [46] has used redundant signed digit (RSD) representation so that the carry
propagation in addition and subtraction is avoided. This enables fast computation of
multiplicative inverse in GF( p).
Bucek and Lorencz [47] have suggested a subtraction-free AMI technique. It
computes (u + v) instead of (u v) where one of the operands must always be
negative. By keeping u always negative and v positive, we can compute an
equivalent of the differences in the original algorithm without subtraction. Note
that the values of v, r, and s are same as in the original algorithm but opposite values
appear for u. The authors have considered original AMI design with two subtractors
as well as one subtractor and show that AT (Area Time product) is lower for
subtraction-free AMI, whereas AMI with one subtractor is slower than AMI using
two subtractors. Note that the initial value of u shall be p in stead of p in this
approach. The algorithm is presented in Figure 10.14.
Schiniankis et al. [48] have realized ECC using RNS. In order to reduce the division
operations in affine representation, they have used Jacobian coordinates. Consider
the elliptic curve
y2 ¼ x3 þ ax þ b over Fp ð10:22aÞ
where a, b 2 Fp and 4a3 + 27b2 6¼ 0 mod p together with a special point O called the
point at infinity. Substituting x ¼ X2 , y ¼ Y3 ; using the Jacobian coordinate repre-
Z Z
sentation, (10.22a) changes as
E Fp : Y 2 ¼ X3 þ aXZ4 þ bZ6 ð10:22bÞ
The point at infinity is given by {0, 0, 0}. The addition of two points Po ¼ (Xo, Yo,
Zo) and P1 ¼ ðX1 ; Y 1 ; Z 1 Þ 2 Fp thus will yield the sum P2 ¼ ðX2 ; Y 2 ; Z2 Þ ¼ P0 þ P1
2 E Fp given by
8
> 2
< X2 ¼ R TW
2
P2 ¼ P0 þ P1 ¼ 2Y 2 ¼ VR MW 3 ð10:22cÞ
>
:
Z2 ¼ Z0 Z1 W
where
used extended RNS using one redundant modulus to perform residue to binary
conversion using CRT. The conversion from projective coordinates to affine coor-
dinates is done using x ¼ X2 , y ¼ Y3 .
Z Z
The RNS adder, subtractor and multiplier architectures are as shown in
Figure 10.15a, b for point adder (ECPA) and point doubler (ECPD). The point
multiplier (ECPM) is shown in Figure 10.12c. Note that the RNS adder, multiplier
and subtractor are shared for all the computations in (10.22c) and (10.23). Note also
that the modulo p reduction is performed after RNS to binary conversion using CRT.
The authors have shown that all the operations are significantly faster than those
using conventional hardware. A 160-bit point multiplication takes approximately
2.416 ms on Xilinx Virtex-E (V1000E-BG-560-8). The authors also observe that the
cost of conversion from residue to binary is negligible.
Schiniakis et al. [49] have further extended their work on ECC using RNS. They
observe that for p of 192-bit length, the equivalent RNS dynamic range is 840 bits.
As such, 20 moduli of 42 bits each have been suggested. The implementation of
(10.22c) and (10.23) can take advantage of parallelism, between multiplication,
addition or subtraction operations for both point addition as well as point doubling.
They observe that 13 steps will be required for each (see Figure 10.16). Note that
for ECPA, 17 multiplications, 5 subtractions and 2 additions are required, whereas
for ECPD, 15 multiplications, one addition and 3 subtractions are required sharing
thus a multiplier/adder/subtractor. They, however, do not use a separate squaring
circuit. The RNS uses one extra redundant modulus and employs extended RNS for
RNS to binary conversion based on CRT. A special serial implementation was used
for multiplication of nf-bit word by f-bit word needed in CRT computation consid-
ering f bits at a time where n is the number of moduli and f is the word length of each
modulus, considering a n moduli RNS. The modulo reduction after CRT is carried
out using a serial modulo multiplier with 1 as one of the operands. The projective to
affine coordinate conversion needs division by Z2 and Z3 which needs finding the
multiplicative inverse. It needs one modular inversion and four modulo
multiplications:
1
T1 ¼ , T 2 ¼ T 21 , x ¼ XT 2 , T 3 ¼ T 1 T 2 , y ¼ YT 3 ð10:24Þ
Z
The authors use the technique of [47] for this purpose. The authors also consider
the effect of the choice of number of moduli and word length of the moduli on the
performance. They observe that the area decreases with increase in number of
moduli and leads to decrease in the bit length of the moduli. The moduli used are
presented in Table 10.1 for a 192-bit implementation for illustration. The authors
have described FPGA implementation using Xilinx Virtex-E XCV 1000E, FG
680 FPGA device. Typically the time needed for point multiplication is 4.84,
4.08, 3.54 and 2.35 ms for 256, 224,192 and 160-bit implementations.
Esmaeildoust et al. [50] have described Elliptic curve point multiplication based
on Montgomery technique using two RNS bases. The authors use moduli of the
a Z1
to_ Z1_2
to _U0
Z1
to _Z1_3
Multiplexer
X0
Decoder
1 to 17
34 to 2
RNS
Multiplier
to_ VR
From _W2
to_ W3
From _M
to_ MW3
From _W3
From _U0
From _U1
to_ W
Multiplexer
From _S0
to_R
Decoder
10 to 2
RNS
1 to 5
Subtractor to _X2
to _V
From _2X2
to_Y2
From _VR
From MW3
From _U0
Multiplexer
Decoder
RNS
1 to 2
4to 2
Adder To_ M
From _S0
From _S1
b Y1
to _Y1Z1
to_Z2
Y1
to_X1 2
Multiplexer
From _Y1Z1
Decoder
1 to 15
30 to 2
RNS
Multiplier
to _Y1 4
From_ Y1 Z4
to_ T
From_ M
to _MS X2
From _S X2
Multiplexer
10 to 2
From U0 RNS To M
Adder
From U1
From _M2
From _S
Decoder
RNS
1 to 3
6 to 2
To _SX2
Subtractor
From _X2 To _Y2
From MS X2
From _T
Figure 10.15 Architectures of ECPA (a), ECPD (b) and ECPM (c) (adapted from [48]
©IEE2006)
10.5 Elliptic Curve Cryptography Using RNS 301
c k(l-bits)
Shift
MSB LSB
Counter register
O
[k]P
MUX
MUX
ECPA ECPD
P
a RESERVED
REGISTERS A = X0, B = Y0, C = Z0, D = Z02, G = X1, H = Y1, I = Z1
A,B,C,D X1 Z02
*
U1
t1
E Z02 Z0
*
Z03
t2
D X0=U0 Y1
* *
W S1
t3
D,H Y0=S0 S1
* D*
R t4
F R
D+
*
R2 t5
G U0 U1 W
+
*
T W2 t6
A,I S0
+
*
TW2 M t7
B,D W
-
*
X2 W3 t8
A,I
+
*
2X2 MW3
t9
D,G 2 Z0
TW W
- *
V Z2
t10
B,C
* D-
VR t11
F X2 X2
* -
2Y2
t12
D,F X22 1/2(modp)
* D-
Y2
t13
B
Figure 10.16 (a, b) Data flow graph (DFG)s for point addition and point doubling algorithms
(adapted from [49] ©IEEE2011)
10.5 Elliptic Curve Cryptography Using RNS 303
*
3X12
t1
A Z1 Z1
*
Z12
t2
D Z12
* D-
Z14 t3
D a
* D-
aZ14
Y1 t4
D Z1
+ *
M Y1Z1
t5
E,D Y1 Y1 Y1Z1
Y12 * +
Z2
t6
B,C X1 Y12
* D+
X1Y12
t7
D 4
* D-
S
t8
D M
+ *
2S M2
t9
A,F Y12
* +
Y14 X2
t10
B,A 8
* -
T S-X2
t11
D
* D-
M(S-X2)
t12
A Z2 Z2
* *
Y2 Z 22
t13
B,D
Figure 10.16 (continued)
Typically on Xilinx Virtex-E, a 192-bit ECPM takes 2.56 ms while needing 20,014
LUTs and for 160-bit field length, ECPM needs 1.83 ms while needing 15,448
LUTs. The reader is urged to refer to [50] for more details.
Difference in complexity between addition and doubling leads to simple power
analysis (SPA) attacks [73]. Hence, unified addition formulae need to be used.
304 10 RNS in Cryptography
Montgomery ladder can be used for scalar multiplication algorithm (addition and
doubling performed in each step). Several solutions have been suggested to have
leak resistance for different types of elliptic curves. As an illustration, for Hessian
form [53], the curve equation is given by
x3 þ y3 þ z3 ¼ 3dxyz ð10:25aÞ
where d 2 Fp and is not a third root of unity. For Jacobi model [54], we have the
curve equation as
where ε and δ are constants in Fp and for short Wierstrass form [55], the curve
equation is given by
These require 12, 12 and 18 field multiplications for addition/doubling. Note that
Montgomery’s technique [56] proposes to work only on x coordinates. The curve
equation is given by
Both addition and doubling take time of only three multiplications and two
squarings. Both these are performed for each bit of the exponent. Cost of this is 10
(k)2 multiplications for finding kG.
Bajard et al. [73] have also shown that the formula for point addition and
doubling can be rewritten to minimize the modular reductions needed. As an
illustration for the Hessian form elliptic curve, the original equations for addition
of two points (X1, Y1, Z1) and (X2, Y2, Z2) are
10.5 Elliptic Curve Cryptography Using RNS 305
X3 ¼ Y 21 X2 Z2 Y 22 X1 Z 1
Y 3 ¼ X21 Y 2 Z 2 X22 Y 1 Z1 ð10:26aÞ
Z3 ¼ Z 21 X2 Y 2 Z 22 X1 Y 1
A ¼ Y 1 X2 , B ¼ Y 1 Z 2 , C ¼ X1 Y 2 , D ¼ Y 2 Z1 , E ¼ X1 Z 2 , F ¼ X2 Z 1 ,
ð10:26bÞ
X3 ¼ AB CD, Y 3 ¼ EC FA, Z3 ¼ EB FD
Thus, only nine reductions and 12 multiplications are needed. Similar results can be
obtained for Wierstrass and Montgomery ladder cases.
RNS base extension needed in Montgomery reduction using MRC first followed
by Horner evaluation has been considered by Bajard et al. [73]. Expressing the
reconstruction from residues in base B using MRC as
A ¼ a1 þ m1 ða2 þ m2 ða3 þ Þ þ mn1 an Þ . . . ð10:27aÞ
1
The number of multiplications by constants are (n2 n)/2 digit products.
mi mj
The conversion from MRS to RNS corresponds to few shifts and adds. Assuming
modulus of the form 2k 2ti 1, this needs computation of ða þ bmi Þmj ¼ a þ 2k
b 2ti b b which can be done in two additions (since a + 2kb is just concatenation
and reduction mod mj requires three additions). Thus, the evaluation of each aj in
base B0 needs 5n word additions. The MRS-to-RNS conversion needs (n2 n)/5
RNS digit products since the five-word additions are equivalent to 1/5 of a RNS
digit product. Hence for the two base extensions, we need
n2 n þ 25 n2 n þ 3n ¼ 75 n2 þ 85 n RNS digit products which is better than O(n2).
306 10 RNS in Cryptography
Considerable attention has been paid in the design of special purpose processors
and algorithms in software for efficient implementation of pairing protocols. The
pairing computation can be broken down into multiplications and additions in the
underlying fields. Pairing has applications in three-way key exchange [57], identity-
based encryption [58] and identity-based signatures [59] and non-interactive zero
knowledge proofs [60].
The name bilinear pairing indicates that it takes a pair of vectors as input and
returns a number. It performs a linear transformation on each of its input variables.
These operations are dependent on elliptic or hyper-elliptic curves. The pairing is a
mapping e: G1 G2 ! G3 where G1 is a curve group defined over finite field Fq and
G2 is another curve group on the extension field F k and G3 is a sub-group of the
q
multiplicative group F k . If groups G1 and G2 are of the same group, then e is called
q
symmetric pairing. If G1 6¼ G2, then e is called asymmetric pairing. The map is
linear in each component and hence useful for constructing cryptographic pro-
tocols. Several pairing protocols exist: Weil Pairing [61], Tate pairing [62], ate
pairing [63], R-ate pairing [64] and optimal pairing [65].
Let Fp be the prime field with characteristic p and let E(Fp) be an elliptic curve
y ¼ x3 þ a4 x þ a6 and # E(F p) is the number of points on the elliptic curve. Let ‘
2
We define
E(k)[r]
k-rational r-torsion group of the curve. Let G1 ¼ E(Fp)[r],
the
G2 ¼ E F =rE F k and G3 ¼ μr F* k (the rth roots of unity). Let P 2 G1 ,
pk p p
Q 2 G2 , then, the reduced Tate pairing is defined as
‘
eT :E Fp ½‘ E F k ! F* k = F* k ð10:29aÞ
p p p
pk 1
‘
eðP; QÞ ¼ f ðl;PÞ ðQÞ ð10:29bÞ
The first step is to evaluate the function f(‘,P)(Q) at Q using Miller loop [61]. A
pseudocode for Miller loop is presented in Figure 10.17. This uses the classical
square and multiply algorithm. The Miller loop is the core of all pairing protocols.
10.6 Pairing Processors Using RNS 307
Figure 10.17 Algorithm for Miller loop (adapted from [66] ©2011)
In this, g(A,B) is the equation of a line passing through the points A and B (or tangent
g A;BÞ
to A if A ¼ B) and νA is the equation of the vertical line passing by A so that νðAþB is
the function on E involved in the addition of A and B. The values of the line and
vertical functions g(A,B) and νA+B are the distances calculated between the fixed
point Q and the lines that arise when adding B to A on the elliptic curve in
the standard way. Considering the affine coordinate representation of A and A + B
as (xj, yj) and (xj+1, yj+1), and coordinates of Q as (xQ, yQ), then we have
lA, B ðQÞ ¼ yQ yj λj xQ xj
vAþB ðQÞ ¼ xQ xjþ1
‘
eR : E F k \ Kerðπ pÞ E Fp ½‘ ! F* k F* k ð10:31aÞ
p p p
p
pk 1
‘
Ra ðQ; PÞ ¼ f ðb;QÞ ðPÞ: f ðb;QÞ ðPÞ:gðbQ;QÞ ðPÞ gðπðbþ1ÞQ, bQÞ ðPÞ ð10:31bÞ
pffiffi
The length of the Miller loop is 4 ‘ and hence is reduced by 4 compared to Tate
pairing.
The MNT curves [68] have an embedding degree k of 6. These are ordinary
elliptic curves over Fp such that p ¼ 4l2 + 1 and t ¼ 1 2l where p is a large prime
such that #E(Fp) ¼ p + 1 t is a prime [69].
Parameterized elliptic curves due to Barreto and Naehrig [67] are well suited for
asymmetric pairings. These are defined with E: E : y2 ¼ x3 þ a6 , a6 6¼ 0 over Fp
where p ¼ 36u4 36u3 þ 24u2 6u þ 1 and n the order of E is n ¼ 36u4 36u3
þ18u2 6u þ 1 for some u such that p and n are primes. Note that only u that
generates primes p and n will suffice. BN curves have an embedding degree k ¼ 12
which means that n divides p12 1 but not pk 1 for 0 k 12. Note that t ¼ 6u2
+ 1 is the trace of Frobenius. The value of t is also parameterized and must be
chosen large to meet certain security level. For efficiency of computation,
u and t must be having small Hamming weight. As an example, for a6 ¼ 3,
u ¼ 0x6000 0000 0000 1F2D (hex) gives 128-bit security. Since t, n and p are
parameterized, the parameter u alone suffices to be stored or transmitted. This
yields two primes n and p of 256 bits with Hamming weights 91 and 87, respec-
tively. The field size is F k is 256 k ¼ 3072 bits. This allows a faster exponen-
p
tiation method.
An advantage of BN curves is their degree 6 twist. Considering that E and E e are
two elliptic curves defined over Fq, the degree of the twist is the degree of the
smallest extension on which the isomorphism ψ d between E and E˜ is defined over an
e defined by
extension Fdq of Fq. This means that E is isomorphic over F 12 to a curve E
p
y2 ¼ x3 þ aν6 where ν is an element in F 2 which is not a cube or a square. Thus, we
p
can define twisted versions of pairings on E e F 2 E Fp ½‘. This means that the
p
coordinates of Q can be written as (xQv1/3, yQv1/2) where xQ, yQ are in F 2 :
p
e!E
ψ6 : E
ðx; yÞ ° xv1=3 , yv1=2 ð10:32Þ
Bajard et al. [72] have considered choice of 64-bit moduli with low Hamming
weight in moduli sets. The advantage is that the multiplications with multiplicative
inverses in MRC will be having low Hamming weight thus simplifying the multi-
plication as few additions. For base extension to another RNS as well, as explained
before, such moduli will be useful. These Moduli are of the type 2k 2ti 1 where
t < k/2. As an illustration, two six moduli sets are 264-210-1, 264-216-1, 264-219-1, 264
-228-1, 264-220-1, and 264-231-1 whose Hamming weights are 3 and 264-222-1, 264-2
13
-1, 264-229-1, 264-230-1, 264-1, and 264 with Hamming weight being 3,3,3,3,2,1.
The inverses in this case are having Hamming weight ranging between 2 and 20.
Multiplication mod (2224-296 + 1) which is an NIST prime P-224 can be easily
carried out [42]. The product is having 14 number of 32-bit words. Denoting these
as r13, r12, r11, . . ., r2, r1, r0, the reduction can be carried out by computing
(t1 + t2 + t3 t4 t5) mod P-224 where
t1 ¼ r6r5r4r3r2r1r0
t2 ¼ r10r9r8r7000
t3 ¼ 0r13r12r11000
t4 ¼ 0000r13r12r11
t5 ¼ r13r12r11r10r9r8r7
Multiplication of large numbers can be carried out using Karatsuba formula [74]
using fewer multiplications of smaller numbers and with more additions. This can
be viewed as multiplication of linear polynomials. Two linear polynomials of two
terms can be multiplied as follows using only three multiplications:
a0 þ a1 x þ a2 x2 b0 þ b1 x þ b2 x2 ¼ a0 b0 C þ 1 x x2
þ a1 b1 C x þ x 2 x 3
þ a2 b2 C x 2 x 3 þ x 4
þ ða0 þ a1 Þðb0 þ b1 ÞðC þ xÞ
þ ða0 þ a2 Þðb0 þ b2 Þ C þ x2
þ ða1 þ a2 Þðb1 þ b2 Þ C þ x3
þ ða0 þ a1 þ a2 Þðb0 þ b1 þ b2 ÞC
ð10:33bÞ
and similarly with “b”s. Other optimizations are also possible by considering
repeated sub-expressions.
a0 þ a1 x þ a2 x2 þ a3 x3 þ a4 x4 b0 þ b1 x þ b2 x2 þ b3 x3 þ b4 x4
¼ ða0 þ a1 þ a2 þ a3 þ a4 Þðb0 þ b1 þ b2 þ b3 þ b4 Þ x5 x4 þ x3
þða0 a2 a3 a4 Þðb0 b2 b3 b4 Þ x6 2x5 þ 2x4 x3
þða0 þ a1 þ a2 a4 Þðb0 þ b1 þ b2 b4 Þ x5 þ 2x4 2x3 þ x2
5
þða0 þ a1 a3 a4 Þðb0 þ b1 b3 b4 Þ x 2x4 þ x3
þða0 a2 a3 Þðb0 b2 b3 Þ x6 þ 2x5 x4
þða1 þ a2 a4 Þðb1 þ b2 b4 Þ x4 þ 2x3 x2
þða3 þ a4 Þðb3 þ b4 Þ x7 x6 þ x4 x3 þ ða0 þ a1 Þðb0 þ b1 Þ x5 þ x4 x2 þ x
þða0 a4 Þðb0 b4 Þ x6 þ 3x5 4x4 þ 3x3 x2
þa4 b4 x8 x7 þ x6 2x5 þ 3x4 3x3 þ x2
þa3 b3 x7 þ 2x6 2x5 þ x4 þ a1 b1 x4 2x3 þ 2x2 x
þa0 b0 x6 3x5 þ 3x4 2x3 þ x2 x þ 1
ð10:34Þ
10.6 Pairing Processors Using RNS 311
In this step, the coefficient reduction is carried out by finding ci mod z and ci div z.
The ci div z is added to ci+1. In polynomial reduction based on Montgomery
technique, first q(z) is found as
where gðzÞ ¼ ðf ðzÞÞ1 modzn . Next, we compute cðzÞqnðzÞf ðzÞ. A last step is coefficient
z
reduction. The computation yields a(z)b(z)z5 mod p in case of BN curves. The
expressions for q(z), h(z) and v(z) in the case of BN curves are as follows:
X4
qð z Þ ¼ i¼1
qi zi ¼ ðc4 þ 6ðc3 2c2 6ðc1 9c0 ÞÞÞz4
þ ðc3 þ 6ðc2 2c1 6c0 ÞÞz3 þ ðc2 þ 6ðc1 2c0 ÞÞz2 ð10:35cÞ
þ ðc1 þ 6c0 Þz c0
and
X3
hð z Þ ¼ gi zi ¼ 36q4 z3 þ 36ðq4 þ q3 Þz2
i¼0 ð10:35dÞ
þ 12ð2q4 þ 3ðq3 þ q2 ÞÞz þ 6ðq4 þ 4q3 þ 6ðq2 þ q1 ÞÞ
312 10 RNS in Cryptography
a4 a3 a2 a1 a0 bi 65 65
x
c
Mul Mul Mul Mul Mul
(c) Register
(65x32) (65x65) (65x65) (65x65) (65x65) 67 63
(b) Mul
r4 r3 r2 r1 r0
Figure 10.18 (a) Parallel hybrid modular multiplication algorithm for BN curves (b) Fp multi-
plier using HMMB (adapted from [76] ©IEEE2012)
10.6 Pairing Processors Using RNS 313
c ðzÞ
v ðzÞ ¼ þ hð z Þ ð10:35eÞ
z5
cðzÞ ¼ aðzÞbðzÞ ¼ z9 þ 74z8 þ 52z7 þ 111z6 þ 70z5 þ 118z4 þ 96z3 þ 36z2 þ z þ 104
which after simplification gives 65z5 + 6z4 + 30z3 + 57z2 + 82z1 + 100z0. Note that
this needs to be reduced mod p to obtain the actual result
The same example can be worked out using serial multiplication due to Fan
et al. [77] which leads to smaller intermediate coefficient values. Note that instead
of computing a(z)b(z) fully, we take terms of b one term at a time and reduce the
product mod p. The results after each step of partial product addition, coefficient
reduction and scaling by z are as follows:
10z4 + 3z3 + 4z2 + 6z + 95 after adding 5A and 33p
21z4 + 133z3 + 127z2 + 65z + 101 after adding 9A and 74p
34z4 + 44z3 + 37z2 + 72z + 50 after adding 34A and 96p
49z4 + 39z3 + 14z2 + 78z + 76 after adding 136A and 53p
26z4 + 72z3 + 60z2 + 117z + 129 after adding 5A and 99p
314 10 RNS in Cryptography
The coefficients in this case can be seen to be smaller than in the previous case.
We will illustrate the first step as follows: After multiplication of A with 5 we obtain
175z4 + 180z3 + 35z2 + 30z + 515. Reducing the terms mod 137 and adding the carry
to the previous term, we obtain z5 + 39z4 + 43z3 + 35z2 + 33z + 104. Evidently, we
need to add 33p to make the least significant digit zero yielding (z5 + 39z4 + 43z3
+ 35z2 + 33z + 104) + 33(36z4 + 36z3 + 24z2 + 6z + 1) which after reducing the terms
mod 137 as before and dividing by z since z0 term becomes zero gives 10z4 + 3z3
+ 4z2 + 6z + 95.
In the digit serial hybrid multiplication technique, the multiplication and reduc-
tion/scaling is carried out together in each step.
Fan et al. [76] architecture was based on Hybrid Montgomery multiplier (HMM)
where multiplication and reduction are interleaved. The multiplier architecture for
z ¼ 263 + s where s ¼ 857 ¼ 25(24 + 23) + 26 + (24 + 23) + 1 for 128-bit security is
shown in Figure 10.18b. Four 65 65 multipliers and one 65 32 multiplier are
used to carry out the polynomial multiplication. Each 65 65 multiplier is
implemented using two-level Karatsuba method. Five “Mod-1” blocks are used
for first coefficient reduction step. The Mod-1 block is shown in Figure 10.18b.
Partial products are immediately reduced. Multiplication by s is realized using four
additions since s ¼ 25(24 + 23) + 26 + (24 + 23) + 1. The outputs of “Mod-1” blocks
can be at most 78 bits. These outputs corresponding to the various “bi” computed in
successive five cycles are next accumulated and shifted in the accumulator. Once
the partial products are ready, in phase III, polynomial reduction is performed with
only shifts and additions e.g. 6α ¼ 22α + 2α, 9α ¼ 23α + α, 36α ¼ 25α + 22α. The
values of ci are less than (i + 1)277 for 0 i 4. It can be shown that vi are less
than 92 bits. The “Mod-2” block is similar to “Mod-1” block but input is only 93 bit
(see Figure 10.18b). The resulting ri are such that jr i j ¼ 263 þ 241 for 0 i 3 and
jr 4 j 230 .
The negative coefficients in r(z) are made positive by adding the following
polynomial:
algorithm for modular reduction for BN curves is presented in Figure 10.19a. Note
that five steps are needed in the first loop to add a(z)bj to old result and divide by
z mod p. The authors prove that the output is bounded under the conditions
0 jai j, jbi j < 2m=2 , i ¼ 4 and 0 jai j, jbi j < 2mþ1 , 0 i 3 such that
0 jr i j < 2m=2 , i ¼ 4 and 0 jr i j < 2mþ1 , 0 i 3. Note that for realizing
256-bit BN curves, the digit size is 64 bits. Four 64-bit words and one 32-bit
word will be needed. It can be seen that in the first loop, in step 3, one 32 64
and four 64 64 multiplications are needed. In step 4, one dlog2 se dlog2 μe
multiplication where μ < 2m+6, is needed. The last iteration takes four 32 64
and one 32 32 multiplications. In total, the first loop takes one 32 32, eight
32 64, sixteen 64 64 and five dlog2 se dlog2 μe multiplications where μ < 2m+6.
The coefficient reduction phase requires eight dlog2 se dlog2 μe multiplications. It
can be shown that μ < 2k+6 in the for loop (steps 8–10) and μ < s26 in step 12 in the
second for loop. On the other hand, in the Barrett and Montgomery algorithms, we
need 36 numbers of 64 64 multiplications.
The design is implemented on an ASIC 130 nm and needed 183K gates for Ate
and R-Ate pairing and worked at 204 MHz frequency and the times taken are 4.22
and 2.91 ms, respectively. The architecture of the multiplier together with the
accumulator and γ, μ calculation blocks are shown in Figure 10.19b. Step 3 is
performed by a row of 64 16 and 32 16 multipliers needing four iterations. The
partial product is next reduced by the Mod_ t block which comprises of a multiplier
and subtractor. This block generates μ and γ from ci. Note that μ ¼ ci div2m and
γ ¼ ci mod2m sμ in all mod blocks except the one below rc0 which computes
instead γ ¼ sμ rc0 mod2m . The second loop re-uses the mod z blocks.
Chung and Hasan [79] have suggested the use of LWPFI (low-weight polyno-
mial form integers) for performing modular multiplications efficiently. These are
similar to GMN (generalized Mersenne Numbers) f(t) where t is not a power of
2 where jf ðiÞj 1:
Since f(t) is monic (leading coefficient is unity), the polynomial reduction phase
is efficient. A pseudocode is presented in Figure 10.20. The authors use Barrett’s
reduction algorithm for performing divisions required in phase III. When moduli
are large, Chung and Hasan algorithm is more efficient than traditional Barrett or
Montgomery reduction algorithm. The authors have later extended this technique to
the case [80] when jf i j s where s z. Note that the polynomial reduction phase is
efficient only when f(z) is monic.
Corona et al. [81] have described a 256-bit prime field multiplier for application
in bilinear pairing using BN curves with p ¼ 261 + 215 + 1 using an asymmetric
divide and conquer approach based on five-term Karatsuba technique, which used
12 DSP48 slices on Virtex-6. It needed fourteen 64 64 partial sub-products. This,
however, needs lot of additions. However, these additions have certain pattern that
can be exploited to reduce number of clock cycles needed from 23 needed in
316 10 RNS in Cryptography
Figure 10.19 (a) Hybrid modular multiplication algorithm and (b) Fp multiplier using HMMB
(adapted from [77] ©IEEE2009)
Brinci et al. [83] have suggested a 258-bit multiplier for BN curves. The authors
observe that the Karatsuba technique cannot efficiently exploit the full performance
of DSP blocks in FPGAs. Hence, it is required to explore alternative techniques.
They use a Montgomery quartic polynomial multiplier needing 13 sub-products
using Montgomery technique [75] realized using 65 65-bit multipliers, 7 65-bit
multipliers and one 7 7-bit multiplier and it needs 22 additions. In non-standard
tilling, eleven DSP blocks will suffice: eight multipliers are 17 24 whereas three
are 24 17. The value of p used in BN curves is 263 + 857. A frequency of 208 MHz
was achieved using Virtex-6, 11DSP 48 blocks and 4-block RAMS taking 11 cycles
per product.
which costs 3M + 5A + B. On the other hand for squaring, we have the formulae for
respective cases of school book and Karatsuba as
where v0 ¼ a20 , v1 ¼ a21 . Thus, the operations in both these cases are M + 2S + 2A + B
and 3S + 4A + B where S stands for squaring.
In another technique known as complex squaring, c ¼ a2 is computed as
co ¼ vo þ βðða1 þ a2 Þðb1 þ b2 Þ v1 v2 Þ
c1 ¼ ða0 þ a1 Þðb0 þ b1 Þ v0 v1 þ βv2 ð10:40cÞ
c2 ¼ ðao þ a2 Þðbo þ b2 Þ v0 þ v1 v2
which costs 6M + 15A + 2B where vo ¼ aobo, v1 ¼ a1b1 and v2 ¼ a2b2. For squar-
ing, we have
c o ¼ v o þ β ð a1 þ a2 Þ 2 v 1 v 2
c1 ¼ ða0 þ a1 Þ2 v0 v1 þ βv2 ð10:40dÞ
c2 ¼ ðao þ a2 Þ2 v0 þ v1 v2
Thus, the total cost is 3M + 2S + 11A + 2B operations. For the second tech-
nique, we have s2 ¼ ðao a1 þ a2 Þ2 while other si are same as in (10.43a) and
the final step is same as in (10.43b). The total cost is 2M + 3S + 10A + 2B
operations. For the third method, we have pre-computation given by
Table 10.3 Summary of squaring costs for quartic extensions as quadratic over quadratic
(adapted from [84]©2006)
p4 method School book Karatsuba Complex
p2 method >Linear Linear >Linear Linear >Linear Linear
School book 6M + 4S 10A + 4B 3M + 6S 14A + 5B 8M 12A + 4B
Karatsuba 3M + 6S 17A + 6B 9S 20A + 8B 6M 18A + 4B
Karatsuba/Complex 7M 17A + 6B 6M 20A + 8B
322 10 RNS in Cryptography
An element α 2 F is constructed as αo þ α1 X þ α2 X2 þ α3 X3 þ α4 X4 þ α5 X5
p6
where αi 2 Fp . The school book method computes c ¼ ab as [84]
co ¼ ao b0 þ β ð a1 b5 þ a2 b 4 þ a3 b3 þ a4 b2 þ a5 b1 Þ
c1 ¼ a0 b1 þ a1 bo þ β ð a2 b 5 þ a3 b4 þ a4 b3 þ a5 b2 Þ
c2 ¼ ao b2 þ a1 b1 þ a2 b0 þ β ð a3 b5 þ a4 b4 þ a5 b3 Þ
ð10:45aÞ
c3 ¼ ao b3 þ a1 b2 þ a2 b1 þ a3 b0 þ βða4 b5 þ a5 b4 Þ
c4 ¼ ao b4 þ a1 b3 þ a2 b2 þ a3 b1 þ a4 b0 þ βa5 b5
c5 ¼ ao b5 þ a1 b4 þ a2 b3 þ a3 b2 þ a4 b1 þ a5 b0
The total costs of multiplication for direct sextic extension in case of a school
book, Montgomery and ToomCook-6X techniques can be found as 36M
+ 30A + 5B, 17M + 143A + 5B, 11M + 93Mz + 236A + 5B operations respectively.
For squaring, the corresponding costs are 15M + 6S + 22A + 5B, 17S + 123A + 5B
and 11S + 79Mz + 163A + 5B where Mz stands for multiplication with a small
word-size integer.
In case of squaring c ¼ a2, we have for school book method of sextic extension
co ¼ a20 þ β 2ða1 a5 þ a2 a4 Þ þ a23
c1 ¼ 2ð ao a1 þ β ð a2 a5 þ a3 a4 Þ Þ
c2 ¼ 2ao a2 þ a21 þ β 2a3 a5 þ a24
ð10:45bÞ
c3 ¼ 2ðao a3 þ a1 a2 þ βa4 a5 Þ
c4 ¼ 2ðao a4 þ a1 a3 Þ þ a22 þ βa25
c5 ¼ 2ð ao a5 þ a1 a4 þ a2 a 3 Þ
Using Karatsuba once again to compute each of the three products gives
10.6 Pairing Processors Using RNS 323
AB ¼ a0 b0 þ 2 a4 b4 þ ða1 þ a2 Þðb1 þ b2 Þ a1 b1 þ a3 þ a5 ðb3 þ b5 Þ a3 b3
a5 b5 þ a3 b3 þ ða0 þ a1 Þ ðb0 þ b1 Þ a0 b0 a1 b1 þ 2 a2 b2 þ a4 þ a5 ðb4 þ b5 Þ
a4 b4 a5 b5 α þ a1 b1 þ 2a5 b5 þ ða0 þ a2 Þ b0
þ b2 a0 b0 a2 b2 þ a3 þ a4 b3 þ b4 a3 b3
a4 b4 α2 þ ða0 þ a3 Þ ðb0 þ b3 Þ a0 b0 a3 b3
þ 2ða1 þ a2 þ a4 þ a5 Þðb1 þ b2 þ b4 þ b5 Þ
ða1 þ a4 Þðb1 þ b4 Þ ða2 þ a5 Þðb2 þ b5 Þ ða1 þ a2 Þðb1 þ b2 Þ
þ a1 b1 þ a2 b2 ða4 þ a5 Þðb4 þ b5 Þ þ a4 b4 þ a5 b5 β
þ ða0 þ a1 þ a3 þ a4 Þ b0 þ b1 þ b3 þ b4 ða0 þ a3 Þðb0 þ b3 Þ ða1 þ a4 Þðb1 þ b4 Þ
ða0 þ a1 Þðb0 þ b1 Þ þ a0 b0 þ a1 b1
ða3 þ a4 Þ ðb3 þ b4 Þ þ a3 b3 þ a4 b4 þ 2 a2 þ a5 b2 þ b5 a2 b2 a5 b5 αβ
þ ða1 þ a4 Þðb1 þ b4 a1 b1 a4 b4 þ ða0 þ a2 þ a3 þ a5 Þðb0 þ b2 þ b3 þ b5 Þ
ða0 þ a3 Þðb0 þ b3 Þ ða2 þ a5 Þðb2 þ b5 Þ ða0 þ a2 ðb0 þ b2 Þ
þ a0 b0 þ a2 b2 ða3 þ a5 Þðb3 þ b5 Þ þ a3 b3 þ a5 b5 α2 β
ð10:47Þ
It can be seen that 18M + 56A + 8B operations in Fp are required. It requires only
six reductions. Note that each component of AB can lie between 0 and 44p2.
Thus, Bn in Montgomery representation and M in RNS representation must be
greater than 44p to perform lazy reduction in this degree 6 field.
For sextic extension, if M < 20A, Devegili et al. [84] have suggested to
construct the extension as quadratic over cubic and Karatsuba over Karatsuba
for multiplication and use complex over Karatsuba for squaring. For M 20A,
cubic over quadratic, ToomCook3x over Karatsuba for multiplication and
either complex, ChungHasan SQR3 or SQR3x over Karatsuba/complex for
squaring have been recommended.
The extension field F 12 is defined by the following tower of extensions [88]:
p
F 2 ¼ Fp ½u= u2 þ 2
p
F 6 ¼ F 2 ½v= ν3 ξ where ξ ¼ u 1
p p
F 12 ¼ F 6 ½w= w2 v ð10:48Þ
p p
Note that the representation F ¼ F 2 ½W = W 6 ξ where W ¼ w is also
12
p p
possible. The tower has the advantage of efficient multiplication for the canon-
ical polynomial base. Hence, an element α 2 F 12 can be represented in any of
p
the following three ways:
324 10 RNS in Cryptography
α ¼ a0 þ a1 ω where a0 , a1 2 F 6
p
α ¼ a0, 0 þ a0, 1 ν þ a0, 2 ν þ a1, 0 þ a1, 1 ν þ a1, 2 ν2 ω where ai, j 2 F
2
p2
α ¼ a0, 0 þ a1, 0 W þ a0, 1 W 2 þ a1, 1 W 3 þ a0, 2 W 4 þ a1, 2 W 5 ð10:49Þ
Hankerson et al. [88] have recommended the use of Karatsuba for multipli-
cation and complex for squaring for F 12 extensions. Quadratic on top of a cubic
p
on top of a quadratic tower of extensions needs to be used. A multiplication
using Karatsuba’s method needs 54 multiplications and 12 modular reductions,
whereas squaring using complex method for squaring in F 12 and Karatsuba for
p
multiplication in F 6 , F 2 needs 36 multiplications and 12 modular
p p
reductions [69].
A complete multiplication of F k requires kλ multiplications in Fp with
p
1 < λ 2 and note that lazy reduction can be used in Fp. A multiplication in F k
p
then requires k reductions since the result has k coefficients. Multiplication in Fp
needs n2 word multiplications and reduction requires (n2 + n) word multiplica-
tions. If p 3 mod 4, multiplications by β ¼ 1 can be computed as simple
subtractions in F 2 . A multiplication in F k thus needs (kλ + k)n2 + kn word
p p
λ
2 10k þ8k
multiplications in radix representation and 1:1 7k 5 n þn 5 word
multiplications if RNS is used [69].
The school book type of multiplication is preferred for F since in Karatsuba
p2
2 2
method, the dynamic range is increased from 2p to 6p [89].
Yao et al. [89] have observed that for F 12 multiplication, school book
p
method also provides an elegant solution. Using lazy reduction, only 12 reduc-
tions will be needed. The evaluation of f g can be as follows:
!
X X
f g¼ f j gk W jþk þ f j gk ζW jþk6 ð10:50Þ
jþk<6 jþk6
where f g 2 F 2 ½W = W 2 ζ and fj, gk 2 F 2 , 1 j, k 6 are the coefficients
p p
of f and g, respectively. The coefficients of the intermediate results of figk are
less than 2p2 and the coefficients of figkζ are less than 4p2. Considering
fj ¼ f0 + f1i, gk ¼ g0 + g1i, we have
f j gk ζ ¼ ðf 0 g0 f 1 g1 f 0 g1 f 1 g0 Þ þ ðf 0 g0 f 1 g1 þ f 0 g1 þ f 1 g0 Þi ð10:51Þ
Since four products are needed to compute both the components, two accu-
mulators can easily handle this requirement.
10.6 Pairing Processors Using RNS 325
B. Inversion
Three other operations are needed in pairing computation: inversion, Frobenius
computation and squaring of unit elements. For the quadratic case [90], assum-
ing an irreducible polynomial x3 + n, we have the formula for inversion as
1 a ib
¼ ð10:52aÞ
a þ bx a2 þ nb2
1 A þ Bx þ Cx2
¼ ð10:52bÞ
a þ bx þ cx2 F
f ¼ g þ hw ¼ g0 þ h0 W þ g1 W 2 þ h1 W 3 þ g3 W 4 þ h3 W 5 ð10:53aÞ
Pairing Algorithms
where l is the order. Thus, three steps are required. The exponentiation by first term
can be performed by conjugation (since p6 ¼ p) followed by an inversion
(for taking care of 1). The exponentiation corresponding to p2 + 1 needs Frobenius
2
(f p ) computation followed by multiplication with f. These two are known as “easy
part” of the final exponentiation step, whereas the operation corresponding to the
third tem is known as “hardpart”. Devegili
et al. [93] have suggested that in case of
p4 p2 þ1
BN curves, the expression ‘ can be written in terms of the parameters of
BN curve p and x as
p3 þ 6x2 þ 1 p2 þ 36x3 18x2 þ 12x þ 1 p þ 36x3 30x2 þ 18x 2
6x2 þ1 3
3 2 3 2 2
36x 18x þ12xþ1
fp fp ðf p Þ f 36x 30x þ18x2
2
3 2 6x þ1
This can be computed as a f 6x5
,b p
a ,b ab, f f p p 2 p
bðf Þ f
bðf p f Þ9 af 4 .
10.6 Pairing Processors Using RNS 327
2 3
Note that ap , f p , f p , f p are computed using Frobenius. Later, Scott et al. [92]
have given a better systematic method of computing the hard part taking into
account the parameters of the BN curves as
2 3
p2 6 18
2 3 2 2 p 2 30
mp mp mp ½1=m2 4 mx 5 ½1=ðmx Þp 12 1= mx mx 1=mx
36
2 3 p
1= mx mx
The terms in the brackets are next computed using four multiplications (inver-
sion is just a conjugation) leading to a calculation of the form
The authors next suggest that using Olivos algorithm [94], the computation can be
done using just two registers as follows:
T0 ðy6 Þ2 , T 0 T 0 y4 , T 0 T 0 y5 , T 1 y3 y5 , T 1 T1T0, T0 T 0 y2 , T 1 T 21 ,
T1 T1T0, T1 T 21 , T 0 T 1 y1 , T 1 T 1 y0 , T 0 T 20 , T 0 T0T1
Yao et al. [89] suggested a set of base moduli to reduce complexity of modulus
reduction in RNS and presented also an efficient Fp Montgomery modular multi-
plier and a high-speed pairing processor using this multiplier has been described.
They have suggested the selection of moduli in both the RNS used in base extension
close so that (bk cj) values are small where bk and cj are the moduli in the two
bases. Thus, the bit lengths of the operands needed in base extension are small.
328 10 RNS in Cryptography
They have suggested the use of eight moduli sets for 256-bit dynamic range for
128-bit security as
B ¼ 2w 1, 2w 9, 2w þ 3, 2w þ 11, 2w þ 5, 2w þ 9, 2w 31, 2w þ 15
C ¼ 2w , 2w þ 1, 2w 3, 2w þ 17, 2w 13, 2w 21, 2w 25, 2w 33
Y
s
where w ¼ 32 so that the bit lengths of bi cj are as small as possible
k¼1, k6¼j
(<25 bits).
Yao et al. [95] have suggested that maximal length in bits of bi cj shall be
minimized to v bits so that multiplications will be v w words rather than w w
words. They have also suggested a systematic procedure for RNS parameter
selection to result in a lower complexity. They observe that for a 16-bit machine,
n ¼ 4 where n is the number of moduli is optimum. They suggest two techniques for
choosing moduli (a) multiple plus prime and (b) first come first selected improved.
In the first method, we start with the set of first (2n 1) primes. The product of all
these π 2n
i¼1 pi is denoted as Ө. M is a multiple of Ө. Then M + pi are all pairwise
coprime and hence the name MPP (multiple plus prime). The second method selects
only pseudo-Mersenne primes. As an illustration, two RNS with moduli {264-33, 2
64
-15, 264-7, 264-3} and {264-17, 264-11, 264-9, 264-5} can be obtained yielding the
various weights Bij to be represented by at most 14 bits to 5 bits. Thus, v reduces to
14 for one RNS and 8 for another RNS. Thus, 64 64 multiplications can be
replaced with 64 14, 64 8 multiplications. The first method is attractive when
n is very small and note that multiplications are not performed as additions.
Montgomery reduction in RNS has higher complexity than ordinary Montgom-
ery reduction. However, this overhead of slow reduction can be partially removed
by reducing the number of reductions also known as lazy reduction. In computing
ab + cd with a, b, c, d in Fp where p is a n word prime number, we need 4n2 + 2n
word operations since each modulo multiplication needs 2n2 + n word products
using digit serial Montgomery algorithm [11]. In lazy reduction, first ab + cd is
computed and then reduced needing only 3n2 + n word products [69]. Lazy reduc-
tion performs one reduction per multiple of multiplications. This is possible for
expressions like AB CD EF in Fp. In RNS [89], it takes 2s2 + 11s word multi-
plications while it takes 4s2 + s using digit serial Montgomery modular multiplica-
tion [11]. The actual operating range should be large for lazy reduction to be
economical (e.g. 22p2 is needed for computation of (10.51)). Around 10,000
modular multiplications are needed for a pairing processor.
Yao et al. [89] have described a high-speed pairing co-processor using RNS and
lazy reduction. They have used homogeneous coordinates [96] for realizing Ate and
optimal Ate pairing. The algorithm for optimal Ate pairing is presented in
Figure 10.21. The formulas used for point doubling, point addition and line raising
together with the operations needed in Miller loop are presented in Table.10.4. Note
that S, M and R stand for squaring, multiplication and reduction in F 2 and m and
p
10.6 Pairing Processors Using RNS 329
Figure 10.21 Algorithm for optimal Ate pairing (adapted from [89] ©2011)
r stand for multiplication and reduction in Fp. The cost of squaring in Fp is also
indicated as m. Note that M ¼ 4m, S ¼ 3m and R ¼ 2r. School book method has been
employed to avoid more additions. The final addition is carried out in steps 10 and
11, whereas final exponentiation is carried out in step 12. The operation count is
presented in Table 10.5. This includes Frobenius endomorphism of Q1 and Q2. Note
that the computation T Q2 is skipped because this point is not needed further.
6
The algorithm for computation of f p 1 in F 12 [89] is shown in Figure 10.22. The
p
hard part is computed following Devegili et al. [93] as shown in Table 10.5. The
inversion needed in Figure 10.22 is carried out as d 1 ¼ d p2 modp.
The architecture of pairing coprocessor due to Yao et al. [89] is presented in
Figure 10.23. It uses eight PEs in the Cox-Rower architecture (see Figure 10.23c).
Each PE caters for one channel of B and one channel of C. Four DSP blocks can
realize 35 35 multiplier whereas two DSPS can realize a 35 25 multiplier. A
dual mode multiplier (see Figure 10.23a) which can perform two element multipli-
cations at a time is used. It uses two accumulators to perform the four multiplica-
tions in parallel and accumulate the results in parallel. Each PE performs two
element multiplications at a time using the dual multiplier. Thus, Cox-Rower
algorithm is modified accordingly and needs less number of loops. The cox unit
is an accumulator to provide result correction information to all PEs in base
extension operation. The register receives ξ values from every PE and delivers
two ξ values to all PEs at a time. The internal design of the PE is shown in
Figure 10.23b comprising dual-mode multiplier, an adder, 2RAMs for multiplier
inputs, 2 RAMS for adder inputs, 3 accumulators and one channel reduction
module. Two accumulators are used for polynomial multiplication and one is
used for matrix multiplication of the base extension step. The authors have
330
Table 10.4 Pipeline design and operation count of Miller loop (adapted from [89] ©2011)
Condition Step Operations Count
ri ¼ 0 1 A ¼ y21 , B ¼ 3b0 z21 , C ¼ 2x1 y1 , D ¼ 3x21 , E ¼ 2y1 z1 3S + 2M + 5R
2 x3 ¼ ðA 3BÞC, y3 ¼ A2 þ 6AB 3B2 , z3 ¼ 4AE, l0 ¼ ðB AÞζ, l3 ¼ yP E, l4 ¼ xP D 2S + 3M + 4m + 5R
3 f ¼ f2 6S + 15M + 6R
4 f¼f · l 18M + 6R
ri ¼ 1 1 A ¼ y21 , B ¼ 3b0 z21 , C ¼ 2x1 y1 , D ¼ 3x21 , E ¼ 2y1 z1 , f 0 ¼ f 2 0 5S + 4M + 6R
2 x3 ¼ ðA 3BÞC, y3 ¼ A2 þ 6AB 3B2 , z3 ¼ 4AE, l0 ¼ ðB AÞζ, l3 ¼ yP E, l4 ¼ xP D, f 1 ¼ f 2 1 2S + 6M + 4m + 6R
3 A ¼ y3 yQz3, B ¼ x3 xQ z3, f2,3,4,5 ¼ ( f2)2,3,4,5 4S + 12M + 4m + 6R
4 f¼f · l 18M + 6R
5 C ¼ A2 , D ¼ B2 , l0 ¼ xQ A yQ B ζ, l3 ¼ yP B, l4 ¼ xP A 2S + 2M + 4m + 5R
Table 10.5 Operation count of final steps (adapted from [89] ©2011)
Step Operations Operation count
FA1 Q1 π p(Q) M + 2m + R
T T + Q1 and lT , Q1 ðPÞ 2S + 11M + 8m + 13R
f f lT , Q1 ðPÞ 18M + 6R
FA2 Q2 π p(Q1) M + 2m + R
lT , Q2 ðPÞ 4M + 8m + 5R
f f lT , Q2 ðPÞ 18M + 6R
fp
6
1 Before d1 9S + 12M + 2m + 7R + r
d1 ¼ dp2 294m + 294r
After d1 6S + 36M + 16R
fp
2
þ1
fp
2
þ1 e þR
M e þ 8m þ 4R
f6|u|5 64e e þ 68R
e
4 2
p þ1=n a
fp S þ 4M
b ap+1 e e
M þ R þ 3M þ 4m þ 5R
2 3
fp, f p , f p 6M + 12m + 12R
T p 2
b ðf Þ f p2 e þe
2M e
S þ 3R
T T 6u2 þ1
126Se þ 12M
e þ 138R
e
p3
9 e þ 5Se þ 12R
e
f T b f pþ1 f 4 7M
6
Figure 10.22 Algorithm for computation of f p 1
in F (adapted from [89] ©2011)
p12
Figure 10.23 Pairing coprocessor hardware architecture (adapted from [89] ©2011)
Table 10.6 Number of operations and cycles per computation in Miller loop (adapted from [89]
©2011)
Miller’s loop
2T and lT,T(P) T + Q and lT,Q(P) f2 f ∙l Ate Optimal
#Multiplication 39 54 78 72 – –
#Reduction 20 26 12 12 – –
#Cycles 340 456 313 301 128,531 64,084
The authors have given in detail the number of operations and cycles required for
Miller loop (see Table 10.6) and for computation of final steps in Table 10.7. One
pairing computation can be completed in 1 s. The speed advantage can be seen from
the fact that a 256 256 multiplication in RNS needs 16 times 32 32-bit multi-
plication whereas Karatsuba needs 27 multiplications.
Duquesne and Guillermin [66] described FPGA implementation of optimal Ate
pairing using RNS for 128-bit security level in large characteristic. The authors
have used BN curves for 126-bit security using u ¼ (262 + 255 + 1) whereas for
128-bit security, they have used u ¼ (263 + 222 + 28 + 27 + 1). The base extension
using MRC described earlier [52] needs 75 n2 þ 85 n RNS digital multiplications and
cannot be parallelized as it uses MRC. On the other hand, the Kawamura
et al. technique [34] has overall complexity of 2n2 + 3n which has been enhanced
10.6 Pairing Processors Using RNS 333
Table 10.7 Number of operations and cycles per computation in final steps (adapted from [89]
©2011)
Step Operation # Cycles #Idle cycles Occupation rate (%)
Final addition FA1 799 7 99.1
FA2 566 50 91.2
fp
6
1 dp2 mod p 25,537 21,127 17.3
Others 1333 240 82.0
2 2
þ1
fp f p þ1 573 9 98.4
fp
4 2
p þ1=n f6|u|5 21,812 68 99.7
6u2 þ1 44,778 138 99.7
T
Others 6905 47 99.3
Total Ate 100,578 21,629 78.5
Optimal 101,943 21,686 78.7
Figure 10.24 Algorithm for optimal Ate pairing (adapted from [67] ©2012)
by using n parallel rowers to achieve one RNS digit multiplication and accumula-
tion per cycle. Hence, a full-length multiplication can be done in two cycles one
over base B and one over base B0 , an addition in 4, a subtraction in 6 and whole
reduction in 2n + 3 cycles. They have used lazy reduction. As such for F k only
p
k reductions are needed and thus (2kn + 3k) cycles overall are needed. The algo-
rithm implemented is shown in Figure 10.24. The authors use projective coordi-
nates [96, 97]. The doubling and addition steps together with line raising are
presented in detail in Figure 10.25a–c where it may be noted that the classical
formulae are rearranged to highlight reduction and inherent parallelism in local
334 10 RNS in Cryptography
Figure 10.25 (a–c). Algorithms for doubling step, addition step and hard part of the final
exponentiation (adapted from [66] ©2012)
variables. The F inversion is based on Hankerson et al. [88] formulae. The final
p12
exponentiation used Scott et al. multi-addition chain [92] described before. The
X5
squaring in F 12 uses the technique proposed in [98]. Specifically, for a ¼ a
i¼0 i
p
γ with ai 2 F 2 , the coefficients of A ¼ a are given by
i 2
p
10.6 Pairing Processors Using RNS 335
The authors give details of the operation count for BN126 and BN128 curves
which saves cycles at each step using RNS as compared to the earlier design of
Guillermin [99]. The saving is due to exploitation of the inherent parallelism in
algorithm db1 (see Figure 10.25a) and add as well as in operations in F 12 . The
p
pipeline depth could be up to 8 and still avoid idle states. The authors use in F 2
p
multiplication and subtraction using additional hardware in parallel to avoid cas-
caded operations thus saving 4 cycles at the expense of increased hardware. Similar
technique has been used for F 12 as well. The authors have implemented ion Altera
p
Cyclone III, Stratix II and Stratix III. The results are as follows:
BN126 Cyclone II EP2C35 91 MHz frequency size 14274LC time 1.94 ms
BN126 Stratix III EP2S30154 154 MHz frequency size 4227A time 1.14 ms
BN126 Stratix III EP3S50 165 MHz frequency size 4233A time 1.07 ms
Duquesne [69] has considered application of RNS for fast pairing computation
implementation using BN curves and MNT curves considering Tate, Ate and R-Ate
pairings employing lazy reduction and efficient arithmetic in extension fields.
Duquesne has presented very detailed description of every computation needed
for all the three pairings.
We consider Tate pairing first. Jacobian coordinates have been considered for
P and affine coordinates for Q in this approach. The algorithm for Tate pairing
for MNT curves is presented in Figure 10.26 where P ¼ ðxP ; yP Þ 2 E Fp ½‘ and
Q ¼ xQ , yQ β with xQ , yQ 2 F 3 . It is assumed that F 6 is built as a quadratic
p p
extension of F 3 : F 6 F 3 = Y 2 υ ¼ F 3 ½β. The authors consider multipli-
p p ¼ p ½Y p
cation and squaring as of same complexity. Lines 1–4 in the algorithm in Fig-
ure 10.26, perform doubling of point T 2 E(Fp) in Jacobian coordinates and needs
10 multiplications in Fp and 8 modular reductions considering lazy reduction. Lines
7–10 use mixed addition of T and P and require 11 multiplications and 10 modular
reductions in Fp. Since xQ and yQ are in F 3 , line 5 requires 9 multiplications in Fp
p
and 8 modular reductions. Line 11 needs 7 multiplications and 6 reductions in Fp. In
line 6, a multiplication and squaring in F 6 is needed which needs 30 multiplications
p
and 12 modular reductions in Fp. For the line 12, we need 18 multiplications and
6 reductions. Line 13 does exponentiation by p3 free by conjugation, whereas
36 multiplications and 16 reductions and one inversion are required in F 6 . This
p
step totally needs 54 multiplications, 22 modular reductions and one inversion in
Fp. In line 14, a multiplication in F 6 and a Frobenius computation are needed. The
p
336 10 RNS in Cryptography
Figure 10.26 Algorithm for Tate pairing for MNT curves (adapted from [69])
latter needs 5 modular multiplications in Fp and overall, the second step of final
exponentiation needs 23 multiplications and 11 reductions in Fp.
The hard part involves one Frobenius (five modular multiplications), one mul-
tiplication in F 6 , one exponentiation by 2l. For each step of exponentiation,
p
12 multiplications and 6 reductions or 18 multiplications and 6 reductions are
required depending on whether multiplications are not needed or needed. For a
96-bit security, l has bit length of 192 implying that lines 1–6 are done 191 times
and lines 7–12 around 96 times. Totally, the Miller loop needs 191 (10 + 9 + 30)
+ 96 (11 + 7 + 8) ¼ 12,815 multiplications and 191 (8 + 8 + 12) + 96 (10 + 6
+ 6) ¼ 7460 reductions. The easy part of the final exponentiation needs 1 inversion,
77 multiplications and 33 reductions in Fp. Considering that 2l is 96 bits long, the
hard part can perform exponentiation using sliding window of 3 for computing f2l.
This needs 96 squarings in F 6 , 24 multiplications in F 6 and three
p p
10.6 Pairing Processors Using RNS 337
Figure 10.27 Algorithm for Tate pairing for BN curves (adapted from [69])
f
p6
12 reductions in Fp. Line 13 computes where computation of f is free by
f p6
conjugation. Hence, one multiplication and inversion are needed in F 12 . This
p
inversion needs one inversion, 97 multiplications and 35 reductions in Fp. The first
step of the exponentiation thus requires 151 multiplications, 47 modular reductions
and one inversion in Fp. Line 14 involves one multiplication in F 12 and one
p
powering to p2. The Frobenius map and its iterations need 11 modular multiplica-
tions in Fp. This step thus needs 65 multiplications and 23 reductions in Fp.
The hard part given in line 15 involves one Frobenius (11 modular multiplica-
tions), one multiplication in F 12 (54 multiplications and 12 reductions) and one
p
exponentiation. Since for BN curves, l can be chosen as sparse, a classical square
and multiply can be used. Since in line 13, f has been raised to the power ( p6 1), it
10.6 Pairing Processors Using RNS 339
is a unit and can be squared with only 2 squarings and 2 reductions in Fp (i.e. 24
multiplications and 12 reductions in Fp). Thus, the cost is only 24 multiplications
and 12 reductions for most steps. For steps corresponding to the non-zero bits of the
exponent, 54 additional multiplications and 12 additional reductions are necessary.
In line 16, four applications of the Frobenius map, 9 multiplications and 6 squar-
ings in F 12 (i.e. 674 multiplications and 224 reductions in Fp) are needed. It also
p
needs an exponentiation which is similar to line 15 but two times larger. Consid-
ering a Hamming weight of l as 11 and ‘ as 90, we observe that steps 1–6 are done
255 times and lines 7–12 are done 89 times for a 128-bit security level. Thus, the
Miller loop needs 255 (7 + 8 + 75) + 89 (11 + 6 + 39) ¼ 27,934 multiplications
and 255 (6 + 8 + 24) + 89 (10 + 5 + 12) ¼12,093 reductions. The easy part of
the final exponentiation requires one inversion, 216 multiplications and 70 reduc-
tions in Fp. The hard part involves exponentiation by 6l 5 which has Hamming
weight of 11 and 6l2 + 1 which has Hamming weight of 28. The second exponen-
tiation can be split into two parts l and 6l [88] both having Hamming weight of 11.
This leads to 21 multiplications. Lines 15 and 16 require 11 + 54 + 65 24 + 9 54
+ 674 + 127 24 + 21 54 ¼ 6967 multiplications and 11 + 12 + 65 12 + 9 12
+ 224 + 127 12 + 21 12 ¼ 2911 reductions. Thus, the full Tate pairings needs
35,117 multiplications but only 15,074 reductions. For radix implementation using
8 (32 bit) words, we need 35,117 82 + 15,074 (82 + 8) ¼ 3,332,816 word multi-
7 2 8
plications whereas RNS needs 1:1 35, 117 2 8 þ 15, 074 8 þ 8 ¼
5 5
2, 315, 994 word multiplications. This has a gain of 30.5 %.
In the case of Ate pairing, lines 1–4 are done in F 2 requiring 3 multiplications,
p
4 squarings and 6 reductions in F 2 i.e. 17 multiplications and 12 reductions in Fp.
p
Similarly, lines 7–10 require 8 multiplications and 3 squarings and 10 reductions in
F 2 or 30 multiplications and 20 reductions in Fp. If the coordinates of T are (XTγ 2,
p
YTγ 3, ZT), lines 5 and 11 must be replaced by
0
50 : g ¼ Z2T Z 2T yP AZ 2T xP γ þ AXT 2Y 2T γ 3
0 ð10:57Þ
110 : g ¼ ZTþP yP FxP γ þ FxQ ZTþP yQ γ 3
2,208,328 word multiplications are needed, whereas in RNS, only 1,558,065 word
multiplications are needed. The gain is thus 29.5 %.
In the case of R-Ate pairing, while the Miller loop is same, an additional step
p
is necessary at the end: the computation of f : f gðT;QÞ ðPÞ gðπðTþQÞ, T Þ ðPÞ where T ¼
(6l + 2)Q is computed in the Miller loop and π is the Frobenius map on the curve.
The following operations will be needed in the above computation. One step of
addition as in the Miller loop (computation of T + Q and gðTþQÞ ðPÞ) needs 40 mul-
tiplications and 26 reductions in Fp. As p 1mod 6 for BN curves, one application
of Frobenius map is needed which requires 2 multiplications in F 2 by
p
pre-computed values. Next, one non-mixed addition step (computation of
gðπðTþQÞ, T Þ ðPÞ) needs 60 multiplications and 40 reductions in Fp. Two multiplica-
tions of the results in the two previous steps require 39 multiplications and
12 reductions in Fp. Next, a Frobenius needs 11 modular multiplications and finally,
one full multiplication in F 12 requires 54 multiplications and 2 reductions in Fp.
p
Thus, totally this step requires 249 multiplications and 117 reductions in Fp.
Considering that 6l + 2 has 66 bits and Hamming weight of 9, the cost of the Miller
loop is 65 (17 + 15 + 36 + 39) + 8 (30 + 10 + 39) ¼ 7587 multiplications and
65 (12 + 12 + 24) + 8 (20 + 6 + 12) ¼ 3424 reductions. The final exponentiation
is same as for Tate pairing. Hence, for complete R-Ate pairing, we need 15,019
multiplications and 6405 reductions. This means that 1,422,376 word multiplica-
tions in radix representation and 985,794 in the case of RNS will be required thus
saving 30.7 %.
Kammler et al. [100] have described an ASIP (application specific instruction set
processor) for BN curves. They consider the NIST recommended prime group order
of 256 bits E(Fp) and 3072 bits for the finite field F k ¼ 256 12 ¼ 3072 (since
p
k ¼ 12). This ASIC is programmable for all pairings. They keep the points in
Jacobian coordinates throughout the pairing computation and thus field inversion
can be avoided almost entirely. Inversion is accomplished by exponentiation with
( p 2). All the values are kept in Montgomery form through out the pairing
computation.
The authors have used the scalable Montgomery modulo multiplier architecture
(see Figure 10.28a) due to Nibouche et al. [101] which can be segmented and
pipelined. In this technique, for computing ABR1mod M, the algorithm is split into
two multiplication operations that can be performed in parallel. It uses carry-save
number representation. The actual multiplication is carried out in the left half (see
Figure 10.28a) and reduction is carried out in the right half simultaneously. The left
is a conventional multiplier built up of gated full-adders and the right is a multiplier
with special cells for the LSBs. These LSB cells are built around half-adders. Due to
the area constraint, subsets of the regular structure of the multiplier have been used
and computation is performed in multiple cycles. They have used multi-cycle
multipliers for W H (W is word length and H is number of words) of three
different sizes 32 8, 64 8 and 128 8 bits. For example for a 256-bit multiplier,
10.6 Pairing Processors Using RNS 341
symbol:
255 256-W 255-W 256-2W 2W-1 W W-1 0
“0” “0”
CM CR
“0” “0”
SM SR
Figure 10.28 (a) Montgomery multiplier based on Nibouche et al. technique and (b) multi-cycle
Montgomery Multiplier (MMM) (adapted from [100] ©2009)
Table 10.8 Number of operations needed in various pairing computations (adapted from [100]
©iacr2009)
Number of Opt Ate Ate η Tate Comp. η Comp. tate
Multiplications 17,913 25,870 32,155 39,764 75,568 94,693
Additions 84,956 121,168 142,772 174,974 155,234 193,496
Inversions 3 2 2 2 0 0
Ate pairing needed 15.8 ms and frequency was 338 MHz. The number of operations
needed for different pairing applications are presented in Table 10.8 in order to
illustrate the complexity of a pairing processor.
Barenghi et al. [102] described an FPGA co-processor for Tate pairing over Fp
which used BKLS algorithm [62] followed by Lucas laddering [103] for the final
exponentiation pk 1 =r:
p2 1 m m
f P DQ r ¼ ðc þ id Þp1 ¼ ðc id Þ2 ¼ ða þ ibÞm
References
1. W. Stallings, Cryptography and Network Security, Principles and Practices, 6th edn. (Pear-
son, Upper Saddle River, 2013)
2. B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C (Wiley,
New York, 1996)
3. P. Barrett, Implementing the Rivest-Shamir-Adleman Public Key algorithm on a standard
Digital Signal Processor, in Proceedings of Annual Cryptology Conference on Advances in
Cryptology, (CRYPTO‘86), pp. 311–323 (1986)
4. A. Menezes, P. van Oorschot, S. Vanstone, Handbook of Applied Cryptography (CRC, Boca
Raton, 1996)
5. J.-F. Dhem, Modified version of the Barrett Algorithm, Technical report (1994)
6. M. Knezevic, F. Vercauteren, I. Verbauwhede, Faster interleaved modular multiplication
based on Barrett and Montgomery reduction methods. IEEE Trans. Comput. 59, 1715–1721
(2010)
References 343
7. J.-J. Quisquater, Encoding system according to the so-called RSA method by means of a
microcontroller and arrangement implementing the system, US Patent #5,166,978, 24 Nov
1992
8. C.D. Walter, Fast modular multiplication by operand scanning, Advances in Cryptology,
LNCS, vol. 576 (Springer, 1991), pp. 313–323
9. E.F. Brickell, A fast modular multiplication algorithm with application to two key cryptog-
raphy, Advances in Cryptology Proceedings of Crypto 82 (Plenum Press, New York, 1982),
pp. 51–60
10. C.K. Koc. RSA Hardware Implementation. TR 801, RSA Laboratories, (April 1996)
11. C.K. Koc, T. Acar, B.S. Kaliski Jr., Analyzing and comparing Montgomery Multiplication
Algorithms, in IEEE Micro, pp. 26–33 (1996)
12. M. McLoone, C. McIvor, J.V. McCanny, Coarsely integrated Operand Scanning (CIOS)
architecture for high-speed Montgomery modular multiplication, in IEEE International
Conference on Field Programmable Technology (ICFPT), pp. 185–192 (2004)
13. M. McLoone, C. McIvor, J.V. McCanny, Montgomery modular multiplication architecture
for public key cryptosystems, in IEEE Workshop on Signal Processing Systems (SIPS),
pp. 349–354 (2004)
14. C.D. Walter, Montgomery exponentiation needs no final subtractions. Electron. Lett. 35,
1831–1832 (1999)
15. H. Orup, Simplifying quotient determination in high-radix modular multiplication, in Pro-
ceedings of IEEE Symposium on Computer Arithmetic, pp. 193–199 (1995)
16. C. McIvor, M. McLoone, J.V. McCanny, Modified Montgomery modular multiplication and
RSA exponentiation techniques, in Proceedings of IEE Computers and Digital Techniques,
vol. 151, pp. 402–408 (2004)
17. N. Nedjah, L.M. Mourelle, Three hardware architectures for the binary modular exponenti-
ation: sequential, parallel and systolic. IEEE Trans. Circuits Syst. I 53, 627–633 (2006)
18. M.D. Shieh, J.H. Chen, W.C. Lin, H.H. Wu, A new algorithm for high-speed modular
multiplication design. IEEE Trans. Circuits Syst. I 56, 2009–2019 (2009)
19. C.C. Yang, T.S. Chang, C.W. Jen, A new RSA cryptosystem hardware design based on
Montgomery’s algorithm. IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 45,
908–913 (1998)
20. A. Tenca, C. Koc, A scalable architecture for modular multiplication based on Montgomery’s
algorithm. IEEE Trans. Comput. 52, 1215–1221 (2003)
21. D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, S. Hsu, An improved unified scalable
radix-2 Montgomery multiplier, in IEEE Symposium on Computer Arithmetic, pp. 172–175
(2005)
22. K. Kelly, D. Harris, Very high radix scalable Montgomery multipliers, in Proceedings of
International Workshop on System-on-Chip for Real-Time Applications, pp. 400–404 (2005)
23. N. Jiang, D. Harris, Parallelized Radix-2 scalable Montgomery multiplier, in Proceedings of
IFIP International Conference on Very Large-Scale Integration (VLSI-SoC 2007),
pp. 146–150 (2007)
24. N. Pinckney, D. Harris, Parallelized radix-4 scalable Montgomery multipliers. J. Integr.
Circuits Syst. 3, 39–45 (2008)
25. K. Kelly, D. Harris, Parallelized very high radix scalable Montgomery multipliers, in Pro-
ceedings of Asilomar Conference on Signals, Systems and Computers, pp. 1196–1200 (2005)
26. M. Huang, K. Gaj, T. El-Ghazawi, New hardware architectures for Montgomery modular
multiplication algorithm. IEEE Trans. Comput. 60, 923–936 (2011)
27. M.D. Shieh, W.C. Lin, Word-based Montgomery modular multiplication algorithm for
low-latency scalable architectures. IEEE Trans. Comput. 59, 1145–1151 (2010)
28. A. Miyamoto, N. Homma, T. Aoki, A. Satoh, Systematic design of RSA processors based on
high-radix Montgomery multipliers. IEEE Trans. VLSI Syst. 19, 1136–1146 (2011)
29. K.C. Posch, R. Posch, Modulo reduction in residue Number Systems. IEEE Trans. Parallel
Distrib. Syst. 6, 449–454 (1995)
344 10 RNS in Cryptography
30. C. Bajard, L.S. Didier, P. Kornerup, An RNS Montgomery modular multiplication Algo-
rithm. IEEE Trans. Comput. 47, 766–776 (1998)
31. J.C. Bajard, L. Imbert, A full RNS implementation of RSA. IEEE Trans. Comput. 53,
769–774 (2004)
32. A.P. Shenoy, R. Kumaresan, Fast base extension using a redundant modulus in RNS. IEEE
Trans. Comput. 38, 293–297 (1989)
33. H. Nozaki, M. Motoyama, A. Shimbo, S. Kawamura, Implementation of RSA Algorithm
Based on RNS Montgomery Multiplication, in Cryptographic Hardware and Embedded
Systems—CHES, ed. by C. Paar (Springer, Berlin, 2001), pp. 364–376
34. S. Kawamura, M. Koike, F. Sano, A. Shimbo, Cox-Rower architecture for fast parallel
Montgomery multiplication, in Proceedings of International Conference on Theory and
Application of Cryptographic Techniques: Advances in Cryptology, (EUROCRYPT 2000),
pp. 523–538 (2000)
35. F. Gandino, F. Lamberti, G. Paravati, J.C. Bajard, P. Montuschi, An algorithmic and
architectural study on Montgomery exponentiation in RNS. IEEE Trans. Comput. 61,
1071–1083 (2012)
36. D. Schinianakis, T. Stouraitis, A RNS Montgomery multiplication architecture, in Proceed-
ings of ISCAS, pp. 1167–1170 (2011)
37. Y.T. Jie, D.J. Bin, Y.X. Hui, Z.Q. Jin, An improved RNS Montgomery modular multiplier, in
Proceedings of the International Conference on Computer Application and System Modeling
(ICCASM 2010), pp. V10-144–147 (2010)
38. D. Schinianakis, T. Stouraitis, Multifunction residue architectures for cryptography. IEEE
Trans. Circuits Syst. 61, 1156–1169 (2014)
39. H.M. Yassine, W.R. Moore, Improved mixed radix conversion for residue number system
architectures, in Proceedings of IEE Part G, vol. 138, pp. 120–124 (1991)
40. M. Ciet, M. Neve, E. Peeters, J.J. Quisquater, Parallel FPGA implementation of RSA with
residue number systems—can side-channel threats be avoided?, in 46th IEEE International
MW Symposium on Circuits and Systems, vol. 2, pp. 806–810 (2003)
41. J.-J. Quisquater, C. Couvreur, Fast decipherment algorithm for RSA public key cryptosystem.
Electron. Lett. 18, 905–907 (1982)
42. R. Szerwinski, T. Guneysu, Exploiting the power of GPUs for Asymmetric Cryptography.
Lect. Notes Comput. Sci. 5154, 79–99 (2008)
43. B.S. Kaliski Jr., The Montgomery inverse and its applications. IEEE Trans. Comput. 44,
1064–1065 (1995)
44. E. Savas, C.K. Koc, The Montgomery modular inverse—revisited. IEEE Trans. Comput. 49,
763–766 (2000)
45. A.A.A. Gutub, A.F. Tenca, C.K. Koc, Scalable VLSI architecture for GF(p) Montgomery
modular inverse computation, in IEEE Computer Society Annual Symposium on VLSI,
pp. 53–58 (2002)
46. E. Savas, A carry-free architecture for Montgomery inversion. IEEE Trans. Comput. 54,
1508–1518 (2005)
47. J. Bucek, R. Lorencz, Comparing subtraction free and traditional AMI, in Proceedings of
IEEE Design and Diagnostics of Electronic Circuits and Systems, pp. 95–97 (2006)
48. D.M. Schinianakis, A.P. Kakarountas, T. Stouraitis, A new approach to elliptic curve
cryptography: an RNS architecture, in IEEE MELECON, Benalmádena (Málaga), Spain,
pp. 1241–1245, 16–19 May 2006
49. D.M. Schinianakis, A.P. Fournaris, H.E. Michail, A.P. Kakarountas, T. Stouraitis, An RNS
implementation of an Fp elliptic curve point multiplier. IEEE Trans. Circuits Syst. I Reg. Pap.
56, 1202–1213 (2009)
50. M. Esmaeildoust, D. Schnianakis, H. Javashi, T. Stouraitis, K. Navi, Efficient RNS imple-
mentation of Elliptic curve point multiplication over GF(p). IEEE Trans. Very Large Scale
Integration (VLSI) Syst. 21, 1545–1549 (2013)
References 345
51. P.V. Ananda Mohan, RNS to binary converter for a new three moduli set {2n+1 -1, 2n, 2n-1}.
IEEE Trans. Circuits Syst. II 54, 775–779 (2007)
52. M. Esmaeildoust, K. Navi, M. Taheri, A.S. Molahosseini, S. Khodambashi, Efficient RNS to
Binary Converters for the new 4- moduli set {2n, 2n+1 -1, 2n-1, 2n-1 -1}”. IEICE Electron. Exp.
9(1), 1–7 (2012)
53. J.C. Bajard, S. Duquesne, M. Ercegovac, Combining leak resistant arithmetic for elliptic
curves defined over Fp and RNS representation, Cryptology Reprint Archive 311 (2010)
54. M. Joye, J.J. Quisquater, Hessian elliptic curves and side channel attacks. CHES, LNCS
2162, 402–410 (2001)
55. P.Y. Liardet, N. Smart, Preventing SPA/DPA in ECC systems using Jacobi form. CHES,
LNCS 2162, 391–401 (2001)
56. E. Brier, M. Joye, Wierstrass elliptic curves and side channel attacks. Public Key Cryptog-
raphy LNCS 2274, 335–345 (2002)
57. P.L. Montgomery, Speeding the Pollard and elliptic curve methods of factorization. Math.
Comput. 48, 243–264 (1987)
58. A. Joux, A one round protocol for tri-partite Diffie-Hellman, Algorithmic Number Theory,
LNCS, pp. 385–394 (2000)
59. D. Boneh, M.K. Franklin, Identity based encryption from the Weil Pairing, in Crypto 2001,
LNCS, vol. 2139, pp. 213–229 (2001)
60. D. Boneh, B. Lynn, H. Shachm, Short signatures for the Weil pairing. J. Cryptol. 17, 297–319
(2004)
61. J. Groth, A. Sahai, Efficient non-interactive proof systems for bilinear groups, in 27th Annual
International Conference on Advances in Cryptology, Eurocrypt 2008, pp. 415–432 (2008)
62. V.S. Miller, The Weil pairing and its efficient calculation. J. Cryptol. 17, 235–261 (2004)
63. P.S.L.M. Barreto, H.Y. Kim, B. Lynn, M. Scott, Efficient algorithms for pairing based
cryptosystems, in Crypto 2002, LCNS 2442, pp. 354–369 (Springer, Berlin, 2002)
64. F. Hess, N.P. Smart, F. Vercauteren, The eta paring revisited. IEEE Trans. Inf. Theory 52,
4595–4602 (2006)
65. F. Lee, H.S. Lee, C.M. Park, Efficient and generalized pairing computation on abelian
varieties, Cryptology ePrint Archive, Report 2008/040 (2008)
66. F. Vercauteren, Optimal pairings. IEEE Trans. Inf. Theory 56, 455–461 (2010)
67. S. Duquesne, N. Guillermin, A FPGA pairing implementation using the residue number
System, in Cryptology ePrint Archive, Report 2011/176(2011), http://eprint.iacr.org/
68. S. Duquesne, RNS arithmetic in Fpk and application to fast pairing computation, Cryptology
ePrint Archive, Report 2010/55 (2010), http://eprint.iacr.org
69. P. Barreto, M. Naehrig, Pairing friendly elliptic curves of prime order, SAC, 2005. LNCS
3897, 319–331 (2005)
70. A. Miyaji, M. Nakabayashi, S. Takano, New explicit conditions of elliptic curve traces for
FR-reduction. IEICE Trans. Fundam. 84, 1234–1243 (2001)
71. B. Lynn, On the implementation of pairing based cryptography, Ph.D. Thesis PBC Library,
https://crypto.stanford.edu/~blynn/
72. C. Costello, Pairing for Beginners, www.craigcostello.com.au/pairings/PairingsFor
Beginners.pdf
73. J.C. Bazard, M. Kaihara, T. Plantard, Selected RNS bases for modular multiplication, in 19th
IEEE International Symposium on Computer Arithmetic, pp. 25–32 (2009)
74. A. Karatsuba, The complexity of computations, in Proceedings of Staklov Institute of
Mathematics, vol. 211, pp. 169–183 (1995)
75. P.L. Montgomery, Five-, six- and seven term Karatsuba like formulae. IEEE Trans. Comput.
54, 362–369 (2005)
76. J. Fan, F. Vercauteren, I. Verbauwhede, Efficient hardware implementation of Fp-arithmetic
for pairing-friendly curves. IEEE Trans. Comput. 61, 676–685 (2012)
77. J. Fan, F. Vercauteren, I. Verbauwhede, Faster Fp-Arithmetic for cryptographic pairings on
Barreto Naehrig curves, in CHES, vol. 5747, LNCS, pp. 240–253 (2009)
346 10 RNS in Cryptography
99. N. Guillermin, A high speed coprocessor for elliptic curve scalar multiplications over Fp,
CHES, LNCS (2010)
100. D. Kammler, D. Zhang, P. Schwabe, H. Scharwaechter, M. Langenberg, D. Auras,
G. Ascheid, R. Leupers, R. Mathar, H. Meyr, Designing an ASIP for cryptographic pairings
over Barreto-Naehrig curves, in CHES 2009, LCNS 5747 (Springer, Berlin, 2009),
pp. 254–271
101. D. Nibouche, A. Bouridane, M. Nibouche, Architectures for Montgomery’s multiplication, in
Proceedings of IEE Computers and Digital Techniques, vol. 150, pp. 361–368 (2003)
102. A. Barenghi, G. Bertoni, L. Breveglieri, G. Pelosi, A FPGA coprocessor for the cryptographic
Tate pairing over Fp, in Proceedings of Fifth International Conference on Information
Technology: New Generations, ITNG 2008, pp. 112–119 (April 2008)
103. M. Scott, P.S.L.M. Barreto, Compressed pairings, in CRYPTO, Lecture Notes in Computer
Science, vol. 3152, pp. 140–156 (2004)
Further Reading
E. Savas, M. Nasser, A.A.A. Gutub, C.K. Koc, Efficient unified Montgomery inversion with multi-
bit shifting, in Proceedings of IEE Computers and Digital Techniques, vol. 152, pp. 489–498
(2005)
A.F. Tenca, G. Todorov, C.K. Koc, High radix design of a scalable modular multiplier, in
Proceedings of Third International Workshop on Cryptographic Hardware and Embedded
Systems, CHES, pp. 185–201 (2001)
Index
F Modulo multiplication
Fault tolerance in RNS, 163–165, 167–173 for IDEA algorithm, 51, 52
FIR filters using RNS, 6, 172, 195–220, 223, using Barretts algorithm, 265–267, 282
226, 228, 235 using combinational logic, 41, 45, 195
Five moduli sets, 105, 107, 123, 204 using index calculus, 39, 40, 197, 198, 223
Fixed multifunction architectures (FMA), using diminished-1 representation, 39, 51,
69, 70 56, 58–60, 64, 67, 76
Floating-point arithmetic, 3 Modulo squaring, 6, 39–44, 46, 48, 50–55, 57,
Forward conversion 58, 60–64, 67–70, 72, 75, 76, 264
multiple moduli sharing hardware, 32–34 Modulo subtraction, 4, 43, 90–92, 102, 178,
using modular exponentiation, 30–31 244, 301
Four moduli sets, 35, 50, 82, 93, 99, 101, Modulus Replication Residue Number systems
104, 105, 107, 117, 171, 206, 217, (MRRNS), 186–189
223, 240, 301 Montgomery inverse, 295–297
Frequency synthesis using RNS, 6, 245 Montgomery modular multiplication
Frobenius computation, 325, 326, 335 CIHS, 268, 269
CIOS, 268
FIPS, 268, 269
G scalable, 277
GPUs for RNS, 295 SOS, 268
using Kawamura et al technique, 292, 295
using RNS Bajard et al technique, 289
H word based, 275
Hard multiple computation, 50, 61, 68 Montgomery polynomial, 310
MQRNS system, 179, 240
Multi-modulus squarers, 6, 64, 66, 67
I Multiple error correction, 170
IEEE 754 standard, 2 Multiplication technique
IIR filters using RNS inversion in Fpk, 6, 203, for quartic polynomials, 317
206 for quintic polynomials, 311
for sextic polynomials, 310
K
Karatsuba algorithm, 309, 314, 315, 318, 319, N
321–324, 332 New Chinese Remainder Theorems
New CRT-I, 104
New CRT-II, 81, 95–97
L New CRT-III, 95–97
Lazy addition
Logarithmic Residue Number systems, 6,
189–191 O
Low–high lemma, 51, 52 OFDM system using RNS, 249
One-hot coding, 5
Optimal Ate pairing, 328, 329, 331–333, 341
M
Magnitude comparison
using MRC technique, 153 P
using new CRTs, 154–156 Pairing implementation using RNS, 264, 309,
Mixed Radix Conversion, 4, 6, 81, 90–95, 102, 327–342
127, 153, 157, 159 Pairing processors using RNS, 306–342
Mixed Radix Number system, 1 Parity detection, 141, 142, 154
Moduli of the Form rn, 179–184 Polynomial Residue Number system, 6,
Modulo addition 184–186
mod (2n+1) addition, 17, 20, 21 Powers of two related moduli sets, 6, 28–30,
mod (2n-1) addition, 14 39, 44, 92, 99, 103, 199
Index 351