You are on page 1of 13

A 1700-year-old Chinese arithmetic has been revived.

The residue number


system has no overhead and can support very high speed computation.

Residue Arithmetic

A Tutorial with Examples


Fred J. Taylor, University of Florida

The ancient study of the residue numbering system, or numbers. Then in the mid-50's, the Czech researchers
RNS, begins with a verse from a third-century book, Svaboda and Valach conducted experiments on a hard-
Suan-ching, by Sun Tzu:* wired, small moduli RNS machine, which they used to
We have things of which we do not know the number, study error codes. I The same idea apparently occurred to
If we count them by threes, the remainder is 2, Aiken and Garner. 2 From the late 50's to mid-60's, the
If we count them by fives, the remainder is 3, Department of Defense supported the RNS research by
If we count them by sevens, the remainder is 2, Szabo and Tanaka at Lockheed. They worked on a
How many things are there? special-purpose digital correlator while a team from RCA
The answer, 23.
looked into designing a general-purpose machine. Ex-
How to get the answer 23 is outlined in Sun Tzu's historic perimentally, these early efforts met with little success
work. He presents a formula for manipulating remainders because winding the custom core memory required spe-
of an integer after division by 3, 5, and 7. We com- cialized residue mappings. The most tangible result of
memorate this contribution today by referring to one of these early efforts was a comprehensive text written by
the common rules of converting remainders, or residues, Szabo and Tanaka, which survived only one printing.3
into integers as the Chinese Remainder Theorem, or CRT. The reason for this underwhelming response was that the
This theorem, as well as the theory of residue numbers, technology of the 60's was insufficient to support the
was set forth in the 19th century by Carl Friedrich Gauss in unique demands of the RNS.
his celebrated Disquisitiones Arithmetical.-t
This 1700-year-old number system has been attracting a Since the mid-70's, however, technology and theory
great deal of attention recently. Digital systems structured have been slowly converging. In the West, over 100 major
into residue arithmetic units may play an important role in papers have been published on RNS since Szabo and Ta-
ultra-speed, dedicated, real-time systems that support pure naka published their book. In addition, scholars in the
parallel processing of integer-valued data. It is a "carry- Soviet Union are actively investigating residue arithmetic.
free" system that performs addition, subtraction, and They have produced over 50 papers, patents, and books in
multiplication as concurrent (parallel) operations, side- this field during the last 15 years. Their work, which is
stepping one of the principal arithmetic delays managing basically unknown to the Western RNS community, was
carry information. discovered and reported by Miller et al. of Boeing. 4
The first attempt to use some of the unique RNS proper- The microelectronic revolution brought low-cost, high-
ties was made by D. H. Lenhmet who, in 1932, built a performance RAM and ROM to replace the expensive and
special-purpose machine he called the "photo-electric slow core that RNS once used. They provide the ideal tech-
sieve." This electro-mechanical device factored Mersenne nology for residue arithmetic as a table-lookup operation.
*L. Dickson in the History of the Theorv of Numbers attributes the origin Today, RNSs are being built in MOS, TTL, and ECL, and
of the RNS to Sun Tsu (not Tzu) in the first centurv AD, but most scholars other researchers, such as Huang, Tai, Polky and a group
accept the Sun Tzu origin. at Boeing, are exploring novel electro-optical technologies
Translated bs Arthur A. Clarke, Yale LUiversits Press, Nees Haven, for use in the RNS,5 but optical implementation is, at
1966. best, a wave of the future. Today the semiconductor rules.

50 0018-9162/84/0500-0050S01.00 1984 IEEE COM PUTER


The RNS is now finding its way into many application if X>0, and (M- IXI )modp1 otherwise. The signed
areas involving digital filters and transforms. Huang at RNS system is often referred to as the symmetric system.
Lockheed has built and tested a two-dimensional, RNS In some cases, it has been shown that some operations are
matched filter capable of 20M operations per second.6 slightly more complex, and others are simpler in the sym-
Smith at Martin-Marietta has developed a high-speed FFT metric system (see Table 1).
in the RNS.7 A group at Mitre is developing error- While RNS adds, subtracts, and multiplies efficiently,
correction machines using the RNS.8 Jullien has reported division is not a closed operation. If X, Y, and Z have RNS
an RNS comb filter, 9 while Taylor has published on RNS representations given by
systems with VLSI hardware. 10
This article develops some of the fundamental proper- X RNS(X, XL); YRNS(Yyl,.yL);
ties of this branch of mathematics and presents the state of (5)
the RNS art and some potential applications.
Z - (ZI,***ZL)
then denoting o to represent +, - , or *, the RNS version
of the Z = X o Y, satisfies
Background
Z - (Zl,. .,ZL)
An integer X, which has a fixed-radix, weighted-
number representation, with respect to a radix r, is given
by = ((XIoYI)modpI, . . .,(XLOYL)modpL) (6)
n-i if Zbelongs to ZM. The importance of this statement may
X== air'; ai EZ,(1 not be fully appreciated at first glance. The ith RNS digit,
i=o namely Zi, is defined in terms of (XioYi)modpi only.
The number of integer values of Xthat possess an n-digit, That is, no carry information need be communicated be-
fixed-radix representation are r over the range [0, rn - I]. tween residue digits. Since there is no requirement to
For the case where r = 2, the familiar binary numbering manage carry information from RNS digit to digit, the
systems result. Some of the well-known attributes of a overhead of manipulating carry information in more
fixed-radix system are traditional, weighted-number systems can be avoided. The
net result is very high speed concurrent (parallel) opera-
(1) algebraic comparison, tions, and it is speed alone that makes the RNS attractive.
(2) dynamic range extension (i.e., add more digits),
(3) multiplication (division) by simple arithmetic shifts,
and
(4) simplified overflow and sign-detection. Table 1. RNS encoding.
The disadvantage of the fixed-radix, weighted-number
system is that carry information must be passed from For P= [3,4,5J, M =X(signed) 60, then
Xl X2 X3
digits of lesser significance to those of greater significance. X(unsigned)
As a result, there is a slowdown of arithmetic related to the 59 -1 2 3 4
58 -2 1 2 3
carry-management system used (e.g., ripple carry, look-
ahead carry). Carrying can be accelerated, but only at the 30 - 30 0 2 0
expense of additional hardware. The time required to 29 29 2 1 4
compute an n-ary function with gates restricted top inputs 5
......
5 2 1 0
is at least log2n. 4 4 1 0 4
The RNS is, however, a carry-free system and is poten- 3 3 0 3 3
tially very fast even though the advantages of the fixed- 2 2 2 2 2
radix system do not carry over. Algebraic comparison is 1 1 1 1 1
is 0 0 0 0 0
difficult, as is overflow and sign detection; and division
awkward.
The RNS is defined in terms of a set of relatively prime
moduli. If P denotes the moduli set, then
P = LPO ,P2, * * * ,PL3,GCD (pi,pj) = 1,fori.j (2)
Any integer in the residue class ZM, where
M=P1*P2* ... PL (3)
has a unique L-tuple representation given by

XXR~NS
- (X1,X2,- - X,L) (4)
where Xi = Xmodp and is called the ith residue of X. For
a signed number system, any integer in ( -M/2, M/2),
has a RNS L-tuple representation where Xi = Xmodpi
51
May 1984
Example. Again let P = (3,4,5) and M = 60, then note: provided the process does not exceed the address space of
available memory.
3 RNS (0R3,3),7RNS (1,3,2),10 RNS (1,2,0), 21 RNS An apparent asset of the RNS is exactness, but it turns
(0,1,1) out to be a shortcoming. In a weighted-number system,
imprecision is induced during truncation or rounding
and operations-simple operations in a weighted-number
7-(1, 3, 2 ) system, which are needed to manage potential register
+3-(0, 3, 3 ) overflows. However, because each digit in the RNS is of
10-(lmod3,6mod4,5mod5) = 1,2,0 equal significance, none of the digits can be deleted as can
7-(1, 3, 2 ) those of a weighted number. Instead, inefficient RNS divi-
-3-(0, 3, 3 ) sion operations must be used for scaling (rounding/trun-
4-(lmod3,Omod4, - Imod5) = (1,0,4)
cating) and dynamic-range scaling is required in most RNS
applications involving multiplication. The product of two
7-(1, 3, 2 ) RNS integers belonging to ZM is defined in ZM2. This
x3-(0, 3, 3 ) geometric increase in dynamic range can rapidly fill any
21-(Omod3,9mod4,6mod5) = (0,1,1) practical dynamic range limit. For example, ifP = [3,4,5 },
In order to exploit RNS parallelism, arithmetic units the sum of 25 and 25 would be in range, but their product
must be found that efficiently and rapidly implement the would be way out of range.
modular statement (XioYi)modpi. For an arbitrary An investigation of specific functional operations will
moduli set p, there are no conventional radix-2 arithmetic point out RNS assets and liabilities.
units for modulo p arithmetic. Only for the case where
pi= 2 n can a commercial arithmetic unit be found. Decimal-to-residue conversion
However, something better exists.
Modem semiconductor memory can be programmed to
replace traditional arithmetic algorithms with simple table- rates practice, the RNS is useful only when very high data
In
are required.
lookup operations. For example, the high-speed memory ventional arithmeticProblems solved successfully with con-
do not justify an RNS implementa-
found in Figure 1 and Table 2 can be programmed to out- tion. This is especially evident when one considers
put the value of (Xio Yi) modpi upon receipt of the ad- great wealth of digital hardware currently the
available
dress [Xi: Yi], which denotes the concatenation of Xi and designer of conventional systems. For example, the
to the
TRW
Yi. If pi is bounded by 2" for all i, then the concatenated series of multiplier/accumulators have enhanced many
address [Xi: YJ] is 2n bits wide. Therefore, RNS systems as a peripheral processor. Likewise, the
arithmetic can be performed as a table-lookup mapping, CDP185C CMOS multiply/divide RCA
unit, the AMD 9511
arithmetic processor, the Intel 8087 numeric processor,
and Intel 2920 signal processing chip can accelerate various
phases of data processing. To be competitive, RNS
systems must have arithmetic speeds on the order of hun-
dreds of megahertz, and data acquisition and the accom-
panying decimal-to-residue conversion must be equally
fast. Therefore, if analog data is to be processed, A/D
conversions must be made in about 10 ns. These conver-
sions require the use of "flash" converters with relatively
Figure 1. Conventional arithmetic unit. short wordlengths, typically from six to 12 bits. They are
well within the addressing limitations of high-speed semi-
conductor memory. As a result, the six- to 12-bit outputs
of the A/D converter are sent to a memory chip as an ad-
Table 2. Commercially available high-speed memory dress. The table responds with the precomputed value of
devices. Xmodp i for each pi in the moduli set P.

PROMs ORGANIZATION 2n ACCESS(ns) TYPE PINS Magnitude comparison


PROMs 1Kx4 10 30 TTL 18
1 Kx8 10 20 TTL 24 Weighted number systems have a most significant digit.
2Kx4 11 45 TTL 18 As a result, two numbers can be compared on a digit-by-
2Kx8 11 20 TTL 24 digit-basis. When a number is compared to zero, its sign
4Kx4 12 35 TTL 20
4Kx8 12 40 TTL 24 must be detected-a complicated implementation in the
4Kx16 12 75 NMOS 22 RNS. For example, for P = [3,4,5], one cannot tell if
RAM lKxl 10 7 ECL 16 X=lR (1,0,4) -10
lKx4 10 25 ECL 24
1Kx8 10 55 NMOS 24 simply by comparing the relative size of digits. As a result,
4Kx1 12 7 ECL 18 both sign and magnitude comparison are difficult.
4Kx4 12 30 NMOS 20 One important form of magnitude comparison is sign
16Kxl 14 25 ECL 20
detection. Banerji1I was able to show in 1974 that, based
52 COMPUTER
on Winograd's lower bound on addition and multiplica- This pattern continues until the last X'L is defined. An
tion, the speed potential of the RNS can be realized com- MRC conversion of an L-tuple residue set can be both
pletely only if the number of additions greatly outnumber complex and slow. The complexity results from the many
the calls to the high-overhead, sign-detection subsystem. specialized modular mappings. Speed suffers because
The Soviets have produced a variation on this theme. Xj -I must be computed before Xj can be solved. The
They have concentrated on a positional code for RNS MRC algorithm is illustrated in Figure 3.
numbers called "core" arithmetic, purported to be useful For the special case where the moduli set is given by
in supporting magnitude comparisons and data parity p = [2n - 1,2n ,2n + I ], producing a dynamic range on
(odd-even). The core of a member Xis given by RX where the order of 3n bits, Taylor and Ramnarayanan developed
a conversion algorithm based on a subcover decomposi-
L tion of M. 12 It offers some economies in speed and com-
RX= wi[X/pi] (7) plexity if a 3n-bit, dynamic range for practical values of n
i=l ranging from six to 12 bits is satisfactory. In this method,
The weights wi are somewhat magic numbers, chosen
rather than derived, so that RX has a restricted dynamic
range. For example, if P= [3,4,51, then M=60. The
value of RX for X E [0,59] is summarized in Figure 2. The
range M can be seen to be compressed into [ - 1,61. The
general shape of the curve suggests a positive correlation
between RX and X, but the relation to the wi signed values
is not strict.
Difficulty in sign detection and magnitude comparison
prevents the RNS from receiving serious consideration for
use in general-purpose computing. The overhead of servic-
ing conditional branch tests would slow the system to the
point where it could not compete with traditional architec-
tures. However, when nested adds and products alone are
required, the RNS can compete with traditional methods.
Figure 2. Rx vs. X for core arithmetic.

Residue-to-decimal conversion
An RNS L-tuple can be converted to an integer in one
of several ways. One can produce an inner product by
mixed radix conversion:

XXRNS
-S (L
( E v,X'j)modM (8)
i=l
Here, vi=IIpi-I for 2si<L and vI = 1. The mixed-
radix coefficients, denoted X'i, are computed with a
mixed-radix conversion algorithm, which is fundamental-
ly recursive. It uses nesting subtractions with multiplica-
tion by multiplicative inverses.
Suppose X is known to have an RNS representation

XRNS (XI, . . ,XL).


The MRC representation is given by
X=X'I +v2X'2+ . . . +VLX'L (9)
where V2=P1, v3=p p2, v4=p1p2p3, and so forth.
Since p is a factor of v2 through VL, it follows that
Xmodp1 = XI = X'I (10)
Now form X-XI = V2X'2 + V3X'3 + * * * VLX'L
where V3 through VL have a common factor P2. After
multiplying both sides of (X-XI) by the multiplicative
inverse of p I modulo P2 [i], it follows that
(P12 - (X-Xl))modp2 (11)
= (P12 l (X2 -X ))mOdP2 Figure 3. Mixed-radix conversion algorithm with digits
=X'2 only, according to Szabo and Tanaka.3

May 1984
53
the interval [0,Mt) is decomposed in terms of n idealI M= 2 52= 2-5(2'- 1);r t-s (16)
w here The prime factors of 2'- 1 for 12 <r<30 are listed in
l= (kp.O<k<p1P3) (12) Table 3. Some are impractical, since the size of at least one
Any X* belonging to I possesses an RNS representation modulo exceeds the address space of commercially avail-
gixen by X* - (X*i ,0,X*i) and a decimal representation able, high-speed ( <50 ns) memory. If the largest modulo
satisfving: of any admissible factor is bounded byp i <2 1, then the
second term in equation 16, namely 2<, would be defined
X* (PItI, +Ji)v, orX* (p3II +J3)p2' (13) in terms of s= u. Table 4 summarizes the admissible
where I (7Z J E Z1) and 13 E , J3EZ Collec- choices for 16 < t < 35.
tively equation 13 can be written Using lookahead adders, which are n-bits wide (with
an externally accessible overflow fag), and a PLA to
P33 -PIIt (I 1J3) = C
= (14) mechanize some required combinational logic, Taylor and
Figure 4 shows that, upon receipt of the ( n + 1 )-bit xalue Ramnarayanan 12 mapped a modulo M of the sum of two
of c, the precomputed value of o ( c) = (pt * II )modp2 numbers, A and B belonging to Z t,:
would be looked up. Also, the offset xalue of J I would be (1) 1.et S be the ii-bit sum of A and B with 0 denoting
added to the lookup xalue after it is appended with the any overflow. Assume that the data format for the
residue value X2 to generate X. sum is
The second method of converting a residue to a decimal
is the Chinese Remainder Theorem, which has the struc- 0: XXXXXXXXXX.* YYYYYYYYYYYY
ture illustrated in Figure 5 and satisfies bit
L location .n -1 nn in 1 0
X ( Sis(Xisi 1) modp )modM (15' (2) Then for M= 21 -2' 1, n >mn; e.g. n = 8, m = 4,
it= M= 20
where s Mlp and si are called the multiplicative in- (a) If 0 <S<M- 1, then SmodM= S
xYerse of s modp so that (si -Is ) modp -1. S -_ [ O:XXX XXX. YYYYY] with at least
one X = 0; e.g., S = 100-[0:0110: 0100]
Example. For P (3,4,5) and M =60, conver (b) If 2 '1 > S>M, then SmodM = S -M and
X ( 1,0,4) inlto its integer xalue.
S = [0:11 1 . 11: YYYY]
3 4 /tIP2 P S -Vt [0:000 . . . 00: YYYY]
MIRC:
0 4 X-(X I,X2,X3) e.g., S 250-[0:1 I I 1: 1010]
S-M 10-[0:0000: 1010]
A'1 1 3 3 (X- X )modp
(c) If S>2t1, then SmodM= S- and
x 3 x 2 *Ptt (Pt2 l = 3 s.t. S = [1:XXX . XX. YYYYY]
(3 *3)mod4 = 1) S tM [0:ZZZ.. .: Z. YYYYY]
96 (Pt3 l = 2 s.t. c.g., S 260-[ 1:0000: 0100]
(2 *3) mod5 = 1) S-,=V 20-[0:0001: 0100]
1 1 Ptt ( X 'l)modp i where the PLA of Fieure 6, after Taylor, 14 maps the ovei-
X =1 0 (XX X,) modp i flow bits and XXX. . XX into the correct xalues of
ZZZ. ZZ. W'ith such a modified binary adder,
4 *1)23 (P23 = 4 st. modulo MX addition can be performed in a few nanosec-
(4*4) mod5 = 1) onds. A CRT cycle can be completed by L- 1 successive
X =O modulo M adds of the table-lookup values of
andX X '+,vX'+plp2X 'a = 1+3*1+12*0= 4 (si (X js i - i)modp i) modM (Equation 15).
C RIT: 120, In, = 15, in 12
fil l --2 s.t. (2*2())mod3 = 1, 1,12 3 s. t. Base extension. Base-extension methods increase the
(3 * l 5 )modi4 1, 11= 3 s.t. dynamic range of an RNS system by adding moduli to a
(3 * 12) mod5 1 prespecified system, thereby expanding the RNS dynamic
X = (int (MVtIl X )modp I+ (nX2, l )modp range as needed. The extension essentially requires some
form of residue-to-decimal conversion, so overhead is
+ /7 3(M t XI I)modp3)modd,
= (40-+-0+-24)mod60= 4
high. The decimal form of a RNS number is then recoded
into an RNS K-tuple in the new system. The basic mechan-
The objection to the CRT has been its dependence on a ics ol the extension was explained by Szabo and Tanaka. 3
modulo MV accuIrnulatot- where Vt is a large integer that Gregory and Matula developed several algorithms for
caninot be hanidled directlN in commercially axailable converting from one svstem to another: Their conversion
radix-2 hardxxare. Taylor and Huang 13 have developed a works where the moduli of the original RNS are not rela-
method bx which modular accumulators can be practically tixely prime and share no common factors with the moduli
realized for xalues of .M gixen by NVt= 2"' -2`' (e.g., of the extended system. Szabo and Tanaka studied the
p = 2 1-, 2 ',2' + I ). They show that this method can case wherc the moduli of the new system are relatixely
be exp3anded(to suppoi-t a more flexible L-ntodulo CRT. prime to the original modluli set. Finally, core arithmetic
C onsidct- Li ntodulo set xhose product is has been shown to be effectixe in base extension.
54 COMPUTER
Figure 4. RNS-to-decimal conversion with data from example 14.

Figure 5. CRT with modulo M adder tree.

May 1984 55
I)ivision. Division is essentially a blend of nested sub- Scaling. Scaling prevents dynamic-range overflow. It
tractions and magnitude comparisons, so it is most dif- reduces the dynamic range of RNS variables below a po-
ficult to implement in the RNS. Furthermore, since the tential overflow value. While multiplication is the com-
RNS is aIn integer system, it is not closed under disision. position of two variables, scaling is the composition of a
Banerji et al. 16 proposed in 1981 an approximate division variable with a constant of known value. However, since
routine modeled after the standard integer division for- scaling is a special form of division, it presents a major im-
mula, but division remains a slow, high-oxerhead opera- plementation problem. It requires a method by which an
tion to be avoiled. integer X, having an RNS representation, can be divided
by some prespecified constant V with a minimum of over-
Overflow detection. In a weighted-number system, head. Jullien9 proposed a memory-intensive, high-over-
overflow is generally detected by inspecting the value of head method for RNS scaling. This method and its vari-
the most significant digit. In the RNS, overflow detection ants had weaknesses other than memory and overhead
olteni takes the form of magnitude comparison. Recently, costs. The scaled value returned from the algorithm re-
rnore attention has been civen to dcveloping systems that sided in a reduced dynamic range. Taylor and Huang in
scale svstem variables to axoid the risk of overflow during 1982 13 developed a method for the moduli set P=
run-tiimie. Miuch of this deelopment comes from desired 21' 1,2' ,217 + 1 , which was very efficient, required
improvements in the RNS scaling. little hardware, and did not diminish the RNS dynamic
range ( see Figure 7 ).

Table 3. Prime factors 2q -1.


Arithmetic units. In order to mechanize a viable RNS
q l2.30
2 l 2 - 1 =:3*3*5*7*13 2 ' 2- 1 =3*23*89*683 fast and efficient arithmetic units must be developed.
2 l - I =8191 (TL) 2 23 1 =3*23*89*683 (TL) Using 4K-bit memory units, RNS arithmetic for a moduli
2 -1 =3*43*1 27 2 24 1 =47*178481 (TL) system possessing maximal-moduli-six-bit values can be
=7*31*257 2 25-_ 1 =31*601*1801 (TL) mechanized as table-lookup operations. As a result, most
2 l I- 1 =1 3*17 7 2 26 1 =3*2731*8191 (TL)
=131071 (TL) 23297 1 =7*73*262657 (TL) RNS arithmetic units are simply RAM or ROM. Table-
2 l 6- 1 33*3*7* 1 9*73 2 28 1 =3*5*29*431 13*127 lookup operations also have the advantage of simplifying
2 ' 9- =524287 (TL) 2 29 1 =233*1103*2089 (TL) real-time programming of an RNS. In such cases, all basic
2 20- 1 =3*5*5*11*31*41 2 30 1 =33*7* 11 31 151 *331 arithmetic operations will occur at about the same time.
= 77* 127*337
Yau and Chungt7 investigated RNS implementations
TL = Moduli too large for use with commercially available, high-speed with cyclic group decomposition. A new class of code, the
mem ory (Figure 1). circulative, supported this arithmetic. In 1975, Bioul et

Table 4. Admissible moduli choices.


MAX IMUM DYNAM IC NUMBER FASTER MEMORY
MODUL M ODULI RAN GE MEMORY OF MEMORY CYCLE TIME T NORMALIZED
and L F,2 M =2' -2b ORGANIZATION BITS (FIGURE 1) B/m BT/M
a =1 5. 7,9, b =4 m =16 2828 x 2.432 7 ns 1.0 in0
I, 3, 1 6 b =4 2 (326 x3

0 =14 3.43. b =7 m= 2 1 r3214 x7 139, 296 25 ns 45.8 34.3


1 ='18 b =7 1 (,24 x2
L =4
a0
= 7,19. 2 b= 7 m =25 1(1214 x<7 125.120 25 ns 32.9 32.9
73 128 b =7 3(138 x4
1 (526 x3

q =2C 35.31 b =6 m =26 4t212 x6 93,304 7 ns 24.8 24.8


33,41 .64 b =6

=241 57.9. b= m =29 1C(21° x5 7,537 7n 1.7 20


13.17,32 b=5 2@28 x4
L=5 a26 x3
1
A=28 15.29, b=7 m=35 2t214 x7 261 120 25 ns 49.1 58.9
43.113, b 7 155212 x6
12 7.1 28 1 t21 x5
5L=6 1 Ca38 x 4

"L 1 !,n modUi set, T =s ' ie ory access tine rn design (Figure 1).

56 COMPUTER
al. 18 worked on an efficient modular adder for the case Arithmetic modulo p is equivalent to operations over a
where p = 2" - 1. It used two-input gates and had a hard- cyclic group when p is prime. A well-known isomorphism
ware complexity value on the order of 0 ( n2), with a delay between the additive group over ZM- I and the multiplica-
on the order of 0(log2n). tion group over ZM - [tO permits the variables to be
Agarwal in 1978 19 reported on the problem of building multiplied. Indexing is used in a manner similar to
a modulo (2n + 1) adder using conventional carry- logarithmic multiplication, in that exponents are added.
lookahead units and reformatting the data. Taylor and In 1972, Vyshysky and Petushchak2l used the quarter-
Huang20 published a treatise on an RNS floating-point square identity, which compressed data by satisfying
system, in which the conventional floating point principle A*B= ((A+B)2/4) - ((A-B)2/4) (17)
was redefined in an RNS format. The Soviets have also
demonstrated an interest in a floating point RNS based, In a standard application, two n-bit residues form the
like so much of their work, on the core arithmetic concept. address for a 2n-bit table. Equation 17 shows that the sum

Figure 6. Carry look-ahead modulo M adder for Modulo 2n -2m sum.

May 1984 57
and difference of A and B span only n + I bits. Therefore, tion (subtraction), multiplication, sign detection, and
they can be used to form the n + 1 bit addresses for the two overflow detection operations. It was reported that execu-
separate tables that look up the two terms of the right- tion time in sign and overflow detection could be reduced,
hand side of equation 17. More specifically, these terms but at the expense of multiplication. 3 However, since the
are (4-1(A+B)2)modp and (4-1(A-B)2modp. principal use of RNS is a fast alternative to fixed-point
With this modification, much larger moduli can be com- multiplication, such operations have, at best, a very
pressed into high-speed tables of a fixed size. Taylor 10 has limited application.
used this concept to design a 33-bit RNS VLSI multiplier
that can operate at 28.5M multiplications per second. In Error correction. The RNS is an exact system with
addition, he was able to prove that the previous opposition neither most or least significant digits. The loss of any digit
to equation 17-the general absence of an inverse of 4 or bit within the residue can radically alter the decimal
modp for a general p-was unfounded. He demonstrated meaning of an RNS L-tuple. It needs, therefore, an error
that equation 18 is valid for the powerful moduli set detection or correction scheme operating at speeds that
p = [2n -1,2n,2" + 1I and used a commercial TRW will not compromise its real-time capability. In 1972, Man-
IOM-mps RNS multiplier with a 72-bit range to extend this delbaum 23 developed a method for correcting a single er-
application to VLSI. A basic multiplier unit is presented in ror in an RNS by using two redundant moduli (triple
Figure 8. redundancy). His method was later extended to multiple
Huang and Taylor6 proposed a memory compression errors. In a simpler framework, a simple two-dimensional
scheme that reduces to one quarter the memory required parity code can be used. Each L residue is coded as a
to mechanize a modular arithmetic system. This reduction binary word appended with an even parity bit. An addi-
was based on the underlying cyclic symmetry found in tional (L + 1 ) st word is constructed with a jth bit as the
RNSs. In 1982, Tomabechi et al. 22 proposed using ring parity bit generated by the common jth bits of the L
counters to build modular arithmetic modules. They of- residue words. The data format is as follows:
fered a module called a subtractor-multiplier as the basic Xl :BBBBBBBB:P
building block of this system, which was an extension of a
1981 idea involving ring counters for error correcting. X2 :BBBBBBBB:P
However, for wide-wordlength moduli, ring-counter XL BBBBBBBB:P
operations can be unnecessarily slow. XL+I PPPPPPPP:P -----HPW
Over a decade ago, some considered maintaining a -----VPW
magnitude index file, denoted P., for each residue where B= bit, P = parity, HPW = horizontal parity word,
representation of X. This data was updated during addi- and VPW = vertical parity word.

Figure 7. Autoscale multiplier.

58 COMPUTER
If a single-bit error occurs, say in the jth bit of the ith a comparison with conventional complex RNS, CRNS, is
residue, it is indicated by the ith vertical and jth horizontal necessary.
parity bit. Single random errors on a RNS transaction can In a conventional CRNS, a complex number Z is de-
be managed in real time by simple and fast binary parity fined to be
generators.
Z=X+jY RNS (Zl, 'ZL);XEZM,YEZM (18)
Recently, Etzel and Jenkins24 investigated the problem Zj= ((Xi=Xmodp,) +j( Yj= Ymodp,))
of detecting and correcting errors in RNS systems in order wherej = -1 . The combination of two RNS numbers,
to develop fault-tolerant RNS comptuers. They define an say ZI and Z2, is given by
RNS over a split field derived from a moduli set
P= [PI, ,PNI of relatively prime integers pj. The Z3 = ZlI + Z2 = (Z31, ,Z3L );
- (19)
first L defines the nonredundant range M. Adding the Z3i (Xli
= +X2) +j ( Yli Y2i)
+
N- L > 2 redundant moduli, Etzel and Jenkins get a total
range MT. They accept an RNS N-tuple and convert it in- Z3 = Z 1 * Z2 = (Z31, ., Z3L);
to an integer using all possible combinations of (N- I) Z3i= (XIiX2i - YliY2i) +j (Xi Yi +X2iY1i)
residues at once. If one residue is in error, they can detect In the CRNS, complex addition requires two real adds,
the error condition, then locate and correct it. Their tests while complex multiplication takes four real multipliers
produce one residue to integer conversion in ZM and and two real adds to complete.
(N-1) others in ZMT-ZM= tM,M+1, .. ,MT-1]. The QRNS, on the other hand, represents members in
However, if there are no errors, then all RNS mappings to terms of the quadratic roots of j]i2 = - 1 modpi. If pi is
an integer will be in ZM. In the first case, the test that maps prime and has the form pi = 4n + 1, then Jii and i2i are
to an integer in ZM flags the location and value of the said to be quadratic roots and - 1 is a quadratic residue.
flawed RNS digit. This digit is then reconstructed during Here the real integer j will play a role similar to the im-
the error-correction phase. aginary j = f .Like j, the roots i1i and J21 are addi-
Jenkins and Krozmeier25 took another path towards tive and multiplicative inverses of each other.
developing fault-tolerant RNS systems. Based on original Example. Letp = (5,13) (i.e., n = 1 and 3) then
work by Leung, they developed an error decoder for a (a) j 2 = - lmod5
recursive RNS chip-matched filter. Their theoretical foun- In2 = 4 orjll = 2
dation was the Etzel-Jenkins redundant moduli concept
and Leung's novel complex RNS arithmetic unit, the (b)J]22= - I mod 13
quadratic RNS or QRNS. To appreciate their innovation, J212 =250orJ21 =5

Figure 8. Basic VLSI multiplier unit showing 1OM mps pipelined throughput. (if Xn =1, XHI =XLO =O; see Taylor.16)
May 1984
59
Since.ji2 is the inverse of jil, it follows that problems requiring complex arithmetic (e.g., the FFT), the
QRNS may offer some advantages.
j12 =5-2=3
j.12 =9mod5= -1 Applications
122 = 13 - 5 = 8
122 2 =64mod13= -1 One of the most promising uses of the RNS is to imple-
An isomorphic mappingf can be designed between the ment digital filters or convolvers. In many filter applica-
CRNS and the new QRNS by defining: tions, speed is of the essence and wordlengths on the order
of 16 to 32 bits are common. Nussbaumer proposed the
Z E (z,z*); ZEZM X ZM(CRNS) (20) RNS to implement convolution with Mersenne and Fer-
mat transforms.26 In this class of problems, known as
z CR (ZI, ZL); Z=Xi +jYi number-theoretic transforms, or NTTs, strict relations are
established between the transform length and the ampli-
tude dynamic range, here equal to M. Nussbaumer simu-
zzQRNS (i
(zi, * * * ,ZL) L
lated a 64th order bandpass filter and compared it to alter-
native architectures. He showed the RNS filter to have a
Z*,QRN ( z; ZL ) greater potential throughput than traditional number
systems.
zi= (Xi+jYj) modpiEZpi Many articles have been written on classical digital fil-
z4,= (Xi-jYi) modpiEZpi tering. A milestone work is the Jenkins and Leon study of
Xi= (2-1 (z+z*))modpiEZpi finite-impulse response (feed-forward) filters.27 They
combined the distributed arithmetic, or bit-serial structure
Yi=(2-Ij-' (z-z*))modpiEZpi of Peled and Liu, to mechanize a novel CRT architecture.
Observe that X, Y, Z and Z* are real numbers. Further- Their tests demonstrated that the RNS filter improves
more, sincepi is prime, 2 -1 exists. The remarkable prop- speed and reduces cost in comparison to traditional struc-
erty of the QRNS is found in its arithmetic where, if z I and tures. In 1982, Etzel and Jenkins28 described a family of
Z2 are in the QRNS residue classes that facilitates simplified data scaling in the
Z3=ZI +Z2=(Z31, ,Z3L) (21)
RNS. Scaling is an implicit operation in digital filtering, in
which RNS multiplication increases dynamic range re-
Z3i = ((Zli +Z2i)modPi, (Zli +Zli) modPi) quirements geometrically. Scaling then manages the po-
Z3 = Z I Z2 = (Z31, . Z3L ); tential overflow problem. Jenkins, also in 1980,29 studied
the complex arithmetic problem using index calculus to
Z3i = ((ZliZ2i) modpi, (Z;iZ*2iz) modpi) replace formal multiplication. His study expands the pos-
A QRNS add is as complex as a CRNS add; it requires sibilities of residue arithmetic.
two real adds. However, a QRNS multiplication requires Microprocessors have been proposed as adaptive RNS
but two real multiplies (as opposed to four for the CRNS) filters for system identification applications. Overflow
and no real adds (as opposed to four for the CRNS). scaling was serviced with a "recursive-like" algorithm.
Therefore, in those cases where the RNS is to be applied to Tan and McInns30 in 1981 studied an adaptive filter in
digital control applications. They used a least-mean-
square algorithm, which required only adder and shift
registers, so the division was side-stepped. The filter was
reported to run at 10 MHz and cost less than $500.
Soderstrand3' reported on a microprocessor-based,
low-pass filter designed to support a mass spectrometer ex-
periment. It was also used to study RNS adpative filters in
an application of state-of-the-art residue arithmetic
knowledge. Shubs of the USSR studied the digital-filter
problem in terms of a noniterative statement (nonrecursive
filter, FFT) and iterative-operations (recursive filter). 32
Focusing on the mathematical properties of RNS and dis-
regarding its applications, he reported that the RNS may
be superior to noniterative algorithms when the volume of
nonmodular operations does not exceed 25 percent of the
total arithmetic count. Jenkins33 reported on a scaling
algorithm that supported digital filtering with an accom-
panying study of quantification error. Ramnarayanan and
Taylor34 published an RNS study on the effect of ar-
chitecture on filter performance. They found that filter
speed and precision were affected by the filter architecture
chosen (viz., parallel, conical, etc.). In all these cases, the
RNS-based filter demonstrated potential throughput su-
60 COMPUTER
periority, with speeds limited principally by overflow man- modular adder system for the moduli (2 + 1, 2 n, 2 n - 1).
agement schemes adopted by the various investigators. In addition, NMOS designs have incorporated Mead-
Another application area is discrete Fourier transforms, Conway and Brent-Kung architectures.42'43 The Mead-
or DFTs. Taylor and Huang 35 used the RNS to mechanize Conway architecture has an o (n) delay, the other
a Winograd DFT and cover its base numbers, as well as to o ( logn ) Analysis based on a time-area product, suggests
support input/output data reordering. Their work also that, if the basic cells are implemented with a PLA, the
studied radix-2, four FFTs, and the Good-Winograd FFT. faster Brent-Kung cell is superior to the more complex
Again the key problem was managing the potential, dy- Mead-Conway if n > 12 bits.
namic-range overflow that accompanies RNS multiplica-
tion, which must be considered as overhead. A detailed
study of all these algorithms showed that the Winograd Although many readers may be familiar with RNS as a
ran fastest when the RNS was used to do DFT. Without mathematical concept, few are aware of progress in RNS
data addressing overhead, this algorithm presented the applications permitted by semiconductor advancements
lowest total number of memory lookup calls (or cycles). over the last 10 years. Its table-lookup nature means that
Tseng et al. 36 also described building an FFT in the RNS can immediately assimilate new memory devices as
RNS, presenting procedures for choosing scale factors and they enter the market.
their location. This article catalogs recent RNS research and points out
As mentioned before, number-theoretic transforms some of its shortcomings, particularly in sign detection,
have potential utility in spread-spectrum communication division, and magnitude comparison. The RNS appears
and data encryption. Baraniecka and Jullien37 developed best suited to problems placing limited demands on these
two different NTT hardware structures (one using ROMs operations.
and the other microprocessors) in 1980. They verified an The fundamental quality of the RNS is ultra-high-speed
application involving the use of arithmetic over the direct multiplication and addition, which comes from parallel
sum of Galois fields GF (p2), representing complex processing of arithmetic operations. New technologies can
numbers with a 840-point NTT over GF (p2). Martin et be expected to improve speed, and eventually optical
al. 38 in 1979 explained how to choose the moduli set to ad- systems with very high bandwidths will benefit from the
mit fast NTT computation with Winograd's algorithm. growing body of RNS knowledge. *
Then, in 1979, Reddy and Reddy39 studied a triangular
transform.
Huang et al. 40 described a 2D convolver with 5 x 5 ECL Acknowledgments
technology. This highly pipelined structure operated at a
30M 5 x 5 throughput. Fouse et al.41 investigated the I appreciate all those who have assisted in the prepara-
mechanics of another 5 x 5 processor, concluding that it tion of this article, in particular, Raja Ramnarayanan. I
could be implemented in VLSI. Their solution to a set of also thank the NSF for supporting my RNS work during
linear equations is a computationally intensive problem, the last five years. I apologize to authors whose works are
but the RNS holds promise in this problem area, since it not referenced due only to space limitations.
functions well where there are no conditional branch re-
straints. The RNS has been proposed to reduce an integral
matrix to the Frobenius form and compute the Moore- References
Penrose inverse or pseudoinverse. This defense-related ap-
plication of very high speed technology produced 1. "Decimal Arithmetic Unit," in Stroje Na Zprocovani, A.
Nakl, ed., CSAU, Praha, Czech., 1962 (in Czech).
remarkably reliable throughput.
2. H. L. Garner, "The Residue Number System," IRE Trans.
Electronic Computers, Vol. EL-8, No. 6, June 1959, pp.
140-147.
Technology N. S. Szabo and R.I. Tanaka, Residue Arithmetic and Its
3.
Applications to Computer Technology, McGraw-Hill, New
For the RNS to be an effective decimal converter, it York, 1967.
must be compatible with existing digital architectures and 4. D. D. Miller, J. N. Polky, and J. R. King, "A Survey of
contemporary computing machines. Fortunately, we are Soviet Developments in Residue Number Theory Applied to
in the midst of a technological revolution. New devices Digital Filtering," Proc. 26th Midwest Symp. Circuits and
and architectures are being announced daily. Two of the Systems, Aug. 1983.
most exciting are VLSI and VHSIC. Even though VLSI 5 A. Huang, "The Implementation of a Residue Arithmetic
has already made an impact on digital arithmetic, most Unit via Optical and Other Physical Devices," Proc. Int'l
progress is in memory technology. Since its architecture is Optical Computing Conf., 1975, pp. 14-18.
memory intensive, RNS will benefit first from the semi- 6. C. H. Huang and F. J. Taylor, "A Memory Compression
conductor revolution. For example, Phillips announced a Scheme for Modular Arithmetic," IEEE Trans. Acoustics,
Speech and Signal Processing, Vol. ASSP-27, Dec. 1979, pp.
GaAs 256-bit RAM that works at 1.5 ns. With such units, 608-611.
real-time RNS systems could achieve throughputs ap- 7. W. Smith, "SWIFT," Symp. Very High Speed Computing
proaching IG operations per second. Technology, Oct. 1980 (held with IEEE ICASSD Conf.).
Several RNS researchers are exploring VLSI in RNS 8. D. 0. Carhoun et al., A Synthesis Algorithm for Recursive
design, using TRW's VLSI multiplier/accumulator to Finite Field FIR Digital Filters, Mitre tech. report, Bedford,
develop an RNS multiplier or custom designs for a Mass., Apr. 1983.

May 1984 61
9. G. A. Jullien, "Residue Number Scaling and Other Opera- 29. W. K. Jenkins, "Complex Residue Number Arithmetic for
tions Using ROM Arrays," IEEE Trans. Computers, Vol. High Speed Signal Processing," Electronics Letters, Vol. 16,
C-27, No. 4, Apr. 1978, pp. 325-336. No. 17, Aug. 1980, pp. 660-661.
10. F. J. Taylor, "A VLSI Residue Arithmetic Multiplier," 30. C. I. Tan and B. C. Mclnnis, "Adaptive Digital Control Im-
IEEE Trans. Computers, Vol. C-31, No. 6, June 1982, pp. plemented Using Residue Number Systems," Proc. 20th
540-546. IEEE Conf. Decision and Control, Vol. 2, Dec. 1981, pp.
808-812.
11. D. K. Banerji, "On the Use of Residue Arithmetic for Com-
putation," IEEE Trans. Computers, Vol. C-23, Dec. 1974, 31. M. Soderstrand, "A Digital Filter for Use with Eight-Bit
pp. 1315-1317. Microprocessors," IEEE Proc. 20th Midwest Symp. Cir-
cuits and Systems, Aug. 1977, pp. 54-58.
12. F. J. Taylor and A. S. Ramnarayanan, "An Efficient
Residue-to-Decimal Converter," IEEE Trans. Circuits and 32. V. Y. Shubs, "Feasibility of Utilizing a System of Residual
Systems, Vol. CAS-28, No. 12, Dec. 1981, pp. 1164-1169. Classes in Signal Processing Equipment," Izv Vuz
Radioelectron USSR, Vol. 23, No. 1, Sept. 1975, pp. 75-76
13. F. J. Taylor and C. H. Huang, "An Autoscale Residue (in Russian).
Multiplier," IEEE Trans. Computers, Vol. C-31, No. 4,
Apr. 1982, pp. 321-325. 33. W. K. Jenkins, "Recent Advances in Residue Number
Techniques for Recursive Digital Filtering," IEEE Trans.
14. F. J. Taylor, "Large Moduli Multipliers for Signal Process- Acoustics, Speech and Signal Processing, Vol. ASSP-27,
ing," IEEE Trans. Circuits and Systems, Vol. CAS-28, No. No. 1, Feb. 1979, pp. 19-30.
7, July 1981.
34. A. S. Ramnarayanan and F. J. Taylor, "On the Structure of
15. R. T. Gregory and D. W. Matula, "Base Conversion in IIR Filters Using Residue Arithmetic," Proc. IEEE Int'l
Residue Number System," Third Symp. Computational Conf. Acoustics, Speech and Signal Processing, Apr. 1981,
Arithmetic, Nov. 1975, pp. 117-125. pp. 251-254.
16. D. K. Banerji et al., "A High-Speed Division Method in 35. F. J. Taylor and C. H. Huang, "A Comparison of DFT
Residue Arithmetic," IEEE Proc. Fifth Symp. Computer Algorithms Using a Residue Architecture," Computer and
Arithmetic, May 1981, pp. 158-164. Electrical Engineering (England), Vol. 8, No. 3, Sept. 1981,
17. S. S. Yau and J. Chung, "On the Design of Modulo pp. 161-171.
Arithmetic Units Based on Cyclic Groups," IEEE Trans. 36. B. D. Tseng, G. A. JuUien, and W. C. Miller, "Implementa-
Computers, Vol. C-25, No. 11, Nov. 1976, pp. 1057-1067. tion of FFT Structures Using the Residue Number System, "
18. G. Bioul et al., "A Computation Scheme for an Adder IEEE Trans. Computers, Vol. C-28, No. 11, Nov. 1979, pp.
Modulo 2n + I," Digital Processes, Vol. 1, No. 4, winter 83 1-845.
1975, pp. 309-318. 37. A. Z. Baraniecka and G. A. Julien, "Residue Number
19. D. P. Agarwal, "Modulo 2* *n + I Arithmetic Logic," System Implementations of Number Theoretic Transforms
IEEE J. Electronic Circuits and Systems, Vol. 2, No. 6, in Complex Residue Rings," IEEE Trans. Acoustics,
Nov. 1978, pp. 186-188. Speech and Signal Processing, Vol. ASSP-23, No. 3, June
20. F. J. Taylor and C. H. Huang, "A Floating-Point Residue 1980, pp. 285-291.
Arithmetic Unit," J. Franklin Institute, Vol. 311, Jan. 1981, 38. S. C. P. Martin et al, "Microprocessor Implementation of
pp. 33-53. Number Theoretic Transforms," IEEE J. Electronic Cir-
21. V. A. Vyshysky and V. D. Petushchak, "Algorithms for cuits and Systems, Vol. 3, No. 1, Jan. 1979, pp. 21-26.
Determination of the Reciprocal of a Number in a Residue 39. N. S. Reddy and V. U. Reddy, "Convolution Algorithms
Class System," Soviet Automatic Control, Vol. 6, No. 3, for Small-Word-Length Digital Filtering Applications,"
May 1973, pp. 58-61 (in Russian). IEEE J. Electronic Circuits and Systems, Vol. 3, No. 6,
22. N. Tomabechi, M. Kameyama, and T. Higuchi, "Efficient Nov. 1979, pp. 253-256.
Residue Arithmetic Circuit Using Multiple-Valued Ring 40. C. Huang et al., "Implementation of a Fast Digital Pro-
Counters and Its Application to Digital Signal Processing," cessor Using Residue Number Arithemtic," IEEE Trans.
Proc. 12th Int'l Symp. Multiple-Valued Logic, Feb. 1982, Circuits and Systems, Vol. 28, No. 1, Jan. 1981, pp. 32-38.
pp. 107-112.
41. S. D. Fouse et al., "Residue-Based Image Processor for
23. D. Mandelbaum, "Error Correction in Residue Arithmetic, Very Large Scale Integration (VLSI) Implementation,"
IEEE Trans. Computers, Vol. C-21, No. 6, June 1972, pp. Proc. Int'l Society for Optics (England), Vol. 281, July
538-545. 1981, pp. 39-43.
24. M. Etzel and W. K. Jenkins, "Redundant Residue Number 42. C. Mead, Introduction to VLSI Systems, Addison-Wesley,
Systems for Error Detection and Correction in Digital Reading, Mass., 1980.
Filters," IEEE Trans. Acoustics, Speech and Signal Pro-
cessing, Vol. ASSP-29, No. 5, Oct. 1980, pp. xxx-xxx. 43. R. P. Brent, "Regular Layout for Parallel Adders," IEEE
Trans. Computers, Vol. C-31, No. 2, Mar. 1982, pp.
25. W. K. Jenkins and J. J. Krosmeier, "Error Detection and 260-264.
Correction in Quadratic Residue Number Systems," Proc.
26th Midwest Symp. Circuits and Systems, Pueblo Mexico,
Aug. 1983, pp. xx-xx. Fred J. Taylor is a professor of electrical
engineering and computer and information
26. H. Nussbaumer, "Digital Filtering Using Cotransforms in science at the University of Florida, having
Finite Fields," Electronic Letters, Vol. 12, No. 5, Mar. 4, also been on the faculties of the University
1976, pp. 113-114. of Cincinnati, University of Texas-El Paso,
27. W. K. Jenkins and B. J. Leon, "The Use of Residue and Southern Methodist University. He has
Number Systems in the Design of Finite Impulse Response industrial experience with Texas Instru-
Digital Filters," IEEE Trans. Circuits and Systems, Vol. ments and consults widely with defense in-
CAS-24, No. 4, Apr. 1977, pp. 199-201. dustries. He is the author of fifty archived
papers and several book chapters and holds
28. M. H. Etzel and W. K. Jenkins, "The Design of Specialized three patents. Taylor received his PhD from the University of
Residue Classes for Efficient Recursive Digital Filter Colorado-Boulder in 1969. He is a senior member of the IEEE.
Realization," IEEE Trans. Acoustics, Speech and Signal His address is Electrical Engineering Department, University of
Processing, Vol. ASSP-30, No. 3, June 1982, pp. 370-380. Florida, Gainesville, FL 32611.
62 COMPUTER

You might also like