You are on page 1of 353

P. V.

Ananda Mohan

Residue
Number
Systems
Theory and Applications
Residue Number Systems
P.V. Ananda Mohan

Residue Number Systems


Theory and Applications
P.V. Ananda Mohan
R&D
CDAC
Bangalore, Karnataka
India

ISBN 978-3-319-41383-9 ISBN 978-3-319-41385-3 (eBook)


DOI 10.1007/978-3-319-41385-3

Library of Congress Control Number: 2016947081

Mathematics Subject Classification (2010): 68U99, 68W35

© Springer International Publishing Switzerland 2016


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained
herein or for any errors or omissions that may have been made.

Printed on acid-free paper

This book is published under the trade name Birkhäuser


The registered company is Springer International Publishing AG Switzerland (www.birkhauser-science.com)
To
The Goddess of learning Saraswati
and
Shri Mahaganapathi
Preface

The design of algorithms and hardware implementation for signal processing


systems has received considerable attention over the last few decades. The primary
area of application was in digital computation and digital signal processing. These
systems earlier used microprocessors, and, more recently, field programmable gate
arrays (FPGA), graphical processing units (GPU), and application-specific inte-
grated circuits (ASIC) have been used. The technology is evolving continuously to
meet the demands of low power and/or low area and/or computation time.
Several number systems have been explored in the past such as the conventional
binary number system, logarithmic number system, and residue number system
(RNS), and their relative merits have been well appreciated. The residue number
system was applied for digital computation in the early 1960s, and hardware was
built using the technology available at that time. During the 1970s, active research
in this area commenced with application in digital signal processing. The emphasis
was on exploiting the power of RNS in applications where several multiplications
and additions needed to be carried out efficiently using small word length pro-
cessors. The research carried out was documented in an IEEE press publication in
1975. During the 1980s, there was a resurgence in this area with an emphasis on
hardware that did not need ROMs. Extensive research has been carried out since
1980s and several techniques for overcoming certain bottlenecks in sign detection,
scaling, comparison, and forward and reverse conversion.
A compilation of the state of the art was attempted in 2002 in a textbook, and this
was followed by another book in 2007. Since 2002, several new investigations have
been carried out to increase the dynamic range using more moduli, special moduli
which are close to powers of two, and designs that use only combinational logic.
Several new algorithms/theorems for reverse conversion, comparison, scaling, and
error correction/detection have also been investigated. The number of moduli has
been increased, yet the same time focusing on retaining the speed/area advantages.
It is interesting to note that in addition to application in computer arithmetic,
application in digital communication systems has gained a lot of attention. Several
applications in wireless communication, frequency synthesis, and realization of

vii
viii Preface

transforms such as discrete cosine transform have been explored. The most inter-
esting development has been the application of RNS in cryptography. Some of the
cryptography algorithms used in authentication which need big word lengths
ranging from 1024 bits to 4096 bits using RSA (Rivest Shamir Adleman) algorithm
and with word lengths ranging from 160 bits to 256 bits used in elliptic curve
cryptography have been realized using the residue number systems. Several appli-
cations have been in the implementation of Montgomery algorithm and implemen-
tation of pairing protocols which need thousands of modulo multiplication,
addition, and reduction operations. Recent research has shown that RNS can be
one of the preferred solutions for these applications, and thus it is necessary to
include this topic in the study of RNS-based designs.
This book brings together various topics in the design and implementation of
RNS-based systems. It should be useful for the cryptographic research community,
researchers, and students in the areas of computer arithmetic and digital signal
processing. It can be used for self-study, and numerical examples have been
provided to assist understanding. It can also be prescribed for a one-semester course
in a graduate program.
The author wishes to thank Electronics Corporation of India Limited, Bangalore,
where a major part of this work was carried out, and the Centre for Development of
Advanced Computing, Bangalore, where some part was carried out, for providing
an outstanding R&D environment. He would like to express his gratitude to
Dr. Nelaturu Sarat Chandra Babu, Executive Director, CDAC Bangalore, for his
encouragement. The author also acknowledges Ramakrishna, Shiva Rama Kumar,
Sridevi, Srinivas, Mahathi, and his grandchildren Baby Manognyaa and Master
Abhinav for the warmth and cheer they have spread. The author wishes to thank
Danielle Walker, Associate Editor, Birkhäuser Science for arranging the reviews,
her patience in waiting for the final manuscript and assistance for launching the
book to production. Special thanks are also to Agnes Felema. A and the Production
and graphics team at SPi-Global for their most efficiently typesetting, editing and
readying the book for production.

Bangalore, India P.V. Ananda Mohan


April 2015
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Modulo Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Adders for General Moduli . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Modulo (2n 1) Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Modulo (2n + 1) Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Binary to Residue Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Binary to RNS Converters Using ROMs . . . . . . . . . . . . . . . . . 27
3.2 Binary to RNS Conversion Using Periodic Property
of Residues of Powers of Two . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Forward Conversion Using Modular Exponentiation . . . . . . . . . 30
3.4 Forward Conversion for Multiple Moduli Using
Shared Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Low and Chang Forward Conversion Technique
for Arbitrary Moduli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Forward Converters for Moduli of the Type (2n  k) . . . . . . . . . 35
3.7 Scaled Residue Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 36
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Modulo Multiplication and Modulo Squaring . . . . . . . . . . . . . . . . . 39
4.1 Modulo Multipliers for General Moduli . . . . . . . . . . . . . . . . . . 39
4.2 Multipliers mod (2n 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Multipliers mod (2n + 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Modulo Squarers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5 RNS to Binary Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 CRT-Based RNS to Binary Conversion . . . . . . . . . . . . . . . . . . 81
5.2 Mixed Radix Conversion-Based RNS to Binary Conversion . . . 90

ix
x Contents

5.3 RNS to Binary Conversion Based on New CRT-I,


New CRT-II, Mixed-Radix CRT and New CRT-III . . . . . . . . . 95
5.4 RNS to Binary Converters for Other Three Moduli Sets . . . . . . 97
5.5 RNS to Binary Converters for Four and More Moduli Sets . . . . 99
5.6 RNS to Binary Conversion Using Core Function . . . . . . . . . . . 111
5.7 RNS to Binary Conversion Using Diagonal Function . . . . . . . . 114
5.8 Performance of Reverse Converters . . . . . . . . . . . . . . . . . . . . . 117
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6 Scaling, Base Extension, Sign Detection
and Comparison in RNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.1 Scaling and Base Extension Techniques in RNS . . . . . . . . . . . . 133
6.2 Magnitude Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3 Sign Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7 Error Detection, Correction and Fault Tolerance
in RNS-Based Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.1 Error Detection and Correction Using
Redundant Moduli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.2 Fault Tolerance Techniques Using TMR . . . . . . . . . . . . . . . . . 173
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8 Specialized Residue Number Systems . . . . . . . . . . . . . . . . . . . . . . . 177
8.1 Quadratic Residue Number Systems . . . . . . . . . . . . . . . . . . . . 177
8.2 RNS Using Moduli of the Form rn . . . . . . . . . . . . . . . . . . . . . . 179
8.3 Polynomial Residue Number Systems . . . . . . . . . . . . . . . . . . . 184
8.4 Modulus Replication RNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.5 Logarithmic Residue Number Systems . . . . . . . . . . . . . . . . . . . 189
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9 Applications of RNS in Signal Processing . . . . . . . . . . . . . . . . . . . . 195
9.1 FIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.2 RNS-Based Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
9.3 RNS Applications in DFT, FFT, DCT, DWT . . . . . . . . . . . . . . 226
9.4 RNS Application in Communication Systems . . . . . . . . . . . . . . 242
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10 RNS in Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
10.1 Modulo Multiplication Using Barrett’s Technique . . . . . . . . . . 265
10.2 Montgomery Modular Multiplication . . . . . . . . . . . . . . . . . . . . 267
10.3 RNS Montgomery Multiplication and Exponentiation . . . . . . . . 287
10.4 Montgomery Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
10.5 Elliptic Curve Cryptography Using RNS . . . . . . . . . . . . . . . . . 298
10.6 Pairing Processors Using RNS . . . . . . . . . . . . . . . . . . . . . . . . . 306
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Chapter 1
Introduction

Digital computation is carried out using binary number system conventionally.


Processors with word lengths up to 64 bits have been quite common. It is well
known that the basic operations such as addition can be carried out using variety of
adders such as carry propagate adder, carry look ahead adders and parallel-prefix
adders with different addition times and area requirements. Several algorithms for
high-speed multiplication and division also are available and are being continu-
ously researched with the design objectives of low power/low area/high speed.
Fixed-point as well as floating-point processors are widely available. Interestingly,
operations such as sign detection, magnitude comparison, and scaling are quite easy
in these systems.
In applications such as cryptography there is a need for processors with word
lengths ranging from 160 bits to 4096 bits. In such requirements, a need is felt for
reducing the computation time by special techniques. Applications in digital signal
processing also continuously look for processors for fast execution of multiply and
accumulate instruction. Several alternative techniques have been investigated for
speeding up multiplication and division. An example is using logarithmic number
systems (LNS) for digital computation. However, using LNS, addition and sub-
traction are difficult.
In binary and decimal number systems, the position of each digit determines the
weight. The leftmost digits have higher weights. The ratio between adjacent digits
can be constant or variable. The latter is called Mixed Radix Number System [1].
For a given integer X, the MRS digit can be found as
6 7
6 7
6 X 7
6 7
xi ¼ 6 7
4a 5mod Mi
i1 ð1:1aÞ
Mj
j¼0

© Springer International Publishing Switzerland 2016 1


P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_1
2 1 Introduction

where 0  i < n, n is the number of digits. Note that Mj is the ratio between weights
for the jth and ( j + 1) th digit position and x mod y is the remainder obtained by
dividing x with y. MRNS can represent

a
n1
M¼ Mj ð1:1bÞ
j¼0

unique values. An advantage is that it is easy to perform the inverse procedure to


convert the tuple of digits to the integer value:
!
X
n1 a
i1
X¼ xi Mj ð1:1cÞ
i¼0 j¼0

Fixed-point addition is easy since it is equivalent to integer addition. Note that


Q15 format often used in digital signal processing has one sign bit and fifteen
fractional bits. Fixed-point multiplication shall use scaling so as to make the
product in the same format as the inputs. Fixed-point addition of fractional numbers
is more difficult than multiplication since both numbers must be in the same format
and attention must be paid to the possibility of an overflow. The overflow can be
handled by right shifting by one place and setting an exponent flag or by using
double precision to provide headroom allowing growth due to overflow [2].
The floating-point number for example is represented in IEEE 754 standard as [2]

X ¼ ð1Þs ð1:FÞ  2E127 ð1:2Þ

where F is the mantissa in two’s-complement binary fraction represented by bits


0–22, E is the exponent in excess 127 format and s ¼ 0 for positive integers and
s ¼ 1 for negative numbers. Note the assumed 1 preceding the mantissa and biased
exponent. As an illustration, consider the floating-point number

0 1000011. 11000. . .00


Sign Exponent Mantissa

The mantissa is 0.75 and exponent is 131. Hence X ¼ (1.75)  2131–127


¼ (1.75)  24. When floating-point numbers are added, the exponents must be
made equal (known as alignment) and we need to shift right the mantissa of
the smaller operand and increment the exponent till it is equal to that of the large
operand. The multiplication of the properly normalized floating-point numbers
M 2E1 and M 2E2 yields the product given by ME ¼ ðM M Þ2E1 þE2 . The largest
1 2 1 2
and smallest numbers that can be represented are 1.2  1038 and 3.4  1038.
In the case of double precision [3, 4], bits 0–51 are mantissa and bits 52–62 are
exponent and bit 63 is the sign bit. The offset in this case is 1023 allowing
exponents from 21023 to 2+1024. The largest and smallest numbers that can be
represented are 1.8  10308 and  2.2  10308.
1 Introduction 3

In floating-point representation, errors can occur both in addition and


multiplication. However, overflow is very unlikely due to the very wide dynamic
range since more bits are available in the exponent. Floating-point arithmetic is
more expensive and slower.
In logarithmic number system (LNS) [5], we have

X ! ðz, s, x ¼ logb jXjÞ ð1:3aÞ

where b is the base of the logarithm, z when asserted indicates that X ¼ 0, s is the
sign of X. In LNS, the input binary numbers are converted into logarithmic form
with a mantissa and characteristic each of appropriate word length to achieve the
desired accuracy. As is well known, multiplication and division are quite simple in
this system needing only addition or subtraction of the given converted inputs
whereas simple operations like addition, subtraction cannot be done easily. Thus
in applications where frequent additions or subtractions are not required, these may
be of utility. The inverse mapping from LNS to linear numbers is given as

X ¼ ð1  zÞð1Þs bx ð1:3bÞ

Note that the addition operation in conventional binary system (X + Y ) is computed


in LNS noting that X ¼ bx and Y ¼ by as

z ¼ x þ logb ð1 þ byx Þ ð1:4aÞ

The subtraction operation (XY ) is performed as

z ¼ x þ logb ð1  byx Þ ð1:4bÞ

The second term is obtained using an LUT whose size can be very large for n  20
[3, 6, 7]. The multiplication, division, exponentiation and finding nth root are very
simple. After the processing, the results need to be converted into binary number
system.
The logarithmic system can be seen to be a special case of floating-point system
where the significand (mantissa) is always 1. Hence the exponent can be a mixed
number than an integer. Numbers with the same exponent are equally spaced in
floating-point whereas in sign logarithm system, smaller numbers are denser [3].
LNS reduces the strength of certain arithmetic operations and the bit activity
[5, 8, 9]. The reduction of strength reduces the switching capacitance. The change
of base from 2 to a lesser value reduces the probability of a transition from low to
high. It has been found that about two times reduction in power dissipation is
possible for operations with word size 8–14 bits.
The other system that has been considered is Residue Number system [10–12]
which has received considerable attention in the past few decades. We consider this
topic in great detail in the next few chapters. We, however, present here a historical
review on this area. The origin is attributed to the third century Chinese author Sun
4 1 Introduction

Tzu (also attributed to Sun Tsu in the first century AD) in the book Suan-Ching. We
reproduce the poem [11]:
We have things of which we do not know the number
If we count them by threes, the remainder is 2
If we count them by fives, the remainder is 3
If we count them by sevens, the remainder is 2
How many things are there?
The answer, 23.
Sun Tzu in First Century AD and Greek Mathematicians Nichomachus and
Hsin-Tai-Wei of Ming Dynasty (1368AD-1643AD) were the first to explore
Residue Number Systems. Sun Tzu has presented the formula for computing the
answer which came to be known later as Chinese Remainder Theorem (CRT). This
is described by Gauss in his book Disquisitiones Arithmeticae [12].
Interestingly, Aryabhata, an Indian mathematician in fifth century A.D., has
described a technique of finding the number corresponding to two given residues
corresponding to two moduli. This was named as Aryabhata Remainder Theorem
[13–16] and is known by the Sanskrit name Saagra-kuttaakaara (residual
pulveriser) which is the well-known Mixed Radix conversion for two moduli RNS.
Extension to moduli sets with common factors has been recently described [17].
In an RNS using mutually prime integers m1, m2, m3, . . .., mj as moduli, the
dynamic range M is the product of the moduli, M ¼ m1  m2  m3 . . . mj. The numbers
between 0 and M1 can be uniquely  represented by the residues.
 M1 Alternatively,

M M1
numbers betweenM/2 to 2  1 when M is even and  2 to 2 when M is
odd can be represented. A large number can thus be represented by several smaller
numbers called residues obtained as the remainders when the given number is
divided by the moduli. Thus, instead of big word length operations, we can perform
several small word length operations on these residues. The modulo addition,
modulo subtraction and modulo multiplication operations can thus be performed
quite efficiently.
As an illustration, using the moduli set {3, 5, 7}, any number between 0 and 104
can be uniquely represented by the residues. The number 52 corresponds to the
residue set (1, 2, 3) in this moduli set. The residue is the remainder obtained by the
division operation X/mi. Evidently, the residues ri are such that 0  ri  (mi1).
The front-end of an RNS-based processor (see Figure 1.1) is a binary to RNS
converter known as forward converter whose k output words corresponding to k
moduli mk will be processed by k parallel processors in the Residue Processor
blocks to yield k output words. The last stage in the RNS-based processor converts
these k words to a conventional binary number. This process known as reverse
conversion is very important and needs to be hardware-efficient and time-efficient,
since it may be often needed also to perform functions such as comparison, sign
detection and scaling. The various RNS processors need smaller word length and
hence the multiplication, addition and multiplications can be done faster. Of course,
these are all modulo operations. The modulo processors do not have any
1 Introduction 5

Input Binary

Binary to RNS Binary to RNS Binary to RNS Binary to RNS


converter converter converter converter
Modulus m1 Modulus m2 Modulus mk-1 Modulus mk

Residue Residue Residue Residue


Processor Processor Processor Processor

RNS to Binary converter

Binary output

Figure 1.1 A typical RNS-based processor

inter-dependency and hence speed can be achieved for performing operations such
as convolution, FIR filtering, and IIR filtering (not needing in-between scaling).
The division or scaling by an arbitrary number, sign detection, and comparison are
of course time-consuming in residue number systems.
Each MRS digit or RNS modulus can be represented in several ways: binary
(d log2Mje wires with binary logic), index (d log2Mje wires with binary logic), one-
hot (Mj wires with two-valued logic) [18] and Mj-ary (one wire with multi-valued
logic). Binary representation is most compact in storage, but one-hot coding allows
faster logic and lower power consumption. In addition to electronics, optical and
quantum RNS implementations have been suggested [19, 20].
The first two books on Residue number systems appeared in 1967 [21, 22].
Several attempts have been made to build digital computers and other hardware
using Residue number Systems. Fundamental work on topics like Error correction
has been performed in early seventies. However, there was renewed interest in
applying RNS to DSP applications in 1977. An IEEE press book collection of
papers [23] focused on this area in 1986 documenting key papers in this area. There
was resurgence in 1988 regarding use of special moduli sets. Since then the research
interest has increased and a book appeared in 2002 [24] and another in 2007 [25].
Several topics have been addressed such as Binary to Residue conversion, Residue
to binary conversion, scaling, sign detection, modulo multiplication, overflow
detection, and basic operations such as addition. Since four decades, designers
have been exploring the use of RNS to various applications in communication
systems, such as Digital signal Processing with emphasis on low power, low area
and programmability. Special RNS such as Quadratic RNS and polynomial RNS
have been studied with a view to reduce computational requirements in filtering.
6 1 Introduction

More recently, it is very interesting that the power of RNS has been explored to
solve problems in cryptography involving very large integers of bit lengths varying
from 160 bits to 4096 bits. Attempts also have been made to combine RNS with
logarithmic number system known as Logarithmic RNS.
The organization of the book is as follows. In Chapter 2, the topic of modulo
addition and subtraction is considered for general moduli as well powers-of-two related
moduli. Several advances made in designing hardware using diminished-1arithmetic
are discussed. The topic of forward conversion is considered in Chapter 3 in
detail for general as well as special moduli. These use several interesting properties
of residues of powers of two of the moduli. New techniques for sharing hardware for
multiple moduli are also considered. In Chapter 4, modulo multiplication and
modulo squaring using Booth-recoding and not using Booth-recoding is described
for general moduli as well moduli of the type 2n1 and especially 2n + 1. Both the
diminished-1 and normal representations are considered for design of multipliers
mod (2n + 1). Multi-modulus architectures are also considered to share the hardware
amongst various moduli. In Chapter 5, the well-investigated topic of reverse con-
version for three, four, five and more number of moduli is considered. Several
recently described techniques using Core function, quotient function, Mixed-Radix
CRT, New CRTs, and diagonal function have been considered in addition to the well-
known Mixed Radix Conversion and CRT. Area and time requirements are
highlighted to serve as benchmarks for evaluating future designs. In Chapter 6, the
important topics of scaling, base extension, magnitude comparison and sign detec-
tion are considered. The use of core function for scaling is also described.
In Chapter 7, we consider specialized Residue number systems such as Qua-
dratic Residue Number systems (QRNS) and its variations. Polynomial Residue
number systems and Logarithmic Residue Number systems are also considered.
The topic of error detection, correction and fault tolerance has been discussed in
Chapter 8. In Chapter 9, we deal with applications of RNS to FIR and IIR Filter
design, communication systems, frequency synthesis, DFT and 1-D and 2-D DCT
in detail. This chapter highlights the tremendous attention paid by researchers to
numerous applications including CDMA, Frequency hopping, etc. Fault tolerance
techniques applicable for FIR filters are also described. In Chapter 10, we cover
extensively applications of RNS in cryptography perhaps for the first time in any
book. Modulo multiplication and exponentiation using various techniques, modulo
reduction techniques, multiplication of large operands, application to ECC and
pairing protocols are covered extensively. Extensive bibliography and examples
are provided in each chapter.

References

1. M.G. Arnold, The residue logarithmic number system: Theory and application, in Proceedings
of the 17th IEEE Symposium on Computer Arithmetic (ARITH), Cape Cod, 27–29 June 2005,
pp. 196–205
2. E.C. Ifeachor, B.W. Jervis, Digital Signal Processing: A Practical Approach, 2nd edn.
(Pearson Education, Harlow, 2003)
References 7

3. I. Koren, Computer Arithmetic Algorithms (Brookside Court, Amherst, 1998)


4. S.W. Smith, The Scientists’s and Engine’s Guide to Digital Signal Processing (California
Technical, San Diego, 1997). Analog Devices
5. T. Stouraitis, V. Paliouras, Considering the alternatives in low power design. IEEE Circuits
Devic. 17(4), 23–29 (2001)
6. F.J. Taylor, A 20 bit logarithmic number system processor. IEEE Trans. Comput. C-37,
190–199 (1988)
7. L.K. Yu, D.M. Lewis, A 30-bit integrated logarithmic number system processor. IEEE J. Solid
State Circuits 26, 1433–1440 (1991)
8. J.R. Sacha, M.J. Irwin, The logarithmic number system for strength reduction in adaptive
filtering, in Proceedings of the International Symposium on Low-power Electronics and
Design (ISLPED98), Monterey, 10–12 Aug. 1998, pp. 256–261
9. V. Paliouras, T. Stouraitis, Low power properties of the logarithmic number system, in 15th
IEEE Symposium on Computer Arithmetic, Vail, 11–13 June 2001, pp. 229–236
10. H. Garner, The residue number system. IRE Trans. Electron. Comput. 8, 140–147 (1959)
11. F.J. Taylor, Residue arithmetic: A tutorial with examples. IEEE Computer 17, 50–62 (1984)
12. C.F. Gauss, Disquisitiones Arithmeticae (1801, English translation by Arthur A. Clarke).
(Springer, New York, 1986)
13. S. Kak, Computational aspects of the Aryabhata algorithm. Indian J. Hist. Sci. 211, 62–71
(1986)
14. W.E. Clark, The Aryahbatiya of Aryabhata (University of Chicago Press, Chicago, 1930)
15. K.S. Shulka, K.V. Sarma, Aryabhateeya of Aryabhata (Indian National Science Academy,
New Delhi, 1980)
16. T.R.N. Rao, C.-H. Yang, Aryabhata remainder theorem: Relevance to public-key Crypto-
algorithms. Circuits Syst. and Signal. Process. 25(1), 1–15 (2006)
17. J.H. Yang, C.C. Chang, Aryabhata remainder theorem for Moduli with common factors and its
application to information protection systems, in Proceedings of the International Conference
on Intelligent Information Hiding and Multimedia Signal Processing, Harbin, 15–17 Aug.
2008, pp. 1379–1382
18. W.A. Chren, One-hot residue coding for low delay-power product CMOS designs. IEEE
Trans. Circuits Syst. 45, 303–313 (1998)
19. Q. Ke, M.J. Feldman, Single flux quantum circuits using the residue number system. IEEE
Trans. Appl. Supercond. 5, 2988–2991 (1995)
20. C.D. Capps et al., Optical arithmetic/logic unit based on residue arithmetic and symbolic
substitution. Appl. Opt. 27, 1682–1686 (1988)
21. N. Szabo, R. Tanaka, Residue Arithmetic and Its Applications in Computer Technology
(McGraw Hill, New York, 1967)
22. R.W. Watson, C.W. Hastings, Residue Arithmetic and Reliable Computer Design (Spartan,
Washington, DC, 1967)
23. M.A. Soderstrand, G.A. Jullien, W.K. Jenkins, F. Taylor (eds.), Residue Number System
Arithmetic: Modern Applications in Digital Signal Processing (IEEE Press, New York, 1986)
24. P.V. Ananda Mohan, Residue Number Systems: Algorithms and Architectures (Kluwer, Bos-
ton, 2002)
25. A.R. Omondi, B. Premkumar, Residue Number Systems: Theory and Implementation (Imperial
College Press, London, 2007)
Chapter 2
Modulo Addition and Subtraction

In this Chapter, the basic operations of modulo addition and subtraction


are considered. Both the cases of general moduli and specific moduli of
the form 2n1 and 2n + 1 are considered in detail. The case with moduli of the
form 2n + 1 can benefit from the use of diminished-1 arithmetic. Multi-operand
modulo addition also is discussed.

2.1 Adders for General Moduli

The modulo addition of two operands A and B can be implemented using the
architectures of Figure 2.1a and b [1, 2]. Essentially, first A + B is computed and
then m is subtracted from the result to find whether the result is larger than m or not.
(Note that TC stands for two’s complement.) Then using a 2:1 multiplexer, either
(A + B) or (A + Bm) is selected. Thus, the computation time is that of one n-bit
addition, one (n + 1)-bit addition and delay of a multiplexer. On the other hand, in
the architecture of Figure 2.2b, both (A + B) and (A + Bm) are computed in
parallel and one of the outputs is selected using a 2:1 multiplexer depending on
the sign of (A + Bm). Note that a carry-save adder (CSA) stage is needed for
computing (A + Bm) which is followed by a carry propagate adder (CPA). Thus,
the area is more than that of Figure 2.2a, but the addition time is less. The area A and
computation time Δ for both the techniques can be found for n-bit operands
assuming that a CPA is used as

© Springer International Publishing Switzerland 2016 9


P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_2
10 2 Modulo Addition and Subtraction

A B
a A B b
TC of m
Adder
Adder CSA
TC of m

Adder Adder
A+B A+B
A+B-m
A+B-m
2:1 MUX
2:1 MUX

(A+B) mod m

(A+B) mod m

Figure 2.1 Modulo adder architectures: (a) sequential (b) parallel

Figure 2.2 Modular adder X Y


due to Hiasat (adapted from n n
[6] ©IEEE2002)
SAC

B A b a

CPG

P G p g

CLA for
Cout
1
MUX

CLAS

n
R

Acascade ¼ ð2n þ 1ÞAFA þ nA2:1MUX þ nAINV , Δcascade ¼ ð2n þ 1ÞΔFA þ Δ2:1MUX þ ΔINV
AParallel ¼ ð3n þ 2ÞAFA þ nA2:1MUX þ nAINV , Δparallel ¼ ðn þ 2ÞΔFA þ Δ2:1MUX þ ΔINV
ð2:1Þ

where ΔFA, Δ2:1MUX, and ΔINV are the delays and AFA, A2:1MUX and AINV are the
areas of a full-adder, 2:1 Multiplexer and an inverter, respectively. On the other
2.1 Adders for General Moduli 11

hand, by using VLSI adders with regular layout e.g. BrentKung adder [3], the area
and delay requirements will be as follows:

Acascade ¼ 2nðlog2 n þ 1ÞAFA þ nA2:1MUX þ nAINV , Δcascade ¼ 2ðlog2 n þ 1ÞΔFA þ ΔINV ,


AParallel ¼ ðn þ 1 þ log2 n þ log2 ðn þ 1Þ þ 2ÞAFA þ nA2:1MUX þ nAINV ,
Δparallel ¼ ððlog2 n þ 1Þ þ 2ÞΔFA þ Δ2:1MUX þ ΔINV

ð2:2Þ

Subtraction is similar to the addition operation wherein (AB) and (AB + m)


are computed sequentially or in parallel following architectures similar to
Figure 2.1a and b.
Multi-operand modulo addition has been considered by several authors. Alia and
Martinelli [4] have suggested the mod m addition of several operands using a CSA
tree trying to keep the partial results at the output of each CSA stage within the
range (0, 2n) by adding a proper value. The three-input addition in a CSA yields n-
bit sum and carry vectors S and C. S is always in the range {0, 2n}. The computation
of (2C + S)m is carried out as (2C + S)m ¼ L + H + 2TC + TS ¼ L + H + T + km where
k > 0 is an integer. Note that L ¼ 2(CTC) and H ¼ STS were TS ¼ sn12n1 and
TC ¼ cn12n1 + cn22n2. Thus, using sn1, cn1, cn2 bits, T can be obtained using
a 7:1 MUX and added to L, H. Note that L is obtained from C by one bit left shift
and H is obtained as (n1)-bit LSB word of S.
All the operands can be added using a CSA tree and the final result
UF ¼ 2CF + SF is reduced using a modular reduction unit which finds UF, UFm,
UF2 m and UF3 m using two CLAs and based on the sign bits of the last three
words, one of the answers is selected.
Elleithi and Bayoumi [5] have presented a θ(1) algorithm for multi-operand
modulo addition which needs a constant time of five steps. In this technique, the two
operands A and B are written in redundant form as A1, A2 and B1, B2, respectively.
The first three are added in a CSA stage which will yield sum and carry vectors.
These two vectors temp1 and temp2 and B2 are added in another CSA which will
yield sum and carry vectors temp3 and temp4. In the third step, to temp3 and temp4
vectors, a correction term (2nm) or 2(2nm) is added in another CSA stage
depending on either one or both carry bits of temp1 and temp2 are 1 to result in
the sum and carry vectors temp5 and temp6. Depending on the carry bit, in the next
step (2nm) is added to yield final result in carry save form as temp7 and temp8.
There will be no overflow thereafter.
Hiasat [6] has described a modulo adder architecture based on a CSA and
multiplexing the carry generate and propagate signals before being driven to the
carry computation unit. In this design, the output carry is predicted that could result
from computation of A + B + Z where Z ¼ 2nm. If the predicted carry is 1, an adder
proceeds in computing the sum A + B + Z. Otherwise, it computes the sum A + B.
Note that the calculation of Sum and Carry bits in case of bit zi being 1 or 0 is quite
simple as can be seen for both these cases:
12 2 Modulo Addition and Subtraction

s i ¼ ai  bi , ciþ1 ¼ ai bi and ^s i ¼ ai  bi , ^c iþ1 ¼ ai þ bi

Thus, half-adder like cells which give both the outputs are used. Note that si, ci+1,
^s i , ^c iþ1 serve as inputs to carry propagate and generate unit which has outputs Pi,
Gi, pi, gi corresponding to both the cases. Based on the computation of cout using a
CLA, a multiplexer is used to select one of these pairs to compute all the carries and
the final sum. The block diagram of this adder is shown in Figure 2.2 where SAC is
sum and carry unit, CPG is carry propagate generate unit, and CLA is carry look
ahead unit for computing Cout. Then using a MUX, either P, G or p, g are selected to
be added using CLA summation unit (CLAS). The CLAS unit computes all the
carries and performs the summation Pi  ci to produce the output R. This design
leads to lower area and delay than the designs in Refs. [1, 5].
Adders for moduli (2n1) and (2n + 1) have received considerable attention in
literature which will be considered next.

2.2 Modulo (2n1) Adders

Efstathiou, Nikolos and Kalamatinos [7] have described a mod (2n1) adder. In this
design, the carry that results from addition assuming carry input is zero is taken into
account in reformulating the equations to compute the sum. Consider a mod 7 adder
with inputs A and B. With the usual definition of generate and propagate signals, it
can be easily seen that for a conventional adder we have

c0 ¼ G0 þ P0 c1 ð2:3aÞ
c 1 ¼ G 1 þ P1 c 0 ð2:3bÞ
c2 ¼ G2 þ P2 G1 þ P2 P1 g0 ð2:3cÞ

Substituting c1 in (2.3a) with c2 due to the end-around carry operation of a mod
(2n1) adder, we have

c0 ¼ G0 þ P0 G2 þ P0 P2 G1 þ G0 P2 P1 G0 ¼ G0 þ P0 G2 þ P0 P2 G1 ð2:4Þ
c1 ¼ G1 þ P1 G0 þ P1 P0 G2 ð2:5aÞ
c2 ¼ G2 þ P2 G1 þ P2 P1 Go ð2:5bÞ

An implementation of mod 7 adder with double representation of zero


(i.e. output ¼ 7 or zero) is shown in Figure 2.3a where si ¼ Pi  ci1 . A simple
modification can be carried out as shown in Figure 2.3b to realize a single zero.
Note that the output can be 2n1, if both the inputs are complements of each other.
Hence, this condition can be used by computing P ¼ P0P1P2. . .Pn1 and modifying
the equations as
2.2 Modulo (2n1) Adders 13

a X0 P0

Y0
G0
S0
C-1
P1

X1

Y1
G1
C0 S1
P2
X2

Y2
G2
C1 S2

b P0
X0

Y0
G0 S0
C-1
P1
X1
Y1
G1
C0 S1
X2 P2

Y2
G2
C1 S2

Figure 2.3 (a) Mod 7 adder with double representation of zero (b) with single representation of
zero (adapted from [7] ©IEEE1994)
14 2 Modulo Addition and Subtraction

 
si ¼ Pi þ P  ci1 for 0  i  n  1: ð2:6Þ

The architectures of Figure 2.3, although they are elegant, they lack regularity.
Instead of using single level CLA, when the operands are large, multiple levels can
also be used.
Another approach is to consider the carry propagation in binary addition as a
prefix problem. Various types of parallel-prefix adders e.g. (a) LadnerFischer [8],
(b) Kogge-Stone [9], (c) BrentKung [3] and (d) Knowles [10] are available in
literature. Among these, type (a) requires less area but has unlimited fan out
compared to type (b). But designs based on (b) are faster.
Zimmerman [11] has suggested using an additional level for adding end-around-
carry for realizing a mod (2n1) adder (see Figure 2.4a) which needs extra
hardware and more over, this carry has a large fan out thus making it slower.
Kalampoukas et al. [12] have considered modulo (2n1) adders using parallel-
prefix adders. The idea of carry recirculation at each prefix level as shown in
Figure 2.4b has been employed. Here, no extra level of adders will be required,
thus having minimum logic depth. In addition, the fan out requirement of the carry
output is also removed. These architectures are very fast while consuming large
area.
The area and delay requirements of adders can be estimated using the unit-gate
model [13]. In this model, all gates are considered as a unit, whereas only exclusive-
OR gate counts for two elementary gates. The model, however, ignores fan-in and
fan-out. Hence, validation needs to be carried out by using static simulations. The
area and delay requirements of mod (2n1) adder described in [12] are 3nlogn + 4n
and 2logn + 3 assuming this model.
Efstathiou et al. [14] have also considered design using select-prefix blocks with
the difference that the adder is divided into several small length adder blocks by
proper interconnection of propagate and generate signals of the blocks. A select-
prefix architecture for mod (2n1) adder is presented in Figure 2.5. Note that d,
f and g indicate the word lengths of the three sections. It can be seen that

cin, 0 ¼ BG2 þ BP2 BG1 þ BP2 BP1 BG0


cin, 1 ¼ cout, 0 ¼ BG0 þ BP0 BG2 þ BP0 BP2 BG1
cin, 2 ¼ cout, 1 ¼ BG1 þ BP1 BG0 þ BP1 BP0 BG2

where BGi and BPi are block generate and propagate signals outputs of each block.
Tyagi [13] has given an algorithm for selecting the lengths of the various adder
blocks suitably with the aim of minimization of adder delay. Note that designs
based on parallel-prefix adders are fastest but are more complex. On the other hand,
CLA-based adder architecture is area effective. Select prefix-architectures achieve
delay closer to parallel prefix adders and have complexity close to the best adders.
Patel et al. [15] have suggested fast parallel-prefix architectures for modulo
(2n1) addition with a single representation of zero. In these, the sum is
computed with a carry in of “1”. Later, a conditional decrement operation is
2.2 Modulo (2n1) Adders 15

an-1
bn-1
an-2
bn-2

a1
b1
a0
b0
prefix structure

Cin
Cout
sn-1
sn-2

s1
s0
b
b7 a7 b6 a6 b5 a5 b4 a4 b3 a3 b2 a2 b1 a1 b0 a0

C7
C*-1
C*6 C*5 C*4 C*3 C*2 C*1 C*0

S7 S6 S5 S4 S3 S2 S1 S0

Figure 2.4 Modulo (2n1) adder architectures due to (a) Zimmermann and (b) modulo (281)
adder due to Kalampoukas et al. ((a) adapted from [11] ©IEEE1999 and (b) adapted from [12]
©IEEE2000)

performed. However, by cyclically feeding back the carry generate and carry
propagate signals at each prefix level in the adder, the authors show that
significant improvement in latency is possible over existing designs.
16 2 Modulo Addition and Subtraction

BG2 BG1 BG0

BP2 BP1 BP0


Cin,2 Cin,1 Cin,0

BLOCK 2 BLOCK 1 BLOCK 0


Adder (d+f+g-1:f+g) Adder (f+g-1:g) Adder (g-1:0)

Figure 2.5 Modulo 2d+f+g1 adder design using three blocks (adapted from [14] ©IEEE2003)

2.3 Modulo (2n + 1) Adders

Diminished-1 arithmetic is important for handling moduli of the form 2n + 1. This


is because of the reason that this modulus channel needs one bit more word
length than other channels using moduli 2n and 2n1. A solution given by
Liebowitz [16] is to represent the numbers still by n bits only. The diminished-1
number corresponding to normal number A in the range 1 to 2n is represented as
d(A) ¼ A1. If A ¼ 0, a separate channel with one bit which is 1 is used. Another
way of representing A in diminished-1 arithmetic is (Az, Ad) where Az ¼ 1, Ad ¼ 0
when A ¼ 2n, Az ¼ 0, Ad ¼ A1 otherwise. Due to this representation, some rules
need to be built to perform operations in this arithmetic which are summarized
below. Following the above notation, we can derive the following properties [17]:
(a) A + B ¼ C corresponds to

dðA þ BÞ ¼ ðd ðAÞ þ d ðBÞ þ 1Þ mod ð2n þ 1Þ ð2:7Þ

(b) Similarly, we have


 
dðA  BÞ ¼ dðAÞ þ dðBÞ þ 1 modð2n þ 1Þ ð2:8Þ

(c) It follows further that


X n 
d k¼1
A k ¼ ðd ðA1 Þ þ dðA2 Þ þ dðA3 Þ þ . . . þ dðAk Þ þ n  1Þ mod ð2n þ 1Þ
ð2:9Þ

Next,
   
d 2k A ¼ dðA þ A þ A þ . . . þ AÞ ¼ 2k d ðAÞ þ 2k  1 mod ð2n þ 1Þ:

or
   
2k d ðAÞ ¼ d 2k A  2k þ 1 mod ð2n þ 1Þ ð2:10Þ
2.3 Modulo (2n + 1) Adders 17

In order to simplify the notation, we denote a diminished-1 number using an


asterisk e.g. d(A) ¼ A* ¼ A1.
Several mod (2n + 1) adders have been proposed in literature. In the case of
diminished-1 numbers, mod (2n + 1) addition can be formulated as [11]

S  1 ¼ S* ¼ ðA* þ B* þ 1Þ mod ð2n þ 1Þ


¼ ðA* þ B*Þmod ð2n Þ if ðA* þ B*Þ
 2n and ðA* þ B* þ 1Þ otherwise ð2:11Þ

where A* and B* are diminished-1 numbers and S ¼ A + B. The addition of 1 can be


carried out by inverting the carry bit Cout and adding in a parallel-prefix adder with
Cin ¼ Cout (see Figure 2.6):
 
ðA* þ B* þ 1Þmodð2n þ 1Þ ¼ A* þ B* þ Cout modð2n Þ ð2:12Þ

In the case of normal numbers as well [11], we have


 
S þ 1 ¼ ðA þ B þ 1Þmodð2n þ 1Þ ¼ A þ B þ Cout modð2n Þ ð2:13Þ

where S ¼ A + B with the property that (S + 1) is computed. In the design of


multipliers, this technique will be useful.
Note that diminished-1 adders have a problem of correctly interpreting the zero
output since it may represent a valid zero (addition with a result of 1) or a real zero
output (addition with a result zero) [14]. Consider the two examples of modulo

Figure 2.6 Modulo bn-1 bn-2 b1 b0


ð2n þ 1Þ adder architecture an-1 an-2 a1 a0
for diminished-1 arithmetic
(adapted from [18]
©IEEE2002)

Prefix Computation

Gn-1 Gn-2,Pn-2 G1,P1 G0,P0


c*-1

c*n-2 c*n-3 c*1 c*0

Sn-1 Sn-2 S1 S0
18 2 Modulo Addition and Subtraction

9 addition (a) A ¼ 6 and B ¼ 4 and (b) C ¼ 5 and B ¼ 4 using diminished-1


representation:
A* 101 C* 100
B* 011 B* 011
————— —————
Cout 1 000 Cout 0 111
Cout 0 Cout 1
--------------- ----------------
000 Correct result 000 result indicating zero
Note that real zero occurs when the inputs are complimentary. Hence,
this condition needs to be detected using logical AND of the exclusive-OR of
ai and bi. The EXOR gates will be already present in the front-end CSA stage.
Vergos, Efstathiou and Nikolos have presented two mod (2n + 1) adder architec-
tures [18] for diminished-1 numbers. The first one leads to CLA implementation
and was derived by associating the re-entering carry equation with those producing
the carries of the modulo addition similar to that for mod (2n1) described earlier
[12]. In this architecture, both one and two level CLAs have been considered. The
second architecture uses parallel-prefix adders and also was derived by
re-circulation of the carries in each level of parallel-prefix structure. This architec-
ture avoids the problem of fan-out and the additional level needed in Zimmerman’s
technique shown in Figure 2.6.
Efstathiou, Vergos and Nikolos [14] extended the above ideas by using select-
prefix blocks which are faster than the previous ones for designing mod (2n  1)
adders for diminished-1 operands. Here, the lengths of the blocks can be selected
appropriately as well as the number of the blocks. The derivation is similar to that
for mod (2n1) adders with the difference that the equations contain block carry
propagate, and block generate signals instead of bit level propagate and generate
signals. In these, an additional level is used to add the carry after the prefix
computation. A structure using two stages is presented in Figure 2.7. Note that in
this case

cin, 0 ¼ ðBG1 þ BP1 BG0 Þ0


cin, 1 ¼ cout, 0 ¼ BG0 þ BP0 BG0 1

These designs need lesser area than designs using parallel-prefix adders while they
are slower than CLA-based designs.
Efstathiou, Vergos and Nikolos [19] have described fast parallel-prefix modulo
(2n + 1) adders for two (n + 1)-bit numbers which use two stages. The first stage
computes jX þ Y þ 2n  1j nþ1 which has (n + 2) bits. If MSB of the result is zero,
2
then 2n + 1 is added mod 2n+1 and the n LSBs yield the result. For computing
M ¼ X þ Y þ 2n  1, a CSA is used followed by a (n + 1)-bit adder. The authors
use parallel-prefix with fast carry increment (PPFCI) architecture and also a totally
2.3 Modulo (2n + 1) Adders 19

BG0
BG1

BP1 Cin,1 BP0 Cn-0

BLOCK 1 BLOCK 0
Adder (d+f-1:f) Adder (f-1:0)

Figure 2.7 Diminished-1 modulo (2d+f + 1) adder using two blocks (adapted from [14]
©IEEE2004)

parallel-prefix architecture. In the former, an additional stage for re-entering carry


is used, whereas in the latter case, carry recirculation is done at every prefix level.
The architecture of Hiasat [6] can be extended to the case of modulus (2n + 1) in
which case we have Z ¼ 2n1 and the formulae used are as follows:

R ¼ jX þ Y þ Zj2n if X þ Y þ Z  2nþ1 and R ¼ jX þ Y þ Zj2n þ 1 otherwise:

Note that, in this case, the added bit zi is always 1 in all bit positions.
Vergos and Efstathiou [20] proposed an adder that caters for both weighted and
diminished-1 operands. They point out that a diminished-1 adder can be used
to realize a weighted adder by having a front-end inverted EAC CSA stage. Herein,
A + B is computed where A and B are (n + 1)-bit numbers using a diminished-1
adder. In this design, the computation carried out is
 
 
jA þ Bj2n þ1 ¼ jAn þ Bn þ D þ 1j2n þ1 þ 1 n ¼ jY þ U þ 1j2n þ1 ð2:14Þ
2 þ1

where Y and U are the sum and carry vector outputs of a CSA stage computing
An + Bn + D:

carry Y ¼ yn2 yn3 :::::::yo yn1


sum U ¼ un1 un2 :::::::u1 uo

where D ¼ 2n  4 þ 2cnþ1 þ sn . Note that An, Bn are the words formed by the n-bit
LSBs of A and B, respectively, and sn, cn+1 are the sum and carry of addition of 1-bit
words an and bn. It may be seen that D is the n-bit vector 11111:::1cnþ1 sn .
An example will be illustrative. Consider n ¼ 4 and the addition of A ¼ 16 and
B ¼ 11. Evidently an ¼ 1, bn ¼ 0, An ¼ 0 and Bn ¼ 11 and D ¼ 01110 yielding
(16 + 11)17 ¼ ((0 + 11 + 14 + 1)17 + 1)17 ¼ 10. Note that the periodic property of res-
idues mod (2n + 1) is used. The sum of the n th bits is complimented and added to
get D and a correction term is added to take into account the mod (2n + 1) operation.
20 2 Modulo Addition and Subtraction

The mod (2n + 1) adder for weighted representation needs a diminished-1 adder and
an inverted end-around-carry stage. The full adders of this CSA stage perform
(An + Bn + D) mod (2n + 1) addition. Some of the FAs have one input “1” and can
thus be simplified. The outputs of this stage Y and U are fed to a diminished-1 adder
to obtain (Y + U + 1) mod 2n. The architecture is presented in Figure 2.8. It can be
seen that every diminished-1 adder can be used to perform weighted binary addition
using an inverted EAC CSA stage in the front-end.

an bn an b n

an-1 bn-1 an-2 bn-2 an-3 bn-3 an-4 bn-4 a3 b3 a2 b2


a1 b1 a0 b0

FA+ FA+ FA+ FA+ FA+ FA+ FA+ FA+

Diminished-1 adder
(any architecture)

Sn Sn-1 Sn-2 S2 S1 S0

Figure 2.8 Modulo (2n + 1) adder for weighted operands built using a diminished-1 adder
(adapted from [20] ©IEEE2008)
2.3 Modulo (2n + 1) Adders 21

In another technique due to Vergos and Bakalis [21], first A* and B* are
computed such that A* + B* ¼ A + B1 using a translator. Then, a diminished-1
adder can sum A* and B* such that
 
 
jA þ Bj2n þ1  n ¼ jA* þ B*j2n þ cout ð2:15Þ
2

where cout is the carry of the n-bit adder computing A* + B*. However, Vergos and
Bakalis do not present the details of obtaining A* and B* using the translator. Note
that in this method, the inputs are  (2n1).
Lin and Sheu [22] have suggested the use of two parallel adders to find A* + B*
and A* + B* + 1 so that the carry of the former adder can be used to select the correct
result using a multiplexer. Note that Lin and Sheu [22] have also suggested
partitioning the n-bit circular carry selection (CCS) modular adder to m number
of r-bit blocks similar to the select-prefix block type of design considered earlier.
These need circular carry selection addition blocks and circular carry generators.
Juang et al. [23] have given a corrected version of this type of mod (2n + 1) adder
shown in Figure 2.9a and b. Note that this design uses a dual sum carry look ahead
adder (DS-CLA). These designs are most efficient among all the mod (2n + 1)
adders regarding area, time and power.
Juang et al. [24] have suggested considering (n + 1) bits for inputs A and B. The
weighted modulo (2n + 1) sum of A and B can be expressed as
 
 
jA þ Bj2n þ1  n ¼ jA þ B  ð2n þ 1Þj2n if (A + B) > 2n
2
¼ j A þ B  ð 2n þ 1Þ j 2n þ 1 otherwise ð2:16Þ

Thus, weighted modulo (2n + 1) addition can be obtained by subtracting the sum of
A and B by (2n + 1) and using a diminished-1 adder to get the final modulo sum by
making the inverted EAC as carry-in.
Denoting Y0 and U0 as the carry and sum vectors of the summation A + B(2n + 1),
where A and B are (n + 1)-bit words, we have
 
Xn2      
 
j A þ B  ð 2n þ 1 Þ j 2 n ¼  2i 2y0i þ u0i þ 2n1 2an þ 2bn þ an1 þ bn1 þ 1 
 i¼0  n
2
ð2:17Þ

where

y0i ¼ ai _ bi , u0i ¼ ai  bi :

As an illustration, consider A ¼ 16, B ¼ 15 and n ¼ 4. We have

jA þ B  ð2n þ 1Þj2n ¼ j16 þ 15  17j16 ¼ 14

and for A ¼ 6, B ¼ 7,
22 2 Modulo Addition and Subtraction

a B* A*
n n
Cn-1 DS – CLA
Adder

{ * *
Sn-1,0 ... S0,0 } { *
Sn-1,1...S0,1
*
}
n n

MUX

{ * *
Sn-1... S0 }
b*3a3* b*2a2* b*1a1* b*0 a 0* p*0 p*1
b

p2*

Modified
part
c3
MUX MUX MUX MUX
p*3 p*2 p1*

s3* s*2 s1* s 0*

Figure 2.9 (a) Block diagram of CCS diminished-1 modulo (2n + 1) adder and (b) Logic circuit of
CCS diminished-1 modulo (24 + 1) adder ((a) adapted from [22] ©IEEE2008, (b) adapted from
[23] ©IEEE2009)
2.3 Modulo (2n + 1) Adders 23

jA þ B  ð2n þ 1Þj2n ¼ j6 þ 7  17j16 þ 1 ¼ 13:

The multiplier of 2n1 in (2.17) can be at most 5 since 0  A, B  2n. Since only
bits n and n1 are available, the authors consider the (n + 1)-th bit to merge with Cout:
 
 
jA þ Bj2n þ1  n ¼ jA þ B  ð2n þ 1Þj2n ¼ jY 0 þ U 0 j2n þ cout _ FIX ð2:18Þ
2

where y0n1 ¼ an _ bn _ an1 _ bn1 , u0n1 ¼ an1  bn1 and FIX ¼ an bn _ an1 bn
_an bn1 . Note that y0 n1 and u0 n1 are the values of the carry bit and sum bit
produced by the addition 2an þ 2bn þ an1 þ bn1 þ 1. The block diagram is
presented in Figure 2.10a together with the translator in b. Note that FAF
block generates y0 n1, u0 n1 and FA blocks generate y0 i, u0 i for i ¼ 0,1,. . ., n2

a anbn an-1bn-1 an-2bn-2 a0 b0

correction Translator-(2n+1)=Y ʹ+Uʹ

FIX

Diminished-1 adder

Sn Sn-1 Sn-2 S0

b anbn an-1bn-1 an-2bn-2 a0 b0

FAF FA+ FA+

yʹn-1 uʹn-1 yʹn-2 uʹn-2 yʹ0 uʹ0

Figure 2.10 (a) Architecture of weighted modulo (2n + 1) adder with the correction scheme and
(b) translator A + B–(2n + 1) (adapted from [24] ©IEEE2010)
24 2 Modulo Addition and Subtraction

where y0i ¼ ai _ bi and u0i ¼ ai  bi . Note also that FIX is wired OR with the carry
cout to yield the inverted EAC as the carry in. The FIX bit is needed since value
greater than 3 cannot be accommodated in yn1 and un1.
The authors have used Sklansky [25] and BrentKung [3] parallel-prefix adders
for the diminished-1 adder.

References

1. M.A. Bayoumi, G.A. Jullien, W.C. Miller, A VLSI implementation of residue adders. IEEE
Trans. Circuits Syst. 34, 284–288 (1987)
2. M. Dugdale, VLSI implementation of residue adders based on binary adders. IEEE Trans.
Circuits Syst. 39, 325–329 (1992)
3. R.P. Brent, H.T. Kung, A regular layout for parallel adders. IEEE Trans. Comput. 31, 260–264
(1982)
4. G. Alia, E. Martinelli, Designing multi-operand modular adders. Electron. Lett. 32, 22–23
(1996)
5. K.M. Elleithy, M.A. Bayoumi, A θ(1) algorithm for modulo addition. IEEE Trans. Circuits
Syst. 37, 628–631 (1990)
6. A.A. Hiasat, High-speed and reduced area modular adder structures for RNS. IEEE Trans.
Comput. 51, 84–89 (2002)
7. C. Efstathiou, D. Nikolos, J. Kalanmatianos, Area-time efficient modulo 2n1 adder design.
IEEE Trans. Circuits Syst. 41, 463–467 (1994)
8. R.E. Ladner, M.J. Fischer, Parallel-prefix computation. JACM 27, 831–838 (1980)
9. P.M. Kogge, H.S. Stone, A parallel algorithm for efficient solution of a general class of
recurrence equations. IEEE Trans. Comput. 22, 783–791 (1973)
10. S. Knowles, A family of adders, in Proceedings of the 15th IEEE Symposium on Computer
Arithmetic, Vail, 11 June 2001–13 June 2001. pp. 277–281
11. R. Zimmermann, Efficient VLSI implementation of Modulo (2n  1) addition and multiplica-
tion, Proceedings of the IEEE Symposium on Computer Arithmetic, Adelaide, 14 April
1999–16 April 1999. pp. 158–167
12. L. Kalampoukas, D. Nikolos, C. Efstathiou, H.T. Vergos, J. Kalamatianos, High speed parallel
prefix modulo (2n1) adders. IEEE Trans. Comput. 49, 673–680 (2000)
13. A. Tyagi, A reduced area scheme for carry-select adders. IEEE Trans. Comput. 42, 1163–1170
(1993)
14. C. Efstathiou, H.T. Vergos, D. Nikolos, Modulo 2n  1 adder design using select-prefix blocks.
IEEE Trans. Comput. 52, 1399–1406 (2003)
15. R.A. Patel, S. Boussakta, Fast parallel-prefix architectures for modulo 2n1 addition with a
single representation of zero. IEEE Trans. Comput. 56, 1484–1492 (2007)
16. L.M. Liebowitz, A simplified binary arithmetic for the fermat number transform. IEEE Trans.
ASSP 24, 356–359 (1976)
17. Z. Wang, G.A. Jullien, W.C. Miller, An efficient tree architecture for modulo (2n + 1) multi-
plication. J. VLSI Sig. Proc. Syst. 14(3), 241–248 (1996)
18. H.T. Vergos, C. Efstathiou, D. Nikolos, Diminished-1 modulo 2n + 1 adder design. IEEE
Trans. Comput. 51, 1389–1399 (2002)
19. S. Efstathiou, H.T. Vergos, D. Nikolos, Fast parallel prefix modulo (2n + 1) adders. IEEE
Trans. Comput. 53, 1211–1216 (2004)
20. H.T. Vergos, C. Efstathiou, A unifying approach for weighted and diminished-1 modulo
(2n + 1) addition. IEEE Trans. Circuits Syst. II Exp. Briefs 55, 1041–1045 (2008)
References 25

21. H.T. Vergos, D. Bakalis, On the use of diminished-1 adders for weighted modulo (2n + 1)
arithmetic components, Proceedings of the 11th Euro Micro Conference on Digital System
Design Architectures, Methods Tools, Parma, 3–5 Sept. 2008. pp. 752–759
22. S.H. Lin, M.H. Sheu, VLSI design of diminished-one modulo (2n + 1) adders using circular
carry selection. IEEE Trans. Circuits Syst. 55, 897–901 (2008)
23. T.B. Juang, M.Y. Tsai, C.C. Chin, Corrections to VLSI design of diminished-one modulo
(2n + 1) adders using circular carry selection. IEEE Trans. Circuits Syst. 56, 260–261
(2009)
24. T.-B. Juang, C.-C. Chiu, M.-Y. Tsai, Improved area-efficient weighted modulo 2n + 1 adder
design with simple correction schemes. IEEE Trans. Circuits Syst. II Exp. Briefs 57, 198–202
(2010)
25. J. Sklansky, Conditional sum addition logic. IEEE Trans. Comput. EC-9, 226–231 (1960)
Chapter 3
Binary to Residue Conversion

The given binary number needs to be converted to RNS. In this chapter, various
techniques described in literature for this purpose are reviewed. A straightforward
method is to use a divider for each modulus to obtain the residue while ignoring the
quotient obtained. But, as is well known, division is a complicated process [1]. As
such, alternative techniques to obtain residue easily have been investigated.

3.1 Binary to RNS Converters Using ROMs

Jenkins and Leon [2] have suggested reading sequentially the residues mod mi
corresponding to all the input bytes from PROM and performing mod mi addition.
Stouraitis [3] has suggested reading residues corresponding to various bytes in the
input word in parallel from ROM and adding them using a tree of mod mi adders.
Alia and Martinelli [4] have suggested forward conversion for a given n-bit input
binary word using n/2 PEs (processing elements) each storing residues
corresponding to 2j and 2j+1 (i.e. j th and j + 1 th bit positions) for j ¼ 0, . . ., n/2
and adding these residues mod mi selectively depending on the bit value if it is “1”.
Next the results of the n/2 PEs are added in a tree of modulo mi adders to obtain the
final residue.
Capocelli and Giancarlo [5] have suggested using t PEs where t ¼ dn/log2ne each
computing the residue of a log2n-bit word by adding the residues corresponding to
various bits of this word and then adding the residues obtained from various PEs in
a tree of modulo mi adders containing h steps where h ¼ log2t. Note, however, that
only the residue corresponding to the LSB position in each word is stored and
residue corresponding to each of the next bit position is online computed by
doubling the previous residue and finding residue mod mi using one subtractor
and one multiplexer. Thus, the ROM requirement is reduced to t locations.
More recent designs avoid the use of ROMs and use combinational logic to a
large extent. These are discussed in the next few sections.

© Springer International Publishing Switzerland 2016 27


P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_3
28 3 Binary to Residue Conversion

3.2 Binary to RNS Conversion Using Periodic Property


of Residues of Powers of Two

We consider first an example of finding the residue of 892 mod 19. Expressing
892 in binary form, we have 11 0111 1100. (We can start with the 5th bit from the
right since 12 mod 19 is 12 itself.) We know the residues of consecutive powers of
two mod 19 as 1, 2, 4, 8, 16, 13, 7, 14, 9, and 18. Thus, we can add the residues
wherever the input bit corresponding to a power of 2 is “1”. This yields (4 + 8 + 16
+ 13 + 7 + 9 + 18) mod 19 ¼ 18. Note that at each step, when a new residue
corresponding to a new power of 2, is added, modulo 19 reduction can be done to
avoid unnecessary growth of the result: (((4 + 8) mod 19 + 16) mod 19 + 13) mod
19, etc.
Note, however, that certain simplifications can be made by noting the periodic
property of residues of 2k mod m [6–9]. Denoting 2T 1 mod m, we know that
2αTþi 2i mod m, if T is the period of the modulus m. All the α words (α can be
even or odd) each of T bits in the given n-bit binary number where α ¼ n/T can be
added first in a carry-save-adder (CSA) with EAC (end around carry) to first obtain
a T-bit word for which using the procedure described above the residue mod m can
be obtained. Note that “T” is denoted as “order” and can be m1 or less. As an
illustration for m ¼ 89, T ¼11 and for m ¼ 19, T ¼ 18. Consider finding the residue
of 0001 0100 1110 1101 1011 0001 0100 1110 1101 1011 mod 19 ¼ 89887166171
mod 19. Thus, the three 18-bit words (here T ¼ 18) can be added with EAC to obtain

0001
01 0011 1011 0110 1100
01 0100 1110 1101 1011
10 1000 1010 0100 1000

This corresponds to 166472. The residue of this 18-bit number can be obtained
next using the procedure presented earlier by adding the residues of various powers
of 2 mod 19. In short, the periodic property of 2k mod m has been used to simplify
the computation.
Another simplification is possible for moduli satisfying the property 2ðm1Þ=2
mod m ¼ 1. Considering modulus 19 again, we observe that 29 ¼1 mod 19, 210
¼2 mod 19, . . ., 217 ¼ 10 mod 19, 218 ¼ 1 mod19 and 219 ¼ 2 mod 19, etc. Thus,
the residues in the upper half of a period are opposite in sign to those in the lower
half of the period. This property can be used to reduce the CSA word length to
(m1)/2. Denoting successive half period length words W0, W1, W2, .., Wα where α
is odd (considered for illustration), we need to estimate
ðα1
!
XÞ=2 XÞ=2
ðα1
W 2i  W 2iþ1 mod m. Considering the same example considered
i¼0 i¼0
above, we first divide the given word into 9-bit fields starting from LSB as follows:
3.2 Binary to RNS Conversion Using Periodic Property of Residues of Powers of Two 29

W4 ¼ 0001
W3 ¼ 0 1001 1101
W2 ¼ 1 0110 1100
W1 ¼ 0 1010 0111
W0 ¼ 0 1101 1011

Thus, adding together alternate fields in separate CSAs i.e. adding W0, W2 and W4,
we get Se ¼ 10 0100 1000 and adding W1 and W3 we have So ¼ 1 0100 0100.
Subtracting So from Se, we have S ¼ 0001 0000 0100. (Here subtraction is two’s
complement addition of So with Se.) Note that the word length of So and Se can be
more than T/2 bits depending on the number of T/2-bit fields in the given binary
number. (Note also that So and Se can be retained in carry save form.) The residue of
the resulting word can be found easily using another stage using the periodic
property and a final mod m reduction described earlier, as 13 for our example.
It is observed [6–9] that the choice of moduli shall be such that the period or half
period shall be small compared to the dynamic range in bits of the complete RNS in
order to take advantage of the periodic property of the residues.
Interestingly, for special moduli of the form 2k1 and 2k + 1, the second stage of
binary to RNS conversion of a smaller length word of either T or T/2 bits (see
Figure 3.1a and b) can altogether be avoided [6]. For moduli of the form 2k1, the
input word can be divided into k-bit fields all of which can be added in a CSA with
EAC to yield the final residue. On the other hand, for moduli of the form 2k + 1, all
even k-bit fields can be added and all odd k-bit fields can be added to obtain Se and

a b
Wα-1Wα-3 ... W2 W0
Wα Wα-1 ---- W2 W1 W0 Wα-2 Wα-4 W3 W1
k k k k k k k k
k k k k k

CSA with EAC CSA CSA

k C2 k S2 C1 k S1 k
k+m k+m k+m k+m
CPA with EAC (C2+S2)-(C1+S1) calculation

Modulo (2k+1) reduction


k
W mod (2 -1)

W mod (2k+1)

Figure 3.1 Forward converters mod (2k–1) (a) and mod (2k + 1) (b)
30 3 Binary to Residue Conversion

So, respectively, and one final adder gives (SeSo) mod (2k + 1). As an illustration,
892 mod 15 ¼ (0011 0111 1100)2 mod 15 ¼ (3 + 7 + 12) mod 15 ¼ 7 and 892 mod
17 ¼ (3–7 + 12) ¼ 8.
Pettenghi, Chaves and Sousa [10] have suggested, for moduli of the form 2n  k,
rewriting the weights (residues of 2j mod mi) so as to reduce the width of the final
adder in binary to RNS conversion than that needed in designs using period or half
period. In this technique, the negative weights are considered as positive weights and
bits corresponding to negative weights are complimented and a correction factor is
added. The residues uj are allowed to be such that 2n  k þ 3  uj  2n  k  3:
As an illustration, for modulus 37, the residues corresponding to a 20-bit
dynamic range are shown for the full period for the original and modified cases.
Since the period is 18, the last two residues are rewritten as 1 and 2. Thus, the
total worst case weighted sum (corresponding to all inputs bits being 1) is 328 as
against 402 in the first case. In order to avoid negative weights, we can consider the
last two weights as 1 and 2, but complement the inputs and add a correction term 34.
As an illustration for the 20-bit input words 000. . .00, 000. . .01, 000. . .010,
0000. . .011, after complementing the last 2 bits we have 11, 10, 01 and 00 and
adding the corresponding “positive” residues and adding the correction term
COR ¼ 34, we obtain, 0, 35, 36, and 34 which may be verified to be correct.

Design 0 1 2 3 4 5 6 7 8 9 10 11 12
Original 1 2 4 8 16 32 27 17 34 31 25 13 26
Modified 1 1 2 4 8 16 32 27 17 34 31 25 13 26
Modified 2 1 2 4 8 16 5 10 17 3 6 12 13 11

Design 13 14 15 16 17 18 19
Original 15 30 23 9 18 36 35
Modified 1 15 30 23 9 18 1 2
Modified 2 15 7 14 9 18 1 2

In the alternative technique, the residues are such that


2n  k þ 1 2n  k  1
  uj  as shown in the fourth row. Thus, the worst case
2 2
sum (considering all negative weights as positive), is only 177 including the
correction term COR ¼ 3. As an illustration for the 20-bit input 0000 0010 0101
1111 0010, inverting the bits with negative weights, we have 0000 0100 1011 0100
0001 ¼ 13 + 3 ¼ 16 as expected.

3.3 Forward Conversion Using Modular Exponentiation

Premkumar [11], Premkumar and Lai [12] have described a technique for for-
ward conversion without using ROMs. They denote this technique as “modular
exponentiation”. Basically, in this technique, the various residues of powers of
3.3 Forward Conversion Using Modular Exponentiation 31

2 (i.e. 2x mod mi) are obtained using logic functions. This will be illustrated first
using an example. Consider finding 2s3 s2 s1 s0 mod 13 where the exponent is a
4-bit binary word. We can write this expression as

2s3 s2 s1 s0 mod 13 ¼ 28s3 þ4s2 4s1 2so mod 13 ¼ 256s3 16s2 4s1 2so mod 13
¼ ð255s3 þ 1Þð15s2 þ 1Þ 4s1 2so mod 13 ¼ ð3s3 s2 þ 8s3 þ 2s2 þ 1Þ4s1 2so mod13

Next for various values of s1, s0, the bracketed term can be evaluated. As an illustration
for s1 ¼ 0, s0 ¼ 0, 2s3 s2 s1 s0 mod 13 ¼ ð3s3 s2 þ 8s3 þ 2s2 þ 1Þ mod 13. Next, for
the four options for bits s3 and s2 viz., 11, 10, 01, 00, the value of 2s3 s2 s1 s0 mod 13
can be estimated as 1, 9, 3, 1, respectively. Thus, the logic function g0 can be used to
represent 2s3 s2 s1 s0 mod 13 for s1 ¼ 0, s0 ¼ 0 by looking at the bit values as

g0 ¼ 8s3 s2 þ 2s3 s2 þ 1

In a similar manner, the other functions corresponding to s1s0 i.e. 01, 10, 11 can be
obtained as

g1 ¼ 4ð s 2  s 3 Þ þ 2ð s 3 þ s 2 Þ þ s 3 s 2 ð3:1aÞ

g2 ¼ 8ðs2  s3 Þ þ 4ðs3 þ s2 Þ þ 2s3 s2 ð3:1bÞ

g3 ¼ 8ðs3 þ s2 Þ þ 4s3 s2 þ 2ðs3  s2 Þ þ ðs3  s2 Þ ð3:1cÞ

Note that the logic gates that are used to generate and combine the MIN terms in the
gi functions can be shared among the moduli. As an illustration, 211mod 13 can be
obtained from g3 (since s1 ¼ s0 ¼ 1), by substituting s3 ¼ 1, s2 ¼ 0 as 7. The archi-
tecture consists of feeding the input power “i” for which 2i mod 13 is needed. The
two LSBs of i viz., x1, xo are used to select the output nibble using four 4:1
multiplexers of the residue corresponding to function gj dependent on s3 and s2-
bit values. Thus, for each power of 2, the residue will be selected using the set of
multiplexers and all these residues need to be added mod 13 in a tree of modulo
adders to get the final residue. Fully parallel architecture or serial parallel architec-
tures can be used to have area/time trade-offs. Premkumar, Ang and Lai [12] later
have extended this technique to reduce hardware by taking advantage of the
periodic properties of moduli so that a first stage will yield a word of length
equaling period of the modulus and next the modular exponentiation-based tech-
nique can be used.
32 3 Binary to Residue Conversion

3.4 Forward Conversion for Multiple Moduli Using


Shared Hardware

Forward converters for moduli set {2n1, 2n, 2n + 1} have been considered by several
authors. A common architecture for finding residues mod (2n1) and (2n + 1) was
first advanced by Bi and Jones [13]. Given a 3n-bit binary word W ¼ A22n + B2n + C,
where A, B and C are n-bit words, we have already noted that W mod (2n1) ¼
(A + B + C) mod (2n1) and W mod 2n + 1 ¼ (AB + C) mod (2n + 1). Bi and
Jones suggest finding S ¼ A + C first and then compute (S + B) mod (2n1) or
(SB) mod (2n + 1) in a second stage. A third stage performs the modulo m1 or m3
reduction using the carry or borrow from the second stage. Thus, three n-bit adders
will be required for each of the residue generators for moduli (2n1) and (2n + 1).
Pourbigharaz and Yassine [14] have suggested a shared architecture for com-
puting both the residues mod (2n1) and mod (2n + 1) for a large dynamic range
RNS. They sum the k even n-bit fields and k odd n-bit fields separately using a
multi-operand CSA adder to obtain sum and carry vectors Se, So, Ce and Co of (n + β)
bits where β ¼ log2k. Next, So and Co can be added to or subtracted from Se + Ce in a
two-level CSA. Next, the (m + β + 1)-bit carry and sum words can be partitioned into
LSB m-bit and MSB (β + 1)-bit words. Both can be added to obtain mod (2n1) or
MSB word can be subtracted from the LSB word to obtain mod (2n + 1) using another
two-level CSA in parallel with a two-level carry save subtractor. A final CLA/CPA
computes the final result. This method has been applied to the moduli set {2n1,
2n, 2n + 1}. The delay is O(2n).
Pourbigharaz and Yassine [15] have suggested another three-level architecture
comprising of a CSA stage, a CPA stage and a multiplexer to eliminate the modulo
operation. Since P ¼ (A + B + C) in the case of the moduli set {2n1, 2n, 2n + 1}
needs (n + 2) bits, denoting the two MSBs as pn+1, pn, using these 2 bits, P, P + 1 or
P + 2 computed using three CPAs can be selected using a 3:1 multiplexer for
obtaining X mod (2n1). For evaluating X mod (2n + 1), a three operand carry
save subtractor is used to find P0 ¼ (AB + C) and using the two MSBs pn+1 and pn,
P0 or P0 1 or P0 + 1 is selected using a 3:1 Multiplexer. Thus, the delay is reduced to
that of one n-bit adder (of O(n)).
Sheu et al. [16] have simplified the design of Pourbigharaz and Yassine [15]
slightly. In this design, A + C is computed using a Carry Save Half adder (CSHA)
and one CPA (CPA1) is used to add B to A + C and one CSHA and one CPA (CPA2)
is used to subtract B from A + C. Using the two MSBs of the results of CPA1 and
CPA2, two correction factors are applied to add 0, 1 or 2 in the case of mod (2n1)
and 0,1 or 1 in the case of mod (2n + 1). The correction logic can be designed
using XOR/OR and AND/NOT gates. The total hardware requirement is 3 n-bit
CSHA, one n-bit CSA, one (n + 1)-bit CSA, (2n + 2) XOR, (3n + 1) AND, (n + 3)
OR and (2n + 3) NOT gates. The delay is, however, comparable to Pourbigharaz
and Yassine design [15].
The concept of shared hardware has been extended by Piestrak [17] for several
moduli and Skavantzos and Abdallah [18] for conjugate moduli (moduli pairs of the
3.4 Forward Conversion for Multiple Moduli Using Shared Hardware 33

form 2a1, 2a + 1). Piestrak has suggested that moduli having common factor
among periods or half periods can take advantage of sharing. As an illustration
consider the two moduli 7 and 73. The periods are three and nine respectively.
Consider forward conversion of a 32-bit input word. A first stage can take 9-bit
fields of the given 32-bit word and sum them using a CSA tree with end around
carry to get a 9-bit word. Then the converter for modulus 7 can add the three 3-bit
fields using EAC and obtain the residue mod 7, whereas the converter for mod
73 can find the residue of the 9-bit word mod 73. The hardware can be saved
compared to using two separate converters for modulus 7 and modulus 73.
The technique can be extended to the case with period and half-period being
same. As an example for the moduli 5 and 17, the period P(5) ¼ HP(17) ¼4 where
HP stands for half-period and P stands for period. Evidently, the first stage takes
8-bit fields of the input 32-bit word since P(17) ¼ 8 and using a CSA gets 8-bit Sum
and Carry vectors. These are considered as two 4-bit fields and are fed next to mod
5 and mod 17 residue generators.
It is possible to combine generators for moduli with different half periods with
LCM being one of these half-periods. Consider the moduli 3, 5, and 17 whose half-
periods are 1, 2 and 4, respectively. Considering a 32-bit input binary word, a first
stage computes from four 8-bit fields, mod 255 value by adding in a CSA and 8-bit
sum and carry vectors are obtained. Next, these vectors are fed to a mod 17 residue
generator and mod 15 residue generator. The mod 15 residue generator in turn is fed
to mod 3 and mod 5 residue generators. Several full-adders can be saved by this
technique. For example, for moduli 5, 7, 9, and 13, for forward conversion of a
32-bit input binary word, using four separate residue generators, we need 114 full-
adders, whereas in shared hardware, we need only 66 full-adders. The architecture
is presented in Figure 3.2 for illustration.

Figure 3.2 32-input X B3 B2 B1 B0


residue generator for
8 8 8 8
moduli 3, 5 and 17 (adapted
from [17] © IEEE2011)
4-operand CSA tree mod 255

C 8 S 8

CH CL SH SL
4 4 4 4

Residue generator 4-operand CSA tree


mod 17 mod 15
C1 4 S1 4

Residue generator Residue generator


mod 5 mod 3

5 3 2
X 17 X 5 X 3
34 3 Binary to Residue Conversion

In Skavantzos and Abdallah [18] technique proposed for residue number sys-
tems using several pairs of conjugate moduli (2a + 1) and (2a1), a first stage is a
mod (22a1) generator taking 2a-bit fields and summing using CSA to get 2a-bit
sum S and carry C vectors. The second stage uses two residue generators for finding
mod (2a1) and (2a + 1) from the four a-bit vectors SH, SL, CH and CL where H and
L stand for the higher and lower a-bit fields. Considering an RNS with dynamic
range X of 2Ka-bit number, in conventional single-level design, in the case
of modulus (2a1), we need 2 K-operand mod (2a1) CSA tree followed by a
mod (2a1) CPA. Thus, (2 K2) CSAs each containing a FAs will be needed. In
case of mod (2a + 1), we need in addition 2 K operand mod (2a + 1) CSA tree and a
mod (2a + 1) CPA. The CSA tree has (2 K2) CSAs each containing (a + 1)
full-adders. The total cost for a conjugate moduli pair is thus, (4Ka–4a + 2 K2)
full-adders. On the other hand, in the two-level design, we need only (2Ka + 2) full-
adders for the CSA cost, whereas the CPA cost is same as that in the case of
one-level approach.

3.5 Low and Chang Forward Conversion Technique


for Arbitrary Moduli

Low and Chang [19] have suggested Binary to RNS converters for large input word
lengths. This technique uses the idea that the residues mod m of various powers of
2 from 1 to 63 can assume values only between 0 and (m1). Thus, the number
of “1” bits in the residues corresponding to the 64 bits to be added are less. Even
these can be reduced by rewriting the residues which have large Hamming weight
as sum of a correction term and word with smaller Hamming weight. This will
result in reducing the number of terms (bits) being added. As an illustration, for
modulus 29, the various values of 2x mod 29 from 20 to 228 are as follows:

1,2,4,8,16,3,6,12,24,19,9,18,7,14,28,27,25,21,13,26,23,17,5,10,20,11,22,15.

Thus, for a 64-bit input word, these repeat once again for 229 till 257 and once
again from 258 till 263. Many of the bits are zero in these. Consider the residue
27 (i.e. 215mod 29) with Hamming weight 4. It can be written as (x15215) mod 29 ¼
(27 + 2 x15 ) so that when x15 is zero, its value is (27 + 2) mod 29 ¼ 0. Since 2 has
Hamming weight much less than 27, the number of bits to be added will be reduced.
This property applies to residues 19, 28, 27, 25, 21, 13, 26, 23, 11 and 15. Thus,
corresponding to a 64-bit input word, in the conventional design, 64 5-bit words
shall have been added in the general case. Many bits are zero in these words.
Deleting all these bits which are zero, we would have needed to add 27, 28, 29, 31
and 30 bits in various columns. It can be verified without Hamming weight
optimization and with Hamming weight optimization, the number of bits to be
added in each column (corresponding to 2i for i ¼ 4, 3, 2, 1, 0) are as follows:
3.6 Forward Converters for Moduli of the Type (2n  k) 35

Without optimization: 27, 28, 29, 31, 30


With optimization: 17, 21, 25, 32, 18.

Thus, in the case of modulus 29, the full-adders (FA) and half-adders
(HA) needed to add all these bits in each column and carries coming from the
column on the right can be shown before and after Hamming weight reduction to be
111FA + 11HA and 87FA + 11HA. The end result will need a CPA whose word
length will be more than that of the modulus. Low and Chang suggest that the bits
above the (r1)-th bit position also can be simplified in a similar manner by using
additional hardware without disturbing the LSB portion already obtained as one
r-bit word. An LUT can be used to perform the simplification of the MSB bits to
finally obtain one r-bit word.
The two r-bit operands A and B (LSB and MSB) next need to be added mod m.
The authors adopt the technique of Hiasat [20] after modification to handle the
possibility that A + B can be greater than 2 m. The modulo addition of A and B in this
case can be realized as
jXjm ¼ jA þ B þ 2Z j2r if A þ B þ 2Z  2rþ1
jXjm ¼ jA þ B þ Z j2r if 2r  A þ B þ Z  2rþ1
jXjm ¼ A þ B otherwise

where Z ¼ 2rm since A < (2r1) and B < (m1). Two CLAs will be needed for
estimating Cout and C*out corresponding to the computation of A + B + Z and A + B
+ 2Z where Z ¼ 2rmi. Using a 3:1 multiplexer, generate and propagate vectors can
be selected for being added in the CLA and Summation unit.

3.6 Forward Converters for Moduli of the Type (2n  k)

Matutino, Pettenghi, Chaves and Sousa [21, 22] have described binary to RNS
conversion for moduli of the type 2n  k for the four moduli set {2n1, 2n + 1, 2n3,
2n + 3} with dynamic range of 4n bits. The given 4n-bit binary word can be considered
as four n-bit fields W3, W2, W1 and W0 yielding in the case of modulus 2nk,
 
W 2n  k ¼ W 3 k3 þ W 2 k2 þ W 1 k þ W 0 2n k ð3:2Þ

since 2n mod (2nk) ¼ k.


Hence, using three multipliers (n p, n 2p, and n n) to multiply W3, W2 and
W1, respectively with k3, k2 and k, the reduction can be carried out in stages. The
summation in (3.2) yields a (2n + 2)-bit word which again is considered as three
n-bit fields and reduced further to a (n + p + 1)-bit word using two multipliers
(2 2p, n p) followed by adder where p ¼ dlog2 ke. Another two stages reduce
the word length from (n + p + 1) bits to (n + 2) bits and (n + 2) bits to n bits using a
( p + 1) p multiplier and p p multiplier and a modulo adder.
36 3 Binary to Residue Conversion

In the case of the modulus (2n + k), the computation carried out is
 
W 2n þ k ¼ W 3 k3 þ W 2 k2  W 1 k þ W 0 2n þ k ð3:3aÞ

since 2n mod (2n + k) ¼ k and 23n mod (2n + k) ¼ k3. Note that (3.3a) can be
rewritten as
 
W 2n þ k ¼ W 3 k 3 þ W 2 k 2 þ W 1 k þ W 0 þ c 2n þ k ð3:3bÞ

where
 
c ¼ k3 ðk þ 1Þ þ 3kðk þ 1Þ mod ð2n þ kÞ:

Note that due to the intermediate reduction steps for reducing the (2n + 2)-bit
word to (n + p + 1) bits and next (n + p + 1) bits to n bits, the correction factor is
c ¼ k3 ðk þ 1Þ þ 3kðk þ 1Þ. The converter for modulus (2n + k) needs three stages,
whereas that for modulus (2nk) needs four stages. Matutino et al. [22] also suggest
multiplier realization by adding only shifted versions of inputs instead of using
hardware multipliers.

3.7 Scaled Residue Computation

There is often
 x requirement in cryptography as well as in RNS to obtain a scaled
residue i.e. α [23]. This can be achieved by successive division by 2 mod m. As
2 m     
13 13 13 þ 19
an illustration 3
can be obtained by first computing ¼
2 2
 2  19       19 19
13 16 13 8
¼ 16: Next, ¼ ¼ 8, and ¼ ¼ 4: The proce-
22 19 2 19 23 19 2 19
dure for scaling x by 2 implies addition of modulus m in case LSB of x is 1 and
dividing by two (ignoring the LSB or performing one bit right shift). In case LSB of
x is zero, just dividing by two (ignoring the LSB or right
 xshift)
 will suffice.
Montgomery’s algorithm [24] permits evaluation of α by considering α bits
2 m
at a time (also called higher-radix implementation). Here, we wish to find the
multiple of m that needs to be added to x to make it exactly divisible by 2α. First,
we need to compute β ¼ ðmÞ2α . Next, knowing the word Z corresponding to α
LSBs of x, we need to compute Y ¼ x þ ðZβÞ2α m which will be exactly divisible
by 2α. The division is by right shifting by α bits. x
Consider x ¼ (101001101)2 ¼ 333, we wish to find . We find
  16 23
1
β¼  ¼ 9: We know α ¼ 13 (4-bit LSBs of x). Thus, we need to compute
23 16
References 37

Y ¼ 333 + (13 9)16 23 ¼ 333 + 5 23 ¼ 448 which is exactly divisible by 16 to


yield 28. Taking one bit at a time as explained before, we would have needed four
steps (333 + 23)/2 ¼ 178, 178/2 ¼ 89, (89 + 23)/2 ¼ 56 and 56/2 ¼ 28. The proce-
dure can be extended to find scaling by arbitrary power of 2 mod mi. Montgomery’s
technique can be extended to multiplication with the difference that every time a
partial product is added, then the LSBs of the result shall be used in the computa-
tion. The reader is referred to Koc [25] for fast software implementations which
implement high radix of 8 or 16 bit. More on this subject will be considered in
Chapter 8.

References

1. K. Hwang, Computer arithmetic: Principles, architecture and design (Wiley, New York,
1979)
2. W.K. Jenkins, B.J. Leon, The use of residue number systems in the design of finite impulse
response digital filters. IEEE Trans. Circuits Syst. CAS-24, 191–201 (1977)
3. T. Stouraitis, Analogue and binary to residue conversion schemes. IEE Proc. Circuits, Devices
and Systems. 141, 135–139 (1994)
4. G. Alia, E. Martinelli, A VLSI algorithm for direct and reverse conversion from weighted
binary system to residue number system. IIEEE Trans. Circuits Syst. 31, 1033–1039 (1984)
5. R.M. Capocelli, R. Giancarlo, Efficient VLSI networks for converting an integer from binary
system to residue number system and vice versa. IEEE Trans. Circuits Syst. 35, 1425–1430
(1988)
6. S.J. Piestrak, Design of residue generators and multi-operand modulo adders using carry save
adders, in Proceedings of the. 10th Symposium on Computer Arithmetic, Grenoble, 26–28 June
1991. pp. 100–107
7. S.J. Piestrak, Design of residue generators and multi-operand modulo adders using carry save
adders. IEEE Trans. Comput. 43, 68–77 (1994)
8. P.V. Ananda Mohan, Efficient design of binary to RNS converters. J. Circuit. Syst. Comp 9,
145–154 (1999)
9. P.V. Ananda Mohan, Novel design for binary to RNS converters, in Proceedings of ISCAS,
London, 30 May–2 June 1994. pp. 357–360
10. H. Pettenghi, R. Chave, L. Sousa, Method for designing modulo {2n  k} binary to RNS
converters, in Proceedings of the Conference on Design of Circuits and Integrated Systems,
DCIS, Estoril, 25–27 Nov. 2013
11. A.B. Premkumar, A formal framework for conversion from binary to residue numbers. IEEE
Trans. Circuits Syst. 49, 135–144 (2002)
12. A.B. Premkumar, E.L. Ang, E.M.K. Lai, Improved memory-less RNS forward converter based
on periodicity of residues. IEEE Trans. Circuits Syst. 53, 133–137 (2006)
13. G. Bi, E.V. Jones, Fast conversion between binary and residue numbers. Electron. Lett. 24,
1195–1197 (1988)
14. F. Pourbigharaz, H.M. Yassine, Simple binary to residue transformation with respect to 2m+1
moduli. Proc. IEE Circuits Dev. Syst. 141, 522–526 (1994)
15. F. Pourbigharaz, H.M. Yassine, Modulo free architecture for binary to residue transformation
with respect to {2n-1, 2n, 2n + 1} moduli set. Proc. IEEE ISCAS 2, 317–320 (1994)
16. M.H. Sheu, S.H. Lin, Y.T. Chen, Y.C. Chang, High-speed and reduced area RNS forward
converter based on {2n–1, 2n, 2n + 1} moduli set, in Proceedings of the IEEE 2004 Asia-Pacific
Conference on Circuits and Systems, 6–9 Dec. 2004. pp. 821–824
38 3 Binary to Residue Conversion

17. Piestrak, Design of multi-residue generators using shared logic, in Proceeding of ISCAS, Rio
de Janeiro, 15–19 May 2011. pp. 1435–1438
18. A. Skavantzos, M. Abdallah, Implementation issues of the two-level residue number system
with pairs of conjugate moduli. IEEE Trans. Signal Process. 47, 826–838 (1999)
19. J.Y.S. Low, C.H. Chang, A new approach to the design of efficient residue generators for
arbitrary moduli. IEEE Trans. Circuits Syst. I Reg. Papers 60, 2366–2374 (2013)
20. A.A. Hiasat, High-speed and reduced area modular adder structures for RNS. IEEE Trans.
Comput. 51, 84–89 (2002)
21. P.K. Matutino, H. Pettenghi, R. Chave, L. Sousa, Multiplier based binary to RNS converters
modulo (2n  k), in Proceedings of 26th Conference on Design of Circuits and Integrated
Systems, Albufeira, Portugal, pp. 125–130, 2011
22. P.K. Matutino, R. Chaves, L. Sousa, Binary to RNS conversion units for moduli (2n  3), in
14th IEEE Euromicro Conference on Digital System Design, Oulu, Aug. 31 2011-–Sept.
2 2011. pp. 460–467
23. S.J. Meehan, S.D. O’Neil, J.J. Vaccaro, An universal input and output RNS converter. IEEE
Trans. Circuits Syst. 37, 799–803 (1990)
24. P.L. Montgomery, Modular multiplication without trial division. Math. Comput. 44, 519–521
(1985)
25. C.K. Koc, T. Acar, B.S. Kaliski Jr., Analyzing and comparing Montgomery multiplication
algorithms. IEEE Micro 16(3), 26–33 (1996)

Further Reading

F. Pourbigharaz, H.M. Yassine, A simple binary to residue converter architecture, in Proceedings


of the IEEE 36th MidWest Symposium on Circuits and Systems, Detroit, 16–18 Aug. 1993
Chapter 4
Modulo Multiplication and Modulo Squaring

In this chapter, algorithms and implementations for modulo multiplication for


general moduli as well powers of two related moduli are considered. The design
of squarers also is considered in detail in view of their extensive application in
cryptography and signal processing. Designs using conventional number represen-
tation as well as diminished-1 representation are described. Further, designs which
can be shared among various moduli are also explored with a view to reduce area.

4.1 Modulo Multipliers for General Moduli

The residue number multiplication for general moduli (i.e. moduli not of the form
(2k  a)) can be carried out by several methods: using index calculus, using
sub-modular decomposition, using auto-scale multipliers and based on quarter-
square multiplication.
Soderstrand and Vernia [1] have suggested modulo multipliers based on index
calculus. Here, the residues are expressed as exponents of a chosen base modulo m.
The indices corresponding to the inputs for a chosen base are read from the LUTs
(look-up tables) and added mod (m  1) and then using another LUT, the actual
product mod m can be obtained.
As an illustration, consider m ¼ 11. Choosing base 2, since 28 mod 11 ¼ 3, and 24
mod 11 ¼ 5, corresponding to the inputs 3 and 5, the indices are 8 and 4, respec-
tively. We wish to find (3  5) mod11. Thus, we have the index corresponding to
the product as (8 + 4) mod 10 ¼ 2. This corresponds to 22mod11 ¼ 4 which is the
desired answer.
Note that the multiplication modulo m is isomorphic to modulo (m  1) addition
of the indices. Note further that zero detection logic is required if the input is zero
since no index exists.
Jullien [2] has suggested first using sub-modular decomposition for index
calculus-based multipliers mod m. This involves three stages (a) sub-modular

© Springer International Publishing Switzerland 2016 39


P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_4
40 4 Modulo Multiplication and Modulo Squaring

reconstruction (b) modulo index addition and (c) reconstruction of desired result.
The choice of sub-moduli m1 and m2 such that m1m2 > 2 m has been suggested.
As an illustration, for m ¼ 19, m1 ¼ 6 and m2 ¼ 7 can be chosen. Considering
multiplication of X ¼ 12 and Y ¼ 17, and base 2, for modulus m ¼ 19, the indices
corresponding to X and Y can be seen to be 15 and 10 which in residue form are
(3, 1) and (4, 3) corresponding to (m1, m2). Adding the indices corresponding to the
product XY, we obtain the indices as (1, 4). Using CRT (which will be introduced
later in Chapter 5), the decoded word corresponding to moduli {6, 7} can be shown
to be 25 which mod 18 is 7. Thus, the final result can be obtained as 14 since 27 mod
19 ¼ 14. Note that for input zero since index does not exist, the value 7 is used since
7 will never appear as a valid sub-modular result due to the fact that sub-moduli are
less than 7.
Jullien’s approach needs large ROM which can be reduced by using sub-modular
decomposition due to Radhakrishnan and Yuan [3]. This technique is applicable
only if m  1 can be factorized. For example, for m ¼ 29, m1 ¼ 7 and m2 ¼ 4 can be
used. The approach is same as before.
As an illustration, consider the example of multiplication of X ¼ 5 and Y ¼ 12
and base 2 for modulus m ¼ 29. We have the indices as 22 and 7 or in RNS (1, 2)
and (0, 3). Adding these, we have (1, 1). Using CRT, we obtain the sum of indices as
1 which corresponds to the product 2. Note that among the several factorizations
possible for m  1, the one that needs small ROM can be chosen. In the case of
either input being zero, a combinational logic shall detect the condition and output
shall be made zero. Note that addition of indices can be using combinational logic
whereas rest can be using ROMs. As an illustration for the above example, the
memory requirement is totally 320 bits for index ROMs and 160 bits for inverse
index ROM (considering a memory of 32 locations for the needed 29 locations).
Dugdale [4] has suggested that by using unused memory locations in the LUTs,
the zero detection logic can be avoided. Further, he has suggested that in case of a
composite modulus, for non-prime factors, direct multiplication can be used
whereas for prime factors, index calculus may be employed. Consider for illustration
the composite modulus 28 which is realized using the RNS (7, 4). Since multiplica-
tion mod 4 is relatively simple, normal multiplication can be employed, whereas for
the modulus 7 index calculus can be employed. Since m  1 ¼ 6 for m ¼ 7, we can
use the two moduli set {2, 3} to perform computation using index calculus. The input
zero can be represented by unused residue set (2, 0) or (2, 1) or (2, 2).
As an illustration, consider X ¼ 14 and Y ¼ 9, corresponding to the three moduli
i.e. 2, 3, 4, the indices for the first two moduli and actual residue for modulus
4 corresponding to X and Y are (2, 0, 2) and (0, 2, 1) which yields the product as
(2, 2, 2). We obtain thus the result as 14.
Dugdale has compared the multiplier realizations using direct method and
method using index calculus and observed that memory requirements are
much less for the latter method. As an illustration, for the composite modulus
221 ¼ (13  17), the index tables in the first level corresponding to the two moduli
need 2  221  (4 + 5) ¼ 3978 bits and the memory in the second level needed to
perform index addition needs 2121 bits (169  4 + 289  5), whereas the third level
4.1 Modulo Multipliers for General Moduli 41

needed to find the actual result corresponding to the obtained index needs 221 loca-
tions of 8 bits each needing 1768 bits totally thus needing 7867 bits, whereas a
direct multiplier needs 390, 728 bits.
Ramnarayan [5] has considered modulus of the type m ¼ (2n  2k + 1) in which
case the modulo (m  1) adder needed for adding or subtracting indices can be
simplified. Note that m  1 ¼ (2n  2k) is having (n  k) “ones” followed by
 
k zeroes. Hence, if 0  ðx þ yÞ  2n  2k , (x + y) mod (m  1) ¼ x + y. If
2n  2k  x þ y  2n , then (x + y) mod (m  1) ¼ x + y  2n + 2k. If
2n  x þ y  2nþ1 , then (x + y) mod (m  1) ¼ {(x + y) mod 2n} + 2k. These three
conditions can be checked by combinational logic.
In the quarter-square technique [1], XY mod m is founds as
0  1
ðX þ Y Þ2 m þ ðX  Y Þ2 m
ðXY Þm ¼ @ mA
ð4:1Þ
4
m

Thus, using look-up tables, both the terms in the numerator can be found and added
mod m. The division by 4 is also carried out using a ROM. These designs are well
suited for small moduli.
Designs based on combinational logic will be needed for larger word length
moduli. Extensive work on modulo multiplication has been carried out in the past
three decades due to the immense application in Cryptography-authentication and
key exchange algorithms. Therein, word lengths of the operands are very large
ranging from 160 bits to 2048 bits, whereas in RNS for DSP applications, the word
lengths could be few tens of bits at most.
The operation (AB) mod m can be carried out by first multiplying A with B and
then dividing the result with m to obtain the remainder. The quotient is obtained in
this process but is not of interest for us. Moreover, the word length of the product is
2n bits for n-bit input operands. The division process is involved as is well known.
Hence, usually, the modulo multipliers are realized in an integrated manner doing
partial product addition and modulo reduction in one step. This will ensure that the
word length is increased by at most 2 bits. An example will illustrate Brickell’s
algorithm [6, 7].
Example 4.1 Consider C ¼ AB mod m where A ¼ 13, B ¼ 17 and m ¼ 19. We start
with MSB of 13 and in each step, the value Ei+1 ¼ (2Ei + aiB) mod m is computed
considering E1 ¼ 0. The following illustrates the procedure:
E0 ¼ (2  0 + 1  17) mod 19 ¼ 17 a3 ¼ 1
E1 ¼ (2  17 + 1  17) mod 19 ¼ 13 a2 ¼ 1
E2 ¼ (2  13 + 0  17) mod 19 ¼ 7 a1 ¼ 0
E3 ¼ (2  7 + 1  17) mod 19 ¼ 12 a0 ¼ 1


42 4 Modulo Multiplication and Modulo Squaring

Note, however, that each step involves a left shift (appending LSB of zero to
realize multiplication by 2), a conditional addition depending on bi value to add B or
not and a modulo m reduction. The operand in the bracket U ¼ (2Ei + aiB) < 3m
meaning that the word length is 2 bits more than that of m and the modulo reduction
needs to be performed by subtracting m or 2m and selecting U or U  m or U  2m
based on the signs of U  m and U  2m. The method can be extended to higher
radix as well, with some expense in hardware but reducing the modulo multiplica-
tion time (see for example [8]).
Several authors have suggested designing the modulo multiplier in two stages.
The first stage is a n-bit by n-bit multiplier whose output 2n-bit word is reduced
mod m using a second stage. Hiasat [9] has suggested a modulo multiplier which
uses a conventional multiplier to obtain first z ¼ xy followed by a modulo reduction
block. In this method, defining a ¼ 2n  m and considering that k bits are used to
represent a, the 2n-bit product can be written as

Z ¼ D22nk1 þ C2n þ B2n1 þ A ð4:2Þ

where A, B, C, and D are n  1, 1, n  1  k and k + 1 bit words. It may be noted that


(C2n) mod m ¼ Ca. Thus, Ca needs (n  1) bits. B and D are fed to a combinational
logic circuit to yield another (n  1)-bit word. The resulting three words are next
added together using a n-bit carry-save-adder followed by an adder mod m to yield
the final residue. This needs a (n  1  k)  k bit multiplier for computing Ca which
roughly needs a ¼ th the hardware of a n  n multiplier (k ¼ n/2) and n  n
multiplier for computing x  y thus totally needing 1.25 (n  n) multiplier, two
n-bit CPAs and one n-bit CSA and a combinational circuit. The delay is roughly
8 log2n + 2 units. This approach can be considered as a multiplier cascaded by a
binary to RNS converter.
Di Claudio et al. [10] have described a scaled multiplier which computes (XY)2b
mod m where b is the number of bits of X, Y and m. Defining Z ¼ XY, we have
   
Z2b ¼ ZH 2b þ ZL 2b mod m ¼ Z H þ ZL 2b mod m ð4:3Þ

In this technique, we first compute ZL2bmod m and add ZH. We actually compute
h   i
P ¼ Z H þ Z L 2b  αi b m div2b þ δ ð4:4Þ
2
1
where αi ¼ m mod2b and δ can be considered as 1 if ðZ Þ b 6¼ 0 else δ ¼ 0.
2
This architecture needs three n  n multipliers for computing Z, for multiplying
by (2b  αi) and multiplying by mi and a three-input n-bit modulo adder. One of the
multipliers need not compute the most significant word of the product. Thus,
roughly it needs 2.5(n  n) binary multipliers, two n-bit CPAs and additional
control logic. The delay is (11 log2n + 2) units. Note that this technique is a simple
variation of Montgomery technique [11] to be described in Chapter 10.
4.1 Modulo Multipliers for General Moduli 43

Stouraitis et al. [12] have proposed full-adder-based implementation of (AX + B)


mod m. The usual Cary save adder (CSA)-based partial product (PP) addition is
modified by rewriting, the MSBs in bit positions above b  1 bits where b is the
number of bits used to represent m, as new words representing 2xmod m. These
words are next added to yield a smaller length word. This word is again reduced by
rewriting MSBs as new words as before. In few iterations, the result obtained will
be of length b bits. Note that one modulo subtraction may be needed at the end to
obtain the final result.
As an example, consider (12  23) mod 29 ¼ 15. The bits of the partial products
in the positions to the left of five bits need to be rewritten: for example 25 mod
29 ¼ 3, 26 mod 29 ¼ 6, and 27 mod 2 ¼ 12. In this case, in general, 4 bits can be 1 in
25 position, 3 bits can be one in 26 position and 2 can be 1 in 27 position and 1 bit
can be one in 28 position. The sum of the rewritten bit matrix can be seen to be 207.
In the next step, it will reduce further (see Figure 4.1).
RNS multipliers can be designed using combinatorial adder-based techniques.
Since a modulus m is always less than the maximum n bit value possible, many
possible binary words with magnitude greater than (m  1) do not occur i.e. m,
m + 1, . . ., 2n  1. This fact can be exploited to reduce the full-adders in normal
array multipliers to OR gates and half-adders.
For a mod 5 multiplier, with A ¼ a2a1a0 and B ¼ b2b1b0, the product can be
obtained by mapping the bits above 22 position mod 5 for example (a1b223) mod
5 ¼ 3  (a1b2). The resulting bit matrix showing original bits in bold face and
re-mapped bits in normal face are as follows:

a0b2 a0b1 a0b0


a1b1 a1b0 a1b2
a2b0 a1b2 a2b1
a2b1 a2b2

Next, we can re-organize the input two-bit products bits aibj into doublets and
triplets, the maximum sum of which does not exceed unity for any input bit values.
This reduces the number of bits to be added in each column and reducing thereby

Figure 4.1 Stouraitis 01100 01100


et al. [12] technique of 01100 1100
modulo multiplication 01100 100
00000 00
01100 0
00011
00110
01100
1001001
01001
00110
01111 result
44 4 Modulo Multiplication and Modulo Squaring

the 1-bit adders in a column by replacing them with OR gates. Paliouras et al. [13]
have used extensive simulation to arrive at all possible input combinations to
identify the input bit product pairs or triplets that cannot be active simultaneously.
The design contains a recursive modulo reduction stage formed by cascaded adders
following Stouraitis et al. [12] which, however, is delay consuming while the area is
reduced. A mod 5 multiplier realized using this approach is presented in Figure 4.2a
for illustration.
Dimitrakopoulos et al. [14] have considered the use of signed digit representa-
tion of the bit product weight sequence (2k) mod mi in order to reduce the hardware
complexity. A graph-based optimization method is described which could achieve
area reduction of 50 %. A mod 11 multiplier realized using this approach is
presented in Figure 4.2b for illustration. Note that in the case of mod 11 multiplier,
the bits corresponding to 24, 25 and 26 are written as 5, 1, 2 so that the number of
bits to be added will be reduced. A 4  4 multiplier mod 11 will need the following
bits to be added in various columns:

a0b3 a0b2 a0b0 a0b0


a2b1 a1b3 a1b0 a1b3
a1b2 a1b1 a3 b3 a2b2
a3b0 a2b0 a3b1
a3b1 a2 b3
a2b2 a3 b2

A correction term needs to be added to compensate the effect of the inverted bits.
Note that ROMs can be used to realize the multiplication in RNS for small
moduli. On the other hand, recent work has focused on powers of two related
moduli sets since modulo multiplication can be simpler using the periodic property
of moduli discussed in Chapter 3. The multiplication operation (AB) mod 2n is
simple. Only n LSBs of the product AB need to be obtained. An array multiplier can
be used for instance omitting the full adders beyond the (n  1)th bit. We next
consider mod (2n  1) and mod (2n + 1) multipliers separately.

4.2 Multipliers mod (2n  1)

We use the periodic property of modulus (2n  1) for this purpose. We note that
2n+k ¼ 2k mod (2n  1). Thus, the multiplication mod (2n  1) involves addition
of modified partial products obtained by rotation of bits of each of the partial
products. The resulting n words need to be added in a CSA with EAC followed
by a CPA with EAC in the last stage. Consider the following example.
4.2 Multipliers mod (2n  1) 45

Figure 4.2 (a) Modulo-5 a


and (b) modulo-11
multipliers based on a0b0
combinational logic a1b2
((a) adapted from [13]
©IEEE2001, (b) Adapted a2b1 s0
from [14] ©IEEE2004) a2b2

a1b2
H s1
a2b1 F
1
a1b0
a0b1

a2b0
a1b1 s2
H H
a0b2

F
1

a1b3 a2b2
b
a3b1 a2b0 a0b2 a1b3
a0b0

a3b0 a1b2 FA a3b1


a2b1a0b3 a1b0
a2b3

FA a1b1 FA a0b1 FA a3b2

FA FA a2b2 FA a3b3

HA FA HA
1 1
Logic FA HA

HA FA FA FA HA
1

FA FA FA FA HA

C3 C2 C1 C0
46 4 Modulo Multiplication and Modulo Squaring

Example 4.2 Compute AB mod m ¼ (1011  1101) mod 1111.

1011 PP0 b oA
0000 PP1 b1A rotated left by 1 bit
1110 PP2 b2A rotated left by 2 bits
0101 SUM
0101 CARRY Bold is used for EAC bit
1101 PP3 b3A rotated left by 3 bits
1101 SUM
1010 CARRY
1000 Final EAC addition ■
n
Wang et al. [15] have described a mod (2  1) multiplier which is based on
adding the n-bit partial products using a Wallace tree followed by a CPA with EAC.
The partial products are obtained by circularly rotating left by i bits of the input
word A and added in case bi is 1 as mentioned before.
The area and delay of Wang et al. design are n2 + n(n  2)AFA + APAn and
1 + (n  2)DFA + DPAn when a CSA is used where APAn and DPAn are the area
and delay of a n-bit carry-propagate-adder and AFA, DFA, AAND and DAND are areas
and delays a full-adder and AND gate, respectively. The delay in the case of
Wallace tree is 1 + d(n)DFA + DPAn. Note that the depth of the Wallace tree d(n) is
equal to 1, 2, 3, 4, 4, 6, 7, 8, 9 when n is 3, 4, 5–6, 7–9, 10–13, 14–19, 20–28, 29–42,
43–63, respectively.
Zimmermann [16] also has suggested the use of CSA based on Wallace tree. He
has observed that Booth recoding does not lead to faster and smaller multipliers due
to the overhead of recoding logic which is not compensated by the smaller carry
save adder of (n/2) + 1 partial products only.
The area and delay of Zimermann’s modulo (2n  1) multiplier are (8n2 + (3/2)
nlogn  7n) and 4d(n) + 2logn + 6 using the unit gate model, where d(n) is the depth
of the Wallace tree.
Efstathiou et al. [17] have pointed out that while using modified Booth’s
algorithm in the case of even n for a modulus of 2n  1, the number of partial
products can be only n/2. The most significant recoded digit of the multiplier B can
be seen to be bn1 which corresponds to (bn12n) mod (2n  1) ¼ bn1. Thus, in
place of b1 (which is zero) in the first digit, we can use bn1. Accordingly, the first
recoded digit will be (bn1 + b0  2b1). The truth tables and implementations of
Booth encoder and Booth selector are shown in Figure 4.3a, b and the design of the
modulo 255 multiplier is presented in Figure 4.3c.
n
The area and delay
n of this nmod (2 n1)  multiplier
 using either CSA array or
Wallace tree are 2 ABE þ n 2 ABS þ n 2  2 AFA þ APAn where ABE, ABS, and
APan are the areas of Booth encoder (BE), Booth selector (BS) and a mod (2n  1)
adder,
n whereas nfor Zimmermann’s
  n technique, the area needed is
2 þ 1 A BE þ n 2 þ 1 A BS þ n 2  1 AFA þ APAn . The delays for
Efstathiou et al. multiplier and Zimmermann’s multiplier are respectively T BE þ T BS
    
þ n2  2 T FA þ T PAn and T BE þ T BS þ n2  1 T FA þ T PAn .
4.2 Multipliers mod (2n  1) 47

a
b2i+1 b2i b2i-1 s 2x 1x BE
0 0 0 0 0 0
0 0 1 0 0 1 b2i-1
1x
0 1 0 0 0 1
0 1 1 0 1 0 b2i 2x
1 0 0 1 1 0
b2i+1
1 0 1 1 0 1
s
1 1 0 1 0 1
1 1 1 1 0 0
b
BS
s 2x 1x di
0 0 0 0 ai

0 0 1 aj 1x
0 1 0 ai-j
di
ai-1
1 1 0 ai-j
1 1 1 aj 2x
1 0 0 1 s

c a7
a6 a5 a4 a3 a2 a1 a0
b3
b0
a5 b1
a4 a3 a2 a1 a0 a7 a6
b1
b2
a3 b3

a2 a1 a0 a7 a6 a5 a4
b3
b4
b5

a1
a0 a7 a6 a5 a4 a3 a2
b5
b6
b7
: Selector

: Full Adder

modulo 255 adder


: Encoder

P7 P6 P5 P4 P3 P2 P1 P0

Figure 4.3 (a) Radix-4 Booth encoder, (b) Booth selector and (c) a mod 255 multiplier using
Booth’s algorithm (Adapted from [17] ©IEEE 2004)
48 4 Modulo Multiplication and Modulo Squaring

In the case of Wallace tree being used, we have for both the cases the
delays respectively as T BE þ T BS þ k n2 T FA þ T PAn and T BE þ T BS þ
 
k n2 þ 1 T FA þ T PAn .
Recently, Muralidharan and Chang [18] have suggested a radix-8 Booth-
encoded modulo (2n  1) multiplier with adaptive delay. The digit in this case
corresponding to multiplier Y can be expressed as

d i ¼ y3i1 þ y3i þ 2y3iþ1 4y3iþ2 ð4:5Þ

Thus, di can assume one of the values 0, 1, 2, 3, and 4. The Booth encoder
(BE) and Booth selector (BS) blocks are shown in Figure 4.4a, b. Note that while
the partial products (PP) for multiples of 1, 2 and 4 can be easily obtained, the
PPs for multiples of 3 need attention. Conventional realization of 3X mod (2n  1)
by adding 2X with X needs a CPA followed by another carry propagation stage
using half-adders to perform the addition of carry generated by the CPA.
Muralidharan and Chang [18] suggest the use of (n/k) number of k-bit adders to
add X and 2X so that there is no carry propagation between the adder blocks. While
the carry of the leftmost k-bit adder can be shifted to LSB due to the mod (2n  1)
operation, the carry of each k-bit adder results in a second vector as shown in
Figure 4.4c. However, in the case of obtaining 3X, we need to have one’s
complement of both the words. The second word will have several strings of
(k  1) ones. These can be avoided by adding a bias word B [19] which has ones
in al-th bit position where a ¼ 0, . . ., (n/k)  1. The addition of a bias word with the
Sum and Carry bits of the adder in Figure 4.4c can be realized easily using one
XNOR gate and one OR gate as shown in Figure 4.4d to obtain jB þ 3Xj2n 1 .
Note that bs0j ¼ s0j  ck1 j1
when j 6¼ 0 and bs0j ¼ s00  cM1
k1 when j ¼ 0 and
jþ1 j j 0 M1
bck1 j ¼ s0 þ Ck1 when j 6¼ M  1 and bck1 ¼ s0 þ Ck1 when j ¼ M  1 for
j ¼ 0, 1, . . ., M  1 where M ¼ n/k.
The bias word B needs to be added to all multiples of X for uniformity and a
compensation constant (CC) can be added at the end. The biased simple multiples
jB þ 0j2n 1, jB þ Xj2n 1, jB þ 2Xj2n 1, jB þ 4Xj2n 1 for n ¼ 8 are realized by left
circular shift and selective complimenting of the multiplicand bits without addi-
tional hardware as shown in Figure 4.4e. The multiple B  3X is realized in a similar
way. Note that (3X) mod (2n  1) is one’s complement of 3X. The bias word B can
be added in a similar way as in the case of +3X to get B  3X. Note that
the bias word needs to be scaled by 23, 26, etc. Each PPi consists of a n-bit vector
ppi n1, . . ., ppi,0 and a vector of n/k ¼ 2 redundant carry bits qi1 and qi0. These are
circularly displaced to the left by 3 bits for each PPi. In the case of radix-8 Booth
encoding, the ith partial product can be seen to be PPi ¼ 23i d i X 2n 1 , This is
modified to include the bias B as PPi ¼ 23i ðB þ di XÞ 2n 1 . The modulo reduced
partial products and correction terms for a mod 255 multiplier are shown in
Figure 4.4f. Hence, the correction word will be one n-bit word if k is chosen to
be prime to n.
4.2 Multipliers mod (2n  1) 49

a y3i+2 y3i+1 b
y3i+1 y3i y3i y3i-1
Sel X
Sel 2X
Sel 3X
Sel 4X X Xj-1 Xj
j-2
(3X)j

Sign

Sign Sel 4X Sel 3X Sel 2X Sel X


ppij

c x7 x6 x6 x5 x5 x4 x4 x3 x3 x2 x2 x1 x1 x0 x0 x7
C12 C11 C10 C0 2 C01 C00

FA FA FA FA FA FA FA FA

S13 S12 S11 S10 S03 S 02 S10 S00


0 0 0 C03 0 0 0 C13

d x7 x6 x6 x5 x5 x4 x4 x3 x3 x2 x2 x1 x1 x0 x0 x7
1
C 2 C11 C10 C02 C01 C00

FA FA FA FA FA FA FA FA

S10 C03 S10 C13

OR XNOR OR XNOR

S 13 S12 S11 bS10 S03 S02 S10 bS00


0 0 bc30 0 0 0 bc31 0

Figure 4.4 (a) Booth Encoder block (b) Booth Selector block (c) generation of partially redun-
dant (+3X) mod (2n  1) using k-bit RCAs (d) generation of (B + 3X) mod (2n  1) using k-bit
RCAs, (e) generation of partially redundant simple multiples and (f) modulo-reduced partial
products and CC for mod (28  1) multiplier (Adapted from [18] ©IEEE2011)
50 4 Modulo Multiplication and Modulo Squaring

e 0 0 0 1 0 0 0 1 B+0
0 0
x7 x6 x5 x3 x2 x1 B+X
x4 x0
x4 x0
x6 x5 x4 x3 x2 x1 x0
x7
B+2X
x3 x7
x5 x4 x3 x1 x0 x7 B+4X
x2 x6
x2 x6
f
x7 x6 x5 x4 x3 x2 x1 x0
d2 d1 d0
pp07 pp06 pp05 pp04 pp03 pp02 pp01 pp00
q01 q00
pp17 pp16 pp15 pp14 pp13 pp12 pp11 pp10
q10 q11
pp27 pp26 pp25 pp24 pp23 pp22 pp21 pp20
q 20 q 21
0 0 1 0 0 0 1 0
Figure 4.4 (continued)

Note that the choice of k decides the speed of generation of hard multiple (i.e. the
delay of the k-bit ripple carry adder). Here, the partial product accumulation by
CSA tree has a time complexity of O(log (n + n/k)). The delay of the hard multiple
generation, CC generation, partial product generation by BS and BE blocks, and
two operand parallel-prefix modulo (2n  1) adder are respectively O(k), O(1),O(1),
O(logn). Thus, the total delay of the multiplier is logarithmically dependent on
n and linearly on k. Hence, the delay can be manipulated by proper choice of k and
n. The final adder used by the authors was 2-operand mod (2n  1) adder using
Sklansky parallel-prefix structure with an additional level for EAC addition
following Zimmermann [16]. The authors have shown that by proper choice of k,
the delay of the mod (22n  1) multiplier can be changed to match the RNS delay of
the multiplier for lower bit length moduli in four moduli sets of the type {2n, 2n  1,
2n + 1, 22n  1).
The area of BE, BS, k-bit CPA are respectively 3AINV + 3AAND2 +
AAND3 + 3AXOR2 and 4AAND2 + AOR4 + AXOR2, (k  1)AFA + AHA + AOR2 + AXNOR2.
The total normalized area  requirements
 in terms  of gates  are
25:5n n3 þ 1 þ10:5n þ 38:5 n3 þ 1 þ 1 if k ¼ n and 25:5n n3 þ 1 þ 21n þ
   
68:5 n3 þ 1 þ 3 if k ¼ n/3. Note that M (¼ n/k) k-bit RCAs, n3 þ 1 BE blocks,
n   n  bn3cþ1
3 þ 1 (n + M ) BS blocks and n 3 þ Q full-adders where Q ¼ k are
required where M ¼ n/k.
4.3 Multipliers mod (2n + 1) 51

4.3 Multipliers mod (2n + 1)

Mod (2n + 1) multipliers of various types have been considered in literature (a) both
inputs in standard representation, (b) one input in standard form and another in
diminished-1 form and (c) both inputs in diminished-1 representation.
Curiger et al. [20] have reviewed the multiplication mod (2n + 1) techniques for
use in implementation of IDEA (International Data Encryption Algorithm) [21] in
which input “zero” does not appear as an operand. Instead, it is considered as 216 for
n ¼ 16.
The quarter square multiplier needs only 2  2n  n bits of ROM as against the
requirement of 22n  n bits in case of direct table look-up. Note that ϕðx þ yÞ ¼
 x þ y 2 x  y2
and ϕðx  yÞ ¼ are stored in memories (see (4.1)). The index
2 2
n
calculus technique needs 3  2  n bits of ROM.
Curiger et al. [20] suggest three techniques. The first technique follows the well-
known LowHigh lemma [21] based on the periodic property of modulus (2n + 1)
discussed in Chapter 3. The LowHigh lemma states that

ðABÞmodð2n þ 1Þ ¼ ððABÞmodð2n Þ  ðABÞdivð2n ÞÞmodð2n þ 1Þ


if ðABÞmodð2n Þ  ðABÞdivð2n Þ 
¼ ðABÞmodð2n Þ  ðABÞdivð2n Þ þ 2n þ 1 if ðABÞmodð2n Þ < ðABÞdivð2n Þ
ð4:6Þ

Note that when An ¼ 0, Bn ¼ 0, the result of multiplication is (AB) mod (2n + 1) ¼


(AL  AH). Thus, one’s complement of AH and 1 need to be added to AL. If carry is
1, the result is n LSBs. If carry is zero, then ‘1’ needs to be added to the result. When
A ¼ 2n or B ¼ 2n also, the above procedure is followed. On the other hand, if
An ¼ Bn ¼ 1, the LSB is 1 since 22n mod (2n + 1) ¼ 1 and the procedure is same as
before.
This technique uses a (n + 1)-bit  (n + 1)-bit multiplier followed by a dedicated
correction unit after subtraction of MSBs from LSBs using a mod 2n adder and then
to reduce the result mod (2n + 1) using another modulo 2n adder (see Figure 4.5).
This architecture can thus handle the cases A ¼ 2n and/or B ¼ 2n correctly as needed
in IDEA algorithm. Note that Hiasat [22] and Bahrani and Sadeghiyan [23] have
suggested same technique independently later.
Curriger et al. [20] have suggested a second technique using mod (2n + 1) adders
by using diminished-1 representation for one of the operands X which is a “key” of
IDEA algorithm that can be pre-processed. They compute the expression
Xn  
Z¼ y
i¼0 i
2i X* mod2n þ 2i X*div2n þ 1 modð2n þ 1Þ ð4:7Þ

where X* ¼ X  1. The bit xn is not used since it corresponds to the case X ¼ 0


(i.e. actual input of 2n). Note that 2iX* mod2n corresponds to an i-bit left shift and
52 4 Modulo Multiplication and Modulo Squaring

(n+1)x(n+1)binary Modulo correction unit


multiplier
Cin

Partial product generation


n+1

Carry Propagate adder

Modulo 2n adder

Modulo 2n adder
A =0

Reduction stage
2n X n
n

Y n
2n
B =0
n+1
n
Cout
r2n

Figure 4.5 Modulo multiplier based on Lowhigh lemma (adapted from [23] ©IEEE1991)

2iX * div2n corresponds to an i-bit right shift. A carry-save structure is used in


which a correction term is added in every step depending on the carry output and a
final modulo adder is also used.
Curriger et al. [20] consider a third method using bit-pair recoding (modified
Booth’s algorithm) so that the number of addition stages is reduced by a factor of
2. Note here that the modulo correction is done in the carry-select addition unit
which needs the final adder to be of (n + log2n) bits.
We have for multiplication [15, 16], corresponding to C ¼ AB, while using
diminished-1 numbers A*, B*

Cmodð2n þ 1Þ ¼ ABmodð2n þ 1Þ ¼ ðA* þ 1ÞðB* þ 1Þmodð2n þ 1Þ


¼ ðA*B* þ A* þ B* þ 1Þmodð2n þ 1Þ ¼ ðC* þ 1Þmodð2n þ 1Þ

or

C* ¼ ðA*B* þ A* þ B*Þmodð2n þ 1Þ ð4:8Þ

Thus, multiplication A*B* followed by addition of A* and B* will be required to


obtain the diminished-1 number. If either operand is zero or both are zero, the
answer is zero.
Zimmerman [16] also considered the design of mod (2n + 1) multipliers for use in
IDEA where the value 2n is represented by zero. Zimmerman has shown that the final
product (XY) mod (2n + 1) where X and Y are n-bit numbers, can be written as
Xn1   
ðXY Þmodð2n þ 1Þ ¼ i¼1
PPi þ 1 þ 2 modð2n þ 1Þ ð4:9Þ

where PPi are the partial products rewritten as


4.3 Multipliers mod (2n + 1) 53

xi ðyni1 . . . y0 yn1 . . . yni Þ þ xi ð0:::01:::111Þ ð4:10Þ

and note that 0. . .01. . .111 indicates a number with (n  i) zeros and i ones. It can
be seen that the MSBs of the circularly left-shifted words are one’s complemented
and added. The “1” term within the brackets and “2” term in (4.9) are correction
terms. Note that the property ðA þ B þ 1Þmodð2n þ 1Þ ¼ ðA þ B þ cout Þmod2n is
used in computing PPi + 1.
Note that if xi ¼ 0, we need to add a string 000..111..11 since most significant
zeroes are inverted and added. Thus, using a multiplexer controlled by xi and x0 i,
either the shifted word with one’s complemented MSBs or the 0000..11..11 are
selected.
As an illustration, consider the following example for finding (1101)  (1011)
mod 17.
1101 1101+1
1101 1010 +1
0000 0011+1
1101 1001+1
+2
0111

In the case of diminished-1 operands, we need to add two additional terms A*


and B* (see (4.8)).
Note that in Zimmerman’s technique, in the cases of X ¼ 2n, Y ¼ 2n, and
X ¼ Y ¼ 2n, the results respectively shall be
 
P ¼ 2n Ymodð2n þ 1Þ ¼ Ymodð2n þ 1Þ ¼ Y þ 2 modð2n þ 1Þ,
 
P ¼ 2n Xmodð2n þ 1Þ ¼ Xmodð2n þ 1Þ ¼ X þ 2 modð2n þ 1Þ, ð4:11Þ
P¼1

P ¼ ðXY Þmodð2n þ 1Þ otherwise.


These three cases are handled outside using multiplexers before the final modulo
carry propagate adder (see Figure 4.6) using the 2n correction unit. This unit
computes
   
P0C ; P0s ¼ Y; 1  if X ¼ 2n ,
¼ X; 1 if Y ¼ 2n , ð4:12Þ
¼ ð0, 0Þ if X ¼ Y ¼ 2n

With 2n represented by zero, the correction unitneeds two zero detectors which
are outside the critical path. Note that in (4.12), Y þ 1 is computed instead of
 
Y þ 2 because of the reason that the final modulo adder adds an extra “1”.
54 4 Modulo Multiplication and Modulo Squaring

Figure 4.6 Modulo X y


(2n + 1) multiplier due to
Zimmerman (adapted from
[16] ©IEEE1999)
modulo partial-product generator

PPn-1 ... PP0


2
2n - correction modulo carry-save adder

P’C P’S

PC PS

modulo carry-propagate adder

Using unit gate model, it can be shown that the area and delay of mod (2n + 1)
multiplier due to Zimmermann are given by 9n2 + (3/2)nlogn + 11n gates and
4d(n + 1) + 2logn + 9 gate delays, respectively.
Wang et al. [24] have described a Wallace tree-based architecture for modulo
(2n + 1) multiplication for diminished-1 numbers. The expression evaluated is
Xn1   
d ðBAÞ ¼ bd
k¼1  k
2k A  Z  d1 ðAÞ þ 1 modð2n þ 1Þ ð4:13Þ

where  stands for addition and Σ dðAk Þ stands for modulo (2n + 1)summation of
diminished-1 operands and Z is defined as
Xn1 Xn1
Z¼ b ¼n1
k¼1 k k¼1
bk ð4:14Þ

gives the number of zeroes in the (n  1) bits from b1 to bn1 and

d 1 ðAÞ ¼ bo dðAÞ þ bo d ð2AÞ ð4:15Þ

Note that d(2kA) is obtained by k-bit left cyclical shift with the shifted bits being
complimented if bk ¼ 1. On the other hand, if bk ¼ 0, d(2kA) will be replaced with
n zeroes. The case xn ¼ yn ¼ 1 can be handled by having an OR gate for LSB at the
end to give an output of 1. Note that the computation of Z involves a counter which
counts the number of zeroes in the bits b1 to bn1.
An example is presented in Figure 4.7 for finding (58  183) mod 257 where
58 and 183 are diminished-1 numbers. Note that the final mod(2n + 1) adder has a
carry in of ‘1’.
4.3 Multipliers mod (2n + 1) 55

00111010
X10110111
001110101
001110101 B D
s 11101011 C s 01101101
0011101011 c 011101011 s 11110100
A 00000000 c 110101100
s 11110100 c 111010110 00100010
001110101100 c000010001 s 01001111 E
0011101011000 s 11100011
s 10011111 c 100100010
00000000 c 001011001
c011000001 1
001110101100010 F
11111101 100111101
Z
0 G
Final result in diminished-1 coding 000111101

Figure 4.7 Example of multiplication mod 257 (adapted from [24] ©Springer1996)

The area requirement of Wang, Jullien and Miller mod (2n + 1) multiplier are

8n2 þ 92 nlogn þ 92 n  7dlog2 ðn  1Þe  1 equivalent gates.
Wrzyszcz and Milford [25] have described a multiplier (XY) mod (2n + 1) which
reduces the (n + 1)-bit  (n + 1)-bit array of bits corresponding to the (n + 1) partial
products to be added, to a n  n array. They first observe that the bits can be divided
into four groups (see Figure 4.8a). Taking advantage of the fact that when xn or yn is
1, the remaining bits of X and Y are zero, they have suggested combining bits in left
most diagonal and bottom most line into a new row as shown in Figure 4.8b noting
that the partial product bits can be OR-ed instead of being added. Note that the new
bits sqk are defined by s ¼ xn  yn and qk ¼ xk _ yk where k 2 0, 1, . . . , n  1 where
_ and  stand for OR and Exclusive OR operation, respectively. Next, using
periodic property of the modulus (2n + 1), the bits in the positions higher than n  1
are one’s complemented and mapped into LSBs (see Figure 4.8c):

pi, j 2nþk 2n þ1 ¼ pi, j 2k þ 2nþk 2n þ1

Note that we can next use the identity

sqk 2nþk 2n þ1 ¼ sqk 2k þ s2nþk 2n þ1 ð4:16Þ

By summing all the s  2n+k terms for k 2 ð0, 1, . . . , n  1Þ, we get j2sj2n þ1 . Note
also that xnyn22n will yield xnyn which can be moved to LSB. Since in the first and
last rows only one bit can be different from zero, we can combine them as shown in
Figure 4.8d. Thus, the size of the partial product bit matrix has been reduced from
(n + 1)2 to n2. A correction term is added to take into account mod (2n + 1) operation
needed. All these n words are added next using a final modulo (2n + 1) adder.
The number of partial products are n/2 for the case n even and (n + 1)/2 for the
case n odd except for one correction term. This multiplier receives full inputs and
avoids (n + 1)-bit circuits. It uses inverted end-around-carry CSA adders and one
diminished-1 adder.
56 4 Modulo Multiplication and Modulo Squaring

a 22n 22n-1 22n-2 … 2n+2 2n+1 2n 2n-1 2n-2 …. 22 21 20


pn,0 pn-1,0 pn-2,0 …. p2,0 p1,0 p0,0
pn,1 pn-1,1 pn-2,1 pn-3,1 .... p1,1 p0,1
B pn,2 pn-1,2 pn-2,2 pn-3,2 pn-4,2 …. p0,2
….. …… ….. ….. ….. ….. ….. A
pn,n-2 ….. p4,n-2 p3,n-2 p2,n-2 p1,n-2 p0,n-2
pn,n-1 pn-1,n-1 ….. p3,n-1 p2,n-1 p1,n-1 p0,n-1
pn,n pn-1,n pn-2,n ….. p2,n p1,n p0,n
D C

b 22n 22n-1 22n-2 … 2n+2 2n+1 2n 2n-1 2n-2 …. 22 21 20


pn-1,0 pn-2,0 …. p2,0 p1,0 p0,0
pn-1,1 pn-2,1 pn-3,1 .... p1,1 p0,1
pn-1,2 pn-2,2 pn-3,2 pn-4,2 …. p0,2
…… ….. ….. ….. ….. ….. A
….. p4,n-2 p3,n-2 p2,n-2 p1,n-2 p0,n-2
pn-1,n-1 ….. p3,n-1 p2,n-1 p1,n-1 p0,n-1
pn,n sqn-1 sqn-2 ….. sq2 sq1 sq0
C B

c 2n-1 2n-2 …. 22 21 20
pn-1,0 pn-2,0 …. p2,0 p1,0 p0,0νp0,n
pn-2,1 pn-3,1 .... p1,1 pn-1,1
p0,1
A
pn-3,2 pn-4,2 …. p1,2 pn-1,2 pn-2,2

….. ….. ….. ….. ….. ……


p1,n-2 p0,n-2 ….. p4,n-2 p3,n-2 p2,n-2

p0,n-1 pn-1,n-1 ….. p3,n-1 p2,n-1 p1,n-1

B
sq n-1 sq n-2 ….. sq2 sq1 sq0

d n-1
….. 2 n-2 1 0
2 2 2 2 2
p n-1,0 ∪ s qn–1 p n-2,0 ∪ s qn–2 ….. p 2,0 ∪ s q2 p 1,0 ∪ s q1 p0,0 ∪ pn,n ∪ sq0
p n-2,1 p n-3,1 ….. p 1,1 p 0,1 s pn–1,1
p n-3,2 p n-4,2 ….. p 0,2 pn–1,2 pn–2,2
….. ….. ….. ….. ….. …..
p 1,n -2 p 0,n -2 ….. p4,n–2 p3,n–2 p2,n–2
p 0,n -1 pn–1,n–1 …. p3,n–1 p2,n–1 p1,n–1

Figure 4.8 Architecture of mod (2n + 1) multiplier (adapted from [25] ©IEEE1993)

Efstathiou, Vergos, Dimitrakopoulos and Nikolos [26] have described a mod


(2n + 1) multiplier for diminished-1 representation for both the operands A and B.
They rewrite the partial products by inverting the MSBs beyond the n th bit
position and show that a correction factor of 0 will be required (see Figure 4.9).
They compute AB + A + B. Note that an all-zero carry vector is also added. All the
(n + 3) number of n-bit words are added using a CSA with inverted end-around-
carry being added as LSB. The authors use a Dadda tree for adding all the partial
4.3 Multipliers mod (2n + 1) 57

PP0 = a0b7 a0b6 a0b5 a0b4 a0b3 a0b2 a0b3 a0b0


PP1 = a1b6 a1b5 a1b4 a1b3 a1b2 a1b1 a1b0 a1b7
PP2 = a2b5 a2b4 a2b3 a2b2 a2b1 a2b0 a2b7 a2b6
PP3 = a3b4 a3b3 a3b2 a3b1 a3b0 a3b7 a3b6 a3b5
PP4 = a4b3 a4b2 a4b1 a4b0 a4b7 a4b6 a4b5 a4b4
PP5 = a5b2 a5b1 a5b0 a5b7 a5b6 a5b5 a5b4 a5b3
PP6 = a6b1 a6b0 a6b7 a6b6 a6b5 a6b4 a6b3 a6b2
PP7 = a7b0 a7b7 a7b6 a7b5 a7b4 a7b3 a7b2 a7b1
PP8 = a7 a6 a5 a4 a3 a2 a1 a0
PP9 = b7 b6 b5 b4 b3 b2 b1 b0
PP10 = 0 0 0 0 0 0 0 0
Figure 4.9 Diminished-1 mod (28 + 1) multiplier (adapted from [26] ©IEEE2005)

products where some simplification is possible by noting that three bits of the type
an1b0, an1 and bn1 can be added using a simplified full-adder SFA. The final
diminished-1 adder uses a parallel-prefix architecture with carry being inverted
and re-circulated at each prefix level [27].
The area and delay requirements of this adder using unit-gate model are 8n2 + (9/2)
nlog2n + (n/2) + 4 gates and 4d(n + 3) + 2log2n + 2 gate delays if n ¼ 4, 5, 7, 8, 11,
12, 17, 18, and 4d(n + 3) + 2log2n + 4 gate delays otherwise.
Note that in another architecture due to Ma [28], bit-pair recoding is used to
reduce the number of partial products but it accepts only diminished-1 operands.
The number of partial products is reduced to n/2. The result of carry save addition is
two words (SUM and CARRY) R0 and R1 each of length n + (log2n + 1) bits. These
words are written as
R0 ¼ 2n M0 þ R0L and R1 ¼ 2n M1 þ R1L .
Thus, we have
   
ðR0 þ R1 Þmodð2n þ 1Þ ¼ R0L þ M0 þ M1 þ 1 þ R1L þ 1 þ 2 modð2n þ 1Þ
ð4:17Þ

where M0 and M1 are one’s compliments of M0 and M1. All the four words
can be added using two stages of MCSA (mod (2n + 1) CSA) followed by a final
mod (2n + 1) CPA with a carry input of 1. The first MCSA computes the value of
the sum in the inner bracket R0L þ M0 þ M1 þ 1 and the second MCSA computes
the value of the sum in the middle bracket. The CPA only adds carry-in of 1 since
diminished-1 result is desired.
Considering that a Dadda tree is used in place of CSA array in the above
technique suggested by Ma [28], Efstathiou et al. [26] show that the area and
time requirements in terms of unit-gates are
58 4 Modulo Multiplication and Modulo Squaring

9 27 l nm j n k
6n2 þ nlogn þ n þ 7 log2  14 log2 þ 1 and
2 2 2 2
n 
20 þ 4d þ 1 þ 2log2 n:
2

Chaves and Sousa [29, 30] have modified slightly the formulation of
Zimmerman [16] by adding the 2n correction term (see (4.9)) without needing
additional multiplexers. They realize
Xn1   
ðXY Þmodð2n þ 1Þ ¼ i¼0
PPi þ 1 þ 2 þ yn X0 þ xn Y 0 þ 4 modð2n þ 1Þ
ð4:18Þ

where X ¼ 2n xn þ X0 and Y ¼ 2n yn þ Y 0 . When xn ¼ 1 or yn ¼ 1, the relationship

ðXY Þmodð2n þ 1Þ ¼ ð2n yn X0 þ 2n xn Y 0 Þmodð2n þ 1Þ


 
¼ yn X0 þ 2 þ xn Y 0 þ 2 modð2n þ 1Þ ð4:19Þ
 
¼ yn X0 þ xn Y 0 þ 4 modð2n þ 1Þ

has been used in (4.18). Denoting PPn ¼ yn X0 and PPnþ1 ¼ xn Y 0 , we have


Xnþ1   
X¼ i¼0
PPi þ 1 þ 4 modð2n þ 1Þ ð4:20Þ

Note that the case xn ¼ yn ¼ 1 is handled by adding an LSB of 1 using an OR gate.


Sousa and Chaves [31, 32] have described an universal architecture for mod (2n + 1)
multipliers using diminished-1 representation as well as ordinary representation
together with Booth recoding. Denoting P ¼ P0 + 1, X ¼ X0 + 1 and Y ¼ Y0 + 1, we have

P0 þ 1 ¼ ðX0 þ 1ÞðY 0 þ 1Þmodð2n þ 1Þ ð4:21aÞ

or
h  i
P0 ¼ y  x 2n þ1 þ yn  x þ xn  y þ ðxn _ yn Þ  2n modð2n þ 1Þ ð4:21bÞ

In the case of normal representation, we have


 
P ¼ ðx  yÞ2n þ1  yn  x  xn  y þ xn ^ yn modð2n þ 1Þ ð4:21cÞ

where zn is the n th bit and z the remaining n least significant bits of the number
Z and _, ^ stand for OR and AND operations.
 Modified
 Booth recoding can be applied to the efficient computation of
y  x 2n þ1 . As has been mentioned before, the number of partial products are n/2
4.3 Multipliers mod (2n + 1) 59

a
Y 0 0 1 1 0 1 0 0 1 106
X × 0 0 1 0 1 1 0 1 0 91
P2×0(100) 1 0 0 1 0 1 1 0 y 7:0
P2×1(101) 0 1 0 1 1 0 0 1 y 5:0# y 7:6
P2×2(011) + 0 0 1 1 0 0 1 0 y2:0# y 7:3
1 1 1 1 1 1 0 1 S
0* 0 0 1 0 0 1 0 1* C
P2×3(010) + 0 1 1 0 0 1 0 1 y1:0# y 7:2
1 0 1 1 1 1 0 1 S
0* 1 1 0 0 1 0 1 1* C
CT + 1 1 1 1 1 1 1 1 ct7:0
1 0 0 0 1 0 0 1 S
1* 1 1 1 1 1 1 1 0* C
+ 1* 1 0 0 0 0 1 1 1
0*+ 1 +1
zn= xn ∪ yn + 0 1 0 0 0 1 0 0 0
<9645>mod (28+1) =136

b
Y 0 0 1 1 0 1 0 0 1 105
X × 0 0 1 0 1 1 0 1 0 90
P2×0(100) 0 0 1 0 1 1 0 0 y 6:0# y 7
P2×1(101) 0 1 0 1 1 0 0 1 y 5:0# y 7:6
P2×2(011) + 0 0 1 1 0 0 1 0 y2:0# y 7:3
0 1 0 0 0 1 1 1 S
0* 0 1 1 1 0 0 0 1* C
P2×3(010) + 0 1 1 0 0 1 0 1 y1:0# y 7:2
0 1 0 1 0 0 1 1 S
0* 1 1 0 0 1 0 1 1* C
P2×4(000) + 1 1 1 1 1 1 1 1 28-1
0 1 1 0 0 1 1 1 S
1* 1 0 1 1 0 1 1 0* C
CT + 1 0 1 0 0 1 1 0 ct7:0
0 1 1 1 0 1 1 1 S
+ 1* 0 1 0 0 1 1 0 0* C
0* 1 1 0 0 0 0 1 1
+ 1 1* +2
0 1 1 0 0 0 1 1 0
<9450>mod (28+1) =198

Figure 4.10 Example of (a) diminished-1 and (b) ordinary modulo (28 + 1) multipliers (adapted
from [32] ©IEEE2005)

in case n even due to the periodic property of modulus. As an illustration, a mod


257 multiplication is presented in Figure 4.10a, b for two numbers 105 and 90 in
diminished-1 and normal representation. The partial products are obtained by left
shifting in case of digits being +2 or 2 and complimenting the bits beyond 8th bit
position. Consider for example P2 ¼ (101)2 (see Figure 4.10b) corresponding to a
60 4 Modulo Multiplication and Modulo Squaring

digit of 1 for normal representation. Since this digit has a place value of 4, we need
to find 4  (105). Left shifting the byte by two bits first and inverting two MSBs
and appending as LSBs and one’s complimenting all the bits, we obtain
01011001 ¼ 89(10) as shown, whereas the actual value is (420) mod 257 ¼ 94.
Thus, a correction of (3 + 2) ¼ 5 needs to be added where the term 3 comes because
of the one’s complimenting of the two shifted bits due to multiplication by 4. Since
one’s complimenting to make the PP negative, needs addition of 255, the second
term needs to be added to make total addition mod 257 zero. In the case of positive
partial products, for example, the fourth digit 1 with place value of 26, we only need
to left shift by 6 bits and invert the six MSBs and put them in LSB position. No one’s
complementing will be needed. The value of this PP is 101, whereas original value is
(64  105) mod 257 ¼ 38. Accordingly, correction word needs to be added.
In the case of normal representation, depending on the values of xn and yn, viz.,
00, 01, 10, and 11, we need to add (see (4.21c)) 0, x, y, and x  y + 1,
respectively. In the case of diminished-1 representation, depending on the values
of xn and yn, viz., 00, 01, 10, and 11, we need to add (see (4.21b)) x + y, y, x and 1
respectively. Note that in both the cases of normal and diminished-1 representa-
tions, y or y can be combined with the least significant Booth recoded digit as bit
y1 which is unused. Chaves and Sousa [32] have derived a closed-form expression
for the correction term CT.
The various PPs are added using diminished-1 adders which inherently add 1 in a
Wallace tree configuration and the final SUM and CARRY vectors are added using
a modulo (2n + 1) CPA. The case of xn or yn is used to determine the nth bit for
diminished-1 representation, whereas the nth bit of the product is generated by the
modulo (2n + 1) CPA in the case of ordinary representation.
The area and delay of these designs considering the unit-gate model for dimin-
ished-1 and ordinary representations are as follows:
n n  o
n2 3
Diminished-1 : 9 þ 7  1 n þ f21ng þ ndlog2 ne þ 8n;
2 2 2
 n   o
nþ1 n 3
Ordinary : 9n þ 7  1 n þ f28ng þ ndlog2 ne þ 8n gates
2 2 2

and
n jnk o
Diminished-1 : f6g þ 4d þ 1 þ f0g þ 2dlog2 ne þ 6;
2
n jnk o
Ordinary : f5g þ 4 d þ 1 þ f4g þ 2dlog2 ne þ 6 unit-gate delays
2

where the three terms {} correspond to PPG (partial Product Generator), CSA
(carry-save-adder) and COR (correction unit). A Sklansky parallel-prefix structure
with a fast output incrementer has been used for realizing mod (2n + 1) CPA
following Zimmermann [16].
4.3 Multipliers mod (2n + 1) 61

Chen et al. [33] have described an improved multiplier for mod (2n + 1) to be
used in IDEA. This design caters for both non-zero input and zero-input case. In the
non-zero input case, the partial product matrix has only n PPs which have MSBs
inverted and put in LSB positions. Next, these are added using a CSA based on
Wallace tree which inverts the carry bit and puts in the LSB position of the next
adder. The next stage uses 4-bit CLA in place of 2-bit CLA suggested by Zimmer-
mann [16]. Next, in the case of zero handling, a special adder is used to find a þ 2
(since ð2n aÞ mod ð2n þ 1Þ ¼ amod ð2n þ 1Þ ¼ a þ 2 ) where a is the input
when b is zero and similarly to find b þ 2 when a is zero. The actual output or the
output of the special adder is selected using OR gates.
Vergos and Efstathiou [34] have described a mod (2n + 1) multiplier for
weighted binary representation by extending the architecture of [25]. The architec-
ture in [25] suffers from the disadvantage that three n-bit parallel adders are
connected in series and a final row of multiplexers is needed. It uses slightly
different partial products from that of Wryszcz (see Figure 4.8). The correction
factor is a constant (¼3) however. It also has n +1 number of n-bit partial products.
They suggest the use of an inverted EAC parallel adder so that the correction factor
becomes 2 instead of 3. Further, the MSB of the result is computed as the group
propagate signal out of the n bits of the final inverted EAC adder. The reader is
referred to [34] for more details.

The area of this multiplier is 8n2 þ 92 ndlogne  13 2 n þ 9 which is smaller
compared to that of [26]. The delay is 18 if n ¼ 4, 4d ðn þ 1Þ þ 2dlogne þ 6 if
(n + 1) is a number of Dadda sequence and 4dðn þ 1Þ þ 2dlogne þ 4 otherwise.
Chen et al. [35] have described mod (2n + 1) multipliers for one operand B in
weighted form (n + 1 bits) and another operand A in diminished-1 form. The
product P is in (n + 1)-bit weighted binary form.
This multiplier uses pure radix-4 Booth encoding without needing any changes
as in the case of [31] and [32]. The multiplier accepts full input and does not need
any conversion between weighted and diminished-1 forms. The multiplier uses
n-bit circuits only. This needs considering the two cases an þ bn ¼ 1 and 0.
In the case an þ bn ¼ 1, i.e. an ¼ 0 and bn ¼ 0, when n is even, the Booth
X
n=21
encoding of B gives B ¼ bn1 þ b0  2b1 þ ðb2i1 þ b2i  2b2iþ1 Þ22i .
i¼1
(Note that since the most significant digit is (bn1 + bn  2bn+1) ¼ bn1 and since
bn12n ¼ bn1, we have combined this term with the least significant digit.) The
first term evidently needs a hard multiple (can be 3 when b0 ¼ 0, b1 ¼ bn1 ¼ 1).
In order to avoid this hard multiple, the authors suggest considering case bn1 ¼ 0
and 1 separately. Note that B can be written as

X
K 1
B¼ Ei 22i ð4:22Þ
i¼0 2n þ1
62 4 Modulo Multiplication and Modulo Squaring

where Ei ¼ b2i1 þ b2i  2b2iþ1 , bi1 ¼ 0, K ¼ n/2 for n even and (n + 1)/2 for
n odd. Note also that Ei can be 0, 1, 2. Noting that

P ¼ ðA  BÞ2n þ1 ¼ ðdðA  BÞ þ 1Þ2n þ1 ¼ ðdðAÞ  B þ BÞ2n þ1 ð4:23aÞ

and substituting for B from (4.22), we have

X
K1  
P¼ Kþ d AEi 22i ð4:23bÞ
i¼0 2n þ1

In order to retain the operands as n bits, the partial products of diminished-1


numbers to be accumulated are written as K partial products PP0, PP1, . . ., PPK1
and K correction bits c0, c1,. . .,cK1 for 0  i  K. This can be handled by setting
E0 ¼ 1. PP0 and C0 can be set to d ðAÞ and 0 whereas PPi ¼ 2n  22i and ci ¼ 22i
for 1  i  k. When Ei ¼ 0, PPi ¼ 2n  22i and ci ¼ 22i and when Ei 6¼ 0, PPi
 
¼ d A  Ei 22i and ci ¼ 0. This leads to the expression for P as

X
K 1 X
K 1
jA  Bj2n þ1 ¼ PPi þ Ci þ K ð4:24Þ
i¼0 i¼0 2n þ1

where K ¼ n/2 for n even and (n + 1)/2 for n odd.


In the case of an þ bn ¼ 0, two cases arise. If bn ¼ 1, jABj2n þ1 ¼
 
jAj2n þ1 ¼ dðAÞ þ 1 n . In the case of an ¼ 1, since d(A) ¼ d(0),
2 þ1
jABj2n þ1 ¼ 0. This can be taken care of by introducing an into Ei as
Ei ¼ an ðb2i1 þ b2i  2b2iþ1 Þ. The authors suggest combining both the cases
considered above as follows: for n even,
      
E0 ¼ an ðbn _ bn1 Þ þ b0  2 bn _ bn1  b1 , E1 ¼ an b1 bn1 þ b2  2b3
Ei ¼ an ðb2i1 þ b2i  2b2iþ1 Þ 1 < i < K ð4:25aÞ

and for n odd

E0 ¼ an fbn þ b0  2ðbn _ b1 Þg, Ei ¼ an ðb2i1 þ b2i  2b2iþ1 Þ 0 < i < K


ð4:25bÞ

This leads to the expression for P as

X
K 1
P ¼ jABj2n þ1 ¼ PPi þ C þ K ð4:26Þ
i¼0 2n þ1
4.3 Multipliers mod (2n + 1) 63

X
K 1
where C ¼ ci . Note that the correction bits can be merged into a single
i¼0
correction term C of the form . . .0xi+10xi0. . .x10x0. The area and delay of these
  
multipliers can be estimated as 7n2 þ 92 ndlog2 ne  4n þ 11 and 4D n2 þ 1 þ 2
dlog2 ne þ 9 respectively.
The authors show that the new multipliers are area and time efficient compared
to multipliers due to Zimmerman [16], Sousa and Chaves [32], Efstathiou et al. [26]
and Vergos and Efstathiou [34].
Chen and Yao [36] have considered mod (2n + 1) multipliers for both operands in
diminished-1 form. The procedure is similar to that in the case considered above.
The result is in diminished-1 form and is given by

X
K1
dðABÞ ¼ PPi þ C þ d ½1 þ K þ 1 ð4:27Þ
i¼0 2n þ1

with the definition of C and K as before. Note that in the case of an þ bn ¼ 0, an


and bn signals are used in Booth Encoder block to make the output zero irrespective
of other inputs. Note also that C is one’s complement of C. An inverted EAC CSA
can be used to compute (4.27) followed by a diminished-1 modulo (2n + 1) adder.

The area and delay of these multipliers can be shown to be 7n2 þ 92 logn  0:5
n  n 
n þ 6 and 11 þ 4d 2 þ 1 þ 2dlogne for n ¼ 4, 6, 10 and 9 þ 4d 2 þ 2 þ 2dlogne
otherwise.
Vassalos, Bakalis and Vergos [37] have described Booth encoded modulo (2n  1)
multipliers where the operand B is Booth encoded whereas A is in normal binary
form. They use the formulation of Chen and Yao for realizing modulo (2n + 1)
multiplier. They observe the similarities and differences in the partial products
corresponding to the Booth encoded digits Bi 0, 1, 2 for both moduli and suggest
using multiplexers/inverters (using XOR gates) to select appropriately the input bits
to form the partial products. The mod (2n + 1) multiplier needs correction terms to be
added. The partial product reduction stage can be same except for the addition of this
correction term. This architecture has also considered handling the zero input case.
Note that zero input case means either A or B is zero or both A and B are zero or
cases for example like (3  3) mod 9 ¼ 0. Zero indication of the result can be given
by pz ¼ az _ bz _ Dn1:0 where az and bz are zero indication signals of A and B and
Dn1:0 indicates that the final adder inputs are bitwise complimentary. By forcing the
BE output to zero using az and bz signals, the n-bit adder final output can be forced
to zero.
The area and delay of these multipliers are (13/2) n2 + (11/8)nlogn + (155/8)n + 4
unit-gates
 and 2logn + 15 + 5d(n/2 + 2) unit-gate delays respectively.
nþ1These consist
of nþ12 Booth encoder blocks, n nþ1
2 Booth selector blocks, 2 CSAs each
comprising n FAs or HAs, one 2 input XOR gate, and a parallel-prefix mod (2n  1)
 
nþ1 2 
adder. Further XOR gates and nþ1 2 AND gates are needed for
2
64 4 Modulo Multiplication and Modulo Squaring


deriving the modulo dependent partial product bits, nþ1 2 AND gates forming the
CT vector, one three-input XOR gate for zero indication bit of the result and one
XOR gate at input of LS BE block (in case of even n).
Jaberipur and Alavi [38] have described an architecture for finding (XY) mod
(2n + 1) using double-LSB encoding of residues. The use of two LSBs can
accommodate the residue 2n as two words one of value 2n  1 and an additional
1-bit word having value 1. Thus, the partial products need to be doubled to cater
for the extra LSB bit in the multiplier Y (see Figure 4.11a). Next, the periodic
properties of residues is used to map the bits above 2n to the LSBs as negabits of
value 1in the LSB positions since 2n+i mod (2n + 1) ¼ (1)2i mod (2n + 1). Next
special adders which can accept inputs of either polarity can be used to add the
positive bits and negabits. A variety of special full/half adders are possible to
handle different combinations of positive bits and negabits. The partial products
corresponding to multiplier are shown in Figure 4.11b where black colored bits
are posibits and white colored bits are inverted negabits. These partial products
can be arranged as n + 3 words which can be added using a CSA with inverted
EAC followed by a conventional n-bit adder.
Muralidharan and Chang [39] have described radix-4 Booth encoded multi-
modulus multipliers. A unified architecture to cater for the three moduli 2n, 2n  1
and 2n + 1 is developed. They show that for computing (X  Y ) mod m, we need to
compute

X
n
21

jZ jm ¼ 22i d i X for m ¼ 2n  1 or 2n
i¼0 m
ð4:28Þ
X
n
21

¼ 22i d i X þ Y if m ¼ 2n þ 1:
i¼0 m

where di ¼ y2i1 þ y2i  2y2iþ1 . Note that in case m ¼ 2n + 1, diminished-1 repre-


sentation is used for X and Y. Note that in the radix-22 Booth encoder,

y1 ¼ yn1 if m ¼ 2n  1,
¼ 0 if m ¼ 2n ð4:29Þ
¼ if m ¼ 2n þ 1:

This is realized using a 3:1 multiplexer. A multi-modulus radix-22 Booth encoder


for n ¼ 4 is shown in Figure 4.12a together with the 3:1 multiplexer in (b). The
multiplexer is controlled by Modsel 0 and Modsel 1 inputs. Note that Modsel 1 and
Modsel 0 are ‘00’, ‘01’, ‘10’ correspond to the modulus 2n  1, 2n and 2n + 1
respectively. The partial products can be obtained in the case of mod (2n  1) easily
by rotation and complimenting in case of negative di. On the other hand, in case of
mod 2n, left shifting and complimenting are needed. These operations need correc-
tion terms (or bias). The authors show that the correction terms for all the partial
4.3 Multipliers mod (2n + 1) 65

Figure 4.11 (a) Partial products in double LSB multiplication and (b) partial products of modulo
(2n + 1) multiplication (adapted from [38] ©IEEE2010)
66 4 Modulo Multiplication and Modulo Squaring

a b
y3 y2 y1 y0 bi
MUX3
0
BE2 BE2 0 1 2 ModSel0
ModSel1
m2l m20 ci
sl m1l s0 m10

c x x2 x1 x0
3
MUX3
s0
BS2 BS2 BS2 BS2 m10
m20
pp03 pp02 pp01 pp00
x1 x0 x3 x2
MUX3 s1
BS2 BS2 BS2 BS2 m11

MUX3 MUX3 m2l


pp13 pp12 pp11 pp10

Figure 4.12 (a) Multi-modulus Radix-22 Booth encoder and (b) 3:1 multiplexer (c) multi-moduli
partial product generation for radix 22 Booth encoding for n ¼ 4 (adapted from [39] ©IEEE2013)

products can be combined into a single closed form expression comprising of two
parts—static bias K1 and dynamic bias K2. The static bias is independent of di
whereas the dynamic bias is dependent on di. The reader is urged to refer to [39] for
details.
The partial products PPi for i ¼ 0, . . ., (n/2  1) are formed by the Booth
selector (BS) from ppij, 0, ppij in the least significant 2i-bit positions for the moduli
2n  1, 2n and 2n + 1, respectively. When j ¼ 2i also, the input to the BS2 block
is xn1, 0, xn1 for modulus 2n  1, 2n and 2n + 1, respectively. Thus, the input to
BS2 block is also selected using a MUX3. The design for the case n ¼ 4 is presented
in Figure 4.12c for illustration.
The multi-moduli addition of n/2 + 3 partial products is given as

X
n
21

Z m ¼ PPi þ K 1 þ K 2 þ 0 in case of m ¼ 2n  1, or m ¼ 2n
i0 m

X
n
21

¼ PPi þ K 1 þ K 2 þ Y in case of m ¼ 2n þ 1 ð4:30Þ


i¼0 m
4.3 Multipliers mod (2n + 1) 67

a b
k12 pp12 pp02 k11 pp10 pp00 ai, bi
k13pp13pp03 k11 pp11 pp01 PP

FA FA FA FA gi , pi , hi

MUX3 ai bi ai bi ai bi
k23 k22 k21 k20
FA FA FA FA
gi pi hi

y3Modsel1 y2Modsel1 y1Modsel1 y0Modsel1


MUX3
gi ,pi gi ,pi

FA FA FA FA
gi +pi • gj ,pi • pj
MUX3
pi gj pi gj
PP PP PP PP
gi

gi +pi • gj ,pi • pj

hi ci–1
MUX3

zi
hi ci–1

z3 z2 z1 z0 zi

Figure 4.13 (a) Multi-modulus partial product addition for radix-22 Booth encoding and (b)
details of components in (a) (adapted from [39] ©IEEE2013)

Note that in the carry feedback path a MUX will be needed to select cout or 0 or
cout . A CSA tree followed by a parallel-prefix adder needed for adding the ((n/2)
+ 3) partial products is illustrated in Figure 4.13a where • and e are pre-processing
and post-processing blocks. In Figure 4.13b, details of various blocks in (a) are
presented.
Muralidharan and Chang [39, 40] have described unified mod (2n  1), mod 2n
and mod (2n + 1) multipliers
 for radix-8 Booth encoding to reduce the number of
partial products to n3 þ 1. In these cases, considering that the diminished-1
representation is used for the mod (2n + 1) case, the product can be written as [39]

bn3c
X
ðZ Þm ¼ Xdi 23i for m ¼ 2n  1 or m ¼ 2n ð4:31aÞ
i¼0
m
68 4 Modulo Multiplication and Modulo Squaring

and

bn3c
X
ðZ Þm ¼ Xdi 23i þ X þ Y for m ¼ 2n þ 1 ð4:31bÞ
i¼0
m

The result of multiplication can be expressed in this case as

bn3c
X
ðZ Þm ¼ PPi if m ¼ 2n  1
i¼0
m ð4:32aÞ
bn3c
X bn3c
X
¼ PPi þ Ki if m ¼ 2n
i¼0 i¼0
m

and

bn3c
X bn3c
X
¼ PPi þ Ki þ X þ Y if m ¼ 2n þ 1 ð4:32bÞ
i¼0 i¼0
m

Note that di an be 0, 1,2, 3 and 4. The hard multiples 3X are obtained using
customized adders by adding X, 2X and reformulating the carry equations for
moduli (2n  1) and mod (2n + 1) for even and odd cases of i respectively. The
modulo (2n  1) hard multiple generator (HMG) follows the Kalampoukas
et al. adder [41] in case of the modulus (2n  1) and Vergos et al. adder [27] in
case of modulus (2n + 1).
The authors have shown that multi-modulus multipliers save 60 % of the area
over the corresponding single-modulus multipliers. They increase the delay by
about 18 % and 13 %, respectively, for the radix-4 and radix-8 cases. The power
dissipation is more by 5 %.
The area of the mod (2n 1)
 multiplier
 in terms of unit gates [40] using radix-8
Booth encoding are 25:5n n3 þ 23:5 n3 þ 1 þ 14:5n þ 6ndlog2 ne and for
jnk2
mod (2n + 1) multiplier using radix-8 Booth encoding are 1:5
    3
þ25:5n n3 þ 1 þ 28 n3 þ 1 þ 18:5n þ 12nlog2 n þ 52:5. The unit-gate area
and time of radix-4
n Booth
 encoded triple moduli multiplier are 6:75n2 þ 14:5n
þ6 and 12 þ 7d 2 þ 3 where d(N ) is the Dadda depth function for N inputs. These
may be compared with dual modulus radix-4 Booth multiplier due to [37] which are
6.5n2 + 8n + 2 and 12 + 6d((n/2) + 2), respectively.
4.4 Modulo Squarers 69

4.4 Modulo Squarers

In many applications, squaring mod m will be of interest-adaptive filtering, image


processing etc. The use of a modulo multiplier may not be economical in such
applications. Piestrak [42] has suggested using custom designs for realizing com-
pact low area squarers mod m. By writing the partial products and noting that two
identical elements in any column can be written in the next column as a single entry
(e.g. two a0a1 in column 1 can be written in column 2 as a single entry a0a1),
the product matrix can be simplified. As an illustration for a 5-bit squarer, we obtain
the matrix shown in Figure 4.14.
Piestrak [42] has suggested the use of periodic properties of modulus next to
reduce the number of columns by rewriting bits in the columns above the period in
the lower columns without inversion in case of full period or after inversion in the
case of moduli having half period. A final correction word is needed in the case of
moduli of the form 2a + 1 e.g. 33. In case of moduli of the form 21 with period 6, a
six-column array will result, we need to add all the 6-bit words with EAC and
modulo reduction needs to be carried out to get the final residue. The architectures
in both the cases are presented in Figure 4.15a and b respectively.
Note that pi, j indicates a bit product aibj. In the case of Figure 4.15a, using a CSA
with EAC, all the bits can be summed and final modulo 21 reduction can be carried
out by the block 7-input generator mod 21 which also uses periodic property.
Similarly, in the case of modulus 33 as well, the bits can be added using a CSA
with inverted EAC together with a correction term and final modulo 33 reduction
can be performed.
Piestrak [42] has also observed that for certain moduli, instead of using a
CSA/CPA architecture, direct logic functions can be used to economically realize
a modulo squarer. As an illustration, for mod 11, the output bits can be expressed as

s3 ¼ x3 ðx1 x0 Þ þ x2 ðx1 x0 Þ, s2 ¼ ðx3 x0 Þ þ x2 ðx1 x0 þ x1 x0 Þ, s1 ¼ x3 x0 , s0


¼ x2 þ x3  x0 ð4:33Þ

Paliouras and Stouraitis [43] have described multifunction architectures


(MA) for RNS processors which can perform binary to RNS conversion, multipli-
cation, Base extension, and QRNS (quadratic Residue Number system) processing.
These are of two types: fixed multifunction architectures (FMA) and variable
multifunction architectures (VMA). Fixed architectures share only the front-end
and can perform simultaneously operation for various residue channels and have

8 7 6 5 4 3 2 1 0
a4 a 4a 2 a3 a 3a 1 a2 a 0a 2 a1 - ao
a 4a 3 a 3a 2 a 4a 0 a 0a 3 a 1a 0
a4 a1 a 2a 1

Figure 4.14 Bit matrix of a 5-bit squarer


70 4 Modulo Multiplication and Modulo Squaring

a b
G5 G4 G3 G2 G1 G0 G5 G4 G3 G2 G1 G0
p4,0 p3,0 p2,0 p1,0 p4,2 p0,0
p3,1 p2,1 p1,1 p4,1 p3,0 p2,0 p1,0 – p0,0
p2,2 p4,3 p3,2
p4,4 p3,3
p2,1 p1,1 p5,5
2 3 1 4 1 4
– FA – FA – FA p2,2
3 1 2 2 2 2
p4,3 p4,2 p4,1 p4,0
5-bit adder with EAC
1 2 1 1 1 1 p4,4 p3,2 p3,1

p3,3
7-input generator mod 21

5 – 3 3 3 3 4

X2
21

Figure 4.15 Squarers Modulo 21 and Modulo 33 following Piestrak (adapted from [42]
©IEEE2002)

separate hardware units for various moduli. On the other hand, VMAs achieve
hardware savings at the possible expense of delay and at the cost of decreased
parallelism. These share the hardware for several moduli. VMAs can be used in
serial by modulus (SBM) architectures [44] where not all moduli channels are
processed in parallel.
As an illustration, for the moduli set {2n, 2n  1, 2n + 1}, squarers can use VMA
and FMA [45]. The bit matrix for modulus (2n  1) and modulus (2n + 1) for both
these cases will have only few entries in some columns which are different and
hence using multiplexers one of these can be selected. However, a correction term
needs to be added in the case of modulus (2n + 1). The correction term can be added
in separate level after the parallel-prefix adder as in Zimmerman’s technique [16]
(see Figure 4.16 in the VMA architecture for n ¼ 5). This design uses a Sklansky
parallel-prefix structure and a MUX in the carry reinsertion path. Note that in the
case of modulus 2n, no carry feedback is required. The outputs are
A2 2n 1 , A2 2n , A2 2n þ1 .
The authors have shown that VMA has 15 % increase in delay than single
modulus architectures (SMA) but needs 50 % less area for the area optimized
synthesis and 18 % less power dissipation for the moduli set {2n,2n  1,2n + 1} for
n ¼ 24. On the other hand, FMA has area and power savings of 5 % and 10 %
respectively over single-modulus architecture with similar delay.
Adamidis and Vergos [46] have described a multiplication (AB)/sum of squares
(A2 + B2) unit mod (2n  1) and mod (2n + 1) to perform one of these operations
4.4 Modulo Squarers 71

PP generator array

a
0
CSA CSA CSA CSA CSA =
00 01 10
S0 3:1 MUX
0 0 0 0 0 S1
Z

gi-1 pi pi-1
CSA CSA CSA CSA CSA
=
gi
p
Parallel Prefix Structure g

= hi ci-1

Si

Figure 4.16 VMA squarer architecture for moduli 33, 32 and 31 (adapted from [45] ©IEEE2009)

partially sharing the same hardware. They point out the commonality and differ-
ences between the two cases and show that using multiplexers controlled by a select
signal, one of the desired operations can be performed. In an alternative design, they
reduce the multiplexers by defining new variable. Multiplication time is increased
by 11 % in case of n  16 in case of mod (2n  1) multiplier.
In the case of multiplier mod (2n + 1), they consider diminished-1 operands. In
case of sum of squares, denoting SS ¼ A2 + B2, we have
 
ðSS*Þ2n þ1 ¼ ðA* þ 1Þ2 þ ðB* þ 1Þ2  1 n
2 þ1
 
¼ ðA*Þ2 þ ðB*Þ2 þ 2A* þ 2B* þ 1 n ð4:34aÞ
2 þ1

where as in the case of multiplication we have

ðP*Þ2n þ1 ¼ ðA*B* þ A* þ B*Þ2n þ1 ð4:34bÞ

Thus, diminished-1 modulo (2n + 1) multiplier/sum of squares needs more partial


products to be added than in the case of modulo (2n  1). The authors have
also investigated simple as well as reduced hardware cases as in the case of modulo
(2n + 1). The authors have compared the unified designs with (a) design A having
only one multiplier and adder and (b) design B having one multiplier, one squarer
and an adder and have shown that the delay is reduced considerably over designs
A and B typically by 53.4 % and 31.8 %, respectively on the average.
Denoting m as the control to select when m ¼ 0 multiplication and when m ¼ 1
sum of squares, the newly introduced variables are
72 4 Modulo Multiplication and Modulo Squaring

ci ¼ ai m _ bi m, d i ¼ ai m _ bi m, ei ¼ mðai  bi Þ ð4:35Þ

where _ is logic OR function. The partial product bit of the form aibj with j > i
( j < i) is substituted with aicj (dibj) in the multiplication partial product bit matrix.
Bits of the form aibi are retained as they are. In the case of sum of squares, the first
column becomes the last column implying rotation of all the bits. Note that in this
case aiaj is substituted by aicj, bibj becomes dibj and ai exor bi is substituted with ei.
In a similar manner, the case for (2n + 1) also can be implemented. In this case also,
the change of variables can reduce the area:

ci ¼ ai s _ bi s 0  i  n  1, d i ¼ ai s _ bi s 0  i  n  1,
ð4:36Þ
ei ¼ sðai  bi Þ 0  i  n=2, ei ¼ sðai  bi Þ n=2  i  n  1

Note that (n + 5) partial products are needed as against (n + 2) in the case of


mod (2n  1). The area and delay of MMSSU1 simple and reduced area architec-
tures are
A ¼ n(11n + 3logn + 6) and T ¼ 4d(n + 2) + 2logn + 6, A ¼ n(8n + 3logn + 13) and
T ¼ 4d(n + 2) + 2logn + 8. In the case of MMSSU+ simple and reduced area archi-
tectures, we have A ¼ 11n2 + (9/2)nlogn + (43/2)n + 6, and T ¼ 4d(n + 4) + 2logn + 7
and A ¼ 8n2 + (9/2)nlogn + (63/2)n + 18 and T ¼ 4d(n + 4) + 2logn + 14 respectively.
Spyrou, Bakalis and Vergos [47] have suggested for non-Booth-encoded
 
squarers mod (2n  1) writing the terms in any column of the type ai aj þ ai 2k
   
as ai aj 2kþ1 þ ai aj 2k (see e.g. in the LSB column a0 + a0a7) (see Figure 4.17a for
the case n ¼ 8). This simplification can be applied to the (n  1)th bit position as
well (see Figure 4.17b for the case n ¼ 8). The next simplification considers bits in
the adjacent column and simplifies as follows:
   
ai aj 2lþ1 þ ai ak þ aj ak 2l ¼ ci:j, k 2lþ1 þ si, j, k 2l ð4:37Þ
   
where ci, j, k ¼ ai aj _ ak and si, j, k ¼ ak ai  aj where _ stands for logic OR. This
modification yields the final bit matrix as shown in Figure 4.17c for the case n ¼ 8.
Thus, a two-level CSA can be used to add the four rows followed by a mod (28  1)
adder.
The authors have also considered Booth encoding. Denoting the Booth encoded
digits as Ai for i ¼ 0, 1, . . ., 4, the partial product matrix for a 8  8 squarer is shown
in Figure 4.18a which can be rewritten as shown in Figure 4.18b. Next note that Ai2
will have a three-bit representation with middle bit being zero and hence denoted as
Ci,0 and Ci,2. On the other hand, the terms 2A1A0, 2A2A0 and 2A3A0 in the first row
can be simplified using Booth folding technique [48] as a two’s complement 6-bit
word Pi,6, . . ., Pi,0, noting that the Ais are signed digits (see Figure 4.18c). Such
words in each row can be computed using simple ones’ complementing circuit.
Next, noting the periodic property of the bits in the left half, they can be moved to
the right except for the sign bits. The sign bits can be combined into a single
correction term 11P2, 2 1P4, 4 1P0, 6 1 as shown in Figure 4.18d. The resulting number
of partial product bits to be added and the height of the bit matrix are smaller than
4.4 Modulo Squarers 73

a
27 26 25 24 23 22 21 20
a3 a2 a1 a0
a0a6 a0a5 a0a4 a0a3 a0a2 a0a1 a0a7
a1a5 a1a4 a1a3 a1a2 a1a7 a1a6
a2a4 a2a3 a2a7 a2a6 a2a5
a3a7 a3a6 a3a5 a3a4
a4a7 a4a6 a4a5 a4
a5a7 a5a6 a5
a6a7 a6
a7

b 27 26 25 24 23 22 21 20
a2a3 a2a3 a1a2 a1a2 a0a1 a0a1 a0a7 a0a7
a0a6 a0a5 a0a4 a0a3 a0a2
a1a5 a1a4 a1a3 a1a7 a1a6
a2a4 a2a7 a2a6 a2a5
a3a7 a3a6 a3a5
a4a7 a4a6 a3a4 a3a4
a5a7 a4a5 a4a5
a5a6 a5a6
a6a7 a6a7

c
27 26 25 24 23 22 21 20
s4,3,2 c3,2,1 s3,2,1 c2,1,0 s2,1,0 c1,0,7 s1,0,7 c4,3,2
s0,7,6 c7,6,5 s7,6,5 c6,5,4 s6,5,4 c5,4,3 s5,4,3 c0,7,6
a1a5 a0a5 a0a4 a0a3 a3a7 a2a7 a2a6 a1a6
a1a4 a4a7 a3a6 a2a5

Figure 4.17 (a) Initial partial product matrix, (b) modified partial product matrix and (c) final
partial product matrix for squarer mod (28  1) (adapted from [47] ©IEEE 2009)

those needed for Piestrak [42] and Efstathiou et al. [17]. The non-encoded squarers
are suitable for small n whereas Booth encoded squarers are suitable for medium
and large n.
Vergos and Efstathiou [49] and Muralidharan et al. [50] have described modulo
(2n + 1) squarers for (n + 1)-bit operands in normal form. This design also maps the
product bits aibj in the left columns beyond the (n  1)th bit position to the right after
inversion. Only the 2nth bit is having a weight 22n and hence needs to be added as
LSB since when an ¼ 1 all other ai are zero. A correction term of 2n(2n  1  n) will
be required totally to take care of the one’s complementing of these various bits.
74 4 Modulo Multiplication and Modulo Squaring

Figure 4.18 (a) Initial partial product matrix, (b) folded partial product matrix for Booth
encoded design, (c) Booth-folded partial product matrix and (d) final partial product matrix for
modulo (28  1) squarer (adapted from [47] ©IEEE2009)

Next, the duplicating terms in each column for example a1a0 in second column are
considered as a single bit in the immediate column to the left. The duplicating bits in
the n  1 th bit position
 need to be inverted and put in LSB position needing another
correction of 2n n2 . After these simplifications, the matrix becomes a n  n square.
Using CSAs, the various rows need to be added and carry bits need to be mapped into
LSBs after inversion thus needing further correction. The authors show that for both
4.4 Modulo Squarers 75

even and odd n, the correction term is 3. The authors suggest computing the final
result as

dX
n=2e
R¼ 3þ PP*i ð4:38aÞ
i¼0 2n þ1

where PPi are the final partial products. Rewriting (4.38a) as

dX
n=2e
R¼ 2þ PP*i þ1 ð4:38bÞ
i¼0 2n þ1 2n þ1

dX
n=2e
we can treat 2 as another PP and compute the term 2þ PP*i as
i¼0 2n þ1
jC þ S þ 1j k . The result is 2n if C is ones complement of S or if C + S ¼ 2n  1.
2 þ1
Thus, the MSB of the result can be calculated distinctly from the rest whereas
the n LSBs can be computed using a n-bit adder which adds 1 when carry output is
zero. The authors have used a diminished-1 adder with parallel-prefix carry compu-
tation unit [27].
9nðlog2 nÞ
The area requirement is nðn1 Þ
þ 7nðn1 Þ
þ þ n2 þ 7 in number of unit
2 2 2  
gates and the delay for n odd and n even cases are DBin, odd ¼ 4H nþ1 2 þ 1 þ 2log2
nþ1 
n þ 4 and DBin, even ¼ 4H 2 þ 1 þ 2log2 n þ 8 where H(l ) is the height of the
CSA tree of l inputs to the CSA tree.
Vergos and Efstathiou [51] have described a diminished-1 modulo (2n + 1)
 
squarer design. They observe that defining Q ¼ A2 2n þ1 , we have
 
Q* þ 1 ¼ ðA* þ 1Þ2 2n þ1 ¼ ðA*Þ2 þ 2A* þ 1 2n þ1 ð4:39aÞ

or
 
Q* ¼ ðA*Þ2 þ 2A* 2n þ1 ð4:39bÞ

Thus, one additional term 2A* needs to be added. These designs are superior to
multipliers or squarers regarding both area and delay [41]. In addition, 2A* needs to
be added by left shifting A* and complimenting the MSB and inserting in the LSB
position. The correction term for both n odd and even cases is 1, which in
diminished-1 form is “0”. The area and delay requirements of this design are nðn1
2
Þ

9nðlog2 nÞ 
þ 7nðnþ1
2
Þ
þ 2 þ n þ 6 and T ¼ 4d nþ1
2 þ 2log2 n þ 4 time units where d(k)
is the depth in FA stages of a Dadda tree of k operands.
76 4 Modulo Multiplication and Modulo Squaring

Bakalis, Vergos and Spyrou [52, 53] have described modulo (2n  1) squarers
using radix-4 Booth encoding. They consider squarers for both normal and
diminished-1 representation in the case of modulus (2n + 1). They use Strollo
and Caro [48] Booth folding encoding for both the cases of mod (2n  1) and
mod (2n + 1). In the case of mod (2n  1), the partial product matrix is same as in
the case of Spyrou et al. [47]. In the case of mod (2n + 1) using diminished-1
representation, in case of even n which are multiples of 4, the correction term t is
(888..8)16 and in case of even n which are not multiples of 4, the correction term tis
(222..2)16 where the subscript 16 means that these are in hexadecimal form.
 
Note that the diminished-1 modulo (2n + 1) squarer computes A2 þ 2A 2n þ1 in
the case An ¼ 0. As a result, we need to add ð2AÞ2n þ1 for the normal representa-
tion, which can be written as an2 an3 . . . a0 an1 provided that an additional
correction term of 3 is taken into account. Note also that in the case A ¼ 2n,
an ¼ 1, the LSB can be modified as an1 OR an. The correction term in the case
of diminished-1 case is increased by 2.
In the case of odd n, the architectures are applicable as long as the input operands
are extended by 1 bit by adding a zero at MSB position. The partial product matrices
for the diminished-1 and normal squarers are presented in Figure 4.19a, b where
X
n=21
Ci ¼ Ai  Ai for i ¼ 0,. . ., 3 and Pi ¼ 22ðk1iÞ Ai Ak . The authors observe that
k¼iþ1
the height of the partial product matrix is reduced comparable to earlier methods.
The authors show that their designs offer up to 38 % less implementation area than
previous designs and also have a small improvement in delay as well.

a b
27 26 25 24 23 22 21 20 27 26 25 24 23 22 21 20
C1,2 C1,0 C0,2 C0,0 C1,2 C1,0 C0,2 C0,0
C3,2 C3,0 C2,2 C2,0 C3,2 C3,0 C2,2 C2,0
P0,4 P0,3 P0,2 P0,1 P0,0 P0,6 P0,5
P0,4 P0,3 P0,2 P0,1 P0,0 P0,6 P0,5
P1,0 P1,4 P1,3 P1,2 P1,1
P1,0 P1,4 P1,3 P1,2 P1,1
P2,2 P2,1 P2,0
P2,2 P2,1 P2,0 a6 a5 a4 a3 a2 a1 a0 a7Va8
t7 t6 t5 t4 t3 t2 t1 t0 t7 t6 t5 t4 t3 t2 t1 t0

Figure 4.19 (a) Partial product matrix for mod (28 + 1) squarer, (a) diminished-1 case and (b)
normal case (adapted from [52] ©Elsevier2011)
References 77

References

1. M.A. Soderstrand, C. Vernia, A high-speed low cost modulo pi multiplier with RNS arithmetic
applications. Proc. IEEE 68, 529–532 (1980)
2. G.A. Jullien, Implementation of multiplication modulo a prime number with application to
number theoretic transforms. IEEE Trans. Comput. 29, 899–905 (1980)
3. D. Radhakrishnan, Y. Yuan, Novel approaches to the design of VLSI RNS multipliers. IEEE
Trans. Circuits Syst. 39, 52–57 (1992)
4. M. Dugdale, Residue multipliers using factored decomposition. IEEE Trans. Circuits Syst. 41,
623–627 (1994)
5. A.S. Ramnarayan, Practical realization of mod p, p prime multiplier. Electron. Lett. 16,
466–467 (1980)
6. E.F. Brickell, A fast modular multiplication algorithm with application to two-key cryptogra-
phy, in Advances in Cryptography, Proceedings Crypto ‘82 (Plenum, New York, 1983),
pp. 51–60
7. E. Lu, L. Harn, J. Lee, W. Hwang, A programmable VLSI architecture for computing
multiplication and polynomial evaluation modulo a positive integer”. IEEE J. Solid-State
Circuits SC-23, 204–207 (1988)
8. B.S. Prasanna, P.V. Ananda Mohan, Fast VLSI architectures using non-redundant multi-bit
recoding for computing AY mod N. Proc. IEE ECS 141, 345–349 (1994)
9. A.A. Hiasat, New efficient structure for a modular multiplier for RNS. IEEE Trans. Comput.
C-49, 170–174 (2000)
10. E.D. Di Claudio, F. Piazza, G. Orlandi, Fast combinatorial RNS processors for DSP applica-
tions. IEEE Trans. Comput. 44, 624–633 (1995)
11. P.L. Montgomery, Modular multiplication without trial division. Math. Comput. 44, 519–521
(1985)
12. T. Stouraitis, S.W. Kim, A. Skavantzos, Full adder based arithmetic units for finite integer
rings. IEEE Trans. Circuits Syst. 40, 740–744 (1993)
13. V. Paliouras, K. Karagianni, T. Stouraitis, A low complexity combinatorial RNS multiplier.
IEEE Trans. Circuits Syst. II 48, 675–683 (2001)
14. G. Dimitrakopulos, V. Paliouras, A novel architecture and a systematic graph based optimi-
zation methodology for modulo multiplication. IEEE Trans. Circuits Syst. I 51, 354–370
(2004)
15. Z. Wang, G.A. Jullien, W.C. Miller, An algorithm for multiplication modulo (2N-1), in Pro-
ceedings of 39th Midwest Symposium on Circuits and Systems, Ames, IA, pp. 1301–1304
(1996)
16. R. Zimmermann, Efficient VLSI implementation of modulo (2n  1) addition and multiplica-
tion, in Proceedings of IEEE Symposium on Computer Arithmetic, pp. 158–167 (1999)
17. C. Efstathiou, H.T. Vergos, D. Nikolos, Modified Booth modulo 2n-1 multipliers. IEEE Trans.
Comput. 53, 370–374 (2004)
18. R. Muralidharan, C.H. Chang, Radix-8 Booth encoded modulo 2n-1 multipliers with adaptive
delay for high dynamic range residue number system. IEEE Trans. Circuits Syst. I Reg. Pap.
58, 982–993 (2011)
19. G.W. Bevick, Fast multiplication: algorithms and implementation, Ph.D. Dissertation,
Stanford University, Stanford, 1994
20. A.V. Curiger, H. Bonnennberg, H. Keaslin, Regular architectures for multiplication modulo
(2n+1). IEEE J. Solid-State Circuits SC-26, 990–994 (1991)
21. X. Lai, On the design and security of block ciphers, Ph.D Dissertation, ETH Zurich, No.9752,
1992
22. A. Hiasat, New memory less mod (2n1) residue multiplier. Electron. Lett. 28, 314–315
(1992)
23. M. Bahrami, B. Sadeghiyan, Efficient modulo (2n+1) multiplication schemes for IDEA, in
Proceedings of IEEE ISCAS, vol. IV, pp. 653–656 (2000)
78 4 Modulo Multiplication and Modulo Squaring

24. Z. Wang, G.A. Jullien, W.C. Miller, An efficient tree architecture for modulo (2n+1) multi-
plication. J. VLSI Signal Process. Syst. 14, 241–248 (1996)
25. A. Wrzyszcz, D. Milford, A new modulo 2α+1 multiplier, in IEEE International Conference on
Computer Design: VLSI in Computers and Processors, pp. 614–617 (1993)
26. C. Efstathiou, H.T. Vergos, G. Dimitrakopoulos, D. Nikolos, Efficient diminished-1 modulo 2n
+1 multipliers. IEEE Trans. Comput. 54, 491–496 (2005)
27. H.T. Vergos, C. Efstathiou, D. Nikolos, Diminished-1 modulo 2n+1 adder design. IEEE Trans.
Comput. 51, 1389–1399 (2002)
28. Y. Ma, A simplified architecture for modulo (2n+1) multiplication. IEEE Trans. Comput. 47,
333–337 (1998)
29. R. Chaves, L. Sousa, Faster modulo (2n+1) multipliers without Booth Recoding, in XX
Conference on Design of Circuits and Integrated Systems. ISBN 972-99387-2-5 (Nov 2005)
30. R. Chaves, L. Sousa, Improving residue number system multiplication with more balanced
moduli sets and enhanced modular arithmetic structures. IET Comput. Digit. Tech. 1, 472–480
(2007)
31. L. Sousa, Algorithm for modulo (2n+1) multiplication. Electron. Lett. 39, 752–753 (2003)
32. L. Sousa, R. Chaves, A universal architecture for designing efficient modulo 2n+1 multipliers.
IEEE Trans. Circuits Syst. I 52, 1166–1178 (2005)
33. Y.J. Chen, D.R. Duh, Y.S. Han, Improved modulo (2n+1) multiplier for IDEA. J. Inf. sci. Eng.
23, 907–919 (2007)
34. H.T. Vergos, C. Efstathiou, Design of efficient modulo 2n+1 multipliers. IET Comput. Digit.
Tech 1, 49–57 (2007)
35. J.W. Chen, R.H. Yao, W.J. Wu, Efficient modulo 2n+1 multipliers. IEEE Trans. VLSI Syst. 19,
2149–2157 (2011)
36. J.W. Chen, R.H. Yao, Efficient modulo 2n+1 multipliers for diminished-1 representation. IET
Circuits Devices Syst. 4, 291–300 (2010)
37. E. Vassalos, D. Bakalis, H.T. Vergos, Configurable Booth-encoded modulo 2n1 multipliers.
IEEE PRIME 2012, 107–111 (2012)
38. G. Jabelipur, H. Alavi, A modulo 2n+1 multiplier with double LSB encoding of residues, in
Proceedings of IEEE ISCAS, pp. 147–150 (2010)
39. R. Muralidharan, C.H. Chang, Radix-4 and Radix-8 Booth encoded multi-modulus multipliers.
IEEE Trans. Circuits Syst. I 60, 2940–2952 (2013)
40. R. Muralidharan, C.H. Chang, Area-Power efficient modulo 2n-1 and modulo 2n+1 multipliers
for {2n-1, 2n, 2n+1} based RNS. IEEE Trans. Circuits Syst. 59, 2263–2274 (2012)
41. L. Kalampoukas, D. Nikolos, C. Efstathiou, H.T. Vergos, J. Kalamatianos, High speed parallel
prefix modulo (2n-1) adders. IEEE Trans. Comput. 49, 673–680 (2000)
42. S.J. Piestrak, Design of squarers modulo A with low-level pipelining. IEEE Trans. Circuits
Syst. II Analog Digit. Signal Process. 49, 31–41 (2002)
43. V. Paliouras, T. Stouraitis, Multifunction architectures for RNS processors. IEEE Trans.
Circuits Syst. II 46, 1041–1054 (1999)
44. W.K. Jenkins, A.J. Mansen, Variable word length DSP using serial by modulus residue
arithmetic, in Proceedings of IEEE International Conference on ASSP, pp. 89–92 (1993)
45. R. Muralidharan, C.H. Chang, Fixed and variable multi-modulus squarer architectures for
triple moduli base of RNS, in Proceedings of IEEE ISCAS, pp. 441–444 (2009)
46. D. Adamidis, H.T. Vergos, RNS multiplication/sum-of-squares units. IET Comput. Digit.
Tech. 1, 38–48 (2007)
47. A. Spyrou, D. Bakalis. H.T. Vergos, Efficient architectures for modulo 2n-1 squarers, in
Proceedings of IEEE International Conference on DSP 2009, pp. 1–6 (2009)
48. A. Strollo, D. Caro, Booth Folding encoding for high performance squarer circuits. IEEE
Trans. CAS II 50, 250–254 (2003)
49. H.T. Vergos, C. Efstathiou, Efficient modulo 2n+1 squarers, in Proceedings of XXI Conference
on Design of Circuits and Integrated Systems, DCIS (2006)
References 79

50. R. Muralidharan, C.H. Chang, C. Jong, A low complexity modulo 2n+1 squarer design, in
Proceedings of IEEE Asia Pacific Conference on Circuits and Systems, pp. 1296–1299 (2008)
51. H.T. Vergos, C. Efstathiou, Diminished-1 modulo 2n+1 squarer design. Proc. IEE Comput.
Digit. Tech 152, 561–566 (2005)
52. D. Bakalis, H.T. Vergos, A. Spyrou, Efficient modulo 2n1 squarers. Integr. VLSI J. 44,
163–174 (2011)
53. D. Bakalis, H.T. Vergos, Area-efficient multi-moduli squarers for RNS, in Proceedings of 13th
Euromicro Conference on Digital System Design: Architectures, Methods and Tools,
pp. 408–411 (2010)

Further Reading

B. Cao, T. Srikanthan, C.H. Chang, A new design method to modulo 2n-1 squaring, in Proceedings
of ISCAS, pp. 664–667 (2005)
A.E. Cohen, K.K. Parhi, Architecture optimizations for the RSA public key cryptosystem: a
tutorial. IEEE Circuits Syst. Mag. 11, 24–34 (2011)
Chapter 5
RNS to Binary Conversion

This important topic has received extensive attention in literature. The choice of the
moduli set in RNS is decided by the speed of RNS to binary conversion for
performing efficiently operations such as comparison, scaling, sign detection and
error correction. Both ROM-based and non-ROM-based designs will be of interest.
The number of moduli to be chosen is decided by the desired dynamic range, word
length of the moduli and ease of RNS to binary conversion. There are two basic
classical approaches to converting a number from RNS to binary form. These are
based on Chinese Remainder Theorem (CRT) and Mixed Radix Conversion (MRC)
[1]. Several new techniques have been introduced recently such as New CRT-I,
New CRT-II, Mixed-Radix CRT, quotient function, core function and diagonal
function. All these will be presented in some detail.

5.1 CRT-Based RNS to Binary Conversion

The binary number X corresponding to given residues (x1, x2, x3, . . ., xn) in the RNS
{m1, m2, m3, . . . mn} can be derived using CRT as
0 1
  !   !   !
1 1 1
X ¼ @ x1 M1 þ x2 M 2 þ    xn Mn Amod M
M 1 m1 M 2 m2 M n mn
m1 m2 mn
ð5:1Þ

where Mi ¼ M/mi for i ¼ 1, 2, . . ., n andM ¼M1M2. . .Mn. Note that we denote


1
hereafter p mod q as ( p)q. The quantities are known as the multiplicative
Mj mj
  
1
inverses of Mj mod mj defined such that Mj ¼ 1: The sum in (5.1) can
Mj mj

© Springer International Publishing Switzerland 2016 81


P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_5
82 5 RNS to Binary Conversion

x1 x2 xn

ROM ROM ROM


⎛ ⎞ ⎛ ⎞
⎜ ⎛ 1 ⎞ ⎟ ⎛ ⎞ ⎜ ⎛ 1 ⎞ ⎟
⎜ x1⎜⎜ ⎟ ⎟
⎟ ⎜ ⎛ 1 ⎞ ⎟ ⎜ x n ⎜⎜ ⎟ ⎟
⎜ ⎝ M 1 ⎠m ⎟ ⎜ x 2⎜⎜ ⎟
⎟ ⎟ ⎟
⎜ ⎝ M n ⎠m ⎟
⎝ 1⎠m1 ⎜ ⎝ M 2 ⎠m ⎟ ⎝ n ⎠m n
⎝ 2 ⎠m 2

Multi-operand Mod M adder

Figure 5.1 Architecture for CRT implementation

be much larger than M and the reduction mod M to obtain X is a cumbersome


process. The advantage of the CRT is that the weighting of the residues xi can be
done in parallel and results summed, followed by reduction mod M. An architecture
for Reverse conversion using CRT is presented in Figure 5.1.
Example 5.1 Using CRT find the binary number corresponding to the residues
(1, 2, 3) in the moduli set {3, 5, 7}.
Note that M ¼ 105, M1 ¼ 35, M2 ¼ 21 and M3 ¼ 15. Thus, we have (1/M1) mod
3 ¼ 2, (1/M2) mod 5 ¼ 1, (1/M3) mod 7 ¼ 1. Thus, the result is X ¼ [35  ((1  2)
mod 3) + 21  ((2  1) mod 5) + 15 ((3  1) mod 7)] mod 105 ¼ (70 + 42 + 45)
mod 105 ¼ 52. It may be noted that one subtraction of 105 is needed in this example
to reduce the result mod 105. ■
CRT can be efficiently used in case of the three and four moduli sets e.g. {2n  1,
2 , 2n + 1}, {2n  1, 2n, 2n + 1, 22n + 1} and {22n  1, 2n, 22n + 1},{2n  1, 2n, 2n+1
n

 1}, {2n  1, 2n, 2n1  1} since n bits of the decoded number X are available
directly as residue corresponding to modulus 2n and the modulo reduction needed in
the end with respect to the product of remaining moduli can be efficiently
implemented in the case of the first three moduli sets. For some of these moduli
sets, the moduli have wide length ranging from n to 2n bits which may be a
disadvantage since the larger modulus decides the instruction cycle time of the
RNS processor. For general moduli sets, CRT may necessitate the use of complex
modulo reduction!hardware and needs ROM-based implementations for obtaining
 
0 1 0
xi ¼ xi values and multipliers for calculating xi Mi.
Mi mi
mi
We will consider the RNS to binary conversion for the moduli set {2n  1, 2n, 2n + 1}
based on CRT in detail next in view of the immense attention paid in literature
[2–20]. For this moduli set, denoting m1 ¼ 2n  1, m2 ¼ 2n and m3 ¼ 2n + 1, we
have M ¼ 2n(22n  1), M1 ¼ 2n(2n + 1), M2 ¼ 22n  1, M3 ¼ 2n(2n  1) and (1/M1)
mod (2n  1) ¼ 2n1, (1/M2) mod (2n) ¼ 1, (1/M3) mod (2n + 1) ¼ (2n1 + 1).
Thus, we can obtain using CRT [4], from (5.1) the decoded number as
5.1 CRT-Based RNS to Binary Conversion 83

        
X ¼ 2n ð2n þ 1Þ2n1 x1  22n  1 x2 þ 2n ð2n  1Þ 2n1 þ 1 x3 mod 2n 22n  1
¼ Y2n þ x2
ð5:2Þ

Since we know the LSBs of X as x2, we can obtain the 2n MSBs of X by computing
 
X  x2  
Y¼ mod 22n  1 ð5:3Þ
2n

From (5.2) and (5.3), we have


   
Y ¼ ð2n þ 1Þ2n1 x1  2n x2 þ ð2n þ 1Þ2n1 x3  x3 mod 22n  1 ð5:4Þ

Interestingly, the computation of Y involves summing of four terms which can be


easily found by bit manipulations of x1, x2 and x3 (rotations and one’s
complementing) due to the mod (22n  1) operations involved [4, 8, 9]. We consider
the first term A ¼ [(2n + 1)2n1x1]mod (22n  1) ¼ (22n1x1 + 2n1x1) mod (22n  1)
for instance. Writing x1 as the n-bit number x1,n1x1,n2x1,n3. . .x1,2 x1,1 x1,0, we
have

A ¼ x1, 0 x1, n1 x1, n2 x1, n3 . . . x1, 2 x1, 1 x1, 0 x1, n1 x1, n2 x1, n3 . . . x1, 2 x1, 1 ð5:5Þ

The second term B ¼ (2nx2) mod (22n  1) can be written as

B ¼ x2, n1 x2, n2 . . . x2, 2 x2, 1 x2, 0 111 . . . 111 ð5:6Þ
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}
n bits

where bars indicate one’s complement of bits (inverted bits).


The third term is slightly involved since x3 is (n + 1)-bit wide. We use the
fact that when x3,n is 1, invariably x3,0 is 0. Proceeding in a similar manner as
before, ((2n + 1)2n1x3)mod(22n  1) can be obtained as
   
C ¼ x3, n þx3, 0 x3, n1 . . . x3, 2 x3, 1 x3, n þx3, 0 x3, n1 . . . x3, 1 ð5:7aÞ

and

D ¼ ðx3 Þ ¼ 11 . . . :1 x3, n x3, n1 x3, n2 . . . x3, 0 ð5:7bÞ


22n 1 |fflfflfflffl{zfflfflfflffl}
ðn1Þbits

Piestrak [8] has suggested adding the four words A, B, C and D given in (5.5),
(5.6), (5.7a) and (5.7b) using a (4, 22n  1) MOMA (multi-operand modulo adder).
Two levels of carry-save-adder (CSA) followed by a carry-propagate-adder (CPA)
all with end-around-carry (EAC) will be required in the cost-effective (CE) version.
Piestrak has suggested a high-speed (HS) version wherein the mod (22n  1)
84 5 RNS to Binary Conversion

reduction is carried out by using two parallel adders to compute (x + y) and


(x + y  mi) where x and y are the sum and carry vector outputs of the CSA and
selecting the correct result using a 2:1 multiplexer. Dhurkadas [9] has suggested
rewriting the three words B, C and D to yield the two new words

E ¼ x2, n1 x2, n2 . . . x2, 1 x2, 0 x3, n þ x3, n1 x3, n2 . . . x3, 0 ð5:8aÞ
 
F ¼ x3, n þ x3, 0 x3, n1 . . . x3, 2 x3, 1 x3, 0 x3, n1 . . . x3, 1 ð5:8bÞ

Thus, the three words given by (5.5), (5.8a) and (5.8b) need to be summed in a
carry-save-adder with end around carry (to take care of mod (22n  1) operation)
and the resulting sum and carry vectors are added using a CPA with end-around-
carry (see Figure 5.2).
Several improvements have been made in the past two decades by examining the
bit structure of the three operands, by using n-bit CPAs in place of 2n-bit CPA to
reduce the addition time [10–19].
The RNS to binary converters for RNS using moduli of the form (2n  1) can
take advantage of the basic properties:
(a) A mod (2n  1) is one’s complement of A.
(b) (2xA) mod (2n  1) is obtained by circular left shift of A by x bits where A is an
n-bit integer.
(c) The logic can be simplified by noting that full adders with a constant “1” as
input can be replaced by a pair of two-input XNOR and OR gates. Similarly,
full adders with one input “0” can be replaced by pairs of XOR and AND gates.
Note also that a full adder with one input “0” and one input “1” can be reduced
to just an inverter.
Bharadwaj et al. [10] have observed that of the four operands to be added in
Piestrak’s technique, three operands have identical bits in the lower n-bit and upper
(n  1)-bit fields. Hence, n  1 FAs can be saved. Strictly speaking, since these are
having one input as “1”, we save (n  1) EXOR/OR gate pairs.
Next, a modified carry-select-adder has been suggested in place of 2n-bit CPA in
order to reduce the propagation delay of the 2n-bit CPA. This needs four n-bit CPAs
and additional multiplexers. The authors also derive the condition for selecting the
outputs of the multiplexers such that double representation of zero can be avoided.
Wang et al. [12] have suggested rewriting (2n + 1)2n1x3  x3  2nx2 in (5.4) as

x3, 0 ^x 3, n1 ::^x 3, 0 x3, n1 . . . x3, 1 þ x2, n1 x2, n2 . . . x2, 0 x2, n1 x2, n1 x2, n1 x2, n1
|fflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflffl} 22n 1
n1 bits

where ^x 3, i ¼ x3, i _ x3, n , i ¼ 0, 1, . . ., n  1 and _ indicates logic OR. The authors


also suggest that after the 2n-bit CPA, denoting the output as a2k1 a2k2. . .a0,
the final output can be computed as di ¼ a ^ ai where a ¼ a2k1 ^ a2k2 . . . ^ a0
where ^ indicates logic AND, for eliminating the redundant representation of zero.
5.1 CRT-Based RNS to Binary Conversion 85

x1 x2 x3

n
n-1
x1,n x1,0 x1,n x1,n-1
n-1
x3,0

n
n-1

n-1

n n

n n

2n 2n 2n

2n-bit CSA with EAC

2n-bit 1’s complement adder

2n
n

Figure 5.2 Architecture of RNS to Binary converter for moduli set {2n  1, 2n, 2n + 1} of
Dhurkadas (adapted from [9] ©IEEE1998)

Conway and Nelson [11] have suggested an RNS to binary converter based on
CRT expansion (5.2) whose dynamic range is less than 2n(22n  1) by 2n(2n  2)  1.
They rewrite the expression for CRT in (5.2) as X ¼ D222n + D12n + D0 so that the
upper and middle n bits can be computed using n-bit hardware. However, there is
forward and backward dependence between these two n-bit computations.
86 5 RNS to Binary Conversion

Gallaher et al. [18] have considered the equations for residues of the desired 3n-
bit output word X ¼ D222n + D12n + D0 corresponding to three moduli given as
   
ðD2 þ D1 þ D0 Þmod 2k  1 ¼ x1 , ðD2  D1 þ D0 Þmod 2k þ 1
¼ x3 , and D0 ¼ x2 ð5:9aÞ

Thy rewrite the first two expressions as


   
x1 ¼ ðD2 þ D1 þ D0 Þ  m 2k  1 , x3 ¼ ðD2  D1 þ D0 Þ  n 2k þ 1 ð5:9bÞ

Solving for D1 and D2, we have


   
x 1  x 3 m 2k  1 n 2k þ 1
D 1 ¼ X 1 þ L1 ¼ þ  ð5:9cÞ
2 2 2
 k   k 
x1 þ x3 m 2 1 n 2 þ1
D 2 ¼ X 2 þ L2 ¼  r2 þ þ ð5:9dÞ
2 2 2

Note that m can be (0, 1, 2) and n can be (1, 0, 1). Thus the authors explore what
values of m and n will yield the correct result. This technique has been improved
in [19].
CRT can be applied to other RNS systems having three or more moduli.
The various operands to be summed can be easily obtained by bit manipulations
(rotation of word and bit inversions) but the final summation and modulo reduction
can be very involved. Thus, three moduli system described above, hence, is believed to
be attractive. CRT has been applied to other moduli sets {22n, 22n  1, 22n + 1}[48, 49],
{2n, 2n  1, 2n+1  1} [50], {2k, 2k  1, 2k1  1} [47], and {2n + 1, 2n+k, 2n  1} [21].
Chaves and Sousa [21] have suggested a moduli set {2n + 1, 2n+k, 2n  1} with
variable k such that 0  k  n. CRT can be used to decode the number
corresponding to the given residues. In this case also, the multiplicative inverses
needed in CRT are very simple and are given as
     
1 1 1
¼ 2nk1 , ¼ 1, ¼ 2nk1 ð5:10Þ
M 1 m1 M 1 m2 M3 m3

where m1 ¼ 2n + 1, m2 ¼ 2n+k, m3 ¼ 2n  1.
Hence, similar to the case of moduli set {2n  1, 2n, 2n + 1}, using CRT, the
reverse conversion can be carried out by mapping the residues into 2n-bit words and
adding using a mod (22n  1) adder. Hiasat and Sweidan [22] have independently
considered this case with k ¼ n i.e. the moduli set {22n, 2n  1, 2n + 1}.
Soderstrand et al. [23] have suggested the computation of weighted sum in (5.1)
scaled by 2/M. The fractions can be represented as words with one integer bit and
several fractional bits and added. The multiple of 2 in the integer part of the
resulting sum can be discarded. The integer part will convey information about
the sign of the number. However, Vu [24] has pointed out that the precision of these
5.1 CRT-Based RNS to Binary Conversion 87

fractions shall be proper so as to yield a definite indication of sign. Considering the


CRT expression, we obtain
2 ! 3
Xj  
1
X¼4 i¼1
Mi xi 5 ð5:11aÞ
M i mi
mi M

Multiplying (5.11a) both sides by 2/M, we obtain the scaled value


  !
Xj 2 1 Xj
Xs ¼ i¼1 m
xi ¼ u
i¼1 i
ð5:11bÞ
i M i mi
mi

Considering that the error due to finite bit representation in ui as ei, such that
0  ei  2t where t + 1 bits are used to represent each ui, it can be shown that the
total error e < 2/M for M even and e < 1/M for M odd or n2t  2/M for M even and
n2t  1/M for M odd.
An example will illustrate the idea of sign detection using Vu’s CRT
implementation.
Example 5.2 Consider the moduli set {11, 13, 15, 16} and the two cases
corresponding to the real numbers +1 and 1. Use Vu’s technique for sign
detection.
The residues are (1, 1, 1, 1) and (10, 12, 14, 15) respectively. The expression
(5.11b) becomes in the first case

16 2 4 2
Xs ¼ þ þ þ
11 13 15 16

Adding the first two terms and second two terms and representing in fractional
form, we get

16 2 230 4 2 94
v1 ¼ u1 þ u2 ¼ þ ¼ and v2 ¼ u3 þ u4 ¼ þ ¼ :
11 13 143 15 16 240

The fractional representation for v1, v2 and Xs are

^ s ¼ 0:000000000000101
^v 1 ¼ 1:1001101111000000, ^v 2 ¼ 0:0110010001000101, X

For the case corresponding to 1, we have on the other hand,


^ s ¼ 1:111111111100101. Note that following Soderstrand et al. [23] suggestion
X
of using 11 bits for the fractional part, the sum would be 0.0000000000 which does
not give the correct sign.

88 5 RNS to Binary Conversion

Cardarilli et al. [25] have presented a systolic architecture for scaled residue to
binary conversion. Note that it is based on scaling the result of CRT expansion for a
N-moduli RNS by M to obtain

X XN x0i
¼ ð5:12Þ
M i¼1 mi 1

  !
1
where x0i ¼ xi . Note that x0 i/mi are all fractions. The bits representing
Mi mi
mi
x0 i /mi can be obtained using an iterative process. The resulting N words can be
added using a carry-save-adder tree followed by a CPA and overflow ignored. The
fraction xi /mi can be written as

x0i H1 H2 H3 HL εL
¼ þ 2 þ 3 þ  þ L þ L ð5:13Þ
mi 2 2 2 2 m2

where εL is the error due to truncation of x0 i/mi and (5.13) is the radix 2 expansion of
(x0 i/mi) and L ¼ dlog2 ðNMÞe þ 1. By multiplying (5.13) with 2Lmi and taking mod
2 both sides, we obtain
 
ðH L Þ2 ¼ H L ¼ εL m1 2 ð5:14Þ

where m1 is the multiplicative inverse of m mod 2. To compute the HL1 term, we
need to multiply (5.13) by 2L1 and take mod 2 both sides and use the values of εL
and HL already obtained. Note that εL ¼ a1 defined by xi  2L ¼ a1 + a2mi and
a2 ¼ H 1 2L1 þ H 2 2L2 þ H3 2L3 þ    þ HL . In a similar manner, the other Hi
values can be obtained.
As an illustration, for m ¼ 5, and residue x ¼ 1 and moduli set {3, 5, 7}, N ¼ 3
and M ¼ 105 and L ¼ dlog2 ðNMÞe þ 1 ¼ 10. We thus have 2L ¼ 1024, a1 ¼ 4,
a2 ¼ 204 and 1  1024 ¼ 4 + 204  5. Thus εL ¼ 4 and we can compute iteratively
a2 ¼ 204 bit by bit.
Dimauro et al. [26] have introduced a concept “quotient function” for
performing RNS to binary conversion. This technique denoted as quotient function
technique (QFT) is akin to Andraros and Ahmad technique [4]. In this method, one
modulus mj is of the form 2k. Consider the moduli set {m1, m2, . . ., mj}. The quotient
function is defined as

  jX jM XN
QFj jXjM ¼ ¼ bx
i¼1 i i M
ð5:15Þ
mj j

Note that M ¼ Π ji¼1 mi , Mj ¼ M/mj and bi are defined as

1
bi ¼  if i ¼ j ð5:16aÞ
mj Mj
5.1 CRT-Based RNS to Binary Conversion 89

and

1 1
bi ¼ M i for i ¼ 1, . . . N, i 6¼ j ð5:16bÞ
SQj mj Mj
Mj Mj
XN
Note that the sum of quotients SQj is defined as SQj ¼ i ¼ j Mi . Thus, the RNS
i 6¼ j
to binary conversion procedure using (5.15) is similar to CRT computation. Note
that the technique of Andraros and Ahmad [4] realizes quotient through CRT.
Appending QFj with the residue corresponding to 2k viz., rj as least significant
k bits will yield the final decoded number |X|M. An example will be illustrative.
Example 5.3 Consider the moduli set {5, 7, 9, 11, 13, 16} and the residues (2, 3,
5, 0, 4, 11). Obtain the decoded number using quotientfunction.

1
Note that mj ¼ 16 and Mj ¼ 45,045. Next we have ¼ 8446 and SQj ¼
mj M j
(5  7  9  11 + 5  7  9  13 + !
5  9  11  13 + 5  7  11  13 + 7  9  11 
1
13) ¼ 28,009. It follows that ¼ 16, 174. Next, using (5.16), the bi s can be
SQj
Mj
estimated as b(1) ¼ 36,036, b(2) ¼ 12,870, b(3) ¼ 20,020, b(4) ¼ 12,285, b(5) ¼
!  
1 1
17,325 and b(6) ¼ 36,599. As an illustration bð2Þ ¼ M2 ¼
SQj mj Mj
Mj Mj
jð5  9  11  13Þ  16, 174  8446j4 5, 045 ¼ 12, 870. Next, the binary number
corresponding to the given residues can be computed as (2  36,036 + 3  12,870
+ 5  20,020 + 0  12,285 + 4  17,325 + 11  36,599) mod 45,045 ¼ 6996 which
in binary form is 1 101 101 010 100. Appending this with the residue 11 in binary
form 1011 corresponding to modulus 16, we obtain the decoded number as
111,947.

Dimauro et al. [26] observed that since SQ can be large, an alternative technique
where a first level considers the original moduli set as two subsets for whose
residues, reverse conversion can be done in parallel. Next, the quotient function
can be evaluated. This will reduce the magnitude of SQ and hence leading to
simpler hardware. As an illustration, for the same moduli set considered above,
we can consider the subsets {5, 9, 16} and {7, 11, 13} giving SQ ¼ 1721 as against
28,009 in the previous case.
Kim et al. [27] have suggested an RNS to binary conversion technique with
d
rounded error compensation. In this technique, the CRT result is multiplied by 2 M
90 5 RNS to Binary Conversion

where M is the product of the moduli and d is the output word length deciding
parameter:
  "X   #
2d L 1 2d
Xs ¼ X  ¼ Mi xi  α2d ð5:17Þ
M i¼1 M i mi M

Evidently, the mod M operation is converted into mod 2d operation, in which we


ignore the carry generated. The expression [] means rounding to the nearest integer.
The authors suggest considering two terms at a time to be added using a tree of
adders. As an illustration for a four moduli RNS,

X s ¼ j f ðx 1 þ x 2 Þ þ f ðx 3 þ x 4 Þj ð5:18aÞ
2d

where
"     #
1 2d 1 2d
f ðx1 þ x2 Þ ¼ M1 x1 þ M2 x2 ð5:18bÞ
M1 m1 M M2 m2 M 2d

Since each rounding f (x1, x2) introduces a ½ LSB round-off error, maximum
error in computing Xs is 1LSB. The authors suggest computation of error due to
round off using a set of nearest error estimates e.g. 1/3, 2/3, 1/5, 2/5. . .,
XN
4/5 etc and evaluate ^δ ¼ δi where δi ¼ xi  ½xi  where [xi] is a rounded real
i¼1
number. These are read from PROMs for both f (x1, x2) and f (x3, x4) and added
together with sign and the result is added to the coarse values obtained before. They
have shown that the PROM contents need to be obtained by computer simulation to
find the maximum scaling error for the chosen d and thus the lack of ordered outputs
otherwise occurring without rounding error compensation can be avoided.

5.2 Mixed Radix Conversion-Based RNS


to Binary Conversion

The MRC technique is sequential and involves modulo subtractions and modulo
multiplication by multiplicative inverses of one modulus with respect to the
remaining moduli. In MRC, the decoded number is expressed as

B ¼ x1 þ d 1 m1 þ d2 m1 m2 þ d 3 m1 m2 m3 þ    þ dj 1 m1 m2 m3 . . . mj 1 ð5:19Þ

where 0  di  ðmiþ1  1Þ for a j moduli RNS. The parameters di are known as


Mixed Radix Digits.
5.2 Mixed Radix Conversion-Based RNS to Binary Conversion 91

In each step, one mixed radix digit di is determined. At the end, the MRC digits
are weighted following (5.19) to obtain the final decoded number. There is no need
for final modulo reduction.
Note that in each step, the residue corresponding to one modulus is subtracted so
that the result is exactly divisible by that modulus. The multiplication with multi-
plicative inverse accomplishes this division. The last step needs multiplications of
bigger numbers e.g. z1m2m3 in the three moduli example and addition of the
resulting products using carry-save-adders followed by CPA. But, no final modulo
reduction is needed in the case of MRC since the result is always less than
M ¼ m1m2m3. In case of large number of moduli RNS e.g. {m1, m2, m3, . . ., mn}, the
various multiplicative inverses need to be known a priori as well as various products of
moduli m(k1)mk, m(k2)m(k1)mk, etc. need to be stored. The RNS to Binary conver-
sion time is thus (n  1)Δmodsub + (n  1)Δmodmul + Δmult + ΔCSA(n2) + ΔCPA where
modsub and modmul stand for modulo subtraction and multiplication operations,
csa(k  2) stands for (k  2) level CSA and mult is conventional multiplication.
Note that the MRC algorithm can be pipelined. The following example illustrates
the technique.
Example 5.4 We consider the Mixed Radix Conversion technique for finding the
decimal number corresponding to residues (1, 2, 3) using the moduli set {3, 5, 7}.
The procedure is illustrated below:

m3 m2 m1 3 5 7
x3 x2 x1 1 2 3
-x1 -x1 -3 -3
(x3-x1) mod m3 (x2-x1) mod m2 1 4
×(1/m1) mod m3 ×(1/m1) mod m2 ×1 ×3
y1 d1 1 2
-d1 -2
(y1-d1) mod m3 2
×(1/m2) mod m3 ×2
d2 1

The result is X ¼ d2(m2m1) + d1(m1) + x1 ¼ 1  (35) + 2  (7) + 3 ¼ 52.



Huang [28] has suggested an RNS to binary converter in which the MRC digits
corresponding to each residue xi are obtained by table look up from LUTs. These
are next added mod mi in one mod mi adder for each modulus using log2j level tree
where j is the number of moduli. The carry digit needs to be preserved and added to
the next column in the end since Mixed Radix representation is a positional
representation. At the end, the Mixed Radix digits are weighted using multipliers
and summed in a normal final adder to compute (5.19). The CRT II of Wang [29] is
same as this technique [30].
As an illustration, consider the moduli set {3, 5, 7} and given residues (1, 2, 3).
The MRC digits corresponding to the three residues (1, 0, 0), (0, 2, 0), (0, 0, 3) are
[2, 0, 0], [1, 1, 0], [1, 1, 3] where B ¼ d2(35) + d1(7) + d0 is the MRC expansion.
(As an illustration (1, 0, 0) corresponds to 70 ¼ 2  35 + 0  7 + 0 in MRC yielding
92 5 RNS to Binary Conversion

the MRC digits as [2, 0, 0].) Thus, adding these mod mi and adding the carry in the
previous column, we obtain [1, 2, 3] which corresponds to 1  35 + 2  7 + 3 ¼ 52.
MRC is simpler in the case of some powers-of-two related moduli sets
since the various multiplicative inverses needed in successive steps are of the
form  2i so that the modulo multiplication can be realized using bit wise rotation
of the operands in case of mod (2n  1) and one’s complementing certain bits and
adding correction term in case of mod (2n + 1) as explained in Chapter 4 [16, 17].
As an illustration for the case of the earlier considered moduli
 set{2n  1, 2n, 2n
1
+ 1}, the multiplicative inverses are as follows: ¼ 2n1 ,
2n þ 1 2n 1
   
1 1
n ¼ 1, ¼ 1. Thus, in each MRC step, only modulo sub-
2 þ1 2 n 2n 2n 1
tractions are needed and multiplication with 2n1 mod (2n  1) can be realized by
left circular rotation. An example will illustrate the procedure.
Example 5.5 Consider the moduli set {2n  1, 2n, 2n + 1} with n ¼ 3. The RNS is
thus {7, 8, 9}. We wish to find the decimal number corresponding to the residues
(1, 2, 3). The MRC procedure is as follows:
7 8 9
1 2 3
-3 -3
5 7
×4 ×1
6 7
-7
6
×1
6

The decoded number can be obtained as 6  (8  9) + 7  (9) + 3 ¼ 498. (The


modulo multiplication operations shown in bold need no computation time as hard-
wiring can achieve the operation.)
Note also that the order of moduli for MRC needs to be appropriate to reduce
complexity. For example, if we change the order of moduli to 7, 9 and 8, then
modulo (2n + 1) ¼ modulo 9 multiplications will be needed.

For large moduli RNS, ROMs will be needed to implement modulo multiplica-
tion with multiplicative inverse while modulo subtraction can be carried out using
logic. For small word length moduli say n bits, the operation of subtraction
(xi1  xi) mod mi1 and modulo multiplication of the result with x1i mod mi1
can be realized using ROM but with 2n-bit address space (since both xi and xi1
need to address the memory).
Miller and McCormick [31] have proposed two parallel MRC techniques one of
which is conventional MRC as pointed out by Garcia and Jullien [32]. The second
technique uses LUTs having two inputs and having two outputs. These outputs
5.2 Mixed Radix Conversion-Based RNS to Binary Conversion 93

correspond to solutions for the Diophantine equations. Specifically, for input


residues kj+1, lj of any block in Figure 5.3a, the two outputs kj, lj are given by
ðiÞ ðiÞ ði1Þ ði1Þ
kj mj  lj miþj ¼ kjþ1  lj , i ¼ 1, 2, . . . , n  1, j ¼ 1, 2, . . . , n  1
ð5:20Þ

ð 0Þ ð0Þ
and kj ¼ xjþ1 , lj ¼ xj , j ¼ 1, 2, . . ., n  1.
Thus using both the outputs kj and lj, a tree can be constructed to perform Mixed
Radix conversion. In this technique, two MRC expansions are carried out simulta-
neously. As an illustration for a four moduli RNS, a given number can be expressed
by either

X ¼ x1 þ d1 m1 þ d 2 m1 m2 þ d3 m1 m2 m3 ð5:21aÞ

or

X ¼ x4 þ d01 m4 þ d 02 m4 m3 þ d03 m4 m3 m2 ð5:21bÞ

where di and d0 i are Mixed Radix digits. The advantage is the local inter-
connections between various LUTs. However, the size of the LUTs is more due
to the need for two outputs in some cases. A typical converter for a four moduli set
{3, 5, 7, 11} is presented in Figure 5.3b. The numbers in the boxes i, j refer to
adjacent moduli mi, mj ( j ¼ i + 1). As an illustration for the box 2, 3 in the first
column, which corresponds to the moduli 5, 7 and input residues 2 and 5, we have
k2 ¼ 2 and l2 ¼ 1:

x3  x2 ¼ m2 k2  m3 l2 ! 5  2 ¼ 5  2  7  1 ¼ 3:

Yassine and Moore [33] have suggested choice of moduli set to have certain
multiplicative inverses as 1 to facilitate easy reverse conversion using only sub-
tractions. Considering a moduli set {m1, m2, m3, m4) for illustration, we choose the
moduli such that Vi which are constant predetermined factors are all 1:
     
1 1 1
V 1 ¼ 1, V 2 ¼ ¼ 1, V 3 ¼ ¼ 1, V 4 ¼ ¼1
m1 m2 m1 m2 m3 m1 m2 m3 m4
ð5:22aÞ

The decoded number can be first written Mixed Radix form as

X ¼ ðU1 V 1 Þm1 þ m1 ðU2 V 2 Þm2 þ m1 m2 ðU 3 V 3 Þm3 þ m1 m2 m3 ðU 4 V 4 Þm4 ð5:22bÞ

Note that Ui are such that ðU i V i Þmi ¼ γ i are the Mixed radix digits. This can be
proved as follows. We can evaluate the residues x1, x2, x3 and x4 from (5.22b) as
94 5 RNS to Binary Conversion

a x a1 x x + a2m1
a1 (1) a2 x x + a3m2
x1 k1
(2) a3
j=1 1,2
(1) k1
l1
x2 (1) j=1 1,3 (2) x x + aN-1mN-2
k2 l1
(N-2)
j=2 2,3 (2) k1 aN-1 x x + aNmN-1
(1) k2
x3 l2 j=2 2,4 1,N-1 (N-1)
(1) j=1 k1 aN
k3 (2) (N-2)
j=3 3,4 l2 l1 j=1 1,N
x4 (1) 2,N
l3 j=2 (N-2)
k2

(2)
kN-2
j = N-2 N-2,N
xN-1 (1)
kN-1
j = N-1 N-1,N
xN
i=1 i=2 i = N-2 i = N-1

b x←2
(1) x ← 2+0.3
k1 = 0 a2 = 0
m1=3 x ←2+3.15 x ←47+1×105 =152
x1 = 2 1, 2 (2)
(1) k1 = 3 a3 = 3
a4 = 1
1,3
(1)
m2=5
k2 = 2
x2 = 2 (2)
2.3 1,4
l1=1 (3)
(1) k1
l2 = 1 2,4 (2) =1
m3=7
x3 = 5 k2 = 4

3,4 (1)
k3 = 10
m4=11
x4 = 9

Figure 5.3 A RNS to binary converter due to Miller and McCormick (a) general case (b) four
moduli example (adapted from [31] ©IEEE1998)

x1 ¼ ðU 1 V 1 Þm1 ¼ U1 , x2 ¼ ðU1 þ U2 Þm2 ,


 
x3 ¼ U 1 þ m1 ðU2 V 2 Þm2 þ U3 , ð5:22cÞ
 m3 
x4 ¼ U 1 þ m1 ðU2 V 2 Þm2 þ m1 m2 ðU 3 V 3 Þm3 þ U4
m 3
5.3 RNS to Binary Conversion Based on New CRT-I, New CRT-II, Mixed-Radix CRT. . . 95

from which we can see that

U1 ¼ x1 , U 2 ¼ ðx2  x1 Þm2 , U3 ¼ ðx3  x1  m1 U 2 Þm3 , U4 ¼ ðx4  x1  m1 U2  m1 m2 U 3 Þm4


ð5:22dÞ

Consider the following example illustrating this technique.


Example 5.6 Consider the moduli set {m1, m2, m3, m4} ¼ {127, 63, 50, 13} for
which V1 ¼ V2 ¼ V3 ¼ V4 ¼ 1. We wish to find the number corresponding to the
residues {78, 41, 47, 9).
The Mixed radix digits γ i are computed as follows:

γ 1 ¼ U 1 ¼ 78
γ 2 ¼ U 2 ¼ ð41  78Þ63 ¼ 26
γ 3 ¼ U 3 ¼ ð47  78  127:26Þ50 ¼ 17
γ 4 ¼ U 4 ¼ ð9  78  127:26  127:63:17Þ13 ¼ 9

The decoded number is thus 78 + 26  (127) + 17  (127  63) + 9  (127  63


 50) ¼ 3,739,847. Thus, the number of arithmetic operations is reduced compared
to conventional Mixed radix conversion.

5.3 RNS to Binary Conversion Based on New CRT-I, New


CRT-II, Mixed-Radix CRT and New CRT-III

Variations of CRT have appeared in literature most important being New CRT-I
[29]. Using New CRT-I, given the moduli set {m1, m2, m3, . . ., mn}, the weighted
binary number corresponding to the residues (x1, x2, x3, . . ., xn) can be found as

X ¼ x1 þ m1 ðk1 ðx2  x1 Þ þ k2 m2 ðx3  x2 Þ þ    þ kn1 m2 m3 ::mn1 ðxn  xn1 ÞÞ


mod ðm2 m3 :::mn1 mn Þ
ð5:23Þ

where jk1 m1 jm2 m3 . . . mn1 mn ¼ 1, jk2 m1 m2 jm3 ...mn1 mn ¼ 1, . . . jkn1 m1 m2 m3 . . .


mn1 jmn ¼ 1.
As an illustration, application to the moduli set {m1, m2, m3} ¼ {2n  1, 2n, 2n
+ 1} yields
 
X ¼ x2 þ 2n Y ¼ x2 þ 2n ðx2  x3 Þ þ 2n1 ð2n þ 1Þðx1  2x2 þ x3 Þ ð5:24Þ
22n 1
96 5 RNS to Binary Conversion

Thus, Y can be first found and appending x2 as LSBs, we can obtain X.


Wang et al. [14] have suggested computation of Y in (5.24) as ðA þ 2n BÞ 2n
2 1
where

x1 þ ðx10  x30 Þ2n þ ð2n  1  x3 Þ þ 2n  1


A¼ ð5:25aÞ
2

and

x1 þ ðx10  x30 Þ2n þ x3 þ 2ð2n  1  x2 Þ


B¼ ð5:25bÞ
2

where x10 and x30 are the LSBs of x1 and x3, respectively. The value A can be
computed using a 2-input adder to yield the sum and carry vectors A1 and A2.
Similarly, B can be estimated to yield the sum and carry vectors B1 and B2 and a
carry bit using a three-input n-bit adder. Next, Y can be obtained from A and B using
a 2n-bit adder (Converter I) or n-bit adders to reduce the propagation delay.
Two solutions for n-bit case have been suggested denoted as Converter II and
Converter III.
Bi and Gross [34] have described a Mixed-Radix Chinese Remainder Theorem
(Mixed-Radix CRT) for RNS to binary conversion. The result of RNS to binary
conversion can be computed in this approach for an RNS having moduli {m1, m2,
. . ., mn} with residues (x1, x2, . . ., xn) as

γ 1 x1 þ γ 2 x2 þ γ 3 x3
X ¼ x1 þ m1 jγ 1 x1 þ γ 2 x2 jm2 þ m1 m2
m2 m3 ð5:26aÞ
γ 1 x1 þ γ 2 x2 þ γ 3 x3 þ    þ γ n xn
þ   þ m1 m2 . . . mn1
m2 m3 . . . mn1 mn

where
 
1
M1 1  
M 1 m1 M 1
γ1 ¼ and γ i ¼ : ð5:26bÞ
m1 m1 mi Mi mi

Note that the first two terms use MRC and other terms use CRT like expansion.
The advantage of this formulation is the possibility for parallel computation of
various MRC digits enabling fast comparison of two numbers at the expense of
hardware since many terms in the numerators for expressions for several Mixed
Radix digits and division by product of moduli and taking integer value are
cumbersome except for special moduli. The topic of comparison using this tech-
nique is discussed in Chapter 6. An example will be illustrative.
5.4 RNS to Binary Converters for Other Three Moduli Sets 97

Example 5.7 We wish to find the number corresponding to residues (1, 2, 3, 4) in


 M1 ¼
the RNS {3, 5, 7, 11}. Wecancompute asin CRT,  385,M2 ¼ 231,
 M 3 ¼ 165,
1 1 1 1
M4 ¼ 105 and various as ¼ 1, ¼ 1, ¼ 2,
Mi mi 385 3 231 5 165 7
 
1
¼ 2. Next, we compute γ 1 ¼ 128, γ 2 ¼ 77, γ 3 ¼ 110, γ 4 ¼ 70. Thus,
105 11
X can be computed as
 
128 þ 154 þ 330
X ¼ 1 þ 3  ð128 þ 154Þ5 þ 15  þ 105
5 7
 
128 þ 154 þ 330 þ 280

35 11

¼ 1 þ 3  2 þ 15  3 þ 105  3 ¼ 367


New CRT III [35, 36] can be used to perform RNS to binary conversion when the
moduli have common factors. Considering two moduli m1 and m2 with common
factors d, and considering m1 > m2, the decoded number corresponding to residues
x1 and x2 can be obtained as
  
1 ðx 2  x 1 Þ
X ¼ x1 þ m1 ð5:27Þ
m1 =d d m2 =d

As an illustration, consider the moduli set {15, 12} with d ¼ 3 as a common factor
and given residues (5, 2). The decoded number can be obtained from (5.27) as
  
1 ð 2  5Þ
X ¼ 5 þ 15 ¼ 50:
5 3 4

We will later consider application of this technique for Reverse conversion for an
eight moduli set.

5.4 RNS to Binary Converters for Other Three Moduli Sets

Premkumar [37], Premkumar et al. [38], Wang et al. [39], and Globagade et al. [40]
have investigated the three moduli set {m1, m2, m3} ¼ {2n + 1, 2n, 2n  1}. The
reverse converter for this moduli set based on CRT described by Premkumar [37]
uses the expressions
98 5 RNS to Binary Conversion

 
M m2 m3  m m 
1 2
X¼ þ x1 þ x3  m1 m3 x2 mod M for ðx1 þ x3 Þ odd
2 2 2
ð5:28aÞ

and
nm m  m m  o
2 3 1 2
X¼ x1 þ x3  m1 m3 x2 mod M for ðx1 þ x3 Þ even ð5:28bÞ
2 2

where M ¼ 2n(4n2  1).


Note that the output of the adder computing the value inside the brackets needs
to be tested and based on the sign, M has to be added or subtracted once.
The hardware implementation needs three 2k-bit  k-bit multipliers where
k ¼ log2(2n + 1) and a four-input 3k-bit adder. Premkumar et al. [38] suggested
simplification which needs one 2k-bit  k-bit multiplier and one k-bit  k-bit mul-
tiplier and 7 or 9 adders in Architecture A and B, respectively. They divide both
sides of CRT expression by m2 and find the integer part as

X x1  x3
¼ nðx1 þ x3  2x2 Þ þ both x1 , x3 odd or both even
m2 2 m1 m3
ð5:29aÞ

and

X x1  x3 þ m1 m3
¼ nðx1 þ x3  2x2 Þ þ x1 even, x3 odd or vice-versa:
m2 2 m1 m3
ð5:29bÞ

Note that in this


j case,
k m1 ¼ 2n  1, m2 ¼ 2n and m3 ¼ 2n + 1. The final result is
X
given by X ¼ m2 m2 þ x2 . The authors suggest a high-speed version as well as a
cost-effective version.
Wang et al. [39] have given another technique for reverse conversion using the
formula based on new CRT II,

X ¼ x2 þ 2nfðx2  x3 Þ þ ðx1  2x2 þ x3 Þnð2n þ 1Þgmodðð2n þ 1Þð2n  1ÞÞ


ð5:30Þ

which needs one 2k-bit  k-bit multiplier and one k-bit  k-bit multiplier and few
adders. Note that in this case, m1 ¼ 2n  1, m2 ¼ 2n and m3 ¼ 2n + 1. More recently,
Gbolagade et al. [40] have suggested computing X as
5.5 RNS to Binary Converters for Four and More Moduli Sets 99

x1 þ x3
X ¼ m 2 ðx 2  x 3 Þ þ x2 þ m 3 m 2  x2 ð5:31Þ
2 m1 M

Note that in this case, m1 ¼ 2n  1, m2 ¼ 2n and m3 ¼ 2n + 1. This needs at most one


corrective addition of M. The critical path has been shown to be less than Wang
et al. converter [37] with reduced hardware complexity.
Premkumar [41, 42], Wang et al. [39] and Gbolagade [43] have considered
another moduli set {2n, 2n + 1, 2n + 2} which has 2 as a common factor and hence
half the dynamic range compared to the moduli set {2n + 1, 2n, 2n  1}. It may be
remarked that the moduli sets {2n + 1, 2n, 2n  1} and {2n, 2n + 1, 2n + 2} are not
attractive compared to powers of two related moduli sets since the hardware needed
has quadratic dependence on the bit size of the moduli.
Reverse converters for the moduli set {2k, 2k  1, 2k1  1} have also been
described [44–47]. The design due to Hiasat and Abdel-Aty-Zohdy [44] was
based on CRT. Denoting m1 ¼ 2k, m2 ¼ 2k  1,j andk m3 ¼ 2k1  1, the authors
X
start with CRT and estimate X mod M3 and M3 where M3 ¼ M/m3. Wang
et al. [45, 46] have used New CRT II and have shown that the conversion time
can be reduced whereas area is increased. Ananda Mohan [47] has suggested both
CRT and MRC-based converters. The CRT-based converter has reduced conver-
sion time and uses ROM. On the other hand, the MRC-based converter has reduced
area but higher conversion time.
The moduli set {22n  1, 22n, 22n + 1} has been suggested by Ananda Mohan
[48, 49] for which using CRT, cost-effective as well as high-speed converters have
been described. Note that the moduli have word lengths of n bits, 2n bits and 2n + 1
bits. The dynamic range is 5n + 1 bits. Another moduli set with (3n + 1)-bit dynamic
range has also been explored {2n, 2n  1, 2n+1  1} [50] using CRT as well as MRC
techniques. The multiplicative inverses needed in the case of MRC technique are
very simple. The CRT-based converter needs modulo (2n  1)(2n+1  1) reduction
after a CPA which has been suggested to be realized by using ROMs by looking at
the MSBs and subtracting the appropriate residue. Thus, one converter using ROM
and two converters not using ROM have been suggested. This moduli set has the
advantage that due to absence of modulus (2n + 1), the multiplication and addition
operations for all moduli channels can be simpler.

5.5 RNS to Binary Converters for Four and More


Moduli Sets

Some reverse converters of four moduli sets [51–54] are extensions of the con-
verters for the three moduli sets. These use the optimum converters for the three
moduli set M1 {2n  1, 2n, 2n + 1} and use MRC to get the final result to include the
fourth modulus 2n+1  1, 2n1 + 1, 2n1  1, 2n+1 + 1, etc.
100 5 RNS to Binary Conversion

The reverse converter due to Vinod and Premkumar [51] for the moduli set
n n n n+1
{m1, m2, m3, m4} ¼ {2
j k1, 2 + 1, 2 , 2   1} uses CRT but computes the higher
Mixed Radix Digit MX mod 2nþ1  1 where X is the desired decoded number
4
and Mi ¼ M/mi. On the other hand, X mod M4 is computed
j k using the three moduli
X
RNS to binary converter. Next, X is computed as M M4 þ x4 .
4
The reverse converter due to Bhardwaj et al. [52] for the moduli set k 1, m2, m3,
j {m
m4} ¼ {2n  1, 2n + 1, 2n, 2n+1 + 1} uses CRT but computes first E ¼X
. Note that
2n
E can be obtained by using CRT on the four moduli set and subtracting the residue
r3 and dividing by m3. However, the multiplicative inverses needed in CRT are
quite complex and hence, E1 and E2 are estimated from the expression for E. Next,
from E1 and E2 using CRT, E can be obtained:
 
E1 ¼ jEj 2n ¼ 2n1 ð2n þ 1Þr 1  2n r 2  2n1 ð2n  1Þr 3 2n ð5:32aÞ
2 1 2 1

E2 ¼ j E j ¼ ½2r 2  2r 4  ð5:32bÞ
2nþ1 þ1 2nþ1 þ1

Ananda Mohan and Premkumar [53] have suggested using MRC for obtaining
E from E1 and E2.
Ananda Mohan and Premkumar [53] have given an unified architecture for RNS
to binary conversion for the moduli sets {2n  1, 2n + 1, 2n, 2n+1  1} and {2n  1, 2n
+ 1, 2n, 2n+1 + 1} which uses a front-end RNS to binary converter for the moduli set
{2n  1, 2n + 1, 2n} and then uses MRC to include the fourth modulus. Both
ROM-based and non-ROM-based solutions have been given.
Hosseinzadeh et al. [55] have suggested an improvement for the converter of
Ananda Mohan and Premkumar [53] for the moduli set {2n  1, 2n + 1, 2n, 2n+1  1}
for reducing the conversion delay at the expense of area. They suggest using (n + 1)-
bit adders in place of (3n + 1)-bit CPA to compute the three parts of the final result.
Theydo not perform the final addition of the output of the multiplier evaluating

1
ðx4  Xa Þ nþ1 where Xa is the decoded output corresponding the
Xa 2nþ1 1 2 1
moduli set {2n  1, 2n + 1, 2n} but preserve as two carry and sum output vectors and
compute the final output.
Sousa et al. [56] have described an RNS to binary converter for the moduli set
{2n + 1, 2n  1, 2n, 2n+1 + 1}. They have used two-level MRC. In the first
level, reverse conversion using MRC for moduli sets {x1, x2} ¼ {2n + 1, 2n  1}
and {x4, x3} ¼ {2n+1 + 1, 2n} is performed and the decoded words  X12, X34 are
1 n1
obtained. Note that the various multiplicative inverses are x1 modx2 ¼ 2 ,
    X
n3
2 X
n1
2iþ1
1
x4 modx 3 ¼ 1 and 1
m3 m4 mod ð m m
1 2 Þ ¼ 2 þ 22iþ2 . Since the archi-
i¼0 i¼n1
2

tecture uses MRC, it can be pipelined. The multiplication with multiplicative


inverses mod (2n  1), mod 2n, and mod (22n  1) can be easily performed.
5.5 RNS to Binary Converters for Four and More Moduli Sets 101

The resulting area is more than that of Ananda Mohan and Premkumar converter
[53], whereas the conversion time is less.
Cao et al. [54] have described reverse converters for the two four moduli sets
{2n + 1, 2n  1, 2n, 2n+1  1} and {2n + 1, 2n  1, 2n, 2n1  1} both for n even.
They use a front-end RNS to binary converter due to Wang et al. [14] for the three
moduli set to obtain the decoded word X1 and use MRC later to include the fourth
modulus m4 (i.e. (2n+1  1) or (2n1  1)). The authors suggest three stage and
four stage converters which differ in the way the MRC in second level is
performed. In the three-stage converter considering the first moduli set, the
second stage computes
!!
1
Z¼ ðx4  X1 Þ   ð5:33aÞ
2n 22n  1
2nþ1 1
 
and the third stage computes X ¼ X1 þ 2n 22n  1 Z. Noting that
!
1 nþ2
  ¼2 10
, the authors realize Z as
3
2n 22n  1 nþ1
2 1
  nþ2 
1 2  10
Z¼ ðx4  X1 Þ ¼ ðSQÞ nþ1 ð5:33bÞ
3 3 2nþ1 1 2 1
    nþ2 
1 2  10
where S ¼ , Q ¼ ðx 4  X 1 Þ . Note that S can be
3 2nþ1 1 3 2nþ1 1
realized as
 
1
S¼ ¼ 20 þ 22 þ 24 þ    þ 2n :
3 2nþ1 1

Thus, Z can be computed as sum of shifted and rotated versions of Q available in


carry save form using a tree of CSA with end-around-carry. In the four-stage
converter, the sum and carry vectors realizing Q are first added in a mod (2n+1  1)
adder and then multiplied with S realized by summing shifted and rotated terms.
Same technique has been used for the other moduli set as well.
The reverse converters for the four moduli set {2n  1, 2n + 1, 2n  3, 2n + 3}
have also been described which use ROMs and combinational logic
[48, 57–59]. The designs in [48, 57, 58] consider in the first level, two 2-moduli
sets {2n  3, 2n + 1} and {2n + 3, 2n  1} to compute the decoded numbers Xa and Xb
respectively using MRC. Sheu et al. [57] use a ROM-based approach. In the design
in [58], Montgomery algorithm is used to perform the multiplication with  multi-

plicative inverse needed in MRC. This takes advantage of the fact that m12 mod
 
1 n n ðx 1  x 2 Þm 1
m1 ¼ 4 modm1 where m1 ¼ 2  3 and m2 ¼ 2 + 1. Thus, 4 modm1
implies adding a multiple of m1 to ðx1  x2 Þm1 to make the two LSBs zero so that
102 5 RNS to Binary Conversion

 
division by 4 implies ignoring the two LSBs. In the case of computation of Xb, m13

modm4 ¼ 14 modm4 ¼ 2n2 where m3 ¼ 2n + 3 and m4 ¼ 2n  1. The multipli-
cation with 2n2 mod (2n  1) can be carried out in a simple manner by bit rotation

of ðx3  x4 Þm4 . In the case of MRC in the second level, note that m31m4 mod
 
1
ðm1 m2 Þ ¼ nþ2 modðm1 m2 Þ enabling Montgomery technique to be used easily.
2
In [58], MRC using ROMs and CRT using ROMs also have been explored. In
MRC techniques, modulo subtractions are realized using logic, whereas multipli-
cation with multiplicativeinverse
 is carried out using ROMs. In the CRT-based
1
method, the various Mi values are stored in ROM. Carry-save-adder
M i mi
followed by CPA and modulo reduction stage are used to compute the decoded
result.
Jaberipur and Ahmadifar [59] have described an ROM less adder-only reverse
converter for this moduli set. They consider a two-stage converter. The first stage
performs mixed radix conversion corresponding to the two pairs of moduli {2n  1,
2n + 1} and {2n  3, 2n + 3} to obtain residues corresponding to the pair of compos-
ite moduli {22n  1, 22n  9}. The multiplicative inverses needed are as follows:
   
1 n1 1  
n ¼ 2 , n ¼  2n3 þ 2n5 þ    þ 23 þ 2 for n even and
 2  1 2 þ1 2 þ 3 2 3
n n
 
1  n3  1
¼ 2 þ 2 n5
þ    þ 2 2
þ 2 0
for n odd, ¼ 22n3 :
2n þ 3 2n 3 22n  9 22n 1

The decoded words in the first and second stages can be easily obtained using
multi-operand addition of circularly shifted words.
Patronik and Piestrak [60] have considered residue to binary conversion for a
new moduli set {m1, m2, m3, m4} ¼ {2n + 1, 2n, 2n  1, 2n1 + 1} for n odd. They
have described two converters. The first converter is based on MRC of a two moduli
set {m1m2m3, m4}. This uses Wang et al. converter [12] for the three moduli set to
obtain the number X1 in the moduli set {m1, m2, m3}. The multiplicative inverse
needed in MRC is
! 0n3 1
1 X
2 1

  ¼ k1 ¼ @ 22iþ1 þ 1A ð5:34Þ


2n 22n  1 n1
2 þ1 i¼0

Note that since the lengths of residues corresponding to the moduli m1m2m3 and
m4 are different, the operation (x4  X1) mod (2n1 + 1) needs to be carried out using
periodic properties of residues. The multiplication with the multiplicative inverse in
(5.34) needs circular left shifts, one’s complementing of bits arriving in LSBs due to
circular shift and addition of all these modified partial products with a correction
term using several CSA stages. Note that mod (2n1 + 1) addition needs correction
5.5 RNS to Binary Converters for Four and More Moduli Sets 103

to cater for inverting the carry and


 adding in the LSB position. The number of
partial products can be seen to be n3
2 þ 2 . The final computation of X 1 þ m1 m2
m3 ðk1 ðx4  X1 ÞÞm4 can be rearranged to take advantage of the fact that LSBs of
the decoded word are already available as x3.
The second converter uses two-stage conversion comprising of moduli sets
{m1m2, m3m4} using MRC. The numbers corresponding to moduli sets m1m2 and
m3m4 are obtained using CRT and MRC respectively in the first stage. The various
multiplicative inverse used in CRT and MRC in this stage are as follows:
! ! !
1 1 1
n
 ¼ n
 ¼2 n1
, n1  ¼ 2n1 þ 1 ð5:35aÞ
2 þ1 n 2 1 n 2 þ1 n
2 1 2 þ1 2

The multiplicative inverse needed in MRC in the second stage is


! 0 0 n3 11
1 1 X
2  
  ¼ @ n@ 22iþ2 þ 22iþnþ2 þ 2AA ð5:35bÞ
2n 2n1 þ 1 2 i¼0
22n 1 2 2n
1

The multiplication with this multiplicative inverse mod (22n  1) can be obtained
by using a multi-operand carry-save-adder mod (22n  1) which can yield sum and
carry vectors RC and RS. Two versions of the second converter have been presented
which differ in the second stage.
Didier and Rivaille [61] have described a two-stage RNS to binary converter for
moduli specially chosen to simplify the converter using ROMs. They suggest
choosing pairs of moduli with a difference of power of two and difference between
products of pairs of moduli being powers of two. Specifically, the set is of the type
 
fm1 ; m2 ; m3 ; m4 g ¼ m1 , m1 þ 2p1 , m3 , m3 þ 2p2 such that m1m2  m3m4 ¼ 2pp
where pp is an integer. In the first stage, the decoded numbers corresponding to
residues of {m1, m2} and {m3, m4} can be found and in the second stage, the
decoded number corresponding to the moduli set {m1m2, m3m4} can be found. The
basic converter for the two moduli set {m1, m2} can be realized using one addition
without needing any modular reduction. Denoting the residues as (r1, r2), the
decoded number B1 can be written as B1 ¼ r 2 þ ðr 1  r 2 , 0Þ where the second
term corresponds to the binary number corresponding to (r1  r2, 0). Since r1  r2
can be negative, it can be written as a α-bit two’s complement number with a sign
bit S and (α  1) remaining bits. The authors suggest that the decoded number be
obtained using a look-up table T addressed by sign bit and p LSBs where
m2  m1 ¼ 2p and using addition operation as follows:
 
B1 ¼ r 2 þ m2  MSBðr 1  r 2 Þα1
p þ T signðr 1  r 2 Þ, LSBðr 1  r 2 Þp1
0 ð5:36Þ

Some of the representative moduli sets are {7, 9, 5, 13}, {23, 39, 25, 41}, {127,
129, 113, 145} and {511, 513, 481, 545}. As an illustration, the implementation for
the RNS {511, 513, 481, 545} needs 170AFA, 2640 bits of ROM and needs a
104 5 RNS to Binary Conversion

conversion time of 78ΔFA + 2ΔROM where ΔFA is the delay of a full adder and
ΔROM is ROM access time.
We next consider four moduli sets with dynamic range (DR) of the order of 5n
and 6n bits. The four moduli set {2n, 2n  1, 2n + 1, 22n + 1} [62] is attractive since
New CRT-I-based reduction can be easily carried out. However, the bit length of
one modulus is double that of the other three moduli. Note that this moduli set can
be considered to be derived from {22n  1, 22n, 22n + 1} [48, 49].
The reverse converters for the moduli set {2n  1, 2n + 1, 22n+1  1, 2n} with DR
of about (5n + 1) bits and {2n  1, 2n + 1, 22n, 22n + 1} with a DR of about 6n bits
based on New CRT II and New CRT I respectively have been described in [63]. In
the first case, MRC is used for the two two moduli sets {m1, m2} ¼ {2n, 22n+1  1}
and {m3, m4} ¼ {2n + 1, 2n  1} to compute Z and Y. A second MRC stage computes
X from Y and Z:
 
Z ¼ x1 þ 2n 2nþ1 ðx2  x1 Þ 2nþ1 ð5:37aÞ
2 1
 
Y ¼ x3 þ ð2n þ 1Þ 2n1 ðx4  x3 Þ 2n 1 ð5:37bÞ
 
X ¼ Z þ 2n 22nþ1  1 ð2n ðY  Z ÞÞ 2n ð5:37cÞ
2 1

Due to the modulo reductions which are convenient, the hardware can be simpler.
In the case of the moduli set {m1, m2, m3, m4} ¼ {2n  1, 2n + 1, 22n, 22n + 1},
New CRT-I has been used. The decoded number in this case is given by
     
X ¼ x1 þ 22n 22n ðx2  x1 Þ þ 22n1 22n þ 1 ðx3  x2 Þ þ 2n2 22n þ 1 ð2n þ 1Þðx4  x3 Þ
24n 1
ð5:38Þ

Zhang and Siy [64] have described an RNS to binary converter for the moduli set
{2n  1, 2n + 1, 22n  2, 22n+1  3} with a DR of about (6n + 1) bits. They
consider two-level MRC using the two moduli sets {m1 ¼ 2n  1, m2 ¼ 2n + 1}
and {m3 ¼ 22n  2, m4 ¼ 22n+1  3}. The multiplicative inverses are very simple:
     
1 n1 1 1
¼2 , ¼ 1, ¼1 ð5:39Þ
m2 m1 m4 m3 m3 m4 m1 m2

Sousa and Antao [65] have described MRC-based RNS to binary converters for
the moduli sets {2n + 1, 2n  1, 2n, 22n+1  1} and {2n  1, 2n + 1, 22n, 22n+1  1}.
They consider in the first level {x1, x2} ¼ {2n  1, 2n + 1} and {x3, x4} ¼ {2n(1+α),
22n+1  1} where α ¼ 0,1 correspond to the two moduli sets to compute X12 and
X34 respectively.
  The  multiplicative inverses in the first level are
1 1
¼ 2n1 , ¼ 2ð1þαÞn  1, and in the second
2n þ 1 2n 1 2 2nþ1
 1 nð1þαÞ
  2  
1 1
level are 3nþ1 ¼ 2 n
for α ¼ 0 and ¼ 1 for α ¼ 1.
2  2n 22n 1 24nþ1  2n 22n 1
5.5 RNS to Binary Converters for Four and More Moduli Sets 105

Note that all modulo operations are mod (2n  1), 2(1+α)n and 22n  1 which are
convenient to realize. The authors use X12 and X34 in carry save form for computing
ðX12  X34 Þ 2n thus reducing the critical path.
2 1
Stamenkovic and Jovanovic [66] have described a reverse converter for the four
moduli set {2n  1, 2n, 2n + 1, 22n+1  1}. They have suggested exploring the
24 possible orderings of the moduli for being used in MRC so that the multiplicative
inverses are 1 and 2 n1 . The recommended ordering is {2 2n+1  1, 2n, 2 n + 1,
2n  1}. This leads to MRC using only subtractors and not needing modulo
multiplications. They have not, however, presented the details of hardware require-
ment and conversion delay.
The reverse converter for the five moduli set [67] {2n  1, 2n, 2n + 1, 2n+1  1,
n1
2  1} for n even uses in first level the converter for four moduli set {2n  1,
2 , 2 + 1, 2n+1  1} due to [54] and then uses MRC to include the fifth modulus
n n

(2n1  1).
Hiasat
n [68] has described reverse converters for two o five moduli sets based on
nþ1 nþ1
CRT 2n , 2n  1, 2n þ 1, 2n  2 2 þ 1, 2n þ 2 2 þ 1 when n is odd and n 5 and
n nþ1 nþ1
o
2nþ1 , 2n  1, 2n þ 1, 2n  2 2 þ 1, 2n þ 2 2 þ 1 when n is odd and n 7. Note
that this moduli set uses factored form of the two moduli (22n  1) and (22n + 1) in
the moduli set {2n, 22n  1, 22n + 1}. The reverse conversion procedure is similar to
Andraros and Ahmad technique [4] of evaluating the 4n MSBs since n LSBs of the
decoded result are already available. The architecture needs addition of eight
4n-bit words using 4n-bit CSA with EAC followed by 4n bit CPA with EAC or
modulo (24n  1) adder using parallel prefix architectures.
Skavantzos and Stouraitis [69] and Skavantzos and Abdallah [70] have
suggested general converters for moduli products of the form 2a(2b  1) where 2b
 1 is made up of several conjugate moduli pairs such as (2n  1), (2n + 1) or
 n nþ1   nþ1 
2 þ 2 2 þ 1 , 2n  2 2 þ 1 . The reverse converter for conjugate moduli is
quite simple which needs rotation of bits and one’s complementing and addition
using modulo (24n  1) adders or modulo (22n  1) adders. The authors suggest
two-level converters which will find the final binary number using MRC
corresponding to the intermediate residues. The first level converter uses CRT,
whereas the second level uses MRC. The four moduli sets {2n+1, 2n  1, 2n+1  1,
2n+1 + 1} for n odd, {2n, 2n  1, 2n1  1, 2n1 + 1}for n odd, the five moduli
sets {2n+1, 2n  1, 2n + 1, 2n+1  1, 2n+1 + 1}, {2n, 2n  1, 2n + 1, 2n + 2(n+1)/2 + 1,
2n  2(n+1)/2 + 1} and the RNS with seven moduli {2n+3, 2n  1, 2n + 1, 2n+2  1,
2n+2 + 1, 2n+2 + 2(n+3)/2 + 1, 2n+2  2(n+3)/2 + 1} have been suggested. Other RNS
with only pairs of conjugate moduli up to 8 moduli also have been suggested.
Note that care must be taken to see that the moduli are relatively prime. Note
that in case of one common factor existing among the two sets of moduli, this
should be taken into account in the application of CRT in the second level
converter.
Pettenghi et al. [71] have described general RNS to binary converters for the
moduli sets {2n+β, 2n  1, 2n + 1, 2n + k1, 2n  k1} and {2n+β, 2n  1, 2n  k1, 2n  k2,
106 5 RNS to Binary Conversion

j k
. . ., 2n  kf} using CRT. In the case of first moduli set, they compute mX1 where
j k X 5  
m1 ¼ 2n+β as mX1 ¼ Mi 1
V i xi where V i ¼ m xi for i ¼ 2, . . ., 5 which are
1 Mi
i¼1 mi
integers since m1 divides Mi exactly. On the other hand, in case of V1, we have
  !
1  3nβ   
V1 ¼ 2 2 nβ 2
k 1 þ 1 þ ψ x1 ð5:40aÞ
M 1 m1

where ψ is defined as
 
1
k2 ¼ ψm1 þ 1 ð5:40bÞ
M1 m1 1

X
Note that the fractional part in the computation of can be removed using
m1 m1
this technique. As an illustration for m1 ¼ 2nþβ , k1 ¼ 3,β ¼ n ¼ 3, m1 ¼ 64, m2 ¼ 15,
1
m3 ¼ 17, m4 ¼ 13, m5 ¼ 19, we have ψ ¼ 2, ¼ 57 and V1 ¼ 14,024,
M 1 m1
V2 ¼ 58,786, V3 ¼ 59,280, V4 ¼ 43,605 and V5 ¼ 13,260. Note that the technique
can be extended to the case of additional moduli pairs with different k1, k2, etc.
Skavantzos et al. [72] have suggested in case of the balanced eight moduli RNS
using the moduli set {m1, m2, m3, m4, m5, m6, m7, m8} ¼ {2n5  1, 2n3  1, 2n3
+ 1, 2n2 + 1,2n1  1, 2n1 + 1, 2n, 2n + 1}, four first level converters comprising of
moduli {2n3  1, 2n3 + 1}, {2n5  1, 2n2 + 1}, {2n1  1, 2n1 + 1}, {2n, 2n + 1}
to obtain the results B, D, C and E respectively. The computation of

D ¼ x4 þ m4 X01 ð5:41aÞ

where
  
1
X01 ¼ ðx 1  x 4 Þ ð5:41bÞ
2n2 þ 1 2n5 1

needs a multi-operand modulo (2n5  1) CSA tree followed by a modulo (2n5  1)


CPA. The computation E is simpler where
 
1
E ¼ x8 þ m 8 ðx 7  x8 Þ ð5:42Þ
m8 m7

where m8 ¼ 2n þ 1 and m7 ¼ 2n .
The second level converter takes the pairs {B, D} and {C, E} and evaluates the
corresponding numbers F and G respectively which also uses MRC which can
also be realized by multi-operand modulo (22n6  1) CSA tree followed by a
5.5 RNS to Binary Converters for Four and More Moduli Sets 107

modulo (22n6  1) CPA to obtain F and multi-operand modulo (22n2  1) CSA


tree followed by a modulo (22n2  1) CPA to obtain G.
The last stage estimates the final result using two-channel CRT with
M1* ¼ m1m2m3m4 and M2* ¼ m5m6m7m8. We shall take into account the fact that
M1* and M2* have 3 as a common factor. New CRT-III described in Section 5.3 can
be used to perform reverse  conversion. We  need to compute
 0  0 N1 
*
X ¼ FN 1 M*1 M2 þ GN 2 M*2 M1 * 0
where N 1 ¼ *
and N 02 ¼ N32 M*2 and
3 3 M* 3 1 M
3
 * 1 !  * 1 ! Y83
M2 M1 m
i¼1 i .
N1 ¼ , N2 ¼ and M* ¼ 3
3 M *
1
3 M *
1
3  0 3  
The computation of FN 1 M*1 can be carried out by first finding Q ¼ FN 01 *
3
M1
and then finding H ¼ ðQÞM*1 . Note that Q can be computed using CRT. Similarly,
  3

J ¼ GN 02 M*2 can be computed.


3
Pettenghi et al. [73] have suggested RNS to binary converters for moduli sets
with dynamic ranges up to (8n + 1) bits. They have extended the five moduli set due
to Hiasat [68] {2n, 2n  1, 2n + 1, 2n + 2n+1/2 + 1, 2n  2n+1/2 + 1} in two
ways denoted as vertical and horizontal extensions. The former considers the
modulus 2n+β in place of 2n where β is variable (n or 2n). In the latter case, they
augment the moduli set with another modulus (2n+1 + 1) or (2n1 + 1) and optionally
employ β as well. Thus dynamic ranges up to 8n 1 bits can be obtained. They have
used both CRT and MRC for realizing RNS to binary converters. They observe that
the area delay product can be extended up to 1.34 times the state of the art
converters due to Skavantzos et al. [72] and Chaves and Sousa [21]. Note, however,
that the horizontal extension has large conversion delay overhead.
Chalivendra et al. [74] have extended the three moduli set {2n  1, 2n+k, 2n + 1}
[21] to a four moduli set {2k, 2n  1, 2n + 1, 2n+1  1} for even n where n < k < 2n.
They use the reverse converter of [21] to first find the decoded number X1 and use
next MRC to include the modulus 2n+1  1. The decoded word is computed as
! !
  1
k
X ¼ X1 þ 2 22n  1 ðx 4  X 1 Þ   ð5:43Þ
2k 22n  1
2nþ1 1 2nþ1 1

The multiplicative inverse can be found as


!  
1 1
k
  ¼  2nþ3k for k < ðn þ 3Þ ð5:44aÞ
2 22n  1 3 2nþ1 1
2nþ1 1
108 5 RNS to Binary Conversion

and
!  
1 1 2nþ4k
  ¼  2 for k ð n þ 3Þ ð5:44bÞ
2k 22n  1 3 2nþ1 1
2nþ1 1

Note that X1 can be found as X1 ¼ x1 þ 2k Y 1 since k LSBs are already available


as x1. The authors compute X as
  
k
  1
2n k0 k0
X ¼ x1 þ 2 Y1 þ 2  1 X1 2  x4 2 ð5:45Þ
3 2nþ1 1
0 0
Note that X1 2k and x4 2k can be easily obtained as four words thus needing
totally five words to be added using a CSA tree followed by a mod (2n+1  1) adder.
  X
n=2
1
The multiplication is realized as 22i thus needing several rotated
3 2nþ1 1 i¼0
versions of the four words to be added in a CSA tree with EAC to find Z.
Patronik and Piestrak [75] have described the RNS to Binary converters for the
two four moduli sets {2k, 2n  1, 2n + 1, 2n+1  1} and {2k, 2n  1, 2n + 1, 2n1  1}
where k can be any integer for n even. They first derive a reverse converter
for the three moduli set {2k, 2n  1, 2n + 1} using a converter for two moduli
set {2n  1, 2n + 1} using CRT followed by MRC for the composite set {2k, 22n
 1}. They derive the 2n MSBs of the result X1 denoted as Xh. A second stage
uses MRC of the two moduli sets {2k(22n  1), (2n+1  1)} for the first four
moduli set and {2k (22n  1), (2n1  1)} for the second four moduli set. They
suggest two versions  of the reverse converters which realize
 
x4  x1  2k Xh nþ1 or ðx4  x1 Þ nþ1  2k Xh Þ nþ1 . The moduli sets are
2 1 2 1 2 1
considered as {m1, m2, m3, m4}.
Interestingly, they suggest that for the multiplication with the multiplicative
X
w1
inverse F(w)¼ 22i a 2w-bit word 011101110. . .1101, the number of adders can
i¼0
be reduced by using “constant multiplication” technique using graphs. The
various values of can be obtained using fewer adders. As an illustration for n ¼ 6,
  X
1
n=21

F(4) ¼ 85 ¼ 5  (16 + 1). (Note that 2n ¼8 22i ¼ 8F n2 Þ. Thus,
2  1 2nþ1 1 i¼0
instead of using an 8-operand CSA, we can use a 4-operand CSA to compute 5L and
add rotated versions of the Carry and save vectors to compute 85L needing only
four adders.
Conway and Nelson [76] have described a CRT-based RNS converter using
restricted moduli set of the type of 2i or 2i  1. In their formulation, the CRT
expansion corresponding to a L moduli set is multiplied by 2/M to yield
5.5 RNS to Binary Converters for Four and More Moduli Sets 109

hX L i
Xf ¼ i¼0
x i yi  2r ð5:46aÞ

where

2 1
yi ¼ ð5:46bÞ
mi Mi mi

so that Xf will be in the range [0, 2]. The advantage is that the subtraction of 2r can
be realized by dropping the MSBs. The decoded value can be obtained by scaling Xf
by M/2. Note that yi in (5.46b) can be approximated as
& ’
  2bþ1 1
^y i ¼ 2 yi 2
b b
¼ 2b ð5:47Þ
mi Mi mi

The value of b can be selected so as to reduce the error to correctly distinguish


different Xf values. The error in (5.47) is such that 0 < ei < 2b. Thus, the over-all
XL XL  
b
error is e ¼ i¼1
ei x i and in the worst case is given as 2 i¼1
mi  1 which
must be less than M2 . Conway and Nelson observe that ^yi can be estimated in
a simple manner by observing certain properties of the moduli of the type 2n, 2n +
bþ1 j bþ1 k
1, 2n  1. Note that 2 n is quite simple ¼ 2b+1-n. In the case of 2n 2b ,
2 2 1
we observe that it has a periodic pattern of 1 separated by n  1 zeroes. Hence,
multiplying by an ni-bit number does not need any additional operations. In
j bþ1 k
the case of 2n 2b also, a periodic pattern exists which has some bits of value
2 þ1
1 and some bits 1 which are interspersed with zeros. Due to this property,
the negative bits are treated as another word. Both words need to be multiplied
by the ni-bit number and added.
Example 5.8 This example demonstrates reverse conversion using Conway and
Nelson technique. We consider the moduli set {7, 8, 17} and given X ¼ (4, 1, 4) ¼
361.
1
Multiplying by the respective residues we have X0 ¼ (6, 7, 11) and in
Mi mi
j 15 k
binary form the residues are {110,111, 1011) and b ¼ 14. Thus, we have 27 214
as 0.01001001001001. Thus, the first two cases do not require any multiplications.
j 15 k
On the other hand, 2 214 is written as 0:001000100010001000. The authors
17
suggest writing this last word as two words due to the negative components. Thus,
these four words corresponding to moduli 7, 8 and 17 after weighting are as follows:

y^01 x01 ¼ 1:10110110110110


y^02 x02 ¼ 1:11000000000000
110 5 RNS to Binary Conversion

y^03 x03 ¼ 1:011010010110100


þ1:111000011110001

Summing and dropping the weighting of 2 or greater yields

X ¼ 0:110000100010001 ¼ 0:75833129882

Multiplying by M/2, we get X ¼ 360.9657 ¼ 361.



In this technique, there is no need for multiplications and the words that need to
be added can be obtained in a simple manner. These can be added by exploring the
period of these terms. The authors also suggest an algorithmic approach to obtain
full adder-based architectures to minimize area and delay.
Lojacono et al. [77] have suggested an RNS to binary conversion architecture for
moduli sets having no modulus of the type 2n in which the given residues are first
scaled by 2k and CRT is applied. Next, the effect of scaling is removed by division
by 2k in the end after an addition of multiple of M. For the given residues (x1, x2, x3,
x4) corresponding to a binary number X, the multiplication by 2k can be carried out
by multiplying the weights in the CRT expansion by 2k mod mi:
  !
X4 1
k
X:2 ¼ Mi xi :2 k
 αM ð5:48Þ
i¼1 Mi mi
mi

The value of α can be determined using the fact that due to scaling by 2k, the
k LSBs shall be zero. Hence, by using the k LSBs of the first term in (5.48) to look
into a look-up table, value of α can be determined and αM can be subtracted. The
result is exactly divisible by 2k. This is similar to Montgomery’s algorithm used for
scaling. The authors observe that k shall be such that 2k N  1 where N is the
number of moduli. Note that 2k < α < N.
Note that in the case of two’s complement number being desired as the output, an
addition of (M  1)/2 in the beginning is performed. The result can be obtained by
subtracting (M  1)/2 at the end if the result is positive else by adding (M  1)/2.
In short, we compute

^
e ¼ H  αM
X ð5:49aÞ
2k

where
  !
X4 1
^ ¼
H Mi xi :2 k
ð5:49bÞ
i¼1 M i mi
mi
5.6 RNS to Binary Conversion Using Core Function 111

and the two’s complement result is XC ¼ X e þ M1 if X e is negative and XC ¼ X e


2
M1 e
 2 if Xis positive.
Re et al. [78] have later extended this technique for the case of one power of two
modulus mN ¼ 2h in the moduli set. This needs a subtraction of 2hrj from the result
first and then dividing by 2k:
k
H2 r N e
 αM
e ¼ M:
h
X ¼ ε:2h þ r N where ε ¼ 2 where M ð5:50Þ
k
2 2h

Example 5.9 As an illustration let us consider the moduli set {17, 15, 16} and given
residues (2, 9, 15). Obtain the decoded number using Re et al. technique
 [78].

1 1
We have M1 ¼ 240, M2 ¼ 272, M3 ¼ 255 and ¼ 9, ¼ 8,
240 17 272 15
   
1 1
¼ 1. Thus, scaling by 8 the input residues multiplied by ,
255 16 M i mi
yields the weights as (9  2  8)17 ¼ 8, (8  9  8)15 ¼ 6, (1  15  8)16 ¼ 8.
Thus, CRT yields the weighted sum as 8  240 + 6  272 + 8  255 ¼ 5592.
Subtracting from this, the scaled number corresponding to residue mod
16, (8  15) ¼120, we obtain 5472. Dividing this by 16, then we obtain 342.
Next division by 8 is needed
 to take
 into account the pre-scaling. Using Montgom-
1
ery technique, we obtain  ¼ 1 . Noting that 342 mod 8 ¼ 6, we need to add
255 8
to 342, 6M ¼ 6  255 to finally get 342þ6255
8 ¼ 234: Thus the decoded value is
234  16 + 15 ¼ 3759 as desired.

5.6 RNS to Binary Conversion Using Core Function

Some authors have investigated the use of core function for RNS to Binary
conversion as well as scaling. This will be briefly considered next. Consider the
moduli set {m1, m2, m3, . . .., mk}. We need to choose first a constant C(M ) as the
core [79–86]. It can be the largest modulus in the RNS or product of two or more
moduli.  
1
As in the case of CRT, we define various Bi as Bi ¼ Mi where Mi ¼ M/mi
M i mi
 
1
and is the multiplicative inverse of Mi mod mi. We next need to compute
M i mi
weights wi defined as
112 5 RNS to Binary Conversion

  !
1
wi ¼ CðMÞ mod mi ð5:51aÞ
M i mi

The weights also need to satisfy the condition

X
k
wi
CðMÞ ¼ M ð5:51bÞ
i¼1
mi

thereby necessitating that some weights be negative. The weights are next used to
compute the Core function C(n) of a given number n as
 
CðMÞ X K wi
CðnÞ ¼ n  i¼1 m i
r modCðMÞ ð5:52Þ
M i

Note that the core values C(Bi) corresponding to the input Bi can be seen from
(5.52) to be

Bi CðMÞ wi
CðBi Þ ¼  ð5:53Þ
M mi

since the residue ri corresponding to Bi mod mi is 1 and residues of Bi corresponding


to all other moduli are zero. Note that various C(Bi) values are constants for a
chosen moduli set. From (5.51b) and (5.53), it can be seen that C(Bi) < C(M ) since
Bi < M.
The residue to Binary conversion corresponding to residues (r1, r2, r3, . . ., rk) is
carried out by first determining the Core function C(n) of the given number n as
X k  Xk
CðnÞ ¼ r CðBi Þ
i¼1 i
mod CðMÞ ¼ r CðBi Þ  αCðMÞ
i¼1 i
ð5:54aÞ

where α is known as the rank function defined by CRT. Note that (5.53) is known as
CRT for Core function. Next n can be computed by rewriting (5.52) as
  X k wi 
M
n¼ CðnÞ þ i¼1 m i
r ð5:54bÞ
C ðM Þ i M

The important property of C(n) is that the term n CðMMÞ in (5.52) monotonically
increases with n with some furriness due to the second term in (5.52), even though
the choice of some weights wi defined in (5.51a) as negative numbers reduces the
furriness (see (5.52)). Hence, accurate comparison of two numbers using Core
function or sign detection is difficult.
The main advantage claimed for using the Core function is that the constants C
(Bi) involved in computing the Core function following (5.53) are small since they
are less than C(M ). However, in order to simplify or avoid the cumbersome division
5.6 RNS to Binary Conversion Using Core Function 113

by C(M ) needed in (5.54a), it has been suggested that C(M ) be chosen as a power of
two or one modulus or product of two or more moduli.
The following example illustrates the procedure of computing Core and reverse
conversion using Core function.
Example 5.10 Consider the moduli set {3, 5, 7, 11}. Let us choose C(M ) ¼ 11.
 M ¼ 1155, M1 ¼ 385, M2 ¼ 241,
Then  M3 ¼ 165 and M4 ¼ 105. The values of
1 1
are 1, 1, 2, 2. Thus Bi ¼ Mi are 385, 241, 330, 210. The wi can
M i mi M i mi
be found as 1, 1, 1, 0. Next, we find C(B1) ¼ 4, C(B2) ¼ 2, C(B3) ¼ 3 and C(B4) ¼
 ¼ (1  421
2. Consider the residues (1, 2, 3, 8). Then we find C(n) + 2  2 +3  3 +
8  2) mod 11 ¼ 0. Next, n can be found as n ¼ 105 0 þ 11 3 þ 5 þ 7
31
¼ 52:

Note that if 0  CðXÞ < CðMÞ, then ðCðXÞÞCðMÞ ¼ CðXÞ. If C(X) < 0, then
ðCðXÞÞCðMÞ ¼ CðXÞ þ CðMÞ. If C(X) C(M ), then ðCðXÞÞCðMÞ ¼ CðXÞ  CðMÞ.
Thus any specific value of (C(X))C(M ) may introduce two possible values for
C(X). There is an ambiguity to determine which case is the correct one. The
ambiguity is due to the non-linear characteristic of the core function.
Miller [80] has suggested the use of a redundant modulus mE larger than C(M )
and computing
 
CðMÞ X K wi
C ð nÞ ¼ n  i¼1 m i
r ð5:55Þ
M i mE

so that no ambiguity can occur because C(n) is evaluated to a modulus mE greater


than the range of the core. However, the method needs additional hardware.
Krishnan et al. [86] have used an extra modulus to find the multiple of C(M ) that
needs to be subtracted, which needs extra hardware.
Burgess [82] has suggested three techniques to resolve the ambiguity in core
extraction. In the first technique, the ambiguity is detected if the least significant bit
of (C(n))C(M ) is not equal to (C(n))2, otherwise ðCðnÞÞCðMÞ ¼ CðnÞ. If an ambiguity
occurs, we need to add or subtract C(M ) after comparison with Cmin and Cmax. The
requirement of Burgess’s first technique is Cmax  Cmin < 2CðMÞfor 0  X < M.
In Burgess’s
  secondand  third techniques of core extraction,
  X is scaled down
input
X X X
to 2 and C is calculated. If 0  C  Cmax=2 , then
2 2
    Cð M Þ  
CðMÞ

X X C X
C 2 ¼ C but otherwise if CðMÞ þ 2  C min
2 Cð M Þ 2 Cð M Þ
< CðMÞ,  then  it would be an ambiguous condition and
X X
C 2 ¼ C  CðMÞ. The second and third techniques require that
2 Cð M Þ
Cmax=2  Cmin=2 < K for 0  X < M=2. In all the three techniques, one or two
114 5 RNS to Binary Conversion

comparisons followed by addition/subtraction must be used to solve the ambiguity


problem.
Abtahi and Siy [84] proposed a technique where no ambiguity exists in the
computation of core function known as Scale and Shift (SAS). In order to satisfy the
requirements of this method, the weight set must be selected properly. The authors
have suggested a weight selection algorithm (WSA) which satisfies the SAS
requirements. The authors have suggested flat as well as hierarchical structures
for SAS technique for realizing RNS to Binary conversion. Abtahi and Siy [85]
have suggested using core function for sign determination also. The reader is
referred to their work for more information.

5.7 RNS to Binary Conversion Using Diagonal Function

We briefly review the concept of diagonal function [87–90]. For a given moduli set
{m1, m2, m3, . . ., mn} where the moduli mi are mutually prime, we first define a
parameter “Sum of Quotients (SQ)” where

SQ ¼ M1 þ M2 þ    þ Mn ð5:56Þ
Yn
where Mi ¼ M/mi and M ¼ i¼1
mi is the dynamic range of the RNS. We also
define the constants
 
1
ki ¼  for i ¼ 1, . . . , n: ð5:57Þ
mi SQ

It has been shown in [87] and [88] that the ki values exhibit a property

ðk1 þ k2 þ    þ kn ÞSQ ¼ 0: ð5:58Þ

The diagonal Function corresponding to a given number X with residues (x1, x2,
. . .., xn) is defined next as

DðXÞ ¼ ðx1 k1 þ x2 k2 þ    þ xn kn ÞSQ ð5:59Þ

Note that D(X) is a monotonic function. As such, two numbers X and Y can be
compared based on the D(X) and D(Y ) values. However, if they are equal, we need
to compare any one of the coordinates (residues corresponding to any one modulus)
of X with those of Y in order to determine whether X > Y or X ¼ Y or X < Y. Pirlo and
Impedovo [89] have observed that Diagonal function does not support RNS to
Binary conversion. However, it is now recognized [91] that it is possible to perform
RNS to binary conversion using Diagonal function.
5.7 RNS to Binary Conversion Using Diagonal Function 115

According to CRT, the binary number corresponding to the residues (x1, x2, . . .,
xn) can be obtained as
     
1 1 1
X ¼ x1 M1 þ x2 M 2 þ    þ xn M n  rM ð5:60Þ
M1 m1 M 2 m2 M n mn

where r is an integer. Multiplying both sides of (5.59) with SQ/M, we have


     
X  SQ x1 1 x2 1 xn 1
¼ SQ þ SQ þ . . . þ SQ  rSQ
M m1 M1 m1 m2 M 2 m2 mn Mn mn
      !
x1 1 x2 1 xn 1
¼ SQ þ SQ þ    þ SQ
m1 M1 m1 m2 M2 m2 mn Mn mn
SQ
ð5:61Þ

Note that all the terms in (5.61) are mixed fractions since SQ and mi (i ¼ 1,2, . . ., n)
are mutually prime.
From the definition of ki in (5.57), we have

βi  SQ  ki mi ¼ 1 ð5:62Þ

Evidently, from (5.62), we can obtain βi as


 
1
βi ¼ ð5:63Þ
SQ mi

Substituting the value of SQ from (5.56) in (5.63) and noting that Mk is a multiple
of mj for k ¼ 1, 2, . . .., n except for k ¼ j, we have
 
1
βi ¼ ð5:64Þ
Mi mi

Thus, from (5.62) and (5.64), we derive that


 
1
SQ 1  
β SQ  1 Mi mi SQ 1 1
ki ¼ i ¼ ¼  ð5:65aÞ
mi mi mi Mi mi mi

or
 
SQ 1 1
¼ ki þ ð5:65bÞ
mi Mi mi mi
116 5 RNS to Binary Conversion

It is thus clear that by using ki þ m1 i in place of ki in (5.59), exact scaled value


D0 ðXÞ ¼ XSQ
M as defined by (5.61) can be obtained. Thus, for RNS to binary
conversion (in order to obtain the decoded number X), we need to multiply
XSQ M
M with SQ . It follows from (5.65b) and (5.61) that the decoded number is
      
M 1 1 1
X¼ x1 k 1 þ þ x2 k2 þ þ    þ xn k n þ
SQ m1 m2 mn SQ
ð5:66Þ
ðM  DðXÞ þ x1 M1 þ x2 M2 þ    þ xn Mn Þ
¼
SQ

Note that the addition of x1 M1 þ x2 M2 þ    þ xn Mn with M  D(X) makes the


numerator exactly divisible by SQ.
We consider next an example to illustrate the above reverse conversion
technique.
Example 5.11 Consider the moduli set {m1 ¼ 3, m2 ¼ 5, m3 ¼ 7} and given residues
(x1 ¼ 1, x2 ¼ 4, x3 ¼ 6) corresponding to X ¼ 34. We have M ¼ 105, M1 ¼ 35,
M2 ¼ 21, M3 ¼ 15, SQ ¼ 71, k1 ¼ 47, k2 ¼ 14, k3 ¼ 10. We can find D(34) ¼ 21.
On the other hand, XSQ M should have been D0 (34) ¼ (34  71/105) ¼ 22.9904
corresponding to scaling by 71/105. The decoded number can be found following
(5.66) as

105  21 þ 1  35 þ 4  21 þ 6  15
X¼ ¼ 34:
71


An examination of (5.65b) suggests a new approach for RNS to binary conver-
sion which is considered next. We are adding a multiple of M to
p ¼ ðx1 M1 þ x2 M2 þ    þ xn Mn Þ such that the sum is divisible exactly by SQ.
This is what Montgomery’s technique [92] does to find (a/b) mod Z. In this
 a the value of s such that (a + sZ) ¼ 0 mod b. Note that s is
technique, we compute
defined as s ¼  . Thus, we observe that by adding D(X)  M with
b SQ  p
ðx1 M1 þ x2 M2 þ    þ xn Mn Þ and dividing with SQ where DðXÞ ¼  , we
M SQ
can obtain X:

ðx1 M1 þ x2 M2 þ    þ xn Mn þ M  DðXÞÞ
X¼ ð5:67Þ
SQ

Note that in this technique, we need not find the various ki values. We consider
the previous example to illustrate this method.
5.8 Performance of Reverse Converters 117

Example 5.12 Consider the moduli set {m1 ¼ 3, m2 ¼ 5, m3 ¼ 7} and given residues
(x1 ¼ 1, x2 ¼ 4, x3 ¼ 6) corresponding to X ¼ 34. We have M1 ¼ 35, M2 ¼ 21, M3 ¼ 15,
and SQ ¼ 71. We can  findp ¼ ðx1 M671 þ x2 M2 þ    þ xn Mn Þ ¼ 209. Thus, we have
DðXÞ ¼ Mp SQ ¼ 209 105 71 ¼ 105 71 ¼ ð67  48Þ71 ¼ 21. The decoded number
can be found following (12) as X ¼ ð209þ10521
71
Þ
¼ 34:

5.8 Performance of Reverse Converters

The hardware requirements and conversion delay for various designs for several
moduli sets considered in this chapter are presented in Table 5.1 for three moduli
sets and Table 5.2 for four and more moduli sets. The various multiplicative
inverses needed in MRC for various moduli sets which use subsets of moduli are
presented in Table 5.3. In Table 5.4, the multiplicative inverses needed in CRT are
presented. The various multiplicative inverses needed in New CRT II are presented
in Table 5.5 and those needed in New CRT I are given in Table 5.6. In Table 5.7,
performance (area in gates in case of ASIC/slices in case of FPGA and conversion
time and power dissipation) for some state-of-the-art practically implemented
reverse converters is presented for both FPGA and ASIC designs. Various dynamic
ranges have been considered. It can be observed from this table that conversion
times less than few ns for dynamic ranges above 100 bits have been demonstrated to
be feasible.
It can be seen that MRC, CRT and New CRT perform differently for different
moduli sets. Various authors have considered various options and design can be
chosen based on low area/conversion time or both. Among the three moduli sets,
in spite of slight complexity in the modulus channel for (2n + 1), the moduli set
{2n, 2n  1, 2n + 1} out performs the rest. It is interesting to note that four moduli
sets with uniform size moduli appear to be more attractive if two-level MRC is
used rather than CRT, contrary to the assumption that CRT-based designs are faster
than MRC-based designs. Modulo sets with moduli of bit lengths varying from n to
2n bits appear to perform well if moduli are properly chosen enabling realization of
higher dynamic range. These have linear dependence of area on “n” as against four
moduli sets which have uniform word length having quadratic dependence on “n”.
The present trend appears to be favoring moduli of the type 2x  1. Multi-moduli
systems investigated more recently appear to be attractive. As will be shown in a
later chapter, several multi-moduli systems have been explored for crypto-
graphic applications and need more rigorous investigations.
We present the detailed design procedure and development of the implementa-
tion architecture of an MRC-based reverse converter for the moduli set {2n, 2n  1,
2n + 1  1}. The MRC technique for this reverse conversion is illustrated in
Figure 5.4a. The various multiplicative inverses, in this converter, denoted as
Converter I, can be computed as follows:
118

Table 5.1 Area and delay requirements of various converters for three moduli sets
Design Moduli set Hardware requirements Delay
1 [44] {2n  1, 2n, 2n1  1} (12n  8)AHA+(6n  4)AAND (5n  4) τFA
2 [45, 46] {2n  1, 2n, 2n1  1} (17n  13)AHA+(7n  3)AAND (3n + 2)τFA
3 [47] CRT based {2n  1, 2n, 2n1  1} (9n  10)AFA+(3n  1)AINV + 18(2n  2)AROM +2nAHA+(n + 1)AEXNOR (2n + 3)τFA
+(n + 1)AOR
4 [47] MRC based {2n  1, 2n, 2n1  1} (4n  3)AFA+(3n  1))AINV+(3n  4)AEXNOR+(3n  4)AOR (6n  5)τFA
5 [10] {2n  1, 2n, 2n + 1} (6n + 1)AFA+(n + 3)AAND/OR+(n + 1)AXOR/XNOR + 2n A2:1MUX (n + 2)τFA + τMUX
6 [8, 9] {2n  1, 2n, 2n + 1} 4nAFA + 2AAND/OR (4n + 1)τFA
7 CI [14] {2n  1, 2n, 2n + 1} 4nAFA + AHA+ AXOR/XNOR + 2A2:1 MUX (4n + 1)τFA
8 CII [14] {2n  1, 2n, 2n + 1} 6nAFA + AHA + 2AAND/OR + AXNOR/XOR +(2n + 2) A2:1MUX (n + 1)τFA
9 CIII [14] {2n  1, 2n, 2n + 1} 4nAFA + AHA+(2n + 2)AAND/OR+(2n  1)AXNOR/XOR+(2n + 2)A 2:1MUX (n + 1)τFA
10 [12] {2n  1, 2n, 2n + 1} 4nAFA + nAAND/OR (4n + 1)τFA
11 Converter I [50] {2n  1, 2n, 2n+1  1} (4n + 3)AFA+ nAAND/OR + nAXOR/XNOR (6n + 5)τFA
12 Converter II [50] {2n  1, 2n, 2n+1  1} (14n + 21)AFA+(2n + 3)AHA+(2n + 1) A3:1MUX (2n + 7)τFA
13 Converter III [50] {2n  1, 2n, 2n+1  1} (12n + 19)AFA+(2n + 2)AHA + 10(2n + 1)AROM + (2n + 1) A2:1MUX (2n + 7)τFA
14 CE [49] {2n, 22n  1,22n + 1} (3n + 1)AINV + (5n + 2)AFA + (2n  1)AEXOR + (2n  1)AAND + (n  1) (8n + 1)τFA + τinv
AOR + (n  1)AEXNOR
15 HS [49] {2n,22n  1,22n + 1} (3n + 1)AINV + (9n + 2)AFA + (2n  1)AEXOR + (2n  1)AAND + (n  1) (2n + 1)τFA + τinv + τMUX
AOR + (n  1)AEXNOR + 4nA2:1MUX + 2τNAND
16 [21] 0  k  n (2n  1,2n+k,2n + 1} 4nAFA (4n + 2)tFA
17 [22] {22n, 2n  1,2n + 1} (4n + 1)AFA + (n  1)AHA (4n + 1)τFA
5 RNS to Binary Conversion
Table 5.2 Area and time requirements of RNS to Binary converters used for comparison using four and five moduli sets
Design Moduli set Area A Conversion time T
1 [57] {2n  3, 2n + 1, 2n  1, 2n + 3} (26n + 8) AFA +(2n+5 + 32)nAROM (7n + 8)τFA + 2τROM
2 ROM less CE [58] {2n  3, 2n + 1, 2n  1, 2n + 3} (25.5n + 12 + (5n2/2))AFA + 5nAHA + (18n + 23)τFA
3nAEXNOR + 3nAOR
3 CI HS [58] ROM less {2n  3, 2n + 1, 2n  1, 2n + 3} (37.5n + 28 + (5n2/2)) (12n + 15)τFA
AFA + 5nAHA + 3nAEXNOR + 3nAOR
4 C2 CE [58] MRC with {2n  3, 2n + 1, 2n  1, 2n + 3} (20n + 17)AFA+(3n  4)AHA +2n(5n + 2) 3τROM + (13n + 22)τFA
ROM AROM
5 C2 HS [58] MRC with {2n  3, 2n + 1, 2n  1, 2n + 3} (42n + 61)AFA +(3n  4)AHA + 2n(5n + 2) 3τROM + (7n + 10)τFA
ROM AROM
6 C3 CE [58] CRT with ROM {2n  3, 2n + 1, 2n  1, 2n + 3} (23n + 11)AFA+ (2n  2)AHA + (6n + 4)2n (16n + 14)τFA + τROM
5.8 Performance of Reverse Converters

AROM
7 C3 HS [58] CRT with ROM {2n  3, 2n + 1, 2n  1, 2n + 3} (35n + 17) AFA + (2n  2)AHA + (6n + 4)2n (4n + 7)τFA + τROM
AROM
8 [53] Converter I {2n  1, 2n, 2n + 1, 2n+1  1} (9n + 5 + ((n  4)(n + 1)/2))AFA [(23n + 12)/2]τFA
+ 2nAEXNOR + 2nAOR +(6n + 1)AINV
9 [53] Converter I using {2n  1, 2n, 2n + 1, 2n+1  1} (6n + 1)AINV +(8n + 4)AFA (9n + 6)τFA
ROM + 2nAex-Nor + 2nAOR +(n + 1)2n+1AROM
10 [53] Converter 2 {2n  1, 2n, 2n + 1, 2n+1 + 1} (6n + 7)AINV +(n2 + 12n + 12)AFA + 2n (16n + 22)τFA
AEXNOR + 2nAOR + (4n + 8)A2:1MUX
11 [53] Converter2 using {2n  1, 2n, 2n + 1, 2n+1 + 1} (5n + 6)AINV +(9n + 10)AFA + 2n (11n + 14)τFA
ROM AEXNOR + 2nAOR + (2n + 2)A2:1MUX
+ (n + 2)2n+2AROM
12 [55] {2n  1, 2n, 2n + 1, 2n+1  1} (10n + 6 + (n  4)(n + 1)/2)AFA+(6n + 2) ((15n + 22)/2) τFA
AEXNOR + (6n + 2)AOR+(7n + 2)AINV+(n
+ 3)AMUX2:1 + (2n + 1)AMUX3:1
13 [56] {2n  1, 2n, 2n + 1, 2n+1 + 1} (2n2 + 11n + 3)AFA (11.5n + 2log2n + 2.5)τFA
14 [51] {2n  1, 2n, 2n + 1, 2n+1  1} (37n + 14)AFA (14n + 8)τFA
(continued)
119
Table 5.2 (continued)
120

Design Moduli set Area A Conversion time T


15 [52] {2n  1, 2n, 2n + 1, 2n+1 + 1} (58n + 23 + log2(c + 1))AFA + 36nAROM (24n + 17 + log2(c + 1))τFA{{
16 CE [62] {2n  1,2n,2n + 1,22n + 1} (11n + 6)AFA+ (2n  1)AEXOR + (2n  1) (8n + 3)τFA
AAND + 4nAEXNOR + 4nAOR + (6n  1)AINV
17 HS [62] {2n  1,2n,2n + 1,22n + 1} (15n + 6)AFA+(2n  1)AEXOR+(2n  1)AAND (2n + 3)τFA
+4nAEXOR + 4nAOR+(6n  1)AINV
+4nA2 :1MUX
18 HS [63, 67] {2n1  1,2n  1,2n,2n + 1, 2n+1  1} [{(5n2 + 43n + m)/6} + 16n  1]AFA+(6n + 1) (18n + l + 7) τFA{
AINVa
19 [63] {2n  1,2n,2n + 1, 22n+1  1} (8n + 2)AFA+(n  1)AXOR+(n  1)AAND +(4n (12n + 5)tFA + 3tNOT + tMUX
+ 1) AXNOR+(4n + 1)AOR+(7n + 1)
ANOT + nAMUX2:1
20 [54] 4-stage CE {2n  1, 2n, 2n + 1, 2n+1  1} (n2/2 + 7n/2 + 7n + 4)AFA + AHA+(3n + 2) (11n + l + 8)τFAa
AINV + 2A2 :1MUX
21 [54] 3-stage CE {2n  1, 2n, 2n + 1, 2n+1  1} (n2 + 10n + 3)AFA + AHA+(3n + 2)AINV (9n + m + 6)τFAa
+ 2A2 :1MUX
22 [54] 4-stage CE {2n  1, 2n, 2n + 1, 2n1  1} (n2/2 + 3n/2 + 7n  3)AFA + AHA+(5n + 1) (11n + l  1)τFAb
AINV +2A2:1MUX+ (2n  8)
AXNOR + 6AAND + (2n  8)AOR +6AXOR
23 [54] 3-stage CE {2n  1, 2n, 2n + 1, 2n1  1} (n2 + 7n  2)AFA + AHA+(5n + 1)AINV (9n + m + 1)τFAb
+ 2A2 :1MUX + (2n  8)
AXNOR + 6AAND+(2n  8)AOR +6AXOR
24 [54] 4-stage HS {2n  1, 2n, 2n + 1, 2n1  1} (n2/2 + 3n/2 + 11n  5)AFA + AHA+(5n + 1) (5n + l + 3)τFAb
AINV +4n A2:1MUX+(2n  8)
AXNOR + 9AAND+(2n  2)AOR + 6AXOR
25 [54] 3-stage HS {2n  1, 2n, 2n + 1, 2n1  1} (n2 + 10n  3)AFA + AHA+(5n + 1)AINV + (3n (4.5n + m + 3)τFAb
+ 1)A2 :1MUX + (2n  8)
AXNOR + 8AAND + (2n  4)AOR +6AXOR
5 RNS to Binary Conversion
26 [63] {2n  1,2n + 1, 22n, 22n + 1} (10n + 6)AFA+(4n  3)AXOR+(4n  3) (8n + 3)tFA + tNOT
AAND + (2n  3) AXNOR+(2n  3)AOR+(6n
+ 3)ANOT
27 [65] {2n  1,2n + 1, 2n, 22n+1  1} (13n + 2)AFA (8n + 1) τFA
28 [65] {2n  1,2n + 1, 22n, 22n+1  1} (16n + 1)AFA (8n + 2) τFA
29 [63, 68] {2n  1,2n,2n + 1, 2n  2(n+1)/2 + 1,2n 19nAFA + 7nAXOR + 7nAAND + 2nAXNOR + 2n (8n + 4)tFA + tNOT
+ 2(n+1)/2 + 1} AOR + 4nANOT
30 [72, 73] {2n5  1, 2n3  1, 2n3 + 1, 2n2 (66n2  87n  15)AFA [46n  42 + 2log2(2n  6) +
+ 1, 2n1  1, 2n1 + 1, 2n,2n + 1} 2log2(4n  12) + 4log2n +
2log2(4n  1)]tFA
31 [73] {22n,2n  1, 2n + 1, 2n  36n (8n + 5)tFA
5.8 Performance of Reverse Converters

2(n+1)/2 + 1, 2n + 2(n+1)/2 + 1}
32 [73] {2n,2n  1, 2n + 1, 2n  2(n+1)/2 + 1, 2n 28n + (n  1)(11 + (n/2)) 10n  3 + 2(3 + log2((n/2) + 4))
+ 2(n+1)/2 + 1, 2n1 + 1}
33 [73] {22n,2n  1, 2n + 1, 2n  36n + (n  1)(12 + (n/2)) 10n + 3 + 2(3 + log2((n/2) + 4))
2(n+1)/2 + 1, 2n + 2(n+1)/2 + 1,
2n1 + 1}
34 [73] {23n,2n  1, 2n + 1, 2n  28n + (n  1)(13 + (n/2)) 10n + 2 + 2(3 + log2((n/2) + 4))
2(n+1)/2 + 1, 2n + 2(n+1)/2 + 1,
2n1 + 1}
35 [63, 64] {2n  1, 2n + 1, 22n  2, 22n+1  3} (28n + 9)AFA+(9n + 4)ANOT + 3(2n)AMUX2:1 (14n + 10)tFA
36 [74] {2k, 2n  1, 2n + 1, 2n+1  1} ((n2 + 27n)/2 + 2) AFA +(2n + k + 2)AINV (11n + l + 10)tFAc
37 [60] Version1 {2n + 1, 2n, 2n  1, 2n1 + 1} n odd ((n2  13)/2 + 13n)AFA + AMUX + nAOR (10n + log1.5(n  3)/2 + 5)
+ (((n2 + 3)/4) + 6n)AINV DFA + DOR + 2DINV + DMUX
38 [60] Version2 {2n + 1, 2n, 2n  1, 2n1 + 1} n odd ((n2)/2 + 13n)AFA + 3AXOR + 2AAND + (3n + 4) (8n + log1.5(n  5)/2 + 5)
AINV DFA + 2DXOR + 2DINV
39 [60] Version3 {2n + 1, 2n, 2n  1, 2n1 + 1} n odd (2n2 + 10n)AFA + 3AXOR + 2AAND +(3n + 4) 8n + log1.5(n  1)/2 + 2)DFA +
AINV 2DXOR + 2DINV
121

(continued)
Table 5.2 (continued)
122

Design Moduli set Area A Conversion time T


k      l m
40 [75] Version1 {2k, 2n  1, 2n + 1,2n1  1}, n even 6n  2 þ 2n 2n þ ðn  1Þ 2nþk
n1  1 þ
k
 n  9n þ þ 2nþk
nþ1 þ
ðn  1ÞΩ 2log2 2  1  2n
n 
θ 2logp2ffiffi 2  1
k   k   
41 [75] Version2 {2k, 2n  1, 2n + 1,2n1  1} 7n  3 þ 2n 2n þ ðn  1Þ 3 þ n1 þ k
 n  9n þ 3 þ þ
ðn  1ÞΩ 2log2 2  1  2n
n 
θ 2logpffiffi2 2  1
k l m     
42 [75] Version1 {2k, 2n  1, 2n + 1,2 n+1  1} 6n þ 2 þ 2n 2n þ ðn þ 1Þ 2nþk k 2n þ k
nþ1  1 þ 9n þ 6 þ þ þ
   2n nþ1
ðn þ 1ÞΩ 2log2 n2  n
θ 2logpffiffi2 2
k  l m   
k
43 [75] version2 {2k, 2n  1, 2n + 1,2 n+1  1} k
7n þ 3 þ 2n 2n þ ðn þ 1Þ 2 þ nþ1 þ 9n þ 8 þ 2n þ θ 2logp2ffiffi n2
 n
ðn þ 1ÞΩ 2log2 2

Note: {{ c is Lk and L ¼ (2n21) (2n+11) and K < 26n
{
m ¼ n  4, 9n  12 and 5n  8 for n ¼ 6k  2, 6k and 6k + 2 respectively, l number of levels in CSA tree with ((n/2) + 1) inputs
a
l and m are the number of levels in CSA tree of n/2+1 inputs and n + 2 inputs respectively
b
l and m are the number of levels in CSA tree of n/2 inputs and n inputs respectively
c
number of levels in (n/2) + 1 CSA tree
5 RNS to Binary Conversion
5.8 Performance of Reverse Converters 123

Table. 5.3 Multiplicative inverses of four and five moduli sets using subsets
Moduli set (A,B) (1/A) mod B
 nþ2 
{P, 2n+1  1) 2  10 =3
{P, 2n+1 + 1} 2n + 2n2 + . . . + 2n2k +24  2 till n  2k ¼ 5
for n 5; (14 for n ¼ 3).
 n 
{P, 2n1  1} 2 þ 2n2  2 =3
{25n  2n, 2n+1 + 1} [73] X4
n11

 24iþ8  23 22  21  20
i¼0
{25n  2n, 2n1 + 1} [73] X
4
n9

2n2 þ 24iþ10 þ 26 þ23


i¼0
{26n  22n, 2n1 + 1}[73] X
4
n9

2n3  24iþ9  25 22


i¼0
{26n  22n, 2n+1 + 1} [73] X
4
n11

24iþ9 þ 24 þ 23 þ 22 þ21
i¼0
{27n  23n, 2n+1 + 1} [73] X4
n11

 24iþ10  25 24  23  22
i¼0
{27n  23n, 2n1 + 1} [73] X
4
n9

2n4 þ 24iþ8 þ 24 þ21


i¼0
n1
{P0 , 2  1} [67] 2 
 n ¼ 6k  2, k ¼ 1, 2, 3 . . . , k0 ¼ 2n3 þ 2n4  1 ,
1 9
k0 ¼
P0 2n1 1 1 
n ¼ 6k, k ¼ 1, 2, 3 . . . , k0 ¼ 2n2 þ 2nþ1  5 ,
9
8 
n ¼ 6k þ 2, k ¼ 1, 2, 3 . . . , k0 ¼ 2n2 þ 2n2  1
9
Note: P ¼ {2n,2n  1,2n + 1}, P0 ¼ {2n,2n  1,2n + 1, 2n+1  1}

 
1  nþ1 
XA ¼ n mod 2 1 ¼2 ð5:68aÞ
2
 
1
XB ¼ modð2n  1Þ ¼ 1 ð5:68bÞ
2n
 
1  
XC ¼ n mod 2nþ1  1 ¼ 2 ð5:68cÞ
2 1

The implementation of the MRC algorithm of Figure 5.4a using the various
multiplicative inverses given in (5.68a–c) follows the architecture given in
Figure 5.4b. The various modulo subtractors can make use of the well-known
property of 2x mod mi. The subtraction (r2  r3) mod (2n  1) can be realized by
one’s-complementing r3 and adding to r2 using a mod (2n  1) adder in the
block MODSUB1. The mixed radix digit UB is thus the already available (r2  r3)
124

Table 5.4 Multiplicative inverses (1/Mi) mod mi for use in CRT for various modulo sets
(1/M2) mod
Modulo set (1/M1) mod m1 m2 (1/M3) mod m3 (1/M4) mod m4 (1/M5) mod m5
{2n  1, 2n, 2n1  1} 2n  3 2n1 + 1 2n2 – –
{2n  1, 2n, 2n + 1} 2n1 2n  1 2n1 + 1 – –
{2n  1, 2n, 2n+1  1} 1 1 (4)mod (2n+1  1) – –
{2n,22n  1,22n + 1} 2n1 2n1 2n1 – –
{2n  1,2n+k,2n + 1} 2nk1 2n+k  1 2nk1 – –
{2n  1,22n,2n + 1} 2n1 (1) mod 22n 2n1 – –
{2n  1, 2n, 2n + 1, 2n+1 + 1} [52] 2n1 + 2/3(2n1  1) 2n  1 2n1 2n + 3 + (2n + 1)/3 –
{2n  1,2n,2n + 1,2n+1  1} [51] 2n1 1 2n  (2n1  2)/3 2n+1  4  (2n+1  2)/3 –
P00 ¼ {2n,2n  1, 2n + 1, 2n  2(n+1)/2 + (1) mod m1 2n2 (2n2) mod m3 2(n5)/2 (2(n5)/2) mod m5
1,2n + 2(n+1)/2 + 1} [68]
P00 ¼ {2n+1,2n  1, 2n + 1, 2n  2(n+1)/2 + (1) mod m1 2n3 (2n3) mod m3 2(n7)/2 (2(n7)/2) mod m5
1,2n + 2(n+1)/2 + 1} [68]
{22n,2n  1, 2n + 1, 2n  2(n+1)/2 + 1, 2n + (1) mod m1 2n2 2n2 (2n2 + 2(n5)/2) mod m4 (2n2  2(n5)/2) mod m4
2(n+1)/2 + 1} [73]
{23n,2n  1, 2n + 1, 2n  2(n+1)/2 + 1, 2n + (1) mod m1 2n2 (2n2) mod m3 (2(n5)/2) mod m4 2(n5)/2
2(n+1)/2 + 1, 2n1 + 1} [73]
5 RNS to Binary Conversion
5.8 Performance of Reverse Converters 125

Table. 5.5 Multiplicative inverses for MRC for various moduli sets using New CRT II
     
Moduli set {m1, m2, m3, m4} 1 1 1
m1 m2 m4 m3 m3 m4 m1 m2
{2n  1, 2n + 1, 22n  2, 22n+1  3} 2n1 1 1
[64]
 
{2n + 1,2n  1,2n,2n+1 + 1} [56] 2n1 1 22nþ2  2n  2
3 22n 1
{2n + 1,2n  1, 2(1+α)n,22n+1  1} 2n1 2(1+α)n 2n for α ¼ 0, 1 for α ¼ 1
[65] 1
{2n + 1, 2n  3,2n  1,2n + 3} (3  2n2  2) 2n2 (1/2n+2)mod (22n  2n+1  3)
[57, 58]
{2n + 1,2n  1, 2n,22n+1  1} [63] 2n1 2n+1 2n

Table. 5.6 Multiplicative inverses for New CRT I for various moduli sets
     
Moduli set 1 1 1
m1 k1 m1 m2 k2 m1 m2 m3 m4
{2n  1, 2n, 2n1  1} [45] 22n2  2n  2n2 þ 2 2n2 –
{2n, 2n + 1, 22n + 1,2n  1} 23n 23n2
þ2 2n1 n2
2 2n2
[62]
{22n, 22n + 1, 2n + 1, 2n  1} 22n 22n1 2n2
[63]
{2n, 2n + 1, 2n  1} [14] 2n 2n1 –
k1 ¼ m2m3 for three moduli set and m2m3m4 for four moduli set
k2 ¼ m3 for three moduli set and m3m4 for four moduli set

mod (2n  1) since XB ¼ 1. The subtraction (r1  r3) mod (2n+1  1) involves two
numbers of different word length r1 of (n + 1) bits and r3 of n bits. By appending a
1-bit most significant bit (MSB) of zero, r3 can be considered as a (n + 1)-bit word.
Thus, one’s complement of this word can be added to r1 using a (n + 1)-bit modulo
adder in the block MODSUB2.
Next, UA can be obtained by circularly left shifting already obtained (r1  r3)
mod (2n+1  1) by 1 bit. The computation of (UA  UB) mod (2n+1  1) can be
carried out as explained before in the case of (r1  r3) mod (2n+1  1) since UA is
(n + 1)-bit wide and UB is n-bit wide using the block MODSUB3.
Next, the multiplication of (UA  UB) mod (2n+1  1) with Xc ¼ 2 to obtain [see
(5.68c)] is carried out by first left circular shifting (UA  UB) mod (2n+1  1) by 1 bit
and one’s complementing the bits in the result. The last stage in the converter shall
compute

B ¼ U C ð2n  1Þ2n þ U B 2n þ r 3 ð5:69Þ


126 5 RNS to Binary Conversion

Table 5.7 Performance of reverse converters in ASIC/FPGA state-of-the-art implementations


Area/gates/ Conversion Power
Converter DR slices delay (ns) dissipation(mW)
[67] Area optimized n ¼ 28 5n bit 139 21,136 gates 43.13 174.8
[67] Timing optimized n ¼ 28 5n bit 139 41,936 gates 27.86 392.1
[67] Area optimized n ¼ 20 5n bit 99 11,479 gates 39.09 110.6
[67] Timing optimized n ¼ 20 5n bit 99 26,154 gates 22.35 278.2
[56] 65 nm n ¼ 17 4n bit 69 17.9k μm2 0.9 71.8pJ
[56] Virtex 5 n ¼ 17 4n bit 69 650 slices 23.6 8.6nJ
[71] n ¼ 15 6n bit 90 223K μm2 6.58
[71] n ¼ 7 27n bit 189 579K μm2 7.96
[71] n ¼ 15 6n bit 90 6K μm2 2.06
[71] n ¼ 21 6n bit 126 9K μm2 2.22
[71] n ¼ 11 8n  1 bit 89 16Kμm2 3.3
[71] n ¼ 23 8n  1 bit 185 36K μm2 4.22
[65] 65 nm 6nDR n ¼ 16 96 9.3K μm2 0.63
[65] 65 nm 5nDR n ¼ 16 80 8.7K μm2 0.63
[65] FPGA 5nDR n ¼ 16 80 502 Slices 14.1
[65] FPGA 6n DR n ¼ 16 96 789 Slices 15.

where UC, UB and r3 are the mixed radix digits. Note, however, that since the least
significant bits of B are given by r3, we need to compute only B0 (the (2n + 1) MSBs
of B)

B  r3
B0 ¼ ð5:70Þ
2n

From (5.69) and (5.70), we have

B0 ¼ U C ð 2 n  1 Þ þ U B ð5:71Þ

Denoting UC and UB of word lengths (n + 1)-bit and n-bit, respectively, as


ucnuc(n1),. . .uc1uc0 and ub(n1)ub(n2). . .ub1ub0, the three operands to be added to
obtain B0 can be seen as shown in the equation at the bottom of the page, together
with a least significant bit (LSB) of “1.” Note that the primes indicate the inverted
bits. These three words can be simplified as two (2n + 1)-bit words since the first
and second words together have zeroes in all the (2n + 1)-bit positions. These two
words can be added using a (2n + 1)-bit CPA (CPA1 in Figure 5.4b). Since bits in
one operand being added are “one,” full adders can be replaced by pairs of
exclusive NOR (EXNOR) and OR gates. The modulo adders can be realized
using one’s-complement adders (CPA with end-around carry) or by using special
designs described in Chapter 2.
5.8 Performance of Reverse Converters 127

2n+1-1 2n-1 2n
r1 r2 r3
-r3 -r3
(r1-r3) mod (2n+1-1) (r2-r3) mod (2n-1)
×(1/2n) mod (2n+1-1) = X^ ×(1/2n) mod (2n-1) = XB
UA UB
-UB
(UA-UB) mod (2n+1-1)
×(1/2n-1)) mod (2n+1-1) = XC
UC

r1 n+1 n r3 n r2 n r3

Subtractor Modulo (2n+1-1) Subtractor Modulo


MODSUB2 (2n-1) MODSUB1

n+1
UB n
Rotation
UA
n+1

Subtractor Modulo (2n+1-1) MODSUB3

n+1
n
Rotaion followed by Ones’ complement

n+1 UC

Mapping of the bits followed by CPA CPA1

B’ 2n+1

0 0 … … 0 0 ub(n-1) ub(n-2) … … ub1 ub0


ucn uc(n-1) … … uc1 uc0 0 0 … … 0 0
1 1 … … 1 uʹcn uʹc(n-1) uʹc(n-2) … … uʹc1 uʹc0

Figure 5.4 (a) Mixed radix conversion flow chart and (b) architecture of implementation of (a),
(c) bit matrix for computing B0 (Adapted from [50]©IEEE2007)
128 5 RNS to Binary Conversion

The hardware requirements for this converter are thus nAFA for MODSUB1,
(n + 1)AFA each for MODSUB2 and MODSUB3, and (n + 1)AFA + nAXNOR + nAOR
for the CPA1. The total hardware requirement and conversion time are presented in
Table 5.1 (entry 11).

References

1. N.S. Szabo, R.I. Tanaka, Residue Arithmetic and Its Applications to Computer Technology
(Mc-Graw Hill, New-York, 1967)
2. G. Bi, E.V. Jones, Fast conversion between binary and Residue Numbers. Electron. Lett. 24,
1195–1197 (1988)
3. P. Bernardson, Fast memory-less over 64-bit residue to binary converter. IEEE Trans. Circuits
Syst. 32, 298–300 (1985)
4. S. Andraros, H. Ahmad, A new efficient memory-less residue to binary converter. IEEE Trans.
Circuits Syst. 35, 1441–1444 (1988)
5. K.M. Ibrahim, S.N. Saloum, An efficient residue to binary converter design. IEEE Trans.
Circuits Syst. 35, 1156–1158 (1988)
6. A. Dhurkadas, Comments on “An efficient Residue to Binary converter design”. IEEE Trans.
Circuits Syst. 37, 849–850 (1990)
7. P.V. Ananda Mohan, D.V. Poornaiah, Novel RNS to binary converters, in Proceedings of
IEEE ISCAS, pp. 1541–1544 (1991)
8. S.J. Piestrak, A high-Speed realization of Residue to Binary System conversion. IEEE Trans.
Circuits Syst. II 42, 661–663 (1995)
9. A. Dhurkadas, Comments on “A High-speed realisation of a residue to binary Number system
converter”. IEEE Trans. Circuits Syst. II 45, 446–447 (1998)
10. M. Bhardwaj, A.B. Premkumar, T. Srikanthan, Breaking the 2n-bit carry propagation barrier in
Residue to Binary conversion for the [2n-1, 2n, 2n+1] moduli set. IEEE Trans. Circuits Syst. II
45, 998–1002 (1998)
11. R. Conway, J. Nelson, Fast converter for 3 moduli RNS using new property of CRT. IEEE
Trans. Comput. 48, 852–860 (1999)
12. Z. Wang, G.A. Jullien, W.C. Miller, An improved Residue to Binary Converter. IEEE Trans.
Circuits Syst. I 47, 1437–1440 (2000)
13. P.V. Ananda Mohan, Comments on “Breaking the 2n-bit carry propagation barrier in Residue
to Binary conversion for the [2n-1, 2n, 2n+1] moduli set”. IEEE Trans. Circuits Syst. II 48, 1031
(2001)
14. Y. Wang, X. Song, M. Aboulhamid, H. Shen, Adder based residue to binary number converters
for (2n-1, 2n, 2n+1). IEEE Trans. Signal Process. 50, 1772–1779 (2002)
15. W. Wang, M.N.S. Swamy, M.O. Ahmad, Y. Wang, A study of the residue-to-Binary con-
verters for the three moduli sets. IEEE Trans. Circuits Syst. I 50, 235–243 (2003)
16. B. Vinnakota, V.V.B. Rao, Fast conversion techniques for Binary to RNS. IEEE Trans.
Circuits Syst. I 41, 927–929 (1994)
17. P.V. Ananda Mohan, Evaluation of Fast Conversion techniques for Binary-Residue Number
Systems. IEEE Trans. Circuits Syst. I 45, 1107–1109 (1998)
18. D. Gallaher, F.E. Petry, P. Srinivasan, The digit parallel method for Fast RNS to weighted
number System conversion for specific moduli (2k-1, 2k, 2k+1). IEEE Trans. Circuits Syst. II
44, 53–57 (1997)
19. P.V. Ananda Mohan, On “The Digit Parallel method for fast RNS to weighted number system
conversion for specific moduli (2k-1, 2k, 2k+1)”. IEEE Trans. Circuits Syst. II 47, 972–974
(2000)
References 129

20. A.S. Ashur, M.K. Ibrahim, A. Aggoun, Novel RNS structures for the moduli set {2n-1, 2n, 2n
+1} and their application to digital filter implementation. Signal Process. 46, 331–343 (1995)
21. R. Chaves, L. Sousa, {2n+1, 2n+k, 2n-1}: a new RNS moduli set extension, in Proceedings of
Euro Micro Systems on Digital System Design, pp. 210–217 (2004)
22. A. Hiasat, A. Sweidan, Residue-to-binary decoder for an enhanced moduli set. Proc. IEE
Comput. Digit. Tech. 151, 127–130 (2004)
23. M.A. Soderstrand, C. Vernia, J.H. Chang, An improved residue number system digital to
analog converter. IEEE Trans. Circuits Syst. 30, 903–907 (1983)
24. T.V. Vu, Efficient implementations of the Chinese remainder theorem for sign detection and
residue decoding. IEEE Trans. Comput. 34, 646–651 (1985)
25. G.C. Cardarilli, M. Re, R. Lojacano, G. Ferri, A systolic architecture for high-performance
scaled residue to binary conversion. IEEE Trans. Circuits Syst. I 47, 1523–1526 (2000)
26. G. Dimauro, S. Impedevo, R. Modugno, G. Pirlo, R. Stefanelli, Residue to binary conversion
by the “Quotient function”. IEEE Trans. Circuits Syst. II 50, 488–493 (2003)
27. J.Y. Kim, K.H. Park, H.S. Lee, Efficient residue to binary conversion technique with rounding
error compensation. IEEE Trans. Circuits Syst. 38, 315–317 (1991)
28. C.H. Huang, A fully parallel Mixed-Radix conversion algorithm for residue number applica-
tions. IEEE Trans. Comput. 32, 398–402 (1983)
29. Y. Wang, Residue to binary converters based on New Chinese Remainder theorems. IEEE
Trans. Circuits Syst. II 47, 197–205 (2000)
30. P.V. Ananda Mohan, Comments on “Residue-to-Binary Converters based on New Chinese
Remainder Theorems”. IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 47, 1541
(2000)
31. D.F. Miller, W.S. McCormick, An arithmetic free Parallel Mixed-Radix conversion algorithm.
IEEE Trans. Circuits Syst. II 45, 158–162 (1998)
32. Antonio Garcia, G.A. Jullien, Comments on “An Arithmetic Free Parallel Mixed-Radix
Conversion Algorithm”, IEEE Trans. Circuits Syst. II Analog Digit. Signal Process, 46,
1259–1260 (1999)
33. H.M. Yassine, W.R. Moore, Improved Mixed radix conversion for residue number system
architectures. Proc. IEE Part G 138, 120–124 (1991)
34. S. Bi, W.J. Gross, The Mixed-Radix Chinese Remainder Theorem and its applications to
Residue comparison. IEEE Trans. Comput. 57, 1624–1632 (2008)
35. A. Skavantzos, Y. Wang, New efficient RNS-to-weighted decoders for conjugate pair moduli
residue number Systems, in Proceedings of 33rd Asilomar Conference on Signals, Systems and
Computers, vol. 2, pp. 1345–1350 (1999)
36. Y. Wang, New Chinese remainder theorems, in Proceedings of 32nd Asilomar Conference on
Signals, Systems and Computers, pp. 165–171 (1998)
37. A.B. Premkumar, An RNS to binary converter in 2n-1, 2n, 2n+1 moduli set. IEEE Trans.
Circuits Syst. II 39, 480–482 (1992)
38. A.B. Premkumar, M. Bhardwaj, T. Srikanthan, High-speed and low-cost reverse converters for
the (2n-1, 2n, 2n+1) moduli set. IEEE Trans. Circuits Syst. II 45, 903–908 (1998)
39. Y. Wang, M.N.S. Swamy, M.O. Ahmad, Residue to binary converters for three moduli sets.
IEEE Trans. Circuits Syst. II 46, 180–183 (1999)
40. K.A. Gbolagade, G.R. Voicu, S.D. Cotofana, An efficient FPGA design of residue-to-binary
converter for the moduli set {2n+1, 2n, 2n-1}. IEEE Trans. Very Large Scale Integr. (VLSI)
Syst. 19, 1500–1503 (2011)
41. A.B. Premkumar, An RNS to binary converter in a three moduli set with common factors.
IEEE Trans. Circuits Syst. II 42, 298–301 (1995)
42. A.B. Premkumar, Corrections to “An RNS to Binary converter in a three moduli set with
common factors”. IEEE Trans. Circuits Syst. II 51, 43 (2004)
43. K.A. Gbolagade, S.D. Cotofana, A residue-to-binary converter for the {2n+2, 2n+1, 2n}
moduli set, in Proceedings of 42nd Asilomar Conference on Signals, Systems Computers,
pp. 1785–1789 (2008)
130 5 RNS to Binary Conversion

44. A.A. Hiasat, H.S. Abdel-Aty-Zohdy, Residue to Binary arithmetic converter for the moduli set
(2k, 2k-1, 2k-1-1). IEEE Trans. Circuits Syst. II 45, 204–209 (1998)
45. W. Wang, M.N.S. Swamy, M.O. Ahmad, Y. Wang, A high-speed Residue-to-binary converter
for thee moduli {2k, 2k-1, 2k-1-1}RNS and a scheme for its VLSI implementation. IEEE Trans.
Circuits Syst. II 47, 1576–1581 (2000)
46. W. Wang, M.N.S. Swamy, M.O. Ahmad, Y. Wang, A note on “A high-speed Residue-to-
binary converter for thee moduli {2k, 2k-1, 2k-1-1} RNS and a scheme for its VLSI implemen-
tation”, IEEE Trans. Circuits Syst. II, 49, 230 (2002)
47. P.V. Ananda Mohan, New residue to Binary converters for the moduli set {2k, 2k-1, 2k-1-1},
IEEE TENCON, doi:10.1109/TENCON.2008.4766524 (2008)
48. P.V. Ananda Mohan, Reverse converters for the moduli sets {22n-1, 2n, 22n+1} and {2n-3, 2n
+1, 2n-1, 2n+3}, in SPCOM, Bangalore, pp. 188–192 (2004)
49. P.V. Ananda Mohan, Reverse converters for a new moduli set {22n-1, 2n, 22n+1}. CSSP 26,
215–228 (2007)
50. P.V. Ananda Mohan, RNS to binary converter for a new three moduli set {2n+1 -1, 2n, 2n-1}.
IEEE Trans. Circuits Syst. II 54, 775–779 (2007)
51. A.P. Vinod, A.B. Premkumar, A residue to Binary converter for the 4-moduli superset {2n-1,
2n, 2n+1, 2n+1-1}. JCSC 10, 85–99 (2000)
52. M. Bhardwaj, T. Srikanthan, C.T. Clarke, A reverse converter for the 4 moduli super set {2n-1,
2n, 2n+1, 2n+1+1}, in IEEE Conference on Computer Arithmetic, pp. 168–175 (1999)
53. P.V. Ananda Mohan, A.B. Premkumar, RNS to Binary converters for two four moduli sets {2n-
1, 2n, 2n+1, 2n+1-1} and {2n-1, 2n, 2n+1, 2n+1+1}. IEEE Trans. Circuits Syst. I 54, 1245–1254
(2007)
54. B. Cao, T. Srikanthan, C.H. Chang, Efficient reverse converters for the four-moduli sets {2n-1,
2n, 2n+1, 2n+1-1} and {2n-1, 2n, 2n+1, 2n-1-1}. IEE Proc. Comput. Digit. Tech. 152, 687–696
(2005)
55. M. Hosseinzadeh, A. Molahosseini, K. Navi, An improved reverse converter for the moduli set
{2n+1, 2n-1, 2n, 2n+1-1}. IEICE Electron. Exp. 5, 672–677 (2008)
56. L. Sousa, S. Antao, R. Chaves, On the design of RNS reverse converters for the four-moduli set
{2n+1, 2n-1, 2n, 2n+1+1}. IEEE Trans. VLSI Syst. 21, 1945–1949 (2013)
57. M.H. Sheu, S.H. Lin, C. Chen, S.W. Yang, An efficient VLSI design for a residue to binary
converter for general balance moduli (2n-3, 2n-1, 2n+1, 2n+3). IEEE Trans. Circuits Syst. Exp.
Briefs 51, 52–55 (2004)
58. P.V. Ananda Mohan, New Reverse converters for the moduli set {2n-3, 2n + 1, 2n-1, 2n + 3}.
AEU 62, 643–658 (2008)
59. G. Jaberipur, H. Ahmadifar, A ROM-less reverse converter for moduli set {2q  1, 2q  3}.
IET Comput. Digit. Tech. 8, 11–22 (2014)
60. P. Patronik, S.J. Piestrak, Design of Reverse Converters for the new RNS moduli set {2n+1, 2n,
2n-1, 2n-1+1} (n odd). IEEE Trans. Circuits Syst. I 61, 3436–3449 (2014)
61. L.S. Didier, P.Y. Rivaille, A generalization of a fast RNS conversion for a new 4-Modulus
Base. IEEE Trans. Circuits Syst. II Exp. Briefs 56, 46–50 (2009)
62. B. Cao, C.H. Chang, T. Srikanthan, An efficient reverse converter for the 4-moduli set {2n-1,
2n, 2n+1, 22n+1} based on the new Chinese Remainder Theorem. IEEE Trans. Circuits Syst. I
50, 1296–1303 (2003)
63. A.S. Molahosseini, K. Navi, C. Dadkhah, O. Kavehei, S. Timarchi, Efficient reverse converter
designs for the new 4-moduli sets {2n-1, 2n, 2n+1, 22n+1-1} and {2n-1, 2n+1, 22n, 22n+1} based
on new CRTs. IEEE Trans. Circuits Syst. I 57, 823–835 (2010)
64. W. Zhang, P. Siy, An efficient design of residue to binary converter for the moduli set {2n-1, 2n
+1, 22n-2, 22n+1-3} based on new CRT II. Elsevier J. Inf. Sci. 178, 264–279 (2008)
65. L. Sousa, S. Antao, MRC based RNS reverse converters for the four-moduli sets {2n+1,2n-1,2n,
22n+1-1} and {2n+1,2n-1,22n, 22n+1-1}. IEEE Trans. Circuits Syst. II 59, 244–248 (2012)
References 131

66. N. Stamenkovic, B. Jovanovic, Reverse Converter design for the 4-moduli set {2n-1,2n,2n+1,
22n+1-1} based on the Mixed-Radix conversion. Facta Universitat (NIS) SER: Elec. Energy 24,
91–105 (2011)
67. B. Cao, C.H. Chang, T. Srikanthan, A residue to binary converter for a New Five-moduli set.
IEEE Trans. Circuits Syst. I 54, 1041–1049 (2007)
68. A.A. Hiasat, VLSI implementation of new arithmetic residue to binary decoders. IEEE Trans.
Very Large Scale Integr. (VLSI) Syst. 13, 153–158 (2005)
69. A. Skavantzos, T. Stouraitis, Grouped-moduli residue number systems for Fast signal
processing, in Proceedings of IEEE ISCAS, pp. 478–483 (1999)
70. A. Skavantzos, M. Abdallah, Implementation issues of the two-level residue number system
with pairs of conjugate moduli. IEEE Trans. Signal Process. 47, 826–838 (1999)
71. H. Pettenghi, R. Chaves, L. Sousa, Method to design general RNS converters for extended
moduli sets. IEEE Trans. Circuits Syst. II 60, 877–881 (2013)
72. A. Skavantzos, M. Abdallah, T. Stouraitis, D. Schinianakis, Design of a balanced 8-modulus
RNS, in Proceeedings of IEEE ISCAS, pp. 61–64 (2009)
73. H. Pettenghi, R. Chaves, L. Sousa, RNS reverse converters for moduli sets with dynamic
ranges up to (8n+1) bits. IEEE Trans. Circuits Syst. 60, 1487–1500 (2013)
74. G. Chalivendra, V. Hanumaiah, S. Vrudhula, A new balanced 4-moduli set {2k, 2n-1, 2n+1,
2n+1-1} and its reverse converter design for efficient reverse converter implementation, in
Proceedings of ACM GSVLSI, Lausanne, Switzerland, pp. 139–144 (2011)
75. P. Patronik, S.J. Piestrak, Design of Reverse converters for general RNS moduli sets {2k, 2n-1,
2n+1, 2n+1-1} and {2k, 2n-1, 2n+1, 2n-1-1} (n even). IEEE Trans. Circuits Syst. I 61, 1687–1700
(2014)
76. R. Conway, J. Nelson, New CRT based RNS converter for restricted moduli set. IEEE Trans.
Comput. 52, 572–578 (2003)
77. R. Lojacono, G. C. Cardarilli, A. Nannarelli, M. Re, Residue Arithmetic techniques for high
performance DSP, in IEEE 4th World Multi-conference on Circuits, Communications and
Computers, CSCC-2000, pp. 314–318 (2000)
78. M. Re, A. Nannarelli, G.C. Cardiralli, M. Lojacono, FPGA implementation of RNS to binary
signed conversion architecture, Proc. ISCAS, IV, 350–353 (2001)
79. L. Akushskii, V.M. Burcev, I.T. Pak, A New Positional Characteristic of Non-positional Codes
and Its Application, in Coding Theory and Optimization of Complex Systems, ed. by
V.M. Amerbaev (Nauka, Kazhakstan, 1977)
80. D.D. Miller et al., Analysis of a Residue Class Core Function of Akushskii, Burcev and Pak, in
RNS Arithmetic: Modern Applications in DSP, ed. by G.A. Jullien (IEEE Press, Piscataway,
1986)
81. J. Gonnella, The application of core functions to residue number systems. IEEE Trans. Signal
Process. SP-39, 69–75 (1991)
82. N. Burgess, Scaled and unscaled residue number systems to binary conversion techniques
using the core function, in Proceedings of 13th IEEE Symposium on Computer Arithmetic, pp
250–257 (1997)
83. N. Burgess, Scaling a RNS number using the core function, in Proceedings of 16th IEEE
Symposium on Computer Arithmetic, pp. 262–269 (2003)
84. M. Abtahi, P. Siy, Core function of an RNS number with no ambiguity. Comput. Math. Appl.
50, 459–470 (2005)
85. M. Abtahi, P. Siy, The non-linear characteristic of core function of RNS numbers and its effect
on RNS to binary conversion and sign detection algorithms, in Proceedings of NAFIPS 2005-
Annual Meeting of the North American Fuzzy Information Processing Society, pp. 731–736
(2005)
86. R. Krishnan, J. Ehrenberg, G. Ray, A core function based residue to binary decoder for RNS
filter architectures, in Proceedings of 33rd Midwest Symposium on Circuits and Systems,
pp. 837–840 (1990)
132 5 RNS to Binary Conversion

87. G. Dimauro, S. Impedevo, G. Pirlo, A new technique for fast number comparison in the
Residue Number system. IEEE Trans. Comput. 42, 608–612 (1993)
88. G. Dimauro, S. Impedevo, G. Pirlo, A. Salzo, RNS architectures for the implementation of the
diagonal function. Inf. Process. Lett. 73, 189–198 (2000)
89. G. Pirlo, D. Impedovo, A new class of monotone functions of the Residue number system. Int.
J. Math. Models Methods Appl. Sci. 7, 802–809 (2013)
90. P.V. Ananda Mohan, RNS to binary conversion using diagonal function and Pirlo and
Impedovo monotonic function, Circuits Syst. Signal Process. 35, 1063–1076 (2016)
91. S.J. Piestrak, A note on RNS architectures for the implementation of the diagonal function. Inf.
Process. Lett. 115, 453–457 (2015)
92. P.L. Montgomery, Modular multiplication without trial division. Math. Comput. 44, 519–521
(1985)

Further Reading

R.E. Altschul, D.D. Miller, Residue to binary conversion using the core function, in 22nd Asilomar
Conference on Signals, Systems and Computers, pp. 735–737 (1988)
M. Esmaeildoust, K. Navi, M. Taheri, A.S. Molahosseini, S. Khodambashi, Efficient RNS to
Binary Converters for the new 4-moduli set {2n, 2n+1 -1, 2n-1, 2n-1 -1}. IEICE Electron. Exp. 9
(1), 1–7 (2012)
F. Pourbigharaz, H.M. Yassine, A signed digit architecture for residue to binary transformation.
IEEE Trans. Comput. 46, 1146–1150 (1997)
W. Zhang, P. Siy, An efficient FPGA design of RNS core function extractor, in Proceedings of
2005 Annual Meeting of the North American Fuzzy Information Processing Society
(NAFIPS), pp. 722–724 (2005)
Chapter 6
Scaling, Base Extension, Sign Detection
and Comparison in RNS

6.1 Scaling and Base Extension Techniques in RNS

It is often required to scale a number in DSP applications. Scaling by a power of two


or by one modulus or product of few moduli will be desired. Division by arbitrary
integer is exactly possible in RNS if the remainder of division is known to be zero.
As an example, consider the moduli set {3, 5, 7}. We wish to divide 39 by 13. This
is possible by multiplication of residues of 39 with multiplicative inverse of 13.
We know that 39 ¼ (0, 4, 4). We can see that (1/13) mod 3 ¼ 1, (1/13) mod 5 ¼ 2 and
(1/13) mod 7 ¼ 6. Thus, multiplying (0, 4, 4) with (1, 2, 6), we obtain (0, 3, 3) which
corresponds to 3. The divisor shall be mutually prime to all moduli for division to be
possible. On the other hand, if we wish to divide 40 by 13, it is not possible. If the
residue mod 13 is first found and subtracted from 40, then only exact division is
feasible.
The MRC technique described in Chapter 5 in fact performs scaling by first
subtracting the residue corresponding to the modulus, and then multiplying with the
multiplicative inverse. However, there will be a need for base extension which is
explained next. Consider that division of 52, i.e. (1, 2, 3) in the moduli set {3, 5, 7}
by 3 is desired. By subtracting the residue corresponding to modulus 3, i.e. 1, we
have 51 which if divided by 3 yields 17. Thus, division is accomplished. The
residues corresponding to 17 in the moduli set {5, 7} are now available. However,
the result will be in complete RNS, only if residue of 17 mod 7 is also available. The
computation of this residue is known as base extension.
Szabo and Tanaka describe an interesting technique for base extension [1]. But,
it needs additional MRC. In this technique, we assume the desired residue to be
found as x. Consider that we need to find the residue mod 3 of the number
corresponding to the residues (2, 3) in the moduli set {5, 7}. We can start conver-
sion from modulus 7 and decide the MRC digit corresponding to modulus 3. This
should be zero, because the quotient 17 is less than 35. Thus, we can use this
condition to find x as can be seen from the following example.

© Springer International Publishing Switzerland 2016 133


P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_6
134 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

Example 6.1 This example illustrates Szabo and Tanaka base extension technique
considering the moduli set {3, 5, 7} and given residues 2 and 3 corresponding the
moduli 5 and 7, respectively. We need to find the residue corresponding to modulus
3. We use MRC starting from modulus 7.

3 5 7
x 2 3
3 3
ð x  3Þ 3 4
1 3
ð x  3Þ 3 2
2
ð x  2Þ 3
2
ð2x  1Þ3

The condition (2x  1)3 ¼ 0 yields x ¼ 2. Thus, the RNS number is (2, 2, 3). ■
Alternative techniques based on CRT are available but they need the use of a
redundant modulus and ROMs. Shenoy and Kumaresan [2] have suggested base
extension using CRT. This needs one extra (redundant) modulus. Consider the three
moduli set {m1, m2, m3} and given residues (r1, r2, r3). We wish to extend the base
to modulus m4. We need a redundant modulus mr and we need to have the residue
corresponding to mr. In other words, all computations need to be done on moduli
m1, m2, m3 and mr. Using CRT, we can obtain the binary number X corresponding to
(r1, r2, r3) as
  !
X3 1
X¼ i¼1
Mi r i  kM ð6:1Þ
M i mi

The residue X mod m4 can be found from (6.1), if k is known. Using a redundant
modulus mr, if rr ¼ X mod mr is known, k can be found from (6.1) as:
0 ! !   1
X3  
1 1
k¼@ Mi ri  rr A ð6:2Þ
i¼1 M i mi M mr
mr mr

After knowing k, X mod m4 can be found from (6.1). An example will be


illustrative.
Example 6.2 Consider the moduli set {3, 5, 7}. We wish to find the residue mod m4
where m4 ¼ 11. Let us choose a redundant modulus 13. Let X be 52 ¼ (1, 2, 3). Note
that X mod 13 ¼ 0 shall be available.
6.1 Scaling and Base Extension Techniques in RNS 135

Using CRT, we have X ¼ 70  1 + 21  2 + 15  3  k  105 ¼ 52. We need to


find that k ¼1. From
 (6.2), we have k ¼ [(70  1 + 21  2 + 15  3  0)  1] mod
1
13 ¼ 1 since ¼ 1. Thus, we can determine X mod 11 from (6.1) as (70  1
105 13
+ 21  2 + 15  3  1  105) mod 11 ¼ 8. ■
Note that ! for a n moduli RNS,! by using LUTs to store the terms
     
1 1 1
Mi ri and Mi ri , where mn+1 is the modulus
M i mi M i mi M mr
mr mnþ1
to which base extension is required and mr is the redundant modulus, using a tree of
modulo mr adders, k can be found following (6.2). Using another tree of modulo mn+1
adders, (6.1) can be estimated mod mn+1. A modulo mn+1 adder subtracts kM from the
sum of the second tree of modulo mn+1 adders to obtain the final result. The time
taken is one look up for multiplication by known constants and log2(n + 1) + 1 modulo
Xq n
addition cycles. The hardware requirement is (2n + 2) LUTs and i¼0 2i
modulo
adders where q ¼ dlog2 ne.
Shenoy and Kumaresan [3] have proposed scaling X by a product Q ¼ m1m2. . .ms
of S moduli using CRT based on the base extension technique using redundant
modulus discussed earlier. In the approximate technique, first k is determined as
before using one redundant modulus mr. We note that in the small ring ZQ, using
CRT we have
 
XS  xi 
0
x ¼ jxjQ ¼ Q    r0 Q ð6:3Þ
i¼1 i Q  x
i mi

where Qi ¼ Q=mi for i ¼ 1, 2, . . . , s. Since, we need the scaled result y, we need to


subtract x0 from X and divide by Q to get the scaled result:
    !
Xn  xi  XS  xi 
Mi    r x M  Q 
i¼1 i Q 
 0
 rx Q
i¼1 M i mi i mi
y¼ ð6:4Þ
Q

M
where Mi ¼ m and n is the total number of moduli in the RNS. In the approximate
i  
method, the quantity r0 x is ignored and a correction term S1 2 is added to obtain
instead an estimate of the quotient as
 
Xn Mi  xi 
ye1 ¼ i¼Sþ1 Q M 
 r x M0
i mi
    !  
XS 1  x   xi  S1
0 i   
þ i¼1 m
M    þ ð6:5Þ
i Mi mi Qi mi 2

where M0 ¼ M=Q. In the accurate method, (6.5) gets modified as


136 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

      !
Xn Mi  xi  XS 1    xi 
0  xi   
ye2 ¼ r 0xe þ  
0
 rx M þ M   ð6:6Þ
i¼Sþ1 Q Mi mi i¼1 m
i Mi mi Qi mi


where
    ! 
 1  XS  xi   0 
      
r 0xe ¼   Q Q   x  ε  ð6:7Þ
Qm i¼1 i
i m
t

t i mt

and mt is a redundant modulus and mt  S is any integer prime to m1, m2, . . ., mS.
Note that ε is 0 or 1 thus making r0 xe differ by 1 at most. This technique needs log
n cycles and the scaled integer is having an error of at most unity whereas in
the approximate scaling technique, the error e is such that jej  S1 2 for S odd and
jej  S2 for S even. The redundant residue channel is only log2n bits wide where n is
the number of moduli and does not depend on the dynamic range of the RNS.
Jullien [4] has proposed two techniques using look-up tables for scaling a
number in an N moduli RNS by product of S moduli. In the first method, based
on Szabo and Tanaka approach, a first step (denoted original) MRC obtains the
residues corresponding to the division by the desired product of S moduli. Next base
extension is carried out for this result by further MRC conversion. A third step (final
stage) is used for obtaining the residues corresponding to the S moduli. The flow
chart is illustrated in Figure 6.1a for a six moduli RNS. The total number of look-up
tables needed for an N moduli system being scaled by product of S moduli for the
three stages is as follows:
Original: L2 ¼ S N  Sþ1
2 needing n2 ¼ S cycles.
ðNS1ÞðNSÞ
Mixed Radix: L3 ¼ needing n3 ¼ (N  S  1) cycles.
2
Final stage L4 ¼ SðN  S  1Þ needing 2n4  1 < N  S  2n4 cycles.
The MRC stage and final stage overlap and the minimum number of look-up
cycles needed for these two stages is thus (N  S). Thus, totally N cycles are needed.
In the second technique due to Jullien [4], denoted as scaling using estimates, the
CRT summation  is divided by the product of the S moduli. Evidently, some of the
1 1
products Mi are integers and some are fractions. Next, these are
Mi mi π j¼S1
j¼0 mj
multiplied by the given residues. All the resulting fractions are rounded by adding
1/2. The residues for all these N numbers corresponding to N  S moduli are added
mod mi to get the residues of the estimated quotient. The number of cycles needed
in this first stage is n1 where 2n1 1 < S þ 1  2n1 and L1 ¼ (N  S)S tables are
needed. Next, base extension is carried out for these residues to get the residues
corresponding to S moduli (see Figure 6.1b). The MRC and base extension steps
need L3 and L4 look-up tables as before. The total number of LUTs and cycles
needed are thus L1 + L3 + L4 and N + n1  S respectively as compared to L2 + L3 + L4
and N needed for the original algorithm. Note that all the tables have double inputs.
6.1 Scaling and Base Extension Techniques in RNS 137

a 1
Ф0
m0 x0 y0
T1(1,0)
T1(1,1) T4(0,1) T4(0,2)
Ф1
y1
m 1 x1
T1(2,1)
T4(1,0)
T1(2,2) T1(2,0) T4(0,0) T4(1,1) T4(1,2)
Ф2
y2
m2 x2
T2(3,2)
T1(3,1)
T2(3,0)
T2(3,3) y3 1
m 3 x3 T4(2,0)
r0 T4(2,2)
T2(4,2) T4(2,1)
T2(4,4) T2(4,0) T2(4,1) T3(4,0)
m 4 x4 T3(4,4) r1
y4
T3(5,1)
T2(5,2) T3(5,0)
T2(5,5) T2(5,0) T2(5,1)
T3(5,5)
m 5 x5 r2
y5

b
m0 x0 y0
T1(1,0)
T1(1,1) T4(0,1) T4(0,2)
y1
m1 x1

T4(1,0)
T4(0,0) T4(1,1) T4(1,2)
y2
m2 x2 e3 (0)

e3 (2) e3 (1)
y3 1
x3 T4(2,0)
m3 r0 T4(2,2)
T4(2,1)
e3 (3) (2)
e4 (0)
e4
e4 (1) T3(4,0)
T3(4,4)
m4 x4 r1
y4
e4(4) T3(5,1)
e5(0) T3(5,0)
e5 (2) e5 (1)
T3(5,5)
m 5 x5 r2

(5)
e5 y5

Figure 6.1 Jullien’s scaling techniques (a) based on Szabo–Tanaka approach and (b) based on
estimates [Adapted from [4]©IEEE1978)

Garcia and Lloris [5] have suggested an alternative scheme which needs only
two look-up cycles but bigger size look-up tables. The first look-up uses as
addresses, the residues of S moduli and the residue corresponding to each of xS+1
138 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

XN
LUT YN

XS+1
LUT YS+1

XS YS
LUT
X1 Y1

XN
XN YN LUT YN
calculation

XS+1
Iterative

YS+1 LUT YS+1


XS+1

extension
YS XS YS
XS

Base
LUT
Y1 Y1
X1 X1

Figure 6.2 Garcia and Lloris scaling techniques: (a) two look-up cycle scaling in the RNS, (b)
look-up generation of [y1, . . ., yS] and (c) look-up calculation of [yS+1, . . ., yN] (Adapted from [5]
©IEEE1999)

to xN to obtain the N  S residues yS+1,. . .yN of the scaled output corresponding to


the (N  S) moduli using (N  S) LUTs. These LUTs have (S + 1) inputs. The
second step finds the residues y1 to ys corresponding to the S moduli pertaining to
the scaled result using one table of N  S inputs (see Figure 6.2a). The authors also
suggest two alternatives (see Figure 6.2b, c): (a) Look-up generation of [y1,. . .yS]
wherein the first stage in the above technique uses iterative calculation needing n1
look-up cycles and L1 look-up tables and second stage using S look-up tables of
(N  S) inputs as before and (b) Look-up calculation of [yS+1,. . .yN] with N  S
look-up tables of S + 1 inputs and second base extension stage using iterative
calculation in (N  S) cycles needing L3 + L4 Look-up tables.
Barsi and Pinotti [6] have suggested an exact scaling procedure by product of
few moduli using look-up tables which does not need any redundant modulus. This
needs two steps of base extension. Consider for illustration the six moduli set {m1,
m2, m3, m4, m5, m6} and residues (r1, r2, r3, r4, r5, r6). We wish to scale the number
by product of moduli m1, m2 and m3. The first step is to subtract the number
corresponding to the three residues pertaining to these three moduli by performing
a base extension to moduli m4, m5 and m6 to obtain the residues corresponding to
these moduli. The result obtained is exactly divisible by m1m2m3. Hence, the
multiplication with multiplicative inverse of (m1m2m3) mod mi (i ¼ 4, 5, 6) will
yield the scaled result in the moduli set {m4, m5, m6}. The next step is to perform
base extension to the moduli set m1, m2 and m3 to get the scaled result. Note that
exact scaling can be achieved.
6.1 Scaling and Base Extension Techniques in RNS 139

Note that the base extension is carried out without needing any redundant
modulus. We scale the CRT expression by the product of the moduli Mp ¼ m1m2m3
first. The result is given by
     
X n  Xp
Mp  M xi
  
X ¼ XE  εM ¼ Mp  a  þ    εM ð6:8aÞ
 i¼1 ip  M m i M p M i mi 
i¼1 mi
Mp

where
$   %
M i xi
aip ¼ ð6:8bÞ
M p M i mi

Note that the integer and fractional parts are separated by this operation. It can be
shown that whereas in conventional CRT, the multiple of M that needs to be
subtracted rM (where r is rank function) can range between 0 and n where n is
the number of moduli, Barsi and Pinotti observe that ε can be 0 or 1 thus needing
X n 
 
subtraction of M only. Under the conditions  aip  M
M  p þ 1 and jXE jM
 i¼1  M p
Mp
X p
1
 ðp  1ÞMp  Mp only, ε can be 1.
i¼1
mi
Note that (6.8a) needs to be estimated for other moduli (m4, m5 and m6) to obtain
the residues corresponding to the residues {x1, x2, x3, 0, 0, 0}. The second base
extension after scaling by m1m2m3 to the moduli set {m1, m2, m3} also needs to use

xn
xs+1
xs
x1

BEn+1 BEn
x*s+1 x*n
Look up Look up

BE1 BEs

xs,1 xs,s xs,s+1 xs,n

Figure 6.3 Scaling scheme due to Barsi and Pinotti (Adapted from [6]©IEEE1995)
140 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

similar technique. The architecture is sketched in Figure 6.3 for an n moduli RNS
scaled by a product of s moduli.
An example will be illustrative.
Example 6.3 Consider the moduli set {23, 25, 27, 29, 31, 32} and given residues
(9, 21, 8, 3, 16, 17). This number corresponds to 578, 321. We wish to divide this
number by product of 23, 25 and 27, viz., 15,525. First we need to perform base
extension for residue set {9, 21, 8} to the moduli 29, 31, 32. Note that this number
is 3896.
The residues of 3896 corresponding to the moduli {29, 31, 32} are (10, 21, 24). It
can be seen that a1,p ¼ 4, a2,p ¼ 1 and a3,p ¼ 1. The
 conditions
 needed for applica-
X 3 
 
tion of Barsi and Pinotti technique become  a   m3  1 ¼ 26 and
 i¼1 ip 
 m3
X 3 
 
jXE jM  m1 m2  ðm1 þ m2 Þ ¼ 527. Note that since  aip  ¼ 6 < 26, we can
 i¼1 
m3
use this technique. The extended digits corresponding to moduli m4, m5 and m6 are
given by evaluating for illustration
   
 X2
23  25  xi  

x*4 ¼ 23  25  6 þ  ¼ 10
 i¼1
mi M*i m  i 29

and x*5 ¼ 21 and x*6 ¼ 24.


Subtracting these from the given residues, we have (, , , 22, 26, 25). We
next need to multiply these residues by multiplicative inverse of 15,525
corresponding to moduli {29, 31, 32}. These are (, , , 3, 5, 13). This yields
(, , , 8, 6, 5). We need to next base extend the result (¼37) to the moduli
{23, 25, 27}. This yields the final result (14, 12, 10, 8, 6, 5) ¼ 37. ■
Griffin et al. [7] have suggested a technique L(ε + δ)-CRT for scaling by an
integer in the RNS. In this, the basic idea is an approximation of the CRT
summation modulo a number μ known as “reduced system modulus” given by
M/d where 1  d  M is a scaling factor. We recall the CRT equation

X
L
x¼ ai modM ð6:9Þ
i¼1

 
xi
where ai ¼ Mi for a L-moduli RNS. We divide (6.9) both sides by d. The
Mi mi
 
ai M i x i
quantities ¼ and M/d are approximated with real numbers αi and μ,
d d Mi mi
respectively. Typically, αi and μ are chosen as integers. Note that the error in using
y to approximate x/d is given by
6.1 Scaling and Base Extension Techniques in RNS 141

x 
 
  y < Lðε þ δÞ
d

or
 x  
M   M
min ; μ  Lðε þ δÞ    y < max ;μ ð6:10Þ
d d d
a   
 i  M M  M
where   αi   ε < and   μ  δ <
  ε. Note that ε and δ are errors in
d dL d dL
the summands and the error in the modulus, respectively. Note that the smaller of
the two errors in (6.10) is L times the error ε in the summands plus the error δ in the
modulus and hence it is named L(ε + δ)-CRT.
We can choose d ¼ M/ 2k$, in which case, k
% M ¼ d2 , the computation becomes
XL  
M i xi
y¼ a0i mod2k where a0i ¼ . This is denoted as L-CRT where ε ¼ 1
i¼1
d Mi mi
and δ ¼ 0. It may be noted that a large modulo M addition in CRT is thus converted
into a smaller k-bit two’s-complement addition. Thus, the L-CRT can be
implemented using look-up tables to find a0 i followed by a tree of k-bit adders.
Meyer-Base and Stouraitis [8] have proposed a technique for scaling by
power of 2. This is based on the following two facts: (a) in the case of a
residue x which is even, x/2 will yield the result and (b) in the case of a residue
x which is odd, the  division by 2 needs multiplication with (1/2) mod mi or
xþ1
computing mod mi . Thus, iterative division by 2 in r steps will result in
2
scaling by 2r mod mi.
Note, however, it is required to find whether the intermediate result is odd or
even. This needs a parity detection circuit implying base extension to the modulus
2 using Shenoy and Kumaresan technique which needs a redundant modulus [2]. In
addition, in case of signed numbers, the scaling has to be correctly performed. It is
first needed to determine the sign of the given number to be scaled. Note that the
negative of X is represented by M  X in case of odd M. Hence, even negative
numbers are mapped to odd positive numbers. When the input number is positive,
output of the sign detection block is 0 and the output of the parity block is correct,
whereas when the input is negative, the output of the sign detection block is 1 and
the output of the parity block shall be negated. Thus, using the logic (X mod2 ¼ 0)
XOR (X > 0), the operation (X + 1)/2 needs to be performed. The architecture is
presented in Figure 6.4. Cardarilli et al. [9] have applied these ideas to realize a
QRNS polyphase filter. It may be recalled that in Vu’s sign detection algorithm
[10], X is divided by M to yield the conditions Sign Xs ¼ 0 if 0  Xs  1 and 1 if
1  Xs  2.
The authors observe that sign detection can be converted to parity detection by
doubling a number X. If X is positive, 0  X  (M  1)/2 or 0  2X  M  1. The
142 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

Figure 6.4 Power of two X X/2


scaling scheme for scaling
of signed numbers (Adapted
from [8]©IEEE2003) X1 +1 ,
m1 *2-1
MPX m1 X1

X2 +1
m2 *2-1 ,
MPX m2 X2

XL +1
mL ,
MPX *2-1 XL
mL

(X MOD 2=0) XOR (X>0)

integer 2X is within the dynamic range and hence 2X mod 2 ¼ 0. If X < 0, (M + 1)/
2  X  M  1 or M + 1  2X  2M  2 and hence the integer is outside the dynamic
range given by (2X)mod 2 ¼ 1.
Note that for n levels of scaling, each level needs parity detection and 1-bit
scaling hardware shown in Figure 6.4. Cardiralli et al. [9] have considered several
techniques for base extension to modulus 2. These are Barsi and Pinotti method [6],
Shenoy and Kumaresan technique [2], fractional CRT of Vu [10], Szabo and
Tanaka technique with two types of ordering of moduli smallest to highest and
highest to smallest in MRC. They observe that Shenoy and Kumaresan technique
[2] together with fractional CRT of Vu [10] for base extension to redundant
modulus (mr ¼ 5) needs minimum resources and exhibits low latency for scaling.
Kong and Philips [11] have suggested another implementation applicable to any
scaling factor K which is mutually prime to all moduli. In this technique, we
compute all the residues
   
   1  

yi ¼  xi  jXjK      ð6:11Þ
 K mi 
mi

We need to first know X mod K using base extension so that X-(X mod K ) is exactly
divisible by K. The architecture is shown in Figure 6.5. Note that the LUTs used for
base extension need N inputs whereas the LUTs needed in the second stage need
two inputs. Kong and Phillips considered Shenoy and Kumaresan [2] and Barsi and
Pinotti [6] techniques for base extension, scaling method of Garcia and Lloris [5],
core function-based approach due to Burgess discussed in Chapter 5. For large
6.1 Scaling and Base Extension Techniques in RNS 143

x1 x2 xN

LUTs for base


extension

(X)K
LUT LUT LUT

y1 y2 yN

Figure 6.5 Scaling scheme of Kong and Philips (Adapted from [11]©IEEE2009)

dynamic range RNS, they show that their technique outperforms other methods in
latency as well as resources. In this method, note that the base extension compu-
tation in Barsi and Pinotti technique [6] is replaced by n LUTs and only one look-up
cycle.
Chang and Low [12] have presented a scaler for the popular three moduli set
{2n  1, 2n, 2n + 1}. They have derived three expressions for the residues of the
result Y obtained by scaling the given RNS number by 2n:
 
 X 
y1 ¼  n  ¼ jx1  x2 jm1 ð6:12aÞ
2 m1
 
 X 
y2 ¼  n  ¼ Bmodm2
2 m2
  
 
¼  22n1 þ 2n1 x1  2n x2 þ 22n1 þ 2n1  1 x3 m m  ð6:12bÞ
1 3 m2
 
 X 
y3 ¼  n  ¼ jx2 þ 2n x3 jm3 ð6:12cÞ
2 m3

Note that the residues x1, x2 and x3 correspond to the moduli m1 ¼ 2n  1, m2 ¼ 2n


and m3 ¼ 2n + 1, respectively. Note that computation of B in (6.12b) is same as that
of Andraros and Ahmad [13] and improved later in [14, 15] (see Section 5.1).
We also point out that the (6.12a) and (6.12c) can be obtained by Mixed Radix
Conversion (MRC). At the end of the first two steps in MRC, viz., subtraction of
residue x2 of modulus 2n from x1 mod m1 and x3 mod m3 and dividing by 2n
(or multiplying with the multiplicative inverses 1 and 1, respectively) gives
B ¼ Xnx2 . The residues of the result B are thus available. However, base extension
2
needs to be done next to find the residue B mod 2n. Chang and Low [12] avoid the
144 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

Figure 6.6 Block Diagram x1 x2 x3


of Chang–Low scaling n+1
technique
n n

(6.12a) and (c) CRT

n n+1 n
⎢X⎥ ⎢X ⎥ ⎢X ⎥
⎢ ⎥ ⎢ n⎥ ⎢ n⎥
⎣ 2n ⎦ m ⎣ 2 ⎦ m3 ⎣ 2 ⎦ m2
1

base extension step by computing in parallel the 2n-bit word B. Interestingly, the
n LSBs of B yield the residue mod 2n. Thus, the scaling technique described in [12]
can be considered to have used both MRC and CRT in parallel as shown in
Figure 6.6.
Tay et al. [16] have described 2n scaling of signed integers in the RNS {2n  1,
2 , 2n + 1}. We define the scaled number in RNS as Y ¼ ðe
n
y1; e
y2; e
y 3 Þ corresponding
to given number X ¼ (x1, x2, x3). In the case of a positive number, these are same
as y1, y2 and y3 given in (6.12).
In the case of negative integers, the scaled e y 2 value needs to be modified as
e
y 2 þ 1 while ey 1 and e
y 3 remain the same. As an illustration, for moduli set {7, 8, 9},
consider a negative number 400 ¼ 104 ¼ (1, 0, 4). After scaling by 8, consider-
ing it as a positive integer, we have (1, 2, 5) corresponding to 50. The actual
answer considering that it is a negative integer is (1, 3, 5} ¼ 491 ¼ 13. The
implementation of the above idea needs detection of sign of the number.
The authors show that the sign can be negative if bit (Y )2n1 ¼ 1 or if Y ¼ 22n1
 1 and x2,n1 ¼ 1 where x2 is the residue corresponding to modulus 2n. However,
detection of the condition for Y ¼ 22n1  1 needs a tree of AND gates. An
alternative solution is possible in which the detection of the negative sign is
possible under the three conditions: (2n  1)th bit of Y is zero, y1 ¼ 2n1  1, and
y2,n1 ¼ 1. Thus, a control signal generation block detecting the three conditions
needs to be added to the unsigned 2n scaler architecture. The output of this block
selects 0 or 1 to be added toe y 2 . The resulting architecture is presented in Figure 6.7a.
Note that the first block implements (6.12a)–(6.12c) to obtain y1, Y and y3. The
second block modifies the result to take into account sign of the given number and
yields the residues corresponding to the scaled number Y. ~
The authors have also suggested integrating the two blocks so that the compu-
tation time is reduced. Note that in Figure 6.7b, the n LSBs of Y, i.e. y2 are
computed using n LSBs of A and B (sum and carry vectors of the 2n-bit CSA
with EAC) and carry-in bit c2n1 arriving from the n MSBs in the modified mod 2n
adder block following Zimmermann [17] having an additional prefix level. The
carry-in c2n1 is generated by the control signal generation using n MSBs of A and
B, y1, x2 , y2,(n1) and Gn1 and Pn1 signals arriving from the modified mod 2n
6.1 Scaling and Base Extension Techniques in RNS 145

b y1 ~
~ y1
x1 Mod 2n-1
One’s Adder
complement y2
n Modified Mod 2n
adder n

⎛ G n −1 ⎞
⎜ ⎟ (y2)n−1 c2n −1
A ⎜ ⎟
⎝ Pn −1 ⎠
~
x2 n ~
y2
2n-bit
Bit CSA
rewiring B Simplified AND
With Modified Mod 2n gate
EAC n Control Signal adder n Array n
y1 n Generation
~y
(~x 2)
n −1
2

n (Y ) 2 n −1 control

~
x3
MFAs
n-bit CSA with Diminished-1 y3 ~
mod (2n+1) y3
CEAC
adder

Figure 6.7 (a) Scaler for signed numbers for the moduli set {2n  1, 2n, 2n + 1} and (b) simpli-
fication of (a) (Adapted from [16] ©IEEE 2013)

adder. The bit y(2n1) is added to y2 in a simplified mod 2n adder. Note that the last
AND gate array is needed to realize e y 2 ¼ 0 when Y ¼ 22n1  1 and Y~ is in the
negative range and ey 2 ¼ ðy2 þ y2n1 Þ2n under other conditions.
Ulman and Czyzak [18] have suggested a scaling technique which does not need
redundant moduli. This is based on CRT. They suggest first dividing the CRT
expansion by the desired divisor
j k  Kto get integer values of the orthogonal
X n Xj 
pro-
Xj xj M
jections K , where Xj ¼ Mj and Mj ¼ m . The value is
Mj mj j i¼1 K
146 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

estimated in one channel. The resulting error ej is smaller than 1/2n where n is the
number of the moduli. The resulting total error denoted as
Xn
ε¼ ε
i¼1 j
ð6:13Þ

where εj denoted as projection fractional offset, a constant binary fraction that


j k
X
needs to be added to each Kj , which are estimated in a second parallel channel.
We know from CRT that

X X n Xj rM
¼  ð6:14Þ
K j¼1 K K

or
Xn X 
Xj X n Xj
r¼ j¼1
 ¼ ð6:15Þ
M M j¼1 M

since X/M is a fraction.


 The value of r is estimated following (6.15) from which
 M 
bj ¼ r  is estimated in a third parallel channel. The results of all these
K m j
three channels are added to get the quotient with an error of at most 1 if M mod
K ¼ 0 and 1.5 if M mod K 6¼ 0.
The procedure thus comprises of four steps:
  n o
Xj X
(a) Look-up of length (log m bit), Mj (of length d log nMe bits in
K mk
binary form) and εj (of length d log ne bit in binary fractions) for j ¼ 1, 2, . . .,
n in parallel. $ %
Xn   Xn
M Xj
(b) Compute r ¼ and ε ¼ εj in parallel.
j¼1
K M j¼1
n o
X
(c) Using r, look-up bj for j ¼1, 2, . . ., n (multiplication of Mj with M K is carried
out in binary form).
(d) Sum up mod mj, j ¼ 1, 2, . . ., n, the results of steps 1, 2 and 3.
 
Xj
Note that k, j ¼ 1, 2, . . ., n is obtained using ROMs in RNS form and
K mk
n o
X
further that r ¼ Mj is obtained using ROMs of size 2dlogme  dlogme in binary
form in parallel. Note also that bj is computed from r using ROMs of size 2dlogme
 ddlognM e
logme bits to obtain the residues. Note also that εj is obtained using ROMs of size
6.1 Scaling and Base Extension Techniques in RNS 147

 
Xj
2dlogme  dlogne bits. Next, bj, εj and are summed mod mj for j ¼1, 2, . . ., n.
K mk
Rest of the hardware uses multi-operand modulo adders and binary adders.
Lu and
 Chiang [19] have described a subtractive division algorithm to find
Z ¼ XY which uses parity checking for sign and overflow detection. It uses binary
search for deciding the quotient in five parts. In part I, the signs of the dividend and
divisor are determined and the operands are converted into positive numbers. In
part II, it finds a k such that 2kY  X  2k+1Y. In part III, the difference between 2k
and the quotient is found in k steps since 2k integers lie in the range 2k and 2k+1.
Note that each step in part III needs several RNS additions, subtractions, one RNS
multiplication, a table look up for finding parity of Si (see later), a table look up for
sign detection, multi-operand binary addition and exclusive OR operations. Part IV
is used in case 2kY  X  (M  1)/2  2k+1Y where M is the dynamic range of the
RNS. In part V, the quotient is converted to proper RNS form taking into account
the sign extracted in step 1. In this technique, totally 2log2Z steps are needed.
The parity of a RNS number can be found by using CRT in the case of all moduli
being odd. Recall that
 
X n  xi 
X¼   Mj  rM ð6:16Þ
j¼1 M 
j mj

 
 xi 
Since M and Mj are odd, the parity of X depends only on   and r. Thus, we
M j mj
define parity as
     
 x1   x2   xn 
P ¼ LSB   LSB       LSB   LSBðr Þ
    ð6:17Þ
M 1 m1 M 2 m2 Mn mn

where  is exclusive OR. If P ¼ 0, that means that X is an even number and if P ¼ 1,


it means that X is an odd number.
Lu and Chiang [19] have presented rules for determining whether overflow has
occurred during addition. These are summarized as follows.
Consider two numbers with residues (a1, a2, . . ., an) and (b1, b2, . . ., bn). Then, A
+ B causes an overflow if (i) (a1 + b1, a2 + b2, . . ., an + bn) is odd and A and B have
the same parity and (ii) (a1 + b1, a2 + b2, . . ., an + bn) is even and A and B have
different parities. If X and Y have same parity and Z ¼ X  Y, then X  Y, iff Z is an
even number and X < Y iff Z is an odd number. Next, if X and Y have different
parities, and Z ¼ X  Y, X  Y iff Z is an odd number and X < Y iff Z is an even
number. Considering all odd moduli, overflow in addition exists when
jX þ Y j > ðM  1Þ=2. If X and  Y are of the same sign, absolute value of the sum
should be no greater than M2 . If X and Y have different signs, no overflow will
occur. Lu and Chiang have used these rules in their implementation.
148 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

Aichholzer and Hassler [20] have introduced an idea called relaxed residue
computation (RRC) which facilitates modulo reduction as well as scaling. The
reduction mod L can be performed where L is arbitrary such that gcd (L, M ) ¼ 1 and
M is the dynamic range of the RNS. Note that L is large compared to all moduli and
typically log2 L ¼ 12 log2 M. All large constants in the CRT expression are first
reduced mod L:
X  
1
x¼ ðMi ÞL xi þ r ðMÞL ð6:18Þ
M i mi

Note that Shenoy and Kumaresan technique [2] described earlier employing a
redundant modulus needs to be used to estimate r. Note that we do not obtain the
residue x* < L but some number e x ¼ x*mod L which can be larger than L in
general. This technique is in fact a parallel algorithm to compute an L-residue in
RNS.
Example 6.4 As an illustration, consider the moduli set {11, 13, 15, 17, 19, 23} and
the redundant modulus mx ¼ 7. The dynamic range of the chosen RNS is
15,935,205. We consider an example corresponding to the input binary number
X ¼ 1,032,192 which in the chosen RNS is (7, 5, 12, 3, 17, 21/0) and wish to divide
it by 211. (Note that /x indicates that  the residue
 corresponding to the redundant
1
modulus is x). We first find Mi and as {1448655, 1225785, 1062347,
M i mi
 
1
937365, 838695, 692835) and (10, 7, 8, 9, 6, 4). Next, we can find xi as
M i mi
(4, 9, 6, 10, 7, 15) and we can compute rx using the redundant modulus 7 as 3.
Considering L ¼ 211, for RRC, we obtain ðMi ÞL¼(719, 1081, 1483, 1429, 1063, 611)
and ðMÞL ¼ 283. Using rx ¼ 3, already obtained, the residue corresponding to 211
can be obtained from (6.18) as x ¼ ð719  4þ 1081  9 þ 1483  6 þ 1429  10
þ1063  7 þ 611  15Þþ 3  283 ¼ 53, 248. We can compute this in RNS itself.
As an illustration for modulus 11, we have x ¼ ð4  4 þ 3  9þ 9  6 þ 10  10
þ7  7 þ 6  15Þþ 3  283 ¼ 1185 ¼ 8mod11. Thus, the residue mod 211 is
(8, 0, 13, 4, 10, 3/6). This corresponds to 978,944. (Note that the residues of
given number and this number mod 2048 are same but what we obtained is not
the actual residue mod 2048). Subtracting this value from the given residues of
X gives (10, 5, 14, 16, 7, 18/1). We next multiply by 211 to remove the effect of
scaling done in the beginning. This will require multiplying with the multiplicative
inverse of 2048: (6, 2, 2, 15, 14, 1/2). This corresponds to 478 as against the actual
value 504. Thus, there can be error in the scaled result. ■
This technique can be used for RSA encryption as well where me mod L needs to
be obtained.
Hung and Parhami [21] suggested a sign estimation procedure which indicates in
log2n steps, whether a residue number is positive or negative or too small in
6.1 Scaling and Base Extension Techniques in RNS 149

magnitude to tell. This is based on CRT and uses a parameter α > 1 to specify the
input range and output precision. The input number shall be within 12  2α M and
1 α
22 M. When the output ES(X) (i.e. sign of X) is indeterminate, X is guaranteed
X n 
 
to be in the range {2αM, 2αM}. We compute EFα ðXÞ ¼  i¼1 EFα ðiÞðjÞ where
20 13 1
  1 !
j M
each term EFα ðiÞðjÞ ¼ 4@ A5 is truncated to the (β)th
mi mi
mi 1 2β
 
X
bit where β ¼ αþ dlog2 ne. Note that EFα(X) is an estimate of FðXÞ ¼   εð0; 1Þ
M1
and contains both the magnitude and sign information. If 0  EFα(X) < 1/2, then
ESα(X) ¼ + and if ½  EF(X) 1  2α, then ESα(X) ¼  and X < 0, otherwise
ESα(X) ¼  and 2αM  X  2αM. In case of the result being , MRC can be
carried out to determine the sign.
Huang and Parhami [22] have suggested algorithms for division  by
 fixed divi-
sors. Consider the divisor D and dividend X. First compute C ¼ M D and choose
k such that 1  k  n and M[1,k  1]  D  M[1,k] where n is the number of moduli.
0
We evaluate X0 ¼ M½Xk;n and next Q ¼ M½1X, k1 C
where M½k; n ¼ M½a; b ¼ π i¼a b
mi .
00
Next we compute X ¼ X  QD: By using general division, we get Q0 and R such
that X00 ¼ Q0 D + R. The result is Q00 ¼ Q + Q0 and the remainder is R. One example
will be illustrative.
Example 6.5 Consider the moduli set {3, 5, 7, 11} and the division of 0503503
by 13.
Since M ¼ 1155 and D ¼13, we have C ¼ 1155 13  ¼ 88. We next have X ¼ 385 ¼ 1
188

since k ¼1. Then, we have Q ¼ 3 ¼ 29. It follows that
X0 ¼ 503  29  13 ¼ 126. We next write X00 ¼ 126 ¼ 9  13 + 9, i.e. Q0 ¼ 9 and
R ¼ 9. The quotient hence is 29 + 9 ¼ 38 and the remainder 9. ■
In an alternative technique, CRT is used. The CRT expansion is reduced mod
D to first obtain Y:
Xn
Y¼ i¼1
jαi xi jmi Z i þ BðXÞðD  ZÞ ð6:19Þ

where Zi ¼ Mi mod D, Z ¼ M mod D and B(X) indicates that B(X)M needs to be


subtracted. Next, general division is used to obtain Q and R such that Y ¼ QD + R.
The final results are Q0 ¼ ðX  RÞD1 mod M and R.
Hiasat and Abdel-Aty-Zohdy [23] have described a division algorithm for RNS.
This uses fractional representation of CRT. Consider finding X/Y in RNS. First, X is
evaluated following Vu [10] as X/M using t bits where t  dlog2 MN e if M is odd and
t  dlog2 MN e  1 otherwise where N is the number of moduli. Similarly Y is also
evaluated. The highest power of 2 in these is represented as j and k. Then the
quotient is estimated as Q1 ¼ 2jk1 . Next XQ1Y is estimated and as before
(X  Q1Y)/Y is computed to yield the quotient Q2. The updated quotient is
Q1 + Q2. The procedure needs to be continued as long as j > k. When j ¼ k, the
150 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

result is X0 ¼ (X  Y ). If the highest power contained in X0 ¼ 1, then Q is not


incremented. If on the other hand, j 6¼ 1, then Q is incremented. This approach
eliminates the need for sign determination, overflow detection and scaling.
Shang et al. [24] have suggested a scheme for scaling by 2n for signed integers.
The result of scaling by K can be written in RNS as
  !
1
yi ¼ xi  ðX ÞK m ð6:20Þ
i K
mi mi

in case X is positive. In case X is negative, we have X0 ¼ M  X and hence

X0  ðX0 ÞK M  X  ðM  XÞK M  X  ðMÞK þ ðXÞK


Y0 ¼ ¼ ¼ ð6:21Þ
K K K

Since the scaling result Y0 is positive, it needs to be mapped into negative range of
the RNS as

ðK  1ÞM þ X þ ðMÞK  ðXÞK


Y ¼ Y 0 ¼ ð6:22aÞ
K

or
 
xi  ðXÞK þ ðMÞK
yi ¼ ð6:22bÞ
K mi

Comparing with (6.20), we note that the additional term (M )K comes into picture in
case of negative numbers. If the scaling factor is 2n, the above result changes as
  !
1
yi ¼ xi  ðXÞ2n for X > 0 ð6:23aÞ
2n m i
mi

and
  !
1
yi ¼ xi  ðXÞ2n þ ðMÞ2n for X < 0: ð6:23bÞ
2n m i
mi

Thus, either of (6.23a) or (6.23b) can be selected using a MUX. Note that a sign
detector is needed in the earlier technique and base extension is needed to find (X)K.
It may be noted that scaling is possible using core function [25, 26]. The
techniques described in Chapter 5 Section 5.6 can be used for scaling by arbitrary
number C(M )/M. From (5.54b) recall that
6.1 Scaling and Base Extension Techniques in RNS 151

CðMÞ Xk
wi
n ¼ CðnÞ þ αi ð6:24Þ
M i¼1
mi

Burgess [26] has suggested scaling of RNS number using core function within
the RNS. It is required to compute (6.24) in RNS. This can be achieved by
splitting the moduli into two subsets MJ and MK and find the cores CMJ ðnÞ and
CMK ðnÞ where MJ and MK are the products of the moduli in the two sets and
M ¼ MJMK. The core can be calculated efficiently since the terms in (6.24)
corresponding to MJ are zero for computing CMJ ðnÞ and corresponding to MK
are zero for computing CMK ðnÞ. Next, we can estimate the difference in the cores
(ΔC(n))ΔC(M ) as follows:
! !
X X
ΔCðnÞ ¼ ni CJ ðBi Þ  RðnÞCJ ðMÞ  ni CK ðBi Þ  RðnÞCK ðMÞ
i ! i
X
¼ ni ΔCðBi Þ  RðnÞΔCðMÞ
i

ð6:25Þ

where ΔCðBi Þ ¼ CJ ðBi Þ  CK ðBi Þ and ΔCðMÞ ¼ CJ ðMÞ  CK ðMÞ


It follows that
!
X
k
ðΔCðnÞÞΔCðMÞ ¼ ni CðBi Þ ð6:26aÞ
i¼1 ΔCðMÞ

We can add this value to CMa ðnÞ to obtain the residues corresponding to the
other moduli. An example will be illustrative.
Example 6.6 Consider the moduli set {7, 11, 13, 17, 19, 23}. We consider the two
groups MJ ¼ 7  17  23 ¼ 2737 and MK ¼ 11  13 19 ¼  2717. Note that
1
MJ  MK ¼ M. Thus ΔC(M ) ¼ 20. The values Mi and are {1062347,
Mi Mi
676039, 572033, 437437, 391391, 323323} and {6, 1, 2, 12, 2, 2}, respectively.
The weights for Cj(M ) ¼ 2737 are (0, 2, 1, 0, 2, 0) and for CK(M ) ¼ 2717 are
(1, 0, 0, 2, 0, 6). The two sets of C(Bi) can be derived next as CJ(Bi) ¼ {2346,
249, 421, 1932, 288, 238) and CK(Bi) ¼ (2329, 247, 418, 1918, 286, 236). Finally,
we have ΔC(Bi) ¼ (17, 2, 3, 14, 2, 2).
The given number n ¼ 1859107 ¼ (5, 8, 3, 4, 14, 17) is to be approximately
scaled by 2717 to yield 684. The complete calculation in RNS is as follows:
152 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

CJ ðnÞmod7 ¼ 5, CJ ðnÞmod17 ¼ 4, CJ ðnÞmod23 ¼ 17, CK ðnÞmod11 ¼ 5,


CK ðnÞmod13 ¼ 0, CK ðnÞmod19 ¼ 11

where CJ ðnÞ ¼ 5  2346 þ 8  249 þ 3  421 þ 4  1932 þ 14  288 þ 17  238


following CRT for core function.
We can compute in parallel (ΔC(n))ΔC(M ) using (6.25) as

ΔCðnÞmod20 ¼ 5  17 þ 8  2 þ 3  3 þ 4  14 þ 14  2
þ17  2 mod20 ¼ 8:

Next adding ΔC(n) to CK(n) values to obtain the remaining scaled moduli as

CJ ðnÞmod11 ¼ CK ðnÞ þ ΔCðnÞmod11 ¼ 5 þ 8mod11 ¼ 2


CJ ðnÞmod13 ¼ CK ðnÞ þ ΔCðnÞmod13 ¼ 0 þ 8mod13 ¼ 8
CJ ðnÞmod19 ¼ CK ðnÞ þ ΔCðnÞmod19 ¼ 11 þ 8mod19 ¼ 0:

Hence, the scaled result is (5, 2, 8, 4, 0, 17) which corresponds to 684. ■


Note that there can be ambiguity in this case also. Note that CJ(n), CK(n) or
ΔC(n) may be negative. Further, CJ(n) or CK(n) may exceed CJ(M ) or CK(M ),
respectively, or ΔC(n) may exceed ΔC(M ). This ambiguity has been suggested to be
resolved by computing (ΔC(n))2ΔC(M ):
!
X
ΔCðnÞ ¼ ni ΔCðBi Þ  ðRðnÞÞ2 ΔCðMÞ ð6:26bÞ
i 2ΔCðMÞ

Note that parity will be needed in this unambiguous scaling technique.


Consider n ¼ 6432750 to be scaled by 2717 to yield 2368 using the same moduli
set as before. The parity p ¼ 0 for the chosen n. Proceeding as before, we can obtain

CJ ðnÞmod7 ¼ 0, CJ ðnÞmod17 ¼ 3, CJ ðnÞmod23 ¼ 20:


CK ðnÞmod11 ¼ 3, CK ðnÞmod13 ¼ 6, CK ðnÞmod19 ¼ 9:

The parity of the rank function of the


! CRT of core function can be found from
Xk  
1
CRT as RðnÞ2 ¼ ni  p ¼ 1. Next, we compute in parallel taking
i¼1
M i mi
2
into account the rank function ΔCðnÞmod40 ¼ 20: Next, as before, residues of CJ(n)
for other moduli can be calculated by adding ΔC(n) to the CK(n) value to obtain the
final scaled value. Note that we have CJ(n) mod 11 ¼ 1, CJ(n) mod 13 ¼ 0, CJ(n)
mod19 ¼ 10. Hence, RNS value of the scaled result is given by CJ(n) ¼ (0, 1, 0, 3,
10, 20). Next using CRT, we get CJ(n) ¼ 2366.
6.2 Magnitude Comparison 153

6.2 Magnitude Comparison

Another operation that is often required is magnitude comparison. Unfortunately,


unless both numbers that need to be compared are available in binary form, this is
not possible. Solutions do exist but these are time consuming. For example, both
RNS numbers can be converted into MRC form and by sequential comparison of
Mixed radix digits starting from higher digit, comparison can be made
[27, 28]. Thus, the computation time involved is in the worst case n comparisons
for an n moduli system preceded by MRC which can be done in parallel by having
two hardware units.
Example 6.7 As an illustration, consider that comparison of 12 and 37 is needed in
{3, 5, 7}. Mixed Radix conversion yields the mixed radix digits [0, 1, 5],
i.e. 12 ¼ 35  0 + 7  1 + 5 and [1, 0, 2], i.e. 37 ¼ 35  1 + 7  0 + 2. Starting
from the most significant MRC digit, comparison can be made. Hence, 37 is
greater than 12. ■
Bi and Gross [29] have suggested residue comparison in the RNS {2n  1, 2n, 2n + 1}
by first using Mixed-Radix CRT discussed in Chapter 5 Section 5.3 to get the Mixed
radix digits and then comparing these starting from the most significant Mixed radix
digit sequentially, the greater among A and B can be found.
Magnitude comparison of A and B can be carried out using sum of quotients
technique (SQT). In this method, the value of diagonal function D (X) is found [30]
as described in Section 5.7. Since D(X) is a monotonically increasing function, a
comparison of two numbers X and Y can be carried out by comparing D(X) and D
(Y). In case D(X) ¼ D(Y), however, we have ! to compare any one of the coordinates
Xn
of X with that of Y. Note also that ki ¼ 0: Note, however, that like Core
i¼1 SQ
function, the mapping of X to D(X) exhibits noise. An example will be illustrative.
Example 6.8 Using Diagonal function, we wish to compare two numbers X ¼ (3, 5,
11, 8, 7) corresponding to 30,013 and Y ¼ (0, 0, 10, 1, 2) corresponding to 11,000 in
the moduli set {m1, m2, m3, m4, m5} ¼ {5, 11, 14, 17, 9}. We have M ¼ 117,810,
SQ ¼ 62,707. We can find that k1 ¼ 37624, k2 ¼ 45605, k3 ¼ 4479, k4 ¼ 51641,
k5 ¼ 48772. Then D(X) ¼ 15972 and D(Y ) ¼ 5854. Hence, since D(X) > D(Y ), we
have X > Y. ■
Note that the numbers ki that need to be handled are quite large in the earlier
technique. A solution to this problem is to first group the moduli and perform MRC
in each set to get the decoded numbers and then use SQT on these numbers. As an
illustration, for the moduli set {17, 29, 19, 23}, we have M ¼ 215, 441 and
SQ ¼ 40808. On the other hand, if we consider products of pairs of moduli
(17  29), (19  23), we have virtual moduli {493, 437} and SQ ¼ 930 thus sim-
plifying the computation.
154 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

The authors have suggested an alternative technique in which SQ is also


included in the moduli set. However, then, SQT needs to be modified. The dynamic
range of the new RNS {m1, m2, . . ., mn, SQ} is M  SQ. First xSQ is subtracted
  from
1
all other residues of X and the resulting residues are multiplied with to
SQ mi
   
X Y
obtain . In a similar manner, for Y also, we find . SQT is
SQ mi SQ mi
   
X Y
performed on and as before for comparing the two numbers.
SQ mi SQ mi
Magnitude comparison for RNS {2n  1, 2n, 2n + 1} has also been investigated.
Elvazi et al. [31] have described a comparator which uses reverse converter of
Piestrak [14] excluding the final
 2n-bit
 adder. The sum
 and carry vectors sumi and
A B
carryi corresponding to A0 ¼ n and B0 ¼ n for both the inputs
2 22n 1 2 22n 1
A and B to be compared (i ¼ 1, 2) are first obtained. Next, using a tree of exclusive
OR gates, the equality of these two tested. Using two carry-look-ahead units (carry
recognizers), and a 4-input CSA, the carry1 þ sum1 þ carry2 þ sum2 þ 2, the
decision regarding A0 > B0 or A0 < B0 is obtained. If the decision is ambiguous,
then the n LSBs (residues of modulus 2n) are compared using another n-bit
comparator.
New CRT III [32, 33] has been shown in Chapter 5 Section 5.3 to be useful to
perform RNS to binary conversion when the moduli have common factors. Sousa
[34] has described a magnitude comparison technique for a four moduli RNS with
two conjugate moduli pairs {2n  1, 2n + 1, 2n+1  1, 2n+1 + 1}. The dynamic range
is (22n  1)(22n+2  1)/3 which is odd. The comparator is based on parity
detection of both the given numbers A and B. They observe that for d ¼ 3 where
d is GCD(22n+2  1,22n  1), the expression for the decoded word defining X1 as the
decoded number corresponding to the moduli set {2n  1, 2n + 1} and X2 the
decoded number corresponding to the moduli set {2n+1  1, 2n+1 + 1} can be derived
following the procedure described in Chapter 5 Section 5.3 as
0 !1 1
22ðnþ1Þ  1 ðX1  X2 ÞA
X ¼ X2 þ 22nþ2  1 @
3 3 22n  1
  3 ð6:27aÞ
2nþ2 ðX 1  X 2 Þ
¼ X2 þ 2 1
3 22n  1
3

It can be seen next from (6.27a) that


D D EE
hXi2 ¼ hX2 i2  ðX1  X2 Þ 2n ð6:27bÞ
2 1 2 2
6.2 Magnitude Comparison 155

which is based on the observation that (22n+2  1) is odd and X1 3 X2 22n 1


has same
3
parity as ðX1  X2 Þ22n 1 . Next, the comparison between A and B can be made using
the following two properties: (a) A  B iff A and B have same parity and C ¼ A  B
is an even number or A and B have different parities and C is an odd number and
(b) A < B iff A and B have the same parity and C is an odd number or A and B have
different parities PA and PB but C is an even number. This can be summarized as the
computation of P ¼ PA  PB  PC where PC is the parity of C and if P ¼ 1, A  B
else A < B.
Three first-level converters for the three residue sets are used to yield the
decoded outputs (X1, X2) corresponding to inputs A, B and A  B given as
(a1, a1*, a2, a2*), (b1, b1*, b2, b2*), (a1  b1, a1*  b1*, a2  b2, a2*  b2*). We
denote these outputs, respectively, as (A1, A2), (B1, B2), (C1, C2). Note that c ¼
(a  b) mod (2n  1) and c* ¼ (a*  b*) mod (2n + 1). Next the parities of the final
decoded words are computed following (6. 27b) as

PA ¼ LSBðA1  A2 Þ22n 1  LSBðA2 Þ


PB ¼ LSBðB1  B2 Þ22n 1  LSBðB2 Þ ð6:28Þ
PC ¼ LSBðC1  C2 Þ22n 1  LSBðC2 Þ

Pirlo and Impedovo [35] have described a monotone function which can facil-
itate magnitude comparison and sign detection. In this technique, the function
calculated is

N  
X X
FI ð X Þ ¼ ð6:29Þ
i2I
mi

where I ¼1,2,. . .N for a RNS with N moduli. Note that the number of terms being
added in (6.29) can be optional. As an illustration, for a four moduli RNS (m1, m2,
m3, m4}, we can choose
   
X X
F I ðX Þ ¼ þ ð6:30Þ
m2 m4
h i h i
Evidently the values of mX2 and mX4 can be calculated by using CRT expansion
and dividing by m2 and m4, respectively, and approximating the multipliers of
various residues x1, x2, x3 and x4 by truncation or rounding. However, these can
also be calculated by defining parameters MI and SINV first as follows:
  !
X X 1 
MI ¼ Mi and SINV ¼   ð6:31Þ
m 
i2I i2I i
MI MI

Next, we can compute FI(X) as


156 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

 
X n 
 
FI ð X Þ ¼  b  xi  ð6:32Þ
 i¼1 i 
MI

where
 
 1
bi ¼  
 for i 2 I
mi MI
   
 1 

bi ¼ Mj SINV    for j 2 J ð6:33Þ
 Mj mj 
MI

An example will be considered next.


h i h i
Example 6.9 Find the monotone function FI ðXÞ ¼ mX2 þ mX4 for the RNS {37,
41, 43, 64} corresponding to X ¼ 17735 in RNS with residues (12, 23, 19, 7).
We can note that M2 ¼ 37  41  43 ¼ 101,824 and M4 ¼ 37  41  43 ¼ 65,231.
The direct
 method
  FI(X) ¼
yields  709. We have MI ¼ M2 + M4 ¼ 167055 and
1 1
SINV ¼ þ ¼ 36, 225. Next from (6.33), we can
41 167055 64 167055 167055
compute b1 ¼ 9030, b2 ¼ 8149, b3 ¼ 27195 and b4 ¼ 122681. Thus, we can compute
FI(X) ¼ 709 using (6.32). ■
It may be noted that from CRT expansion by dividing by m2 and m4 and adding
the two expressions, we can obtain (6.33). The fractions can be truncated to obtain
the various bi.  
1
The various multiplicative inverses needed in CRT, viz.   are 2, 2, 7 and
Mj mj
47 for j ¼ 1, 2, 3 and 4, respectively. Thus, we have

2  41  43  64 2  41  43  64
b1 ¼ þ ¼ 9030
41 64

and similarly,

2  37  43  64 2  37  43  64
b2 ¼ þ ¼ 8149:0243,
41 64
7  37  41  64 7  37  41  64
b3 ¼ þ ¼ 27195
41 64
47  37  41  43 47  37  41  43
b4 ¼ þ ¼ 122681:015625
41 64

Evidently, the b1 and b4 are approximated leading to error in the scaled value.
6.3 Sign Detection 157

6.3 Sign Detection

Sign detection is equally complicated since this involves comparison once again.
A straightforward technique is to perform RNS to binary conversion and compare
with M/2 where M is the dynamic range and declare the sign. However, simpler
techniques in special cases have been considered in literature. Recall from
Chapter 5 Section 5.1 that Vu’s method [10] of RNS to binary conversion based
on scaled CRT is suitable for sign detection.
Ulman [36] has suggested a technique for sign detection for moduli sets having
one even modulus. This is based on Mixed Radix Conversion. Considering the
moduli set {m1, m2, m3, . . ., mn} where mn is even, we can consider another moduli
set having mn/2 in place of mn. Denoting the original dynamic range as
M ¼ m1m2m3. . .mn, the sign function corresponding to a binary number is defined
using a parameter k as

SgnðZ Þ ¼ 0 if k ¼ 0, SgnðZÞ ¼ 1 if k > 0 ð6:34Þ

where jZ jM ¼ k M2 þ jZ jM=2 and k  0. It can be shown for mn even, that sgn (Z ) ¼


 
 
0 if jZjmn ¼ jZjM=2  . Note that jZjmn =2 needs to be calculated first from |Z|M in
m n
the RNS {m1, m2,. . ., mn1}. Then jZjM/2 needs to be computed  in the RNS {m1,
 
m2,. . ., mn1, mp} where mp ¼ mn/2 and then, we find jZ jM=2  . Note that jZ jmn is
mn
directly available.
In the case of mn ¼ 2ws where w  1, s  1 and s is odd, it can be shown that sgn
   
Z ¼ 0, if jZjmn  ¼ jZjP 2 where P ¼ M/2. From MRC, we know that
2
 
jZj  ¼ ja0 þ a1 m1 þ a2 m1 m2 þ    þ an1 m1 m2 m3 . . . mn1 j ð6:35Þ
P 2 2

where ai are the Mixed Radix Digits since all moduli are prime except mn, we have
     
jZ j  ¼ ja0 j þ ja1 j þ ja2 j þap   ð6:36Þ
P 2 2 2 2 2 2

Hence, the LSBs of MRC digits can be added mod 2 and the result compared with
LSB of zn to determine the sign. An architecture is presented for n ¼ 5, mn even in
Figure 6.8.
Tomczak [37] has suggested a sign detection algorithm for the moduli set {m1,
m2, m3} ¼ {2n  1, 2n, 2n + 1}. It is noted that the MSB of the decoded word gives
the sign (negative if MSB is 1 and positive if MSB is zero) in all cases except for the
numbers between 23n1  2n1 and 23n1. One simple method would be to first
perform RNS to binary conversion and declare sign as negative when MSB is
1 except when all 2n MSBs are “1”. Tomczak observes that the 2n bit MSBs can be
obtained using well-known RNS to Binary conversion due to Wang [38] as
158 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

│Z│mn

│Z│mn to │Z│mn/2
│Z│m1 │Z│m2 │Z│m3 │Z│m4 converter

C1

n-bit modulo
2 adder

C2

1-bit
comparator
a0 a1 a2 a3 an S(z)

Magnitude of |Z|M

Figure 6.8 Architecture for sign detection due to Ulman (Adapted from [36]©IEEE1983)

X ¼ x2 þ 2n j2n ðx3  x2 þ Y ð2n þ 1ÞÞj 2n ð6:37aÞ


2 1
 
where Y ¼ 2n1 ðx1  x3 Þ2n 1 . Due to the multiplication by 2n modulo (22n  1),
the MSB of the desired result is actually the (n  1)th bit of Z where Z ¼ x3  x2
þY ð2n þ 1Þ þ C with C ¼ 22n  1. Thus, only lower n bits of Z need to be
computed:

t ¼ t0 þ jCj2n ð6:38aÞ

where
6.3 Sign Detection 159

t 0 ¼ ðx 3  x2 þ Y Þ ð6:38bÞ

It can be shown that C is not needed in the sign determining function given as
 
sgnðx1 ; x2 ; x3 Þ ¼ jt0 j2n  2n1 ð6:39Þ

As an illustration for the RNS {15, 16, 17}, consider the following numbers and
their detected signs:
2042 ¼ (2,10,2); Y ¼ 0, t0 ¼ 8, 11000 negative
3010 ¼ (10,2,1); Y ¼ 12, t0 ¼ 11, 01011 negative
1111 ¼ (1,7,6); Y ¼ 5, t0 ¼ 4, 00100 positive
Tomczak suggests implementation of (6.39) as
 
  x*3
t0 ¼ x2 þ 2n1 x1 2n 1 þ 2n1 ðx3, 0 þ x3, n Þ þ þ x3, 0 þ W ð6:40Þ
2

where x*3 is the n-bit LSB word formed from x3 and


 
W ¼ 0 if 2n1 x1 2n 1 þ ^x 3  2n  1,
 n1  ð6:41Þ
¼ 1 if 2 x1 2n 1 þ ^x 3  2n  1
   
where ^x 3 ¼ 2n1 x3 2n 1 2n 1 .
Sousa and Martins [39] have described sign detection for the moduli set {2n + 1,
2  1, 2n+k} using Mixed Radix conversion. The MRC digits can be easily derived as
n

d1 ¼ x1 , d2 ¼ ðx2  x1 Þ2n1 2n 1 ,
ð6:42Þ
d3 ¼ ððx1  x3 Þ þ d2, k1:0  2n þ d 2 Þ
2nþk

where X ¼ d1 + d2(2n + 1) + d3(22n  1). The sign is the MSB of d3. Thus,
ðx1  x3 Þ nþk needs to be added with the (n + k)-bit word formed by d2,k1:0
2
concatenated with d2. Using carry-look-ahead scheme, sign bit of d3 can be
obtained.
Xu et al. [40] have considered sign detection for the three moduli set {2n+1  1,
2  1, 2n}. They suggest using Mixed Radix digit obtained by using Mixed-Radix
n

CRT of Bi and Gross [29] described in Chapter 5 Section 5.3. It can be shown that
the highest Mixed Radix digit is given by
j x2  x1 k
d3 ¼ 2x1 þ x2 þ x3 þ n ð6:43aÞ
2  1 2n

The sign detection algorithm uses the MSB of d3. Note that (6.43a) can be
rewritten as
160 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

 00 
 
d 3 ¼ x1 þ x2 þ x3 þ W  ð6:43bÞ
2n

where

W ¼ 0 if x1, n ¼ 0 and x2 < x0 1 or x1, n ¼ 1 and x2  x0 1


¼ 1 if x1, n ¼ 0 and x2  x0 1 or x1, n ¼ 1 and x2 > x0 1 :

Note that x00 1 is the n-bit word equaling 2x1,n2:0 + x1,n. A CSA can be used to find
the sum of the first three terms and W is estimated using a comparator comparing x2
and x1 and glue logic. Note that the sum of CSA output vectors and W need to be
added in an adder having only carry computation logic.
Bakalis and Vergos [41] have described shifter circuits for the moduli set
{2n  1, 2n, 2n + 1}. While the shifting operation (multiplication by 2t mod mi) for
moduli 2n  1 and 2n is straightforward by left circular shift and left shift, respec-
tively, shifter for modulus (2n + 1) can be realized for diminished-1 representation.
Denoting A as the residue in this modulus channel, the rotated word is given as

R*2n þ1 ¼ ant1 ant2 . . . a0 an1 ant ð6:44Þ

where A*2n þ1 ¼ an1 an2 . . . a0 in the n-bit diminished-1 representation. As an


example, for t ¼ 3, in mod 17 channel, result of (23)  11 mod 17 can be obtained as
the diminished-1 number 00102 (by circular left shift of 10102 by 3 bits and
complementing the 3 LSBs). By cascading r blocks, generic shifters for arbitrary
binary control word of length r bits can be realized using multiplexers.

References

1. N.S. Szabo, R.I. Tanaka, Residue Arithmetic and Its Applications to Computer Technology
(Mc-Graw Hill, New-York, 1967)
2. A.P. Shenoy, R. Kumaresan, Fast base extension using a redundant modulus in RNS. IEEE
Trans. Comput. 38, 293–297 (1989)
3. A.P. Shenoy, R. Kumaresan, A fast and accurate scaling technique for high-speed signal
processing. IEEE Trans. Acoust. Speech Signal Process. 37, 929–937 (1989)
4. G.A. Jullien, Residue number scaling and other operations using ROM arrays. IEEE Trans.
Comput. 27(4), 325–337 (1978)
5. A. Garcia, A. Lloris, A look up scheme for scaling in the RNS. IEEE Trans. Comput. 48,
748–751 (1999)
6. F. Barsi, M.C. Pinotti, Fast base extension and precise scaling in RNS for look-up table
implementation. IEEE Trans. Signal Process. 43, 2427–2430 (1995)
7. M. Griffin, M. Sousa, F. Taylor, Efficient scaling in the residue number System, in Pro-
ceedings of IEEE ASSP, pp. 1075–1078 (May 1989)
8. U. Meyer-Base, T. Stouraitis, New power-of-2 RNS scaling scheme for cell-based IC design.
IEEE Trans. Very Large Scale Integr. VLSI Syst. 11, 280–283 (2003)
References 161

9. G.C. Cardarilli, A. Del Re, A. Nannarelli, M. Re, Programmable power-of-two RNS scaler and
its application to a QRNS polyphase filter, in Proceedings of 2005. IEEE International
Symposium on Circuits and Systems, vol. 2, pp 1102–1105 (May 2005)
10. T.V. Vu, Efficient implementations of the Chinese remainder theorem for sign detection and
residue decoding. IEEE Trans. Comput. 34, 646–651 (1985)
11. Y. Kong, B. Phillips, Fast scaling in the Residue Number System, IEEE Trans. Very Large
Scale Integr. VLSI Syst. 17, 443–447 (2009)
12. C.H. Chang, J.Y.S. Low, Simple, fast and exact RNS scaler for the three moduli set {2n-1,
2n, 2n+1}. IEEE Trans. Circuits Syst. I Reg. Pap. 58, 2686–2697 (2011)
13. S. Andraros, H. Ahmad, A new efficient memory-less residue to binary converter. IEEE Trans.
Circuits Syst. 35, 1441–1444 (1988)
14. S.J. Piestrak, A high-speed realization of residue to binary system conversion. IEEE Trans.
Circuits Syst. II 42, 661–663 (1995)
15. A. Dhurkadas, Comments on “A High-speed realisation of a residue to binary Number system
converter”. IEEE Trans. Circuits Syst. II 45, 446–447 (1998)
16. T.F. Tay, C.H. Chang, J.Y.S. Low, Efficient VLSI implementation of 2n scaling of signed
integers in RNS {2n1, 2n, 2n+1}. IEEE Trans. Very Large Scale Integr. VLSI Syst. 21,
1936–1940 (2012)
17. R. Zimmermann, Efficient VLSI implementation of Modulo (2n1) Addition and multiplica-
tion, in Proceedings of IEEE Symposium on Computer Arithmetic, pp. 158–167 (1999)
18. Z.D. Ulman, M. Czyzak, Highly parallel, fast scaling of numbers in nonredundant residue
arithmetic. IEEE Trans. Signal Process. 46, 487–496 (1998)
19. M. Lu, J.S. Chiang, A novel division algorithm for the Residue Number System. IEEE Trans.
Comput. 41(8), 1026–1032 (1992)
20. O. Aichholzer, H. Hassler, Fast method for modulus reduction and scaling in residue number
system, in Proceedings of EPP, Vienna, Austria, pp. 41–53 (1993)
21. C.Y. Hung, B. Parhami, Fast RNS division algorithms for fixed divisors with application to
RSA encryption. Inf. Process. Lett. 51, 163–169 (1994)
22. C.Y. Hung, B. Parhami, An approximate sign detection method for residue numbers and its
application to RNS division. Comput. Math. Appl. 27, 23–35 (1994)
23. A.A. Hiasat, H.S. Abdel-Aty-Zohdy, A high-speed division algorithm for residue number
system, in Proceedings of IEEE ISCAS, pp. 1996–1999 (1995)
24. M.A. Shang, H.U. JianHao, Y.E. YanLong, Z. Lin, L. Xiang, A 2n scaling technique for signed
RNS integers and its VLSI implementation. Sci. China Inf. Sci. 53, 203–212 (2010)
25. N. Burgess, Scaled and unscaled residue number systems to binary conversion techniques
using the core function, in Proceedings of 13th IEEE Symposium on Computer Arithmetic,
pp. 250–257 (1997)
26. N. Burgess, Scaling a RNS number using the core function, in Proceedings of 16th IEEE
Symposium on Computer Arithmetic, pp. 262–269 (2003)
27. B. Vinnakota, V.V.B. Rao, Fast conversion techniques for Binary to RNS. IEEE Trans.
Circuits Syst. I 41, 927–929 (1994)
28. P.V. Ananda Mohan, Evaluation of fast conversion techniques for Binary-Residue Number
Systems. IEEE Trans. Circuits Syst. I 45, 1107–1109 (1998)
29. S. Bi, W.J. Gross, The Mixed-Radix Chinese Remainder theorem and its applications to
residue comparison. IEEE Trans. Comput. 57, 1624–1632 (2008)
30. G. Dimauro, S. Impedovo, G. Pirlo, A new technique for fast number comparison in the residue
number system. IEEE Trans. Comput. 42, 608–612 (1993)
31. S.T. Elvazi, M. Hosseinzadeh, O. Mirmotahari, Fully parallel comparator for the moduli set
{2n,2n-1,2n+1}. IEICE Electron. Express 8, 897–901 (2011)
32. A. Skavantzos, Y. Wang, New efficient RNS-to-weighted decoders for conjugate pair moduli
residue number systems, in Proceedings of 33rd Asilomar Conference on Signals, Systems and
Computers, vol. 2, pp. 1345–1350 (1999)
162 6 Scaling, Base Extension, Sign Detection and Comparison in RNS

33. Y. Wang, New Chinese remainder theorems, in Proceedings of 32nd Asilomar Conference on
Signals, Systems and Computers, pp. 165–171 (1998)
34. L. Sousa, Efficient method for comparison in RNS based on two pairs of conjugate moduli, in
Proceedings of 18th IEEE Symposium on Computer Arithmetic, pp. 240–250 (2007)
35. G. Pirlo, D. Impedovo, A new class of monotone functions of the Residue Number System. Int.
J. Math. Models Methods Appl. Sci. 7, 802–809 (2013)
36. Z.D. Ulman, Sign detection and implicit explicit conversion of numbers in residue arithmetic.
IEEE Trans. Comput. 32, 5890–5894 (1983)
37. T. Tomczak, Fast sign detection for RNS (2n-1, 2n, 2n+1). IEEE Trans. Circuits Syst. I Reg.
Pap. 55, 1502–1511 (2008)
38. Y. Wang, Residue to binary converters based on New Chinese Remainder theorems. IEEE
Trans. Circuits Syst. II 47, 197–205 (2000)
39. L. Sousa, P. Martins, Efficient sign detection engines for integers represented in RNS extended
3-moduli set {2n-1, 2n+1, 2n+k}. Electron. Lett. 50, 1138–1139 (2014)
40. M. Xu, Z. Bian, R. Yao, Fast Sign Detection algorithm for the RNS moduli set {2n+1-1, 2n-1,
2n}. IEEE Trans. VLSI Syst. 23, 379–383 (2014)
41. D. Bakalis, H.T. Vergos, Shifter circuits for {2n+1, 2n, 2n-1) RNS. Electron. Lett. 45, 27–29
(2009)

Further Reading

G. Alia, E. Martinelli, Sign detection in residue Arithmetic units. J. Syst. Archit. 45, 251–258
(1998)
D.K. Banerji, J.A. Brzozouski, Sign detection in residue number Systems. IEEE Trans. Comput.
C-18, 313–320 (1969)
A. Garcia, A. Lloris, RNS scaling based on pipelined multipliers for prime moduli, in IEEE
Workshop on Signal Processing Systems (SIPS 98), Piscataway, NJ, pp. 459–468 (1998)
E. Gholami, R. Farshidi, M. Hosseinzadeh, H. Navi, High speed residue number comparison for
the moduli set {2n, 2n-1, 2n+1}. J. Commun. Comput 6, 40–46 (2009)
Chapter 7
Error Detection, Correction and Fault
Tolerance in RNS-Based Designs

In this chapter, we consider the topic of error detection and error correction in
Residue Number systems using redundant (additional) moduli. RNS has the unique
advantage of having modularity so that once faulty units are identified either in
the original moduli hardware or in the redundant moduli hardware, these can be
isolated. Triple modular redundancy known in conventional binary arithmetic
hardware also can be used in RNS, which also will be briefly considered.

7.1 Error Detection and Correction Using Redundant


Moduli

Error detection and correction in RNS has been considered by several authors
[1–20]. Single error detection needs one extra modulus and single error correction
needs two extra moduli. Consider the four moduli set {3, 5, 7, 11} where 11 is the
redundant modulus. Consider that the residues corresponding to the original num-
ber 52 ¼ {1, 2, 3, 8} have been modified as {1, 2, 4, 8} due to an error in the residue
corresponding to the modulus 7.
Barsi and Maeastrini [4, 5] proposed the idea of modulus projections to correct
single residue digit errors. Since the given number needs to be less than 105, any
projection larger than 105 indicates that error has occurred. A projection is obtained
by ignoring one or more moduli thus considering smaller RNS:
{1, 2, 4} ¼ 67 moduli set {3, 5, 7}
{1, 2, 8} ¼ 52 moduli set {3, 5, 11}
{1, 4, 8} ¼ 151 moduli set {3, 7, 11}
{2, 4, 8} ¼ 382 moduli set {5, 7, 11}

© Springer International Publishing Switzerland 2016 163


P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_7
164 7 Error Detection, Correction and Fault Tolerance in RNS-Based Designs

Since the last two projections are larger than 105, it is evident that error has
occurred in the residue corresponding to modulus 7 or 11. If we use an additional
modulus, the exact one among these can be found.
Szabo and Tanaka [1] have suggested an exhaustive testing procedure to detect
XK
and correct the error, which needs two extra moduli. It needs i¼1
ðmi  1ÞK tests
where K is the number of moduli. This method is based on the observation that an
errorof “1” in any residue corresponding to modulus mi causes a multiple of

1 mr1 mr2 M
Mi to be added where Mi ¼ and mr1 and mr2 are the redundant
M i mi mi
moduli. Hence, we need to find which multiple yields the correct number within the
dynamic range. This needs to be carried out for all the moduli.
As an example, consider the moduli set {3, 5, 7, 11, 13}. Let us assume that the
residues of 52 ¼ {1, 2, 3, 8, 0} got changed to {1, 2, 6, 8, 0}. We can first find the
number corresponding to {1, 2, 6, 8, ¼ 2197 which is outside the dynamic range
 0} 
1
of 105. Hence, by adding M3 ¼ 5  ð3  5  11  13Þ ¼ 10725
M3 m3
¼ f0, 0, 1, 0, 0g corresponding to an error of 1 in modulus 7, a number of times
(in this case, six times), we get different decoded numbers which modulo 15015
(the full dynamic range of the system including redundant moduli) are 2197, 12922,
8632, 4342, 52, 10777, 6487. Evidently, 52 is the correct answer.
Mandelbaum [3] has proved that two additional residues can correct single digit
residue errors. He has observed that the high order bits of the decoded word are
non-zero in the presence of errors. Let the product of all the residues be denoted as
Mr. Denoting the MSBs of the decoded word F as T M m
r B and defining two new
 .   .  k
quotients Q1 ¼ T M rB
mk Mr and Q2 ¼ T M mk
rB Mr þ 1, look-up tables can be
used to obtain the values of B and mk. Note that mk is the modulus for which error is
to be tested, B is to be determined
  and T() stands for truncation.
 The criterion for
M r B M r B
selecting B and mk is that mk agrees with T mk to maximum number of
decimal places. The last step is to obtain X from F as X ¼ F  M rB
mk . Mandelbaum
procedure is based on binary representation of numbers and not hence convenient
in RNS.
Consider the moduli set {7, 9, 11, 13, 16, 17} with a dynamic range of 2,450,448
where 16 and 17 are the redundant residues. The number 52 in the RNS is (3, 7, 8, 0,
4, 1) which is modified due to error in the residue of modulus 11 as (3, 7, 0, 0, 4, 1).
The decoding gives a 21 bit word corresponding to 668,356. The 8 MSBs reflect the
error since the original number is less than the DR of 13 bits. The MSB word Q1 is
69 and we have Q2 ¼ Q1 + 1 ¼ 70. Expressing these as fractions d1 ¼ 69/256 and
d2 ¼ 70/256, we need to find B and mi such that d1mi and d2mi are close to an
integer. It can be easily checked that for mi ¼ 11, we have d1mi and d2mi as 2.959
and 3.007 showing that B ¼ 3 and the error is in residue corresponding to modulus
7.1 Error Detection and Correction Using Redundant Moduli 165

11. The original decoded word can be obtained by subtracting (3  Mr)/11 from
668,356 to obtain 52.
Jenkins et al. [7–12] have suggested an error correction technique which is also
based on projections. In this technique using two redundant moduli in addition to
the original n moduli, the decoded words considering only (n + 1) moduli at a time
are computed using MRC. Only one of these will have the redundant MRC digit as
zero. As an illustration, consider the moduli set {3, 5, 7, 11, 13} where 11 and 13 are
the redundant moduli. Consider 52 changed as {1, 2, 4, 8, 0} due to error in the
residue corresponding to modulus 7. The various moduli sets and corresponding
projections are as follows:
{3, 5, 7, 11} {1, 2, 4, 8} ¼ 382, {3, 5, 7, 13} {1, 2, 4, 0} ¼ 1222,
{3, 5, 11, 13} {1, 2, 8, 0} ¼ 52, {3, 7, 11, 13} {1, 4, 8, 0} ¼ 1768,
{5, 7, 11, 13} {2, 4, 8, 0} ¼ 767.
Evidently, 52 is the correct answer and in MRC form is 0  165 + 3  15 + 2  3
+ 1 ¼ 52 with the most significant MRC digit being zero.
Jenkins [8] observed that the MRC structure can be used to obtain the projections
by shorting the row and column corresponding to the residue under consideration
(see Figure 7.1). This may take advantage of the fact that the already computed
MRC digit can be used without repeating the full conversion for obtaining the other
projections. Jenkins has also suggested a pipelined implementation so that at any
given time, L + r  1 projections progress through the error checker simultaneously
where L are non-redundant moduli and r are redundant moduli. First X5 is produced
at the output and next X4, X3, X2 and X1 are produced.
Note that first full MRC needs to be performed in (L + r  1) steps to know
whether any error is present by looking at the MRC digits. If the leading MRC digits
are non-zero, then five steps to obtain the projections need to be carried out. It is
evident that the already computed MRC digits can be used to compute the next to
avoid re-computation. Note that the shaded latches denote invalid numerical values
due to the reason that complete set of residues rather than the reduced set is needed
to compute the projections. Note that a monitoring circuit is needed to monitor the
mixed radix digits to detect the location of the error.
Jenkins and Altman [9] also point out that the effect of error in MRC hardware
using redundant moduli is same as the error in the input residue in that column. The
error checker also checks the errors that occur in the hardware of the MRC
converter.
In another technique known as expanded projection, Jenkins and Altman [9]
suggest multiplying the given residues by mi to generate a projection not involving
mi. In other words, we are computing miX. By observing the most significant MRC
digit, one can find whether error has occurred. However, fresh MRC on the original
residues needs to be carried out on the correct residues. As an illustration, consider
the original residue set (1, 2, 3, 8, 0) residue set corresponding to the moduli set
{3, 5, 7, 11, 13} which is modified as {1, 2, 4, 8, 0} where error has occurred in
residue mod 7. Multiplying by 3, 5, 7, 11 and 13, we get the various decoded
numbers as follows:
166 7 Error Detection, Correction and Fault Tolerance in RNS-Based Designs

x5 x4 x3 x2 x1

ROM

X2 LATCH

X3

X4

X5

a5 a4 a3 a2 a1

X5

Figure 7.1 A pipe-lined mixed Radix converter for the sequential generation of projections
(Adapted from [8] ©IEEE1983)

X3 ¼ 2301 ¼ 1  1155 þ 10  105 þ 6  15 þ 2  3 þ 0;


X5 ¼ 11570 ¼ 7  1155 þ 7  105 þ 1  15 þ 1  3 þ 2;
X7 ¼ 364 ¼ 0  1155 þ 3  105 þ 3  15 þ 1  3 þ 1;
X11 ¼ 13442 ¼ 11  1155 þ 7  105 þ 0  15 þ 0  3 þ 2;
X13 ¼ 4966 ¼ 4  1155 þ 3  105 þ 2  15 þ 0  3 þ 1:

Thus, the correct result is 364 as it is within the scaled dynamic range
735 (¼7  105). Etzel and Jenkins [10] also observe that by arranging moduli in
ascending order, overflow can be detected. After performing MRC, the illegitimate
range is identified by the higher MRC digit being made non-zero.
7.1 Error Detection and Correction Using Redundant Moduli 167

Jenkins has suggested a CRT-Based error checker for generating projections


[12]. The computation of CRT is carried out by using a technique known as biased
addition so as to simplify the reduction mod M of the CRT summation. The
operation is, however, sequential. A constant 2l  M is added to the sum where 2l
is the smallest power of 2 greater than M. Overflow is detected if there is a carry bit.
Then mod 2l reduction is performed by discarding the carry bit. Jenkins suggests
parallel computation of projections from the result of CRT. This amounts to
subtraction of the CRT term corresponding to the omitted modulus. In this method,
n steps are required to obtain the n projections from given X corresponding to
residues of all the n moduli for an n moduli RNS.
Su and Lo [14] have suggested a technique which combines scaling and error
correction. This uses two redundant moduli. In redundant residue number system
(RRNS), the actual positive number is mapped between 0 to (M  1)/2 and the
negative numbers are mapped between Mt  (M  1)/2 to Mt  1 where Mt is the
product of all the moduli. In case of even M, positive numbers are mapped between
0 and (M/2)  1 and the negative numbers from Mt  (M/2) to Mt  1. Since X lies
between  M/2 and M/2 for M even and  (M  1)/2 and (M + 1)/2 for M odd, by
introducing a polarity shift, defined as X0 ¼ X + (M/2) for M even and X0 ¼ X
+ (M  1)/2 for M odd, we have X0 between 0 and (M  1). Thus for scaling and
error correction, they suggest polarity shift to be performed first as suggested by
Etzel and Jenkins [10]. Next, the integer part of X0 /M is computed. The error
corresponding to the obtained MRC digits is obtainedfrom  a LUT. The error is
0 ej
next corrected as X ¼ (X  Ej) mod Mt where Ej ¼ Mj .
M j mj
An example is next considered to illustrate the above technique. Consider the
moduli set {2, 5, 7, 9, 11, 13} where 11 and 13 are redundant moduli. The dynamic
range M ¼ 630 and Mt ¼ 90,090. Consider a given number X ¼ 311 corresponding
to (1, 4, 4, 4, 8, 1) and a single digit error e2 ¼ 4 corresponding to modulus
5, i.e. (1, 3, 4, 4, 8, 1) ¼ 53,743. First a polarity shift is performed to obtain
X0 ¼ 53,743 + 315 ¼ 54,058 ¼ (0, 3, 4, 4, 4, 4). Decoding using MRC yields
X0 /M ¼ 85. It can be found using a LUT relating errors in residues to the higher
two Mixed Radix digits that e2 ¼ 4. Hence subtracting E2 ¼ (0, e2, 0, 0, 0, 0) from
X0 , we obtain the correct X. As an illustration, (0, 1, 0, 0, 0, 0) corresponds to 2 
(2  7  9  11  13) which when divided by M ¼ 630 yields 286/5 ¼ 57.2. Hence
the LUT will have two values 57 and 58. Note that since the total error conditions
Xnþ2  
are i¼1
mi  1 where n is the number of moduli and two redundant moduli are
used, the product of the redundant moduli shall be greater than this value. Thus,
41 states exist for the chosen moduli set and none of these repeat. Su and Lo
technique require an MRC. This is applicable for small moduli to contain the size of
the look-up tables.
Ramachandran [13] has simplified the technique of Jenkins and Altman [9]
regarding the number of projections needed for error detection and correction.
Ramachandran suggests that the faulty modulus must be left at least twice so
that the correct value can be obtained. The number of re-combinations to be tried
168 7 Error Detection, Correction and Fault Tolerance in RNS-Based Designs


is s > 2nr þ 2 where n is the number of actual moduli and r is number of redundant
moduli.
Consider, for example, a three moduli RNS (n ¼ 3) {3, 5, 7} with two additional
redundant moduli (r ¼ 2) 11 and 13. It is sufficient to test for the following
combinations {3, 11, 13}, {5, 11, 13}, {5, 7, 13}, {3, 5, 7}, {5, 7, 11}. Note that
we have omitted {5, 7}, {3, 7}, {3, 11}, {11, 13}, and {5, 13}. This method needs
MRC for three moduli whereas Jenkins et al. technique needs MRC for four
moduli RNS.
Orton et al. [15] have suggested error detection and correction using only one
redundant modulus which is a power of two and larger than"all other moduli. In # this
Xr   
S ri
technique denoted as approximate decoding, we compute
m i M i mi
i¼1 2k
which is obtained by multiplying the original result of CRT with M/S. Note that S is
the scaling factor 2y. Orton et al. have suggested that S can be chosen suitably so as
to detect errors by looking at MSBs of the result. If the MSB word is all zero or all
1, there is no error. This technique, however, does not apply to the full dynamic
range due to the rounding of (2y/mi) values.
As an illustration, consider the moduli set {17, 19, 23, 27, 32} and given
 residues

ri
(12, 0, 22, 6, 18) where 32 is the redundant residue. We can first obtain as
Mi mi
11, 0, 7, 3, 30. Choosing S as 29, S/mi values can be either rounded or truncated.
Considering that rounding
" is used, we#have various S/mi values as 30, 27, 22, 19, 16.
Xr   
S ri
We thus obtain ¼ 509 ¼ ð11111101Þ2 . Considering that
mi Mi mi k
i¼1 2
an error "has occurred to make # the input residues as (13, 0, 22, 6, 18), we can
X r   
S ri
compute ¼ 239 ¼ ð011101111Þ2 . The six most significant
mi Mi mi k
i¼1 2
bits are not one or zero, thus indicating an error. Note that, we need to next use
projections for finding the error and correcting it. The choice of S has to be proper to
yield the correct answer whether  error has occurred. This method needs small
S ri
look-up tables for obtaining corresponding to given ri.
mi Mi mi
Orton et al. [15] have suggested another technique for error detection using two
redundant moduli mk1 and mk2. In this method, each redundant modulus is consid-
ered separately with the actual RNS and the value of A(X) (the multiple of M to be
subtracted from the CRT summation) is determined using redundant modulus mki
using Shenoy and Kumaresan technique [21]:
7.1 Error Detection and Correction Using Redundant Moduli 169

  !
X
N
xi 1
A1 ðXÞmk1 ¼ Mi  xmk1 ð7:1Þ
j¼1
mi mi M
mk1

Similarly for second redundant modulus also A2(X) is determined. If these two are
equal, there is no error. This technique is called overflow consistency check.
An example will be illustrative. Consider the moduli set {5, 6, 7, 11} where
5 and 6 are actual moduli and 7, 11 are the redundant moduli. For a legitimate
number, 17 ¼ (2, 5, 3, 6), we have A1(X) considering the moduli set {5, 6, 7} as
0 and A2(X) considering the moduli set {5, 6, 11} as zero. On the other hand,
consider that an error has occurred in the residue corresponding to the modulus 5 to
change the residues as (3, 5, 3, 6). It can be verified that A1(X) ¼ 3 and A2(X) ¼ 9
showing the inconsistency. The authors suggest adding mk1 N1
2 to Ai(X) where N is
the number of moduli (not considering redundant moduli) to take care of the
possible negative values of Ai(X).
Watson and Hastings [2] error correction procedure uses base extension to
redundant moduli. The difference between the original and reconstructed redundant
residues Δ1, Δ2 is used to correct the errors. If Δ1 ¼ 0 and Δ2 ¼ 0, then no error has
occurred. If one of them is non-zero, the old residue corresponding to this redundant
modulus is replaced by new one. If both are non-zero, then they are used to address
X n
a correction table of mi  1 entries.
i¼1
Yau and Liu [6] modified this procedure and suggest additional computations in
stead of using look-up tables. Considering a n moduli set with r additional redun-
dant moduli, they compute the sets Δm , nþr ; . . . ; Δm , nþ1 , Δm , n ; Δm , n1 ,
. . . , Δm , 2 ; Δm , 1 . Here the residues are determined by base extension assuming
the RNS contains all moduli except those within the set. If the first set has zero
entries, there is no error. If exactly one of these is non-zero, corresponding
redundant residue is in error. If more than one element is non-zero, then an iterative
procedure checks the remaining sets to identify the incorrect residue in a similar
manner. This means that the information residue is in error.
Barsi and Maestrini [5] and Mandelbaum [3] suggest the concept of product
codes, where each residue is multiplied by a generator A which is larger than the
moduli and mutually prime to all moduli. Thus, of the available dynamic range MA,
only M values represent the original RNS.
Given a positive integer A, called the generator of the code, an integer X in the
range [0, M] is a legitimate number in the product code of the generator A if X ¼ 0
mod A and A is mutually prime to all mi. Any X in the range [0, M] such that
X 6¼ 0mod A is said to be an illegitimate number. The advantage of this technique is
that when addition of X1 and X2 is performed, if overflow has occurred, it can be
found by checking whether jXs jA ¼ 0 where XS ¼ (X1 + X2) mod M. Then, we need
to check whether the number is legitimate. If jXs jA ¼ jMjA , an additive overflow
has been detected. Barsi and Meastrini [5] suggest the use of AX code to allow
single digit error detection and correction. They also point out that the use of AX
170 7 Error Detection, Correction and Fault Tolerance in RNS-Based Designs

code can detect overflow in addition. A single error can be detected since an error in
one residue will yield a decoded word
 !

xi
H¼ AX þ Mi modM ð7:2Þ
M i mi

which is not divisible by A. As an illustration for the RNS {2, 3, 5, 7, 11, 13, 17, 19}
with dynamic range 9,699,690 if A ¼ 23 is the generator, the maximum value that
can be represented is 9,699,690/23 ¼ 421,725.
Barsi and Meastrini [5] have suggested a method which can detect error or
additive overflow. Given a number X, to be tested, |X|A can be found by using base
extension. If it is zero, the number is legitimate. If jXjA ¼ jMjA , an additive
overflow is detected. As an illustration consider m1 ¼ 5,m2 ¼ 7, m3 ¼ 9, m4 ¼ 11
jM j
and A ¼ 23. Note that the condition A > 2mi 1  M A for each mi 1  i  n
needs to be satisfied. Let X1 ¼ (4, 1, 3, 2) ¼ 2829 and X2 ¼ (3, 3, 3, 8) ¼ 2208. The
sum is Xs ¼ jX1 þ X2 jM ¼ ð2; 4; 6; 10Þ ¼ 1572 and overflow has occurred. Since
jXs jA ¼ 8 ¼ jMjA , overflow is detected. On the other hand, suppose the result has
error to give X^ s ¼ ð2; 0; 6; 10Þ ¼ 2560: Since jXs j ¼ 9, the error is detected.
A
Goh and Siddiqui [17] have described technique for multiple error correction
using redundant moduli. Note that we can correct up to t errors where t  br=2c
where r is the number of redundant moduli. This technique is based on CRT
expansion as a first step to obtain the result. If the result is within the dynamic
range allowed by the non-redundant moduli, the answer is correctly decoded.
Otherwise, it indicates wrong decoding. For double error correction as an illustra-
tion, for total number of moduli n, Cn2 possibilities exist. For each one of the
possibilities, it can be seen that because of a double error in residues corresponding
to moduli mi and mj, a multiple of the product of other moduli, i.e. Mij ¼ mM i mj
is
added in the CRT expansion. Thus, by taking mod Mij of the CRT expansion for all
cases excluding two moduli at a time and picking among the results the smallest
within the dynamic range due to the original moduli, we obtain the correct result.
An example will be illustrative. Consider the six moduli set {11, 13, 17,
19, 23, 29} where 11, 13 are non-redundant and 17, 19, 23, 29 are the redundant
moduli. The legitimate dynamic range is 0–142. Consider X ¼ 73 which in RNS is
(7, 8, 5, 16, 4, 15). Let it be changed to (7, 8, 11, 16, 4, 2) due to a double error. It
can be seen that CRT gives the number 25,121,455 which obviously is wrong.
Taking mod Mij for all the 15 cases (excluding two moduli mi, mj each time whose
i and j are (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 3), (2, 4), (2,5), (2, 6), (3, 4), (3, 5),
(3, 6), (4, 5), (4, 6), (5, 6) since C62 ¼ 15), we have, respectively, 130299, 79607,
62265, 36629, 11435, 28915, 50926, 83464, 33722, 36252, 65281, 73, 23811,
16518, 40828. Evidently, 73 is the decoded word since rest of the values are outside
the valid dynamic range.
7.1 Error Detection and Correction Using Redundant Moduli 171

The authors have extended the technique for correcting four errors in the
residues as well even though it is known that with four redundant moduli, only
double errors can be corrected. They observe that only three cases (error positions)
need to be tested for (1, 2, 3, 4), (1, 2, 5, 6) and (3, 4, 5, 6). Note that these
combinations are chosen such that when they are taken t ¼ 2 at a time (where t is
the number of correctable errors), most of the combinations among the 15 would
have been included. Consider the modified residues (1, 12, 0, 5, 6, 14) due to four
errors. The corresponding results mod Mijkl are 1050, 51, 12. It can be seen that the
last two appear to be correct. Only one of them is correct. The correct one can be
chosen by comparing the given residue set with those corresponding to both these
cases and one with most agreements is chosen. As an illustration, 51 ¼ (7, 12, 0, 5,
22, 14) and 12 ¼ (1, 12, 12, 12, 12, 12) whereas the input residues are (1, 12, 0, 5,
6, 14). It can be seen that the disagreements in both cases are, respectively, 2 and
4. Hence, 51 is the correct answer.
Haron and Hamdioui [18] have suggested the use of 6 M-RRNS (six moduli
Redundant Residue Number system) which uses six moduli for protecting hybrid
Memories (e.g., non-CMOS types). Two moduli (information moduli) are used for
actual representation of the memory word as residues whereas four moduli are
used for correcting errors in these two information moduli. The moduli set used was
{2p + 1, 2p, 2p1  1, 2p2  1, 2p3  1, 2p4  1} where p ¼ 8, 16, 32 for securing
memories of data width 16, 32 and 64 bits, respectively. As an illustration,
considering p ¼ 8, the moduli set is {257, 256, 127, 63, 31, 17}. Even though the
redundant moduli are smaller than the information moduli, the dynamic range of the
redundant moduli 4,216,527 is larger than the DR of the information moduli
65,792. They also consider conventional RRNS (C-RRNS) which uses three infor-
mation moduli {2p  1, 2p, 2p + 1} and six redundant moduli. For a 16 bit dynamic
range, they suggest {63, 64, 65} as information moduli and [67, 71, 73, 79, 83, 89}
as redundant moduli. Note that the code word length will be 61 bits whereas for
6 M-RRNS the code word length is for p ¼ 6, only 40 bits. Note, however, that in
6 M-RRNS, since the word lengths of the redundant moduli are not larger than those
of the information moduli, a single read word may be decoded into more than one
output data. This ambiguity can be resolved by using maximum likelihood
decoding similar to Goh and Siddiqui technique [17]. The closest Hamming
distance between the read code word and decoded ambiguous residues is found
and the one with the smallest distance is selected.
Consider the moduli set {257, 256, 127, 63, 31, 17} where the first two are
information residues and last four are redundant residues. Consider X ¼ 9216 which
corresponds to (221, 0, 72, 18, 9, 2). Assume that it is corrupted as (0, 0, 72, 18,
9, 2), Calculating all the projections, we can find that two possible results can exist
corresponding to m1 and m2 discarded and m3 and m6 discarded. These are 9216 and
257. We can easily observe that these correspond to the residues (221, 0, 72,
18, 9, 2) and (0, 1, 3, 5, 9, 2). Compared to the given residues, the choice 9216
has more agreements and hence it is the correct result.
172 7 Error Detection, Correction and Fault Tolerance in RNS-Based Designs

Pontarelli et al. [20] have suggested error detection and correction for
RNS-based FIR filters using scaled values. This technique uses one additional
modulus m which is prime with respect to all other moduli in the RNS. The dynamic
range is divided into two ranges: legitimate range where the elements are divisible
by m exactly and the remaining is the illegitimate range. An error in single modulus
is detected if after the conversion to two’s complement representation, the result
belongs to the illegitimate range. The members of the set in legitimate range are
exactly divisible by m. Thus, the scaled dynamic range is much less than the actual
range. As an illustration, considering the moduli set {3, 5, 7} and m ¼ 11, the only
valid integers are 0, 11, 22, 33, 44, 55, 66, 77, 88, 99. In case of errors in the
residues, e.g. {2, 0, 0} the CRT gives the decoded number as 35 whose residue mod
11 is 2. Thus, all the possible errors corresponding to moduli {3,5,7} are (1,0,0),
(2,0,0), (0,1,0), (0,2,0), (0,3,0), (0,4,0), (0,0,1), (0,0,2), (0,0,3), (0,0,4), (0,0,5),
(0,0,6). The values mod 11 corresponding to these are 4, 2, 10, 9, 8, 7, 4, 8, 1,
5, 9, 2. It can be seen that for all possible errors, the residue mod 11 is non-zero thus
allowing detection of any error.
The authors suggest that for a FIR filter implementation, the input sequence is
multiplied by a constant m and after processing and conversion into binary form, a
division by this constant needs to be performed. The choice of m ¼ 2i needs only left
shifting of the input sequence and hence simplifies the pre-multiplication by a
constant. The division by m is carried out easily by discarding the i LSBs.
Consider the moduli set {11, 13, 15} and the redundant   scaling modulus
 256. We
1 1
can see that M1 ¼ 195, M2 ¼ 165 and M3 ¼ 143 and ¼ 7, ¼ 3,
M 1 m1 M2 m2
       
1 1 1 1
¼ 2. We also have ¼ 235, ¼ 45, ¼ 111
M 3 m3 M1 256 M2 256 M3 256
which we need later. The code words (multiples of 256) are only 9, viz. (0, 256,
512, 768, 1024, 1280, 1536, 1792 and 2048) since the dynamic range is 2145. There
can be one digit error in any residue yielding 10 + 12 + 14 error cases which will
yield unique values mod 256 after decoding (without aliasing). As an example, we
have corresponding to error (1, 0, 0), 195  7 ¼ 1365 and 1365 mod 256 ¼ 85.
In a FIR filter implementation, while coefficients are in conventional unscaled
form, the sample values shall be scaled by m. As an example consider the com-
putation of 2  3 + 1 ¼ 7 where 2 is a coefficient. This is evaluated as (2, 2, 2) 
(9, 1, 3) + (3, 9, 1) ¼ (10, 11, 7). Note that (9, 1, 3) corresponds to 256  3 in RNS
form. Assuming that 7 is changed to 8, we compute using CRT, the erroneous value
as 2078 which mod 256 ¼ 30 indicating that  there  is an
 error.
 We obtain the error
1
corresponding to each modulus as e1 ¼ 30  ¼ ð30  235Þ256 ¼
M
    1 256 256
1
138 and similarly e2 ¼ 30  ¼ ð30  45Þ256 ¼ 70, and
M2 256 256
   
1
e3 ¼ 30  ¼ ð30  111Þ256 ¼ 2. Next, from e1, we can calculate
M3 256 256
7.2 Fault Tolerance Techniques Using TMR 173

the error Ei as Ei ¼ ei  Mi. This needs to be subtracted from the decoded word Y.
Repeating the step for other moduli channels also, we note that ðY  E1 ÞM ¼ 908,
ðY  E2 ÞM ¼ 1253, ðY  E3 ÞM ¼ 1792 and since 1792 is a multiple of 256, it is the
correct result.
Preethy et al. [16] have described a fault tolerance scheme for a RNS MAC. This
uses two redundant moduli 2m and 2m  1 which are mutually prime with respect to
the moduli in the RNS. The given residues are used to compute the binary word
X using a reverse converter and the residues of X mod 2m and X mod (2m  1) are
computed and compared with the input residues. The error is used to look into a
LUT to generate the error word which can be added with the decoded word X to
obtain the corrected word. The authors have used the moduli set {7, 11, 13, 17, 19,
23, 25, 27, 29, 31, 32} where 31 and 32 are the redundant residues for realizing a
36 bit MAC.

7.2 Fault Tolerance Techniques Using TMR

Triple modular redundancy (TMR) uses three modules working in parallel and if
the outputs of two modules are in agreement, it is selected as the correct output. The
price paid is the need for 200 % redundancy. In the case of Quadruple modular
redundancy (QMR), two units (printed circuit boards for example) will exist each
having two similar modules doing the same computation. Thus, on board modules
will be compared for agreement of the result and if not, output of the other board
having agreement between the two modules on board is selected. QMR needs a
factor of 4.0 redundancy. In the case of RNS, Double modular redundancy (DMR)
can be used in which each modulus hardware is duplicated and checked for
agreement. If there is disagreement, that modulus channel is removed from the
computation. Evidently, more number of channels than in the original RNS will be
needed. As an illustration for a five moduli RNS, one extra channel will be needed
thus having 17 % more hardware. An arbitration unit will be needed to detect error
and switch the channels. This design totally needs a factor of 2.34 redundancy.
Jenkins et al. [22] also suggest a SBM-RNS (serial by modulus RNS) in which only
one modulus channel exists and is reused. This is L times slower than the conven-
tional implementation using L moduli channel hardware. The results corresponding
to various moduli need to be stored. LUTs may be used for all arithmetic operations.
A concept compute-until-correct can be used since the fault is not known by
looking at individual channel.
We will discuss more on specific techniques used for achieving fault tolerance
using specialized number systems in Chapter 8 and fault tolerance of FIR filters in
Chapter 9.
174 7 Error Detection, Correction and Fault Tolerance in RNS-Based Designs

References

1. N.S. Szabo, R.I. Tanaka, Residue Arithmetic and Its Applications to Computer Technology
(Mc-Graw Hill, New-York, 1967)
2. R.W. Watson, C.W. Hastings, Self-checked computation using residue Arithmetic, in Pro-
ceedings of IEEE, pp. 1920–1931 (1966)
3. D. Mandelbaum, Error correction in residue arithmetic. IEEE Trans. Comput. C-21, 538–545
(1972)
4. F. Barsi, P. Maestrini, Error correcting properties of redundant residue number systems. IEEE
Trans. Comput. C-22, 307–315 (1973)
5. F. Barsi, P. Maestrini, Error detection and correction in product codes in residue Number
system. IEEE Trans. Comput. C-23, 915–924 (1974)
6. S.S.S. Yau, Y.C. Liu, Error correction in redundant residue number Systems. IEEE Trans.
Comput. C-22, 5–11 (1973)
7. W.K. Jenkins, Residue number system error checking using expanded projections. Electron.
Lett. 18, 927–928 (1982)
8. W.K. Jenkins, The design of error checkers for self-checking Residue number arithmetic.
IEEE Trans. Comput. 32, 388–396 (1983)
9. W.K. Jenkins, E.J. Altman, Self-checking properties of residue number error checkers based
on Mixed Radix conversion. IEEE Trans. Circuits Syst. 35, 159–167 (1988)
10. M.H. Etzel, W.K. Jenkins, Redundant residue Number systems for error detection and correc-
tion in digital filters. IEEE Trans. Acoust. Speech Signal Process. 28, 538–545 (1980)
11. W.K. Jenkins, M.H. Etzel, Special properties of complement codes for redundant residue
Number systems, in Proceedings of the IEEE, vol. 69, pp. 132–133 (1981)
12. W.K. Jenkins, A technique for the efficient generation of projections for error correcting
residue codes. IEEE Trans. Circuits Syst. CAS-31, 223–226 (1984)
13. V. Ramachandran, Single residue error correction in residue number systems. IEEE Trans.
Comput. C-32, 504–507 (1983)
14. C.C. Su, H.Y. Lo, An algorithm for scaling and single residue error correction in the Residue
Number System. IEEE Trans. Comput. 39, 1053–1064 (1990)
15. G.A. Orton, L.E. Peppard, S.E. Tavares, New fault tolerant techniques for Residue Number
Systems. IEEE Trans. Comput. 41, 1453–1464 (1992)
16. A.P. Preethy, D. Radhakrishnan, A. Omondi, Fault-tolerance scheme for an RNS MAC:
performance and cost analysis, in Proceedings of IEEE ISCAS, pp. 717–720 (2001)
17. V.T. Goh, M.U. Siddiqui, Multiple error detection and correction based on redundant residue
number systems. IEEE Trans. Commun. 56, 325–330 (2008)
18. N.Z. Haron, S. Hamdioui, Redundant Residue Number System code for fault-tolerant hybrid
memories. ACM J. Emerg. Technol. Comput. Syst. 7(1), 1–19 (2011)
19. S. Pantarelli, G.C. Cardarilli, M. Re, A. Salsano, Totally fault tolerant RNS based FIR filters,
in Proceedings of 14th IEEE International On-Line Testing Symposium, pp. 192–194 (2008)
20. S. Pontarelli, G.C. Cardiralli, M. Re, A. Salsano, A novel error detection and correction
technique for RNS based FIR filters, in Proceedings of IEEE International Symposium on
Defect and Fault Tolerance of VLSI Systems, pp. 436–444 (2008)
21. A.P. Shenoy, R. Kumaresan, Fast base extension using a redundant modulus in RNS. IEEE
Trans. Comput. 38, 293–297 (1989)
22. W.K. Jenkins, B.A. Schnaufer, A.J. Mansen, Combined system-level redundancy and modular
arithmetic for fault tolerant digital signal processing, in Proceedings of the 11th Symposium on
Computer Arithmetic, pp. 28–34 (1993)
References 175

Further Reading

R.J. Cosentino, Fault tolerance in a systolic residue arithmetic processor array. IEEE Trans.
Comput. 37, 886–890 (1988)
M.-B. Lin, A.Y. Oruc, A fault tolerant permutation network modulo arithmetic processor. IEEE
Trans. VLSI Syst. 2, 312–319 (1994)
A.B. O’Donnell, C.J. Bleakley, Area efficient fault tolerant convolution using RRNS with NTTs
and WSCA. Electron. Lett. 44, 648–649 (2008)
C. Radhakrishnan, W.K. Jenkins, Hybrid WHT-RNS architectures for fault tolerant adaptive
filtering, in Proceedings of IEEE ISCAS, pp. 2569–2572 (2009)
D. Radhakrishnan, T. Pyon, Fault tolerance in RNS: an efficient approach, in Proceedings, 1990 I.
E. International Conference on Computer Design: VLSI in Computers and Processors, ICCD
’90, pp. 41–44 (1990)
D. Radhakrishnan, T. Pyon, Compact real time RNS error corrector. Int. J. Electron. 70, 51–67
(1991)
T. Sasao, Y. Iguchi, On the complexity of error detection functions for redundant residue Number
systems, DSD-2008, pp. 880–887 (2008)
S. Timarchi, M. Fazlali, Generalized fault-tolerant stored-unibit-transfer residue number system
multiplier for moduli set {2n-1,2n,2n + 1}. IET Comput. Digit. Tech. 6, 269–276 (2012)
Chapter 8
Specialized Residue Number Systems

Several Residue number systems which lead to certain advantages in Signal


Processing applications have been described in literature. These are based on
concepts of Quadratic Residues, Polynomial Residue Number systems, Modulus
replication, logarithmic number systems and those using specialized moduli. These
are considered in detail. Applications of these concepts and techniques for achiev-
ing fault tolerance are described in later Chapters.

8.1 Quadratic Residue Number Systems

Complex signal processing can be handled by Quadratic Residue Number Systems


(QRNS) [1–4]. These use prime moduli of the form (4k + 1). For such moduli, 1
is a quadratic residue. In other words, x2 + 1 ¼ 0 mod (4k + 1). This has
two solutions ji for x such that ji2 ¼ 1 mod (4k + 1). A given complex number
pffiffiffiffiffiffiffi
a + jb where j ¼ 1 can be represented by an extension element (A, A*) where
A ¼ (a + j1b) mod m and A* ¼ (a  j1b) mod m.
The addition of two complex numbers a + jb and c + jd is considered next. We
know that

a þ jb ) ðA, A*Þ ¼ ðða þ j1 bÞmod m, ða  j1 bÞmod mÞ ð8:1aÞ

c þ jd ) ðB, B*Þ ¼ ððc þ j1 d Þmod m, ðc  j1 dÞmod mÞ ð8:1bÞ

The sum can be written as

© Springer International Publishing Switzerland 2016 177


P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_8
178 8 Specialized Residue Number Systems

a þ c þ jðb þ dÞ ) ðR, R*Þ


¼ ðða þ c þ j1 ðb þ d ÞÞmod m, ða þ c  j1 ðb þ dÞÞmod mÞ
ð8:1cÞ

The actual sum can be obtained as Q ¼ RþR*


2 ¼ a þ c, Q* ¼ 2j ¼ b þ d.
RR*
1
Consider an example using k ¼ 3 giving m ¼ 13. It can be verified that j1 ¼ 8. Let
us add A ¼ (1 + j2) and B ¼ (2 + j3).

1 þ j2 ) ðA, A*Þ ¼ ðð1 þ 8  2Þmod 13, ð1  8  2Þmod 13Þ ¼ ð4; 11Þ


2 þ j3 ) ðB, B*Þ ¼ ðð2 þ 8  3Þmod 13, ð2  8  3Þmod 13Þ ¼ ð0; 4Þ

Hence 42
Sum ¼ R + jR* ¼ (4, 2). The actual sum is, therefore, Q ¼ 4þ22 ¼ 3, Q* ¼ 2:8
mod13 ¼ 5 which can be verified to be true.
The multiplication in QRNS is quite simple. Consider the multiplication of
(a + jb) and (c + jd). The following illustrates the procedure:

a þ jb ) ðA, A*Þ ¼ ðða þ j1 bÞmod m, ða  j1 bÞmod mÞ


ð8:2Þ
c þ jd ) ðB, B*Þ ¼ ððc þ j1 dÞmod m, ðc  j1 d Þmod mÞ

Straight forward multiplication of each component on the right-hand side yields

Q ¼ ac  bd þ j1 ðad þ bcÞ
Q* ¼ ac  bd  j1 ðad þ bcÞ ð8:3Þ

The Real and imaginary parts of the result are R ¼ QþQ*


2 ¼ ac  bd and I ¼ 2j
QQ*
1
¼ ad þ bc, respectively.
Consider next one example of multiplication of (1 + j2) and (2 + j3) with m ¼ 13
and j1 ¼ 8. We have the following:

1 þ j2 ) ð4; 11Þ, 2 þ j3 ) ð0; 4Þ, Product ¼ ð0, 5Þ


  05
Result ¼ 0þ52 mod13 ¼ 9, 28 mod13 ¼ 7.
Thus as compared to complex multiplication involving several multiplications
and additions, QRNS-based complex multiplication is simple.
It may be noted that QRNS systems need a front-end binary to QRNS converter
followed by a RNS to QRNS converter which needs one modulo multiplication, one
modulo addition and one modulo subtraction.
Soderstrand and Poe [3] have suggested a Quadratic-like RNS (QLRNS).
Herein, an integer j1 can be found such that j12 ¼ c mod m, c 6¼ 1. Then the
mapping of a + jb can be seen to be as follows:
8.2 RNS Using Moduli of the Form rn 179

pffiffiffi pffiffiffi
a þ jb ) Z ¼ ða þ j1 b cÞmod m, Z* ¼ ða  j1 b cÞmod m
 1
Z þ Z* 2j ð8:4Þ
a¼ , b ¼ ðZ  Z*Þ pffiffi1ffi
2 2

where the values of Z and Z* are nearest rounded integers.


Krishnan et al. [4] have suggested a MQRNS system, in which j1 is a solution of
x2  n ¼ 0 mod m. In this system, the mapping of A and B is same as in the QRNS
case. The multiplication is mapped as

ða þ jbÞ  ðc þ jd Þ ) ðAB  S, A*B*  SÞ ð8:5aÞ

where
 
S ¼ j21 þ 1 bdmod m ð8:5bÞ

Note that several choices of j1 are possible for a given m leading to different
n values. For example, for m ¼ 19, ( j1, n) pairs can be (5, 6), (6, 17), (7, 11), (8, 7),
(10, 5), etc. Consider the following example of multiplication in MQRNS.
Consider m ¼ 19, j1 ¼ 10, n ¼ 5. Let us find (2 + j3)  (3 + j5). First note that
S ¼ 14. By the mapping rule, we have

2 þ j3 ) A ¼ ð2 þ 10  3Þmod19 ¼ 13, A* ¼ ð2  10  3Þmod19 ¼ 10


3 þ j5 ) B ¼ ð3 þ 10  5Þmod19 ¼ 15, B* ¼ ð3  10  5Þmod19 ¼ 10
ABþA*B* 
Hence, the real part 2  S mod m ¼ 10 and the imaginary part
ABA*B*
2j
mod m ¼ 0.
1

8.2 RNS Using Moduli of the Form rn

Paliouras and Stouraitis [5, 6] have suggested the use of moduli of the form rn in
order to increase the dynamic range. The residue ri for a modulus rn can be
represented as digits ai (i ¼ 0,. . .,n  1) and these digits can be processed:

Amodr n ¼ an1 r n1 þ an2 r n2 þ    þ a1 r þ ao ð8:6aÞ

where 0 < ai  r  1. It is necessary to have efficient hardware for handling these


types of moduli.
Specifically, Paliouras and Stouraitis [6] have considered the design of high
radix multipliers (HRM) mod rn. It can be noted that as in conventional multipli-
cation, several digit products need to be computed in the pre-processor and these
next need to be aligned and added to obtain the final result. Defining B as,
180 8 Specialized Residue Number Systems

Bmodr n ¼ bn1 r n1 þ bn2 r n2 þ    þ b1 r þ bo ð8:6bÞ

where 0 < bi  r  1 for computing A  B, we have digit products aibj such that

pij ¼ ai bj ¼ rp1 ij þ p0ij ð8:6cÞ

If i + j  n, then due to the mod rn operation, pij will not contribute to the result. Note
also that p1ij can be at most (r  2) since ai and bj can be (r  1) at most in which
case

ai bj ¼ r 2  2r þ 1 ¼ r ðr  2Þ þ 1 ð8:6dÞ

The maximum carry digit is hence (r  2). The multiplier will have two types of
cells one with both sum and carry outputs and another with only sum output. The
latter are used in the leftmost column of the multiplier. The architecture of the
multiplier for r > 3 and n ¼ 5 is shown in Figure 8.1a. The preprocessor computes
pij ¼ aibj and yields two outputs p1ij and p0ij (see 8.6c). The preprocessor cell is shown
in Figure 8.1b. Note that it computes the digit product D1  D2 and gives the output
S ¼ Cr + D where C and D are radix-r digits.
The partial product outputs of the various preprocessor cells next need to be
summed in a second stage using an array of radix r adder cells. These special full
adder cells need to add three inputs which are having maximum value (r  1) or
(r  2). The carry digit can be at most 2 for r  3:

ðr  1Þ þ ðr  1Þ þ ðr  2Þ ¼ 2r þ ðr  4Þ ð8:7Þ

(since the input carry digit can be at most (r  2) as shown before).


Various types of adder cells are needed in Figure 8.1a so as to minimize
hardware complexity regarding the circuits used to generate the carry. The FA
cell adds two digits and a carry generated at the pre-processor. The adder cell H22
adds a radix-r digit and two carries generated by FAs. The H20 adder adds a radix-r
digit to a carry generated by a FA. The H11 adder adds a radix-r digit with two
carries generated by H22, H20, H11 or H10 digit adders. Finally, the H10 adder adds a
radix r-digit with one carry generated by H22, H20, H11 or H10 digit adders.
Kouretas and Paliouras [7] have introduced graph-based techniques for optimiz-
ing the hardware/time. These minimize the multiplier complexity by selecting digit
adders that observe the constraints on the maximum value of the intermediate
digits. They have proposed a family of digit adders which have three or four inputs
and one or two outputs denoted by the sixtuple {m1, m2, m3, m4, m5, m6} where
m1  m4 are the word lengths of the inputs and m5 and m6 are word lengths of upper
X4
and lower digits. These compute d 5 r þ d6 ¼ di where 0  di  mi. The adders
i¼1
take into account the redundancy to reduce the complexity of the radix-r digit
adders. The flexibility of restricting the intermediate digit values is possible by
8.2 RNS Using Moduli of the Form rn 181

a A B
Preprocessor

FAʹ FA FA FA

FAʹ FA FA

FAʹ FA H20

FAʹ H22

Hʹ22 H10

Hʹ20

Hʹ11

D1 D2 c d1,2 d2,2 d1,1 d2,1 d1,0 d2,0


b d0,2 d0,1 d0,0
FA FA FA
Binary Multiplier
HA s0
S
FA s1
Converter
s2

C D s3
s 4 s 3 s2 s 1 s 0
d e
s4 s3 s2 s1 s0

HA FA FA HA
c1
HA FA FA
HA FA

b4 b3 b2 b1 b0 b4 b3 b2 b1 b0

f b0 a1 b1 a0 b0 a0 g b0 a2 b1 a1 b2 a0 b0 a1 b1 a0 b0 a0

p0
p0
p1

p1 p2

Figure 8.1 (a) Architecture of a radix 5 multiplier (r > 3, n ¼ 5), (b) Pre-processor cell, (c) full-
adder based three-input radix-5 digit adder, (d) radix-7 D665062 digit-adder converter, (e) opti-
mized version of (d), (f) modulo r2 multiplier, (g) modulo r3 multiplier, (h) modulo r4
multiplier and (i) 8 bit Binary to Residue converter mod 125 ((a, i) adapted from [5] ©IEEE2000,
(b–h) adapted from [7] ©IEEE2009)
182 8 Specialized Residue Number Systems

h
b0 a3 b1 a2 b2 a1 b3 a0 b0 a2 b1 a1 b2 a0 b0 a1 b1 a0 b0 a0

p0

p1

p2

p3

y2 y0
i

441 y6
y4 y3

y3
311 r32 y1

y7
H11 y5 r31
y5

y6 211 r21 y6 r2 y5

y5 Hʹ11 H10

Hʹ10

Figure 8.1 (continued)

allowing certain redundancy in the employed representation. Redundancy permits


representation in more than one way. The digits can assume a value greater than the
radix r but bit length is restricted to that of r. For example, the sum of three radix-5
digits whose maximum value are 4, 3 and 4 can be 11 which can be represented as
2  5 + 1 or 1  5 + 6 both of which are valid. The adder consists of two stages. The
first stage is a radix-r digit adder which gives binary output S. The second stage is a
8.2 RNS Using Moduli of the Form rn 183

converter which evaluates the legitimate carry digit C and Sum digit D (see
Figure 8.1b). The second stage is a recursive mod-r adder in which the carry
generated is fed back after being multiplied by k ¼ 2l mod r in a chain of adders
till no carry is produced. Every time a carry is produced, (2l  r) is added and the
carry bits are accumulated to form the final carry digit.
As an illustration, a three input radix-5 adder is presented in Figure 8.1c. Each
digit is represented as 3-bit word. However, the values can be greater than 4. Several
radix 5-cells with three 3-bit inputs with maximum specified sum can be designed.
Some possibilities, for example, are D744052, D654052, etc. Note that the first four
digits correspond to the maximum value of the inputs. The last but one indicates the
sum output and the last digit indicates the carry output. The output binary word of
this adder needs to be converted to radix r form using a binary to radix-r converter.
In some cases, it can be simplified, for example, for r ¼ 7. In a binary to RNS
converter, the quotient value is ignored whereas herein we need to compute the
quotient. A radix-7 binary to RNS converter for five bit input s4, s3, s2, s1, s0 with
maximum sum value 17 is shown in Figure 8.1d and a look-ahead version is shown
in Figure 8.1e. Here, b4b3 are carry bits and b2b1b0 are the Sum bits. The authors
have suggested optimum implementation for radix r2, r3 and r4 multiplication.
These are presented in Figure 8.1f–h.
Kouretas and Paliouras [8, 9] have described HRM (high Radix redundant
multiplier) for moduli set rn, rn  1 and rn + 1. The last one uses diminished-1
arithmetic. Redundancy has been used in the representation of radix-r digits to
reduce complexity of the adder cells as discussed earlier.
The binary to radix-r conversion [7] can follow similar methods as in radix-2
case. The residues of various powers of 2 can be stored in radix form and added mod
r. As an illustration for radix 125 ¼ 53, the various powers of 2 corresponding to an
8 bit binary input are as follows:
Powers of 2: 0 1 2 3 4 5 6 7
Radix 53 digits: 000 001 004 013 031 112 224 003.
Thus, for any given binary number, the LUTs can yield the radix 5 digits which
need to be added taking into account the carries to get the final result. A 8-bit
modulo 125 (¼53) binary to residue converter, using an array of special purpose
cells denoted as simplified digit adders, is shown in Figure 8.1i. Note that the
numbers inside the adders in Figure 8.1i indicate the weights of the inputs. For
example, the cell 441 computes 4y2 + y0 + 4y6 and gives the sum digit in radix-r
form and carry bit. The cell rAB adds inputs assigned to three ports. The r port can
be any radix-r digit where the ports IA and IB are 1 bit which when asserted will add
constants A and/or B. This cell also gives a 1 bit carry output.
Abdallah and Skavantzos [10] have described multi-moduli residue number
systems with moduli of the form ra, rb  1, rc + 1 where r > 2. Evidently, the
complete processing has to be done in radix r. The rules used for radix 2 for
reduction mod (2n  1) or (2n + 1) using periodic properties discussed earlier can
be applied in this case as well. The moduli can have common factors which need to
be taken into account in the RNS to binary conversion using CRT. As an
184 8 Specialized Residue Number Systems

illustration, the moduli set {312 + 1, 313  1, 314 + 1, 315  1, 316 + 1} has all even
moduli and division by 2 yields mutually prime numbers. The reader is referred to
[10] for an exhaustive treatment on the options available. The authors show that
these can be faster than radix-2-based designs.

8.3 Polynomial Residue Number Systems

Skavantzos and Taylor [11], Skavantzos and Stouraitis [12] have proposed poly-
nomial residue number system which is useful for polynomial multiplication. This
can be considered as a generalization of QRNS and can perform the polynomial
product with a minimal number of multiplications and with a high degree of
parallelism provided the arithmetic operates in a carefully chosen ring. This is
useful in DSP applications that involve multiplications intensive algorithms like
convolutions and one- or two-dimensional correlations.
Consider two (N  1)th order polynomials A(x), B(x) and we need to find (A(x) 
B(x)) mod (xN + 1) for a chosen ring Zm(0, 1, 2, . . ., m  1) which is closed with
respect to operations of additions and multiplications mod m. Such operation is
needed in circular convolution. Note that (xN + 1) can be factored into N distinct
factors in Zm, viz., (x  r0)(x  r1). . .(x  rN1) where r i 2 Z m , i ¼ 0, 1, 2, . . ., N  1,
aL
if and only if ( pi)2N ¼ 1 where m ¼ pi ei where pi are primes and ei are
i¼1
exponents. In case of xN  1, the necessary and sufficient condition for factorization
is ( pi)N ¼ 1.
We consider N ¼ 4 for illustration. We define first the roots of (x4 + 1) ¼ 0 as r0,
(r0) mod m, (1/r0) mod m, (1/r0) mod m. Once these roots are known, using an
isomorphic mapping, it is possible to map A(x) into the 4-tuple (a0*, a1*, a2*, a3*)
where a0* ¼ (A(r1)) mod m, a1* ¼ (A(r2)) mod m, a2* ¼ (A(r3))mod m, a3* ¼ (A
(r4)) mod m as follows:
 
a0 * ¼ a0 þ a1 r 0 þ a2 r 0 2 þ a3 r 0 3 mod m ð8:8aÞ
 
a1 * ¼ a0  a1 r 0 þ a2 r 0 2  a3 r 0 3 mod m ð8:8bÞ
 
a2 * ¼ a0  a1 r 0 3  a2 r 0 2  a3 r 0 mod m ð8:8cÞ
 
a3 * ¼ a0 þ a1 r 0 3  a2 r 0 2 þ a3 r 0 mod m ð8:8dÞ

Defining the 4-tuple corresponding to B(X) as (b0*, b1*, b2*, b3*), the multipli-
cation of (a0*, a1*, a2*, a3*) with (b0*, b1*, b2*, b3*) item-wise yields the product
(c0*, c1*, c2*, c3*). This task reduces N2 number of mod m multiplications to
N number of mod m multiplications only. The 4-tuple (c0*, c1*, c2*, c3*) needs to
be converted using an inverse isomorphic transformation in order to obtain the final
result (A(X)  B(X)) mod (x4 + 1) using the following equations:
8.3 Polynomial Residue Number Systems 185

 
a0 ¼ 22 ða*0 þ a*1 þ a*2 þ a*3 Þ m ð8:9aÞ
 
a1 ¼ 22 r 30 ða*1  a*0 Þ þ 22 r 0 ða*2  a*3 Þ m ð8:9bÞ
 
a0 ¼ 22 r 20 ða*3 þ a*2  a*1  a*0 Þ m ð8:9cÞ
  
a0 ¼ 22 ða*1  a*0 Þ þ 22 r 3o a*2  a*3 m ð8:9dÞ
  
In general, ai* can be obtained as ai ¼ N 1 a*0 r i * i i
0 þ a1 r 1 þ    þ aN1 r N1 m .
*

Yang and Lu [13] have observed that PRNS can be interpreted in terms of CRT for
polynomials over a finite ring.
An example will be illustrative. Consider the evaluation of A(x)B(x) mod (x4 + 1)
where A(x) ¼ 5 + 6x + 8x2 + 13x3 and B(x) ¼ 9 + 14x + 10x2 + 12x3 with m ¼ 17. It
can be found that the roots of (x4 + 1) mod 17 are 2, 15, 9, 8. We note that
ai ¼ {5, 6, 8, 13} and bi ¼ {9, 4, 10, 12}. For each of the roots of (x4 + 1), we can
find ai* ¼ {0, 6, 1, 13} and bi* ¼ {3, 10, 3, 3}. Thus, we have ci* ¼ {0, 9, 3, 5}.
Using inverse transformation, we have c i ¼ {0, 0, 16, 9}. Thus, the answer is
16x2 + 9x3.
Paliouras et al. [14] have extended PRNS for performing modulo (xn  1)
multiplication as well. They observe that in this case, values of roots ri will be
very simple powers of two. As an illustration, the roots for x8  1 mod (24 + 1) are
{1, 2, 22, 23, 23, 22, 2, 1}. As such, the computation of A(ri) can be
simplified as simple rotations and bit inversions with low hardware complexity.
The authors have used diminished-1 arithmetic for the case (2n + 1) and have shown
that PRNS-based cyclic conversion architectures reduce the area as well as
power consumption. The authors have also considered three moduli systems
{2 4 + 1, 2 8 + 1, 2 16 + 1} so that supply voltage reduction can be done for high
critical path channels.
Skavantzos and Stouraitis [12] have extended the PRNS to perform complex
linear convolutions. This can be computed using two modulo (x2N + 1) polynomial
products. N-point complex linear convolutions can be computed with 4N real
multiplications while using PRNS instead of 2N2 real multiplications when using
QRNS. The reader is referred to their work for more information.
Abdallah and Skavantzos [15] observe that the sizes of the moduli rings used in
PRNS are of the same order as size N of the polynomials to be multiplied. For
multiplication of large polynomials, large modular rings must be chosen leading to
performance degradation. As an illustration, for modulus (x20 + 1) PRNS, the only
possible q values are of the form 40k + 1 or 41 and 241. However, the dynamic
range is 41  241 < 14 bits. In such cases, multi-polynomial Channel PRNS
(MPCPRNS) has been suggested. The reader is referred to [15] for more
information.
Paliouras and Stouraitis [16] have suggested complexity reduction of forward
and inverse PRNS converters exploiting the symmetry of the transformation matri-
ces used for representing the conversion procedure as a matrix-by-vector product.
186 8 Specialized Residue Number Systems

Shyu et al. [17] have suggested a quadratic polynomial residue Number system
based complex multiplier using moduli of the form 22n + 1. The advantage is that
the mapping and inverse mapping can be realized using simple shifts and additions.
For complex numbers with real and imaginary parts less than R, the dynamic
range of the RNS shall be 4R2. As an illustration for R ¼ 28, we can choose the
RNS {28 + 1, 26 + 1, 24 + 1}.
Two-dimensional PRNS techniques will be needed to multiply large polyno-
mials in a fixed size arithmetic ring. Skavantzos and Mitash [18, 19] have described
this technique. PRNS can be extended to compute the products of multivariate
polynomials, e.g. A(x1, x2) ¼ 2 + 5x1 + 4x2 + 7x1x2 + 2x1x22 + x22, B(x1, x2) ¼ 1
+ 2x1 + 4x2 + 11x1x2 + 3x1x22 + 9x22 [20]. This has application in multi-dimensional
signal processing using correlation and convolution techniques.
Beckman and Musicus [21] have presented fault-tolerant convolution algorithms
based on PRNS. Redundancy is incorporated using extra residue channels. Their
approach for error detection and correction is similar to that in integer arithmetic.
They also suggest the use of Triple modular Redundancy (TMR) for CRT recon-
struction, error detection and correction. Note that individual residue channel
operations do not use TMR. The authors recommend the use of specific set of
modulo polynomials which are sparse in order to simplify the modulo multiplica-
tion and accumulation operations.
Parker and Benaissa [22], Chu and Benaissa [23, 24] have used PRNS for
multiplication in GF( pm). They suggest choice of irreducible trinomials such that
the degree of the product is 2m. For implementing ECC curve k-163, with a
polynomial f(x) ¼ x163 + x7 + x6 + x3 + 1, four 84 degree irreducible polynomials x84
+ xk + 1 where k  42 have been selected. In another design, 37 numbers of degree
9 irreducible polynomials have been selected. These need a GF(29) channel
multiplier.
PRNS has also been used for implementing AES (Advanced Encryption Stan-
dard) with error correction capability [25]. The S-Box was mapped using three
irreducible polynomials x4 + x + 1, x4 + x3 + 1, x4 + x3 + x2 + x + 1 for computing
S-Box having three GF(24) modules, while two are sufficient. The additional
modulus has been used for error detection. LUT-based implementation was used
for S-Box whereas MixColumn transformation also was implemented using three
moduli PRNS.

8.4 Modulus Replication RNS

In RNS, the dynamic range is directly related to the moduli since it is a product of
all mutually prime moduli. Increase in dynamic range thus implies increase in the
number of moduli or word lengths of the moduli making the hardware complex. In
MRRNS (modulus replication RNS) [26] the numbers are represented as poly-
nomials of indeterminates which are powers of two (some fixed radix 2β). The
coefficients are integers smaller in magnitude than 2β. The computation of residues
8.4 Modulus Replication RNS 187

in the case of general moduli mi (forward conversion step) does not therefore arise.
As an example for an indeterminate x ¼ 8, we can write 79 in several ways [27]:

79 ¼ x2 þ x þ 7 ¼ x2 þ 2x  1 ¼ 2x2  6x  1 ¼ . . .

The polynomials are represented as elements of copies of finite rings. The dynamic
range is increased by increasing the number of copies of already existing moduli.
Even a small moduli set such as {3, 5, 7} can produce a large dynamic range.
MRRNS is based on a version of CRT that holds for polynomial rings. There is no
restriction that all moduli must be relatively prime. It allows repeated use of moduli
to increase the dynamic range of the computation. A new multivariate version of
MRRNS was described by Wigley et al. [26].
MRRNS uses the fact that every polynomial of degree n can be uniquely
represented by its values at (n + 1) distinct points and closed arithmetic operations
can be performed
  completely independent channels. These points ri are chosen
over
such that r i  r j 8ði:jÞ; i 6¼ j shall be invertible in Zp. This technique allows
algorithms to be decomposed into independent computations over identical chan-
nels. The number of points must be large enough not only to represent the input
polynomials but also the result of computation.
Consider computation of 79  47 + 121  25 ¼ 6738 [28]. We first define poly-
nomials P1(x), P2(x), Q1(x), Q2(x) that correspond to the input values 79, 47,
121 and 25 assuming an indeterminate x ¼ 23 ¼ 8. Thus, we have

P1 ðxÞ ¼ 1 þ 2x þ x2 , P2 ðxÞ ¼ 1 þ 7x þ x2 , Q1 ðxÞ ¼ 1 þ 6x, Q2 ðxÞ ¼ 1 þ 3x:

Since the degree of the final polynomial is 3, we need to evaluate each of these
polynomials at n ¼ 4 distinct points. Choosing the set S ¼ {2, 1, 1, 2}, and
assuming the coefficients of the final polynomial belong to the set {128, . . .,
+128}, we can perform all calculations in GF(257). Evaluating P1(x), P2(x), Q1(x),
Q2(x) at each point in S gives u1 ¼ {1, 2, 2, 7}, v1 ¼ {13, 7, 5, 11}, u2 ¼ {9,
5, 9, 19} and v2 ¼ {5, 2, 4, 7}. Thus, the component-wise results can be
computed as w1 ¼ {13, 14, 10, 77} and w2 ¼ {45, 10, 36, 124}. (Note that the last
entry in w2 ¼ 133 is rewritten mod 257 as 124). Adding w1 and w2, we obtain the
result w ¼ w1 + w2 ¼ {58, 24, 46, 47}. Next, using interpolation algorithm
(Newton’s divided difference [29]), we obtain the final polynomial as R(x) ¼
2 + 2x + 33x2 + 9x3. Substituting x ¼ 8, the final answer is found as 6738 which
can be found to be true by straightforward computation. Note that by adding extra
channels, it is possible to detect and correct errors [28]. Error detection is achieved
simply by computing the polynomials at n + 2 points. Error correction can be
achieved by computing at n + 3 points. Evidently, the condition for a fault to be
detected is that the highest degree term of the result R(x) is non-zero.
As an illustration, consider the computation of product of two polynomials P(x) ¼
1  2x + 3x2 and Q(x) ¼ 2  x. Since the result R(x) is a polynomial of degree 3, we
188 8 Specialized Residue Number Systems

need to consider 4 + 2 points to correct a single error. Considering the set S ¼ {4, 2,
1, 1, 2, 4}, evaluating P and Q at these points we get
u ¼ (57, 17, 6, 2, 9, 41) and v ¼ (6, 4, 3, 1, 0, 2) and w ¼ (85, 68, 18, 2, 0, 82).
Performing interpolation, we obtain, R(x) ¼ 2  5x + 8x2  3x3 + 0  x4 + 0  x5
which is the correct result. Let us consider that one error has occurred on channel 2,
and the computed result is w ¼ (85, 71, 18, 2, 0, 82). We can independently
eliminate each of the channels and compute the polynomials R(x) to obtain the
following polynomials:

4638xþ98x2 þ30x3 þ123x4 ,25xþ8x2 3x3 , 86þ127xþ98x2 þ53x3 þ67x4 ;


9 þ 127x  12x2 þ 53x3  56x4 ,  79  5x þ 78x2  3x3 þ 11x4 ,  123  38x
12x2 þ 30x3  112x4 :

Note that the only polynomial of third degree is 2  5x + 8x2  3x3 obtained by
removing the second channel.
MRRNS can be extended to polynomials of two determinates also [28]. Consid-
ering two determinates, the polynomials need to be evaluated at (m + 1)(n + 1)
points for the case of polynomials of degree m in x and n in y. Note that addition,
subtraction and multiplication can be carried out in component-wise fashion.
However, the two-dimensional interpolation is carried out in two steps. In the
first step, (m + 1) 1D interpolation in the y direction is performed and next, (n + 1)
1D interpolation in the x direction is performed.
As an illustration consider P(x,y) ¼ 2x2  2xy  y2 + 1 and Q(x,y) ¼ x2y + y2  1.
Considering the set of points {1, 0, 1, 2} for x and y, considering x fixed, P(x, y) can
be estimated for all values of y. Similarly, considering y is fixed, P(x, y) can be
computed for the values of x. Performing similar computation for Q(x, y), then on the
4  4 matrix obtained, addition, multiplication and subtraction operations can be
performed component wise. In case of errors, the degrees of the polynomials
obtained for chosen y will be of higher order than 2. In a similar manner for a chosen
x, the degree of the polynomials obtained in y will be higher than 2 thus showing the
error. For error correction, those row and column in the matrix can be removed and
interpolation can be carried out to find the correct result as 2x2 + x2y  2xy.
The authors have also described a symmetric MRRNS (SMRRNS) [28, 30] in
which case the input values are mapped into even polynomials. Then S shall be
chosen as powers of two which are such that xi 6¼ xj . As an illustration consider
finding 324  179. Hence P(x) and Q(x) can be written as P(x) ¼ 5x2 + 4 and Q(x) ¼
3x2  13 choosing x ¼ 8 as the indeterminate. Since R(x) is of degree 4, we need at
least 5 points. Choosing the set S ¼ {8, 4, 2, 1, 0}, and performing computations
over GF(257), we have u ¼ {67, 84, 24, 9, 4} and v ¼ {78, 35, 1, 10, 13} and
thus we have w ¼ {86, 113, 24, 90, 52}. Interpolating with respect to
S yields the result as R(x) ¼ 15x4  53x2  52 and for x ¼ 8, we obtain R(8) ¼
57,996. Note that error detection and error correction can be carried out using a
different technique. Consider that due to error w has changed as {86, 120, 24,
90, 52}. We need to extend virtually the output vector as {86, 120, 24, 90,
8.5 Logarithmic Residue Number Systems 189

52, 90, 24, 120, 86}. Next considering S as S ¼ {8, 4, 2, 1, 0, 1, 2, 4,
8} and interpolating, we obtain an eighth-order polynomial 109x8  68x6 + 122x4
+ 56x2  52. Since we do not know where error has occurred, by removing two
values corresponding to each location, we can find that the error is in the second
position and the answer is R(x) ¼ 15x4  53x2  52.
MRRNS has been applied to realize fault-tolerant complex filters [31–34]. These
use QRNS together with MRRNS techniques. As an illustration consider compu-
tation of product of two complex numbers a ¼ 237  j225 and b ¼ 162 + j211. We
illustrate the technique using the three moduli set {13, 17, 29}. The dynamic range
is evidently M ¼ 6409. As an illustration, for the modulus 13, the elements (resi-
dues) can be considered to lie in the interval [6, 6]. Choosing an indeterminate
x ¼ 8, the given numbers 237, 225, 162, 211 can be written as polynomials 3x2
+ 5x + 5, 3x2  4x  1, 2x2  4x  2 and 3x2 + 2x + 3, respectively. We next
convert the coefficients to QRNS form noting that j1 ¼ 5 for m1 ¼ 13, j2 ¼ 4 for
m2 ¼ 17 and j3 ¼ 12 for m3 ¼ 29. We choose the inner product polynomial as fifth
degree and choose 5 points x ¼ 2, 1, 0, 1, 2 at which we evaluate the poly-
nomials. Note that the procedure is same as in the previous case. After the inner
product computation by element-wise multiplication and interpolation, we get the
values in QRNS form corresponding to each modulus. These can be converted back
into normal form and using CRT for the moduli set {13, 17, 29}, the real and
imaginary parts can be obtained. Note that in the case of inner product of N terms,
considering that the ai and bi values are bounded by 2γ + 1 and 2γ  1, M shall
satisfy the condition M > 4N(2γ  1)2 where M is the product of all the moduli. Note
that fault tolerance can be achieved by having two more channels for fault detection
and correction [31, 33, 34].
Radhakrishnan et al. [35] have described realization of fault tolerant adaptive
filters based on a hybrid combination of Fermat number transform block processing
and MRRNS. These are immune to transient errors. Note that each input sample and
tap weight is interpreted as polynomials and these polynomials are evaluated at the
roots. The transformation matrix of Fermat Number Transform (FNT) is applied to
the resulting matrices corresponding to input samples and weights. Next, the
elements of these matrices are multiplied element wise and converted back using
interpolation formula. Fault tolerance can be achieved by evaluating the polynomial
at two additional points as explained before.

8.5 Logarithmic Residue Number Systems

Preethy and Radhakrishnan [36] have suggested using RNS for realizing logarith-
mic adders. They have suggested using multiple bases. The ring elements can be
considered as products of factors of different bases. The multiple base logarithm is
defined for X ¼ b1 α1 b2 α2 . . . bm αm as
190 8 Specialized Residue Number Systems

ðα1 ; α2 ; ::; αm Þ ¼ lmb1 , b2 , ::, bm ðXÞ ð8:10Þ

where m > 1. αi are exponents and lm stands for logarithm. All algebraic
properties obeyed by normal algorithms apply to multiple base algorithm also.
This implies that in case of a prime GF( p), index calculus can be used for the
purpose of multiplication. We can exploit properties of RNS together with those of
finite fields and rings so that LUTs can be reduced to a small size. The index α of
the sum reduced mod ( p  1) corresponding to addition of two non-zero
 integers
 

 α   α    αy  αx 
X ¼ g x p and Y ¼ g y p is αx þ αf p1 where αf ¼ logg 1 þ g p1 .
p
As an example for modulus 31, base can be 3. Thus, indices exist for all elements
of GF(31). Note that X + Y can be found from the indices. As an example for
X ¼ 15, Y ¼ 4, we have the indices αx ¼ 21 and αy ¼ 18 meaning 321mod31 ¼ 15
and 318mod 31 ¼ 4. Thus, we have index of the result (X + Y ) as

 
m ¼ 21 þ log3 1 þ 3ð18  21Þ30  ¼ 4. Taking anti-logarithm, we get the
31 30
result as 19.
In case of modulus of the form pm e.g. 33, we can express the given integer as gαpβ.
Thus, the 27 elements can be represented by the indices (α, β) using bases g ¼ 2,
p ¼ 3. As an example integer 7 corresponds to (16, 0) whereas
 integer 9 corresponds

 αx β x   
to (0, 2). In this case, the result of addition of X ¼ g p  m , Y ¼ gαy pβy  m ,
p p
X and Y also is of the same form (α, β):
   
ðα; βÞ ¼ αx þ αf ϕðpm Þ , βx þ βf for βy  βx
   
ðα; βÞ ¼ αy þ αf ϕðpm Þ , βy þ βf otherwise ð8:11aÞ

where
    
   ð1Þs αy  αx  m s 
 ϕðp Þ ð1Þ
αf ; βf ¼ lmðg;pÞ 1 þ g p βy  β x  ð8:11bÞ
 
pm

with s ¼ 0 for βy  βx and s ¼ 1 otherwise. Note that ϕ(z) is the number of integers
less than z and prime to it.
The authors also have considered
 the case GF(2m) where the given integer X can
 
be expressed as 2α 5β ð1Þγ  m . As an example for GF(25), 7 can be expressed as
2
(α, β, γ) ¼ (0, 2, 1). Check that 52mod 32 ¼ 25. The negative sign makes it 25 ¼ 7
mod 32. In this case also, the computation can be easily carried out for (αf, βf, γf).
The reader may refer to [36] for more information.
In the case of GF( p), it has been pointed out [37] that memory can be reduced by
storing only certain indices and the rest can be obtained by shifting or addition.
References 191

For p ¼ 31, we need to store only for 2, 5, 7, 11, 13, 17, 19, 23, 29. For example for
integer 28, we can obtain the logarithm log328 ¼ log3(22  7) ¼ 2 log32 + log37.
The Residue Logarithmic Number system (RLNS) [38] represents real values as
quantized logarithms which are in turn represented in RNS. This leads to faster
multiplication and division. This uses table look-ups. There is no overflow detec-
tion possibility in RLNS. Hence Xmax and Xmin shall be well within the dynamic
range of RLNS. RLNS addition is difficult whereas multiplication is faster than any
other system. The reader is referred to [38] for more information.

References

1. W.K. Jenkins, J.V. Krogmeier, The design of dual-mode complex signal processors based on
Quadratic modular number codes. IEEE Trans. Circuits Syst. 34, 354–364 (1987)
2. G.A. Jullien, R. Krishnan, W.C. Miller, Complex digital signal processing over finite rings.
IEEE Trans. Circuits Syst. 34, 365–377 (1987)
3. M.A. Soderstrand, G.D. Poe, Application of quadratic-like complex residue number systems to
ultrasonics, in IEEE International Conference on ASSP, vol. 2, pp. 28.A5.1–28.A5.4 (1984)
4. R. Krishnan, G.A. Jullien, W.C. Miller, Complex digital signal processing using quadratic
residue number systems. IEEE Trans. ASSP 34, 166–177 (1986)
5. V. Paliouras, T. Stouraitis, Novel high radix residue number system architectures. IEEE Trans.
Circuits Syst. II 47, 1059–1073 (2000)
6. V. Paliouras, T. Stouraitis, Novel high-radix residue number system multipliers and adders, in
Proceedings of ISCAS, pp. 451–454 (1999)
7. I. Kouretas, V. Paliouras, A low-complexity high-radix RNS multiplier. IEEE Trans. Circuits
Syst. Regul. Pap. 56, 2449–2462 (2009)
8. I. Kouretas, V. Paliouras, High radix redundant circuits for RNS moduli rn-1, rn and rn + 1, in
Proceedings of IEEE ISCAS, vol. V, pp. 229–232 (2003)
9. I. Kouretas, V. Paliouras, High-radix rn-1 modulo multipliers and adders, in Proceedings of 9th
IEEE International Conference on Electronics, Circuits and Systems, vol. II, pp. 561–564
(2002)
10. M. Abdallah, A. Skavantzos, On multi-moduli residue number systems with moduli of the
form ra, rb-1 and rc+1. IEEE Trans. Circuits Syst. 52, 1253–1266 (2005)
11. A. Skavantzos, F.J. Taylor, On the polynomial residue number system. IEEE Trans. Signal
Process. 39, 376–382 (1991)
12. A. Skavantzos, T. Stouraitis, Polynomial residue complex signal processing. IEEE Trans.
Circuits Syst. 40, 342–344 (1993)
13. M.C. Yang, J.L. Wu, A new interpretation of “Polynomial Residue Number System”. IEEE
Trans. Signal Process. 42, 2190–2191 (1994)
14. V. Paliouras, A. Skavantzos, T. Stouraitis, Multi-voltage low power convolvers using the
Polynomial Residue Number System, in Proceedings 12th ACM Great Lakes Symposium on
VLSI, pp. 7–11 (2002)
15. M. Abdallah, A. Skavantzos, The multipolynomial Channel Polynomial Residue Arithmetic
System. IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 46, 165–171 (1999)
16. V. Paliouras, A. Skavantzos, Novel forward and inverse PRNS converters of reduced compu-
tational complexity, in 36th Asilomar Conference on Signals, Systems and Computers,
pp. 1603–1607 (2002)
17. H.C. Shyu, T.K. Truong, I.S. Reed, A complex integer multiplier using the quadratic-
polynomial residue number system with numbers of form 22n+1. IEEE Trans. Comput.
C-36, 1255–1258 (1987)
192 8 Specialized Residue Number Systems

18. A. Skavantzos, N. Mitash, Implementation issues of 2-dimensional polynomial multipliers for


signal processing using residue arithmetic, in IEE Proceedings-E, vol. 140, pp. 45–53 (1993)
19. A. Skavantzos, N. Mitash, Computing large polynomial products using modular arithmetic.
IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 39, 252–254 (1992)
20. B. Singh, M.U. Siddiqui, Multivariate polynomial products over modular rings using residue
arithmetic. IEEE Trans. Signal Process. 43, 1310–1312 (1995)
21. P.E. Beckmann, B.R. Musicus, Fast fault-tolerant digital convolution using a polynomial
Residue Number System. IEEE Trans. Signal Process. 41, 2300–2313 (1993)
22. M.G. Parker, M. Benaissa, GF(pm) multiplication using Polynomial Residue umber Systems.
IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 42, 718–721 (1995)
23. J. Chu, M. Benaissa, Polynomial residue number system GF(2m) multiplier using trinomials, in
17th European Signal Processing Conference (EUSIPCO 2009), Glasgow, Scotland,
pp. 958–962 (2009)
24. J. Chu, M. Benaissa, GF(2m) Multiplier using Polynomial Residue Number System, in IEEE
Asia Pacific Conference on Circuits and Systems, pp. 1514–1517 (2008)
25. J. Chu, M. Benaissa, A novel architecture of implementing error detecting AES using PRNS, in
14th Euromicro Conference on Digital System Design, pp. 667–673 (2011)
26. N.M. Wigley, G.A. Jullien, D. Reaume, Large dynamic range computations over small finite
rings. IEEE Trans. Comput. 43, 78–86 (1994)
27. L. Imbert, G.A. Jullien, Fault tolerant computation of large inner products. Electron. Lett. 37,
551–552 (2001)
28. L. Imbert, V. Dimitrov, G.A. Jullien, Fault-tolerant computations over replicated finite rings.
IEEE Trans. Circuits Syst. I Fundam. Theory Appl. 50, 858–864 (2003)
29. D.E. Knuth, The Art of Computer Programming. Seminumerical algorithms, vol. 2, 3rd edn.
(Addison-Wesley, Boston, 1997)
30. L. Imbert, G. A. Jullien. Efficient fault-tolerant arithmetic using a symmetrical modulus
replication RNS. in 2001 IEEE Workshop on Signal Processing Systems, Design and Imple-
mentation, SIPS’01, pp. 93–100 (2001)
31. L. Imbert, G.A. Jullien, V. Dimitrov, A. Garg, Fault tolerant complex FIR filter architectures
using a redundant MRRNS, in Conference Records of The 35th Asilomar Conference on
Signals, Systems, and Computers, vol. 2, pp. 1222–1226 (2001)
32. N.M. Wigley, G.A. Jullien, On modulus replication for residue arithmetic computations of
complex inner products. IEEE Trans. Comput. 39(8), 1065–1076 (1990)
33. P. Chan, G.A. Jullien, L. Imbert, V. Dimitrov, G.H. McGibney, Fault-tolerant computations
within complex FIR filters, in 2004 IEEE Workshop on Signal Processing Systems, Design and
Implementation, SIPS’04, pp. 316–320 (2004)
34. I. Steiner, P. Chan, L. Imbert, G.A. Jullien, V. Dimitrov, G.H. McGibney, A fault-tolerant
modulus replication complex FIR filter, in Proceedings 16th IEEE International Conference
on Application-Specific Systems, Architecture Processors, ASAP’05, pp. 387–392 (2005)
35. C. Radhakrishnan, W.K. Jenkins, Z. Raza, R.M. Nickel, Fault tolerant Fermat Number
Transform domain adaptive filters based on modulus replication RNS architectures, in Pro-
ceedings of 24th Asilomar Conference on Signals, Systems and Computers, pp. 1365–1369
(2009)
36. A.P. Preethy, D. Radhakrishnan, RNS-based logarithmic adder, in IEE Proceedings—Com-
puters and Digital Techniques, vol. 147, pp. 283–287 (2000)
37. M.L. Gardner, L. Yu, J.W. Muthumbi, O.B. Mbowe, D. Radhakrishnan, A.P. Preethy, ROM
efficient logarithmic addition in RNS, in Proceedings of 7th International Symposium on
Consumer Electronics, ISCE-2003, Sydney (December 2003)
38. M.G. Arnold, The residue Logarithmic Number system: theory and implementation, in 17th
IEEE Symposium on Computer Arithmetic, Cape Code, pp. 196–205 (2005)
References 193

Further Reading

J.H. Cozzens, L.A. Fenkelstein, Computing the discrete Fourier transform using residue number
systems in a ring of algebraic integers. IEEE Trans. Inf. Theory 31, 580–588 (1985)
H.K. Garg, F.V.C. Mendis, On fault-tolerant Polynomial Residue Number systems, in Conference
Record of the 31st Asilomar Conference on Signals, Systems and Computers, pp. 206–209
(1997)
G.A. Jullien, W. Luo, N.M. Wigley, High throughput VLSI DSP using replicated finite rings,
J. VLSI Signal Process. 14(2), 207–220 (1996)
J.B. Martens, M.C. Vanwormhoudt, Convolutions of long integer sequences by means of number
theoretic transforms over residue class polynomial rings. IEEE Trans. Acoust. Speech Signal
Process. 31, 1125–1134 (1983)
J.D. Mellott, J.C. Smith, F.J. Taylor, The Gauss machine: a Galois enhanced Quadratic residue
Number system Systolic array, in Proceedings of 11th Symposium on Computer Arithmetic,
pp. 156–162 (1993)
M. Shahkarami, G.A. Jullien, R. Muscedere, B. Li, W.C. Miller, General purpose FIR filter arrays
using optimized redundancy over direct product polynomial rings, in 32nd Asilomar Confer-
ence on Signals, Systems & Computers, vol. 2, pp. 1209–1213 (1998)
N. Wigley, G.A. Jullien, W.C. Miller, The modular replication RNS (MRRNS): a comparative
study, in Proceedings of 24th Asilomar Conference on Signals, Systems and Computers,
pp. 836–840 (1990)
N. Wigley, G.A. Jullien, D. Reaume, W.C. Miller, Small moduli replications in the MRRNS, in
Proceedings of the 10th IEEE Symposium on Computer Arithmetic, Grenoble, France, June
26–28, pp. 92–99 (1991)
G.S. Zelniker, F.J. Taylor, Prime blocklength discrete Fourier transforms using the Polynomial
Residue Number System, in 24th Asilomar Conference on Signals, Systems and Computers,
pp. 314–318 (1990)
G.S. Zelniker, F.J. Taylor, On the reduction in multiplicative complexity achieved by the Poly-
nomial Residue Number System. IEEE Trans. Signal Process. 40, 2318–2320 (1992)
Chapter 9
Applications of RNS in Signal Processing

Several applications of RNS for realizing FIR filters, Digital signal processors and
digital communication systems have been described in literature. In this Chapter,
these will be reviewed.

9.1 FIR Filters

FIR (Finite Impulse Response) filters based on ROM-based multipliers using RNS
have been described by Jenkins and Leon [1]. The coefficients of the L-tap FIR filter
and input samples are in RNS form and the multiplications and accumulation
needed for FIR filter operation

X
L1
yðnÞ ¼ hðkÞxðn  kÞ ð9:1Þ
k¼0

are carried out in RNS for all the j moduli. In order to avoid overflow, the dynamic
range of the RNS shall be chosen to be greater than the worst case weighted sum of
the products of the given coefficients of the FIR filter and maximum amplitudes of
samples of the input signal. Note that each modulus channel has a multiplier based
on combinational logic or ROM to compute h(k)  x(n  k) and an accumulator
mod mi to find ðyðnÞÞmi . After the accumulation in L steps following (9.1), the result
which is in residue form ðyðnÞÞmi needs to be converted into binary form using a
RNS to binary converter following one of several methods outlined in Chapter 5.
Jenkins and Leon also suggest that instead of weighting the inputsamples
 by the
1
coefficients in RNS form, the coefficients h(k) can be multiplied by where
M i mi
M
Mi ¼ and stored. These modified coefficients h(k) can be multiplied by the
mi
© Springer International Publishing Switzerland 2016 195
P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_9
196 9 Applications of RNS in Signal Processing

samples and accumulated so that CRT can be easily performed by multiplying the
final accumulated residues ðyðnÞÞmi corresponding to modulus channel mi with Mi
and summing all the results mod M to obtain the final result.
Note that instead of using multipliers, a bit-slice approach [2] can be used. We
consider all moduli of length b bits without loss of generality. In this method, the
MSBs of all the final accumulated residues r i ¼ ðyðnÞÞmi for i ¼ 0, 1, . . ., j  1
X
j1
address a ROM to get the intermediate word corresponding to r i, b1 Mi . Next
i¼0
this is multiplied by 2 and the intermediate word corresponding to the next bits
(i.e. b  2, . . ., 0) of all the residues is obtained from ROM and accumulated. This
process continues b times and finally, modulo M reduction needs to be carried out,
where M is the product of the moduli.
As an illustration, consider the realization of a first-order filter y(n) ¼ h0x(n)
+ h1x(n  1) using the RNS {19, 23, 29, 31}. The coefficients are ho ¼ 127 ¼ (13, 12,
11, 3) and h1 ¼ 61 ¼ (15, 8, 26, 1). Consider the input samples u(n) ¼ 30 ¼
(11, 7, 1, 30)andu(n  1) ¼ 97 ¼(2, 5, 10, 4).
 Themultiplicative
 inverses
 needed
1 1 1 1
in CRT are ¼ 4, ¼ 20, ¼ 22, ¼ 5. Note
M1 m1 M2 m2 M3 m3 M4 m4
also that M ¼ 392,863, M1 ¼ 23  29  31 ¼ 20,677, M2 ¼ 19  29  31 ¼ 17,081,
M3 ¼ 19  23  31 ¼ 13,547 and M4 ¼ 19  23  29 ¼ 12, 673. Multiplying the
coefficients with the multiplicative inverses, we have the modified coefficients as
h0 0 ¼ (14, 10, 10, 15) and h0 1 ¼ (3, 22, 21, 5). The result after weighting these with
the input samples and summing yields y(n) ¼ (8, 19, 17, 5). These in binary form are
01000, 10011, 10001, 00101. The MSBs can be seen to be 0, 1, 1, 0 which when
weighted with M1, M2, M3, M4, respectively, yields 30,628. Doubling and adding
the word corresponding to the word formed using the next significant bits (1, 0, 0, 0)
yields 2  30,628 + 20,677 ¼ 81,933. This process is repeated to obtain finally
((81,933)  2 + 12,673)  2)  2 + 51,305 mod 392,863 ¼ 364,598.
Vergos [3] has described a 200 MHz RNS core which can perform FIR
filtering using the moduli set {232  1, 232, 232 + 1}. The architecture is shown in
Figure 9.1. It has a front-end binary to RNS converter (residue generator) for the
channels 232  1 and 232 + 1. The blocks modulo 2n, (2n  1) channels perform
modulo multiplication and modulo addition. The authors have used Kalampoukas
et al. [4] modulo 232  1 adder and Piestrak architecture based on CSA followed by
a modulo adder for modulo (2n + 1) converter [5] modified to give the residue in
diminished-1 form. The modulo (232  1) multiplier is six-stage pipelined. For the
multiplier mod (2n + 1), they have used Booth algorithm-based architecture with
Wallace tree-based partial product reduction. The modulo (2n + 1) multiplier is
based on diminished-1 arithmetic due to Zimmermann [6] and is also six-stage
pipelined. Considering input conversion speed, the execution frequency is 50 MHz
whereas without considering input conversion, the execution frequency is
200 MHz.
9.1 FIR Filters 197

Operand A (31:0) Operand B (31:0) Operation (1:0) B/RNS

Clk Input Buffers

232-1 diminished-1 232+1 diminished-1


Residue Generator Residue Generator

Modulo 2n
Channel Multiplexers Multiplexers

Modulo (2n-1) Modulo (2n+1)


Channel Channel

Multiplexers and
Sequencing logic

Output Buffers

Result (31:0) Result _Valid

Figure 9.1 Architecture of the RNS core (adapted from [3] with permission from H.T. Vergos)

Re et al. [7] have considered RNS-based implementation of FIR filters in RNS


form in transpose structure. A typical tap in such a filter comprises of a modulo
multiplier and modulo adder (see Figure 9.2). The modulo multiplier was realized
using index calculus whereas the modulo adder was realized as a simple adder with
carry and save outputs separate. The modulo reduction is done in the last stage of
the FIR filter. The registers needed are increased, however, due to separate carry
and save outputs being available. It is evident that the length of the carry and sum
vectors grows gradually thus making the final modulo reduction, slightly complex.
As an alternative, they have suggested reduction in intermediate stage called relay
station. The authors have observed that for RNS and CS-RNS (RNS using the carry-
save scheme), the area and power dissipation are less than the TCS (traditional
two’s complement) filters. The authors have used the moduli set {3, 5, 7, 11, 17, 64}
and for the moduli 3 and 64, no index calculus-based multiplier was used. For the
modulus 3, a LUT-based multiplier is used whereas for the modulus 64, the six
198 9 Applications of RNS in Signal Processing

Figure 9.2 Tap structure n


for the RNS carry save X buffer
architecture (adapted from
[7] ©IEEE 2001) n
Isomorph
A
Multiplier

n
n1 CSA 3:2 n2 n2

Registers
Ysk-1 HA Ysk
n1 HA n2 n2
Yck-1 OR Yck

clock cycle

LSBs of a conventional 6  6 multiplier are chosen. The authors also have


suggested power supply voltage reduction for moduli processors of low delay,
e.g. for moduli 3, 5, 7 and 64. This has achieved a power reduction of 15 %.
Cardarilli et al. [8] have also estimated the effect of coding overhead on the
performance of FIR filters. For a given dynamic range of b bits, the chosen RNS
may have a dynamic range of d bits where d > b. The difference (d  b) is called the
coding overhead. As an example for a 12-bit desired DR, the moduli set chosen say is
{5, 7, 11, 17} whose dynamic range is 15 bits (estimated by adding the number of bits
needed to represent each modulus), the coding overhead is 3 bits. Cardarilli et al. [8]
have considered direct form and transpose type FIR filters (see Figure 9.3a, b) with
multiplier based on isomorphism and modulo adders. In one experiment, the authors
choose moduli sets first with M > 2d. They choose the one among these which has the
largest b. In another experiment, for a target clock period, the RNS base (moduli set)
which needs minimum area or lowest power dissipation is found. These two exper-
iments do not include RNS to binary and binary to RNS converters. The third
experiment considers complete filters. Filters up to Dynamic range (DR) of 48 bits,
e.g. {23, 29, 31, 37, 43, 47, 53, 59, 61} have been studied. All these have demon-
strated that RNS filters are superior to TCS filters generally.
Del Re et al. [9] have described an automated tool for RNS FIR filter design
taking into account the dynamic range, filter order and timing constraint.
Cardaralli et al. [10] have implemented RNS-based FIR filters in 90 nm tech-
nology based on transpose structure. They have used carry save representation of
the product and index calculus-based multipliers. They have used two types of cells,
HS (high speed cells with low VT) and LL (low leakage cells with high VT). For
moduli having low delay, available time is used to substitute HS cells with LL cells.
The authors have reported reduction of 40 % of dynamic power and 50 % of static
power without delay penalty.
Bernocchi et al. [11, 12] have described a hybrid RNS low-pass adaptive filter
for channel equalization applications in fixed wireless systems. The block diagram
is shown in Figure 9.4. Note that the error calculation and tap estimation updating is
done in binary and converted to RNS form and given to FIR filter. They have used a
9.1 FIR Filters 199

a x(n–k–1) reg
x(n–k)
reg
x(n–k+1)
reg
si
ak-1 si ak ak+1

reg

reg
reg
si

si

si si

b
X(n)
si si
ak+1 ak ak-1
reg

reg

reg
si
si
reg

reg

reg
Y(n)k+2 Y(n)k+1 Y(n)k Y(n)k-1
si

Figure 9.3 RNS FIR filters in (a) direct form and (b) in transpose form (adapted from [8]
©IEEE2007)

x(n) Binary RNS d*(n) d(n)


RNS Variable
to to
Filter - +
RNS Binary
e(n)
T

Δw(0) Δw(M-1) Serial LMS

Binary Δw(n)
to T
RNS
Shift register

Figure 9.4 Adaptive filter hybrid architecture (adapted from [11] ©IEEE2006)

192 tap, 32-bit dynamic range filter using the moduli set {5, 7, 11, 13, 17, 19, 23,
128}. At a clock frequency of 200 MHz, when compared with TCS, area and power
dissipation saving of 50 % has been demonstrated.
Conway and Nelson [13] suggested FIR filter realization using powers of two
related moduli set with moduli of the form 2n  1, 2n, 2n + 1. These moduli can be
200 9 Applications of RNS in Signal Processing

selected to minimize area and critical path. The multipliers make use of periodic
properties of moduli following the Wrzyszcz and Milford multiplier for modulus
(2n + 1) [14] and multipliers for mod 2n and 2n  1. Note, however, that transpose
structure has been used and the modulo reduction is not performed but product bits
and carry and sum outputs from previous transpose stage are added using CSA
tree. Using the cost details such as number of full-adders, flip-flops, area and
time, appropriate moduli sets have been selected by exhaustive search. The authors
have shown that these designs are attractive over designs using the three moduli set
{2n  1, 2n, 2n + 1}.
As an illustration, for a 24-bit dynamic range, the moduli sets {5, 7, 17, 31, 32,
33} and {255, 256, 257} have area of 280 and 417 units and delays of 5 and 7 units,
respectively. They have shown that the gain in area-delay product of 35–60 % for
16 tap filters with dynamic ranges of 20–40 bits could be achieved.
Wang [15] has described bit serial VLSI implementation of RNS digital N-tap
FIR filter using a linear array of N cells. The output sequence of a N-tap FIR filter is
given as

X
N 1
yn ¼ ai xni n ¼ 0, 1, 2, . . . ð9:2aÞ
i¼0

where ai are the coefficients and xi (i ¼ 0, 1, 2, . . .) represent the sequence of input


samples. Assuming ai and xi are B-bit words, (9.2a) can be rewritten as
" #
X
B1
ynj ¼ Snjb 2b ð9:2bÞ
b¼0 mj

with
" #
XN1
Snjb ¼ b
aij xn1,j ð9:2cÞ
i¼0 mj

where j stands for modulus mj and superscript b indicates bth bit of the binary
b
representation of xni,j and aij. Note that Snj can be computed recursively as

T nb , j ðiÞ ¼ T nb , j ði  1Þ þ xnb , j a0ij þ xnb, j Cnb, j ðiÞmj for i ¼ 0, . . . N  1

where Cbn;j is the complement of the carry generated in adding the first two terms.
Note that a0i ¼ ai þ mj . In the processor cell, old samples are multiplied by ai and
added with the latest sample weighted by ai1. The FIR filter architecture is shown
in Figure 9.5a in which the bits of x enter serially in word-serial bit-serial form to
the Nth cell and are delayed by B  2 clock cycles using the delay blocks D1 and
move to the next cell. The extra clock cycle is used for clearing the accumulator
a mi 0 b mj
c mi 0
. . . .
T
. . B -bit B -bit . .
. . . .
a´n-1,j L a´ L a´0,j
0 xB-41j 0 xB-41j

1 xB-31j B -bit 1 xB-31j


9.1 FIR Filters

1
D2 D1
xB-21j 1 xB-21j
1 -bit
1 xB-11j X 1 xB-11j

0 0 L 0 x00j
x 0j
. .
1 . 1 -bit M C P 1 .

. . . .
L L L
. . . .
xB-40j xB-40j
. D2 . D1
xB-30j xB-30j
1 1

1 a´1,j B -bit B -bit


xB-20j 1 xB-20j a´N-2,j
mj T´
0 0
Bank of B 2-input AND gates xB-10j
xB-10j 0
0 D2 D1

a´0,j a´N-1,j
CLR CLR
1 -bit 1 -bit
B -bit B -bit B -bit B -bit

1 -bit 1 -bit
ROM ROM

(B.2B) bits (B.2B) bits

yoj*yij B -bit yoj*yij B -bit


201

Figure 9.5 (a) Basic cell (b) A hybrid VLSI architecture using (a) for RNS FIR sub-filter and (c) alterative FIR filter architecture (adapted from [15]
©IEEE1994)
202 9 Applications of RNS in Signal Processing

addressing the ROM which will perform modulo shift and add operation so that
fresh evaluation of yn can commence. The cell architecture is shown in Figure 9.5b.
Note that the multiplication function is realized
 by ANDgates. The cells  contain
a0 I ¼ ai mod mi. The cell computes αj þ xβj modmj as αj þ x βj þ r j þ xcmj
where c is the complement of the carry generated by adding the first two terms and
rj ¼ 2B  mj and indicates B LSBs of the result of the adder.
In an alternative architecture shown in Figure 9.5c, the input enters at the top.
Note that this structure has unidirectional data flow unlike that in Figure 9.5a. The
second set of latches in Figure 9.5b can be removed. The reader is urged to refer to
[15] regarding detailed timing requirement for both the structures of Figure 9.5a, c.
Lamacchia and Radinbo [16] have described RNS digital filtering for wafer-
scale integration. They have used index calculus-based multipliers and modulo
addition is performed using conventional binary adders. They observe that RNS is
best suited for wafer-scale integration since parallel architectures can be gainfully
employed.
Bajard et al. [17] have considered implementation of digital filters using RNS.
They observe that ρ-direct form transpose filter structure has the advantage of
needing small coefficient word length. A typical ρ-direct form block is shown in
Figure 9.6a where in the delay in transpose form structure is replaced by a lossy

a
u(k)
βn βn-1 β1 β0
βi

ρ-1n ρ-1i+1 ρ-11

-αn -αn-1 -αi -α1

y(k)
b Z(k+1)
x(k) z-1
Δi

γi ρ-1i

Figure 9.6 (a) Generalized ρ-direct form II structure (b) realization of operator ρ1i (adapted
from [17] ©IEEE2011)
9.1 FIR Filters 203

integrator (se Figure 9.6b). Note that the parameters Δi, γi can be chosen so as to
minimize the transfer function sensitivity and round off noise. They observe that
5-bit coefficients will suffice for realizing a sixth-order Butterworth filter whereas
for direct form I realization, 15-bit coefficients are required. The popular three
moduli set {2n  1, 2n, 2n + 1} has been used. The ρ-direct form needs 10-bit
variables, 5-bit coefficients and 15-bit adders. The RNS used was {31, 32, 33}.
Multipliers were implemented using LUTs. Conventional direct form I filter using
moduli set {4095, 4096, 4097} has also been designed. They observe that FPGA-
based designs for IIR filters appear to be not attractive for RNS applications over
fixed point designs. The considerations for FPGAs are different from those of
ASICs since in FPGAs fast carry chains make the adders efficient and ripple
carry adders are quite fast and compact. The authors conclude that ρ-direct form
designs are superior to direct form realizations in both fixed point and RNS case.
Patronik et al. [18] have described design techniques for fast and energy efficient
constant-coefficient FIR filters using RNS. They consider several aspects—coeffi-
cient representation, techniques of sharing of sub-expressions in the multiplier
block (MB), optimized usage of RNS in the hardware design of MB and accumu-
lation pipeline. A common sub-expression elimination (CSE) technique has been
used for synthesis of RNS-based filters. Two’s complement arithmetic has been
used. Four and five moduli RNS have been considered.
Multiple constant multiplications (MCM) need to be performed in the transpose
FIR filter structure (see Figure 9.7) where the MB block is shown in dotted lines.
The constant coefficients can be represented in Canonical Signed digit (CSD)
representation, wherein minimum number of ones exist since non-zero strings are
substituted by 1 and 1. Remaining bits of the type 1 1 are replaced by 0 1.
The resulting words can obey all the mod (2n + 1) operations. As an illustration,
   
27 can be written as 011011 ¼ 101101 SD ¼ 100101 CSD . However, using
periodicity property, we can write (27)31 as 000100 ¼ 4.
The authors next use level constrained CSE (LCCSE) algorithm of [19] to
compute modular MCMs. The coefficients are  decomposed into  shifts n1 and n2
of two values d1 and d2 written as ck ¼ d1 2n1  d2 2n2 as desired. The
authors modify this technique
  by specifying
  the bases bi by choosing k such
that the values of bi ¼ 2k ci 2n 1 or 2k ci 2n þ1 are minimized. These are next
decomposed so as take into account their modular interpretation

Figure 9.7 Transposed


form FIR filter structure Xk Multiplier block (MB)
(adapted from [18]
©IEEE2011) C1 CN-1 CN

C1 Xk CN-1Xk CNXk
yk
z -1 z -1
204 9 Applications of RNS in Signal Processing

   
ck ¼ d 1 2n1  d2 2n2 modð2n  1Þ or ck ¼ d 1 2n1  d 2 2n2 modð2n þ 1Þ. The
coefficients can share one base or different bases. Given bases, any coefficient
can be obtained by bit rotation and inversion. As an illustration, coefficients 5 and
29 have same base: 29 ¼ (5  25) mod 63. This step may result in very compact
CSD form.
Carry-save-adder stages together with Carry-propagate-adders need to be used
to realize the Multiplier block. In the optimization of this multiplier block, the
output of any adder is a result of multiplication by some constant. These interme-
 
diate results are denoted as “fundamentals”. Fundamentals of the type 2i  2j xk
 
are created by simply assigning ðf c ; f s Þ ¼ 2i xk ,  2j xk . On the other hand,
fundamentals of the form 2i xk  axk where axk is a fundamental in the CS form
can be added using a CSA layer. Next, fundamentals of the type (axk + bxk) where
both are in CSA form need addition of two levels of CSA. The modulo (2n + 1)
adders are slightly complicated. The authors have used mod (2n  1) adder due to
Patel et al. [20]. Note that in the filter pipeline, each stage adds two values one from
the product pipeline and another from the previous pipeline stage. These can be
reduced mod (2n  1) or mod (2n + 1) therein itself. However, the carry bit is
separately kept so that in the case of both the moduli, the sums are
    
zj ¼ zj, n1 , , . . . zj, 0 þ uj, n1 , , . . . uj, 0 þ zj, n modð2n  1Þ ð9:3aÞ

and
    
zj ¼ zj, n1 , , . . . zj, 0 þ uj, n1 , , . . . uj, 0 þ zj, n  2 þ uj, n modð2n þ 1Þ ð9:3bÞ

where uj ¼ xkcj. Note that zj is the accumulated filter value and uj is output of the
multiplier block. A n-bit adder with carry-in can be used for computing (9.3a),
whereas a CSA is used to compute (9.3b) to add two n-bit vectors and two bits
except for 2 term. The constant term 2 can be added to co. The authors have
shown for the benchmark filters [21], using four or five moduli sets, RNS designs
have better area and power efficiency than TCS designs.
Garcia, Meyer-Baese and Taylor [22] have described implementation of cascade
integrator Comb (CIC) filters (also known as Hogenauer filters) using RNS. The
architecture shown in Figure 9.8a is a three-stage CIC filter consisting of a three-
stage integrator (blocks labeled I), sampling reduction rate by R and three stage
three comb (blocks labeled C). The realized transfer function is given as
 S
1  zRD
H ðzÞ ¼ ð9:4Þ
1  z1

where S is the number of stages, R is the decimation ratio and D is the delay of the
comb response. As an illustration, for D ¼ 2, R ¼ 32, we have RD ¼ 64. The
maximum value of the transfer function occurs at dc and is (RD)S. For RD ¼ 64
and S ¼ 3, we have for a 8-bit input, the internal word length of 8 + 3 log2(64) ¼ 26
bits. The output word length can, however, be small. It has been shown that the
9.1 FIR Filters 205

a I I I R C C C
26 bit 26 bit 26 bit 26 bit 26 bit 26 bit

Z-1
z-D

b I I I C C C ε-CRT

Output
8 8 8 8 8 8
BRS
10 bit
BRS

6 6 6 6 6 6
6 6 6
Input
8 bit 6

c BRS m4 BRS m3 ε–CRT


-1 -1
x1 〈m4 x1〉m1 xʹ1 x1 〈m3 x1〉m1 8 10
x1 〈+〉m1 〈+〉m1 ROM
- ROM - ROM
x4 xʹ3 + X
-1
x2 〈m4 x1〉m2 xʹ2 x3 〈m-13 x2〉m 6
x2 〈+〉m2 〈+〉m2 2 ROM
- ROM - ROM
x4
-1
x3 〈m4 x3〉m xʹ3
x3 〈+〉m3 3

- ROM

x4

Figure 9.8 (a) CIC filter (b) detailed design with base removal scaling (BRS) and (c) BRS and
ε-CRT conversion steps (adapted from [22] ©IEEE1998)

lower significant bits in the early stages can be dropped without sacrificing the
system performance. The architecture of Figure 9.8a has been implemented using a
RNS based on moduli set {256, 63, 61, 59}. The output scaling has been
implemented using ε-CRT in one technique which needs 8 tables and 3 two’s
complement adders. In another technique, the base removal scaling (BRS) proce-
dure based on two 6-bit moduli (using MRC) and ε-CRT scaling [23] of the
remaining two moduli is illustrated in Figure 9.8b. The scaling architecture is as
shown in Figure 9.8c and needs 9 ROMs and 5 modulo adders. The authors have
shown increase in speed over designs using ε-CRT only.
206 9 Applications of RNS in Signal Processing

Cardarilli et al. [24] have described fast IIR filters using RNS. The recursion
equation is first multiplied by N, an integer related to the quantization step.
A 3 moduli system {128, 15, 17} has been used. The recursion equation
corresponding to channel 2 (modulus m1) is given as

Xm1 ðjÞ ¼ U m1 ðkÞ þ Y m1 ðk  1ÞAm1 þ Y m1 ðk  2ÞBm1 ð9:5Þ

where A, B are coefficients and y is a delayed output and u is input which are scaled
up by N and k is the time index. Similar expressions for other two modulo channels
can be written. The authors perform the conversion using a double modulo method
(DMM). In this technique, two CRTs are used for the moduli sets {m1, m21} and
{m1, m22} to yield Ym1,m21, Ym1,m22. Another CRT follows for the moduli set
{m1m21, m1m22}. The final result is divided by m1 to get a scaled output.
Jenkins [25] has suggested the use of four different moduli sets which facilitate
easy scaling by one of the moduli for designing IIR filters where scaling will be
required. These are (a) {m, m  1}, (b) {2n+k  1, 2k}, (c) {m  1, m, m + 1} and
(d) {2k, 2k  1, 2k1  1}. The respective scale factors are m  1, 2k, (m  1)(m + 1)
and 2k(2k1  1), respectively. The results are as follows:
(a) {m, m  1} Class I

ys ðnÞ ¼ ðy2 ðnÞ  y1 ðnÞÞmod2k ð9:6aÞ

when m ¼ 2k.
(b) {2n+k  1, 2k} Class II
 
ys ðnÞ ¼ 2n ðy1 ðnÞ  y2 ðnÞÞmod 2nþk  1 ð9:6bÞ

(c) {m  1, m, m + 1} Class III


 
ys ðnÞ ¼ y01 ðnÞ þ y02 ðnÞ þ y03 ðnÞ modm ð9:6cÞ
  
 1 
where y0 ðnÞ ¼  yi ðnÞ .
Mi mi
(d) {2k, 2k  1, 2k1  1} Class IV
   
ys ðnÞ ¼ y01 ðnÞ þ y02 ðnÞ þ 2y0 3 ðnÞ mod 2k  1 ð9:6dÞ

Note that base extension needs to be done. In the case of four moduli sets,
{m1, m2, m3, m4} as well, Jenkins [25] has shown that the scaled answer can be
derived by application of CRT as
9.1 FIR Filters 207

yðnÞ
y s ð nÞ ¼ ¼ ðf 1 ðy1 ; y2 Þ þ f 2 ðy3 ; y4 ÞÞmod m1 m2 ð9:7aÞ
m3 m4

where
  
  1  
1 
 
 
f 1 ðy1 ; y2 Þ ¼ m2  y1  þ m1  y2   ð9:7bÞ
 M1 m M2 m 
1 m1 m2
2

  
m m  1  
m1 m2  1  
 1 2 
f 2 ðy 3 ; y4 Þ ¼  y þ y  ð9:7cÞ
 m3 M3 3 m m 4 M 4 4 m 2 
1 m1 m2

Etzel and Jenkins [26] have suggested several other residue classes (moduli sets)
amenable for use in recursive filters facilitating easy scaling and base extension.
These are Class V {2k  1, 2k1  1}, Class VI {2k  1, 2k1, 2k1  1}, Class VII
{2k+1  3, 2k  1}, Class VIII {2k+1  3, 2k1  1}, Class IX {2k+1  5, 2k  3},
Class X {2k+1  5, 2k1  1}, Class XI {22k+2  5, 22k1  1} and Class XII {2k+1
 3, 2k  1, 2k1  1}, respectively.
Nannarelli et al. [27] have suggested techniques for reducing power consump-
tion in modulo adders. They suggest the architecture of Figure 9.9a wherein the
possibility of the sum exceeding the modulus is predicted. Using this information
only A + B or A + B  mi is computed thus reducing the power consumption. How-
ever, the prediction can only say whether definitely (A + B) > mi or (A + B) < mi in
some cases. In other cases, both the parallel paths need to work. For m ¼ 11, as in
illustration, the prediction function can choose left or right path using the following
logic:

FR ¼ a3 b3 ða2 þ b2 Þ for a < 7 and b < 3 or vice versa


FL ¼ a 3 b2 þ a 2 b3 þ a3 b3 for a  8, b  4 or vice versa
Fða; b; 11Þ ¼ Enable right if FR ¼ 1,
¼ Enable left if FL ¼ 1,
¼ Enable both otherwise:

As an example, for mod 11, in 33 % cases, ambiguity exists whereas for 40 %


cases, A + B is the correct answer and for 27 % of the cases, A + B  mi is the correct
answer.
In the case of multipliers using index calculus, the architecture is shown in
Figure 9.9b. Of the two adders in parallel in the modulo adder, one is eliminated and
the modulo reduction is incorporated in the inverse isomorphic transformation (IIT)
table. If (x + y  mI) is positive where mI ¼ m  1, we access the normal IIT Table.
Otherwise, the modified table is accessed in which the IIT is addressed by
(x + y  mi) mod k where k ¼ dlog2 mI e. The IIT table, however, is complex in
208 9 Applications of RNS in Signal Processing

a b

a PredFunc b a -mI b
n n
-m DIT DIT
table table
n
latch latch latch latch x y

Carry save adder n-bit adder


carry -save adder

det. 0 n-bit adder det. 0


n-bit adder
n x+y-m1
sign
n IIT I I T*
MUX
logic table table
1
r

c a b

DIT D I T*
table table
e=y-m1
x

det. 0 n-bit adder det. 0


x+e
sign
IIT I I T*
table table
r
Figure 9.9 (a) Modified Modular adder, (b) modified isomorphic multiplier and (c) modification
of (b) (adapted from [27] ©IEEE2003)

this case (since the entries are doubled). But, the multiplexer in the modulo adder is
eliminated. Another modification is possible (see Figure 9.9c). In the right DIT
(direct isomorphic transformation) table, instead of addressing y, one can address
y  mi. Using an n-bit adder, w ¼ x + y  mi can be computed. If w is positive, we
access the normal IIT table. Else, modified table IIT* is addressed. This modifica-
tion eliminates the CSA in the critical path present in Figure 9.9b. Note that when
one of the operands is zero, there is no isomorphic correspondence and the modular
adder has to be bypassed as shown using detector blocks in Figure 9.9b, c. The
authors show that power is reduced by 15 % and delay shorter by 20 % for a mod
11 adder.
Cardarilli et al. [28] have described a reconfigurable data path using RNS. The
basic reconfigurable cell is shown in Figure 9.10. The block “isodir” converts the
input residues using an isomorphic transformation so that the multiplication oper-
ation can be converted into an addition problem as explained before. The powers of
chosen radix r cover the range 0 to mi  1 where mi is prime. For addition operation
using multiplexers, the “isodir” block is avoided.
9.1 FIR Filters 209

Figure 9.10 Reconfig- a1 a2


urable basic cell (adapted
from [28] ©IEEE2002)
iso iso
dir dir

1 0 1 0
mux mux

modular adder
mod m-1 (mult)
mod m (add)

iso
inv

1 0
mux

There is an “isoinv” block after the multiplier to convert the result back to
conventional residue form. A 32-bit dynamic range was realized using the moduli
set {13, 17, 19, 23, 29, 31, 64}. Sixty four processing elements are used which can
be configured to realize applications such as FIR filtering, bi-dimensional convo-
lution and pattern matching. The dynamic range can be reduced by shutting off
some moduli channels. AMS 0.35 μm technology was used. The RNS processor is
25 % faster than TCS and consumes 30 % less power at the same frequency.
The reduction of power consumption at arithmetic level can be achieved by
using RNS. Cardarilli et al. [29] have compared TCS and RNS implementations of
FIR filters in ASICs. They observe that in general area and power consumption can
be represented as

Ax ¼ k A x Ax
1 þ k 2 N TAP ð9:8aÞ

and

Px ¼ k P x Px
1 þ k 2 N TAP ð9:8bÞ

where x refers to the number system used and NTAP is the number of taps in the FIR
filter. Note that k1 is the offset of the plots representing (9.8) and k2 is the growing
rate value. RNS has large offset because of the presence of RNS to binary and
binary to RNS converters. On the other hand, RNS slopes are less steep than TCS
ones. In VLSI designs, RNS reduces the interconnection capacitances and com-
plexity. On the other hand in FPGA implementations, power consumption due to
210 9 Applications of RNS in Signal Processing

interconnections plays an important role rather than clocking structure and logic
and IOB (Input/output block). They observe that since RNS has local interconnec-
tions, it is very advantageous in addition to complexity reduction in FPGAs. RNS
allows to reduce power in both ASICs and FPGAs.
Nannarelli et al. [30] have shown that transpose type RNS FIR filters have high
latency due to conversions at front end and back end. They can however be clocked
at the same rate as TCS filters and can give same throughput. RNS filters are small
and consume less power than the TCS type when number of taps is larger than 8 for
a coefficient size of 10 bits. In power dissipation aspect as well, RNS filters are
better. For direct form FIR filters as well, for more than 16 tap filters, RNS is faster.
Thus, RNS filters can perform at same speed but with low area and lower power
consumption. Note that transposed form FIR filters give better performance at the
expense of larger area and power dissipation.
Mahesh and Mehendale [31] have suggested low power FIR filter realization by
coefficient encoding and coefficient ordering so as to minimize the switching
activity in the coefficient memory address buses that feed to the modulo MAC
units. Next, they suggest reordering the coefficients and data pairs so that the total
Hamming distance between successive values of coefficient residues is minimized.
Freking and Parhi [32] have considered the case of using hybrid RNS-binary
arithmetic suggested by Ibrahim [33] and later investigated by Parhami [34]. They
have considered hardware cost, switching activity reduction and supply voltage
reduction for the complete processing unit. They observed that the hardware cost is
not affected much by the number of taps in the FIR filter since the conversion
overhead remains constant. Hence, the basic hardware cost shall be less than the
binary implementations. Considering the use of a FIR unit comprising of adders
needing an area of n2 + n full-adders assuming a binary multiplier with no rounding
operation, a RNS with r moduli with a dynamic range of 2n bits, needs an area of
2
r 4n2 þ 4nr full adders using the architecture of Figure 9.11b. Hence, if this needs
r
to be less than n2 + n, we obtain the condition that r > n3 4n
. Thus, a five moduli
system may be optimal.
A direct form unit cell in a FIR filter can be realized in RNS form as shown in
Figure 9.11a. Ibrahim [33] has suggested that the modulo reduction can be deferred
to a later a stage instead of performing in each multiplier/accumulator cell. How-
ever, the word length grows with each accumulation operation. One solution
suggested for this problem is shown in Figure 9.11b wherein integer multiples of
the modulus can be added or subtracted at any point without affecting the result.
Another solution (see Figure 9.11c [34]) uses a different choice of correction factor
which is applied at the MSBs.
Since the MSBs of the residues have different probability than the LSBs, the
switching activity is reduced in RNS considerably for small word length moduli up
to 38 %. Finally, since the word length of the MAC unit is less, the supply voltage
can be reduced as the critical path is smaller and increases as log of the word length
whereas in the binary case, it is linear. Next, considering the input and output
conversion overhead, the power reduction factor is substantial for large number of
taps, e.g. 128 taps up to 3.5 times.
9.1 FIR Filters 211

a x mod mi

ai mod mi

Sumj-1, mod mi D Sumj, mod mi

b
x mod mi
n -mi2 0

MUX
ai mod mi n 2n MSB 2n
2n+1
Sum (i-1) D Sum i
2n 2n 2n

x mod mi
c
n -2nmi 0
n
MUX

ai mod m i
2n MSB n
2n+1
Sum (i-1) D Sum i
n MSBs 2n
2n

nLSBs
Figure 9.11 (a) Direct form FIR RNS unit cell (b) deferred reduction concept (c) modified
version of (b) correction applied to MSBs only (adapted from [32] ©IEEE1997)

Cardarilli et al. [35] have described a FIR filter architecture wherein the coeffi-
cients and samples are assumed to be in residue form. They suggest the scaling of
the coefficients by a factor 2h to obtain
* +
  XP D  E
h
2 Y ð nÞ m ¼ 2 h Ak m h X ð n  k Þ i m i ð9:9Þ
i i mi
k¼0 mi
212 9 Applications of RNS in Signal Processing

|h(N-1)|mi |h(N-2)|mi |h(1)|mi |h(0)|mi

|x(n)|mi

Encoder Encoder Encoder Encoder

|x(n)h(N-1)|mi |x(n).h(0)|mi

|y(n)|mi
Z-1 Z-1 Z-1

Figure 9.12 Reconfigurable FIR filter (adapted from [37] ©IEEE2008)

where the number of taps are P. The inner summation has dynamic range given by
(P  1)mj. The authors suggest modulo reduction and post scaling by 2h by adding
or subtracting αmj. The value of α can be obtained by a LUT look up of the h LSBs
of the summation. The reader may observe that this is nothing but Montgomery’s
technique [36].
Smitha and Vinod [37] have described a reconfigurable FIR filter for software
radio applications. In this design (see Figure 9.12), the multipliers used for com-
puting product of the coefficient of the FIR filter with the sample are realized using
a product encoder. The product encoder can be configured to meet different FIR
filter requirements to meet different standards. The encoder takes advantage of the
fact that output values of the multiplier can only lie between 0 and (mi  1). The
coefficients can be stored in LUTs for various standards. The authors have shown
area and time improvement over conventional designs using modulo multipliers.
Parallel fixed-coefficient FIR filters can be realized using interpolation technique
or using Number Theoretic Transforms (NTT). A 2-Parallel FIR filter is shown in
Figure 9.13a [38]. This uses moduli of the form (2n  1) and (2n + 1). In an
alternative technique, in order to reduce the polynomial multiplication complexity,
parallel filtering using NTT is employed (see Figure 9.13b). Conway [38] has
investigated both these for RNS and has shown that RNS-based designs have low
complexity.
Lee and Jenkins [39] have described a VLSI adaptive equalizer using RNS
which contains a binary to RNS converter, RNS to binary converter, RNS multi-
pliers and RNS adders and coefficient update using LMS algorithm. They use a
hybrid design wherein the error calculation is performed in binary. The block
diagram is presented in Figure 9.14. The authors use an approximate CRT
(ACRT). Note that in ACRT, we compute
9.1 FIR Filters 213

a X 0(z 2) Y0(z 2 ) Y(z)


X(z) z-1 2↓ H0 (z-2) 2↑

-
H 0 (z-2 )+H1(z-2) 2↑ z-1
Y1(z 2)
-
2↓ H1(z-2) z-2

X1(z 2)
b
N point Forward transform H0(z M)

N point inverse transform


X(z)
H1(z M) Y(z)

z-1

HN-1(z M)

HN(z M)

Figure 9.13 (a) Structure of 2 Parallel FIR filter and (b) parallel filtering using transform
approach for reducing polynomial multiplication complexity (adapted from [38] ©IEEE2008)

d(n)

: RESIDUE OPERATION : BINARY OPERATION

y(n)

x(n) B TO R H(z) R TO B

e(n)

MUX
SHIFT
UPDATE

DELAY

Figure 9.14 RNS implementation of modified LMS algorithm (adapted from [39] ©IEEE1998)
214 9 Applications of RNS in Signal Processing

   
N  
X:2d X r
  j  2d  X

N 

¼  ¼ Rð k Þ  ð9:10Þ
M  j¼1 Mj m mj    d
j 2d j¼1 2
j dk  
 rj 
where RðkÞ ¼ k mj and k ¼   is an integer such that 0  k  mj. Note that
2
M j mj
R(k) are stored in ROM.
The binary to residue converter uses LSBs and MSBs of the input word to
address two ROMs to get the residue and these two outputs are next added using
a modulo adder. The moduli multiplier is based on quarter-square concept and uses
ROMs to obtain (A + B)2 and (A  B)2 followed by an adder.
Shah et al. [40] have described 2D-recursive digital filter using RNS. They
consider a general 3  3 2D quarter plane filter with scaling included. The differ-
ence equation computed is
 
X 2 X 2   
 
yðk; lÞ ¼  san1 n2 r xðk  n1 , l  n2 Þ
n ¼0 n ¼0 
1 2
 
 2 
X X 2 
  sbp1 p2 yðk  p1 , l  p2 Þ ð9:11aÞ
p1 ¼0 p2 ¼0 r 

where an1 n2 and bn1 n2 are the sets of coefficients that characterize the filter, k,
l define the particular position of the sample in the array to be filtered, and p1, p2 6¼ 0
simultaneously. Note that the value of y (k, l ) shall be less than or equal to π Li¼0 m2 i .
Equation (9.11a) can be realized using the architecture of Figure 9.15. The authors
suggest scaling by moduli 13  11  7 and use the moduli set {16, 15, 13, 11, 7}.
The scaling is based on the estimates technique due to Jullien [41] described in
Chapter VI using look-up tables. Note that Figure 9.15 realizes the following
equations:

jyS ðk; lÞjmi ¼ jF2, N ð:Þ þ F4, N ð:Þ þ F6, N ð:Þ þ F2, D ð:Þ þ F4, D ð:Þ þ F6, D ð:Þjmi
ð9:11bÞ

where
 
 
F2, N ð:Þ ¼ ja00 xðk; lÞ þ a01 xðk, l  1Þjmi þ ja02 xðk, l  2Þjmi 
mi
 
 
¼ F1N ð:Þ þ ja02 xðk, l  2Þjmi  ð9:11cÞ
mi
 
 
F4, N ð:Þ ¼ ja10 xðk  1, lÞ þ a11 xðk  1, l  1Þjmi þ ja12 xðk  1, l  2Þjmi 
mi
 
 
¼ F3N ð:Þ þ ja12 xðk  1, l  2Þjmi 
mi
ð9:11dÞ
9.1 FIR Filters 215

y s (k ,l )
m1
Σy SCALER

x(k ,l )
m1

R1N R2N F2N Σ2N Σ2D F2D R2D R1D

F1N
y(k −1,l )
m1

FIFO
x(k −1,l ) FIFO
m1

R3N R4N F4N Σ1N Σ1D F4D R4D R3D

F3N F3D
y(k −2,l )
m1
FIFO FIFO
x(k −2,l )
m1

R5N R6N F6N F6D R6D R5D

F5N F5D

Figure 9.15 An ith section of a 3  3 2D residue coded recursive digital filter (adapted from [40]
©IEEE1985)

 
 
F6, N ð:Þ ¼ ja20 xðk  2, lÞ þ a21 xðk  2, l  1Þjmi þ ja22 xðk  2, l  2Þjmi 
mi
 
 
¼ F5N ð:Þ þ ja22 xðk  2, l  2Þjmi 
mi
ð9:11eÞ

Shanbag and Siferd [42] have described a 2D FIR filter ASIC with a mask size of
3  3 with symmetric coefficients as shown in Figure 9.16a. The data window is
presented in Figure 9.16b. The computation involved is given by

yði; jÞ ¼ A½xði  1, jÞ þ xði þ 1, jÞ þ B½xði, j  1Þ þ xði, j þ 1Þ


þC½xði  1, j  1Þ þ xði þ 1, j þ 1Þ ð9:12Þ
þD½xði þ 1, j  1Þ þ xði  1, j þ 1Þ þ xði; jÞ

The data and coefficients need to be represented in RNS form. The authors used
the moduli set {13, 11, 9, 7, 5, 4} with a dynamic range of 17.37 bits. The authors
have used a PLA-based multiplier. The IC has incorporated binary to RNS con-
verter realized using PLAs (programmable logic arrays) to find residues BM and BL
216 9 Applications of RNS in Signal Processing

a b
C A D x(i-1,j-1) x(i-1,j) x(i-1,j+1)

B I B x(i,j-1) x(i,j) x(i,j+1)

D A C x(i+1,j-1) x(i+1,j) x(i+1,j+1)

A
FROM BTOR
OUTPUT

Figure 9.16 (a) Coefficient and (b) data windows and (c) filter details (adapted from [42]
©IEEE1991)

where the input number is expressed as 2LBM + BL. The modulo result of both PLAs
is added using a modulo adder. The residue to binary converter used MRC in first
stage to obtain the numbers corresponding to {13, 4}, {11, 5} and {9, 7}. A second
stage finds the number corresponding to {52, 55} using MRC. A third stage finds
the number corresponding to {2860, 63}. PLAs were used for storing the values of
  !
  1
mi r j  r i m   . The 2D FIR filter architecture is shown in Figure 9.16c
j mi
mj mj
which implements (9.12).
Soderstrand [43] has described digital ladder filters using lossless discrete
integrator (LDI) transformation [44] and RNS using the moduli set {4, 7, 15}.
The LDI transformation-based resonator is shown in Figure 9.17a together with the
RNS realization in Figure 9.17b. These designs need coefficients and data samples
of 8–10 bits word length while achieving low sensitivity.
Taylor and Huang [45], Ramnarayan and Taylor [46] have described an auto-
scale multiplier which implicitly scales the result by a scaling factor c. For a three
moduli set {m1, m2, m3} ¼ {2n  1, 2n, 2n + 1}, the decoded scaled number can be
written as
9.1 FIR Filters 217

a b

BRNS

RNSB
v1 v2

K2

Z-1 MOD15 MOD7 MOD4 MUL2

Z-1 MUL1 MOD15 MOD7 MOD4

K1

RNSB

BRNS
I1 I2

Figure 9.17 (a) LDI ladder structure and (b) RNS realization of (a) (adapted from [43]
©IEEE1977)

X ¼ X2 þ m2 J 1 þ m2 m1 I 1 ð9:13aÞ

where J 1 ¼ ðX2  X1 Þmodm1 , J 3 ¼ ðX3  X2 Þmod m3 and I1 is a function of


J1  J3. An estimate of X is given as

^ ffi m2 J 1 þ m2 m1 I 1
X ð9:13bÞ

Then given a scale factor c, we have

^ c ¼ ½m2 J 1 c þ ½m2 m1 I 1 c:


½Xc ¼ X ð9:13cÞ

This formula is denoted as auto-scale algorithm. Note that there can be two sources
^ and (b) due to round off of
of error in such scaling: (a) error due to estimating X as X
the final result. An architecture of the auto-scale unit is presented in Figure 9.18.
Note that in recursive filters, these need to be used in order to efficiently manage the
register overflow.
The supply voltage of CMOS digital circuits can be scaled below the critical
supply voltage (CSV) for power saving. This mode of operation is called voltage
over scaling (VOS). Chen and Hu [47] have suggested using RNS together with
reduced precision redundancy (RPR). This technique is denoted as JRR (Joint
RNS-RPR). This technique has been applied to the design of a 28-tap FIR filter
using 0.25 μm 2.5 V CMOS technology to recover from the soft errors due to VOS.
In VOS technology, the supply voltage of a DSP system is Vdd ¼ KvVdd-crit
(0 < Kv  1) where Kv is the voltage over-scaling. In the VOS case for a DSP
(digital signal Processing) system, when the critical path delay Tcp is larger than
the sample period Ts, soft errors will occur due to timing violation. They observe
that RNS has tolerance up to Kv ¼ 0.6 compared to TCS-based implementations.
Since RNS has smaller critical path, it can achieve lower Critical Supply Voltage
(CSV) than TCS implementation and lower power consumption can be achieved.
The JRR method uses MRC. It uses the fact that the remainder Rmi (decoded
word corresponding to lower moduli) is uncorrelated with the residues of the higher
moduli. In a four moduli system, e.g. {2n  1, 2n, 2n + 1, 2n+1  1}, probability of
218 9 Applications of RNS in Signal Processing

[p2j1c]modp1

[p2j1c]modp2

[p2j1c]modp3

[p2p1I1c]modp1 Z1
– J1
X1

[p2p1I1c]modp2 Z2
X2 –


[p2p1I1c]modp3 Z3
X3
J3

x = x2+p2J1+p2p1I1

x=x= p2J1+p2p1I1
Z = Xc = [p2J1c]+[p2p1I1c]
I1 is a function of J1-J3.

Figure 9.18 Auto-scale unit (adapted from [46] ©IEEE1985)

soft errors is more for the modulus {2n+1  1}. Hence they apply JRR for this
modulus. In the full RNS, the quotient Urpr is more precise whereas in the reduced
RNS, the remainder Rmi is more precise. They consider moduli set with n ¼ 7. The
width of the RPR channel is 7. The structure of the complete FIR filter is presented
in Figure 9.19a where the modulo sub-filters (see Figure 9.19b) perform the needed
computation. A binary to RNS converter and a RNS to binary converter precede and
succeed the conventional FIR filter. The RPR unit word length can be n bits. It
processes only the n MSBs of the input samples and modulo reduction is not
performed. The 2n-bit word is processed next. The n LSBs are left shifted to
make it a (3n  2)-bit word from which the RNS filter decoded word corresponding
to the three moduli is subtracted and decision is taken in the block DEC (see
Figure 9.19c, d) regarding the correction 0, +1 or 1 to the MSB n bits to
effectively obtain the higher Mixed radix digit. Next, this value is weighted by
the product of the moduli in the RNS using the technique shown in Figure 9.19e
which realizes (23n  2n) by left shifts and subtraction and added to Rmi to obtain the
correct word. The hardware increase is about 9 % and the power saving is about
67.5 %.
When FIR filters operate in extreme environments like space, they are exposed
to different radiation sources which cause errors in the circuits. One type of such
error is single event upset (SEU) that changes the value of a flip-flop or memory
cell. One way of mitigation is using Redundant Residue Number system (RRNS)
[48]. The normal computation is performed by an n moduli RNS (for example
9.1 FIR Filters 219

a n-1 Z1ʹ
Mod 2n-1
/ FIR Filter
/
n-1
n-1 Z2ʹ
Binary Mod 2n Residue
4n-2 / FIR Filter / Z
To n-1 To
/ /
Residue n n Z3ʹ Binary 4n-2
Mod 2 +1
Converter / / Converter
FIR Filter 4n-2
n +

mux
| |>Th /
n Mod 2n+1-1 Z4ʹ
/ FIR Filter /
n -
Rm
CSV Zrpr CSV
n-1 MSB signal Reduce
/ Precision / JRR unit /
FIR Filter 2n 4n-2

b n
/

/ D D D D / Mod mi / D
2n 2n+k n

Rmi
c
Rrpr
LSB n bit -
LS 2n-2 DEC

3n-2
Zrpr {-1,0,1}
Urpr ZJRR

LS left shifting MSB n bit 4n-2 4n-2


m1m2.....mi
d e
DEC

m1m2.... mi
Rmi-Rrpr
|.| <T M
0 +
MUX

Uδ Uδ+Urpr LS3n
23n LSn bit

1
MUX

_
-1

<TM
LSn

Figure 9.19 (a) Block diagram of Joint RNS-RPR method in RNS, (b) mod mi FIR filter structure
(c) optimized structure for JRR (d) Structure of DEC unit and (e) amplification block (adapted
from [47] ©IEEE2013)
220 9 Applications of RNS in Signal Processing

consider n ¼ 3). A redundant channel also processes the signal with modulus m4.
The residue of the output of the 3-moduli channel after RNS to binary conversion
with respect to modulus m4 is computed and compared with that of the channel
pertaining to modulus m4 and if there is a difference between these two, the input
samples and coefficients are reloaded and the FIR filter re-computes the output.
Gao et al. [49] suggested the use of Arithmetic residues of a FIR filter which has
lower cost than the actual FIR filter for avoiding soft and hard errors. In this
scheme, two FIR filters produce the outputs y1 and y2 corresponding to input
samples. We define r1 ¼ y1 mod m and r2 ¼ y2 mod m. We process the input samples
x(n) by finding x(n) mod m in another replica FIR filter to obtain the output r. If
y1 ¼ y2, y1 is chosen as the correct output. If y1 6¼ y2, we compare r1, r2 and r to know
whether r1 or r2 agrees with r and choose the correct output. However, if r1 ¼ r2 ¼ r,
there is no decision possible. The procedure corrects as long as the errors affect a
single FIR filter and the errors are such that residue of the affected branch is
different from r. This leads to undetected errors. Moreover, y1 and y2 can be
different but r1 may be equal to r2 (for example y1 ¼ 15, y2 ¼ 22 and m ¼ 7,
r1 ¼ r2 ¼ 1), no decision is possible. Gao et al. [49] suggested in case of SEU
causing errors in coefficients, modifying the input samples by adding +1 or 1 if
the input sample x(n) mod m ¼ 0 for a low-pass filter. The effect of addition of this
will be filtered out by the filter since the resulting noise will be outside the passband.
On the other hand, in the case of bandpass filters, +1 can be added to all input
samples which again results in noise being outside the band of interest.

9.2 RNS-Based Processors

Chaves and Sousa [50] have described a RISC DSP based on RNS named as RDSP
(see Figure 9.20). It uses a moduli set {2n  1, 22n, 2n + 1} for n ¼ 8. Even though
one modulus has double the word length, the time for multiplication in all the three
channels is balanced. This processor can support signed values also by adding
M when the input value is negative. The RNS to binary converter used is based on
CRT. Signed RNS decoder subtracts M from the decoded output. The 0.25 μm
ASIC could work at a clock frequency of 200 MHz and has 20–30 % area and power
reduction over mixed or binary designs. The FPGA Virtex E2000 design could
work up to 29 MHz and has smaller occupancy 15 % in case of RNS and 20 % in
case of mixed and 24 % in case of binary designs. In the mixed design, the 22n
channel is expanded to 32 bits. In CMOS VLSI, it could work at 250 MHz and RNS
design outperforms mixed and binary designs. Note that RDSP comprises of a RISC
having five pipeline stages: Instruction fetch (IF), Instruction Decode (ID), Instruc-
tion Execute1 (EX1) and Instruction Execute2 (EX2) and Write Back (WB). There
is an arithmetic address generation unit (AGU) and logical unit (LU) to coordinate
all the operations.
Fu et al. [51] have described FPGA-based RNS optimization techniques using
the moduli set {2n, 2n  1, 2n + 1}. The reverse converter yields n bit outputs similar
to Gallaher et al. design [52] by solving the equations
9.2 RNS-Based Processors 221

clk ID clk
Ar
IF M
U Config
X RA
PC+1
M Rin1 Reg
U PC WB1
+ Bank
X
En-conf A

M
C
Program RA U
O
PC Memory N
X
T RB Data
Rin1 Ar
R
Reg
O
L Rin2
WB1 Bank
M
WB2 U
enb´1
enb´2 X

B
constants

EX1 clk
EX2/WB1 clk
PC+1
Aʹ WB2
LU M WB1
U

X
B
Aʹ Data
AGU Memory

Aʹ A
C
M WB2
Arithmetic Arithmetic U
Unit 1 Unit 2 X

Figure 9.20 RDSP architecture (adapted from [50] ©IEEE 2003)


222 9 Applications of RNS in Signal Processing

Figure 9.21 General


structure of reverse residue form: x1 x2 x3
converter (adapted from
[51] ©IEEE 2008)

condition
decoder

c1
binary generator c2

binary form: A B C

A þ B þ C ¼ x1 þ C1 m1 C1 2 0, 1, 2
C ¼ x2
A  B þ C ¼ x3 þ C 2 m 3 C2 2 1, 0, 1 ð9:14Þ

where m1 ¼ 2n  1, m3 ¼ 2n + 1 and A, B and C are n bit words of the final result to


be found. The hardware architecture is shown in Figure 9.21 wherein the condition
decoder makes use of parity of x1, x3, C1 and C2. In addition, the condition that 2A
and 2B shall both be in the range (0, 2n+1  2) is also employed. The authors
observe that RNS adders are more expensive than TCS adders due to modulo
reduction needed. On the other hand, RNS multipliers are more economical com-
pared to binary multipliers. The authors also have developed a reconfigurable RNS
arithmetic library generator.
Chokshi et al. [53] investigated the application of RNS to embedded processors.
They observe that multi-tier synergistic co-design of architecture, micro architec-
ture and hardware components as well as compilation techniques will facilitate
realization of efficient processors. They have adopted the moduli set {2n  1, 2k,
2n + 1} with k > n. The conversion and computation are separated so that parallel
operation of both functions is possible. The conversion overhead is to the
software through instruction set augmentations. The authors use periodicity
property of moduli for binary to RNS conversion. The design uses New CRT
II for RNS to binary conversion.
The authors recommend avoiding the modulo addition often but retaining the
two SUM and CARRY vectors and only at the end, the modulo operation can
be performed. The adders can thus be faster for 2n  1 and 2n + 1 channels as well.
The multiplier also exploits properties of redundant representation utilized by the
extension. A 32-bit channel has used the moduli {29  1, 215, 29 + 1}. The authors
have augmented the scalar ARM architecture with RNS arithmetic components.
9.2 RNS-Based Processors 223

They have also considered instruction selection and instruction scheduling to


reduce run time.
Ramirez et al. [54] have proposed a RNS-enabled DSP using a four moduli set
{256, 251, 241, 239} using a SIMD architecture. The binary to RNS and RNS to
binary converters were outside the chip. The modulo processor comprises of index
calculus-based multiplier. The adder/subtractor is built using a cascade design. An
index adder and LUTs are used to realize the multiplier. This has a carefully
selected instruction set to load registers with a constant, load a word into memory,
increase or decrease a constant, multiply by a constant, etc. Several global registers
are present which can be used as address pointers to implement address counters,
etc. Several FIR filters, IIR filters have been implemented (8/16 taps) with through-
puts up to few Msamples/s using Altera FPGAs and 0.35 μm CMOS ASIC.
Perumal and Siferd [55] have described a 50 MHz ASIC using 1 μ technology for
Binary to RNS and RNS to binary conversion for a moduli set {31, 29, 23, 19, 17,
13, 11, 7} for a 32-bit dynamic range. The binary to RNS converter does the
summation of residues of various powers of two mod mi. However, taking advan-
tage of the residues of powers of 2 mod mi, the grouping of bits is arranged so that
the worst-case sum of the residues is <2mi. This will help in reducing the number of
mod mi adders from 32 to a much lower value. As an illustration for the modulus
31, a tree of modulo adders of depth three will be sufficient whereas for modulus
19, a depth of 5 will be required. The RNS to binary converter takes pairs of
residues and computes the decoded word using MRC. This continues in log2n steps
to get the final binary word.
Distributed Arithmetic (DA)-based FIR filters using RNS have been considered in
literature [56–58]. The DA-based designs of FIR filters are independent of number of
taps since they store all the possible outputs corresponding to all input bit combina-
tions corresponding to all taps in ROM. The result is multiplied by 2 and accumulated
with the weighted sum corresponding to next bit of all the input samples till all the bits
in the input samples are covered. The scaling by 2 in RNS and then adding to the new
value needs modulo reduction. Denoting S ¼ 2X + Y, the modulo reduction needs two
steps depending on 2X + Y < mi, mi < 2X + Y < 2mi and 2mi < 2X + Y < 3mi. In this
technique, the following computation was suggested [56]:

Y þ 2X ¼ SUM1 þ 2n co
SUM1 þ 2n  mi ¼ SUM2 þ c2 2n ð9:15Þ
n n
SUM2 þ 2  mi ¼ SUM3 þ c3 2

Next, using c0, c1, c2 and c3, a 3:1 MUX selects SUM1, SUM2 or SUM3. Note
that c1, c2 and c3 are the carry out signals of first, second and third adders and c0 is
the MSB of the first (n + 1)-bit adder. This architecture is presented in Figure 9.22a.
Ramirez et al. have used this technique later to realize DWT [57] with the modified
arrangement of Figure 9.22b wherein CSAs have been used to reduce the delay.
Recently, Vun et al. [58] have suggested use of one-hot residue (OHR) coding
for simplifying the doubling and modulo reduction operation. The authors use
a (x1 mod mi)[n] Channel of an RNS-DA FIR
x1 mod mi
(ACC mod mi)
2K word

common bit
nth
TLU
see

Register
xK mod mi insert
(xK mod mi)[n]
Shift-registers Lookup
Table
Accumulator

c1 C0 c2 c3
sum_1 N-bit adder
sum_2
y y

sum_1
sum_2
sum_3
+ sum_3
x + +
2x

N+1-bit adder 2N-m 2N-m C0


N-bit adder c1 Decision MUX
c2 Logic 3:1
Accumulator Details c3

b Bi
a2n(i-1)

Bi
(i-1)
a2n-1

Bi
Bi 2NxW
N
sCLK sCLK
LUT
+/-
фg
an(i)
Bi bCLK

×2

Bi N bCLK

2NxW

LUT sCLK
Bi +/-
N фh
dn(i)
bCLK

Bi
×2

bCLK
bCLK

Figure 9.22 (a) Distributed Arithmetic scheme for RNS applications and (b) RNS DA-based 1D
DWT architecture ((a) Adapted from [56] ©IEEE 1999, (b) adapted from [57] with permission
from Springer 2003)
9.2 RNS-Based Processors 225

a 1 1 1 OHR[6]
a[6] 0 0 0

1 1 1 OHR[5]
a[5] 0 0 0

1 1 1 OHR[4]
a[4] 0 0 0

1 1 1
OHR[3]
a[3] 0 0 0

1 1 1
OHR[2]
a[2] 0 0 0

1 1 1
a[1] 0 0 0
OHR[1]

1 1 1
a[0] 0 0 0
OHR[0]

b[0] b[1] b[2]


b
(OHR)
OHR
Register Accumulated
Modulo
OHR output
A Adder (OHR)

B (BC)
1
ton
2K One residue channel of the TCR based
(TCR) entries DA-RNS system
1 DALUT
t(K-1)n

Figure 9.23 (a) OHR-based modulo 7 adder and (b) TCR-based DA-RNS with OHR modulo
accumulator (adapted from [58]©IEEE 2013)

thermometer code format for input residues whereas the output data is encoded in
the one-hot format. The modular adders can be implemented using single shifter
based circuit utilizing one hot-coded format. The modulo operation is performed
automatically during the addition process.
A OHR modulo 7 adder is shown in Figure 9.23a where one input is in OHR and
the other one is in binary format. A TCR (thermometer code encoded residue) based
DA-RNS with OHR modulo accumulator for one channel is shown in Figure 9.23b.
Note that the DA LUT has 2k entries for a k-tap filter. Note that the input samples
226 9 Applications of RNS in Signal Processing

are sent serially and the number of cycles needed is (mi  1) for obtaining one
output sample of the FIR filter.
The authors also extend the technique for taking two bits at a time (2BAAT) by
cascading two OHR modular adders to sum the two DA LUT outputs each driven
by one group of bit streams allocated from the TCR.

9.3 RNS Applications in DFT, FFT, DCT, DWT

Cardarilli et al. [59] have described QRNS realization of complex FIR filters using
transpose structure. The architecture is shown in Figure 9.24b. It comprises of two
RNS filters in parallel. Each RNS filter is decomposed into P filters working in
parallel where P is the number of moduli. RNS {5, 13, 17, 29, 41} has been used for
a dynamic range of 20 bits. Note that this is simpler than conventional architectures
needing four multiplications per tap (see Figure 9.24a). The multipliers are
implemented using index calculus. The authors have compared with TCS and

a
CSA 4:2

REG
COMPLEX FILTER TAP

CSA 4:2

AIM ARE
XRE

XIM
ARE -AIM

CSA 4:2
CSA 4:2

REG

Figure 9.24 (a) Structure of tap in TCS FIR filter and (b) QRNS FIR filter architecture (adapted
from [59] ©IEEE2008)
9.3 RNS Applications in DFT, FFT, DCT, DWT 227

* * *
m1

X Y
* * X
m2

+ + +

* * *
mp
QRNS

TCS
+ + +

CONVERSION QRNS
CONVERSION TCS

y(n)
x(n)
RNS FIR

* * *
m1

+ + +
^
X
^
Y
m2 * * *

+ + +

mp * * *

+ + +

Figure 9.24 (continued)

other types and have shown that QRNS-based design consumes less power and
needs less area.
Taylor [60] has described a single modulus complex ALU using QRNS as
against multi-moduli set based systems. He has used Gaussian primes (e.g. 5,
17, 257, 65,537) as well as composite Gaussian primes (85 ¼ 5*17,
1285 ¼ 5*257, 4369 ¼ 17*257, 21,845 ¼ 5*17*257) of the type 2k + 1 as the
modulus. A single modulus ALU has the advantage of trivial magnitude scaling,
sign detection and overflow detection as against multi-moduli RNS.
228 9 Applications of RNS in Signal Processing

Cardarilli et al. [61] have suggested implementation of polyphase filters using


QRNS. The moduli set used was {13, 17, 29, 37, 41, 53, 61} to realize a dynamic
range of 34 bits. The architecture is shown in Figure 9.25a. The filter is divided into
two structures for X and X ^ (both QRNS coded outputs) plus the input and output
conversion blocks. Each RNS path was having eight FIR filters to cater for eight
channels and an Inverse Discrete Fourier Transform (IDFT) block. The 8-point
IDFT implemented by DIF (Decimation in Frequency) algorithm needed 12 butter-
flies in three stages. Each butterfly needed two LUTs for multiplications. The
multiplications needed for FIR filters were implemented using index calculus.
The use of QRNS reduced the needed complex multiplications operation to just
two real multiplications as described before. The binary to QRNS and QRNS to
binary converters were included in the hardware. The needed application was for
satellite Digital Video broadcasting (DVB) system which required 367 complex-tap
filter for realizing a Kaiser window with 0.02 dB in-band ripple and 43 dB out of
band attenuation. The filters could be with fixed coefficients or programmable
12-bit coefficients. Truncation after multiplication also was used. TCS version
also was implemented in AMS 0.35 Micron technology. The truncation was
achieved by using QRNS to binary conversion and after truncation again binary
to QRNS conversion to perform IDFT subsequently (see Figure 9.25b). The authors
show that the area and power are less than those for TCS implementation.
Cardarilli et al. [62] have described realization of a 128 channel polyphase filter
derived from a 1024-tap prototype filter. This uses QRNS and the moduli set {13,
17, 29, 37, 41}. The block diagram is presented in Figure 9.26. The front end is a
binary to QRNS converter. It is followed by a decimator and 128 QRNS-based
8-tap FIR filters. The FIR filters used a single MACC (complex multiply and
accumulate) QRNS unit. The output dynamic range is 23 bits. Since truncation
cannot be done directly on QRNS channels, the channel outputs have been
converted into conventional form and scaled to yield a 17-bit word. The 17-bit
output is converted into QRNS again with base extension to have a dynamic range
of 35 bits using additional moduli 53 and 61 in the block denoted CTBE (conver-
sion plus truncation and base extension). A seven moduli RNS-based IDFT is
realized using a serial architecture. One hundred and twenty-eight complex multi-
pliers are used and two adder trees are used to accumulate the results for the real and
imaginary parts. The authors have shown that while both the conventional complex
two’s complement system (CTCS) and QRNS-based designs occupy same area, the
power dissipation in the case of QRNS-based design is 50 % lower. Note that QRNS
filters have higher latency due to the I/O conversions.
D’Amora et al. [63] have used similar techniques for realizing complex digital
filters using QRNS and have demonstrated that for the same throughput rate, the
power dissipation is one-third and area is half of that of CRNS (complex RNS)
filters whereas the latency is more. Stouraitis and Paliouras [64] have suggested the
use of QRNS for low power designs.
9.3 RNS Applications in DFT, FFT, DCT, DWT 229

a
RNS path mod m1 yo(n)

RNS path mod m2 Yn y1(n)


Xn

RNS path mod mP y2(n)

QRNS path X
QRNS y3(n)
x(n) to
Binary Binary
to
QRNS y4(n)

RNS path mod m1

RNS path mod m2 y5(n)

^ ^
Xn Yn
RNS path mod mP y6(n)

^
QRNS path X
y7(n)

fc fc/8

↓M FIR Filter E0 yo(n)


X(n)

y1(n)
fc ↓M FIR Filter E1
IDFT

↓M FIR Filter E7

y7(n)
fc/8
fc/8
RNS path mod mi

Figure 9.25 (a) Polyphase filter bank and (b) filter with truncated dynamic range (adapted from
[61] ©IEEE2004)
230 9 Applications of RNS in Signal Processing

b
yo(n)

y1(n)

y2(n)
X(n)
Binary Binary QRNS
Filter QRNS IDFT

Truncation (15 bits)


to to to
Banks to y3(n)
QRNS QRNS Binary
Binary

y4(n)

y5(n)

y6(n)

y7(n)

fc fc/8 fc/8 fc/8 fc/8 fc/8

Dynamic range 23 bits mi =(13,17, 29,37,41) Dynamic range 28 bits mi =(13,17, 29,37,41,53)
Figure 9.25 (continued)

DCT Computation

Fernandez et al. [65] have suggested a straightforward implementation of 8-point


RNS-based 1D-DCT using 5-bit moduli {32, 31, 29, 27}. The data is 8-bit wide and
the eight coefficients are each 9-bit wide. A typical modulus channel is illustrated in
Figure 9.27. The input data is multiplied by the fixed coefficient. A modular
accumulator accumulates all the products. The modular adder used a two-stage
pipeline scheme. The implementation was on Altera Flex10K device and has higher
throughput than a distributed arithmetic processor.
Several authors [66–69] have described DCT computation using QRNS. The
N point 1D DCT [67] of a sequence {x(0), x(1), . . ., x(N  1)} is given by
rffiffiffiffi N1
2 X mð2n þ 1Þπ
X ðm Þ ¼ Km xðnÞ cos m ¼ 0, 1, . . . , N  1 ð9:16Þ
N n¼0 2N

and K 0 ¼ p1ffiffi2, K 1 ¼ K 2 ¼ ¼ K N 1 ¼ 1.
9.3 RNS Applications in DFT, FFT, DCT, DWT 231

QRNS FIR BANK ROM TWIDDLE (W)


IDFT Serial
A1Kmi A1Kmi
1 1 128
X mi W mi W mi

QRNS-S
-FIR-1 V1mi

ADDERS TREE
128
1
X mi
Xmi 128 ISOMULT
Ymi Vmi Z1mi
BIN
DECIMATOR

ARRAY
V128mi

SHIFT REGISTER
{XR,XI) To 128

MUX
QRNS CTBE

Xmi Ymi Vmi


X128mi V1mi

ADDERS TREE
128
QRNS-S
-FIR-128 128 ISOMULT
Z1mi
128 128 ARRAY
X mi V mi
CLK0

CLK(0/128)*8 CLK(0/128) W1mi


W128mi
CLK0
ROM TWIDDLE (W)
CLK0
Dynamic Range Domain 1 Dynamic Range Domain 2
mi = {13,17,29,37,41} mi = {13,17,29,37,41, 53,61}

Figure 9.26 QRNS polyphase filter architecture (adapted from [62] ©IEEE2010)

x(i)mod mi

LUT

Coefficient
| + | mi y(u)mod mi
Counter Products
0:7

Figure 9.27 Modulo mi channel for one transform point of an RNS-based 1D-DCT processor
(adapted from [65] ©IEEE1999)

Ramirez et al. [66] use the fact that N-point DCT can be computed through the
calculation of real part of 2N-point DCT scaled by a complex exponential constant
as follows:
rffiffiffiffi ( )
jmπ X
2N1
2
XðmÞ ¼ K m Re e 2N xðnÞW mn
2N ð9:17Þ
N n¼0
232 9 Applications of RNS in Signal Processing

j2π
W 2N ¼ e 2N , xðnÞ ¼ 0 n ¼ N, N þ 1, . . . , 2N  1

Initially, the N-point input sequence {x(0), x(1), x(2), . . ., x(N  1)} is reordered in
the sequence {y(0), y(1), y(2). . .y(N  1)} defined by
 
N
yðnÞ ¼ xð2nÞ, yðN  n  1Þ ¼ xð2n þ 1Þ n ¼ 0, 1, . . . , 1 ð9:18Þ
2

Let {Y(0), Y(1), Y(2). . .., Y(N  1)} be the DFT of the sequence {y(0), y(1),
y(2). . .y(N  1)}. The DCT sequence {X(0), X(1), X(2). . .., X(n  1)} of the original
sequence can be obtained through the real part of Z(n) [71] defined as
rffiffiffiffi
2 n
Z ð nÞ ¼ H n Y ð nÞ ¼ K n W 4N Y ð nÞ ð9:19Þ
N

where W 4N ¼ ej4N . By using the property Z(N  n) ¼ jZ*(n), $ Re
[Z(N  n)] ¼ Im[Z(n)], it is necessary to compute only the N2 þ 1 values of Z(n),
viz., Z(0), Z(1),. . ., Z(N/2), Z(N/2 + 1),. . ..Z(3N/4  1). The N-point DCT
sequence is given by {Re[Z(0)], Re [Z(1)], . . .Re [Z(N/4)]}, Im {Z(3N/4  1)],
Im[Z(3N/4  2)],. . ., Im{Z(N/2 + 1)], Re[(Z(N/2)], Re[(Z(N/2 + 1)],. . .,Re
[Z(3N/4  1)], Im[Z(N/4)],. . ., Im[Z(N/4  1)],. . ., Im[Z(1)]}.
The fast algorithms known for DFT can be used for fast computation of DCT. A
QRNS butterfly for computation of a DIF radix-2 DFT is shown in Figure 9.28a.
Note that since the input sequence is real, each QRNS adder is one modular adder.
A butterfly needs a QRNS adder (two modular adders), a QRNS subtractor (two
modular subtractors) and a QRNS multiplier (two modulo multipliers). The moduli
set used is {221, 229, 233, 241}. The multiplier has used isomorphic mapping with
the roots {47,107, 89, 177}, respectively. The 8-point QRNS DCT computation is
shown in Figure 9.28b. Note that only five outputs are Z(0), Z(1), Z(2).., Z(5) are
required for DCT computation.
Fernandez et al. [69] have presented a RNS architecture for computation of
scaled 2D-DCT on field programmable logic (FPL). An eight pixel 1D-DCT is
implemented as shown in Figure 9.29a. The 2D DCT is computed as

2eðuÞeðvÞ X
N 1 X
N 1
uð2i þ 1Þπ vð2j þ 1Þπ
Xðu; vÞ ¼ xði; jÞ cos cos
N i¼0 j¼0
2N 2N ð9:20Þ
u, v ¼ 0, 1, . . . , ðN  1Þ

where x(i, j) is a N  N matrix of pixels and X(u, v) is the corresponding


transformed matrix. Since 2D-DCT is a separable transform, the row-column
decomposition [70] can be used. A N  N 2D-DCT can be performed by first
N 1D DCTs on rows and next, N 1D-DCTs on the columns. The use of a transpo-
sition structure containing 8  8 matrix of registers and multiplexers allows the
transposition of the parallel input data.
9.3 RNS Applications in DFT, FFT, DCT, DWT 233

Using an algorithm due to Arai et al. [72] (see Figure 9.29a), the 8-pixel
1D-DCT can be realized as shown in Figure 9.29b for one modulus channel
which needs only five multiplications. Note that e1 and e2 are power of two scaling
factors. The coefficients
  are k1 ¼ C4, k2 ¼ C6  C2, k3 ¼ C4, k6 ¼ C6 + C2, k5 ¼ C6,
where Cq ¼ cos qπ 16 . The 1D-DCT can be designed to have single multiplication
per stage. Multiplication by DCT coefficients is by ROM look-ups. In order to
obtain the exact value of DCT, each output needs an additional multiplication
which can be taken care of in the next stage. The hardware consists of adders,
registers and LUTs. The moduli set used was {256, 255, 253, 251}. The output

a a + , b-
g+, h-

_ i +, j -
*
c + , d-

e+ , f -

a+ |+|m g+

b- |+|m
h-

|-| m |×e+| m
i+
c+

|-| m |×f -|m


j-
d-

Figure 9.28 (a) QRNS butterfly for a radix-2 FFT and (b) pipelined QRNS DCT implementation
(adapted from [66] ©IEEE2000)
234 9 Applications of RNS in Signal Processing

b
y(0) Z(0)

+ + + *
H0
Z(4)
y(1)
+ + - *
H4
y(2) Z(2)

+ - * + *
H2
W80
y(3)
+ - *

y(4) W 82 Z(1)

- * + * +
0
H1
W8 Z(5)

- * + * +
1
H5
y(6) W8

- *
2
W8
y(7)
- *
W 83

Figure 9.28 (continued)

a
[a] [b] [c] [d ] [e]
x(0) X(0)

x(1) X(4)

k1
x(2) X(2)

x(3) X(6)
k2
x(4) X(5)

k3
x(5) X(1)
k4
x(6) X(7)

x(7) X(3)

k5

Figure 9.29 (a) Flow graph for fast computation of DCT and (b) moduli mi channel of 1D DCT
(adapted from [69] IEEE 2000)
9.3 RNS Applications in DFT, FFT, DCT, DWT 235

b
X (0) m
i
.
.
.
X (7) m
i

+ + + + − − − −m
mi mi mi mi mi mi mi i

+m + − − +m − + D
i mi mi mi i mi mi

+m −m +m D D D + D D
i
i i

×k1 ×e1 ×k2 ×k3 ×k5 ×k4 ×e2


D D LUT LUT LUT LUT LUT LUT LUT

D D + − − + − −
mi mi mi mi mi mi

D + + − −
D D D mi mi mi mi

X (0) X (4) X (2) X (6) X (5) X (1) X (7) X (3)


mi mi mi mi mi mi mi mi

Figure 9.29 (continued)

conversion to binary is performed using a ε-CRT converter of Griffin, Taylor and


Sousa [23] described earlier. The cosine coefficients are 7-bit, the signal samples
are 8-bit and a DR of 32 bit could be achieved.
Taylor [73] has described DFT implementation using RNS using a circular shift
register to store the data and using multipliers followed by adders (see Figure 9.30).
The structure of this five-point prime factor DFT is akin to FIR filters. Gaussian
236 9 Applications of RNS in Signal Processing

GF(p2) exponents
ex (3),ex (4),ex (2),ex (1)
Circular shift register path

RUN
T T T T
2n LOAD
2n 2n 2n 2n
Input in e
W(3) e
W(4) e
W(2) e
W(1)
permuted
+ + + +
order

GF(p2) to RNS converter GF(p2) to RNS converter


p<2n

ADDER
x(0)

X(k)=[X(3),X(4),X(2),X(1)]

Figure 9.30 Five point RNS prime factor DFT implementation (adapted from [73] ©IEEE1990)

primes of the form 2n + 1 for n ¼ 2, 4, 8, 16 or 32 have been suggested for single


modulus or multi-moduli RNS architectures. CRT needs to be used to convert into
normal integers and scaling can then be performed. This implementation has
employed CRNS as well as QRNS and index calculus-based multipliers. In order
to overcome the overflow problem of QRNS-based FFT implementations, a prime
factor transform (PFT) [74] has been suggested. This reduces the dynamic range
from NQ4 needed for FFT to NQ2 where N is the number of points and Q is the word
size of the inputs and coefficients. Taylor has suggested CRT computation using
distributed Arithmetic.
Tseng et al. [75] have described FFT implementation using RNS and considered
the effect of quantization noise. They have shown that radix-4 is the largest radix
without internal multiplications in the r-point DFT. The twiddle factors are 0, 1,
j in the case of radix-2 or radix-4. The basic calculation of radix-4 decimation-in-
time (DIT) is shown in Figure 9.31. The magnitudes of numbers at subsequent
stages increase very rapidly due to the cascaded integer multiplications. Since RNS
cannot accommodate this dynamic range, all internal numbers shall be scaled
properly by a priori chosen scaling factors to prevent overflow. Tseng et al. [74]
have analyzed the techniques for scaling to prevent overflow by performing error
analysis due to A/D conversion, scaling factor and twiddle factors. The scaling can
be after few stages as well. The number of scaling stages also can be chosen
suitably.
9.3 RNS Applications in DFT, FFT, DCT, DWT 237

Re[xi(0)] + + Re[xi+1(0)]

Im[xi+1(0)]
Im[xi(0)] + +

Im[xi(2)]
Im[xi+1(3)]
+ -1 + +
Re[xi(2)]

Im[WN2t] -1 Re[xi+1(1)]
+ -1 + +
Re[WN2t]

Im[WNt] -1
Re[xi+1(2)]
+ + +
Re[WNt]

Re[xi(1)] Im[xi+1(2)]
+ + +
Im[xi(1)]

Im[xi(3)] -1
-1
+ + -1 + Im[xi+1(1)]
Re[xi(3)]

Re[WN3t]
-1 Re[xi+1(3)]
+ -1 + +
Im[WN3t]

Figure 9.31 Radix-4 DIT (decimation in time) basic calculation (adapted from [75] ©IEEE1979)

Taylor et al. [76] have presented radix-4 FFT using complex RNS arithmetic. In
this technique, the complex multipliers needed in conventional implementation are
replaced by QRNS multipliers thus reducing the hardware. A radix-4 complex RNS
(CRNS) butterfly is presented in Figure 9.32a together with the QRNS butterfly in
Figure 9.32b. In the CRNS butterfly, 12 real multiplications at level 1, 6 read/
subtract at level 2, 8 read/subtract at level 3 and 8 real add/subtract at level 4 are
needed. On the other hand, in the case of QRNS-based designs, we need only 6 real
multiplications at level 1, 8 real/subtract and 2 multiplications at level 2, 8 real
add/subtract at level 3.
Jullien et al. [77] have described a systolic Quadratic Residue DFT with fault
tolerance. In this each systolic array cell uses a 16  6 ROM in place of 16  4
ROM. The additional two bits correspond to parity of output content of the ROM
and parity of input address bits. In normal operation, the address parity of a cell
must equal content parity of the previous cell.
The general form of Number Theoretic Transform (NTT) [78, 79] is described
by the transform pair
238 9 Applications of RNS in Signal Processing

a State i State i+1


A4 A8
Re[X(0)]
Re[X(0)]

A5 A9
Im[X(0)] Im[X(0)]

S4 A10
Re[X(2)] Im
CRNS unit #1 Im[X(3)]
Re[WN2t] M1-M4 S5 A11
Im[X(2)] Re 4
A1 Re[X(1)]
2t
IM[WN ]
S1 A6 S8
Re
Re[X(1)] Re[X(2)]
CRNS unit #2
Re[WNt] A7 S9
M5-M8
Im Im[X(2)]
Im[X(1)]
A2
Im[WNt] S6 S10
S2 Re Im[X(1)]
Re[X(3)]
CRNS unit #3
Re[WN3t]
S7 S11
M9-M12
Im[X3)] Im
Re[X(3)]
A3
Im[WN3t]
S3
8 ADD/SUB 8 ADD/SUB
Level 3 Level 4
12MULT, 6ADD/SUB
Level 1&2

Mv
CRNS Re(A)
crossproduct unit + Sv
(1 of 3)
Mv
Re(C)
Im(A) _
V=Value
A=Adder
Mv Level2
S=Subtractor
M=multiplier
Re(B) + Av
Mv Im (C)

Im(B) +
Level1
All multipliers at Level 1
All adders at Level 2

Figure 9.32 (a) CRNS radix-4 FFT butterfly and (b) QRNS radix-4 Butterfly (adapted from [76]
©IEEE1985)
9.3 RNS Applications in DFT, FFT, DCT, DWT 239

b State i+1
State i
A1 A5
Z(0)
Z(0)

A2 A6
Z(0)* Z(0)*

S1 S5
Z(2)* M1
Z(3)*
W(2)*
S2 A7
Z(2)
M2 Z(1)
W(2)
A3 S6
Z(1)
M3 Z(2)
W(1)
A4 S7
Z(1)* Z(2)*
M4
W(1)* -1#
S3 M7 A8
Z(3)* Z(1)*
M5
W(3)*
S4 M8 -1#
S8
Z(3) M6
W(3) Z(3)


Level 1 −j
Level 3
6 Multipliers 8 ADD/SUB
Level2 6 ADD/SUB
2SUB/SCALAR
(DUPLICATES)
Figure 9.32 (continued)

XN1 
 
XðmÞ ¼  n¼0 xðnÞαnm  0mN1
M
 XN1 
 
xðnÞ ¼ N 1 m¼0 XðmÞαnm  0nN1 ð9:21Þ
M

 
where α generates a multiplicative group of elements αN M ¼ 1. NTTs are useful
for fast, efficient and error-free coefficient computation of cyclic convolution.
YL
A complex number theoretic transform with dynamic range M ¼ m where
i¼1 i
mi is prime can be implemented in the RNS if the transform length is a divisor of the
gcd of the numbers Ni, i ¼ 1, . . .. L where N i ¼ m2i  1, mi ¼ 4n + 3 and
N i ¼ mi  1, mi ¼ 4n + 1. FIR filters characterized by the finite convolution

X
p1
ym ¼ hmn xn ð9:22Þ
n¼0
240 9 Applications of RNS in Signal Processing

can be implemented with transforms having the circular convolution property if


zeroes are appended to the input sequence {xn} and impulse response{hn}. Trans-
form length shall be K + P  1 where P and K are lengths of {hn} and {xn},
respectively. Circular convolution can be implemented for blocks of length N by
computing transforms of N samples of input signal and impulse response, multi-
plying and taking inverse transform. Computation can be done in RNS in parallel
for the L moduli by selecting a moduli set to meet the dynamic range of (9.22).
Baraniecka and Jullien [78] have used LUTs for realizing index calculus-based
multipliers. As an illustration for the moduli set {31, 47, 97}, the transform length is
32 since the respective maximum lengths Ni are 26  3  5, 25  3  23 and 25  3,
respectively. Since the transform length is power of 2, fast transform algorithms can
be used for efficient implementation.
Krishnan, Jullien and Miller [79] have described CRNS, QRNS and MQRNS-based
Complex Number Theoretic Transform (CNTT) for realizing cyclic convolution.
They consider the realization of radix-2 butterfly structure for the computation of
cyclic convolution. The modulus used was M ¼ 231  1 (a Mersenne prime). The
implementation was based on ROMs for addition as well as multiplication operations.
The radix-2 butterfly needs to compute
 
Z ¼ jðc þ jd Þðx þ jyÞjm þ jða þ jbÞjm m ð9:23aÞ
 
Y ¼ jðc þ jd Þðx þ jyÞjm  jða þ jbÞjm m ð9:23bÞ

In QRNS, this becomes


   
Z ¼ jA:Bjm þ Cm , Y¼ jA:Bjm  Cm ð9:23cÞ
   
Z* ¼ jA*B*jm þ C*m , Y* ¼ jA*B*jm  C*m ð9:23dÞ

where A, B, A*, B*, C and C* are the element pairs of two input samples and the
twiddle factor, respectively, and Z, Z*,Y and Y* are pairs of elements of the butterfly
output in the QRNS.
Dimitrov et al. [80] have suggested implementation of real orthogonal transfor-
mation using RNS. In this technique, real numbers are approximated by considering
pffiffiffi
them in the form a þ x b. These real numbers are approximated as elements of
Quadratic number rings. The reader is referred to [80] for more information.
A three moduli set {22k2 + 1, 22k + 1, 22k+2 + 1} was suggested by Shyu
et al. [81] where k is odd. Abdallah and Skavantzos [82] have extended these to
general moduli set (having more moduli) and presented rules to select them to be
mutually prime. A four moduli set recommended was {2n6 + 1, 2n2 + 1, 2n + 1,
2n+2 + 1} where n ¼ 8k + 6, k ¼ 1, 2, 3, . . .. Abdallah and Skavantzos [82] have
suggested QRNS using such moduli sets with non-co-prime moduli. This enables
availability of several moduli of the form 2n + 1. The conversion from binary to
QRNS is same as before. Note, however, that CRT needs to be modified to take
9.3 RNS Applications in DFT, FFT, DCT, DWT 241

Figure 9.33 Filter bank


structure of 2D-DWT h1 2 HH
(adapted from [83] h1 2
©IEEE2004) h0 2 HL
Input
h1 2 LH
h0 2
h0 2 LL

into account the common factors in the non-co-prime moduli. As an illustration,


for the moduli set {65 ¼ 26 + 1, 1025 ¼ 210 + 1}, the dynamic range is as though
the moduli set {13, 205} is used. The CRT needs to be performed on this
moduli set.
Liu and Lai [83] have described a 2D-DWT processor using RNS. The filter
bank structure is presented in Figure 9.33. First, a 1D-DWT is performed on the raw
data yielding high-pass and low-pass outputs. A second-stage 1D DWT is executed
on the two outputs of this stage to decompose further into four sub-images  HH,
HL, LL, LH. A Daubechies 9/7 tap filter is used for the filtering unit of the DWT
processor. A 24-bit dynamic range is needed and hence the moduli set {255,
256, 257} in order to reduce the hardware complexity and delay has been employed.
The FIR filters have been realized in transposed form. The multipliers are realized
using LUTs since the coefficients are fixed. Since there are three moduli, three
separate filter banks are used needing 27 number of 256  8 LUTs for realizing the
multipliers. The forward and reverse converters also have been considered. A four-
stage pipeline was used comprising of forward converter, multiplier, adder and final
converter. The reverse converters have been shared between LP and HP filters since
decimation is used. A frequency of 28 MHz could be realized.
Ramirez et al. [84] have described orthogonal DWT using 6, 7 and 8 moduli
RNS in field programmable logic. The low-pass analysis and synthesis filters are
similar to FIR filters. The authors use LUT-based RNS multipliers and modulo
adder trees. The wavelet filters have eight taps each. The hardware has been shared
between LP and HP filters since the coefficients are related.
Toivonen and Heikkila [85] have described video filtering with Generalized
Fermat number transforms (GFNT) using RNS. The use of RNS helps to enlarge the
effective modulus. The moduli shall be chosen to support the transform length.
Further, the product of the moduli must be larger than the necessary dynamic range.
They suggest the composite modulus (216 + 1)(28 + 1) so that the transform lengths
up to 256 with a dynamic range of 24 bits can be obtained. This allows convolving
or correlating 16  16 image block containing 8-bit pixels. The authors suggest the
use of diminished-1 arithmetic for the computations. The authors also present a
RNS to binary converter for the chosen two moduli set using diminished-1 repre-
sentation of the residues.
242 9 Applications of RNS in Signal Processing

9.4 RNS Application in Communication Systems

RNS has been used for communication systems for protecting the information
processed or transmitted [86–89]. This exploits the self-checking/error correction
properties of redundant residue number systems (RRNS). The block diagram of a
transmitter using RNS-based parallel communication scheme is shown in
Figure 9.34a. The input binary word N is converted into residues r1, r2, . . ., ru
(where r1, . . ., rv are the actual residues and rv+1, rv+2, . . ., ru are the redundant
residues) which are then mapped to orthogonal sequences U 1r 1 , U 1r 2 , . . . , Uur v ,
. . . , U u r u and multiplexed for transmission. The receiver architecture is shown in
Figure 9.34b. If an information symbol is coded and sent as u residues, after the
MLD (maximum likelihood detection) of the u banks, d number of MLD outputs
can be dropped while still recovering the transmitted symbol using the remaining
outputs and they can be corrected as well. The block named bank for receiving
residues is expanded in Figure 9.34c which comprises of correlators, square law
detectors and multi-path diversity reception combining MLD units. Note that
L represents the number of resolvable paths being tracked at the receiver. The
receiver has a diversity reception structure with L multi-paths tracking. Note that
some of the redundant channels can be dropped if the dynamic range is less than
π i¼1
u
mi . The errors can be corrected by using the theory of RNS-PC (RNS product
codes) [90, 91].
A RNS (u, v) code has a minimum distance of (u  v + 1) and can detect (u  v)
or less residue digit errors and correct up to (u  v)/2 residue digit errors. Further, a
RNS (u, v) code is capable of correcting a maximum of tmax residue errors and
simultaneously detecting a maximum of β > tmax residue errors if and only if
tmax + β  (u  v). Note that the diversity combining technique can be based on
equal gain combining (EGC) or selection combining (SC). Note that in EGC, for a
receiver with Lth order diversity (L  LP) reception where LP is the number of
resolvable multi-path components, the L paths at the receiver are added after equal
weighting and form the decision variable:

X
L
U ij ¼ U ij ðlÞ ð9:24Þ
l¼1

for i ¼ 1, 2, . . ., u and j ¼ 0, 1, . . ., mi  1. The largest among the set U io , U i1 , . . . ,


U iðmi 1Þ is selected for each i. On the other hand, in SC technique, we have
Uij ¼ max{Uij(1), Uij(2), . . ., Uij(L)} for i ¼ 1, 2, . . ., u and j ¼ 0, 1, . . ., mi1. The
authors have shown that dramatic improvements in BER (bit error ratio) perfor-
mance can be obtained. As an example, for L ¼ 3, moduli {29, 31, 32, 33, 35,
37, 41} could achieve an error rate of 104 at SNR (signal to noise ratio) per bit of
18 dB.
Yang and Hanzo [86] have also considered RNS for direct sequence Code
division multiple access (DS-CDMA) based transmitters and receivers. This also
9.4 RNS Application in Communication Systems 243

a U1r1(t)
r1

orthogonal signals
Binary to Residue
r2

Residue to

mapping
N U2r2(t)

converter
s(t)

Binary
symbols
ru Uuru(t)
Carrier

b U1 ri1
Bank for receiving
residue digit r1
λ1

Residue to Binary conversion


RNS processing
ri2
Ui
r(t) Bank for receiving
residue digit r2
λi Binary
output

Uu riv
Bank for receiving
residue digit ru λu
Section I Section II Section III

Ui0(1)

c
Multipath diversity combining and MLD decision

Ui0(2)
T
│ │2
∫0

Ui
U*i0 (t) Ui0(L)
r(t)
λi

Ui(mi-1)(2) To RNS
processing
T Ui(mi-1)(2)
│ │2
∫0

U*i(m-1)(t) Ui(mi-1)(L)

Figure 9.34 (a) Transmitter block diagram (b) receiver block diagram with RNS processing
and (c) non-coherent demodulation bank for receiving residue digit ri (adapted from [85]
©IEEE1999)
244 9 Applications of RNS in Signal Processing

uses RRNS for good error correction capabilities. RNS product codes are used as
inner codes and Reed Solomon (RS) codes are used as outer codes. The trade-offs
between information moduli and redundant moduli have been considered. In a
system with v moduli, in order to transmit k bits of information in one symbol
period, for transmitting each of the residues in parallel using a RNS with v moduli,
X v Yv
mi sequences are required which are orthogonal such that mi  2k . Hence at
i¼1 i¼1
X
v
the receiver, mi correlators are required. Similar architecture as discussed in
i¼1
Figure 9.34b can be employed. Note that each of the users is assigned a random PN
X
u
sequence set consisting of mi orthogonal sequences of length Ns where a subset
i¼1
of mi orthogonal sequences is used for transmission of the residue ri. For example,
the moduli set {7, 11, 13, 15, 16} can accommodate a 17-bit symbol. Hence, by
using a 64-bit Walsh code, this symbol can be transmitted in parallel in one symbol
period. Thus, given a message Xq for a qth user, the residue are (r1, r2, . . ., ru) such
ðqÞ ðqÞ ðqÞ
X
u
that u specific PN sequences V 1r 1 , V 2r 2 , ::::, V ur u can be selected from the mi
i¼1
sequences.
Ramirez et al. [92] have described RNS-based communication receiver which
needs a direct digital frequency synthesizer (DDS) and a programmable decimation
FIR filter as shown in the block diagram of Figure 9.35a in the mixer block. The
RNS-based DDS based on [90] and [91] is sketched in Figure 9.35b. Note that the
phase accumulator is conventional binary type. The output of the phase accumula-
tor is 10-bit wide. The first two MSBs are used to decide the sign and the quadrant.
The eight LSBs address the COS and SIN LUTs in RNS form. The negative values
are obtained by modulo subtraction. The correct output cn, sn, cn, sn is selected
by using a multiplexer. The index LUTs convert the input into indices so that the
mixing function can be accomplished.
The input TC word is converted into index form using a binary to RNS converter
operating on p number of b bit blocks of the input word and using ( p  1) LUTs to
store the residue. All these are added in an adder tree whose output is fed to an index
LUT to get the input indices to be added with COS and SIN indices obtained as
explained before. The output is routed to the programmable decimation filter. The
programmable decimation filter uses L index-based RNS channels and a final RNS
to binary converter. The number of taps, input and output precisions are program-
mable. The filter has serial data and coefficient inputs. Using a multiplexer, the
coefficients are sequentially loaded and the input is distributed to all index-based
multipliers. The products are computed using a mod (mi  1) adder to add the
indices followed by a LUT to get the inverse. These are added using a tree of adders
to yield the output. A final stage using both CRT and ε-CRT has been tried to yield
the desired binary result. The authors have shown that the complexity is comparable
9.4 RNS Application in Communication Systems 245

a Speaker

Audio Amp
RF I Prog
A/D
AMP Dec FIR DSP
CONV Demod D/A
Q filter
CONV
cos
sin
Digital
Local
Oscillator

cl
b

Index LUT
Δθ Cos Mod
LUT SUB Cosl(n)
θ
Quantizer
-cl
+ F
Load
F
-sl

Index LUT
CLK
Sin Mod
RES LUT SUB
Sinl(n)
sl

Figure 9.35 (a) Digital receiver architecture (b) RNS based DDS (adapted from [92] © with
permission from Springer 2002)

to or lower than the conventional design whereas the throughput has increased
by 65 %.
The use of RNS for digital frequency synthesis (DFS) has been considered
[93]. The block diagram of classical DFS is shown in Figure 9.36a. The phase
accumulator is incremented by a frequency setting word k and using a look-up table,
the sine value is read corresponding to the accumulator output. This output after
D/A conversion yields the desired sine waveform. There is a possibility of scaling
the phase accumulator output from L bits to W bits. Assuming a clock frequency of
fc, frequency setting word k, an L-bit accumulator, the frequency of the generated
sinusoid is fck/2L. The phase increment is 2πk/2L. The symmetry of sine function
can be taken advantage of to reduce the ROM size. The two MSBs of the accumu-
lator output indicate the sign and quadrant. Chren [93] suggests using RNS com-
prising of several moduli one of them being 2p+2. The architecture shown in
Figure 9.36b uses RNS-based phase accumulator, scaler and modifier for taking
into account sign and quadrant information. The block FSM (finite state machine)
performs phase accumulation mod mi. The input for FSM is binary encoded ith
residue digit of the frequency setting word. The additive invert (AI) and sign
inversion units compute the additive inverse of the inputs so as to take advantage
of the quarter wave symmetry of the cosine function. Chren [93] has suggested
another reduced area architecture shown in Figure 9.36c in which modulo adders
are used in place of FSMs in Figure 9.36b. Here, sample values are computed rather
246 9 Applications of RNS in Signal Processing

a Phase Accumulator
MSB

Output V°
w

L Symmetry ROM Sign DAC


L w
w t t
Truncate
Frequency setting
word k

Pipelined System Clock fc


Adder mod 2L

b
MSB
Next MSB

|K|mod 2p+2 FSM1 Al


Output V°
FSM2 Al
SCALER

|K|mod m2
DAC
Sign
ROM inversion

|K|mod mn FSMn
Al

Phase Accumulator System clock fc


Frequency setting
word as residues

MSB
c

Next MSB
p+2
2
|K|mod 2p+2 Al ROM1
Output V°
S
C
|K|mod m2 A
m2 L Al ROM2
E RAC
R
Residue
Processor

|K|mod mn
mn Al ROMn

System clock f°
Frequency
setting code
word as Phase Accumulator Address Invert Sample ROM
residues

Figure 9.36 (a) Traditional direct digital frequency synthesizer (b) frequency agile direct syn-
thesizer and (c) Reduced area direct synthesizer (adapted from [94] ©IEEE2001)
9.4 RNS Application in Communication Systems 247

a Cnxn
Spread code
MSB (k-m) bits selection
To Binary
rn
(k bits)
Urn
LSB MPSK/MQAM
(m bits) modulation
d

r1
b Ur1
Sub-transmitter 1
RNS transform+interleaver

r2 Ur2
Sub-transmitter 2 Pulse
M-ary Σ Shaping
symbol
rv
Urv ck(t)
Sub-transmitter v

Figure 9.37 (a) Transmitter block diagram and (b) details of sub-transmitter for nth residue
channel (adapted from [95] ©IEEE2004)

than stored. The MSB of the residue corresponding to modulus 2p+2 is


recommended to be used to perform sign inversion of the output of the RNS to
binary converter and next bit is recommended for selecting the additive inverse to
take care of the quadrant information. Unfortunately, this is not feasible as shown
later [94] due to the reason that the residue mod 2p+2 alone does not yield
information about sign and quadrant as assumed. Thus, the RNS architecture after
scaler shall use CRT and after reverse conversion only, sign or quadrant informa-
tion will be known. Thus, the ROM size cannot be reduced.
Madhukumar and Chin [95] have extended the concepts of using RNS for M-ary
orthogonal system. In the earlier technique, considering a dynamic range of
M comprising moduli {m1, m2, m3, . . ., mn}, the number of orthogonal carriers
Xn
required were mi where the redundant moduli also may exist. These residues
i¼1
will be transmitted in parallel. As an illustration for the moduli set {39, 41, 43,
44, 47}, 214 carriers are required to cater for each residue value. Madhukumar and
Chin [92] suggest combining the modulation scheme (PSK/QAM), orthogonal
modulation and RNS together. They suggest the architecture of Figure 9.37a,
wherein, the residue of k bits is considered as two fields of k  m bits and m bits.
The m LSBs are MPSK/MQAM modulated and considered as a data symbol (for
example for QPSK, m ¼ 2). (Note that M stands for M-ary where M ¼ 2m). Next the
MSBs are converted back into integer values and mapped into an orthogonal
sequence selected from a family of Hadamard–Walsh codes. In other words, the
MPSK/MQAM modulated symbol is spread with the corresponding spread code
and multiplexed for transmission. In essence, in the five moduli case, since the four
248 9 Applications of RNS in Signal Processing

a 1

Cos(2πf11t)
r1 V1r1(k) (t)
Sk(t)
2

Residue to rthogonal
sequence Mapping
r2
Binary to residue
Σ
conversion
Ck(t) Cos(2πf12t)
V2r2(k) (t)
Xk
L
VQrQ(k) (t)
Binary
symbols Cos(2πf1Lt)

rQ
U=LQ

b y11
v̂1

RMD/MS/-MMSE MUD
Chip waveform matched

vˆq
filter ψ*(Tc-t)

IRNST
yql
R(t) X

nTc
exp(-j2πfqlt) yQL
vˆQ

Sub-block I Sub-block II Sub-block III

Figure 9.38 (a) Transmitter schematic for kth CRU and (b) receiver block diagram in RRNS MC/
DS-CCDMA system (adapted from [96] ©IEEE2012)

MSBs corresponding to maximum residue values are 9, 10, 10, 10 and 11, respec-
tively, we need overall 55 orthogonal sequences only (considering that 0–9, etc. are
all the residue values).
The high level architecture of the receiver is presented in Figure 9.37b. The
authors have used 16 bits per symbol. The last two elements in the moduli set
(44 and 47) are the redundant residues. The authors have shown that the BER versus
Eb/No performance is superior to earlier technique of Yang and Hanzo [86–89].
Zhang et al. [96] have suggested use of RRNS for multi-carrier direct sequence
CDMA multiple access for cognitive radio. The transmitter block diagram is
presented in Figure 9.38a, wherein the input binary symbol is converted into
residues using a front-end binary to RNS converter. These Q residues
corresponding to Q moduli are mapped next into Q orthogonal sequences. Consid-
XQ
ering a Q moduli set, the number of orthogonal sequences are mq . These are
q¼1
Hadamard–Walsh codes of length Ns. Next, each of these selected orthogonal
9.4 RNS Application in Communication Systems 249

sequences is spread using a user-specific signature sequence. Next each of these


Q spreading sequences is transmitted on L sub-carriers to achieve a Lth-order
frequency diversity. Thus, the RRNS MC/DS-CDDMA (multi-carrier code division
Dynamic multiple access) system requires a total of U ¼ LQ number of sub-carriers.
The error and throughput performance is comparable to other techniques. Any
erroneous residues can be discarded without affecting the data recovery provided
that sufficient dynamic range remains in the reduced RRNS to unambiguously
represent the non-redundant information. The receiver block diagram is presented
in Figure 9.38b. Note that IRNST (inverse RNS transform) block does reverse
conversion from RNS to Binary form. Note also that RMD/MS-MMSE MUD
stands for receiver multi-user diversity-aided multistage minimum mean square
error multi-user detector.
Yi and Jian-Hao [97] have described a RNS-based OFDM transmission scheme
with low peak to average power ratio (PAPR). In OFDM systems, PAPR defined as
the ratio of maximum peak power and average power can exceed the linear dynamic
range of amplifier leading to distortion and BER. A RNS-based OFDM system
block diagram is shown in Figure 9.39a. The serial data stream to be transmitted by
the source is d0, . . ., dN1. After conversion of the serial data into RNS form for
v moduli, IFFT is performed for each residue channel. These orthogonal and
parallel outputs of the IFFT block are combined and modulated in the transmitter.
The receiver block diagram is presented in Figure 9.39b which has front-end
residue channels followed by FFT and residue to binary conversion. Since the
value of each selected modulus is less than the symbol amplitude di, this scheme
controls the dynamic range of the transmitted signal. The authors have used the
moduli set {128, 127, 63}. N ¼ 2048 sub-carriers were considered. The input data
symbols on the sub-carriers are presented as di (i ¼ 0, 1, . . ., N  1) where N is the
number of sub-carriers. The authors show that 5 dB improvement of PAPR reduc-
tion is obtained by the RNS-OFDM transmission scheme at a PAPR level of 1 %.
How et al. [98] have described a RRNS coded burst-by-burst (BbB) adaptive
joint detection-based CDMA speech transceiver, since RRNS codes exhibit
maximum–minimum distance properties [90, 91]. They have suggested coding of
the speech frame bits of different classes depending on the error sensitivity. These
three classes have bits 40, 25 and 30, respectively, yielding 95 bits in a 20 ms frame
corresponding to a 4.75 kb/s speech coder. Thus using 5-bit redundant moduli, class
I bits are coded using RRNS (8, 4) where 4 corresponds to four 5-bit residues and
8 corresponds to total number of 5-bit residues including redundant residues. Thus,
for coding 40 speech bits of class I, two code words are needed and for class II and
III, one code word only is needed. Hence, overall 80 bits are needed. In a similar
manner, class II (25 bits) and class III (30 bits) are coded using (8, 5) and (8, 6)
RRNS. Thus, overall, 160 bits are needed for each speech frame. The block diagram
of the JD (Joint Detection)-CDMA system employing such RRNS is shown in
Figure 9.40. The authors have shown that a burst-by-burst adaptive speech trans-
ceiver can drop its coding rate and speech quality under transceiver control in order
to invoke a more resilient modem mode among less favorable channel conditions.
250 9 Applications of RNS in Signal Processing

a mi
sub-channel
m1{rm10,rm11,….,rm1(N-1)} IFFT

Source
di
B/R m2{rm20,rm21,….,rm2(N-1)} IFFT
d0 d1 d2…….dN-1 Modulation

IFFT
mv{rmv0,rmv1,….,rmv(N-1)}

b
ri1
m1 residue
channel

ri2
S´k,mi m2 residue
channel FFT R/B di

riv
mv residue
channel

Figure 9.39 (a) Block diagram of RNS OFDM transmitter and (b) receiver (adapted from [97]
©IEEE 2011

AMR RRNS
Modulator Spreader
encoder encoder

Mode Modulation Channel


Channel
selection adaptation estimation

AMR RRNS De- MMSE-


decoder decoder Modulator BDFE

Figure 9.40 Adaptive dual-mode JD-CDMA system block diagram (adapted from [98]
©IEEE2006)
9.4 RNS Application in Communication Systems 251

Zhu and Natarajan [99] have described hopping pilot pattern design using RNS
for cellular downlink orthogonal frequency division multiple access (OFDMA).
This enables hopping of pilot in time as well as frequency domains. By using these
RRNS-based pilot patterns, the channel’s Doppler delay response can be fully
reconstructed without aliasing. This technique provides with number of pairs of
hopping patterns that are collision free.
We only describe here briefly the algorithm used. The number of carriers
available are N ¼ MMc where Mc is the number of clusters and M is the number
of contiguous frequencies (sub-carriers) in a cluster. Note that G is the number of
OFDM symbols between two consecutive pilot signals in time Tp and M is the
number of OFDM sub-carriers between two consecutive pilots. M is chosen as
product of two primes a and b. We consider a ¼ 2, b ¼ 3 for M ¼ 6 for illustration.
Thus, if Ts is one OFDM symbol duration, Tp ¼ GTs and fp ¼ Nf where f is
sub-carrier spacing. Consider an example N ¼ 12, M ¼ 6, Mc ¼ 2 and G ¼ 4.
As shown in Figure 9.41a, in which two clusters each with M ¼ 6 carriers are
shown. These are divided into two sub-clusters and each sub-cluster has three
sub-carriers. We start with an initial address (IA), for example, 4 for time slot
0. Since 4 mod 2 ¼ 0 and 4 mod 3 ¼ 1, we send the pilot in each sub-cluster 0, first
sub-carrier. Next, we increment IA to get for first time slot, 5 mod 2 ¼ 1, 5 mod
3 ¼ 2. Thus, the next sub-carrier chosen for sending pilot tone in second cluster
third sub-carrier. In a similar manner, corresponding to the other two time slots, the
location 4 is obtained as shown in Figure 9.41b. Similarly starting with different
G values from 1 to 6, the generic RNS pilot pattern can be found as in Figure 9.41b.
It can be shown that the patterns are orthogonal meaning that pilot patterns using
different IAs do not collide with each other. The authors have shown that such a
system generates more unique hopping pilot patterns than other techniques.
Han et al. [100] have presented an architecture for block interleaving algorithm
in MB-OFDM (multi-band orthogonal frequency division multiplexing) using
Mixed Radix System for UWB (ultra-wide-band) communication. The block inter-
leaving algorithm outlined in [101] WiMedia Alliance MAC-PHY interface spec-
ification 1.0 necessitates three consecutive steps: symbol interleaving, tone
interleaving and cyclic shift. These are described by the following equations:
  
i 6
aS ð i Þ ¼ a þ  modði; N CBPS Þ ð9:25aÞ
N CBPS N TDS
  
i
aT ð i Þ ¼ a S þ 10  modði; N Tint Þ ð9:25bÞ
N Tint
 
bðiÞ ¼ aT mi  N CBPS þ mod i þ mðiÞ  N cyc , N CBPS ð9:25cÞ

where i is an index for bit sequences in the range 0  i < N 6  N CBPS in


TDS
(9.25a) and (9.25c) and 0  i < N CBPS in (9.25b). The symbol a is input to the
interleaver, aS is the output of the symbol interleaver, aT is the output of the tone
252 9 Applications of RNS in Signal Processing

a time slots(0-7)

0
0 1
2
0
time slots(0-7) 1
1
2
7 5 4
0
0 1
one pilot signal 2

one subcarrier 0
1 1
one sub-cluster
2
one cluster

b
Time slots {0,1,….)

Tp = GTs
OFDM sub-carriers

6 5 4 3 6 5 4 3

4 3 2 1 4 3 2 1
6 5 4 3
2 1 6 5 2 1 6 5
4 3 2 1
3 2 1 6 3 2 1 6
2 1 6 5
1 6 5 4 1 6 5 4
3 2 1 6
5 4 3 2 5 4 3 2
1 6 5 4
Fp = Mf

5 4 3 2
6 5 4 3 6 5 4 3

4 3 2 1 Generic RNS based pilot pattern


4 3 2 1

2 1 6 5 2 1 6 5

3 2 1 6 3 2 1 6

1 6 5 4 1 6 5 4

5 4 3 2 5 4 3 2

Figure 9.41 (a) Design procedures of pilot patterns using RNS arithmetic and (b) RNS based
pilot pattern (adapted from [99] ©IEEE2010)
9.4 RNS Application in Communication Systems 253

j k
interleaver and b is an output of the cyclic shift and mðiÞ ¼ N i . Note also that
CBPS
NCBPS, NTint, NTDS and Ncyc are constants depending on the data rate of the
MB-OFDM system. We have used the notation mod (m,n) ¼ m mod n. Note that
NCBPS is number of coded bits/OFDM symbol, NTint tone interleaver block size,
Ncyc is cyclic shift and NTDS is TDS factor. The authors suggest realizing the
equations (9.25) using RNS. The interleaving operations can be realized using
Mixed Radix System denoted as 2-radix MRS ( p2/p1). The value of i is written as
(a2p2 + a1) p1 + a0. The choice of p1 and p2 enables mapping of the given xth bit in
a2, a1, a0 position. Note that a0 < p1, a1 < p2, a2 < M/( p1p2) where M is the
block size.
The whole block is divided into p1 sub-blocks differently colored. Each
sub-block is divided into p2 sub-sub-blocks at a time. Each sub-sub-block is
shown as a vertical line in Figure 9.42b. The sub-sub-blocks with different colors
are alternatively arranged and wired using the connections shown in dotted lines.
Thus, the Xth bit position in decimal number system is mapped as a 2D array X ¼
(a2p2 + a1)p1 + a0. As an illustration for data rates of 320, 400, 480 Mb/s, the block
size is 1200 bits and p1 ¼ 6, p2 ¼ 10. In Figure 9.42b, (a2|a1|a0) is represented as cell
(a2, a1) with a0 color. The authors observe that the latency, power consumption and
complexity are benefited over conventional implementation. A latency of six
OFDM symbols could be achieved with complexity reduction by 85.5, 69.4 and
40.3 % for 80, 200 and 480 Mb/s. Power consumption reduction was 87.4 %,
73.6 %, 39.8 %, respectively.
Figure 9.42a shows the general representation for interleaving process. The first
two consecutive modulo permutations given in (9.25a) and (9.25b) are considered
as a single process. In the interleaver architecture shown in Figure 9.42b, the data to
be encoded enters in the lower right cell. Note that there are M cells arranged as
M/( p1p2) rows with each row containing p1p2 cells. These p1p2 cells are grouped as
p2 groups each containing p1 cells. After M cycles, the first bit would have reached
the leftmost corner cell. Next the data is read as indicated in the dotted lines. Note
that different columns in each group of p1 columns are colored differently. The end
point of one color is connected to the starting point of another color as indicated by
Ais. The de-interleaving process is similar where the input enters the INdecode
terminal and output is taken in a similar way at the output decode (leftmost corner
cell). Note finally that the cyclic shift of 33 as required by (9.25c) needs connection
of last point in one p1 cell cluster to the 33rd pin of the next cluster and similarly
66th pin of the third cluster, etc. as shown in Figure 9.42c. Note that this needs
additional multiplexers in some cells.
Quasi-chaotic (QC) generators based on RNS have been described in literature
[102]. A cascade of first-order recursive filters exhibits quasi-chaotic behavior in
the absence of input. The computation performed is

x1 ðkÞ ¼ ðg1 x1 ðk  1Þ þ uðkÞÞmodm1 ð9:26aÞ


254 9 Applications of RNS in Signal Processing

x2 ðkÞ ¼ ðg2 x2 ðk  1Þ þ x1 ðkÞÞmodm2 ð9:26bÞ

yðkÞ ¼ xN ðkÞ ¼ ðgN xN1 ðk  1Þ þ xN1 ðkÞÞmodmN ð9:26cÞ

where uk is the input and xi(k) i ¼ 1,2, . . ., N is the output of the ith section. The
zero-input response will exhibit periodic behavior with a period coincident with
LCM of all mi  1. The length of the period can be very large. As an illustration for
n ¼ 8, and moduli set {257,263, 347, 359, 383, 467, 479, 503}, the period will be
256  131  173  179  233  239  251 2.7  108. The shape of the generator
response is independent of the initial conditions. It depends instead on gi values
called primitives. For prime modulus M, the number of primitives is f( f(M )) where f
(M ) is the number of integers less than M and relatively prime to M. Each choice of
gi defines a particular state of the generator. The number of possible states of a RNS
X N
filter for a N section generator is S ¼ f ðf ðMi ÞÞ. Thus, for the above example we
i¼1
have, S ¼ 128  130  172  178  232  238  250 1.3  108. The authors sug-
gest for an 8-bit input, choice of the moduli set {3, 7, 13}. Each of the three QC
generators have used eight sections with moduli {3139, 157, 173, 191,199,

p1
a
MRS( p1)
M/p1

p2
(M/p1)/p2

2nd MRS (p2) (p1) th MRS (p2)


1st MRS (p2)

Figure 9.42 Interleaver architecture for MB-OFDM (a) general representation for interleaving
processes (b) interleaving processor in MRS( p2|p1) (c) Design of the cyclic shift (adapted from
[100] ©IEEE2010)
9.4 RNS Application in Communication Systems 255

p2 ×p1
b
Ap1-1
OUT Encode A1
OUT Decode

0,0 0,0 0,0, 0,1 0,1 0,p2-1 0, p2-1 0,p2-1

1,0 1,0 1,0, 1,1 1,1 1,p2-1 1, p2-1 1,p2-1

2,0 2,0 2,0, 2,1 2,1 2,p2-1 2, p2-1 2,p2-1

A1 A2
IN Encode
IN Decode

MRS p2/p1

P1 sub-blocks

Figure 9.42 (continued)


256 9 Applications of RNS in Signal Processing

c MSRS (l|k)

l X mod k = 0 l l l

X mod l = X mod l = X mod l = X mod k = X mod k = X mod k =


0 1 l-1 1 2 k-1

e
e
s
s

e
Ncyc×0 Ncyc×1 Ncyc×2Cyclic
Cyclic Shifted Cyclic Shifted Shifted
Figure 9.42 (continued)

227, 239}, {7149, 163, 179, 193, 211, 229, 241}, {13, 151, 167, 181, 197, 223, 233,
251}, respectively.
The authors have suggested later a self-correcting communication system [103]
also. The input sequence containing the message is converted to residues which
enter the QC generator. The outputs of the QC generator drive the self-correcting
transmission system as well as redundant residues. By using 2r redundant residues,
r errors can be checked. The decoded residues are converted back at the receiver
using CRT.

References

1. W.K. Jenkins, B.J. Leon, The use of residue number systems in the design of finite impulse
response digital filters. IEEE Trans. Circuits Syst. CAS-24, 191–201 (1977)
2. A. Peled, B. Liu, A new hardware realization of digital filters. IEEE Trans. Acoust. Speech
Signal Process. ASSP-22, 456–462 (1974)
3. H.T. Vergos, A 200 MHz RNS core, in Proceedings of ECCTD, vol. II (2001), pp. 249–252
4. L. Kalampoukas, D. Nikolos, C. Efstathiou, H.T. Vergos, J. Kalamatianos, High speed
parallel prefix modulo (2n-1) adders. IEEE Trans. Comput. 49, 673–680 (2000)
5. S.J. Piestrak, Design of residue generators and multi-operand modulo adders using carry save
adders, in Proceedings of 10th Symposium on Computer Arithmetic (1991), pp. 100–107
6. R. Zimmermann, Efficient VLSI implementation of Modulo (2n  1) Addition and multipli-
cation, in Proceedings of IEEE Symposium on Computer Arithmetic (1999), pp. 158–167
References 257

7. A.D. Re, A. Nannarelli, M. Re, Implementation of digital filters in carry save Residue number
system, in Conference Record of 39th Asilomar Conference on Signals, Systems and Com-
puters (2001), pp. 1309–1313
8. G.C. Cardarilli, A.D. Re, A. Nanarelli, M. Re, Impact of RNS coding overhead on FIR filter
performance, in Proceedings of the 41st Asilomar Conference on Circuits, Systems and
Computers (2007), pp. 1426–1429
9. A.D. Re, A. Nannarelli, M. Re, A tool for automatic generation of RTL-level VHDL
description of RNS FIR filters, in Proceedings of Design and Automation and Test of in
Europe Conference and Exhibition (2004), pp. 686–687
10. G.C. Cardarilli, A.D. Re, A. Nannarelli, M. Re, Low power low leakage implementation of
RNS FIR filters, in Conference Record of 39th Asilomar Conference on Signals, Systems and
Computers (2005), pp. 1620–1624
11. G.L. Bernocchi, G.C. Cardarilli, A.D. Re, A. Nannarelli, M.Re, A hybrid RNS adaptive filter
for channel equalization, in Proceedings of 40th Asilomar Conference on Signals, Systems
and Computers (2006), pp. 1706–1710
12. G.L. Bernocchi, G.C. Cardarilli, A.D. Re, A. Nannarelli, M. Re, Low-power adaptive filter
based on RNS components, in Proceedings of 2007 I.E. International Symposium on Circuits
and Systems (ISCAS) (2007), pp. 3211–3214
13. R. Conway, J. Nelson, Improved RNS FIR filter architectures. IEEE Trans. Circuits Syst.
Express Briefs 51, 26–28 (2004)
14. A. Wrzyszcz, D. Milford, A new modulo 2α + 1 multiplier, in IEEE International Conference
on Computer Design: VLSI in Computers and Processors (1993), pp. 614–617
15. C.-L. Wang, New bit-serial VLSI implementation of RNS FIR digital filters. IEEE Trans.
Circuits Syst II. Analog Digit. Signal Process. 41, 768–772 (1994)
16. B. LaMacchia, G. Redinbo, RNS digital filtering structures for wafer-scale integration. IEEE
J. Sel. Areas Commun. 4, 67–80 (1986)
17. J.C. Bajard, L.S. Didier, T. Hilaire, ρ-Direct form transposed and Residue Number systems
for filter implementations, in IEEE 54th International Midwest Symposium on Circuits and
Systems (MWSCAS) (2011), pp. 1–4
18. P. Patronik, P.K. Berezowski, S.J. Piestrak, J. Biernat, A. Shrivastava, Fast and energy-
efficient constant-coefficient FIR filters using residue number system, in 2011 International
Symposium on Low Power Electronics and Design (ISLPED) (2011), pp. 385–390
19. J.H. Choi, N. Banerjee, K. Roy, Variation-aware low-power synthesis methodology for fixed-
point FIR filters. IEEE Trans. CAD 28, 87–97 (2009)
20. R. Patel, M. Benaissa, N. Powell, S. Boussakta, Novel power-delay-area efficient approach to
generic modular addition. IEEE Trans. Circuits Syst. I 54, 1279–1292 (2007)
21. L. Aksoy, E. Da Costa, P. Flores, J. Monteiro, Exact and approximate algorithms for the
optimization of area and delay in multiple constant multiplications. IEEE Trans. CAD 27,
1013–1026 (2008)
22. A. Garcı́a, U. Meyer-Bäse, F.J. Taylor, Pipelined Hogenauer CIC filters using field-
programmable logic and residue number system, in Proceedings of IEEE International
Conference on Acoustics, Speech, and Signal Processing, Seattle, vol. 5 (1998),
pp. 3085–3088
23. M. Griffin, M. Sousa, F. Taylor, Efficient scaling in the residue number System, in Pro-
ceedings of IEEE ASSP (May 1989), pp. 1075–1078
24. G.C. Cardarilli, R. Lojacono, M. Salerno, F. Sargeni, VLSI RNS implementation of fast IIR
filters, in Proceedings of of 35th Midwest Symposium on Circuits and Systems (1992),
pp. 1245–1248
25. W.K. Jenkins, Recent advances in residue number techniques for recursive digital filtering.
IEEE Trans. Acoust. Speech Signal Process. 27, 19–30 (1979)
26. M. Etzel, W. Jenkins, The design of specialized residue classes for efficient recursive digital
filter realization. IEEE Trans. Acoust. Speech Signal Process. 30, 370–380 (1982)
258 9 Applications of RNS in Signal Processing

27. A. Nannarelli, G.C. Cardarilli, M. Re, Power-delay trade-off in residue number system. Proc.
IEEE ISCAS 5, 413–416 (2003)
28. G.C. Cardarilli, A. Del Re, A. Nannarelli, M. Re, Residue number system reconfigurable data
path, in Proceedings of IEEE ISCAS, vol. II, (2002), pp. 756–759
29. G.C. Cardarilli, A. Nannarelli, M. Re, Residue number system for low-power DSP applica-
tions, in 41st Asilomar Conference (2007), pp. 1412–1416
30. A. Nannarelli, M. Re, G. C. Cardarilli, Trade-offs between Residue number system and
traditional FIR filters, in Proceedings of IEEE ISCAS (2001), pp. 305–308
31. M.N. Mahesh, M. Mehendale, Low-power realization of residue number system based FIR
filters, in 13th International Conference on VLSI Design, Bangalore (May 2001), pp. 350–353
32. W.L. Freking, K.K. Parhi, Low-power FIR digital filters using residue arithmetic, in Confer-
ence Record of 31st Asilomar Conference on Signals, Systems and Computers (ACSSC 1997),
Pacific Grove, vol. 1 (1997), pp. 739–743
33. M.K. Ibrahim, A note on digital filter implementation using hybrid RNS-binary arithmetic.
Signal Process. 40, 287–294 (1994)
34. B. Parhami, A note on digital filter implementation using hybrid RNS-binary arithmetic.
Signal Process. 41, 65–67 (1996)
35. G.C. Cardarilli, M. Re, R. Lojacano, A new RNS FIR filter architecture, in Proceedings of
13th International Conference on Digital Signal Processing, DSP 97 (1997), pp. 671–674
36. P.L. Montgomery, Modular multiplication without trial division. Math. Comput. 44, 519–521
(1985)
37. K.G. Smitha, A.P. Vinod, A reconfigurable high-speed RNS FIR channel filter for multi-
standard software radio receivers, in Proceedings of IEEE ICCS (2008), pp. 1354–1358
38. R. Conway, Efficient residue arithmetic based parallel fixed coefficient FIR filters, in Pro-
ceedings of IEEE ISCAS (2008), pp. 1484–1487
39. I. Lee, W.K. Jenkins, The design of residue number system arithmetic units for a VLSI
adaptive equalizer, in IEEE Proceedings of 8th Great Lakes Symposium (1998), pp. 179–184
40. A. Shah, M. Sid-Ahmed, G. Jullien, A proposed hardware structure for two-dimensional
recursive digital filters using the residue number system. IEEE Trans. Circuits Syst. 32,
285–288 (1985)
41. G.A. Jullien, Residue number scaling and other operations using ROM arrays, in IEEE
Transactions on Computers, vol. 27, no. 4 (1978), pp. 325–337
42. N.R. Shanbag, R.E. Siferd, A single-chip pipelined 2-D FIR filter using residue arithmetic.
IEEE J. Solid State Circuits 26, 796–805 (1991)
43. M.A. Soderstrand, A high-speed low-cost recursive digital filter using residue number
arithmetic. Proc. IEEE 65, 1065–1067 (1977)
44. L.T. Bruton, Low-sensitivity digital ladder filters. IEEE Trans. Circuits Syst. CAS-22,
168–176 (1975)
45. F.J. Taylor, C.H. Huang, An auto-scale residue multiplier. IEEE Trans. Comput. C-31,
321–325 (1982)
46. R. Ramnarayan, F. Taylor, On large moduli residue number system recursive digital filters.
IEEE Trans. Circuits Syst. 32, 349–359 (1985)
47. J. Chen, J. Hu, Energy-efficient digital signal processing via voltage-overscaling-based
Residue Number System. IEEE Trans. VLSI Syst. 21, 1322–1332 (2013)
48. Z. Luan, X. Chen, N. Ge, Z. Wang, Simplified fault tolerant FIR filter architecture based on
redundant residue number system. Electron. Lett. 50, 1768–1770 (2014)
49. Z. Gao, P. Reviriego, W. Pan, Z. Xu, M. Zhao, J. Wang, J.A. Maeastro, Efficient arithmetic-
residue-based SEU-tolerant FIR filter design’. IEEE Trans. Circuits Syst. II 60, 497–501
(2013)
50. R. Chaves, L. Sousa, RDSP: a RISC DSP based on residue number system, in Proceedings of
Euromicro Symposium on Digital System Design: Architectures, Methods, and Tools,
Antalya, Turkey (2003), pp. 128–135
References 259

51. H. Fu, O. Mencer, W. Luk, Optimizing residue arithmetic on FPGAs, in Proceedings of


International Conference on ICECE Technology 2008, FPT 08 (2008), pp. 41–48
52. D. Gallaher, F.E. Petry, P. Srinivasan, The digit parallel method for Fast RNS to weighted
number system conversion for specific moduli (2k-1, 2k, 2k + 1). IEEE Trans. Circuits Syst. II
44, 53–57 (1997)
53. R. Chokshi, K.S. Berezowski, A. Shrivatsava, S.J. Piestrak, Exploiting residue number
system for power-efficient digital signal processing in embedded processors, in Proceeding
of ACM Embedded Systems Week Conference—CASES, Grenoble, France (2009), pp. 19–27
54. J. Ramirez, A. Garcia, S. Lopez-Buedo, A. Lloris, RNS-enabled digital signal processor
design. Electron. Lett. 38, 266–268 (2002)
55. S. Perumal, R.E. Siferd, Pipelined 50 MHz CMOS ASIC for 32 bit binary to residue
conversion and residue to binary conversion, in Proceedings of Seventh Annual IEEE
International ASIC Conference and Exhibit (1994), pp. 454–457
56. A. Garcia, U. Meyer-Base, A. Lloris, F.J. Taylor, RNS implementation of FIR filter using
distributed arithmetic using field programmable logic. Proc. IEEE ISCAS 1, 486–489 (1999)
57. J. Ramirez, A. Garcia, U. Meyer-Base, F.J. Taylor, A. Lloris, Implementation of RNS based
distributed arithmetic discrete wavelet transform architectures. J. VLSI Signal Process. 3,
171–190 (2003)
58. C.H. Vun, A.B. Premkumar, W. Zhang, A new RNS based DA approach for inner product
computation. IEEE Trans. Circuits Syst. I 60, 2139–2152 (2013)
59. G.C. Cardarilli, A. Nannarelli, M. Re, On the comparison of different number systems in the
implementation of complex FIR filters, in Proceedings of 16th IFIP/IEEE International
Conference on Very Large Scale Integration (VLSI-SoC), Rodos, Greece (2008), pp. 37–41
60. F.J. Taylor, A single modulus complex ALU for signal processing, in IEEE Transactions on
ASSP (1985), pp. 1302–1318
61. G.C. Cardarilli, A. Del Re, A. Nannarelli, M. Re, Low-power implementation of polyphase
filters in Quadratic Residue Number system, in Proceedings of IEEE International Sympo-
sium on Circuits Systems (ISCAS 2004), Vancouver, vol. 2 (2004), pp. 725–728
62. G.C. Cardarilli, A. Nannarelli, Y. Oster, M. Petricca, M. Re, Design of large polyphase filters
in the Quadratic Residue Number system, in Proceedings of Asilomar Conference (2010),
pp. 410–413
63. A. D’Amora, A. Nannarelli, M. Re, G.C. Cardarilli, Reducing power dissipation in complex
digital filters by using the quadratic residue number system, in Conference Record 34th
Asilomar Conference on Signals, Systems and Computers (ACSSC 2000), Pacific Grove, vol.
2 (2000), pp. 879–883
64. T. Stouraitis, V. Paliouras, Considering the alternatives in low-power design. IEEE Circuits
Devices Mag. 17, 22–29 (2001)
65. P.G. Fernandez, A. Garcia, J. Ramirez, L. Parrrilla, A. Lloris, A new implementation of the
discrete cosine transform in the residue number system, in Proceedings of 33rd Asilomar
Conference on Signals, Systems and Computers, vol. 2 (1999), pp. 1302–1306
66. J. Ramirez, A. Garcia, P.G. Fernandez, L. Parrrilla, A. Lloris, A New architecture to compute
discrete cosine transform using the quadratic residue number system. Proc. IEEE ISCAS 5,
321–324 (2000)
67. J. Ramirez, A. Garcia, P.G. Fernandez, L. Parrilla, A. Lloris, A novel QRNS-based 1-D DCT
processor over field programmable logic devices, in Proceedings of the XV Design of Circuits
and Integrated Systems Conference DCIS 2000, Montpellier, Francia (2000), pp. 610–615
68. J. Ramirez, A. Garcia, P.G. Fernandez, L. Parrilla, A. Lloris, A fast QRNS based algorithm
for DCT and its field programmable logic implementation. JCSC 12, 111–121 (2003)
69. G. Fernandez, J. Ramirez, A. Garcia, L. Parrrilla, A. Lloris, A new RNS architecture for the
computation of the scaled 2D-DCT on field programmable logic, 1977. In Conference Record
of the 34th Asilomar Conference, vol. 1 (2000), pp. 379–383
70. K.R. Rao, P. Yip, Discrete Cosine Transform Algorithms: Advantages and Applications
(Academic, Boston, 1990)
260 9 Applications of RNS in Signal Processing

71. M.J. Narasimha, A.M. Peterson, O the computation of the discrete cosine transform. IEEE
Trans. Commun. 26, 934–946 (1978)
72. Y. Arai, T. Agui, M. Nakajima, A fast DCT-SQ scheme for images. Trans. IEICE 71,
1095–1097 (1988)
73. F.J. Taylor, An RNS discrete Fourier transform implementation. IEEE Trans. Acoust. Speech
Signal Process. 38, 1386–1394 (1990)
74. J.H. McCllellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice-Hall,
Englewood Cliffs, 1979)
75. B.D. Tseng, G.A. Jullien, W.C. Miller, Implementation of FFT structures using the residue
number system. IEEE Trans. Comput. C-28, 831–845 (1979)
76. F.J. Taylor, G. Papadorakis, A. Skavantzos, T. Stouraitis, A radix-4 FFT using complex RNS
arithmetic. IEEE Trans. Comput. 34, 573–576 (1985)
77. G.A. Jullien, M. Taheri, J. Carr, G. Thomsen, W.C. Miller, A VLSI systolic quadratic residue
DFT with fault tolerance, in Proceedings of ISCAS (1988), pp. 2271–2274
78. A.Z. Baraniecka, G.A. Jullien, Residue number system implementations of number theoretic
transforms in complex residue rings. IEEE Trans. Acoust. Speech Signal Process. 28,
285–291 (1980)
79. R. Krishnan, G. Jullien, W. Miller, Implementation of complex number theoretic transforms
using quadratic residue number systems. IEEE Trans. Circuits Syst. 33, 759–766 (1986)
80. V.S. Dimitrov, G.A. Jullien, W.C. Miller, A residue number system implementation of real
orthogonal transforms. IEEE Trans. Signal Process. 46, 563–570 (1998)
81. H.C. Shyu, T.K. Truong, I.S. Reed, A complex integer multiplier using the quadratic
polynomial residue number system with numbers of the form 22n+1. IEEE Trans. Comput.
C-36, 1255–1258 (1987)
82. M. Abdallah, A. Skavantzos, On the binary quadratic residue number system with
non-coprime moduli. IEEE Trans. Signal Process. 45, 2085–2091 (1997)
83. Y. Liu, E.M.K. Lai, Design and implementation of an RNS-based 2-D DWT processor. IEEE
Trans. Consum. Electron. 50, 376–385 (2004)
84. J. Ramirez, A. Garcia, P.G. Frernandez, L. Parrilla, A. Lloris, RNS-FPL merged architecture
for orthogonal DWT. Electron. Lett. 36, 1198–1199 (2000)
85. T. Toivonen, J. Heikkilä, Video filtering with Fermat number theoretic transforms using
residue number system. IEEE Trans. Circuits Syst. Video Technol. 16, 92–101 (2006)
86. L.L. Yang, L. Hanjo, Residue number system based Multiple-code DS-CDMA system, in
IEEE 49th Vehicular Technology Conference, vol. 2 (1999), pp. 1450–1454
87. L.L. Yang, L. Hanjo, Ratio statistic assisted Residue number system based parallel commu-
nication scheme, in IEEE 49th Vehicular Technology Conference, vol. 2 (1999), pp. 894–898
88. L.L. Yang, L. Hanzo, A residue number system based parallel communication scheme using
orthogonal signaling: part I—system outline. IEEE Trans. Veh. Technol. 51, 1534–1546
(2002)
89. L.L. Yang, L. Hanzo, Residue number system assisted fast frequency-hopped synchronous
ultra-wideband spread-spectrum multiple-access: a design alternative to impulse radio. IEEE
J. Sel. Areas Commun. 20, 1652–1663 (2002)
90. H. Krishna, K.Y. Lin, J.D. Sun, A coding theory approach to error control in redundant
residue number systems. I. Theory and single error correction. IEEE Trans. Circuits Syst. II
Analog Digit. Signal Process. 39, 8–17 (1992)
91. H. Krishna, J.D. Sun, On theory of fast algorithms for error correction in residue number
systems. IEEE Trans. Comput. C-42, 840–852 (1993)
92. J. Ramirez, A. Garcia, U. Meyer-Base, A. Lloris, Fast RNS FPL based communications
receiver design and implementation, in Proceedings of the 12th International Conference,
FPL 2002 Montpellier, France (2–4 September 2002), pp. 472–481
93. A. Chren, RNS-based enhancements for direct digital frequency synthesis. IEEE Trans.
Circuits Syst. II Analog Digit. Signal Process. 42, 516–524 (1995)
94. P.V. Ananda Mohan, On RNS-based enhancements for direct digital frequency synthesis.
IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 48, 988–990 (2001)
References 261

95. A.S. Madhukumar, F. Chin, Enhanced architecture for residue number system-based CDMA
for high-rate data transmission. IEEE Trans. Wirel. Commun. 3, 1363–1368 (2004)
96. S. Zhang, L.L. Yang, Y. Zhang, Redundant residue number system assisted multicarrier
direct-sequence code-division dynamic multiple access for cognitive radios. IEEE Trans.
Veh. Technol. 61, 1234–1250 (2012)
97. Y. Yi, H. Jian-Hao, RNS based OFDM transmission scheme with low PARR, in Proceedings
of International Conferences on Computational Problem Solving (2011), pp. 326–329
98. H.T. How, T.H. Liew, Ee-Lin Kuan, L.L. Yang, L. Hanzo, A redundant residue number
system coded burst-by-burst adaptive joint-detection based CDMA speech transceiver, IEEE
Trans. Veh. Technol. 55, 387–396 (2006)
99. D. Zhu, B. Natarajan, Residue number system arithmetic-inspired hopping-pilot pattern
design. IEEE Trans. Veh. Technol. 59, 3679–3683 (2010)
100. Y. Han, P. Harliman, S.W. Kim, J.K. Kim, C. Kim, A novel architecture for block interleav-
ing algorithm in MB-OFDM using mixed radix system, in IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 18 (2010), pp. 1020–1024
101. WiMedia Alliance, MAC-PHY interface specification 1.0. (2005), http://www.Wimedia.org
102. M. Panella, G. Martinelli, RNS quasi-chaotic generators. Electron. Lett. 36, 1325–1326
(2000)
103. M. Panella, G. Martinelli, RNS quasi-chaotic generator for self-correcting secure communi-
cation. Electron. Lett. 37, 325–327 (2001)

Further Reading

A. Bertossi, A. Mei, A residue number system on reconfigurable mesh with application to prefix
sums and approximate string matching. IEEE Trans. Parallel Distrib. Syst. 11, 1186–1199
(2000)
E. Kinoshita, K.J. Lee, A residue arithmetic extension for reliable scientific computation. IEEE
Trans. Comput. 46, 129–138 (1997)
Chapter 10
RNS in Cryptography

The need for securely accessing the information and protecting the information
from unauthorized persons is well recognized. These needs can be met by using
Encryption algorithms and Authentication algorithms [1, 2]. Encryption can be
achieved by using block ciphers or stream ciphers. The information considered as
fixed blocks of data e.g. 64-bit, 128-bit, etc. can be mapped into another 64-bit
block or 128-bit block under the control of a key. On the other hand, stream ciphers
generate random sequence of bits which can be used to mask plain text bit stream
(by performing bit-wise Exclusive-Or operation). Several techniques exist for block
and stream cipher implementation. These are called as symmetric key-based sys-
tems, since the receiver shall use for decryption the same key used for the block or
stream cipher at the transmitter. This key has to be somehow made available to the
receiver by previous arrangement or by using Key exchange algorithms
e.g. DiffieHellman Key exchange. The other requirement of authentication of a
source is performed using several techniques. This notable among these is based on
RSA (Rivest Shamir Adleman) algorithm. This algorithm is the workhorse of
Public Key cryptography. As against symmetric key systems, in this case two
keys are needed known as Public Key or Private Key. The strength of these systems
is derived from the difficulty of factoring large numbers which are products of two
big primes.
RSA algorithm is briefly described next. First, Alice chooses two primes p and q.
The product of the primes n ¼ p  q is made public since it is extremely difficult to
find p and q given n. Next Alice defines ф(n) ¼ ( p  1)  (q  1), which gives the
number of integers less than n and prime to n. Next, Alice chooses a value e denoted
as encryption key (public Key). Alice next computes a decryption key (private key)
d such that e  d ¼ 1mod ф(n). The private key d is not disclosed to anybody.
Alice can now encrypt a message m by obtaining C ¼ md mod n where C stands
for cipher text corresponding to m. Anybody (say Bob, the intended recipient or
anybody else) who has knowledge of the public key of Alice, can now find m by
computing Cemod n ¼ m. This method is useful for confirming that only Alice could
have sent this message since meaningful message could be obtained by decryption.

© Springer International Publishing Switzerland 2016 263


P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3_10
264 10 RNS in Cryptography

This is the problem of Authentication. The mathematical operations performed by


Alice and receiver are exponentiation mod m operations. We can employ the Key
pair in another way too. Suppose Bob wants to send a message to Alice, he can use
Alice’s Public Key e and send C ¼ me mod n. Then Alice can use her private key
and compute Cdmod n to get m. This implies that Alice, the only owner of the
private key, can decrypt the message and nobody else.
With the computing power available for factorization of n, it is recommended that
p and q typically shall be 1024–2048 bits. The exponentiation operation is realized
as successive squaring and multiplication operations. An example will illustrate the
complexity of the procedure. Let us choose p ¼ 13 and q ¼ 17 so that n ¼ 221. We
note that ф(221) ¼ (13  1)  (17  1) ¼ 192. Choose e ¼ 7 such that gcd (e,
ф(n)) ¼ 1 where gcd is greatest common divisor. Next, d can be computed as
55 since e  d mod ф(n) ¼ 7  55 mod 192 ¼ 1. Thus e ¼ 7 is encryption key and
d ¼ 55 is decryption key. Since we need n also, we usually denote the encryption key
together with n as Public key ¼ (e, n) ¼ (7,221) and similarly Private key ¼ (d, n) ¼
(55,221). Consider a message m ¼ 11. Encryption needs computation of c ¼ me
mod n ¼ 117 mod 221 ¼ 54. Next, decryption using private key yields, m ¼ cdmod
n ¼ 5455 mod 221 ¼ 11. Alternatively, signing the message with the private key
yields the signature s ¼ md mod n ¼ 755 mod 221 ¼ 97 and verification using public
key yields 977 mod 221 ¼ 7. We have used calculator to perform the exponentiation
mod n. In practice, 5455 mod 221 is computed by successive squaring and optional
multiplication all modulo n operations. Expressing the exponent in binary form for
example 55 ¼ 110111, we note that 5455 ¼ 5432  5416  544  542  54. Hence, we
need to square 54 to obtain 542 mod 221, square 542 mod 221 to obtain 544 mod
221 and so on. Next scanning the exponent to find bits which are one, we can
multiply these intermediate results and skip in case of missing 1 in the exponent.
For illustration, we need not multiply 547 with 548 and instead in the next step
multiply with 5416 to obtain 5423. In successive steps, we obtain, 54, 543, 547, 5423,
5455 all mod n. Thus, in general, for a 1024-bit exponent, 1023 modulo squaring
operations need to be done compulsorily, whereas optionally in the worst case 1023
modulo multiplications need to be carried out. It is therefore essential to speed up the
basic operation—modulo multiplication (A  B) mod n. Some techniques have been
described in Chapter 4 for modulo multiplication already. We explore other tech-
niques in detail in this Chapter.
In cryptographic applications such as RSA encryption, Diffie-Hellman Key
exchange, Elliptic curve cryptography, etc., modulo multiplication and modulo
exponentiation of large numbers, of bit lengths varying between 160 bits and
2048 bits, typically will be required. Two popular techniques are based on Barrett
reduction and Montgomery multiplication. However, to perform the operation (XY)
mod N for a single modulus, RNS using several small word lengths moduli can be
employed. This topic has received recently considerable attention. We deal with
both RNS-based and non-RNS-based (i.e. using only one modulus)
implementations in the following sections. In this chapter, we also consider appli-
cations of RNS in Elliptic Curve Cryptography processors and for implementation
of Pairing protocols.
10.1 Modulo Multiplication Using Barrett’s Technique 265

10.1 Modulo Multiplication Using Barrett’s Technique

We first consider Barrett’s technique [3, 4] for computing r ¼ x mod M given x and
M in base b where x ¼ x2k1. . .x1x0, and M ¼ mk1. . .m1m0 with mk1 6¼ 0. Note that
radix b is chosen typically to be word length of the processor. We assume b > 3
j 2k k
herein. Barrett’s technique requires pre-computation of a parameter μ ¼ bM .
j k j k j k
q2
First, we find consecutively q1 ¼ k1 x
, q2 ¼ k1x
μ and q3 ¼ kþ1 . Next, we
b b b
compute r 1 ¼ x modbkþ1 , r 2 ¼ ðq3  MÞmodbkþ1 , r ¼ r 1  r 2 . If r < 0, r ¼ r
þbkþ1 and if r  M, r ¼ r  M. Note that divisions are simple right hand shifts of
the base b representation. In the multiplication q2 ¼ q1μ, the (k + 1) LSBs are
not required to determine q3 except for determining carry from position (k + 1)
to (k + 2). Hence, k  1 least significant digits of q2 need not be computed.
Similarly, r2 ¼ q3M also can be simplified as a partial multiple-precision mul-
tiplication which evaluates only the least  significant
 (k + 1) digits of q3M. Note
kþ1
that r2 can be computed using at most þ k single precision multipli-
2
 μand q1 have at most (k + 1) digits, determining q3 needs at most
cations. Since
k 2
ð k þ 1Þ 2  ¼ k þ5kþ2
2 single precision multiplications. Note that q2 is
2
needed only for computing q3. It can be shown that 0  r < 3M.
As an illustration, consider finding (121) mod 13. Evidently k ¼ 4. We obtain
μ ¼ 19, q1 ¼ 15, q2 ¼ 15  19 ¼ 285, q3 ¼ 8. Thus, r1 ¼ 25, r2 ¼ 8 and r ¼ 17. Since
r > 13, we have r ¼ r mod 13 ¼ 4.
Barrett’s algorithm estimates the quotient q for b ¼ 2 in the general case as
  $j X kj2kþα k%
X
q¼ ¼ 2kþβ M ð10:1aÞ
M αβ
2
kþα
where α and β are two parameters. The value μ ¼ 2 M can be pre-computed and
stored. Several attempts have been made to overcome the last modulo reduction
operation. Dhem [5] has suggested α ¼ w þ 3, β ¼ 2 for radix 2w so that the
maximum error in computing q is 1. Barrett has used α ¼ n, β ¼ 1. The classical
modular multiplication algorithm to find (X  Y ) mod M is presented in Figure 10.1
where multiplication and reduction are integrated. Note that step 4 uses (10.1a).
Quisquater [7] and other authors [8] have suggested writing the quotient as
    kþc 
X X 2
q¼ ¼ kþc ð10:1bÞ
M 2 M

and the result is T ¼ X  qM.


266 10 RNS in Cryptography

Figure 10.1 High Radix classical Modulo multiplication algorithm (adapted from [6]
©IEEE2010)

Knezevic et al. [6] have observed that the performance of Barrett reduction can
be improved by choosing moduli of the form (2n  Δ) in set S1 where 0 < Δ 
j n k j k
2 α or (2n1 + Δ) in Set S where 0 < Δ  2n1 . In such cases, the value of
1þ2
2 αþ1
2 1
^q in (10.1a) can be computed as
   
Z Z
^
q ¼ if M 2 S 1 or ^
q ¼ if M 2 S2 ð10:1cÞ
2n 2n1

This modification does not need any computation unlike in (10.1b). Since many
recommendations such as SEC (Standards for Efficient Cryptography), NIST
(national Institute of Standards and Technology), and ANSI (American National
Standards Institute) use such primes, the above method will be useful.
Brickell [9, 10] has introduced a concept called carry-delayed adder. This
comprises of a conventional carry-save-adder whose carry and sum outputs are
added in another level of CSA comprising of half-adders. The result in carry-save
form has the interesting property that either a sum bit or next carry bit is ‘1’. As an
illustration, consider the following example:
A ¼ 40 101000
B ¼ 25 011001
C ¼ 20 010100
S ¼ 37 100101
C ¼ 48 0110000
T ¼ 21 010101
D ¼ 64 1000000
The output (D, T) is called carry-delayed number or carry-delayed integer. It
may be checked that TiDi+1 ¼ 0 for all i ¼ 0, . . ., k  1.
10.2 Montgomery Modular Multiplication 267

Brickell [9] has used this concept to perform modular multiplication. Consider
computing P ¼ AB mod M where A is a carry-delayed integer:

X
k1  
A¼ T i þ Di 2i
i¼0

Then P ¼ AB can be computed by summing the terms


 
ðT 0 B þ D0 BÞ20 þ T 1 B þ D1 B 21 þ ðT 2 B þ D2 BÞ22 þ    þ ðT k1 B þ Dk1 BÞ2k1

Rearranging noting that D0 ¼ 0, we have

20 T 0 B þ 21 D1 B þ 21 T 1 B þ 22 D2 B þ 22 T 2 B þ 23 D3 B þ   
þ2k2 T k2 B þ 2k1 Dk1 B þ 2k1 T k1 B

Since either Ti or Di+1 is zero due to the delayed-carry-adder, each step requires a
shift of B and addition of at most two carry-delayed integers:
þ1
either ðPd ; Pt Þ ¼ ðPd ; Pt Þ þ 2i T i B or ðPd ; Pt Þ ¼ ðPd ; Pt Þ þ 2i Diþ1 B

After k steps, P ¼ (Pd, Pt) is obtained.


Brickell suggests addition of terms till P exceeds 2k and then only a correction is
added of value (2k  M ). Brickell shows that 11 steps after multiplication starts,
the algorithm starts subtracting multiples of N since P is a carry-delayed integer of
k + 11 bits, which needs to be reduced mod M.

10.2 Montgomery Modular Multiplication

The Montgomery multiplication (MM) algorithm for processor-based implementations


[11] use two techniques: separated or integrated multiplication and reduction.
In separated multiplication and reduction, first multiplication of A and B, each
of s number of words, is performed and then Montgomery reduction is
performed. On the other hand, in integrated MM algorithm, these two operations
alternate. The integration can be coarse or fine grained (meaning how often we
switch between multiplication and reduction after processing an array of words
or after processing just one word). Next option is regarding the general form of
multiplication and reduction steps. One form is operand scanning based on
whether the outer loop moves through the words of one operand. In another
form, known as product scanning, the outer loop moves through the product
itself. Note that the operand scanning or product scanning is independent of
whether multiplication and reduction are integrated or separated. In addition, the
268 10 RNS in Cryptography

multiplication can take one form and reduction can take another form even in
integrated approach.
As such, we have five techniques (a) separated operand scanning (SOS),
(b) coarsely integrated operand scanning (CIOS), (c) finely integrated operand
scanning (FIOS), (d) finely integrated product scanning (FIPS) and (e) coarsely
integrated hybrid scanning (CIHS). The word multiplications needed in all these
techniques are (2s2 + s) whereas word additions for FIPS are (6s2 + 4s + 2), for SOS,
CIOS and CIHS are (4s2 + 4s + 2) and for FIOS are (5s2 + 3s + 2).
In SOS technique, we first obtain the product (A  B) as a 2s-word   integer t.
0 0 1
Next, we compute u ¼ (t + mn)/r where m ¼ (tn ) mod r and n ¼  . We first
n r
take u ¼ t and add mn to it using standard multiplication routine. We divide the
result by 2sw which we accomplish by ignoring the least significant s words. The
reduction actually proceeds word by word using n0 ¼ n mod 2w. Each time the result
is shifted right by one word implying division by 2w. The number of word
multiplications is (2s2 + s).
The CIOS technique [11, 12] improves on the SOS technique by integrating the
multiplication and reduction steps. Here instead of computing complete (A  B) and
then reducing it, we alternate between the iterations of the outer loop for multipli-
cation and reduction. Consider an example with A and B each comprising of four
words a3, a2, a1, a0 and b3, b2, b1, b0 respectively. First a0b0 is computed and we
denote the result as cout0 and tout00 where tout00 is the least significant word and
cout0 is the most significant word. In the second cycle, two operations are performed
simultaneously. We multiply tout00 with n0 to get m0 and also computing a1b0 and
adding cout0 to obtain cout1, tout01. At this stage, we know the multiple of N to be
added to make the least significant word zero. In the third cycle, a2b0 is computed
and added to cout1 to obtain cout2, tout02 and in parallel m0n0 is computed and added
to tout00 to obtain cout3. In the fourth cycle, a3b0 is computed and added with cout2
to get cout4 and tout03 and simultaneously m0n1 is computed and added with cout3
and tout01 to obtain cout5 and tout10. Note that the multiplication with b0 is
completed at this stage, but reduction is lagging behind by two cycles. In the fifth
cycle, a0b1 is computed and added with tout10 to get cout7 and tout20 and simul-
taneously m0n2 is computed and added with cout5 and tout02 to obtain cout6 and
tout11. In addition, cout4 is added to get tout04 and tout05. In the sixth cycle, a1b1 is
computed and added with cout7, tout11 to get cout9 and tout21 and simultaneously
m0n3 is computed and added with cout6 and tout03 to obtain cout8 and tout12. In
addition, tout2 is multiplied with n0 to get m1. In this way, the computation proceeds
and totally 18 cycles are needed.
The FIOS technique integrates the two inner loops of the CIOS method by
computing both the addition and multiplication in same loop. In each iteration,
X0Yi is calculated  andthe result is added to Z. Using Z0 we calculate T as
1
T ¼ ðZ 0 þ X 0 Y 0 Þ  . Next, we add MT to Z. The Least significant word Z0
M r
of Z will be zero, and hence division by r is exact and performed by a simple right
shift. The number of word multiplications in each step is (2s + 1) and hence totally
10.2 Montgomery Modular Multiplication 269

(2s2 + s) word multiplications are needed and (2s2 + s) cycles are needed on a w-bit
processor. The addition operations need additional cycles.
Note that in the CIHS technique, the right half of the partial product summation
of the conventional n  n multiplier is performed and the carries flowing beyond the
s words are saved. In the second loop, the least significant word t0 is multiplied by
n0 0 to obtain the value of m0. Next the modulus n0 is multiplied with m0 and added to
t0. This will make the LSBs zero. The multiplication with m0 with n1, n2, etc. and
addition with t1, t2, t3, etc. will be carried out in the next few cycles. Simulta-
neously, the multiplications needed for forming the partial products beyond s words
are carried out and result added to the carries obtained and saved in the first step as
well as with the words obtained by multiplying mi with nj. At appropriate time, the
mi values are computed as soon as the needed information is available. Thus the
CIHS algorithm integrates the multiplication with addition of mn. For a 4  4 word
multiplication, the first loop takes 7 cycles and the second loop takes 19 cycles.
The reader may refer to [13] for a complete description of the operation.
In the FIPS algorithm also, the computation of ab and mn are interleaved.
There are two loops. The first loop computes one part of the product ab and then
adds mn to it. Each iteration of the inner loop executes two multiply accumulate
operations of the form a  b + S i.e. products ajbij and pjnij are added to a
cumulative sum. The cumulative sum is stored in three single-precision words t[0],
t[1] and t[2] where the triple (t[0], t[1], t[2]) represents t[2]22w + t[1]2w + t[0].
These registers are thus used as a partial product accumulator for products ab and
mn. This loop computes the words of m using n0 and then adds the least significant
word of mn to t. The second loop completes the computation by forming the final
result u word by word in the memory space of m.
Walter [14] has suggested a technique for computing (ABrn) mod M where
A < 2M, B < 2M and 2M < rn1, r is the radix and r  2 so that S < 2M for all
possible outputs S. (Note that n is the upper bound on the number of digits in A,
B and M). Note also that an1 ¼ 0. Each step computes S ¼ ðS þ ai  B þ qi  MÞ
 
div rwhere qi ¼ ðs0 þ ai bo Þ m1 o mod r. It can be verified that S < (M + B) till
the last but one step. Thus, the final output is bounded: S < 2M. Note that in the last
step of exponentiation, multiplication by 1 is needed and scaling by 2n mod M will
be required. A Montgomery step can achieve this. Here also note that since Sr n
¼ Ae þ QM and Q ¼ rn  1 maximum. Note that A ¼ ðA  r n Þmod M
i.e. Montgomery form of Ae. Therefore since Ae < 2M, we have Srn < (rn + 1)M
and hence S  M needing no final subtraction. The advantage here is that the cycle
time is independent of radix.
Orup [15] has suggested a technique for avoiding the modulo multiplication
needed to obtain the q value for high radix Montgomery modulo  multiplication.

e 0 0 1
Orup suggests scaling the modulus M to M ¼ MM where M ¼  consid-
M k
 2
1
ering radix 2k so that q is obtained as qi ¼ ðSi1 þ bi AÞ k since  0 ¼ 1.
2 M 2k
Thus only (biA) mod 2k is needed to be added with k LSBs of Si1:
270 10 RNS in Cryptography


e þ bi A div2k
Siþ1 ¼ Si þ qi M ð10:2aÞ

e with dynamic range greater than the original value M by k bits at


This leads to M
most. The addition operation in the determination of the quotient q also can be
avoided by replacing A by 2kA. Then, the expression qi ¼ (Si + biA) mod 2k becomes
qi ¼ Si mod2k. In the update Si+1, we have

e div2k þ bi A
Siþ1 ¼ Si þ qi M ð10:2bÞ

The number of iterations is increased by one to compensate for the extra factor 2k.
McIvor et al. [16] have suggested Montgomery modular multiplication (AB2k
modM ) modification using 5-2 and 4-2 carry save adders. Note that A, B and S are
considered to be in carry-save-form denoted by the vectors A1, A2, B1, B2, S1 and S2.
Specifically, the qi determination and estimation of the sum S is based on the
following equations:
 
qi ¼ S1 ½i0 þ S2 ½i0 þ Ai ðB10 þ B20 Þ mod2 ð10:3aÞ

and

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ Ai ðB1 þ B2 Þ þ qi MÞdiv2 ð10:3bÞ

Note that S1,0 ¼ 0 and S2,0 ¼ 0. In other words, the SUM is in redundant form or
carry-save form (CSR). The second step uses a 5:2 CSA. In an alternate algorithm,
qi computation is same as in (10.3a) but it needs a 4:2 CSA. We have for the four
cases of Ai and qi being 00, 01, 10 and 11 the following expressions:

Ai ¼ 0, qi ¼ 0:

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ 0 þ 0Þdiv2 ð10:4aÞ

Ai ¼ 1, qi ¼ 0:

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ B1 þ B2 Þdiv2 ð10:4bÞ

Ai ¼ 0, qi ¼ 1:

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ M þ 0Þdiv2 ð10:4cÞ

Ai ¼ 1, qi ¼ 1:

S1, iþ1 , S2, iþ1 ¼ CSRðS1 ½i þ S2 ½i þ D1 þ D2 Þdiv2 ð10:4dÞ

where D1, D2 ¼ CSR (B1 + B2 + M + 0) is pre-computed.


10.2 Montgomery Modular Multiplication 271

The advantage of this technique is that the lengthy and costly conventional
additions are avoided thereby reducing the critical path. Only (n + 1) cycles are
needed in the case of (10.3a) and (10.3b) and (n + 2) cycles are needed in the case of
(10.4a) and (10.4b). The critical path in the case of (10.3a) and (10.3b) is 3ΔFA þ 2
ΔXOR þ ΔAND whereas in the case of (10.4a) and (10.4b), it is 2ΔFA þ
Δ4:1MUX þ 2ΔXOR þ ΔAND . Note that k steps are needed where k is the number of
bits in M, A and B.
Nedjah and Mourelle [17] have described three hardware architectures for binary
Montgomery multiplication and exponentiation. The sequential architecture uses two
Systolic Array Montgomery Modular Multipliers (SAMMM) to perform multiplica-
tion followed by squaring, whereas, the parallel architecture uses two systolic modular
multipliers in parallel to perform squaring and multiplication (see Figure 10.2a, b).
In the sequential architecture, two SAMMMs are used each needing five regis-
ters. The controller controls the number of iterations needed depending on the
exponent. Note, however, one of the multipliers is not necessary. In the parallel
architecture, the hardware is more since multiplication and squaring use different
hardware blocks and this needs eight registers. The systolic linear architecture using
m e-PEs (E-cells) shown in Figure 10.2c, where m is the number of bits in M,
contains two SAMMMs one of which performs squaring and another performs
multiplication. These e-PEs together perform left to right binary modular
exponentiation.
Note that a front-end and back-end SAMMM are needed to do the
pre-computation and post-computation as needed in Montgomery algorithm (see
Figure 10.2d). The front-end multiplies the operands by 22nmod M and post-
Montgomery multiplication multiplies by ‘1’ to get rid of the factor 2n from the
result. The basic PE realizes the Montgomery step of computing R + aiB + qiM
where qi ¼ (r0 + aib0) mod 2. Note that depending on ai and qi, four possibilities
exist:
(1) ai ¼ 1, qi ¼ 1 add M + B, (2) ai ¼ 1, qi ¼ 0, add B, (3) ai ¼ 0, qi ¼ 1 add M,
(4) ai ¼ 0, qi ¼ 0 no addition.
The authors suggest pre-computation of M + B only once and denote it as MB.
Thus, using a 4:1 multiplexer, either MB, B, M or 0 is selected to be added to R. This
will reduce the cell hardware to a full-adder, a 4:1 MUX and few gates to control the
4:1 MUX. Some of the cells can be simplified which are in the border (see
Figure 10.2e showing the systolic architecture which uses the general PE in
Figure 10.2f). The authors show that sequential exponentiation needs least area
whereas systolic exponentiation needs most area. Sequential exponentiation takes
highest time, whereas systolic exponentiation takes the least computation time. The
AT (area time product) is lower for systolic implementation architecture and
highest for parallel implementation.
Shieh et al. [18] have described a new algorithm for Montgomery modular
multiplication. They extend Yang et al. technique [19] in which first AB is com-
puted. This 2k-bit word is considered as MH. 2k + ML. Hence (AB)2k mod N ¼ MH+
(ML)2k mod N. The second part can be computed and the result added to MH to
a e0/1
I E
T
1 0 1 0
ei
CONTROLLER
MUX22 MUX21

MPRODUCT1 SQUARE1

SAMMM2 MODULUS SAMMM1

MPRODUCT1.1 SQUARE 1.1

T
b
E TEXT

ei-1/1 1 0
CONTROLLER
MUX21

SQUARE

MODULUS
SAMMM1

SAMMM2

0 1

ei
MUX21

MPRODUCT

Figure 10.2 (a) Parallel (b) sequential and (c) systolic linear architectures for Montgomery
multiplier (d) architecture of the exponentiator (e) systolic architecture and (f) basic PE architec-
ture (adapted from [17] ©IEEE 2006)
10.2 Montgomery Modular Multiplication 273

obtain the result. Denoting ML(i) as the ith bit of ML, the reduction process in each
step of Montgomery algorithm finds

S þ ML ðiÞ þ qðiÞN
qi ¼ ðS þ ML ðiÞÞmod2, S ¼ ð10:5aÞ
2

iteratively for i ¼ 0, . . ., k  1. The algorithm starts from LSB of ML. Yang


et al. [17] represent qi and S in carry-save form as

SC þ SS þ ML ðiÞ þ qðiÞN
qi ¼ ðSC þ SS þ ML ðiÞÞmod2, ðSc ; SS Þ ¼ ð10:5bÞ
2

Shieh et al. [18] suggest computing S as

S þ ML ðiÞ þ 2qðiÞN 0
S¼ ð10:5cÞ
2

where N 0 ¼ Nþ1 0
2 . The advantage is that 2q(i)N and ML(i) can be concatenated as a
single operand, thus decreasing the number of terms to be added as 3 instead of
4. The authors also suggest “quotient pipelining” deferring the use of computed q(i)
to the next iteration. Thus, we need to modify (10.5c) as

em-1 e1 T1M
c e0

p(m-1) p(2) p(1) p(0)

R(m) R(m-1) R(2) R(1) R(0)

E-cellm-1 E-cell1 E-cell0


M M M M

d E

one twon T

TE mod M Two2n
2nR 2 nT
SAMMMK Exponentiator 2 nT SAMMM0
M
M

Figure 10.2 (continued)


274 10 RNS in Cryptography

e
mbn bn mn 0 mbj bj mj 0 mb1 b1 m1 0 mb0 b0 m0 0

a0 a0 a0
a0
Carry0,n-2. Carry0,j-1. Carry0,0.
cell 0,n cell 0,j cell 0,1 cell 0,0
qo qo qo

γn(1) γj(1)
mbn bn mn 0 mbj bj mj 0 mb1 b1 m1 0 γ1(1) mb0 b0 m0 0 γ0(1)

a1 a1 a1

Carry1,n-2. Carry1,j-1. Carry1,0.


cell 1,n cell 1,j cell 1,1 cell 1,0
q1 q1 q1
a1

γn(2) γj(2) γj(2) γ0(2)

γj+1(i) γ1(i)
mbn bn mn 0 γ2(i)
mbj bj mj 0 mb1 b1 m1 0 mb0 b0 m0 0

γ0(i)

ai ai ai

Carryi,n-1. Carryi,j-1. Carryi,0.


cell i,n cell i,j cell i,1 cell i,0
qi qi qi ai

γn(i+1)
γj(i+1) γ1(i+1) γ0(i+1)

γj+1(n) mbj bj mj 0 mb1 b1 m1 0


mbn bn mn 0 γ2(n) γ1(n) mb0 b0 m0 0

γ0(n)

an an an

Carryn,n-1. Carryn,j-1. Carryn,0.


cell n,n cell n,j cell n,1 Cell n,0
qn qn qn
an

γn(n+1) γj(n+1) γ1(n+1) γ0(n+1)


Figure 10.2 (continued)
10.2 Montgomery Modular Multiplication 275

f gj (i) bj mj mbj 0

MUX4×1

aj aj

qj qj

Carryout FA Carryin

bj mj mbj 0
g j-1 (i+1)

Figure 10.2 (continued)


00
S þ ML ðiÞ þ 2qði  1ÞN
S¼ ð10:5dÞ
2
00 0 00 0
where N ¼ N2 if N0 [0] ¼ 0 and N ¼ N þN 2 if N0 [0] ¼ 1. Since N 0 ¼ Nþ1
2 , we have
00 Nþ1 3Nþ1 0
N ¼ 4 or ¼ 4 depending on the LSB of N . (Note that [i] stands for ith bit).
This technique needs extension of A, B and S by two bits (0  A, B, S > 4N ) and
(AB2(n+4)) mod N is computed and 0  A, B, S > 4N. The advantage of using
(10.5a) and (10.5b) is that these two computations can be performed in a pipelined
fashion.
The authors have also shown that the partial product addition and modulo
reduction can be merged into one step adding two more iterations. This method
also needs extension of B by four bits. The output is S ¼ (AB)2(k+4) mod N. The
loop computes in this case

M S 00
M¼ þ ABðiÞ, ML ðiÞ ¼ Mmod2, S ¼ þ ML ði  1Þ þ 2qði  2Þ  N ,
2 2
qði  1Þ ¼ Smod2
ð10:6Þ

The authors have described an array architecture comprising of (k  1) PE cells.


Each PE has one PPA (partial product addition) and one MRE (modulo reduction).
They realize in carry-save form by denoting M ¼ (Mc, Ms) and S ¼ (Sc, Ss). They
show that the critical path is affected by one full-adder only.
Word-based Montgomery modular multiplication has also been considered in
literature [11, 20–28]. In the MWR2MM (multiple word radix-2 Montgomery
multiplication) algorithm for computing XY2nmodM due to Tenca and Koc [20],
Y and M are considered to be split into e number of w-bit words. M, Y and S are
extended to (e + 1) words by a most significant zero-bit word: M ¼ (0, Me1, . . ., M1,
276 10 RNS in Cryptography

1 S=0
2 for i = 0 to n – 1
3 (Ca,S(0)) := xiY (0) + S (0)
4 if S0(0) = 1 then
5 (Cb,S(0)) := S(0) + M(0)
6 for j = 1 to e
7 (Ca,S(j)) := Ca + xiY (j) +S (j)
8 (Cb,S(j)) := Cb + M (j) +S (j)
9 S(j–1) :=(S0(j), Sw–1..1
(j–1) )

10 end for
11 else
12 for j = 1 to e
13 (Ca,S(j)) := Ca + xiY (j) + S (j)
14 S(j–1) :=(S (j), S (j–1) )
0 w–1..1
15 end for
end if
16 S (e) = 0
end for

Figure 10.3 Pseudocode of MWR2MM algorithm (adapted from [20] ©IEEE2003)

M0), Y ¼ (0, Ye1, . . ., Y1, Y0) and S ¼ (0, Se1, . . ., S1, S0). The algorithm is given in
the pseudocode in Figure 10.3. The arithmetic is performed in w-bit precision. Based
on the value of xi, xiY0 + S0 is computed and if LSB is 1, then M is added so that the
LSB becomes zero. A shift right operation must be performed in each of the inner
loops. A shifted Sj1 word is available only when the LSB of new Sj is obtained.
Basically, the algorithm has two steps (a) add one word from each of the vectors
S, xiY and M (addition of M depending on a test) and (b) one-bit right shift of an
S word. An architecture is shown in Figure 10.4a containing a pipe-lined kernel of p
w-bit PEs (processing elements) for a total of wp-bit cells. In one kernel cycle, p bits
of X are processed. Hence, k ¼ n/p kernel cycles are needed to do the entire
computation. Each PE contains two w-bit adders, two banks of w AND gates to
conditionally add xi  Yj and add “odd” Mj to Si and registers hold the results (see
Figure 10.4b). (Note that “odd” is true if LSB of S is “1”.) Note that S is renamed
here as Z and Z is stored in carry save redundant form. A PE must wait two cycles to
kick off after its predecessor until Zo is available because Z1 must be first computed
and shifted. Note that the FIFO needs to store the results of each PE in carry-save
redundant form requiring 2w bits for each entry. These need to be stored until PE1
becomes available again.
A pipeline diagram of Tenca-Koc architecture [20] is shown in Figure 10.5a for
two cases of PEs (a) case 1, e > 2p  1, e ¼ 4 and p ¼ 2 and (b) case 2, e  2p  1,
e ¼ 4 and p ¼ 4 indicating which bits are processed in each cycle. There are two
10.2 Montgomery Modular Multiplication 277

X Mem
Sequence
Control

Kernel
x
YM M
Mem 0 Y PE1 PE2 PE3 PE P
Z’
Z

FIFO

Result
xi
Mw-1:0 Mw-1:0
Yw-1:0 Yw-1:0
cin cin
3:2 CSA

3:2 CSA
(w) (w)
odd

Z0
Zw-1:0 Zw-1:0
cout cout Zw-1

ca cb reset
Z0

Figure 10.4 (a) Scalable Montgomery multiplier architecture and (b) schematic of PE (adapted
from [21] ©IEEE 2005)

dependencies for PE1 to begin a kernel cycle indicated by the gray arrows. PE
1 must be finished with the previous cycle and the Zw1:0 result of the previous
kernel cycle must be available at PE p. Assuming a two cycle latency to bypass the
result from PE p to account for the FIFO and routing, the computation time in clock
cycles is

k ð e þ 1Þ þ 2ð p  1Þ e > 2p  1 ðcase IÞ
ð10:7Þ
kð2p þ 1Þ þ e  2 e  2p  1 ðcase IIÞ

The first case corresponds to large number of words. Each kernel cycle needs e + 1
clock cycles for the first PE to handle one bit of X. The output of PE p must be queued
until the first PE is ready again. There are k kernel cycles. Finally, 2( p  1) cycles are
required for the subsequent PEs to complete on the last kernel cycle.
The second case corresponds to the case where small number of words are
necessary. Each kernel cycle takes 2p clock cycles before the final PE produces
its first word and one more cycle to bypass the result back. k kernel cycles are
needed. Finally, e  2 cycles are needed to obtain the more significant words at the
end of the first kernel cycle.
Harris et al. [21] case is presented for comparison in Figure 10.5b. Harris
et al. [21] have suggested that the results be stored in the FIFO in non-redundant
form to save FIFO area requiring only w bits for each entry in stead of 2w bits in
[20]. They also note that in stead of waiting for the LSBs of the previous word to be
shifted to the right, M and Y can be left shifted thus saving latency of one clock
cycle. This means that as soon as LSB of Z is available, we can start the next step for
another xi. The authors have considered the cases e ¼ p and e > p (number of PEs
p required equal to number of the words e or less than the number of words e). Note
that in this case, (10.7) changes to

ð k þ 1Þ ð e þ 1Þ þ p  2 e > pðcase IÞ
ð10:8Þ
kðp þ 1Þ þ 2e  1 e  pðcase IIÞ

Kelley and Harris [22] have extended Tenca-Koc algorithm to high-radix 2v using
a w  v bit multiplier. They have also suggested using Orup’s technique [15] for

a Case 1: e>2p-1; e=4,p=2 Case 2: e≤2p-1; e=4, p = 4


PE1 PE2 PE1 PE2 PE3 PE4
1 Xo MYw-1:0 Xo MYw-1:0
Zw-2:1 Zw-2:1
Xo MY2w-1:w
Kernel cycle 1

2 Xo MY2w-1:w
Z2w-2:w-1 Z2w-2:w-1
Xo MY3w-1:2w X1 MYw-1:0
3 Xo MY3w-1:2w X1 MYw-1:0
Kernel cycle 1

Z3w-2:2w-1 Zw-2:1
Z3w-2:2w-1 Zw-2:1 Xo MY4w-1:3w X1 MY2w-1:w
4 Xo MY4w-1:3w X1 MY2w-1:w Z4w-2:3w-1 Z2w-2:w-1
Z4w-2:3w-1 Z2w-2:w-1
Xo MY5w-1:4w X1 MY3w-1:2w X2 MYw-1:0
5 Xo MY5w-1:4w X1 MY3w-1:2w
Z-5w-2:4w-1 Z3w-2:2w-1 Zw-2:1
Z-5w-2:4w-1 Z3w-2:2w-1
X1 MY4w-1:3w X2 MY2w-1:w
6 X2 MYw-1:0 X1 MY4w-1:3w Z4w-2:3w-1 Z2w-2:w-1
Zw-2:1 Z4w-2:3w-1
X1 MY5w-1:4w X2 MY3w-1:2w X3 MYw-1:0
7 X2 MY2w-1:w X1 MY5w-1:4w Z-5w-2:4w-1 Z3w-2:2w-1 Zw-2:1
Z2w-2:w-1 Z-5w-2:4w-1
X2 MY4w-1:3w X3 MY2w-1:w
8 X2 MY3w-1:2w X3 MYw-1:0 Z4w-2:3w-1 Z2w-2:w-1
Kernel cycle 2

Z3w-2:2w-1 Zw-2:1 X2 MY5w-1:4w X3 MY3w-1:2w


9 X2 MY4w-1:3w X3 MY2w-1:w Z-5w-2:4w-1 Z3w-2:2w-1
Z4w-2:3w-1 Z2w-2:w-1 X3 MY4w-1:3w
10 X2 MY5w-1:4w X3 MY3w-1:2w Z4w-2:3w-1
Z-5w-2:4w-1 Z3w-2:2w-1 X3 MY5w-1:4w
11 X3 MY4w-1:3w Z-5w-2:4w-1
Z4w-2:3w-1 Kernel Stall
12 X3 MY5w-1:4w
Z-5w-2:4w-1 Xo MYw-1:0
Zw-2:1
Xo MY2w-1:w
Z2w-2:w-1
Xo MY3w-1:2w X1 MYw-1:0
Z3w-2:2w-1 Zw-2:1
Xo MY4w-1:3w X1 MY2w-1:w
Z4w-2:3w-1 Z2w-2:w-1
Xo MY5w-1:4w X1 MY3w-1:2w X2 MYw-1:0
Kernel cycle 2

Z-5w-2:4w-1 Z3w-2:2w-1 Zw-2:1


X1 MY4w-1:3w X2 MY2w-1:w
Z4w-2:3w-1 Z2w-2:w-1
X1 MY5w-1:4w X2 MY3w-1:2w X3 MYw-1:0
Z-5w-2:4w-1 Z3w-2:2w-1 Zw-2:1
X2 MY4w-1:3w X3 MY2w-1:w
Z4w-2:3w-1 Z2w-2:w-1
X2 MY5w-1:4w X3 MY3w-1:2w
Z-5w-2:4w-1 Z3w-2:2w-1
X3 MY4w-1:3w
Z4w-2:3w-1
X3 MY5w-1:4w
Z-5w-2:4w-1

Figure 10.5 Pipeline diagrams corresponding to (a) Tenca and Koc technique and (b) Harris
et al. technique (adapted from [21] ©IEEE2005)
10.2 Montgomery Modular Multiplication 279

b Case 1: e>p; e=4,p=2 Case 2: e≤p; e=4, p = 4


PE1 PE2 PE1 PE2 PE3 PE4
1 Xo MYw-1:0 Xo MYw-1:0
Zw-2:1 Zw-2:1
Kernel cycle 1

2 Xo MY2w-1:w Xo MY2w-1:w X1 MYw-1:0


Z2w-2:w-1 Z2w-2:w-1 Zw-2:1
3 Xo MY3w-1:2w X1 MYw-1:0 Xo MY3w-1:2w X1 MY2w-1:w X2 MYw-1:0
Z3w-2:2w-1 Zw-2:1 Z3w-2:2w-1 Z2w-2:w-1 Zw-2:1
4 Xo MY4w-1:3w X1 MY2w-1:w
Xo MY4w-1:3w X1 MY3w-1:2w X2 MY2w-1:w X3 MYw-1:0

Kernel cycle 1
Z4w-2:3w-1 Z2w-2:w-1
5 Xo MY5w-1:4w X1 MY3w-1:2w Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1 Zw-2:1
Z-5w-2:4w-1 Z3w-2:2w-1 Xo MY5w-1:4w X1 MY4w-1:3w X2 MY3w-1:2w X3 MY2w-1:w
Z-5w-2:4w-1 Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1
6 X2 MYw-1:0 X1 MY4w-1:3w
Zw-2:1 Z4w-2:3w-1 Kernel Stall X1 MY5w-1:4w X2 MY4w-1:3w X3 MY3w-1:2w
Z-5w-2:4w-1 Z4w-2:3w-1 Z3w-2:2w-1
7 X2 MY2w-1:w X1 MY5w-1:4w
Z2w-2:w-1 Z-5w-2:4w-1 X4 MYw-1:0 X2 MY5w-1:4w X3 MY4w-1:3w
8 X2 MY3w-1:2w X3 MYw-1:0 Zw-2:1 Z-5w-2:4w-1 Z4w-2:3w-1
Kernel cycle 2

Z3w-2:2w-1 Zw-2:1 X4 MY2w-1:w X5 MYw-1:0 X3 MY5w-1:4w


9 X2 MY4w-1:3w X3 MY2w-1:w Z2w-2:w-1 Zw-2:1 Z-5w-2:4w-1
Z4w-2:3w-1 Z2w-2:w-1 X4 MY3w-1:2w X5 MY2w-1:w X6 MYw-1:0
10 X2 MY5w-1:4w X3 MY3w-1:2w Z3w-2:2w-1 Z2w-2:w-1 Zw-2:1
Kernel cycle 2

Z-5w-2:4w-1 Z3w-2:2w-1
X4 MY4w-1:3w X5 MY3w-1:2w X6 MY2w-1:w X7 MYw-1:0
11 X3 MY4w-1:3w Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1 Zw-2:1
Z4w-2:3w-1 X4 MY5w-1:4w X5 MY4w-1:3w X6 MY3w-1:2w X7 MY2w-1:w
12 X3 MY5w-1:4w Z-5w-2:4w-1 Z4w-2:3w-1 Z3w-2:2w-1 Z2w-2:w-1
Z-5w-2:4w-1
X5 MY5w-1:4w X6 X6 X7 MY3w-1:2w
Z-5w-2:4w-1 X6 Z3w-2:2w-1
X6 MY5w-1:4w X7 MY4w-1:3w
Z-5w-2:4w-1 Z4w-2:3w-1
X7 MY5w-1:4w
Z-5w-2:4w-1

Figure 10.5 (continued)

avoiding multiplication in computing q by scaling the modulus and also by pre-scaling


X by 2v to allow multiplications to occur in parallel for computing qM + xiY.
Jiang and Harris [23] have extended the Harris et al. [21] radix-2 design by using
a parallel modification of Montgomery algorithm. Here, the computation done is
Z ¼ Z2 þ q  M^ þ Xi  Y where q ¼ Z mod 2 and M0 is such that RR0  MM0 ¼ 1
n
and R ¼ 2 and M ^ ¼ ððM0 mod2ÞM þ 1Þ=2. Note that parallel multiplication needs
just ANDing only.
Pinckney and Harris [24] and Kelly and Harris [25] have described a radix-4
parallelized design which left shifts the operands and parallelizes the multiplica-
tions within the PE. Note that pre-computed values of 3M ^ and 3Y are employed
 0   2
^
where M ¼ M mod2 M þ 1 =2 . Orup’s technique [15] has been used to avoid
2

multiplications.
Huang et al. [26] have suggested modifications for Tenca-Koc algorithm to
perform Montgomery multiplication in n clock cycles. In order to achieve this,
they suggest pre-computing the partial results using two possible assumptions for
the MSB of the previous word. PE1 can take the w  1 MSBs of S0 (i ¼ 0), from PE
0 at the beginning of clock 1, do a right shift and prepend with both 1 and zero based
on the two different assumptions about the MSBs of this word at the start of the
computation and compute S1(i ¼ 1). At the beginning of the clock cycle 2, since the
correct bit will be available as the LSB of S1 (i ¼ 0), one of the two pre-computed
versions of S0 (i ¼ 1) is chosen. Since the w  1 LSBs are same, the parallel
hardware can have same LSB adding hardware and using small additional adders,
280 10 RNS in Cryptography

the other portions can be handled. Same pattern of computations repeats in the
subsequent clock cycles. Thus, the resource requirement is marginally increased. The
j k cycles is T ¼ n + e  1 if e  p and T ¼ n + k(e  p) + e  1
computation time in clock
n
otherwise where k ¼ p .
In another technique, each PE processes the complete computation of a
specific word in S. However, all PEs can scan different bits of the operand X at
the same time. The data dependency graphs of both these cases are presented in
Figure 10.6a, b. Note that the second architecture, however, has fixed size
(i.e. e number of PEs which cannot be reduced). The first technique has been
shown to outperform Tenca-Koc design by about 23 % in terms of the product of
latency time and area when implemented on FPGAs. The second technique
achieves an improvement of 50 %.
The authors have also described a high-radix implementation [26] while
preserving the speed up factor of two over corresponding technique of Tenca and
Koc [20]. In this, for example, considering radix-4, two bits are scanned at one time
taking ((n/2) + e  1) clock cycles to produce an n-bit Montgomery multiplication.
The multiplication by 3 needed, can be done on the fly or avoided by using Booth’s
algorithm which needs to handle negative operands [26].
Shieh and Lin [27] have suggested rewriting the recurrent equations in MM
algorithm

qi ¼ ðSi þ A  Bi Þmod 2
ðSi þ A  Bi þ qi  N Þ
Siþ1 ¼ ð10:9Þ
2

as
 
SMi1
qi ¼ SRi þ þ A  Bi mod 2
2

SRi þ SM2i1 þ A  Bi þ qi  N
SRiþ1 þ SMiþ1 ¼ ð10:10Þ
2

with SR0 ¼ SM0 ¼ SM1 ¼ 0 for i ¼ 0, . . . (k  1). Note that A, B and N are k-bit
words. This will help in deferring the accumulation of MSB of each word of the
intermediate result to the next iteration of the algorithm. Note that the intermediate
result Si in (10.9) is decomposed into two parts SM and SR. The word SM contains
only the MSB followed by zeroes and the word SR comprises the remaining LSBs.
They also observe that in (10.10), the number of terms can be reduced to three,
taking advantage of several zero bits in SRi and SMi1/2. Further by considering
A as two words AP and AR (for example, for W ¼ 4) AP ¼ 0a10000a60000a200 and
AR ¼ a10a9a8a70a5a4a30a1a0, (10.10) gets changed to
10.2 Montgomery Modular Multiplication 281

a PE#0
j=0 i= 0
x0 Y(0)
Sw-1..1 (0)=0
M(0)

j=1 D
Sw-1..1 (0)
S0(1)=0 PE#1
(1)
{xo,qo,C } i=1
Y(1) x1
Y(0)
Sw-1..1 (1)=0
M(1)
M(0)
j=2 S0(1)
E D
Sw-1..1 (0)
S0(2)=0 Sw-1..1 (1)
(1)
{x1,q1,C } PE#2
{xo,qo,C(2)} i= 2
(1)
Y Y(0)
Y(2) x2
Sw-1..1 (2)=0 M(1)
M(2) M(0)
j=3 S0(1)
E E D
S0(3)=0
S0(2) Sw-1..1 (1) Sw-1..1 (0)
{xo,qo,C(3)} {x1,q1,C(2)} {x2,q2,C(1)}
Sw-1..1 (2)
Y(2) Y(1)
Y(3)
Sw-1..1 (3)=0 M(3) M(2) M(1)
j=4 S0(3) S0(2) S0(1)
E E E
S0(4)=0

{xo,qo,C(4)} {x1,q1,C(3)} {x2,q2,C(2)}


Sw-1..1 (2) Sw-1..1 (1)
Sw-1..1 (3)

b PE #0
j=0
S(0)=0 Y(0)
(0)
M
i=0 x0
PE #1
D 0 j=1
{C (1)
Y(0)
(1
) S =0 Y(1)
i=1 S(0) (0)
,q
} M(1)
M 0
x1 x0 PE #2
D S(1)0 E 0 j=2
{ C (1 { C (2 (2) (2)
)
S(0) Y (0) )
,q S(1) Y
(1)
,q S =0 Y
i=2 (1) 0 }
M
(0) 1}
M M(2)
x2 x1 x0 PE #3
D S 0 E(1)
S 0 E 0
(2)
j=3
{ C (1 { C (2 { C (3 (3)
S =0 (3)
) ) (2) )
(0) Y(0) ,q S (1) Y (1) , q (2) Y , q Y (3)
i=3 S (0) 2} (1) 1 } S (2) 0} M
M M M
x3 x2 x1 x0

D S(1)0 E S(2)0 E (3)


S 0 E 0
{ C (1 { C (2 { C (3 { C (4
) ) ) )
(0) Y(0) , q S (1) Y (1)
, q S (2) Y(2)
, q S (3) Y(3) ,q
i=4 S 3} 2} 1} 0}
M
(0)
M(1) M(2) M(3)
x4 x3 x2 x1
D S(1)0 E S(2)0 E S(3)0 E S(4)0
{ C (1 { C (2 { C (4
) ) { C (3 )
,q ,q ) ,q
S(0) 4 } S
(1) 3 } S(2) ,q
2 } S
(3)
1 }

Figure 10.6 Data dependency graphs of (a) optimized architecture and (b) alternative architec-
ture of MWR2MM algorithm (adapted from [26] ©IEEE2011)
282 10 RNS in Cryptography

 
0 SM0 i1
qi ¼ SR i þ 2AP  Biþ1 þ þ AR  Bi mod2 ¼ ðOP1i þ OP2i Þmod 2
2
  0  
SM i1
SR0 i þ 2AP  Biþ1 þ þ AR  Bi þ qi  N
2
SR0 iþ1 þ SM0 iþ1 ¼
2
OP1i þ OP2i þ OP3i
¼
2
ð10:11Þ

where OP1i ¼ SRi þ 2AP  Biþ1 , OP2i ¼ SM2i1 þ AR  Bi and OP3i ¼ qi  N.


The pair (SR0 , SM0 ) is used in place of (SR, SM) since the value of the intermediate
result changes due to the rearrangement of the operands to be added in each
0
iteration. Note that a post-processing operation S ¼ SR0 þ SM0 þ SM k1 is
k k k 2
required to obtain the final result. Thus, the data dependency between the MSB of
Sj in the (i + 1)th iteration and Sj+1 in the ith iteration can be relaxed using this
technique. The reader is referred to [27] for more information on the implementation.
Knezevic et al. [6] have described Barrett and Montgomery modulo reduction
techniques for special moduli-generalized Mersenne primes used in Elliptic curve
cryptography (ECC). They have presented an interleaved modulo multiplier suit-
able for both Barrett and Montgomery reduction and observe that Montgomery
technique is faster. The algorithm for Montgomery reduction is presented in
Figure 10.7a. Two unified architectures for interleaved Barrett and Montgomery
reduction architectures are presented in Figure 10.7b, c for classical technique and
modified technique, respectively. In these, the blocks π1 and π2 are multiple-
precision multipliers which perform multiplication in lines 3 and 5 in the flow
chart in Figures 10.1 and 10.7a, whereas π3 is a single precision multiplier used to
perform multiplication in step 4. An additional adder Σ is also required.
In the case of Barrett reduction, the pre-computed value of μ is λ ¼ w + 4 bits
long, whereas in the case of Montgomery algorithm, the precomputed value M0 is
λ ¼ w bits long. In the case of Barrett reduction, π2 uses the most significant λ bits of
the product calculated by π3, whereas in the case of Montgomery algorithm, it uses
the least significant λ bits of the same product. The authors show that an improve-
ment in speed of about 50 % can be achieved using digit size of 32 bits and modulus
of 192–512 bits. In Montgomery reduction, the processing of Y starts from right to
left whereas in the case of Barrett reduction it starts from left to right. The authors
also observe that in the case of special structure for the modulus M (see (10.1c)), the
architecture can be simplified as shown in Figure 10.7c needing only two multi-
pliers. This does not need any multiplication with pre-computed values. The critical
path is reduced as shown in the bold line in Figure 10.7c of one multiplier and one
adder only.
Miyamoto et al. [28] have explored the full design space of RSA processors
using High-radix Montgomery multipliers. They have considered four aspects
(a) algorithm design (b) radix design (c) architecture design and (d) arithmetic
10.2 Montgomery Modular Multiplication 283

Figure 10.7 (a) Algorithms for Modulo multipliers using Montgomery and Barrett Reduction
(b) original architecture and (c) modified architecture (adapted from [10.8] ©IEEE2010)
284 10 RNS in Cryptography

b X Y M m or M’
n-bit n-bit n-bit l-bit

l l
*
l 2l
w w l l
* * p3
2w n+w n+l 2l
+ + l

n+w-bit n+l-bit

p1 p2
n+w n+l

Barrett: l = w + 4 MS bits
n+l+1 n+l Montgomery: l = w LS bits
+
n+l+1 -bit Z
S

c X Y M
n-bit n-bit n-bit

w w l l

* *

2w n+w n+l 2l
+ +

n+w-bit n+l-bit

p1 p2
n+w n+l

n+l+1 n+l
+

n+l+1 -bit Z
S
Figure 10.7 (continued)
10.2 Montgomery Modular Multiplication 285

component design. They have considered four types of exponentiation algorithms


based on two variants of binary methods (a) left to right binary method and
(2) square multiply exponentiation method that have different resistances to simple
power analysis (SPA)-based attacks each using CRT and not using CRT.
The square multiply exponentiation method starts at LSB and works upwards.
The left to right binary method requires lower hardware resources. The m-ary
window method reduces the number of multiplication operations using 2m1
pre-computed values. But more memory resources are needed. Hence the authors
use left to right binary method. The square and multiply method has been used since
it prevents SPA attacks by using dummy operations even for zero bits of the
exponent. CRT also reduces the clock cycles by almost ¾. This, however, requires
extra hardware for pre-processing and post-processing.
First an algorithm for modulo exponentiation is selected considering trade-off
between RSA computation time and tamper resistance. The authors suggest next
that radix needs to be chosen. Circuit area and delay time increase exponentially
with the radix. However, the area and time increase in a different way. The decrease
in number of cycles may compensate the increase in critical path. The data path
architectures of three types have been considered which have the intermediate
results in (1) single form (type I), (2) semi-carry-save form (type II) and
(3) carry-save form (type III). The authors have observed that 85, 73 and 84 different
variants of type I, II and II data path architectures are possible. The RSA time is
largest for type I and least for type III, whereas the area is least for type I and largest
for type III.
All the three algorithms (type I, II and III) are presented in Figure 10.8a–c for
completeness. We wish to compute Z ¼ (X  Y  2rm) modN where X, Y, Z, N are
k-bit integers. These k-bit operands are considered as m number of r-bit blocks. We
define w ¼ N1mod 2r, ti ¼ (z0 + xiy0)w mod 2r. The original high-radix Montgom-
ery algorithm needs computation of q ¼ z + xiyj + tinj + c. They suggest storing the
temporary variable q of 2r bits as zj1 and c where c ¼ q/2r and zj1 ¼ q mod (2r).
The computation of q is realized in two steps in Type I. In the first step,
zi + xiyj + ca is computed to yield the sum zj and carry ca. Next step computes
zj + tinj + cb as sum zj1 and carry cb. A CPA is used to perform the needed
summation. In type II, these two steps are modified to preserve the carry-save
form for only the intermediate carry. We compute in the first step as follows:
 
zj þ xi yj þ cs1a þ cs2a þ eca ¼ cs1a þ cs2a þ eca , zj ð10:12aÞ

In the second step, we evaluate


 
zj þ ti nj þ cs1b þ cs2b þ ecb ¼ cs1b þ cs2b þ ecb , zj1 ð10:12bÞ

Note that csa and csb are r-bit whereas ec is 1-bit carry. The lower r-bit output and
1-bit ec is given by the CPA operation, whereas the rest are obtained by partial
product addition using a CSA.
286 10 RNS in Cryptography

Figure 10.8 Montgomery multipliers using (a) single form (Type I), (b) using semi carry-save
form (Type II) and (c) using carry-save form (Type III) (adapted from [28] ©IEEE2011)
10.3 RNS Montgomery Multiplication and Exponentiation 287

In the third algorithm, carry save form is used for intermediate sum and carry
where cs1 and cs2 are intermediate carry signals and zs1 and zs2 are intermediate sum
signals. The two steps in this case are modified as

zj þ xi yj þ cs1a þ cs2a ¼ ðcs1a þ cs2a , zs1 þ zs2 Þ


zs1 þ zs2 þ ti nj þ cs1b þ cs2b þ ec ¼ ðcs1b þ cs2b , zs1 þ zs2 Þ ð10:12cÞ
 
zs1 þ zs2 ¼ ec; zj1

The CPA operation is performed at the end of the inner loop to obtain zj. The third
approach needs more steps due to the extra additions.
The computation time of CPA significantly affects the critical path. In Type I
and II, the CPA widths are 2r and r respectively whereas in Type III, in every cycle
CPA is not used. The number of cycles needed for the arithmetic core in types I, II
and III are 2m2 + 4m + 1, 2m2 + 5m + 1 and 2m2 + 6m + 2 respectively. The authors
have considered variety of final CPA based on KoggeStone, BrentKung,
HanCarlson and LadnerFischer types. The partial product addition also used a
variety of algorithms: Dadda tree, 4:2 compressor tree, (7, 3) counter trees and (3, 2)
counters. The radix also was variable from 8 to 64 bits.
The authors report that smallest area of 861 gates using Type-I radix-28 proces-
sor to shortest operating time of 0.67 ms at 421.94 MHz with Type III radix-2128
processor. The highest hardware efficiency (RSA time  area) of 83.12 s-gates was
achieved with Type II radix-232 processor.

10.3 RNS Montgomery Multiplication and Exponentiation


 
ab
Posch and Posch [29] have first suggested Montgomery reduction mod N in
M
RNS using two RNS namely RNS1 (Base B) and RNS2 (base B0 ) which have
dynamic ranges of M and M0 respectively. The algorithm [30] is presented in
Figure 10.9a for computing Z ¼ (ab/M ) mod N. First t ¼ ab is computed in both
RNS. Next q ¼ t/N is computed in RNS1. In RNS2, we compute ab þ ^q N (where ^q
is obtained by base extension of q from RNS1 to RNS2) and divide it by M. (Note
that inverse of M exists in RNS2 and not in RNS1). This result is next base extended
to RNS1. Evidently, two base extensions are needed: one from RNS1 to RNS2 and
another from RNS2 to RNS1.
Posch and Posch have made certain assumptions on RNS1 and RNS2:

N N
NþΔ<M <Nþ with Δ <
3 6
288 10 RNS in Cryptography

b
Set1 M Set2 Mʹ
Moduli 3 5 7 11 13 17
a =10 1 0 3 10 10 10
b = 25 1 0 4 3 12 8
a×b=250 1 0 5 Mod (105) 8 3 12 (a×b)mod (2431)
= 250
t 2 2 3
=(-1/37) mod 105
=17
t×(a×b) mod 105 2 0 1 Base Extend 6 11 16 S = t×(a×b) mod 105 =
=50 to M′ 50

4 11 3 N = 37
2 4 14 s×N = 1850
10 7 9 a×b+s×N = 2100
2 1 6 1/M = 1/105
Result = 20 2 0 6 Base 9 7 3 2100/105 = 20
Extension to
M

Figure 10.9 (a) Bajard’s and Posch and Posch algorithm for Montgomery multiplication using
RNS (adapted from [31] ©IEEE2004) (b) an example of Montgomery multiplication using RNS

  1
4N  M0  4 þ εM0 =N N with εM0 =N <
12
N
a, b < N þ Δ < N þ ð10:13Þ
6

Only the base extension


$ algorithm
% is different. The CRT expansion is
X
n
approximated as W *int ¼ Xi w*i where Xi are the residues and w*i ¼ wi  δi
i¼1
10.3 RNS Montgomery Multiplication and Exponentiation 289

M1
and δi  1 denoting the error between the actual value and wi ¼ mi i . Base
extension from RNS1 to RNS2 results in some t* ¼ t or t + M1 if W*int ¼ Wint or
Wint  1 (approximation off by 1). It can be shown that t* < M1 or M1 + Δ in these
2 < y < 3N þ 3
N N
two cases respectively. Further, note that where
  1 1
y ¼ x  xM N modMÞNM þ 2N.
Bajard et al. [30] have described RNS Montgomery modular multiplication
algorithm in which B and N are given in RNS form and A is given in Mixed
Radix form (MRS). In this method only one RNS base is used and the condition
0  N  3max M ðmi Þ needs to be satisfied. The algorithm is executed in n steps
i2ð1, ::nÞ
where n is the number of moduli. In each  step,a MRS digit q0 i of a number Q is
AB
computed and a new value of R where R ¼ modN is determined using a0 i and
M
q0 i in RNS where a0 i is the mixed radix digit of A. Next, since R is a multiple of mi,
and the moduli are relatively prime numbers, R is multiplied by the multiplicative
inverse of mi. This process cannot, however, be carried out in the last step, since mi
is not prime to itself. Hence, another RNS shall be used for expressing the result and
reconstructing the residue after it is lost. Since the result is available in RNS base, it
needs to be extended to the original base in the end. The authors also suggest
another technique where the missing residue is recovered using base extension
employing a redundant modulus using Shenoy and Kumaresan technique
[32]. Note, however, that for systems based on MRC, fully parallel computation
cannot be realized and thus they are slower. An example of Montgomery multipli-
cation using RNS to compute (AB/M ) mod N where A ¼ 10, B ¼ 25, M ¼ 105 and
N ¼ 37 is presented in Figure 10.9b.
Bajard et al. [31] approach using two moduli sets is similar to other techniques
[33, 34] but the base extension steps use different algorithms. These, however, have
the same complexity as that of Posch and Posch [29]. Note that q shall be extended
to base B0 as explained before. This can be obtained by using CRT but the multiple
of M that needs to be subtracted needs to be known. The application of CRT in base
B yields

X
k
q¼ σ i Mi  αM ð10:14aÞ
i¼1


where M contains k moduli and σ i ¼ qi M1i mi modmi and α < k. We need not
compute the exact value of q in M but instead extend q in M0 as
0

X
k
^
q ¼ q þ αM ¼ σ i Mi ð10:14bÞ
i¼1

for j ¼ k + 1, . . .,2k. We compute next


290 10 RNS in Cryptography

ab þ ^
qN
^r ¼ ð10:14cÞ
M

Note that the computed value ^r < M0 and has a valid representation in base B0 .
From (10.14b) and (10.14c), we get
 
ab þ ðq þ αMÞN  
^r ¼ modN ¼ abM1 modN ð10:14dÞ
M

Thus, there is no need to compute α. Instead, ^


q needs to be computed in B0 . Once r is
estimated, this needs to be extended to B which can be done using Shenoy and
Kumaresan [32] technique for which a redundant residue mr is used. It may be noted
that since in CRT, α < k, q < M and ab < MN, we have ^q < ðk þ 1ÞM and hence
^r < ðk þ 2ÞN < M0 . The condition ab < MN implies that if we want to use ^r in the
next step as needed in exponentiation algorithms, say squaring, we obtain the
condition (k + 2)2N2 < MN or (k + 2)2N < M. Thus, if N is a 1024-bit number, and
if we use 32-bit moduli, we need base B of size k  33.
Note that lines 1, 2 and 4 in the algorithm in Figure 10.9a need 5k modular
multiplications. The first and second base extensions (steps 3 and 5) each need
(k2 + 2k) modular multiplications respectively thus needing over all (2k2 + 9k)
modular multiplications. If the redundant modulus is chosen as a power
of 2, the first base extension needs k2 + k multiplications only.
Kawamura et al. [33, 34] have described a Cox-Rower architecture. They
stipulate that a, b < 2N so that
 
v ab þ tN ð2N Þ2 þ MN 4N
w¼ ¼ < ¼N þ 1  2N if 4N  M ð10:15Þ
M M M M

In this method, the base extension algorithm is executed in parallel by plural


“rower” units controlled by a “cox” unit. Each rower unit is a single precision
modular multiplier accumulator whereas the cox unit is typically a 7-bit adder. The
algorithm is same as that in Figure 10.9a except for the base extension steps.
Referring to CRT-based RNS to binary conversion, it is clear that the value of α
(i.e. multiple of M ) that needs to be subtracted needs to be determined so that base
extension to carried out as discussed in Chapter 6. Kawamura et al. [34] have
suggested computing α in a different way. Noting that

Xn   Xn
1
x¼ xi Mi  αM ¼ σ i Mi  αM ð10:16aÞ
i¼1
M i mi i¼1

where
10.3 RNS Montgomery Multiplication and Exponentiation 291

Xn   !
1
σi ¼ xi modmi ð10:16bÞ
i¼1
M i mi

we have

x Xn
σi
αþ ¼ ð10:16cÞ
M i¼1
mi

$ %
Xn
σi Xn
σi
Since x < M, is between α and α + 1. Thus α ¼ and 0 < α < n
i¼1
mi i¼1
mi
holds. The value of α can be recursively estimated in the “cox” unit by approxi-
mating mi in the denominator by 2r, in order to avoid division by mi. Note that it is
assumed that r is common to all moduli in spite of mi being different in general and
computing
$   %
Xn
trunc σ i
^ ¼
α þ α0 ð10:17Þ
i¼1
2r
 
where trunc σ i ¼ σ i \ ð1:::::10:::::0Þ2 . (Note that the number of ones are q0 and
number of zeroes are r  q0 and \ stands for bit-wise AND operation.) Note that σ i is
approximated by its most significant q bits as trunc (σ i). The parameter α0 is an
offset value to take into account the error caused due to the approximation. Note
^ in (10.17) is computed bit by bit with an initial value λo ¼ α as
that α
 
trunc σ i
λi ¼ λi1 þ , αi ¼ bλi c, λi ¼ λi αi for i ¼ 1, 2, . . . n ð10:18Þ
2r

Note that αi is a bit and if it is 1, the rower unit subtracts M. Note that the error is
transferred to the next step and only in the last step there is residual error.
Kawamura et al. [34] have suggested the use of α0 ¼ 0 and α0 ¼ 0.5 for the first
and second base extensions, respectively. Note that n clock cycles are needed to
obtain the n number of αi values. It can be seen that n2 + 2n modulo multiplications
are needed for each base extension and 5n other modulo multiplications operations
are needed for complete modulo multiplication. The Cox-Rower architecture is
presented in Figure 10.10.
Gandino et al. [35] have suggested reorganization allowing pre-computation of
certain constants of the algorithms due to Bajard et al. [31] and Kawamura
et al. [34]. In these algorithms, several multiplications of partial results with
pre-computed values exist. By exploiting the commutative property, a sequence
of multiplications of a partial result by a pre-computed value is replaced by a single
multiplication. In addition, the authors use the commutative, distributive and
associative properties to rearrange the operations in a more effective way.
292 10 RNS in Cryptography

r
I/O
r
RAM ROM RAM ROM RAM ROM
e
r r

q trune
q
Mul & Acc Mul & Acc Mul & Acc
Adder
mod an/bn mod an-1/bn-1 mod a1/b1

<Cox Unit> 1 ki
<Rower Unit n> <Rower Unit (n-1)> <Rower Unit 1>

Figure 10.10 Cox-Rower architecture for Montgomery algorithm in RNS (adapted from [34]
©Eurocrypt2000)

The original RNS MM algorithm and reorganized RNS MM algorithm are


presented in Figure 10.11a, b for illustrating the optimizations that can be carried
out. First, the inputs x and y are multiplied by Aj1 and are denoted as ^x and ^y. Note
that the two consecutive steps for computing q from u by multiplying sequentially
with  N1 and Bi1 in the original algorithm is replaced with single multiplication
of u with (N1Bi1). Steps 3, 5, 6, and 7 in the original algorithm are moved into
the first base extension step together with the multiplication by Aj that is required to
correct the input. There is thus no need to sequentially compute u^, u, t, v, w but
instead ŵ can be computed. Similar modifications have been suggested in the
re-organized first and second base extension algorithms also for both Bajard’s
technique as well as Kawamura’s technique. Further, they have shown that the
exponentiation algorithms also can be reorganized. The reader is urged to refer to
their work for more information.
Schinianakis and Stouraitis [36] have suggested the use of MRC for both the
base extension operations in place of the approximate CRT-based techniques (refer
Chapter 6 Jullien’s technique of Figure 6.1a). They have also considered the binary-
to-RNS conversion and RNS-to-binary conversion needed in the front and back
ends. They have used radix 2r representation of given integer x and all the binary to
reverse converters for various moduli use L steps to compute

L1   
X
Xmi ¼ xj 2rj m for all i: ð10:19Þ
j¼0
i mi
10.3 RNS Montgomery Multiplication and Exponentiation 293

Figure 10.11 (a, b) The


original and reorganized
RNS MM algorithms
(adapted from [35]
©IEEE2012)

 
The constants 2rj m are pre-computed and stored. Thus, L parallel units can
i
convert in L steps the given binary word X into L residues. The MRC technique is
used for RNS-to-binary conversion where each add/multiply unit weighs the Mixed
radix digit appropriately and computes the result. The general architecture for all
these functions for RNS Montgomery multiplication (RMM) is presented in
Figure 10.12. They denote the two RNS bases as K and Q and compute (ABQ1)
mod N. They choose the scaling factor as the product of moduli in the first moduli
2
set. The authors observe that the BE algorithm using MRC needs only L þL  2 2
operations, whereas the techniques due to Bajard et al. [31] and Kawamura
et al. [34] techniques need 2L2 + 3L and 2L2 + 4L operations, respectively. Conse-
quently, a 1024-bit exponentiation using 33 number of 32-bit moduli and a clock
frequency of 100 MHz could work with a throughput of 3 Mb/s using 0.35–0.18μ
CMOS technology as against earlier methods [34] whose throughput is 890 kb/s.
Jie et al. [37] have suggested reformulation of Bajard et al. technique [32] which
requires pre-computations to reduce the number of steps from 2n2 + 8n to 2n2 + 5n,
whereas the technique of Kawamura et al. [34] needs 2n2 + 9n steps. In this
approach, a modulus 2n is used for easy base extension.
Schinianakis and Stouraitis [38] have suggested using MRC following Yassine
and Moore technique [39] discussed in Chapter 5. The moduli in RNS need to be
selected in a proper manner in this method so that the computation is simpler. In
294 10 RNS in Cryptography

Figure 10.12 A RNS Montgomery multiplication architecture due to Schinianakis (adapted from
[36] ©IEEE2011)

Yassine and Moore’s technique, (L  2) multiplications are needed for one base
extension as compared L(L  1)/2 needed for other techniques. The authors also
have unified the hardware to cater for both conventional RNS and Polynomial RNS.
The authors show that in the RNS case, the number of multiplications can be
reduced compared to use of conventional MRC as in [36]. The authors have
designed the hardware for dual field addition/subtraction, multiplication, modular
reduction and MAC operation to cater for both the fields GF( p) and GF (2m). They
have considered the various options viz., number of moduli, the use of several
MACs (β in number) in parallel and selectable radix 2r.
Ciet et al. [40] have suggested FPGA implementation of 1024-bit RSA with
RNS following similar approach as Bajard et al. [31] and Kawamura
et al. [34]. They have suggested that the nine moduli needed for each of the bases
(RNS moduli sets) can be selected from a pool of generalized Mersenne primes of
the form 2k1  2k2  1. Thus (C63 )(C54 ) possible combinations exist for
9 9
58  k1  64, 0  k2  k12þ1. Ciet et al. [40] have also suggested solutions for
signing using private key d using RSA algorithm. In these CRT is considered to
be used [41].
Denoting the hash of the message to be signed using private key d as μ, compute
μp ¼ μ mod p and μq ¼ μ mod q. Choosing two random bases as mentioned above,
μp and μq can be represented in two RNS bases. In order to avoid Differential power
analysis (DPA) attacks, the authors suggest adding randomization to both exponent
and message. Next, μpD mod N and μqD mod N can be computed using
RNS followed by CRT to perform reverse conversion to obtain μD mod N. Note
that μ D mod N can be computed as μ Dp modN where D ¼ D mod ( p  1) and
p p p

similarly we have μqD mod N ¼ μq Dq modN where Dq ¼ D mod (q  1).


10.4 Montgomery Inverse 295

Szerwinski and Guneysu [42] have studied on the application of Graphics


Processing units (GPU) for RNS due to their inherent parallel architecture. They
can effectively run a group of threads called warp in a SIMD (Single Instruction
Multiple data) fashion. They suggest that CIOS (coarse integrated operand scan-
ning) [11] is suitable for implementation in RNS of Montgomery modular multi-
plication using GPUs. They have reviewed various techniques of base extension
considered before due to Szabo and Tanaka using MRC and CRT-based techniques
of Shenoy and Kumaresan [32], Bajard et al. [31] and Kawamura et al. [34] for first
and second base extensions. They conclude that the throughput is maximum while
using Bajard’s method [31] for first base extension and Shenoy and Kumaresan
technique [32] for second base extension. For exponentiation as well, they have
shown that RNS yields lower throughput (number of encryptions per second) and
has lower latency than CIOS technique. Considering 1024-bit modular exponenti-
ation, they observe that 439.8 operations/s and 813 operations/s and latencies of
144 ms as against 6930 ms is required for RNS and CIOS-based techniques
respectively.

10.4 Montgomery Inverse

Montgomery representation of the modular inverse is b12n mod a [43]. The first
phase in the evaluation computes b12k mod a where k is the number of iterations.
The output of the first phase is denoted as Almost Montgomery Inverse (AMI). The
first phase computes gcd (a, b) where gcd stands for greatest common divisor. The
second phase halves this value (k  n) times modulo a and negates the result to yield
b12n mod a.
The pseudocode is presented in Figure 10.13a. Note that u, v, r and s are such
that us + vr ¼ a, s  1, u  1, 0  v  b. Note that r, s, u, and v are between 0 and
k1
2a  1. The number of iterations k are such that aþb 2 2  ab. The following
k k
invariants can be verified: br ¼ u2 moda and bs ¼ v2 moda. An example for
a ¼ 17, b ¼ 10 and n ¼ 5 is presented in Figure 10.13b.
The Montgomery inverse can be effectively used in reducing the number of steps
in exponentiation mod a as needed in RSA and other algorithms. For example, if the
 
exponent is 119 ¼ (1110111)2, it can be recoded as 10001001 2 where 1 has weight
of 1. Thus only three multiplications need to be done instead of 5, whereas the
number of squarings are same in both the cases.
Savas and Koc [44] have suggested defining New Montgomery inverse as x ¼ b1
2n
2 mod a so that the Montgomery inverse of a number already in Montgomery
domain is computed:

x ¼ NewMonInvðb2m Þ ¼ ðb2m Þ1 :22m mod a ¼ b1 2m mod a ð10:20Þ

Note that MonInv algorithm in Figure 10.13a cannot compute new Montgomery
inverse. This needs two steps:
296 10 RNS in Cryptography

Figure 10.13 (a) Algorithm for computing the Montgomery inverse b12mmoda and (b) example
a ¼ 17, b ¼ 10, n ¼ 5 (adapted from [43] ©IEEE1995)

m 1 m
c ¼ MonInvðb2m Þ ¼ Þ :2 moda
 ðb2 1
1
 ¼ b mod
 a 
2m m ð10:21aÞ
x ¼ MonPro c; 2 2m
¼ b 2 2 moda ¼ b1 2m mod a

where Montgomery product (MonPro) is defined as MonPro (a, b) ¼ (ab2n) mod


p. Alternatively, we can use
10.4 Montgomery Inverse 297

m m
v ¼ MonProðb2m , 1Þ ¼ ðb2 2 Þmod a ¼ bmod a
1 m ð10:21bÞ
x ¼ MonInvðbÞ ¼ b 2 mod a

This will be useful in ECC computation if the intermediate results are already in
Montgomery domain and when division is needed e.g. in computation of point
addition or doubling.
Gutub et al. [45] have described a VLSI architecture for GF( p) Montgomery
modular inverse computation. They observe that two parallel subtractors for finding
(u  v), (v  u) and (r  a) are required so as to speed up the computation (see
Figure 10.13a). They have suggested a scalable architecture which takes w bits at a
time and performs computation in wn cycles for scalable operations such as
addition/subtraction. The area of the scalable design has been found to be on
average 60 % smaller than the fixed one.
Savas [46] has used redundant signed digit (RSD) representation so that the carry
propagation in addition and subtraction is avoided. This enables fast computation of
multiplicative inverse in GF( p).
Bucek and Lorencz [47] have suggested a subtraction-free AMI technique. It
computes (u + v) instead of (u  v) where one of the operands must always be
negative. By keeping u always negative and v positive, we can compute an
equivalent of the differences in the original algorithm without subtraction. Note
that the values of v, r, and s are same as in the original algorithm but opposite values
appear for u. The authors have considered original AMI design with two subtractors
as well as one subtractor and show that AT (Area  Time product) is lower for
subtraction-free AMI, whereas AMI with one subtractor is slower than AMI using
two subtractors. Note that the initial value of u shall be  p in stead of p in this
approach. The algorithm is presented in Figure 10.14.

Figure 10.14 Bucek and


Lorenz subtraction free
AMI algorithm pseudocode
(adapted from [47]
©IEEE2006)
298 10 RNS in Cryptography

10.5 Elliptic Curve Cryptography Using RNS

Schiniankis et al. [48] have realized ECC using RNS. In order to reduce the division
operations in affine representation, they have used Jacobian coordinates. Consider
the elliptic curve

y2 ¼ x3 þ ax þ b over Fp ð10:22aÞ

where a, b 2 Fp and 4a3 + 27b2 6¼ 0 mod p together with a special point O called the
point at infinity. Substituting x ¼ X2 , y ¼ Y3 ; using the Jacobian coordinate repre-
Z Z
sentation, (10.22a) changes as
 
E Fp : Y 2 ¼ X3 þ aXZ4 þ bZ6 ð10:22bÞ

The point at infinity is given by {0, 0, 0}. The addition of two points Po ¼ (Xo, Yo,
Zo) and P1 ¼ ðX1 ; Y 1 ; Z 1 Þ 2 Fp thus will yield the sum P2 ¼ ðX2 ; Y 2 ; Z2 Þ ¼ P0 þ P1
 
2 E Fp given by
8
> 2
< X2 ¼ R  TW
2

P2 ¼ P0 þ P1 ¼ 2Y 2 ¼ VR  MW 3 ð10:22cÞ
>
:
Z2 ¼ Z0 Z1 W

where

W ¼ X0 Z21  X1 Z20 , R ¼ Y 0 Z 31  Y 1 Z 30 , T ¼ X0 Z21 þ X1 Z20 ,


M ¼ Y 0 Z 31 þ Y 1 Z 30 , V ¼ TW 2  2X2 :

The doubling of point P1 is given as


8
> 2
< X2 ¼ M  2S
P2 ¼ 2P1 ¼ Y 2 ¼ MðS  X2 Þ  T ð10:23Þ
>
:
Z 2 ¼ 2Z 1 Y 1

where M ¼ 3X21 þ aZ41 , S ¼ 4X1 Y 21 , T ¼ 8Y 41 .


Note that the computation is intensive in multiplications and additions while
division is avoided. The exponentiation i.e. operation kP follows the binary algo-
rithm where successive squarings and multiplications depending on the value of
bits (1 or 0) in the exponent k are required. All the operations are mod p thus
necessitating modulo adders and multipliers. If the field characteristic is 160 bits
long, the equivalent RNS range to compute (10.22c) and (10.23) is about 660 bits.
Hence, the authors use 20 moduli each of about 33-bit length. In the case of p being
192 bits, then the moduli size of the 20 moduli each will be only 42 bits. The authors
10.5 Elliptic Curve Cryptography Using RNS 299

used extended RNS using one redundant modulus to perform residue to binary
conversion using CRT. The conversion from projective coordinates to affine coor-
dinates is done using x ¼ X2 , y ¼ Y3 .
Z Z
The RNS adder, subtractor and multiplier architectures are as shown in
Figure 10.15a, b for point adder (ECPA) and point doubler (ECPD). The point
multiplier (ECPM) is shown in Figure 10.12c. Note that the RNS adder, multiplier
and subtractor are shared for all the computations in (10.22c) and (10.23). Note also
that the modulo p reduction is performed after RNS to binary conversion using CRT.
The authors have shown that all the operations are significantly faster than those
using conventional hardware. A 160-bit point multiplication takes approximately
2.416 ms on Xilinx Virtex-E (V1000E-BG-560-8). The authors also observe that the
cost of conversion from residue to binary is negligible.
Schiniakis et al. [49] have further extended their work on ECC using RNS. They
observe that for p of 192-bit length, the equivalent RNS dynamic range is 840 bits.
As such, 20 moduli of 42 bits each have been suggested. The implementation of
(10.22c) and (10.23) can take advantage of parallelism, between multiplication,
addition or subtraction operations for both point addition as well as point doubling.
They observe that 13 steps will be required for each (see Figure 10.16). Note that
for ECPA, 17 multiplications, 5 subtractions and 2 additions are required, whereas
for ECPD, 15 multiplications, one addition and 3 subtractions are required sharing
thus a multiplier/adder/subtractor. They, however, do not use a separate squaring
circuit. The RNS uses one extra redundant modulus and employs extended RNS for
RNS to binary conversion based on CRT. A special serial implementation was used
for multiplication of nf-bit word by f-bit word needed in CRT computation consid-
ering f bits at a time where n is the number of moduli and f is the word length of each
modulus, considering a n moduli RNS. The modulo reduction after CRT is carried
out using a serial modulo multiplier with 1 as one of the operands. The projective to
affine coordinate conversion needs division by Z2 and Z3 which needs finding the
multiplicative inverse. It needs one modular inversion and four modulo
multiplications:
1
T1 ¼ , T 2 ¼ T 21 , x ¼ XT 2 , T 3 ¼ T 1 T 2 , y ¼ YT 3 ð10:24Þ
Z

The authors use the technique of [47] for this purpose. The authors also consider
the effect of the choice of number of moduli and word length of the moduli on the
performance. They observe that the area decreases with increase in number of
moduli and leads to decrease in the bit length of the moduli. The moduli used are
presented in Table 10.1 for a 192-bit implementation for illustration. The authors
have described FPGA implementation using Xilinx Virtex-E XCV 1000E, FG
680 FPGA device. Typically the time needed for point multiplication is 4.84,
4.08, 3.54 and 2.35 ms for 256, 224,192 and 160-bit implementations.
Esmaeildoust et al. [50] have described Elliptic curve point multiplication based
on Montgomery technique using two RNS bases. The authors use moduli of the
a Z1
to_ Z1_2

to _U0
Z1
to _Z1_3

Multiplexer
X0

Decoder
1 to 17
34 to 2
RNS
Multiplier
to_ VR
From _W2
to_ W3
From _M
to_ MW3
From _W3

From _U0

From _U1
to_ W

Multiplexer
From _S0
to_R

Decoder
10 to 2
RNS

1 to 5
Subtractor to _X2

to _V
From _2X2
to_Y2
From _VR

From MW3

From _U0
Multiplexer

From _U1 To_ Y

Decoder
RNS
1 to 2
4to 2

Adder To_ M
From _S0

From _S1

b Y1
to _Y1Z1
to_Z2
Y1
to_X1 2
Multiplexer

From _Y1Z1
Decoder
1 to 15
30 to 2

RNS
Multiplier

to _Y1 4
From_ Y1 Z4
to_ T
From_ M
to _MS X2
From _S X2
Multiplexer
10 to 2

From U0 RNS To M
Adder
From U1

From _M2

From _2S To _X2


Multiplexer

From _S
Decoder

RNS
1 to 3
6 to 2

To _SX2
Subtractor
From _X2 To _Y2
From MS X2

From _T

Figure 10.15 Architectures of ECPA (a), ECPD (b) and ECPM (c) (adapted from [48]
©IEE2006)
10.5 Elliptic Curve Cryptography Using RNS 301

c k(l-bits)
Shift
MSB LSB
Counter register

O
[k]P
MUX

MUX
ECPA ECPD
P

Figure 10.15 (continued)

form 2k, 2k  1 and 2k  2ti 1 in order to have efficient arithmetic operations,


binary-to-RNS and RNS-to-binary conversions. The first base uses three or four
moduli of the type 2k  2ti  1 where ti < k/2 depending on the field length
160 (three moduli), 192 (three or four moduli), 224 and 256 bits (both four moduli).
The second base uses either the three moduli set {2k, 2k  1, 2k+1  1} [51] for
160 and 192 bits field length or the four moduli set {2k, 2k  1, 2k+1  1, 2k1  1}
[52] for field lengths 192, 224, and 256 bits. The various arithmetic operations like
modulo addition, modulo subtraction and modulo multiplication for these are
simpler. As an illustration, considering the modulus of the form ( 2k  2ti  1Þ,
the reduction of a 2k-bit number which is a product of two k-bit numbers
can be realized using mod ð2k  2ti  1Þ addition of four operands as
 0 k 
whh 2 þ w0hl þ wh þ wl k ti where w is the 2k-bit product written as
k
 0 k 2 20 1 
wl ¼ wh2 + wl ¼ whh 2 þ whl þ wh þ wl k ti . After the MRC digits are
2 2 1
found for base 1, conversion to base 2 needs computation of
xj ¼ ðv1 þ m1 ðv2 þ m2 ðv3 þ m3 v4 ÞÞÞmj where vi are the Mixed radix digits [50].
The MRC digit computation needs modulo subtraction and modulo multi-
plication with multiplicative inverse. Due to the particular form of moduli,
these operations are simple. We will consider this in more detail later. The
advantage of the choice of moduli set {2n, 2n  1, 2n+1  1} is that the MRC
digits can be easily found (see Chapter 5). The conversion from second base to
first base also is performed in a similar way. Thus using shifters and adders, the
various modulo operations can be performed.
The authors have employed a four-stage pipeline comprising of one mod
2k  2ti  1 multiplier, one reconfigurable modular (RM) multiplier, Recon-
figurabe modulo (RM) adder and two base extension units with adder-based
structures. The RM structures cater for operations for four types of moduli needed
2k, 2k  1, 2k+1  1, 2k1  1. A six-stage pipeline also has been suggested which
can achieve higher speed. In this, the conversion from RNS in one base to another
base is performed in two-stage RNS to MRS and MRS to RNS. The designs were
implemented on Xilinx Virtex-E, Xlinx Virtex-2 Pro and Altera Stratix II.
302 10 RNS in Cryptography

a RESERVED
REGISTERS A = X0, B = Y0, C = Z0, D = Z02, G = X1, H = Y1, I = Z1
A,B,C,D X1 Z02

*
U1
t1
E Z02 Z0

*
Z03
t2
D X0=U0 Y1

* *
W S1
t3
D,H Y0=S0 S1

* D*
R t4
F R
D+
*
R2 t5
G U0 U1 W
+
*
T W2 t6
A,I S0
+
*
TW2 M t7
B,D W
-
*
X2 W3 t8
A,I
+
*
2X2 MW3
t9
D,G 2 Z0
TW W
- *
V Z2
t10
B,C
* D-
VR t11
F X2 X2
* -
2Y2
t12
D,F X22 1/2(modp)
* D-
Y2
t13
B

Figure 10.16 (a, b) Data flow graph (DFG)s for point addition and point doubling algorithms
(adapted from [49] ©IEEE2011)
10.5 Elliptic Curve Cryptography Using RNS 303

b RESERVED A = X1, B = Y1, C = Z1, D = X12


REGISTERS
A,B,C,D 3 X12

*
3X12
t1
A Z1 Z1

*
Z12
t2
D Z12
* D-
Z14 t3
D a

* D-
aZ14
Y1 t4
D Z1
+ *
M Y1Z1
t5
E,D Y1 Y1 Y1Z1

Y12 * +
Z2
t6
B,C X1 Y12
* D+
X1Y12
t7
D 4
* D-
S
t8
D M
+ *
2S M2
t9
A,F Y12
* +
Y14 X2
t10
B,A 8
* -
T S-X2
t11
D
* D-
M(S-X2)
t12
A Z2 Z2

* *
Y2 Z 22
t13
B,D
Figure 10.16 (continued)

Typically on Xilinx Virtex-E, a 192-bit ECPM takes 2.56 ms while needing 20,014
LUTs and for 160-bit field length, ECPM needs 1.83 ms while needing 15,448
LUTs. The reader is urged to refer to [50] for more details.
Difference in complexity between addition and doubling leads to simple power
analysis (SPA) attacks [73]. Hence, unified addition formulae need to be used.
304 10 RNS in Cryptography

Table 10.1 RNS base 2446268224217 2446268224261 2446268224273


modulus set for the 192-b
2446268224289 2446268224321 2446268224381
implementation (adapted
from [49] ©IEEE2006) 2446268224409 2446268224427 2446268224441
2446268224447 2446268224451 2446268224453
2446268224457 2446268224481 2446268224493
2446268224513 2446268224579 2446268224601
2446268224639 2446268224657

Montgomery ladder can be used for scalar multiplication algorithm (addition and
doubling performed in each step). Several solutions have been suggested to have
leak resistance for different types of elliptic curves. As an illustration, for Hessian
form [53], the curve equation is given by

x3 þ y3 þ z3 ¼ 3dxyz ð10:25aÞ

where d 2 Fp and is not a third root of unity. For Jacobi model [54], we have the
curve equation as

y2 ¼ εx4  2δx2 z2 þ z4 ð10:25bÞ

where ε and δ are constants in Fp and for short Wierstrass form [55], the curve
equation is given by

y2 z ¼ x3 þ axz2 þ bz3 ð10:25cÞ

These require 12, 12 and 18 field multiplications for addition/doubling. Note that
Montgomery’s technique [56] proposes to work only on x coordinates. The curve
equation is given by

By2 ¼ x3 þ Ax2 þ x ð10:25dÞ

Both addition and doubling take time of only three multiplications and two
squarings. Both these are performed for each bit of the exponent. Cost of this is 10
(k)2 multiplications for finding kG.
Bajard et al. [73] have also shown that the formula for point addition and
doubling can be rewritten to minimize the modular reductions needed. As an
illustration for the Hessian form elliptic curve, the original equations for addition
of two points (X1, Y1, Z1) and (X2, Y2, Z2) are
10.5 Elliptic Curve Cryptography Using RNS 305

X3 ¼ Y 21 X2 Z2  Y 22 X1 Z 1
Y 3 ¼ X21 Y 2 Z 2  X22 Y 1 Z1 ð10:26aÞ
Z3 ¼ Z 21 X2 Y 2  Z 22 X1 Y 1

The cost of multiplication is negligible compared to the cost of reduction in


RNS. The authors consider RNS bases with moduli of the type mi ¼2k  ci where ci
is small and sparse and ci  2k=2 . Several co-primes can be found e.g., for mi < 232,
ci ¼ 2ti 1 with ti ¼ 0,1, . . ., 16 for ci ¼ 2ti  1 and with ti ¼ 1, . . ., 15 for
ci ¼ 2ti þ 1. If more co-primes are needed, ci can be of the form ci ¼ 2ti 2si
1 can be used. The reduction mod mi in these cases needs few shift and add
operations. Thus the reduction part cost is 10 % of the cost of multiplication. Thus,
an RNS digit product is equivalent to a 1.1 word product (where word is k bits) and
RNS multiplication needs only 2n RNS digit products or 2.2n word products.
The authors have shown that in RNS, the modular reductions needed will be
reduced. The advantage of RNS will be apparent if we count multiplications and
modular reductions separately. Hence, (10.26a) can be rewritten as

A ¼ Y 1 X2 , B ¼ Y 1 Z 2 , C ¼ X1 Y 2 , D ¼ Y 2 Z1 , E ¼ X1 Z 2 , F ¼ X2 Z 1 ,
ð10:26bÞ
X3 ¼ AB  CD, Y 3 ¼ EC  FA, Z3 ¼ EB  FD

Thus, only nine reductions and 12 multiplications are needed. Similar results can be
obtained for Wierstrass and Montgomery ladder cases.
RNS base extension needed in Montgomery reduction using MRC first followed
by Horner evaluation has been considered by Bajard et al. [73]. Expressing the
reconstruction from residues in base B using MRC as

A ¼ a1 þ m1 ða2 þ m2 ða3 þ   Þ þ mn1 an Þ . . . ð10:27aÞ

we need to compute for base extension to base B0 with moduli mj for j ¼ n, n + 1, . . .,


2n

aj ¼ a1 þ m1 ða2 þ m2 ða3 þ   Þ þ mn1 an Þ . . . mj ð10:27bÞ

 
1
The number of multiplications by constants are (n2  n)/2 digit products.
mi mj
The conversion from MRS to RNS corresponds to few shifts and adds. Assuming
modulus of the form 2k  2ti 1, this needs computation of ða þ bmi Þmj ¼ a þ 2k
b  2ti b b which can be done in two additions (since a + 2kb is just concatenation
and reduction mod mj requires three additions). Thus, the evaluation of each aj in
base B0 needs 5n word additions. The MRS-to-RNS conversion needs (n2  n)/5
RNS digit products since the five-word additions are equivalent to 1/5 of a RNS
digit product. Hence for the two base extensions, we need
 
n2  n þ 25 n2  n þ 3n ¼ 75 n2 þ 85 n RNS digit products which is better than O(n2).
306 10 RNS in Cryptography

10.6 Pairing Processors Using RNS

Considerable attention has been paid in the design of special purpose processors
and algorithms in software for efficient implementation of pairing protocols. The
pairing computation can be broken down into multiplications and additions in the
underlying fields. Pairing has applications in three-way key exchange [57], identity-
based encryption [58] and identity-based signatures [59] and non-interactive zero
knowledge proofs [60].
The name bilinear pairing indicates that it takes a pair of vectors as input and
returns a number. It performs a linear transformation on each of its input variables.
These operations are dependent on elliptic or hyper-elliptic curves. The pairing is a
mapping e: G1  G2 ! G3 where G1 is a curve group defined over finite field Fq and
G2 is another curve group on the extension field F k and G3 is a sub-group of the
q
multiplicative group F k . If groups G1 and G2 are of the same group, then e is called
q
symmetric pairing. If G1 6¼ G2, then e is called asymmetric pairing. The map is
linear in each component and hence useful for constructing cryptographic pro-
tocols. Several pairing protocols exist: Weil Pairing [61], Tate pairing [62], ate
pairing [63], R-ate pairing [64] and optimal pairing [65].
Let Fp be the prime field with characteristic p and let E(Fp) be an elliptic curve
y ¼ x3 þ a4 x þ a6 and # E(F  p) is the number of points on the elliptic curve. Let ‘
2

be a prime divisor of #E Fp ¼ p þ 1  t where t is the trace of Frobenius map on


the curve. The embedding degree k of E with respect to ‘ is the smallest integer such
that ‘ divides pk  1. This means that the full ‘-torsion is defined on the field F k .
p
For any integer m and ‘-torsion point P, f(m,P) is the function defined on the curve
whose divisor is

div f ðm;PÞ ¼ mðPÞ  ½mP  ðm  1ÞO ð10:28Þ

We define
 E(k)[r]
 k-rational r-torsion group of the curve. Let G1 ¼ E(Fp)[r],
 the 
G2 ¼ E F =rE F k and G3 ¼ μr F* k (the rth roots of unity). Let P 2 G1 ,
pk p p
Q 2 G2 , then, the reduced Tate pairing is defined as
   ‘
 
eT :E Fp ½‘  E F k ! F* k = F* k ð10:29aÞ
p p p

 pk 1

eðP; QÞ ¼ f ðl;PÞ ðQÞ ð10:29bÞ

The first step is to evaluate the function f(‘,P)(Q) at Q using Miller loop [61]. A
pseudocode for Miller loop is presented in Figure 10.17. This uses the classical
square and multiply algorithm. The Miller loop is the core of all pairing protocols.
10.6 Pairing Processors Using RNS 307

Figure 10.17 Algorithm for Miller loop (adapted from [66] ©2011)

In this, g(A,B) is the equation of a line passing through the points A and B (or tangent
g A;BÞ
to A if A ¼ B) and νA is the equation of the vertical line passing by A so that νðAþB is
the function on E involved in the addition of A and B. The values of the line and
vertical functions g(A,B) and νA+B are the distances calculated between the fixed
point Q and the lines that arise when adding B to A on the elliptic curve in
the standard way. Considering the affine coordinate representation of A and A + B
as (xj, yj) and (xj+1, yj+1), and coordinates of Q as (xQ, yQ), then we have
  
lA, B ðQÞ ¼ yQ  yj  λj xQ xj
 
vAþB ðQÞ ¼ xQ  xjþ1

Miller [61] proposed an algorithm that constructs f(‘,P)(Q) in stages by using


pk 1
double and add method. The second step is to raise f to the power l .
The length of the Miller loop can be reduced to half compared to that of Tate
pffiffi
pairing because t  1 ‘, 
by swapping
 P and Q in Ate pairing [64]. Here, we
 
define G1 ¼ E(Fp)[r], G2 ¼ E F k ½r  \ Ker π p  ½p where π p is the pth power
p
Frobenius endomorphism, i.e. π p : E ! E : ðx; yÞ ° ðxp ; yp Þ. Let P 2 G1 , Q 2 G2 and
let t ¼ p + 1  #E(Fp) be the trace of Frobenius. Then, Ate pairing is defined as
   ‘
 
eA : E F k \ Ker ðπ  pÞ  E Fp ½‘ ! F* k = F* k ð10:30aÞ
p p p
pk 1
ðQ; PÞ ¼ ðf ‘1, Q ðPÞÞ ‘ ð10:30bÞ

Note that Q 2 Kerðπ  pÞ, π ðQÞ ¼ ðt  1ÞQ.


In the case of R-Ate pairing [64], if l is the parameter used to construct the BN
curves [67], b ¼ 6l + 2, it is defined as
308 10 RNS in Cryptography

   ‘
 
eR : E F k \ Kerðπ  pÞ  E Fp ½‘ ! F* k F* k ð10:31aÞ
p p p

  p
pk 1

Ra ðQ; PÞ ¼ f ðb;QÞ ðPÞ: f ðb;QÞ ðPÞ:gðbQ;QÞ ðPÞ gðπðbþ1ÞQ, bQÞ ðPÞ ð10:31bÞ

pffiffi
The length of the Miller loop is 4 ‘ and hence is reduced by 4 compared to Tate
pairing.
The MNT curves [68] have an embedding degree k of 6. These are ordinary
elliptic curves over Fp such that p ¼ 4l2 + 1 and t ¼ 1 2l where p is a large prime
such that #E(Fp) ¼ p + 1  t is a prime [69].
Parameterized elliptic curves due to Barreto and Naehrig [67] are well suited for
asymmetric pairings. These are defined with E: E : y2 ¼ x3 þ a6 , a6 6¼ 0 over Fp
where p ¼ 36u4  36u3 þ 24u2  6u þ 1 and n the order of E is n ¼ 36u4  36u3
þ18u2  6u þ 1 for some u such that p and n are primes. Note that only u that
generates primes p and n will suffice. BN curves have an embedding degree k ¼ 12
which means that n divides p12  1 but not pk  1 for 0  k  12. Note that t ¼ 6u2
+ 1 is the trace of Frobenius. The value of t is also parameterized and must be
chosen large to meet certain security level. For efficiency of computation,
u and t must be having small Hamming weight. As an example, for a6 ¼ 3,
u ¼ 0x6000 0000 0000 1F2D (hex) gives 128-bit security. Since t, n and p are
parameterized, the parameter u alone suffices to be stored or transmitted. This
yields two primes n and p of 256 bits with Hamming weights 91 and 87, respec-
tively. The field size is F k is 256  k ¼ 3072 bits. This allows a faster exponen-
p
tiation method.
An advantage of BN curves is their degree 6 twist. Considering that E and E e are
two elliptic curves defined over Fq, the degree of the twist is the degree of the
smallest extension on which the isomorphism ψ d between E and E˜ is defined over an
e defined by
extension Fdq of Fq. This means that E is isomorphic over F 12 to a curve E
p
y2 ¼ x3 þ aν6 where ν is an element in F 2 which is not a cube or a square. Thus, we
p  
 
can define twisted versions of pairings on E e F 2  E Fp ½‘. This means that the
p
coordinates of Q can be written as (xQv1/3, yQv1/2) where xQ, yQ are in F 2 :
p

e!E
ψ6 : E

ðx; yÞ ° xv1=3 , yv1=2 ð10:32Þ

Note that computing g, v, 2T, T + Q (needed in Algorithm 1 see Figure 10.17)


requires only F 2 arithmetic but the result remains in F 12 . The denominators v2T,
p p
vT+Q will get wiped out by the final exponentiation.
10.6 Pairing Processors Using RNS 309

For implementation of Pairing protocols, special hardware will be required such


as large operand multipliers based on variety of techniques such as Karatsuba,
ToomCook, Arithmetic in extension fields, etc. It will be helpful to consider these
first before discussing pairing processor implementation using RNS. The reader is
urged to consult [70, 71] for tutorial introduction to pairing.

Large Operand Multipliers

Bajard et al. [72] have considered choice of 64-bit moduli with low Hamming
weight in moduli sets. The advantage is that the multiplications with multiplicative
inverses in MRC will be having low Hamming weight thus simplifying the multi-
plication as few additions. For base extension to another RNS as well, as explained
before, such moduli will be useful. These Moduli are of the type 2k  2ti 1 where
t < k/2. As an illustration, two six moduli sets are 264-210-1, 264-216-1, 264-219-1, 264
-228-1, 264-220-1, and 264-231-1 whose Hamming weights are 3 and 264-222-1, 264-2
13
-1, 264-229-1, 264-230-1, 264-1, and 264 with Hamming weight being 3,3,3,3,2,1.
The inverses in this case are having Hamming weight ranging between 2 and 20.
Multiplication mod (2224-296 + 1) which is an NIST prime P-224 can be easily
carried out [42]. The product is having 14 number of 32-bit words. Denoting these
as r13, r12, r11, . . ., r2, r1, r0, the reduction can be carried out by computing
(t1 + t2 + t3  t4  t5) mod P-224 where
t1 ¼ r6r5r4r3r2r1r0
t2 ¼ r10r9r8r7000
t3 ¼ 0r13r12r11000
t4 ¼ 0000r13r12r11
t5 ¼ r13r12r11r10r9r8r7
Multiplication of large numbers can be carried out using Karatsuba formula [74]
using fewer multiplications of smaller numbers and with more additions. This can
be viewed as multiplication of linear polynomials. Two linear polynomials of two
terms can be multiplied as follows using only three multiplications:

ða0 þ a1 xÞðb0 þ b1 xÞ ¼ a0 b0 þ ða0 b1 þ a1 b0 Þx þ a1 b1 x2


ð10:33aÞ
¼ a0 b0 þ ðða0 þ a1 Þðb0 þ b1 Þ  a0 b0  a1 b1 Þx þ a1 b1 x2

Thus only a0 b0 , a1 b1 , ða0 þ a1 Þðb0 þ b1 Þ are the three needed multiplications.


Extension to three terms [75] is as follows:
310 10 RNS in Cryptography

    
a0 þ a1 x þ a2 x2 b0 þ b1 x þ b2 x2 ¼ a0 b0 C þ 1  x  x2
 
þ a1 b1 C  x þ x 2  x 3
 
þ a2 b2 C  x 2  x 3 þ x 4
þ ða0 þ a1 Þðb0 þ b1 ÞðC þ xÞ
 
þ ða0 þ a2 Þðb0 þ b2 Þ C þ x2
 
þ ða1 þ a2 Þðb1 þ b2 Þ C þ x3
þ ða0 þ a1 þ a2 Þðb0 þ b1 þ b2 ÞC
ð10:33bÞ

for an arbitrary polynomial C with integer coefficients. Proper choice of C can


reduce the number of multiplications. As an example C ¼ x2 avoids the need to
compute ða0 þ a2 Þðb0 þ b2 Þ. Thus, only six multiplications will be needed instead
of nine multiplications needed in school book algorithm.
Montgomery [75] has extended this technique to products of quartic, quintic and
sextic polynomials which are presented below for completeness. The quartic case
(see (10.34)) needs 13 multiplications and 22 additions/subtractions by taking
advantage of the common sub-expressions

a0 þ a1 , a0  a4 , a3 þ a4 , ða0 þ a1 Þ  ða3 þ a4 Þ, ða0 þ a1 Þ


þa2 , a2 þ ða3 þ a4 Þ, ða0 þ a1 Þ þ ða2 þ a3 þ a4 Þ, a0
ða2 þ a3 þ a4 Þ, ða0 þ a1 þ a2 Þ  a4 , ða0  a2  a3  a4 Þ
þa4 , ða0 þ a1 þ a2  a4 Þ  a0

and similarly with “b”s. Other optimizations are also possible by considering
repeated sub-expressions.
  
a0 þ a1 x þ a2 x2 þ a3 x3 þ a4 x4 b0 þ b1 x þ b2 x2 þ b3 x3 þ b4 x4
 
¼ ða0 þ a1 þ a2 þ a3 þ a4 Þðb0 þ b1 þ b2 þ b3 þ b4 Þ x5  x4 þ x3
 
þða0  a2  a3  a4 Þðb0  b2  b3  b4 Þ x6  2x5 þ 2x4  x3
 
þða0 þ a1 þ a2  a4 Þðb0 þ b1 þ b2  b4 Þ x5 þ 2x4  2x3 þ x2
 5 
þða0 þ a1  a3  a4 Þðb0 þ b1  b3  b4 Þ x  2x4 þ x3
 
þða0  a2  a3 Þðb0  b2  b3 Þ x6 þ 2x5  x4
 
þða1 þ a2  a4 Þðb1 þ b2  b4 Þ x4 þ 2x3  x2
   
þða3 þ a4 Þðb3 þ b4 Þ x7  x6 þ x4  x3 þ ða0 þ a1 Þðb0 þ b1 Þ x5 þ x4  x2 þ x
 
þða0  a4 Þðb0  b4 Þ x6 þ 3x5  4x4 þ 3x3  x2
 
þa4 b4 x8  x7 þ x6  2x5 þ 3x4  3x3 þ x2
   
þa3 b3 x7 þ 2x6  2x5 þ x4 þ a1 b1 x4  2x3 þ 2x2  x
 
þa0 b0 x6  3x5 þ 3x4  2x3 þ x2  x þ 1
ð10:34Þ
10.6 Pairing Processors Using RNS 311

Similarly, for the quintic polynomial multiplication, number of multiplications


can be 17 and for sextic case we need 22 base ring multiplications. Bounds on the
needed number of multiplications on the number of products for terms up to 18 have
been given in [75].
The hardware implementation of Fp arithmetic for pairing friendly curves
e.g. BarretoNaehrig (BN) curves can be intensive in modular multiplications.
These can be realized using polynomial Montgomery reduction technique
[76, 77]. Either parallel or digit serial implementation can be adopted. In this hybrid
modular multiplication (HMM) technique, polynomial reduction is carried out
using Montgomery technique while coefficient reduction uses division. In the
parallel version [76], four steps are involved (see flow chart in Figure 10.18a)
(a) polynomial multiplication, (b) coefficient reduction mod z, (c) polynomial
reduction and (d) coefficient reduction.
X
n1
We wish to compute r(z) ¼ a(z)b(z)zn mod p(z) where aðzÞ ¼ ai z i , bð z Þ ¼
i¼0
X
n1 X
n1
bi zi and pðzÞ ¼ pi zi þ 1 where p ¼ f(z) is the modulus. The polynomial
i¼0 i¼1
multiplication in the first step leads to
X2n2
cðzÞ ¼ aðzÞbðzÞ ¼ i¼0
ci zi ð10:35aÞ

In this step, the coefficient reduction is carried out by finding ci mod z and ci div z.
The ci div z is added to ci+1. In polynomial reduction based on Montgomery
technique, first q(z) is found as

qðzÞ ¼ ðcðzÞmod zn ÞgðzÞmod zn ð10:35bÞ

where gðzÞ ¼ ðf ðzÞÞ1 modzn . Next, we compute cðzÞqnðzÞf ðzÞ. A last step is coefficient
z
reduction. The computation yields a(z)b(z)z5 mod p in case of BN curves. The
expressions for q(z), h(z) and v(z) in the case of BN curves are as follows:
X4
qð z Þ ¼ i¼1
qi zi ¼ ðc4 þ 6ðc3  2c2  6ðc1  9c0 ÞÞÞz4
þ ðc3 þ 6ðc2  2c1  6c0 ÞÞz3 þ ðc2 þ 6ðc1  2c0 ÞÞz2 ð10:35cÞ
þ ðc1 þ 6c0 Þz  c0

and
X3
hð z Þ ¼ gi zi ¼ 36q4 z3 þ 36ðq4 þ q3 Þz2
i¼0 ð10:35dÞ
þ 12ð2q4 þ 3ðq3 þ q2 ÞÞz þ 6ðq4 þ 4q3 þ 6ðq2 þ q1 ÞÞ
312 10 RNS in Cryptography

a4 a3 a2 a1 a0 bi 65 65

x
c
Mul Mul Mul Mul Mul
(c) Register
(65x32) (65x65) (65x65) (65x65) (65x65) 67 63

(b) Mul

Mod-1 Mod-1 Mod-1 Mod-1 Mod-1 67 63 67 93


s s
Accumulator X + 63
X
C8 C7 C6 C5 C4 C3 C2 C1 C0 - -
78 30 64
Polynomial Reduction
(d) Mod-1 (e) Mod-2
v3 v2 v1 v0
Mod-2 Mod-2 Mod-2 Mod-2
+ + + + +
+ + +
+

r4 r3 r2 r1 r0

Figure 10.18 (a) Parallel hybrid modular multiplication algorithm for BN curves (b) Fp multi-
plier using HMMB (adapted from [76] ©IEEE2012)
10.6 Pairing Processors Using RNS 313

c ðzÞ
v ðzÞ ¼ þ hð z Þ ð10:35eÞ
z5

Next, coefficient reduction is done on v(z) to obtain r(z).


An example [78] will be illustrative. Consider a ¼ 35z4 + 36z3 + 7z2 + 6z + 103,
b ¼ 5z4 + 136z3 + 34z2 + 9z + 5 with z ¼ 137, f(z) ¼ 36z4 + 36z3 + 24z2 + 6z + 1.
Note that g(z) ¼ f(z)1 ¼ 324z4  36z3  12z2 + 6z + 1. We need to find r(z) ¼
a(z)b(z)/z5 mod f(z).
First, we consider the non-polynomial form of integers. We have A ¼ 12,
422, 338, 651 and B ¼ 2, 111, 720, 197 and p ¼ 12, 774, 932, 983. We can find
A  B ¼ 26, 232, 503, 423, 290, 434, 247 and 1375 ¼ 48, 261, 724, 457. We also
compute α ¼ (A  B) mod 1375 ¼ 41, 816, 018, 411. We next have β ¼ f(z)1 ¼ 114,
044, 423, 849. Hence, γ ¼ (αβ) mod 1375 ¼ 33,251, 977, 638. Finally we compute
ABþγp
¼ 451, 024, 289, 300,5955, 068, 401 ¼ 9, 345, 382, 793.
1375 137
The following steps will obtain the result when we compute in polynomial form:

cðzÞ ¼ aðzÞbðzÞ ¼ z9 þ 74z8 þ 52z7 þ 111z6 þ 70z5 þ 118z4 þ 96z3 þ 36z2 þ z þ 104

after reducing the coefficients mod 137. Thus


  
cðzÞmodz5 f 1 ðzÞ modz5 ¼ 33686z4  3636z3  1278z2 þ 623z  104

Next multiplying this with f(z) yields z9 + 1,212,696z8 + 1,081,800z7 + 631,560z6


+ 91,272z5 + 0z4 + 0z3 + 0z2 + 0z + 0z0. Note that the least significant five terms are
zero. Next adding the most significant terms of c(z) divided by z5 viz., (z9 + 74z8
+ 52z7 + 111z6 + 70z5)/z5 to the most significant terms beyond z4, yields

z5 þ 1212770z4 þ 1081852z3 þ 631671z2 þ 91342z1 þ 100z0

which after simplification gives 65z5 + 6z4 + 30z3 + 57z2 + 82z1 + 100z0. Note that
this needs to be reduced mod p to obtain the actual result

26z4 þ 72z3 þ 57z2 þ 117z þ 129 ¼ 9, 345, 382, 793:

The same example can be worked out using serial multiplication due to Fan
et al. [77] which leads to smaller intermediate coefficient values. Note that instead
of computing a(z)b(z) fully, we take terms of b one term at a time and reduce the
product mod p. The results after each step of partial product addition, coefficient
reduction and scaling by z are as follows:
10z4 + 3z3 + 4z2 + 6z + 95 after adding 5A and 33p
21z4 + 133z3 + 127z2 + 65z + 101 after adding 9A and 74p
34z4 + 44z3 + 37z2 + 72z + 50 after adding 34A and 96p
49z4 + 39z3 + 14z2 + 78z + 76 after adding 136A and 53p
26z4 + 72z3 + 60z2 + 117z + 129 after adding 5A and 99p
314 10 RNS in Cryptography

The coefficients in this case can be seen to be smaller than in the previous case.
We will illustrate the first step as follows: After multiplication of A with 5 we obtain
175z4 + 180z3 + 35z2 + 30z + 515. Reducing the terms mod 137 and adding the carry
to the previous term, we obtain z5 + 39z4 + 43z3 + 35z2 + 33z + 104. Evidently, we
need to add 33p to make the least significant digit zero yielding (z5 + 39z4 + 43z3
+ 35z2 + 33z + 104) + 33(36z4 + 36z3 + 24z2 + 6z + 1) which after reducing the terms
mod 137 as before and dividing by z since z0 term becomes zero gives 10z4 + 3z3
+ 4z2 + 6z + 95.
In the digit serial hybrid multiplication technique, the multiplication and reduc-
tion/scaling is carried out together in each step.
Fan et al. [76] architecture was based on Hybrid Montgomery multiplier (HMM)
where multiplication and reduction are interleaved. The multiplier architecture for
z ¼ 263 + s where s ¼ 857 ¼ 25(24 + 23) + 26 + (24 + 23) + 1 for 128-bit security is
shown in Figure 10.18b. Four 65  65 multipliers and one 65  32 multiplier are
used to carry out the polynomial multiplication. Each 65  65 multiplier is
implemented using two-level Karatsuba method. Five “Mod-1” blocks are used
for first coefficient reduction step. The Mod-1 block is shown in Figure 10.18b.
Partial products are immediately reduced. Multiplication by s is realized using four
additions since s ¼ 25(24 + 23) + 26 + (24 + 23) + 1. The outputs of “Mod-1” blocks
can be at most 78 bits. These outputs corresponding to the various “bi” computed in
successive five cycles are next accumulated and shifted in the accumulator. Once
the partial products are ready, in phase III, polynomial reduction is performed with
only shifts and additions e.g. 6α ¼ 22α + 2α, 9α ¼ 23α + α, 36α ¼ 25α + 22α. The
values of ci are less than (i + 1)277 for 0  i  4. It can be shown that vi are less
than 92 bits. The “Mod-2” block is similar to “Mod-1” block but input is only 93 bit
(see Figure 10.18b). The resulting ri are such that jr i j ¼ 263 þ 241 for 0  i  3 and
jr 4 j  230 .
The negative coefficients in r(z) are made positive by adding the following
polynomial:

lðzÞ ¼ ð36v  2Þz4 þ ð36v þ 2z  2Þz3 þ ð24v þ 2z  2Þz2


þ ð6v þ 2z  2Þz þ ðv þ 2zÞ ð10:36Þ

where v ¼ 225 and z ¼ 263 þ s.


The authors have used a 16-stage pipeline to achieve a high clock frequency and
one polynomial multiplication takes five iterations. One multiplier has a delay of
20 cycles. The multiplier finishes one multiplication every five cycles. The authors
have used XILINX Virtex-6 FPGAs (XC6VLX240) and could achieve a maximum
frequency of 210 MHz using 4014 slices, 42 DSP48E1s and 5 block RAMs
(RAMB36E1).
The digit serial implementation [77] will be described next. Note that p1(z) ¼ 1
mod z which means that p1(z) mod zn has integer coefficients. The polynomial
reduction uses Montgomery reduction which needs division by “z”. Since z ¼ 2m + s
where s is small, the division is transferred to the multiplication by s for small s. The
10.6 Pairing Processors Using RNS 315

algorithm for modular reduction for BN curves is presented in Figure 10.19a. Note
that five steps are needed in the first loop to add a(z)bj to old result and divide by
z mod p. The authors prove that the output is bounded under the conditions
0  jai j, jbi j < 2m=2 , i ¼ 4 and 0  jai j, jbi j < 2mþ1 , 0  i  3 such that
0  jr i j < 2m=2 , i ¼ 4 and 0  jr i j < 2mþ1 , 0  i  3. Note that for realizing
256-bit BN curves, the digit size is 64 bits. Four 64-bit words and one 32-bit
word will be needed. It can be seen that in the first loop, in step 3, one 32  64
and four 64  64 multiplications are needed. In step 4, one dlog2 se  dlog2 μe
multiplication where μ < 2m+6, is needed. The last iteration takes four 32  64
and one 32  32 multiplications. In total, the first loop takes one 32  32, eight
32  64, sixteen 64  64 and five dlog2 se  dlog2 μe multiplications where μ < 2m+6.
The coefficient reduction phase requires eight dlog2 se  dlog2 μe multiplications. It
can be shown that μ < 2k+6 in the for loop (steps 8–10) and μ < s26 in step 12 in the
second for loop. On the other hand, in the Barrett and Montgomery algorithms, we
need 36 numbers of 64  64 multiplications.
The design is implemented on an ASIC 130 nm and needed 183K gates for Ate
and R-Ate pairing and worked at 204 MHz frequency and the times taken are 4.22
and 2.91 ms, respectively. The architecture of the multiplier together with the
accumulator and γ, μ calculation blocks are shown in Figure 10.19b. Step 3 is
performed by a row of 64  16 and 32  16 multipliers needing four iterations. The
partial product is next reduced by the Mod_ t block which comprises of a multiplier
and subtractor. This block generates μ and γ from ci. Note that μ ¼ ci div2m and
γ ¼ ci mod2m  sμ in all mod blocks except the one below rc0 which computes
instead γ ¼ sμ  rc0 mod2m . The second loop re-uses the mod z blocks.
Chung and Hasan [79] have suggested the use of LWPFI (low-weight polyno-
mial form integers) for performing modular multiplications efficiently. These are
similar to GMN (generalized Mersenne Numbers) f(t) where t is not a power of
2 where jf ðiÞj  1:

f ðtÞ ¼ tn þ f n1 tn1 þ    þ f 1 ðtÞ þ f 0 ð10:37Þ

Since f(t) is monic (leading coefficient is unity), the polynomial reduction phase
is efficient. A pseudocode is presented in Figure 10.20. The authors use Barrett’s
reduction algorithm for performing divisions required in phase III. When moduli
are large, Chung and Hasan algorithm is more efficient than traditional Barrett or
Montgomery reduction algorithm. The authors have later extended this technique to
the case [80] when jf i j  s where s  z. Note that the polynomial reduction phase is
efficient only when f(z) is monic.
Corona et al. [81] have described a 256-bit prime field multiplier for application
in bilinear pairing using BN curves with p ¼ 261 + 215 + 1 using an asymmetric
divide and conquer approach based on five-term Karatsuba technique, which used
12 DSP48 slices on Virtex-6. It needed fourteen 64  64 partial sub-products. This,
however, needs lot of additions. However, these additions have certain pattern that
can be exploited to reduce number of clock cycles needed from 23 needed in
316 10 RNS in Cryptography

Figure 10.19 (a) Hybrid modular multiplication algorithm and (b) Fp multiplier using HMMB
(adapted from [77] ©IEEE2009)

Karatsuba technique to 15 by proper scheduling. The 512-bit product is reduced to a


256-bit integer using polynomial variant of Montgomery reduction of Fan
et al. [77]. A 65  65-bit multiplier has been realized using asymmetric tilling
[82]. One operand A was split into three 24-bit words A0, A1 and A2 and B was
split into four 17-bit words B0, B1, B2 and B3 so that a 72  68 multiplier can be
realized using the built-in DSP48 slices. This consumes 12 DSP48 slices and
requires 12 products and 5 additions. This design could achieve a clock frequency
of 223.7 MHz using Virtex-6 with a 40-cycle latency and takes 15 cycles per
product.
10.6 Pairing Processors Using RNS 317

Figure 10.20 ChungHasan multiplication algorithm (adapted from [76] ©IEEE2012)

Brinci et al. [83] have suggested a 258-bit multiplier for BN curves. The authors
observe that the Karatsuba technique cannot efficiently exploit the full performance
of DSP blocks in FPGAs. Hence, it is required to explore alternative techniques.
They use a Montgomery quartic polynomial multiplier needing 13 sub-products
using Montgomery technique [75] realized using 65  65-bit multipliers, 7  65-bit
multipliers and one 7  7-bit multiplier and it needs 22 additions. In non-standard
tilling, eleven DSP blocks will suffice: eight multipliers are 17  24 whereas three
are 24  17. The value of p used in BN curves is 263 + 857. A frequency of 208 MHz
was achieved using Virtex-6, 11DSP 48 blocks and 4-block RAMS taking 11 cycles
per product.

Extension Field Arithmetic

When computing pairings, we need to construct a representation for the finite


field F k ,where k is the embedding degree. The finite field F k is implemented as
p p
Fp[X]/f(X), where f(X) is an irreducible polynomial of degree k over Fp. The
elements of F k are represented using polynomial basis [1, X, X2, . . ., Xk1]
p
where X is the root of the irreducible polynomial over F k . In large prime fields,
p
pairing involves arithmetic in extensions of small degrees of the base field.
Hence, optimization of extension field arithmetic will be required. We need
algorithms for multiplication, squaring, for finding multiplicative inverse and
318 10 RNS in Cryptography

for computing Frobenius endomorphism. These are considered in detail in this


section.
Multiplication is computed as a multiplication of polynomials followed by a
reduction modulo the irreducible polynomial f(X), which can be built into the
formula for multiplication. For a multiplication in F k , at least k reductions are
p
required as the result has k coefficients. For F 12 , twelve reductions are required.
p
Lazy reduction (accumulation and reduction) can be used to decrease the number of
reductions in the extension field as will be explained later.
Several techniques can be used to perform computations in the quadratic, cubic,
quartic and sextic extension fields [84].
A. Multiplication and squaring
 
A quadratic extension can be constructed using F ¼ Fp ½X= X2  β where β is a
p2
quadratic non-residue in Fp. An element α 2 F 2 is represented as α0 þ α1 X where
p
αi 2 Fp .
The school book method of multiplication c ¼ ab yields

c ¼ ðao þ Xa1 Þðbo þ Xb1 Þ ¼ ðao bo þ βa1 b1 Þ þ Xða1 bo þ ao b1 Þ ¼ ðco þ Xc1 Þ


ð10:38aÞ

where v0 ¼ a0 b0 , v1 ¼ a1 b1 which costs 4M + 2A + B where M, A and B stand for


multiplication, addition and multiplication by a constant, respectively. Using
Karatsuba’s formula [70], we have

c ¼ vo þ βv1 , c1 ¼ ðao þ a1 Þðbo þ b1 Þ  vo  v1 ð10:38bÞ

which costs 3M + 5A + B. On the other hand for squaring, we have the formulae for
respective cases of school book and Karatsuba as

co ¼ a2o þ βa21 , c1 ¼ 2ao a1 and co ¼ a2o þ βa21 , c1 ¼ ðao þ a1 Þ2  vo  v1


ð10:38cÞ

where v0 ¼ a20 , v1 ¼ a21 . Thus, the operations in both these cases are M + 2S + 2A + B
and 3S + 4A + B where S stands for squaring.
In another technique known as complex squaring, c ¼ a2 is computed as

co ¼ ðao þ a1 Þðao þ βa1 Þ  vo  βv0


ð10:39Þ
c1 ¼ 2v0

where vo ¼ aoa1. This needs 2M + 4A + 2B operations.


 
In the case of cubic extensions F 3 ¼ Fp ½X= X3  β , an element α 2 F 3 is
p p
represented as α0 þ α1 X þ α2 X2 where αi 2 Fp and β is a cubic non-residue in Fp.
10.6 Pairing Processors Using RNS 319

The school book type of multiplication yields c ¼ ab as


  
c ¼ ao þ a1 X þ a2 X 2 bo þ b1 X þ b2 X 2
¼ a0 b0 þ Xða1 b0 þ a0 b1 Þ þ X2 ða1 b0 þ a0 b1 þ βa2 b2 Þ þ X3ða2 b1 þ a1 b2 Þ þ X4 a2 b2
¼ a0 b0 þ βða2 b1 þ a1 b2 Þ þ X a1 b0 þ a0 b1 þ βa2 b2 þ X2 a2 b0 þ a0 b2 þ a1 b1
¼ co þ c1 X þ c2 X 2
ð10:40aÞ

This costs 9M + 6A + 2B. For squaring, we have

co ¼ a2o þ 2βa1 a2 , c1 ¼ 2a0 a1 þ βa22 , c2 ¼ a21 þ 2a0 a2 ð10:40bÞ

which needs 3M + 3S + 6A + 2B operations. The Karatsuba technique for mul-


tiplication yields

co ¼ vo þ βðða1 þ a2 Þðb1 þ b2 Þ  v1  v2 Þ
c1 ¼ ða0 þ a1 Þðb0 þ b1 Þ  v0  v1 þ βv2 ð10:40cÞ
c2 ¼ ðao þ a2 Þðbo þ b2 Þ  v0 þ v1  v2

which costs 6M + 15A + 2B where vo ¼ aobo, v1 ¼ a1b1 and v2 ¼ a2b2. For squar-
ing, we have

c o ¼ v o þ β ð a1 þ a2 Þ 2  v 1  v 2
c1 ¼ ða0 þ a1 Þ2  v0  v1 þ βv2 ð10:40dÞ
c2 ¼ ðao þ a2 Þ2  v0 þ v1  v2

which requires 6S + 13A + 2B operations where v0 ¼ a02, v1 ¼ a12, v2 ¼ a22. In


the ToomCook-3 [85, 86] method, we have to pre-compute

vo ¼ að0Þbð0Þ ¼ a0 b0 , v1 ¼ að1Þbð1Þ ¼ ða0 þ a1 þ a2 Þðb0 þ b1 þ b2 Þ,


v2 ¼ að1Þbð1Þ ¼ ða0  a1 þ a2 Þðb0  b1 þ b2 Þ,
ð10:41Þ
v3 ¼ að2Þbð2Þ ¼ ða0 þ 2a1 þ 4a2 Þðb0 þ 2b1 þ 4b2 Þ,
v4 ¼ að1Þbð1Þ ¼ a2 b2

where the five interpolation points vi are estimated as a(X)b(X) at X ¼ 0, 1,


2 and 1. This needs 5M + 14A operations (for eliminating division by 6). Next
interpolation needs to be performed to compute c ¼ 6ab as

co ¼ 6v0 þ βð3v0  3v1  v2 þ v3  12v4 Þ,


c1 ¼ 3v0 þ 6v1  2v2  v3 þ 12v4 þ 6βv4 ,
ð10:42aÞ
c2 ¼ 6v0 þ 3v1 þ 3v2  6v4

The total computation requirements are 5M + 40A + 2B operations. If β ¼ 2,


cost is reduced to 5M + 35A. For squaring, we have
320 10 RNS in Cryptography

vo ¼ ðað0ÞÞ2 ¼ a2o , v1 ¼ ðað1ÞÞ2 ¼ ðao þ a1 þ a2 Þ2 , v2 ¼ ðað1ÞÞ2


¼ ðao  a1 þ a2 Þ2 , v3 ¼ ðað2ÞÞ2 ¼ ðao þ 2a1 þ 4a2 Þ2 , v4 ¼ ðað1ÞÞ2 ¼ a22
ð10:42bÞ

which needs 5S + 7A operations. Next, interpolation needs to be performed as


before using (10.42a). Thus, ToomCook method needs less number of multi-
plications but more additions.
For squaring three other techniques due to Chung and Hasan [87] are useful.
For degree 2 polynomials, these need two steps. In the pre-computation step in
method 1 (CH-SQR1), we have
 
so ¼ a2o , s1 ¼ 2a0 a1 , s2 ¼ ao þ a1  a2 ðao  a1  a2 Þ, s3 ¼ 2a1 a2 , s4 ¼ a22
ð10:43aÞ

In the next step, squaring is computed as

co ¼ so þ βs3 , c1 ¼ s1 þ βs4 , c2 ¼ s1 þ s2 þ s3  so  s4 ð10:43bÞ

Thus, the total cost is 3M + 2S + 11A + 2B operations. For the second tech-
nique, we have s2 ¼ ðao  a1 þ a2 Þ2 while other si are same as in (10.43a) and
the final step is same as in (10.43b). The total cost is 2M + 3S + 10A + 2B
operations. For the third method, we have pre-computation given by

so ¼ a2o , s3 ¼ 2a1 a2 , s1 ¼ ðao þ a1 þ a2 Þ2 , s2 ¼ ðao a1 þ a2 Þ2 , s3 ¼ 2a1 a2 ,


s4 ¼ a22 , t1 ¼ ðs1 þ s2 Þ=2
ð10:43cÞ
and finally we compute

co ¼ so þ βs3 , c1 ¼ s1  s3  t1 þ βs4 , c2 ¼ t1  s0  s4 ð10:43dÞ

The total cost is 1M + 4S + 11A + 2B + 1D2 operations where D2 indicates


division by 2. To avoid this division, C ¼ 2a2 can be computed:

co ¼ 2so þ 2βs3 , c1 ¼ s1  s2  2s3 þ 2βs4 , c2 ¼ 2s0 þ s1 þ s2  2s4 ð10:43eÞ

The total cost is 1M + 4S + 14A + 2B operations.


In the case of direct quartic extensions, an element α 2 F is represented as
p4
α0 þ α1 X þ α2 X2 þ α3 X3 where αi 2 Fp . We can construct a quartic extension as
 
F 4 ¼ Fp ½X= X4  β where β is a quadratic non-residue in Fp. We can also
p   pffiffiffi
construct a quartic extension as F 4 ¼ F 2 ½Y = Y 2  γ where γ ¼ β is a
p p
10.6 Pairing Processors Using RNS 321

quadratic non-residue to F 2 . An element in F 4 can be represented as α0 þ α1 γ


p p
where αi 2 F 2 .
p
The school book type of multiplication in direct quartic extension yields
c ¼ ab as
 
co ¼ ao bo þ β a1 b3 þa3 b1 þa2 b2 
c1 ¼ ao b1 þ a1 b0 þ β a2 b3 þ a3 b2
ð10:44aÞ
c2 ¼ ao b2 þ a1 b1 þ a2 b0 þ βa3 b3
c3 ¼ ao b3 þ a1 b2 þ a2 b1 þ a3 b0

needing 16M + 12A + 3B operations whereas squaring is by the equations


 
c0 ¼ a20 þ β 2a1 a3 þ a22 , c1 ¼ 2ða0 a1 þ βa2 a3 Þ, c2 ¼ 2a0 a2 þ a21 þ βa23 , c3
¼ 2ð a0 a3 þ a1 a2 Þ
ð10:44bÞ

needing 6M + 4S + 10A + 3B operations. Note that ToomCook method also can


be used which reduces the number of multiplications at the expense of other
operations.
Note, however, that quadratic over quadratic extension can be done to obtain
quartic extensions. Note also that the Karatsuba technique or school book
technique can be used in F 2 leading to four options. The cost of operations
p
needed depends upon the choice of multiplication methods for the bottom
quadratic extension field. The operations needed are summarized in Tables 10.2
and 10.3 for multiplication and squaring [84].
Sextic extensions are possible using quadratic over cubic, cubic over qua-
dratic or direct sextic. In the case of direct sextic extension, F 6 is constructed as
  p
F 6 ¼ Fp ½X= X6  β where β is both a quadratic and cubic non-residue in Fp.
p

Table 10.2 Summary p4 method School book Karatsuba


of multiplicative costs
for quartic extensions as p2 method >Linear Linear >Linear Linear
quadratic over quadratic School book 16M 12A + 5B 12M 16A + 4B
(adapted from [84] ©2006) Karatsuba 12M 24A + 5B 9M 25A + 4B

Table 10.3 Summary of squaring costs for quartic extensions as quadratic over quadratic
(adapted from [84]©2006)
p4 method School book Karatsuba Complex
p2 method >Linear Linear >Linear Linear >Linear Linear
School book 6M + 4S 10A + 4B 3M + 6S 14A + 5B 8M 12A + 4B
Karatsuba 3M + 6S 17A + 6B 9S 20A + 8B 6M 18A + 4B
Karatsuba/Complex 7M 17A + 6B 6M 20A + 8B
322 10 RNS in Cryptography

An element α 2 F is constructed as αo þ α1 X þ α2 X2 þ α3 X3 þ α4 X4 þ α5 X5
p6
where αi 2 Fp . The school book method computes c ¼ ab as [84]

co ¼ ao b0 þ β ð a1 b5 þ a2 b 4 þ a3 b3 þ a4 b2 þ a5 b1 Þ
c1 ¼ a0 b1 þ a1 bo þ β ð a2 b 5 þ a3 b4 þ a4 b3 þ a5 b2 Þ
c2 ¼ ao b2 þ a1 b1 þ a2 b0 þ β ð a3 b5 þ a4 b4 þ a5 b3 Þ
ð10:45aÞ
c3 ¼ ao b3 þ a1 b2 þ a2 b1 þ a3 b0 þ βða4 b5 þ a5 b4 Þ
c4 ¼ ao b4 þ a1 b3 þ a2 b2 þ a3 b1 þ a4 b0 þ βa5 b5
c5 ¼ ao b5 þ a1 b4 þ a2 b3 þ a3 b2 þ a4 b1 þ a5 b0

The total costs of multiplication for direct sextic extension in case of a school
book, Montgomery and ToomCook-6X techniques can be found as 36M
+ 30A + 5B, 17M + 143A + 5B, 11M + 93Mz + 236A + 5B operations respectively.
For squaring, the corresponding costs are 15M + 6S + 22A + 5B, 17S + 123A + 5B
and 11S + 79Mz + 163A + 5B where Mz stands for multiplication with a small
word-size integer.
In case of squaring c ¼ a2, we have for school book method of sextic extension
 
co ¼ a20 þ β 2ða1 a5 þ a2 a4 Þ þ a23
c1 ¼ 2ð ao a1 þ β ð a2 a5 þ a3 a4 Þ Þ
 
c2 ¼ 2ao a2 þ a21 þ β 2a3 a5 þ a24
ð10:45bÞ
c3 ¼ 2ðao a3 þ a1 a2 þ βa4 a5 Þ
c4 ¼ 2ðao a4 þ a1 a3 Þ þ a22 þ βa25
c5 ¼ 2ð ao a5 þ a1 a4 þ a2 a 3 Þ

An example of degree extension using a quadratic extension of cubic (cubic


 
over quadratic) is illustrated next [69]. Note that F 3 ¼ Fp ½X= X3  2 ¼ Fp ðαÞ
  p
and F 6 ¼ F 3 ½Y = Y 2  α ¼ F 3 ðβÞ where α is a cubic root of 2. For using lazy
p p p
reduction, complete arithmetic shall be unrolled. Hence, letting A ¼ a0 þ a1 α
   
þa2 α2 þ β a3 þ a4 α þ a5 α2 and B ¼ b0 þ b1 α þ b2 α2 þ β b3 þ b4 α þ b5 α2
two elements of F 6 , using Karatsuba on the quadratic extension leads to
p
     
AB ¼ a0 þ a1 α þ a2 α2 b0 þ b1 α þ b2 α2 þ α a3 þ a4 α þ a5 α2 b3 þ b4 α þ b5 α2
  
þ a0 þ a3 þ ða1 þ a4 Þα þ ða2 þ a5 Þα2 b0 þ b3 þ ðb1 þb4 Þα þ ðb2 þ b5 Þα2
     
 a0 þ a1 α þ a2 α 2 b0 þ b1 α þ b2 α 2  a3 þ a4 α þ a5 α 2 b3 þ b4 α þ b5 α 2 β
ð10:46Þ

Using Karatsuba once again to compute each of the three products gives
10.6 Pairing Processors Using RNS 323

  
AB ¼ a0 b0 þ 2 a4 b4 þ ða1 þ a2 Þðb1 þ b2 Þ  a1 b1 þ a3 þ a5 ðb3 þ b5 Þ  a3 b3
    
 a5 b5 þ a3 b3 þ ða0 þ a1 Þ ðb0 þ b1 Þ  a0 b0  a1 b1 þ 2 a2 b2 þ a4 þ a5 ðb4 þ b5 Þ
 
 a4 b4  a5 b5 α þ a1 b1 þ 2a5 b5 þ ða0 þ a2 Þ b0
   
þ b2  a0 b0  a2 b2 þ a3 þ a4 b3 þ b4  a3 b3
 
 a4 b4 α2 þ ða0 þ a3 Þ ðb0 þ b3 Þ  a0 b0  a3 b3
þ 2ða1 þ a2 þ a4 þ a5 Þðb1 þ b2 þ b4 þ b5 Þ
 ða1 þ a4 Þðb1 þ b4 Þ  ða2 þ a5 Þðb2 þ b5 Þ  ða1 þ a2 Þðb1 þ b2 Þ

þ a1 b1 þ a2 b2  ða4 þ a5 Þðb4 þ b5 Þ þ a4 b4 þ a5 b5 β
 
þ ða0 þ a1 þ a3 þ a4 Þ b0 þ b1 þ b3 þ b4  ða0 þ a3 Þðb0 þ b3 Þ  ða1 þ a4 Þðb1 þ b4 Þ
 ða0 þ a1 Þðb0 þ b1 Þ þ a0 b0 þ a1 b1
    
 ða3 þ a4 Þ ðb3 þ b4 Þ þ a3 b3 þ a4 b4 þ 2 a2 þ a5 b2 þ b5 a2 b2  a5 b5 αβ

þ ða1 þ a4 Þðb1 þ b4  a1 b1  a4 b4 þ ða0 þ a2 þ a3 þ a5 Þðb0 þ b2 þ b3 þ b5 Þ

 ða0 þ a3 Þðb0 þ b3 Þ  ða2 þ a5 Þðb2 þ b5 Þ  ða0 þ a2 ðb0 þ b2 Þ

þ a0 b0 þ a2 b2  ða3 þ a5 Þðb3 þ b5 Þ þ a3 b3 þ a5 b5 α2 β
ð10:47Þ

It can be seen that 18M + 56A + 8B operations in Fp are required. It requires only
six reductions. Note that each component of AB can lie between 0 and 44p2.
Thus, Bn in Montgomery representation and M in RNS representation must be
greater than 44p to perform lazy reduction in this degree 6 field.
For sextic extension, if M < 20A, Devegili et al. [84] have suggested to
construct the extension as quadratic over cubic and Karatsuba over Karatsuba
for multiplication and use complex over Karatsuba for squaring. For M  20A,
cubic over quadratic, ToomCook3x over Karatsuba for multiplication and
either complex, ChungHasan SQR3 or SQR3x over Karatsuba/complex for
squaring have been recommended.
The extension field F 12 is defined by the following tower of extensions [88]:
p
 
F 2 ¼ Fp ½u= u2 þ 2
p
 
F 6 ¼ F 2 ½v= ν3  ξ where ξ ¼ u  1
p p
 
F 12 ¼ F 6 ½w= w2  v ð10:48Þ
p p
 
Note that the representation F ¼ F 2 ½W = W 6  ξ where W ¼ w is also
12
p p
possible. The tower has the advantage of efficient multiplication for the canon-
ical polynomial base. Hence, an element α 2 F 12 can be represented in any of
p
the following three ways:
324 10 RNS in Cryptography

α ¼ a0 þ a1 ω where a0 , a1 2 F 6
p
   
α ¼ a0, 0 þ a0, 1 ν þ a0, 2 ν þ a1, 0 þ a1, 1 ν þ a1, 2 ν2 ω where ai, j 2 F
2
p2
α ¼ a0, 0 þ a1, 0 W þ a0, 1 W 2 þ a1, 1 W 3 þ a0, 2 W 4 þ a1, 2 W 5 ð10:49Þ

Hankerson et al. [88] have recommended the use of Karatsuba for multipli-
cation and complex for squaring for F 12 extensions. Quadratic on top of a cubic
p
on top of a quadratic tower of extensions needs to be used. A multiplication
using Karatsuba’s method needs 54 multiplications and 12 modular reductions,
whereas squaring using complex method for squaring in F 12 and Karatsuba for
p
multiplication in F 6 , F 2 needs 36 multiplications and 12 modular
p p
reductions [69].
A complete multiplication of F k requires kλ multiplications in Fp with
p
1 < λ  2 and note that lazy reduction can be used in Fp. A multiplication in F k
p
then requires k reductions since the result has k coefficients. Multiplication in Fp
needs n2 word multiplications and reduction requires (n2 + n) word multiplica-
tions. If p 3 mod 4, multiplications by β ¼ 1 can be computed as simple
subtractions in F 2 . A multiplication in F k thus needs (kλ + k)n2 + kn word
p p
  λ
2 10k þ8k
multiplications in radix representation and 1:1  7k 5 n þn 5 word
multiplications if RNS is used [69].
The school book type of multiplication is preferred for F since in Karatsuba
p2
2 2
method, the dynamic range is increased from 2p to 6p [89].
Yao et al. [89] have observed that for F 12 multiplication, school book
p
method also provides an elegant solution. Using lazy reduction, only 12 reduc-
tions will be needed. The evaluation of f  g can be as follows:
!
X X
f g¼ f j gk W jþk þ f j gk ζW jþk6 ð10:50Þ
jþk<6 jþk6

 
where f  g 2 F 2 ½W = W 2  ζ and fj, gk 2 F 2 , 1  j, k  6 are the coefficients
p p
of f and g, respectively. The coefficients of the intermediate results of figk are
less than 2p2 and the coefficients of figkζ are less than 4p2. Considering
fj ¼ f0 + f1i, gk ¼ g0 + g1i, we have

f j gk ζ ¼ ðf 0 g0  f 1 g1  f 0 g1  f 1 g0 Þ þ ðf 0 g0  f 1 g1 þ f 0 g1 þ f 1 g0 Þi ð10:51Þ

Since four products are needed to compute both the components, two accu-
mulators can easily handle this requirement.
10.6 Pairing Processors Using RNS 325

B. Inversion
Three other operations are needed in pairing computation: inversion, Frobenius
computation and squaring of unit elements. For the quadratic case [90], assum-
ing an irreducible polynomial x3 + n, we have the formula for inversion as

1 a  ib
¼ ð10:52aÞ
a þ bx a2 þ nb2

This needs 1 inversion, 2 squarings, 2 multiplications and 3 reductions in Fp.


The inversion of d in Fp is performed as d1 ¼ dp2 mod p. For the cubic case,
assuming an irreducible polynomial x3 + n, we have [90]

1 A þ Bx þ Cx2
¼ ð10:52bÞ
a þ bx þ cx2 F

where A ¼ a2 þ nbc, B ¼ nc2  ab, C ¼ b2  ac, F ¼ nbC þ aA  ncB.


This needs one inversion, 9 multiplications, 3 squarings and 7 reductions in
Fp [69].
Inversion in F 6 built as a quadratic extension of a cubic requires one
p
inversion, 9 multiplications, 3 squarings and 7 reductions in F 2 [88]. Alterna-
p
tively, we can see that 1 inversion, 36 multiplications and 16 reductions in Fp are
required. Inversion in F 12 requires 1 inversion, 97 multiplications and 35 reduc-
p
tions in Fp or 1 inversion, 2 multiplications and 2 squarings in F 6 [69, 88].
p
C. Frobenius computation
Frobenius action or raising an extension field element to the power of the
modulus is always very cheap [90]. Consider (a + ib)p ¼ ap + bpip ¼ (a  ib)
mod p since all the terms which are multiples of p in the expansion of (a + ib)p
will vanish and i( p1)/2 ¼ 1, as i is a quadratic non-residue.
 
Raising an element f 2 F 12 ¼ F 6 ðwÞ= w2  v to the pth power can be
p p
efficiently carried out [91]. This is needed in the final exponentiation in Ate
pairing. The field extension F 12 can also be represented as a sextic extension of
p  
a quadratic field f 2 F 12 ¼ F 2 ðW Þ= W 6  u with W ¼ w. Hence, we can write
p p
f ¼ g þ hw 2 F 12 with g, h 2 F 6 such that g ¼ g0 þ g1 v þ g2 v2 and h ¼ h0
p p
þh1 v þ h2 v2 where gi , hi 2 F 2 for i ¼ 1, 2, 3. Equivalently, we have
p

f ¼ g þ hw ¼ g0 þ h0 W þ g1 W 2 þ h1 W 3 þ g3 W 4 þ h3 W 5 ð10:53aÞ

Recall that pth power of an element in F can be calculated free of


p2
cost. Denoting gi , hi as conjugates of gi and hi for i ¼ 1, 2, 3 using the
 p
identity W p ¼ uðp1Þ=6 W, we can write W i ¼ γ 1, i W i with γ 1, i ¼ uiðp1Þ=6 ,
326 10 RNS in Cryptography

γ 2, i ¼ γ 1, i γ 1, i and γ 3, i ¼ γ 1, i γ 2, i which need to be pre-computed and stored


for i ¼ 1, . . . 5. Hence, we compute f p as
 p
f p ¼ g0 þ h0 W þ g1 W 2 þ h1 W 3 þg2 W 4 þ h2 W 5
¼ g0 þ h0 W p þ g1 W 2p þ h1 W 3p þg2 W 4p þ h2 W 5p
¼ g0 þ h0 γ 1, 1 W þ g1 γ 1, 2 W 2 þ h1 γ 1, 3 W 3 þg2 γ 1, 4 W 4 þ h2 γ 1, 5 W 5 ð10:53bÞ

Thus, five multiplications in Fp and five conjugations in F are needed. Similar


p2
2 3
procedure can be used for computing f p , f p as well.

Pairing Algorithms

The final exponentiation in pairing algorithms needed in (10.29b), (10.30b), and


(10.31b) is considered next [92]. For BN curves, we have
 
p12  1  6  2  p4  p2 þ 1
¼ p 1 p þ1 ð10:54Þ
‘ ‘

where l is the order. Thus, three steps are required. The exponentiation by first term
can be performed by conjugation (since p6 ¼ p) followed by an inversion
(for taking care of 1). The exponentiation corresponding to p2 + 1 needs Frobenius
2
(f p ) computation followed by multiplication with f. These two are known as “easy
part” of the final exponentiation step, whereas the operation corresponding to the
third tem is known as “hardpart”. Devegili
 et al. [93] have suggested that in case of
p4 p2 þ1
BN curves, the expression ‘ can be written in terms of the parameters of
BN curve p and x as
     
p3 þ 6x2 þ 1 p2 þ 36x3  18x2 þ 12x þ 1 p þ 36x3  30x2 þ 18x  2

Using the method of multi-exponentiation combined with Frobenius, this needs


computation of

  6x2 þ1    3 
3 2 3 2 2
36x 18x þ12xþ1
fp fp ðf p Þ f 36x 30x þ18x2

   2
3 2 6x þ1
This can be computed as a f 6x5
,b p
a ,b ab, f f p p 2 p
bðf Þ f

bðf p  f Þ9 af 4 .
10.6 Pairing Processors Using RNS 327

2 3
Note that ap , f p , f p , f p are computed using Frobenius. Later, Scott et al. [92]
have given a better systematic method of computing the hard part taking into
account the parameters of the BN curves as
2 3
    p2 6     18  
2 3 2 2 p 2 30
mp mp mp ½1=m2 4 mx 5 ½1=ðmx Þp 12 1= mx mx 1=mx

    36
2 3 p
1= mx mx

The terms in the brackets are next computed using four multiplications (inver-
sion is just a conjugation) leading to a calculation of the form

y0 y21 y62 y12 18 30 36


3 y4 y5 y6

The authors next suggest that using Olivos algorithm [94], the computation can be
done using just two registers as follows:

T0 ðy6 Þ2 , T 0 T 0 y4 , T 0 T 0 y5 , T 1 y3 y5 , T 1 T1T0, T0 T 0 y2 , T 1 T 21 ,
T1 T1T0, T1 T 21 , T 0 T 1 y1 , T 1 T 1 y0 , T 0 T 20 , T 0 T0T1

This can be carried out using 9 multiplications and 4 squarings.


p6 1  
For MNT curves, the final exponentiation [92] is realized as ‘ ¼ p3  1
 2 
p pþ1
ð p þ 1Þ ‘ where p ¼ 4l2 + 1 and ‘ ¼ 4l2  2l þ 1. The hard part of the final
 2 
p pþ1
exponentiation is ‘ . Since p ¼ x2 + 1, and l ¼ x2  x + 1, we have the hard
4 2
part as x 2þx þ1 ¼ x2 þ x þ 1 ¼ p þ x: Hence, the hard part is mpmx which is
x xþ1
achieved by a Frobenius and exponentiation to the power of x. where m is the
value to be exponentiated.

Pairing Processor Implementations Using RNS

Yao et al. [89] suggested a set of base moduli to reduce complexity of modulus
reduction in RNS and presented also an efficient Fp Montgomery modular multi-
plier and a high-speed pairing processor using this multiplier has been described.
They have suggested the selection of moduli in both the RNS used in base extension
close so that (bk  cj) values are small where bk and cj are the moduli in the two
bases. Thus, the bit lengths of the operands needed in base extension are small.
328 10 RNS in Cryptography

They have suggested the use of eight moduli sets for 256-bit dynamic range for
128-bit security as

B ¼ 2w  1, 2w  9, 2w þ 3, 2w þ 11, 2w þ 5, 2w þ 9, 2w  31, 2w þ 15
C ¼ 2w , 2w þ 1, 2w  3, 2w þ 17, 2w  13, 2w  21, 2w  25, 2w  33

Y
s  
where w ¼ 32 so that the bit lengths of bi  cj are as small as possible
k¼1, k6¼j
(<25 bits).
Yao et al. [95] have suggested that maximal length in bits of bi  cj shall be
minimized to v bits so that multiplications will be v  w words rather than w  w
words. They have also suggested a systematic procedure for RNS parameter
selection to result in a lower complexity. They observe that for a 16-bit machine,
n ¼ 4 where n is the number of moduli is optimum. They suggest two techniques for
choosing moduli (a) multiple plus prime and (b) first come first selected improved.
In the first method, we start with the set of first (2n  1) primes. The product of all
these π 2n
i¼1 pi is denoted as Ө. M is a multiple of Ө. Then M + pi are all pairwise
coprime and hence the name MPP (multiple plus prime). The second method selects
only pseudo-Mersenne primes. As an illustration, two RNS with moduli {264-33, 2
64
-15, 264-7, 264-3} and {264-17, 264-11, 264-9, 264-5} can be obtained yielding the
various weights Bij to be represented by at most 14 bits to 5 bits. Thus, v reduces to
14 for one RNS and 8 for another RNS. Thus, 64  64 multiplications can be
replaced with 64  14, 64  8 multiplications. The first method is attractive when
n is very small and note that multiplications are not performed as additions.
Montgomery reduction in RNS has higher complexity than ordinary Montgom-
ery reduction. However, this overhead of slow reduction can be partially removed
by reducing the number of reductions also known as lazy reduction. In computing
ab + cd with a, b, c, d in Fp where p is a n word prime number, we need 4n2 + 2n
word operations since each modulo multiplication needs 2n2 + n word products
using digit serial Montgomery algorithm [11]. In lazy reduction, first ab + cd is
computed and then reduced needing only 3n2 + n word products [69]. Lazy reduc-
tion performs one reduction per multiple of multiplications. This is possible for
expressions like AB CD EF in Fp. In RNS [89], it takes 2s2 + 11s word multi-
plications while it takes 4s2 + s using digit serial Montgomery modular multiplica-
tion [11]. The actual operating range should be large for lazy reduction to be
economical (e.g. 22p2 is needed for computation of (10.51)). Around 10,000
modular multiplications are needed for a pairing processor.
Yao et al. [89] have described a high-speed pairing co-processor using RNS and
lazy reduction. They have used homogeneous coordinates [96] for realizing Ate and
optimal Ate pairing. The algorithm for optimal Ate pairing is presented in
Figure 10.21. The formulas used for point doubling, point addition and line raising
together with the operations needed in Miller loop are presented in Table.10.4. Note
that S, M and R stand for squaring, multiplication and reduction in F 2 and m and
p
10.6 Pairing Processors Using RNS 329

Figure 10.21 Algorithm for optimal Ate pairing (adapted from [89] ©2011)

r stand for multiplication and reduction in Fp. The cost of squaring in Fp is also
indicated as m. Note that M ¼ 4m, S ¼ 3m and R ¼ 2r. School book method has been
employed to avoid more additions. The final addition is carried out in steps 10 and
11, whereas final exponentiation is carried out in step 12. The operation count is
presented in Table 10.5. This includes Frobenius endomorphism of Q1 and Q2. Note
that the computation T  Q2 is skipped because this point is not needed further.
6
The algorithm for computation of f p 1 in F 12 [89] is shown in Figure 10.22. The
p
hard part is computed following Devegili et al. [93] as shown in Table 10.5. The
inversion needed in Figure 10.22 is carried out as d 1 ¼ d p2 modp.
The architecture of pairing coprocessor due to Yao et al. [89] is presented in
Figure 10.23. It uses eight PEs in the Cox-Rower architecture (see Figure 10.23c).
Each PE caters for one channel of B and one channel of C. Four DSP blocks can
realize 35  35 multiplier whereas two DSPS can realize a 35  25 multiplier. A
dual mode multiplier (see Figure 10.23a) which can perform two element multipli-
cations at a time is used. It uses two accumulators to perform the four multiplica-
tions in parallel and accumulate the results in parallel. Each PE performs two
element multiplications at a time using the dual multiplier. Thus, Cox-Rower
algorithm is modified accordingly and needs less number of loops. The cox unit
is an accumulator to provide result correction information to all PEs in base
extension operation. The register receives ξ values from every PE and delivers
two ξ values to all PEs at a time. The internal design of the PE is shown in
Figure 10.23b comprising dual-mode multiplier, an adder, 2RAMs for multiplier
inputs, 2 RAMS for adder inputs, 3 accumulators and one channel reduction
module. Two accumulators are used for polynomial multiplication and one is
used for matrix multiplication of the base extension step. The authors have
330

Table 10.4 Pipeline design and operation count of Miller loop (adapted from [89] ©2011)
Condition Step Operations Count
ri ¼ 0 1 A ¼ y21 , B ¼ 3b0 z21 , C ¼ 2x1 y1 , D ¼ 3x21 , E ¼ 2y1 z1 3S + 2M + 5R
2 x3 ¼ ðA  3BÞC, y3 ¼ A2 þ 6AB  3B2 , z3 ¼ 4AE, l0 ¼ ðB  AÞζ, l3 ¼ yP E, l4 ¼ xP D 2S + 3M + 4m + 5R
3 f ¼ f2 6S + 15M + 6R
4 f¼f · l 18M + 6R
 
ri ¼ 1 1 A ¼ y21 , B ¼ 3b0 z21 , C ¼ 2x1 y1 , D ¼ 3x21 , E ¼ 2y1 z1 , f 0 ¼ f 2 0 5S + 4M + 6R
 
2 x3 ¼ ðA  3BÞC, y3 ¼ A2 þ 6AB  3B2 , z3 ¼ 4AE, l0 ¼ ðB  AÞζ, l3 ¼ yP E, l4 ¼ xP D, f 1 ¼ f 2 1 2S + 6M + 4m + 6R
3 A ¼ y3  yQz3, B ¼ x3  xQ z3, f2,3,4,5 ¼ ( f2)2,3,4,5 4S + 12M + 4m + 6R
4 f¼f · l 18M + 6R

5 C ¼ A2 , D ¼ B2 , l0 ¼ xQ A  yQ B ζ, l3 ¼ yP B, l4 ¼ xP A 2S + 2M + 4m + 5R

6 C ¼ z3C, E ¼ BD, D ¼ x3D, f0,1,2 ¼ ( f · l )0,1,2 12M + 6R


7 13M + 6R
10

x3 ¼ B(E + C  2D), y3 ¼ A(3D  E  C)  y3 E, z3 ¼ z3E, f3,4,5 ¼ ( f · l )3,4,5


RNS in Cryptography
10.6 Pairing Processors Using RNS 331

Table 10.5 Operation count of final steps (adapted from [89] ©2011)
Step Operations Operation count
FA1 Q1 π p(Q) M + 2m + R
T T + Q1 and lT , Q1 ðPÞ 2S + 11M + 8m + 13R
f f  lT , Q1 ðPÞ 18M + 6R
FA2 Q2 π p(Q1) M + 2m + R
lT , Q2 ðPÞ 4M + 8m + 5R
f f  lT , Q2 ðPÞ 18M + 6R
fp
6
1 Before d1 9S + 12M + 2m + 7R + r
d1 ¼ dp2 294m + 294r
After d1 6S + 36M + 16R
fp
2
þ1
fp
2
þ1 e þR
M e þ 8m þ 4R
f6|u|5 64e e þ 68R
e
4 2
p þ1=n a
fp S þ 4M
b ap+1 e e
M þ R þ 3M þ 4m þ 5R
2 3
fp, f p , f p 6M + 12m + 12R
T p 2
b  ðf Þ  f p2 e þe
2M e
S þ 3R
T T 6u2 þ1
126Se þ 12M
e þ 138R
e
p3
 9 e þ 5Se þ 12R
e
f  T  b  f pþ1  f 4 7M

6
Figure 10.22 Algorithm for computation of f p 1
in F (adapted from [89] ©2011)
p12

implemented using Virtex-6 XC6VLX240T-2 FPGA. The performance figures are


as follows:
Ate and Optimal Ate pairing 7032 slices and 32 DSP48E1s, frequency of
250 MHz, cycles 229,109 and delay of 0.916 ms for Ate pairing and 166,027 cycles
and delay of 0.664 ms for optimal Ate pairing.
332 10 RNS in Cryptography

Figure 10.23 Pairing coprocessor hardware architecture (adapted from [89] ©2011)

Table 10.6 Number of operations and cycles per computation in Miller loop (adapted from [89]
©2011)
Miller’s loop
2T and lT,T(P) T + Q and lT,Q(P) f2 f ∙l Ate Optimal
#Multiplication 39 54 78 72 – –
#Reduction 20 26 12 12 – –
#Cycles 340 456 313 301 128,531 64,084

The authors have given in detail the number of operations and cycles required for
Miller loop (see Table 10.6) and for computation of final steps in Table 10.7. One
pairing computation can be completed in 1 s. The speed advantage can be seen from
the fact that a 256  256 multiplication in RNS needs 16 times 32  32-bit multi-
plication whereas Karatsuba needs 27 multiplications.
Duquesne and Guillermin [66] described FPGA implementation of optimal Ate
pairing using RNS for 128-bit security level in large characteristic. The authors
have used BN curves for 126-bit security using u ¼ (262 + 255 + 1) whereas for
128-bit security, they have used u ¼ (263 + 222 + 28 + 27 + 1). The base extension
using MRC described earlier [52] needs 75 n2 þ 85 n RNS digital multiplications and
cannot be parallelized as it uses MRC. On the other hand, the Kawamura
et al. technique [34] has overall complexity of 2n2 + 3n which has been enhanced
10.6 Pairing Processors Using RNS 333

Table 10.7 Number of operations and cycles per computation in final steps (adapted from [89]
©2011)
Step Operation # Cycles #Idle cycles Occupation rate (%)
Final addition FA1 799 7 99.1
FA2 566 50 91.2
fp
6
1 dp2 mod p 25,537 21,127 17.3
Others 1333 240 82.0
2 2
þ1
fp f p þ1 573 9 98.4
fp
4 2
p þ1=n f6|u|5 21,812 68 99.7
6u2 þ1 44,778 138 99.7
T
Others 6905 47 99.3
Total Ate 100,578 21,629 78.5
Optimal 101,943 21,686 78.7

Figure 10.24 Algorithm for optimal Ate pairing (adapted from [67] ©2012)

by using n parallel rowers to achieve one RNS digit multiplication and accumula-
tion per cycle. Hence, a full-length multiplication can be done in two cycles one
over base B and one over base B0 , an addition in 4, a subtraction in 6 and whole
reduction in 2n + 3 cycles. They have used lazy reduction. As such for F k only
p
k reductions are needed and thus (2kn + 3k) cycles overall are needed. The algo-
rithm implemented is shown in Figure 10.24. The authors use projective coordi-
nates [96, 97]. The doubling and addition steps together with line raising are
presented in detail in Figure 10.25a–c where it may be noted that the classical
formulae are rearranged to highlight reduction and inherent parallelism in local
334 10 RNS in Cryptography

Figure 10.25 (a–c). Algorithms for doubling step, addition step and hard part of the final
exponentiation (adapted from [66] ©2012)

variables. The F inversion is based on Hankerson et al. [88] formulae. The final
p12
exponentiation used Scott et al. multi-addition chain [92] described before. The
X5
squaring in F 12 uses the technique proposed in [98]. Specifically, for a ¼ a
i¼0 i
p
γ with ai 2 F 2 , the coefficients of A ¼ a are given by
i 2
p
10.6 Pairing Processors Using RNS 335

A0 ¼ 3a20 þ 3ð1 þ iÞa23  2a0 A1 ¼ 6ð1 þ iÞða2 þ a5 Þ þ 2a1


A2 ¼ 3a21 þ 3ð 1 þ iÞa24  2a2 A3 ¼ 6a0 a3 þ 2a3 ð10:55Þ
A4 ¼ 3a22 þ 3ð 1 þ iÞa25  2a4 A5 ¼ 6a1 a4 þ 2a5

The authors give details of the operation count for BN126 and BN128 curves
which saves cycles at each step using RNS as compared to the earlier design of
Guillermin [99]. The saving is due to exploitation of the inherent parallelism in
algorithm db1 (see Figure 10.25a) and add as well as in operations in F 12 . The
p
pipeline depth could be up to 8 and still avoid idle states. The authors use in F 2
p
multiplication and subtraction using additional hardware in parallel to avoid cas-
caded operations thus saving 4 cycles at the expense of increased hardware. Similar
technique has been used for F 12 as well. The authors have implemented ion Altera
p
Cyclone III, Stratix II and Stratix III. The results are as follows:
BN126 Cyclone II EP2C35 91 MHz frequency size 14274LC time 1.94 ms
BN126 Stratix III EP2S30154 154 MHz frequency size 4227A time 1.14 ms
BN126 Stratix III EP3S50 165 MHz frequency size 4233A time 1.07 ms
Duquesne [69] has considered application of RNS for fast pairing computation
implementation using BN curves and MNT curves considering Tate, Ate and R-Ate
pairings employing lazy reduction and efficient arithmetic in extension fields.
Duquesne has presented very detailed description of every computation needed
for all the three pairings.
We consider Tate pairing first. Jacobian coordinates have been considered for
P and affine coordinates for Q in this approach. The algorithm for Tate pairing
for MNT curves is presented in Figure 10.26 where P ¼ ðxP ; yP Þ 2 E Fp ½‘ and
 
Q ¼ xQ , yQ β with xQ , yQ 2 F 3 . It is assumed that F 6 is built as a quadratic
p  p
extension of F 3 : F 6 F 3 = Y 2  υ ¼ F 3 ½β. The authors consider multipli-
p p ¼ p ½Y  p
cation and squaring as of same complexity. Lines 1–4 in the algorithm in Fig-
ure 10.26, perform doubling of point T 2 E(Fp) in Jacobian coordinates and needs
10 multiplications in Fp and 8 modular reductions considering lazy reduction. Lines
7–10 use mixed addition of T and P and require 11 multiplications and 10 modular
reductions in Fp. Since xQ and yQ are in F 3 , line 5 requires 9 multiplications in Fp
p
and 8 modular reductions. Line 11 needs 7 multiplications and 6 reductions in Fp. In
line 6, a multiplication and squaring in F 6 is needed which needs 30 multiplications
p
and 12 modular reductions in Fp. For the line 12, we need 18 multiplications and
6 reductions. Line 13 does exponentiation by p3 free by conjugation, whereas
36 multiplications and 16 reductions and one inversion are required in F 6 . This
p
step totally needs 54 multiplications, 22 modular reductions and one inversion in
Fp. In line 14, a multiplication in F 6 and a Frobenius computation are needed. The
p
336 10 RNS in Cryptography

Figure 10.26 Algorithm for Tate pairing for MNT curves (adapted from [69])

latter needs 5 modular multiplications in Fp and overall, the second step of final
exponentiation needs 23 multiplications and 11 reductions in Fp.
The hard part involves one Frobenius (five modular multiplications), one mul-
tiplication in F 6 , one exponentiation by 2l. For each step of exponentiation,
p
12 multiplications and 6 reductions or 18 multiplications and 6 reductions are
required depending on whether multiplications are not needed or needed. For a
96-bit security, l has bit length of 192 implying that lines 1–6 are done 191 times
and lines 7–12 around 96 times. Totally, the Miller loop needs 191  (10 + 9 + 30)
+ 96  (11 + 7 + 8) ¼ 12,815 multiplications and 191  (8 + 8 + 12) + 96  (10 + 6
+ 6) ¼ 7460 reductions. The easy part of the final exponentiation needs 1 inversion,
77 multiplications and 33 reductions in Fp. Considering that 2l is 96 bits long, the
hard part can perform exponentiation using sliding window of 3 for computing f2l.
This needs 96 squarings in F 6 , 24 multiplications in F 6 and three
p p
10.6 Pairing Processors Using RNS 337

pre-computations. Thus, the hard part needs 5 + 18 + 97  12 + 27  18 ¼ 1673


multiplications and 5 + 6 + 97  12 + 27  18 ¼ 755 reductions. Thus, full Tate
pairing needs 14,565 multiplications and 8248 reductions. A radix implementation
on the other hand needs 14,565  62 + 8248  (62 + 6) ¼ 870,756 word multiplica-
  
tions, whereas RNS needs 1:1  14, 565  2  8 þ 8248  7582 þ 858 ¼ 736, 626
word multiplications indicating a gain of 15.4 %.
In the case of Ate pairing, lines 1–4 are done in F 3 requiring 10 multiplications
p
and 8 reductions in F 3 i.e. 60 multiplications and 24 reductions in Fp. Similarly,
p
lines 7–10 require 11 multiplications and 10 reductions in F 3 or 66 multiplications
p
and 30 reductions in Fp. If the coordinates of T are (XT, YTβ, ZT), lines 5 and 11 must
be replaced by
 
50 : g ¼ Z 2T Z 2T yP β þ A XT  Z 2T xP  2νY 2T 
ð10:56Þ
110 : g ¼ ZTþP yQ β þ ZTþP yP  F xP  xQ
 
where Z2T , Z 2T , A ¼ 3X2T  a4 Z4T , Y 2T , ZTþP , F ¼ Y T  yQ Z 3T are computed in F
in
p3
the previous steps. The first requires 18 multiplications and 12 reductions in Fp,
whereas the second requires 15 multiplications and 6 reductions.
Finally, since t  1 has bit length 96, and Hamming weight of about 48 bits, the
Miller loop requires 95  (60 + 18 + 30) + 47  (66 + 15 + 18) ¼ 14,913 multiplica-
tions and 95  (24 + 12 + 12) + 47  (30 + 6 + 6) ¼ 6534 reductions. The final expo-
nentiation is same as in Tate pairing and thus full Ate Pairing needs 16,663
multiplications and 7322 reductions. In radix representation, this means that
907,392 word multiplications are needed whereas in RNS, only 703,204 word
multiplications are needed. The gain is thus 22.65 %.
In the case of BN curves, the flow chart is presented in Figure 10.27. Due to the
twist of order 6 for BN curves, some improvements can be made. The author
considers that F 12 is built as a quadratic extension of a cubic extension of F 2
p p
which is compatible with the use of twist of order 6. Due to the twist defined by v,
 
the second input of Tate pairing can be written as Q ¼ xQ γ 2 þ yQ γ 3 with xQ, yQ
2 F 2 . As seen earlier in the case of MNT curves, Lines 1–4 in the algorithm need
p
7 multiplications in Fp and 6 modular reductions, whereas lines 7 and 10 require
11 multiplications in Fp and 10 modular reductions in Fp. Since xQ and yQ are in
F 2 , line 5 requires 8 modular multiplications in Fp and lazy reduction cannot be
p
used. In line 11, 6 multiplications and only 5 modular reductions are needed since
lazy reduction can be used on the constant term. Line 6 involves both squaring and
multiplication in F 6 . This requires 36 multiplications and 12 reductions. Further-
p
more, multiplication by g needs 39 multiplications and 12 reductions. Thus, the
total complexity for line 6 is 75 multiplications and 24 modular reductions in Fp.
The case of line 12 is similar and it needs 39 modular multiplications and
338 10 RNS in Cryptography

Figure 10.27 Algorithm for Tate pairing for BN curves (adapted from [69])

f
p6
12 reductions in Fp. Line 13 computes where computation of f is free by
f p6
conjugation. Hence, one multiplication and inversion are needed in F 12 . This
p
inversion needs one inversion, 97 multiplications and 35 reductions in Fp. The first
step of the exponentiation thus requires 151 multiplications, 47 modular reductions
and one inversion in Fp. Line 14 involves one multiplication in F 12 and one
p
powering to p2. The Frobenius map and its iterations need 11 modular multiplica-
tions in Fp. This step thus needs 65 multiplications and 23 reductions in Fp.
The hard part given in line 15 involves one Frobenius (11 modular multiplica-
tions), one multiplication in F 12 (54 multiplications and 12 reductions) and one
p
exponentiation. Since for BN curves, l can be chosen as sparse, a classical square
and multiply can be used. Since in line 13, f has been raised to the power ( p6  1), it
10.6 Pairing Processors Using RNS 339

is a unit and can be squared with only 2 squarings and 2 reductions in Fp (i.e. 24
multiplications and 12 reductions in Fp). Thus, the cost is only 24 multiplications
and 12 reductions for most steps. For steps corresponding to the non-zero bits of the
exponent, 54 additional multiplications and 12 additional reductions are necessary.
In line 16, four applications of the Frobenius map, 9 multiplications and 6 squar-
ings in F 12 (i.e. 674 multiplications and 224 reductions in Fp) are needed. It also
p
needs an exponentiation which is similar to line 15 but two times larger. Consid-
ering a Hamming weight of l as 11 and ‘ as 90, we observe that steps 1–6 are done
255 times and lines 7–12 are done 89 times for a 128-bit security level. Thus, the
Miller loop needs 255  (7 + 8 + 75) + 89  (11 + 6 + 39) ¼ 27,934 multiplications
and 255  (6 + 8 + 24) + 89  (10 + 5 + 12) ¼12,093 reductions. The easy part of
the final exponentiation requires one inversion, 216 multiplications and 70 reduc-
tions in Fp. The hard part involves exponentiation by 6l  5 which has Hamming
weight of 11 and 6l2 + 1 which has Hamming weight of 28. The second exponen-
tiation can be split into two parts l and 6l [88] both having Hamming weight of 11.
This leads to 21 multiplications. Lines 15 and 16 require 11 + 54 + 65  24 + 9  54
+ 674 + 127  24 + 21  54 ¼ 6967 multiplications and 11 + 12 + 65  12 + 9  12
+ 224 + 127  12 + 21  12 ¼ 2911 reductions. Thus, the full Tate pairings needs
35,117 multiplications but only 15,074 reductions. For radix implementation using
8 (32 bit) words, we need 35,117 82 + 15,074  (82 + 8) ¼ 3,332,816 word  multi-

7 2 8
plications whereas RNS needs 1:1 35, 117  2  8 þ 15, 074  8 þ 8 ¼
5 5
2, 315, 994 word multiplications. This has a gain of 30.5 %.
In the case of Ate pairing, lines 1–4 are done in F 2 requiring 3 multiplications,
p
4 squarings and 6 reductions in F 2 i.e. 17 multiplications and 12 reductions in Fp.
p
Similarly, lines 7–10 require 8 multiplications and 3 squarings and 10 reductions in
F 2 or 30 multiplications and 20 reductions in Fp. If the coordinates of T are (XTγ 2,
p
YTγ 3, ZT), lines 5 and 11 must be replaced by
0  
50 : g ¼ Z2T Z 2T yP  AZ 2T xP γ þ AXT  2Y 2T γ 3
0   ð10:57Þ
110 : g ¼ ZTþP yP  FxP γ þ FxQ  ZTþP yQ γ 3

where Z2T , A ¼ 3X2T , Y 2T , Z TþP , F ¼ Y T  yQ Z3T were computed in the previous


steps. The first requires 15 multiplications and 12 reductions in Fp, whereas the
second requires 10 multiplications and 6 reductions. Note further that the value
g obtained has only terms in γ, γ3 and a constant term so that a multiplication by
g requires only 39 multiplications instead of 54.
Next, since t  1 has bit length 128 and Hamming weight 29, the total cost of the
Miller loop is 127  (17 + 15 + 36 + 39) + 28  (30 + 10 + 39) ¼ 15,801 multiplica-
tions and 127  (12 + 12 + 24) + 28  (20 + 6 + 12) ¼ 7160 reductions. The final
exponentiation is same as in Tate pairing and thus full Ate Pairing needs 22,984
multiplications and 10,241 reductions. In radix representation, this means that
340 10 RNS in Cryptography

2,208,328 word multiplications are needed, whereas in RNS, only 1,558,065 word
multiplications are needed. The gain is thus 29.5 %.
In the case of R-Ate pairing, while the Miller loop is same, an additional step
 p
is necessary at the end: the computation of f : f gðT;QÞ ðPÞ gðπðTþQÞ, T Þ ðPÞ where T ¼
(6l + 2)Q is computed in the Miller loop and π is the Frobenius map on the curve.
The following operations will be needed in the above computation. One step of
addition as in the Miller loop (computation of T + Q and gðTþQÞ ðPÞ) needs 40 mul-
tiplications and 26 reductions in Fp. As p 1mod 6 for BN curves, one application
of Frobenius map is needed which requires 2 multiplications in F 2 by
p
pre-computed values. Next, one non-mixed addition step (computation of
gðπðTþQÞ, T Þ ðPÞ) needs 60 multiplications and 40 reductions in Fp. Two multiplica-
tions of the results in the two previous steps require 39 multiplications and
12 reductions in Fp. Next, a Frobenius needs 11 modular multiplications and finally,
one full multiplication in F 12 requires 54 multiplications and 2 reductions in Fp.
p
Thus, totally this step requires 249 multiplications and 117 reductions in Fp.
Considering that 6l + 2 has 66 bits and Hamming weight of 9, the cost of the Miller
loop is 65  (17 + 15 + 36 + 39) + 8  (30 + 10 + 39) ¼ 7587 multiplications and
65  (12 + 12 + 24) + 8  (20 + 6 + 12) ¼ 3424 reductions. The final exponentiation
is same as for Tate pairing. Hence, for complete R-Ate pairing, we need 15,019
multiplications and 6405 reductions. This means that 1,422,376 word multiplica-
tions in radix representation and 985,794 in the case of RNS will be required thus
saving 30.7 %.
Kammler et al. [100] have described an ASIP (application specific instruction set
processor) for BN curves. They consider the NIST recommended prime group order
of 256 bits E(Fp) and 3072 bits for the finite field F k ¼ 256  12 ¼ 3072 (since
p
k ¼ 12). This ASIC is programmable for all pairings. They keep the points in
Jacobian coordinates throughout the pairing computation and thus field inversion
can be avoided almost entirely. Inversion is accomplished by exponentiation with
( p  2). All the values are kept in Montgomery form through out the pairing
computation.
The authors have used the scalable Montgomery modulo multiplier architecture
(see Figure 10.28a) due to Nibouche et al. [101] which can be segmented and
pipelined. In this technique, for computing ABR1mod M, the algorithm is split into
two multiplication operations that can be performed in parallel. It uses carry-save
number representation. The actual multiplication is carried out in the left half (see
Figure 10.28a) and reduction is carried out in the right half simultaneously. The left
is a conventional multiplier built up of gated full-adders and the right is a multiplier
with special cells for the LSBs. These LSB cells are built around half-adders. Due to
the area constraint, subsets of the regular structure of the multiplier have been used
and computation is performed in multiple cycles. They have used multi-cycle
multipliers for W  H (W is word length and H is number of words) of three
different sizes 32  8, 64  8 and 128  8 bits. For example for a 256-bit multiplier,
10.6 Pairing Processors Using RNS 341

symbol:
255 256-W 255-W 256-2W 2W-1 W W-1 0

load from memory


W
B M

“0” “0”
CM CR
“0” “0”
SM SR

0 bin cin sin 0 min cin sin


a t'
load 32 H multiplication reduction
from ain t' t'in
memory W ×H bit out W ×H bit
cout sout H-1 cout sout
31

Figure 10.28 (a) Montgomery multiplier based on Nibouche et al. technique and (b) multi-cycle
Montgomery Multiplier (MMM) (adapted from [100] ©2009)

H ¼ 8 and W ¼ 32 can be used. Thus, A is taken as 8 bits at a time and B taken as


32 bits at a time thus needing 256 cycles for multiplication and partial reduction and
addition (see Figure 10.28a). This approach makes the design adaptable to the
desired computation performance and to trade off area versus execution time of the
multiplication.
The structure of the multi-cycle Montgomery multiplier (MMM) is shown in
Figure 10.28b. The two’s complementer is included in the multiplication unit. The
result is stored in the registers of temporary carry-save values CM, SM, SR,CR. The
authors have used a multi-cycle adder unit for modular addition and subtraction. In
addition, an enhanced memory architecture has been employed-transparent inter-
leaved memory segmentation. Basically, the number of ports to the memory system
is extended to increase the throughput. These memory banks can be accessed in
parallel. The authors mention that in 130 nm standard cell technology, an optimal
342 10 RNS in Cryptography

Table 10.8 Number of operations needed in various pairing computations (adapted from [100]
©iacr2009)
Number of Opt Ate Ate η Tate Comp. η Comp. tate
Multiplications 17,913 25,870 32,155 39,764 75,568 94,693
Additions 84,956 121,168 142,772 174,974 155,234 193,496
Inversions 3 2 2 2 0 0

Ate pairing needed 15.8 ms and frequency was 338 MHz. The number of operations
needed for different pairing applications are presented in Table 10.8 in order to
illustrate the complexity of a pairing processor.
Barenghi et al. [102] described an FPGA co-processor for Tate pairing over Fp
which used BKLS algorithm [62] followed by Lucas laddering [103] for the final
 
exponentiation pk  1 =r:

 p2 1  m  m
f P DQ r ¼ ðc þ id Þp1 ¼ ðc  id Þ2 ¼ ða þ ibÞm

where m ¼ p1 2 2 m V m ð2aÞ þ ibU ð2aÞ


r , a ¼ c  d , b ¼ 2cd. Note that ða þ ibÞ ¼ 2 m
where Um and Vm are the mth terms of the Lucas sequence. The prime p is a 512-bit
number and k ¼ 2 has been used. They have designed a block which can be used for
modular addition/subtraction using three 512-bit adders. The adders compute A + B,
A + B  M and A  B + M. Modular multiplication was using Montgomery algo-
rithm based on CIOS technique. The architecture comprises of a microcontroller, a
Program ROM, a Fp multiplier and adder/subtractor, a register file and an input/
output buffer. The microcontroller realizes Miller’s loop by calling the
corresponding subroutines. The ALU could execute multiplication and addition/
subtraction in parallel. Virtex-2 8000 (XC2V8000-5FF1152) was used which
needed 33,857 slices and frequency of 135 MHz and a time of 1.61 ms.

References

1. W. Stallings, Cryptography and Network Security, Principles and Practices, 6th edn. (Pear-
son, Upper Saddle River, 2013)
2. B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C (Wiley,
New York, 1996)
3. P. Barrett, Implementing the Rivest-Shamir-Adleman Public Key algorithm on a standard
Digital Signal Processor, in Proceedings of Annual Cryptology Conference on Advances in
Cryptology, (CRYPTO‘86), pp. 311–323 (1986)
4. A. Menezes, P. van Oorschot, S. Vanstone, Handbook of Applied Cryptography (CRC, Boca
Raton, 1996)
5. J.-F. Dhem, Modified version of the Barrett Algorithm, Technical report (1994)
6. M. Knezevic, F. Vercauteren, I. Verbauwhede, Faster interleaved modular multiplication
based on Barrett and Montgomery reduction methods. IEEE Trans. Comput. 59, 1715–1721
(2010)
References 343

7. J.-J. Quisquater, Encoding system according to the so-called RSA method by means of a
microcontroller and arrangement implementing the system, US Patent #5,166,978, 24 Nov
1992
8. C.D. Walter, Fast modular multiplication by operand scanning, Advances in Cryptology,
LNCS, vol. 576 (Springer, 1991), pp. 313–323
9. E.F. Brickell, A fast modular multiplication algorithm with application to two key cryptog-
raphy, Advances in Cryptology Proceedings of Crypto 82 (Plenum Press, New York, 1982),
pp. 51–60
10. C.K. Koc. RSA Hardware Implementation. TR 801, RSA Laboratories, (April 1996)
11. C.K. Koc, T. Acar, B.S. Kaliski Jr., Analyzing and comparing Montgomery Multiplication
Algorithms, in IEEE Micro, pp. 26–33 (1996)
12. M. McLoone, C. McIvor, J.V. McCanny, Coarsely integrated Operand Scanning (CIOS)
architecture for high-speed Montgomery modular multiplication, in IEEE International
Conference on Field Programmable Technology (ICFPT), pp. 185–192 (2004)
13. M. McLoone, C. McIvor, J.V. McCanny, Montgomery modular multiplication architecture
for public key cryptosystems, in IEEE Workshop on Signal Processing Systems (SIPS),
pp. 349–354 (2004)
14. C.D. Walter, Montgomery exponentiation needs no final subtractions. Electron. Lett. 35,
1831–1832 (1999)
15. H. Orup, Simplifying quotient determination in high-radix modular multiplication, in Pro-
ceedings of IEEE Symposium on Computer Arithmetic, pp. 193–199 (1995)
16. C. McIvor, M. McLoone, J.V. McCanny, Modified Montgomery modular multiplication and
RSA exponentiation techniques, in Proceedings of IEE Computers and Digital Techniques,
vol. 151, pp. 402–408 (2004)
17. N. Nedjah, L.M. Mourelle, Three hardware architectures for the binary modular exponenti-
ation: sequential, parallel and systolic. IEEE Trans. Circuits Syst. I 53, 627–633 (2006)
18. M.D. Shieh, J.H. Chen, W.C. Lin, H.H. Wu, A new algorithm for high-speed modular
multiplication design. IEEE Trans. Circuits Syst. I 56, 2009–2019 (2009)
19. C.C. Yang, T.S. Chang, C.W. Jen, A new RSA cryptosystem hardware design based on
Montgomery’s algorithm. IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 45,
908–913 (1998)
20. A. Tenca, C. Koc, A scalable architecture for modular multiplication based on Montgomery’s
algorithm. IEEE Trans. Comput. 52, 1215–1221 (2003)
21. D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, S. Hsu, An improved unified scalable
radix-2 Montgomery multiplier, in IEEE Symposium on Computer Arithmetic, pp. 172–175
(2005)
22. K. Kelly, D. Harris, Very high radix scalable Montgomery multipliers, in Proceedings of
International Workshop on System-on-Chip for Real-Time Applications, pp. 400–404 (2005)
23. N. Jiang, D. Harris, Parallelized Radix-2 scalable Montgomery multiplier, in Proceedings of
IFIP International Conference on Very Large-Scale Integration (VLSI-SoC 2007),
pp. 146–150 (2007)
24. N. Pinckney, D. Harris, Parallelized radix-4 scalable Montgomery multipliers. J. Integr.
Circuits Syst. 3, 39–45 (2008)
25. K. Kelly, D. Harris, Parallelized very high radix scalable Montgomery multipliers, in Pro-
ceedings of Asilomar Conference on Signals, Systems and Computers, pp. 1196–1200 (2005)
26. M. Huang, K. Gaj, T. El-Ghazawi, New hardware architectures for Montgomery modular
multiplication algorithm. IEEE Trans. Comput. 60, 923–936 (2011)
27. M.D. Shieh, W.C. Lin, Word-based Montgomery modular multiplication algorithm for
low-latency scalable architectures. IEEE Trans. Comput. 59, 1145–1151 (2010)
28. A. Miyamoto, N. Homma, T. Aoki, A. Satoh, Systematic design of RSA processors based on
high-radix Montgomery multipliers. IEEE Trans. VLSI Syst. 19, 1136–1146 (2011)
29. K.C. Posch, R. Posch, Modulo reduction in residue Number Systems. IEEE Trans. Parallel
Distrib. Syst. 6, 449–454 (1995)
344 10 RNS in Cryptography

30. C. Bajard, L.S. Didier, P. Kornerup, An RNS Montgomery modular multiplication Algo-
rithm. IEEE Trans. Comput. 47, 766–776 (1998)
31. J.C. Bajard, L. Imbert, A full RNS implementation of RSA. IEEE Trans. Comput. 53,
769–774 (2004)
32. A.P. Shenoy, R. Kumaresan, Fast base extension using a redundant modulus in RNS. IEEE
Trans. Comput. 38, 293–297 (1989)
33. H. Nozaki, M. Motoyama, A. Shimbo, S. Kawamura, Implementation of RSA Algorithm
Based on RNS Montgomery Multiplication, in Cryptographic Hardware and Embedded
Systems—CHES, ed. by C. Paar (Springer, Berlin, 2001), pp. 364–376
34. S. Kawamura, M. Koike, F. Sano, A. Shimbo, Cox-Rower architecture for fast parallel
Montgomery multiplication, in Proceedings of International Conference on Theory and
Application of Cryptographic Techniques: Advances in Cryptology, (EUROCRYPT 2000),
pp. 523–538 (2000)
35. F. Gandino, F. Lamberti, G. Paravati, J.C. Bajard, P. Montuschi, An algorithmic and
architectural study on Montgomery exponentiation in RNS. IEEE Trans. Comput. 61,
1071–1083 (2012)
36. D. Schinianakis, T. Stouraitis, A RNS Montgomery multiplication architecture, in Proceed-
ings of ISCAS, pp. 1167–1170 (2011)
37. Y.T. Jie, D.J. Bin, Y.X. Hui, Z.Q. Jin, An improved RNS Montgomery modular multiplier, in
Proceedings of the International Conference on Computer Application and System Modeling
(ICCASM 2010), pp. V10-144–147 (2010)
38. D. Schinianakis, T. Stouraitis, Multifunction residue architectures for cryptography. IEEE
Trans. Circuits Syst. 61, 1156–1169 (2014)
39. H.M. Yassine, W.R. Moore, Improved mixed radix conversion for residue number system
architectures, in Proceedings of IEE Part G, vol. 138, pp. 120–124 (1991)
40. M. Ciet, M. Neve, E. Peeters, J.J. Quisquater, Parallel FPGA implementation of RSA with
residue number systems—can side-channel threats be avoided?, in 46th IEEE International
MW Symposium on Circuits and Systems, vol. 2, pp. 806–810 (2003)
41. J.-J. Quisquater, C. Couvreur, Fast decipherment algorithm for RSA public key cryptosystem.
Electron. Lett. 18, 905–907 (1982)
42. R. Szerwinski, T. Guneysu, Exploiting the power of GPUs for Asymmetric Cryptography.
Lect. Notes Comput. Sci. 5154, 79–99 (2008)
43. B.S. Kaliski Jr., The Montgomery inverse and its applications. IEEE Trans. Comput. 44,
1064–1065 (1995)
44. E. Savas, C.K. Koc, The Montgomery modular inverse—revisited. IEEE Trans. Comput. 49,
763–766 (2000)
45. A.A.A. Gutub, A.F. Tenca, C.K. Koc, Scalable VLSI architecture for GF(p) Montgomery
modular inverse computation, in IEEE Computer Society Annual Symposium on VLSI,
pp. 53–58 (2002)
46. E. Savas, A carry-free architecture for Montgomery inversion. IEEE Trans. Comput. 54,
1508–1518 (2005)
47. J. Bucek, R. Lorencz, Comparing subtraction free and traditional AMI, in Proceedings of
IEEE Design and Diagnostics of Electronic Circuits and Systems, pp. 95–97 (2006)
48. D.M. Schinianakis, A.P. Kakarountas, T. Stouraitis, A new approach to elliptic curve
cryptography: an RNS architecture, in IEEE MELECON, Benalmádena (Málaga), Spain,
pp. 1241–1245, 16–19 May 2006
49. D.M. Schinianakis, A.P. Fournaris, H.E. Michail, A.P. Kakarountas, T. Stouraitis, An RNS
implementation of an Fp elliptic curve point multiplier. IEEE Trans. Circuits Syst. I Reg. Pap.
56, 1202–1213 (2009)
50. M. Esmaeildoust, D. Schnianakis, H. Javashi, T. Stouraitis, K. Navi, Efficient RNS imple-
mentation of Elliptic curve point multiplication over GF(p). IEEE Trans. Very Large Scale
Integration (VLSI) Syst. 21, 1545–1549 (2013)
References 345

51. P.V. Ananda Mohan, RNS to binary converter for a new three moduli set {2n+1 -1, 2n, 2n-1}.
IEEE Trans. Circuits Syst. II 54, 775–779 (2007)
52. M. Esmaeildoust, K. Navi, M. Taheri, A.S. Molahosseini, S. Khodambashi, Efficient RNS to
Binary Converters for the new 4- moduli set {2n, 2n+1 -1, 2n-1, 2n-1 -1}”. IEICE Electron. Exp.
9(1), 1–7 (2012)
53. J.C. Bajard, S. Duquesne, M. Ercegovac, Combining leak resistant arithmetic for elliptic
curves defined over Fp and RNS representation, Cryptology Reprint Archive 311 (2010)
54. M. Joye, J.J. Quisquater, Hessian elliptic curves and side channel attacks. CHES, LNCS
2162, 402–410 (2001)
55. P.Y. Liardet, N. Smart, Preventing SPA/DPA in ECC systems using Jacobi form. CHES,
LNCS 2162, 391–401 (2001)
56. E. Brier, M. Joye, Wierstrass elliptic curves and side channel attacks. Public Key Cryptog-
raphy LNCS 2274, 335–345 (2002)
57. P.L. Montgomery, Speeding the Pollard and elliptic curve methods of factorization. Math.
Comput. 48, 243–264 (1987)
58. A. Joux, A one round protocol for tri-partite Diffie-Hellman, Algorithmic Number Theory,
LNCS, pp. 385–394 (2000)
59. D. Boneh, M.K. Franklin, Identity based encryption from the Weil Pairing, in Crypto 2001,
LNCS, vol. 2139, pp. 213–229 (2001)
60. D. Boneh, B. Lynn, H. Shachm, Short signatures for the Weil pairing. J. Cryptol. 17, 297–319
(2004)
61. J. Groth, A. Sahai, Efficient non-interactive proof systems for bilinear groups, in 27th Annual
International Conference on Advances in Cryptology, Eurocrypt 2008, pp. 415–432 (2008)
62. V.S. Miller, The Weil pairing and its efficient calculation. J. Cryptol. 17, 235–261 (2004)
63. P.S.L.M. Barreto, H.Y. Kim, B. Lynn, M. Scott, Efficient algorithms for pairing based
cryptosystems, in Crypto 2002, LCNS 2442, pp. 354–369 (Springer, Berlin, 2002)
64. F. Hess, N.P. Smart, F. Vercauteren, The eta paring revisited. IEEE Trans. Inf. Theory 52,
4595–4602 (2006)
65. F. Lee, H.S. Lee, C.M. Park, Efficient and generalized pairing computation on abelian
varieties, Cryptology ePrint Archive, Report 2008/040 (2008)
66. F. Vercauteren, Optimal pairings. IEEE Trans. Inf. Theory 56, 455–461 (2010)
67. S. Duquesne, N. Guillermin, A FPGA pairing implementation using the residue number
System, in Cryptology ePrint Archive, Report 2011/176(2011), http://eprint.iacr.org/
68. S. Duquesne, RNS arithmetic in Fpk and application to fast pairing computation, Cryptology
ePrint Archive, Report 2010/55 (2010), http://eprint.iacr.org
69. P. Barreto, M. Naehrig, Pairing friendly elliptic curves of prime order, SAC, 2005. LNCS
3897, 319–331 (2005)
70. A. Miyaji, M. Nakabayashi, S. Takano, New explicit conditions of elliptic curve traces for
FR-reduction. IEICE Trans. Fundam. 84, 1234–1243 (2001)
71. B. Lynn, On the implementation of pairing based cryptography, Ph.D. Thesis PBC Library,
https://crypto.stanford.edu/~blynn/
72. C. Costello, Pairing for Beginners, www.craigcostello.com.au/pairings/PairingsFor
Beginners.pdf
73. J.C. Bazard, M. Kaihara, T. Plantard, Selected RNS bases for modular multiplication, in 19th
IEEE International Symposium on Computer Arithmetic, pp. 25–32 (2009)
74. A. Karatsuba, The complexity of computations, in Proceedings of Staklov Institute of
Mathematics, vol. 211, pp. 169–183 (1995)
75. P.L. Montgomery, Five-, six- and seven term Karatsuba like formulae. IEEE Trans. Comput.
54, 362–369 (2005)
76. J. Fan, F. Vercauteren, I. Verbauwhede, Efficient hardware implementation of Fp-arithmetic
for pairing-friendly curves. IEEE Trans. Comput. 61, 676–685 (2012)
77. J. Fan, F. Vercauteren, I. Verbauwhede, Faster Fp-Arithmetic for cryptographic pairings on
Barreto Naehrig curves, in CHES, vol. 5747, LNCS, pp. 240–253 (2009)
346 10 RNS in Cryptography

78. J. Fan, http://www.iacr.org/workshops/ches/ches2009/presentations/08_ Session_5/CHES


2009_fan_1.pdf
79. J. Chung, M.A. Hasan, Low-weight polynomial form integers for efficient modular multipli-
cation. IEEE Trans. Comput. 56, 44–57 (2007)
80. J. Chung, M. Hasan, Montgomery reduction algorithm for modular multiplication using low
weight polynomial form integers, in IEEE 18th Symposium on Computer Arithmetic,
pp. 230–239 (2007)
81. C.C. Corona, E.F. Moreno, F.R. Henriquez, Hardware design of a 256-bit prime field
multiplier for computing bilinear pairings, in 2011 International Conference on
Reconfigurable Computing and FPGAs, pp. 229–234 (2011)
82. S. Srinath, K. Compton, Automatic generation of high-performance multipliers for FPGAs
with asymmetric multiplier blocks, in Proceedings of 18th Annual ACM/Sigda International
Symposium on Field Programmable Gate Arrays, FPGA ‘10, New York, pp. 51–58 (2010)
83. R. Brinci, W. Khmiri, M. Mbarek, A.B. Rabaa, A. Bouallegue, F. Chekir, Efficient multi-
pliers for pairing over Barreto-Naehrig curves on Virtex -6 FPGA, iacr Cryptology Eprint
Archive (2013)
84. A.J. Devegili, C. OhEigertaigh, M. Scott, R. Dahab, Multiplication and squaring on pairing
friendly fields, in Cryptology ePrint Archive, vol. 71 (2006)
85. A.L. Toom, The complexity of a scheme of functional elements realizing the multiplication
of integers. Sov. Math. 4, 714–716 (1963)
86. S.A. Cook, On the minimum computation time of functions, Ph.D. Thesis, Harvard Univer-
sity, Department of Mathematics, 1966
87. J. Chung, M.A. Hasan, Asymmetric squaring formulae, Technical Report, CACR 2006-24,
University of Waterloo (2006), http://www.cacr.uwaterloo.ca/techreports/2006/cacr2006-24.
pdf
88. D. Hankerson, A. Menezes, M. Scott, Software Implementation of Pairings, in Identity Based
Cryptography, Chapter 12, ed. by M. Joye, G. Neven (IOS Press, Amsterdam, 2008),
pp. 188–206
89. G.X. Yao, J. Fan, R.C.C. Cheung, I. Verbauwhede, A high speed pairing Co-processor using
RNS and lazy reduction, eprint.iacr.org/2011/258.pdf
90. M. Scott, Implementing Cryptographic Pairings, ed. by T. Takagi, T. Okamoto, E. Okamoto,
T. Okamoto, Pairing Based Cryptography, Pairing 2007, LNCS, vol. 4575, pp. 117–196
(2007)
91. J.L. Beuchat, J.E. Gonzalez-Diaz, S. Mitsunari, E. Okamoto, F. Rodriguez-Henriquez,
T. Terya, in High Speed Software Implementation of the Optimal Ate Pairing over Barreto-
Naehrig Curves, ed. by M. Joye, A. Miyaji, A. Otsuka, Pairing 2010, LNCS 6487, pp. 21–39
(2010)
92. M. Scott, N. Benger, M. Charlemagne, L.J.D. Perez, E.J. Kachisa, On the final exponentiation
for calculating pairings on ordinary elliptic curves, Cryptology ePrint Archive, Report 2008/
490(2008), http://eprint.iacr.org/2008/490.pdf
93. A.J. Devegili, M. Scott, R. Dahab, Implementing cryptographic pairings over Barreto-
Naehrig curves, Pairing 2007, vol. 4575 LCNS (Springer, Berlin, 2007), pp. 197–207
94. J. Olivos, On vectorial addition chains. J. Algorithm 2, 13–21 (1981)
95. G.X. Yao, J. Fn, R.C.C. Cheung, I. Verbauwhede, Novel RNS parameter selection for fast
modular multiplication. IEEE Trans. Comput. 63, 2099–2105 (2014)
96. C. Costello, T. Lange, M. Naehrig, Faster pairing computations on curves with high degree
twists, ed. by P. Nguyen, D. Pointcheval, PKC 2010, LNCS, vol. 6056, pp. 224–242 (2010)
97. D. Aranha, K. Karabina, P. Longa, C.H. Gebotys, J. Lopez, Faster explicit formulae for
computing pairings over ordinary curves, Cryptology ePrint Archive, Report 2010/311
(2010), http://eprint.iacr.org/
98. R. Granger, M. Scott, Faster squaring in the cyclotomic subgroups of sixth degree extensions,
PKC-2010, 6056, pp. 209–223 (2010)
References 347

99. N. Guillermin, A high speed coprocessor for elliptic curve scalar multiplications over Fp,
CHES, LNCS (2010)
100. D. Kammler, D. Zhang, P. Schwabe, H. Scharwaechter, M. Langenberg, D. Auras,
G. Ascheid, R. Leupers, R. Mathar, H. Meyr, Designing an ASIP for cryptographic pairings
over Barreto-Naehrig curves, in CHES 2009, LCNS 5747 (Springer, Berlin, 2009),
pp. 254–271
101. D. Nibouche, A. Bouridane, M. Nibouche, Architectures for Montgomery’s multiplication, in
Proceedings of IEE Computers and Digital Techniques, vol. 150, pp. 361–368 (2003)
102. A. Barenghi, G. Bertoni, L. Breveglieri, G. Pelosi, A FPGA coprocessor for the cryptographic
Tate pairing over Fp, in Proceedings of Fifth International Conference on Information
Technology: New Generations, ITNG 2008, pp. 112–119 (April 2008)
103. M. Scott, P.S.L.M. Barreto, Compressed pairings, in CRYPTO, Lecture Notes in Computer
Science, vol. 3152, pp. 140–156 (2004)

Further Reading

E. Savas, M. Nasser, A.A.A. Gutub, C.K. Koc, Efficient unified Montgomery inversion with multi-
bit shifting, in Proceedings of IEE Computers and Digital Techniques, vol. 152, pp. 489–498
(2005)
A.F. Tenca, G. Todorov, C.K. Koc, High radix design of a scalable modular multiplier, in
Proceedings of Third International Workshop on Cryptographic Hardware and Embedded
Systems, CHES, pp. 185–201 (2001)
Index

A Diminished-1 representation, 6, 16, 18, 20, 39,


Adaptive filter using RNS, 198 51, 56, 58–60, 64, 67, 76, 160
Almost Montgomery Inverse (AMI), 295 Distributed arithmetic based RNS
Aryabhata Remainder Theorem, 4 FIR filters, 223
Auto-scale multipliers, 39, 216 Division in RNS, 36, 133, 149,
251, 298
Double-LSB encoding, 64
B DPA resistant, 294
Barreto-Naehrig curves, 308, 311 DS-CDMA using RNS, 242
Binary to RNS Conversion, 4, 27–30, 34, 35,
42, 69, 183, 196, 198, 209, 212, 215,
218, 222, 223, 244, 248, 292, 301, 342 E
Elliptic curve cryptography
bilinear pairing, 306
C Miller loop, 306
Chinese Remainder Theorem (CRT), 4, 81, 96 pairing processor, 306
Chung-Hasan technique point doubling, 299
for multiplication, 317 point multiplication, 301
for squaring, 323 projective coordinates, 299
Communication receiver, RNS based, 244 R-Ate pairing, 306
Comparison of residue numbers, 133–136, Tate pairing, 306, 307
139–141, 143–160 using Barreto-Naehrig curves, 308
Conjugate moduli, 32, 34, 105, 154 using MNT curves, 308
Core function using RNS, 264, 298–305
reverse conversion using, 113 Weil pairing, 306
scaling using, 6, 111, 150 Error correction using
sign detection using, 112, 114 projections, 165
Cox-Rower architecture, 290–292, 329 single, 163
using redundant moduli, 163–173
Extension field arithmetic
D cubic extension, 318
DCT implementation using RNS, 226–241 inversion in, 325
DFT implementation using RNS, 6, 226–241 Quartic Extension, 318, 321
Digital Frequency synthesis using RNS, 245 Sextic Extension, 318

© Springer International Publishing Switzerland 2016 349


P.V. Ananda Mohan, Residue Number Systems, DOI 10.1007/978-3-319-41385-3
350 Index

F Modulo multiplication
Fault tolerance in RNS, 163–165, 167–173 for IDEA algorithm, 51, 52
FIR filters using RNS, 6, 172, 195–220, 223, using Barretts algorithm, 265–267, 282
226, 228, 235 using combinational logic, 41, 45, 195
Five moduli sets, 105, 107, 123, 204 using index calculus, 39, 40, 197, 198, 223
Fixed multifunction architectures (FMA), using diminished-1 representation, 39, 51,
69, 70 56, 58–60, 64, 67, 76
Floating-point arithmetic, 3 Modulo squaring, 6, 39–44, 46, 48, 50–55, 57,
Forward conversion 58, 60–64, 67–70, 72, 75, 76, 264
multiple moduli sharing hardware, 32–34 Modulo subtraction, 4, 43, 90–92, 102, 178,
using modular exponentiation, 30–31 244, 301
Four moduli sets, 35, 50, 82, 93, 99, 101, Modulus Replication Residue Number systems
104, 105, 107, 117, 171, 206, 217, (MRRNS), 186–189
223, 240, 301 Montgomery inverse, 295–297
Frequency synthesis using RNS, 6, 245 Montgomery modular multiplication
Frobenius computation, 325, 326, 335 CIHS, 268, 269
CIOS, 268
FIPS, 268, 269
G scalable, 277
GPUs for RNS, 295 SOS, 268
using Kawamura et al technique, 292, 295
using RNS Bajard et al technique, 289
H word based, 275
Hard multiple computation, 50, 61, 68 Montgomery polynomial, 310
MQRNS system, 179, 240
Multi-modulus squarers, 6, 64, 66, 67
I Multiple error correction, 170
IEEE 754 standard, 2 Multiplication technique
IIR filters using RNS inversion in Fpk, 6, 203, for quartic polynomials, 317
206 for quintic polynomials, 311
for sextic polynomials, 310

K
Karatsuba algorithm, 309, 314, 315, 318, 319, N
321–324, 332 New Chinese Remainder Theorems
New CRT-I, 104
New CRT-II, 81, 95–97
L New CRT-III, 95–97
Lazy addition
Logarithmic Residue Number systems, 6,
189–191 O
Low–high lemma, 51, 52 OFDM system using RNS, 249
One-hot coding, 5
Optimal Ate pairing, 328, 329, 331–333, 341
M
Magnitude comparison
using MRC technique, 153 P
using new CRTs, 154–156 Pairing implementation using RNS, 264, 309,
Mixed Radix Conversion, 4, 6, 81, 90–95, 102, 327–342
127, 153, 157, 159 Pairing processors using RNS, 306–342
Mixed Radix Number system, 1 Parity detection, 141, 142, 154
Moduli of the Form rn, 179–184 Polynomial Residue Number system, 6,
Modulo addition 184–186
mod (2n+1) addition, 17, 20, 21 Powers of two related moduli sets, 6, 28–30,
mod (2n-1) addition, 14 39, 44, 92, 99, 103, 199
Index 351

Q using Look up tables, 142, 167


Quadratic Residue Number systems (QRNS), using Shenoy and Kumaresan, 142
6, 69, 177–179, 183, 184 Sign detection, 1, 4–6, 81, 87, 112, 133–136,
Quasi-chaotic generator using RNS, 253 138, 140–142, 144–160, 227
Specialized Residue Number systems, 6,
177–180, 182, 184–191
R
Reduced Precision Redundancy (RPR), 217, 219
Redundant moduli, 145, 163–165, 167–171, T
173, 244, 247, 249 Three moduli sets, 82, 97–99, 101, 102, 108,
Reverse Conversion 117, 118, 134, 143, 159, 189, 200, 203,
using Core function, 6, 81, 111–114, 151–153 216, 240, 301
using CRT, 117 Triple modular redundancy, 163, 173, 186
using Mixed Radix Conversion, 4, 6, 81, Twist of Elliptic curves, 308
90–95, 102, 117, 127, 157 Two-D DCT, 6, 232
using Mixed Radix CRT, 6, 81, 95–97 Two-dimensional filtering using RNS, 184,
using New CRT, 6, 81, 95–97, 104, 107, 186, 188
117, 125, 154
using quotient function, 6, 81, 88, 89
V
Variable multifunction architectures (VMA),
S 69–71
Scaled residue computation, 36–37 Voltage overscaling, 217
Scaling using core function

You might also like