You are on page 1of 5

ALGORITHM AND ARCHITECTURE DESIGN FOR THE IMPLEMENTATION OF HIGH ORDER FIR

FILTERS USING THE RESIDUE NUMBER SYSTEM

A.M. Dennis, C.B. Marshall and /.A. Burgess

Introduction
The computational speed requirements of future signal processing algorithms necessitate the
use of special purpose multiprocessor systems implemented using VLSl technology. It is well
established 1 ' ) that when designing such systems a cohesive exploration of appiications.
algorithms, architectures and technology is essential. However an extra consideration is the
actual representation of the numbers themselves for this influences the relative hardware
complexity/speed of required operations which directly affects the choice of algorithm and
ultimately therefore strongly influences the architecture.
Finite field arithmetic and algorithms offer an extra degree of freedom in the design of high
performance integrated circuits and, motivated by the need for high order and high
performance FIR digital-filters.this paper investigates some issues in maximising the benefits
obtained from the Residue Number System of arithmetic whilst maintaining the ability to use
sophisticated finite field algorithms for a V L S l implementation.
There are a number of alternative choices of algorithm for the implementation of FIR filters
(linear convolution) ( 2 ) . The tapped delay line is the most direct method and leads to very
regular systolic structures ideally suited for VLSl implementation. This has therefore been the
preferred approach of most of the FIR chips currently available. For high order filters ( > 200
say) this technique is too hardware intensive, resulting from a computationally inefficient
algorithm (O(N2) operations). The alternative is to compute a length M = 2N-1 (or greater)
cyclic convolution for which more computationally efficient algorithms exist. The most well
known technique is the DFT realised by using the FFT algorithm. Although computationally
efficient the required structure for a multiprocessor, VLSl implementation has very complicated
on-chip data communication leading to a high cost in terms of silicon performance, area and
power. Work by Swarzlander (3) (see also (4) for an alternative approach) to combat this
problem has led to a so-called 'delay commutator' circuit which reorders data prior to following
butterfly stages. The cost in silicon to implement this data reordering is particularly high.
Further, as a result of the OFT requiring trigonometric functions, which cannot be represented
exactly by a finite number of bits, the arithmetic used is necessarily floating point to reduce
round off errors in the machine. Also note that complex numbers must be manipulated in the
OFT.
Attention has been given to the WFTA ( 5 ) and mixed radix (6) algorithms which offer greater
improvement in computational requirements, particularly with the ability to use pure (not
complex) numbers in the transform coefficients. However the problems of complicated
communication and irregular structure, for a multiprocessor VLSl implementation, remain.
A number of alternative algorithms become available through the use of number theory (')
which generally involve breaking up a one dimensional aperiodic/cyclic convolution into a
many dimensional aperiodic/cyclic convolution of smaller lengths. The use of Residue Number
System (RNS)@)arithmetic further expands our available choice via the existence of Number
Theoretic Transforms and Prime Factor Algorithms. These are the finite field equivalents of the
DFT and FFT respectively. The NTT is of particular interest in that the transform coefficients are
real and integer, thus avoiding round off problems and complex arithmetic as compared to the
DFT.

A.M. Dennis, C.B. Marshall and /.A. Burgess are with Philips Research Laboratories, Redhill,
Surrey.

Much of the original work in this new area was devoted to finding efficient algorithms (in a
computational sense) for implementing short cyclic convolutions without regard for the
important multiprocessor VLSl issues of on-chip data communication and structure regularity.
In optimising the computational count it is all too easy to lose sight of the real objective, to
squeeze the maximum performance per square millimetre of silicon.
The algorithm and architecture described next results from a joint consideration of the above
factors. Benefits from using more sophisticated algorithms for cyclic convolutions are obtained
without adversely affecting the VLSl implementation issues. Also considered are the intended
arithmetic attributes (namely of the RNS) and technological factors in the final implementation
(such as the trade-offs between logic and memory on-chip for example).

Design Considerations
The problem considered is that of implementing a length M =2N-1 point cyclic convolution. The
required FIR filter is easily realised by a series of such cyclic convolutions coupled with use
of the overlap add or save algorithms. Careful choice of algorithm implementation allows one
to discard the necessity to add N-1 zeros to the data stream prior to computation. Since nothing
is gained by a direct implementation, the long cyclic convolution is first split up into small cyclic
convolutions via the Agarwal-Cooley algorithm(9) A number of issues become relevant;
i) Choosing a sufficiently composite M allows a one to many dimensional mapping to be used
but leads to a proliferation of smail length cyclic convolutions and difficult 'recombination'
requirements.
ii) How does one implement the smaller cyclic convolutions?. Winograd fast convolution
algorithms ( l o ) are computationally efficient but require complicated data flow. Further,
presenting the required data points in the right order necessary for such a hardware block is
particularly difficult. If we consider the use of NTTs then how does this restrict our choice of
moduli for the RNS representation?. Clearly if we require a large length (L say) transform to
exist we require large moduli ( > =L+1), reducing the advantages of the RNS to break up a
large dynamic range into many 'small' parallel channels, Further, requiring the same root of
unity to exist in each individual modulus greatly restricts our moduli choice and leads again to
very large moduli.

...
111)

These algorithm are notorious for requiring a large amount of data reordering and data
storage. What are the trade-offs of on-chip RAM against complicated shift register/multiplexor
structures ?
iv) Regarding the final architecture used to implement the chosen algorithm, how regular and
modular is it for ease of design ? 1s it systolic in nature ?
To combat these points we choose. for the Agarwal Cooley mapping, M = M1M2 only (where
(M1,M2) = 1 as required). For reasons yet to be revealed. MI is chosen highly composite
whereas M2 may be prime. To avoid the necessity of large moduli (see point ii) the M2
dimension convolutions are performed explicitly whereas the Ml dimension convolutions are
computed via the NTT. However to avoid unnecessary computation it is possible to perform the
direct cyclic convolutions within the Ml transforms (i.e. in 'M, space' after the forward
transforms but prior to the inverse transforms). This arrangement allows the use of small
moduli containing only the Mlth root of unity whereas the Mth root (and thus larger moduli)
would be necessary if transforms were used along both dimensions. As a result it is possible
to obtain greater benefit from the number system and to take advantage of the relative ease
of implementing smaller moduli. Since MI is composite and the MIth root exists (by choice)
we may use the Prime Factor Algorithm to reduce the long NTTs into smaller length NTTs.
These small NTTs are implemented with a regular architecture described next. A very similar
architecture is also used to compute the so called 'inner convolutions'. Figure 1 will help to
clarify the above. Note that in general the reference sequence is already known and the
tranSfOrmS into MI space may be pre-computed. Note also that the reference sequence only
interacts with the architecture within the inner convolution stages.

The Snake Architecture

A major limitation of the algorithm previously described, is the apparent data reordering
required prior to each transform block of figure 1. This is a consequence of the Prime Factor
Algorithm mapping the NTT index from one to many dimensions via the Chinese Remainder
Theorem (or a permutation). It is possible ( 1 ) to circumvent some (but not a great deal) of the
ordering problem by using a different mapping (a lexicographic ordering) at the expense of
requiring additional twiddle factor multiplications.
To avoid data reordering completely consider the arithmetic unit (AU) of figure 2. I t is easy to
see how the inner product of two vectors is achieved with such a unit; the partial products
being summed recursively and the result passed onto a bus (at the right time) via the
multiplexor. Concatenating a number of these units allows NTTs (matrix vector multiplications)
to be computed. Figure 3 shows such a configuration for a three point NTT with two important
additional factors: the addition of the shift register in the accumulator feedback loop and the fact
that AUll and AUlll take their data input from the previous arithmetic unit feedback loop and not
from the block input.
The addition of a programmable length shift register serves to allow multiplexing of NTT
computations (many matrix vector computations simultaneously). This in turn allows NTT
blocks of figure 3 to be joined together to compute the aigorithms described above without
explicit data reordering. Also note that the extra delays in the summation loop, caused by the
algorithm configuration, allows the AU block to be fully pipelined thereby allowing more
throughput to be obtained in the computational unit itself.
The ability to allow successive AUs to get their data from the previous AU feedback loop is
unique to the computation of NTTs (or other matrices of the same form). This snake
architecture (follow the data path to see why it is so called) leads to a very regular systolic
design with local connectivity. The output of the block is in serial form ready to be input to a
following transform block. To use this structure it is only necessary to change the coefficients
fed to the multipliers which are pre-known and stored in ROM.
Demonstration System

To validate the above ideas figure 5 shows a photograph of a demonstration convolver system.
Modulo AU blocks (figure 2 ) and programmable length shift registers were implemented on
gate arrays (figure 4). A number of these AU and Latch chips are concatenated onto a PCB
to realise the Snake architecture (figure 3). PCBs can be used individually or concatenated to
realise NTTs of varying lengths. Although the gate arrays are fairly pedestrian this system
performs a 210 point convolver at 5MHz data rate.
Conclusions
In order to obtain the maximum performance from silicon (in terms of MIPS per square
millimetre) the application, algorithms, architecture, technology and, of equal importance. the
arithmetic must be considered at the same time. The arguments presented here, leading t o the
choice of algorithm and architecture, clearly demonstrate the convoluted design process
necessary.
RNS arithmetic and tailored finite field algorithms have been used to design an architecture for
the implementation of high order, high performance FIR digital filters. Specifically a number
of relatively small moduli are used to implement long FIR filters via an algorithm derived from
the Agarwal-Cooley algorithm. Cyclic convolutions are, where possible, implemented via
Number Theoretic Transforms in the respective moduli with Prime Factor algorithms used to
reduce the computational expense. A key improvement of the algorithm implementation is the
elimination of explicit on chip data reordering. The resulting systolic structure is well suited to
VLSI implementation and a demonstration system has been constructed to prove the validity
of the ideas. Work is underway to integrate the system using full custom technology.

References
1) Kung, S.Y., VLSl Array Processors, Prentice Hall, NY 1981

2 ) Nussbaumer, H.J., Fast Fourier Transform and Convolution Algorithms, Springer-Verlag.,NY


1981

3) Swartzlander et al., A radix 4 delay commutator for fast Fourier transform processor
implementation, IEEE Jour. of Solid State Circuits 1984, SC-19, 702-709.
4) Willey, T. et al., Systolic Implementatzns for Deconvolution. DFT and FFT.. IEE Proc. F, 132,
6, October 1985.

5 ) Winograd, S., On computing the discrete Fourier transform. Math. Comput. 32, 175-199 (1978)
6 ) Sorensen, V., On computing the split-radix FFT, IEEE trans ASSP, 34, #1, Feb 1986
7) McClellan, J.H. and Rader, C > M., Number Theory in Digital Signal Processing (Prentice-Hall,
Englewood Cliffs, N.J. 1979)
8) Szabo. N, and Tanaka. R., Residue Arithmetic and Its Applications to Computer Technology.
New York: McGraw-Hill, 1967.

9) Agarwal. R.C., and Cooley, J . W . , New Algorithms for Digital Convolution. IEEE Trans ASSP-25,
392-410 (1977)
10)
Nussbaumer,
H.J..
Fast
Springer-Verlag.,pp66-79, NY 1981

Fourier

Transform

and

Convolution

Algorithms,

11) Burrus. C.S., Index Mappings for Multidimensional Formulation of the DFT and Convolution,
IEEE Intern. Symp. on Circuits and Systems Proc.,pp662-664.

Data I n p u l

F i g u r e 1: "Translorm ConwoluLion Transfotm"

Processor
A

Delay L a t c h e s
(variable no.)

Result Qutput

Multiplexer
c

!
Feedback loop

Figure 2: Modulo Multiply/ Accumuiato Block

schemollc.

$.

B';

Figure 3: Three Paint 'Snake' Architecture.

t=l

Figure 4 : Gate A r r a y Implcmcntations

o f RNS Arithmetic U n i t s

Figure 5: Demonstration RNS Convolver.

You might also like