Useful PDF

Applications of Distributed
Arithmetic to Digital Signal

Processing: A Tutorial Review
Stanley A. White
D ISTRIBUTED ARITHMETIC (DA) is so named because

the arithmetic operations that appear in signal pro-
cessing (e.g., addition, multiplication) are not “lumped”
on DA 1121, Classen et al. at Philips in the Netherlands de-
scribed communications systems applications [131, and
Buttner and Schuessler at the University of Erlangen in
in a comfortably familiar fashion (”Aha, there’s the multi- Germany showed h o w t o reduce the memory require-
plier over there,” etc.), but are distributed in an often ments 1141. In 1975, Kai-Ping Yiu at Hewlett-Packard showed
unrecognizable fashion. The most-often encountered a c o n v e n i e n t r u l e f o r h a n d l i n g t h e sign b i t [15];
form of computation in digital signal processing is a sum H. Schroder of Siemens in Munich [I61 and C. S. Burrus of
of products (or in vector analysis parlance, dot-product, Rice University [I71 have given some suggestions and in-
or inner-product generation). This i s also the computa- sight for speeding up the algorithms, and K. D. Kammeyer
tion that is executed most efficiently by DA. gave a survey/summary [181. Mechanization studies
O u r motivation for using DA i s its extreme compu- on application of DA to digital filters were discussed by
tational efficiency. The advantages are best exploited Burrus [191, Jenkins and Leon [201, Zeman and Nagel 1211,
in circuit design, b u t off-the-shelf hardware often can Tam and Hawkins [22], Arjmand and Roberts [231 and
be configured effectively t o perform DA. By careful de- White [24,251. Kammeyer [261 and Taylor [271 have pre-
sign one may reduce the total gate count in a signal pro- sented error analyses, Smith and White [28] have consid-
cessing arithmetic unit by a number seldom smaller than ered DA for coordinate transformations, and Burleson
50 percent and often as large as 80 percent. and Scharf have applied it to a rotator array 1291. Taylor
I
applied DA t o a Gray-Markel filter 1301, and Zohar dis-
HISTORICAL PERSPECTIVE OF DISTRIBUTED ARITHMETIC cussed a VLSl implementation of a correlator/filter [31].
The first widely known description of DA was given at a DA i s also discussed by Taylor in his text [321, in the text
presentation by Abraham Peled and Bede Liu on IIR digi- by Smith and Denyer [33], and by Mintzer i n Elliott’s
tal filter mechanization in 1974 at the Arden House Work- handbook [341.
shop on Digital Signal Processing. Their work on both FIR
and IIR digital filter mechanization was also published in TECHNICAL OVERVIEW OF DISTRIBUTED ARITHMETIC
the / € E € ASSP Transactions [I ,2]. Earlier work (pre-1971) DA i s basically (but not necessarily) a bit-serial compu-
o n DA had been performed in France by Croisier et al. [3]. tational operation that forms an inner (dot) product of a
The earliest documented work in the U.S. was done by pair of vectors in a single direct step. The advantage of
Zohar [4,5,6], who had independently invented DA in DA i s i ts efficiency of mechanization. A frequently argued
1968. Other early work in the United States was reported disadvantage has been i ts apparent slowness because of
by Little [7]. From the Arden House description of DA, i ts inherent bit-serial nature. This disadvantage i s not real
Bona and Terhan at Rockwell International designed an if the number of elements in each vector is commensu-
integrated-circuit DA compensator for control systems rate with the number of bits in each vector element, e.g.,
applications [ 8 ] , and White generalized its control system the time required to input eight 8-bit words one at a time
application [9]. The April 1975 special digital-signal pro- i n a parallel fashion i s exactly the same as the time re-
cessing issue of /€€€ Proceedings contained an article by quired to input (simultaneously on eight wires) all eight
Freeny o n DA applications t o the telephone system at words serially. Other modifications to increase the speed
Bell Laboratories [IO] and an article by White on general may be made by employing techniques such as bit pairing
vector dot-product formation using DA [ I l l . Later that or partitioning the input words into the most significant
year White et al. developed an AGM digital autopilot based half and least significant half, the least significant half
4 IEEE ASSP M A G A Z I N E JULY 1 9 8 9 0740-7467/89/0700-0004$1 .OO 0 1 9 8 9 IEEE

of the most significant half, etc., thereby introducing a one-bit-at-a-time (IBAAT) fashion, with LSBs { b k , N - l }
parallelism in the computation. This will be described first. The sign bits { b k O } are the last bits to arrive. The clock
in Section 4. period in which the sign bits all simultaneously arrive is
As an example of direct DA inner-product generation, called the "sign-bit time." During the sign-bit time the
consider the calculation of the following sum of products: control signal T, = 1, otherwise T, = 0. For the moment
K we will assume essentially zero time delay between the
time of arrival of the address pattern to the ROM and the
availability of its output. The delay around the accumulator
The Ak are fixed coefficients, and the xk are the input data loop is assumed to be one clock cycle and i s concentrated
words. If each xk is a 2's-complement binary number scaled in the summer. Switch SWA remains in Position Iexcept
(for convenience, not as necessity) such that (xkl < 1, then during the clock cycle that follows the sign-bit time, when
we may express each xk as it toggles for one clock cycle to Position 2, and the fully
N-I
formed result is output.
xk = -bko + bkn2-" (2) We may reduce the memory size by half to a 2K word
n=1 ROM by modifying the adder to an adderhbtractor and
where the bknare the bits, 0 or 1, bkois the sign bit, and using T, as the addhubtract-control line as shown in Fig-
b k . N - 1 i s the least significant bit (LSB). ure 1b. This configuration may now be mechanized with
Now let us combine Equations 1 and 2 in order to ex- a 16-word ROM. The stored table is simply the upper half
press y in terms of the bits of x k : of Table 1.
The memory size may be halved again to f 2Kwords. In
y = 5
k- 1 A,[-,,, f n=1 bkn2?] . (34
order to understand how this works, we shall interpret
(not convert, but just interpret) the input data as being
cast not in a ( 0 , l ) straight binary code, but instead as be-
Equation 3a is the conventional form of expressing the
ing cast in a (-1,l) offset binary code. Suppose that we
inner product. Direct mechanization of this equation de-
think of xk as
fines a "lumped" arithmetic computation. Let us instead
interchange the order of the summations, which gives us: 1
xk = y [ x k - ( - x k ) l (4)
and remember that in 2's-complement notation the nega-

tive of x k is written as
This i s the crucial step: Equation 3b defines a distributed
arithmetic computation. Consider the bracketed term in - N-1 -
Equation 3b:
-Xk = -bku + 2 bkn2-" -k 2-"-" (5)
n=1
where the overscore symbol indicates the complement of a

(3c)
bit. From Equations 2 and 5 we may rewrite Equation 4 as:
Because each bknmay take on vatues of 0 and 1 only, ex-
pression (3c) may have only 2K possible values. Rather
than compute these values on line, we may precompute
the values and store them in a ROM. The input data can In order to simplify our notation later, it i s convenient to
be used to directly address the memory and the result, define the new variables
i.e., the xf=lAkbkn,can be dropped into an accumulator. Ckn = bkn - E k n n # 0 (7)
After N such cycles, the memory contains the result, y.
As an example, let K = 4, A , = 0.72, A, = -0.30, and
A3 = 0.95, and A4 = 0.11. The memory must contain all Cko = -(bko - EkO) (8)
possible combinations (24 = 16 values) a n d their nega-
tives in order to accommodate the term where the possible values of the Ckn, including n = 0, are
C1. Now (6) may be rewritten as
["f
which occurs at the sign-bit time. As a consequence, we
xk =
n-0
By substituting (9) into (1) we obtain

Ckn2-n - 2-(N-l)
1
need to use a 2 . 2Kword ROM. Figure l a shows the simple
structure (with a 2 x 24 = 32-word ROM) that can be
used to mechanize these equations; Table 1 shows the
contents of the memory. The T, signal is the sign-bit timing N-1
signal. We assume that the data on the x l , x,, x 3 , and x 4 =
"=O
Q(bn)2-" + 2-"-"Q(O) (111
lines (which with T, comprise the ROM address words)
are serial, 2's-complement numbers. Each is delivered in where
JULY 1989 IEEE A S S P M A G A Z I N E 5

K A Ak memory, a one-word initial condition register for Q ( O ) ,
Q(bn) = 2
k = l 2Ckn and Q(o) ( I 2 ) and a single parallel adderhbtractor with the necessary
= 2
k=l - '
control-logic gates. This i s shown in Figure IC,

using the
Notice that Q(b,) has only 2'K-1'possible amplitude Val- 8-word ROM, which contains the Q(b,).
ues with a sign that is given by the instantaneous corn- Notice from the memory values of Table 2 that those
bination of bits. This is consistent with our earlier claim. values in the lower half under "Q" are the mirror image
The computation of y i s mechanized using a 2'K-1'word of the values i n the upper half, b u t with the signs re-
= 0.72
= -0.30
Input Code 32-Word
= 0.95
= 0.11 Ts bln b2n b3n b4n Memory Contents
(Table 1) 0 0 0 0 0 0
0 0 0 0 1 A4 = 0.11
0 0 0 1 0 A3 = 0.95
0 0 0 1 1 A3+A4 = 1.06
I 0 0 1 0 0 A2 = -0.30
0 0 1 0 1 A2+A4 = -0.19
0 0 1 1 0 A2+A3 = 0.65
-
VI 0
0
0
0
1
1
1
0
0
1
0
0
1
0
1
A2+A3+A4 = 0.75
A1 = 0.72
A1+A4 = 0.83
0 1 0 1 0 A1+A3 = 1.67
0 1 0 1 1 Al+A3+A4 = 1.78
0 1 1 0 0 A1+A2 = 0.42
0 1 1 0 1 A1+A2+A4 = 0.53
0 1 1 1 0 A1+A2+A3 = 1.37
0 1 1 1 1 Al+A2+A3+A4 = 1.48
Y
Figure la. Adder and Full Memory 1 0 0 0 0 0
1 0 0 0 1 -A4 = -0.11
1 0 0 1 0 -A3 = -0.95
1 0 0 1 1 -(A3+A4) = -1.06
1 0 1 0 0 -A2 = +0.30
16-Word 1 0 1 0 1 -(A2+A4) = +0.19
1 0 1 1 0 -(A2+A3) = -0.65
(Top Half of -A1 = -0.72 = -0.75
-(Az+A3+Aq)
Table 1) 1 0 1 1 1
1 1 0 0 0
1 1 0 0 1 -(A1 +A4) = -0.83
1 1 0 1 0 -(Ai+A3) = -1.67
1 1 0 1 1 -(Al+A3+A4) = -1.78
1 1 1 0 0 -(A1 +Ap) = -0.42
Sign Control 1 1 1 0 1 -(A1 +A2+A4) = -0.53
0 = Add 1 1 1 1 0 -(Al+A2+A3) = -1.37
1 = Subtract 1 1 1 1 1 - ( A I + A ~ + A ~ + A =~ )-1.48
2-1
Table 2
fo bln
Input Code
b2n b3n b4n
8-Word
Memory Contents, Q
0 0 0 0 -1/2(Al+Ap+A3+Aq) = -0.74
Figure 1b. AdderlSubtractor and Memory
0 0 0 1 -1/2(Ai+A2+A~-Aq) = -0.63
0 0 1 0 -1/2(Al+A2-A3+Aq) 0.21
e,
=
X1 0 0 1 1 -1/2(Al+Ap-A3-A4) = 0.32
0 1 0 0 -1/2(Ai-Ap+A3+Aq) = -1.04
8-Word Q(0) = -0.74
ROM 0 1 0 1 -1/2(Ai-Ag+A~-Aq) = -0.93
(Top Half of 0 1 1 0 -1/2(Al-Ag-A3+Aq) = -0.09
Table 2) Condition
x4 0 1 1 1 -1/2(Al-Ap-A3-A4) = 0.02
TS i Parallel Output 1 0 0 0 1/2(Al-A2-A3-A4) = -0.02

AIS sw0 1 0 0 1 1/2(Ai-Ap-A3+Aq) = 0.09
C + 1 0 1 0 1/2(Ai-Ag+As-A4) = 0.93
1 0 1 1 1/2(Al-Ap+A3+Aq) = 1.04
SWA @
1 1 0 0 1/2(Ai+Ag-A3-A4) = -0.32
2-1 1 1 0 1 1/2(Al+Ap-A3+Aq) = -0.21
1 1 1 0 1 / 2 ( A l + A g + A ~ - A q )= 0.63
1 1 1 1 1/2(Ai+Ap+A3+Aq) = 0.74
Y
Figure I C . AdderlSubtractor and Reduced Memory
Figure 1. DA mechanization of y = Alxl + A2x2 + A3x3 + A4x4 for bit serial [I BAATI implementation.
6 IEEE ASSP MAGAZINE JULY 1989

versed. If we look at the bit patterns in the left-hand col- Q(b,) to the right shifted previously computed quantity
umn, we discover that if we EXOR b,, with the remaining to produce Q(b,) +Q(b,)2-' + Q ( b , ) 2 - 2 + +
set of b2n,b3,, and b+,, we properly address the 8-word +
Q(b~-2)2-'"-~' [Q(bN_,) -t Q(0)l2-"-".
memory to pull out the correct values of Q . . .except for
the sign. By using the b,, as the addhubtract control line
for the accumulator input, we also now have the proper INCREASING THE SPEED OF D A MULTIPLICATION
sign. During the sign-bit time the addhubtract command One can see that ingesting the data serially, 1BAAT, re-
must be inverted. We therefore combine the bl, and T, sults in a slow computation. If the input words are N bits
signals through an EXOR gate in order to derive the proper in length, N clock cycles or periods are required in which
addkubtract control signal. to form the dot product. O n the other hand, the equiva-
The initial-condition memory that contains the value lent of K separate products are being formed. If, therefore,
Q(0) i s shown on the extreme right side of Figure IC. When K > N , the DA processor is faster than a single parallel
the LSBs of the x k are addressing the ROM, the value that m u I t ipl ier/accu m u lator.
is read out from the ROM must be corrected by the Q(0) Additional speed may be bought in two ways; one at the
through switch SWB, which operates synchronously with expense of linearly increased memory plus more arith-
switch SWA. This artifact of the binary-offset system can metic operations, the other at the expense of exponentially
be seen in Equation 11. Subsequerlt values from the ROM increased memory. The speed may be increased by a fac-
are summed with the shifted previous sum. As before, tor of L by partitioning each input word into L subwords
we assume zero time delay between the application of ( L must be a factor of N ) . This effectively increases the
the addressing bits and the availability of the contents of dimension of the input vector by a factor of L. We can
the ROM. There i s a clock period of delay through the use L-times as many memories with an expanded-capacity
parallel adder, and the switches SWA and SWB are in accumulator for combining their results, or we can stay
Position 2 only for the clock cycle following the sign-bit with a single memory, but its word capacity becomes 2KL
time when T, = 1. During the first clock cycle, the first and the lengths of the words grow by log, L bits. The first
output from the ROM, Q(bN-l),i s summed with Q(0); approach is obvious and is shown in Figure 2. The second
during the second clock cycle it i s right shifted and summed i s described below. Both are illustrated by example.
with Q ( b ~ - 2to
) produce Q(b~-2) + [Q(bN-1)f Q(O)12-'; We seek a computationally simple means to introduce
during the third clock cycle it is again right shifted and parallelism in the mechanization of Equation 11. The sum-
summed with Q(bN-3)to produce Q(bN-3) Q(bN-,)2-' + + mation over n is next broken into two sums: the one over
lQ(bN-7) + Q(0)12-2;up to the N t h clock cycle when we add n sums from 0 to (NIL) - 1, and the second one sums I
Least-Significant
Members of Bit Fairs
\
I t I ' rI
Figure 2. 2-Memory, 2 BAAT version of Figure IC.
JULY 1989 IEEE ASSP MAGAZINE 7

from 0 to L - 1. Now Equation 11 becomes: We can solve this for L, the number of bits at a time (per
input variable) that we are trying t o load:
where
K L-1 4 The Gauss brackets tell us t o round u p to the next integer.
DA is often most efficient when the number of input lines
i s commensurate with the number of clocks required
and t o load the data, or equivalently, when w = 1. For our
example
Only NIL rather than N clock times are required to form

the inner product, therefore the reader can see that we therefore, the data will be input 2 BAAT. Of each input bit
have succeeded in increasing the speed of the computa- pair, we identify the most significant bit (MSB) and the
tion by a factor of L . least significant bit (LSB). For all values of L, the gate at
We can parametrically explore the issue further. The which the most significant bit appears receives special
total number of bits to be loaded is NK, and the number treatment. The T, control signal is EXOR’d with the MSB
of clock periods required to read in these NK bits is NIL for sign-bit time correction (because the sign bit i s the
clocks. The number of input lines is MSB of the word). In Figure 3, we can see the configura-
tion and also see that the LSB of the input pairs of x, con-
total bits in - NK - K L . (16) trols the addhubtract line.
= number of clock periods (N/L) In order t o demonstrate the validity of the approach,
There may be a relative cost let us reconstruct by the same rules the structure for
L = 1, as shown in Figure 4. Because for L = 1, a IBAAT
importance of minimizing pins
W = serial input line for each x k uses the same line for the sign
importance of minimizing time bit as for all other bits, all T, correction must be EXOR’d
with all xk. The equivalence between Figures IC and 4
should be easy for the reader t o see because the ROM
TS
Figure 3. Single-memory, 2 BAAT version of Figure 1C.
8 ~ E E EASSP MAGAZINE JULY 1989

1,
I
1,
x2 , b
8
1,
I
iD
* Word
ROM
11 ,
1,
I
t
FD AIS
AIS
b
(*'
9
SWB
-
PIC
- 0.74
0.74
Y
+
Figure 4. DA mechanization of y = Alxl + A2x2 A3x3 + &x4 for 1 BAAT mechanization
similar t o Figure A3 showing alternate derivation of AIS (AddISubtractl control.
address line in Figure 4 that is driven b y x k (k = 2,3,4) ac- input vector t o the addressing logic, but a scalar input, x,,
tually sees x k @ T , @ ( T , @ x,) = x k @ x,. This i s the and a scalar output, y, for the processor.
same as i n Figure IC. The derivation of the A/S control We have added a pair of delays (the z-' blocks) in the
lines is the same in both figures. input-signal path so that we could develop the delayed
In Figure 5 , we see a demonstration of an L = 4 design signals, x,-, and x,-, from the input, x,; and we have
in order t o show the structure. The memory cost is ex- added a third delay block so that we could develop the
tremely great for such a small computation, but it i s pre- delayed output, y,-,, from y,-,. We obtain yn-, from a par-
sented t o illustrate the principle. Some more practical allel-to-serial register ( P E ) , since the output from the
cases are shown below. summer in the accumulator loop i s a parallel word. The
contents of the 16-word ROM are shown in Table 3. We
APPLICATION OF D A T O A BIQUADRATIC DIGITAL
may, of course, use parallelism to increase the speed, as
FILTER (AN EXAMPLE OF VECTOR DOT-PRODUCT A N D
was discussed in the previous section.
VECTOR-MATRIX-PRODUCT MECHANIZATION)
There i s another important (because of low roundoff
A typical biquadratic digital filter has a transfer function noise, low coefficient sensitivity and favorable limit-cycle
of the form behavior) form o f digital filter (the normal form) that
Y(z) - A. + A,z-' +A2z-' serves t o illustrate the vector-matrix form of DA, and
-- (20) which we will now discuss. An excellent tradeoff study
X(z) 1 + B,z-' +B,z-~
was recently presented by Barnes [351 to show how one
where the poles are determined by the B, and B, and the could design a normal-form second order digital filter to
gain and zeros are determined by the Ao, A,, and A,. The meet prescribed performance criteria. In this section, an
time-domain description is extremely efficient set o f realizations i s shown, o n e
in which the speed and complexity can be effectively
Y" = [&Ai A2 Bi BJ'[X,X,-~X,-~Y~-~y1-J (21)
traded. ( M u c h o f t h e f o l l o w i n g i s taken f r o m Refer-
where the coefficient vector i s [A,A, A2B, B21Jand the data ences 36 and 37.)
vector is [x,x,-,x,-~ yn-, yn-21r. A direct DA mechanization A block diagram of the normal-form structure is shown
of the filter i s shown i n Figure 6. Notice the extreme in Figure 7. The multipliers b,, b2,c,, and c2 determine the
economy of this filter. This figure differs slightly from pole locations of the filter. The input multipliers a, and a2
those that we have seen above. We have a 5-dimensional are used to determine the input scaling and the multipli-

32K
Word
I,,
II II AIS
'1
Ts
~~~~
2-4=1/16
I@
Y
Figure 5. DA mechanization of y = A l x l + A2x2 + A3x3 + A4x4 for 4 BAAT implementation

using single AdderlSubtractor and minimum memory.
ers a,, d,, and d, to determine the zero locations of the The relationships between Equations 20 and 22 are
filter. There are nine multipliers in total, as compared to
A,, = a,
the five that are required in the so-called direct mechani-
zation. Barnes shows procedures whereby one may re- A, = ald, + a,d2 - ao(hl + h2)
duce this number of multipliers; however, we shall show A2 = aO(hlhL - ClCJ + al(c2d2 - M I )
how to eliminate them.
The vector matrix equation that describes the configu- + az(cid1 - hid,) (23)
ration of Figure 7 is given below: B1 = h, + h2
+
10
[;I [:;&1 nl[:1
U, =
1
hi
IEEE ASSP MAGAZINE JULY 1989

~1 x,-i
(22)
62 = -bib2 ~ 1 ~ 2 .
There are three common inputs to each of the "clumps"

of Figure 7; so, for each output, we need to store 23 = 4 1
*n-l input Code Contents of
a bn2 bng bn4 bng 16-Word Memory, Q
0 0 0 0 -1/2(Ao+Ai+Ap+Bi+B2)
0 0 1 -1/2(AO+A1+A2+B1-B2)
-
0
Xn-2 0 0 1 0 -1/2(Ao+Aj+Ap-B1+B2)
xn L
4 16 0 0 1 1 -1/2(Ao+Ai+Ap-B1-B2)
Word 0 1 0 0 -I/~(Ao+A~-A~+B~+B~)
ROM
0 1 0 1 -1/2(Ao+Ai-A2+81-B2)
0 1 1 0 -I/~(AO+A~-A~-B~+B~)
0 1 1 1 -I/~(Ao+A~-A~-B~-B~)
1 0 0 0 -~/~(AO-AI+A~+B~+B~)
1 0 0 1 -1/2(Ao-Ai+Ap+Bi-Bp)
4,
1 0 1 0 -1/2(Ao-Al+Ap-B1+B2)
1 0 1 1 -I/~(Ao-A~+A~-B~-B~)
1 1 0 0 -1/2(Ao-Ai-Ap+Bi+B2)
1 1 0 1 -1/2(Ao-A1-Ap+BI-B2)
TS
1 1 1 0 -1/2(Ao-Ai-Ap-BI+B2)
1 1 1 1 -1/2(Ao-Ai-A2-81-B2)
_J
I 1
~
Yn - - - - -
Figure 8. DA realization of state-space filter of Figure 7.
words. If the words that are stored are 16 bits long, then The parallel-to-serial registers shown d o not output a
our three outputs together call for 3 x 16 = 48 bits per serial data stream, but rather provide a sequence of four
stored word. The total number of bits stored, however, is 4-bit wide segments. The time required t o perform the
a modest 4 x 48 = 192. The somewhat detailed DA reali- filtering function has been reduced t o 4 clock periods.
zation is illustrated in Figure 8. Notice how the addressing This increase in speed demands that the ROM size be in-
section has reduced t o a pair of EXOR gates. The 4-word creased t o (;) (23)4= 2048 words. O n e would like t o be
by 48-bit memory is shown for clarity as comprising three able t o make the throughput rate equal to the clock rate,
memories. I n fact, the ROM may physically consist of rather than just a quarter of that rate. By quadrupling the
three identically addressed 4-word by 16-bit memories or memory of the configuration of Figure 10 and complicat-
a single 4-word extended length (48-bit) memory. Each ing the 3 adders, we can create a very fast but memory-
16-bit output segment drives a separate accumulator loop, hungry (8K word by 48 bit) filter structure that can perform
each with its own initial-condition register, just as we en- the filtering function in a single clock period, as shown in
countered earlier. The addhubtract control line is com- Figure 11.
mon, and the outputs of two of the accumulator loops An alternate approach, which i s shown i n Figure 12,
are converted t o serial form in their parallel-to-serial reg- may be used in which eight ROMs are addressed by the
isters to be fed back to the memory addressing gates. three data streams, 2 bits at a time (2BAAT). Each memory
I n order t o simplify our subsequent development, i t is now a more modest (23)2= 32 words x 48 bits for a
will be useful to redraw Figure 8 as shown in Figure 9. We total memory requirement of 8 x 32 x 48 = 12,288 bits.
can see the essence and utter simplicity of the structure, The three adders are each of a complexity less than that
which is somewhat startling when one realizes that this is of a 16-bit modified-Booth multiplier. I n this approach,
a realization of the 9-multiplier configuration of Figure 7. we have reduced the memory size at the expense of com-
Figure 10 shows a factor-of-four speedup over the cir- plicating the adders.
cuit of Figure 9 by using the data 4 bits at a time (4BAAT). The outputs from the memories are 16 (or whatever,
12 IEEE ASSP MAGAZINE JULY 1989

Word
X l/ ROM
/ I
48 Bits
Per
Word
2048
Word
X 4/ ROM
1 Y
/ 48 Bits
Per
Word
I I
Figure I O . Single memory 4 BAAT DA structure, four clock periods per filter function.
say N, number of) bits. In the shift-and-add process that the filter that this error i s introduced (the accumulator
occurs in the accumulators, the accuracy of the results outputs), we follow them around the recursive loop of
degrades because of the least-significant bits that are lost the filter and see the reinforcement addition that i s the
through the trauma of quantization. Figure 13 illustrates cause of noise gain within recursive filters. We will create
the problem; as the data circulates through the accumu- a parallel error path, now, with error subtractions that
lator loop, LSB's are lost at the shift stage. (These lost bits will nearly cancel the error additions.
are often modeled as an additive error.) From the point in Figure 14 shows the state-space filter of Figure 7 with

noise sources (quantization effects) E , , c2, and e3 added at cal initial condition, which includes the error feedback. If
the outputs of the accumulators. We have also added, the entity within the dotted line of Figure 15 i s redefined
within the dotted lines, noise-cancellation paths whose as the adder and parallel-to-serial register, then Figure 9
gains are shift approximations to the actual gains. This i s valid for the error-feedback mechanization. If the
technique is known as error feedback and has long been adders that are fed by the ICs are L BAAT adders, then
used in DA filters [391. We mechanize the noise cancella- this structure can be generalized for use with structures
tion paths by modifying Figure 8 as shown in Figure 15. that are shown in Figures 10-12.
Since at most the tilde-marked gains consist of a shift,
those gains shown in Figure 15 are physically trivial. They
feed a serial adder in which they are bit-by-bit combined APPLICATIONS IN TRANSFORMERS
with the initial conditions (ICs), then loaded into a serial- DA has found important use in low-complexity, high-
to-parallel register; that register now contains a chimeri- performance FFT structures [39,40,411. The effective use
14 IEEE A S S P MAGAZINE JULY 1989

2
U
M
'/16
6
,
4
I
1.
2
1
Y
U 1
Figure 12. Eight memory 2 BAAT DA structure, one clock period per filter function.
of DA i n the mechanization o f a simple, direct, high- multiplier-free building blocks that lead t o increased effi-
performance complex multiplier has been the reason for ciency in transform processing. By the simple expedient
its success in FFT applications, and has spurred additional of turning t o non-orthogonal coordinate systems for
work in the development of efficient complex multipliers complex arithmetic, the resulting FFT structures [45] gave
1421. The development path for complex multipliers took birth to a new complex multiplier [46] to perform the re-
an interesting twist with the advent of radix-3 and radix-6 duced number of "twiddles" on the interstage coupling.
FFT's [43,441. Their charm is that they give larger-radix Image processing has turned t o the discrete-cosine
JULY 1989 IEEE A S S P MAGAZINE 15

I6 IEE ASSP MAGAZINE JULY 1989
I--
I
I
I
I
I
I
I
I
I
I
I I
I
I
I
I
I
I
"n
I
Figure 15. DA mechanization of Figure 14.
transform (DCT) for more efficient processing. There, DA lacks the modularity.
again, DA has found a home 147,481. Satisfactory DA adaptive nonlinear filters have been in-
vestigated and reported by Sicuranza and Ramponi [561,
NONLINEAR AND/OR NONSTATIONARY and by Smith et al. [57).
PROCESSING WITH DA
CONCLUSIONS
We have only considered the use of DA in linear, time-
invariant systems. It i s not so restricted. For variable coef- DA i s a very efficient means t o mechanize computa-
ficients, we may use RAM'S rather than ROM's. I n fact, tions that are dominated by inner products. As we have
one of the trailblazers in this approach was Schroder [161. seen, the coefficients of the equations can be time vary-
In 1981 Cowan and Mavor [49] described an 8-tap adap- ing, and the equations themselves can be nonlinear.
tive transversal filter that employed DA, and in 1983 an When a great many computing methods are compared,
expanded version was published by Cowan, Smith, and DA has always fared well, not always (but often) best, and
Elliott [50]. Andrews [51] compared it favorably in a mech- never poorly. As a consequence, whenever the perfor-
anization study involving traditional and nontraditional mance/cost ratio is critical (especially in custom designs),
arithmetic mechanizations of adaptive filters. The capac- DA should be seriously considered as a contender.
ity of the DA adaptive structure can be increased by using
block-processing concepts [52,531. ACKNOWLEDGMENTS
The mechanization of nonlinear difference equations I want t o express my thanks t o the many workers in the
by DA was presented by Sicuranza i n 1541, who repre- field who have quickly and generously shared their results
sented the filter by a truncated discrete Volterra series. with me; to my coworkers who, over the years, have pa-
Chiang et al. in [551 report the results of a mechanization tiently simulated, analyzed, built, and tested endless DA
trade-off study of various implementations of quadratic concepts, and corrected my errors; and to the reviewers
filters. They conclude that DA i s as efficient as matrix who offered very helpful suggestions and led me to refer-
decomposition implemented with systolic arrays, but that ences that were new and unfamiliar t o me.
JULY 1989 IEEE A S S P M A G A Z I N E 17

REFERENCES Theory and Design, Genoa, Italy, September 1976.
[ I ] A. Peled and B. Liu, “A New Approach to the Realiza- [I91 C. S. Burrus, ”Digital Filter Structures Described by
tion of Nonrecursive Digital Filters,” /€E€ Trans. Au- Distributed Arithmetic Filters,” / E € € Trans. on Cif-
d i o a n d Electroacoustics, V o l . AU-21, N o . 6, cuits and Systems, Vol. CAS-24, No. 12, pp. 674-680,
pp. 477-485, December 1973. December 1977.
[21 A. Peled and B. Liu, “A New Hardware Realization of 1201 W.K. Jenkins, and B.J. Leon, “The Use of Residue
Digital Filters,” /€E€ Trans. on A.S.S.P., Vol. ASSP-22, Number System in the Design of Finite Impulse Re-
pp. 456-462, December 1974. sponse Filters,” /€€€ Trans. on Circuits and Systems,
[3] A Croisier, D. J. Esteban, M. E. Levilion, and V. Rizo, Vol. CAS-24, No. 4, April 1977.
“Digital Filter for PCM Encoded Signals,” U.S. Patent [211 I. Zeman, and H.T. Nagle, Jr., “A High-speed Mi-
3,777,130, December 3, 1973. croprogrammable Digital Signal Processor Employing
[4] S . Zohar, ”New Hardware Realization o f Nonre- Distributed Arithmetic,” /E€€ Trans. on Computers,
cursive Digital Filters,” / € E € Trans. on Computers, Vol. C-29, No. 2, pp. 134-144, February 1980.
Vol. C-22, pp. 328-338, April 1973. [22] B. S. Tam, and G. J. Hawkins, “Speed-Optimized Mi-
[SI S. Zohar, ”The Counting Recursive Digital Filter,” croprocessor Implementation of a Digital Filter,” /€€€
/ € € E Trans. on Computers, Vol. C-22, pp. 338-347, Proc., Vol. 69, No. 3, pp. 85-93, May 1981.
April 1973. 1231 M. Arjmand and R.A. Roberts, “ O n Comparing
[61 S. Zohar, “A Realization of the RAM Digital Filter,” Hardware Implementations of Fixed-point Digital Fil-
/ € € E Trans. o n Computers, Vol. C-25, pp. 1048-1052, ters,“ / E € € Circuits a n d Systems Magazine, Vol. 3,
October 1976. NO. 2, 1981, pp. 2-8.
[ 7 ] W. D. Little, “A Fast Algorithm for Digital Filters,” [241 S. A. White, “Architecture f o r a Digital Program-
/ € € E Trans. on Communications, Vol. C-23, pp. 466- mable Image Processing Element,” Proc. 1981 /€€€ In-
469, May 1974. ternational Conference on Acoustics, Speech, and
[81 B. E. Bona and F. C. Terhan, “A Special-Purpose Digi- Signal Processing, Atlanta, pp. 658-661, March 30-
tal Control Compensator,’’ Proc. 8th Asilomar Con- April 1, 1981.
ference on Circuits, Systems, and Computers, Pacific [25] S. A. White, “An Architectural for a High-speed Digi-
Grove, California, pp. 454-457, December 1974. tal Signal Processing Device,” Proc. 1987 /€€E Interna-
[91 S. A. White, “Applications of Digital Signal Process- tional Symposium on Circuits and Systems, Chicago,
ing to Control Systems,” Proc. 8th Asilomar Confer- pp. 893-896, April 28-29,1981.
ence o n Circuits, Systems, a n d Computers, Pacific [26] K. D. Kammeyer, ”Quantization Error of the Distrib-
Grove, California, pp. 278-284, December 1974. uted Arithmetic,” / € € E Trans. on Circuits a n d Sys-
[IO] S. L. Freeny, ”Special-Purpose Hardware for Digital tems, Vol. CAS-24, No. 12, pp. 681-689, December
Filtering,” Proc. /€€E, Vol. 63, pp. 633-648, April 197.5. 1977.
[ I l l S. A. White, “ O n Mechanization of Vector Multipli- [27] F. J. Taylor, “An Analysis of the Distributed-Arithmetic
cation,” Proc. /E€€, Vol. 63, pp. 730-731, April 1975. Digital Filter,” /€E€ Trans. on A.S.S.P., Vol. ASSP-35,
[I21 S.A. White, W. P. Engler, J.P. Davis, S. L. Smith, J.V. NO. 5, pp. 1165-1170, Oct. 1986.
Henning, “A Programmable Digital Servo Control- [28] S. G. Smith and S. A. White, “Hardware Approaches
ler,‘’ Invention Disclosure PF75E145, October 16, to Vector-Plane Rotation,” Proc. 1988 / E € € Interna-
1975. tional Conference on Acoustics, Speech, and Signal
[I31 T. R. C. M. Classen, W. F. G. Mecklenbrauker, and Processing, New York, pp. 212-213, April 11-14,1988.
I . D. H. Peek, “Some Considerations on the Imple- I291 W. P. Burleson and L. L. Scharf, “A VLSl Implementa-
mentation of Digital Systems for Signal Processing,” tion of a Cellular Rotator Array,” Proc. 1988 /€E€ cus-
Philips Research Report 30, pp. 73-84, 1975. tom integrated Circuits Conference, pp. 8.1.2-8.1.4.
I141 M. Buttner and H. W. Schuessler, “On Structures for [30] F. J. Taylor, “A Distributed Gray-Markel Filter,” I € € €
the Implementation of the Distributed Arithmetic,” Trans. o n A.S.S.P., Vol. ASSP-31, No. 3, pp. 761-763,
NTZ Communication Journal, Vol. 6, June 1975. June 1983.
[I51 Kai-Ping Yiu, ”On Sign-Bit Assignment for a Vector [311 S. Zohar, “A VLSl Implementation of a Correlated
Multiplier,” Proc. /€E€, Vol. 64,pp. 372-373, March Digital Filter Based on Distributed Arithmetic,” /E€€
1976. Trans. on A.S.S.P., Vol. ASSP-37, No. 1, pp. 156-160,
[I61 H. Schroder, “High Word-Rate Digital Filters with Jan. 1989.
Programmable Table Look-Up,” /E€€ Trans. on Cir- [321 F. I. Taylor, Digital Filter Design Handbook, Marcel
cuits and Systems, Vol. CAS-24, No. 5, pp. 277-279, Dekker, Inc., pp. 678-697, June 1983.
May 1977. [33] S. G. Smith and P. B. Deyer, Serial-Data Computation,
[I71 C. S. Burrus, ”Digital Filter Realization by Distributed Kluwer Academic Publishers, 1988.
Arithmetic,” International Symposium o n Circuits [34] D. F. Elliott (ed.), Handbook of Digital Signal Process-
and Systems, Munich, April 1976. ing, Academic Press, pp. 964-972, 1987.
[I81 K. D. Kammeyer, ”Digital Filter Realization in Distrib- [351 C. W. Barnes, “A Parametric Approach t o the Realiza-
uted Arithmetic,” Proc. European Conf. on Circuit tion of Second-Order Digital-Filter Sections,” / E € €
18 ~ E E EASSP MAGAZINE JULY 1989

Trans. on Circuits and Systems, Vol. CAS-32, No. 6, [501 C. F. N. Cowan, S. G. Smith, and J. H. Elliott, “A Digi-
pp. 530-539, June 1985. tal Adaptive Filter Using a Memory-Accumulator Ar-
S. A. White, “High-speed Distributed-Arithmetic Re- chitecture: Theory and Realization,” /E€€ Trans. on
alization of a Second-Order Normal-Form Digital Fil- A.S.S.P., Vol. ASSP-31, No. 3, pp. 541-549, June 1983.
ter,“ l€€€ Trans. on Circuits and Systems, Vol. CAS-33, [511 M. Andrews, “A Systolic SBNR Adaptive Signal Pro-
No. I O , pp. 1036-1038, October 1986. cessor,” /€€€/. Solid State Circuits, Vol. SC-21, No. 1,
S. A. White, “High-speed Distributed-Arithmetic Re- pp. 120-128, Feb. 1986.
alization of a Second-Order State-Space Digital Fil- [521 C. -H. Wei, and J.-J.Lou, “Multimemory Block Struc-
ter,” Proc. 20th Asilomar Conference on Signals, ture for Implementing a Digital Adaptive Filter Using
Systems, a n d Computers, Pacific Grove, California, Distributed Arithmetic,” I€€ Proc., Vol. 133, Pt. G,
pp. 359-362, November IO-12,1986. No. 1, pp. 19-26, Feb. 1986.
T. L. Chang and S.A. White, “An Error Cancellation [531 Y.-Y. Chiu and C.-H. Wei, “On the Realization of
Digital-Filter Structure and Its Distributed Arithmetic Multimemory Block Structure Digital Adaptive Filter
Implementation,” /€E€ Trans. on Circuits a n d Sys- Using Distributed Arithmetic,” 1. Chinese Institute o f
tems, Vol. CAS-28, No. 4, pp. 339-342, April 1981. Engineers, Vol. I O , No. 1, pp. 115-122, 1987.
[391 S. A. White, “A Simple FFT Butterfly Arithmetic 1541 G.C. Sicuranza, “Nonlinear Digital-Filter Realization
Unit,” I€€€ Trans. on Circuits and Systems, Vol. CAS- by Distributed Arithmetic,” /€€E Trans. o n A.S.S.P.,
28, No. 4, pp. 352-366, April 1981. Vol. ASSP-33, NO.4, pp. 939-1321, Aug. 1985.
[401 I. R. Mactaggart and M.A. Jack, “A Single Chip Radix-2 [551 H. -H. Chiang, C. L. Nikias, and A. N. Venetsanopou-
FFT Butterfly Architecture Using Parallel Data Dis- Ios, ” Efficient Im p Ie m entat ions of Quad rat ic Digital
tributed Arithmetic,” /E€€ /. Solid State Circuits, Filters,” I € € € Trans. o n A.S.S.P., Vol. ASSP-34, No. 6,
Vol. SC-19, pp. 368-373, June 1984. pp. 1511-1528, Dec. 1986.
S.A. White, “An Architecture for a 16-Point FFT GaAs [561 G . L. Sicuranza, “Adaptive Nonlinear Digital Filters
Building Block,” Proc. 21st Asilomar Conference o n U s i n g D i s t r i b u t e d A r i t h m e t i c , ” /€E€ Trans. on
Signals, Systems, a n d Computers, Pacific Grove, A.S.S.P., Vol. ASSP-34, No. 3, pp. 518-526, June 1986.
California, pp. 928-932, November 1987. [57] M. J. Smith, C. F. N. Cowan, and P. F. Adams, “Non-
S. G. Smith and P. B. Denyer, “Efficient Bit-Serial linear Echo Cancellers Based on Transpose Distrib-
Complex Multiplication and Sum-of-Products Com- uted Arithmetic,” I € € € Trans. on Circuits and Systems,
putation Using Distributed Arithmetic,” Proc. 1986 CAS-35, No. 1, pp. 6-18, Jan. 1988.
lnternational Conference on Acoustics, Speech, and
Stanley A. White (S’55, M’57, SM’69, F’82) was
Signal Processing, Tokyo, pp. 2203-2206, April 1986.
b o r n i n Providence, Rhode Island i n 1931. He
E. Dubois and A. N. Venetsanopoulos, “A New Algo- i s Senior Scientist at Rockwell International
rithm for the Radix-3 FFT,” l€€€ Trans. o n A.S.S.P., Corporation’s Autonetics Electronics Systems
Vol. ASSP-26, No. 3, pp. 222-225, June 1978. i n Anaheim, California, and Adjunct Profes-
S. Prakash and V.V. Rao, “A New Radix-6 FFT Algo- sor of Electrical Engineering at t h e University
o f California, Irvine. He received his B.S.E.E.,
rithm,” l€€€ Trans. on A.S.S.P., Vol. ASSP-29, No. 4,
M.S.E.E., and Ph.D. ( w i t h letter of commen-
pp. 939-941, August 1981. d a t i o n ) degrees f r o m Purdue University i n
S.A. White, “Results of a Preliminary Study of a 1957, 1959, and 1965, respectively. He i s also a
Novel IC Arithmetic Unit for an FFT Processor,” Proc. Fellow of t h e American Association f o r the Advancement of Science
18th Asilomar Conference on Circuits, Systems, and (AAAS), t h e N e w York Academy o f Sciences (NYAS), and the Insti-
tute for the Advancement o f Engineering (IAE). He i s a member (and
Computers, Pacific Grove, California, pp. 67-71, long-time Autonetics Chapter President) o f Sigma Xi, Tau Beta Pi,
November 5-7,1984. the Audio Engineering Society, and has been International Director
S. A. White, “A Complex Multiplier for Binary Two’s- of Eta Kappa Nu. H e received t h e Purdue University Distinguished
Complement Numbers,” U.S. Patent No. 4,680,727, Engineering Alumnus Award i n 1988, t h e Leonard0 Da Vinci Medal-
July 14, 1987. l i o n i n 1986, and was selected as Rockwell International Engineer o f
the Year i n 1985. I n 1984 he received b o t h the Engineer of the Year
N. Demassieux, G. Concordel, J.-P. Durandeau, and
Award f r o m t h e Orange C o u n t y (California) Engineering Council
F. Jutand, “An Optimized VLSl Architecture for a and the IEEE Centennial Medal. He has held several distinguished
Multiformat Discrete Cosine Transform,” Proc. 1987 lecturerships and received the NEC Distinguished Lecturer Award.
I € € € Interngtional Conference on Acoustics, Speech, Dr. White has over 100 publications i n the open literature, holds
Signal Processing, pp. 547-550. 30 U.S. patents ( w i t h more pending issue), and is a registered Pro-
fessional Engineer in b o t h California and Indiana.
A.M. Gottlieb, M.T. Sun, and T. C. Chen, “A Video He i s General Chairman of ISCAS ’92; was General Chairman o f
Rate 16 x 16 Discrete Cosine Transform IC,” Proc. ICASSP ’84, Vice Chairman o f ISCAS ‘83; was b o t h General Chair-
1988 I € € € Custom Integrated Circuits Conference, man of the Asilomar Conference o n Circuits, Systems, and Computers,
pp. 8.2.1-8.2.4. and Technical Program Chairman of the IEEE Region 6 Conference
C. F. N. Cowan and J. Mavor, “New Digital-Adaptive i n 1982; and was Technical Program Chairman of the Asilomar Con-
ference i n 1981. Dr. W h i t e is listed i n American M e n a n d Women of
Filter Implementation Using Distributed-Arithmetic
Science, Who‘s Who in Biomedical Engineering, Who‘s Who in Engi-
Techniques,” I€€ Proc., V o l . 128, Pt. F, No. 4, neering, Who’s W h o in America, Who‘s W h o in The World, and
pp. 225-230, Aug. 1981. other standard biographical reference works.

Useful PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Useful PDF

Uploaded by

Copyright:

Available Formats

Applications of Distributed

Arithmetic to Digital Signal

D ISTRIBUTED ARITHMETIC (DA) is so named because

4 IEEE ASSP M A G A Z I N E JULY 1 9 8 9 0740-7467/89/0700-0004$1 .OO 0 1 9 8 9 IEEE

and remember that in 2's-complement notation the nega-

where the overscore symbol indicates the complement of a

By substituting (9) into (1) we obtain

JULY 1989 IEEE A S S P M A G A Z I N E 5

control-logic gates. This i s shown in Figure IC,

TS i Parallel Output 1 0 0 0 1/2(Al-A2-A3-A4) = -0.02

Figure I C . AdderlSubtractor and Reduced Memory

6 IEEE ASSP MAGAZINE JULY 1989

Figure 2. 2-Memory, 2 BAAT version of Figure IC.

JULY 1989 IEEE ASSP MAGAZINE 7

Only NIL rather than N clock times are required to form

Figure 3. Single-memory, 2 BAAT version of Figure 1C.

8 ~ E E EASSP MAGAZINE JULY 1989

JULY 1989 IEEE ASSP MAGAZINE 9

Figure 5. DA mechanization of y = A l x l + A2x2 + A3x3 + A4x4 for 4 BAAT implementation

IEEE ASSP MAGAZINE JULY 1989

There are three common inputs to each of the "clumps"

Figure 8. DA realization of state-space filter of Figure 7.

12 IEEE ASSP MAGAZINE JULY 1989

JULY 1989 IEEE ASSP MAGAZINE 13

14 IEEE A S S P MAGAZINE JULY 1989

JULY 1989 IEEE A S S P MAGAZINE 15

JULY 1989 IEEE A S S P M A G A Z I N E 17

18 ~ E E EASSP MAGAZINE JULY 1989

JULY 1989 IEEE ASSP MAGAZINE 19

You might also like