Professional Documents
Culture Documents
S. Sarkar
A.K. Majumdar
"=O
domain D, containing all the dependence vectors for all
values of the parameter k is given by D, = {(k, i)l k > 0,
- m < i < m}. Clearly, the cone c, does not exist in this
where w = e-jZffiN. case. However, if the D G is reindexed such that an index
The D G for an N-point F F T (we assume N = 2'" point ( k , i) in the original DG is reindexed to an index
where m is an integer) is shown in Fig. 1 (The constants point ( k , i 2' + - I), the cone c, exists. Fig. 2 shows the
involved in the computation have been ignored.). The DG after reindexing. Fig. 3 shows the cone c, and vectors
computation represented by the D G can be expressed by El and B , . The set of recurrence eqns. 1 to 4 are modi-
a set of recurrence equations. fied accordingly.
If k = 1, 1 4 i 4 N then Ifk = 1,2" < i < N +(2'-' -1) then
Fig. 2
Fig. 3
Reindexed DG
B,
Cone C , and vectors B , and B ,
-
Fig. 4
Fig. 5
(2N - I ) P E systolic array
(N - 1) PEs. The array is linear and uses tags for routing Fig. 6 Tagged systolic arrayfor N = 8 F F T
receiving it and copies the data if it is meant for it. In the When the computation is started, the value of count is set
next time-step, the piece of data passes into the next PE. to zero. Before starting computation, the values of the tag
A PE of the TSA derived for the F F T is shown in Fig. constants and the constant involved in the computation
7. Because most high-speed A/D converters are lower in are set.
: Multiplier
: Adder
0 : Subtmctor
A,B,C,D : Registerpairs
w (= ,+j.w 2 : Twiddle factor
TI,...,T6 : TagRegisterj
Fig. 7 P E /or computation of FFT using tagged systolic array
precision, the input data are assumed to be 16-bit fixed- 4 Performance analysis
point words. Each PE executes the following program in The ( N - 1 ) PE TSA takes ( N + log N ) time-steps to
one time-step.
complete the computation of an N-point FFT. The
Line no. +
speed-up S is given by ( N log N ) / ( N log N ) . The
1 heain time-steu average processor utilisation decreases with the increase
2 send output'data 1 and output data 2 in size oi the FFT. The block pipelining period is N . Fig.
3 receive input data 1 and input data 2 8 plots S against PE number.
4 ,fiv all tags of input data The performance of an ( N - 1) PE TSA can be com-
S decrement tag by 1 pared with that of an ( N log N/2) PE network using
6 iftag = 0 and count = 0 butterfly interconnection. A PE of the latter type is
7 copy data to register B shown in Fig. 9. Fig. 7 shows the PE of a TSA. T o
8 count = 1 compare the two designs based on the chip area required
9 else iftag = 0 and count = 1 for computing an N-point FFT, the approximate number
10 copy data to register A of gates (estimated from the data path synthesis) required
11 count = 2 for a PE is taken as the basis. The complexity of the con-
12 end if troller is measured in terms of the number of basic oper-
13 end if ations it performs. The additional controller complexity
14 end for of the TSA for F F T results from lines 4 to 15 and 19 to
1s tfcount = 2 23 of the program that each PE executes. The approx-
16 compute imate gate count (from the data path synthesis) of the
TSA is 5400 and that for a butterfly PE is 4500. Thus an
17 output data 1 = A + w"B optimistic assumption would be that a PE of TSA takes
18 output data 2 =A - w^B at most twice the area required for a processor employing
19 count = 0 butterfly interconnection for computing an FFT.
20 else However, because of the nonlocal and nonregular inter-
21 output data 1 = input data 1 connection employed in the latter case, the total chip
22 output data 2 = input data 2 area required is approximately double that required for
23 end if processors only. If we assume that a butterfly processor
24 end time-step occupies unit chip area, then the area required to
compute an N-point FFT is 2N log N/2 = N log N. If we Thus the chip area utilisation of a TSA is much better
use a TSA to compute an N-point FFT, the chip area than that of a butterfly interconnection network for the
required is 2(N - 1). Table 1 shows the chip area com- computation of an FFT, especially when N is large. Table
parison between TSA and butterfly for the computation 2 shows the comparison (based on some other factors) of
of an N-point FFT. TSA and butterfly interconnection networks for the com-
putation of an FFT.
From the above analysis, we observe that both designs
have certain strong aspects and certain weak aspects. It is
6 -1 difficult to relate all the factors by a common formula
which can be used as a performance metric. The alter-
5-1
native approach is to assign credit points for each of the
factors depending on their relative merit and the total
1
points can be used as a performance index for comparing
two designs. If we assign equal credit points for all factors
with equal relative merit, we conclude that the TSA
implementation of an FFT is superior.
2 -
5 TSA f o r o t h e r orthogonal transforms
, .-~
J(N)
1 x(n)[cos (2nknlN) + sin (Znkn/N]
"=o
Table 1 : Chip area comparison The fast Hartley transform (FHT) is very similar to the
N Area (in units) FFT and the DG for an N = 8 FHT is the same as for an
N = 8 FFT except for the constants involved in the com-
Butterfly TSA
putation and that the FHT involves only real arithmetic
8 24 14 computations. Thus a similar TSA can be derived for the
16 64 30 computation of an FHT where the PEs perform only real
32 160 62
64 384 126 arithmetic computations.
128 896 254 The Hadamard transform [6] of a data sequence
256 2048 510 { x ( n ) ; n = 0, 1, 2, ...} is given by y = Hx, where H is an
I
I32
1 7-
: 16bitregister
: Multiplier
fl : Adder
: SUbaaaOI
A,B,C,D : Registerpain
w (= 0 ,+j.y): Twiddle factor
Fig. 9 P E f o r Computation of F F T usinq butterfly interconnection between PEs
C= A + w'BandD = A - w*B wherew' = wI + w2
Outputs of all registers are In-state
T a b l e 2: C o m p a r i s o n o f ( N - 1) PE TSA a n d ( N log N / 2 ) PE n e t w o r k using b u t t e r f l y i n t e r c o n n e c t i o n for
c o m p u t a t i o n of N - p o i n t FFT
Factor TSA Butterfly
computational area
Area efficiency = 1 0.5
total chip area
computational power
Power efficiency = 1 0.5
total power
Design cost Less because of local and more because of nonlocal and
regular interconnection among PEs nonregular interconnection among PEs
Fault tolerance r 100%fault tolerant less than loo’%
(interconnection failure)
Modularity Modular Non-modular
I/O bandwidth 4 data/time-step 2N data/time-step
Processor utilisation Less than 100% 100%
Block pipelining period N 1
Chip area required 2 ( N - 1 ) units (N log N ) units
Time N+looN log N
N x N matrix. For N = 8, the matrix H is broader class of algorithms amenable for implementation
on a TSA. The starting point of the design is a set of
-- 1 11 11 11 11 11 11 11 1 recurrence equations describing the algorithm. When no
1-1 1-1 1-1 1-1 apparent transformation can be applied to the set of
1 1 - 1 -1 1 1 - 1 -1 recurrence equations describing the algorithm to convert
them into a set of uniform recurrence equations or alter-
1 - 1 - 1 1 1 - 1 -1 1
H = 1/[2 . ,/(2)] ’ natively the DG describing the algorithm cannot be
1 1 1 1 - 1 -1 -I - 1 transformed into a regular one, we may try to implement
1-1 1 - 1 -1 1-1 1 the problem on a TSA. A TSA design for the FFT is
1 1 - I - 1 -1 -1 1 1 derived and shown to be better in certain aspects than
-1-1 - 1 - 1 1 - 1 1 1-1 the design employing butterfly interconnection among
PEs. Finally it has been shown that similar designs can
be derived for some other important orthogonal trans- *
The fast Hadamard transform also involves only real
arithmetic computations. The DG for a fast Hadamard forms.
transform is the same as for the N = 8 FFT except for
the constants involved in the computation. Thus a TSA
similar to an FFT can be derived in this case.
The discrete cosine transform (DCT) [lo] of a data
sequence { x ; n = 0, 1, ..., N - 1) is given by the output 7 References
sequence { z t ;k = 0, 1, . . . , N - 1) where 1 KUNG, H.T. ‘Why systolic architectures?’, IEEE Computer. Jan
N-1 1982
zk = 2e(k)/N 1 x , cos [n(2n + l)k/2N]
n=O
2 MOLDOVAN, D.I.. ‘On the design of algorithms for VLSI systolic
arrays’, Proc. IEEE, 1983,71
3 LI, G.H., and WAH, B.W.: ‘The design of optimal systolic arrays’,
and IEEE Truns. Cornput., 1985, 34, (1)
4 ULLMAN, J.D.: ‘The computational aspects of VLSI’ (Computer
~ ( k=
) 1/4(2) for k = 0 Science Press, 1984)
5 DELOSME, J.M.. and IPSEN, I.C.F.: ‘Efiicient systolic arrays for
= I otherwise the solution of Toeplitz system: an illustration of the methodology
for the construction of systolic architectures for VLSI’, in MOORE,
Ordinarily, a DCT is calculated from an FFT by using W.. McCABE, A,, and URQUHART, R. (Eds.): ‘Construction of
systolic architectures’ (Adam Hilger, 1986)
6 KUNG, S.Y.: ‘VLSI array processor’ (Prentice-Hall. New Jersey,
19x8)
7 RAO, S.K.: ‘Regular iterative algorithms and their implementatlons
for k = 0, 1, 2, ..., N/2 1 - on processor arrays’. PhD thesis, Stanford University, USA, 1985
where R and I are the real and imaginary parts of the 8 YAACOBY, Y., and CAPPELLO, P.R.: ‘Scheduling a system of
afiine recurrence equations onto a systolic array‘. Proceedings of the
FFT, respectively, and zNiz(x)= ,/(2/N) . RN12(x’)and International Conference on Systolic Arrays, San Diego, California,
xk = x2”; x h - n - i = x 2 ” + , and 0, = nk/2N. Hence the May 1988
DCT sequence can be obtained from the FFT sequence 9 KARP, R.M.. MILLER, R.E., and WINOGARD, S.:‘The organis-
by including two additional processors that implement ation of computations for uniform recurrence equations’, J A C M , 14,
(3). July 1967
the above computation. Because the outputs from an 10 HOU, S.H.: ‘The fast Hartley transform alrorithm’, IEEE Trans.
FFT tagged systolic array are available sequentially, the Cornput., 1987, C-36,( 2 )
DCT outputs are also available sequentially. 1 I QUINTON, P.: ‘The systematic design of systolic arrays’. IRlSA
Research Report No. 193, April 1983
12 DONGEN, VV., and QUINTON, P.. ’Uniformization of linear
recurrence equations: a step towards automatic synthesis of systolic
6 Conclusion arrays’. Proceedings of the International Conference on Systolic
Arrays, San Diego, California, May 1988
The design methodology and the enhanced systolic archi- 13 NUSSBAUMER, H.J.: ‘Fast Fourier transform and convolutlon
tecture discussed in the paper allows us to consider a algorithms’ (Springer-Verlag, 1982). 2nd edn.