You are on page 1of 6

Tagged systolic arrays

S. Sarkar
A.K. Majumdar

Indexing terms Fael-Fourier fransfiirm, Sysrolic arraj, Tagqed .svefolic array, V L S l

When an algorithm cannot be expressed as a set of


Abstract: Design of systolic arrays from a set of URE, much less is known about the method of deriving a
non-linear and nonuniform recurrence equations systolic design. Such algorithms can, in general, be
is discussed. A systematic method for deriving a expressed as a set of nonlinear and nonuniform recur-
systolic design in such cases is presented. A novel rence equations (NLNURE). Many important algorithms
architectural idea, termed a tagged systolic array such as FFT, bitonic sorting, etc. belong to this class.
(TSA), is introduced. The design methodology Thus the need arises to explore the possibility of a sys-
described broadens the class of algorithms amen- tolic design for such algorithms.
able for tagged systolic array implementation. The In this paper, a methodology for deriving systolic
methodology is illustrated by deriving a systolic designs for algorithms expressed as a set of NLNURE is
design for the fast Fourier transform. presented. In the process, a new architectural idea called
the tagged systolic array (TSA) is introduced. In a TSA,
tags are attached to the results of a particular computa-
1 Introduction tion for sending them to other processing elements (PEs)
where the result of that particular computation is
Systolic arrays exploit the advantages offered by VLSI required. A TSA uses only nearest-neighbour local com-
technology to develop special-purpose devices for parallel munication links for sending data. We illustrate the pro-
computation. For this reason, hardware implementation posed design methodology by deriving a systolic design
of several specialised parallel algorithms has become feas- for the fast Fourier transform.
ible. Kung [ l ] characterises a systolic array as a special-
purpose device consisting of a number of interconnected 2 Design methodology
processing elements each capable of performing some
simple operations. In a systolic array, data flows from cell The constraint imposed by a systolic array on the physi-
to cell in a very regular and pipelined fashion. Local and cal layout of the array is the local and regular intercon-
regular interconnection between the processing elements nections between the processing elements. Thus
is one of the most important characteristics of a systolic algorithms whose DGs are local and regular can easily be
array. mapped onto a systolic array. The two steps that are
Considerable effort has been directed towards the always followed for a systolic array design are finding a
development of a systematic method for the design of timing function T and an allocation function a. The
systolic arrays [1-8]. Most of these methods depend timing function shows the time ordering of computations
upon the fact that the dependence graph (DG) [ 6 ] of the at different index points and it must be compatible with
algorithm is local and regular, or it can be transformed the ordering of computations represented by the DG i.e.
intuitively or by some valid transformation into a local if the computation at an index point z1 depends on the
and regular one. The locality property of the D G implies computation at another index point z 2 , then T(z,) >
that it can be embedded in a multidimensional index T ( z 2 ) .A permissible allocation function a will be such
space such that the computation at an index point that if for two index points z1 and z 2 , a(zJ = a(z2), then
depends only on data from neighbouring index points, mi) f TP2).
and the regularity property of the D G implies that the According to the uniformisation technique for linear
dependencies are the same at every index point. More recurrence equations described by Dongen & Quinton
recently, based on the pioneering work of Karp, Miller & [12], we need to find a set of integral vectors B , , ..., B,
Winograd [9]. Quinton [ l l ] described a systematic such that any dependence vector u, can be expressed as a
design methodology using a uniform recurrence equation non-negative integral combination of these vectors, and a
(URE). A transformation technique has also been dis- linear timing function T which is compatible with all
cussed by Dongen and Quinton [12] which enables a set vectors B,, . . . , B, i.e. VB,: T . B j > 0. If D, is defined as
of nonuniform linear recurrence equations to be con- the smallest domain that contains all the dependence
verted to a set of URE. Designing a systolic array from a vectors for all values of size parameters of a particular
set of nonuniform linear recurrence equations is a three- problem, then c, can be defined as the smallest cone
step process: conversion of the set of recurrence equa- pointed at the origin that contains D,.It has been shown
tions into a set of URE followed by derivation of a 1121 that the existence of c, is necessary for the parallel-
permissible timing function and a permissible allocation isation of the final uniform recurrences. Such a cone does
function [I 11. not always exist. In that case, we need to consider a rein-
dexing transformation.
To design a systolic array from a set of NLNURE, we
follow a procedure similar to that outlined above. If we
cannot find a cone c,. for the problem concerned, we try
the reindexing transformation so that c, exists for the In the above recurrence equations, the input data (for
reindexed DG. Next we find the set of vectors B , , . . . , B, N = 8) are
and T ( T need not necessarily be a linear function).
Because an integral combination of vectors B , , . . , , B, is x1 = x(7) x, = x(3) x-,= x(5) x4 = x(1)
used to replace any dependence vector u,, we route the x 5 = ~ ( 6 )x6 = x(2) X, = x(4) x g = x(0)
data along the directions of these vectors. The vectors B,,
. . . , B, are chosen such that when the dependence vectors
of the DG are replaced with an integral combination of
B , , . . . , B , , the resulting D G has only local dependencies.
Although the locality property required for systolic
design has been satisfied, the regularity property cannot
be satisfied (We have assumed that the D G of the algo-
rithm cannot be transformed into a regular one by any
known method.). However, this implies that the relative
position(s) of the index point(s) where data from a partic-
ular index point is to be routed, is known. It may also be
noted that these relative positions may vary with the
index points. T o overcome this problem, we enhance the
existing systolic array architecture by including tags for
data routing. A tag is attached to the result of a compu-
tation to identify the relative location of the index point
where the result is to be used. This type of systolic array -~
using tags for data routing is called a tagged systolic k
array (TSA). A TSA employs only local and regular Fig. 1 Dependence graph ( D G )for N = X FFT
(because data is routed only along a fixed number of
directions given by B,, . . . , B,, the interconnection among The function F describes the computation involved and is
the PEs may be made regular) interconnection among the given by
PEs. Thus for problems in which the D G cannot be F(a, b) = a + w* . b where wx is a constant
transformed into a local and regular one by any valid
transformation known, we can still explore the possibility The constants involved in the computation are assumed
of mapping the problem on a TSA. to be stored at the nodes. Recurrence eqns. 2 and 3 can
be written in fully indexed form [9, 123.
If 1 < k 4 (log N + I), 1 4 i 4 N , 1 4 (i mod 2k-1)
4 2'-' then
3 Fast Fourier transform on tagged systolic array
~ ( ki), = F [ A { ( ~i), - (I, -2'-')}, A { ( k , i) - ( I , O)}] ( 5 )
The fast Fourier transform (FFT) [13] is one of the most If 1 < k < (log N + I), 1 4 i 4 N , (i mod 2k-') = 0 or
powerful tools used in many signal and image processing (i mod 2k-1)> 2'-' then
applications. Given a sequence {x(O),x(l), . . ., x ( N - 1))
of time-dependent inputs, the Fourier transform com- A(k, i) = F [ A { k , i) - (1, O)}, A { ( k , i) - (1, 2'-')}] (6)
putes the output sequence {X(O), X ( l), . . . , X ( N - 1)) as According to the definition of URE given in Reference
follows: 12, the above set of recurrence equations can be identified
N- 1 as NLNURE. The set of dependence vectors for eqns. 5
X(k)= x(n)e-jZnnk" and 6 (that represent computation at an index point) is
"=O H = { u , , 0,. u-,} = {(I, 0), (1, -2'-'), (1, 2'-')}. The
= c x(n)w"k
N-1

"=O
domain D, containing all the dependence vectors for all
values of the parameter k is given by D, = {(k, i)l k > 0,
- m < i < m}. Clearly, the cone c, does not exist in this
where w = e-jZffiN. case. However, if the D G is reindexed such that an index
The D G for an N-point F F T (we assume N = 2'" point ( k , i) in the original DG is reindexed to an index
where m is an integer) is shown in Fig. 1 (The constants point ( k , i 2' + - I), the cone c, exists. Fig. 2 shows the
involved in the computation have been ignored.). The DG after reindexing. Fig. 3 shows the cone c, and vectors
computation represented by the D G can be expressed by El and B , . The set of recurrence eqns. 1 to 4 are modi-
a set of recurrence equations. fied accordingly.
If k = 1, 1 4 i 4 N then Ifk = 1,2" < i < N +(2'-' -1) then

A(k, i) = x, (1) A(k, i) xi= (7)


If l<k<(logN+l), l<i<N, l<(imod2'-') If 1 < k 4 (log N + I), 2'..-' 4 i 4 N + (2" - l), 1 4
4 2'-' then (i + 1 - 2") mod 2'-' < 2'-' then

A(k, i) = F[A(k - 1, i + 2' -'), A(k - 1, i)] (2)


A(k, i) = F[A(k - 1, i), A(k - I , i - 2'-')]
(8)
If 1 < k <(log N + l), 2" 4 i < N + (2" - l),
If I < k < ( l o g N + l ) , l < i < N , ( i m 0 d 2 ~ - ' ) = 0 or
( i + 1 - 2") mod 2'-' = 0 or (i + 1 - 2'-') mod 2'-'
(imod 2k-') > 2"' then
> 2k-2 then
A(k, i) = F[A(k - 1, i). A(k - 1, i - 2"')] (3) A(k, i) = F[A(k - 1, i - 2'-'), A(k - 1, i - 2'-' )I (9)
Ifk > (logN + l), 1 < i 4 N then If k > (log N + l), 2*-' 4i < N + (2k-1 - 1 ) then
X(N - i) = A(k, i) (4) X(N - i) = A(k, i) (10)
The new set of dependence vectors is 8' = {u;, u; , u ; } = data. Each PE stores only two pieces of data
{(l, 0), (1, 2 k - 2 ) ,(1, 2*-')). The set of vectors B , , ..., B, is (intermediate results) and one constant for computation.
now chosen as {El, B 2 } =. {(l, O), (0, 1)) and T' = (1, 1). Two tag values are also stored in each PE and these
If a linear allocation function a is chosen, then any a that values are also constants. A tag is attached to the result
satisfies the relation T' . a > 0 is permissible [ 111. Thus of a computation when it is sent to a neighbouring PE. A
a = (1, O)T and a = (0, 1)' are both permissible allocation PE immediately checks the tag of a piece of data after
9 X(0)

Fig. 2

Fig. 3
Reindexed DG

B,
Cone C , and vectors B , and B ,
-
Fig. 4

Fig. 5
(2N - I ) P E systolic array

(log N ) P E systolic array

functions. a = (0, 1)' will result in the design shown in


Fig. 4. The array uses ( 2 N - 1) PEs and tags for routing
data. However, this design has the drawback that each
PE has to dynamically compute the values of tags to
send results to other PEs. This increases the processor
complexity.
If a = (1, 0)' the systolic array shown in Fig. 5 using
log N PEs results. This design does not use tags for
routing data. However, it has the drawback that it
requires exponentially-increasing local memory within
each PE from the leftmost PE to the rightmost PE for
storing the partial results of computation and the con-
stants involved.
For a better design, a is taken to be the following per-
missible nonlinear function: ill d5)
For 2 < k < (log N + I), an index point (k, i) is allo- d3) d7)
cated to PE {(imod 2'-') + "2 - 2'-'}.
-
- -
.

The resulting design, shown in Fig. 6, uses only - .

(N - 1) PEs. The array is linear and uses tags for routing Fig. 6 Tagged systolic arrayfor N = 8 F F T
receiving it and copies the data if it is meant for it. In the When the computation is started, the value of count is set
next time-step, the piece of data passes into the next PE. to zero. Before starting computation, the values of the tag
A PE of the TSA derived for the F F T is shown in Fig. constants and the constant involved in the computation
7. Because most high-speed A/D converters are lower in are set.

: Multiplier
: Adder

0 : Subtmctor

A,B,C,D : Registerpairs
w (= ,+j.w 2 : Twiddle factor
TI,...,T6 : TagRegisterj
Fig. 7 P E /or computation of FFT using tagged systolic array

precision, the input data are assumed to be 16-bit fixed- 4 Performance analysis
point words. Each PE executes the following program in The ( N - 1 ) PE TSA takes ( N + log N ) time-steps to
one time-step.
complete the computation of an N-point FFT. The
Line no. +
speed-up S is given by ( N log N ) / ( N log N ) . The
1 heain time-steu average processor utilisation decreases with the increase
2 send output'data 1 and output data 2 in size oi the FFT. The block pipelining period is N . Fig.
3 receive input data 1 and input data 2 8 plots S against PE number.
4 ,fiv all tags of input data The performance of an ( N - 1) PE TSA can be com-
S decrement tag by 1 pared with that of an ( N log N/2) PE network using
6 iftag = 0 and count = 0 butterfly interconnection. A PE of the latter type is
7 copy data to register B shown in Fig. 9. Fig. 7 shows the PE of a TSA. T o
8 count = 1 compare the two designs based on the chip area required
9 else iftag = 0 and count = 1 for computing an N-point FFT, the approximate number
10 copy data to register A of gates (estimated from the data path synthesis) required
11 count = 2 for a PE is taken as the basis. The complexity of the con-
12 end if troller is measured in terms of the number of basic oper-
13 end if ations it performs. The additional controller complexity
14 end for of the TSA for F F T results from lines 4 to 15 and 19 to
1s tfcount = 2 23 of the program that each PE executes. The approx-
16 compute imate gate count (from the data path synthesis) of the
TSA is 5400 and that for a butterfly PE is 4500. Thus an
17 output data 1 = A + w"B optimistic assumption would be that a PE of TSA takes
18 output data 2 =A - w^B at most twice the area required for a processor employing
19 count = 0 butterfly interconnection for computing an FFT.
20 else However, because of the nonlocal and nonregular inter-
21 output data 1 = input data 1 connection employed in the latter case, the total chip
22 output data 2 = input data 2 area required is approximately double that required for
23 end if processors only. If we assume that a butterfly processor
24 end time-step occupies unit chip area, then the area required to
compute an N-point FFT is 2N log N/2 = N log N. If we Thus the chip area utilisation of a TSA is much better
use a TSA to compute an N-point FFT, the chip area than that of a butterfly interconnection network for the
required is 2(N - 1). Table 1 shows the chip area com- computation of an FFT, especially when N is large. Table
parison between TSA and butterfly for the computation 2 shows the comparison (based on some other factors) of
of an N-point FFT. TSA and butterfly interconnection networks for the com-
putation of an FFT.
From the above analysis, we observe that both designs
have certain strong aspects and certain weak aspects. It is
6 -1 difficult to relate all the factors by a common formula
which can be used as a performance metric. The alter-
5-1
native approach is to assign credit points for each of the
factors depending on their relative merit and the total

1
points can be used as a performance index for comparing
two designs. If we assign equal credit points for all factors
with equal relative merit, we conclude that the TSA
implementation of an FFT is superior.

2 -
5 TSA f o r o t h e r orthogonal transforms
, .-~

1 1 0 ~ X ) 4 0 5 0 6 0 7 0 The Hartley transform [lo] of a data sequence { x ( n ) ;


n = 0, 1, 2, . ..} is given by
No. of PEs
Fig. 8
1 N-1
Speed-up S us u Junction ofnumber o f P E s
Ak) = ~

J(N)
1 x(n)[cos (2nknlN) + sin (Znkn/N]
"=o
Table 1 : Chip area comparison The fast Hartley transform (FHT) is very similar to the
N Area (in units) FFT and the DG for an N = 8 FHT is the same as for an
N = 8 FFT except for the constants involved in the com-
Butterfly TSA
putation and that the FHT involves only real arithmetic
8 24 14 computations. Thus a similar TSA can be derived for the
16 64 30 computation of an FHT where the PEs perform only real
32 160 62
64 384 126 arithmetic computations.
128 896 254 The Hadamard transform [6] of a data sequence
256 2048 510 { x ( n ) ; n = 0, 1, 2, ...} is given by y = Hx, where H is an

I
I32
1 7-

: 16bitregister

: Multiplier
fl : Adder
: SUbaaaOI

A,B,C,D : Registerpain
w (= 0 ,+j.y): Twiddle factor
Fig. 9 P E f o r Computation of F F T usinq butterfly interconnection between PEs
C= A + w'BandD = A - w*B wherew' = wI + w2
Outputs of all registers are In-state
T a b l e 2: C o m p a r i s o n o f ( N - 1) PE TSA a n d ( N log N / 2 ) PE n e t w o r k using b u t t e r f l y i n t e r c o n n e c t i o n for
c o m p u t a t i o n of N - p o i n t FFT
Factor TSA Butterfly

computational area
Area efficiency = 1 0.5
total chip area
computational power
Power efficiency = 1 0.5
total power
Design cost Less because of local and more because of nonlocal and
regular interconnection among PEs nonregular interconnection among PEs
Fault tolerance r 100%fault tolerant less than loo’%
(interconnection failure)
Modularity Modular Non-modular
I/O bandwidth 4 data/time-step 2N data/time-step
Processor utilisation Less than 100% 100%
Block pipelining period N 1
Chip area required 2 ( N - 1 ) units (N log N ) units
Time N+looN log N

N x N matrix. For N = 8, the matrix H is broader class of algorithms amenable for implementation
on a TSA. The starting point of the design is a set of
-- 1 11 11 11 11 11 11 11 1 recurrence equations describing the algorithm. When no
1-1 1-1 1-1 1-1 apparent transformation can be applied to the set of
1 1 - 1 -1 1 1 - 1 -1 recurrence equations describing the algorithm to convert
them into a set of uniform recurrence equations or alter-
1 - 1 - 1 1 1 - 1 -1 1
H = 1/[2 . ,/(2)] ’ natively the DG describing the algorithm cannot be
1 1 1 1 - 1 -1 -I - 1 transformed into a regular one, we may try to implement
1-1 1 - 1 -1 1-1 1 the problem on a TSA. A TSA design for the FFT is
1 1 - I - 1 -1 -1 1 1 derived and shown to be better in certain aspects than
-1-1 - 1 - 1 1 - 1 1 1-1 the design employing butterfly interconnection among
PEs. Finally it has been shown that similar designs can
be derived for some other important orthogonal trans- *
The fast Hadamard transform also involves only real
arithmetic computations. The DG for a fast Hadamard forms.
transform is the same as for the N = 8 FFT except for
the constants involved in the computation. Thus a TSA
similar to an FFT can be derived in this case.
The discrete cosine transform (DCT) [lo] of a data
sequence { x ; n = 0, 1, ..., N - 1) is given by the output 7 References
sequence { z t ;k = 0, 1, . . . , N - 1) where 1 KUNG, H.T. ‘Why systolic architectures?’, IEEE Computer. Jan
N-1 1982
zk = 2e(k)/N 1 x , cos [n(2n + l)k/2N]
n=O
2 MOLDOVAN, D.I.. ‘On the design of algorithms for VLSI systolic
arrays’, Proc. IEEE, 1983,71
3 LI, G.H., and WAH, B.W.: ‘The design of optimal systolic arrays’,
and IEEE Truns. Cornput., 1985, 34, (1)
4 ULLMAN, J.D.: ‘The computational aspects of VLSI’ (Computer
~ ( k=
) 1/4(2) for k = 0 Science Press, 1984)
5 DELOSME, J.M.. and IPSEN, I.C.F.: ‘Efiicient systolic arrays for
= I otherwise the solution of Toeplitz system: an illustration of the methodology
for the construction of systolic architectures for VLSI’, in MOORE,
Ordinarily, a DCT is calculated from an FFT by using W.. McCABE, A,, and URQUHART, R. (Eds.): ‘Construction of
systolic architectures’ (Adam Hilger, 1986)
6 KUNG, S.Y.: ‘VLSI array processor’ (Prentice-Hall. New Jersey,
19x8)
7 RAO, S.K.: ‘Regular iterative algorithms and their implementatlons
for k = 0, 1, 2, ..., N/2 1 - on processor arrays’. PhD thesis, Stanford University, USA, 1985
where R and I are the real and imaginary parts of the 8 YAACOBY, Y., and CAPPELLO, P.R.: ‘Scheduling a system of
afiine recurrence equations onto a systolic array‘. Proceedings of the
FFT, respectively, and zNiz(x)= ,/(2/N) . RN12(x’)and International Conference on Systolic Arrays, San Diego, California,
xk = x2”; x h - n - i = x 2 ” + , and 0, = nk/2N. Hence the May 1988
DCT sequence can be obtained from the FFT sequence 9 KARP, R.M.. MILLER, R.E., and WINOGARD, S.:‘The organis-
by including two additional processors that implement ation of computations for uniform recurrence equations’, J A C M , 14,
(3). July 1967
the above computation. Because the outputs from an 10 HOU, S.H.: ‘The fast Hartley transform alrorithm’, IEEE Trans.
FFT tagged systolic array are available sequentially, the Cornput., 1987, C-36,( 2 )
DCT outputs are also available sequentially. 1 I QUINTON, P.: ‘The systematic design of systolic arrays’. IRlSA
Research Report No. 193, April 1983
12 DONGEN, VV., and QUINTON, P.. ’Uniformization of linear
recurrence equations: a step towards automatic synthesis of systolic
6 Conclusion arrays’. Proceedings of the International Conference on Systolic
Arrays, San Diego, California, May 1988
The design methodology and the enhanced systolic archi- 13 NUSSBAUMER, H.J.: ‘Fast Fourier transform and convolutlon
tecture discussed in the paper allows us to consider a algorithms’ (Springer-Verlag, 1982). 2nd edn.

You might also like