You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/224649069

Mapping DSP Algorithms into FPGA

Conference Paper · October 2006


DOI: 10.1109/PARELEC.2006.51 · Source: IEEE Xplore

CITATIONS READS

12 1,162

2 authors, including:

Anatolij Sergiyenko
National Technical University of Ukraine Kyiv Polytechnic Institute
52 PUBLICATIONS   119 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Design of the IP cores, and development of the image processing systems View project

Rational fraction computations in FPGA View project

All content following this page was uploaded by Anatolij Sergiyenko on 20 May 2014.

The user has requested enhancement of the downloaded file.


Methods of Mapping DSP Algorithms into FPGA

Oleg Maslennikow
Polytechnica Koszalinska, Poland,
E-mail: oleg@moskit.ie.tu.koszalin.pl
Anatolij Sergiyenko
National Technical University of Ukraine
E-mail: aser@comsys.ntu-kpi.kiev.ua

Abstract algorithm period, considering the acyclic part of


SDFG. Here the algorithms of list scheduling, force
A method of mapping DSP algorithms into FPGA directed scheduling, graph colouring, left edge
is considered. Algorithms are represented by scheduling are very popular. But these methods cause
synchronous data flow graphs, and are mapped into the bad load balance of processing units (PUs) at the
pipelined data path. The Method consists in placing beginning and the end of the cycle. To consider the
the algorithm graph in the multidimensional index cyclic nature of SDFG the conflict graph or interval
space and mapping it into structure and event graph are analysed additionally [2],[3].
subspaces. The limitations to the mapping process The resulting computing system effectiveness
minimizes both clock time and hardware volume depends on all synthesis task implementation. But
including multiplexor inputs. they are usually implemented separately. The subtasks
have the local targets, which are often in contradiction
1. Introduction to another subtask targets. Therefore most of
synthesis methods do not provide the effective
structure solutions.
Modern field programmable gate arrays (FPGAs) Recently such programming tools like AccelDSP
have the volume of millions of equivalent gates and or System Generator are wide-spreaded, which help to
hundreds of hardware multipliers. The FPGA clock generate the FPGA configuration of the DSP
frequency achieves hundreds of megahertz. Therefore computing system. Here the initial algorithm is
they are considered as advanced chips for high represented by SDFG using Matlab Simulink package
throughput digital signal processing (DSP). [4]. But these tools are effective ones only for аcyclic
By the behavioral or high level synthesis of SDFG, for example, FIR-filters, or for actors which
computing systems the DSP algorithm is usually represent the complex library components.
represented by the synchronous data flow graph In [5],[6] methods for mapping SDFG into parallel
(SDFG). In SDFG the nodes, which are named as computing structures are proposed. They are based on
actors, represent the algorithm operators. Edges placing the graph in the multidimensional index space
represent variable movings between actors with the and mapping them in subspace of structures and
delay to a given cycle number. During a single cycle subspace of events. Methods of structural design of
each actor generates and accepts a variable (token) digital filters described in [7],[8] use this approach as
set, which volume is stable in each cycle [1],[2]. well. This approach provides simultaneous fulfillment
The Synthesis of computing system usually has of the synthesis tasks and therefore it optimizes the
three general tasks: resurce allocation, operator (or structure of computing system much better. In the
actor) scheduling, and binding resources with opera- representation this approach is described in the
tors. These tasks consist of subtasks like variable application of mapping DSP algorithms into FPGAs.
moving scheduling, variable to register or bus assign-
ment, memory allocation, etc. The different synthesis
methods are distinguished due to the order and way of
2. Mapping SDFG into subspaces of
such subtask implementation. The operator schedu- structures and events
ling is the responsible and complex task.
DSP algorithms have both cyclic and acyclic The initial dates for computing system design are
SDFGs. In general, the subgraphs of the cyclic SDFG DSP algorithm, represented by SDFG and cost
have different computing periods. The multirate DSP functions. SDFG with a single discretization
system is an example of such SDFG. Therefore the frequency is considered. Multirate SDFG is trans-
SDFG sheduling is a complex task. Very popular formed into equivalent single rate SDFG. The
approach consists in the search for the shedule of an computing system structure consists of a set of
processing units (PUs), connected to each other

1
according to the structure graph. PU consists of ALU If SDFG has a cyclic route i then the sum of
of given type (multiplier, adder, etc.) with the result vectors-edges Dj , belonging to this route, has to be
register (or FIFO buffer ) on its output and with the equal to zero
multiplexers on its inputs. PU calculates the operation ∑ bi,j Dj = 0,
not more than for a single clock cycle. If PU has not where bi,j is the not zeroed element of i-th row of the
a register then the operation is considered to be cyclomatic matrix of SDFG. Such cyclic routes are
calculated without delay. Such simple PUs can be inherent to the IIR filters. And the back propagation,
naturally implemented in the FPGA hardware reversed vector-edge DВj = <0,0,–kL> means the delay
including special DSP hardware in modern FPGAs which is equal to k periods (iterations) of the
like DSP48 units in Xilinx Virtex4 devices. algorithm. This vector is equivalent to the delay node
The cost functions are algorithm period interval Z–i in the signal graph of the DSP algorithm.
QT = LtC and hardware cost: SDFG can be represented as the conjunction of the
QS =∑ Сp qpmax, аcyclic subgraph, which implements the calculations
where L – period, clock cycles, tC – clock cycle of a single algorithm period, and a set of reversed
length, Сp – hardware cost of PU of p-th type. edges DВj. In the multidimensional space these edges
The average cost of a single multiplexor input of are directed in the reversed direction relatively to the
Xilinx FPGAs can be estimated as 0,57 of register time axis оt. They represent the interiteration delay to
cost or adder cost with the equal data widths. For this kL clock cycles.
reason the 18-input 16-bit width multiplexor has the Effective configuration is searched in two steps.
cost as the 16-bit multiplier unit has. Therefore the On the first step SDFG nodes and edges are placed in
multiplexor input number minimization is of great the three dimensional space as sets of vectors Ki and
demand in FPGA projects. Besides, this number Dj with respect to all conditions mentioned above.
represents the interconnection complexity. Then the PU number is minimized satisfying the
Due to the method SDFG is represented in the condition |Kp,q|→L. This condition means that the
three dimensional index space as the algorithm number of nodes, that are mapped into a single PU, is
configuration KG =(K,D,A), where K – is the matrix of directed to L.
vectors-nodes Ki, representing algorithm operators, D On the second step the configuration balancing is
is the matrix of vectors-edges, representing variable
implemented. For this purpose the acyclic subgraph of
movings between operators, A is the incident matrix
SDFG (i.e. it is without reversed edges DВj) is
of SDFG. The coordinates of the node Ki = <ki,si,ti>
considered, and in each its edge the delay nodes are
are equal to the operator type (for example, k=1 is
multiplication), PU number si, in which the operator is added. These nodes represent the delay or storing to a
implemented, and clock cycle ti, when the operator single clock cycle, or register. In the resulting
result is stored in the register. balanced configuration all the vectors-edges except
Equivalent structural solutions are represented by reversed vectors are equal to Dj=<aj,bj,1> or
equivalent algorithm configurations, which of them Dj=<aj,bj,0>. The nodes form columns, and the
differ in its matrices К and D. The matrix К codes distance between neighboring columns is equal to one
some correct solution, and the matrix D can be clock cycle.
derived from the equation D = КА. The optimum The balanced configuration is optimized by
structural solution finding consists in the search of the mutual interchandes of nodes belonging to a single
маtrix К, which minimizes the given cost criteria. If column. By this process the register number and the
the genetic optimization approach is used then the multiplexor number are minimized. The other
matrix К serves as the genome of the population methods like resynchronizing, left edge shedule are
representative. The following relations and definitions used sa well.
help to find the effective solutions. The computing system structure and the resulting
The algorithm configuration is correct if there are schedule are derived by splitting the algorithm
none equal vectors in the matrix К, i.e. configuration KG to the structure configuration KS and
 Ki,Kj (Ki Kj, i j ) . event configuration, that have the same incidence
The operator schedule is correct if the operators, matrix А. The vectors of the structure configuration
that are mapped in a single PU, are implemented in are equal to <ki,si>, i.e. to the type and number of PU,
different clock cycles, i.e. and the vectors of event configuration are equal to
 Ki,Kj(ki=kj, si =sj )  ti  tj mod L. <ti>, i.e. to the clock cycle in which the mapped
Here the modulo operation considers the cyclic operator is implemented. In such a way the algorithm
nature of SDFG implementation. The equal operators configuration is mapped into the structure subspace
are mapped in the PUs of the same type:
and into the event subspace.
Ki,KjKp,q(ki=kj=p, si =sj =q), | Kp,q| ≤ L,
where Kp,q is a set of nodes of p-th type, that are 3. Method for multiplexor minimization
mapped into q-th PU.

2
Consider the method of multiplexor minimizаtion subconfigurations exchange their coordinates. Then
in pipelined computing system which is effective for two situations can be held on. In the first one the
FPGA projects. Very often SDFGs of DSP algorithms interchanged node will be mapped into additional PU
have a set of isomorphic subgraphs. For example, node, which increases the hardware volume. Besides,
algorithms of complex IIR filters are the chains of the the vector Di,l , which is outputted from this node,
second order stages, FIR filter graphs have equal exchanges its coordinates. And in the resulting
periodic parts. High speed recursive filter structures structure the additional multiplexor input occurs in
are designed composing identical all-pass subfilters which the data is outputted from the additional PU
[9]. To minimize the multiplexor number in such node.
structures the following theorems are used. In the second situation the node is mapped into
Theorem 1. Consider the balanced algorithm another free PU node, or two nodes are mutually
configuration. Then the multiplexor input amount NМi interchanged. Then the two input multiplexers have to
of i-th PU node not succeeded the number of vectors be set on the inputs of this PU, or these PUs. As a
Dі,j that are not equal to each other, and its arrows are result, the conditions (1),(2) are not satisfied by any
incident to the nodes Ki,k, mapped in this PU. exchanges in subconfigurations, which cause the
The theorem proving is evident. Such vectors Dі,j increase of PU number or multiplexor input number.
represent the variables that are inputted in i-th PU
through different multiplexor inputs. The words "not 4. Method for synthesis of digital filters
succeeded" means the occasion when the multiplexor with multiple delays
has a single input and is not needed at all. In another
situation the multiplexor input number is equal to the Usually digital filters are described by the transfer
different vector-node number. function H0(Z) which depends on the complex
Theorem 2. Consider the balanced algorithm variable Z. This function represents the filtering
configuration KG which has up to L isomorphic algorithm with the period L=1 precisely. And each
subgraphs, or equivalent subconfigurations multiplicand Z-k in the function represents the delay to
KGОі=(KОі,DОі,AO). Then the structure, which is the k cycles, which can be implemented on the FIFO with
result of this configuration mapping with the period L, k registers. If the register number in each delay FIFO
has the minimum multiplexor input number if and is increased in n times then the filter with the function
only if all the subconfigurations KGОі are mapped into Hn(z) = H0(Zn) is derived. Such a filter, named the
a single structure subconfiguration KSО, and filter with multiple delays, has the following property.
Ki,jKОі(Ki,j= <Cj,k,Cj,l, ti,j>) ; (1) Its frequency characteristic is similar to the prototype
Ki,jKОі(arrow of Di,l is incident to Ki,j  filter characteristic but in the range of 0 to fS these
Di,l=<Ck,Cl, 1> , Cl≠0 ). (2) characteristic repeats itself n times, where fS is the
This means that the structure subgraph is quantization frequency.
isomorphic to the proper algorithm subgraphs. Figure When the filter stages are connected in a chain
1 illustrates mapping the algorithm configuration with then the resulting characteristic is the product of the
the period L=3 to the structure configuration. Bold stage characteristics. When these characteristics are
line represent the multiplexor. different ones then they mask each other. Using the
masking effect and the stages of filters with
multiplied delays the high quality narrow band filters
are designed which have small hardware volume [10].
Besides, in Xilinx FPGAs the multiplied delays
s are inplemented in FIFO units of SRL16 - type, which
occupy the same place as separate registers do. For
o t the synthesis of filters in FPGA the following method
KGО3 is proposed.
KGО1 KGО2 KSO On the first step of the method the algorithm is
selected which SDFG has a set of isomorphic
Figure1. Example of algorithm configuration subgraphs. The subgraph edges can be loaded by
mapping. delays which are equal to k. Consider the subgraph
number is L. This figure provides the maximum
The theorem proving. When the conditions (1) and hardware utilization but in general, it can have
(2) are satisfied then the resulting PU with one input another value. The subgraph with k=1 is represented
ALU has a single input, and with the two input ALU by the algorithm subconfiguration named as period
has two inputs respectively. This means that each PU configuration. The border nodes in it are selected
of resulting subconfiguration has none multiplexor at which are incident to the reversed edges. These border
all, not to consider the inputs of PU (see the Fg.1). nodes are mapped into iteration inputs and outputs of
From the opposite side, consider the theorem the structure. The period configuration is optimized as
conditions are satisfied. Let one or two nodes in the it is mapped into the structure with the parameter L=1

3
clock cycle, i.e. the structure with the maximum The time axis is horisontal and the space axis is
parallelism. vertical.
On the second step up to L subconfigurations and On the second step L=4 subconfigurations are
other nodes (if needed) are connected together connected together. The resulting configuration is
according the algorithm graph. By this process the shown on the Fig. 3. And on the third step the filter
conditions of the theorem 2 are satisfied. If some filter structure is derived, and its FSM is synthesized. The
stage uses delays to k cycles then (k-1)L delay nodes derived structure configuration is illustrated by the
are added to reversed edges of its subconfiguration. Fig.4. In the figures the colored polygons represent
On the third step the configuration is mapped into the subconfiguration shown on the Fig.2.
the structure with the period L. The resulting structure
consists of the pipelined substructure which
implements the period configuration, and multiplexers
with delays to (k-1)L clock cycles on its inputs.
The hardware volume of the resulting structure,
not to account registers, is estimated by the formula:
'S = nA + 10nM + 0,57LnD ,
where nA and nM is the adder and multiplier number
сумматоров и блоков умножения, which is equal to
the number of nodes of additions and multiplications
in the subgraph of SDFG; nD is the number of
reversed edges, or delay lines in the filter stage.
The register number in the structure depends on
the topology of the subgraph, on the delay distribution
between reversed edges, and can be estimated as the
number which is proportional to L. As a result, if the
subgraph number is equal to L, then the structure has
the minimum number of registers, adders and
multipliers. This is proven by the fact that by this
condition exactly L nodes of delay, addition, and
multiplication are mapped into respective PU nodes,
and all of them are fully loaded.

5. Filter synthesis example

Consider the example of the low pass filter


structure synthesis using the methods described in [9].
The filter consists of four sequential stages with the
delays multiplied by 1, 2, 4, and 8. First three stages
have the transfer characteristics as the bireciprocal
filter has. The last stage frequency characteristic
forms the sharpness of whole filter. Each stage is Figure.2. Subconfiguration of the filter
implemented on the base of allpassfilter, and therefore
it has stable, and well regulated characteristics. The
transfer function of a single (first) stage is the
following
Z 2  b(1  a ) Z 1  a , (3)
H ( z )  Z 1 
1  b(1  a ) Z 1  aZ  2
where b regulates the pass band cutoff frequency, a
regulates the filter sharpness. This filter is effectively
implemented using the two staged lattice wave digital
filter structure which provides the minimum of
multipliers [11]. Figure.3. Resulting algorithm configuration
On the first step of the synthesis the balanced
algorithm subconfiguration is formed which is shown The filter implementation consists in transfor-
on the Fig.2. On this figure the register nodes are mation its algorithm configuration into VHDL
signed by large circles, reverse edges are signed by description and then in configuring it in FPGA.
the dotted arrow, and border nodes – by small circles.

4
When such a filter is designed by the usual volume, high characteristics, and can operate with the
method then its structure consists of four separate signals with the sampling frequencies up to 24 MHz.
stages connected in chain. Each stage has 7 adders 2 The filter structure and control is so complex that its
multipliers and 3 registers for pipelining. Besides design would not impossible without new method use.
these stages have three registered delays, each of them
has 1, 2, 4 and 8 registers, which can be implemented References
on SRL16 units. The resulting hardware volume is
equal SВ = 132. In the critical path of this structure 4 [1] S. S. Bhattacharyya, R. Leupers, P.Marwedel, “Software
adder are set, i.e. the clock cycle period is equal to Synthesis and Code Generation for Signal Processing Sys-
ТВ = 4ТS. tems” IEEE Trans. on Circuits and Systems—II: Analog
and Digital Signal Processing, 2000, V47, №9, pp.849-
875.
[2] The Systhesis Approach to Digital System Design / Ed.
P. Michel, U.Lauther, P.Duzy, Kluwer Academic Pub.,
1992.
[3] Eles P., Kuchinski K., Zebo P., System Synthesis with
VHDL, Kluwer Academic Pub., 1998.
[4] http://www.xilinx.com
[5] Ju.S.Kanevski, A.M.Sergienko, A. A. Guzinski,
“Method for Mapping Unimodular Loops into
Application Specific Parallel Structures”, Proc. of the 2-
nd Intern. Conf. "Parallel Proc. and Appl. Math. PRAM'97"
, Zakopane, Poland, 1997, pp. 362-371.
[6] A.Sergyienko, J.Kaniewski, O.Maslennikov,
R.Wyrzykowski, “Mapping regular algorithms into
processor arrays using software pipelining” Proceedings of
the 1-st Int. Conf. on Parallel Computing in Electrical
Engineering, Bialystok, Poland, 2-5 Sept., 1998. pp.197-
200.
[7] Ju.S.Kanevski, L.M.Loginova and A.M.Sergienko,
Figure 4. Resulting structure configuration “Structured Design of Recursive Digital Filters”
Engineering Simulation, OPA, Amsterdam B.V., 1996, Vol.
The synthesized structure has 2 multipliers, 7 13, pp. 381-389.
adders, 26 registers and registered delays, and 7 four [8] A.Sergyienko, J.Kaniewski, A. Arefjev, D.Kortshev. A
input multiplexers. As a result its hardware volume is Method for Mapping DSP Algorithms into Pentium MMX 
equal to S=65,5. The critical path delay is equal to Architecture. Proc 3-d Int. Conf. On Parallel Processing
the delay of a single adder or multiplier, i.e. Т = ТS, and Applied Mathematics, PPAM'99, Kazimierz Dolny,
Poland, Sept. 14-17, 1999, pp.348-356
and the cycle period is equal to LТS=4ТS. Comparing [9] H. Johansson , L. Wanhammar, “High Speed Recursive
the derived figures, the proposed structure has Filter Structures Composed of Identical All-Pass Subfilters
twofold smaller hardware volume than the usual for Interpolation, Decimation, and QMF Banks With Perfect
structure has, and has the same throughput. Magnitude Reconstruction”, IEEE Trans. on Circuits and
Systems – II Analog and Digital Signal Processing, 1999,
6. Dynamically tuned Digital filter V46, N1, Jan., pp.16-28.
[10] J. G.Chung, K.K.Parhi, “Pipelined wave digital filter
design for narrow-band sharp-transition digital filters” Proc.
In the representation the digital filter that is IEEE Workshop VLSI Signal Processing, La Jolla, CA,
tuned dynamically (on the fly) is represented. This 1994, pp. 501-510.
filter design is similar to the design example shown [11] A. Fettweis, “Wave digital filters: Theory and
above. The main idea is that the stage coefficients, practice”, Proc. IEEE, 1986, V74, N2, Feb., pp.270-327.
and the chain length can be exchanged during the
filter operation considering the properties of the
transfer function (3). Up to 8 stages are connected
into chain. As a result, the bandpass edge frequency is
tuned in the range of (0,015-0,4)fS. The suppression
level is not less than 75-80 дб, and the filter sharpness
is not less than 100 db per octave. The 12-bit length
frequency code regulates the edge frequency.
The filter was implemented in Xilinx Virtex2-
Pro FPGA device. Its hardware volume occupies 706
CLB slices and three multipliers. Due to the full
pipelining the maximum clock frequency is equal to
190 MHz. As a result, the filter has small hardware

View publication stats

You might also like