You are on page 1of 5

Synthesizing DSP Kernels with a

High-Performance Data-Path Architecture


M.D. Galanis1, G. Theodoridis2, S. Tragoudas3, and C.E. Goutis1
1

Electrical and Computer Engineering Department, University of Patras, Rio, Greece


2
Physics Department, Aristotle University, Thessaloniki, Greece
3
Electrical & Computer Engineering Department, Southern Illinois University, Carbondale, USA
e-mail: mgalanis@vlsi.ee.upatras.gr

Abstract In this paper a high-performance data-path


architecture is proposed for synthesizing DSP kernels. The
data-path primitive resources are identical small templates.
The steering logic allows the control-unit to quickly
implement desirable templates so that system's performance
benefits from chaining of operations. The small number of
these data-path computational resources coupled with their
simple structure allows for chaining and latency reduction
over existing methods with template-based computational
resources. Data flow graph scheduling and binding is
accommodated by efficient algorithms that achieve
minimum latency at the expense of a negligible overhead to
the control circuit and the clock period. Compared with
data paths implemented by primitive resources, a reduction
in latency is achieved when the proposed architecture is
adopted. Also, its efficiency in terms of chaining exploitation
over existing template-based architectures is shown.

not always be feasible. It is reasonable to assume


throughout the paper that the computations are of ALU or
multiplication type, and the respective computational units
are implemented in combinational logic.
Consider the DFG of Fig. 1. If direct inter-template
connections are allowed, the chaining of operations is
optimally exploited and the DFG is realized in one clock
cycle (Fig. 1a). On the other hand, if direct inter-template
connections are not allowed as in [1], the DFG is realized
in two clock cycles (Fig. 1b). In particular, the result of
operation b is produced and stored in the register bank at
the first cycle and it is consumed in the second cycle.
Direct interconnection needed

If there are no direct


interconnections

Time 1

I.
INTRODUCTION
In the majority of the existing commercial and
academic architectural synthesis tools primitive resources
like ALUs or multipliers (called conventional resources
hereafter) implement the data-path. The intermediate
results are usually stored in a centralized register bank [5].
Methodologies that allow for more complex resources
have also been proposed [1]-[4]. These complex units,
called templates or clusters, consist of primitive resources
in sequence without intermediate registers. This sequence
of operations called chaining, is exploited during synthesis
to reduce the number of latency cycles and improve the
systems performance [5]. The templates can be contained
in an existing library [1, 3] or they can be extracted from
the Control Data Flow Graph (CDFG) of the application
by a process called template generation [2, 4].
Corazao et. al. [1] have shown that templates of depth
two and at most two primitive resources per level can be
used as the computational elements on the data-path to
significantly improve performance. However, their
approach requires a large number of different template
instances. They propose to generate templates for a large
subset of these instances, and then a scheduling algorithm
is presented to cover the Data Flow Graph (DFG) with
template instances to maximize the throughput
( 1 / latency template period ). To achieve this goal, a
large number of templates (some occurring more than
once in the data-path) may be required. This prevents the
design of an efficient inter-cluster network to support the
system's performance. Thus, the templates are forced to
communicate through the register bank and chaining may

Time 1

c
(a)

template 1
template 2

(b)

Time 2

Figure 1. Latency increase when they are no direct interconnections


among templates

In this paper, a high-performance data-path architecture


consisting of ASIC coarse-grain components, is proposed.
Also, a methodology to synthesize behavioral descriptions
using this component for DSP applications, is illustrated.
The architecture is expected to increase the throughput
over existing template-driven approaches suitable when
the DFG has long critical path, which is the case in DSP
applications.
The introduced component is an 2 2 array of nodes,
where each node contains one ALU and one
multiplication unit implemented as pure combinational
circuits. At any time instance only one of them will be
activated. This area redundancy has no impact in the
power dissipation and negligible overhead in the delay of
the resource. A flexible interconnection network
accompanied with appropriate steering logic exist inside
the component allowing the control-unit to easily
implement any desired template, so that the systems
performance benefits from the chaining of operations.
Moreover, direct inter-component connections are allowed
to fully exploit any chaining possibility of the DFG, where
methods such as that of [1] cannot achieve. Due to the
components regularity and since one type of resource is
used, the synthesis algorithm is simpler than those of
[1, 2, 4]. The generated data-path is characterized by

minimum latency due to the optimal achieved chaining at


the expense of a negligible overhead to the clock period
due to the intra-component steering logic.
Compared with an ASAP or ALAP schedule using
conventional resources with maximum c resources at any
scheduled level, the proposed methodology is guaranteed
to implement the same schedule with half latency (and
clock period equal to that of the flexible component) using
c / 2 templates with all possible inter-component
interconnections, so that chaining of operations is feasible
at any level. It must be mentioned that in real-life
applications the quantity c/2 is small, as the average
Instruction Level Parallelism (ILP) does not exceed the
value of 6, for DFGs comprising the Control Data Flow
Graph (CDFG) [2]. Compared with data paths
implemented by primitive resources, a reduction in
latency is achieved when the proposed platform is
adopted. Its efficiency in terms of chaining exploitation
over the template-based architectures is also proved by the
performed experiments.
The paper is organized as follows: the structure of the
introduced component and the data-path architecture are
presented in section II, while the analysis of the
components features is presented in section III. Section
IV presents the synthesis algorithm, while the
experimental results are reported in section V. Finally,
conclusions and future work are discussed in section VI.
COMPONENT-BASED DATA-PATH ARCHITECTURE

II.

A. Intra-component architecture
The structure of the proposed coarse-grain component
is illustrated in Fig. 2. The component consists of 4 nodes,
4 inputs (in1, in2, in3 in4) connected to the centralized
register bank, 4 additional inputs (A, B, C, D) connected
either to the register bank or to another component, two
outputs (out1, out2) connected to the register bank and/or
to another component and two outputs (out3, out4) whose
values are stored in the register bank. Since each internal
node performs two-operand computation, multiplexers are
used to select the inputs of the nodes of the second level.
In 1

To register bank
or to another component

In 2

In 3

Node
1

A B

Node
2

A,B,C,D come
from register bank
or from other component

Node
3

To register bank
or to another component

In 4

C D

Node
4

Out 3

Out 4

Figure 2. Architecture of the component

The detailed structure of the node is shown in Fig. 3.


Both ALU and multiplier are implemented in
combinational logic so as to take advantage of the
chaining of operations inside the coarse-grain component.
The ALU performs shifting, arithmetic, and logical
operations. Each time either the multiplier or the ALU is
activated according to the control signals, Sel1 and Sel2
respectively, as illustrated in Fig. 3. When a node is

utilized, then one of the Sel1 and Sel2 signals is set to 1,


so as to enable the desired operation. If the node is not
utilized both Sel1 and Sel2 are set to 0.
In A

Sel 1

In B

Buffer

Buffer

ALU

Sel 2

Out

Figure 3. Node structure

Each operation is 16-bit because such a word-length is


adequate for many DSP and multimedia applications. This
choice of the bit-width characterizes the component as
coarse-grain. Moreover, multiplication and ALU are
selected because DSP and multimedia kernels mainly
consist of these operations. The same operations have also
been chosen for the templates of [1, 2, 4].
If exceptional operations such as division or square root
are needed, these can be realized by special units, which
communicate with the coarse-grain ones. An alternative
solution should be to transform the exceptional operations,
where it is possible, to a series of ALU and/or multiply
operations to be handled by the introduced component.
B. Control overhead for component instantiation
A data-path consisting of the proposed components
introduces extra control overhead compared to the datapaths consisting of primitive resources and with the ones
consisting of templates [1, 2, 4].
The control overhead comes from the signals required
to control the buffers and multiplexers. The ALUs control
signals are not considered, as they also exist in any datapath that includes an ALU unit. Considering the structure
of the component as shown in Fig. 2 and Fig. 3, 8 control
signals are need to configure the buffers and 8 control
signals are required for the multiplexers. Hence, for each
component in the data-path 16 additional control signals
are required compared with a data-path realized by
conventional resources. This results in more complex
control and steering logic. Also, control signals are
required for setting properly the inter-component network.
However, as it has been mentioned, in realistic
applications 4 to 6 operations are executed in parallel per
scheduled step in average. This implies that two or at most
three components can be used to realize the data-path.
Considering the benefits of chaining and the improvement
of latency achieved by the use of this component, the extra
control overhead is affordable if a high-performance
implementation is required. The inter-component network
is explained in more details in the following section.
In Corazao et. al. [1], there is also a control overhead
compared to the primitive resource-based data-paths.
There are cases where the control overhead is higher than
that of the proposed method introduces. Since direct intertemplate connection is not allowed, the intermediate
results are exchanged among the templates through the
centralized register bank and buses. This implies that the
number of buses, which connect the templates to the

register bank and the number of data values stored in the


register bank, is larger than in our case. Thus, the control
overhead of [1] stems from the fact that register-enable
and bus control signals have to be set by the controller.
For instance, consider a DFG with four levels where
there are 4 operations per level. Assume that the
operations are of the same type in each level but different
among the levels. Using the proposed method, the datapath requires two components and 32 extra control
signals. Now assume that there are 3 templates of two
levels with one operation in each level, and the approach
of [1] is adopted. Then four instances of each template are
required in each level and totally 12 units used. This larger
number of data-path units implies more signals to control
the bus where these units are connected and the registers.
Thus, the number control signals of [1] dependents on the
DFG and it may be larger than that of our method.
C. Inter-component network
The direct connections among the components of the
data-path are implemented through a crossbar interconnect
network as illustrated in Fig. 4. This network is chosen so
as to enable all the possible inter-component connections.
If there are N inputs and M outputs there are N M number
of switches, N number of buses and M loads per
connection. In the case of a data-path consisting of p
components, the total number of second level inputs is
4 p and the number of first level outputs is 2 p. So, there
are 8 p2 switches, 4 p buses and 2 p loads. As we have
mentioned earlier, even for the case of an ASAP or ALAP
schedule in real-life application, the number of
components required for the binding stage is a small
constant. Thus a crossbar interconnect network is feasible,
and the introduced routing overhead small, not affecting
the selection of the clock period of the architecture.
N inputs
switch

M outputs

Figure 4. Crossbar interconnect network

III.

ANALYSIS OF THE COMPONENTS


CHARACTERISTICS

The introduced component offers two main advantages.


First, it fully exploits chaining of operations resulting in
latency and performance improvements. Second, due to its
regular and flexible structure a full DFG covering is easily
obtained by a small number of components instances,
while the scheduling and binding are also simplified.
Taking into account the absence of registers inside the
component and the internal interconnection and steering
logic, any chaining possibility inside the component is
exploited. Additionally, the direct inter-component
connections enable to further exploit chaining among
nodes of different components resulting in an improved
latency and high-performance.
Consider the example of Fig. 1, the data-path is
optimally realized if two instances of the component are
used. The direct interconnection between the instances
allows them to exchange data without the use of the

register bank and thus optimally exploit chaining resulting


in one cycle implementation. If other template-based
approaches, as in [1, 2, 4] are adopted chaining is not fully
exploited. This also verified by the experimental results.
The proposed component is a 2x2 array of nodes with
regular structure that simplifies the scheduling and
binding. Furthermore, the existence of a library consisting
of only one type of component further simplifies binding.
The capability of each node to connect either to the
register bank or to nodes of the same or another
component offers high flexibility to achieve a full DFG
covering with the minimum number of components. It
must be mentioned that internal interconnection and
steering logic allows the control unit to realize any
required template to cover the DFG.
On the contrary, the approaches [1, 2, 4] use templates
with predefined operations and data flow structure. Under
these restrictions to cover a DFG portion by such a
template, the data flow structure and the type of operations
of this portion must be matched with a template available
in the library. This usually results in a difficult templatematching problem. Also, as we use only one type of
component there is no need to develop sophisticated graph
algorithms to derive the required templates of the
application domain as in [2, 4].
Moreover, the flexible intra-component connections
allow covering isolated operations in a control step (cstep) of the scheduled DFG without using extra resources.
We call isolated operations the ones they do not have any
data dependency with any other in a c-step. This is
achieved since the nodes can connect their inputs and
outputs to the register bank.
On the contrary, the DFG may not be completely
covered by existing template-based methods when partial
matching is not supported. Then the uncovered nodes are
realized by extra primitive resources implemented either
in ASIC [4] or in FPGA technology [2]. In both cases,
there is an area overhead as extra resources are used. If the
extra resources are implemented in FPGA while the
templates as ASIC technology as in [2], there is significant
performance degradation due to the low performance of
the FPGAs. In the case of our components there is no need
for additional primitive resources since the DFG will be
completely covered by the used components instances.
Also, the proposed coarse-grain component can be
partially utilized resulting in reduced energy consumption.
This happens as the tri-state buffers can freeze the inputs
of the node when it is not used.
Compared with a data-path realized using conventional
primitive resources, the critical path of the introduced
component increases due to the steering logic (i.e. the
multiplexers and buffers). Since the component is
implemented in combinational logic, there is a reduction
in the delay due to the absence of the register transfers,
which happens when a data-path implementation or
conventional resources is adopted. Specifically, the extra
delay is bounded by the delay of four 16-bit buffers plus
the delay of one 16-bit 4-to-1 multiplexer minus the delay
of one 16-bit register. This delay overhead is expected to
cause a negligible increase of the clock period if the
components physical design is properly optimized.
Compared with an equivalent component realized using
templates [1, 2, 4], the critical path of the introduced
component increases due to the four 16-bit buffers plus

the delay of one 16-bit 4-to-1 multiplexer. This small


increase in the critical path does not significantly affect
the increase of the clock period, considering one-cycle
execution delay of the components. So, the clock period
will be slightly larger for our component-based data-path,
but a significant improvement in latency is expected due
to the optimal exploitation of chaining. This latency
improvement combined with the small increase of the
clock period will result in a larger throughput.
Since each node contains one ALU and one multiplier,
there is also an area overhead compared with a data-path
realized by templates or primitive resources. However, as
shown in the experimental results, this is not important as
the component is almost fully utilized, which means that
each of the above units is always required and used to
realize the data-path.
Finally, the routing and steering logic overhead due to
the allowed direct inter-component connections is not
prohibitive since the DFGs of real-life applications can be
accommodated by three components, as it has been
mentioned in section II-B.
IV. SCHEDULING AND BINDING METHODOLOGY
In Fig. 5, the scheduling and binding methodology is
illustrated. The scheduling is a resource-constrained
problem since a fixed number of coarse-grain components
are considered to be available to realize the data-path.
Input DFG
Constraints
Scheduling
number of
coarse-grain
components
Binding with the
components

Scheduled and
Allocated DFG

Figure 5. Scheduling and binding methodology

The input of the methodology is an unscheduled DFG.


The DFG was selected as the Intermediate Representation
(IR) of an application described in a high level language
like behavioral VHDL or C/C++. For scheduling and
binding Control Data Flow Graphs (CDFGs) the
methodology is iterated through the DFGs comprising the
CDFG of an application [5].
The DFG is scheduled by a list scheduler. Although this
is a heuristic scheduler, it is faster than an Integer Linear
Programming based one, especially in large DFGs
consisting of many nodes and data edges, while it
achieves comparable results for such kinds of DFGs [5].
In our case the list scheduler is simplified to the Hus
scheduler [6]. This is due to two reasons. The first one is
that the data path is implemented by one type of resource,
(i.e. the coarse-grain component). The second reason is
that the clock period of the data-path is set so as to have
one-cycle execution delay for the coarse-grain
components. Since Hus algorithm also assumes that each
operation can be realized by the same type of resource and
each operation has a unit execution delay, this algorithm is
adopted to schedule the DFG.
Due to the aforementioned features of a componentbased data-path presented in Section III, a relatively

simple, though efficient algorithm is used for the binding


with the coarse-grain components. The pseudo-code of the
binding algorithm is illustrated in Fig. 6.
do {
for the # of components
for the # of rows remaining
while (col_idx < 2 && col_idx < # of ops not covered)
map_to_comp (node, row_idx, col_idx)
end while;
end for;
end for;
} while (the graph is covered)
Figure 6. Binding algorithm

For every component in the data-path a covering of the


graph is performed. The component maps the DFG nodes
to its nodes in a row-wise manner. It covers the operations
in a c-step until the first row (level) operations of the
coarse-grain component are enabled or until there are no
DFG nodes left uncovered in the first level of this c-step.
Then it proceeds to the covering of the second row
operations at the same c-step, if there are any DFG nodes
to be covered. This procedure is repeated for every
component in the data-path, if there are DFG nodes left to
be covered. The binding starts from the component
assigned to the number 1, and it is continued, if it is
necessary, for the next components in the data-path. A
component is not utilized in a c-step, when they are no
nodes left in the DFG to be covered. Also, a component is
partially utilized when there is no sufficient number of
operations in a c-step and the mapping to component
procedure (map_to_comp) has already been started for
this component. If there are p components in the data-path
the maximum number of operations per c-step is equal to
4 p, because its node is consisted of two operations.
The result of the binding of the scheduled DFG with the
available components is a scheduled and bound DFG. The
overall latency of the DFG is measured in new clock
cycles. The new clock cycles have clock period Tnew, so as
there is one-cycle execution delay for the components.
V.
EXPERIMENTAL RESULTS
A prototype tool has been developed in C++ for
demonstrating the efficiency of the introduced synthesis
methodology in well-known benchmarks. Extra
functionality to perform graph operations was obtained by
the Boost Graph Library (BGL) [7].
The DFGs used in the experiments were obtained from
behavioral VHDL descriptions that are available in the
benchmarks set of the CDFG tool of [8]. This set was
chosen as it contains well-known DSP kernels. All those
kernels consist of multiplication and ALU operations. The
number of nodes and edges in the benchmark suites
DFGs are shown in Table I.
Table I shows the results in terms of latency when the
DFGs are implemented by primitive resources and by the
proposed method using two components. The clock period
in these cases is different and it set so as each operation is
executed in one clock cycle. For the case of a primitive
resource-based data-path there are four resources activated
per c-step. This happens since two coarse-grain
components used to implement the data path, which
implies that at most four operations can be executed in
parallel.

In the last column the component usage of the


component is given. The component usage is the ratio of
the number of component instances enabled in the
binding, to the product of the number of components
multiplied with the latency.
TABLE I.

LATENCY AND COMPONENT USAGE RESULTS

DFG

nodes

edges

dct
ellip
fir7
fir11
iir
lattice
volterra
wavelet
wdf7

44
39
21
33
18
24
34
69
52

84
76
47
69
31
45
65
146
106

Primitive
resource
latency
11
12
7
11
7
9
12
17
13

Two 2x2
CGCs
latency comp. usage
6
11/12
6
11/12
4
6/8
6
9/12
4
5/8
5
8/10
6
9/12
9
18/18
7
13/14

As depicted in Table I, there is a decrease of 45% in


average in the number of cycles from the usage of the
coarse-grain element. The actual throughput of the
scheduled DFG is expected to be larger compared with an
implementation using primitive resources if the
component is manufactured such that its worst-case delay
permits the set-up of the clock period to be smaller than
the relation: ( Lold Told ) / Lnew , where Lold and Told are the
latency and the clock period for the data-path realized by
primitive resources, while Lnew is the latency of the datapath realized by the proposed components. If the average
decrease in clock cycles is 45% and Lold is set to the
multipliers delay, then k = ( Lold / Lnew ) 1.8 . As shown
in [1], such boundary for ratio k can be satisfied, thus
allowing the throughput improvement over the primitive
resource-based data-path.
For DFGs that exhibit large levels of parallelism (can
be expressed in number of operations in a c-step of an
ASAP schedule) and this is distributed to the c-steps
due to the component constraints, the component usage
approaches the value of 1. Actually, is equal to 1 for the
case of the wavelet benchmark, which is the one with the
largest average parallelism in the c-steps of an ASAP
schedule.
To show the benefits of the proposed method in terms
of chaining over the template-based synthesis methods,
two more experiments were performed. In the first
experiment the DFGs were synthesized by the proposed
method using two of the introduced components. The
DFGs were also synthesized using the templates of [1]
without inter-template connection and allowing at most
four operations to be executed in parallel. This happens to
have a fair comparison as the use of two of the coarsegrain components implies that at most four parallel
operations can be covered. A similar experiment was
performed using three coarse-grain components and
allowing at most six operations to be executed in parallel.
When the templates of [1] were used, the DFGs were
scheduled using the Hus algorithm and then covered by
the templates. Extra templates are added to achieve full
covering in all DFGs without using primitive resources.
In both experiments an optimal chaining was achieved
by the proposed method. However, the same did not
happen when the template-based synthesis was taken

place. Table II shows the average number of chaining


misses - which equals to the ratio total misses / latency and the maximum number of misses in a c-step of the
scheduled DFGs.
For all the benchmarks except fir7 there are chaining
misses, which will cause an increase in latency. In the
second experiment the number of chaining misses
increases for the dct, iir, wavelet and wdf7 benchmarks.
This increase is explained as follows: as the number of
available resources increases the scheduler allows more
operation to be executed in each c-step. This increases the
possibility for more data dependencies among the
operations as those of Fig. 1. As, in the existing templatebased methods direct inter-component connections are not
allowed, the number of chaining misses increases.
TABLE II.
Benchmark
dct
ellip
fir7
fir11
iir
lattice
volterra
wavelet
wdf7

AVERAGE AND MAXIMUM CHAINING MISSES


1st experiment
average
max
1/6
1
5/6
2
0/4
0
1/6
1
2/4
1
2/5
1
1/6
1
2/9
2
1/7
1

2nd experiment
average
max
6/4
3
5/6
2
0/4
0
1/6
1
3/4
2
2/5
1
1/6
1
2/7
2
5/7
2

VI. CONCLUSIONS FUTURE WORK


A high-performance data-path architecture for
synthesizing computational intensive DSP kernels and
consisting of flexible coarse-grain components, has been
presented. A small number of identical components
implements the data-path, where the chaining of
operations is optimally exploited resulting in latency and
performance improvements compared with existing
synthesis methodologies. The structure of the proposed
universal component allows for simpler and efficient
synthesis algorithms. On going work focuses on the
implementation of the coarse-grain component in ASIC
technology so as to compare its delay characteristics over
templates of [1, 2, 4]. Also, a complete architectural
synthesis system is being developed, so as to accurately
estimate the consumed area, power consumption and
execution time for kernels executed in a coarse-grain
component-based data-path.
VII. REFERENCES
[1]

[2]

[3]

[4]

[5]
[6]
[7]
[8]

M. R Corazao et al., Performance Optimization Using Template


Mapping for Datapath-Intensive High-Level Synthesis, in IEEE
Trans. on CAD, vol.15, no. 2, pp. 877-888, August 1996.
R. Kastner et al., Instruction Generation for Hybrid
Reconfigurable Systems, IEEE/ACM International Conf. on
Computer Aided Design, ICCAD 2001. pp. 127 -130, 2001.
S. Note et al, Cathedral III: Architecture-Driven High-level
Synthesis for High Throughput DSP Applications, in Proc. of
DAC, pp. 597-602, 1991.
D. Rao and F. Kurdahi, On Clustering for Maximal Regularity
Extraction, in IEEE Trans. on CAD, vol.12, no. 8, pp. 1198-1208,
August 1993.
Giovanni De Micheli, Synthesis and Optimization of Digital
Circuits, McGraw-Hill, International Editions, 1994.
T.C. Hu Parallel Sequencing and Assembly Line Problems, in
Operation Research, pp. 841-848, 1961
Boost Graph Library, http://www.boost.org
CDFG toolset, http://poppy.snu.ac.kr/CDFG/cdfg.html

You might also like