Professional Documents
Culture Documents
Definition 2: The set of hardware blocks of a DFG Definition 1: The total area cost of the reconfigurable
vertex contains the hardware blocks from the component li- datapath corresponding to the merged graph is
brary which can perform the computation represented by .
The resulting reconfigurable datapath is the merge of all
DFGs , and is modeled as a merged graph , as
where and are
defined below. The merged graph is the overlapping of all
the hardware block and interconnection area cost, respectively,
, such that only vertices which can be implemented by the
of the reconfigurable datapath. is the area cost of the hard-
same hardware block can be overlapped.
ware block allocated to , and represents the area cost
Definition 3: A merged graph, corresponding to DFGs ,
equivalent to one MUX input of the suitable width.
, is a directed graph , where:
We can now define the DFG merge problem, as follows.
• A vertex represents a mapping of vertices , Definition 5: Given input DFGs , , find the
, each one from a different , such that merged graph , such that is minimum.
. Finding a mapping of the vertices from the DFGs so as to
• An arc represents a mapping of arcs minimize the hardware block area cost is not a difficult task. It
, , each one from a different can be modeled as a maximum weight bipartite matching, for
, such that all have been mapped onto , all have which a polynomial time algorithm is well known (the Hun-
been mapped onto , and all match.1 garian Method described in [14]). On the other hand, mapping
The reconfigurable datapath will have a hardware block cor- the arcs from the DFGs so as to minimize the interconnection
responding to each , capable of performing all the opera- area is a hard problem (more precisely, it is NP-complete),
tions mapped onto . For each , there will because the mapping of arcs depends on the mapping of their
be a “path” (in the reconfigurable datapath) from the output of adjacent vertices. That is, two arcs from two DFGs can only
to the input port of . Moreover, for each input port of be mapped if their source vertices are mapped as well as their
each vertex which has more than one incoming arc , destination vertices. So, if we map vertices without considering
the reconfigurable datapath will have a MUX selecting the input the interconnection costs or using only estimates, we may get a
operand. solution where the interconnection area cost is not minimized
Given a set of input DFGs , it is possible to build sev- and consequently, the total area cost is also nonoptimal.
eral solutions for the merged graph . The optimal solution We now illustrate these concepts with an example. Given
for is the one which produces the reconfigurable datapath the DFGs and in Fig. 3, we can build three different
with minimum area cost, considering both hardware blocks and merged graphs , , and . In , vertices from and
interconnections. from are not mapped, so the reconfigurable datapath cor-
Since in our architecture model the interconnection network responding to would have six hardware blocks (a multiplier,
is based on MUXes, the interconnection area cost is proportional two adders, a subtractor, an and, and a shifter). In , those ver-
to the number of MUX inputs. For each arc tices are mapped (represented by the notation ), resulting
which is a mapping of arcs from DFGs , the MUX in a reconfigurable datapath with five hardware blocks (a mul-
(if exists) at the input port of hardware block has tiplier, an adder/subtractor, an adder, an and, and a shifter). So,
fewer inputs than it would have if no arcs were overlapped. is larger than and consequently, is also
Regarding each MUX as a tree of 2-input MUXes, there is a larger than .
linear dependency between the number of 2-input MUXes and Still in Fig. 3, and represent different vertex mappings.
the number of wires, so the interconnection area cost can be In vertex of is mapped onto vertex of , while it is
expressed in terms of the number of wires. mapped onto in . The vertex mappings represented by
The area cost of the reconfigurable datapath is defined below. and may appear equivalent and, as a matter of fact,
is equal to . But they allow for different arc mappings.
1The meaning of matching input ports will be further elaborated in Sec- In , no arcs are overlapped, so two MUXes are needed at
tion III-A. the two input ports of vertex . In , the arcs and
972 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 7, JULY 2005
A. Input Ports and Commutativity gain achieved if the two vertices are mapped. This gain includes
Several two-input operations are commutative, so properly hardware block and estimated interconnection area. The inter-
exchanging the sources of these operations can enable arc map- connection area gain, corresponding to the mapping of vertices
pings that would not exist otherwise, thus eliminating wires and and , is an upper bound estimate, because it considers that the
MUXes and reducing the interconnection area cost of the re- predecessors of and in the DFGs being merged have been
configurable datapath. Each operation has a set of input ports mapped to each other, as well as the successors (what may not
which represent the input operands it expects, for instance an happen). The second solution uses a slightly different approach
addition has two input ports and . From Definition 3, two to handle situations in which the two DFGs being merged ex-
arcs and can be mapped if hibit a high degree of similarity and regularity. The relative po-
and can be mapped, as well as and , and and sitions of the vertices in the DFGs are used to break ties when
match. The input ports and match if: (a) they are equal; or several edges in the bipartite graph have the same weight.
(b) and/or represent two-input commutative operations. In [18] the authors combine two or more designs into a
Consider for example the DFGs and of Fig. 4, where reconfigurable one, based on the identification and mapping of
the input ports 1 and 2 of the addition operations are shown. components common to these designs. The goal is to minimize
When merging and , there are three possible vertex map- the reconfiguration time. The bipartite graph edge weights are
pings and the two possible arc mappings defined using as criteria the component types, the component
and . These arc mappings would not be ob- positions in the FPGA, and the depth from matched ports. In
tained without exploiting the commutativity of the addition op- [19] this technique is used to design a reconfigurable data-
eration, because the input ports of the mapped arcs are different. path capable of performing the computation corresponding to
As a result, the merged graph has two less interconnections time-consuming inner loops of an application. The bipartite
and MUXes than it would have without commutativity. graph edge weights represent estimates on the interconnection
mapping.
Another group of methods [17], [20] relies on search mech-
IV. RELATED WORK
anisms to explore the solution space looking for the optimal
In this section we describe the work available in the literature, solution. Since the solution space has an exponential size and
related to the data-flow graph merge problem. is explored randomly, these methods execution times may be
The most commonly used method to the merge problem com- not polynomially bounded, although in practice they sometimes
bines two DFGs at a time and is based on the problem of finding behave reasonably well.
a maximum weight matching of a bipartite graph. The two parti- In [20] a merging technique is proposed to synthesize a multi-
tions of vertices in the bipartite graph correspond to the vertices functional processing unit, in which each function corresponds
of the two DFGs being merged. There is an edge connecting to clusters of operations in the specification. The method assigns
two vertices in the bipartite graph if they can be mapped (over- operations to the processing unit’s operators, given a number of
lapped). The weight of this edge represents the gain achieved if pre-allocated operators. The approach is based on iterative im-
the two vertices are mapped, concerning the desired objective provement and simulated annealing local search algorithms, and
function. Since each vertex of one DFG should be mapped to at aims at reducing the interconnection cost of the processing unit.
most one vertex of the other, the maximum weight matching of Both algorithms start with an initial random assignment, and
the bipartite graph gives a solution to the problem. This method continuously step through a randomly chosen neighborhood so-
is a polynomial-time approach, but since vertices are mapped lution set, considering a cost acceptance criterion. The concept
using only estimates about the interconnection mapping, the so- of neighborhood is defined using a two-exchange mechanism,
lutions produced can be far from optimal. where the assignment of two operations from the same DFG are
The bipartite matching method has been used in [17]–[19]. exchanged. The neighborhood of the current solution is the set
In [17] two solutions are presented to synthesize an application of solutions that can be reached from it by performing only one
specific unit capable of executing in different modes, each one two-exchange. The solution provided by the iterative improve-
corresponding to one cluster of operations from the application ment algorithm is the first local minimum found, so it may be far
nonscheduled flow graph. The presented solutions solve both from optimal. The simulated annealing method accepts limited
unit selection and unit binding and try to minimize the area cost deteriorating transitions, trying to escape from local minima,
of the resulting unit. The weight of an edge represents the area and as a consequence has longer execution times.
MOREANO et al.: EFFICIENT DATAPATH MERGING FOR PARTIALLY RECONFIGURABLE ARCHITECTURES 973
Fig. 5. Steps of DFG merging algorithm. (a) DFGs G and G ; (b) all possible mappings of vertices and arcs of G and G ; (c) maximum weight clique on the
G
compatibility graph G
; (d) merged graph .
A solution described in [17] also uses an iterative improve- tries to minimize the total area of the datapath (including hard-
ment method based on the algorithm presented in [20], but ware blocks and interconnections), while the referred solution
performs unit selection and unit binding, and starts with a dif- tries to minimize only the interconnection area. Finally, our al-
ferent initial solution. Again the solution space is explored by gorithm exploits the commutativity of the operations in order to
stepping from one solution to one of its neighbors, reached by increase the interconnection sharing, while the referred solution
a two-exchange. The cost acceptance criteria does not permit does not.
deteriorating transitions, but zero-cost difference transitions
are accepted. V. HARDWARE BLOCK AND INTERCONNECTION
In another solution from [17], which also merges two DFGs MAPPING APPROACH
at a time, an ILP (Integer Linear Programming) model is devel-
oped to represent a global interconnection-cost model. In this In this section, we propose a technique to solve the DFG
case, the interconnection area gain corresponding to the map- merge problem. Our approach solves the unit selection and unit
ping of the vertices and is dependent on the mapping of binding tasks, and performs functional-unit, storage, and inter-
its predecessors in the DFGs. This model provides an exact but connection binding simultaneously. So, we are able to find a
exponential-time solution to the problem of merging only two global solution and optimize the interconnection network using
clusters, and can be used as an approximation in order to merge accurate area costs instead of estimates, even for the case in
several DFGs iteratively. which the interconnection network is MUX-based. Our method
None of the above referred solutions exploit operation com- merges two DFGs at a time and in order to merge several DFGs,
mutativity in order to improve the interconnection mapping. this method is applied iteratively. First, two input DFGs are
A merging approach is proposed in [21] to design a reconfig- merged, then the resulting graph is merged with another input
urable datapath capable of performing the computation corre- DFG, and so on.
sponding to time-consuming inner loops of an application. The
authors use a graph-based technique to perform hardware block A. Compatibility Graph
and interconnection assignment together, trying to minimize the Initially, a compatibility graph is constructed to represent all
interconnection area cost. Our merge solution is an extension possible vertex and arc mappings between two input DFGs
of this approach, but it is more complete and precise because and and the consistency among these mappings. Two ver-
we model both hardware block and interconnection area costs tices from and respectively, can be mapped (overlapped)
and we globally solve unit selection and unit binding. Our al- if there is, in the component library, a module capable of per-
gorithm solves both unit selection and binding, while the re- forming the operations they represent. Two arcs from and
ferred solution solves only unit binding and assumes that the can be mapped if their corresponding source vertices can
blocks have been previously allocated. Besides, our technique be mapped, as well as their destination vertices, and their input
974 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 7, JULY 2005
G
Fig. 8. Compatibility graph with mappings of G and G from Fig. 4 (arc
mappings obtained through commutativity).
Fig. 9. Area increase of previous methods wrt. our Compatible Clique (the missing bars correspond to area increase of 0%).
Fig. 10. Our Compatible Clique speedup wrt. previous methods (the missing bars correspond to speedup of 0.1).
Chaining was exploited during scheduling and no intra-DFG re- the execution time of each technique. Fig. 10 shows the speedup
source sharing was exploited. The DFGs resulting from HLS achieved by our approach when compared to the other methods
were merged afterward. The area and delay information used (note the logarithm scale).
were obtained from a component library. The control units of Notice that Bipartite Matching produces datapaths with area
the input DFGs were not unified on the merged datapath, so con- up to 30% larger than our algorithm. Iterative Improvement
trol logic area was not measured. produces area increase from 2.5% to 14.9%, when compared
We implemented three techniques based on previous ap- to our Compatible Clique, but it also takes much longer than
proaches to DFG merge and compared their solutions to ours, our approach. For example, for the JPEG decoder program,
with respect to both the area of the resulting reconfigurable Iterative Improvement produces an area increase of only 4.9%,
datapath and the execution time of the algorithm. These but it takes 75.3 times longer to execute. For the MPEG2
methods are the bipartite maximum weight matching approach decoder program, its execution time is only 20% longer than
(in this section referred as Bipartite Matching) [17]–[19], the our Compatible Clique, but the area increase is 14.9%. Integer
iterative improvement local search algorithm (referred as Itera- Programming is almost equivalent to our Compatible Clique
tive Improvement) [17], [20] and the ILP solution (referred as concerning the area of the resulting datapath, but its execution
Integer Programming) [17]. Our approach, based on computing time can be up to 20 thousand times longer than the Compatible
the solution to a compatible maximum weight clique, is referred Clique approach. For some applications, our Compatible Clique
as Compatible Clique. produced datapaths with smaller area than even the Integer
In a first experiment we compared, for each application Programming approach. Although this method provides the
program, the area of the resulting datapath produced by our exact solution when merging two DFGs, it does not guarantee
approach, and by the three previous techniques. Fig. 9 shows to find the best solution when merging several DFGs iteratively.
the percentage of area increase produced by the previous Moreover, this method does not exploit operation commutativity,
methods when compared to our approach. We also measured while our approach does.
MOREANO et al.: EFFICIENT DATAPATH MERGING FOR PARTIALLY RECONFIGURABLE ARCHITECTURES 977
Fig. 11. Area increase wrt. our Compatible Clique and execution time for GSM coder (the missing bars correspond to area increase of 0%).
Fig. 12. Area increase wrt. our Compatible Clique and execution time for JPEG coder (the downward bars correspond to area increase of 00.1%).
TABLE I
COMPARISON OF OUR COMPATIBLE CLIQUE SOLUTION AND
PREVIOUS APPROACHES
Fig. 14. VHDL code combining G and G and combined datapath at Synopsys DC output.
Fig. 15. VHDL code merging G and G and merged datapath at Synopsys DC output.
to know if the datapath merge approach can provide some area We illustrate this experiment with an example. Given the
gain, when compared to this pure HLS approach. In order to ad- DFGs and in Fig. 13, we produced a combined data-
dress this issue we performed another set of experiments, which path, as well as, a merged datapath. Figs. 14 and 15 show the
are described in this subsection. datapath and the corresponding VHDL code, for the combined
In these experiments, the input DFGs were merged using our and merged datapaths, respectively. When we submit these
algorithm and we generated a datapath for the merged DFG. We two datapaths to DC, the merged datapath provides an area
also generated a datapath (referred as combined datapath) for reduction of 27% when compared to the combined datapath.
a complete DFG combining all input DFGs without merging. By analyzing the resource allocation and sharing reports gen-
Then these two datapaths were synthesized under the same erated by DC during synthesis, shown in Fig. 16, one can see that
design constraints and optimization options, and the synthesis in the combined datapath, the subtractions and of and
results were compared. (lines 38 and 48 in Fig. 14, respectively) were mapped onto
We used the Synopsys Design Compiler (DC) synthesis the same subtractor (resource r23 in Fig. 16(a)). However, the
tool (version 2003.03) [24], which performs three levels addition and subtraction (lines 39 and 50) were mapped
of optimization: architectural optimization, logic-level and onto an adder/subtractor (resource r22), as well as, the subtrac-
gate-level optimization. During architecture optimization, DC tion and addition (lines 40 and 49) which were mapped
exploits common subexpression sharing and resource sharing. onto resource r79. In the merged datapath, there is also a sub-
We provided design constraints so as to optimize the design tractor corresponding to the mapped operation (line 38
to minimize its area (setting the directives set_resource_al- in Fig. 15 and resource r22 in Fig. 16(b)). But differently from
location and set_resource_implementation to area_only and the combined datapath, (lines 40 and 42) shared the same
set_max_area to 0.0). We also directed the optimization process adder (resource r70), and (line 44) shared a subtractor
to apply the maximum effort in the mapping and area reduction (resource r73). As a result, the combined datapath uses two
phases (specifying both options -map_effort and -area_effort adder/subtractors for operations which the merged datapath im-
of the compile optimization command set to high). plements with only one adder and one subtractor. Besides, fewer
MOREANO et al.: EFFICIENT DATAPATH MERGING FOR PARTIALLY RECONFIGURABLE ARCHITECTURES 979
Fig. 16. Resource allocation and sharing reports generated by Synopsys DC synthesis tool. (a) Combined datapath. (b) Merged datapath.
[13] C. Lee, M. Potkonjak, and W. Mangione-Smith, “MediaBench: a tool for Edson Borin received the B.S. degree in computer
evaluating and synthesizing multimedia and communication systems,” science from the Federal University of Mato Grosso
in Proc. MICRO-30, Dec. 1997, pp. 330–335. do Sul, Brazil, in 2001. He is currently pursuing the
[14] C. Papadimitriou and K. Steiglitz, Combinatorial Optimization—Algo- Ph.D. degree in computer science at the University of
rithms and Complexity. New York: Dover, 1998. Campinas, Brazil.
[15] N. Moreano, G. Araujo, and C. Souza, “CDFG Merging for Re- His research interests include computer archi-
configurable Architectures,” Institute of Computing-UNICAMP, tectures, processor specialization, and compiler
Tech. Rep. IC-03-18. [Online]. Available: http://www.ic.uni- optimizations.
camp.br/ic-tr-ftp/2003/03-18.ps.gz, 2003.
[16] M. Garey and D. S. Johnson, Computers and Intractability—A Guide to
the Theory of NP-Completeness. San Francisco, CA: Freeman, 1979.
[17] W. Geurts, F. Catthoor, S. Vernalde, and H. De Man, Accelerator
Data-Path Synthesis for High-Throughput Signal Processing Applica-
tions. Boston, MA: Kluwer, 1997.
[18] N. Shirazi, W. Luk, and P. Cheung, “Automating production of run- Cid de Souza received the B.S. and M.S. degrees
time reconfigurable designs,” in Proc. 6th Symp. FCCM, Apr. 1998, pp. in electrical engineering from the Catholic Univer-
147–156. sity of Rio de Janeiro, Brazil, in 1985 and 1989, re-
[19] Z. Huang and S. Malik, “Managing dynamic reconfiguration overhead in spectively, and the Ph.D. degree in applied sciences
systems-on-a-chip design using reconfigurable datapaths and optimized from the Catholic University of Louvain, Belgium, in
interconnection networks,” in Proc. Design Automation Test Eur. Conf., 1993.
2001, pp. 735–740. He is currently an Assistant Professor of the In-
[20] A. van der Werf et al., “Area optimization of multifunctional processing stitute of Computing at the University of Campinas,
units,” in Proc. Int. Conf. CAD, Nov. 1992, pp. 292–299. Brazil. His research interests include combinatorial
[21] N. Moreano, G. Araujo, Z. Huang, and S. Malik, “Datapath merging optimization, linear and integer programming and the
and interconnection sharing for reconfigurable architectures,” in Proc. design and analysis of algorithms.
Int. Symp. Syst. Synthesis, Oct. 2002, pp. 38–43.
[22] S. Niskanen and P. Östergärd, “Cliquer User’s Guide, Version 1.0,”
Commun. Lab., Helsinki Univ. Technol., Helsinki, Finland, Tech. Rep.
T48, 2003.
[23] R. Stallman. (2002, Jan.) GNU Compiler Collection Internals. [Online]. Guido Araujo received the B.S. degree in electrical
Available: http://gcc.gnu.org/onlinedocs/ engineering from UFPE, Brazil, in 1987, and the
[24] Synopsys, Inc., “Design Compiler User Guide,”, Mountain View, CA, M.S. and Ph.D. degrees in electrical engineering
2003. from Princeton University, Princeton, NJ, in 1994
and 1997, respectively.
He worked as a Compiler Consultant for Conexant
Semiconductor Systems and Mindspeed Technolo-
Nahri Moreano received the B.S. degree in com- gies from 1997 to 1999. He is currently an Associate
puter science and the M.S. degree in computer Professor at the University of Campinas, Brazil. His
engineering from the Federal University of Rio de main research interests are compiler and computer
Janeiro, Brazil, in 1990 and 1994, respectively. She architecture optimization for embedded processors
is currently pursuing the Ph.D. degree in computer and reconfigurable computing.
science at the University of Campinas, Brazil. Dr. Araujo was corecipient of best paper awards at the 1996 ACM/IEEE
Her research interests include computer archi- Design Automation Conference, and at the 2003 International Workshop on
tecture and compiler optimizations for embedded Software and Compilers for Embedded Systems (SCOPES’03). He was also
systems, hardware/software codesign, and reconfig- awarded the 2002 University of Campinas Zeferino Vaz Award for his contri-
urable computing. butions to Computer Science research and teaching.