Exploring and Optimizing Partitioning of Large Designs For

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/343120800
Exploring and optimizing partitioning of large designs for multi-FPGA based

prototyping platforms
Article in Computing · November 2020

DOI: 10.1007/s00607-020-00834-5
CITATIONS READS
7 259
2 authors, including:
Umer Farooq
Dhofar University
67 PUBLICATIONS 625 CITATIONS
SEE PROFILE
All content following this page was uploaded by Umer Farooq on 11 November 2020.
The user has requested enhancement of the downloaded file.

Noname manuscript No.
(will be inserted by the editor)
Exploring and Optimizing Partitioning of Large Designs for

Multi-FPGA based Prototyping Platforms
Umer Farooq · Bander A Alzahrani
Received: date / Accepted: date
Abstract Recently, multi-FPGA platforms have become a popular choice to prototype

complex digital systems. This is because of unique advantages such as high frequency
and real world testing experience that are offered when compared to other pre-silicon
testing techniques. However, one of several challenges faced by multi-FPGA proto-
typing is the requirement of an efficient back end flow. Partitioning is a key part of
the back end flow of multi-FPGA systems and it directly affects the quality of final
prototyped design. In this work, we explore two different partitioning approaches:
one is multilevel; while the other is hierarchical partitioning approach. For experi-
mentation, we use a suite of fourteen large benchmarks. Experimental results reveal
that the multilevel approach gives 12.5% better frequency results for mono-cluster
benchmarks while the hierarchical approach gives 13% better results for multi-cluster
benchmarks. Furthermore, the hierarchical approach requires, on average, 60% less
execution time when compared to the multilevel partitioning approach.
Keywords Partitioning · Multi-FPGA systems · Prototyping
1 Introduction
Modern day System on Chip (SoC) designs have huge computation capability and
they are enormously complex to design. Moreover, shrinking product life cycle and
faster time-to-market pressures increase the need for an efficient, fault-free design
process [1] [2]. Because a faulty and inefficient design can cost a huge fortune [3] [4].
U. Farooq
Electrical and Computer Engineering Department
Dhofar University, Salalah, Oman
E-mail: ufarooq@du.edu.om
B. Alzahrani
Faculty of Computing & Information Technology
King Abdulaziz University, Jeddah, Saudi Arabia
E-mail: baalzahrani@kau.edu.sa
2 Umer Farooq, Bander A Alzahrani
In this regard, FPGA-based prototyping offers a good option for complete design-
to-silicon system verification. FPGA-based prototyping is pre-silicon verification
technique that offers better speed as compared to simulation-based verification [5].
Simulation-based solutions are cost-effective but they are very slow and offer only
abstract level view of the system. Although emulation-based pre-silicon verification
gives good speed, unique feature of FPGA-based prototyping is that it gives real-world
testing and trouble shooting experience to a user.
Prototyping of less complex Application Specific Integrated Circuit (ASIC) can be
performed on a single FPGA as the modern day FPGAs are quite capable and have huge
logic capacity. However, as the complexity of the system under consideration grows,
the capability of even the most modern FPGAs becomes insufficient to handle the
resource and I/O requirement of the ASIC. For such scenarios, multi-FPGA platforms
are required because the gap between FPGAs capability and ASIC requirement is
huge [6] and with every new processing technology, it is becoming increasingly
difficult to bridge this gap. Normally, the number of FPGAs required to prototype a
design depends upon the complexity of the design under consideration and this number
may vary from a few FPGAs to a couple of dozen FPGAs [7] [8]. The prototyping
of complex ASIC designs using multi-FPGA platforms usually follows a complex
back end flow that involves several optimization steps. The core objective of this back
end flow is to optimize the frequency and the execution speed of the design under
consideration. The back end flow starts with the RTL description of the design. The
design is first synthesized and next partitioned using a partitioning algorithm. After
partitioning, the routing of the design is performed. Finally the flow is culminated in
the intra-FPGA placement and routing of the design.
Partitioning is one of the most critical steps of the multi-FPGA partitioning flow. In
this step, based on the number of FPGAs on multi-FPGA board, the design under
consideration is divided into multiple parts. Because of the several optimization
constraints, finding an optimal partitioning solution is an NP hard problem [9]. When
we consider partitioning problem from multi-FPGA prototyping perspective, several
constraints are associated with the partitioning of a complex design. Two of the
principle objectives of a partitioning tool are to respect the logic capacity of the target
FPGA architecture while keeping the communication between different partitions
as small as possible. Thanks to the improved design process and better processing
technology, both the logic capacity and number of I/Os of modern generations of
FPGAs have increased. However, the rate at which the logic capacity in FPGAs has
increased is much higher when compared to the rate of increase of number of I/Os. This
trend has led to an increased logic to I/O ratio in newer generations of FPGAs and it
has become particularly difficult for a partitioning tool to minimize the inter-partition
communication. Thus, the number of signals (also termed as cut-nets) traversing
different partitions are more than the available I/Os between different FPGAs of a
multi-FPGA board. These signals are routed between different FPGAs through next
step of the prototyping flow which is called inter-FPGA routing.
Inter-FPGA routing follows the partitioning of the design under consideration and also
plays an important role in the overall optimization of the design under consideration. In
this step, the cut-nets of the partitioned design are routed on the tracks of multi-FPGA
board in a time division multiplexed (TDM) manner. So, higher value of cut-nets will
Title Suppressed Due to Excessive Length 3
lead to a higher value of multiplexing ratio which in turn will reduce the execution
speed of the final prototyped design. The results produced by the routing tool are
directly linked with the quality of preceding partitioning process. Even a highly
efficient routing tool cannot overturn the poor results of a partitioning tool. An in
depth discussion on the quality of partitioning tool and its impact on the frequency of
final prototyped design is presented in the subsequent sections of the paper.
It is evident from the discussion presented above that partitioning plays very important
role in the multi-FPGA prototyping flow. In this work, we propose and explore two
partitioning approaches, namely hierarchical and multilevel partitioning approach. For
this purpose, we use an open source flow. This flow gives complete experience for the
prototyping of multi-FPGA systems. However, our focus in this work remains on the
partitioning aspect of the flow. The flow proposed in this work starts with the generation
of large and complex benchmarks. These benchmarks are generated using a generic
academic tool that can generate both flat and hierarchical benchmarks. The generated
benchmarks are next logically synthesized using an open source tool by VERIFIC [10].
This tool not only performs standard cell synthesis, it also gives complete information
about the interconnect of the design under consideration. After synthesis, we perform
partitioning of the design. In order to produce the best partitioning results, we strive to
exploit the inherent interconnect patterns of the design under consideration. For this
purpose, we explore two different partitioning approaches. One approach exploits the
hierarchical interconnect which is inherent in certain designs. We call this proposed
approach as hierarchical partitioning approach. Second partitioning approach that
we use in this work performs partitioning using multilevel clustering and refinement.
We call this approach multilevel partitioning approach in this work. Both proposed
approaches are novel in the sense that they have been specifically customized in
the context of prototyping for multi-FPGA systems. An in-depth discussion on both
approaches is given in Section 4 of the paper. After partitioning, the inter-FPGA routing
of the design is performed. After routing, system frequency results are obtained for the
two partitioning approaches and a thorough analysis of those results is also presented.
Here, only a brief overview of different steps involved in the proposed back end flow
is given. Detailed discussion on these steps is given in the subsequent sections of the
paper. The main contributions of this paper are summarized as follows:
– Development of an open source, generic, back end flow for prototyping of multi-
FPGA systems. All the steps of the proposed flow either use open source tools or
the tools that are free for academia.
– Development and implementation of two partitioning approaches for exploration
and optimization of different designs in multi-FPGA prototyping.
– Extensive experimentation and thorough analysis of results obtained through the
proposed back end flow.
In the rest of the paper, Section 2 discusses the background and related work and also
elaborates the contribution of this work. Section 3 then gives a detailed discussion on
the proposed flow where comprehensive details of all the steps of back end flow are
presented. Section 4 presents in-depth discussion on the two partitioning approaches
that we propose and explore in this work. Section 5 gives profound analysis of the
results obtained through experimentation and Section 6 concludes this paper with
discussion on the future work.
2 Background and Related Work
The discussion presented in Section 1 shows that partitioning plays a very important
role in determining the system frequency of final prototyped design. Partitioning is a
well formulated research problem and researchers have been active in this area since
1970s. Many techniques have been proposed in past to find the efficient solution of
partitioning problems. Mainly, there are three different types of techniques which are
used to find the solution of a partitioning problem.
1. Analytical partitioning technique [11] [12] is commonly utilized where objective
function is to optimize the quadratic length of the critical path. Although min-
imizing the quadratic length of the critical path is only an indirect measure of
the partitioning solution, its main advantage is that the objective function can be
achieved in very small time. This kind of approach is particularly suitable for very
large problems. A quadratic function, however, does not give the best possible
solution and it is often followed by several local tweaks.
2. Simulated annealing based placement [13] [14] is another technique that uses the
annealing concept for molten metal which is cooled down gradually to produce
high quality solutions. The objective function of this approach is to minimize the
overall Manhattan distance between all the connected instances. This approach
is quite effective in finding a reasonably good solution in a small amount of time.
This type of technique is commonly used for island style architectures. However,
simulated annealing technique is classified more as a placement technique rather
than a partitioning technique.
3. Min-Cut based partitioning approach [15] [16] is generally suitable for partitioning
of complex designs. The min-cut partitioner recursively partitions the design under
consideration. The aim of the partitioner is to minimize the cut-nets of the design
by merging the connected instances in a single cluster. Because of the ability to
find a good solution in small time, in this work, we mainly consider min-cut based
partitioning algorithms. Further discussion on different min-cut based partitioning
algorithms is given next.
In min-cut based partitioning approach, the design is presented as a hypergraph and
the connections between different instances of the design are presented as hyper edges.
The main objective of the partitioner is to minimize the number of hyper edges (con-
nections that traverse more than one partition) in the graph. In this regard, authors
in [17] present Kerninghan-Lin bi-partitioning algorithm. Authors in [18] present FM
partitioning algorithm that uses recursive bi-partitioning approach to find a solution
of a partitioning problem. Similarly authors in [19] present another bi-partitioning
algorithm that promises to give optimal results for small graphs. However, this al-
gorithm either gives sub-optimal or no results for large to very large hypergraph.
The aforementioned three algorithms are the main partitioning algorithms used for
digital systems and the research work done later is mainly an extension of one of
these algorithms. Among these algorithms, Fiduccia-Mattheyses (FM) heuristics [18]

has known to produce the best results. It is an iterative partitioning algorithm that
minimizes the cut-net count over multiple iterations. In each iteration, the cut-net cost
is reduced by maximizing the move gain that is associated with each move of the
instance from one cluster to another. In FM algorithm, all moves have either positive
or negative gains. After each move, the gains of all the associated instances are also
updated. This keeps the complexity of overall process linear and allows to find an
optimal solution in minimal time.
The min-cut based partitioning approach can be applied either in a flat manner that
finds a quick solution or it can be applied using multilevel approach. However, flat
partitioning approach’s computation time increases exponentially with the complexity
of the design. For designs having moderate to high complexity, multilevel hypergraph
partitioning approach has been known to produce the best results [20] [21] [22]. Mul-
tilevel partitioning approach comprises of three phases namely clustering, top level
partitioning and uncoarsening. The main advantage of multilevel partitioning over
flat partitioners is its ability to search the solution space more effectively by spending
comparatively more effort on smaller coarsened hypergraphs. Good coarsening algo-
rithms allow for high correlation between good partitioning for coarsened hypergraphs
and better refinement for the initial hypergraph. Therefore, a thorough search at the
top of the multilevel hierarchy is worthwhile because it is relatively inexpensive when
compared to flat partitioning of the original hypergraph, but can still preserve most of
the possible improvement. The result is an algorithmic framework with both improved
run time and solution quality over a completely flat approach. Multilevel partitioning
approach was successfully demonstrated by hMetis program [22]. This tool mainly
uses FM algorithm for partitioning and it also introduced several new heuristics that
produced reportedly performance critical results. However, this tool partitions designs
with homogeneous instances only and cannot partition heterogeneous instances.
Above, we have presented a detailed discussion on the partitioning problem from a
generic perspective. When we look at partitioning solutions in the context of multi-
FPGA prototyping, different tools/work exist commercially as well as in academia.
Commercially, different tools exist which provide either partial or complete prototyp-
ing flow for multi-FPGA systems. For example Synopsys’ Protocompiler [23] gives
a complete back end prototyping flow for multi-FPGA systems. However, this tool
is accompanied by HAPS [24] hardware platform of Synopsys and works only for
Synopsys specific platforms. Then, there are AUSPY and WASGA [25] partitioning
tools. These tools are platform independent and are not accompanied by a specific
hardware platform. However, these tools give only partial partitioning solution and do
not provide complete prototyping flow. Another partitioning tool by Synopsys called
CERTIFY [26] which was available for partitioning solution of multi-FPGA systems
until recently. It was generic in nature and could have been used for any hardware
platform. However, recently it was discontinued and replaced by Protocompiler. Apart
from aforementioned tools, there are several other solutions [27] [28] [29] that are pro-
vided by commercial vendors. Just like multi-FPGA prototyping, these solutions are
mainly used for pre-silicon verification. However, these solutions are either simulation
or emulation based. Moreover, these solutions are very costly and they do not fall in
the domain of this paper. The discussion on aforementioned commercially available
partitioning solutions indicates that either these solutions are platform dependent
or they offer only partial solution. Moreover, all of them are proprietary tools with
thousands of dollars in annual subscription fees.
On the other hand, if we look at state-of-the-art academic solutions of partitioning
from multi-FPGA prototyping perspective, sufficient work is not available. Authors
in [30] [31] propose a new multilevel hierarchical FPGA architecture and they propose
to use a multilevel partitioning tool for the partitioning of the design. However, their
proposed solution can handle homogeneous blocks and gives partitioning solution for
a single FPGA only. Similarly, authors in [32] explore the partitioning problem for
multi-FPGA systems. They perform comparison between solutions obtained through
commercial WASGA and CERTIFY partitioning tools only and do not give any aca-
demic solution. Also, authors in [33] [34] explore prototyping of multi-FPGA systems.
However, for partitioning, they use commercial tool called CERTIFY [26] by Synop-
sys. The main focus of their flow remains the inter-FPGA routing issue of the back end
flow. Furthermore, authors in [35] [36] also address the back end flow for multi-FPGA
systems, but their focus remains mainly the inter-FPGA routing as well.
In this work, we not only address the routing issue but we also focus on the partitioning
problem. Because even a highly efficient routing tool cannot improve the frequency of
final prototyped design if it is preceded by an inefficient partitioning process. In order
to make the partitioning process efficient, we put particular emphasis on the knowledge
of interconnect of the design under consideration. We extract the information on the
interconnect of the design through open source tool called VERIFIC [10]. Because,
when it comes to different types of designs, they exhibit different interconnect patterns.
Some of them are hierarchical in nature while others have rather flat interconnect. So,
partitioning all the designs with a single approach is not justified and it may eventually
lead to poor frequency results. For this reason, in our back end flow, we propose and
explore two different partitioning approaches in this work. first approach is called
hierarchical partitioning approach and it uses a hierarchical partitioning algorithm.
This approach is more useful for designs exhibiting hierarchical interconnect. Second
approach is based on multilevel partitioning algorithm and it is more suitable for rather
flat designs. Details about the two proposed approaches are given in Section 4. The
two partitioning approaches coupled with an efficient inter-FPGA routing tool give
the best frequency results for the partitioned design.
To the best of our knowledge, there is not enough academic work in state-of-the-art for
multi-FPGA prototyping systems from partitioning perspective. As discussed before,
some work exists that either uses commercial tools or performs comparison between
partitioning results of commercial tool. The unique contribution of this work is that
we extract the information on the interconnect of the design through VERIFIC tool
which is free for academia. Next, we apply one of two partitioning approaches that
best exploits the interconnect in terms of minimizing the cut-nets of the partitioned
design. Both the proposed partitioning approaches used in this work are either based
on academic tools or the customized versions of those tools. So, through this work,
we strive to provide a platform for academia in multi-FPGA prototyping and advance
the research in the important domain of pre-silicon verification through multi-FPGA
prototyping.
Benchmark
Logic Synthesis
Hierarchical Multilevel
Partitioner Partitioner
Routing
Netlist 1 Netlist 2 Netlist N
Synthesis, Place Synthesis, Place Synthesis, Place

& Route 1 & Route 2 & Route N
Bitstream 1 Bitstream 2 Bitstream N
Fig. 1: Multi-FPGA Prototyping Flow
3 Prototyping Flow
In this paper, we propose a prototyping flow for multi-FPGA based systems. In this
flow, we explore two different partitioning approaches and analyze their effect on the
system frequency of final prototyped design. An overview of the complete flow is
shown in Figure 1. It can be seen from this figure that the flow starts with the logic
synthesis of the benchmark under consideration. After passing through various steps,
Fig. 2: An Example of Mono-Core MPSoC Architecture
Fig. 3: An Example of Multi-Core MPSoC Architecture
the flow terminates at the bitstream generation of the design. Further discussion on the
steps of the flow is given next.
3.1 Benchmark Generation
For any exploration flow, benchmarks are a fundamental requirement. For multi-FPGA
prototyping flow, this requirement is even more pertinent as complex benchmarks
mimicking the real life applications are utmost necessary to test the capability of
the tools of a prototyping flow. Researchers in the past [37] [38] [39] have used
different sets of benchmarks for different types of exploration environments. But these
benchmarks are either too small to pose a real challenge to the exploration tools or
they are synthetic in nature and lack resemblance with real life applications. In this
work, we use benchmarks that are generated by DSX [40] academic tool. Using this
tool, we can generate mono-core and multi-core MPSoC architectures. A mono-core
MPSoC architecture contains components like UART, RAM, multiple FIFOs, and co-
processors. These components are further connected with each other through a cross
bar architecture. An example of mono core MPSoC architecture is shown in Figure 2.

In a multi-core MPSoC architecture, we have clusters of mono-core MPSoCs that are
connected to each other using a mesh-based NoC interconnect [41]. As compared to
multi-core MPSoCs, mono-core MPSoCs have lower complexity and flat bus-based
interconnect. Multi-core MPSoCs, on the other hand, have higher complexity and
hierarchical interconnect. An example of multi-core MPSoC architecture is shown in
Figure 3.
3.2 Synthesis
It can be seen from Figure 1 that the benchmarks generated through the DSX tool
are first logically synthesized. During synthesis, the design is logically optimized.
For logic synthesis, in this work, we use open source tool by VERIFIC [10] which
is free for non-commercial academic purposes. When the benchmark is given to this
tool, it parses the whole design through a very powerful parser. The parser of this
tool builds a comprehensive database of all the components of the design and gives
complete information about the interconnect of different components of the design.
This information is very useful as it is used by the hierarchical partitioner later in the
flow. The tool also performs transformation of the design into the standard logic gate
format. We use this tool to keep our flow open source and generic in nature.
3.3 Partitioning
After synthesis, the partitioning of the design under consideration is performed. Since
the designs are quite large and complex, a single FPGA cannot satisfy their logic and
I/O resource requirements, thus, they have to be partitioned in multiple partitions.
As discussed in Section 1, the partitioning plays a very important role in the final
execution speed of the design under consideration. Normally, number of physical
connections are quite small between different partitions while the number of cut-nets
that span these partitions are quite large. So, in subsequent process, these cut-nets have
to share the physical resources between different FPGAs in a time multiplexed manner.
Eventually, larger cut-nets will lead to greater size of multiplexer; hence increasing the
delay and reducing the overall speed. Thus the main goal of any partitioner is to keep
the number of cut-nets as small as possible. Another constraint that a partitioner has
to deal with is the logic capacity of the target FPGA architecture. A partitioner must
satisfy this constraint while performing partitioning. These two combined constraints
make partitioning an NP hard problem [9] for large and complex designs and it is not
possible to find an optimal solution.
Figure 4 summarizes the partitioning problem. Figure 4a shows two partitions where
the number of cut-nets are 2. Figure 4b shows the partitioning solution where the
number of cut-nets are reduced from 2 to 1. But in order to do that, we have to move
large combinatorial logic from partition 2 to partition 1 and new combinatorial logic
part may not fit in the logic capacity of partition 1. So, a partitioner always has to find
a trade-off between the logic capacity and the cut-net constraint. To find an efficient
Combinational Combinational
Logic Logic
Combinational Logic
Combinational
Logic Combinational
Logic
Partition 1 Partition 2
(a)
Combinational Combinational
Logic Logic
Combinational
Logic Combinational
Logic
Combinational Logic
Partition 1 Partition 2
(b)
Fig. 4: (a) Partitioning Solution with 2 cut-nets; (b) Partitioning Solution with 1 cut-net
Routing Compute Initial Optimal

Constraints MUX Ratio Routing? Yes
Board Generate Routing No Estimate System

Cut-Nets Routing
Description Graph Frequency
Trace Adjust MUX Optimized

Assignment Group Cut Nets Routing
Ratio
Solution
Fig. 5: An Overview of Inter-FPGA Routing Flow
partitioning solution, the partitioner should know and exploit the interconnect of the
design under consideration. For this purpose, in this work, we explore two different
partitioning approaches. The details of these approaches are given in Section 4 of this
paper.
3.4 Routing
Once the partitioning is completed, the routing of the design under consideration is
performed on the multi-FPGA board. The aim of partitioning approaches discussed in
Section 3.3 is to minimize the number of cut-nets. However, as discussed in Section 1,
the number of cut-nets are always greater than the available I/O resources of FPGAs.
This is because of higher logic capacity and fewer I/Os of newer generations of FPGAs.
Therefore, we have to route the cut-nets in a time division multiplexing manner. A
simplified overview of the flow used to perform inter-FPGA routing is shown in

Figure 5. It can be seen from this figure that routing flow starts with the routing
constraints, board description (user generated) and trace assignment file (generated
by partitioning). Once these files are given to the routing tool, the routing graph is
generated. The I/Os of this routing graph are represented as a set of vertices V and
the connection between these vertices are represented as edges E. These vertices
and edges that are combined together make a directed graph G(V, E). The graph is
later used by the routing algorithm to route cut-nets on the physical resources of
the FPGA board. Once the routing graph is generated, initial mux ratio is computed
as the ratio of maximum number of cut-nets and the physical wires between two
partitions. Next, the cut-nets are grouped as per the mux ratio value and routing is
performed. For inter-FPGA routing, in this work, we use Pathfinder [42] routing
algorithm. This is a congestion-driven negotiation-based routing algorithm. Pathfinder
routing algorithm routes the cut-nets one by one and tries to find a conflict free solution
through negotiation based approach. For a conflict free solution, it uses an iterative
approach through which the cost of congested nodes is gradually increased to avoid
congestion in future. This algorithm routes all the cut-nets of the design in conflict
free manner. Next, the mux ratio is optimized using binary search algorithm. Each
time, a successful routing is achieved, the mux ratio is adjusted according to binary
search algorithm. The binary search algorithm continues until the best mux ratio is
found. This process is also depicted in Figure 5. While searching for the minimum
mux ratio, the routing algorithm also tries to keep the number of hops as small as
possible. This is because of the reason that both mux ratio and number of hops affect
the final system frequency which is computed at the end of the routing process.
3.5 Intra-FPGA synthesis, Placement and Routing
Once the routing is complete, the netlists are generated as shown in Figure 1. These
netlists contain all the information related to the partitioned design and their routing
information. The netlists are next passed to the the vendor specific tool to perform
intra-FPGA synthesis, placement, and routing of all the partitions. After a successful
completion of this step, the bitstreams of the partitions are generated which can finally
be loaded into the respective FPGAs to complete the prototyping flow. The process of
loading of the bitstreams allows to perform the in-circuit verification and debugging
of the partitioned designed. Moreover, it also gives the real world, cycle accurate and
bit-accurate execution information of the partitioned design.
A comprehensive overview of all the steps of prototyping flow is given in this section.
In the next section, a further detailed discussion is provided on the two partitioning
approaches that are proposed and explored in this work.
4 Multi-FPGA Partitioning Approaches
As discussed in Sections 1 and 3.3, partitioning plays a fundamental role in the

quality of final prototyped design. It is at this step that the number of cut-nets of
Start
Choose N
instances
Unassigned No Instance list

instances? for each part
Yes
End
Partitions where
instance can go?
Yes
Get lower level.
Remake No Choose where
instance list instance is most
connected
Yes Instance
breakable? No
Partitioning
Impossible
End
Fig. 6: Hierarchical Partitioning Flow

a partitioned design are determined. The cut-nets can be either a single source to
single destination (bi-terminal cut-nets) or they can be single source to multiple
destinations (multi-terminal cut-nets). These cut-nets later determine the mux ratio
which eventually decides the execution speed of the design. It is evident that there is
a direct relation between the number of cut-nets obtained after partitioning and the
execution frequency of the final design. The primary objective of any partitioner is to
minimize the number of cut-nets while also satisfying the logic resource constraint
of the target architecture. In order to best satisfy these constraints, in this work, we
explore two different partitioning approaches; one is termed as hierarchical partitioning
approach while other is called multilevel partitioning approach. Detailed discussion
on the two partitioning approaches is provided next.
4.1 Hierarchical Partitioning Approach
It is discussed in Section 3.2 that we use VERIFIC to perform logic synthesis of the
design under consideration. While performing logic synthesis, VERIFIC parses the
whole design and it gives complete information about the interconnect of the design.
In hierarchical partitioning approach, we extract information about the hierarchy of the
design from VERIFIC parser tool. At the next step, based on the required number of
partitions and specified capacity of each partition, a hierarchical partitioning algorithm

is applied on the design. The flow of this algorithm is given in Figure 6. It can be
seen from this figure that initially all the instances of the synthesized design are
marked as unassigned. Next, on the basis of connectivity, these instances are assigned
into different partitions iteratively. The partitioning algorithm adopts a top-down
partitioning approach. In each iteration, N unassigned instances are chosen. Then,
based on the hierarchical information, these instances are assigned to a partition where
they are most connected. This step is to ensure that the logic capacity constraint of
the target architecture is not violated. In case of violation, the algorithm reduces the
number of instances by moving further down the hierarchy and tries to assign instances
based on the connectivity. This process continues until all the instances are assigned
into different partitions. The algorithm combines top-down partitioning approach
with the hierarchical interconnect information of the design under consideration to
minimize the cut-net count and as a result it gives the result in a very small time. This
kind of approach is particularly useful for the designs that are inherently hierarchical
in nature. The pseudo code of this algorithm is given in Algorithm 1 and the steps
performed are summarized as follows:
1. Take the parsed instance list, the number of partitions and partition capacity as
input.
2. Take N unassigned instances and assign them to M partitions based on their
connectivity. While assigning, make sure, partition capacity is not violated and the
cut-net is minimum.
3. Mark N instances as assigned and go back to step 2 again.
4. Terminate when all the instances are assigned.
Get hierarchy;
Get partitions;
Get capacity;
while unassigned instances do
instances=N;
find(max connection);
if capacity > N then
assign instances(N,M);
assigned = N;
end
else if instance breakable then
level = level - 1;
end
else
Partitioning impossible;
end
end
Algorithm 1: Pseudo-code for the Hierarchical Algorithm
The aforementioned steps are performed iteratively where connectivity among the
instances is given top priority and the partition size is always respected. As described
in Section 5, the above approach is more suited for designs which have an inherent
N5 absorbed in
this move
C7 C6 C7 N4
N4
N2 N2
N5
C1 C5 C1 C5C6
N1 N1
C2 C2C3
C3
N3 C4 C4
N3
N3 absorbed in
C7 N4 this move
C5C6C7 N2
N4 absorbed in this
N2 move
C5C6
C1
N1 N1
C1C2C3C4 C2C3C4
Fig. 7: An Overview of Multi-level Clustering
hierarchical interconnect architecture in nature. For flat designs, a more sophisticated

approach is required which is described next.
4.2 Multilevel Partitioning Approach
Contrary to the hierarchical approach that exploits the hierarchy of the design, the
multilevel approach uses clustering and refinement approach over multiple levels. In
this approach, the instances of the benchmark are first represented in the form of a
hypergraph. Initially, the graph is quite complex as it contains a lot of instances and
it is difficult to partition it. Therefore, the graph is next reduced by merging smaller
instances together. This process is called clustering and it is repeated over multiple
levels until the number of clusters are reduced to a few dozens in number. The process
continues until the graph becomes considerably small and the refinement becomes
easy. An example of this multilevel clustering process is given in Figure 7 where
a large hyper-graph is reduced to a smaller hyper-graph after multiple iterations of
clustering.
Once the clustering process is complete, the refinement of the graph is done and the
graph is expanded in a reverse manner. During the refinement process, the instances
are moved between different clusters. The objective of the refinement process is to
minimize the overall cut-net count of the design. Each time a block (i.e. instance) is
moved from one cluster to another, the change in the total cut-net count is computed.
If the change is negative (which means total cut-nets are reduced), the move is
accepted and it is rejected otherwise. This is a greedy approach which may lead to
a problem of local-minima. To avoid such situation, moves with positive gain are
also accepted depending upon the level of refinement. At higher levels, such moves
are accepted. However, these moves are not accepted when the refinement is being
performed at lower levels. The refinement process continues until the bottom level of
Initial partitioning phase
Un
oa
rse
nin
ga
ase
nd
ph
ref
ing
ine
en
me
ars
nt
Co
Ph
ase
Fig. 8: An Overview of Multilevel Refinement
the graph is reached. Upon reaching this point, the partitioning process is complete
and we have the final partitioned result. An overview of the refinement process is
shown in Figure 8 where only 2-way refinement is shown. However, the proposed
multilevel partitioning tool is able to perform N-way partition as it is generic in nature.
The multilevel partitioning tool uses same approach as presented in [43] where first
clustering is performed which is then followed by initial partitioning and refinement
phases. However, the work presented in [43] performs partitioning of homogeneous
instances only. On the contrary, the proposed tool can handle heterogeneous instances
and also takes into account the maximum partition size.
The multilevel partitioning is a highly sophisticated technique and for flat designs, it
offers better results when compared to hierarchical approach. However, it requires
significantly more time to produce the partitioning result. Furthermore, the hierarchical
approach gives equal or better results for designs which are purely hierarchical in
nature. The pseudo code of the multilevel algorithm used in this work is shown in
Algorithm 2.
5 Experimentation and analysis
In this section, we present the experimental results that are obtained through the
exploration flow described in Section 3. Initially, an overview of the benchmarks
used in this work is presented and next the results obtained for those benchmarks are
discussed.
5.1 Benchmarks
For experimentation, a set of fourteen complex benchmarks is used. These benchmarks

are generated through the DSX tool described in Section 1. Through DSX tool, we
level = 0;
hierarchy[level] = hypergraph;
min vertices = 200;
while hierarchy[level].vertex count() > min vertices do
next level = cluster(hierarchy[level]);
level = level + 1;
hierarchy[level] = next level;
end
partitioning[level] = a random initial solution for top-level hypergraph;
FM(hierarchy[level], partitioning[level]);
while level > 0 do
level = level - 1;
partitioning[level] = project(partitioning[level+1], hierarchy[level]);
FM(hierarchy[level], partitioning[level]);
end
Algorithm 2: Pseudo-code for the Multilevel Partitioning Algorithm
Table 1: Benchmark Description
Sr. No Benchmark Benchmark No of

Name Type Components
1 CPU20 mono-cluster 50460
5 AES multi-cluster 90484
6 CPU2X2X1 multi-cluster 93654
generate both mono- and multi-cluster benchmarks which have varying degree of com-
plexity. The mono-cluster benchmarks mainly exhibit a non-hierarchical interconnect
pattern. Multi-cluster benchmarks, on the other hand, are hierarchical in nature where
different clusters are connected to each other in a hierarchy. The connection patterns
of the two types of benchmarks used in this work are further verified through the VER-
IFC parsing tool. The core objective of incorporating two types of benchmarks with
different interconnect patterns is to test the capability of two partitioning approaches
being used in this work. The details of these benchmarks are given in Table 1. It can
be seen from this table that we use four mono- and ten multi-cluster benchmarks. The
internal structure of each mono-cluster benchmarks is already discussed in Section 3.
The number of coprocessors in each mono-cluster benchmarks are indicated at the end
of each benchmark’s name as it can be seen from Table1. As far as the multi-cluster
25000
Biterminal cut-nets
20000
15000
10000
5000
0
Benchmark Name
Fig. 9: Bi-terminal Cut-Net Comparison Between Hierarchical and Multilevel Parti-

tioning Approach
10000
Multiterminal cut-nets
8000
6000
4000
2000
Benchmark Name
Fig. 10: Multi-terminal Cut-Net Comparison Between Hierarchical and Multilevel

Partitioning Approach
benchmarks are concerned, they are named like CPUXxY xZ where XxY indicates the
size of cluster and Z indicates the number of processors in each cluster. For example,
the name CPU2x2x6 indicates that this benchmark has four clusters and inside each
cluster there are six processors. As shown in Table 1, in this work, we use a variety of
benchmarks that have varying requirements in terms of number of components.
5.2 Experimental Results
Extensive experimentation is performed using the benchmarks described in Table 1.

These benchmarks are passed through the flow of Section 3 and comparison is per-
formed between the hierarchical and multilevel partitioning approaches of Section 4.
The objective of this experimentation is to establish the working ability of the proposed
flow and also perform the comparative analysis of the results that are obtained through
two partitioning approaches.
When the benchmarks are passed through the flow using two partitioning approaches,
the first important metric that we obtain is that of cut-nets. Since, we are considering
multi-FPGA partitioning, the cut-nets obtained after partitioning can be of two types:
bi-terminal cut-nets and multi-terminal cut-nets. Bi-terminal cut-nets are those that
have a single source and single destination and they represent point-point interconnect.
Multi-terminal cut-nets, on the contrary, have single source and multiple destinations
and they represent point-multi-point interconnect. Both partitioning approaches give
different bi-terminal and multi-terminal cut-net count for each benchmark. Bi-terminal
cut-net comparison between two partitioning approaches is given in the Figure 9. It can
be seen from this figure that multilevel partitioning approach gives better bi-terminal
cut-net results for all the mono-cluster benchmarks. This is due to the reason that
mono-cluster benchmarks used in this work are flat in nature and they do not pos-
sess hierarchy. This fact coupled with better clustering and refinement technique of
multilevel partitioning approach leads to, on average, 8% less bi-terminal cut-nets for
mono-cluster benchmarks as compared to hierarchical partitioning approach. However,
the bi-terminal cut-net results for multi-cluster benchmarks reveal that multilevel
partitioning approach performs poorly as compared to purely hierarchical partitioning
approach. This trend emerges from the fact that the hierarchical partitioning approach
better exploits the inherent hierarchy of multi-cluster benchmarks as compared to
the multilevel partitioning approach. As a result, for multi-cluster benchmarks, the
hierarchical approach, on average, gives 12% less number of bi-terminal cut-nets.
A similar trend for two partitioning approach on multi-terminal cut-nets metric is
also observed and this trend is shown in Figure 10. It can be seen from this figure that,
on average, for mono-cluster benchmarks, multilevel approach gives 7.5% reduced
multi-terminal cut-nets whereas for multi-cluster benchmarks, hierarchical approach
gives 12.5% better cut-nets respectively. The bi-terminal and multi-terminal cut-nets
are combined together to give the total cut-net count for each benchmark under con-
sideration. The results obtained for total cut-net count are shown in Figure 11 and the
trend observed in Figures 9 and 10 is applicable here as well.
The effect of the cut-nets is carried forward when the routing of the benchmarks under
consideration is performed. This effect is evident in the form of mux ratio which is
obtained after the routing of each benchmark. The results on mux ratio are given in
Figure 12. It can be seen from this figure that multilevel partitioning approach gives
better mux ratio results for mono-cluster benchmarks whereas the hierarchical parti-
tioning approach gives better results for multi-cluster benchmarks. For mono-cluster
benchmarks, multilevel partitioning approach gives on average 12.5% smaller mux
ratio while for multi-cluster benchmarks, hierarchical partitioning approach gives, on
average, 13% better mux ratio respectively.
40000
35000
30000
Cut-Nets
25000
20000
15000
10000
5000
0
Benchmark Name
Fig. 11: Cut-Net Comparison Between Hierarchical and Multilevel Partitioning Ap-
proach
40
35
30
MUX Ratio
25
20
15
10
5
0
Benchmark Name
Fig. 12: Multiplexing Ratio Comparison Between Hierarchical and Multilevel Parti-
tioning Approach
Once the routing of the benchmark is completed, its system frequency is estimated
according to [44] using equation 1. It can be seen from this equation that partitioning
approach with smaller mux ratio values will result in better system frequency results.
125
sys f req = MHz (1)
mux ratio
The system frequency results obtained using the two partitioning approaches are
shown in Figure 13. It can be seen from this figure that, for mono-cluster benchmarks,
the multilevel partitioning approach gives better system frequency results whereas for
16
14
Frequency (MHz)
12
10
8
6
4
2
0
Benchmark Name
Fig. 13: System Frequency Comparison Between Hierarchical and Multilevel Parti-
tioning Approach
1400
Execution Time (Sec)
1200
1000
800
600
400
200
0
Benchmark Name
Fig. 14: Execution Time Comparison Between Hierarchical and Multilevel Partitioning
Approach
multi-cluster benchmarks, the hierarchical partitioning approach gives better results.

Finally, to further consolidate our comparison between two partitioning approaches,
we also measure the execution time taken by each approach to partition individual
benchmarks. The execution time results are given in Figure 14. It can be seen from
this figure that, for all the benchmarks under consideration, multilevel partitioning
approach gives poor results as compared to hierarchical partitioning approach. For
mono-cluster benchmarks, the gap is not huge and the multilevel approach is 24%
slower than the hierarchical approach. However, for multi-cluster benchmarks, the
execution time gap between hierarchical and multilevel partitioning approach is sig-
nificant. For multi-cluster benchmarks, the multilevel approach is 68% slower as

compared to hierarchical approach.
It is evident from the results presented in Figures 11- 14 that multilevel approach gives
better results for mono-cluster benchmarks which are flat and do not possess traits
of hierarchy. On the other hand, the hierarchical partitioning approach gives better
results for multi-cluster benchmarks that are inherently hierarchical in nature. How-
ever, it is also important to note that hierarchical partitioning approach outclasses the
multilevel approach when we compare them from execution time point of view. Being
a mono-cluster benchmark or a multi-cluster benchmark, the hierarchical partitioning
approach gives always better results. The gain of hierarchical approach is significant
for mono-cluster benchmarks and this gain becomes even bigger for multi-cluster
benchmarks.
6 Conclusion
For multi-FPGA systems, partitioning plays a very important role in determining the
quality of a final prototyped design. This work explores two partitioning approaches
for multi-FPGA prototyping systems. One approach exploits the inherent hierarchy
of benchmarks while second approach uses a multilevel clustering and refinement
approach to partition the design under consideration. For exploration purpose, we
use a set of fourteen large, complex and realistic benchmarks. Experimental results
obtained through the exploration environment of this work demonstrate that multilevel
partitioning approach gives overall better results for mono-cluster benchmarks. On
the other hand, hierarchical partitioning approach gives better results for multi-cluster
benchmarks. On average, multilevel approach gives 12.5% better frequency results
for mono-cluster benchmarks whereas hierarchical approach gives 13% better fre-
quency results for multi-cluster benchmarks. Execution time comparison between two
approaches further reveals that hierarchical approach gives better results irrespective
of the nature of benchmarks under consideration. Hierarchical partitioning approach
gives on average 60% better execution time results as compared to multilevel parti-
tioning approach.
In this work, our emphasis has mainly been the exploration of partitioning approaches.
In the future, we will make the proposed multi-FPGA prototyping flow more compre-
hensive by introducing novel in-circuit verification techniques. These techniques can
be used for the functional verification of design after the prototyping of the design is
finished.
References
1. M. Santarini, “Asic prototyping: Make versus buy,” EDN, vol. 11, 2005.
2. “Sigenics: Custom asic calculator,” http://www.sigenics.com/page/custom-asic-cost-calculator, 2017.
3. AMD, “http://techreport.com/news/13721/chip-problem-limits-supply-of-quad-core-opterons,” 2007.
4. Pentium, “https://en.wikipedia.org/wiki/pentium fdiv bug,” 1994.
5. M. Graphics, “https://www.mentor.com/products/fv/modelsim/,” 2017.
6. J. R. Ian Kuon, Quantifying and Exploring the Gap Between FPGAs and ASICs, Springer, Ed. Springer
US, 2010, vol. 1.
7. H. Krupnova, “Mapping multi-million gate socs on fpgas: industrial methodology and experience,” in
Design, Automation and Test in Europe Conference and Exhibition, 2004. Proceedings, vol. 2, Feb
2004, pp. 1236–1241 Vol.2.
8. S. Asaad, R. Bellofatto, B. Brezzo, C. Haymes, M. Kapur, B. Parker, T. Roewer, P. Saha, T. Takken,
and J. Tierno, “A cycle-accurate, cycle-reproducible multi-fpga system for accelerating multi-core
processor simulation,” in Proceedings of the ACM/SIGDA International Symposium on Field
Programmable Gate Arrays, ser. FPGA ’12. New York, NY, USA: ACM, 2012, pp. 153–162.
[Online]. Available: http://doi.acm.org/10.1145/2145694.2145720
9. M. R. Garey and D. S. Johnson, Computers and Intractability; A Guide to the Theory of NP-
Completeness. New York, NY, USA: W. H. Freeman & Co., 1990.
10. VERIFIC, “https://www.verific.com/,” 2019.
11. G.Sigl, K.Doll, and F.Johannes, “Analytical Placement: A Linear or a Quadratic Objective Function?”
Design Automation Conference, pp. 427–432, 1991.
12. C.J.Alpert, T.Chan, D.Huang, A.Kahng, I.Markov, P.Mulet, and K.Yan, “Faster Minimization of Linear
Wirelength for Global Placement,” ACM Symposium on Physical Design, pp. 4–11, 1997.
13. S.Kirkpatrick, C.D.Gelatt, and M.P.Vecchi, “Optimization by Simulated Annealing,” Science 220, pp.
671–680, 1983.
14. C.Sechen and A.Sangiovanni-Vincentelli, “The Timberwolf Placement and Routing Package,” JSSC,
pp. 510–522, April 1985.
15. A.Dunlop and B.Kernighan, “A Procedure for Placement of Standard-cell VLSI Circuits,” IEEE
Transactions on CAD, pp. 92–98, Jan 1985.
16. D.Huang and A.Kahng, “Partitioning-based Standard-cell Global Placement with an Exact Objective,”
ACM Symposium on Physical Design, pp. 18–25, 1997.
17. B.Kernighan and S.Lin, “An Efficient Heuristic Procedure for Partitioning Graphs,” Bell System Tech.
Journal, vol. 49, pp. 291–307, 1970.
18. C.M.Fiduccia and R.M.Mattheyeses, “A Linear-time Heuristic for Improving Network Partitions,”
Design Automation Conference, pp. 175–181, 1982.
19. T.Bui, S.Chaudhuri, T.Leighton, and M.Sipser, “Graph Bisection Algorithms with Good Average
Behavior,” Combinatorica, vol. 7, no. 2, pp. 171–191, Jun. 1987.
20. C.J.Alpert, L.W.Hagen, and A.B.Kahng, “Multilevel Circuit Partitioning,” Design Automation Confer-
ence, pp. 530–533, 1997.
21. G.Karypis, R.Aggarwal, V.Kumar, and S.Shekhar, “Multilevel Hypergraph Partitioning: Application in
VLSI Design,” Design Automation Conference, pp. 526–529, June 1997.
22. G.Karypis and V.Kumar, “Multilevel k-way Hypergraph Partitioning,” Design automation conference,
June 1999.
23. “Haps protocompiler by synopsys,” http://www.synopsys.com/Prototyping/FPGA BasedPrototyp-
ing/Pages /protocompiler.aspx, 2017.
24. “Haps multi-fpga board by synopsys,” http://www.synopsys.com/Prototyping/FPGA BasedPrototyp-
ing/Pages /HAPS.aspx, 2017.
25. Auspy, “https://www.mentor.com/products/fv/aupsy,” 2017.
26. “Certify partitioning tool by synopsys,” http://www.synopsys.com/Prototyping/FPGA BasedPrototyp-
ing/ Pages/Certify.aspx, 2017.
27. C. P. Series, “http://www.cadence.com/products/sd/palladium xp series/ pages / default.aspx,” 2017.
28. M. G. Veloce, “https://www.mentor.com/products/fv/emulation-systems/,” 2017.
29. “Zebu-server asic emulator by synopsys,” http://www.synopsys.com/tools/verification /hardware-
verification/emulation/Pages/default.aspx, 2017.
30. Z.Marrakchi, H.Mrabet, and H.Mehrez, “Hierarchical FPGA Clustering to Improve Routability,”
Conference on Ph.D Research in Microelectronics and Electronics, PRIME, 2005.
31. Z. Marrakchi, H. Mrabet, and H. Mehrez, “A new Multilevel Hierarchical MFPGA and its suitable
configuration tools,” Proc. ISVLSI, Karlsruhe, Germany, March 2006.
32. M. Turki, H. Mehrez, Z. Marrakchi, and M. Abid, “Partitioning constraints and signal routing approach
for multi-fpga prototyping platform,” in 2013 International Symposium on System on Chip (SoC), Oct
2013, pp. 1–4.
33. Q. Tang, H. Mehrez, and M. Tuna, “Routing algorithm for multi-fpga based systems using multi-point
physical tracks,” in Rapid System Prototyping (RSP), 2013 International Symposium on, Oct 2013, pp.
2–8.
34. U. Farooq, I. Baig, and B. A. Alzahrani, “An efficient inter-fpga routing exploration environment for
multi-fpga systems,” IEEE Access, vol. 6, pp. 56 301–56 310, 2018.
35. M. Inagi, Y. Takashima, and Y. Nakamura, “Globally optimal time-multiplexing in inter-fpga connec-
tions for accelerating multi-fpga systems,” in Field Programmable Logic and Applications, 2009. FPL
2009. International Conference on, Aug 2009, pp. 212–217.
36. S. Hauck and A. DeHon, Reconfigurable Computing: The Theory and Practice of FPGA-Based
Computation. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007.
37. D. Stroobandt, P. Verplaetse, and J. Van Campenhout, “Generating synthetic benchmark circuits for
evaluating cad tools,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions
on, vol. 19, no. 9, pp. 1011–1022, Sep 2000.
38. U. Farooq, H. Parvez, H. Mehrez, and Z. Marrakchi, “A new heterogeneous tree-based
application specific fpga and its comparison with mesh-based application specific fpga,”
Microprocess. Microsyst., vol. 36, no. 8, pp. 588–605, Nov. 2012. [Online]. Available:
http://dx.doi.org/10.1016/j.micpro.2012.06.012
39. S. Yang, “Logic synthesis and optimization benchmarks user guide, version 3.0,” Jan 1991.
40. N. Pouillon and A. Greiner, “Soc lib project,” 2010. [Online]. Available:
https://www.asim.lip6.fr/trac/dsx/
41. I. Miro Panades, A. Greiner, and A. Sheibanyrad, “A low cost network-on-chip with guaranteed
service well suited to the gals approach,” in Nano-Networks and Workshops, 2006. NanoNet ’06. 1st
International Conference on, Sept 2006, pp. 1–5.
42. L.McMurchie and C.Ebeling, “Pathfinder: A negotiation-based performance-driven router for fpgas,”
in ACM International Symposium on Field-Programmable Gate Arrays. New York, NY, USA: ACM
Press, 1995, pp. 111–117.
43. G. Karypis and V. Kumar, “Multilevel k-way hypergraph partitioning,” in Proceedings of the 36th
Annual ACM/IEEE Design Automation Conference, ser. DAC ’99. New York, NY, USA: ACM, 1999,
pp. 343–348. [Online]. Available: http://doi.acm.org/10.1145/309847.309954
44. Synopsys, “http://www.synopsys.com/prototyping/fpgabasedprototyping/,” 2017. [Online]. Available:
http://www.synopsys.com/Prototyping/FPGA BasedPrototyping/FPMM/ Pages/default.aspx
View publication stats

Exploring and Optimizing Partitioning of Large Designs For

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exploring and Optimizing Partitioning of Large Designs For

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Exploring and optimizing partitioning of large designs for multi-FPGA based

Article in Computing · November 2020

The user has requested enhancement of the downloaded file.

Exploring and Optimizing Partitioning of Large Designs for

Umer Farooq · Bander A Alzahrani

Received: date / Accepted: date

Abstract Recently, multi-FPGA platforms have become a popular choice to prototype

2 Background and Related Work

these algorithms. Among these algorithms, Fiduccia-Mattheyses (FM) heuristics [18]

Netlist 1 Netlist 2 Netlist N

Synthesis, Place Synthesis, Place Synthesis, Place

Bitstream 1 Bitstream 2 Bitstream N

Fig. 1: Multi-FPGA Prototyping Flow

Fig. 2: An Example of Mono-Core MPSoC Architecture

Fig. 3: An Example of Multi-Core MPSoC Architecture

3.1 Benchmark Generation

bar architecture. An example of mono core MPSoC architecture is shown in Figure 2.

Routing Compute Initial Optimal

Board Generate Routing No Estimate System

Trace Adjust MUX Optimized

Fig. 5: An Overview of Inter-FPGA Routing Flow

simplified overview of the flow used to perform inter-FPGA routing is shown in

3.5 Intra-FPGA synthesis, Placement and Routing

4 Multi-FPGA Partitioning Approaches

As discussed in Sections 1 and 3.3, partitioning plays a fundamental role in the

Unassigned No Instance list

Fig. 6: Hierarchical Partitioning Flow

4.1 Hierarchical Partitioning Approach

partitions and specified capacity of each partition, a hierarchical partitioning algorithm

Fig. 7: An Overview of Multi-level Clustering

hierarchical interconnect architecture in nature. For flat designs, a more sophisticated

4.2 Multilevel Partitioning Approach

Initial partitioning phase

5 Experimentation and analysis

For experimentation, a set of fourteen complex benchmarks is used. These benchmarks

Table 1: Benchmark Description

Sr. No Benchmark Benchmark No of

Fig. 9: Bi-terminal Cut-Net Comparison Between Hierarchical and Multilevel Parti-

Fig. 10: Multi-terminal Cut-Net Comparison Between Hierarchical and Multilevel

5.2 Experimental Results

Extensive experimentation is performed using the benchmarks described in Table 1.

multi-cluster benchmarks, the hierarchical partitioning approach gives better results.

nificant. For multi-cluster benchmarks, the multilevel approach is 68% slower as

View publication stats

You might also like