You are on page 1of 6

IEEE Computer Society Annual Symposium on VLSI

GePaRD - a High-Level Generation Flow for Partially Reconfigurable Designs

Maik Boden, Thomas Fiebig, Markus Reiband, Peter Reichel, Steffen Rülke

Fraunhofer IIS / EAS Dresden


Zeunerstr. 38, 01069 Dresden, Germany
{boden|fiebig|ruelke}@eas.iis.fraunhofer.de
Abstract reused by several tasks and do have to be reconfigured
during a task switch. Raje et al. [6] consider resource shar-
This paper presents GePaRD, a novel approach to ing in HLS as a clique partitioning problem and propose to
High-Level Synthesis of self-adaptive systems based on use a global cost function. Thus our approach uses Greedy
Partially Reconfigurable (PR) FPGAs. GePaRD combines Clique Partitioning (GCP).
Temporal Modularization and Temporal Placement in After Temporal Modularization the TRMs has to be
order to reduce the reconfiguration overhead at runtime by placed on the target architecture using Temporal Place-
extracting Temporal Reusable Modules. We introduce the ment. We apply a probabilistic optimization heuristic (Sim-
basics of High-Level PR design as well as the GePaRD ulated Annealing) on a hierarchical layout model that
design steps (transformations) and GePaRD descriptions considers temporal sharings.
(models). Moreover, we describe our approach to Temporal This paper is structured as follows: Section 2 introduces
Modularization using Greedy Clique Partitioning and partially reconfigurable architectures and related design
Temporal Placement using Simulated Annealing. flows. An overview of the GePaRD approach is given in
Section 3. In Section 4 we explain the GePaRD approach
to Temporal Modularization and Temporal Placement in
detail. Section 5 summarizes the presented approach and
1. Motivation and introduction points out future work.
Innovative applications in ubiquitous computing (such
2. Background of High-Level PR Design
as mobile embedded and multimedia systems) require a
high performance at a reasonable power consumption. The first section gives an overview of state-of-the-art
State-of-the-art programmable devices (such as FPGAs) partially reconfigurable architectures and related design
are capable to meet this requirements using Run-Time flows. In terms of the GePaRD approach, we also introduce
Reconfiguration (RTR) [11]. RTR enables to exchange the the basics of HLS.
functionality of HW partitions while the remaining HW
partitions continue processing without interruption. 2.1. Partially Reconfigurable (PR) FPGAs
Several approaches deal with the FPGA-based
implementation of RTR systems and the use of high-level
In the last decade, a certain number of reconfigurable
design entries [1][7]. Using FPGA architectures, the
computing architectures has been introduced [3]. But none
overhead to reconfigure at runtime affects the system
of these architectures is provided commercially. Nowa-
performance significantly.
days, common FPGA architectures enable the implementa-
To overcome this limitation, an application-specific tion of adaptive systems.
design has to adapt to the target-specific reconfiguration Xilinx [9] introduced Virtex series FPGAs that provide
mechanisms. For this reason, we present GePaRD (a Partial Reconfiguration (PR) capabilities. Modern Virtex-4
Generation Flow for Partially Reconfigurable Designs) for FPGAs also support 2D reconfiguration which increases
Partially Reconfigurable (PR) FPGAs using High-Level the flexibility in PR system design. Moreover, PR-FPGAs
Synthesis (HLS). are hyper-programmable, that means they can implement
In order to reduce the reconfiguration overhead at run- coarse-grained architectures too.
time we developed a novel approach to temporal resource The GePaRD approach uses the hyper-programmability
sharing of called Temporal Modularization. It extracts in order to create an optimized architecture to implement
Temporal Reusable Modules (TRMs). These modules are adaptive systems based on PR-FPGAs efficiently. For this
. reason, we use HLS to generate an application-specific

978-0-7695-3170-0/08 $25.00 © 2008 IEEE 298


DOI 10.1109/ISVLSI.2008.21
virtual architecture that adapts to the reconfiguration specification) to a structural description (i.e., a netlist) [5].
mechanisms of a dedicated target device. In the context of PR-FPGA design, HLS is used to generate
both the top-level architecture and the structural descrip-
2.2. Basic PR-FPGA Design Flows tion of the PR modules for the EAPR flow.
As shown in Figure 1, regarding PR-FPGA design we
can distinguish three different classes of HLS steps:
Xilinx, a vendor of PR-FPGAs, provides three basic
design flows in order to use the feature of dynamic partial • Compilation and Scheduling: Compilation transforms
reconfiguration of modern Virtex FPGAs: the behavioral description to an intermediate model
(i.e., a Control Data Flow Graph, CDFG). Scheduling
• Small-Bit Manipulations: The first PR design flow [8]
commits resources for the realization of operations at a
is a difference-based design flow. It is appropriate if
dedicated time. Regarding the realization of an adaptive
two distinct configurations differ marginally and does
PR system consisting of a set of HW tasks, the schedule
not require design restrictions because it is applied after
is explicitly given by the executable specification.
all configurations of the PR system have been com-
pletely designed and the bitstreams are available. Then • Resource Allocation and Binding: Allocation selects
partial bitstreams for the reconfiguration at runtime are design elements for realizing operation using a design
generated using BitGen or the FPGA Editor. library. Binding assigns the selected design elements to
dedicated resources on the target device. Regarding PR
• Module-based Partial Reconfiguration: The second design, allocation and binding enable an approach to
PR design flow [8] is based on the Xilinx Modular temporal resource sharing between HW tasks.
Design Flow (XMDF). According to Xilinx [10], the
XMDF is preferred for designing dynamically reconfig- • Connection Allocation and Architecture Generation:
urable embedded systems using Virtex FPGAs. This In the EAPR design flow, the top-level architecture
flow is characterized by dividing the design in a certain depends on the timing-dependent placement of the
number of distinct portions, called reconfigurable mod- modules on the target device. Thus, Temporal
ules, whose height is always the full height of the Placement has to be applied in order to determine the
device. Consequentially, a PR system is composed of a structure and the layout of the PR system including the
set of reconfigurable modules. If a module is not module-interconnections.
involved in the reconfiguration task, it does not have to
stop during the reconfiguration process. Modules are Behavioral Description
z = (x+y) * (e-f)
interlinked using dedicated routing resources, called Resource Binding
Bus Macros (BMs). BMs are place on the border of two
Compilation ALU + Reg
modules and represent a fixed communication interface
- ALU
in order to ensure a correct interconnection even if the + - * MULT
Connection Allocation
x y e f
configuration of modules varies at runtime. * ALU + Reg
a
• Early Access Partial Reconfiguration (EAPR): The Resource Allocation b - ALU
* MULT
most recent PR design flow proposed by Xilinx [11] z
Scheduling Add + Reg
introduces a novel approach to modular PR design. In
- Sub
contrast to the XMDF, a reconfigurable module (in the + s1
* Mult Architecture Generation
EAPR flow, called PR module) can assume an arbitrary - s2 s1 e
x
s2
s1
MULT
height. But this feature requires a device that supports * f s2 b
s2 a
y s1 z
ALU
2D reconfiguration such as a Virtex-4 FPGA.
Figure 1. High-level synthesis steps
In conclusion, the EAPR flow enables flexible, modular
design of PR systems using the advantages of modern PR-
FPGAs. Moreover, 2D reconfiguration allows an efficient 3. The GePaRD Approach
design partitioning in common and distinct PR modules in
This section gives an overview of the GePaRD approach
order to reduce the reconfiguration overhead at runtime.
which focusses on the reduction of the overhead when
reconfiguring PR-FPGA partially at runtime. It is based on
2.3. High-Level Synthesis (HLS) a previous work on HLS of HW tasks targeting run-time
reconfigurable FPGAs [2].
In general, High-Level Synthesis (HLS) transforms the Firstly, we introduce a temporal resource sharing
behavioral description of an IC design (i.e., an executable approach using temporal reusable modules. Then applying

299
the Y-diagram [4], the related design steps in HLS are substitutes a common set of macros in the PMNs of
pointed out. Lastly, we itemize the models used to describe sequentially executed tasks.
a PR-FPGA system in all three design domains.
4. Temporal Placement: To generate a physically-aware
top-level architecture for the modular EAPR design
3.1. Temporal Reusable Modules (TRMs)
flow, Temporal Placement is used in order to pre-place
the unitized PMNs on the target-device.
The most significant drawback of FPGA-based adaptive
PR systems is the reconfiguration overhead at runtime. Figure 2 depicts the design steps in the GePaRD flow
GePaRD addresses the reduction of the reconfiguration using the Y-diagram.
overhead by extracting Temporal Reusable Modules
(TRMs) in order to enable temporal resource sharing High-Level Modelling
between sequentially executed HW tasks. Executable
Specification High-Level System Model
According to the modular EAPR design flow, a TRM Prozessor, Memory
System Algorithm Synthesis
has to be placed on a fixed area on the FPGA. Furthermore, 2
3 Temporal
it has to provide a common interface that fits all tasks RTL CDFG Modularization
which are composed of this TRM. Netlist ALU,MUX,
Registers
Consequentially, the GePaRD approach combines the
LIB
extraction of Common Temporal Modules (CTMs), called 4 Temporal
Template 1 Placement
Temporal Modularization, and Temporal Placement. Abstraction Macro

Layout Physically-aware
3.2. Design Steps and Transformations Module Implementation

The GePaRD Flow enhances the EAPR design flow by Target Architecture
a high-level synthesis framework to enable model-based Figure 2. Design Steps and transformations
design of PR systems using PR-FPGAs.
The flow uses an executable specification (i.e. high- 3.3. Design Descriptions and Models
level notation of the PR system) as input and generates
both a system model for simulation and a physically-aware
architecture description as input for the implementation on GePaRD comprises three interdepending intermediate
the target device using the modular EAPR design flow. models to describe a PR design in all design domains
This is done by the four following design steps: according to the Y-diagram shown in Figure 2:

1. Template Abstraction: Firstly, the design library (LIB) • Binary Macro Tree (BMT): The BMT is a hierarchical
is created by abstracting templates from the target CDFG that represents the behavior of the PR system. It
architecture. Templates associate high-level notations is organized as a binary tree and consists of leaves and
(i.e. control statements, called patterns, and operations) branches. Leaves represent operations on the data flow.
with the related implementation (called macro). The Branches represent patterns or hierarchical nodes (e.g.,
compiler uses these templates to refine a given algo- to distinguish between system and the task behavior).
rithm-level description to an intermediate RT-level
• Pattern Macro Netlist (PMN): The PMN is a netlist the
description called Binary Macro Tree (BMT, see
represents the hierarchical structure of the PR system.
Section 3.3) as input for HLS.
It consists of macros and CTMs. A CTM comprises a
2. High-Level Synthesis: According to Section 2.3, HLS set of macros. Macros are provided by the design
starts with a pre-scheduled behavioral description (rep- library and implement patterns and operations.
resented by the BMT) and generates the structural PR
• Binary Layout Tree (BLT): The BLT is a hierarchical
system description, the Pattern Macro Netlist (PMN,
representation of an PR-FPGA layout. It segments the
see Section 3.3), which is used as model for simulating
CLB array using nested containers by splitting the chip
the PR system on RT-level and as architecture template
area represented by a container in half. Hence, each
to implement the PR system using the EAPR flow.
container contains two sub-containers. At bottom-level
3. Temporal Modularization: The extraction of CTMs is each container represents a CLB. The BLT comprises
done at RT-level using the PMNs provided by HLS. A multiple configurations per container in order to enable
CTM encapsulates an application-specific set of com- Temporal Placement of multiple PMNs (one for each
mon macros in a reusable design element. Each CTM task or configuration respectively).

300
Figure 3 illustrates the descriptions and models (BMT, Sample Mean Sample Variance
PMN, BLT) used by GePaRD. n = read(); n = read();
c = n; c = n;
while ( c > 0 ) a = read();
Binary Macro Tree
{ while ( c > 0 )
t s p x = read(); {
Reconfiguration Schedule a = a + x; x = read();
Operation Sequence r r c = c - 1; s = x - a;
} s2 = s2 + (s * s);
r l s c s
a = a / n; c = c - 1;
write( a ); }
!= % / * + s2 = s2 / (n - 1);
Design (r,a,0) (s,a,2) (a,a,2) (t,a,3) (a,t,1) write( s2 );
l c
Library
(r) (s)

Figure 4. High-level notation (C code) of examples

REG
MUX Binary Layout Tree Sample Mean Sample Variance

run run
Pattern Macro Netlist
Figure 3. Descriptions and Models n n
a

c c
4. Physically-aware HLS for PR-FPGAs true
> -
true
> -

false false
x - x
This section explains the GePaRD approach to HLS of - / /
physically-aware optimized implementations of adaptive
s + s2 + a
PR systems using Temporal Modularization and Temporal *
rdy rdy
Placement in order to extract TRMs.
Figure 5. Control Data Flow Graph of examples
4.1. Example: Statistical Calculations
similar. Consequentially, the task is to extract CTMs from
To explain the GePaRD approach, we introduce simple the related implementations in order to compose the PR
design example for a PR system: an adaptive accelerator system using TRMs. The next two sections explain our
for statistical calculations. This accelerator provides two approach to extract CTMs using Temporal Modularization
statistic function: and to form TRMs using Temporal Placement.
• Sample Mean: In statistics, the mean (or average) of a
list of numbers (x1, .., xn) is the sum of all the members 4.2. Temporal Modularization
of the list divided by the number of items in the list.
l n l In the first step of physically-aware PR optimization,
x = --- ∑ x = --- 〈 x l + … + x n〉 GePaRD uses Temporal Modularization to extract CTMs
n i=l i n
from sequentially executed HW tasks.
• Sample Variance: In statistics, the variance of a list of Using PMN, the structural description D of a PR system
numbers (x1, .., xn) is one measure of the statistical dis- is defined as a 5-tuple D = (H, P, t, L, a), whereby
persion. It averages the squared distance of the possible
• H = {h1, h2, .., hh} is the finite set of HW tasks.
values from the expected value.
n l
δ = n∑
2 2 2 2
( x i – x ) = --- ( ( x l – x ) + … + ( x n – x ) ) • P = {p1, p2, .., pp} is a finite set of PMN nodes.
i=l n
• t: P → H is a relation that assigns a PMN node a task.
Figure 4 shows implementation examples for these two
algorithms using a High-Level Language (HLL) such as C. • L = {l1, l2, .., ll} is a finite set of LIB macros.
In the GePaRD flow, optimizations (such as Temporal
• a: P → L is the allocation relation which assigns a PMN
Modularization and Temporal Placement) are performed
node a dedicated LIB macro.
during HLS. Thus, we transform the source code to an
abstract CDFG for representing the intermediate formats of Valid combinations Cu of interconnected PMN nodes of
the GePaRD flow, see Figure 5. a HW task hu, called (PMN) tuples, are defined as a 4-tuple
It’s obvious that the CDFGs of both algorithms are quite Cu = (T, i, t2, tn), whereby

301
• Tuc = {tuc1, tuc2, .., tuct} is a finite set of c-tuples tuci that best combination of sharings svk ∀v∀k(v,k ∈ N ∧ v → Sv
comprises a set of PMN nodes {pi1, pi2, .., pic} of a task ≠ ∅ ∧ k < card(Sv)). Figure 6 illustrates the result of Tem-
hu, i.e. ∀pi∀hu(pi ∈ tuci ∧ hu ∈ H → t(pi) = hu). poral Modularization with example (see Section 4.1).

• i: P × P → {true, false} is the input condition of a pair Sample Mean Sample Variance
of nodes (pi, pj) that holds if pj gets an input from pi.
run run

• t2: P × P → Tu2 is a relation that forms a 2-tuple by


n n
joining two PMN nodes (pi, pj) of HW task hu, a
if i(pi, pj) = true. c
> c - > -
true true
• tn: Tu2 × Tui → Tu(i+1) is a relation that merges i-tuples false false
x x -
and 2-tuples to form (i+1)-tuples of HW task hu, / - /

if ∃px∃py(px ∈ Tu2 ∧ py ∈ Tui → i(px, py) = true). + a s + s2


*
rdy rdy
All (2..m)-tuples of a task hu can be derived iteratively
by merging i-tuples and 2-tuples to form (i+1)-tuples with
i,m ∈ N ∧ 2 < i < m, i.e. Figure 6. Temporal Modularization

• ∀hu∀pki∀pkj(hu ∈ H ∧ pki ∈ P ∧ pkj ∈ P 4.3. Temporal Placement


→ t2(pki, pkj) ∈ Tu2)
The GePaRD approach to Temporal Placement uses a
• ∀hu∀n∀tki∀tkj(hu∈H∧n∈N∧n>1∧tki∈Tkn∧tkj∈Tk2
hierarchical binary tree representation of the PR-FPGA
→ tn(tun, tuj) ∈ Tu(n+1)) layout (the BLT, see Section 3.3) as model, that is defined
Valid mergings Mv of temporal PMN v-tuples sharings as a X-tuple, whereby
according to a given task schedule are defined as a 4-tuple • B = {b1, b2, .., bb} is a finite set of BLT nodes that
Mv = (F, e, Sv, s), whereby represent a dedicated area on the PR-FPGA.
• F = {f1, f2, .., fts} is a finite set of task pairs (hi, hj) • v = [v1, v2, .., vv] is a vector that represents the path to a
where hj is valid a successors of hi in a given schedule. BLT node in the tree related to the root node, i.e. the
layout position on the PR-FPGA.
• e: Tqv x Trv → {true, false} is the equivalent condition
of two v-tuples (ti, tj) that holds if ti ≡ tj. • p: B → v is a relation that assigns a BLT node a path.

• Sv = {sv1, sv2, .., svs} is a finite set of v-tuples sharings • D = {d1, d2, .., dd} is a finite set of dimensions for valid
svi represents a set of equal tuples (tav, tbv, .., tcv) of dis- BLT paths representing multiple configurations for the
related chip area on the PR-FPGA.
joint tasks, i.e. ∀tqv∀trv(tqv ∈ svi ∧ trv ∈ svi ∧ hq,hr ∈ H
→ (hq,hr) ∈ F ∧ e(tqv, trv) = true). • d: D → B is a relation that assigns a BLT node a
dimension.
• s: Tqv → Sv is a relation that assigns equal v-tuples tqvi
• c: Sv → B is a relation that assigns a CTM svk (selected
and trvj in succeeding tasks hq and hr an unique sharing
sharing Sv) a BLT node.
svk, i.e. ∀tqvi∀trvj∃1svk(tqvi ∈ Tqv ∧ trvj ∈ Trv ∧ e(tqvi,
trvj) = true ∧ svk ∈ Sv → s(tqvi) = svk ∧ s(tqvj) = svk). • u: P → B is a relation that assigns a PMN node a BLT
node bi within the sub-tree of the related CTM svk in
A sharing svk is characterized by the required chip area the same dimension, i.e. ∀bi∀svk(bi ∈ B ∧ svk ∈ Sv →
costa(svk), the number of I/O ports costp(svk), and the p(svk) ⊂ p(bi) ∧ d(bi) = d(c(svk))).
number of task transitions costs(svk) where the runtime
reconfiguration overhead is reduced when applying svk. In consequence, the BLT fulfills the requirements in
Thus, a weighted cost function can be deduced: order to apply a probabilistic optimization heuristic (such
as Simulated Annealing):
• cost(svk) = α • costs(svk) • costa(svk) + β • costp(svk)
• An initial start layout of the CTMs can be generated
with α,β ∈ R+ ∧ α,β ≤ 1
using a deterministic placement algorithm by assign
The CTMs are extracted using GCP, i.e. by selecting the any CTM and any PMN node a BLT node.

302
• Exchanging two BLT nodes bi and bj enables to move 6. References
pieces of the layout on different levels of the hierarchy
(i.e. entire CTMs or BMT nodes within a CTM) in [1] C.Bobda, M.Majer, A.Ahmadinia, T.Haller, A.Linarth,
J.Teich, "The Erlangen slot machine: increasing flexibility in
order to modify the PR-FPGA layout iteratively during FPGA-based reconfigurable platforms", Int. Conference on Field-
optimization. Programmable Technology (IEEE ICFPT), Singapore, Dec 2005
[2] M.Boden, T.Fiebig, T.Meissner, S.Rülke, J.Becker, "High-
• To estimate the quality of a layout or the improvement Level Synthesis of HW Tasks Targeting Run-Time
after a movement in particular, the implementation Reconfigurable FPGAs", 21th Int'l Parallel and Distributed
Processing Symp. (IEEE IPDPS), Reconfigurable Architectures
costs of a PR system can be calculated using the Man-
Workshop (RAW), Long Beach, CA, USA, Mar 2007
hattan distance between interconnected PMN nodes as [3] R.Hartenstein, "A Decade of Reconfigurable Computing: a
metric for the communication costs. Visionary Retrospective", Int. Conference on Design Automation
and Test in Europe (DATE), Munich, Germany, Mar 2001
Figure 7 illustrates the result of Temporal Placement [4] P.Michel, U.Lauther, P.Duzy, The Synthesis Approach to
with example (see Section 4.1). Digital System Design, Kluwer Academic Publishers, Boston/
Dordrecht/London, 1992
[5] R.A.Walker, R.Camposano, A Survey of High-Level Syn-
Sample Mean Sample Variance thesis Systems, Kluwer Academic Publishers, Boston/Dordrecht/
London, 1991
1 - 1 - [6] S.Raje, R.A.Bergamaschi, "Generalized Resource Sharing",
n c n c Int’l Conf. on Computer-aided Design (IEEE/ACM ICCAD), San
true true Jose, CA, USA, Nov 1997
1 > 1 >
x 1
false x 1
false [7] Y.Ying, R.Woods, "FPGA-based system-level design
framework based on the IRIS synthesis tool and System
+ a +
2 2 1 * Generator", Int. Conference on Field-Programmable Technology
a s2 (IEEE FPT), Hong Kong, Dec 2002
-
[8] Xilinx, Two Flows for Partial Reconfiguraiton: Module Based
s
/ - / or Small Bit Manipulations, Application Note XAPP290 (v1.0),
2 2
May 2002, www.xilinx.com
[9] Xilinx, Virtex Series Configuration User Guide, 2003,
www.xilinx.com
Figure 7. Temporal Placement [10] Xilinx, Development System Reference Guide, 2005,
www.xilinx.com
[11] Xilinx, Early Access Partial Reconfiguration User Guide For
5. Summary and future work ISE 8.1.01i, UG208 (v1.1), Mar 2006, www.xilinx.com

All registered or unregistered trademarks referenced are the property of


We proposed GePaRD, a Generation Flow for Partially their respective owners and no trademark rights to the same is claimed.
Reconfigurable Designs. GePaRD enhances the EAPR
flow by a High-Level Synthesis approach to optimized
design of adaptive PR systems that implement a certain
number of HW tasks.
In order to reduce the reconfiguration overhead when
tasks switches occur at runtime, GePaRD combines
Temporal Modularization and Temporal Placement.
Temporal Modularization is considered as a Greedy Clique
Partitioning problem. It extracts Temporal Reusable
Modules. These modules are reused by several tasks and do
have to been reconfigured during a task switch. For
Temporal Placement, we apply a probabilistic optimization
heuristic (Simulated Annealing) on a hierarchical layout
model that considers temporal sharings.
This paper introduces the GePaRD approach and gives
an example. The next step in future work is to add the
described optimization steps in our synthesis framework
for adaptive PR systems [2] in order to get experimental
results.

303

You might also like