Professional Documents
Culture Documents
Maik Boden, Thomas Fiebig, Markus Reiband, Peter Reichel, Steffen Rülke
299
the Y-diagram [4], the related design steps in HLS are substitutes a common set of macros in the PMNs of
pointed out. Lastly, we itemize the models used to describe sequentially executed tasks.
a PR-FPGA system in all three design domains.
4. Temporal Placement: To generate a physically-aware
top-level architecture for the modular EAPR design
3.1. Temporal Reusable Modules (TRMs)
flow, Temporal Placement is used in order to pre-place
the unitized PMNs on the target-device.
The most significant drawback of FPGA-based adaptive
PR systems is the reconfiguration overhead at runtime. Figure 2 depicts the design steps in the GePaRD flow
GePaRD addresses the reduction of the reconfiguration using the Y-diagram.
overhead by extracting Temporal Reusable Modules
(TRMs) in order to enable temporal resource sharing High-Level Modelling
between sequentially executed HW tasks. Executable
Specification High-Level System Model
According to the modular EAPR design flow, a TRM Prozessor, Memory
System Algorithm Synthesis
has to be placed on a fixed area on the FPGA. Furthermore, 2
3 Temporal
it has to provide a common interface that fits all tasks RTL CDFG Modularization
which are composed of this TRM. Netlist ALU,MUX,
Registers
Consequentially, the GePaRD approach combines the
LIB
extraction of Common Temporal Modules (CTMs), called 4 Temporal
Template 1 Placement
Temporal Modularization, and Temporal Placement. Abstraction Macro
Layout Physically-aware
3.2. Design Steps and Transformations Module Implementation
The GePaRD Flow enhances the EAPR design flow by Target Architecture
a high-level synthesis framework to enable model-based Figure 2. Design Steps and transformations
design of PR systems using PR-FPGAs.
The flow uses an executable specification (i.e. high- 3.3. Design Descriptions and Models
level notation of the PR system) as input and generates
both a system model for simulation and a physically-aware
architecture description as input for the implementation on GePaRD comprises three interdepending intermediate
the target device using the modular EAPR design flow. models to describe a PR design in all design domains
This is done by the four following design steps: according to the Y-diagram shown in Figure 2:
1. Template Abstraction: Firstly, the design library (LIB) • Binary Macro Tree (BMT): The BMT is a hierarchical
is created by abstracting templates from the target CDFG that represents the behavior of the PR system. It
architecture. Templates associate high-level notations is organized as a binary tree and consists of leaves and
(i.e. control statements, called patterns, and operations) branches. Leaves represent operations on the data flow.
with the related implementation (called macro). The Branches represent patterns or hierarchical nodes (e.g.,
compiler uses these templates to refine a given algo- to distinguish between system and the task behavior).
rithm-level description to an intermediate RT-level
• Pattern Macro Netlist (PMN): The PMN is a netlist the
description called Binary Macro Tree (BMT, see
represents the hierarchical structure of the PR system.
Section 3.3) as input for HLS.
It consists of macros and CTMs. A CTM comprises a
2. High-Level Synthesis: According to Section 2.3, HLS set of macros. Macros are provided by the design
starts with a pre-scheduled behavioral description (rep- library and implement patterns and operations.
resented by the BMT) and generates the structural PR
• Binary Layout Tree (BLT): The BLT is a hierarchical
system description, the Pattern Macro Netlist (PMN,
representation of an PR-FPGA layout. It segments the
see Section 3.3), which is used as model for simulating
CLB array using nested containers by splitting the chip
the PR system on RT-level and as architecture template
area represented by a container in half. Hence, each
to implement the PR system using the EAPR flow.
container contains two sub-containers. At bottom-level
3. Temporal Modularization: The extraction of CTMs is each container represents a CLB. The BLT comprises
done at RT-level using the PMNs provided by HLS. A multiple configurations per container in order to enable
CTM encapsulates an application-specific set of com- Temporal Placement of multiple PMNs (one for each
mon macros in a reusable design element. Each CTM task or configuration respectively).
300
Figure 3 illustrates the descriptions and models (BMT, Sample Mean Sample Variance
PMN, BLT) used by GePaRD. n = read(); n = read();
c = n; c = n;
while ( c > 0 ) a = read();
Binary Macro Tree
{ while ( c > 0 )
t s p x = read(); {
Reconfiguration Schedule a = a + x; x = read();
Operation Sequence r r c = c - 1; s = x - a;
} s2 = s2 + (s * s);
r l s c s
a = a / n; c = c - 1;
write( a ); }
!= % / * + s2 = s2 / (n - 1);
Design (r,a,0) (s,a,2) (a,a,2) (t,a,3) (a,t,1) write( s2 );
l c
Library
(r) (s)
REG
MUX Binary Layout Tree Sample Mean Sample Variance
run run
Pattern Macro Netlist
Figure 3. Descriptions and Models n n
a
c c
4. Physically-aware HLS for PR-FPGAs true
> -
true
> -
false false
x - x
This section explains the GePaRD approach to HLS of - / /
physically-aware optimized implementations of adaptive
s + s2 + a
PR systems using Temporal Modularization and Temporal *
rdy rdy
Placement in order to extract TRMs.
Figure 5. Control Data Flow Graph of examples
4.1. Example: Statistical Calculations
similar. Consequentially, the task is to extract CTMs from
To explain the GePaRD approach, we introduce simple the related implementations in order to compose the PR
design example for a PR system: an adaptive accelerator system using TRMs. The next two sections explain our
for statistical calculations. This accelerator provides two approach to extract CTMs using Temporal Modularization
statistic function: and to form TRMs using Temporal Placement.
• Sample Mean: In statistics, the mean (or average) of a
list of numbers (x1, .., xn) is the sum of all the members 4.2. Temporal Modularization
of the list divided by the number of items in the list.
l n l In the first step of physically-aware PR optimization,
x = --- ∑ x = --- 〈 x l + … + x n〉 GePaRD uses Temporal Modularization to extract CTMs
n i=l i n
from sequentially executed HW tasks.
• Sample Variance: In statistics, the variance of a list of Using PMN, the structural description D of a PR system
numbers (x1, .., xn) is one measure of the statistical dis- is defined as a 5-tuple D = (H, P, t, L, a), whereby
persion. It averages the squared distance of the possible
• H = {h1, h2, .., hh} is the finite set of HW tasks.
values from the expected value.
n l
δ = n∑
2 2 2 2
( x i – x ) = --- ( ( x l – x ) + … + ( x n – x ) ) • P = {p1, p2, .., pp} is a finite set of PMN nodes.
i=l n
• t: P → H is a relation that assigns a PMN node a task.
Figure 4 shows implementation examples for these two
algorithms using a High-Level Language (HLL) such as C. • L = {l1, l2, .., ll} is a finite set of LIB macros.
In the GePaRD flow, optimizations (such as Temporal
• a: P → L is the allocation relation which assigns a PMN
Modularization and Temporal Placement) are performed
node a dedicated LIB macro.
during HLS. Thus, we transform the source code to an
abstract CDFG for representing the intermediate formats of Valid combinations Cu of interconnected PMN nodes of
the GePaRD flow, see Figure 5. a HW task hu, called (PMN) tuples, are defined as a 4-tuple
It’s obvious that the CDFGs of both algorithms are quite Cu = (T, i, t2, tn), whereby
301
• Tuc = {tuc1, tuc2, .., tuct} is a finite set of c-tuples tuci that best combination of sharings svk ∀v∀k(v,k ∈ N ∧ v → Sv
comprises a set of PMN nodes {pi1, pi2, .., pic} of a task ≠ ∅ ∧ k < card(Sv)). Figure 6 illustrates the result of Tem-
hu, i.e. ∀pi∀hu(pi ∈ tuci ∧ hu ∈ H → t(pi) = hu). poral Modularization with example (see Section 4.1).
• i: P × P → {true, false} is the input condition of a pair Sample Mean Sample Variance
of nodes (pi, pj) that holds if pj gets an input from pi.
run run
• Sv = {sv1, sv2, .., svs} is a finite set of v-tuples sharings • D = {d1, d2, .., dd} is a finite set of dimensions for valid
svi represents a set of equal tuples (tav, tbv, .., tcv) of dis- BLT paths representing multiple configurations for the
related chip area on the PR-FPGA.
joint tasks, i.e. ∀tqv∀trv(tqv ∈ svi ∧ trv ∈ svi ∧ hq,hr ∈ H
→ (hq,hr) ∈ F ∧ e(tqv, trv) = true). • d: D → B is a relation that assigns a BLT node a
dimension.
• s: Tqv → Sv is a relation that assigns equal v-tuples tqvi
• c: Sv → B is a relation that assigns a CTM svk (selected
and trvj in succeeding tasks hq and hr an unique sharing
sharing Sv) a BLT node.
svk, i.e. ∀tqvi∀trvj∃1svk(tqvi ∈ Tqv ∧ trvj ∈ Trv ∧ e(tqvi,
trvj) = true ∧ svk ∈ Sv → s(tqvi) = svk ∧ s(tqvj) = svk). • u: P → B is a relation that assigns a PMN node a BLT
node bi within the sub-tree of the related CTM svk in
A sharing svk is characterized by the required chip area the same dimension, i.e. ∀bi∀svk(bi ∈ B ∧ svk ∈ Sv →
costa(svk), the number of I/O ports costp(svk), and the p(svk) ⊂ p(bi) ∧ d(bi) = d(c(svk))).
number of task transitions costs(svk) where the runtime
reconfiguration overhead is reduced when applying svk. In consequence, the BLT fulfills the requirements in
Thus, a weighted cost function can be deduced: order to apply a probabilistic optimization heuristic (such
as Simulated Annealing):
• cost(svk) = α • costs(svk) • costa(svk) + β • costp(svk)
• An initial start layout of the CTMs can be generated
with α,β ∈ R+ ∧ α,β ≤ 1
using a deterministic placement algorithm by assign
The CTMs are extracted using GCP, i.e. by selecting the any CTM and any PMN node a BLT node.
302
• Exchanging two BLT nodes bi and bj enables to move 6. References
pieces of the layout on different levels of the hierarchy
(i.e. entire CTMs or BMT nodes within a CTM) in [1] C.Bobda, M.Majer, A.Ahmadinia, T.Haller, A.Linarth,
J.Teich, "The Erlangen slot machine: increasing flexibility in
order to modify the PR-FPGA layout iteratively during FPGA-based reconfigurable platforms", Int. Conference on Field-
optimization. Programmable Technology (IEEE ICFPT), Singapore, Dec 2005
[2] M.Boden, T.Fiebig, T.Meissner, S.Rülke, J.Becker, "High-
• To estimate the quality of a layout or the improvement Level Synthesis of HW Tasks Targeting Run-Time
after a movement in particular, the implementation Reconfigurable FPGAs", 21th Int'l Parallel and Distributed
Processing Symp. (IEEE IPDPS), Reconfigurable Architectures
costs of a PR system can be calculated using the Man-
Workshop (RAW), Long Beach, CA, USA, Mar 2007
hattan distance between interconnected PMN nodes as [3] R.Hartenstein, "A Decade of Reconfigurable Computing: a
metric for the communication costs. Visionary Retrospective", Int. Conference on Design Automation
and Test in Europe (DATE), Munich, Germany, Mar 2001
Figure 7 illustrates the result of Temporal Placement [4] P.Michel, U.Lauther, P.Duzy, The Synthesis Approach to
with example (see Section 4.1). Digital System Design, Kluwer Academic Publishers, Boston/
Dordrecht/London, 1992
[5] R.A.Walker, R.Camposano, A Survey of High-Level Syn-
Sample Mean Sample Variance thesis Systems, Kluwer Academic Publishers, Boston/Dordrecht/
London, 1991
1 - 1 - [6] S.Raje, R.A.Bergamaschi, "Generalized Resource Sharing",
n c n c Int’l Conf. on Computer-aided Design (IEEE/ACM ICCAD), San
true true Jose, CA, USA, Nov 1997
1 > 1 >
x 1
false x 1
false [7] Y.Ying, R.Woods, "FPGA-based system-level design
framework based on the IRIS synthesis tool and System
+ a +
2 2 1 * Generator", Int. Conference on Field-Programmable Technology
a s2 (IEEE FPT), Hong Kong, Dec 2002
-
[8] Xilinx, Two Flows for Partial Reconfiguraiton: Module Based
s
/ - / or Small Bit Manipulations, Application Note XAPP290 (v1.0),
2 2
May 2002, www.xilinx.com
[9] Xilinx, Virtex Series Configuration User Guide, 2003,
www.xilinx.com
Figure 7. Temporal Placement [10] Xilinx, Development System Reference Guide, 2005,
www.xilinx.com
[11] Xilinx, Early Access Partial Reconfiguration User Guide For
5. Summary and future work ISE 8.1.01i, UG208 (v1.1), Mar 2006, www.xilinx.com
303