You are on page 1of 7

1

Performance Modeling of Reconfigurable Distributed


Systems based on the OpenSPARC FPGA Board and
the SIRC Communication Framework
Kevin L. Thomas, Michael S. Thompson
Department of Electrical and Computer Engineering, Bucknell University
{kevin.thomas, michael.thompson}@bucknell.edu

AbstractIn this paper we present a model for predicting


performance of a distributed, reconfigurable computing cluster
using commodity parts, specifically the Digilent OpenSPARC
development board, and the SIRC Framework, developed by Microsoft Research. The goal of this work is to assist in determining
the feasibility of deploying a similar system for a given problem.
This work is aimed a low-budget and introductory projects
that wish to leverage commodity hardware. This paper also
provides experimental results for a reconfigurable, distributed
system simulating various loading configurations.

I.

I NTRODUCTION AND R ELATED W ORK

The system constructed and studied for this project utilizes


two important concepts: reconfigurable computing and distributing computing. We foresee a trend of the intersection
of these two areas in computing. The literature in the field
of reconfigurable computing is rife with examples of how
reconfigurable computing can be used to accelerate certain
computations. In addition to the increase in performance of
FPGAs themselves, development boards from makers like
Digilent Inc.1 , lower the barrier of entry for lower-budget
entities like small companies and academics at small schools
to leverage this technology by removing the need to build
custom boards for the FPGAs. Because the FPGAs on these
boards are paired with high-speed peripherals like Serial ATA
and gigabit ethernet, it is fairly reasonable to consider using
them as co-processor systems that are controlled by a PC for
certain types of problems. Additionally, projects like the SIRC
framework from Microsoft Research facilitate communication
between these FPGA boards and a PC, freeing the designer to
focus on only the computation and not the communication.
While this combination alleviates some of the challenges of
working with reconfigurable computing, it does not solve the
issue of whether or not moving the computation to a reconfigurable board or multiple reconfigurable boards is feasible or
will give a desired level of performance. Our work focuses on
answering the following two questions.
1) What level of performance can be attained with X
FPGA boards?
2) What level of resources are required to attain a performance level of X bytes per unit of time?
1 http://www.digilentinc.com/

978-1-4799-2079-2/13/$31.00 2013 IEEE

Our interest is to aid in the feasibility study that should occur


before a major development effort takes place to implement
the algorithm of interest on one or more FPGA boards. When
pursuing a distributed, reconfigurable computing cluster the
key concern is the network and its level of utilization. When
the network becomes saturated, the distributed system reaches
its maximum utility and scalability; adding more FPGA boards
will not improve system performance once this point is
reached. On the contrary, if the FPGA boards are completely
utilized and the network and control PC still have capacity,
adding more FPGA boards may increase the performance of
the entire system. As such, our interest in this work is to find
a way to approximate system-level performance and find the
point of diminishing returns.
While our study focuses on a specific combination of
hardware and software, it should be possible to use the same
model with delay values specific to other hardware platforms.
A. Use Scenario
The specific use scenario of interest is a small, distributed,
reconfigurable computing cluster composed of multiple Digilent OpenSPARC development boards2 and a single control PC
that manages them and aggregates result data. These entities
are connected via gigabit ethernet with a Cisco gigabit switch.
Figure 1 shows our basic cluster configuration. We chose these
specific items because these items were readily available to us.
The basic operating procedure of the system is that the
FPGA boards are programmed and configured with one or
more instantiations of a specific algorithm ahead of time;
ideally, the user would create as many system instantiations as
possible on the available fabric of the algorithm of interest. At
the start, the control PC sends each FPGA board a collection
of data to process. Once received, each FPGA board processes
the data and sends a result back to the control PC when the
computation is finished. In a scenario with multiple FPGA
boards, the control PC would send data to each FPGA board
such that they would all be running data at the same time.
In an appropriate application, the throughput of the system is
much greater than the throughput of a software-only version
of the algorithm running on a similarly priced system.
In this particular configuration, the system is applicable
to computing problems that have limited system throughput
2 http://www.digilentinc.com/Products/Detail.cfm?Prod=XUPV5

Control PC

Gigabit
Switch

Fig. 1. A simple distributed reconfigurable computing cluster with one control


PC and 3 FPGA boards connected via ethernet.

( 500 Mb/s) and high-computation time. We acknowledge


that the limitations of the system dont allow for application
to all possible problems, but this configuration should be applicable to some types of problems or computation scenarios.
B. The SIRC Framework
One of the challenges of working with FPGA boards in this
manner is communicating with them. While these development
boards have a lot of methods of communication, using them
in the desired manner presents a bit of a challenge. For large
deployments of 10s to 100s of nodes, ethernet is an obvious
choice for communication. While exchanging data with the
ethernet interface is reasonable, the ethernet MAC/LLC and
TCP/IP protocol stacks that exist on top of the raw bits are
more challenging to implement. As such, a better solution is
needed. To fill this need, we found the SIRC framework by
Microsoft Research [1].
The SIRC project aimed to create an open source,
lightweight, and easy-to-use FPGA-based communication
framework with high throughput [1]. This project is open
source and was designed to run on the Microsoft Windows
platform.
SIRC is easy-to-use because a user simply needs to program
to a C++ API in order to transfer data to and from FPGA
boards using input/output buffers and a parameter register file.
The user hardware design only needs to interface with a set
of hardware components to communicate with the software
program. The user can also use the C++ SIRC API to control
execution, including starting, pausing, and stopping the execution of the users hardware circuit. The interface is displayed
as a block diagram in Figure 2.
SIRC is also lightweight and handles network communication with a custom data format built on top of ethernet; it
does not use IP or higher-layer protocols. SIRC leverages the
ethernet LLC to reliably deliver data. While this limits the scalability of SIRC to only local area networks (LANs), it reduces
the amount of overhead data and therefore processing time.
Data is processed with dedicated hardware which alleviates
the overhead of using a software network stack. As a result
throughput for SIRC approaches 1 Gb/s; system throughput
depends on the transfer size of packets, though. Transfers of
8 KB through the system achieves a throughput of 500 Mb/s,

Output
Buffer
Parameter
Register File

Hardware API

FPGA
Boards

Commands
from the
users
software

Software API

Input Buffer
Interface
to the
users
hardware

Execution
Control
Fig. 2. The interfaces between hardware and software in the SIRC Framework [1].

half the theoretical network throughput of gigabit ethernet.


Increasing the transfer size to 128 KB will increase network
throughput to 950 Mb/s or 95% of the theoretical network
throughput of gigabit ethernet.
SIRCs easy-to-use API and high throughput results make it
a viable option for a communication framework. The biggest
drawbacks to SIRC are that the framework does not offer
functionality for easily managing multiple FPGA connections,
and the software side of the framework only runs on Microsoft
Windows. Currently, SIRC represents a reasonable solution for
facilitating communication between a Windows machine and
multiple FPGA boards over ethernet. We believe that SIRC,
coupled with prescriptive throughput models, will allow system
designers to create high performance reconfigurable distributed
systems for feasible applications.
C. Application Areas
Certain computational algorithms execute faster on reconfigurable hardware than on general purpose processors, typically
because of the parallel nature of these algorithms, the independence of the data being processed, and/or the complexity
of certain operations required for processing. These same
applications may also run particularly well on a distributed
system if the particular computational problem has the ability
to be separated into independent parts that can be solved in
parallel.
This subset of applications has the most performance to
gain from being executed on a network of reconfigurable
distributed hardware. This section described the following application: simulations, cryptography, bioinformatics, and neural networks. Using modeling tools allows designers to make
better decisions in creating a system for performing one of the
many algorithms well-suited for execution on a reconfigurable
distributed computing system.
1) Monte Carlo Simulations: Monte Carlo simulations represent one of the tasks that can take advantage of a distributable
system. In a Monte Carlo simulation, a certain model is tested
many times, but each time with a different random seed input
to represent noise in the model. Therefore, every single run of

a Monte Carlo simulation is independent and can be executed


atomically and distributed on a different processing unit.
Monte Carlo simulations are used often in physics and
finance, where predicting the future output of models depends
on noise in the system. The Black-Scholes formula is a popular
model in finance for determining the future price of stock
options. Using a cluster of FPGAs, researchers in Spain were
able to speed up a simulation 5 times compared to a cluster of
general purpose processors [2]. A 16 node FPGA-based cluster
was used to accelerate the Asian Options pricing model over
690 times compared to a dual-core CPU [3].
2) Cryptography: Cryptography is a crucial component of
computer security. Whenever secure transactions occur over
the internet, the data is encrypted on one end and decrypted
on the other end. Cryptography algorithms can take a long time
to process on general purpose processors because the algorithm
may have to process a large set of data.
Some cryptography algorithms are performed on small
blocks of data at a time, making these smaller tasks independent and distributable. The algorithms can also usually
take advantage of custom hardware to increase performance
even more [4]. Because of these reasons an FPGA cluster
implementation of 3DES, a common cryptography algorithm,
has shown an increase in throughput over 1000x compared to
a cluster of general purpose processors [5].
Alternatively, FPGA-based clusters can be used to crack the
security of cryptoghraphic systems. RC4 is a commonly used
stream cipher for the Secure Socket Layer (SSL) protocol.
The CUBE, a cluster containing 512 FPGAs, was able to
exhaustively crack a 40-bit RC4 key within 3 minutes, 359
times faster than a quad-core Intel Xeon CPU [6].
3) Engineering: Engineers can utilize computational modeling tools to save time and money in the design process.
Computational fluid dynamics provides a valuable tool for
engineers, but is computationally intensive. The Lattice Boltzman algorithm in particular has been accelerated using a tighly
coupled cluster of FPGAs [7].
4) Bioinformatics: Processing biological data, especially sequencing genes and analyzing genetic data, is computationally
challenging [8]. Like cryptography, these tasks are performed
on large data sets and can be divided into smaller independent
tasks. Specialized hardware has been developed for FPGA
boards to sequence DNA up to 400% faster compares to
traditional software-based algorithms [8]. With the increasing
amount of biological data that needs to be processed, reconfigurable distributable systems may provide a best means for
processing it.
5) Neural Networks: The design and development of neural
networks, a decentralized interconnected processing network,
is a growing field not only for neuroscientists, but also for
data scientists trying to better model data with innovative
systems [9]. Because neural networks are designed to be
distributed among many cells, with no central processor,
modern computer architecture does not model them efficiently.
Therefore, FPGA-based designs run more efficiently because
the hardware architecture of the FPGA is already distributed
and decentralized.
An FPGA-based neural network for pattern recognition

using a Hierarchical Temporal Memory (HTM) learning algorithm showed a 1000x performance increase over a Matlab
implementation [10]. The performance of this system was high
enough that the processing could be done in real-time.
II. A BASIC S YSTEM M ODEL
We created a simple model of the system that reflects the
basic operation of the SIRC framework and takes our specific
use scenario into account. The steps of sending data to an
FPGA and collecting the results are as follows.
1) (Control PC) Application calls a SIRC API function to
send a write request to the FPGA board.
2) (Control PC) SIRC framework generates and sends a
message to the FPGA board.
3) (Control PC) Application calls a SIRC API function
to issue a read request for the result from the FPGA
board. The call blocks.
4) (FPGA board) The write message is received.
5) (FPGA board) The board waits for the simulated processing time.
6) (FPGA board) The board sends back the result from its
simulated calculation in response to the read request
message.
7) (Control PC) The read request function call returns the
result to the application.
Starting from the assumption that the system operates in this
manner, we created the timing diagram shown in Figure 3. This
figure shows all of the delays in the system. We did not include
a SIRC delay value on the FPGA board because we know
that it only takes 10 clock cycles for data to be processed by
the framework. This value is negligible compared to the other
delays in the system and thus is ignored. Simply stated, the
sum of all of these delays is the amount of time that it takes
an application to send out data and receive a result. Instead
of attempting to measure the performance by the individual
values, we measure performance at the system level using
Equation 1, which is system throughput, Tsys , data through
the system divided by the time.
data transferred
(1)
execution time
In Equation 2 data transferred and execution time are
expanded to show the composition of these values. Dsys
represents the entire time it takes to execute a single iteration
of the system, and it can be broken into three parts: DSIRC ,
Dnet , and Dproc . Since an amount of data is assumed to
be transmitted and received, Dnet1 and Dnet2 are equal for
this analytical model, Dnet1 = Dnet2 . The communication
framework overhead may vary based on the framework used
for the system, but in this model of the SIRC framework it is
represented as DSIRC . DSIRC1 and DSIRC2 . In case these
values are not equal, we do not separate them and assume that
DSIRC = DSIRC1 + DSIRC2 .
Tsys =

Tsys =

Nboards Strans
Nboards Strans
=
Dsys
2 Dnet + DSIRC + Dproc
(2)

Control PC

FPGA
Board

Network

DSIRC1
Dnet1

Read Req

Dproc

Dsys

Write

Dnet2

Read

DSIRC2

Fig. 3. A generalized timing diagram for a single write/process/read cycle


using SIRC.
TABLE I.
Tsys
Tnet
Nboards
Strans
DSIRC
Dnet
Dproc
Dsys

D EFINITIONS OF VARIABLES

throughput of the entire system in Mb/s


speed of the network connecting everything in Mb/s
number of FPGA boards being utilized in the system
size of each data transfer through the system in bytes
delay within the communication framework in ns
delay for transmitting data transfer over the network in ns
delay for processing data on FPGA in ns
total delay for system execution in ns

The initial analytical model can be expanded even further


by describing Dnet in detail. Using the amount of data being
transferred over the network interface speed (times two because the data is both written and read), Dnet can be modeled
even more precisely. This analytical model is displayed in
Equation 3. The full list of the model variables and descriptions
is included in Table I. Throughput variables represent data
over time, measured in bits per second. The transfer size of
data into and out of the system is measured in bytes. Delay
variables all represent given times during system execution and
are measured in seconds.
Tsys =

Nboards Strans
2Nboards Strans
+ DSIRC
Tnet

+ Dproc

(3)

Of the values shown in Table I some are static, including


Tnet and Nboards , while others are algorithm-dependent but
can be determined once an algorithm is chosen, including
Strans , Dproc , and Dnet . The only value left is DSIRC which
is unknown at this point. In order to complete the model,
we need to determine values for DSIRC . Since there isnt a
simple path to analytically determine this value, we will use a
simplified experimental system.

A. Model Limitations
This simplified model for the throughput of an FPGA-based
cluster has some innate limitations. The model assumes the
following system attributes.
A fixed processing time Dproc given a certain data size
Strans . This assumption conflicts with certain hardware
circuits, which have different times of completion for
input data of the same size.
A fixed processing overhead of the SIRC framework
DSIRC for each board configuration Nboards (1, 2 or
3), which is not the same as other frameworks and may
vary based on various factors.
The amount of input data and output data are equal.
These assumptions make the model valid for computations
that transform a data set into one of equal size and also take
a deterministic number of cycles to process given the input
data size. While this does not apply to certain algorithms, it
can be applied to many cryptographic and image processing
algorithms. Although this model has its limitations, it can still
be used for general trends and approximate values of certain
applications and provides a good basis for further modeling in
the future.
III. F INDING DSIRC
At this point our goal is to use experimental methods to
determine a value for the framework delay, DSIRC . Because
we do not know what aspects of the system affect the delay of
within the framework all of the user-controlled variables must
be carefully varied to determine the effect. The specific usercontrolled values of interest are: Dproc , Strans , and Nboards .
A. Experimental Setup
The experimental setup matched that shown in Figure 1. The
control PC was a laptop with a quad-core Intel i7 processor
and 8 GB of RAM running Windows 7. As previously noted,
the three FPGA boards available were Digilent OpenSPARC
boards, based on the Xilinx Virtex 5 FPGA. All of these items
were connected with gigabit ethernet via a Cisco Catalyst 3750
switch. Microsoft Visual Studio 2010 was used for all software
development and the Xilinx ISE suite version 14.1 was used
for all FPGA development.
B. Experimental Procedure
We chose a set of discrete values for each of the three usercontrolled values, Dproc , Strans , and Nboards ; these sets are
shown in Table II. A set of 10 test runs was completed for
each combination of three values. Each test run consisted of
sending and receiving 1000 operations.3 A timer was started
for operation one and stopped after operation 1000 finished;
this value is the sum of all of the Dsys values for the test
run. We calculated the system throughput, Tsys , by dividing
the total operating time by the sum of the data that was sent,
Strans 1000.
3 We also tried running the system for 10,000 operation, but found no
difference in the results.

TABLE II.
Nboards
Strans
Dproc

VALUES FOR USER - CONTROLLED VARIABLES

500

1, 2, or 3
1 KB, 2 KB, 4 KB, 8 KB, 16 KB, 32 KB, 64 KB, 128 KB
0 s, 100 ns, 1 s, 10 s, 100 s, 1 ms, 10 ms, 100 ms

Tsys (Mb/s)

500
450

Throughput(Mb/s)

400

1,024

450

350

2,048

400

4,096

350

8,192

300

16,384
32,768

250

65,536
200

131,072

150

300

100

250

50

200

1board,crossover

150

100

1board,switch

1,000

10,000

2boards,switch

100

100,000

1,000,000

10,000,000

100,000,000

Dproc(ns)

3boards,switch

50
0
0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Strans(Bytes)

Fig. 5. The system throughput, Tsys , for a 1-board system. Each line on the
figure represents a different data transfer size, Strans . The simulated hardware
processing time, Dproc , is the indendent variable.
500

Fig. 4. Maximum throughput, Tsys for 4 configurations of the system with


no processing delay, DSIRC = 0.

1,024

450

2,048

D. Calculating DSIRC
We calculated DSIRC by solving Equation 3 for DSIRC ,
shown in Equation 4. We used the experimental value of
Tsys and the analytical values of Nboards , Dproc , and Strans .
The calculated average DSIRC values are shown in Table III,
grouped by transfer size, Strans . These data are arranged
in this manner because the raw data (not included) show a
correlation between changes in transfer size and changes in
DSIRC , but changes in processing delay show no correlation
with changes in DSIRC .

4,096

350

8,192

300

16,384
32,768

250

65,536
200

131,072

150
100
50
0
100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

Dproc(ns)

Fig. 6. The system throughput, Tsys , for a 2-board system. Each line on the
figure represents a different data transfer size, Strans . The simulated hardware
processing time, Dproc , is the indendent variable.
500

1,024

450

2,048

400

Tsys(Mb/s)

C. Experimental Results
First by setting the value of Dproc to zero, the maximum
throughput for various configurations was calculated and displayed in Figure 4. By performing this experiment, we were
able to verify the quoted network throughput values for the
SIRC framework. Additionally, this data provides a notion
of the maximum Tsys value given no actual processing was
simulated. This figure contains two single-board scenarios, one
where the entities are directly connected with a cross-over
cable and another where they are connected with the switch.
The 2-board and 3-board configurations are also shown and
yield higher throughput than the single-board scenarios.
We collected data for all possible combinations of processing time, Dproc , transfer size, Strans , and the number of
boards, Nboards , shown in Table II. The data from these tests
are displayed in Figures 5, 6, and 7. The data are grouped by
the number of boards, processing time, and then transfer size.
One point of note is the exclusion of the 128 KB data series
in Figure 7. This is due to an error in the SIRC framework,
which returned an error during that experiment, but not during
any other of the experiments. The exact cause of the error is
unknown at this point.

Tsys(Mb/s)

400

4,096

350

8,192

300

16,384
32,768

250

65,536
200
150
100
50
0
100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

Dproc(ns)

Fig. 7. The system throughput, Tsys , for a 3-board system. Each line on the
figure represents a different data transfer size, Strans . The simulated hardware
processing time, Dproc , is the indendent variable.

TABLE III.
AVERAGE CALCULATED DSIRC VALUES IN S FOR 1-, 2-,
AND 3- BOARD SYSTEMS GROUPED BY TRANSFER SIZE , Strans .
Strans
1 KB
2 KB
4 KB
8 KB
16 KB
32 KB
64 KB
128 KB

1 board
814
827
846
861
947
840
1,061
1,051

2 boards
897
795
732
720
700
570
514
464

3 boards
2,097
1,835
1,546
1,572
1,437
1,119
1,204
n/a

Nboards + Strans 2 Nboards Strans

Dproc
Tsys
Tnet
(4)
First and foremost, these data show that DSIRC is in the
range of a few hundred microseconds to a few milliseconds;
even if a specific prediction is infeasible, a value in this
range is the correct order of magnitude. Going beyond being
close, the data show slightly different behavior for a single
board scenario versus a multi-board scenario. For the single
board scenario DSIRC increases as the transfer size, Strans
increases. The opposite effect is observed in the multi-board
scenarios. We hypothesize that this effect is a result of the bidirectional nature of gigabit ethernet. Our model assumes that
send and receive operations happen in serial for each board in
the system, but in parallel overall. In a multi-board scenario,
it is possible that while the control PC is sending to board A
it may be receiving from board B. This inherent parallelism
reduces the system side delays because sending and receiving
delays for different boards may overlap.
Looking at only the multi-boards scenarios, it is possible
that there is a trend of increasing framework delay, DSIRC ,
as a result of an increase in the number of boards, Nboards .
Due to the limited amount of data, this trend is difficult to
confirm or further analyze.
In the bigger picture, the framework delay, DSIRC , has less
effect on the system performance at large values of transfer
size and processing delay, Strans and Dproc , respectively.
This is analytically evident from Equation 4 and knowing the
approximate order of magnitude of each of the delay values.
When DSIRC is significantly smaller than the other delays, the
other delays have a greater impact on the result. When these
values are large, the value of DSIRC has little impact on the
result. To the developer, this means that if the system being
designed transfers data in larger chunks, greater than 50 KB,
and/or has a long processing delay, greater than 10 ms, then
DSIRC can be ignored because it is significantly smaller than
the other values. If this is not the case then a reasonable value
of DSIRC should be used to predict system performance.
When a specific value for DSIRC is required it is important
to find an appropriate value that best mimics the system
parameters. The goal is to find a value of DSIRC for the
model calculation that provides the most accurate results; the
lowest percentage of error between the model results and
the experimental results. Table IV shows the average percent
error of all of the experimental results for all combinations of
processing delay, number of boards, and transfer size. For the
DSIRC =

single board scenario, a DSIRC value of approximately 900 s


yeilds the lowest average delay. The 2- and 3-board scenarios
also have individual values that yeild small error percentages,
approximately 800 s and 1,500 s, respectively. Restated,
a small percent error means that the value of DSIRC is a
reasonable match for that combination of values.
E. Further Analysis
More generally, these results show there is a distinct point of
diminishing returns in the system, establishing two zones of
operation: the saturation zone and the diminishing zone. The
falloff point exists around Dproc = 100, 000 ns. This plateau
appears to be consistent for all cases. Problems which can
be solved more quickly than 100,000 ns will still be limited
by this inherent latency of the framework and network. There
is inefficiency in the system because of the communication
latency between the control PC and the FPGA board. A
problem that takes enough time to execute to be in the
diminishing zone of operation utilizes more system execution
time performing useful work. Ideally, a problem will take
approximately 100,000 ns to execute, therefore falling between
the two zones of operation, which means it will achieve high
throughput without having the system inefficiency from the
saturation zone.
Knowing that this point exists for this system setup provides
another point of high-level comparison. If the problem executes
on an FPGA board orders of magnitude faster than 100,000 ns
then further investigation is not really necessary. Significantly
faster computations will not benefit from being moved to this
type of platform.
IV.

M AKING P ERFORMANCE P REDICTIONS

Thus far we have introduced a model of system performance


for a distributed, reconfigurable system composed of one or
more FPGA boards connected via gigabit ethernet. As part
of a feasibility study to determine if moving to this type of
system would be worthwhile, the model provides the ability
to roughly calculate the system throughput given knowledge
of the on-board processing delay, Dproc , size of the messages
sent to the board, Strans , and the number of FPGA boards,
Nboards . It is, however, up to the system designer to carefully
choose the size of the messages and how long the processing
will take. As a general note, our data show that the messages
should be as large as possible and the processing time should
be as long as possible. We roughly assume that these two will
be related; more sent data means more processing time.
The second use of the model is to roughly estimate the
number of boards needed to reach a given level of system
performance and/or determine if the desired level of performance is even attainable. Again, this requires the designer
to know the size of the messages that will be sent to a
given board and the amount of time required for on-board
processing; adjustment may be needed. An efficient system
is defined as a system where the system throughput, Tsys , is
below the theoretical maximum of the network and control PC,
approximately 500 Mb/s in this specific scenario.

TABLE IV.

AVERAGE PERCENT ERROR OF THE EXPERIMENTAL THROUGHPUT VALUES ( COLUMNS ) AND THE ESTIMATED THROUGHPUT VALUES FOR
VARIOUS VALUES OF DSIRC , IN MICROSECONDS , FOR EACH BOARD COUNT SCENARIO ( ROWS ).
1 board
2 boards
3 boards

700
10.25
6.47
36.89

750
7.07
5.53
32.6

800
4.74
5.46
28.69

850
4.06
6.41
25.2

900
4.38
8.18
22.08

950
5.55
10.14
19.3

V. C ONCLUSION
FPGA-based custom hardware has proven to accelerate
certain tasks. Furthermore, a subset of these problems may
be further accelerated by a distributed, reconfigurable system.
The problem is that system design is challenging and time
consuming, and FPGA hardware is expensive. Because of this,
it is important to be able to evaluate whether a certain computation will benefit from such resources before the financial
investment is made to implement the computation on such
hardware. This work created an analytical model to assist a
given system designer in making decisions about whether or
not to pursue an FPGA-based cluster solution to a problem.
The analytical model provided in this paper has its limitations,
but it closely matches the experimental data from a real FPGAbased cluster of three FPGA nodes. It can be used to help
system designers, and it is a good start for future research in
studying the types of reconfigurable distributed systems.
VI. F UTURE W ORK
The next step in this specific area of research is improving
the analytical model. This can be done by performing a deeper
analysis into the workings of the communication framework
(specifically SIRC) and determining different aspects of the
framework affect performance. The model can be expanded to
address computational problems that do not always require a
fixed amount of time to process given the data size or even to
include pipelined systems. By improving the analytical model,
it becomes a more valuable tool for system designers to gauge
the feasibility of building an FPGA-based computing cluster.
The experimental setup and procedure can be used in the
future as a means to calculate the overhead of different FPGAbased communication frameworks. This data is important in
judging the effectiveness of one framework over another, providing system designers with a clear choice for low-overhead
systems.
Lastly, although this paper addresses only FPGA-based
distributed platforms, its model for throughput may also be
applied for distributed systems that are non-FPGA-based. With
more research into traditional distributed computing clusters,
the analytical model may be refined to give system designers
even more useful information to assist their decision making.
ACKNOWLEDGEMENTS
Thanks to Joe Hass and Matt Watkins for their feedback.
R EFERENCES
[1]

K. Eguro, Sirc: An extensible reconfigurable computing communication api, in Field-Programmable Custom Computing Machines
(FCCM), 2010 18th IEEE Annual International Symposium on. IEEE,
2010, pp. 135138.

1000
7.38
12.01
16.89

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

1100
11.18
15.41
12.72

1200
14.6
18.4
9.32

1300
17.71
21.07
6.88

1400
20.49
23.48
6.37

1500
23.02
25.66
7.55

1600
25.3
27.65
9.62

J. Castillo, J. L. Bosque, E. Castillo, P. Huerta, and J. I. Martnez,


Hardware accelerated montecarlo financial simulation over low cost
fpga cluster, in Parallel & Distributed Processing, 2009. IPDPS 2009.
IEEE International Symposium on. IEEE, 2009, pp. 18.
Q. Liu, T. Todman, K. H. Tsoi, and W. Luk, Convex models for accelerating applications on fpga-based clusters, in Field-Programmable
Technology (FPT), 2010 International Conference on, 2010, pp. 495
498.
C. M. Wee, P. R. Sutton, and N. W. Bergmann, An fpga network
architecture for accelerating 3des-cbc, in Field Programmable Logic
and Applications, 2005. International Conference on. IEEE, 2005, pp.
654657.
J. Espenshade, M. Lukowiak, M. Shaaban, and G. von Laszewski,
Flexible framework for commodity fpga cluster computing, in FieldProgrammable Technology, 2009. FPT 2009. International Conference
on. IEEE, 2009, pp. 465471.
O. Mencer, K. H. Tsoi, S. Craimer, T. Todman, W. Luk, M. Y. Wong,
and P. Leong, Cube: A 512-fpga cluster, in Programmable Logic,
2009. SPL. 5th Southern Conference on, 2009, pp. 5157.
Y. Kono, K. Sano, and S. Yamamoto, Scalability analysis of tightlycoupled fpga-cluster for lattice boltzmann computation, in Field Programmable Logic and Applications (FPL), 2012 22nd International
Conference on, 2012, pp. 120127.
S. Al Junid, M. Haron, Z. Abd Majid, F. Osman, H. Hashim, M. Idros,
and M. Dohad, Optimization of dna sequences data to accelerate dna
sequence alignment on fpga, in Mathematical/Analytical Modelling
and Computer Simulation (AMS), 2010 Fourth Asia International Conference on. IEEE, 2010, pp. 231236.
C. Pohl, J. Hagemeyer, J. Romoth, M. Porrmann, and U. Ruckert,
Using a reconfigurable compute cluster for the acceleration of neural networks, in Field-Programmable Technology, 2009. FPT 2009.
International Conference on. IEEE, 2009, pp. 368371.
M. Deshpande, Fpga implementation & acceleration of building blocks
for biologically inspired computational models, Ph.D. dissertation,
Portland State University, 2011.

You might also like