Professional Documents
Culture Documents
I.
Control PC
Gigabit
Switch
Output
Buffer
Parameter
Register File
Hardware API
FPGA
Boards
Commands
from the
users
software
Software API
Input Buffer
Interface
to the
users
hardware
Execution
Control
Fig. 2. The interfaces between hardware and software in the SIRC Framework [1].
using a Hierarchical Temporal Memory (HTM) learning algorithm showed a 1000x performance increase over a Matlab
implementation [10]. The performance of this system was high
enough that the processing could be done in real-time.
II. A BASIC S YSTEM M ODEL
We created a simple model of the system that reflects the
basic operation of the SIRC framework and takes our specific
use scenario into account. The steps of sending data to an
FPGA and collecting the results are as follows.
1) (Control PC) Application calls a SIRC API function to
send a write request to the FPGA board.
2) (Control PC) SIRC framework generates and sends a
message to the FPGA board.
3) (Control PC) Application calls a SIRC API function
to issue a read request for the result from the FPGA
board. The call blocks.
4) (FPGA board) The write message is received.
5) (FPGA board) The board waits for the simulated processing time.
6) (FPGA board) The board sends back the result from its
simulated calculation in response to the read request
message.
7) (Control PC) The read request function call returns the
result to the application.
Starting from the assumption that the system operates in this
manner, we created the timing diagram shown in Figure 3. This
figure shows all of the delays in the system. We did not include
a SIRC delay value on the FPGA board because we know
that it only takes 10 clock cycles for data to be processed by
the framework. This value is negligible compared to the other
delays in the system and thus is ignored. Simply stated, the
sum of all of these delays is the amount of time that it takes
an application to send out data and receive a result. Instead
of attempting to measure the performance by the individual
values, we measure performance at the system level using
Equation 1, which is system throughput, Tsys , data through
the system divided by the time.
data transferred
(1)
execution time
In Equation 2 data transferred and execution time are
expanded to show the composition of these values. Dsys
represents the entire time it takes to execute a single iteration
of the system, and it can be broken into three parts: DSIRC ,
Dnet , and Dproc . Since an amount of data is assumed to
be transmitted and received, Dnet1 and Dnet2 are equal for
this analytical model, Dnet1 = Dnet2 . The communication
framework overhead may vary based on the framework used
for the system, but in this model of the SIRC framework it is
represented as DSIRC . DSIRC1 and DSIRC2 . In case these
values are not equal, we do not separate them and assume that
DSIRC = DSIRC1 + DSIRC2 .
Tsys =
Tsys =
Nboards Strans
Nboards Strans
=
Dsys
2 Dnet + DSIRC + Dproc
(2)
Control PC
FPGA
Board
Network
DSIRC1
Dnet1
Read Req
Dproc
Dsys
Write
Dnet2
Read
DSIRC2
D EFINITIONS OF VARIABLES
Nboards Strans
2Nboards Strans
+ DSIRC
Tnet
+ Dproc
(3)
A. Model Limitations
This simplified model for the throughput of an FPGA-based
cluster has some innate limitations. The model assumes the
following system attributes.
A fixed processing time Dproc given a certain data size
Strans . This assumption conflicts with certain hardware
circuits, which have different times of completion for
input data of the same size.
A fixed processing overhead of the SIRC framework
DSIRC for each board configuration Nboards (1, 2 or
3), which is not the same as other frameworks and may
vary based on various factors.
The amount of input data and output data are equal.
These assumptions make the model valid for computations
that transform a data set into one of equal size and also take
a deterministic number of cycles to process given the input
data size. While this does not apply to certain algorithms, it
can be applied to many cryptographic and image processing
algorithms. Although this model has its limitations, it can still
be used for general trends and approximate values of certain
applications and provides a good basis for further modeling in
the future.
III. F INDING DSIRC
At this point our goal is to use experimental methods to
determine a value for the framework delay, DSIRC . Because
we do not know what aspects of the system affect the delay of
within the framework all of the user-controlled variables must
be carefully varied to determine the effect. The specific usercontrolled values of interest are: Dproc , Strans , and Nboards .
A. Experimental Setup
The experimental setup matched that shown in Figure 1. The
control PC was a laptop with a quad-core Intel i7 processor
and 8 GB of RAM running Windows 7. As previously noted,
the three FPGA boards available were Digilent OpenSPARC
boards, based on the Xilinx Virtex 5 FPGA. All of these items
were connected with gigabit ethernet via a Cisco Catalyst 3750
switch. Microsoft Visual Studio 2010 was used for all software
development and the Xilinx ISE suite version 14.1 was used
for all FPGA development.
B. Experimental Procedure
We chose a set of discrete values for each of the three usercontrolled values, Dproc , Strans , and Nboards ; these sets are
shown in Table II. A set of 10 test runs was completed for
each combination of three values. Each test run consisted of
sending and receiving 1000 operations.3 A timer was started
for operation one and stopped after operation 1000 finished;
this value is the sum of all of the Dsys values for the test
run. We calculated the system throughput, Tsys , by dividing
the total operating time by the sum of the data that was sent,
Strans 1000.
3 We also tried running the system for 10,000 operation, but found no
difference in the results.
TABLE II.
Nboards
Strans
Dproc
500
1, 2, or 3
1 KB, 2 KB, 4 KB, 8 KB, 16 KB, 32 KB, 64 KB, 128 KB
0 s, 100 ns, 1 s, 10 s, 100 s, 1 ms, 10 ms, 100 ms
Tsys (Mb/s)
500
450
Throughput(Mb/s)
400
1,024
450
350
2,048
400
4,096
350
8,192
300
16,384
32,768
250
65,536
200
131,072
150
300
100
250
50
200
1board,crossover
150
100
1board,switch
1,000
10,000
2boards,switch
100
100,000
1,000,000
10,000,000
100,000,000
Dproc(ns)
3boards,switch
50
0
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
Strans(Bytes)
Fig. 5. The system throughput, Tsys , for a 1-board system. Each line on the
figure represents a different data transfer size, Strans . The simulated hardware
processing time, Dproc , is the indendent variable.
500
1,024
450
2,048
D. Calculating DSIRC
We calculated DSIRC by solving Equation 3 for DSIRC ,
shown in Equation 4. We used the experimental value of
Tsys and the analytical values of Nboards , Dproc , and Strans .
The calculated average DSIRC values are shown in Table III,
grouped by transfer size, Strans . These data are arranged
in this manner because the raw data (not included) show a
correlation between changes in transfer size and changes in
DSIRC , but changes in processing delay show no correlation
with changes in DSIRC .
4,096
350
8,192
300
16,384
32,768
250
65,536
200
131,072
150
100
50
0
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
Dproc(ns)
Fig. 6. The system throughput, Tsys , for a 2-board system. Each line on the
figure represents a different data transfer size, Strans . The simulated hardware
processing time, Dproc , is the indendent variable.
500
1,024
450
2,048
400
Tsys(Mb/s)
C. Experimental Results
First by setting the value of Dproc to zero, the maximum
throughput for various configurations was calculated and displayed in Figure 4. By performing this experiment, we were
able to verify the quoted network throughput values for the
SIRC framework. Additionally, this data provides a notion
of the maximum Tsys value given no actual processing was
simulated. This figure contains two single-board scenarios, one
where the entities are directly connected with a cross-over
cable and another where they are connected with the switch.
The 2-board and 3-board configurations are also shown and
yield higher throughput than the single-board scenarios.
We collected data for all possible combinations of processing time, Dproc , transfer size, Strans , and the number of
boards, Nboards , shown in Table II. The data from these tests
are displayed in Figures 5, 6, and 7. The data are grouped by
the number of boards, processing time, and then transfer size.
One point of note is the exclusion of the 128 KB data series
in Figure 7. This is due to an error in the SIRC framework,
which returned an error during that experiment, but not during
any other of the experiments. The exact cause of the error is
unknown at this point.
Tsys(Mb/s)
400
4,096
350
8,192
300
16,384
32,768
250
65,536
200
150
100
50
0
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
Dproc(ns)
Fig. 7. The system throughput, Tsys , for a 3-board system. Each line on the
figure represents a different data transfer size, Strans . The simulated hardware
processing time, Dproc , is the indendent variable.
TABLE III.
AVERAGE CALCULATED DSIRC VALUES IN S FOR 1-, 2-,
AND 3- BOARD SYSTEMS GROUPED BY TRANSFER SIZE , Strans .
Strans
1 KB
2 KB
4 KB
8 KB
16 KB
32 KB
64 KB
128 KB
1 board
814
827
846
861
947
840
1,061
1,051
2 boards
897
795
732
720
700
570
514
464
3 boards
2,097
1,835
1,546
1,572
1,437
1,119
1,204
n/a
Dproc
Tsys
Tnet
(4)
First and foremost, these data show that DSIRC is in the
range of a few hundred microseconds to a few milliseconds;
even if a specific prediction is infeasible, a value in this
range is the correct order of magnitude. Going beyond being
close, the data show slightly different behavior for a single
board scenario versus a multi-board scenario. For the single
board scenario DSIRC increases as the transfer size, Strans
increases. The opposite effect is observed in the multi-board
scenarios. We hypothesize that this effect is a result of the bidirectional nature of gigabit ethernet. Our model assumes that
send and receive operations happen in serial for each board in
the system, but in parallel overall. In a multi-board scenario,
it is possible that while the control PC is sending to board A
it may be receiving from board B. This inherent parallelism
reduces the system side delays because sending and receiving
delays for different boards may overlap.
Looking at only the multi-boards scenarios, it is possible
that there is a trend of increasing framework delay, DSIRC ,
as a result of an increase in the number of boards, Nboards .
Due to the limited amount of data, this trend is difficult to
confirm or further analyze.
In the bigger picture, the framework delay, DSIRC , has less
effect on the system performance at large values of transfer
size and processing delay, Strans and Dproc , respectively.
This is analytically evident from Equation 4 and knowing the
approximate order of magnitude of each of the delay values.
When DSIRC is significantly smaller than the other delays, the
other delays have a greater impact on the result. When these
values are large, the value of DSIRC has little impact on the
result. To the developer, this means that if the system being
designed transfers data in larger chunks, greater than 50 KB,
and/or has a long processing delay, greater than 10 ms, then
DSIRC can be ignored because it is significantly smaller than
the other values. If this is not the case then a reasonable value
of DSIRC should be used to predict system performance.
When a specific value for DSIRC is required it is important
to find an appropriate value that best mimics the system
parameters. The goal is to find a value of DSIRC for the
model calculation that provides the most accurate results; the
lowest percentage of error between the model results and
the experimental results. Table IV shows the average percent
error of all of the experimental results for all combinations of
processing delay, number of boards, and transfer size. For the
DSIRC =
TABLE IV.
AVERAGE PERCENT ERROR OF THE EXPERIMENTAL THROUGHPUT VALUES ( COLUMNS ) AND THE ESTIMATED THROUGHPUT VALUES FOR
VARIOUS VALUES OF DSIRC , IN MICROSECONDS , FOR EACH BOARD COUNT SCENARIO ( ROWS ).
1 board
2 boards
3 boards
700
10.25
6.47
36.89
750
7.07
5.53
32.6
800
4.74
5.46
28.69
850
4.06
6.41
25.2
900
4.38
8.18
22.08
950
5.55
10.14
19.3
V. C ONCLUSION
FPGA-based custom hardware has proven to accelerate
certain tasks. Furthermore, a subset of these problems may
be further accelerated by a distributed, reconfigurable system.
The problem is that system design is challenging and time
consuming, and FPGA hardware is expensive. Because of this,
it is important to be able to evaluate whether a certain computation will benefit from such resources before the financial
investment is made to implement the computation on such
hardware. This work created an analytical model to assist a
given system designer in making decisions about whether or
not to pursue an FPGA-based cluster solution to a problem.
The analytical model provided in this paper has its limitations,
but it closely matches the experimental data from a real FPGAbased cluster of three FPGA nodes. It can be used to help
system designers, and it is a good start for future research in
studying the types of reconfigurable distributed systems.
VI. F UTURE W ORK
The next step in this specific area of research is improving
the analytical model. This can be done by performing a deeper
analysis into the workings of the communication framework
(specifically SIRC) and determining different aspects of the
framework affect performance. The model can be expanded to
address computational problems that do not always require a
fixed amount of time to process given the data size or even to
include pipelined systems. By improving the analytical model,
it becomes a more valuable tool for system designers to gauge
the feasibility of building an FPGA-based computing cluster.
The experimental setup and procedure can be used in the
future as a means to calculate the overhead of different FPGAbased communication frameworks. This data is important in
judging the effectiveness of one framework over another, providing system designers with a clear choice for low-overhead
systems.
Lastly, although this paper addresses only FPGA-based
distributed platforms, its model for throughput may also be
applied for distributed systems that are non-FPGA-based. With
more research into traditional distributed computing clusters,
the analytical model may be refined to give system designers
even more useful information to assist their decision making.
ACKNOWLEDGEMENTS
Thanks to Joe Hass and Matt Watkins for their feedback.
R EFERENCES
[1]
K. Eguro, Sirc: An extensible reconfigurable computing communication api, in Field-Programmable Custom Computing Machines
(FCCM), 2010 18th IEEE Annual International Symposium on. IEEE,
2010, pp. 135138.
1000
7.38
12.01
16.89
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
1100
11.18
15.41
12.72
1200
14.6
18.4
9.32
1300
17.71
21.07
6.88
1400
20.49
23.48
6.37
1500
23.02
25.66
7.55
1600
25.3
27.65
9.62