You are on page 1of 2

A Networked FPGA-Based Hardware Implementation

of a Neural Network Application

Hictor Fabio RESTREPO, Ralph HOFFMANN, Andres PEREZ-URIBE,


Christof TEUSCHER, and Eduardo SANCHEZ

Logic Systems Laboratory, Swiss Federal Institute of Technology


CH - 1015 Lausanne, Switzerland
E-mail: { name.surname} @epfl.ch

Abstract 2 The FAST Algorithm


This paper describes a networked FPGA-based imple- The FAST neural network [2, 31 is an unsupervised
mentation of the FAST (Flexible Adaptable-Size Topology) learning network with Flexible Adaptable-Size Topology.
architecture, a Artijicial Neural Network (ANN) that dy- The network is feed-forward, fully connected, and consists
namically adapts its size. Most ANN models base their abil- of two layers: an input layer and an output layer. Essen-
ity to adapt to problems on changing the strength of the tially, the aim of the network is to cluster or categorize the
interconnections between computational elements accord- input data. Clusters must be determined by the network it-
ing to a given learning algorithm. However; constrained self based on correlations of the inputs (unsupervised learn-
interconnection structures may limit such ability. Field pro- ing). The network’s size increases by adding a new neuron
grammable hardware devices are very well adapted for the to the output layer when a sufficientlydistinct input vector is
implementation of ANN with in-circuit structure adapta- encountered and decreases by deleting an operational neu-
tion. To realize this implementation we used a network of ron through the application of probabilistic deactivation.
Labomat 3 boards ( a reconfigurable plarform developed in
our laboratory), which communicate with each other using
3 Labomat 3 Description
TCPIIP or a faster; direct hardware connection.
The Labomat 3 [4] board uses FPGA technology, and is
1 Introduction therefore a reconfigurable platform. Developed by our lab-
oratory, the features of this board makes it a unique teaching
Artificial neural network models offer an attractive and research tool. Our aim is to offer a powerful, all-round
paradigm: learning to solve problems from examples. They platform that is easy to understand and simple to use.
achieve fast parallel processing via massively parallel non-
linear computational elements. While software implemen- 4 FAST Implementation on Labomat 3
tations on conventional von Neumann machines are very The FAST neural network architecture was implemented
useful for investigating the capabilities of neural network using a network of Labomat 3 boards and was completely
models, a hardware implementation is essential to fully described in generic VHDL. The system is completely scal-
profit from their inherent parallelism, as well as for real- able because we can add to the network as many Labomat 3
time processing in real-world problems. Well-known ANN, boards as we want. The user communicates with and con-
such as multi-layer Perceptrons [ 11, have proved capable figures the system via the telnet protocol.
of solving various problems, but because of their rela-
tively high complexity, they are not well suited for a hard- 5 Results
ware implementation. Thus, we chose the FAST (Flexible The final implementation consists of a network of up to
Adaptable-Size Topology)algorithm [2], which is very well eight Labomat 3 boards. Each board contains two FAST
suited for such an implementation. This algorithm has been neurons supporting 8-bit computation and two-dimensional
implemented on a Labomat 3 board, developed in our labo- vectors.
ratory as a teaching and research tool. It is a flexible board, The maximal working frequency for this design was
providing FPGA technology and a microcontroller with an approximately 10 MHz. The same implementation using
Ethernet interface.

337
0-7695-0871-5/00 $10.00 0 2000 IEEE
an P G A speed grade of -1 (XC4013E-1) would allows 6 Conclusion
a working frequency of 16 MHz. The working frequency
We have used a network of Labomat 3 boards to suc-
of this implementation could be increased by implementing
cessfully realize an adaptable-size neural network architec-
only a single neuron into the FPGA circuit: we would then
ture, which was completely implemented in generic VHDL.
have enough place to implement more adders and multipli-
Our initial experiments with this recently completed system
ers working in parallel rather that sequentially, as is the case
indicate that high performance can be achieved on pattern
in our current implementation. Of course, this approach re-
clustering tasks.
quires sixteen Labomat 3 boards instead of eight. Thanks to
As an example, we have applied our FAST ANN to
the scalability characteristics of our implementation, how-
an image segmentation and recognition problem. It has
ever, this increase does not pose any particular problem, and
been found that the results obtained with this methodol-
we are currently developing this approach.
ogy closely resemble those obtained with other neural algo-
The average number of clock cycles needed to process rithms which are not suitable for hardware implementation.
one vector is approximately 64,which translates to a pro- At the beginning of our implementation, a hardware
cessing time of 6.4ps per vector for a 10 MHz clock. Unfor- driver featuring network capabilities was developed, unfor-
tunately, the whole process is software-driven (each hard- tunately this driver was quite bulky and complex, requiring
ware subprocess is started by the processor, based on the fully half of the resources of the Xilinx XC4013 device. For
last subprocess results), which introduces an important soft- this reason we used the software device driver instead, lead-
ware overhead. ing to very disappointing results.
The working frequency and the speed of our implemen-
For the application tests, we presented to &e neural net-
tation could be increased by implementing a single neuron
work a set of loo00 two-dimensional vectors randomly cho- into the FPGA circuit: we would then have enough place
sen from a database composed of 40000 vectors. Several to implement more adders and multipliers working in par-
tests have been made on the following platforms (Table 1 allel rather that sequentially, as well as the afore mentioned
summarizes the obtained results): hardware-driven communication process.
1. eight Labomat 3 boards using TCPAP; The hardware approach would become competitive with
2. eight Labomat 3 boards using direct hardware commu- a completely hardware-driven process (which is not the case
nication; in our current implementation) and with more neurons per
chip, implying a greater amount of parallelism, an approach
3. software simulation on the 68360 processor available
that may be possible with larger FPGAs like the new Xilinx
on the board and a Sun SPARC Ultra 1 workstation.
Virtex family, or even better, ASIC technology.
The first observation are the poor results of the TCPAP-
based configuration. This is due to the way we use TCP: References
there is only one byte sent or received per TCP packet.
N. K. Bose and P. Liang. Neural Network Fundamentals with
graphs, Algorithms, and Applications. Electrical and Com-
puter Engineering Series. McGraw-Hill, Inc., 1996.
A. PCrez-Uribe. Structure-Adaptable Digital Neural Net-
works. PhD thesis, Swiss Federal Institute of Technology-
Lausanne, Lausanne, EPFL, 1999.
A. Ptrez-Uribe and E. Sanchez. FPGA Implementation of
an Adaptable-Size Neural Network. In C. von der Mals-
burg, W.von Seelen, J. C. Vorbriiggen, and B. Sendhoff, edi-
Table 1. Hardware-based
tors, Proceedings of the International Conference on Artificial
By using the direct hardware connections between the Neural Networks (ICANN96), volume 1112 of Lecture Notes
boards, we obtain better results, but the software overhead in Computer Science, pages 383-388. Springer-Verlag, Hei-
is very important (290ps for a communicationless, single delberg, 1996.
board configuration). To understand the main differences C. Teuscher, J . - 0 . Haenni, F. J. Gomez, H. E Restrepo, and
between the hardware processing time (about 6.4 ps) and E. Sanchez. A Tool for Teaching and Research on Computer
the final processing time, we measured the time spent in Architecture and Reconfigurable Systems. In Proceedings of
function calls using a C cross-compiler, and obtained a the 25th Euromicm Conference, volume 1, pages 343-350.
IEEE Computer Society, Milan, Italy, September 8-10 1999.
value of 1.5 ps (time to call, then return from a function).
The software stack involves an important amount of func-
tion calls, which is not the case for the software simulation.
The best results were obtdned with software simulations
on powerful but widely available platforms.

338

You might also like