You are on page 1of 4

Tech Recon

Advanced Signal Processing

Case Study:
Cryptography Computing Cluster
Blends FPGAs, Quad CPUs
Designed for military cryptography applications, a cluster computing design
marries an array of FPGA computing resources and power control processors tied
together over PCI Express.

Steve McReynolds, Technical Support Manager tive and easy to use. But many such pro-
Trenton Technology cessors are burdened with extremely high
Mark Hur, Director of Marketing power requirements. In such processors,
ploration
your goal Pico Computing a very large number of gates must switch
k directly when performing even the simplest deci-

C
age, the hoosing the right architecture can sions. In contrast, FPGA modules, such as
source.
ology,
make a huge difference in how the Pico Computing E-16 shown in Fig-
d products complex signal processing com- ure 1, use a minimal set of gates to make
puting applications perform. Applica- basic logic decisions.
d tions such as weather predication, bomb The high-performance computing
blast analysis and financial modeling, system described here requires approxi-
for example, require floating-point com- mately 500W of system power. Over half
putations. Meanwhile, other similarly of that power is consumed by the PICMG
complex applications such as the Smith 1.3 System Host Board (SHB) with its two
Waterman algorithm, military cryp- Quad-Core Intel Xeon Processors E5335.
nies providing solutions now
tography and image processing do not The Trenton MCXT system host board
ion into products, technologies and companies. Whether your goal is to research the latest
tion Engineer,typically
or jump to arequire
company'sfloating-point
technical page, the compu-
goal of Get Connected is to put you shown in Figure 2 takes full advantage
tations.
you require for whateverFor
type non-floating-point
of technology, applica- of the capabilities of the Quad-Core Intel
and productstions,
you arelarge
searching for. of FPGAs can dramat-
arrays Figure 1 Xeon Processors to manage the complex
www.cotsjournalonline.com/getconnected
ically speed up the computation of the data traffic flow needed by the high-per-
Unlike general-purpose processor
overall solution algorithm. formance computing system.
arrays—in which a very large number
This case study takes a look at how In cluster computing designs, equally
an array of up to 84 FPGAs managed by of gates must switch to perform even the important as the processing elements is
a PICMG 1.3 System Host Board (SHB) simplest decisions—FPGA computing the flexibility of the interconnect that
communicating to the FPGA boards modules, such as this Pico Computing links processing elements together. Rarely
over PCI Express links on a PICMG 1.3 E-16, use a minimal set of gates to make can any one military application justify
End of Article
backplane, enables 10,000:1 speed im-
provements in typical hardware cryp-
basic logic decisions. its own purpose-built hardware opti-
mized for a particular algorithm. Typi-
tography applications. cally a machine designed for one military
FPGA Acceleration with Intel application does not have the optimal
Get Connected Processors interconnect strategy for another algo-
with companies mentioned in this article. General-purpose processor arrays— rithm. With that in mind, the PCI Ex-
www.cotsjournalonline.com/getconnected including graphics processors—are effec- press (PCIe)-based interconnect strategy

Reprinted from COTS Journal November 2007


Get Connected with companies mentioned in this article.
www.cotsjournalonline.com/getconnected
Tech Recon

used in this system allows tremendous their nearest neighbors, to other process- FPGA algorithms and minimal impact
flexibility connecting the processing ele- ing elements hosted in the same box, and on speed. Unless a PE is capable of per-
ments (PE). The PEs may be connected to even across boxes with no change to the forming significant amounts of process-
ing independently, the potential for high
levels of parallelism is limited. The PCIe
interconnect strategy used in the Pico
Computing SC3 Super Cluster overcomes
this inherent limitation.

FPGA-Based Cluster System


Architecture
The block diagram in Figure 3 illus-
trates the system architecture of the Pico
Computing SC3 SuperCluster. The SC3
is the third generation of FPGA-based
cluster computers from Pico Comput-
Figure 2 ing. A central element of the SC3 is the
In the SC3 SuperCluster, Trenton’s MCXT system host board shown here takes full advantage PICMG 1.3 system host board from
Trenton Technology, featuring two long-
of the capabilities of the Quad-Core Intel Xeon Processors to manage the complex data
life Quad-Core Intel Xeon Processors.
traffic flow needed by the high-performance computing system.
The Cyclone Microsystems model PCIe-
412 backplane is used and features four-
teen PCI Express option card slots for
E16 EC7BP EC7BP E16 the FPGA backplane boards, one SHB
E16E16 EC7BP EC7BP E16E16 slot and two additional PCI-X option
E16E16E16 EC7BP Pico Computing SC3 Super Cluster EC7BP E16E16E16 card slots. A Pico Computing EC7BP
E16E16E16
PCI Express backplane plugs into the
SW WS E16E16E16
PCIe card slots on the backplane. Fig-
E16E16E16 SW WS E16E16E16
ure 4 illustrates how each EC7BP is ca-
E16E16E16 SW SW SW E16E16E16 pable of supporting up to seven standard
E16E16E16 E16E16E16 FPGA cards such as the Pico Computing
E16E16 E16E16 E-16 card, and each E-16 card features a
E16 E16 Virtex-5 LX50 FPGA with 46,080 Logic
Trenton Cells. Using this combination of off-the-
MCXT shelf components enabled the system to
System
Host Board be developed, manufactured and tested
(SHB) in a very short period of time.
E16 E16
E16 EC7BP EC7BP E16 The PCI Express serial links between
E16 E16 EC7BP EC7BP E16 E16
E16 E16 EC7BP E16 E16 the SHB slot and the individual E-16
EC7BP
E16 E16 E16 E16
FPGA cards are managed by a PCI Ex-
E16 E16E16 E16 E16
E16 SW SW SW E16 E16 press fan out switch on each EC7BP back-
E16 E16E16 SW WS E16 E16
E16 WS E16 E16 plane card and four 32-lane PCI Express
SW E16 E16
E16 E16E16 SW WS E16 E16 fan out switches on the Cyclone PCIe-412
E16
E16 E16E16 E16 E16 PICMG 1.3 backplane. A PCIe-to-PCI-X
E16 E16 E16
EC7BP EC7BP E16 E16
E16 E16 EC7BP E16
bridge chip on the backplane manages
E16 EC7BP
E16 EC7BP Cyclone Microsystems PCIe-412 EC7BP E16 the link between the SHB and the two
EC7BP PICMG 1.3 Backplane EC7BP 64-bit/100 MHz PCI-X option card slots.
These standard PCIe fan out switches can
Figure 3 simultaneously coalesce multiple PCIe
Illustrated here is the system architecture of the Pico Computing SC3 SuperCluster. The lanes. This provides considerable band-
SC3 is the third generation of FPGA-based cluster computers from Pico Computing. The width overlap while keeping any inherent
latency penalties down to a minimum.
Cyclone Microsystems model PCIe-412 backplane is used and features fourteen PCI
Granted, the PCIe lane architecture is not
Express option card slots, one SHB slot and two PCI-X slots. A Pico Computing EC7BP PCI
as fast as some gate-to-gate FPGA tech-
Express backplane plugs into the PCIe card slots on the backplane. nologies that have been built. But PCIe-

Reprinted from COTS Journal November 2007


Tech Recon

based FPGA clusters can be expanded to on the FPGA card so an operating system to allow up to 256 segments on the PCIe
include very large arrays where cluster on the card was not a choice. That said, bus. Each Pico Card requires approxi-
computing is used to satisfy the require- the PE can be replaced with the FX part, mately 3 1/7 segments when allowances
ments of complex applications. which includes a PPC 440 processor. are made for all of the supporting devices
The SC3 runs under Linux and natu- contained in the system. With 256 seg-
rally exploits the 64-bit architecture of Engineering Challenges and ments managed by the BIOS, a maximum
that operating system. For algorithm Solutions of 84 Pico Cards can be used effectively in
development, any particular card may As shown in Figure 5, the Pico this SC3 SuperCluster.
be run under Windows (XP or Vista), as Computing SC3 SuperCluster has five Particular care was paid to cool-
would be most likely available on any lap- EC7BP PCI Express backplane cards ing the SC3 SuperCluster. The standard
top. Loading the FPGA image is managed installed with seven E-16 FPGA cards fans were replaced with higher-volume
entirely through the PCIe bus (in either plugged into each board. The PCI Ex- fans; however, it was not necessary to
architecture). The choice of host operat- press architecture is built on the same deviate from the standard air-cooled
ing systems is, of course, not binding on logical addressing model developed chassis design. Each FPGA has a tem-
the operating system that runs on the years ago with the PCI bus and as such, perature sensor built in, and the chip
FPGA. On the SC3 there is no processor PCIe is subject to similar limitations on will shutdown if it overheats. The same
the number of supported device seg- is true with the Quad-Core Intel Xeon
ments. Given the architecture of a typi- Processors used on the Trenton MCXT
cal PICMG 1.3 passive backplane and system host board. Care must taken to
the BIOS configuration of a system host guard against FPGA overheating since
board, the number of allowable PCI the internal algorithms are open to the
buses was exceeded in the initial system user design and it is possible to drive
design with the maximum number of the FPGA so hard that it will overheat.
FPGA cards installed. Design steps are taken to find the point
To correct this condition, engineers where the increased FPGA clock speed
at Trenton Technology made several will cause the FPGA to “burn up” and
modifications to the MCXT board’s BIOS then the clock speed is adjusted down-

MCXT
System Host Board
PCIe-412 Backplane
EC7BP
PCIe Backplane Card E-16 FPGA Card

Figure 4

Each EC7BP is capable of supporting up


to seven standard FPGA cards such as
Figure 5
the Pico Computing E-16 card, and each
E-16 card features a Virtex-5 LX50 FPGA The SC3 SuperCluster has five EC7BP PCI Express backplane cards installed with seven
with 46,080 Logic Cells. E-16 FPGA cards plugged into each board.

Reprinted from COTS Journal November 2007


Tech Recon

ward to avoid this condition. While this permit a larger FPGA to be mounted on diverse processing elements. This capa-
additional step in the design process the card such as the Virtex-5 LX85 or bility enables very flexible FPGA solu-
must be taken, it does illustrate the in- possibly the LX110. tions in a wide variety of computing
herent FPGA flexibility and acceleration Both FPGAs and general-purpose applications. The SC3 SuperCluster is a
capability to fine-tune the FPGA clock processors have an impressive breadth of great example of how complete off-the-
speed to maximize processing efficiency available development tools and a long shelf FPGA and quad-core processor
of a particular algorithm. and established track record of delivering technology can be merged together by
superior system performance. General- virtue of PCI Express and the PICMG
Future Development Directions purpose processors represent 50 years 1.3 architecture, to provide the robust,
Having the FPGA card as a sepa- of maturity, and FPGAs have been used high-performance computing platform
rate unit with its own power regula- in advanced computing applications for needed for military cryptography and
tion, memory, clocks and PCIe interface, over 20 years. Tools, algorithms and ap- applications like it.
opens up the possibility of using differ- plications for both of these technologies
ent processing elements. Among these have made tremendous strides. Needless Pico Computing
possibilities under active development to say, there are high levels of engineering Seattle, WA.
are: an FPGA with a PPC processor, a and development activity at work advanc- (206) 283-2178.
larger FPGA, and special coprocessors ing the capabilities of both of these prod- [www.picocomputing.com].
such as the MathStar chip. Single in- uct technologies.
stances, or clusters, of such components An FPGA is an intrinsically paral-
Trenton Technology
can be integrated with no change to the lel device and implements well with the
Gainesville, GA.
overall system cluster architecture. The kind of algorithm that can be divided
Pico E-16 card is a 34 mm wide card. The into relatively watertight sub-processes. (770) 287-3100.
same footprint will accommodate a card FPGA architectures can be expanded, [www.trentontechnology.com].
with a 54 mm width, which will in turn more or less at will, to incorporate many

Reprinted from COTS Journal November 2007

You might also like