You are on page 1of 7

QPACE: QCD parallel omputing on the Cell

H. Baier,1 M. Dro hner,2 N. Ei ker,3 G. Goldrian,1 U. Fis her,1 Z. Fodor,4 D. Hierl,5 S. Heybro k,5 B. Krill,1
T. Lippert,3 T. Maurer,5 N. Meyer,5 A. Nobile,6, 7 I. Ouda,8 H. Penner,1 D. Pleiter,9 A. S hafer,5 H. S hi k,1
F. S hifano,10 H. Simma,6, 9 S. Solbrig,5 T. Streuer,9 K.-H. Sulanke,9 R. Tripi ione,10 T. Wettig,5 and F. Winter9
IBM Development Laboratory, 71032 Boblingen, Germany
ZEL, Resear h Center Juli h, 52425 Juli h, Germany
Juli h Super omputing Centre, 52425 Juli h, Germany
Department of Physi s, University of Wuppertal, 42119 Wuppertal, Germany
Department of Physi s, University of Regensburg, 93040 Regensburg, Germany
Department of Physi s, University of Milano-Bi o a, 20126 Milano, Italy
European Centre for Theoreti al Studies ECT and INFN, Sezione di Trento, 13050 Villazzano, Italy
IBM Systems & Te hnology Group, Ro hester, MN 55901, USA
Deuts hes Elektronen-Syn hrotron DESY, 15738 Zeuthen, Germany
Dipartimento di Fisi a, Universita di Ferrara and INFN, Sezione di Ferrara,
43100 Ferrara, Italy
(Dated: February 18, 2007)
We give an overview of the QPACE proje t, whi h is pursuing the development of a massively
parallel, s alable super omputer for appli ations in latti e quantum hromodynami s (QCD). The
ma hine will be a three-dimensional torus of identi al pro essing nodes, based on the PowerXCell 8i
pro essor. These nodes will be tightly oupled by an FPGA-based, appli ation-optimized network
pro essor atta hed to the PowerXCell 8i. We rst present a performan e analysis of latti e QCD
odes on QPACE and orresponding hardware ben hmarks. We then des ribe the ar hite ture of
QPACE in some detail. In parti ular, we dis uss the hallenges arising from the spe ial multi- ore
nature of the PowerXCell 8i and from the use of an FPGA for the network pro essor.

I. INTRODUCTION l this requirement, a de ision between them an be made
on the basis of pri e-performan e and power-performan e
The properties and intera tions of quarks and gluons, ratios. The most ompetitive systems with respe t to
whi h are the building blo ks of parti les su h as the both of these metri s are ustom-designed LGT ar hite -
proton and the neutron, an be studied in the frame- tures (QCDOC [2℄ and apeNEXT [3℄, whi h have been in
work of quantum hromodynami s (QCD), whi h is very operation sin e 2004/2005) and IBM's BlueGene [4℄ series
well established both experimentally and theoreti ally. of ma hines. These ar hite tures are based on system-
In some physi ally interesting dynami al regions, QCD on- hip (SoC) designs, but in modern hip te hnologies
an be studied perturbatively, i.e., event amplitudes and the development osts of a VLSI hip are so high that
probabilities an be worked out as a systemati expansion an SoC approa h is no longer sensible for an a ademi
in the so- alled strong oupling onstant s . In other (of- proje t. An alternative approa h is the use of a ompute
ten more interesting) regions, su h an expansion is not node based on a ommer ial multi- ore pro essor that
possible, and other approa hes have to be found. The is tightly oupled to a ustom-designed network pro es-
most systemati and most widely used non-perturbative sor, whose implementation is fa ilitated by the adoption
approa h is based on latti e gauge theory (LGT), a dis- of Field Programmable Gate Arrays (FPGA). This ap-
retized and omputer-friendly version of the theory pro- proa h will be presented here. Con retely, we are de-
posed more than 30 years ago by Wilson [1℄. In the veloping a massively parallel ma hine based on IBM's
LGT framework, the problem an be reformulated as a PowerXCell 8i and a Xilinx Virtex-5 FPGA, in whi h the
problem in statisti al me hani s, whi h an be studied nodes are onne ted in a three-dimensional torus with
with Monte Carlo te hniques. LGT Monte Carlo sim- nearest-neighbor onne tions. The name of this proje t
ulations have been performed by a large ommunity of is QPACE (QCD PArallel omputing on the CEll). This
physi ists that have, over the years, developed very ef- approa h is ompetitive with the BlueGene approa h in
ient simulation algorithms and sophisti ated analysis both pri e-performan e and power-performan e, at mu h
te hniques. LGT has been pioneering the use of mas- lower development osts. There are some unique hal-
sively parallel omputers for large s ienti appli ations lenges to our approa h that will be dis ussed below.
sin e the early 1980s, using virtually all available om- QPACE is a ollaboration of several a ademi institu-
puting systems su h as traditional mainframes, large PC tions with the IBM development lab in Boblingen, Ger-
lusters, and high-performan e systems in luding several many.1
dedi ated ar hite tures.
In the ontext of this paper, we are only interested in
parallel ar hite tures on whi h LGT odes s ale to a large 1 The funding proposal for QPACE has re eived a positive evalu-
number (thousands) of nodes. If several ar hite tures ful-

The paper is stru tured as follows. In Se . II we present Roughly speaking, the ideal LGT simulation engine
an overview of the appli ation (LGT) for whi h the ma- is a system able to keep in lo al storage the degrees
hine is to be optimized. In Se . III we very brie y of freedom de ned above and to repeatedly apply the
des ribe the PowerXCell 8i pro essor. A detailed per- Dira operator with very high eÆ ien y. Expli it paral-
forman e analysis of LGT odes on a parallel ma hine lelism is straightforward. Inspe tion of (1) shows that
based on this pro essor as well as mi ro-ben hmarks of Dh has non-zero entries only between nearest-neighbor
the relevant memory operations obtained on Cell hard- latti e sites. This suggests an obvious parallel stru ture
ware are presented in Se . IV. The QPACE ar hite ture for an LGT omputer: a d-dimensional grid of pro ess-
is presented in detail in Se . V, and the software environ- ing nodes. Ea h node ontains lo al data storage and
ment is des ribed in Se . VI. We lose with a summary a ompute engine. The physi al latti e is mapped onto
and outlook in Se . VII. the pro essor grid as regular tiles (if d < 4, then 4 d
dimensions of the physi al latti e are fully mapped onto
ea h pro essing node).
II. LATTICE GAUGE THEORY COMPUTING In the on eptually simple ase of a 4-d square grid of
p4 pro essors, ea h pro essing node would store and pro-
LGT simulations are among the grand hallenges of ess a 4-d subset of (L=p)4 latti e sites. Ea h pro essing
s ienti omputing. Today, a group a tive in high- node handles its sub-latti e, using lo al data, and a -
end LGT simulations must have a ess to omputing re- esses data orresponding to latti e sites just outside the
sour es on the order of tens of sustained TFlops-years. surfa e of the sub-latti e, stored on nearest-neighbor pro-
Computing resour es used for LGT simulations have essors. Pro essing pro eeds in parallel, sin e there are
grown over the years at a faster pa e than predi ted by no dire t data dependen ies for latti e sites that are not
Moore's law. nearest-neighbor, and the same sequen e of operations
An a urate des ription of the theory and its algo- must be applied to all pro essing nodes.
rithms is beyond the s ope of the present paper. We only Parallel performan e will be linear in the number of
stress that LGT simulations have an algorithmi stru - nodes, as long as the latti e an be evenly partitioned
ture exhibiting a large degree of parallelism and several on the pro essor grid and as long as the inter onne tion
additional features that make it simple to extra t a large bandwidth is large enough to sustain the performan e
fra tion of this parallelism. Continuous 4-d spa e-time on ea h node. The rst onstraint is met up to a very
is repla ed by a dis rete and nite latti e (e.g., of linear large number of pro essors while the latter is less trivial
size L, ontaining N = L4 sites). All ompute-intensive as the required inter-node bandwidth in reases linearly
tasks involve repeated exe ution of just one basi step, with p. A urate gures on bandwidth requirements will
the produ t of the latti e Dira operator and a quark be provided below.
eld . A quark eld xia is de ned at latti e site x and The dis ussion outlined above on eptually favors an
arries so- alled olor indi es a = 1; : : : ; 3 and spinor in- implementation in whi h a desired total pro essing power
di es i = 1; : : : ; 4. Thus, is a ve tor with 12N omplex is sustained by the most powerful pro essor available,
entries. The so- alled hopping-term Dh of the Dira op- sin e this redu es pro essor ount and node-to-node
erator a ts on as follows,2 bandwidth. Hen e the obvious suggestion to onsider
0 the Cell Broadband Engine (BE) as the target pro essor.
x = Dh
X nU o:
x; (1 +  ) x+^ + Ux ^; (1  ) x ^ III. THE PowerXCell 8i PROCESSOR
Here,  labels the four spa e-time dire tions, and the The Cell BE is des ribed in detail in Refs. [5, 6℄.
Ux; are the SU(3) gauge matri es asso iated with the The pro essor ontains one PowerPC Pro essor Element
links between nearest-neighbor latti e sites. The gauge (PPE) and 8 Synergisti Pro essor Elements (SPE). Ea h
matri es are themselves dynami al degrees of freedom of of the SPEs runs a single thread and has its own 256 kB
the theory and arry olor indi es (3  3 omplex entries on- hip memory (lo al store, LS) whi h is a essible by
for ea h of the 4N di erent Ux; ). The  are the ( on- dire t memory a ess (DMA) or by lo al load/store oper-
stant) Dira matri es, arrying spinor indi es. Today's ations to/from 128 general-purpose 128-bit registers. An
state-of-the-art simulations use latti es with a linear size SPE an exe ute two instru tions per y le, performing
of at least 32 sites. up to 8 single pre ision (SP) oating point (FP) opera-
tions. Thus, the total SP peak performan e of all 8 SPEs
on a Cell BE is 204.8 GFlops at 3.2 GHz.
The PowerXCell 8i is an enhan ed version of the Cell
ation, and the nal funding de ision will be made in May 2008 BE with the same SP performan e and a peak double
by the German Resear h Foundation (DFG).
2 Equation (1) represents the so- alled Wilson Dira operator. pre ision (DP) performan e of 102.4 GFlops with IEEE-
Other dis retization s hemes exists, but the implementation de- ompliant rounding. It has an on- hip memory ontroller
tails are all very similar. supporting a memory bandwidth of 25.6 GB/s and a on-

gurable I/O interfa e (Rambus FlexIO) supporting a ILB RF
T FP T link

oherent as well as a non- oherent proto ol with a to- Main
tal bidire tional bandwidth of 25.6 GB/s. Internally, all T ILB T RF
units of the pro essor are onne ted to the oherent el- T LS Interface
ement inter onne t bus (EIB) by DMA ontrollers. The Local Store (LS)
power onsumption of the pro essor will be dis ussed be- T mem T ext
PowerXCell 8i FIG. 1: The data- ow paths and their exe ution times Ti for
a single SPE.
A. Performan e model
The Cell BE was developed for the PlayStation 3, but  external ommuni ations between di erent pro es-
it is obviously very attra tive for s ienti appli ations as sors, Text
well [7{9℄. This is even more true for the PowerXCell 8i.
We therefore investigate the performan e of this pro es-  transfers via the EIB (memory a ess, internal and
sor in a theoreti al model along the lines of Ref. [10℄. The external ommuni ations), TEIB
results of this se tion have been reported at the Latti e
2007 onferen e [9℄. We onsider a ommuni ation network with the topology
We onsider two lasses of hardware devi es: (i) Stor- of a 3-d torus in whi h ea h of the 6 links between a
age devi es (e.g., registers or LS) store data and/or in- given site and its nearest neighbors has a bidire tional
stru tions and are hara terized by their storage size. (ii) bandwidth of 1 GB/s so that a bidire tional bandwidth of
Pro essing devi es a t on data (e.g., FP units, hara ter- ext = 6 GB/s is available between ea h Cell BE and the
ized by their performan e) or transfer data/instru tions network.3 All other hardware parameters i are taken
from one storage devi e to another (e.g., DMA on- from the Cell BE manuals [6℄.
trollers or buses, hara terized by their bandwidths i Our performan e model was applied to various linear
and startup laten ies i ). algebra kernels and tested su essfully by ben hmarks
An algorithm implemented on a spe i ar hite ture on several hardware systems (see [9℄ for details). In the
an be divided into mi ro-tasks performed by the pro- following, we report on results relevant for the main LGT
essing devi es of our model. The exe ution time Ti of kernels. These results have in uen ed the design of the
ea h task i is estimated by a linear ansatz, QPACE ar hite ture.

Ti ' Ii = i + O(i ) ; (2)
B. Latti e QCD kernel
where Ii is the size of the pro essed data. In the follow-
ing, we assume that all tasks an be run on urrently at The omputation of Eq. (1) on a single latti e site
maximal throughput and that all dependen ies and la- amounts to 1320 Flops (not ounting sign ips and
ten ies an be hidden by suitable s heduling. The total omplex onjugation) and thus yields Tpeak = 330 y-
exe ution time is then les per site (in DP). However, the implementation of
Eq. (1) requires at least 840 multiply-add operations and
Texe ' max Ti : (3) TFP  420 y les per latti e site to exe ute. Thus, any
implementation of Eq. (1) on an SPE annot perform
If Tpeak is the minimal ompute time for the FP oper- better than 78% of peak.
ations of an appli ation that an be a hieved with an The latti e data layout greatly in uen es the time
\ideal implementation", the oating point eÆ ien y "FP spent on load/store operations and on remote ommuni-
is de ned as "FP = Tpeak =Texe. ations for the operands (9  12+8  9 omplex numbers)
Fig. 1 shows the data- ow paths and asso iated exe- of the hopping term (1). We assign to ea h Cell BE a
ution times Ti that enter our analysis, in parti ular: lo al latti e with VCell = L1  L2  L3  L4 sites and
 oating-point operations, TFP arrange the 8 SPEs logi ally as s1  s2  s3  s4 = 8. A
single SPE thus holds a subvolume of VSPE = (L1 =s1 ) 
 load/store operations between register le (RF)
and LS, TRF
 o - hip memory a ess, Tmem 3 Note that the dimensionality of the torus network as well as the
bandwidth gures are strongly onstrained by the te hnologi al
 internal ommuni ations between SPEs on the apabilities of present-day FPGAs, i.e., number of high-speed
same pro essor, Tint serial trans eivers and total pin ount. See also Se . V.

(L2 =s2 )  (L3 =s3 )  (L4 =s4 ) = VCell =8 sites. Ea h SPE on data in on- hip LS data in o - hip MM
average has Aint neighboring sites on other SPEs within VCell 23  64 L1  L2  L3 83 43 23
and Aext neighboring sites outside a Cell BE. In the fol-
lowing we investigate two di erent strategies for the data Aint 16 Aint =L4 48 12 3
layout: Either all data are kept in the on- hip lo al store Aext 192 Aext =L4 48 12 3
of the SPEs, or the data reside in o - hip main memory. Tpeak 21 Tpeak =L4 21 2.6 0.33
TFP 27 TFP =L4 27 3.4 0.42
TRF 12 TRF =L4 12 1.5 0.19
Data in on- hip memory (LS) Tmem | Tmem =L4 61 7.7 0.96
Tint 2 Tint =L4 5 1.2 0.29
In this ase we require that all data for a ompute task Text 79 Text =L4 20 4.9 1.23
an be held in the lo al store of the SPEs. In addition, TEIB 20 TEIB =L4 40 6.1 1.06
the lo al store must hold a minimal program kernel, the "FP 27% "FP 34% 34% 27%
run-time environment, and intermediate results. There-
fore, the storage requirements strongly onstrain the lo al TABLE I: Theoreti al time estimates Ti (in 1000 SPE y les)
latti e volumes VSPE and VCell . for some mi ro-tasks arising in the omputation of Eq. (1) for
the LS ase (left) and the MM ase (right). Aint and Aext are
A spinor eld x needs 24 real words (192 Bytes in the numbers of neighboring sites per SPE. All other symbols
DP) per site, while a gauge eld Ux; needs 18 words are de ned in Se s. IV A and IV B.
(144 Bytes) per link. If for a solver we need storage for 8
spinors and 34 links per site, the subvolume arried by a
single SPE is restri ted to about VSPE = 79 sites. In a 3-d
network, the fourth latti e dimension must be distributed In Table I we display the predi ted exe ution times for
lo ally within the same Cell BE a ross the SPEs (logi ally some of the mi ro-tasks onsidered in our model for both
arranged as a 13  8 grid). L4 is then a global latti e data layouts and reasonable hoi es of the lo al latti e
extension and may be as large as L4 = 64. This yields size. In the LS ase, the theoreti al eÆ ien y of about
a very asymmetri lo al latti e with VCell = 23  64 and 27% is limited by the ommuni ation bandwidth (Texe 
VSPE = 23  8.4 Text). This is also the limiting fa tor for the smallest lo al
latti e in the MM ase, while for larger lo al latti es the
memory bandwidth is the limiting fa tor (Texe  Tmem).
Data in o - hip main memory (MM) We have not yet ben hmarked a representative QCD
kernel su h as Eq. (1) sin e in all relevant ases TFP is
When all data are stored in main memory, there are no far from being the limiting fa tor. Rather, we have per-
a-priori restri tions on VCell . However, we have to avoid formed hardware ben hmarks with the same memory a -
redundant loads of the operands of Eq. (1) from main ess pattern as (1), using the above-mentioned multiple
memory into lo al store when sweeping through the lat- bu ering s heme for the MM ase. We found that the
ti e. To also allow for on urrent omputation and data exe ution times were at most 20% higher than the the-
transfers (to/from main memory or remote SPEs), we oreti al predi tions for Tmem. (The freely available full-
onsider a multiple bu ering s heme.5 A possible imple- system simulator is not useful in this respe t sin e it does
mentation of su h a s heme is to ompute the hopping not model memory transfers a urately.)
term (1) on a 3-d sli e of the lo al latti e and then move
the sli e along the 4-dire tion. Ea h SPE stores all sites
along the 4-dire tion, and the SPEs are logi ally arranged C. LS-to-LS DMA transfers
as a 23  1 grid to minimize internal ommuni ations be-
tween SPEs and to balan e external ones. To have all Sin e DMA transfer speeds determine Tmem, Tint , and
operands in Eq. (1) available in the lo al store, we must Text, their optimization is ru ial to exploit the Cell BE
be able to keep the U - and - elds asso iated with all performan e. Our analysis of detailed mi ro-ben hmarks
sites of three 3-d sli es in the LS at the same time. This for LS-to-LS transfers shows that the linear model (2)
optimization requirement again onstrains the lo al lat- does not a urately des ribe the exe ution time of DMA
ti e size, now to VCell  800  L4 sites. operations with arbitrary size I and arbitrary address
alignment. We re ned our model to take into a ount
the fragmentation of data transfers, as well as the sour e
4 When distributed over 4096 nodes, this gives a global latti e size
and destination addresses, As and Ad , of the bu ers:
of 323  64.
5 In multiple bu ering s hemes several bu ers are used in an alter- TDMA(I; As ; Ad ) (4)
nating fashion to either pro ess or load/store data. This allows 128 Bytes
for on urrent omputation and data transfer at the pri e of ad- = 0 + a  Na (I; As ; Ad ) + Nb (I; As )  :
ditional storage (here in the LS).
800 8i pro essors add up to approximately 200 TFlops (DP
linear model (1) peak) orresponding to about 50 TFlops sustained for
refined model (5) typi al LQCD appli ations. As dis ussed above, a simple
600 QS20 benchmarks nearest-neighbor d-dimensional inter onne tion among
T [cycles]

these pro essor is all we need to support the data ex-
hange patterns asso iated with our algorithms. This
400 simple stru ture helps to make the design and onstru -
tion of the system fast and ost-e e tive, making it a
As = Ad = 0 (mod 128) ompetitive option for a QCD-oriented number run her
to be deployed in 2009.
0 512 1024 1536 2048 In a ollaboration of a ademi and industrial partners
I [bytes] we embarked on the QPACE proje t to design, imple-
800 ment, and deploy a next generation of massively par-
allel and s alable omputer ar hite tures optimized for
LQCD. While the primary goal of this proje t is to make
600 an additional vast amount of omputing power available
for LQCD resear h, the proje t is also driven by te hni al
T [cycles]

400  Unlike similar previous proje ts aiming at mas-
sively parallel ar hite tures optimized for LQCD
As = 32, Ad = 16 (mod 128)
appli ations, QPACE uses ommodity pro essors,
0 512 1024 1536 2048
inter onne ted by a ustom network.
I [bytes]
 For the implementation of the network we leverage
FIG. 2: Exe ution time of LS-to-LS DMA transfers as a fun - the potential of Field Programmable Gate Arrays.
tion of the transfer size with aligned (top) and misaligned
(bottom) sour e and destination addresses. The data points  The QPACE design aims at an unpre edentedly
show the measured values on an IBM QS20 system. The small ratio of power onsumption versus oating
dashed and solid lines orrespond to the theoreti al predi - point performan e.
tions of Eq. (2) and Eq. (4), respe tively.
The building blo ks of the QPACE ar hite ture are the
node ards. These pro essing nodes, whi h run indepen-
dently of ea h other, in lude as main omponents one
Our hardware ben hmarks, tted to Eq. (4), indi ate PowerXCell 8i pro essor, whi h provides the omputing
that ea h LS-to-LS DMA transfer has a (zero-size trans- power, and a network pro essor (NWP), whi h imple-
fer) laten y of 0  200 y les. The DMA ontrollers ments a dedi ated interfa e to onne t the pro essor to
fragment all transfers into Nb 128-Byte blo ks aligned at a 3-d high-speed torus network used for ommuni ations
lo al store lines (and orresponding to single EIB trans- between the nodes and to an Ethernet network for I/O.
a tions). When ÆA = As Ad is a multiple of 128, Additional logi s needed to boot and ontrol the ma hine
the sour e lo al store lines an be dire tly mapped onto is kept to the bare minimum. The node ard furthermore
the destination lines. Then, we have Na = 0, and the ontains 4 GBytes of private memory, suÆ ient for all
e e tive bandwidth e = I=(TDMA 0 ) is approxi- the data stru tures { in luding auxiliary variables { of
mately the peak value. Otherwise, if the alignments do present-day lo al latti e sizes.
not mat h (ÆA not a multiple of 128), an additional la-
ten y of a  16 y les is introdu ed for ea h transferred The NWP is implemented using an FPGA (Xilinx
128-Byte blo k, redu ing e by about a fa tor of two. Virtex-5 LX110T). The advantage of using FPGAs is
Fig. 2 illustrates how learly these e e ts are observed in that they allow us to develop and test logi s within a
our ben hmarks and how a urately they are des ribed reasonably short amount of time and that development
by Eq. (4). osts an be kept low. However, the devi es themselves
tend to be expensive.6 The main task of the NWP is to
route data between the Cell pro essor, the torus network
links, and the Ethernet I/O interfa e. The bandwidth
V. THE QPACE ARCHITECTURE for a torus network link is on the order of 1 GBytes/s in
ea h of the two dire tions. The interfa e between NWP
Our performan e model and hardware ben hmarks in-
di ate that the PowerXCell 8i pro essor is a promising
option for latti e QCD (LQCD). We expe t that a sus-
tained performan e above 20% an be obtained on large 6 For us this issue is less severe sin e Xilinx is supporting QPACE
ma hines. Parallel systems with O(2000) PowerXCell by providing the FPGAs at a substantial dis ount.

and Cell pro essor has a bandwidth of 6 GBytes/s,7 in ond dimension, in whi h the nodes are onne ted by a
balan e with the overall bandwidth of the 6 torus links ombination of ba kplane onne tions and ables. In the
atta hed to ea h NWP. third dimension ables are used. A large system of N
Unlike in other Cell-based parallel ma hines, in QPACE abinets ould be operated as a single partition
QPACE node-to-node ommuni ations will pro eed from with 2  N  16  8 nodes.
the lo al store (LS) of an SPE on a pro essor to the LS Input and output operations are implemented by a
of a nearest-neighbor pro essor. For ommuni ation the Gigabit Ethernet tree network. Ea h node ard is an
data do not have to be moved through the main memory end-point of this tree and onne ted to one of six abinet-
(re all that the bandwidth of the interfa e to main mem- level swit hes, ea h of whi h has one or more 10-Gigabit
ory is parti ularly performan e riti al). They will rather Ethernet uplinks depending on bandwidth requirements.
be routed from an LS via the EIB dire tly to the I/O in- When QPACE is deployed we expe t that latti es of size
terfa e of the Cell pro essor. The PPE is not needed to 483  96 will be typi al. A gauge eld on guration for
ontrol these ommuni ations. The laten y for LS-to-LS su h a latti e size is about 6 GBytes, so the available I/O
opy operations is expe ted to be on the order of 1s. bandwidth should allow us to read or write the database
To start a ommuni ation the sending devi e (e.g., an in O(10) se onds.
SPE) has to initiate a DMA transfer of the data from its The power onsumption of a single node ard is ex-
lo al store to a bu er atta hed to any of the link modules. pe ted to be less than 150 Watts. A single QPACE
On e the data arrives in su h a bu er the NWP will abinet would therefore onsume about 35 kWatts.
take are of moving the data a ross the network without This translates into a power eÆ ien y of about 1.5
intervention of the pro essor. On the re eiving side the Watts/GFlops. A liquid ooling system is being devel-
re eiving devi e has to post a re eive request to trigger oped in order to rea h the planned pa kaging density.
the DMA transfer of the data from the re eive bu er in
the NWP to the nal destination.
The physi al layer of the torus network links relies on VI. QPACE SOFTWARE
ommer ial standards for whi h well-tested and heap
ommuni ation hardware is available. This allows us to The QPACE nodes will be operated using Linux, whi h
move the most timing- riti al logi s out of the FPGA. runs on the PPE. As on most other pro essor platforms,
Spe i ally, we are using the 10 Gbit/s trans eiver PMC the operating system annot dire tly be started after sys-
Sierra PM8358 (in XAUI mode), whi h provides redun- tem start. Instead the hardware is rst initialized using
dant link interfa es that an be used to sele t among a host rmware. The QPACE rmware will be based on
several topologies of the torus network (see below). Slimline Open Firmware [11℄. Start-up of the system is
A mu h bigger hallenge is the implementation of the ontrolled by the mi ropro essor on the root ard.
link to the I/O interfa e of the Cell pro essor, a Rambus EÆ ient implementation of appli ations on the Cell
FlexIO pro essor bus interfa e. By making use of spe ial pro essor is more diÆ ult ompared to more standard
features of the Ro ketIO trans eivers available in Xilinx pro essors. For optimizing large appli ation odes on a
Virtex-5 FPGAs it has a tually been possible to onne t Cell pro essor the programmer has to fa e a number of
a NWP prototype and a Cell pro essor at a speed of 3 hallenges. For instan e, the data layout has to be hosen
GBytes/s per link and dire tion. (At the time of this arefully to maximize utilization of the memory interfa e.
writing only a single 8-bit link has been tested, but in Optimal use of the on- hip memory to minimize external
the nal design two links will be used.) memory a esses is mandatory. The overall performan e
The node ards are atta hed to a ba kplane through of the program will furthermore depend on the fra tion
whi h all network signals are routed. One ba kplane of the ode that is parallelized on- hip.
hosts 32 node ards plus 2 root ards. Ea h root ard To relieve the programmer from the burden of porting
ontrols 16 node ards via a mi ropro essor whi h an e orts we apply two strategies. In typi al LQCD ap-
be a essed via Ethernet. pli ations almost all y les are spent in just a few kernel
One QPACE abinet has room for 8 ba kplanes, i.e., routines su h as Eq. (1). For a number of su h kernels we
256 node ards. Ea h abinet therefore has a peak double will therefore provide highly optimized assembly imple-
pre ision performan e of about 25 TFlops. mentations, possibly with the aid of an assembly gener-
On the ba kplane, subsets of nodes are inter onne ted ator. Se ondly, we will leverage the work of the USQCD
in one dimension in a ring topology. By sele ting the ollaboration [12℄. This ollaboration has performed pio-
primary or redundant XAUI links it will be possible to neering work in de ning and implementing software lay-
sele t a ring size of 2, 4 or 8 nodes. By the same me h- ers with the goal of hiding hardware details. In su h a
anism the number of nodes is on gurable in the se - framework LQCD appli ations an be built in a portable
way on top of these software layers. For QPACE our goal
is to implement the QMP (QCD Message Passing) and at
least parts of the QDP (QCD Data-Parallel) appli ation
7 Existing southbridges do not provide this bandwidth to the Cell, programming interfa es.
and thus a ommodity network solution is ruled out. The QMP interfa e omprises all ommuni ation op-

erations required for LQCD appli ations. It relies on the the most relevant appli ation kernels a sustained perfor-
fa t that in LQCD appli ations ommuni ation patterns man e of more than 20% an be a hieved.
are typi ally regular and repetitive with data being sent While the PowerXCell 8i pro essor turned out to be
between adja ent nodes in a torus grid.8 QMP would be a suitable devi e for our target appli ation, a ustom
implemented on top of a small number of low-level om- network is needed to inter onne t a larger number of
muni ation primitives whi h, e.g., trigger the transmis- pro essors. In QPACE we will use FPGAs to imple-
sion of data via one parti ular link, initiate the re eive ment a dedi ated interfa e to 6 high-speed network links
operation and allow to wait for ompletion. The QDP whi h onne t ea h pro essor to its nearest neighbors in
interfa e in ludes operations on distributed data obje ts. a 3-d mesh. While this approa h turned out to be very
promising in ase of the QPACE ar hite ture, we would
like to add some autious remarks. For almost all rele-
vant (multi- ore) pro essors the I/O interfa e to whi h
VII. CONCLUSION AND OUTLOOK an external network pro essor an be atta hed is highly
non-trivial. The implementation of su h an interfa e on
We have presented an overview of the ar hite ture of an FPGA is likely to be a signi ant te hni al hallenge.
a novel massively parallel omputer that is optimized for With respe t to future developments it also remains to be
appli ations in latti e QCD. This ma hine is based on seen whether FPGAs will be able to ope with in reasing
the PowerXCell 8i pro essor, an enhan ed version of the bandwidth requirements.
Cell Broadband Engine whi h has re ently be ome avail-
able and provides support for eÆ ient double pre ision Another advantage of using the PowerXCell 8i pro es-
al ulations. sor is its low ratio of power onsumption and oating
Based on a detailed analysis of the requirements of point performan e. For QPACE we expe t a gure as
latti e QCD appli ations we developed a set of perfor- low as 1.5 Watts/GFlops.
man e models. These have allowed us to investigate the The ambitious goal of the QPACE proje t is to om-
expe ted performan e of these appli ations on the tar- plete hardware development by the end of 2008 and to
get pro essor. A relevant subset of these models has start manufa turing and deployment of larger systems
been tested on real hardware using the standard Cell beginning 2009. By the middle of 2009 the ma hines
Engine pro essor. Our on lusion is that for should be fully available for resear h in latti e QCD.
An API as general as MPI is therefore not required.

[1℄ K.G. Wilson, Con nement of quarks, Phys. Rev. D 10 for S ienti Computing, Pro . of the 3rd onferen e on
(1974) 2445 Computing frontiers (2006) 9
[2℄ P.A. Boyle et al., Overview of the QCDSP and QCDOC [8℄ A. Nakamura, Development of QCD- ode on a Cell ma-
omputers, IBM J. Res. & Dev. (2005) 351 hine, PoS (LAT2007) 040
[3℄ F. Belletti et al., Computing for LQCD: apeNEXT, Com- [9℄ F. Belletti et al., QCD on the Cell Broadband Engine,
puting in S ien e & Engineering (2006) 18 PoS (LAT2007) 039
[4℄ A. Gara et al., Overview of the Blue Gene/L system ar- [10℄ G. Bilardi et al., The Potential of On-Chip Multipro ess-
hite ture, IBM J. Res. & Dev. (2005) 195 ing for QCD Ma hines, Springer Le ture Notes in Com-
[5℄ H.P. Hofstee et al., Cell Broadband Engine te hnology and puter S ien e 3769 (2005) 386
systems, IBM J. Res. & Dev. (2007) 501 [11℄
[6℄ om/developerworks/power/ ell [12℄ http://www.usq
[7℄ S. Williams et al., The Potential of the Cell Pro essor