Professional Documents
Culture Documents
P E R S P E C T I V E S I N C O M P U TAT I O N A L S C I E N C E
HIGH-PERFORMANCE COMPUTING:
CLUSTERS, CONSTELLATIONS, MPPS,
AND FUTURE DIRECTIONS
By Jack Dongarra, Thomas Sterling, Horst Simon, and Erich Strohmaier
MARCH/APRIL 2005 Copublished by the IEEE CS and the AIP 1521-9615/05/$20.00 © 2005 IEEE 51
PERSPECTIVES IN COMPUTATIONAL SCIENCE
sibilities and services that necessitate distinguished colleagues, see an impor- ucts available for procurement and in-
building such facilities will continue to tant unifying principle emerging in dependent application by organizations
be critical, especially to the high-end HPC’s evolution: the integration of (end users or separate vendors) other
computing and large data archive com- highly replicated components (many of than the original equipment manufac-
munities. Already we see in the US De- which were designed and fabricated for turer. Beowulf-class clusters and work-
partment of Energy and the US Na- more general markets) as the driving station clusters were once two distinct
tional Science Foundation sector the force for a convergent architecture. system types, but with the blurring or
development of new and larger com- They call this architecture a cluster and outright elimination of any meaningful
puting centers to house the next gener- distinguish it only from the minority set differences in capability between PCs
ation of high-end systems, including of vector supercomputers (such as and workstations, the differentiation
very large Beowulf clusters. The com- NEC SX-6 and Cray X1) that exploit between these two types of clusters has
puter centers of the future will be vectors in custom processor architec- also largely lost any meaning. This is
charged with the administration, man- ture designs. This convergent architec- particularly true with the wide usage of
agement, and training associated with ture model of the evolution of super- Linux as the base node operating sys-
bringing these major resources to bear computer design is compelling, readily tem, a strategy originally pioneered by
Beowulf-class clusters.
The Top500 list represents two
The term ‘cluster’ is best employed not as a synonym broad classes of clusters: cluster-NOW
and constellation systems. Both are
for essentially the universal set of parallel computer commodity cluster systems distin-
guished by the dominant level of par-
system organizations, but rather as a specific class. allelism. Although more complex sys-
tem structures are possible (such as
super clusters), commodity clusters
usually comprise two levels of paral-
on mission-critical applications. apparent, and wrong. We respectfully lelism. The first is the number of nodes
This article, while congratulating assert an alternate perspective that is connected by the global communica-
Bell and Gray in opening up this line of rich in detail and has value in its ability tions network, in which a node con-
discourse, offers a constructive expan- as an enabling framework for reasoning tains all the cluster’s processor and
sion on their original themes and seeks about computing structures and meth- memory resources. The second is the
to correct specific areas of their ods. In particular, we assert that the number of processors in each node,
premise with which we take exception. term “cluster” is best employed not as a usually configured as a symmetric mul-
The long-term future of HPC archi- synonym for essentially the universal tiprocessor (SMP). If a commodity
tectures will involve innovative struc- set of parallel computer system organi- cluster has more nodes than micro-
tures that support new paradigms of zations, but rather as a specific class of processors in any one of its nodes, the
execution models, which in turn will such systems. Therefore we state that dominant mode of parallelism is at the
greatly enhance efficiency in terms of NOT everything is a cluster. first level (the cluster-NOW category).
performance, cost, space, and power We limit the scope of the definition If a node has more microprocessors
while enabling scalability to tens or of a cluster to a parallel computer sys- than there are nodes in the commodity
hundreds of petaflops. The conceptual tem comprising an integrated collec- cluster, the dominant mode of paral-
framework offered here implies the tion of independent nodes, each of lelism is at the second level (a constel-
directions of such developments and which is a system in its own right, ca- lation). This distinction is not arbi-
resonates with recent advances being pable of independent operation and trary: it can have a serious impact on
pursued by the computer architecture derived from products developed and cluster programming. A cluster-NOW
research community. marketed for other stand-alone pur- system, for example, is programmed al-
poses. A commodity cluster is a cluster most exclusively with the message-
Commodity Clusters in which both the network and the passing interface (MPI), whereas a con-
Bell and Gray, in conjunction with their compute nodes are commercial prod- stellation is likely to be programmed at
MARCH/APRIL 2005 53
PERSPECTIVES IN COMPUTATIONAL SCIENCE
Is There Light at the End of the Tunnel? fluorocarbon or CO2 emissions) is only possible if the time
required for each simulation is small.
By Dan Reed, Renaissance Computing Institute (RENCI) Hence, the supercomputing challenge has always been
to deliver the highest possible performance to applications,
fer dramatically in their hardware sup- Namespace distribution. From an for very large structures to be an es-
port for system-wide parallel execu- abstract viewpoint, a system comprises sential attribute in determining a sys-
tion. The amount of overhead in the a collection of namespaces and actions tem concept’s likely success or failure.
critical time path for controlling par- that can be performed on named enti- Beowulf clusters exclusively comprise
allel actions thus largely depends on ties. Shared-memory systems versus commodity components, systems, and
the hardware control mechanisms (or distributed-memory systems are one networks; the SX-6 employs custom
lack thereof). such division. Names of I/O ports or vector processor architectures, moth-
channels (including whether they’re lo- erboards, and networks, but uses com-
Latency management. The wait cal to part of the system or globally ac- modity memory chips. Commodity
for remote access and service strongly cessible) is another. Process ID (again components might be cheaper because
factors into a supercomputer’s effi- local or global) is also a discriminator, of their economy of scale, but they
ciency and scaling. It includes the long as is whether processor nodes consti- could lack many functional attributes
distances that messages have to travel, tute an explicit namespace or are sim- essential for efficient scalable super-
the delays when contending for shared ply a pool of anonymous resources. computing.
resources such as network bandwidth, Logical namespaces can thus charac-
and the service times for actions such terize a system, at least in part. A Framework for Characterizing
as assembling and interpreting mes- Parallel Architectures
sages. Latency management includes Reliance on commodity. In recent We suggest a naming schema that de-
pipelining vectors, multithreading, years, the economics of system imple- lineates parallel computing systems ac-
avoidance via explicit locality manage- mentation has dominated develop- cording to key dimensions of system
ment, caching, and message-driven ment. Some people consider the de- attributes rather than the random
computing. How a system manages la- gree to which the system architecture terms with which we’re all familiar.
tency is an important distinguishing exploits commodity components, sub- Our rationale is to demonstrate that al-
characteristic. systems, or systems as building blocks ternative naming methods are possible,
rather than to impose a specific new Any system can be clustered with like tributed namespace is one in which one
method on the community. For exam- systems to yield a larger ensemble sys- node’s variables aren’t directly visible to
ple, as Bell and Gray make clear, the tem, but doing so doesn’t mean that all another, whereas a shared namespace is
term MPP, although used pervasively the attributes of the constituent uni- one in which all variables in all nodes
throughout the history of the Top500 form systems are conveyed unmodified are visible to all other nodes. Cache co-
list, is confusing, misleading, and pro- to the aggregate system. The impor- herence adds to the shared namespace
vides little specification of system type. tant factor here is a synthesis of exist- attribute by providing hardware sup-
The notion of it has been abused, con- ing stand-alone subsystems developed port for managing copies. This illus-
fused, and ironically derived from a dif- for a different, presumably larger, mar- trates that we can combine multiple at-
ferent type of system than that to ket and user workload. The alterna- tributes (lexically concatenated) to
which it is ordinarily applied. (The tive, a monolithic system, is not a provide a complex descriptive.
original MPP was an SIMD-class com- product of clustering, but a structure Parallelism reflects the forms of ac-
puter, which is a separate classification of highly replicated basic components. tion concurrency that the hardware ar-
of the Top500 list.) Other appropriate designators, per- chitecture can exploit and support.
Every strategy, including ours, re- haps those that reflect a specific hier- Conventional distributed-memory
flects the critical sensitivities of its time. archy, could exist—how would we rep- MPPs (old usage) are limited to com-
Here, we emphasize four dominant di- resent a supercluster (a cluster of municating sequential processes (such
mensions for characterizing parallel clusters), for example? as message passing), whereas vector
computing systems: Namespace indicates how far a sin- computers exploit fine-grained vectors
gle namespace is shared across a sys- and pipelining. This property field ex-
• clustering, tem. Although many possible name- poses the means by which the overhead
• namespace, spaces exist (such as variables, process of managing parallel resources and
• parallelism, and IDs, I/O ports, and so on), user vari- concurrent tasks is supported and
• latency and locality management. ables illustrate the concept here. A dis- therefore made efficient.
MARCH/APRIL 2005 55
PERSPECTIVES IN COMPUTATIONAL SCIENCE
continued from p. 55 feedback, the most promising efforts should then transi-
plex applications requires a judicious match of computer tion to even larger-scale testing and vendor product cre-
architecture, system software, and software development ation, and new prototyping efforts should be launched.
tools. Most researchers in high-end computing believe the This critical cycle of prototyping, assessment, and com-
key reasons for our current difficulties in achieving high mercialization must be a long-term, sustaining invest-
performance on complex scientific applications can be ment, not a one-time crash program.
traced to inadequate research investment in software and
the use of processor and memory architectures that aren’t References
well matched to scientific applications. 1. G. Bell and J. Gray, “What’s Next in High-Performance Computing,”
Today, scientific applications are developed with crude Comm. ACM, vol. 45, no. 2, 2002, pp. 91–95.
software tools (compared to those used in the commercial 2. D.A. Reed, ed., Workshop on the Roadmap for the Revitalization of
sector). Low-level programming, based on message-pass- High End Computing, Computing Research Assoc., Jan. 2004;
ing libraries, means that application developers must pro- www.cra.org/reports/supercomputing.pdf.
vide deep knowledge of application software behavior and 3. A Science-Based Case for Large Scale Simulation, US Dept. of Energy,
its interaction with the underlying computing hardware. June 2003; www.pnl.gov/scales/.
This is a tremendous intellectual burden that, unless recti- 4. Revolutionizing Science and Engineering through Cyberinfrastructure:
fied, will continue to limit the usability of high-end com- Report of the National Science Foundation Blue-Ribbon Advisory Panel
puting systems, restricting effective access to a small cadre on Cyberinfrastructure, US Nat’l Science Foundation, Jan. 2003;
of researchers. We need only look at the development his- www.cise.nsf.gov/evnt/reports/toc.htm.
tory of Microsoft Windows to recognize the importance of 5. Federal Plan for High-End Computing: Report of the High-End Comput-
an iterated cycle of development, deployment, and feed- ing Revitalization Task Force (HECRTF), May 2004; www.nitrd.gov/
back to develop an effective, widely used product. High- pubs/2004_hecrtf/20040702_hecrtf.pdf.
quality research software isn’t cheap: it is labor intensive,
and its successful creation requires the opportunity to in- Dan Reed is director of the interdisciplinary Renaissance Computing
corporate the lessons learned from previous versions. Institute, which spans the University of North Carolina at Chapel Hill,
Hence, we must begin a coordinated research and de- North Carolina State University, and Duke University. He also holds
velopment effort to create high-end systems that are bet- the Chancellor’s Eminent Professorship at the University of North Car-
ter matched to the characteristics of scientific applica- olina at Chapel Hill and is a member of the President’s Information
tions. This will require a broad program of basic research Technology Advisory Committee (PITAC). Reed conducts research in
into computer architectures, system software, program- high-performance computing, fault-tolerance, performance mea-
ming models, software tools, and algorithms. In addition, surement, and interdisciplinary applications. He is a fellow of the IEEE
we must fund the design and construction of large-scale and ACM. Until December 2003, he was director of the NSF-funded
prototypes of next-generation high-end systems that in- National Center for Supercomputing Applications (NCSA) and chief
cludes balanced exploration of new hardware and soft- architect of the NSF TeraGrid, a nationally distributed high-perfor-
ware models, driven by scientific application require- mance computing Grid built from commodity clusters. Contact him
ments. After experimental assessment and community at reed@renci.org.
Latency and locality management types. We consider here a similar syntax, ing, s for systolic, w for very long in-
defines the mechanisms and methods using four fields to represent parallel ar- struction word (VLIW), h for pro-
incorporated to tolerate latency effects. chitecture classes, but we extend such ducer or consumer, p for parallel
Caches, pipelining, prefetching, multi- nomenclature to let multiple designa- processes, and so on.
threading, and message-driven com- tors in any given field permit a richer • Latency: c for caches, v for vectors, t
puting mechanisms are among the pos- description space. We suggest the fields for multithreaded, m for processor in
sible mechanisms that avoid or hide and examples for each as follows: memory, p for parcel or message-dri-
access latencies. The more flexible and ven split-transaction, f for prefetch-
less sensitive to application attributes • Clustering: c for commodity cluster ing, and a for explicit allocation.
such as temporal and spatial locality, or m for monolithic system.
the greater the overall execution effi- • Naming: d for distributed, s for Admittedly, we could probably add
ciency that is likely to be achieved. shared, or c for cache-coherent. other designations to this list: for ex-
Queuing models use a method of a • Parallelism: t for multithreading, v ample, the Earth Simulator would be
few descriptors separated by slashes to for vector, c for communicating se- m/s/v/v, the Tera MTA (multithreaded
describe a broad range of queue system quential processes or message pass- architecture) would be m/s/t/t, the SGI
MARCH/APRIL 2005 57
PERSPECTIVES IN COMPUTATIONAL SCIENCE
of bytes per flops, we’ll lose the ratio of field-programmable gate arrays might center as an institution is at its end.
1:1 because of the roughly 1,000-to-1 prove of value to make the applications Without question, commodity clus-
difference between the required mem- more accessible and general. High- ters and Beowulfs have resulted in lo-
ory and the ALUs needed to match it. bandwidth optical communication, per- cal sites obtaining, applying, and main-
Instead, an entirely different set of haps even directly connected to the taining systems with capabilities
metrics and balance requirements will chips, will permit bisection bandwidths ordinarily reserved for the pristine con-
drive future architectures based on well beyond many petabits per second ditions of the classic machine room;
bandwidth, overhead time, and latency across a system. The choices are so this is often accomplished through
tolerance. It’s possible that the concept rich—and the driving motivation so minimal upfront costs, leveraging ex-
of the processor as we know it will dis- compelling—that COTS-based systems tant talent for system support services.
appear as aggregates of finer-grained will go the way of the mainframe For certain contexts, such as academic
cell-like constructs that integrate logic, uniprocessor at some point. Supercom- environments and research laborato-
state, and data transfer merge into one puters and desktops just aren’t the same ries, this trade-off works well for sys-
highly replicated element. Power and thing, in spite of the fact that they both tems of a few dozen to a couple of hun-
reliability through active reconfigura- perform calculations. dred nodes, but beyond that, resources
are usually stressed, sometimes se-
verely, especially when such clusters are
Commodity components might be cheaper, but they shared among several users and appli-
cations. As few as 50 nodes can demand
could lack many functional attributes. a full support person. Although this
seems a little high, at some level, man-
aging a commodity cluster can become
a full-time job.
tion (graceful degradation) will become Centers in the 21st Century Mass storage can be an important
as important as optimized throughputs. Computing centers evoke images of part of a system’s capability, even one
Computation will have to be abstracted white lab coats and large front panels assumed to be dedicated to compute-
(virtualized) with respect to the under- behind layers of glass—mere mortals intensive applications. Large archival
lying physical execution medium to let had limited or indirect access. These tertiary storage facilities are becoming
it adapt to the constantly shifting orga- centers housed the largest processing increasingly valuable to a full-service
nization. SOC, SMP on a chip, and and storage facilities, cost millions of scientific computing environment and
PIM will become important elements dollars, and some even contained the can require expert administration.
that bring memory closer to logic. ultimate high-IQ system: the super- Similarly, networking of cluster sys-
Simultaneously, Logic Intensive computer. For batch processing, access tems to the external user base, either
Processor Architectures (LIPAs) such as was through submitted decks, either within an administrative domain or
Stanford’s streaming architecture, the physical (punched cards) or virtual (jobs over the Internet, adds to the respon-
University of Texas at Austin’s Trips (for submitted from terminals). Some of us sibilities of such systems. Software up-
tera-op reliable intelligently adaptive remember keypunch and terminal grades, hardware diagnosis and re-
processing system) architecture, and rooms managed by the central institu- placement, and account, job, and user
Cray’s Cascade architecture will exploit tional computer centers of the day. As interface management all entail signif-
large internal arrays of ALUs. New minicomputers, workstations, and ulti- icant usage of time and talent in main-
technologies could permit systems such mately PCs incrementally permeated taining a large cluster.
as the Hybrid Technology Multi- the computing community, the com- Many labs, groups, and organizations
threaded (HTMT) architecture to puter center’s role evolved and nar- can now acquire—within their bud-
achieve far higher densities of computa- rowed, but it retained its critical contri- gets—a cluster system capable of sub-
tion, as would 3D packaging (capable of bution to large-scale computation. stantial performance, but they’re often
putting 500,000 chips in a cubic meter). However, with the cluster’s emer- not prepared to engage the resources or
For important applications, special-pur- gence—particularly Beowulf-class sys- fulfill the responsibilities of maintain-
pose devices could still play a role, and tems—some contend that the computer ing and managing a complex comput-
MARCH/APRIL 2005 59