158 2005 High Performance Computing Clusters Conslellations Mpps and Future Directions

PERSPECTIVES
P E R S P E C T I V E S I N C O M P U TAT I O N A L S C I E N C E
HIGH-PERFORMANCE COMPUTING:
CLUSTERS, CONSTELLATIONS, MPPS,
AND FUTURE DIRECTIONS
By Jack Dongarra, Thomas Sterling, Horst Simon, and Erich Strohmaier
I N A RECENT PAPER,1 GORDON BELL AND JIM GRAY PUT FORTH
A VIEW OF THE PAST, PRESENT, AND FUTURE OF HIGH-
PERFORMANCE COMPUTING (HPC) THAT IS BOTH INSIGHTFUL AND

grated ensembles will have a signifi-
cant, even dominant, impact on the
evolution of high-performance systems
in the near future. The Top500 list al-
ready clearly reflects this trend with
the vast majority of all systems repre-
THOUGHT PROVOKING. IDENTIFYING KEY TRENDS WITH A GRACE sented on the list being products of
some form of clustering. Moreover, as
and candor rarely encountered in a sin- power as a tool to represent and differ- Bell and Gray point out, Beowulf-class
gle work, the authors describe an evo- entiate. Specifically, their paper implies clusters are having a significant impact
lutionary past drawn from their vast that essentially every parallel system on medium-to-high-scale systems
experience and project an enticing and employing replicated resources is a throughout the science and technical
compelling vision of HPC’s future. cluster. In this well-intentioned effort computing arena as well as in the com-
Yet, the underlying assumptions im- to provide a unifying principle, mercial sector. Also referred to as
plicit in their treatment, particularly though, the authors have eliminated a Linux clusters or PC clusters, Beowulfs
those related to terminology and dom- powerful concept even as they in- are perhaps more widely used than any
inant trends, conflict with our own ex- tended to reinforce it. The concept of other type of parallel computer be-
perience, common practices, and shared the commodity cluster has driven an cause of their low cost, flexibility, and
view of HPC’s future directions. Taken important trend in parallel processing accessibility. Indeed, among the top 10
from our vantage points of the Top500 over the past decade, delivering un- systems on the most recent list (No-
list,2 the Lawrence Berkeley National precedented performance-to-cost and vember 2004 at www.top500.org), five
Laboratory NERSC computer center, providing exceptional flexibility and are commodity clusters, three of which
Beowulf-class computing,3 and research technology tracking. By expanding the are Linux clusters, not unlike the orig-
in petaflops-scale computing architec- scope of the term, they’ve deprived this inal Beowulf-class systems, and two are
tures,4 we offer an alternate perspective important term of its seminal meaning constellations.
on several key issues in the form of a and implication. One consequence of the progress an-
constructive counterpoint. One objective of this article is to re- ticipated beyond what Bell and Gray
store the strength and value of the term envisioned is the form and content of
A New Path “cluster” by degeneralizing its applica- future computer centers, which will
Terminology and taxonomies are sub- bility to a restricted subset of parallel evolve as available technologies and
jective, with common usage dictating computers. We’ll further consider this system architecture classes advance. In-
practical utility. Yet, in spite of its im- class in conjunction with its comple- stead of becoming obsolete, the com-
perfections, technical nomenclature menting terms constellation, Beowulf class, puter center’s role will likely grow in
can be a powerful tool for describing, and massively parallel processing systems importance, evolving to meet the chal-
distinguishing, and delineating among (MPPs), based on the classification used lenges of new architectures, program-
related concepts, entities, and pro- by the Top500 list, which has tracked ming models, mass storage, and acces-
cesses. Bell and Gray incorporate a the HPC field for more than a decade. sibility via the Grid. The emergence of
fundamental assumption throughout As Bell and Gray convincingly artic- Beowulf and other commodity clusters
their reasoning, which, although de- ulate, the impact of Moore’s law and will definitely alter the mix of resources
fensible and advocated by notable re- the economy of scale of mass-market that will comprise a medium-to-large-
searchers,5 corrupts the terminology’s computing components in easily inte- sized computer center, but the respon-
MARCH/APRIL 2005 Copublished by the IEEE CS and the AIP 1521-9615/05/$20.00 © 2005 IEEE 51
PERSPECTIVES IN COMPUTATIONAL SCIENCE
sibilities and services that necessitate distinguished colleagues, see an impor- ucts available for procurement and in-
building such facilities will continue to tant unifying principle emerging in dependent application by organizations
be critical, especially to the high-end HPC’s evolution: the integration of (end users or separate vendors) other
computing and large data archive com- highly replicated components (many of than the original equipment manufac-
munities. Already we see in the US De- which were designed and fabricated for turer. Beowulf-class clusters and work-
partment of Energy and the US Na- more general markets) as the driving station clusters were once two distinct
tional Science Foundation sector the force for a convergent architecture. system types, but with the blurring or
development of new and larger com- They call this architecture a cluster and outright elimination of any meaningful
puting centers to house the next gener- distinguish it only from the minority set differences in capability between PCs
ation of high-end systems, including of vector supercomputers (such as and workstations, the differentiation
very large Beowulf clusters. The com- NEC SX-6 and Cray X1) that exploit between these two types of clusters has
puter centers of the future will be vectors in custom processor architec- also largely lost any meaning. This is
charged with the administration, man- ture designs. This convergent architec- particularly true with the wide usage of
agement, and training associated with ture model of the evolution of super- Linux as the base node operating sys-
bringing these major resources to bear computer design is compelling, readily tem, a strategy originally pioneered by
Beowulf-class clusters.
The Top500 list represents two
The term ‘cluster’ is best employed not as a synonym broad classes of clusters: cluster-NOW
and constellation systems. Both are
for essentially the universal set of parallel computer commodity cluster systems distin-
guished by the dominant level of par-
system organizations, but rather as a specific class. allelism. Although more complex sys-
tem structures are possible (such as
super clusters), commodity clusters
usually comprise two levels of paral-
on mission-critical applications. apparent, and wrong. We respectfully lelism. The first is the number of nodes
This article, while congratulating assert an alternate perspective that is connected by the global communica-
Bell and Gray in opening up this line of rich in detail and has value in its ability tions network, in which a node con-
discourse, offers a constructive expan- as an enabling framework for reasoning tains all the cluster’s processor and
sion on their original themes and seeks about computing structures and meth- memory resources. The second is the
to correct specific areas of their ods. In particular, we assert that the number of processors in each node,
premise with which we take exception. term “cluster” is best employed not as a usually configured as a symmetric mul-
The long-term future of HPC archi- synonym for essentially the universal tiprocessor (SMP). If a commodity
tectures will involve innovative struc- set of parallel computer system organi- cluster has more nodes than micro-
tures that support new paradigms of zations, but rather as a specific class of processors in any one of its nodes, the
execution models, which in turn will such systems. Therefore we state that dominant mode of parallelism is at the
greatly enhance efficiency in terms of NOT everything is a cluster. first level (the cluster-NOW category).
performance, cost, space, and power We limit the scope of the definition If a node has more microprocessors
while enabling scalability to tens or of a cluster to a parallel computer sys- than there are nodes in the commodity
hundreds of petaflops. The conceptual tem comprising an integrated collec- cluster, the dominant mode of paral-
framework offered here implies the tion of independent nodes, each of lelism is at the second level (a constel-
directions of such developments and which is a system in its own right, ca- lation). This distinction is not arbi-
resonates with recent advances being pable of independent operation and trary: it can have a serious impact on
pursued by the computer architecture derived from products developed and cluster programming. A cluster-NOW
research community. marketed for other stand-alone pur- system, for example, is programmed al-
poses. A commodity cluster is a cluster most exclusively with the message-
Commodity Clusters in which both the network and the passing interface (MPI), whereas a con-
Bell and Gray, in conjunction with their compute nodes are commercial prod- stellation is likely to be programmed at
52 COMPUTING IN SCIENCE & ENGINEERING

least in part with OpenMP, by using a parallel programmers, yet the defini- in providing a general or coherent ter-
threaded model. Very often, a constel- tion Bell and Gray propose would ob- minology to consider alternatives or to
lation is space-shared and not time- scure, even eliminate, this seminal represent new architectures when they
shared, with each user getting his or quality. We propose to retain it, and come into being.
her own node; space-sharing a cluster- thus offer the following definition for
NOW system means allocating some a commodity cluster: a parallel com- Distinguishing the
number of nodes to a particular user. puter exclusively comprising commod- Properties of Parallel Systems
The critical distinction between our ity computing subsystems and com- Supercomputers differ from more
usage of the term cluster and that pro- mercial networks such that the broadly commercialized systems in sev-
posed by Bell and Gray is that in our computing nodes are developed and eral ways that could help determine the
narrower definition, all constituent employed in stand-alone configura- seminal parameters or dimensions that
top-level components of the system are tions for broad (even mass) commercial distinguish among different classes of
commodity with no significant cost to markets, and the networks are dedi- systems.
the full cluster system’s fabrication cated to the private use of the cluster
other than that of installation and net- (non-worldly). Performance. Although the ultimate
work integration. Developing a cluster measure of the effectiveness of a given
as part of a market line requires no HPC System Taxonomy system’s execution is its response time
hardware development investment Current HPC system architectures or time to completion for a given appli-
other than in packaging, which might aren’t well characterized by the notion cation, other imperfect measures at-
be little more than cosmetic. that only two architecture classes— tempt to correlate with this fundamen-
It is this very low cost of develop- Cray-style vector supercomputers and tal metric. Mips and Gflops are among
ment and exploitation of economy of parallel clusters—exist. The first class these metrics, but they represent the
scale that distinguish commodity clus- perhaps refers only to the SX NEC dependent variable or the resulting
ters from all other forms of scalable product line, the remaining Cray T90s, value derived from other system struc-
parallel systems; it’s also this important and the new Cray X1 line. What Bell tures and characteristics. Except as the
distinction that we want to retain in and Gray consider to fall under the sec- primary driver for the highest levels of
our revised definition of a cluster. ond category is more accurately repre- performance, performance need not be
However, accepting Bell and Gray’s sented by several distinct classes of par- part of the classification scheme.
broader interpretation means sacrific- allel computing systems presenting less
ing commodity clusters’ crucial bene- of a monoculture than might first ap- Parallelism. Hardware and soft-
fit: exceptional performance-to-cost, pear, including ware parallelism determine the
invulnerability to specific vendor deci- amount of concurrent work and there-
sions, flexibility in configuration and • multicomputers or multiprocessors, fore achievable peak performance.
expansion, rapid tracking of technol- • SMPs, The semantics of parallelism, includ-
ogy advances, direct use of a wide • distributed shared memory (DSM), ing its granularity and the hardware
range of available (often open-source) • distributed memory, mechanisms that support parallel exe-
software, portability between clusters, • tightly integrated MPP, cution, determine the effectively ex-
and a wide array of component choices. • commodity clusters, including (but ploitable amount of it. Classes of ar-
Moreover, commodity clusters provide not restricted to) Beowulf-class sys- chitecture can be distinguished by the
scaling from the very small (a few tems, and kinds of parallelism they use.
nodes) to the very large (approaching • constellations (also a subclass of
10,000 processors). clusters). Control. The hardware support
Most of these benefits come from mechanisms incorporated in an archi-
the specific attribute that the con- In most cases, these terms have com- tecture to efficiently control the sys-
stituent components are off the shelf mon meaning within the literature. tem can significantly change the sys-
with no specialty parts. This single However, Bell and Gray observe that tem’s operation and performance and
property has made commodity clusters while constituting the common lexi- distinguish among classes of systems.
the dominant training environment for con, this set of terms is not very useful An SIMD computer and a cluster dif-
MARCH/APRIL 2005 53
Is There Light at the End of the Tunnel? fluorocarbon or CO2 emissions) is only possible if the time
required for each simulation is small.
By Dan Reed, Renaissance Computing Institute (RENCI) Hence, the supercomputing challenge has always been
to deliver the highest possible performance to applications,
F rom the beginning of the digital age, supercomputers

have been time machines that let researchers peer into
the future, both intellectually and temporally. Intellectually,
subject to economic, political, and engineering constraints.
Fueled by weapons research and national security con-
cerns, US government supercomputing needs substantively
supercomputers help researchers bring to life models of influenced the commercial market until the 1980s. With the
complex phenomena when economics or other constraints explosive growth of personal computing and the end of the
preclude experimentation. Computational cosmology, Cold War, supercomputing is now a much smaller fraction
which tests competing theories of the universe’s origins by of the overall computing market, with concomitantly less
computationally evolving cosmological models, is one such economic influence. This economic milieu has had pro-
example. Given our inability to conduct cosmological ex- found effects on all aspects of supercomputing—research
periments (we can’t create variants of the current universe and development, marketing, procurement, and operation.
and observe its evolution), computational simulation is the Most notably, the explosive growth of clustered systems,
only feasible way to conduct experiments. based on commodity building blocks, has reshaped the
Temporally, supercomputers reduce the time to solution computing market.
by enabling scientists to evaluate larger or more complex Although this democratization of supercomputing has
models than would be possible on conventional systems. had many salutatory effects, including the growth of com-
Although this might seem prosaic, the practical difference modity clusters across laboratories and universities, it isn’t
between obtaining results in hours, rather than weeks or without its negatives. Not all algorithms map efficiently to
years, is substantial—it qualitatively changes the range of the predominant cluster-programming model of loosely
experiments we can conduct. Climate-change studies that coupled, message-based communication. Hence, some re-
simulate thousands of Earth years, for example, are only searchers and their applications have suffered due to lack of
feasible if the time to simulate a climactic year is small. access to more tightly coupled supercomputing systems.
Moreover, conducting parameter studies (for example, to Second, an excessive focus on peak performance at low
assess sensitivity to different conditions such as the rate of cost, which favors commodity clusters, has limited research
fer dramatically in their hardware sup- Namespace distribution. From an for very large structures to be an es-
port for system-wide parallel execu- abstract viewpoint, a system comprises sential attribute in determining a sys-
tion. The amount of overhead in the a collection of namespaces and actions tem concept’s likely success or failure.
critical time path for controlling par- that can be performed on named enti- Beowulf clusters exclusively comprise
allel actions thus largely depends on ties. Shared-memory systems versus commodity components, systems, and
the hardware control mechanisms (or distributed-memory systems are one networks; the SX-6 employs custom
lack thereof). such division. Names of I/O ports or vector processor architectures, moth-
channels (including whether they’re lo- erboards, and networks, but uses com-
Latency management. The wait cal to part of the system or globally ac- modity memory chips. Commodity
for remote access and service strongly cessible) is another. Process ID (again components might be cheaper because
factors into a supercomputer’s effi- local or global) is also a discriminator, of their economy of scale, but they
ciency and scaling. It includes the long as is whether processor nodes consti- could lack many functional attributes
distances that messages have to travel, tute an explicit namespace or are sim- essential for efficient scalable super-
the delays when contending for shared ply a pool of anonymous resources. computing.
resources such as network bandwidth, Logical namespaces can thus charac-
and the service times for actions such terize a system, at least in part. A Framework for Characterizing
as assembling and interpreting mes- Parallel Architectures
sages. Latency management includes Reliance on commodity. In recent We suggest a naming schema that de-
pipelining vectors, multithreading, years, the economics of system imple- lineates parallel computing systems ac-
avoidance via explicit locality manage- mentation has dominated develop- cording to key dimensions of system
ment, caching, and message-driven ment. Some people consider the de- attributes rather than the random
computing. How a system manages la- gree to which the system architecture terms with which we’re all familiar.
tency is an important distinguishing exploits commodity components, sub- Our rationale is to demonstrate that al-
characteristic. systems, or systems as building blocks ternative naming methods are possible,

into new architectures, programming models, and system computing. In the US, the White House Office of Science
software. The result has been the emergence of a supercom- and Technology Policy (OSTP), in coordination with the
puting monoculture composed predominantly of commod- National Science and Technology Council, commissioned
ity clusters and small symmetric multiprocessors (SMPs). the creation of the interagency High-End Computing Revi-
In an earlier article, Gordon Bell and Jim Gray1 describe talization Task Force (HECRTF). The interagency HECRTF
the reasons for this monoculture’s emergence, and they was charged with developing a five-year plan to guide fu-
comment on some of its possible implications for super- ture US investments in high-end computing; the Comput-
computing centers, community access to supercomputing, ing Research Association (CRA) recently published a work-
and distributed, peer-to-peer computing via computational shop report from US research community discussions,2 and
Grids. In the main text of the article presented here, the au- the interagency HECRTF report itself recently appeared.5
thors disagree with many, though not all, of Bell and Gray’s Researchers in every discipline at the HECRTF workshop
premises and extrapolations. In particular, they argue that cited the difficulty in achieving high, sustained perfor-
Bell and Gray have defined clusters more broadly than the mance (relative to peak) on complex applications to reach
commonly accepted definition (that is, a computing system new, important scientific thresholds. They also made com-
built largely or entirely from separately purchasable compelling cases for sustained computing performance of 50 to
modity components). They also suggest that economies of 100 times beyond that currently available. A complemen-
scale will continue to make supercomputing centers an at- tary set of workshops,3,4 commissioned by individual re-
tractive mechanism for serving the research computing search agencies, reached similar conclusions. The result is a
needs of the very highest end national users. renewed debate about research strategies and investment
Both articles argue strongly that we have substantially un- in high-end computing.
derinvested the research needed to develop a new genera- In the 1990s, the US High-Performance Computing and
tion of architectures, programming systems, and algorithms. Communications (HPCC) program supported the develop-
The result is a paucity of new approaches to managing the ment of several new computer systems. In retrospect, we
increasing disparity between processor speeds and memory didn’t learn the critical lesson of 1970s vector computing:
access times (the so-called von Neumann bottleneck). the need for long-term, balanced investment in both hard-
These and other developments have stimulated a re- ware and software. Achieving high performance for com-
examination of current policies and approaches to super- continued on p. 56
rather than to impose a specific new Any system can be clustered with like tributed namespace is one in which one
method on the community. For exam- systems to yield a larger ensemble sys- node’s variables aren’t directly visible to
ple, as Bell and Gray make clear, the tem, but doing so doesn’t mean that all another, whereas a shared namespace is
term MPP, although used pervasively the attributes of the constituent uni- one in which all variables in all nodes
throughout the history of the Top500 form systems are conveyed unmodified are visible to all other nodes. Cache co-
list, is confusing, misleading, and pro- to the aggregate system. The impor- herence adds to the shared namespace
vides little specification of system type. tant factor here is a synthesis of exist- attribute by providing hardware sup-
The notion of it has been abused, con- ing stand-alone subsystems developed port for managing copies. This illus-
fused, and ironically derived from a dif- for a different, presumably larger, mar- trates that we can combine multiple at-
ferent type of system than that to ket and user workload. The alterna- tributes (lexically concatenated) to
which it is ordinarily applied. (The tive, a monolithic system, is not a provide a complex descriptive.
original MPP was an SIMD-class com- product of clustering, but a structure Parallelism reflects the forms of ac-
puter, which is a separate classification of highly replicated basic components. tion concurrency that the hardware ar-
of the Top500 list.) Other appropriate designators, per- chitecture can exploit and support.
Every strategy, including ours, re- haps those that reflect a specific hier- Conventional distributed-memory
flects the critical sensitivities of its time. archy, could exist—how would we rep- MPPs (old usage) are limited to com-
Here, we emphasize four dominant di- resent a supercluster (a cluster of municating sequential processes (such
mensions for characterizing parallel clusters), for example? as message passing), whereas vector
computing systems: Namespace indicates how far a sin- computers exploit fine-grained vectors
gle namespace is shared across a sys- and pipelining. This property field ex-
• clustering, tem. Although many possible name- poses the means by which the overhead
• namespace, spaces exist (such as variables, process of managing parallel resources and
• parallelism, and IDs, I/O ports, and so on), user vari- concurrent tasks is supported and
• latency and locality management. ables illustrate the concept here. A dis- therefore made efficient.
MARCH/APRIL 2005 55
continued from p. 55 feedback, the most promising efforts should then transi-
plex applications requires a judicious match of computer tion to even larger-scale testing and vendor product cre-
architecture, system software, and software development ation, and new prototyping efforts should be launched.
tools. Most researchers in high-end computing believe the This critical cycle of prototyping, assessment, and com-
key reasons for our current difficulties in achieving high mercialization must be a long-term, sustaining invest-
performance on complex scientific applications can be ment, not a one-time crash program.
traced to inadequate research investment in software and
the use of processor and memory architectures that aren’t References
well matched to scientific applications. 1. G. Bell and J. Gray, “What’s Next in High-Performance Computing,”
Today, scientific applications are developed with crude Comm. ACM, vol. 45, no. 2, 2002, pp. 91–95.
software tools (compared to those used in the commercial 2. D.A. Reed, ed., Workshop on the Roadmap for the Revitalization of
sector). Low-level programming, based on message-pass- High End Computing, Computing Research Assoc., Jan. 2004;
ing libraries, means that application developers must pro- www.cra.org/reports/supercomputing.pdf.
vide deep knowledge of application software behavior and 3. A Science-Based Case for Large Scale Simulation, US Dept. of Energy,
its interaction with the underlying computing hardware. June 2003; www.pnl.gov/scales/.
This is a tremendous intellectual burden that, unless recti- 4. Revolutionizing Science and Engineering through Cyberinfrastructure:
fied, will continue to limit the usability of high-end com- Report of the National Science Foundation Blue-Ribbon Advisory Panel
puting systems, restricting effective access to a small cadre on Cyberinfrastructure, US Nat’l Science Foundation, Jan. 2003;
of researchers. We need only look at the development his- www.cise.nsf.gov/evnt/reports/toc.htm.
tory of Microsoft Windows to recognize the importance of 5. Federal Plan for High-End Computing: Report of the High-End Comput-
an iterated cycle of development, deployment, and feed- ing Revitalization Task Force (HECRTF), May 2004; www.nitrd.gov/
back to develop an effective, widely used product. High- pubs/2004_hecrtf/20040702_hecrtf.pdf.
quality research software isn’t cheap: it is labor intensive,
and its successful creation requires the opportunity to in- Dan Reed is director of the interdisciplinary Renaissance Computing
corporate the lessons learned from previous versions. Institute, which spans the University of North Carolina at Chapel Hill,
Hence, we must begin a coordinated research and de- North Carolina State University, and Duke University. He also holds
velopment effort to create high-end systems that are bet- the Chancellor’s Eminent Professorship at the University of North Car-
ter matched to the characteristics of scientific applica- olina at Chapel Hill and is a member of the President’s Information
tions. This will require a broad program of basic research Technology Advisory Committee (PITAC). Reed conducts research in
into computer architectures, system software, program- high-performance computing, fault-tolerance, performance mea-
ming models, software tools, and algorithms. In addition, surement, and interdisciplinary applications. He is a fellow of the IEEE
we must fund the design and construction of large-scale and ACM. Until December 2003, he was director of the NSF-funded
prototypes of next-generation high-end systems that in- National Center for Supercomputing Applications (NCSA) and chief
cludes balanced exploration of new hardware and soft- architect of the NSF TeraGrid, a nationally distributed high-perfor-
ware models, driven by scientific application require- mance computing Grid built from commodity clusters. Contact him
ments. After experimental assessment and community at reed@renci.org.
Latency and locality management types. We consider here a similar syntax, ing, s for systolic, w for very long in-
defines the mechanisms and methods using four fields to represent parallel ar- struction word (VLIW), h for pro-
incorporated to tolerate latency effects. chitecture classes, but we extend such ducer or consumer, p for parallel
Caches, pipelining, prefetching, multi- nomenclature to let multiple designa- processes, and so on.
threading, and message-driven com- tors in any given field permit a richer • Latency: c for caches, v for vectors, t
puting mechanisms are among the pos- description space. We suggest the fields for multithreaded, m for processor in
sible mechanisms that avoid or hide and examples for each as follows: memory, p for parcel or message-dri-
access latencies. The more flexible and ven split-transaction, f for prefetch-
less sensitive to application attributes • Clustering: c for commodity cluster ing, and a for explicit allocation.
such as temporal and spatial locality, or m for monolithic system.
the greater the overall execution effi- • Naming: d for distributed, s for Admittedly, we could probably add
ciency that is likely to be achieved. shared, or c for cache-coherent. other designations to this list: for ex-
Queuing models use a method of a • Parallelism: t for multithreading, v ample, the Earth Simulator would be
few descriptors separated by slashes to for vector, c for communicating se- m/s/v/v, the Tera MTA (multithreaded
describe a broad range of queue system quential processes or message pass- architecture) would be m/s/t/t, the SGI

Origin would be m/c/p/c, and Red cies. By the end of the decade, with no simulation time steps for the same sim-
Storm would be m/d/c/a. The commu- changes in structure, it will take at least ulated period). At some point, archi-
nity would have to work out the precise 10 times as long (measured in processor tecture will begin to dominate, and the
codification in greater detail and accu- cycles) to touch every word once in a field will ultimately emancipate itself
racy, so clearer guidelines are needed given memory chip due to the increase from the narrow commercial off-the-
for such a representation schema to be in memory densities and processor clock shelf (COTS) mentality. Although it’s
established or to foster community ac- speeds. Without methods for tolerating impossible to predict the future, some
ceptance. Here, we simply suggest a latency, computations will suffer a 1,000- possibilities are evident even now.
strategy for addressing this challenge. cycle critical time delay for access to re- In one sense, we agree with Bell and
The basic concept permits the system mote memory, perhaps even longer. Gray. Clusters (even in our narrower
to be described by its attributes rather If no changes to HPC system archi- sense of the term) will indefinitely re-
than as a collection of terms not neces- tecture occur, we’ll have to configure main as an important part of super-
sarily making up a coordinated lexicon. the most expensive parts of the sys- computing because no matter how
MPP, for example, could mean a dis- tem—the I/O bandwidth and memory large or innovative a system is, some-
tributed-memory system, a large bandwidth—via ever larger and expen- one can always assemble a larger one
shared-memory system with or with- sive caches to optimize for the least ex- through clustering—it’s like turbo
out cache coherence, a large vector sys- pensive component, the arithmetic charging. As more than a decade of re-
tem, and so on: it covers too many cat- logic unit. Fortunately, new architec- search has shown us, though, cluster-
egories and hides too many salient
differences to be a useful tool for de-
scription. On this point, we have come
This very low cost of development and exploitation of
to agree with Bell and Gray.
economy of scale distinguishes commodity clusters.
Traversing the
Trans-Petaflops
Performance Regime
Clusters, as Bell and Gray suggest, tures will be able to exploit between ing could be achieved while retaining a
could have a long life as an architecture one and two orders of magnitude more user global namespace and with some
principle because many workloads can Arithmetic Logic Units (ALUs) for a cross-system latency hiding. This isn’t
tolerate their limitations and benefit given scale system than is possible to- unreasonable as long as
from their economy of scale. It’s also day. Such architecture classes include
quite probable that sometime between multicore system on a chip (SOC), • the logical interface to the external
2009 and 2012, one or more commod- processor in memory (PIM), stream- world includes access to the address
ity clusters will exhibit a peak perfor- ing, and vector. management and translation mech-
mance of 1 Pflops or greater. That said, Although Moore’s law will apply un- anisms, and
the evolution of high-end computing abated and commodity components • the different cluster nodes can sub-
isn’t likely to lead ultimately to the will dominate system design in the divide the namespace in a mutually
convergent cluster architecture Bell short run, Moore’s law will eventually exclusive, collectively exhaustive, and
and Gray predict, but rather to a new flat-line due to atomic and quantum ef- logically consistent manner.
class of parallel architectures that re- fects, and conventional components
spond to the opportunities and chal- will provide low efficiency such that lit- But to cluster in this way means that
lenges presented by the technology tle gain will be achieved for larger sys- the processing components have to
trends that drive it. tems, even for larger problems. (This “know” about each other and their lo-
The memory wall, or von Neumann is already happening, but certain em- cal worlds, which is outside the scope
bottleneck (among other terms), repre- barrassingly parallel problems will al- of current COTS processor architec-
sents the disparity between processor ways be the exception.) As problems tures and within the realm of custom
clock rates and memory cycle times as get larger, they often increase their se- system designs.
well as system-wide remote access laten- quential time scale as well (taking more In terms of the conventional balance
MARCH/APRIL 2005 57
of bytes per flops, we’ll lose the ratio of field-programmable gate arrays might center as an institution is at its end.
1:1 because of the roughly 1,000-to-1 prove of value to make the applications Without question, commodity clus-
difference between the required mem- more accessible and general. High- ters and Beowulfs have resulted in lo-
ory and the ALUs needed to match it. bandwidth optical communication, per- cal sites obtaining, applying, and main-
Instead, an entirely different set of haps even directly connected to the taining systems with capabilities
metrics and balance requirements will chips, will permit bisection bandwidths ordinarily reserved for the pristine con-
drive future architectures based on well beyond many petabits per second ditions of the classic machine room;
bandwidth, overhead time, and latency across a system. The choices are so this is often accomplished through
tolerance. It’s possible that the concept rich—and the driving motivation so minimal upfront costs, leveraging ex-
of the processor as we know it will dis- compelling—that COTS-based systems tant talent for system support services.
appear as aggregates of finer-grained will go the way of the mainframe For certain contexts, such as academic
cell-like constructs that integrate logic, uniprocessor at some point. Supercom- environments and research laborato-
state, and data transfer merge into one puters and desktops just aren’t the same ries, this trade-off works well for sys-
highly replicated element. Power and thing, in spite of the fact that they both tems of a few dozen to a couple of hun-
reliability through active reconfigura- perform calculations. dred nodes, but beyond that, resources
are usually stressed, sometimes se-
verely, especially when such clusters are
Commodity components might be cheaper, but they shared among several users and appli-
cations. As few as 50 nodes can demand
could lack many functional attributes. a full support person. Although this
seems a little high, at some level, man-
aging a commodity cluster can become
a full-time job.
tion (graceful degradation) will become Centers in the 21st Century Mass storage can be an important
as important as optimized throughputs. Computing centers evoke images of part of a system’s capability, even one
Computation will have to be abstracted white lab coats and large front panels assumed to be dedicated to compute-
(virtualized) with respect to the under- behind layers of glass—mere mortals intensive applications. Large archival
lying physical execution medium to let had limited or indirect access. These tertiary storage facilities are becoming
it adapt to the constantly shifting orga- centers housed the largest processing increasingly valuable to a full-service
nization. SOC, SMP on a chip, and and storage facilities, cost millions of scientific computing environment and
PIM will become important elements dollars, and some even contained the can require expert administration.
that bring memory closer to logic. ultimate high-IQ system: the super- Similarly, networking of cluster sys-
Simultaneously, Logic Intensive computer. For batch processing, access tems to the external user base, either
Processor Architectures (LIPAs) such as was through submitted decks, either within an administrative domain or
Stanford’s streaming architecture, the physical (punched cards) or virtual (jobs over the Internet, adds to the respon-
University of Texas at Austin’s Trips (for submitted from terminals). Some of us sibilities of such systems. Software up-
tera-op reliable intelligently adaptive remember keypunch and terminal grades, hardware diagnosis and re-
processing system) architecture, and rooms managed by the central institu- placement, and account, job, and user
Cray’s Cascade architecture will exploit tional computer centers of the day. As interface management all entail signif-
large internal arrays of ALUs. New minicomputers, workstations, and ulti- icant usage of time and talent in main-
technologies could permit systems such mately PCs incrementally permeated taining a large cluster.
as the Hybrid Technology Multi- the computing community, the com- Many labs, groups, and organizations
threaded (HTMT) architecture to puter center’s role evolved and nar- can now acquire—within their bud-
achieve far higher densities of computa- rowed, but it retained its critical contri- gets—a cluster system capable of sub-
tion, as would 3D packaging (capable of bution to large-scale computation. stantial performance, but they’re often
putting 500,000 chips in a cubic meter). However, with the cluster’s emer- not prepared to engage the resources or
For important applications, special-pur- gence—particularly Beowulf-class sys- fulfill the responsibilities of maintain-
pose devices could still play a role, and tems—some contend that the computer ing and managing a complex comput-

ing facility. To address this gap, the modity clusters will play an important Jack Dongarra is a university distinguished pro-
computer center can be redefined to role for the foreseeable future, research fessor at the University of Tennessee and Oak
manage a large cluster for an owner or- in this area is still required. Packaging, Ridge National Laboratory. He specializes in nu-
ganization, providing the expertise, in- interconnection networks, and system merical algorithms in linear algebra, parallel
frastructure, and environment neces- software, as well as latency-tolerant al- computing, use of advanced computer archi-
sary for maintaining continued cluster gorithms and fault tolerance, are areas tectures, programming methodology, and
operation. These facilities can be amor- demanding further pursuit. System tools for parallel computers. Dongarra received
tized across a diversity of systems, thus software research in particular must es- a PhD in applied mathematics from the Uni-
limiting the burden on any one system, tablish fully supportive system-wide versity of New Mexico. He is a fellow of the
permitting rapid deployment and high environments for resource manage- AAAS, ACM, and the IEEE and a member of the
availability, avoiding the learning curve ment, administration, and program- National Academy of Engineering. Contact him
of untrained administrators, and sim- ming, including tools for correctness at dongarra@cs.utk.edu .
plifying decommissioning at the end of and performance debugging.
the system’s life time. The restriction of devising systems Thomas Sterling is a faculty associate at the
Moreover, placing a moderately constrained to comprise mostly COTS Center for Advanced Computing Research at
sized commodity cluster in the same components precludes the innovation the California Institute of Technology and a
administrative facility with other com- critical to achieving high efficiency as principal scientist at the NASA Jet Propulsion
parable systems enables their synthesis well as programmability and reliability. Laboratory. His research interests include high-
to form superclusters, the peak capa- Custom architecture is an important performance computing systems, processors in
bility of which can be brought to bear opportunity to pursue, in spite of the memory architecture, Beowulf computing, and
on significant user applications by conventional wisdom that dismisses parallel system software. Sterling received a
common agreement and shared proto- specialty designs as infeasible in today’s PhD in electrical engineering from the Massa-
cols. It also makes available large data market climate. A new class of proces- chusetts Institute of Technology. He is a mem-
storage reservoirs within the computer sor architecture intended for a role in ber of the IEEE and the ACM. Contact him at
center that might not otherwise be ac- highly parallel systems must be simple tron@cacr.caltech.edu.
cessible. Revamped computer centers in design, permitting a short design cy-
will complement the commodity clus- cle as well as easy modeling, simula- Horst Simon is associate laboratory director for
ter’s strengths, extending its price-per- tion, debugging, and compilation. computing sciences and director of NERSC at
formance advantage by leveraging in- With a more strict definition of a Lawrence Berkeley National Laboratory. His re-
vestment in the needed administrative cluster, we envision a continued evolu- search interests include the development of
facilities while providing very large on- tion of the computer center’s role—one sparse matrix algorithms, algorithms for large-
line data archives that anyone can ac- in which it will adapt to the new re- scale eigenvalue problems, and domain decom-
cess. In all likelihood, computer centers quirements and opportunities of paral- position algorithms for unstructured domains for
will remain—if not grow in impor- lel system classes while providing mas- parallel processing. Simon has a PhD in mathe-
tance—in response to the new trends sive data archival stores. matics from the University of California, Berkeley.
in cluster computing. He is a member of SIAM and the IEEE Computer
References Society. Contact him at HDSimon@lbl.gov.
1. G. Bell and J. Gray, “What’s Next in High-Per-
B ell and Gray touch on some po-

tential future directions that
might drive high-end computing
formance Computing,” Comm. ACM, vol. 45,
no. 2, 2002, pp. 91–95.
2. E. Strohmaier et al., “The Marketplace of
High-Performance Computing,” Parallel Com-
Erich Strohmaier is a computer scientist in the
Future Technologies Group at Lawrence Berke-
ley National Laboratory. His research interests
through the end of this decade and be- puting, vol. 25, Dec. 1999, pp. 1517–1544. include performance characterization of scien-
yond, thus justifying investment (of 3. W. Gropp, E. Lusk, and T. Sterling, Beowulf tific application, performance evaluations of
time and funding) by industry, govern- Cluster Computing with Linux, MIT Press, 2003. HPC systems, and computer system bench-
ment sponsors, and the research com- 4. T. Sterling, P. Messina, and P.H. Smith, En- marking. Strohmaier has a Dr.rer.Nat. in theo-
abling Technologies for Petaflops Computing,
munity. We share much of their view, MIT Press, 1995. retical physics from the University of Heidelberg.
but we want to emphasize certain dis- 5. G. Pfister, In Search of Clusters, Pearson Edu- He is a member of the AAAS, the ACM, and the
tinctions. Although we agree that com- cation, 1997. IEEE. Contact him at Estrohmaier@lbl.gov.
MARCH/APRIL 2005 59

158 2005 High Performance Computing Clusters Conslellations Mpps and Future Directions

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

158 2005 High Performance Computing Clusters Conslellations Mpps and Future Directions

Uploaded by

Copyright:

Available Formats

PERSPECTIVES

I N A RECENT PAPER,1 GORDON BELL AND JIM GRAY PUT FORTH

A VIEW OF THE PAST, PRESENT, AND FUTURE OF HIGH-

PERFORMANCE COMPUTING (HPC) THAT IS BOTH INSIGHTFUL AND

52 COMPUTING IN SCIENCE & ENGINEERING

F rom the beginning of the digital age, supercomputers

54 COMPUTING IN SCIENCE & ENGINEERING

56 COMPUTING IN SCIENCE & ENGINEERING

58 COMPUTING IN SCIENCE & ENGINEERING

B ell and Gray touch on some po-

You might also like