Professional Documents
Culture Documents
Thomas Sterling
Matthew Anderson
Maciej Brodowicz
School of Informatics, Computing, and Engineering
Indiana University, Bloomington
This textbook would not have been possible in either form or quality without the many contributions,
both direct and indirect, of a large number of friends and colleagues. It is derivative of first-year grad-
uate courses taught at both Louisiana State University (LSU) and Indiana University (IU). A number of
people contributed to these courses, including Chirag Dekate, Daniel Kogler, and Timur Gilmanov.
Amy Apon, a professor at the University of Arkansas, partnered with LSU and taught this course in
real time over the internet and helped to develop pedagogical material, including many of the exercises
used. Now at Clemson University, she continued this important contribution using her technical and
pedagogical expertise. Andrew Lumsdaine, then a professor at IU, cotaught the first version of this
course at IU. Amanda Upshaw was instrumental in the coordination of the process that resulted in
the final draft of the book, and directly developed many of the illustrations, graphics, and tables.
She was also responsible for the glossary of terms and acronyms. Her efforts are responsible in part
for the quality of this textbook.
A number of friends and colleagues provided guidance as the authors crafted early drafts of the book.
These contributions were of tremendous value, and helped improve the quality of content and form to be
useful for readers and students. David Keyes of KAUST reviewed and advised on Chapter 9 on parallel
algorithms. Jack Dongarra provided important feedback on Chapter 4 on benchmarking.
This textbook reflects decades of effort, research, development, and experience by uncounted num-
ber of contributors to the field of high performance computing. While not directly involved with the
creation of this text, many colleagues have contributed to the concepts, components, tools, methods,
and common practices associated with the broad context of high performance computing and its value.
Among these are Bill Gropp, Bill Kramer, Don Becker, Richard and Sarah Murphy, Jack Dongarra and
his many collaborators, Satoshi Matsuoka, Guang Gao, Bill Harrod, Lucy Nowell, Kathy Yelick, John
Shalf, John Salmon, and of course Gordon Bell. Thomas Sterling would like to acknowledge his thesis
advisor (at MIT) Bert Halstead for his mentorship to become the contributor that he has become.
Thomas Sterling also acknowledges Jorge Ucan, Amanda Upshaw, co-authors who made this book
possible, and especially Paul Messina who is his colleague, role model, mentor, and friend without
whom this book would never occurred. Matthew Anderson would like to thank Dayana Marvez, Oliver
Anderson, and Beltran Anderson. Maciej Brodowicz would like to thank his wife Yuko Prince Brodo-
wicz. The authors would like to thank Nate McFadden of Morgan-Kaufmann who provided enormous
effort, guidance, and patience that made this textbook possible.
xxvii
xxviii ACKNOWLEDGMENTS
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any
liability for any injury and/or damage to persons or property as a matter of products liability, negligence or
otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the
material herein.
High Performance Computing is a needed follow-on to Becker and Sterling’s 1994 creation of the
Beowulf clusters recipe to build scalable high performance computers (also known as a supercomputers)
from commodity hardware. Beowulf enabled groups everywhere to build their own supercomputers.
Now with hundreds of Beowulf clusters operating worldwide, this comprehensive text addresses the crit-
ical missing link of an academic course for training domain scientists and engineersdand especially
computer scientists. Competence involves knowing exactly how to create and run (e.g., controlling,
debugging, monitoring, visualizing, evolving) parallel programs on the congeries of computational
elements (cores) that constitute today’s supercomputers.
Mastery of these ever-increasing, scalable, parallel computing machines gives entry into a compar-
atively small but growing elite, and is the authors’ goal for readers of the book. Lest the reader believes
the name is unimportant: the first conference in 1988 was the ACM/IEEE Supercomputing Confer-
ence, also known as Supercomputing 88; in 2006 the name evolved to the International Conference
on High Performance Computing, Networking, Storage, and Analysis, abbreviated SCXX. About
11,000 people attended SC16.
It is hard to describe a “supercomputer,” but I know one when I see one. Personally, I never pass up
a visit to a supercomputer having seen the first one in 1961dthe UNIVAC LARC (Livermore
Advanced Research Computer) at Lawrence Livermore National Laboratory, specified by Edward
Teller to run hydrodynamic simulations for nuclear weapons design. LARC consisted of a few dozen
cabinets of densely packed circuit board interconnected with a few thousand miles of wires and a few
computational units operating at a 100 kHz rate. In 2016 the largest Sunway Light supercomputer in
China operated a trillion times faster than LARC. It consists of over 10 million processing cores oper-
ating at a 1.5 GHz rate, and consumes 15 MW. The computer is housed in four rows of 40 cabinets,
containing 256 processing nodes. A node has four interconnected 8 MB processors, controlling 64 pro-
cessing elements or cores. Thus the 10.6 million processing elements deliver 125 peak petaflops, i.e.,
160 cabinets 256 physical nodes 4 computers (1 control þ 8 8) processing elements or cores
with a 1.31 PB memory (160 256 4 8 GB). Several of the Top 500 supercomputers have
O(10,000) computing nodes that connect and control graphic processing units (GPUs) with O(100)
cores. Today’s challenge for computational program developers is designing the architecture and
implementation of programs to utilize these megaprocessor computers.
From a user perspective, the “ideal high performance computer” has an infinitely fast clock,
executes a single instruction stream program operating on data stored in an infinitely large and fast
single-memory, and comes in any size to fit any budget or problem. In 1957 Backus established the
von Neumann programming model with Fortran. The first or “Cray” era of supercomputing from
the 1960s through the early 1990s saw the evolution of hardware to support this simple, easy-to-use
ideal by increasing processor speed, pipelining an instruction stream, processing vectors with a single
instruction, and finally adding processors for a program held in the single-memory computer. By the
early 1990s evolution of a single computer toward the ideal had stopped: clock speeds reached a few
GHz, and the number of processors accessing a single memory through interconnection was limited to
a few dozen. Still, the limited-scale, multiple-processor shared memory is likely to be the most
straightforward to program and use!
xix
xx FOREWORD
Fortunately, in the mid-1980s the “killer microprocessor” arrived, demonstrating cost effectiveness
and unlimited scaling just by interconnecting increasingly powerful computers. Unfortunately, this
multicomputer era has required abandoning both the single memory and the single sequential program
ideal of Fortran. Thus “supercomputing” has evolved from a hardware engineering design challenge of
the single (mono-memory) computer of the Seymour Cray era (1960e95) to a software engineering
design challenge of creating a program to run effectively using multicomputers. Programs first oper-
ated on 64 processing elements (1983), then 1000 elements (1987), and now 10 million (2016) pro-
cessing elements in thousands of fully distributed (mono-memory) computers in today’s
multicomputer era. So in effect, today’s high performance computing (HPC) nodes are like the super-
computers of a decade ago, as processing elements have grown 36% per year from 1000 computers in
1987 to 10 million processing elements (contained in 100,000 computer nodes).
High Performance Computing is the essential guide and reference for mastering supercomputing,
as the authors enumerate the complexity and subtleties of structuring for parallelism, creating, and
running these large parallel and distributed programs. For example, the largest climate models
simulate ocean, ice, atmosphere, and land concurrently created by a team of a dozen or more domain
scientists, computational mathematicians, and computer scientists.
Program creation includes understanding the structure of the collection of processing resources and
their interaction for different computers, from multiprocessors to multicomputers (Chapters 2 and 3),
and the various overall strategies for parallelization (Chapter 9). Other topics include synchronization
and message-passing communication among the parts of parallel programs (Chapters 7 and 8), addi-
tional libraries that form a program (Chapter 10), file systems (Chapter 18), long-term mass storage
(Chapter 17), and components for the visualization of results (Chapter 12). Standard benchmarks for
a system give an indication of how well your parallel program is likely to run (Chapter 4). Chapters 16
and 17 introduce and describe the techniques for controlling accelerators and special hardware cores,
especially GPUs, attached to nodes to provide an extra two orders of magnitude more processing per
node. These attachments are an alternative to the vector processing units of the Cray era, and typified
by the Compute Unified Device Architecture, or CUDA, model and standard to encapsulate
parallelism across different accelerators.
Unlike the creation, debugging, and execution of programs that run interactively on a personal
computer, smartphone, or within a browser, supercomputer programs are submitted via batch process-
ing control. Running a program requires specifying to the computer the resources and conditions for
controlling your program with batch control languages and commands (Chapter 5), getting the pro-
gram into a reliable and dependable state through debugging (Chapter 14), checkpointing, i.e., saving
intermediate results on a timely basis as insurance for the computational investment (Chapter 20), and
evolving and enhancing a program’s efficacy through performance monitoring (Chapter 13).
Chapter 21 concludes with a forward look at the problems and alternatives for moving supercom-
puters and the ability to use them to petascale and beyond. In fact, the only part of HPC not described
in this book is the incredible teamwork and evolution of team sizes for writing and managing HPC
codes. However, the most critical aspect of teamwork resides with the competence of the individual
members. This book is your guide.
Gordon Bell
October 2017
Index
‘Note: Page numbers followed by “f” indicate figures, “t” indicate tables, and “b” indicate boxes.’
677
678 INDEX
forwarding, 202 S
instruction-level parallelism, 201 sacct, 164e165
multithreading, 203 salloc, 160e161
reservation stations, 202 sbatch, 161e162
Profiling in distributed environments, 411e417, 414fe415f, Scalable library for eigenvalue problem computations, 328
417f Scalable linear algebra, 326
Programmability, HPC architecture, 48 Scalable linear solvers, 328e329
Programming Scaling, 18e19
concepts, 487e489 scancel, 163e164
environment, 476e477 Scatter, 269e271
interfaces, commodity clusters, 97e98 Scheduling, 355e357, 356f
high performance computing programming languages, Science, supercomputing impact on, 10e14
97 Scott, Steven, 35
parallel programming modalities, 97e98 Script components, SLURM job scripting, 166e167
Secondary storage
Q commodity clusters, 95
qdel, 180 management, 351
qstat, 180e182 Sections, OpenMP programming model, 239e240
job status query, 180e181 Security, supercomputing impact on, 10e14
queue status query, 181 Sequencer controller, 70
server status query, 182 Sequential data access, 555e556, 560
qsub, 174e179 Serial advanced technology attachment (SATA), 215e218
Quantum computing, 619e620 connectors, 216f
interface variants, 218f
Serial quicksort algorithm, 289f
R Shannon, Claude Elwood, 23
Redirection, Linux, 649e650
Shared neighbors, 581e582
Reduction
Shared-memory multiprocessors, 227f
OpenMP programming model, 244
HPC architecture, 74e76, 74f
operations, 274e276, 274t
Signal processing, 334e341
Redundant array of independent disks, 532f, 534e541
SIMD array class of parallel computer architecture, 69f
hybrid redundant array of independent disks variants,
Simple Linux Utility for Resource Management (SLURM)
539e541
cheat sheet, 171e172
RAID 0: striping, 534e535
job scripting, 166e171
RAID 1: mirroring, 535
concurrent applications, 167e169
RAID 2: bit-level striping with hamming code, 536
environment variables, 169e171
RAID 3: byte-level striping with dedicated parity, 536e537
MPI scripts, 167
RAID 4: block-level striping with dedicated parity, 537
OpenMP scripts, 167
RAID 5: block-level striping with single distributed parity,
script components, 166e167
538
scheduling, 149e150
RAID 6: block-level striping with dual distributed parity,
elastic computing, 150
539
gang scheduling, 149
Reliability, HPC architecture, 47e48
generic resources, 150
Remote access, Linux, 639e640
high-throughput computing, 150
Reservation stations, symmetric multiprocessor architecture,
preemption, 149e150
202
trackable resources, 150
Resource management, commodity clusters, 99e100
sinfo, 165e166
debugger, 101
Single directive, OpenMP programming model, 243
performance profiling, 101
Single-bit full adder, 66f
visualization, 101e104
Single-instruction
Running applications, commodity clusters, 113
multiple data architecture, 69e70
Runtime library routines, OpenMP programming model, 230
multiple data array, 33, 69e73
Runtime system software, 615
688 INDEX
BY E. C. SPITZKA, M.D.
As the symptoms of the regular affections of the cord are by far the
most readily recognizable, and a preliminary knowledge of them will
facilitate the better understanding of the irregular forms, we shall
consider the former first. They may be subdivided into two groups.
The largest, longest known, and best studied consists of acquired,
the other, containing less numerous cases and varieties, and
rendered familiar to the profession only within the last decade,
comprises the spinal disorders due to defective development of the
cerebro-spinal and spinal-fibro systems.
Tabes Dorsalis.
While some patients escape these pains almost entirely,2 others are
tormented with them at intervals for years, their intensity usually
diminishing when the ataxic period is reached. There is little question
among those who have watched patients in this condition that their
pains are probably the most agonizing which the human frame is
ever compelled to endure. That some of the greatest sufferers
survive their martyrdom appears almost miraculous to themselves.
Thus, in one case the patient, who had experienced initial symptoms
for a year, woke up at night with a fulminating pain in the heels which
recurred with the intensity of a hot spear-thrust and the rapidity of a
flash every seven minutes; then it jumped to other spots, none of
which seemed larger than a pin's head, till the patient, driven to the
verge of despair and utterly beside himself with agony, was in one
continued convulsion of pain, and repeatedly—against his conviction
—felt for the heated needles that were piercing him. In another case
the patient, with the pathetic picturesqueness of invalid misery,
compared his fulminating pains to strokes of lightning, “but not,” he
added, “as they used to appear, like lightning out of a clear sky, but
with the background of a general electrical storm flashing and
playing through the limbs.”
2 I have at present under observation two intelligent patients (one of whom had been
hypochondriacally observant of himself for years) who experienced not a single pain,
as far as they could remember, and who have developed none while under
observation. Seguin mentioned a case at a meeting of the Neurological Society with a
record of but a single paroxysm of the fulgurating variety. Bramwell (Brit. Med. Journ.,
Jan. 2, 1886) relates another in which the pains were entirely absent.
Either while the pains are first noticed or somewhat later other signs
of disturbed sensation are noted. Certain parts of the extremities feel
numb or are the site of perverted feelings. The soles of the feet, the
extremities of the toes, the region about the knee-pan, and the
peroneal distribution, and, more rarely, the perineum and gluteal
region, are the localities usually affected.3 In a considerable
percentage of cases the numbness and tingling are noted in the little
finger and the ulnar side of the ring finger; that is, in the digital
distribution of the ulnar nerve. The early appearance of this symptom
indicates an early involvement of the cord at a high level. Some
parallelism is usually observable between the distribution of the
lightning-like pains when present and the anæsthesia and
paræsthesia if they follow them. With these signs there is almost
invariably found a form of illusive sensation known as the belt
sensation. The patient feels as if a tight band were drawn around his
body or as if a pressure were exerted on it at a definite point. This
sensation is found in various situations, according as the level of the
diseased part of the cord be a low or high one. Thus, when the lower
limbs are exclusively affected or nearly so the belt will be in the
hypogastric or umbilical region; if the upper limbs be much involved,
in the thoracic region; and if occipital pain, anæsthesia of the
trigeminus, and laryngeal crises are present, it may even be in the
neck. Correspondingly, it is found in the history of one and the same
patient: if there be a marked ascent—that is, a successive
involvement of higher levels in the cord—the belt will move up with
the progressing disease. This occurrence, however, is less
frequently witnessed than described. In the majority of cases of
tabes disturbances of the bladder function occur very early in the
disease. Hammond indeed claims that in the shape of incontinence it
may be the only prodromal symptom for a long period.4
3 In the exceptional cases where the initial sensory disturbance is marked in the
perineal and scrotal region I have found that the antecedent fulminating pains had
been attributed to the penis, rectum, and anal region; and in one case the subjective
sense of a large body being forcibly pressed through the rectum was a marked early
sign.
7 Not even the absence of the knee-jerk ranks as high as these two signs. Aside from
the fact that this is a negative symptom, it is not even a constant feature in advanced
tabes.
8 It does not seem as if the disturbance of static equilibrium were due merely to the
removal of the guide afforded by the eyes, for it is noted not alone in patients who are
able to carry out the average amount of locomotion in the dark, but also in those who
have complete amaurosis. Leyden (loc. cit., p. 334) and Westphal (Archiv für
Psychiatrie, xv. p. 733) describe such cases. The act of shutting the eyes alone,
whether through a psychical or some occult automatic influence, seems to be the
main factor.
In most cases of early tabes it is found that the pupil does not
respond to light; it may be contracted or dilated, but it does not
become wider in the dark nor narrower under the influence of light.
At the same time, it does contract under the influence of the
accommodative as well as the converging efforts controlled by the
third pair, and in these respects acts like the normal pupil. It is
paralyzed only in one sense—namely, in regard to the reflex to light;
just as the muscles which extend the leg upon the thigh may be as
powerful as in health, but fail to contract in response to the reflex
stimulus applied when the ligamentum patellæ is struck. For this
reason it is termed reflex iridoplegia.9 It is, when once established,
the most permanent and unvarying evidence of the disease, and is
of great differential diagnostic value, because it is found in
comparatively few other conditions.
9 It is also known as the Argyll-Robertson pupil. Most of the important symptoms of
tabes are known by the names of their discoverers and interpreters. Thus, the
swaying with the eyes closed is the Romberg or Brach-Romberg symptom; the
absence of the knee-phenomenon, Westphal's or the Westphal-Erb symptom; and the
arthropathies are collectively spoken of as Charcot's joint disease.
18 Loc. cit.
While the symptoms thus far considered as marking the origin and
progress of tabes dorsalis are more or less constant, and although
some of them show remarkable remissions and exacerbations, yet
may in their entity be regarded as a continuous condition slowly and
surely increasing in severity, there are others which constitute
episodes of the disease, appearing only to disappear after a brief
duration varying from a few hours to a few days: they have been
termed the crises of tabes dorsalis. These crises consist in
disturbances of the functions of one or several viscera, and are
undoubtedly due to an error in innervation provoked by the
progressing affection of the spinal marrow and oblongata. The most
frequent and important are the gastric crises. In the midst of
apparent somatic health, without any assignable cause, the patient is
seized with a terrible distress in the epigastric region, accompanied
by pain which may rival in severity the fulgurating pains of another
phase of the disease, and by uncontrollable vomiting. Usually, these
symptoms are accompanied by disturbances of some other of the
organs under the influence of the pneumogastric and sympathetic
nerves. The heart is agitated by violent palpitations, a cold sweat
breaks out, and a vertigo may accompany it, which, but for the fact
that it is not relieved by the vomiting and from its other associations,
might mislead the physician into regarding it as a reflex symptom. In
other cases the symptoms of disturbed cardiac innervation or those
of respiration are in the foreground, constituting respectively the
cardiac and bronchial crises. Laryngeal crises are marked by a
tickling and strangling sensation in the throat, and in their severer
form, which is associated with spasm of the glottis, a crowing cough
is added.22 Enteric crises, which sometimes coexist with gastric
crises, at others follow them, and occasionally occur independently,
consist in sudden diarrhœal movements, with or without pain, and
may continue for several days. Renal or nephritic crises are
described23 as resembling an attack of renal colic. The sudden