OXFORD TEXTS IN APPLIED AND ENGINEERING

MATHEMATICS
OXFORD TEXTS IN APPLIED AND ENGINEERING
MATHEMATICS
* G. D. Smith: Numerical Solution of Partial Differential Equations
3rd Edition
* R. Hill: A First Course in Coding Theory
* I. Anderson: A First Course in Combinatorial Mathematics 2nd Edition
* D. J. Acheson: Elementary Fluid Dynamics
* S. Barnett: Matrices: Methods and Applications
* L. M. Hocking: Optimal Control: An Introduction to the Theory with
Applications
* D. C. Ince: An Introduction to Discrete Mathematics, Formal System
Specification, and Z 2nd Edition
* O. Pretzel: Error-Correcting Codes and Finite Fields
* P. Grindrod: The Theory and Applications of Reaction–Diffusion Equations: Patterns
and Waves 2nd Edition
1. Alwyn Scott: Nonlinear Science: Emergence and Dynamics of Coherent Structures
2. D. W. Jordan and P. Smith: Nonlinear Ordinary Differential Equations:
An Introduction to Dynamical Systems 3rd Edition
3. I. J. Sobey: Introduction to Interactive Boundary Layer Theory
4. A. B. Tayler: Mathematical Models in Applied Mechanics (reissue)
5. L. Ramdas Ram-Mohan: Finite Element and Boundary Element Applications in
Quantum Mechanics
6. Lapeyre, et al.: Monte Carlo Methods for Transport and Diffusion Equations
7. I. Elishakoff and Y. Ren: Finite Element Methods for Structures with Large Stochastic
Variations
8. Alwyn Scott: Nonlinear Science: Emergence and Dynamics of Coherent Structures 2nd
Edition
9. W. P. Petersen and P. Arbenz: Introduction to Parallel Computing
Titles marked with an asterisk (*) appeared in the Oxford Applied Mathematics
and Computing Science Series, which has been folded into, and is continued by,
the current series.
Introduction to Parallel
Computing
W. P. Petersen
Seminar for Applied Mathematics
Department of Mathematics, ETHZ, Zurich
wpp@math.ethz.ch
P. Arbenz
Institute for Scientific Computing
Department Informatik, ETHZ, Zurich
arbenz@inf.ethz.ch
1
3
Great Clarendon Street, Oxford OX2 6DP
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide in
Oxford NewYork
Auckland Cape Town Dar es Salaam Hong Kong Karachi
Kuala Lumpur Madrid Melbourne Mexico City Nairobi
NewDelhi Shanghai Taipei Toronto
With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam
Oxford is a registered trade mark of Oxford University Press
in the UK and in certain other countries
Published in the United States
by Oxford University Press Inc., New York
c Oxford University Press 2004
The moral rights of the author have been asserted
Database right Oxford University Press (maker)
First published 2004
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
without the prior permission in writing of Oxford University Press,
or as expressly permitted by law, or under terms agreed with the appropriate
reprographics rights organization. Enquiries concerning reproduction
outside the scope of the above should be sent to the Rights Department,
Oxford University Press, at the address above
You must not circulate this book in any other binding or cover
and you must impose the same condition on any acquirer
A catalogue record for this title is available from the British Library
Library of Congress Cataloging in Publication Data
(Data available)
Typeset by Newgen Imaging Systems (P) Ltd., Chennai, India
Printed in Great Britain
on acid-free paper by
Biddles Ltd., King’s Lynn, Norfolk
ISBN 0 19 851576 6 (hbk)
0 19 851577 4 (pbk)
10 9 8 7 6 5 4 3 2 1
PREFACE
The contents of this book are a distillation of many projects which have sub-
sequently become the material for a course on parallel computing given for several
years at the Swiss Federal Institute of Technology in Z¨ urich. Students in this
course have typically been in their third or fourth year, or graduate students,
and have come from computer science, physics, mathematics, chemistry, and pro-
grams for computational science and engineering. Student contributions, whether
large or small, critical or encouraging, have helped crystallize our thinking in a
quickly changing area. It is, alas, a subject which overlaps with all scientific
and engineering disciplines. Hence, the problem is not a paucity of material but
rather the distillation of an overflowing cornucopia. One of the students’ most
often voiced complaints has been organizational and of information overload. It is
thus the point of this book to attempt some organization within a quickly chan-
ging interdisciplinary topic. In all cases, we will focus our energies on floating
point calculations for science and engineering applications.
Our own thinking has evolved as well: A quarter of a century of experi-
ence in supercomputing has been sobering. One source of amusement as well as
amazement to us has been that the power of 1980s supercomputers has been
brought in abundance to PCs and Macs. Who would have guessed that vector
processing computers can now be easily hauled about in students’ backpacks?
Furthermore, the early 1990s dismissive sobriquets about dinosaurs lead us to
chuckle that the most elegant of creatures, birds, are those ancients’ successors.
Just as those early 1990s contemptuous dismissals of magnetic storage media
must now be held up to the fact that 2 GB disk drives are now 1 in. in diameter
and mounted in PC-cards. Thus, we have to proceed with what exists now and
hope that these ideas will have some relevance tomorrow.
Until the end of 2004, for the three previous years, the tip-top of the famous
Top 500 supercomputers [143] was the Yokohama Earth Simulator. Currently,
the top three entries in the list rely on large numbers of commodity processors:
65536 IBM PowerPC 440 processors at Livermore National Laboratory; 40960
IBM PowerPC processors at the IBM Research Laboratory in Yorktown Heights;
and 10160 Intel Itanium II processors connected by an Infiniband Network [75]
and constructed by Silicon Graphics, Inc. at the NASA Ames Research Centre.
The Earth Simulator is now number four and has 5120 SX-6 vector processors
from NEC Corporation. Here are some basic facts to consider for a truly high
performance cluster:
1. Modern computer architectures run internal clocks with cycles less than a
nanosecond. This defines the time scale of floating point calculations.
vi PREFACE
2. For a processor to get a datum within a node, which sees a coherent
memory image but on a different processor’s memory, typically requires a
delay of order 1 µs. Note that this is 1000 or more clock cycles.
3. For a node to get a datum which is on a different node by using message
passing takes more than 100 or more µs.
Thus we have the following not particularly profound observations: if the data are
local to a processor, they may be used very quickly; if the data are on a tightly
coupled node of processors, there should be roughly a thousand or more data
items to amortize the delay of fetching them from other processors’ memories;
and finally, if the data must be fetched from other nodes, there should be a
100 times more than that if we expect to write-off the delay in getting them. So
it is that NEC and Cray have moved toward strong nodes, with even stronger
processors on these nodes. They have to expect that programs will have blocked
or segmented data structures. As we will clearly see, getting data from memory
to the CPU is the problem of high speed computing, not only for NEC and Cray
machines, but even more so for the modern machines with hierarchical memory.
It is almost as if floating point operations take insignificant time, while data
access is everything. This is hard to swallow: The classical books go on in depth
about how to minimize floating point operations, but a floating point operation
(flop) count is only an indirect measure of an algorithm’s efficiency. A lower flop
count only approximately reflects that fewer data are accessed. Therefore, the
best algorithms are those which encourage data locality. One cannot expect a
summation of elements in an array to be efficient when each element is on a
separate node.
This is why we have organized the book in the following manner. Basically,
we start from the lowest level and work up.
1. Chapter 1 contains a discussion of memory and data dependencies. When
one result is written into a memory location subsequently used/modified
by an independent process, who updates what and when becomes a matter
of considerable importance.
2. Chapter 2 provides some theoretical background for the applications and
examples used in the remainder of the book.
3. Chapter 3 discusses instruction level parallelism, particularly vectoriza-
tion. Processor architecture is important here, so the discussion is often
close to the hardware. We take close looks at the Intel Pentium III,
Pentium 4, and Apple/Motorola G-4 chips.
4. Chapter 4 concerns shared memory parallelism. This mode assumes that
data are local to nodes or at least part of a coherent memory image shared
by processors. OpenMP will be the model for handling this paradigm.
5. Chapter 5 is at the next higher level and considers message passing. Our
model will be the message passing interface, MPI, and variants and tools
built on this system.
PREFACE vii
Finally, a very important decision was made to use explicit examples to show
how all these pieces work. We feel that one learns by examples and by proceeding
from the specific to the general. Our choices of examples are mostly basic and
familiar: linear algebra (direct solvers for dense matrices, iterative solvers for
large sparse matrices), Fast Fourier Transform, and Monte Carlo simulations. We
hope, however, that some less familiar topics we have included will be edifying.
For example, how does one do large problems, or high dimensional ones? It is also
not enough to show program snippets. How does one compile these things? How
does one specify how many processors are to be used? Where are the libraries?
Here, again, we rely on examples.
W. P. Petersen and P. Arbenz
Authors’ comments on the corrected second printing
We are grateful to many students and colleagues who have found errata in the
one and half years since the first printing. In particular, we would like to thank
Christian Balderer, Sven Knudsen, and Abraham Nieva, who took the time to
carefully list errors they discovered. It is a difficult matter to keep up with such
a quickly changing area such as high performance computing, both regarding
hardware developments and algorithms tuned to new machines. Thus we are
indeed thankful to our colleagues for their helpful comments and criticisms.
July 1, 2005.
ACKNOWLEDGMENTS
Our debt to our students, assistants, system administrators, and colleagues
is awesome. Former assistants have made significant contributions and include
Oscar Chinellato, Dr Roman Geus, and Dr Andrea Scascighini—particularly for
their contributions to the exercises. The help of our system gurus cannot be over-
stated. George Sigut (our Beowulf machine), Bruno Loepfe (our Cray cluster),
and Tonko Racic (our HP9000 cluster) have been cheerful, encouraging, and
at every turn extremely competent. Other contributors who have read parts of
an always changing manuscript and who tried to keep us on track have been
Prof. Michael Mascagni and Dr Michael Vollmer. Intel Corporation’s Dr Vollmer
did so much to provide technical material, examples, advice, as well as trying
hard to keep us out of trouble by reading portions of an evolving text, that
a “thank you” hardly seems enough. Other helpful contributors were Adrian
Burri, Mario R¨ utti, Dr Olivier Byrde of Cray Research and ETH, and Dr Bruce
Greer of Intel. Despite their valiant efforts, doubtless errors still remain for which
only the authors are to blame. We are also sincerely thankful for the support and
encouragement of Professors Walter Gander, Gaston Gonnet, Martin Gutknecht,
Rolf Jeltsch, and Christoph Schwab. Having colleagues like them helps make
many things worthwhile. Finally, we would like to thank Alison Jones, Kate
Pullen, Anita Petrie, and the staff of Oxford University Press for their patience
and hard work.
CONTENTS
List of Figures xv
List of Tables xvii
1 BASIC ISSUES 1
1.1 Memory 1
1.2 Memory systems 5
1.2.1 Cache designs 5
1.2.2 Pipelines, instruction scheduling, and loop unrolling 8
1.3 Multiple processors and processes 15
1.4 Networks 15
2 APPLICATIONS 18
2.1 Linear algebra 18
2.2 LAPACK and the BLAS 21
2.2.1 Typical performance numbers for the BLAS 22
2.2.2 Solving systems of equations with LAPACK 23
2.3 Linear algebra: sparse matrices, iterative methods 28
2.3.1 Stationary iterations 29
2.3.2 Jacobi iteration 30
2.3.3 Gauss–Seidel (GS) iteration 31
2.3.4 Successive and symmetric successive overrelaxation 31
2.3.5 Krylov subspace methods 34
2.3.6 The generalized minimal residual method (GMRES) 34
2.3.7 The conjugate gradient (CG) method 36
2.3.8 Parallelization 39
2.3.9 The sparse matrix vector product 39
2.3.10 Preconditioning and parallel preconditioning 42
2.4 Fast Fourier Transform (FFT) 49
2.4.1 Symmetries 55
2.5 Monte Carlo (MC) methods 57
2.5.1 Random numbers and independent streams 58
2.5.2 Uniform distributions 60
2.5.3 Non-uniform distributions 64
x CONTENTS
3 SIMD, SINGLE INSTRUCTION MULTIPLE DATA 85
3.1 Introduction 85
3.2 Data dependencies and loop unrolling 86
3.2.1 Pipelining and segmentation 89
3.2.2 More about dependencies, scatter/gather operations 91
3.2.3 Cray SV-1 hardware 92
3.2.4 Long memory latencies and short vector lengths 96
3.2.5 Pentium 4 and Motorola G-4 architectures 97
3.2.6 Pentium 4 architecture 97
3.2.7 Motorola G-4 architecture 101
3.2.8 Branching and conditional execution 102
3.3 Reduction operations, searching 105
3.4 Some basic linear algebra examples 106
3.4.1 Matrix multiply 106
3.4.2 SGEFA: The Linpack benchmark 107
3.5 Recurrence formulae, polynomial evaluation 110
3.5.1 Polynomial evaluation 110
3.5.2 A single tridiagonal system 112
3.5.3 Solving tridiagonal systems by cyclic reduction. 114
3.5.4 Another example of non-unit strides to achieve
parallelism 117
3.5.5 Some examples from Intel SSE and Motorola Altivec 122
3.5.6 SDOT on G-4 123
3.5.7 ISAMAX on Intel using SSE 124
3.6 FFT on SSE and Altivec 126
4 SHARED MEMORY PARALLELISM 136
4.1 Introduction 136
4.2 HP9000 Superdome machine 136
4.3 Cray X1 machine 137
4.4 NEC SX-6 machine 139
4.5 OpenMP standard 140
4.6 Shared memory versions of the BLAS and LAPACK 141
4.7 Basic operations with vectors 142
4.7.1 Basic vector operations with OpenMP 143
4.8 OpenMP matrix vector multiplication 146
4.8.1 The matrix–vector multiplication with OpenMP 147
4.8.2 Shared memory version of SGEFA 149
4.8.3 Shared memory version of FFT 151
4.9 Overview of OpenMP commands 152
4.10 Using Libraries 153
CONTENTS xi
5 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA 156
5.1 MPI commands and examples 158
5.2 Matrix and vector operations with PBLAS and BLACS 161
5.3 Distribution of vectors 165
5.3.1 Cyclic vector distribution 165
5.3.2 Block distribution of vectors 168
5.3.3 Block–cyclic distribution of vectors 169
5.4 Distribution of matrices 170
5.4.1 Two-dimensional block–cyclic matrix distribution 170
5.5 Basic operations with vectors 171
5.6 Matrix–vector multiply revisited 172
5.6.1 Matrix–vector multiplication with MPI 172
5.6.2 Matrix–vector multiply with PBLAS 173
5.7 ScaLAPACK 177
5.8 MPI two-dimensional FFT example 180
5.9 MPI three-dimensional FFT example 184
5.10 MPI Monte Carlo (MC) integration example 187
5.11 PETSc 190
5.11.1 Matrices and vectors 191
5.11.2 Krylov subspace methods and preconditioners 193
5.12 Some numerical experiments with a PETSc code 194
APPENDIX A SSE INTRINSICS FOR FLOATING POINT 201
A.1 Conventions and notation 201
A.2 Boolean and logical intrinsics 201
A.3 Load/store operation intrinsics 202
A.4 Vector comparisons 205
A.5 Low order scalar in vector comparisons 206
A.6 Integer valued low order scalar in vector comparisons 206
A.7 Integer/floating point vector conversions 206
A.8 Arithmetic function intrinsics 207
APPENDIX B ALTIVEC INTRINSICS FOR FLOATING
POINT 211
B.1 Mask generating vector comparisons 211
B.2 Conversion, utility, and approximation functions 212
B.3 Vector logical operations and permutations 213
B.4 Load and store operations 214
B.5 Full precision arithmetic functions on vector operands 215
B.6 Collective comparisons 216
xii CONTENTS
APPENDIX C OPENMP COMMANDS 218
APPENDIX D SUMMARY OF MPI COMMANDS 220
D.1 Point to point commands 220
D.2 Collective communications 226
D.3 Timers, initialization, and miscellaneous 234
APPENDIX E FORTRAN AND C COMMUNICATION 235
APPENDIX F GLOSSARY OF TERMS 240
APPENDIX G NOTATIONS AND SYMBOLS 245
References 246
Index 255
LIST OF FIGURES
1.1 Intel microprocessor transistor populations since 1972. 2
1.2 Linpack benchmark optimal performance tests. 2
1.3 Memory versus CPU performance. 3
1.4 Generic machine with cache memory. 4
1.5 Caches and associativity. 5
1.6 Data address in set associative cache memory. 7
1.7 Pipelining: a pipe filled with marbles. 9
1.8 Pre-fetching 2 data one loop iteration ahead (assumes 2|n). 11
1.9 Aligning templates of instructions generated by unrolling loops. 13
1.10 Aligning templates and hiding memory latencies by pre-fetching data. 13
1.11 Ω-network. 15
1.12 Ω-network switches. 16
1.13 Two-dimensional nearest neighbor connected torus. 17
2.1 Gaussian elimination of an M ×N matrix based on Level 2 BLAS
as implemented in the LAPACK routine dgetrf. 24
2.2 Block Gaussian elimination. 26
2.3 The main loop in the LAPACK routine dgetrf, which is
functionally equivalent to dgefa from LINPACK. 27
2.4 Stationary iteration for solving Ax = b with preconditioner M. 33
2.5 The preconditioned GMRES(m) algorithm. 37
2.6 The preconditioned conjugate gradient algorithm. 38
2.7 Sparse matrix–vector multiplication y = Ax with the matrix A
stored in the CSR format. 40
2.8 Sparse matrix with band-like nonzero structure row-wise block
distributed on six processors. 41
2.9 Sparse matrix–vector multiplication y = A
T
x with the matrix A
stored in the CSR format. 42
2.10 9×9 grid and sparsity pattern of the corresponding Poisson matrix
if grid points are numbered in lexicographic order. 44
2.11 9×9 grid and sparsity pattern of the corresponding Poisson matrix
if grid points are numbered in checkerboard (red-black) ordering. 44
2.12 9×9 grid and sparsity pattern of the corresponding Poisson matrix
if grid points are arranged in checkerboard (red-black) ordering. 45
2.13 Overlapping domain decomposition. 46
2.14 The incomplete Cholesky factorization with zero fill-in. 48
2.15 Graphical argument why parallel RNGs should generate parallel
streams. 63
xiv LIST OF FIGURES
2.16 Timings for Box–Muller method vs. polar method for generating
univariate normals. 67
2.17 AR method. 68
2.18 Polar method for normal random variates. 70
2.19 Box–Muller vs. Ziggurat method. 72
2.20 Timings on NEC SX-4 for uniform interior sampling of an n-sphere. 74
2.21 Simulated two-dimensional Brownian motion. 79
2.22 Convergence of the optimal control process. 80
3.1 Four-stage multiply pipeline: C = A∗ B. 90
3.2 Scatter and gather operations. 91
3.3 Scatter operation with a directive telling the C compiler to ignore
any apparent vector dependencies. 91
3.4 Cray SV-1 CPU diagram. 93
3.5 saxpy operation by SIMD. 94
3.6 Long memory latency vector computation. 96
3.7 Four-stage multiply pipeline: C = A∗ B with out-of-order
instruction issue. 98
3.8 Another way of looking at Figure 3.7. 99
3.9 Block diagram of Intel Pentium 4 pipelined instruction execution
unit. 100
3.10 Port structure of Intel Pentium 4 out-of-order instruction core. 100
3.11 High level overview of the Motorola G-4 structure, including the
Altivec technology. 101
3.12 Branch processing by merging results. 102
3.13 Branch prediction best when e(x) > 0. 104
3.14 Branch prediction best if e(x) ≤ 0. 104
3.15 Simple parallel version of SGEFA. 108
3.16 Times for cyclic reduction vs. the recursive procedure. 115
3.17 In-place, self-sorting FFT. 120
3.18 Double “bug” for in-place, self-sorting FFT. 121
3.19 Data misalignment in vector reads. 123
3.20 Workspace version of self-sorting FFT. 127
3.21 Decimation in time computational “bug”. 127
3.22 Complex arithmetic for d = w
k
(a −b) on SSE and Altivec. 128
3.23 Intrinsics, in-place (non-unit stride), and generic FFT. Ito: 1.7 GHz
Pentium 4 130
3.24 Intrinsics, in-place (non-unit stride), and generic FFT. Ogdoad:
1.25 GHz Power Mac G-4. 133
4.1 One cell of the HP9000 Superdome. 136
4.2 Crossbar interconnect architecture of the HP9000 Superdome. 137
4.3 Pallas EFF BW benchmark. 137
4.4 EFF BW benchmark on Stardust. 138
4.5 Cray X1 MSP. 138
4.6 Cray X1 node (board). 139
4.7 NEC SX-6 CPU. 140
LIST OF FIGURES xv
4.8 NEC SX-6 node. 140
4.9 Global variable dot unprotected, and thus giving incorrect results
(version I). 144
4.10 OpenMP critical region protection for global variable dot
(version II). 144
4.11 OpenMP critical region protection only for local accumulations
local dot (version III). 145
4.12 OpenMP reduction syntax for dot (version IV). 146
4.13 Times and speedups for parallel version of classical Gaussian
elimination, SGEFA. 150
4.14 Simple minded approach to parallelizing one n = 2
m
FFT using
OpenMP on Stardust. 152
4.15 Times and speedups for the Hewlett-Packard MLIB version
LAPACK routine sgetrf. 154
5.1 Generic MIMD distributed-memory computer (multiprocessor). 157
5.2 Network connection for ETH Beowulf cluster. 157
5.3 MPI status struct for send and receive functions. 159
5.4 MPICH compile script. 162
5.5 MPICH (PBS) batch run script. 162
5.6 LAM (PBS) run script. 163
5.7 The ScaLAPACK software hierarchy. 163
5.8 Initialization of a BLACS process grid. 167
5.9 Eight processes mapped on a 2 × 4 process grid in row-major order. 167
5.10 Release of the BLACS process grid. 167
5.11 Cyclic distribution of a vector. 168
5.12 Block distribution of a vector. 168
5.13 Block–cyclic distribution of a vector. 169
5.14 Block–cyclic distribution of a 15 × 20 matrix on a 2 × 3 processor
grid with blocks of 2 × 3 elements. 171
5.15 The data distribution in the matrix–vector product A ∗ x = y with
five processors. 173
5.16 MPI matrix–vector multiply with row-wise block-distributed
matrix. 174
5.17 Block–cyclic matrix and vector allocation. 175
5.18 The 15 × 20 matrix A stored on a 2 × 4 process grid with big
blocks together with the 15-vector y and the 20-vector x. 175
5.19 Defining the matrix descriptors. 176
5.20 General matrix–vector multiplication with PBLAS. 176
5.21 Strip-mining a two-dimensional FFT. 180
5.22 Two-dimensional transpose for complex data. 181
5.23 A domain decomposition MC integration. 188
5.24 Cutting and pasting a uniform sample on the points. 188
5.25 The PETSc software building blocks. 190
5.26 Definition and initialization of a n ×n Poisson matrix. 191
5.27 Definition and initialization of a vector. 192
xvi LIST OF FIGURES
5.28 Definition of the linear solver context and of the Krylov subspace
method. 193
5.29 Definition of the preconditioner, Jacobi in this case. 194
5.30 Calling the PETSc solver. 194
5.31 Defining PETSc block sizes that coincide with the blocks of the
Poisson matrix. 195
LIST OF TABLES
1.1 Cache structures for Intel Pentium III, 4, and Motorola G-4. 4
2.1 Basic linear algebra subprogram prefix/suffix conventions. 19
2.2 Summary of the basic linear algebra subroutines. 20
2.3 Number of memory references and floating point operations for
vectors of length n. 22
2.4 Some performance numbers for typical BLAS in Mflop/s for a
2.4 GHz Pentium 4. 23
2.5 Times (s) and speed in Mflop/s of dgesv on a P 4 (2.4 GHz, 1 GB). 28
2.6 Iteration steps for solving the Poisson equation on a 31 ×31 and on
a 63×63 grid. 33
2.7 Some Krylov subspace methods for Ax = b with nonsingular A. 35
2.8 Iteration steps for solving the Poisson equation on a 31 ×31 and on
a 63 ×63 grid. 39
4.1 Times t in seconds (s) and speedups S(p) for various problem sizes
n and processor numbers p for solving a random system of
equations with the general solver dgesv of LAPACK on the HP
Superdome. 141
4.2 Some execution times in microseconds for the saxpy operation. 143
4.3 Execution times in microseconds for our dot product, using the C
compiler guidec. 145
4.4 Some execution times in microseconds for the matrix–vector
multiplication with OpenMP on the HP superdome. 148
5.1 Summary of the BLACS. 164
5.2 Summary of the PBLAS. 166
5.3 Timings of the ScaLAPACK system solver pdgesv on one processor
and on 36 processors with varying dimensions of the process grid. 179
5.4 Times t and speedups S(p) for various problem sizes n and
processor numbers p for solving a random system of equations with
the general solver pdgesv of ScaLAPACK on the Beowulf cluster. 179
5.5 Execution times for solving an n
2
×n
2
linear system from the
two-dimensional Poisson problem. 195
A.1 Available binary relations for the mm compbr ps and
mm compbr ss intrinsics. 203
B.1 Available binary relations for comparison functions. 212
B.2 Additional available binary relations for collective comparison
functions. 216
D.1 MPI datatypes available for collective reduction operations. 230
D.2 MPI pre-defined constants. 231
This page intentionally left blank
1
BASIC ISSUES
No physical quantity can continue to change exponentially forever.
Your job is delaying forever.
G. E. Moore (2003)
1.1 Memory
Since first proposed by Gordon Moore (an Intel founder) in 1965, his law [107]
that the number of transistors on microprocessors doubles roughly every one to
two years has proven remarkably astute. Its corollary, that central processing unit
(CPU) performance would also double every two years or so has also remained
prescient. Figure 1.1 shows Intel microprocessor data on the number of transist-
ors beginning with the 4004 in 1972. Figure 1.2 indicates that when one includes
multi-processor machines and algorithmic development, computer performance
is actually better than Moore’s 2-year performance doubling time estimate. Alas,
however, in recent years there has developed a disagreeable mismatch between
CPU and memory performance: CPUs now outperform memory systems by
orders of magnitude according to some reckoning [71]. This is not completely
accurate, of course: it is mostly a matter of cost. In the 1980s and 1990s, Cray
Research Y-MP series machines had well balanced CPU to memory performance.
Likewise, NEC (Nippon Electric Corp.), using CMOS (see glossary, Appendix F)
and direct memory access, has well balanced CPU/Memory performance. ECL
(see glossary, Appendix F) and CMOS static random access memory (SRAM)
systems were and remain expensive and like their CPU counterparts have to
be carefully kept cool. Worse, because they have to be cooled, close packing is
difficult and such systems tend to have small storage per volume. Almost any per-
sonal computer (PC) these days has a much larger memory than supercomputer
memory systems of the 1980s or early 1990s. In consequence, nearly all memory
systems these days are hierarchical, frequently with multiple levels of cache.
Figure 1.3 shows the diverging trends between CPUs and memory performance.
Dynamic random access memory (DRAM) in some variety has become standard
for bulk memory. There are many projects and ideas about how to close this per-
formance gap, for example, the IRAM [78] and RDRAM projects [85]. We are
confident that this disparity between CPU and memory access performance will
eventually be tightened, but in the meantime, we must deal with the world as it
is. Anyone who has recently purchased memory for a PC knows how inexpensive
2 BASIC ISSUES
10
5
10
4
10
3
10
2
10
1
10
0
1970 1980 1990 2000
Year
N
u
m
b
e
r

o
f

t
r
a
n
s
i
s
t
o
r
s
Intel CPUs
Itanium
Pentium-II-4
Pentium
80486
(2.5 year
doubling)
80386
80286
8086
(Line fit = 2 year doubling)
4004
Fig. 1.1. Intel microprocessor transistor populations since 1972.
10
8
10
7
10
6
10
5
10
4
10
3
10
2
10
1
1980 1990 2000
Year
Super Linpack R
max
[fit: 1 year doubling]
ES
Cray T–3D
NEC SX–3
Fujitsu VP2600
Cray–1
M
F
L
O
P
/
s
Fig. 1.2. Linpack benchmark optimal performance tests. Only some of the fastest
machines are indicated: Cray-1 (1984) had 1 CPU; Fujitsu VP2600 (1990)
had 1 CPU; NEC SX-3 (1991) had 4 CPUS; Cray T-3D (1996) had 2148
DEC α processors; and the last, ES (2002), is the Yokohama NEC Earth
Simulator with 5120 vector processors. These data were gleaned from various
years of the famous dense linear equations benchmark [37].
MEMORY 3
10,000
1000
100
10
1990 1995 2000
Year
D
a
t
a

r
a
t
e

(
M
H
z
)
2005
= CPU
= SRAM
= DRAM
Fig. 1.3. Memory versus CPU performance: Samsung data [85]. Dynamic RAM
(DRAM) is commonly used for bulk memory, while static RAM (SRAM) is
more common for caches. Line extensions beyond 2003 for CPU performance
are via Moore’s law.
DRAM has become and how large it permits one to expand their system. Eco-
nomics in part drives this gap juggernaut and diverting it will likely not occur
suddenly. However, it is interesting that the cost of microprocessor fabrication
has also grown exponentially in recent years, with some evidence of manufactur-
ing costs also doubling in roughly 2 years [52] (and related articles referenced
therein). Hence, it seems our first task in programming high performance com-
puters is to understand memory access. Computer architectural design almost
always assumes a basic principle —that of locality of reference. Here it is:
The safest assumption about the next data to be used is that they are
the same or nearby the last used.
Most benchmark studies have shown that 90 percent of the computing time is
spent in about 10 percent of the code. Whereas the locality assumption is usually
accurate regarding instructions, it is less reliable for other data. Nevertheless, it
is hard to imagine another strategy which could be easily implemented. Hence,
most machines use cache memory hierarchies whose underlying assumption is
that of data locality. Non-local memory access, in particular, in cases of non-unit
but fixed stride, are often handled with pre-fetch strategies—both in hardware
and software. In Figure 1.4, we show a more/less generic machine with two
levels of cache. As one moves up in cache levels, the larger the cache becomes,
the higher the level of associativity (see Table 1.1 and Figure 1.5), and the lower
the cache access bandwidth. Additional levels are possible and often used, for
example, L3 cache in Table 1.1.
4 BASIC ISSUES
Instruction cache
Memory
Fetch and
decode unit
L2 cache
Execution unit
S
y
s
t
e
m

b
u
s
L1 data cache
Bus interface unit
Fig. 1.4. Generic machine with cache memory.
Table 1.1 Cache structures for Intel Pentium III, 4, and Motorola G-4.
Pentium III memory access data
Channel: M ↔ L2 L2 ↔ L1 L1 ↔ Reg.
Width 64-bit 64-bit 64-bit
Size 256 kB (L2) 8 kB (L1) 8·16 B (SIMD)
Clocking 133 MHz 275 MHz 550 MHz
Bandwidth 1.06 GB/s 2.2 GB/s 4.4 GB/s
Pentium 4 memory access data
Channel: M ↔ L2 L2 ↔ L1 L1 ↔ Reg.
Width 64-bit 256-bit 256-bit
Size 256 kB (L2) 8 kB (L1) 8·16 B (SIMD)
Clocking 533 MHz 3.06 GHz 3.06 GHz
Bandwidth 4.3 GB/s 98 GB/s 98 GB/s
Power Mac G-4 memory access data
Channel: M ↔ L3 L3 ↔ L2 L2 ↔ L1 L1 ↔ Reg.
Width 64-bit 256-bit 256-bit 128-bit
Size 2 MB 256 kB 32 kB 32·16 B (SIMD)
Clocking 250 MHz 1.25 GHz 1.25 GHz 1.25 GHz
Bandwidth 2 GB/s 40 GB/s 40 GB/s 20.0 GB/s
MEMORY SYSTEMS 5
12 0
Fully associative
12 mod 4
Direct mapped
12 mod 8
2-way associative
Memory 4
Fig. 1.5. Caches and associativity. These very simplified examples have caches
with 8 blocks: a fully associative (same as 8-way set associative in this case),
a 2-way set associative cache with 4 sets, and a direct mapped cache (same
as 1-way associative in this 8 block example). Note that block 4 in memory
also maps to the same sets in each indicated cache design having 8 blocks.
1.2 Memory systems
In Figure 3.4 depicting the Cray SV-1 architecture, one can see that it is possible
for the CPU to have a direct interface to the memory. This is also true for other
supercomputers, for example, the NEC SX-4,5,6 series, Fujitsu AP3000, and
others. The advantage to this direct interface is that memory access is closer in
performance to the CPU. In effect, all the memory is cache. The downside is that
memory becomes expensive and because of cooling requirements, is necessarily
further away. Early Cray machines had twisted pair cable interconnects, all of
the same physical length. Light speed propagation delay is almost exactly 1 ns
in 30 cm, so a 1 ft waveguide forces a delay of order one clock cycle, assuming
a 1.0 GHz clock. Obviously, the further away the data are from the CPU, the
longer it takes to get. Caches, then, tend to be very close to the CPU—on-chip, if
possible. Table 1.1 indicates some cache sizes and access times for three machines
we will be discussing in the SIMD Chapter 3.
1.2.1 Cache designs
So what is a cache, how does it work, and what should we know to intelli-
gently program? According to a French or English dictionary, it is a safe place
to hide things. This is perhaps not an adequate description of cache with regard
6 BASIC ISSUES
to a computer memory. More accurately, it is a safe place for storage that is close
by. Since bulk storage for data is usually relatively far from the CPU, the prin-
ciple of data locality encourages having a fast data access for data being used,
hence likely to be used next, that is, close by and quickly accessible. Caches,
then, are high speed CMOS or BiCMOS memory but of much smaller size than
the main memory, which is usually of DRAM type.
The idea is to bring data from memory into the cache where the CPU can
work on them, then modify and write some of them back to memory. According
to Hennessey and Patterson [71], about 25 percent of memory data traffic is
writes, and perhaps 9–10 percent of all memory traffic. Instructions are only
read, of course. The most common case, reading data, is the easiest. Namely,
data read but not used pose no problem about what to do with them—they are
ignored. A datum from memory to be read is included in a cacheline (block)
and fetched as part of that line. Caches can be described as direct mapped or
set associative:
• Direct mapped means a data block can go only one place in the cache.
• Set associative means a block can be anywhere within a set. If there are
m sets, the number of blocks in a set is
n = (cache size in blocks)/m,
and the cache is called an n−way set associative cache. In Figure 1.5 are
three types namely, an 8-way or fully associative, a 2-way, and a direct
mapped.
In effect, a direct mapped cache is set associative with each set consisting of
only one block. Fully associative means the data block can go anywhere in the
cache. A 4-way set associative cache is partitioned into sets each with 4 blocks;
an 8-way cache has 8 cachelines (blocks) in each set and so on. The set where
the cacheline is to be placed is computed by
(block address) mod (m = no. of sets in cache).
The machines we examine in this book have both 4-way set associative and
8-way set associative caches. Typically, the higher the level of cache, the larger
the number of sets. This follows because higher level caches are usually much
larger than lower level ones and search mechanisms for finding blocks within
a set tend to be complicated and expensive. Thus, there are practical limits on
the size of a set. Hence, the larger the cache, the more sets are used. However,
the block sizes may also change. The largest possible block size is called a page
and is typically 4 kilobytes (kb). In our examination of SIMD programming on
cache memory architectures (Chapter 3), we will be concerned with block sizes
of 16 bytes, that is, 4 single precision floating point words. Data read from cache
into vector registers (SSE or Altivec) must be aligned on cacheline boundaries.
Otherwise, the data will be mis-aligned and mis-read: see Figure 3.19. Figure 1.5
MEMORY SYSTEMS 7
shows an extreme simplification of the kinds of caches: a cache block (number
12) is mapped into a fully associative; a 4-way associative; or a direct mapped
cache [71]. This simplified illustration has a cache with 8 blocks, whereas a real
8 kB, 4-way cache will have sets with 2 kB, each containing 128 16-byte blocks
(cachelines).
Now we ask: where does the desired cache block actually go within the set?
Two choices are common:
1. The block is placed in the set in a random location. Usually, the random
location is selected by a hardware pseudo-random number generator. This
location depends only on the initial state before placement is requested,
hence the location is deterministic in the sense it will be reproducible.
Reproducibility is necessary for debugging purposes.
2. The block is placed in the set according to a least recently used (LRU)
algorithm. Namely, the block location in the set is picked which has not
been used for the longest time. The algorithm for determining the least
recently used location can be heuristic.
The machines we discuss in this book use an approximate LRU algorithm
which is more consistent with the principle of data locality. A cache miss rate
is the fraction of data requested in a given code which are not in cache and must
be fetched from either higher levels of cache or from bulk memory. Typically it
takes c
M
= O(100) cycles to get one datum from memory, but only c
H
= O(1)
cycles to fetch it from low level cache, so the penalty for a cache miss is high
and a few percent miss rate is not inconsequential.
To locate a datum in memory, an address format is partitioned into two parts
(Figure 1.6):
• A block address which specifies which block of data in memory contains
the desired datum, this is itself divided into two parts,
— a tag field which is used to determine whether the request is a hit or
a miss,
— a index field selects the set possibly containing the datum.
• An offset which tells where the datum is relative to the beginning of the
block.
Only the tag field is used to determine whether the desired datum is in cache or
not. Many locations in memory can be mapped into the same cache block, so in
order to determine whether a particular datum is in the block, the tag portion
Block address
Tag Index
Block
offset
Fig. 1.6. Data address in set associative cache memory.
8 BASIC ISSUES
of the block is checked. There is little point in checking any other field since the
index field was already determined before the check is made, and the offset will
be unnecessary unless there is a hit, in which case the whole block containing
the datum is available. If there is a hit, the datum may be obtained immediately
from the beginning of the block using this offset.
1.2.1.1 Writes
Writing data into memory from cache is the principal problem, even though it
occurs only roughly one-fourth as often as reading data. It will be a common
theme throughout this book that data dependencies are of much concern
in parallel computing. In writing modified data back into memory, these data
cannot be written onto old data which could be subsequently used for processes
issued earlier. Conversely, if the programming language ordering rules dictate
that an updated variable is to be used for the next step, it is clear this variable
must be safely stored before it is used. Since bulk memory is usually far away
from the CPU, why write the data all the way back to their rightful memory
locations if we want them for a subsequent step to be computed very soon? Two
strategies are in use.
1. A write through strategy automatically writes back to memory any
modified variables in cache. A copy of the data is kept in cache for sub-
sequent use. This copy might be written over by other data mapped to
the same location in cache without worry. A subsequent cache miss on
the written through data will be assured to fetch valid data from memory
because the data are freshly updated on each write.
2. A write back strategy skips the writing to memory until: (1) a subsequent
read tries to replace a cache block which has been modified, or (2) these
cache resident data are again to be modified by the CPU. These two
situations are more/less the same: cache resident data are not written
back to memory until some process tries to modify them. Otherwise,
the modification would write over computed information before it is
saved.
It is well known [71] that certain processes, I/O and multi-threading, for example,
want it both ways. In consequence, modern cache designs often permit both
write-through and write-back modes [29]. Which mode is used may be controlled
by the program.
1.2.2 Pipelines, instruction scheduling, and loop unrolling
For our purposes, the memory issues considered above revolve around the same
basic problem—that of data dependencies. In Section 3.2, we will explore in more
detail some coding issues when dealing with data dependencies, but the idea, in
principle, is not complicated. Consider the following sequence of C instructions.
MEMORY SYSTEMS 9
a[1] = f1(a[0]);
.
.
a[2] = f2(a[1]);
.
.
a[3] = f3(a[2]);
Array element a[1] is to be set when the first instruction is finished. The second,
f2(a[1]), cannot issue until the result a[1] is ready, and likewise f3(a[2]) must
wait until a[2] is finished. Computations f1(a[0]), f2(a[1]), and f3(a[2])
are not independent. There are data dependencies: the first, second, and last
must run in a serial fashion and not concurrently. However, the computation of
f1(a[0]) will take some time, so it may be possible to do other operations while
the first is being processed, indicated by the dots. The same applies to computing
the second f2(a[1]). On modern machines, essentially all operations are
pipelined: several hardware stages are needed to do any computation. That
multiple steps are needed to do arithmetic is ancient history, for example, from
grammar school. What is more recent, however, is that it is possible to do
multiple operands concurrently: as soon as a low order digit for one operand
pair is computed by one stage, that stage can be used to process the same low
order digit of the next operand pair. This notion of pipelining operations was also
not invented yesterday: the University of Manchester Atlas project implemented
Step 1
Step 2
Step L
Step L+1
.
.
.
L
1 2
3 1 2
1 2 3
1 2 3 L
L+1
L+1 L+2
. .
. .
Fig. 1.7. Pipelining: a pipe filled with marbles.
10 BASIC ISSUES
such arithmetic pipelines as early as 1962 [91]. The terminology is an analogy to
a short length of pipe into which one starts pushing marbles, Figure 1.7. Imagine
that the pipe will hold L marbles, which will be symbolic for stages necessary to
process each operand. To do one complete operation on one operand pair takes
L steps. However, with multiple operands, we can keep pushing operands into
the pipe until it is full, after which one result (marble) pops out the other end
at a rate of one result/cycle. By this simple device, instead n operands taking
L cycles each, that is, a total of n · L cycles, only L + n cycles are required
as long as the last operands can be flushed from the pipeline once they have
started into it. The resulting speedup is n · L/(n + L), that is, L for large n.
To program systems with pipelined operations to advantage, we will need to
know how instructions are executed when they run concurrently. The schema is
in principle straightforward and shown by loop unrolling transformations done
either by the compiler or by the programmer. The simple loop
for(i=0;i<n;i++){
b[i] = f(a[i]);
}
is expanded into segments of length (say) m i’s.
for(i=0;i<n;i+=m){
b[i ] = f(a[i ]);
b[i+1] = f(a[i+1]);
.
.
b[i+m-1] = f(a[i+m-1]);
}
/* residual segment res = n mod m */
nms = n/m; res = n%m;
for(i=nms*m;i<nms*m+res;i++){
b[i] = f(a[i]);
}
The first loop processes nms segments, each of which does m operations f(a[i]).
Our last loop cleans up the remaining is when n = nms · m, that is, a residual
segment. Sometimes this residual segment is processed first, sometimes last (as
shown) or for data alignment reasons, part of the res first, the rest last. We
will refer to the instructions which process each f(a[i]) as a template. The
problem of optimization, then, is to choose an appropriate depth of unrolling
m which permits squeezing all the m templates together into the tightest time
grouping possible. The most important aspect of this procedure is pre-
fetching data within the segment which will be used by subsequent
segment elements in order to hide memory latencies. That is, one wishes
to hide the time it takes to get the data from memory into registers. Such data
MEMORY SYSTEMS 11
pre-fetching was called bottom loading in former times. Pre-fetching in its
simplest form is for m = 1 and takes the form
t = a[0]; /* prefetch a[0] */
for(i=0;i<n-1; ){
b[i] = f(t);
t = a[++i]; /* prefetch a[i+1] */
}
b[n-1] = f(t);
where one tries to hide the next load of a[i] under the loop overhead. We can go
one or more levels deeper, as in Figure 1.8 or more. If the computation of f(ti)
does not take long enough, not much memory access latency will be hidden under
f(ti). In that case, the loop unrolling level m must be increased. In every case,
we have the following highlighted purpose of loop unrolling:
The purpose of loop unrolling is to hide latencies, in particular, the delay
in reading data from memory.
Unless the stored results will be needed in subsequent iterations of the loop
(a data dependency), these stores may always be hidden: their meanderings into
memory can go at least until all the loop iterations are finished. The next section
illustrates this idea in more generality, but graphically.
1.2.2.1 Instruction scheduling with loop unrolling
Here we will explore only instruction issue and execution where these processes
are concurrent. Before we begin, we will need some notation. Data are loaded into
and stored from registers. We denote these registers by {R
i
, i = 0 . . .}. Differ-
ent machines have varying numbers and types of these: floating point registers,
integer registers, address calculation registers, general purpose registers; and
anywhere from say 8 to 32 of each type or sometimes blocks of such registers
t0 = a[0]; /* prefetch a[0] */
t1 = a[1]; /* prefetch a[1] */
for(i=0;i<n-3;i+=2){
b[i ] = f(t0);
b[i+1] = f(t1);
t0 = a[i+2]; /* prefetch a[i+2] */
t1 = a[i+3]; /* prefetch a[i+3] */
}
b[n-2] = f(t0);
b[n-1] = f(t1);
Fig. 1.8. Prefetching 2 data one loop iteration ahead (assumes 2|n)
12 BASIC ISSUES
which may be partitioned in different ways. We will use the following simplified
notation for the operations on the contents of these registers:
R
1
←M: loads a datum from memory M into register R
1
.
M ←R
1
: stores content of register R
1
into memory M.
R
3
←R
1
+R
2
: add contents of R
1
and R
2
and store results into R
3
.
R
3
←R
1

R
2
: multiply contents of R
1
by R
2
, and put result into R
3
.
More complicated operations are successive applications of basic ones. Consider
the following operation to be performed on an array A: B = f(A), where f(·)
is in two steps:
B
i
= f
2
(f
1
(A
i
)).
Each step of the calculation takes some time and there will be latencies in
between them where results are not yet available. If we try to perform multiple
is together, however, say two, B
i
= f(A
i
) and B
i+1
= f(A
i+1
), the various
operations, memory fetch, f
1
and f
2
, might run concurrently, and we could set up
two templates and try to align them. Namely, by starting the f(A
i+1
) operation
steps one cycle after the f(A
i
), the two templates can be merged together. In
Figure 1.9, each calculation f(A
i
) and f(A
i+1
) takes some number (say m) of
cycles (m = 11 as illustrated). If these two calculations ran sequentially, they
would take twice what each one requires, that is, 2 · m. By merging the two
together and aligning them to fill in the gaps, they can be computed in m+ 1
cycles. This will work only if: (1) the separate operations can run independently
and concurrently, (2) if it is possible to align the templates to fill in some of the
gaps, and (3) there are enough registers. As illustrated, if there are only eight
registers, alignment of two templates is all that seems possible at compile time.
More than that and we run out of registers. As in Figure 1.8, going deeper shows
us how to hide memory latencies under the calculation. By using look-ahead
(prefetch) memory access when the calculation is long enough, memory latencies
may be significantly hidden.
Our illustration is a dream, however. Usually it is not that easy. Several
problems raise their ugly heads.
1. One might run out of registers. No matter how many there are, if the
calculation is complicated enough, we will run out and no more unrolling
is possible without going up in levels of the memory hierarchy.
2. One might run out of functional units. This just says that one of the
{f
i
, i = 1, . . .} operations might halt awaiting hardware that is busy. For
example, if the multiply unit is busy, it may not be possible to use it until
it is finished with multiplies already started.
3. A big bottleneck is memory traffic. If memory is busy, it may not be
possible to access data to start another template.
4. Finally, finding an optimal algorithm to align these templates is no small
matter. In Figures 1.9 and 1.10, everything fit together quite nicely. In
general, this may not be so. In fact, it is known that finding an optimal
MEMORY SYSTEMS 13
R
0
← A
i
.
wait
R
1
← f
1
(R
0
)
.
R
2
← f
2
(R
1
)
.
B
i
← R
2
.

.
R
4
← A
i+1
.
wait
R
5
← f
1
(R
4
)
.
R
6
← f
2
(R
5
)
.
B
i+1
← R
6
=
R
0
← A
i
R
4
← A
i+1
wait
R
1
← f
1
(R
0
)
R
5
← f
1
(R
4
)
R
2
← f
2
(R
1
)
R
3
← f
2
(R
5
)
B
i
← R
2
B
i+1
← R
6
Fig. 1.9. Aligning templates of instructions generated by unrolling loops. We
assume 2|n, while loop variable i is stepped by 2.
R
0
← R
3
R
3
← A
i+2
.
.
R
1
← f
1
(R
0
)
.
R
2
← f
2
(R
1
)
.
B
i
← R
2
.

.
.
R
4
← R
7
R
7
← A
i+3
.
R
5
← f
1
(R
4
)
.
R
6
← f
2
(R
5
)
.
B
i+1
← R
6
=
R
0
← R
3
R
3
← A
i+2
R
4
← R
7
R
7
← A
i+3
R
1
← f
1
(R
0
)
R
5
← f
1
(R
4
)
R
2
← f
2
(R
1
)
R
3
← f
2
(R
5
)
B
i
← R
2
B
i+1
← R
6
Fig. 1.10. Aligning templates and hiding memory latencies by pre-fetching data.
Again, we assume 2|n and the loop variable i is stepped by 2: compare with
Figure 1.9.
algorithm is an NP-complete problem. This means there is no algorithm
which can compute an optimal alignment strategy and be computed in
a time t which can be represented by a polynomial in the number of
steps.
14 BASIC ISSUES
So, our little example is fun, but is it useless in practice? Fortunately the
situation is not at all grim. Several things make this idea extremely useful.
1. Modern machines usually have multiple copies of each functional unit: add,
multiply, shift, etc. So running out of functional units is only a bother but
not fatal to this strategy.
2. Modern machines have lots of registers, even if only temporary storage
registers. Cache can be used for this purpose if the data are not written
through back to memory.
3. Many machines allow renaming registers. For example, in Figure 1.9, as
soon as R
0
is used to start the operation f
1
(R
0
), its data are in the f
1
pipeline and so R
0
is not needed anymore. It is possible to rename R
5
which was assigned by the compiler and call it R
0
, thus providing us more
registers than we thought we had.
4. While it is true that there is no optimal algorithm for unrolling loops
into such templates and dovetailing them perfectly together, there are
heuristics for getting a good algorithm, if not an optimal one. Here is
the art of optimization in the compiler writer’s work. The result may not
be the best possible, but it is likely to be very good and will serve our
purposes admirably.
• On distributed memory machines (e.g. on ETH’s Beowulf machine), the work
done by each independent processor is either a subset of iterations of an outer
loop, a task, or an independent problem.
— Outer loop level parallelism will be discussed in Chapter 5, where MPI
will be our programming choice. Control of the data is direct.
— Task level parallelism refers to large chunks of work on independent data.
As in the outer-loop level paradigm, the programmer could use MPI; or
alternatively, PVM, or pthreads.
— On distributed memory machines or networked heterogeneous systems, by
far the best mode of parallelism is by distributing independent
problems. For example, one job might be to run a simulation for one
set of parameters, while another job does a completely different set. This
mode is not only the easiest to parallelize, but is the most efficient. Task
assignments and scheduling are done by the batch queueing system.
• On shared memory machines, for example, on ETH’s Cray SV-1 cluster,
or our Hewlett-Packard HP9000 Superdome machine, both task level and
outer loop level parallelism are natural modes. The programmer’s job is
to specify the independent tasks by various compiler directives (e.g., see
Appendix C), but data management is done by system software. This mode of
using directives is relatively easy to program, but has the disadvantage that
parallelism is less directly controlled.
MULTIPLE PROCESSORS AND PROCESSES 15
1.3 Multiple processors and processes
In the SIMD Section 3.2, we will return to loop unrolling and multiple data pro-
cessing. There the context is vector processing as a method by which machines
can concurrently compute multiple independent data. The above discussion
about loop unrolling applies in an analogous way for that mode. Namely, spe-
cial vector registers are used to implement the loop unrolling and there is a lot
of hardware support. To conclude this section, we outline the considerations for
multiple independent processors, each of which uses the same lower level instruc-
tion level parallelism discussed in Section 1.2.2. Generally, our programming
methodology reflects the following viewpoint.
1.4 Networks
Two common network configurations are shown in Figures 1.11–1.13. Variants
of Ω-networks are very commonly used in tightly coupled clusters and relatively
modest sized multiprocessing systems. For example, in Chapter 4 we discuss the
NEC SX-6 (Section 4.4) and Cray X1 (Section 4.3) machines which use such
log(NCPUs) stage networks for each board (node) to tightly couple multiple
CPUs in a cache coherent memory image. In other flavors, instead of 2 →2
switches as illustrated in Figure 1.12, these may be 4 →4 (see Figure 4.2) or
higher order. For example, the former Thinking Machines C-5 used a quadtree
network and likewise the HP9000 we discuss in Section 4.2. For a large number of
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
sw
Fig. 1.11. Ω-network.
16 BASIC ISSUES
Straight through
Upper broadcast Lower broadcast
Cross-over
Fig. 1.12. Ω-network switches from Figure 1.11.
processors, cross-bar arrangements of this type can become unwieldy simply due
to the large number of switches necessary and the complexity of wiring arrange-
ments. As we will see in Sections 4.4 and 4.3, however, very tightly coupled nodes
with say 16 or fewer processors can provide extremely high performance. In our
view, such clusters will likely be the most popular architectures for supercom-
puters in the next few years. Between nodes, message passing on a sophisticated
bus system is used. Between nodes, no memory coherency is available and data
dependencies must be controlled by software.
Another approach, which places processors on tightly coupled grid, is more
amenable to a larger number of CPUs. The very popular Cray T3-D, T3-E
machines used a three-dimensional grid with the faces connected to their opposite
faces in a three-dimensional torus arrangement. A two-dimensional illustration is
shown in Figure 1.13. The generalization to three dimensions is not hard to ima-
gine, but harder to illustrate in a plane image. A problem with this architecture
was that the nodes were not very powerful. The network, however, is extremely
powerful and the success of the machine reflects this highly effective design. Mes-
sage passing is effected by very low latency primitives (shmemput, shmemget,
etc.). This mode has shown itself to be powerful and effective, but lacks portab-
ility. Furthermore, because the system does not have a coherent memory, image
compiler support for parallelism is necessarily limited. A great deal was learned
from this network’s success.
Exercise 1.1 Cache effects in FFT The point of this exercise is to get you
started: to become familiar with certain Unix utilities tar, make, ar, cc; to pick
an editor; to set up a satisfactory work environment for yourself on the machines
you will use; and to measure cache effects for an FFT.
The transformation in the problem is
y
l
=
n−1

k=0
ω
±kl
x
k
, for l = 0, . . . , n −1 (1.1)
NETWORKS 17
PE
PE
PE
PE PE
PE
PE
PE PE
PE
PE
PE PE
PE
PE
PE
Fig. 1.13. Two-dimensional nearest neighbor connected torus. A three-
dimensional torus has six nearest neighbors instead of four.
with ω = e
2πi/n
equal to the nth root of unity. The sign in Equation (1.1) is
given by the sign argument in cfft2 and is a float. A sufficient background
for this computation is given in Section 2.4.
What is to be done? From our anonymous ftp server
http://www.inf.ethz.ch/˜arbenz/book,
in directory Chapter1/uebung1, using get download the tar file uebung1.tar
1. Un-tar this file into five source files, a makefile, and an NQS batch script
(may need slight editing for different machines).
2. Execute make to generate:
(a) cfftlib.a = library of modules for the FFT (make lib).
(b) cfftst = test program (make cfftst).
3. Run this job on ALL MACHINES using (via qsub) the batch sumission
script.
4. From the output on each, plot Mflops (million floating pt. operations/
second) vs. problem size n. Use your favorite plotter—gnuplot, for
example, or plot by hand on graph paper.
5. Interpret the results in light of Table 1.1.
2
APPLICATIONS
I am afraid that I rather give myself away when I explain. Results without
causes are much more impressive.
Arthur Conan Doyle (Sherlock Holmes, The Stockbroker’s Clerk)
2.1 Linear algebra
Linear algebra is often the kernel of most numerical computations. It deals with
vectors and matrices and simple operations like addition and multiplication on
these objects.
Vectors are one-dimensional arrays of say n real or complex numbers
x
0
, x
1
, . . . , x
n−1
. We denote such a vector by x and think of it as a column
vector,
x =





x
0
x
1
.
.
.
x
n−1





. (2.1)
On a sequential computer, these numbers occupy n consecutive memory loca-
tions. This is also true, at least conceptually, on a shared memory multiprocessor
computer. On distributed memory multicomputers, the primary issue is how to
distribute vectors on the memory of the processors involved in the computation.
Matrices are two-dimensional arrays of the form
A =





a
00
a
01
. . . a
0,m−1
a
10
a
11
. . . a
1,m−1
.
.
.
.
.
.
.
.
.
a
n−1,0
a
n−1,1
. . . a
n−1,m−1





. (2.2)
The n · m real (complex) matrix elements a
ij
are stored in n · m (respectively
2 · n · m if complex datatype is available) consecutive memory locations. This is
achieved by either stacking the columns on top of each other or by appending row
after row. The former is called column-major, the latter row-major order. The
actual procedure depends on the programming language. In Fortran, matrices
are stored in column-major order, in C in row-major order. There is no principal
difference, but for writing efficient programs one has to respect how matrices
LINEAR ALGEBRA 19
Table 2.1 Basic linear algebra subpro-
gram prefix/suffix conventions.
Prefixes
S Real
D Double precision
C Complex
Z Double complex
Suffixes
U Transpose
C Hermitian conjugate
are laid out. To be consistent with the libraries that we will use that are mostly
written in Fortran, we will explicitly program in column-major order. Thus, the
matrix element a
ij
of the m×n matrix A is located i +j · m memory locations
after a
00
. Therefore, in our C codes we will write a[i+j*m]. Notice that there
is no such simple procedure for determining the memory location of an element
of a sparse matrix. In Section 2.3, we outline data descriptors to handle sparse
matrices.
In this and later chapters we deal with one of the simplest operations one
wants to do with vectors and matrices: the so-called saxpy operation (2.3). In
Tables 2.1 and 2.2 are listed some of the acronyms and conventions for the basic
linear algebra subprograms discussed in this book. The operation is one of the
more basic, albeit most important of these:
y = αx +y. (2.3)
Other common operations we will deal with in this book are the scalar (inner,
or dot) product (Section 3.5.6) sdot,
s = x · y = (x, y) =
n−1

i=0
x
i
y
i
, (2.4)
matrix–vector multiplication (Section 5.6),
y = Ax, (2.5)
and matrix–matrix multiplication (Section 3.4.1),
C = AB. (2.6)
In Equations (2.3)–(2.6), equality denotes assignment, that is, the variable on
the left side of the equality sign is assigned the result of the expression on the
right side.
20 APPLICATIONS
Table 2.2 Summary of the basic linear algebra subroutines.
Level 1 BLAS
ROTG, ROT Generate/apply plane rotation
ROTMG, ROTM Generate/apply modified plane rotation
SWAP Swap two vectors: x ↔ y
SCAL Scale a vector: x ← αx
COPY Copy a vector: x ← y
AXPY axpy operation: y ← y +αx
DOT Dot product: s ← x · y = x

y
NRM2 2-norm: s ← x
2
ASUM 1-norm: s ← x
1
I AMAX Index of largest vector element:
first i such that |x
i
| ≥ |x
k
| for all k
Level 2 BLAS
GEMV, GBMV General (banded) matrix–vector multiply:
y ← αAx +βy
HEMV, HBMV, HPMV Hermitian (banded, packed) matrix–vector
multiply: y ← αAx +βy
SEMV, SBMV, SPMV Symmetric (banded, packed) matrix–vector
multiply: y ← αAx +βy
TRMV, TBMV, TPMV Triangular (banded, packed) matrix–vector
multiply: x ← Ax
TRSV, TBSV, TPSV Triangular (banded, packed) system solves
(forward/backward substitution): x ← A
−1
x
GER, GERU, GERC Rank-1 updates: A ← αxy

+A
HER, HPR, SYR, SPR Hermitian/symmetric (packed) rank-1 updates:
A ← αxx

+A
HER2, HPR2, SYR2, SPR2 Hermitian/symmetric (packed) rank-2 updates:
A ← αxy



yx

+A
Level 3 BLAS
GEMM, SYMM, HEMM General/symmetric/Hermitian matrix–matrix
multiply: C ← αAB +βC
SYRK, HERK Symmetric/Hermitian rank-k update:
C ← αAA

+βC
SYR2K, HER2K Symmetric/Hermitian rank-k update:
C ← αAB



BA

+βC
TRMM Multiple triangular matrix–vector multiplies:
B ← αAB
TRSM Multiple triangular system solves: B ← αA
−1
B
LAPACK AND THE BLAS 21
An important topic of this and subsequent chapters is the solution of the
system of linear equations
Ax = b (2.7)
by Gaussian elimination with partial pivoting. Further issues are the solution of
least squares problems, Gram–Schmidt orthogonalization, and QR factorization.
2.2 LAPACK and the BLAS
By 1976 it was clear that some standardization of basic computer operations on
vectors was needed [92]. By then it was already known that coding procedures
that worked well on one machine might work very poorly on others [125]. In con-
sequence of these observations, Lawson, Hanson, Kincaid, and Krogh proposed
a limited set of Basic Linear Algebra Subprograms (BLAS) to be optimized by
hardware vendors, implemented in assembly language if necessary, that would
form the basis of comprehensive linear algebra packages [93]. These so-called
Level 1 BLAS consisted of vector operations and some attendant co-routines.
The first major package which used these BLAS kernels was LINPACK [38].
Soon afterward, other major software libraries such as the IMSL library [146]
and NAG [112] rewrote portions of their existing codes and structured new
routines to use these BLAS. Early in their development, vector computers
(e.g. [125]) saw significant optimizations using the BLAS. Soon, however, such
machines were clustered together in tight networks (see Section 1.3) and some-
what larger kernels for numerical linear algebra were developed [40, 41] to include
matrix–vector operations. Additionally, Fortran compilers were by then optim-
izing vector operations as efficiently as hand coded Level 1 BLAS. Subsequently,
in the late 1980s, distributed memory machines were in production and shared
memory machines began to have significant numbers of processors. A further
set of matrix–matrix operations was proposed [42] and soon standardized [39]
to form a Level 2. The first major package for linear algebra which used the
Level 3 BLAS was LAPACK [4] and subsequently a scalable (to large numbers
of processors) version was released as ScaLAPACK [12]. Vendors focused on
Level 1, Level 2, and Level 3 BLAS which provided an easy route to optimizing
LINPACK, then LAPACK. LAPACK not only integrated pre-existing solvers
and eigenvalue routines found in EISPACK [134] (which did not use the BLAS)
and LINPACK (which used Level 1 BLAS), but incorporated the latest dense
and banded linear algebra algorithms available. It also used the Level 3 BLAS
which were optimized by much vendor effort. In subsequent chapters, we will
illustrate several BLAS routines and considerations for their implementation on
some machines. Conventions for different BLAS are indicated by
• A root operation. For example, axpy (2.3).
• A prefix (or combination prefix) to indicate the datatype of the operands,
for example, saxpy for single precision axpy operation, or isamax for the
index of the maximum absolute element in an array of type single.
22 APPLICATIONS
• A suffix if there is some qualifier, for example, cdotccc or cdotuuu for
conjugated or unconjugated complex dot product, respectively:
cdotc(n,x,1,y,1) =
n−1

i=0
x
i
¯ y
i
,
cdotu(n,x,1,y,1) =
n−1

i=0
x
i
y
i
,
where both x and y are vectors of complex elements.
Tables 2.1 and 2.2 give the prefix/suffix and root combinations for the BLAS,
respectively.
2.2.1 Typical performance numbers for the BLAS
Let us look at typical representations of all three levels of the BLAS, daxpy, ddot,
dgemv, and dgemm, that perform the basic operations (2.3)–(2.6). Additionally,
we look at the rank-1 update routine dger. An overview on the number of
memory accesses and floating point operations is given in Table 2.3. The Level 1
BLAS comprise basic vector operations. A call of one of the Level 1 BLAS thus
gives rise to O(n) floating point operations and O(n) memory accesses. Here, n is
the vector length. The Level 2 BLAS comprise operations that involve matrices
and vectors. If the involved matrix is n × n, then both the memory accesses
and the floating point operations are of O(n
2
). In contrast, the Level 3 BLAS
have a higher order floating point operations than memory accesses. The most
prominent operation of the Level 3 BLAS, matrix–matrix multiplication costs
O(n
3
) floating point operations while there are only O(n
2
) reads and writes. The
last column in Table 2.3 shows the crucial difference between the Level 3 BLAS
and the rest.
Table 2.4 gives some performance numbers for the five BLAS of Table 2.3.
Notice that the timer has a resolution of only 1 µs! Therefore, the numbers in
Table 2.4 have been obtained by timing a loop inside of which the respective
function is called many times. The Mflop/s rates of the Level 1 BLAS ddot
Table 2.3 Number of memory references and float-
ing point operations for vectors of length n.
Read Write Flops Flops/mem access
ddot 2n 1 2n 1
daxpy 2n n 2n 2/3
dgemv n
2
+ n n 2n
2
2
dger n
2
+ 2n n
2
2n
2
1
dgemm 2n
2
n
2
2n
3
2n/3
LAPACK AND THE BLAS 23
Table 2.4 Some performance numbers for typ-
ical BLAS in Mflop/s for a 2.4 GHz Pentium 4.
n = 100 500 2000 10,000,000
ddot 1480 1820 1900 440
daxpy 1160 1300 1140 240
dgemv 1370 740 670 —
dger 670 330 320 —
dgemm 2680 3470 3720 —
and daxpy quite precisely reflect the ratios of the memory accesses of the two
routines, 2n vs. 3n. The high rates are for vectors that can be held in the on-chip
cache of 512 MB. The low 240 and 440 Mflop/s with the very long vectors are
related to the memory bandwidth of about 1900 MB/s (cf. Table 1.1).
The Level 2 BLAS dgemv has about the same performance as daxpy if the
matrix can be held in cache (n = 100). Otherwise, it is considerably reduced.
dger has a high volume of read and write operations, while the number of floating
point operations is limited. This leads to a very low performance rate. The
Level 3 BLAS dgemm performs at a good fraction of the peak performance of the
processor (4.8 Gflop/s). The performance increases with the problem size. We
see from Table 2.3 that the ratio of computation to memory accesses increases
with the problem size. This ratio is analogous to a volume-to-surface area effect.
2.2.2 Solving systems of equations with LAPACK
2.2.2.1 Classical Gaussian elimination
Gaussian elimination is probably the most important algorithm in numerical lin-
ear algebra. In Chapter 3, in particular Section 3.4.2, we will review the classical
form for Gaussian elimination because there we will be deeper into hardware
considerations wherein the classical representation will seem more appropriate.
Here and in subsequent sections, we hope it will be clear why Level 2 and Level 3
BLAS are usually more efficient than the vector operations found in the Level 1
BLAS. Given a rectangular m × n matrix A, it computes a lower triangular
matrix L and an upper triangular matrix U, such that
PA = LU, (2.8)
where P is an m×m permutation matrix, L is an m×m lower triangular matrix
with ones on the diagonal and U is an m×n upper triangular matrix, that is, a
matrix with nonzero elements only on and above its diagonal.
The algorithm is obviously well known [60] and we review it more than once
due to its importance. The main code portion of an implementation of Gaussian
elimination is found in Figure 2.1. The algorithm factors an m × n matrix in
24 APPLICATIONS
info = 0;
for (j=0; j < min(M,N); j++){
/* Find pivot and test for singularity */
tmp = M-j;
jp = j - 1 + idamax_(&tmp,&A[j+j*M],&ONE );
ipiv[j] = jp;
if (A[jp+j*M] != 0.0){
/* Interchange actual with pivot row */
if (jp != j) dswap_(&N,&A[j],&M,&A[jp],&M);
/* Compute j-th column of L-factor */
s = 1.0/A[j+j*M]; tmp = M-j-1;
if (j < M) dscal_(&tmp,&s,&A[j+1+j*M],&ONE);
}
else if (info == 0) info = j;
/* Rank-one update trailing submatrix */
if (j < min(M,N)) {
tmp = M-j-1; tmp2 = N-j-1;
dger_(&tmp,&tmp2,&MDONE,&A[j+1+j*M],&ONE,
&A[j+(j+1)*M],&M,&A[j+1+(j+1)*M],&M);
}
}
Fig. 2.1. Gaussian elimination of an M ×N matrix based on Level 2 BLAS as
implemented in the LAPACK routine dgetrf.
min(m, n) − 1 steps. In the algorithm of Figure 2.1, the step number is j. For
illustrating a typical step of the algorithm, we chose m = n + 1 = 6. After
completing the first two steps of Gaussian elimination, the first two columns of
the lower triangular factor L and the two rows of the upper triangular factor U
are computed.








1
l
10
1
l
20
l
21
1
l
30
l
31
1
l
40
l
41
1
l
50
l
51
1
















u
00
u
01
u
02
u
03
u
04
u
11
u
12
u
13
u
14
ˆ a
22
ˆ a
23
ˆ a
24
ˆ a
32
ˆ a
33
ˆ a
34
ˆ a
42
ˆ a
43
ˆ a
44
ˆ a
52
ˆ a
53
ˆ a
54








≡ L
1
U
1
= P
1
A. (2.9)
In the third step of Gaussian elimination, zeros are introduced below the
first element in the reduced matrix ˆ a
ij
. Column three of L is filled below its
diagonal element. Previously computed elements l
ij
and u
ij
are not affected:
Their elements in L may be permuted, however!
LAPACK AND THE BLAS 25
The j-th step of Gaussian elimination has three substeps:
1. Find the first index j
p
of the largest element (in modulus) in pivot
column j, that is, |ˆ a
j
p
,j
| ≥ |ˆ a
ij
| for all i ≥ j. This is found using function
idamax in Fig. 2.1. Let

P
j
be the m-by-m permutation matrix that swaps
entries j and j
p
in a vector of length m. Notice, that

P
j
is the identity
matrix if j
p
= j.
In the example in (2.9) we have j = 2. Applying P
2
yields

P
2
L
1

P
T
2

P
2
U
1
=

P
2
P
1
A = P
2
A, P
2
=

P
2
P
1
. (2.10)
Notice that

P
2
swaps row j = 2 with row j
p
≥ 2. Therefore,

P
2
L
1

P
T
2
differs from L
1
only in swapped elements on columns 0 and 1 below row 1.

P
2
applied to U
1
swaps two of the rows with elements ˆ a
ik
. After the per-
mutation, ˆ a
jj
is (one of) the largest element(s) of the first column of the
reduced matrix. Therefore, if ˆ a
jj
= 0, then the whole column is zero.
In the code fragment in Figure 2.1 the lower triangle of L without the
unit diagonal and the upper triangle of U are stored in-place onto matrix
A. Evidently, from (2.9), the elements of L
k
below the diagonal can be
stored where U
k
has zero elements. Therefore, applying

P
j
corresponds to
a swap of two complete rows of A.
2. Compute the jth column of L,
l
ij
= ˆ a
ij
/ˆ a
jj
, i > j.
This is executed in the Level 1 BLAS dscal.
3. Update the remainder of the matrix,
ˆ a
ik
= ˆ a
ik
−l
ij
ˆ a
jk
, i, k > j.
This operation is a rank-one update that can be done with the Level 2
BLAS dger.
As we see in Table 2.4, the BLAS dger has quite a low performance. Also, please
look at Figure 3.15. To use the high performance Level 3 BLAS routine dgemm,
the algorithm has to be blocked!
2.2.2.2 Block Gaussian elimination
The essential idea to get a blocked algorithm is in collecting a number of steps
of Gaussian elimination. The latter number is called block size, say b. Assume
that we have arrived in the Gaussian elimination at a certain stage j > 0 as
indicated in Figure 2.2(a). That is, we have computed (up to row permutations)
the first j columns of the lower triangular factor L, indicated by L
0
, and the first
j rows of the upper triangular factor U, indicated by U
0
. We wish to eliminate
the next panel of b columns of the reduced matrix
ˆ
A. To that end, split
ˆ
A in four
26 APPLICATIONS
Â
12
Â
21
Â
22
L
0
U
0
Â
11
L
0
L
21
U
12
Ã
22
U
0
L
11
U
11
(a) (b)
Fig. 2.2. Block Gaussian elimination. (a) Before and (b) after an intermediate
block elimination step.
portions, wherein
ˆ
A
11
comprises the b × b block in the upper left corner of
ˆ
A.
The sizes of the further three blocks become clear from Figure 2.2(a). The block
Gaussian elimination step now proceeds similarly to the classical algorithm (see
Figure 3.15 and p. 108) in three substeps
1. Factor the first panel of
ˆ
A by classical Gaussian elimination with column
pivoting,
P

ˆ
A
11
ˆ
A
21

=

L
11
L
21

U
11
.
Apply the row interchanges P on the “rectangular portion” of L
0
and

ˆ
A
12
ˆ
A
22

. (The rectangular portion of L
0
is the portion of L
0
on the left
of
ˆ
A.)
2. Compute U
12
by forward substitution, U
12
= L
−1
11
ˆ
A
12
. This can be done
efficiently by a Level 3 BLAS as there are multiple right hand sides.
3. Update the rest of the matrix by a rank-b update.

A
22
=
ˆ
A
22
−L
21
U
12
,
by using matrix–matrix multiplication Level 3 BLAS routine dgemm.
Block Gaussian elimination as implemented in the LAPACK routine, dgetrf,
is available from the NETLIB [111] software repository. Figure 2.3 shows
the main loop in the Fortran subroutine dgetrf for factoring an M × N mat-
rix. The block size is NB, and is determined by LAPACK and is determined by
hardware characteristics rather than problem size.
The motivation for the blocking Gaussian elimination was that blocking
makes possible using the higher performance Level 3 BLAS. Advantages of
Level 3 BLAS over the Level 1 and Level 2 BLAS are the higher ratios of
computations to memory accesses, see Tables 2.3 and 2.4. Let us examine the
situation for Gaussian elimination. For simplicity let us assume that the cache
can hold three b ×b blocks, that is, 3b
2
floating point numbers. Furthermore, we
assume that n is divisible by b, n = bm. Then we investigate the jth block step,
1 ≤ j ≤ m. We use the notation of Figure 2.2.
LAPACK AND THE BLAS 27
DO 20 J=1, MIN(M,N), NB
JB = MIN(MIN(M,N)-J+1,NB)
* Factor diagonal and subdiagonal blocks and test for
* exact singularity.
CALL DGETF2(M-J+1, JB, A(J,J), LDA, IPIV(J), IINFO)
* Adjust INFO and the pivot indices.
IF(INFO.EQ.0 .AND. IINFO.GT.0)
$ INFO = IINFO + J - 1
DO 10 I = J, MIN(M,J+JB-1)
IPIV(I) = J - 1 + IPIV(I)
10 CONTINUE
* Apply interchanges to columns 1:J-1.
CALL DLASWP(J-1, A, LDA, J, J+JB-1, IPIV, 1)
IF(J+JB.LE.N) THEN
* Apply interchanges to columns J+JB:N.
CALL DLASWP(N-J-JB+1, A(1,J+JB), LDA, J, J+JB-1,
$ IPIV, 1)
* Compute block row of U.
CALL DTRSM(’Left’, ’Lower’, ’No transpose’, ’Unit’,
$ JB, N-J-JB+1, ONE, A(J,J), LDA,
$ A(J,J+JB), LDA)
IF(J+JB.LE.M) THEN
* Update trailing submatrix.
CALL DGEMM(’No transpose’, ’No transpose’,
$ M-J-JB+1, N-J-JB+1, JB, -ONE,
$ A(J+JB,J), LDA, A(J,J+JB), LDA,
$ ONE, A(J+JB,J+JB), LDA)
END IF
END IF
20 CONTINUE
Fig. 2.3. The main loop in the LAPACK routine dgetrf, which is functionally
equivalent to dgefa from LINPACK.
1. The in-place factorization
ˆ
A
11
= L
11
U
11
costs b
2
reads and writes.
2. Keeping L
11
/U
11
in cache, the backward substitution L
21

ˆ
A
21
U
−1
11
needs the reading and storing of b · (m−j)b floating point numbers.
3. The same holds for the forward substitution U
12
←L
−1
11
ˆ
A
12
.
4. The rank-b update can be made panel by panel. Keeping the required
block of U
12
in cache, L
21
and the panel are read; after the update, the
28 APPLICATIONS
panel is stored in memory. Neglecting the access of U
12
, this consumes
2b · (m−j)b read and 2b · (m−j) write operations for each of the m−j
panels.
Summing all the contributions of items (1)–(4) and taking n = bm into
account gives
m

j=1
2b
2
(m−j) + 3b
2
(m−j)
2
=
n
3
b
+O(n
2
).
Thus, the blocked algorithm executes
3
2
b flops per memory access. This raw
formula seems to show that b should be chosen as big as possible. However, we
have derived it under the assumption that 3b
2
floating point numbers can be
stored in cache. This assumption gives an upper bound for b.
In Table 2.5 we have collected some timings of the LAPACK routine dgesv
that comprises a full system solve of order n including factorization, forward
and backward substitution. Our numbers show that it is indeed wise to block
Gaussian elimination. A factor of three or more improvement in performance
over the classical Gaussian elimination (see p. 108) can be expected if the block
size is chosen properly. However, we get nowhere close to the performance of
dgemm. The performance depends on the block size. The default block size on
the Pentium 4 is 64. In Chapter 4, we will see that on shared memory machines,
the block algorithm even more clearly shows its superiority.
2.3 Linear algebra: sparse matrices, iterative methods
In many situations, either an explicit representation of matrix A or its fac-
torization is extremely large and thus awkward to handle, or such explicit
representations do not exist at all. For example, A may be a general linear
Table 2.5 Times (s) and speed in Mflop/s of dgesv on a P 4
(2.4 GHz, 1 GB). There is little improvement when b ≥ 16.
n = 500 n = 1000 n = 2000
b Time (s) Speed b Time (s) Speed b Time (s) Speed
1 0.27 308 1 2.17 307 1 16.57 322
2 0.18 463 2 1.44 463 2 11.06 482
4 0.13 641 4 1.08 617 4 8.16 654
8 0.11 757 8 0.89 749 8 6.73 792
16 0.10 833 16 0.80 833 16 6.42 831
32 0.09 926 32 0.78 855 32 7.74 689
64 0.10 833 64 0.96 694 64 10.16 525
LINEAR ALGEBRA: SPARSE MATRICES, ITERATIVE METHODS 29
operator whose only definition is its effect on an explicit representation of a
vector x. In these situations, iterative methods for the solution of
Ax = b (2.11)
are used. Direct methods such as Gaussian elimination may be suitable to solve
problems of size n ≤ 10
5
. If A is very large and sparse, that is, most of A’s ele-
ments are zero, then the factorization A = LU in a direct solution may generate
nonzero numbers in the triangular factors L and U in places that are zero in the
original matrix A. This so-called fill-in can be so severe that storing the factors
L and U needs so much more memory than A, so a direct method becomes very
expensive or may even be impossible for lack of memory. In such cases, iterative
methods are required to solve (2.11). Likewise, when no explicit representation of
A exists, iterative methods are the only option. In iterative methods, a sequence
x
(0)
, x
(1)
, x
(2)
, . . . is constructed that converges to the solution of (2.11) for some
initial vector x
(0)
.
A second reason for choosing an iterative solution method is if it is not
necessary to have a highly accurate solution. As the sequence x
(0)
, x
(1)
, x
(2)
, . . .
converges to the solution, it is possible to discontinue the iterations if one con-
siders one of the x
(k)
accurate enough. For example, if A is a discretization
of, say, a differential operator, then it makes no sense to solve the system of
equations more accurately than the error that is introduced by the discretiza-
tion. Or, often the determination of the elements of A is not highly accurate,
so a highly accurate solution of (2.11) would be foolish. In the next section, we
examine the so-called stationary iterative and conjugate-gradient methods.
2.3.1 Stationary iterations
Assume x
(k)
is an approximation to the solution x of (2.11), Ax = b. Then
x = x
(k)
+ e
(k)
, (2.12)
where e
(k)
is called the error. Multiplying (2.12) by A gives
Ae
(k)
= Ax −Ax
(k)
= b −Ax
(k)
=: r
(k)
. (2.13)
Vector r
(k)
is called the residual at x
(k)
. Determining the error e
(k)
would require
knowing x, whereas the residual r
(k)
is easily computed using known information
once the approximation x
(k)
is available. From (2.12) and (2.13) we see that
x = x
(k)
+A
−1
r
(k)
. (2.14)
30 APPLICATIONS
Of course, determining e
(k)
= A
−1
r
(k)
is as difficult as solving the original (2.11),
so (2.14) is only of theoretical interest. Nevertheless, if A in (2.14) is replaced by
a matrix, say M, where (1) M is a good approximation of A and (2) the system
Mz = r can be solved cheaply, then one may expect that
x
(k+1)
= x
(k)
+M
−1
r
(k)
, r
(k)
= b −Ax
(k)
(2.15)
gives a convergent sequence {x
(k)
}. Another way to write (2.15) is
Mx
(k+1)
= Nx
(k)
+ b, N = M −A. (2.16)
Iterations (2.15) and (2.16) are called stationary iterations because the rules
to compute x
(k+1)
from x
(k)
do not depend on k. Matrix M is called a
preconditioner. The error changes in each iteration according to
e
(k+1)
= x − x
(k+1)
= x − x
(k)
−M
−1
r
(k)
,
= e
(k)
−M
−1
(b −Ax
(k)
− b +Ax),
= e
(k)
−M
−1
Ae
(k)
= (I −M
−1
A)e
(k)
.
The matrix G = I −M
−1
A is called the iteration matrix. The residuals satisfy
r
(k+1)
= (I −AM
−1
)r
(k)
. (2.17)
The matrix I − AM
−1
is similar to G = I − M
−1
A which implies that it has
the same eigenvalues as G. In fact, the eigenvalues of the iteration matrix are
decisive for the convergence of the stationary iteration (2.15). Let ρ(G) denote
the spectral radius of G, that is, the largest eigenvalue of G in absolute value.
Then the following statement holds [60].
The stationary iteration (2.15) converges for any x
(0)
if and only if ρ(G) < 1.
Let us now look at the most widely employed stationary iterations.
2.3.2 Jacobi iteration
By splitting A into pieces, A = L+D+U, where D is diagonal, L is strictly lower
triangular, and U is strictly upper triangular, the Jacobi iteration is obtained if
one sets M = D in (2.15).
x
(k+1)
= x
(k)
+D
−1
r
(k)
= −D
−1
(L +U)x
(k)
+D
−1
b. (2.18)
Component-wise we have
x
(k+1)
i
=
1
a
ii


b
i

n

j=1,j=i
a
ij
x
(k)
j


= x
(k)
i
+
1
a
ii


b
i

n

j=1
a
ij
x
(k)
j


.
LINEAR ALGEBRA: SPARSE MATRICES, ITERATIVE METHODS 31
Not surprisingly, the Jacobi iteration converges if A is strictly diagonally
dominant, that is, if
a
ii
>

j=i
|a
ij
|, for all i.
For a proof of this and similar statements, see [137]. A block Jacobi iteration
is obtained if the diagonal matrix D in (2.18) is replaced by a block diagonal
matrix.
2.3.3 Gauss–Seidel (GS) iteration
The GS iteration is obtained if we set M = D+L in (2.15) with D and L as in
the previous section. Thus, the iteration is
(D +L)x
(k+1)
= −Ux
(k)
+b. (2.19)
Component-wise we have
x
(k+1)
i
=
1
a
ii


b
i

j<i
a
ij
x
(k+1)
j

j>i
a
ij
x
(k)
j


.
The GS iteration converges if A is symmetric and positive definite [60]. A block
GS iteration is obtained if the diagonal matrix D in (2.19) is replaced by a block
diagonal matrix.
2.3.4 Successive and symmetric successive overrelaxation
Successive overrelaxation (SOR) is a modification of the GS iteration and has
an adjustable parameter ω. Component-wise it is defined by
x
(k+1)
i
=
ω
a
ii


b
i

j<i
a
ij
x
(k+1)
j

j>i
a
ij
x
(k)
i


+ (1 −ω)x
(k)
i
, 0 < ω < 2.
In matrix notation, this is
M
ω
x
(k+1)
= N
ω
x
(k)
+b, M
ω
=
1
ω
D +L, N
ω
=
1 −ω
ω
D −U, (2.20)
where again D is the diagonal, L is the strictly lower triangular, and U is
the strictly upper triangular portion of A. Usually ω ≥ 1, whence the term
overrelaxation. SOR becomes GS iteration if ω = 1. If A is symmetric, that
is, if L = U
T
, SOR can be symmetrized to become Symmetric Successive
32 APPLICATIONS
Overrelaxation (SSOR). To that end, Equation (2.20) is defined as the first
half-step
M
ω
x
(k+
1
2
)
= N
ω
x
(k)
+b, (2.21)
of an iteration rule that is complemented by a backward SOR step
M
T
ω
x
(k+1)
= N
T
ω
x
(k+
1
2
)
+b. (2.22)
Combining (2.21) and (2.22) yields
ω
2 −ω
M
ω
D
−1
M
T
ω
x
(k+1)
=
ω
2 −ω
N
T
ω
D
−1
N
ω
x
(k)
+b. (2.23)
The symmetric GS iteration is obtained if ω = 1 is set in the SSOR
iteration (2.23),
(D +L)D
−1
(D +L
T
)x
(k+1)
= LD
−1
L
T
x
(k)
+b. (2.24)
(S)SOR converges if A is symmetric positive definite (spd) and 0 < ω < 2,
see [137] or [113, p. 253ff.].
Remark 2.3.1 With the definitions in (2.20) we can rewrite (2.22),
M
T
ω
x
(k+1)
=
2 −ω
ω
Dx
(k+
1
2
)
−M
ω
x
(k+
1
2
)
+b.
As M
ω
x
(k+
1
2
)
is known from (2.21) we can omit the multiplication by L
in (2.22). A similar simplification can be made to (2.21). These savings are called
the Conrad–Wallach trick. Ortega [113] gives a different but enlightening
presentation.
Example 2.3.2 Let A be the Poisson matrix for an n×n grid. A is symmetric
positive definite and has the order n
2
. A has a block-tridiagonal structure where
the blocks have order n. The off-diagonal blocks are minus the identity. The
diagonal blocks are themselves tridiagonal with diagonal elements 4 and off-
diagonal elements −1. Table 2.6 gives the number of iteration steps needed to
reduce the residual by the factor 10
6
by different stationary iteration methods
and two problem sizes. The algorithm used for Example 2.3.2 takes the form
given in Figure 2.4.
Test numbers for this example show typical behavior. GS iterations reduce the
number of iterations to achieve convergence roughly by a factor of 2 compared
LINEAR ALGEBRA: SPARSE MATRICES, ITERATIVE METHODS 33
Table 2.6 Iteration steps for solving the Poisson
equation on a 31×31 and on a 63×63 grid with
a relative residual accuracy of 10
−6
.
Solver n = 31
2
n = 63
2
Jacobi 2157 7787
Block Jacobi 1093 3943
Gauss–Seidel 1085 3905
Block Gauss–Seidel 547 1959
SSOR (ω = 1.8) 85 238
Block SSOR (ω = 1.8) 61 132
Choose initial vector x
0
and convergence tolerance τ.
Set r
0
= b −Ax
0
, ρ
0
= r
0
, k = 0.
while ρ
k
> τ ρ
0
do
Solve Mz
k
= r
k
.
x
k+1
= x
k
+z
k
.
r
k+1
= b −Ax
k+1
, ρ
k
= r
k
.
k = k + 1.
endwhile
Fig. 2.4. Stationary iteration for solving Ax = b with preconditioner M.
with the Jacobi method. So does blocking of Jacobi and GS iterations. SSOR
further reduces the number of iterations. However, for difficult problems, it is
often difficult or impossible to determine the optimal value of the relaxation
parameter ω.
Notice that we have listed only iteration counts in Table 2.6 and not execution
times. Each iteration of a stationary iteration consists of at least one multiplica-
tion with a portion of the system matrix A and the same number of solutions
with the preconditioner M. Except for the Jacobi iteration, the solutions consist
of solving one (or several as with SSOR) triangular systems of equations. Usually,
these linear system solutions are the most time-consuming part of the algorithm.
Another typical behavior of stationary iterations is the growth of iteration
counts as the problem size increases. When the problem size grows by a factor 4,
the iteration count using the Jacobi or GS iterations grows by the same factor.
The growth in iteration counts is not so large when using SSOR methods. Nev-
ertheless, we conclude that stationary iterations are not usually viable solution
methods for really large problems.
34 APPLICATIONS
2.3.5 Krylov subspace methods
It can be easily deduced from (2.15) and (2.17) that
x
(k)
= x
(0)
+ M
−1
r
(k−1)
+ M
−1
r
(k−2)
+ · · · + M
−1
r
0
,
= x
(0)
+ M
−1
k−1

j=0
(I − AM
−1
)
j
r
(0)
,
= x
(0)
+
k−1

j=0
G
j
z
(0)
, z
(j)
= M
−1
r
(j)
. (2.25)
Here, G = I − M
−1
A is the iteration matrix defined on p. 30. Vectors z
(j)
are
called preconditioned residuals. Thus, the kth iterate x
(k)
shifted by x
(0)
is
a particular element of the Krylov subspace K
k
(G, z
(0)
) which is defined as the
linear space spanned by these powers of G applied to z
(0)
:
K
k
(G, z
(0)
) := span{z
(0)
, Gz
(0)
, . . . , G
k−1
z
(0)
}.
In Krylov subspace methods, the approximations x
(k)
are found by some cri-
terion applied to all vectors in K
k
(G, z
(0)
). In this book, we consider only
the two most important Krylov subspace methods: the Generalized Minimal
RESidual method (GMRES) for solving arbitrary (nonsingular) systems and
the Preconditioned Conjugate Gradient (PCG) method for solving symmetric
positive definite systems of equations. In GMRES, x
(k)
is chosen to minimize
z
(k)

2
. The preconditioned conjugate gradient method determines x
(k)
such
that the residual r
(k)
is orthogonal to K
k
(G, z
(0)
).
If the vectors G
j
z
(0)
become closely aligned with an eigenvector correspond-
ing to the absolute largest eigenvalue of G, the set {z
(0)
, Gz
(0)
, . . . , G
k−1
z
(0)
}
become a poorly conditioned basis for the Krylov subspace K
k
(G, z
(0)
). There-
fore, a considerable amount of work in Krylov subspace methods is devoted
to the construction of a good basis of K
k
(G, z
(0)
). In GMRES and PCG this
is done by a modified Gram–Schmidt orthogonalization procedure [63]. In the
context of Krylov subspaces, the Gram–Schmidt orthogonalization procedure is
called Lanczos’ algorithm if G is symmetric and positive definite, and Arnoldi’s
algorithm otherwise.
For a more comprehensive discussion of Krylov subspace methods, the reader
is referred to [36]. Table 2.7 lists the most important of such methods, together
with the properties of the system matrix required for their applicability.
2.3.6 The generalized minimal residual method (GMRES)
The GMRES is often the algorithm of choice for solving the equation Ax = b.
Here, we consider a variant, the preconditioned equation
M
−1
Ax = M
−1
b, (2.26)
LINEAR ALGEBRA: SPARSE MATRICES, ITERATIVE METHODS 35
Table 2.7 Some Krylov subspace methods for Ax = b with
nonsingular A. On the left is its name and on the right the
matrix properties it is designed for [11, 63, 128].
Algorithm Matrix type
GMRES Arbitrary
Bi-CG, Bi-CGSTAB, QMR Arbitrary
CG Symmetric positive definite
MINRES, SYMMLQ Symmetric
QMRS J-symmetric
where the preconditioner M is in some sense a good approximation of A and
systems of the form Mz = r can be solved much more easily than Ax = b.
In GMRES, for k = 1, 2, . . . , an orthogonal basis {q
1
, q
2
, . . . , q
k
} is con-
structed from the Krylov subspace K
k
(z
0
, M
−1
A). Notice that K
k
(z
0
, M
−1
A) =
K
k
(z
0
, G). Here, z
0
= M
−1
r
0
, where r
0
= b − Ax
0
for some initial vector
x
0
. The algorithm proceeds recursively in that computation of the orthogonal
basis of K
k+1
(z
0
, M
−1
A) makes use of the already computed orthogonal basis of
K
k
(z
0
, M
−1
A).
To see how this works, assume q
1
, q
2
, . . . , q
k
have already been computed.
Normalized vector q
1
is chosen by q
1
= z
0
/z
0

2
. We get q
k+1
in the fol-
lowing way:
• Compute the auxiliary vector y
k
= M
−1
Aq
k
. It can be shown that y
k
is linear combination of (M
−1
A)
k
q
k
and of q
1
, . . . , q
k
.
• Construct a new vector y

k
from y
k
, orthogonal to q
1
, q
2
, . . . , q
k
, by
y

k
= y
k

k

j=1
q
j
h
jk
, h
jk
= q
T
j
y
k
. (2.27)
• Normalize y

k
to make q
k+1
,
h
k+1,k
= y

k

2
, q
k+1
= y

k
/h
k+1,k
(2.28)
Let Q
k
= [q
1
, q
2
, . . . , q
k
]. Then, (2.27) and (2.28) for j = 1, . . . , k can be
collected in the form
M
−1
AQ
k
= Q
k+1
H
k+1,k
= Q
k
H
k,k
+ h
k+1,k
q
k+1
e
T
k
(2.29)
with
H
k+1,k
=







h
11
h
12
· · · h
1,k
h
21
h
22
· · · h
2,k
h
3,2
· · · h
3,k
.
.
.
.
.
.
h
k+1,k







.
36 APPLICATIONS
H
k,k
is obtained by deleting the last row from H
k+1,k
. Unit vector e
k
is the
last row of the k × k identity matrix. H
k+1,k
and H
k,k
are called Hessenberg
matrices [152]. They are upper triangular matrices with an additional nonzero
first lower off-diagonal.
If h
k+1,k
= y

k

2
is very small we may declare the iterations to have
converged. Then M
−1
AQ
k
= Q
k
H
k,k
and x
k
is in fact the solution x of (2.26).
In general, the approximate solution x
k
= Q
k
y
k
is determined such that
M
−1
b −M
−1
Ax
k

2
becomes minimal. We have
M
−1
(b −Ax
k
)
2
= b
2
q
1
−M
−1
AQ
k
y
k

2
,
= b
2
Q
k+1
e
1
−Q
k+1
H
k+1,k
y
k

2
,
= b
2
e
1
−H
k+1,k
y
k

2
, (2.30)
where the last equality holds since Q
k+1
has orthonormal columns. Thus, y
k
is
obtained by solving the small k + 1 ×k least squares problem
H
k+1,k
y
k
= b
2
e
1
. (2.31)
Since H
k+1,k
is a Hessenberg matrix, the computation of its QR factorization
only costs k Givens rotations. Furthermore, the QR factorization of H
k+1,k
can be computed in O(k) floating point operations from the QR factorization
of H
k,k−1
.
The GMRES method is very memory consumptive. In addition to the
matrices A and M, m + O(1) vectors of length n have to be stored because
all the vectors q
j
are needed to compute x
k
from y
k
. A popular way to limit
the memory consumption of GMRES is to restart the algorithm after a certain
number, say m, of iteration steps. In the GMRES(m) algorithm, the approx-
imate solution x
m
is used as the initial approximation x
0
of a completely new
GMRES run. Independence of the GMRES runs gives additional freedom. For
example, it is possible to change the preconditioner every iteration, see [36].
We give a possible implementation of GMRES in Figure 2.5. Notice that the
application of Givens rotations set to zero the nonzeros below the diagonal of
H
k,k
. Vector s
1,...,m
denotes the first m elements of s. The absolute value |s
m
|
of the last component of s equals the norm of the preconditioned residual. If one
checks this norm as a convergence criterion, it is easy to test for convergence at
each iteration step [11].
2.3.7 The conjugate gradient (CG) method
If the system matrix A and the preconditioner M are both symmetric and pos-
itive definite, then M
−1
A appearing in the preconditioned system (2.26) is sym-
metric positive and definite with respect to the M-inner product x, y ≡ x
T
My.
LINEAR ALGEBRA: SPARSE MATRICES, ITERATIVE METHODS 37
Choose an initial guess x
0
, r
0
= b −Ax
0
, ρ
0
= r
0
.
while 1 do
Compute preconditioned residual z
0
from Mz
0
= r
0
.
q
1
= z
0
/z
0
, s = z
0

2
e
1
.
for k = 1, 2, . . . , m do
Solve My
k
= Aq
k
.
for j = 1, . . . , k do
h
jk
= q
T
j
y
k
, y
k
= y
k
−q
j
h
jk
end
h
k+1,k
= y
k

2
, q
k+1
= y
k
/h
k+1,k
.
Apply J
1
, . . . , J
k−1
on the last column of H
k+1,k
.
Generate and apply a Givens rotation J
k
acting on the last
two rows of H
k+1,k
to annihilate h
k+1,k
.
s = J
k
s
end
Solve the upper triangular system H
m,m
t = s
1..m
.
x
m
= x
0
+ [q
1
, q
2
, . . . , q
m
] t, r
m
= b −Ax
m
if r
m

2
< · ρ
0
then
Return with x = x
m
as the approximate solution
end
x
0
= x
m
, r
0
= r
m
end
Fig. 2.5. The preconditioned GMRES(m) algorithm.
In fact, for all x, y, we have
x, M
−1
Ay
M
= x
T
M(M
−1
Ay) = x
T
Ay,
= x
T
AM
−1
My = (M
−1
Ax)
T
My = M
−1
Ax, y
M
.
If we execute the GMRES algorithm using this M-inner product, then the
Hessenberg matrix H
k,k
becomes symmetric, that is, tridiagonal. This implies
that y
k
in (2.27) has to be explicitly made orthogonal only to q
k
and q
k−1
to become orthogonal to all previous basis vectors q
1
, q
2
, . . . , q
k
. So, for the
expansion of the sequence of Krylov subspaces only three basis vectors have to
be stored in memory at a time. This is the Lanczos algorithm.
Like GMRES, the Lanczos algorithm constructs an (M-)orthonormal basis
of a Krylov subspace where the generating matrix is symmetric. In contrast, the
CG method, first suggested by Hestenes and Stiefel [72], targets directly improv-
ing approximations x
k
of the solution x of (2.26). x
k
is improved along a search
direction, p
k+1
, such that the functional ϕ =
1
2
x
T
Ax − x
T
b becomes minimal
at the new approximation x
k+1
= x
k
+ α
k
p
k+1
. ϕ has its global minimum
38 APPLICATIONS
Choose x
0
and a convergence tolerance ε.
Set r
0
= b −Ax
0
and solve Mz
0
= r
0
.
Set ρ
0
= z
T
0
r
0
, p
1
= z
0
.
for k = 1, 2, . . . do
q
k
= Ap
k
.
α
k
= ρ
k−1
/p
T
k
q
k
.
x
k
= x
k−1

k
p
k
.
r
k
= r
k−1
−α
k
q
k
.
Solve Mz
k
= r
k
.
if r
k
< εr
0
then exit.
ρ
k
= z
T
k
r
k
.
β
k
= ρ
k

k−1
.
p
k+1
= z
k

k
p
k
.
end
Fig. 2.6. The preconditioned conjugate gradient algorithm.
at x. The preconditioned conjugate gradient (PCG) algorithm is shown in
Figure 2.6. In the course of the algorithm, residual and preconditioned resid-
ual vectors r
k
= b − Ax
k
and z
k
= M
−1
r
k
, respectively, are computed. The
iteration is usually stopped if the norm of the actual residual r
k
has dropped
below a small fraction ε of the initial residual r
0
. Lanczos and CG algorithms
are closely related. An elaboration of this relationship is given in Golub and Van
Loan [60].
Both GMRES and CG have a finite termination property. That is, at step n,
the algorithms will terminate. However, this is of little practical value since to
be useful the number of iterations needed for convergence must be much lower
than n, the order of the system of equations. The convergence rate of the CG
algorithm is given by
x −x
k

A
≤ x −x
0

A

κ −1

κ + 1

k
,
where κ = M
−1
AA
−1
M is the condition number of M
−1
A. In general, κ is
close to unity only if M is a good preconditioner for A.
Example 2.3.3 We consider the same problem as in Example 2.3.2 but now
use the conjugate gradient method as our solver preconditioned by one step of
a stationary iteration. Table 2.8 lists the number of iteration steps needed to
reduce the residual by the factor ε = 10
−6
. The numbers of iteration steps have
been reduced by at least a factor of 10. Because the work per iteration step is
not much larger with PCG than with stationary iterations, the execution times
are similarly reduced by large factors.
LINEAR ALGEBRA: SPARSE MATRICES, ITERATIVE METHODS 39
Table 2.8 Iteration steps for solving the Poisson
equation on a 31 × 31 and on a 63 × 63 grid with
an relative residual accuracy of 10
−6
. PCG with
preconditioner as indicated.
Preconditioner n = 31
2
n = 63
2
Jacobi 76 149
Block Jacobi 57 110
Symmetric Gauss–Seidel 33 58
Symmetric block Gauss–Seidel 22 39
SSOR (ω = 1.8) 18 26
Block SSOR (ω = 1.8) 15 21
2.3.8 Parallelization
The most time consuming portions of PCG and GMRES are the matrix–
vector multiplications by system matrix A and solving linear systems with
the preconditioner M. Computations of inner products and vector norms can
become expensive, however, when vectors are distributed over memories with
weak connections. Thus, in this section we consider parallelizing these three
crucial operations. We assume that the sparse matrices are distributed block
row-wise over the processors. Accordingly, vectors are distributed in blocks.
While we fix the way we store our data, we do not fix in advance the number-
ing of rows and columns in matrices. We will see that this numbering can strongly
affect the parallelizability of the crucial operations in PCG and GMRES.
2.3.9 The sparse matrix vector product
There are a number of ways sparse matrices can be stored in memory. For an
exhaustive overview on sparse storage formats see [73, p. 430ff.]. Here, we only
consider the popular compressed sparse row (CSR, or Yale storage) format. It
is the basic storage mechanism used in SPARSKIT [129] or PETSc [9] where it
is called generalized sparse AIJ format. Another convenient format, particularly
when A is structurally symmetric (i.e. if a
ij
= 0 → a
ji
= 0) is the BLSMP
storage from Kent Smith [95].
In the CSR format, the nonzero elements of an n × m matrix are stored
row-wise in a list val of length nnz equal to the number of nonzero elements
of the matrix. Together with the list val there is an index list col ind, also
of length nnz, and another index list col ptr of length n + 1. List element
col ind[i] holds the column index of the matrix element stored in val[i]. List
item col ptr[j] points to the location in val where the first nonzero matrix
element of row j is stored, while col ptr[n] points to the memory location right
40 APPLICATIONS
for (i=0; i < n; i++){
y[i] = 0.0;
for (j=col_ptr[i]; j < col_ptr[i+1]; j++)
y[i] = y[i] + val[j]*x[col_ind[j]];
}
Fig. 2.7. Sparse matrix–vector multiplication y = Ax with the matrix A stored
in the CSR format.
behind the last element in val. For illustration, let
A =






1 0 0 0 0
0 2 5 0 0
0 3 6 0 9
0 4 0 8 0
0 0 7 0 10






.
Then,
val = [1, 2, 5, 3, 6, 9, 4, 8, 7, 10],
col ind = [1, 2, 3, 2, 3, 5, 2, 4, 3, 5],
col ptr = [1, 2, 4, 7, 9].
If A is symmetric, then it is only necessary to store either the upper or the lower
triangle.
A matrix–vector multiplication y = Ax can be coded in the way shown
in Figure 2.7. Here, the elements of the vector x are addressed indirectly via
the index array col ptr. We will revisit such indirect addressing in Chapter 3,
Section 3.2.2 where the x[col ptr[j]] fetches are called gather operations.
Indirect addressing can be very slow due to the irregular memory access pat-
tern. Block CSR storage, however, can improve performance significantly [58].
Parallelizing the matrix–vector product is straightforward. Using OpenMP on
a shared memory computer, the outer loop is parallelized with a compiler dir-
ective just as in the dense matrix–vector product, for example, Section 4.8.1. On
a distributed memory machine, each processor gets a block of rows. Often each
processor gets n/p rows such that the last processor possible has less work
to do. Sometimes, for example, in PETSc [9], the rows are distributed so that
each processor holds (almost) the same number of rows or nonzero elements.
The vectors x and y are distributed accordingly. Each processor executes a code
segment as given in Figure 2.7 to compute its segment of y. However, there will
be elements of vector x stored on other processors that may have to be gathered.
Unlike the dense case, where each processor needs to get data from every other
processor and thus calls MPI Allgather, in the sparse case there are generally
LINEAR ALGEBRA: SPARSE MATRICES, ITERATIVE METHODS 41
P_0
P_5
P_1
P_2
P_3
P_4
Fig. 2.8. Sparse matrix with band-like nonzero structure row-wise block distrib-
uted on six processors.
just a few nonzero elements required from the other processor memories. Fur-
thermore, not all processors need the same, or same amount of, data. Therefore,
a call to MPI Alltoallv may be appropriate.
Very often communication can be reduced by reordering the matrix. For
example, if the nonzero elements of the matrix all lie inside a band as depicted
in Figure 2.8, then a processor needs only data from nearest neighbors to be able
to form its portion of Ax.
It is in principle possible to hide some communication latencies by a careful
coding of the necessary send and receive commands. To see this, we split the
portion A
i
of the matrix A that is local to processor i in three submatrices. A
ii
is the diagonal block that is to be multiplied with the local portion x
i
of x.
A
i,i−1
and A
i,i+1
are the submatrices of A
i
that are to be multiplied with x
i−1
and x
i+1
, the portions of x that reside on processors i −1 and i +1, respectively.
Then, by proceeding in the following three steps
• Step 1: Form y
i
= A
ii
x
i
(these are all local data) and concurrently
receive x
i−1
.
• Step 2: Update y
i
←y
i
+A
i,i−1
x
i−1
and concurrently receive x
i+1
.
• Step 3: Update y
i
←y
i
+ A
i,i+1
x
i+1
.
some fraction of the communication can be hidden under the computation. This
technique is called latency hiding and exploits non-blocking communication as
implemented in MPI Isend and MPI Irecv (see Appendix D). An analogous
latency hiding in the SIMD mode of parallelism is given in Figure 3.6, where
it is usually more effective.
In Figure 2.7 we have given a code snippet for forming y = Ax. We have seen,
that the vector x was accessed indirectly. If one multiplies with A
T
, that is, if one
42 APPLICATIONS
for (j=0; j < m; j++) y[i] = 0.0;
for (i=0; i < n; i++){
for (j=col_ptr[i]; j < col_ptr[i+1]; j++)
y[row_ind[j]] = y[row_ind[j]] + val[j]*x[i];
}
Fig. 2.9. Sparse matrix–vector multiplication y = A
T
x with the matrix A stored
in the CSR format.
forms y = A
T
x, then the result vector is accessed indirectly, see Figure 2.9. In
this case, y[row ind[j]] elements are scattered in a nonuniform way back into
memory, an operation we revisit in Section 3.2.2. Since A
T
is stored column-wise,
we first form local portions y
i
= A
T
i
x
i
of y without communication. If A has
a banded structure as indicated in Figure 2.8, then forming y =

y
i
involves
only the nearest neighbor communications.
If only the upper or lower triangle of a symmetric matrix is stored, then both
matrix–vector multiplies have to be used, one for multiplying with the original
matrix, one for multiplying with its transpose.
Remark 2.3.4 The above matrix–vector multiplications should, of course, be
constructed on a set of sparse BLAS. Work on a set of sparse BLAS is in
progress [43].
2.3.10 Preconditioning and parallel preconditioning
2.3.10.1 Preconditioning with stationary iterations
As you can see in Figures 2.5 and 2.6, both GMRES and PCG require the
solution of a system of equations Mz = r, where M is the preconditioner. Since
M approximates A it is straightforward to execute a fixed number of steps of
stationary iterations to approximately solve Az = r.
Let A = M
1
− N
1
be a splitting of A, where M
1
is an spd matrix. As we
derived in Section 2.3.1, the corresponding stationary iteration for solving Az = r
is given by
M
1
z
(k)
= N
1
z
(k−1)
+r or z
(k)
= Gz
(k−1)
+c, (2.32)
where G = M
−1
1
N
1
= I − M
−1
1
A is the iteration matrix and c = M
−1
1
r. If the
iteration is started with z
(0)
= 0, then we have seen in (2.25) that
z
(k)
=
k−1

j=0
G
j
c =
k−1

j=0
G
j
M
−1
1
r, k > 0. (2.33)
Sum

k−1
j=0
G
j
is a truncated approximation to A
−1
M
1
= (I −G)
−1
=


j=0
G
j
.
Thus, if the preconditioner M is defined by applying m steps of a stationary
iteration, then
z = M
−1
r = (I +G+ · · · +G
m−1
)M
−1
1
r. (2.34)
LINEAR ALGEBRA: SPARSE MATRICES, ITERATIVE METHODS 43
This preconditioner is symmetric if A and M
1
are also symmetric.
We note that the preconditioner is not applied in the form (2.34), but as the
iteration given in (2.32). Frequently, the preconditioner consists of only one step
of the iteration (2.32). In that case, M = M
1
.
2.3.10.2 Jacobi preconditioning
In Jacobi preconditioning, we set M
1
= D = diag(A). If m = 1 in (2.32)–(2.34),
then
z = z
(1)
= D
−1
r,
which is easy to parallelize. D
−1
r
(k)
corresponds to an elementwise vector–vector
or Hadamard product. To improve its usually mediocre quality without degrad-
ing the parallelization property, Jacobi preconditioning can be replaced by block
Jacobi preconditioning. The diagonal M
1
is replaced by a block diagonal M
1
such that each block resides entirely on one processor.
2.3.10.3 Parallel Gauss–Seidel preconditioning
Gauss–Seidel preconditioning is defined by
z
(k+1)
i
=
1
a
ii


c
i

j<i
a
ij
z
(k+1)
j

j>i
a
ij
z
(k)
i


, (2.35)
from Section 2.3.3. Each step of the GS iteration requires the solution of a trian-
gular system of equations. From the point of view of computational complexity
and parallelization, SOR and GS are very similar.
If the lower triangle of L is dense, then GS is tightly recursive. In the case
where L is sparse, the degree of parallelism depends on the sparsity pattern. Just
like solving tridiagonal systems, the degree of parallelism can be improved by
permuting A. Note that the original matrix A is permuted and not L. L inherits
the sparsity pattern of the lower triangle of A.
The best known permutation is probably the checkerboard or red-black order-
ing for problems defined on rectangular grids. Again (Example 2.3.2), let A be
the matrix that is obtained by discretizing the Poisson equation −∆u = f (∆ is
the Laplace operator) on a square domain by 5-point finite differences in a 9×9
grid. The grid and the sparsity pattern of the corresponding matrix A is given
in Figure 2.10 for the case where the grid points are numbered in lexicographic
order, that is, row by row. Since there are elements in the first lower off-diagonal
of A, and thus of L, the update of z
(k+1)
i
in (2.35) has to wait until z
(k+1)
i−1
is
known. If the unknowns are colored as indicated in Figure 2.11(left) and first the
white (red) unknowns are numbered and subsequently the black unknowns, the
44 APPLICATIONS
Fig. 2.10. 9 × 9 grid (left) and sparsity pattern of the corresponding Poisson
matrix (right) if grid points are numbered in lexicographic order.
Fig. 2.11. 9 × 9 grid (left) and sparsity pattern of the corresponding Poisson
matrix (right) if grid points are numbered in checkerboard (often called red-
black) ordering. Note that the color red is indicated as white in the figure.
sparsity pattern is changed completely, cf. Figure 2.11(right). Now the matrix
has a 2×2 block structure,
A =

D
w
A
wb
A
bw
D
b

,
where the diagonal blocks are diagonal. Here, the indices “w” and “b” refer to
white (red) and black unknowns. With this notation, one step of SOR becomes
z
(k+1)
w
= D
−1
w
(b
w
−A
wb
z
(k)
b
),
z
(k+1)
b
= D
−1
b
(b
b
−A
bw
z
(k+1)
w
).
(2.36)
This structure admits parallelization if both the black and the white (red)
unknowns are distributed evenly among the processors. Each of the two
LINEAR ALGEBRA: SPARSE MATRICES, ITERATIVE METHODS 45
steps in (2.36) then comprises a sparse matrix–vector multiplication and a
multiplication with the inverse of a diagonal matrix. The first has been discussed
in the previous section; the latter is easy to parallelize, again as a Hadamard
product.
2.3.10.4 Domain decomposition
Instead of coloring single grid points, one may prefer to color groups of grid
points or grid points in subdomains as indicated in Figure 2.12—again for the
Poisson equation on the square grid. To formalize this approach, let us denote the
whole computational domain Ω and the subdomains Ω
i
, i = 1, . . . , d. We require
that Ω is covered by the Ω
i
, Ω = ∪
i

i
. This means that each point x ∈ Ω: in
particular, each grid point is an element of at least one of the subdomains. This
approach is called domain decomposition [133]. In the example in Figure 2.12,
all subdomains are mutually disjoint: Ω
i
∩ Ω
j
= ∅, if i = j. In Figure 2.13, the
subdomains overlap.
Let R
j
be the matrix consisting of those columns of the identity matrix
that correspond to the numbers of the grid points in Ω
j
. (R
T
j
extracts from
a vector those components that belong to subdomain Ω
j
.) Then Ω = ∪
j

j
implies that each column of the identity matrix appears in at least one of the R
j
.
Furthermore, if the subdomains do not overlap, then each column of the identity
matrix appears in exactly one of the R
j
. By means of the R
j
s, the blocks on the
diagonal of A and the residuals corresponding to the jth subdomain are
A|

j
= A
j
= R
T
j
AR
j
,
(b −Az
(k)
)|

j
= R
T
j
(b −Az
(k)
).
(2.37)
When we step through the subdomains, we update those components of
the approximation z
(k)
corresponding to the respective domain. This can be
Fig. 2.12. 9 × 9 grid (left) and sparsity pattern of the corresponding Pois-
son matrix (right) if grid points are arranged in checkerboard (red-black)
ordering.
46 APPLICATIONS

3

2

1

j
Fig. 2.13. Overlapping domain decomposition.
written as
z
(k+i/d)
= z
(k+(i−1)/d)
+ R
j
A
−1
j
R
T
j
(b −Az
(k+(i−1)/d)
),
≡ z
(k+(i−1)/d)
+ B
j
r
(k+(i−1)/d)
, i = 1, . . . , d. (2.38)
We observe that P
j
:= B
j
A is an A-orthogonal projector, that is
P
j
= P
2
j
, AP
j
= P
T
j
A.
Iteration (2.38) is called a multiplicative Schwarz procedure [131]. The
adjective “multiplicative” comes from the behavior of the iteration matrices.
From (2.38) we see that
e
(k+j/d)
= (I −R
j
(R
T
j
AR
j
)
−1
R
T
j
A)e
(k+(j−1)/d)
= (I −B
j
A)e
(k+(j−1)/d)
.
Thus,
e
(k+1)
= (I −B
d
A) · · · (I −B
1
A)e
(k)
.
The iteration matrix of the whole step is the product of the iteration matrices of
the single steps.
The multiplicative Schwarz procedure is related to GS iteration wherein only
the most recent values are used for computing the residuals. In fact, it is block
GS iteration if the subdomains do not overlap. As a stationary method, the
multiplicative Schwarz procedure converges for spd matrices.
The problems with parallelizing multiplicative Schwarz are related to those
parallelizing GS and the solution is the same, coloring. Let us color the sub-
domains such that domains that have a common edge (or even overlap) do not
have the same color. From Figure 2.12 we see that two colors are enough in the
case of the Poisson matrix on a rectangular grid. More generally, if we have q
LINEAR ALGEBRA: SPARSE MATRICES, ITERATIVE METHODS 47
colors, then
x
(k+1/q)
= x
(k)
+

j∈color
1
B
j
r
(k)
,
x
(k+2/q)
= x
(k+1/q)
+

j∈color
2
B
j
r
(k+1/q)
,
.
.
.
x
(k+1)
= x
(k+1−1/q)

j∈color
q
B
j
r
(k+1−1/q)
.
As subdomains with the same color do not overlap, the updates corresponding
to domains that have the same color can be computed simultaneously.
An additive Schwarz procedure is given by
z
(k+1)
= z
(k)
+
d

j=1
B
j
r
(k)
, r
(k)
= b −Az
(k)
. (2.39)
If the domains do not overlap, then the additive Schwarz preconditioner coin-
cides with the block Jacobi preconditioner diag(A
1
, . . . , A
d
). If the Ω
j
do overlap
as in Figure 2.13, the additive Schwarz iteration may diverge as a stationary
iteration but can be useful as a preconditioner in a Krylov subspace method.
Additive Schwarz preconditioners are, as the related Jacobi iterations, easily
parallelized. Nevertheless, in (2.39) care has to be taken when updating grid
points corresponding to overlap regions.
2.3.10.5 Incomplete Cholesky factorization/ICCG(0)
Very popular preconditioners are obtained by incomplete factorizations [130].
A direct solution of Az = r, when A is spd, involves finding a version of the
square root of A, A = LL
T
, which in most situations will generate a lot of fill-in.
This may render the direct solution infeasible with regard to operations count
and memory consumption.
An incomplete LU or Cholesky factorization can help in this situation.
Incomplete factorizations are obtained by ignoring some of the fill-in. Thus,
A = LL
T
+R, R = O.
R is nonzero at places where A is zero. One of the most used incomplete Cholesky
factorizations is called IC(0), see Figure 2.14. Here, the sparsity structure of the
lower triangular portion of the incomplete Cholesky factor is a priori specified
to be the structure of the original matrix A. The incomplete LU factorization
ILU(0) for nonsymmetric matrices is defined in a similar way.
The incomplete Cholesky LU factorization can be parallelized quite like the
original LU or Cholesky factorization. It is simplified because the zero structure
of the triangular factor is known in advance.
48 APPLICATIONS
for k = 1, . . . , n do
l
kk
=

a
kk
;
for i = k + 1, . . . , n do
l
ik
= l
ik
/l
kk
;
for j = k + 1, . . . , n do
if a
ij
= 0 then l
ij
= 0 else
a
ij
= a
ij
−l
ik
l
kj
;
endif
endfor
endfor
endfor
Fig. 2.14. The incomplete Cholesky factorization with zero fill-in.
2.3.10.6 Sparse approximate inverses (SPAI)
SPAI preconditioners are obtained by finding the solution of
min
K
I −AK
2
F
=
n

i=1
e
i
−Ak
i
,
where K has a given sparsity structure. Here, k
i
= Ke
i
is the ith column of K.
SPAI preconditioners are quite expensive to compute. However, their set-up
parallelizes nicely since the columns of K can be computed independent of each
other. Likewise, SPAI preconditioners also parallelize well in their application.
As the name says, SPAI is an approximate inverse of A. Therefore, in the
iterative solver, we multiply with K. With our previous notation, K = M
−1
.
Mz = r ⇐⇒ z = M
−1
r = Kr.
There are variants of SPAI in which the number of the zeros per column are
increased until e
i
− Ak
i
< τ for some τ [65]. There are also symmetric SPAI
preconditioners [88] that compute a sparse approximate inverse of a Cholesky
factor.
2.3.10.7 Polynomial preconditioning
A preconditioner of the form
M = p(A) =
m−1

j=0
γ
j
A
j
is called a polynomial preconditioner. The polynomial p(A) is determined to
approximate A
−1
, that is, p(λ) ≈ λ
−1
for λ ∈ σ(A). Here, σ(A) denotes the set
of eigenvalues of A. See Saad [130] for an overview.
FAST FOURIER TRANSFORM (FFT) 49
Polynomial preconditioners are easy to implement, in particular, on paral-
lel or vector processors as they just require matrix–vector multiplication and
no inner products. However, iterative Krylov subspace methods such as CG
or GMRES construct approximate solutions in a Krylov subspace with the
same number of matrix–vector products that satisfy some optimality proper-
ties. Therefore, a (inner) Krylov subspace method is generally more effective
than a polynomial preconditioner.
2.4 Fast Fourier Transform (FFT)
According to Gilbert Strang, the idea of FFT originated with Karl Friedrich
Gauss around 1805. However, the algorithm suitable for digital computers was
invented by Cooley and Tukey in 1965 [21]. It is this latter formulation which
became one of the most famous algorithms in computation developed in the
twentieth century. There is a vast literature on FFT and readers serious about
this topic should consult Van Loan’s book [96] not only for its comprehens-
ive treatment but also for the extensive bibliography. Additionally, E. Oran
Brigham’s two books [15] and [16], have clear-headed concise discussions of
symmetries, and perhaps more important, how to use the FFT.
Our intent here is far more modest. Even the topic of parallel algorithms
for FFT has an enormous literature—and again, Van Loan’s book [96] does well
in the bibliographic enumeration of this narrower topic. In fact, it was known
to Gauss that the splitting we show below was suitable for parallel computers.
Computers in his case were women, whom he acknowledged were not only fast
but accurate (schnell und pr¨ azis).
Written in its most basic form, the discrete Fourier Transform is simply
a linear transformation
y = W
n
x,
where x, y ∈ C
n
are complex n-vectors, and n
−1/2
W
n
is a unitary matrix. The
matrix elements of W
n
are
w
p,q
= e
(2πi/n)pq
≡ ω
pq
,
where ω = exp(2πi/n) is the nth root of unity. For convenience, we number
the elements according to the C numbering convention: 0 ≤ p, q ≤ n − 1. As
written, the amount of computation is that of matrix times vector multiply,
O(n
2
). What makes the FFT possible is that W
n
has many symmetries. In our
superficial survey here, we treat only the classic case where n, the dimension of
the problem, is a power of 2: n = 2
m
for some integer m ≥ 0. If n is a product
of small prime factors, n = 2
r
3
s
5
t
. . . , the generalized prime factor algorithm
can be used [142]. If n contains large primes, the usual procedure is to pad the
vector to a convenient larger size, and use a convolution procedure. For example
see Van Loan’s [96, section 4.2.6].
50 APPLICATIONS
Here are some obvious symmetries:
w
p,n−q
= ¯ w
p,q
, w
n−p,q
= ¯ w
p,q
, (2.40)
w
p,q+n/2
= (−1)
p
w
p,q
, w
p+n/2,q
= (−1)
q
w
p,q
, (2.40a)
w
p,q+n/4
= i
p
w
p,q
, w
p+n/4,q
= i
q
w
p,q
. (2.40b)
Because of the first pair of symmetries, only n/2 elements of W are independent,
the others being linear combinations (frequently very simple) of these n/2. As
above, ω = exp(2πi/n) is the nth root of unity. The first row of the matrix
W
n
is all ones. Now call the set w = {1, ω, ω
2
, ω
3
, . . . , ω
n/2−1
}; then the second
row of W
n
is w
1,q=0...n−1
= {w, −w}. The first half of the third row is every
other element of the second row, while the next half of the third row is just
the negative of the first half, and so forth. The point is that only w, the first
n/2 powers of ω, are needed. Historically, the pre-computed roots of unity, w =
(1, ω, ω
2
, . . . , ω
n/2−1
), with ω = exp(2πi/n), were called “twiddle factors.”
Let us see how the FFT works: the output is
y
p
=
n−1

q=0
ω
pq
x
q
, p = 0, . . . , n − 1.
Gauss’s idea is to split this sum into even/odd parts:
y
p
=

q even
ω
p·q
x
q
+ (odd),
=
n/2−1

r=0
ω
p·(2r)
n
x
2r
+
n/2−1

r=0
ω
p·(2r+1)
n
x
2r+1
,
=
n/2−1

r=0
ω
p·r
n/2
x
2r
+ ω
p
n
n/2−1

r=0
ω
p·r
n/2
x
2r+1
,
y = W
n/2
x
even
+ diag(w
n
)W
n/2
x
odd
,
y
n/2
= W
n/2
x
even
− diag(w
n
)W
n/2
x
odd
,
where diag(w
n
) = diag(1, ω, ω
2
, . . . , ω
n/2−1
). The y
n/2
is the second half of y
and the minus sign is a result of (2.40a). Already you should be able to see what
happens: the operation count: y = W
n
x has 8n
2
real operations (*, +). Splitting
gives
8

n
2

2

even
+ 6n

diag *
+8

n
2

2

odd
= 4n
2
+ 6n,
FAST FOURIER TRANSFORM (FFT) 51
a reduction in floating point operations of nearly 1/2. A variant of this idea
(called splitting in the time domain) is to split the output and is called decimation
in the frequency domain. We get
y
2·r
=
n−1

q=0
ω
2·r·q
n
x
q
,
=
n/2−1

q=0
ω
r·q
n/2
(x
q
+x
n/2+q
),
y
even
= W
n/2
(x +x
n/2
), (2.41)
y
2·r+1
=
n−1

q=0
ω
(2r+1)·q
n
x
q
,
=
n/2−1

q=0
ω
r·q
n/2
ω
q
n
(x
q
−x
(n/2)+q
),
y
odd
= W
n/2

diag(w
n
)(x −x
n/2
)

. (2.42)
In both forms, time domain or frequency domain, the idea of separating the
calculation into even/odd parts can then be repeated. In the first stage, we
end up with two n/2 × n/2 transforms and a diagonal scaling. Each of the
two transforms is of O((n/2)
2
), hence the savings of 1/2 for the calculation.
Now we split each W
n/2
transformation portion into two W
n/4
operations (four
altogether), then into eight W
n/8
, and so forth until the transforms become W
2
and we are done.
Let us expand on this variant, decimation in frequency, in order to see how it
works in more general cases. For example, if n = 2
m
1
3
m
2
5
m
3
· · · can be factored
into products of small primes, the resulting algorithm is called the general prime
factor algorithm (GPFA) [142]. To do this, we write the above splitting in matrix
form: as above,
y
even
= {y
2·r
|r = 0, . . . , n/2 − 1},
y
odd
= {y
2·r+1
|r = 0, . . . , n/2 − 1},
x
+
= {x
r
+x
r+n/2
|r = 0, . . . , n/2 − 1},
x

= {x
r
−x
r+n/2
|r = 0, . . . , n/2 − 1},
and
D = diag(w
n
) = diag(1, ω, ω
2
, . . . , ω
n/2−1
),
52 APPLICATIONS
to get

y
even
y
odd

=

W
n/2
W
n/2

I
n/2
D
n/2

x
+
x

. (2.43)
Now some notation is required. An operation ⊗ between two matrices is called a
Kronecker (or “outer”) product, which for matrix A (p×p) and matrix B (q×q),
is written as
A

p×p
⊗ B

q×q
=




a
0,0
B a
0,1
B · · · a
0,p−1
B
a
1,0
B a
1,1
B · · · a
1,p−1
B
· · ·
a
p−1,0
B a
p−1,1
B · · · a
p−1,p−1
B




and of course A ⊗ B = B ⊗ A. A little counting shows that where l = l
0
+ l
1
p
and m = m
0
+m
1
p with 0 ≤ m
0
, l
0
< p and 0 ≤ m
1
, l
1
< q, that (A
p
⊗B
q
)
lm
=
A
l
0
m
0
B
l
1
m
1
. Using this notation, we may write (2.43) in a slightly more compact
form:

y
even
y
odd

=

I
2
⊗W
n/2

diag(I
n/2
, D)

x
+
x

, (2.44)
where I
2
is the 2 ×2 unit matrix. Furthermore, noting that
W
2
=

1 1
1 −1

,
we get an even more concise form

y
even
y
odd

=

I
2
⊗W
n/2

diag(I
n/2
, D)

W
2
⊗I
n/2

x. (2.45)
All that remains is to get everything into the right order by permuting
(y
even
, y
odd
) → y. Such permutations are called index digit permutations and
the idea is as follows. If the dimension of the transform can be written in the
form n = p · q (only two factors for now), then a permutation P
p
q
takes an index
l = l
0
+ l
1
p →l
1
+ l
0
q, where 0 ≤ l
0
< p and 0 ≤ l
1
< q. We have

P
p
q

kl
= δ
k
0
l
1
δ
k
1
l
0
, (2.46)
where k = k
0
+ k
1
q, l = l
0
+ l
1
p, with the ranges of the digits 0 ≤ k
0
, l
1
< p
and 0 ≤ k
1
, l
0
< q. Notice that the decimation for k, l is switched, although they
both have the same 0 ≤ k, l < n range. With this definition of P
p
q
, we get a very
compact form for (2.43)
y = P
n/2
2

I
2
⊗W
n/2

diag(I
n/2
, D)

W
2
⊗I
n/2

x. (2.47)
FAST FOURIER TRANSFORM (FFT) 53
To finish up our matrix formulation, we state without proof the following identity
which may be shown from (2.46) [142]. Using the above dimensions of A and B
(p ×p and q ×q, respectively),
(A⊗B) P
p
q
= P
p
q
(B ⊗A) . (2.48)
So, finally we get a commuted form
y =

W
n/2
⊗I
2

P
n/2
2
diag(I
n/2
, D)

W
2
⊗I
n/2

x. (2.49)
As in (2.41) and (2.42), the O(n
2
) computation has been reduced by this pro-
cedure to two O((n/2)
2
) computations, plus another O(n) due to the diagonal
factor. This is easy to see from (2.47) because the I
2
⊗ W
n/2
is a direct sum of
two W
n/2
multiplies, each of which is O((n/2)
2
), the diagonal diag(1, D) factor
is O(n). The remaining two W
n/2
operations may be similarly factored, the next
four W
n/4
similarly until n/2 independent W
2
operations are reached and the
factorization is complete. Although this factorized form looks more complicated
than (2.41) and (2.42), it generalizes to the case n = n
1
· n
2
· · · n
k
, where any n
j
may be repeated (e.g. n
j
= 2 for the binary radix case n = 2
m
).
Here is a general result, which we do not prove [142]. Having outlined the
method of factorization, interested readers should hopefully be able to follow the
arguments in [96] or [142] without much trouble. A general formulation can be
written in the above notation plus two further notations, where n =

k
i=1
n
i
:
m
j
=
j−1

i=1
n
i
and l
j
=
k

i=j+1
n
i
,
which are the products of all the previous factors up to j and of all the remaining
factors after j, respectively. One variant is [142]:
W
n
=
1

j=k

P
n
j
l
j
⊗I
m
j

D
n
j
l
j
⊗I
m
j

W
n
j
⊗I
n/n
j

. (2.50)
In this expression, the diagonal matrix D
n
j
l
j
generalizes diag(1, D) in (2.49)
appropriate for the next recursive steps:
D
p
q
=
p−1

l=0
(w
q
)
l
=





(w
q
)
0
(w
q
)
1
.
.
.
(w
q
)
p−1





,
where w
q
= diag(1, ω, ω
2
, . . . , ω
q−1
). Not surprisingly, there is a plethora of
similar representations, frequently altering the order of reads/writes for various
desired properties—in-place, unit-stride, in order, and so forth [96].
54 APPLICATIONS
In any case, the point of all this is that by using such factorizations one
can reduce the floating point operation count from O(n
2
) to O(n · log(n)), a
significant savings. Perhaps more to the point for the purposes in this book is
that they also reduce the memory traffic by the same amount—from O(n
2
) to
O(n· log(n)). This is evident from (2.50) in that there are k steps, each of which
requires O(n) operations: each W
n
j
is independent of n and may be optimally
hand-coded.
Now we wish to answer the following questions:
1. Is this recursive factorization procedure stable?
2. What about the efficiency of more general n-values than n = 2
m
?
The answers are clear and well tested by experience:
1

. The FFT is very stable: first because W
n
is unitary up to a scaling, but
also because the number of floating point operations is sharply reduced
from O(n
2
) to O(n · log n), thus reducing rounding errors.
2

. It turns out that small prime factors, n
j
> 2, in n = n
1
· n
2
· · · n
k
, are
often more efficient than for n
j
= 2. Figure 3.21 shows that the compu-
tation (“volume”) for n
j
= 2 is very simple, whereas the input/output
data (“surface area”) is high relative to the computation. The (memory
traffic)/(floating point computation) ratio is higher for n
j
= 2
than for n
j
> 2. Unfortunately, if some n
j
is too large, coding for this large
prime factor becomes onerous. Additionally, large prime factor codings
also require a lot of registers—so some additional memory traffic (even if
only to cache) results. Frequently, only radices 2–5 are available and lar-
ger prime factor codings use variants of Rader’s ([96], theorem 4.2.9)
or Bluestein’s ([96], section 4.2.3) methods. These involve computing
a slightly larger composite value than n
j
which contains smaller prime
factors. Optimal choices of algorithms depend on the hardware available
and problem size [55].
What we have not done in enough detail is to show where the partial results
at each step are stored. This is important. In Cooley and Tukey’s formula-
tion, without permutations, the partial results are stored into the same locations
as the operands. Although convenient operationally, this may be inconvenient
to use. What happens in the classical Cooley–Tukey algorithm is that the
results come out numbered in bit-reversed order. To be specific: when n = 8,
the input vector elements are numbered 0, 1, 2, 3, 4, 5, 6, 7 or by their bit pat-
terns 000, 001, 010, 011, 100, 101, 110, 111; and the results come out numbered
0, 4, 2, 6, 1, 5, 3, 7. The output bit patterns of the indices of these re-ordered ele-
ments are bit-reversed: 001 → 100, 011 → 110, . . . , etc. This is not always
inconvenient, for if one’s larger calculation is a convolution, both operand vectors
come out in the same bit-reversed order so multiplying them element by element
does not require any re-ordering. Furthermore, the inverse operation can start
FAST FOURIER TRANSFORM (FFT) 55
from this weird order and get back to the original order by reversing the pro-
cedure. However, when using FFT for filtering, solving PDEs, and most other
situations, it is well to have an ordered-in, ordered-out algorithm. Hence, that
is what we will do in Sections 3.5.4 and 3.6. Clever choices of P
p
q
(2.50) for
each recursive factorization will result in an ordered-in/ordered-out algorithm. In
addition, these algorithms illustrate some important points about data depend-
encies and strides in memory which we hope our dear readers will find edifying.
It turns out that a single post-processing step to re-order the bit-reversed output
([141]) is nearly as expensive as the FFT itself—a reflection on the important
point we made in the introduction (Section 1.1): memory traffic is the most
important consideration on today’s computing machinery.
2.4.1 Symmetries
As we have just shown, by splitting either the input or output into even and
odd parts, considerable savings in computational effort ensue. In fact, we can go
further: if the input sequence has certain symmetries, Equations (2.40) may be
used to reduce the computational effort even more. Consider first the case when
the n-dimensional input contains only real elements. It follows that the output
y = W
n
x satisfies a complex conjugate condition: we have
¯ y
k
=
n−1

j=0
¯ ω
jk
¯ x
j
=
n−1

j=0
¯ ω
jk
x
j
,
=
n−1

j=0
ω
−jk
x
j
=
n−1

j=0
ω
(n−j)k
x
j
,
= y
n−k
, (2.51)
because x is real. Thus, the second half of the output sequence y is the complex
conjugate of the first half read in backward order. This is entirely sensible because
there are only n independent real input elements, so there will be only n inde-
pendent real output elements in the complex array y. A procedure first outlined
by Cooley et al. (see [139]) reduces an n-dimensional FFT to an n/2-dimensional
complex FFT plus post-processing step of O(n) operations. When n is even, then
z = W
n/2
x, we pretend x ∈ C
n/2
y
k
=
1
4
(z
k
+ ¯ z
(n/2)−k
) +
1
4i
ω
k
(z
k
− ¯ z
(n/2)−k
). (2.52)
for k = 0, . . . , n/2. Another helpful feature of this algorithm is that the complex
portion z =
¯
W
n/2
x simply pretends that the real input vector x is complex
(i.e. that storage follows the Fortran convention x = (x
0
, x
0
, x
1
, x
1
, . . . ))
and processing proceeds on that pretense, that is, no re-ordering is needed.
The inverse operation is similar but has a pre-processing step followed by an
56 APPLICATIONS
n/2-dimensional complex FFT. The un-normalized inverse is x =
¯
W
n/2
y: for
j = 0, . . . , n/2 −1,
z
j
= y
j
+ ¯ y
(n/2)−j
+ i¯ ω
j
(y
j
− ¯ y
(n/2)−j
) (2.53)
x =
¯
W
n/2
z, again pretend z ∈ C
n/2
.
Cost savings reduce the O(n · log (n)) operation count to
O

n
2
log

n
2

+ O(n).
Both parts, the n/2-dimensional FFT and O(n) operation post(pre)-processing
steps are both parallelizable and may be done in place if the W
n/2
FFT algorithm
is in place. Even more savings are possible if the input sequence x is not only
real, but either symmetric or skew-symmetric.
If the input sequence is real x ∈ R
n
and satisfies
x
n−j
= x
j
, satisfies for j = 0, . . . , n/2 −1,
then Dollimore’s method [139] splits the n-dimensional FFT into three parts, an
O(n) pre-processing, an O(n/4) dimensional complex FFT, and finally another
O(n) dimensional post-processing step. We get, for j = 0, . . . , n/2 −1,
e
j
= (x
j
+ x
(n/2)−j
) −2 sin

2πj
n

(x
j
−x
(n/2)−j
),
y = RCFFT
n/2
(e),
y
j
= y
j−2
−2y
j
, for j = 3, 5, 7, . . . , n/2 −1.
Here RCFFT
n/2
is an n/2-dimensional FFT done by the CTLW procedure (2.52).
This is equivalent to a cosine transform [96] (Section 4.4). If the input sequence
is real and skew-symmetric
x
j
= −x
n−j
the appropriate algorithm is again from Dollimore [139]. The computation is first
for j = 0, . . . , n/2 − 1, a pre-processing step, then an n/4-dimensional complex
FFT, and another O(n) post-processing step:
o
j
= (x
j
−x
(n/2)−j
) −sin

2πj
n

(x
j
+ x
(n/2)−j
),
y = CRFFT
n/2
(o),
(y) ↔(y), swap real and imaginary parts ,
y
j
= 2y
j
+ y
j−2
, j = 3, 5, 7, . . . , n/2 −1.
The routine CRFFT
n/2
is an n/2-dimensional half-complex inverse to the
real →half-complex routine RCFFT (2.53). The expression half-complex
MONTE CARLO (MC) METHODS 57
means that the sequence satisfies the complex-conjugate symmetry (2.51) and
thus only n/4 elements of the complete n/2 total are needed. The skew-symmetric
transform is equivalent to representation as a Fourier sine series.
A complete test suite for each of these four symmetric FFT cases can be down-
loaded from our web-server [6] in either Fortran (master.f) or C (master.c).
An important point about each of these symmetric cases is that each step is
parallelizable: either vectorizable, or by sectioning (see Sections 3.2 and 3.2.1)
both parallelizable and vectorizable.
2.5 Monte Carlo (MC) methods
In the early 1980s, it was widely thought that MC methods were ill-suited for
both vectorization (SIMD) and parallel computing generally. This perspective
considered existing codes whose logical construction had an outermost loop which
counted the sample. All internal branches to simulated physical components
were like goto statements, very unparallel in viewpoint. To clarify: a sample
path (a particle, say) started at its source was simulated to its end through
various branches and decisions. Because of the many branches, these codes were
hard to vectorize. Worse, because they ran one particle to completion as an
outer loop, distributing a sample datum of one particle/CPU on distributed
memory machines resulted in a poorly load balanced simulation. After a little
more thought, however, it has become clear that MC methods are nearly ideal
simulations for parallel environments (e.g. see [121]). Not only can such simula-
tions often vectorize in inner loops, but by splitting up the sample N into pieces,
the pieces (subsamples) may be easily distributed to independently running pro-
cessors. Each subsample is likely to have relatively the same distribution of short
and long running sample data, hence improving the load balance. Furthermore,
in many cases, integration problems particularly, domains may be split into sub-
domains of approximately equal volume. After this “attitude readjustment,” it
is now generally agreed that MC simulations are nearly an ideal paradigm for
parallel computing. A short list of advantages shows why:
1. The intersection between the concept of statistical independence of data
streams and the concept of data independence required for parallel execu-
tion is nearly inclusive. Independently seeded data should be statistically
independent and may be run as independent data streams.
2. Because of the independence of each sample path (datum), inter pro-
cessor communication is usually very small. Only final statistics require
communication.
3. Distributed memory parallelism (MIMD) is complemented by instruction
level parallelism (SIMD): there is no conflict between these modes. Vector-
izing tasks running on each independent processor only makes the whole
job run faster.
4. If subsamples (say N/ncpus) are reasonably large, load balancing is also
good. Just because one particular path may take a longer (or shorter)
58 APPLICATIONS
time to run to termination, each subsample will on average take nearly the
same time as other independent subsamples running on other independent
processors.
5. If a processor fails to complete its task, the total sample size is reduced but
the simulation can be run to completion anyway. For example, if one CPU
fails to complete its N/ncpus subsample, the final statistic will consist of a
sample of size N· (1−1/ncpus) not N, but this is likely to be large enough
to get a good result. Hence such MC simulations can be fault tolerant.
It is not fault tolerant to divide an integration domain into independent
volumes, however.
2.5.1 Random numbers and independent streams
One of the most useful purposes for MC is numerical integration. In one dimen-
sion, there are far more accurate procedures [33] than MC. However, MC in
one dimension provides some useful illustrations. Let us look at the bounded
interval case,
F =

b
a
f(x) dx,
which we approximate by the estimate
F = (b − a)f ≡ (b − a)Ef,
≈ |A|
1
N
N

i=1
f(x
i
). (2.54)
Here we call |A| = (b−a) the “volume” of the integration domain, and all the x
i
s
are uniformly distributed random numbers in the range a < x
i
< b. Expectation
values Ef = f mean the average value of f in the range a < x < b. Writing
x = (b − a)z,
g(z) = f((a − b)z)
gives us (2.54) again with F = |A|Eg, where g(z) is averaged over the unit
interval (i.e. on z ∈ U(0, 1)). By other trickery, it is easy to do infinite ranges as
well. For example,
H =


0
h(x) dx
may also be transformed into the bounded interval by
x =
z
1 − z
, 0 < z < 1,
to get
H =

1
0
1
(1 − z)
2
h

z
1 − z

dz. (2.55)
MONTE CARLO (MC) METHODS 59
The doubly infinite case is almost as easy, where x = (u − 1)/u − u/(u − 1),
K =


−∞
k(x)dx =

1
0
2u
2
− 2u + 1
u
2
(1 − u)
2
k(
2u − 1
u(1 − u)
)du. (2.56)
To be useful in (2.55), we need h(x)/(1 −z)
2
to be bounded as z → 1. Likewise,
in 2.56 we need that k(x)/u
2
and k(x)/(1 − u)
2
must be bounded as u → 0
and u → 1 respectively. With care, one-dimensional integrals (finite or infinite
ranges) may be scaled 0 < z < 1. Not surprisingly, range rescaling is appropriate
for multiple dimension problems as well—provided the ranges of integration are
fixed. If the ranges of some coordinates depend on the value of others, the integ-
rand may have to be replaced by f(x) → f(x)χ
A
(x), where χ
A
(x) is the indicator
function: χ
A
(x) = 1 if x ∈ A, and zero otherwise (x ∈ A).
Quite generally, a multiple dimension integral of f(x) over a measurable
Euclidean subset A ⊂ R
n
is computed by
I =

A
f(x) d
n
x,
= |A|Ef,
≈ |A|
1
N
N

i=1
f(x
i
), (2.57)
where |A| is again the volume of the integration region, when this is appropriate
(i.e. A is bounded). Also, Ef is the expected value of f in A, estimated by MC
as an average over uniformly distributed sample points {x
i
|i = 1, . . . , N}. For
an infinite volume, the problem is usually written as
Ef =

f(x)p(x) d
n
x,
= f
p
,

1
N
N

i=1
f(x
i
). (2.58)
In this case, the sample points are not drawn uniformly but rather from the
distribution density p which is a weighting function of x = (x
1
, x
2
, . . . , x
n
).
The volume element is d
n
x = dx
1
dx
2
· · · dx
n
. Density p is subject to the
normalization condition

p(x) d
n
x = 1.
60 APPLICATIONS
If p(x) = χ
A
(x)/|A|, we get the previous bounded range situation and uniformly
distributed points inside domain A. In a common case, p takes the form
p(x) =
e
−S(x)

e
−S(x)
d
n
x
, (2.59)
which we discuss in more detail in Section 2.5.3.2.
Sometimes x may be high dimensional and sampling it by acceptance/ rejec-
tion (see Section 2.5.3.1) from the probability density p(x) becomes difficult. A
good example is taken from statistical mechanics: S = E/kT, where E is the
system energy for a configuration x, k is Boltzmann’s constant, and T is the tem-
perature. The denominator in this Boltzmann case (2.59), Z =

exp(−S) d
n
x,
is called the partition function. We will consider simulations of such cases
later in Section 2.5.3.2.
2.5.2 Uniform distributions
From above we see that the first task is to generate random numbers in the
range 0.0 < x < 1.0. Most computer mathematics libraries have generators for
precisely this uniformly distributed case. However, several facts about random
number generators should be pointed out immediately.
1. Few general purpose computer mathematical library packages provide
quality parallel or vectorized random number generators.
2. In general purpose libraries, routines are often function calls which return
one random number 0.0 < x < 1.0 at a time. In nearly all parallel applica-
tions, this is very inefficient: typical function call overheads are a few
microseconds whereas the actual computation requires only a few clock
cycles. That is, computation takes only a few nanoseconds. As we will dis-
cuss later, the computations are usually simple, so procedure call overhead
will take most of the CPU time if used in this way.
3. Many random number generators are poor. Do not trust any of them
completely. More than one should be tried, with different initializations,
and the results compared. Your statistics should be clearly reproducible.
5. Parallel random number generators are available and if used with care will
give good results [102, 121, 135]. In particular, the SPRNG generator
suite provides many options in several data formats [135].
Two basic methods are in general use, with many variants of them in practice.
An excellent introduction to the theory of these methods is Knuth [87, vol. 2].
Here, our concern will be parallel generation using these two basics. They are
1. Linear congruential (LC) generators which use integer sequences
x
n
= a · x
n−1
+b mod wordsize (or a prime)
to return a value ((float)x
n
)/((float)wordsize). For example, wordsize =
2
24
(single precision IEEE arithmetic) or wordsize = 2
53
(double
MONTE CARLO (MC) METHODS 61
precision IEEE). Additionally, if a is chosen to be odd, b may be set to
zero. The choice of the multiplier a is not at all trivial, and unfortunately
there are not many really good choices [87].
2. Lagged Fibonacci (LF) sequences
x
n
= x
n−p
⊗x
p−q
mod 1.0, (2.60)
where the lags p, q must be chosen carefully, and the operator ⊗ can be
one of several variants. Some simple choices are p = 607, q = 273, and ⊗ is
floating point add (+). There are better choices for both the lags and the
operator ⊗, but they use more memory or more computation. Other oper-
ators have been chosen to be subtract, multiply, bit-wise exclusive or (⊕),
and structures like op(x
n−p
, x
n−q
, x
n−q+1
) ≡ x
n−p
⊕ ((x
u
n−q
|x
l
m−q+1
)M)
where M is a bit-matrix, and (x
u
.
|x
l
.
) means the concatenation of the u
upper bits of the first argument, and l bits of the second argument [103]. In
these integer cases, the routine returns a floating point result normalized
to the wordsize, that is 0.0 < x < 1.0.
The LC methods require only one initial value (x
0
). Because the generator is
a one-step recurrence, if any x
i
in the sequence repeats, the whole series repeats
in exactly the same order. Notice that the LC method is a one-step recursion and
will not parallelize easily as written above. It is easy to parallelize by creating
streams of such recurrences, however. The so-called period of the LC method
is less than wordsize. This means that a sequence generated by this proced-
ure repeats itself after this number (period). To put this in perspective, using
IEEE single precision floating point arithmetic, the maximal period of an LC
generator is 2
24
. On a machine with a 1 GHz clock and assuming that the basic
LC operation takes 10 cycles, the random number stream repeats itself every
10 s! This says that the LC method cannot be used with single precision IEEE
wordsize for any serious MC simulation. Repeating the same random sample
statistics over and over again is a terrible idea. Clearly, in IEEE double precision
(53 bits of mantissa, not 24), the situation is much better since 2
53
is a very large
number. Other more sophisticated implementations can give good results. For
example, the 59-bit NAG [68] C05AGF library routine generates good statistics,
but returns only one result/call.
The LF method, and its generalizations (see p. 63), requires memory space for
a buffer. Using a circular buffer, max(p, q) is the storage required for this proced-
ure. The LF method, can be vectorized along the direction of the recurrence to a
maximum segment length (V L in Chapter 3) of min(p, q) at a time. Here is a code
fragment to show how the buffer is filled. We assume LAGP = p > q = LAGQ
for convenience [14, 104].
kP = 0;
kQ = LAGP - LAGQ;
for(i=0;i<LAGQ;i++){
t = buff[i] + buff[kQ+i];
62 APPLICATIONS
buff[i] = t - (double)((int) t);
}
kP = LAGQ;
kQ = 0;
for(i=0;i<LAGP-LAGQ;i++){
t = buff[i+kP] + buff[i+kQ];
buff[i + kP] = t - (double)((int) t);
}
Uniformly distributed numbers x are taken from buff which is refilled
as needed. The maximal periods of LF sequences are extremely large

1
2
wordsize · (2
p∨q
−1)

because the whole buffer of length max(p, q) must
repeat for the whole sequence to repeat. These sequences are not without their
problems, however. Hypercubes in dimension p ∨ q can exhibit correlations, so
long lags are recommended—or a more sophisticated operator ⊗. In our tests
[62, 124], the Mersenne Twister [103] shows good test results and the latest
implementations are sufficiently fast. M. L¨ uscher’s RANLUX [98] in F. James’
implementation [79] also tests well at high “luxury” levels (this randomly throws
away some results to reduce correlations). Furthermore, this routine sensibly
returns many random numbers per call.
In our parallel thinking mode, we see obvious modifications of the above pro-
cedures for parallel (or vector) implementations of i = 1, 2, . . . , V L independent
streams
x
[i]
n
= a · x
[i]
n−1
mod wordsize, (2.61)
or the Percus–Kalos [119] approach, where p
[i]
are independent primes for each
stream [i]:
x
[i]
n
= a · x
[i]
n−1
+ p
[i]
mod wordsize (2.62)
and
x
[i]
n
= x
[i]
n−p
⊗x
[i]
n−q
mod 1.0. (2.63)
The difficulty is in verifying whether the streams are statistically independ-
ent. It is clear that each i counts a distinct sample datum from an independent
initial seed (LC), or seeds (LF). A simple construction shows why one should
expect that parallel implementations of (2.61) and (2.63) should give independ-
ent streams, however. Crucial to this argument is that the generator should
have a very long period, much longer than the wordsize. Here is how it goes:
if the period of the random number generator (RNG) is extremely long (e.g.
2
23
· (2
p∧q
− 1) for a p, q lagged Fibonacci generator in single precision IEEE
MONTE CARLO (MC) METHODS 63
Cycle for RNG
generator
Sequence
3
Sequence
2
Sequence
1
Fig. 2.15. Graphical argument why parallel RNGs should generate parallel
streams: if the cycle is astronomically large, none of the user sequences will
overlap.
arithmetic), then any possible overlap between initial buffers (of length p ∨ q)
is very unlikely. Furthermore, since the parallel simulations are also unlikely to
use even an infinitesimal fraction of the period, the likelihood of overlap of the
independent sequences is also very small. Figure 2.15 shows the reasoning. If
the parallel sequences (labeled 1, 2, 3 in Figure 2.15) are sufficiently small relat-
ive to the period (cycle), the probability of overlap is tiny, hence the sequences
can be expected to be independent. One cannot be convinced entirely by this
argument and much testing is therefore required [62, 135]. Very long period gen-
erators of the LF type [124, 135] and variants [61, 79], when used in parallel with
initial buffers filled by other generators (e.g. ggl) seem to give very satisfactorily
independent streams.
An inherently vector procedure is a matrix method [66, 67] for an array of
results x:
x
n
= Mx
n−1
mod p,
where p is prime, M is a matrix of “multipliers” whose elements must be selected
by a spectral test [87]. For vectors of large size, finding a suitable matrix M is a
formidable business, however.
More general procedures than (2.60) can be constructed which use more
terms. At the time of this 2nd printing, apparently the best generators are
due to Panneton, L’Ecuyer, and Matsumoto, called by the acronym WELL
(Well Equidistributed Long-period Linear). The full complexity of this set of
WELL generators is beyond the scope of this text but curious readers should
refer to reference [117]. In addition, the web-site given in this reference [117],
also contains an extensive set of test routines, which like the WELL generators
themselves, can be downloaded.
64 APPLICATIONS
For our purposes, a particularly useful package SPRNGwritten by Mascagni
and co-workers contains parallel versions of six pseudorandom number gener-
ators [135]. This suite contains parallel generators of the following types (p is a
user selected prime):
• Combined multiple recursive generator:
x
n
= a · x
n−1
+ p mod 2
64
y
n
= 107374182 · y
n−1
+ 104480 · y
n−5
mod 2147483647
z
n
= (x
n
+ 2
32
· y
n
) mod 2
64
(nth result)
• Linear congruential 48–bit (2.61): x
n
= a · x
n−1
+ b mod 2
48
• LC 64–bit (2.61): x
n
= a · x
n−1
+ b mod 2
64
• LC with prime modulus (2.61): x
n
= a · x
n−1
mod (2
61
−1)
• Modified LF (2.60):
x
n
= x
n−p
+ x
n−q
mod 2
32
(1st sequence)
y
n
= y
n−p
+ y
n−q
mod 2
32
(2nd sequence)
z
n
= x
n
⊕y
n
(nth result)
• LF with integer multiply (2.60): x
n
= x
n−p
×x
n−q
mod 2
64
.
This suite has been extensively tested [35] and ported to a wide variety of
machines including Cray T-3E, IBM SP2, HP Superdome, SFI Origin2000, and
all Linux systems.
2.5.3 Non-uniform distributions
Although uniform generators from Section 2.5.2 form the base routine for most
samplings, other non-uniformly distributed sequences must be constructed from
uniform ones. The simplest method, when it is possible, is by direct inversion.
Imagine that we wish to generate sequences distributed according to the density
p(x), where x is a real scalar. The cumulative probability distribution is
P[x < z] = P(z) =

z
−∞
p(t) dt. (2.64)
Direct inversion may be used when the right-hand side of this expression (2.64)
is known and can be inverted. To be more precise: since 0 ≤ P(x) ≤ 1 is
monotonically increasing (p(x) ≥ 0), if u = P(x), then P
−1
(u) exists. The
issue is whether this inverse is easily found. If it is the case that P
−1
is easily
computed, then to generate a sequence of mutually independent, non-uniformly
distributed random numbers, one computes
x
i
= P
−1
(u
i
), (2.65)
where u
1
, u
2
, . . . is a sequence of uniformly distributed, mutually independent,
real numbers 0.0 < u
i
< 1.0: see [34, theorem 2.1].
MONTE CARLO (MC) METHODS 65
Example 2.5.1 An easy case is the exponential density p(x) =
λexp(−λx), x > 0 because P(x) = u = 1 − exp(−λx) is easily inverted for
x = P
−1
(u). The sequence generated by x
i
= −log(u
i
/λ) will be exponentially
distributed. This follows because if {u
i
} are uniformly distributed in (0, 1), so
are {1 −u
i
}.
Things are not usually so simple because computing P
−1
may be painful.
A normally distributed sequence is a common case:
P[x < z] =
1

z
−∞
e
−t
2
/2
dt =
1
2
+
1
2
erf

z

2

.
Here the inversion of the error function erf is no small amount of trouble, par-
ticularly in parallel. Thus, a better method for generating normally distributed
random variables is required. Fortunately a good procedure, the Box–Muller
method, is known and even better suited for our purposes since it is easily
parallelized. It generates two normals at a time
t
1
= 2πu
1
, // u
1
∈ U(0, 1),
t
2
=

−2 ln(u
2
), // u
2
∈ U(0, 1),
z
1
= cos(t
1
) · t
2
,
z
2
= sin(t
1
) · t
2
.
(2.66)
An essential point here is that the library functions sqrt, log, cos, and sin must
have parallel versions. Usually, this means SIMD (see Chapter 3) modes wherein
multiple streams of operands may be computed concurrently. We have already
briefly discussed parallel generation of the uniformly distributed variables u
1
, u
2
(2.60), that is, p. 61, and (2.61), (2.63). Function sqrt may be computed in
various ways, but the essence of the thing starts with a range reduction: where
x = 2
k
x
0
, 1/2 < x
0
≤ 1, then

x = 2
k/2

x
0
= 2
(k−1)/2

2 · x
0
,
where its second form is used when k is odd [49]. Since the floating point repre-
sentation of x is x = [k : x
0
] (in some biased exponent form), computing the new
exponent, either k/2 or (k−1)/2, is effected by masking out the biased exponent,
removing the bias (see Overton [114], Chapter 4) and shifting it right one bit.
The new mantissa, either

x
0
or

2 · x
0
, involves computing only the square
root of an argument between 1/2 and 2. A Newton method is usually chosen to
find the new mantissa: where we want y =

x
0
, this takes the recursive form
y
n+1
=
1
2

y
n
+
x
0
y
n

.
For purposes of parallelization, a fixed number of Newton steps is used (say
n = 0, . . . , 3, the number depending on the desired precision and the accuracy of
66 APPLICATIONS
the initial approximation y
0
). Newton steps are then repeated on multiple data:
y
[i]
n+1
=
1
2

y
[i]
n
+
x
[i]
0
y
[i]
n

,
where i = 1, . . . , V L, the number of independent square roots to be calculated,
see Section 3.2.
Computing log x is also easily parallelized. Again, writing x = 2
k
· x
0
, one
computes log x = log
2
(e) log
2
(x) from log
2
(x) by
log
2
(x) = k + log
2
(x
0
).
The range of x
0
is 1/2 < x
0
≤ 1. The log function is slowly varying so
either polynomial or rational approximations may be used. Assuming a rational
approximation, parallel implementations for V L independent arguments (i =
1, . . . , V L) would look like:
log
2

x
[i]

= k
[i]
+ log
2

x
[i]
0

,
log
2

x
[i]
0

=
P

x
[i]
0

Q

x
[i]
0

,
where P and Q are relatively low order polynomials (about 5–6, see Hart
et al. [49]). Also, see Section 3.5.
Finally, cos and sin: because standard Fortran libraries support complex
arithmetic and exp(z) for a complex argument z requires both cos and sin, most
intrinsic libraries have a function which returns both given a common argument.
For example, the Linux routine in /lib/libm.so sincos returns both sin and
cos of one argument, and the Cray intrinsic Fortran library contains a routine
COSS which does the same thing. In the Intel MKL library [28], the relevant
functions are vsSinCos and vdSinCos for single and double, respectively. It costs
nearly the same to compute both cos and sin as it does to compute either. To
do this, first the argument x is ranged to 0 ≤ x
1
< 2π: x = x
1
+ m · 2π, by
division modulo 2π. In the general case, for large arguments this ranging is the
biggest source of error. However, for our problem, x is always ranged 0 < x < 2π.
The remaining computation cos x
1
(likewise sin x
1
) is then ranged 0 ≤ x
1
<
π
2
using modular arithmetic and identities among the circular functions. Once this
ranging has been done, a sixth to seventh order polynomial in x
2
1
is used: in
parallel form (see Section 3.5),
cos

x
[i]
1

= P

x
[i]
1

2

, for i = 1, . . . , V L.
By appropriate ranging and circular function identities, computing cos is enough
to get both cos and sin, see Hart et al. [49, section 6.4]. Also, see Luke [97,
section 3.3, particularly table 3.6].
MONTE CARLO (MC) METHODS 67
So, the Box–Muller method illustrates two features: (1) generating normal
random variables is easily done by elementary functions, and (2) the procedure
is relatively easy to parallelize if appropriate approximations for the elementary
functions are used. These approximations are not novel to the parallel world, but
are simply selections from well-known algorithms that are appropriate for parallel
implementations. Finally, an interesting number: univariate normal generation
computed on the NEC SX-4 in vector mode costs
(clock cycles)/(normal) ≈ 7.5,
which includes uniformly distributed variables, see Section 2.5.2 [124]. To us, this
number is astonishingly small but apparently stems from the multiplicity of inde-
pendent arithmetic units: 8 f.p. add and 8 f.p. multiply units. In any case, the
Box–Muller method is faster than the polar method [87] which depends on accept-
ance rejection, hence if/else processing, see Section 3.2.8. On the machines we
tested (Macintosh G-4, HP9000, Pentium III, Pentium 4, Cray SV-1, and NEC
SX-4), Box–Muller was always faster and two machine results are illustrated in
Figure 2.16. Label pv BM means that ggl was used to generate two arrays
of uniformly distributed random variables which were subsequently used in a
vectorized loop. Intel compiler icc vectorized the second part of the split loop
1
0.1
T
i
m
e

(
s
)
0.01
0.001
10
4
10
5
10
6
(1) PA8700 Pol.
(2) PA8700 BM
(3) P3 Polar
(4) P4 Polar
(5) P3 pv BM
(6) P4 pv BM
(1)
(2)
(3)
(4)
(5)
(6)
n
Fig. 2.16. Timings for Box–Muller method vs. polar method for generating
univariate normals. Machines are PA8700, Pentium III, and Pentium 4. The
label Polar indicates the polar method, while pv BM means a “partially”
vectorized Box–Muller method. Uniforms were generated by ggl which does
not vectorize easily, see Section 2.5.2. The compilers were icc on the
Pentiums and gcc on the PA8700.
68 APPLICATIONS
nicely but the library does not seem to include a vectorized RNG. On one CPU of
Stardust (an HP9000), gcc -O3 produced faster code than both the native cc
-O3 and guidec -O4 [77]. On Cray SV-2 machines using the ranf RNG, the spee-
dups are impressive—Box–Muller is easily a factor of 10 faster. See Section 3.5.7
for the relevant compiler switches used in Figure 2.16.
2.5.3.1 Acceptance/rejection methods
In this book we wish to focus on those aspects of MC relating parallel computing.
In Equation (2.66), we showed the Box–Muller method for normals and remarked
that this procedure is superior to the so-called polar method for parallel com-
puting. While there remains some dispute about the tails of the distributions of
Box–Muller and polar methods, experience shows that if the underlying uniform
generator is of high quality, the results of these two related procedures are quite
satisfactory, see Gentle [56, section 3.1.2]. Here, we show why the Box–Muller
method is superior to acceptance/rejection polar method, while at the same
time emphasize why branching (if/else processing) is the antithesis of paral-
lelism [57]. First we need to know what the acceptance/rejection (AR) method
is, and then we give some examples.
The idea is due to von Neumann and is illustrated in Figure 2.17. We wish to
sample from a probability density function p(x), whose cumulative distribution
is difficult to invert (2.65). This is done by finding another distribution function,
say q(x), which is easy to sample from and for which there exists a constant
S\A
A
x
F
r
e
q
u
e
n
c
y
p(x)
cq(x)
Fig. 2.17. AR method: the lower curve represents the desired distribution density
p(x), while the covering is cq(x). Distribution density q(x) must be easy to
sample from and c > 1. The set S = {(x, y) : 0 ≤ y ≤ cq(x)} covers the set
A = {(x, y) : 0 ≤ y ≤ p(x)} and points (x, y) are uniformly distributed in
both A and S.
MONTE CARLO (MC) METHODS 69
1 < c < ∞ such that 0 ≤ p(x) ≤ cq(x) for all x in the domain of functions
p(x), q(x). Here is the general algorithm:
P1: pick x from p.d.f. q(x)
pick u ∈ U(0, 1)
y = u · c · q(x)
if (y ≤ p(x)) then
accept x
else
goto P1

Here is the proof that this works [99].
Theorem 2.5.2 Acceptance/rejection samples according to the probability
density p(x).
Proof Given x is taken from the distribution q, y is thus uniformly distributed
in the range 0 ≤ y ≤ cq(x). The joint probability density function is therefore
p
[xy]
(x, y) = q(x)h(y|x),
where h(y|x) is the conditional probability density for y given x. Since y is uni-
formly distributed, h(y|x) does not depend on y, and is then h(y|x) = 1/(cq(x)),
for

cq(x)
0
hdy = 1. We get p
[xy]
(x, y) = 1/c, a constant. For this reason, these
(x, y) points are uniformly distributed in the set
S = {(x, y): 0 ≤ y ≤ cq(x)}.
Now, let A ⊂ S,
A = {(x, y): 0 ≤ y ≤ p(x)},
which contains the (x, y) ∈ S under the curve p as shown in Figure 2.17. Since
A ⊂ S, the points (x, y) ∈ A are uniformly distributed, hence the probability
density function for these (x, y) is a constant (the X → ∞ limit of the next
expression shows it must be 1). Therefore,
P[x ≤ X] =

X
−∞
dx

p(x)
0
dy =

X
−∞
p(x) dx.
Hence x is distributed according to the density p(x), which is von Neumann’
result. 2
The generalization of AR to higher dimensional spaces, x → x, involves only
a change in notation and readers are encouraged to look at Madras’ proof [99].
70 APPLICATIONS
The so-called acceptance ratio is given by
Acceptance ratio =
Number of x accepted
Number of x tried
=
1
c
< 1.
The polar method, which again generates two univariate normals (z
1
, z
2
) is
a good illustration. Here is the algorithm:
P1: pick u
1
, u
2
∈ U(0, 1)
v
1
= 2u
1
−1
v
2
= 2u
2
−1
S = v
2
1
+ v
2
2
if (S < 1)then
z
1
= v
1

−2 log(S)/S
z
2
= v
2

−2 log(S)/S
else
goto P1

and the inscription method is shown in Figure 2.18. This is a simple example
of AR, and the acceptance ratio is π/4. That is, for any given randomly chosen
point in the square, the chance that (v
1
, v
2
) will fall within the circle is the ratio
of the areas of the circle to the square.
The basic antithesis is clear: if processing on one independent datum fol-
lows one instruction path while another datum takes an entirely different one,
each may take different times to complete. Hence, they cannot be processed
ACCEPT
REJECT

1

2
)
|S|
Fig. 2.18. Polar method for normal random variates. A circle of radius one is
inscribed in a square with sides equal to two. Possible (v
1
, v
2
) coordinates are
randomly selected inside the square: if they fall within the circle, they are
accepted; if the fall outside the circle, they are rejected.
MONTE CARLO (MC) METHODS 71
concurrently and have the results ready at the same time. Our essential point
regarding parallel computing here is that independent possible (v
1
, v
2
) pairs may
take different computational paths, thus they cannot be computed concurrently
in a synchronous way. On independent processors, of course, the lack of syn-
chrony is not a problem but requesting one z
1
, z
2
pair from an independent
CPU is asking for a very small calculation that requires resource management,
for example, a job scheduling operating system intervention. On older architec-
tures where cos, sin, and other elementary functions were expensive, the polar
method was superior. We will be more expansive about branch processing (the
if statement above) in Chapter 3.
In a general context, AR should be avoided for small tasks. For larger tasks,
distributed over multiple independent processors, the large inherent latencies (see
Section 3.2.1) of task assignment become insignificant compared to the (larger)
computation. Conversely, the tiny computation of the z
1
, z
2
pair by the polar
method is too small compared to the overhead of task assignment. In some
situations, AR cannot be avoided. An indirect address scheme (see Section 3.2.2
regarding scatter/gather operations) can be a way to index accepted points but
not rejected ones. An integer routine fische which returns values distributed
according to a Poisson distribution, p(n, µ) = e
−mu
µ
n
/n!, uses this idea. You
can download this from the zufall package on NETLIB [111] or from our ftp
server [6] .
If a branch is nearly always taken (i.e. c ∼ 1), the results using AR for small
tasks can be effective. For example, Figure 2.19 shows a stratified sampling
method for Gaussian random variables by Marsaglia and Tsang [100]. Their
idea is to construct a set of strips which cover one half of the symmetric dis-
tribution, each with the same area: see Figure 2.19(a). Unlike the diagram, the
code actually has 256 strips, not 8. The lowest lying strip, handing the tail, uses
Marsaglia’s method [101]. The returned normal is given an random sign. Each
strip area is the same, so the strips may be picked uniformly. The rejection rate
is very small. The computation is in effect a table lookup to find the strip para-
meters which are pre-initialized. Their method is extremely fast as Figure 2.19
shows, but there are problems with the existing code [100]: the uniform random
number generator is in-line thus hard to change.
There is an important point here, apparently not lost on Marsaglia and
Tsang, for a procedure call costs at least 1–5 µs. Furthermore, as written, the
code is only suitable for 32-bit floating point, again not easily modified. The
speed performance is quite impressive, however, so this method holds consider-
able promise. To us, the lesson to be learned here is that a parallel method may
be faster than a similar sequential one, but a better algorithm wins.
Not to put too fine a point on possible AR problems, but the following calcula-
tion shows that completely aside from parallel computing, AR can occasionally
fail miserably when the dimension of the system is high enough that the accept-
ance ratio becomes small. Our example problem here is to sample uniformly
72 APPLICATIONS
inside, or on, the surface of, a unit n-sphere. Here is the AR version for
interior sampling:
P1: pick u
1
, u
2
, . . . , u
n
each u
j
∈ U(0, 1)
v
1
= 2u
1
− 1
v
2
= 2u
2
− 1
. . .
v
n
= 2u
n
− 1
S = v
2
1
+v
2
2
+ · · · +v
2
n
if (S < 1)then
accept points {v
1
, v
2
, . . . , v
n
}
else
goto P1

2 4
x
1
10
0
10
–1
10
–2
10
–3
10
–4
10
4
10
5
10
6
(1) P3 pv BM
(1)
(2)
(3)
(4)
T
i
m
e

(
s
)
(2) P4 pv BM
(3) P3 Zigrt.
(4) P4 Zigrt.
n
Fig. 2.19. Box–Muller vs. Ziggurat method. (top) A simplified diagram of the
stratified sampling used in the procedure [100]. In the actual code, there
are 256 strips, not 8. (bottom) Comparative performance of Box–Muller vs.
Ziggurat on a Pentium III and Pentium 4.
MONTE CARLO (MC) METHODS 73
Alas, here the ratio of volume of an inscribed unit radius n-sphere to the volume
of a side = 2 covering n-cube (see Figure 2.18) gets small fast:
Volume of unit n-sphere
Volume of n-cube
=

n

1
0
drr
n−1
2
n
,

1


α
n

n
2

−(n+1)/2
,
for large n. Constant α =

eπ/2 ≈ 1.46 and Ω
n
is the solid angle volume of an
n-sphere [46, p. 234],

n
=

n/2
Γ(n/2)
.
Hence, using Stirling’s approximation, we see a rapidly decreasing acceptance
ratio for large n. The following table shows how bad the acceptance ratio gets
even for relatively modest n values.
n 2 3 4 10 20
Ratio 0.785 0.524 0.308 0.0025 0.25×10
−7
A vastly better approach is given in Devroye [34, section 4.2]. This is an
isotropic algorithm for both surface or interior sampling. If interior points are
desired, the density function for isotropically distributed points is p(r) = k · r
n−1
.
Since

1
0
p(r) dr = 1, then k = n and P[|x| < r] = r
n
. Sampling of |x| takes the
form |x| = u
1/n
from (2.65).
for i =1, ..., n {
z
i
= box-muller
}
r =

n
i=1
z
2
i
// project onto surface
for i = 1, ..., n {
x
i
= z
i
/r
}
if (interior points) {
// sample radius according to P = r
n
u ∈ U(0, 1)
r = u
1/n
for i = 1, ..., n {
x
i
= rx
i
}
}
A plot of the timings for both the AR and isotropic procedures is shown in
Figure 2.20.
74 APPLICATIONS
10
4
10
2
10
0
0 50 100
10
–2
n
T
i
m
e

(
s
)

f
o
r

6
4
3

p
o
i
n
t
s
Inscribe
Isotropic
x
x
Fig. 2.20. Timings on NEC SX-4 for uniform interior sampling of an n-sphere.
Lower curve (isotropic method) uses n normals z, while the absurdly rising
curve shows the inscription method.
2.5.3.2 Langevin methods
Earlier in this chapter, we mentioned a common computation in quantum and
statistical physics—that of finding the partition function and its associated
Boltzmann distribution, Equation (2.59). The usual problem of interest is to
find the expected (mean) value of a physical variable, say f(x), where x may
be very high dimensional. In an extreme case, n = dim[x] ∼ 6 · N
A
, where
N
A
is the Avogadro’s number (N
A
≈ 6.022 × 10
23
) and the 6 refers to the 3
space + 3 momentum degrees of freedom of the N
A
particles. For the foresee-
able future, computing machinery will be unlikely to handle N
A
particles, but
already astrophysical simulations are doing O(10
8
)—an extremely large number
whose positions and velocities each must be integrated over many timesteps.
Frequently, physicists are more interested in statistical properties than about
detailed orbits of a large number of arbitrarily labeled particles. In that case,
the problem is computing the expected value of f, written (see (2.59))
Ef =

e
−S(x)
f(x) d
n
x
Z
, (2.67)
where the partition function Z =

e
−S(x)
d
n
x normalizes e
−S
/Z in (2.59). For
our simplified discussion, we assume the integration region is all of E
n
, the
Euclidean space in n-dimensions. For an arbitrary action S(x), generating con-
figurations of x according to the density p(x) = e
−S(x)
/Z can be painful unless
S is say Gaussian, which basically means the particles have no interactions. A
relatively general method exists, however, when S(x) > 0 has the contraction
MONTE CARLO (MC) METHODS 75
property that when |x| is large enough,
x ·
∂S
∂x
> 1. (2.68)
In this situation, it is possible to generate samples of x according to p(x) by
using Langevin’s equation [10]. If S has the above contraction property and
some smoothness requirements [86, 106], which usually turn out to be physically
reasonable, a Langevin process x(t) given by the following equation will converge
to a stationary state:
dx(t) = −
1
2
∂S
∂x
dt + dw(t), (2.69)
where w is a Brownian motion, and the stationary distribution will be p(x).
Equation (2.69) follows from the Fokker–Planck equation (see [10], equation
11.3.15) and (2.79) below. What are we to make of w? Its properties are as
follows.
1. The probability density for w satisfies the heat equation:
∂p(w, t)
∂t
=
1
2
p(w, t),
where is the Laplace operator in n-dimensions. That is
p =
n

i=1

2
p
∂w
2
i
.
2. The t = 0 initial distribution for p is p(x, t = 0) = δ(x)
3. Its increments dw (when t →t + dt) satisfy the equations:
Edw
i
(t) = 0,
Edw
i
(t) dw
j
(s) = δ
ij
δ(s −t) dt ds.
In small but finite form, these equations are
E∆w
i
(t
k
) = 0, (2.70)
E∆w
i
(t
k
)∆w
j
(t
l
) = hδ
ij
δ
kl
, (2.71)
where h is the time step h = ∆t.
In these equations, δ
ij
= 1, if i = j and zero otherwise: the δ
ij
are the elements of
the identity matrix in n-dimensions; and δ(x) is the Dirac delta function—zero
everywhere except x = 0, but

δ(x) d
n
x = 1. Item (1) says w(t) is a diffusion
process, which by item (2) means w(0) starts at the origin and diffuses out
with a mean square E|w|
2
= n · t. Item (3) says increments dw are only locally
correlated.
76 APPLICATIONS
There is an intimate connection between parabolic (and elliptic) partial differ-
ential equations and stochastic differential equations. Here is a one-dimensional
version for the case of a continuous Markov process y,
p(y|x, t) = transition probability of x → y, after time t. (2.72)
The important properties of p(y|x, t) are

p(y|x, t) dy = 1, (2.73)
p(z|x, t
1
+ t
2
) =

p(z|y, t
2
)p(y|x, t
1
) dy, (2.74)
where (2.73) says that x must go somewhere in time t, and (2.74) says that
after t
1
, x must have gone somewhere (i.e. y) and if we sum over all possible
ys, we will get all probabilities. Equation (2.74) is known as the Chapman–
Kolmogorov–Schmoluchowski equation [10] and its variants extend to quantum
probabilities and gave rise to Feynman’s formulation of quantum mechanics,
see Feynman–Hibbs [50]. We would like to find an evolution equation for the
transition probability p(y|x, t) as a function of t. To do so, let f(y) be an arbitrary
smooth function of y, say f ∈ C

. Then using (2.74)

f(y)
∂p(y|x, t)
∂t
dy
= lim
∆t→0
1
∆t

f(y) (p(y|x, t + ∆t) −p(y|x, t)) dy,
= lim
∆t→0
1
∆t

f(y)

p(y|z, ∆t)p(z|x, t) dz −p(y|x, t)

dy.
For small changes ∆t, most of the contribution to p(y|z, ∆t) will be near y ∼ z,
so we expand f(y) = f(z) + f
z
(z)(z −y) +
1
2
f
zz
(z)(y −z)
2
+· · · to get

f(y)
∂p(y|x, t)
∂t
dy
= lim
∆t→0
1
∆t

(f(z)p(y|z, ∆t)p(z|x, t) dz) dy
+

∂f(z)
∂z
(y −z)p(y|z, ∆t)p(z|x, t) dz

dy
+

1
2

2
f(z)
∂z
2
(y −z)
2
p(y|z, ∆t)p(z|x, t) dz

dy
+ O((y −z)
3
) −

f(y)p(y|x, t) dy

. (2.75)
MONTE CARLO (MC) METHODS 77
If most of the contributions to the z →y transition in small time increment ∆t
are around y ∼ z, one makes the following assumptions [10]

p(y|z, ∆t)(y −z) dy = b(z)∆t + o(∆t), (2.76)

p(y|z, ∆t)(y −z)
2
dy = a(z)∆t + o(∆t), (2.77)

p(y|z, ∆t)(y −z)
m
dy = o(∆t), for m > 2.
Substituting (2.76) and (2.77) into (2.75), we get using (2.73) and an integration
variable substitution, in the ∆t →0 limit

f(z)
∂p(z|x, t)
∂t
dz =

p(z|x, t)

∂f
∂z
b(z) +
1
2

2
f
∂z
2
a(z)

dz,
=

f(z)



∂z
[b(z)p(z|x, t)] +
1
2

2
∂z
2
[a(z)p(z|x, t)]

dz.
(2.78)
In the second line of (2.78), we integrated by parts and assumed that the large |y|
boundary terms are zero. Since f(y) is perfectly arbitrary, it must be true that
∂p(y|x, t)
∂t
= −

∂y
[b(y)p(y|x, t)] +
1
2

2
∂y
2
[a(y)p(y|x, t)]. (2.79)
Equation (2.79) is the Fokker–Planck equation and is the starting point for many
probabilistic representations of solutions for partial differential equations. Gener-
alization of (2.79) to dimensions greater than one is not difficult. The coefficient
b(y) becomes a vector (drift coefficients) and a(y) becomes a symmetric matrix
(diffusion matrix).
In our example (2.69), the coefficients are
b(x) = −
1
2
∂S
∂x
and
a(x) = I, the identity matrix.
The Fokker–Planck equation for (2.69) is thus
∂p
∂t
=
1
2

∂x
·

∂S
∂x
p +
∂p
∂x

.
And the stationary state is when ∂p/∂t →0, so as t →∞
∂S
∂x
p +
∂p
∂x
= constant = 0,
78 APPLICATIONS
if p goes to zero at |x| →∞. Integrating once more,
p(x, t →∞) =
e
−S(x)
Z
,
where 1/Z is the normalization constant, hence we get (2.68).
How, then, do we simulate Equation (2.69)? We only touch on the easiest
way here, a simple forward Euler method:
∆w =

hξ,
∆x = −
h
2

∂S
∂x

k
+ ∆w,
x
k+1
= x
k
+ ∆x. (2.80)
At each timestep of size h, a new vector of mutually independent, univariate,
and normally distributed random numbers ξ is generated each by the Box–Muller
method (see Equation (2.66)) and the increments are simulated by ∆w =

hξ.
The most useful parallel form is to compute a sample N of such vectors,
i = 1, . . . , N per time step:
∆w
[i]
=


[i]
,
∆x
[i]
= −
h
2

∂S
∂x

[i]
k
+ ∆w
[i]
,
x
[i]
k+1
= x
[i]
k
+ ∆x
[i]
. (2.81)
If S is contracting, that is, (2.68) is positive when |x| is large, then x(t) will
converge to a stationary process and Ef in (2.67) may be computed by a long
time average [118]
Ef ≈
1
T

T
0
f(x(t)) dt,

1
mN
m

k=1
N

i=1
f

x
[i]
k

. (2.82)
Two features of (2.82) should be noted: (1) the number of time steps chosen is
m, that is, T = m· h; and (2) we can accelerate the convergence by the sample
average over N. The convergence rate, measured by the variance, is O(1/mN):
the statistical error is O(1/

mN). Process x(t) will become stationary when m
is large enough that T = h · m will exceed the relaxation time of the dynamics:
to lowest order this means that relative to the smallest eigenvalue λ
small
> 0 of
the Jacobian [∂
2
S/∂x
i
∂x
j
], Tλ
small
1.
MONTE CARLO (MC) METHODS 79
There are several reasons to find (2.82) attractive. If n, the size of the vectors
x and w, is very large, N = 1 may be chosen to save memory. The remaining
i = 1 equation is also very likely to be easy to parallelize by usual matrix methods
for large n. Equation (2.80) is an explicit equation—only old data x
k
appears
on the right-hand side. The downside of such an explicit formulation is that
Euler methods can be unstable for a large stepsize h. In that situation, higher
order methods, implicit or semi-implicit methods, or small times steps can be
used [86, 106, 124]. In each of these modifications, parallelism is preserved.
To end this digression on Langevin equations (stochastic differential
equations), we show examples of two simple simulations. In Figure 2.21, we
show 32 simulated paths of a two-dimensional Brownian motion. All paths begin
at the origin, and the updates per time step take the following form. Where,
w =

w
1
w
2

,
the updates at each time step are
w
k+1
= w
k
+

h

ξ
1
ξ
2

.
The pair ξ are generated at each step by (2.66) and h = 0.01.
10
5
0
–5
–10
–10 –5 0 5 10
x
y
Fig. 2.21. Simulated two-dimensional Brownian motion. There are 32 simulated
paths, with timestep h = 0.1 and 100 steps: that is, the final time is t = 100h.
80 APPLICATIONS
10
4
10
3
10
2
10
1
10
0
10
–1
–5 0
Process value x(t)
F
r
e
q
u
e
n
c
y
5
dx =–(x/|x|) dt +dw, t =100
[expl. TR, x
0
=0, N=10,240, h=0.01]
Fig. 2.22. Convergence of the optimal control process. The parameters are N =
10, 240 and time step h = 0.01. Integration was by an explicit trapezoidal
rule [120].
And finally, in Figure 2.22, we show the distribution of a converged
optimal control process. The Langevin equation (2.80) for this one-dimensional
example is
dx = −sign(x) dt + dw.
It may seem astonishing at first to see that the drift term, −sign(x) dt, is suf-
ficiently contracting to force the process to converge. But it does converge and
to a symmetric exponential distribution, p ∼ exp(−2|x|). Again, this is the
Fokker–Planck equation (2.79) which in this case takes the form:
∂p
∂t
=

∂x
[sign(x)p ] +
1
2

2
∂x
2
p, −→
t→∞
0.
The evolution ∂p/∂t →0 means the distribution becomes stationary (time inde-
pendent), and p = e
−2|x|
follows from two integrations. The first is to a constant
in terms of x = ±∞ boundary values of p and p

, which must both be zero if
p is normalizable, that is,

p(x, t) dx = 1. The second integration constant is
determined by the same normalization.
Exercise 2.1 MC integration in one-dimension. As we discussed in
Section 2.5, MC methods are particularly useful to do integrations. However,
we also pointed out that the statistical error is proportional to N
−1/2
, where N
is the sample size. This statistical error can be large since N
−1/2
= 1 percent
MONTE CARLO (MC) METHODS 81
when N = 10
4
. The keyword here is proportional. That is, although the statistical
error is k · N
−1/2
for some constant k, this constant can be reduced significantly.
In this exercise, we want you to try some variance reduction schemes.
What is to be done?
The following integral
I
0
(x) =
1
π

π
0
e
−x cos ζ

is known to have a solution I
0
(x), the zeroth order modified Bessel function with
argument x, see, for example, Abramowitz and Stegun [2], or any comparable
table.
1. The method of antithetic variables applies in any given domain of integra-
tion where the integrand is monotonic [99]. Notice that for all positive x,
the integrand exp(−xcos ζ) is monotonically decreasing on 0 ≤ ζ ≤ π/2
and is symmetric around ζ = π/2. So, the antithetic variates method
applies if we integrate only to ζ = π/2 and double the result. For x = 1,
the antithetic variates method is
I
0
(1) ≈ I
+
+ I

=
1
N
N

i=1
e
−cos πu
i
/2
+
1
N
N

i=1
e
−cos π(1−u
i
)/2
.
Try the I
+
+ I

antithetic variates method for I
0
(1) and compute the
variance (e.g. do several runs with different seeds) for both this method
and a raw integration for the same size N—and compare the results.
2. The method of control variates also works well here. The idea is that an
integral
I =

1
0
g(u) du
can be rewritten for a control variate φ as
I =

1
0
(g(u) −φ(u)) du +

1
0
φ(u) du,

1
N
N

i=1
[g(u
i
) −φ(u
i
)] + I
φ
,
where I
φ
is the same as I except that g is replaced by φ. The method
consists in picking a useful φ whose integral I
φ
is known. A good one for
the modified Bessel function is
φ(u) = 1 −cos(πu/2) +
1
2
(cos(πu/2))
2
,
82 APPLICATIONS
the first three terms of the Taylor series expansion of the integrand. Again,
we have chosen x = 1 to be specific. For this φ compute I
0
(1). Again
compare with the results from a raw procedure without the control variate.
By doing several runs with different seeds, you can get variances.
Exercise 2.2 Solving partial differential equations by MC. This
assignment is to solve a three-dimensional partial differential equation inside an
elliptical region by MC simulations of the Feynman–Kac formula. The following
partial differential equation is defined in a three-dimensional ellipsoid E:
1
2
∆u(x, y, z) −v(x, y, z)u(x, y, z) = 0, (2.83)
where the ellipsoidal domain is
D =

(x, y, z) ∈ R
3
;
x
2
a
2
+
y
2
b
2
+
z
2
c
2
< 1

.
The potential is v(x, y, z) = 2

x
2
a
4
+
y
2
b
4
+
z
2
c
4

+
1
a
2
+
1
b
2
+
1
c
2
. Our Dirichlet
boundary condition is
u = g(x) = 1 when x ∈ ∂D.
The goal is to solve this boundary value problem at x = (x, y, z) by a MC
simulation. At the heart of the simulation lies the Feynman–Kac formula, which
in our case (g = 1) is
u(x, y, z) = Eg(X(τ)) exp

τ
0
v(X(s)) ds

,
= Eexp

τ
0
v(X(s)) ds

,
= EY (τ),
which describes the solution u in terms of an expectation value of a stochastic
process Y . Here X(s) is a Brownian motion starting from X(0) = x and τ is its
exit time from D (see Section 2.5.3.2) and for convenience, Y is defined as the
exponential of the integral −

t
0
v(X(s)) ds.
The simulation proceeds as follows.
• Starting at some interior point X(0) = x ∈ D.
• Generate N realizations of X(t) and integrate the following system of
stochastic differential equations (W is a Brownian motion)
dX = dW,
dY = −v(X(t))Y dt.
MONTE CARLO (MC) METHODS 83
• Integrate this set of equations until X(t) exits at time t = τ.
• Compute u(x) = EY (X(τ)) to get the solution.
The initial values for each realization are X(0) = (x, y, z) (for X) and Y (0) = 1
(for Y ), respectively. A simple integration procedure uses an explicit trapezoidal
rule for each step n →n + 1 (i.e. t →t + h):
X
n
= X
n−1
+

hξ,
Y
e
= (1 −v(X
n−1
)h)Y
n−1
, (an Euler step),
Y
n
= Y
n−1

h
2
[v(X
n
)Y
e
+ v(X
n−1
)Y
n−1
] (trapezium),
where h is the step size and ξ is a vector of three roughly normally distributed
independent random variables each with mean 0 and variance 1. A inexpensive
procedure is (three new ones for each coordinate k = 1, . . . , 3: ξ = (ξ
1
, ξ
2
, ξ
3
) at
each time step)
ξ
k
= ±

3, each with probability 1/6,
= 0, with probability 2/3.
At the end of each time step, check if X
n
∈ D: if (X
1
/a)
2
+(X
2
/b)
2
+(X
3
/c)
2
≥ 1,
this realization of X exited the domain D. Label this realization i. For each
i = 1, . . . , N, save the values Y
(i)
= Y (τ). When all N realizations i have exited,
compute
u
mc
(x) =
1
N
N

i=1
Y
(i)
.
This is the MC solution u
mc
(x) ≈ u(x) to (2.83) for x = (x, y, z).
What is to be done?
The assignment consists of
(1) implement the MC simulation on a serial machine and test your code;
(2) for several initial values of x ∈ D, compute the RMS error compared
to the known analytic solution given below. For example, m = 50–100
values x = (x, 0, 0), −a < x < a. See note (b).
Note: This problem will be revisited in Chapter 5.
Hints
(a) The exact solution of the partial differential equation is (easily seen by
two differentiations)
u(x, y, z) = exp

x
2
a
2
+
y
2
b
2
+
z
2
c
2
−1

. (2.84)
84 APPLICATIONS
(b) Measure the accuracy of your solution by computing the root mean square
error between the exact solution u(x) (2.84) and the numerical approx-
imation u
mc
(x) given by the MC integration method for m (say m = 50
or 100) starting points x
i
∈ D:
rms =

1
m
m

i=1
(u(x
i
) −u
mc
(x
i
))
2
.
(c) For this problem, a stepsize of 1/1000 ≤ h ≤ 1/100 should give a
reasonable result.
(d) A sensible value for N is 10, 000 (∼ 1–2 percent statistical error).
(e) Choose some reasonable values for the ellipse (e.g. a = 3, b = 2, c = 1).
3
SIMD, SINGLE INSTRUCTION MULTIPLE DATA
Pluralitas non est ponenda sine neccesitate
William of Occam (1319)
3.1 Introduction
The single instruction, multiple data (SIMD) mode is the simplest method
of parallelism and now becoming the most common. In most cases this SIMD
mode means the same as vectorization. Ten years ago, vector computers were
expensive but reasonably simple to program. Today, encouraged by multimedia
applications, vector hardware is now commonly available in Intel Pentium III
and Pentium 4 PCs, and Apple/Motorola G-4 machines. In this chapter, we
will cover both old and new and find that the old paradigms for programming
were simpler because CMOS or ECL memories permitted easy non-unit stride
memory access. Most of the ideas are the same, so the simpler programming
methodology makes it easy to understand the concepts. As PC and Mac com-
pilers improve, perhaps automatic vectorization will become as effective as on
the older non-cache machines. In the meantime, on PCs and Macs we will often
need to use intrinsics ([23, 22, 51]).
It seems at first that the intrinsics keep a programmer close to the hardware,
which is not a bad thing, but this is somewhat misleading. Hardware control
in this method of programming is only indirect. Actual register assignments
are made by the compiler and may not be quite what the programmer wants.
The SSE2 or Altivec programming serves to illustrate a form of instruction level
parallelism we wish to emphasize. This form, SIMD or vectorization, has single
instructions which operate on multiple data. There are variants on this theme
which use templates or macros which consist of multiple instructions carefully
scheduled to accomplish the same objective, but are not strictly speaking SIMD,
for example see Section 1.2.2.1. Intrinsics are C macros which contain one or
more SIMD instructions to execute certain operations on multiple data, usually
4-words/time in our case. Data are explicitly declared mm128 datatypes in the
Intel SSE case and vector variables using the G-4 Altivec. Our examples will
show you how this works.
86 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
Four basic concepts are important:
• Memory access
• data dependencies: Sections 3.2, 3.2.2, 3.5.1, and 3.5.2
• pipelining and unrolling loops: Sections 3.2, 3.2.1, and 1.2.2
• branch execution: Section 3.2.8
• reduction operations: Section 3.3
Consistent with our notion that examples are the best way to learn, several
will be illustrated:
• from linear algebra, the Level 1 basic linear algebra subprograms (BLAS)
— vector updates (-axpy)
— reduction operations and linear searches
• recurrence formulae and polynomial evaluations
• uniform random number generation.
• FFT
First we review the notation and conventions of Chapter 1, p. 11. Generic seg-
ments of size V L data are accumulated and processed in multi-word registers.
Some typical operations are denoted:
V
1
← M: loads V L data from memory M into register V
1
,
M ← V
1
: stores contents of register V
1
into memory M,
V
3
← V
1
+V
2
: adds contents of V
1
and V
2
and store results into V
3
, and
V
3
← V
1
∗ V
2
: multiplies contents of V
1
by V
2
, and stores these into V
3
.
3.2 Data dependencies and loop unrolling
We first illustrate some basic notions of data independence. Look at the fol-
lowing loop in which f(x) is some arbitrary function (see Section 3.5 to see more
such recurrences),
for(i=m;i<n;i++){
x[i]=f(x[i-k]);
}
If the order of execution follows C rules, x[i-k] must be computed before x[i]
when k > 0. We will see that the maximum number that may be computed in
parallel is
number of x[i]’s computed in parallel ≤ k.
For example, for k = 2 the order of execution goes
x[m ]=f(x[m-2]);
x[m+1]=f(x[m-1]);
DATA DEPENDENCIES AND LOOP UNROLLING 87
x[m+2]=f(x[m ]); /* x[m] is new data! */
x[m+3]=f(x[m+1]); /* likewise x[m+1] */
x[m+4]=f(x[m+2]); /* etc. */
. . .
By line three, x[m] must have been properly updated from the first line or x[m]
will be an old value, not the updated value C language rules prescribe. This
data dependency in the loop is what makes it complicated for a compiler or
the programmer to vectorize. This would not be the case if the output values
were stored into a different array y, for example:
for(i=m;i<n;i++){
y[i]=f(x[i-k]);
}
There is no overlap in values read in segments (groups) and those stored in that
situation. Since none of the values of x would be modified in the loop, either we
or the compiler can unroll the loop in any way desired. If n −m were divisible
by 4, consider unrolling the loop above with a dependency into groups of 2,
for(i=m;i<n;i+=2){
x[i ]=f(x[i-k ]);
x[i+1]=f(x[i-k+1]);
}
or groups of 4,
for(i=m;i<n;i+=4){
x[i ]=f(x[i-k ]);
x[i+1]=f(x[i-k+1]);
x[i+2]=f(x[i-k+2]);
x[i+3]=f(x[i-k+3]);
}
If k = 2, unrolling the loop to a depth of 2 to do two elements at a time would
have no read/write overlap. For the same value of k, however, unrolling to depth
4 would not permit doing all 4 values independently of each other: The first pair
must be finished before the second can proceed.
Conversely, according to the C ordering rules,
for(i=m;i<n;i++){
x[i]=f(x[i+k]);
}
when k > 0 all n − m could be computed as vectors as long as the sequential i
ordering is preserved. Let us see what sort of instructions a vectorizing compiler
would generate to do V L operands.
88 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
V
1
←[x
m+k
, x
m+k+1
, x
m+k+2
, ...] // VL of these
V
2
←f(V
1
) // vector of results
[x
m
, x
m+1
, ...] ←V
2
// store VL results,
Registers V
r
are purely symbolic and may be one of
(1) V
r
is a vector register (Cray, NEC, Fujitsu); or Motorola/Apple G-4
Altivec register; SSE2 register, Appendix B; AMD 3DNow technology
[1]; or
(2) V
r
is a collection of registers, say R
1
, R
7
, R
3
, R
5
... where each R
j
stores
only one word. This would be the type of optimization a compiler would
generate for superscalar chips. The term superscalar applies to single
word register machines which permit concurrent execution of instruc-
tions. Hence, several single word registers can be used to form multi-word
operations, see Section 1.2.2.1. Although this mode is instruction level
parallelism, it is not SIMD because multiple data are processed each by
individual instructions.
Although the segments [x
m+k
, x
m+k+1
, . . .] and [x
m
, x
m+1
, . . .] overlap, V
1
has
a copy of the old data, so no x[i] is ever written onto before its value is copied,
then used. At issue are the C (or Fortran) execution order rules of the expres-
sions. Such recurrences and short vector dependencies will be discussed further
in Section 3.5. In sequential fashion, for the pair
x[1]=f(x[0]);
x[2]=f(x[1]); /* x[1] must be new data */
the value of x[1] must be computed in the first expression before its new value
is used in the second. Its old value is lost forever. Whereas, in
x[0]=f(x[1]); /* x[1] is old data */
x[1]=f(x[2]); /* x[1] may now be clobbered */
the old value of x[1] has already been copied into a register and used, so what
happens to it afterward no longer matters.
An example. In the applications provided in Chapter 2, we pointed out
that the lagged Fibonacci sequence (2.60)
x
n
= x
n−p
+ x
n−q
mod 1.0.
can be SIMD parallelized with a vector length V L ≤ min(p, q). According to
the discussion given there, the larger min(p, q) is, the better the quality of the
generator. On p. 61, we showed code which uses a circular buffer (buff) of
length max(p, q) to compute this long recurrence formula. In that case, better
performance and higher quality go together.
DATA DEPENDENCIES AND LOOP UNROLLING 89
The basic procedure of unrolling loops is the same: A loop is broken into
segments. For a loop of n iterations, one unrolls the loop into segments of length
V L by
n = q · V L +r
where q = the number of full segments, and 0 ≤ r < V L is a residual. The total
number of segments processed is q if r = 0, otherwise q + 1 when r > 0. To
return to a basic example, the loop
for(i=p;i<n;i++){
y[i]=f(x[i]);
}
may be unrolled into groups of 4. Let p = n mod 4; either we programmers, or
better, the compiler, unrolls this loop into:
y[0]=f(x[0]); ... y[p-1]=f(x[p-1]); /* p<4 */
for(i=p;i<n;i+=4){
y[i ]=f(x[i]);
y[i+1]=f(x[i+1]);
y[i+2]=f(x[i+2]);
y[i+3]=f(x[i+3]);
}
These groups (vectors) are four at a time. The generalization to say m at a time
is obvious. It must be remarked that sometimes the residual segment is processed
first, but sometimes after all the q full V L segments. In either case, why would
we do this, or why would a compiler do it? The following section explains.
3.2.1 Pipelining and segmentation
The following analysis, and its subsequent generalization in Section 3.2.3, is
very simplified. Where it may differ significantly from real machines is where
there is out-of-order execution, for example, on Intel Pentium III and Pentium 4
machines. On out-of-order execution machines, a general analysis of speedup
becomes difficult. The terminology “pipeline” is easy enough to understand:
One imagines a small pipe to be filled with balls of slightly smaller diameter.
The overhead (same as latency in this case) is how many balls will fit into the pipe
before it is full, Figure 3.1. Subsequently, pushing more balls into the pipe causes
balls to emerge from the other end at the same rate they are pushed in. Today
nearly all functional units are pipelined, the idea apparently having originated
in 1962 with the University of Manchester’s Atlas project [91]. Hardware rarely
does any single operation in one clock tick. For example, imagine we wish to
multiply an array of integers A times B to get array C: for i ≥ 0, C
i
= A
i
∗ B
i
.
This is sometimes called a Hadamard or element by element product. A special
notation is used in MatLab for these products: A.*B. We indicate the successive
90 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
bytes of each element by numbering them 3,2,1,0, that is, little-endian,
A
i
= [A
i3
, A
i2
, A
i1
, A
i0
],
B
i
= [B
i3
, B
i2
, B
i1
, B
i0
],
C
i
= [C
i3
, C
i2
, C
i1
, C
i0
].
Byte numbered j must wait until byte j −1 is computed via
C
ij
= A
ij
∗ B
ij
+ carry from A
i,j−1
∗ B
i,j−1
. (3.1)
When stage j is done with A
i,j
∗B
i,j
, it (stage j) can be used for A
i+1,j
∗B
i+1,j
,
until reaching the last i. After four cycles the pipeline is full. Look at the time flow
in Figure 3.1 to see how it works. Subsequently, we get one result/clock-cycle. It
takes 4 clock ticks to do one multiply, while in the pipelined case it takes 4 +n
to do n, the speedup is thus
speedup =
4n
4 +n
(3.2)
which for large number of elements n ≤ V L is the pipeline length (=4 in
this case).
A00*B00
A01*B01
A02*B02
A03*B03
A10*B10
A11*B11
A12*B12 A21*B21
A20*B20
A30*B30
A40*B40
A50*B50
A60*B60
A31*B31
A41*B41
A42*B42
A22*B22 A13*B13
A32*B32 A23*B23
A33*B33 A51*B51
C0
C1
C2
etc.
T
i
m
e
Fig. 3.1. Four-stage multiply pipeline: C = A ∗ B. In this simplified example,
four cycles (the pipeline latency) are needed to fill the pipeline after which
one result is produced per cycle.
DATA DEPENDENCIES AND LOOP UNROLLING 91
3.2.2 More about dependencies, scatter/gather operations
From our examples above, it is hopefully clear that reading data from one portion
of an array and storing updated versions of it back into different locations of the
same array may lead to dependencies which must be resolved by the programmer,
or by the compiler, perhaps with help from the programmer. In the latter case,
certain compiler directives are available. These directives, pragma’s in C and
their Fortran (e.g. cdir$) equivalents, treated in more detail in Chapter 4 on
shared memory parallelism, are instructions whose syntax looks like code com-
ments but give guidance to compilers for optimization within a restricted scope.
Look at the simple example in Figure 3.3. One special class of these dependencies
garnered a special name—scatter/gather operations. To illustrate, let index be
an array of integer indices whose values do not exceed the array bounds of our
arrays. In their simplest form, the two operations are in Figure 3.2. It is not hard
to understand the terminology: scatter takes a segment of elements (e.g. the
contiguous x
i
above) and scatters them into random locations; gather collects
elements from random locations. In the scatter case, the difficulty for a com-
piler is that it is impossible to know at compile time that all the indices (index)
are unique. If they are unique, all n may be processed in any order, regardless of
how the loop count n is segmented (unrolled). The gather operation has a sim-
ilar problem, and another when w and z overlap, which is frequently the case.
Let us beat this to death so that there is no confusion. Take the case that two
of the indices are not unique:
index[1]=3
index[2]=1
index[3]=1
For n = 3, after the loop in Figure 3.3 is executed, we should have array y with
the values
for(i=0;i<n;i++){
y[index[i]] = x[i]; /* scatter */
w[i] = z[index[i]]; /* gather */
}
Fig. 3.2. Scatter and gather operations.
#pragma ivdep
for(i=0;i<n;i++){
y[index[i]] = x[i];
}
Fig. 3.3. Scatter operation with a directive telling the C compiler to ignore any
apparent vector dependencies. If any two array elements of index have the
same value, there would be such a dependency.
92 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
y[1]=x[3]
y[2]=unchanged
y[3]=x[1]
where perhaps y[1] was set to x[2], then reset to x[1]. If all n calculations
were done asynchronously, y[1] might end up containing x[2], an incorrect value
according to the C execution rules. What is the unfortunate compiler to do?
Obviously, it has to be conservative and generates only one/time code. There
could be no parallel execution in that case. Not only must the results be those
specified by the one after another C execution rules, but the compiler can-
not assume that all the indices (index) are unique. After all, maybe there is
something subtle going on here that the compiler cannot know about. Begin-
ning in the late 1970s, Cray Research invented compiler directives to help
the frustrated compiler when it is confused by its task. The syntax is shown in
Figure 3.3.
The pragma tells the compiler that indeed Ms. Programmer knows what she
is doing and to ignore the vector dependency. Now everybody is happy: The
programmer gets her vectorized code and the compiler does the right optim-
ization it was designed to do. And as icing on the cake, the resultant code is
approximately portable. The worst that could happen is that another compiler
may complain that it does not understand the pragma and either ignore it, or
(rarely) give up compiling. If done in purely sequential order, the results should
agree with SIMD parallelized results.
In C and Fortran where pointers are permitted, the gather operation in
Figure 3.2 can be problematic, too. Namely if w and z are pointers to a memory
region where gathered data (w) are stored into parts of z, again the com-
piler must be conservative and generate only one/time code. In the event that
out-of-sequence execution would effect the results, the dependencies are called
antidependencies [147]. For example, if a loop index i + 1 were executed before
the i-th, this would be contrary to the order of execution rules. As an asynchron-
ous parallel operation, the order might be arbitrary. There are many compiler
directives to aid compilation; we will illustrate some of them and enumerate the
others.
3.2.3 Cray SV-1 hardware
To better understand these ideas about pipelining and vectors, some illustra-
tion about the hardware should be helpful. Bear in mind that the analysis only
roughly applies to in-order execution with fixed starting overhead (laten-
cies). Out-of-order execution [23] can hide latencies, thus speedups may be
higher than (3.5). Conversely, since memory architectures on Pentiums and
Motorola G-4s have multiple layers of cache, speedups may also be lower than
our analysis. Figure 3.4 shows the block diagram for the Cray SV-1 central
processing unit. This machine has fixed overheads (latencies) for the func-
tional units, although the processing rate may differ when confronted with
DATA DEPENDENCIES AND LOOP UNROLLING 93
S0
S7
V0
1
2
3
4
5
6
7
Integer
add
Shift
Logical
Vector integer
Vector f.p.
f.p. add
Multiply
Reciprocal
T00
T77
Population, lz.
Int. add
Logical
Shift
Population, lz.
Scalar integer
Scalar f.p.
f.p. add
Multiply
Reciprocal
Sk
Si
Sj
Sj
Vi
Vj
Sj
Si
Vi
Vk
Vj
Vk
Si
Tjk
Vector mask
One to three ports
to memory
C
a
c
h
e
C
e
n
t
r
a
l

m
e
m
o
r
y
Vector registers
Scalar registers Save scalar
registers
Fig. 3.4. Cray SV-1 CPU diagram.
memory bank conflicts. These conflicts occur when bulk memory is laid
out in banks (typically up to 1024 of these) and when successive requests to
the same bank happen faster than the request/refresh cycle of the bank. The
only parts of interest to us here are the eight 64-word floating point vector
registers and eight 64-bit scalar registers. The eight vector registers, numbered
V
0
, . . . , V
7
, are shown in the upper middle portion. The pipelines of the arith-
metic units appear in the upper right—add, multiply, reciprocal approximation,
logical.
There are many variants on this theme. NEC SX-4, SX-5, and SX-6 machines
have reconfigurable blocks of vector registers. Cray C-90 has registers with
128, not 64, floating point words and so forth. Intel began with the Pentium
featuring MMX integer SIMD technology. Later, the Pentium III featured sim-
ilar vector registers called XMM (floating point) and again included the integer
MMX hardware. The three letter acronym SSE refers to this technology and
means streaming SIMD extensions; its successor SSE2 includes double pre-
cision (64 bit) operations. Not surprisingly, the “MM” stands for multimedia.
Furthermore, if you, gentle reader, wonder how any of the Jaguar swoosh-
ing windows can happen in Macintosh OS-X, some of it is the power of the
94 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
// this may have to be expanded to a vector on some machines
S
1
← a // put a into a scalar register
// operation 1
V
1
← [y
0
, y
1
, y
2
, . . .] // read V L of y
// operation 2
V
2
← [x
0
, x
1
, x
2
, . . .] // read V L of x
V
3
← S
1
∗ V
2
// a * x
V
4
← V
1
+V
3
// y+a * x
// operation 3
[y
0
, y
1
, . . .] ← V
4
// store updated y’s
Fig. 3.5. saxpy operation by SIMD.
Altivec hardware on G-4 that makes this possible. We will explore this further
in Sections 3.2.4 and 3.2.5. The vector registers on the Pentium and G-4 are 4
length 32-bit words, reconfigurable up to 2 doubles (64-bit) or down to 16 1-byte
and 8 2-byte sizes. For our purposes, the four 32-bit configuration is the most
useful.
To explore this idea of pipelining and vector registers further, let us digress
to our old friend the saxpy operation, shown in Figure 3.5, y ← a · x+y. If there
is but one path to memory, three pieces are needed (see Section 3.6 concerning
the first comment in the pseudo-code):
Why are the three parts of operation 2 in Figure 3.5 considered only one oper-
ation? Because the functional units used are all independent of each other: as
soon as the first word (containing y
0
) arrives in V
2
after π
memory
cycles, the mul-
tiply operation can start, similar to Figure 3.1. Memory is also a pipelined
functional unit, see for example [23]. As soon as the first result (a · x
0
) of
the a · x operation arrives (π
multiply
cycles later) in V
3
, the add operation with
the y elements already in V
1
may begin. Hence, the read of x, multiplication
by a, and subsequent add to y may all run concurrently as soon as the respect-
ive operations’ pipelines are full. Subsequently, (one result)/(clock tick)
is the computational rate. Since there is only one port to memory, how-
ever, the final results now in V
4
cannot be stored until all of V
4
is filled. This is
because memory is busy streaming data into V
2
and is not available to stream
the results into V
4
. Thus, with only one port to memory, the processing rate is
approximately (1 saxpy result)/(3 clock ticks).
However, if as in Figure 3.4, there are three ports to memory, the three
operations of Figure 3.1 collapse into only one: read, read, multiply, add, and
store—all running concurrently [80, 123].
To be more general than the simple speedup formula 3.2, a segmented
vector timing formula can be written for a segment length n ≤ V L, where
for each pipe of length π
i
it takes π
i
+n cycles to do n operandpairs. With the
DATA DEPENDENCIES AND LOOP UNROLLING 95
notation,
n = number of independent data
π
i
= pipeline overhead for linked operation i
α = number of linked operations
R
vector
= processing rate (data/clock period)
we get the following processing rate which applies when π
memory
is not longer
than the computation,
R
vector
=
n

α
i=1

i
+n}
. (3.3)
In one/time mode, the number cycles to compute one result is,
1
R
scalar
=
α

i=1
π
i
.
The speedup, comparing the vector (R
vector
) to scalar (R
scalar
) processing
rates, is
speedup =
R
vector
R
scalar
=
n

α
i=1
π
i

α
i=1

i
+n}
. (3.4)
If n ≤ V L is small, the speedup ratio is roughly 1; if n is large compared to the
overheads π
i
, however, this ratio becomes (as n gets large)
speedup ≈

α
i=1
π
i
α
= π,
that is, the average pipeline overhead (latency). It may seem curious at first,
but the larger the overhead, the higher the speedup is. Typically, these startup
latency counts are 3 to 15 cycles for arithmetic operations, which is similar
to π
memory
when CMOS or ECL memory is used. On machines with cache
memories, it may be hard to determine a priori how long is π
memory
, the pipeline
length for memory, because of multiple cache levels (up to three). In that case,
the next Section 3.2.4 applies. Hardware reference manuals and experimentation
are the ultimate sources of information on these quantities. The number of linked
operations is usually much easier to determine:
The number α of linked operations is determined by the number of independent
functional units or the number of registers. When an instruction sequence exhausts
the number of resources, α must be increased. Such a resource might be the number
of memory ports or the number of add or multiply units. No linked operation can
exist if the number of functional units of one category is exceeded. For example,
two multiplies if there is only one multiply unit mean two separate operations.
96 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
3.2.4 Long memory latencies and short vector lengths
On systems with cache, the most important latency to be reckoned with is waiting
to fetch data from bulk memory. Typical memory latencies may be 10 times that
of a relatively simple computation, for example saxpy. This can be longer than
the inner portion of a loop. Thus, unrolling loops by using the vector hardware
to compute
B = f(A)
takes on forms shown in Figure 3.6. There are two variants which loop through
the computation 4/time. In the first, without any prefetch, the vector register V
0
(size four here) is loaded with A
i
, A
i+1
, A
i+2
, A
i+3
and after the time needed for
these data to arrive, f(A) for these four elements can be computed. These data
are then stored in a segment (B
i
, B
i+1
, B
i+2
, B
i+3
) of B. In the second variant,
A
i
, A
i+1
, A
i+2
, A
i+3
are prefetched into V
0
before entering the loop, these data
saved in V
1
as the first instruction of the loop, then the next segment fetch
(A
i+4
, A
i+5
, A
i+6
, A
i+7
) is begun before the i, i +1, i +2, i +3 elements of f(A)
are computed. The advantage of this prefetching strategy is that part of the
memory latency may be hidden behind the computation f, V L = 4 times for
the current segment. That is, instead of a delay of T
wait memory
, the latency
is reduced to T
wait memory
− T
f
− T
j
where T
f
is the time for the V L = 4
computations of f() for the current segment, and T
j
is the cost of the jump back
to the beginning of the loop. It is hard to imagine how to prefetch more than one
segment ahead, that is, more than one i, i +1, i +2, i +3 loop count because the
prefetch is pipelined and not readily influenced by software control. How does
for i = 0, n − 1 by 4 {
V
0
← A
i
, . . . , A
i+3
wait memory
.
.
.
V
1
← f(V
0
)
B
i
, . . . , B
i+3
← V
1
}
V
0
← A
0
, . . . , A
3
for i = 0, n − 5 by 4 {
V
1
← V
0
V
0
← A
i+4
, . . . , A
i+7
(wait memory) −T
f
−T
j
V
2
← f(V
1
)
B
i
, . . . , B
i+3
← V
2
}
V
2
← f(V
0
)
B
i+4
, . . . , B
i+7
← V
2
Fig. 3.6. Long memory latency vector computation, without prefetch on left, with
prefetch on the right: T
f
is the time for function f and T
j
is the cost of the
jump to the top of the loop.
DATA DEPENDENCIES AND LOOP UNROLLING 97
one save data that have not yet arrived? It is possible to unroll the loop to do
2 · V L or 3 · V L, etc., at a time in some situations, but not more than one loop
iteration ahead. However, since higher levels of cache typically have larger block
sizes (see Table 1.1) than L1, fetching A
i
, A
i+1
, A
i+2
, A
i+3
into L1 also brings
A
i+4
, A
i+5
, A
i+6
, . . . , A
i+L2B
into L2, where L2B is the L2 cache block size. This
foreshortens the next memory fetch latency for successive V L = 4 segments. The
result is, in effect, the same as aligning four templates as in Figures 1.9, 1.10
from Chapter 1. In both cases of Figure 3.6, we get the following speedup from
unrolling the loop into segments of size V L:
speedup =
R
vector
R
scalar
≤ V L (3.5)
In summary, then, on SSE2 and Altivec hardware, one can expect speedup ≤ 4
(single precision) or speedup ≤ 2 (double precision). Experimentally, these
inequalities are fairly sharp, namely speedups of 4 (or 2 in double) are often
closely reached. To be clear: unrolling with this mechanism is power-
ful because of the principle of data locality from Chapter 1, Section 1.1.
Namely, once datum A
i
is fetched, the other date A
i+1
, A
i+2
, A
i+3
are in the
same cacheline and come almost gratis.
3.2.5 Pentium 4 and Motorola G-4 architectures
In this book, we can only give a very superficial overview of the architectural
features of both Intel Pentium 4 and Motorola G-4 chips. This overview will be
limited to those features which pertain to the SIMD hardware available (SSE on
the Pentium, and Altivec on G-4). In this regard, we will rely on diagrams to
give a flavor of the appropriate features. Both machines permit modes of out-
of-order execution. On the Pentium, the out of order core can actually reorder
instructions, whereas on the G-4 this does not happen. Instructions may issue
before others are completed, however, thus permitting arithmetic pipeline filling
before all the results are stored, as in Figures 3.7 and 3.8. For our purposes,
this overlap of partially completed instructions is enough to outline the SIMD
behavior.
3.2.6 Pentium 4 architecture
Instruction execution on the Pentium 4 (and on the Pentium III, it must be
remarked) is deeply pipelined. On a Pentium 4, an instruction cache (Trace
cache) of 12k µops (micro-operations) provides instructions to an out-of-order
core (previously Dispatch/execute unit on Pentium III) from a pool of machine
instructions which may be reordered according to available resources. The gen-
eral case of reordering is beyond the scope of this text, so we will only be
concerned with instruction issues which proceed before previous instructions
have finished. See our example in Figures 3.7 and 3.8. Figure 3.9 shows a block
diagram of the instruction pipeline. The out of order core has four ports to
98 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
A00*B00
A01*B01
A02*B02
A03*B03
A10*B10
A11*B11
A12*B12 A21*B21
A20*B20
A30*B30
A40*B40
A50*B50
A60*B60
A70*B70
A31*B31
A32*B32
A41*B41
A22*B22 A13*B13
A23*B23
A33*B33
A51*B51
A61*B61
A42*B42
A52*B52 A43*B43
A00*B00
A01*B01
A02*B02
A03*B03
A10*B10
A11*B11
A12*B12 A21*B21
A20*B20
A30*B30
A40*B40
A50*B50
A60*B60
A70*B70
A31*B31
A41*B41
A42*B42
A22*B22 A13*B13
A32*B32 A23*B23
A33*B33 A51*B51
A61*B61 A43*B43 A52*B52
C0
C1
C2
C3
C0
C1
C2
C3
A80*B80 A71*B71 A62*B62 A53*B53
C4
i
n
s
t
r
.

o
n
e
i
n
s
t
r
.

o
n
e
i
n
s
t
r
.

t
w
o
i
n
s
t
r
.

t
w
o
etc.
Pipelining with out-of-order
execution.
In-order execution
Fig. 3.7. Four-stage multiply pipeline: C = A ∗ B with out-of-order instruc-
tion issue. The left side of the figure shows in-order pipelined execution for
V L = 4. On the right, pipelined execution (out-of-order) shows that the next
segment of V L = 4 can begin filling the arithmetic pipeline without flushing,
see Figures 3.1 and 3.3.
various functional units—integer, floating point, logical, etc. These are shown in
Figure 3.10. In brief, then, the various parts of the NetBurst architecture are:
1. Out-of-order core can dispatch up to 6 µops/cycle. There are four ports
(0,1,2,3) to the execution core: see Figure 3.10.
2. Retirement unit receives results from the execution unit and updates
the system state in proper program order.
3. Execution trace cache is the primary instruction cache. This cache can
hold up to 12k µops and can deliver up to 3 µops/cycle.
DATA DEPENDENCIES AND LOOP UNROLLING 99
C0 C1 C2 C3
C4 C5 C6 C7
C0 C1 C2 C3
C4 C5 C6 C7
C8 C9 C10 C11
Instr.1
Instr.2
Instr.3
Instr.1
Instr.2
Instr.3
etc.
In-order execution of pipelined arithmetic
Out-of-order execution of pipelined arithmetic
(1)
(2)
(3)
(1) (2) (3)
Time
Instr.4
Instr.5
etc.
Fig. 3.8. Another way of looking at Figure 3.7: since instruction (2) may begin
before (1) is finished, (2) can start filling the pipeline as soon as the first
result from instruction (1) C
0
emerges. In consequence, there is an overlap
of instruction execution. In this instance, out-of-order means that (2), sub-
sequently (3) and (4), can begin before earlier issues have completed all their
results.
4. Branch prediction permits beginning execution of instructions before
the actual branch information is available. There is a delay for failed pre-
diction: typically this delay is at least the pipeline length π
i
as in (3.5)
plus a penalty for any instruction cache flushing. A history of previous
branches taken is kept to predict the next branch. This history is updated
and the order of execution adjusted accordingly, see Figure 3.9. Intel’s
branch prediction algorithm is heuristic but they claim it is more than
95 percent correct.
Instruction fetches (see Fetch/decode unit in Figure 3.9) are always in-order
and likewise retirement and result storage. In between, however, the instructions
are in a pool of concurrent instructions and may be reordered by the Execu-
tion unit. Other important architectural features are the register sets. On the
Pentium 4, there are
1. Eight scalar floating point registers (80 bits).
2. Eight 128-bit floating point vector registers. Each may be partitioned in
various ways: 16 bytes/register, 4 single precision 32-bit words, or 2 double
precision 64-bit words, for example. Our principle interest here will be in
the 32-bit single precision partitioning. These registers are called XMM
registers: and are numbered XMM0, XMM1, . . . , XMM7.
100 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
System bus
L1 Cache
4-way assoc.
Fetch/
decode
Branch history update
Trace
cache
microcode
(ROM)
Retire-
ment
BTBs/branch prediction
L2 Cache, 8 way assoc.
Bus unit
Execution
(out of
order)
L3 cache, server only
Front end
Light data traffic
Heavy data traffic
Fig. 3.9. Block diagram of Intel Pentium 4 pipelined instruction execution
unit. Intel calls this system the NetBurst Micro-architecture. Loads may be
speculative, but stores are in-order. The trace cache replaced the older P-3
instruction cache and has a 12k µops capacity, see: [23].
FP move
Memory
load
Port 0 Port 1 Port 2 Port 3
ALU 0 ALU 1 Integer
operation
Normal
speed
FP
Execute
Memory
store
Double
speed
Double
speed
Add,
subract,
logical
FP move,
FP store
data
branches
Add,
subtract
Shift FP: add, mult.,
div., SIMD int.,
SIMD (XMM)
Load and
prefetch
Store
address
Fig. 3.10. Port structure of Intel Pentium 4 out-of-order instruction core. Port 1
handles the SIMD instructions, see: [23].
DATA DEPENDENCIES AND LOOP UNROLLING 101
3. Eight 64-bit integer vector registers. These are the integer counterpart to
the XMM registers, called MMX, and numbered MM0, MM1, . . ., MM7.
Originally, these labels meant “multimedia,” but because of their great
flexibility Intel no longer refers to them in this way.
3.2.7 Motorola G-4 architecture
The joint Motorola/IBM/Apple G-4 chip is a sophisticated RISC processor. Its
instruction issue is always in-order, but instruction execution may begin prior to
the completion of earlier ones. For our examples here, it is similar to the more
deeply pipelined Intel Pentium 4 chip. The G-4 chip has a rich set of registers
(see Figure 3.11):
1. Thirty-two general purpose registers (GPUs), of 64-bit length.
2. Thirty-two floating point registers (FPRs), of 64-bit length. Floating point
arithmetic is generally done in 80-bit precision on G-4 [48], [25].
Completion
unit
Interface
to memory
sub-system
Data cache
(32kb)
Instruction
cache
(32kb)
SEQUENCER
UNIT
ALTIVEC ENGINE
C
F
X
S
F
X
0
S
F
X
1
S
F
X
2
G
P
R
s
R
e
n
a
m
e
b
u
f
f
e
r
s
LSU
FPRs
Rename
buffers
FPU
Dispatch unit
BHT/BTIC
Instruction fetch
Branch unit
AltiVec issue GPR issue FPR issue
F
l
o
a
t
C
o
m
p
l
e
x
S
i
m
p
l
e
P
e
r
m
u
t
e
V
R
s
R
e
n
a
m
e
b
u
f
f
e
r
s
Unified L2 cache (256kb) and L3 cache tag control
System interface unit
System bus MPX
L
3

c
a
c
h
e
Fig. 3.11. High level overview of the Motorola G-4 structure, including the
Altivec technology, see: [29].
102 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
3. Thirty-two Altivec registers, each of 128-bit length. These may be parti-
tioned in various ways: 16 bytes/register, 4 single precision 32-bit floating
point words/register, or 2 double precision 64-bit words/register, for
example.
3.2.8 Branching and conditional execution
Using the SIMD paradigm for loops with if statements or other forms of branch-
ing is tricky. Branch prediction has become an important part of hardware design
(e.g. see [23] and [71]). It is impossible, almost by definition, to compute if’s
or branches in SIMD efficiently. Indeed, a conditional execution with even two
choices (f or g) cannot be done directly with one instruction on multiple data.
Look at the following example.
for(i=0;i<n;i++){
if(e(a[i])>0.0){
c[i]=f(a[i]);
} else {
c[i]=g(a[i]);
}
}
Branch condition e(x) is usually simple but is likely to take a few cycles.
The closest we can come to vectorizing this is to execute either both f and
g and merge the results, or alternatively parts of both. Clearly, there are
problems here:
• if one of the f or g is very expensive, why do both or even parts of both?
• one of f(a [i]) or g(a [i]) may not even be defined.
Regardless of possible inefficiencies, Figure 3.12 shows one possible selection
process.
If e(a
i
) > 0, the corresponding mask bit ‘i’ is set, otherwise it is not set.
The selection chooses f(a
i
) if the ith bit is set, otherwise g(a
i
) is chosen.
V
0
← [a
0
, a
1
, . . .]
V
7
← e(V
0
)
V M ← (V
7
> 0?) forms a mask if e(a
i
) > 0
V
1
← f(V
0
) compute f(a)
V
2
← g(V
0
) compute g(a)
V
3
← V M?V
1
: V
2
selects f if VM bit set
Fig. 3.12. Branch processing by merging results.
DATA DEPENDENCIES AND LOOP UNROLLING 103
The following is another example of such a branching, but is efficient in
this case,
d[i]=(e(a[i])>0.0)?b[i]:c[i];
because b
i
and c
i
require no calculation—only selection. An alternative to
possible inefficient procedures is to use scatter/gather:
npicka=0; npickb=0;
for(i=0;i<n;i++){
if(e(a[i])>0.0){
picka[npicka++]=i;
} else {
pickb[npickb++]=i;
}
}
for(k=0;k<npicka;k++) c(picka[k])=f(a[picka[k]]);
for(k=0;k<npickb;k++) c(pickb[k])=g(b[pickb[k]]);
Either way, you will probably have to work to optimize (vectorize) such branching
loops. In the example of isamax later in Section 3.5.7, we will see how we effect
a merge operation on Intel Pentium’s SSE hardware.
Branch prediction is now a staple feature of modern hardware. A simplified
illustration can be shown when the operations, f(x), g(x), in the computation of
the loop on p. 102 take multiple steps. Imagine that at least two operations are
necessary to compute these functions:
f(x) = f
2
(f
1
(x)),
g(x) = g
2
(g
1
(x)).
If e(x) >0 is more probable, the following branch prediction sequence
(Figure 3.13) will be efficient (see p. 86 for register conventions). Conversely,
what if the branch e(x) ≤ 0 were more probable? In that case, a more effi-
cient branch prediction would be Figure 3.14. Since the computation of f
1
in
Figure 3.13 (or g
1
(x) in the case shown in Figure 3.14) is a pipelined operation,
it may run concurrently with the steps needed to determine the if(e(R
0
) > 0)
test. In Figure 3.13, if it turns out R
7
≤ 0, the prediction fails and there will
be penalties, obviously at least the cost of computing R
1
= f
1
(R
0
) because it
is not used. Instruction reordering is done for optimization, so a missed branch
prediction also forces an instruction cache flush—roughly 10 clock cycles on
Pentium III and 20 cycles on Pentium 4. As many as 126 instructions can be
in the instruction pipe on P-4. In Figure 3.14, if R
7
> 0 the prediction likewise
fails and penalties will be exacted. Regardless of these penalties, branch predic-
tion failure penalties are almost always far less costly than the merging results
procedure illustrated in Figure 3.12.
Now imagine that the hardware keeps a history which records the last selec-
tion of this branch. If it turns out that R
7
≤ 0 seems more likely, according
104 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
R
0
← x
R
7
← e(x)
R
1
← f
1
(R
0
) // compute part of f(x)
if (R
7
> 0) {
R
3
← f
2
(R
1
) // predicted f if e(x) > 0
} else {
R
4
← g
1
(R
0
) // picks g otherwise
R
3
← g
2
(R
4
)
}
Fig. 3.13. Branch prediction best when e(x) > 0.
R
0
← x
R
7
← e(x)
R
1
← g
1
(R
0
) // compute part of g(x)
if (R
7
> 0) {
R
4
← f
1
(R
0
) // picks f if e(x) > 0
R
3
← f
2
(R
4
)
} else {
R
3
← g
2
(R
1
) // predicted g
}
Fig. 3.14. Branch prediction best if e(x) ≤ 0.
to that history, then the second sequence Figure 3.14 should be used. With
out-of-order execution possible, the instruction sequence could be reordered
such that the g(x) path is the most likely branch. With some examination,
it is not hard to convince yourself that the second expression (Figure 3.14)
is just a reordering of the first (Figure 3.13); particularly if the registers can
be renamed. The Pentium 4 allows both reordering (see Figure 3.9) and renam-
ing of registers [23, 132]. Likewise, Power-PC (Motorola G-4) allows renaming
registers [132].
Thus, in loops, the history of the previously chosen branches is used to pre-
dict which branch to choose the next time, usually when the loop is unrolled (see
Section 3.2). Solid advice is given on p. 12 of reference [23] about branch pre-
diction and elimination of branches, when the latter is possible. A great deal of
thought has been given to robust hardware determination of branch prediction:
again, see [23] or [71]. The isamax0 example given on p. 125 is more appropriate
for our SIMD discussion and the decision about which branch to choose is done
by the programmer, not by hardware.
REDUCTION OPERATIONS, SEARCHING 105
3.3 Reduction operations, searching
The astute reader will have noticed that only vector → vector operations have
been discussed above. Operations which reduce the number of elements of oper-
ands to a scalar or an index of a scalar are usually not vector operations in the
same sense. In SIMD, shared memory parallelism, and distributed memory par-
allelism, reduction operations can be painful. For example, unless a machine has
special purpose hardware for an inner product, that operation requires clever
programming to be efficient. Let us consider the usual inner product operation.
The calculation is
dot = (x, y) =
n

i=1
x
i
y
i
where the result dot is a simple scalar element. It is easy enough to do a partial
reduction to V L elements. Assume for simplicity that the number n of elements
is a multiple of V L (the segment length): n = q · V L,
V
0
← [0, 0, 0, . . .] // initialize V
0
to zero
// loop over segments: m += VL each time
V
1
← [x
m
, x
m+1
, x
m+2
, . . .] // read V L of x
V
2
← [y
m
, y
m+1
, y
m+2
, . . .] // read V L of y
V
3
← V
1
∗ V
2
// segment of x*y
V
0
← V
0
+ V
3
// accumulate partial result
which yields a segment of length V L of partial sums in V
0
:
partial
i
=
q−1

k=0
x
i+k·V L
· y
i+k·V L
,
which must be further reduced to get the final result,
dot =
V L

i=1
partial
i
.
Not surprisingly, the last reduction of V L partial sums may be expensive and
given all the special cases which have to be treated (e.g. n = q · V L), it is likely
less efficient than using pure vector operations. Fortunately for the programmer,
many compilers look for such obvious constructions and insert optimized routines
to do reduction operations such as inner products, for example, the BLAS routine
sdot. Obviously, no matter how clever the compiler is, there will be reduction
operations it cannot recognize and the resultant code will not parallelize well.
Another important operation is a maximum element search. We only illustrate
a unit stride version. The BLAS version of isamax has an arbitrary but fixed
stride, ours is
isamax0 = inf
i
{i | |x
i
| ≥ |x
k
|, ∀k}
106 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
that is, the first index i for which |x
i
| ≥ |x
k
| for all 1 ≤ k ≤ n. The Linpack
benchmark uses gefa to factor a matrix A = LU by partial pivoting Gaussian
elimination, and requires such a search for the element of maximum absolute
value. In examples later in this chapter, we will show how to use the SSE
(Pentium III and 4) and Altivec (Apple G-4) hardware to efficiently do both
sdot and the isamax operations (see p. 124, 125). Meanwhile we will show that
it is sometimes possible to avoid some reductions.
3.4 Some basic linear algebra examples
We have repeatedly flogged the saxpy operation because it is important in sev-
eral important tasks. We will now illustrate two: matrix × matrix multiply and
Gaussian elimination. In the first case, we show that a loop interchange can be
helpful.
3.4.1 Matrix multiply
In this example, we want the matrix product C = AB, where matrices A, B, C
are assumed to be square and of size n × n. Here is the usual textbook way
of writing the operation (e.g. [89], section 7.3). We use the C ordering wherein
columns of B are assumed stored consecutively and that successive rows of A
are stored n floating point words apart in memory:
for (i=0;i<n;i++){ /* mxm by dot product */
for (j=0;j<n;j++){
c[i][j]=0.;
for (k=0;k<n;k++){
c[i][j] += a[i][k]*b[k][j];
}
}
}
It is important to observe that the inner loop counter k does not index elements of
the result (c[i][j]): that is, C
i,j
is a fixed location with respect to the index k.
We could as well write this as
for (i=0;i<n;i++){ /* mxm by expl. dot product */
for (j=0;j<n;j++){
c[i][j]=sdot(n,&a[i][0],1,&b[0][j],n);
}
}
An especially good compiler might recognize the situation and insert a high-
performance sdot routine in place of the inner loop in our first way of expressing
the matrix multiply. Alternatively, we might turn the two inner loops inside out
SOME BASIC LINEAR ALGEBRA EXAMPLES 107
to get
for (i=0;i<n;i++){ /* mxm by outer product */
for (j=0;j<n;j++){
c[i][j]=0.;
}
for (k=0;k<n;k++){
for (j=0;j<n;j++){
c[i][j] += a[i][k]*b[k][j];
}
}
}
The astute reader will readily observe that we could be more effective by ini-
tializing the i-th row, C
i,∗
with the first a
i,0
b
0,∗
rather than zero. However, the
important thing here is that the inner loop is now a saxpy operation, namely,
repeatedly for k →k + 1,
C
i,∗
←C
i,∗
+ a
i,k
· b
k,∗
.
Here, the ‘∗’ means all j = 0, . . . , n − 1 elements in the row i. This trick of
inverting loops can be quite effective [80] and is often the essence of shared
memory parallelism. Similar examples can be found in digital filtering [123]. In
a further example, that of Gaussian elimination, the order of the two inner loops
is entirely arbitrary. We will use this arbitrariness to introduce shared memory
parallelism.
3.4.2 SGEFA: The Linpack benchmark
It is certain that the Linpack [37] benchmark is the most famous computer
floating point performance test. Based on J. Dongarra’s version of SGEFA [53]
which appeared in Linpack [38], this test has been run on every computer
supporting a Fortran compiler and in C versions [144] on countless others. The
core operation is an LU factorization
A →PLU
where L is lower triangular with unit diagonals (these are not stored), U is upper
triangular (diagonals stored), and P is a permutation matrix which permits
the factorization to be done in-place but nevertheless using partial pivoting.
Since the permutation P involves row exchanges, only a vector (ip on p. 108)
is necessary for the permutation information. Our version is readily adapted
to shared memory parallelism. That is, either of the inner loops (i, j) may be
chosen for vectorization. In this example, we assume Fortran ordering (column
major order, where elements are stored in sequential words, see Appendix E)to be
consistent with the Linpack benchmark. A reformulation of Gaussian elimination
as matrix–vector and matrix–matrix multiplications was given in Section 2.2.2.1.
108 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
Heretofore, we have not discussed two simple BLAS operations: sswap, a simple
vector swap (y ↔x), and sscal, which simply scales a vector (x ←a· x). These
may be fetched from [111]. Here is the classical algorithm:
for 0 ≤ k≤ n −1 {
l ← max
l
|a
kl
|
p ← a
kl
ip
k
← l
if(l = k) a
kk
↔ a
lk
for(i = k . . . n −1) a
ik
← −a
ik
/p
if(l = k) for(j = k +1 . . . n −1) a
kj
↔ a
lj
// a
kj
independent of i, or a
ik
indep. of j
for(k +1 ≤ i, j ≤ n −1){
a
ij
← a
ij
+ a
ik
· a
kj
}
}
0
0
k +1 k
k
k +1
l
Pivot
n–1
n–1
Area of active reduction
Fig. 3.15. Simple parallel version of SGEFA.
To hopefully clarify the structure, Figure 3.15 shows a diagram which indicates
that the lower shaded portion (pushed down as k increases) is the active portion
of the reduction. Here is the code for the PLU reduction. Vector ip contains
pivot information: the permutation matrix P is computed from this vector [38].
Macro am is used for Fortran or column major ordering, see Appendix E.
SOME BASIC LINEAR ALGEBRA EXAMPLES 109
#define am(p,q) (a+p*la+q)
void sgefa(float *a,int la,int n,int *ip,int info)
{
/* parallel C version of SGEFA: wpp 18/4/2000 */
int j,k,kp1,l,nm1,isamax();
float *pa1,*pa2,t;
void saxpy(),sscal(),sswap();
*info=0;
nm1 =n-1;
for(k=0;k<nm1;k++){
kp1=k+1; pa1=am(k,k);
l =isamax(n-k,pa1,1)+k; ip[k]=l;
if((*am(l,k))==0.0){*info=k; return;}
if(l!=k){
t =*am(k,l);
*am(k,l)=*am(k,k);
*am(k,k)=t;
}
t =-1.0/(*am(k,k)); pa1=am(k,kp1);
sscal(n-k-1,t,pa1,1);
if(l!=k){
pa1=am(kp1,l); pa2=am(kp1,k);
sswap(n-k-1,pa1,la,pa2,la);
}
for(j=kp1;j<n;j++){ /* msaxpy, sger */
pa1=am(k,kp1);
pa2=am(j,kp1);
t =*am(j,k);
saxpy(n-k-1,t,pa1,1,pa2,1);
} /* end msaxpy, sger */
}
}
#undef am
A variant may be effected by replacing the lines between and including msaxpy
and end msaxpy with saxpy
msaxpy(n-k-1,n-k-1,am(k,kp1),n,am(kp1,k),am(kp1,kp1));
where msaxpy is shown below (also see Appendix E for a review of column major
ordering). We will review this update again in Chapter 4, Section 4.8.2 but note
in passing that it is equivalent to the Level 2 BLAS routine sger, a rank-1
update.
110 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
#define ym(p,q) (y+p+q*n)
void msaxpy(nr,nc,a,n,x,y)
int nr,nc,n;
float *a,*x,*y;
{
/* multiple SAXY: y(*,j) <- a(j)*x[*]+y(*,j)
wpp 29/01/2003 */
int i,j,ka;
ka=0;
for(j=0;j<nc;j++){
for(i=0;i<nr;i++){
*ym(i,j) += a[ka]*x[i];
}
ka += n;
}
}
#undef ym
3.5 Recurrence formulae, polynomial evaluation
In the previous Sections 3.4.1, 3.4.2, we have shown how relatively simple loop
interchanges can facilitate vectorizing loops by avoiding reductions. Usually, how-
ever, we are not so lucky and must use more sophisticated algorithms. We now
show two examples: (1) Polynomial evaluation (P
n
(x) =

n
k=0
a
k
x
k
), and (2)
cyclic reduction for the solution of tridiagonal linear systems.
Polynomial evaluations are usually evaluated by Horner’s rule (initially,
p
(0)
= a
n
) as a recurrence formula
p
(k)
= a
n−k
+ p
(k−1)
· x, (3.6)
where the result is P
n
= p
(n)
. For a single polynomial of one argument x, Horner’s
rule is strictly recursive and not vectorizable. Instead, for long polynomials, we
evaluate the polynomial by recursive doubling.
3.5.1 Polynomial evaluation
From the recurrence (3.6), it is clear that p
(k−1)
must be available before the
computation of p
(k)
by (3.6) begins. The following is not usually an effective way
to compute P
n
(x), but may be useful in instances when n is large.
One scheme for x, x
2
, x
3
, x
4
, . . ., uses recursive doubling, with the first step,

v
1
v
2

x
x
2

.
The next two are

v
3
v
4

← v
2

v
1
v
2

,
RECURRENCE FORMULAE, POLYNOMIAL EVALUATION 111
and the next four,




v
5
v
6
v
7
v
8




← v
4




v
1
v
2
v
3
v
4




.
The general scheme is






v
2
k
+1
v
2
k
+2
·
·
v
2
k+1






← v
2
k






v
1
v
2
·
·
v
2
k






.
The polynomial evaluation may now be written as a dot product.
float poly(int n,float *a,float *v,float x)
{
int i,one,id,nd,itd;
fortran float SDOT();
float xi,p;
v[0]=1; v[1] =x; v[2] =x*x;
id =2; one =1; nd =id;
while(nd<n){
itd=(id <= n)?id:(n-nd);
xi =v[id];
#pragma ivdep
for (i=1;i<=itd;i++) v[id+i]=xi*v[i];
id=id+id; nd=nd+itd;
}
nd=n+1;
p =SDOT(&nd,a,&one,v,&one);
return(p);
}
Obviously, n must be fairly large for all this to be efficient. In most instances,
the actual problem at hand will be to evaluate the same polynomial for multiple
arguments. In that case, the problem may be stated as
for i = 1, . . . , m, P
n
(x
i
) =
n

k=0
a
k
x
k
i
,
that is, i = 1, . . . , m independent evaluations of P
n
(x
i
). For this situation, it is
more efficient to again use Horner’s rule, but for m independent x
i
, i = 1, . . . , m,
and we get
112 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
void multihorner(float *a,int m,int n,float *x,float *p)
{ /* multiple (m) polynomial evaluations */
int i,j;
for (j=0;j<=m;j++){
p[j] =a[n];
}
for (i=n-1;i>=0;i--){
for (j=0;j<=m;j++){
p[j]=p[j]*x[j]+a[i];
}
}
return;
}
3.5.2 A single tridiagonal system
In this example, we illustrate using non-unit stride algorithms to permit paral-
lelism. The solution to be solved is Tx = b, where T is a tridiagonal matrix. For
the purpose of illustration we consider a matrix of order n = 7,
T =










d
0
f
0
e
1
d
1
f
1
e
2
d
2
f
2
e
3
d
3
f
3
e
4
d
4
f
4
e
5
d
5
f
5
e
6
f
6










, (3.7)
For bookkeeping convenience we define e
0
= 0 and f
n−1
= 0. If T is diagonally
dominant, that is, if

d
i
| >

e
i
| +

f
i
| holds for all i, then Gaussian elimination
without pivoting provides a stable triangular decomposition,
T = LU ≡










1

1
1

2
1

3
1

4
1

5
1

6
1




















c
0
f
0
c
1
f
1
c
2
f
2
c
3
f
3
c
4
f
4
c
5
f
5
c
6










.
(3.8)
Notice that the ones on the diagonal of L imply that the upper off-diagonal
elements of U are equal to those of T. Comparing element—wise the two
representations of T in (3.7) and (3.8) yields the usual recursive algorithm for
determining the unknown entries of L and U [54]. The factorization portion could
RECURRENCE FORMULAE, POLYNOMIAL EVALUATION 113
be made in-place, that is, the vectors l and c could overwrite e and d such that
there would be no workspace needed. The following version saves the input and
computes l and c as separate arrays, however. After having computing the LU
void tridiag(int n,float *d,float *e,float *f,
float *x,float *b,float *l,float *c)
{
/* solves a single tridiagonal system Tx=b.
l=lower, c=diagonal */
int i;
c[0] =e[0]; /* factorization */
for(i=1;i<n;i++){
l[i]=d[i]/c[i-1];
c[i]=e[i] - l[i]*f[i-1];
}
x[0] =b[0]; /*forward to solve: Ly=b */
for(i=1;i<n;i++){
x[i]=b[i]-l[i]*x[i-1];
}
x[n-1] =x[n-1]/c[n-1]; /* back: Ux=y */
for(i=n-2;i>=0;i--){
x[i]=(x[i]-f[i]*x[i+1])/c[i];
}
}
factorization, the tridiagonal system Tx = LUx = b is solved by first find-
ing y from Ly = b going forward, then computing x from Ux = y by going
backward. By Gaussian elimination, diagonally dominant tridiagonal systems
can be solved stably in as few as 8n floating point operations. The LU fac-
torization is not parallelizable because the computation of c requires that c
i−1
is updated before c
i
can be computed. Likewise, in the following steps, x
i−1
has to have been computed in the previous iteration of the loop before x
i
.
Finally, in the back solve x
i+1
must be computed before x
i
. For all three
tasks, in each loop iteration array elements are needed from the previous
iteration.
Because of its importance, many attempts have been made to devise parallel
tridiagonal system solvers [3, 13, 45, 69, 70, 81, 138, 148]. The algorithm that
is closest in operation count to the original LU factorization is the so-called
twisted factorization [145], where the factorization is started at both ends of
the diagonal. In that approach, only two elements can be computed in parallel.
Other approaches provide parallelism but require more work. The best known of
these is cyclic reduction, which we now discuss.
114 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
3.5.3 Solving tridiagonal systems by cyclic reduction.
Cyclic reduction is an algorithm for solving tridiagonal systems that is paral-
lelizable albeit at the expense of some redundant work. It was apparently first
presented by Hockney [74]. We show the classical cyclic reduction for diagon-
ally dominant systems. For a cyclic reduction for arbitrary tridiagonal systems
see [5]. Let us return to the system of equations Tx = b with the system matrix
T given in (3.7). We rearrange x and b such that first the even and then the odd
elements appear. In this process, T is permuted to become
T

=










d
0
f
0
d
2
e
2
f
2
d
4
e
4
f
4
d
6
e
6
e
1
f
1
d
1
e
3
f
3
d
3
e
5
f
5
d
5










. (3.9)
T

is still diagonally dominant. It can be factored stably in the form
T

=










1
1
1
1

1
m
1
1

3
m
3
1

5
m
5
1




















d
0
f
0
d
2
e
2
f
2
d
4
e
4
f
4
d
6
e
6
d

1
f

1
e

3
d

3
f

3
e

5
d

5










.
(3.10)
This is usually called a 2×2 block LU factorization. The diagonal blocks differ
in their order by one if T has odd order. Otherwise, they have the same order.
More precisely, the lower left block of L is n/2 ×n/2. The elements of L and
U are given by

2k+1
=e
2k+1
/d
2k
, 0 ≤ k < n/2,
m
2k+1
=f
2k+1
/d
2k+2
, 0 ≤ k < n/2,
d

k
=d
2k+1
− l
2k+1
f
2k
− m
2k+1
e
2k+2
, 0 ≤ k < n/2,
e

2k+1
=−l
2k+1
e
2k
, 1 ≤ k < n/2,
d

2k+1
=−m
2k+1
d
2k+2
, 0 ≤ k < n/2 − 1,
where we used the conventions e
0
= 0 and f
n−1
= 0. It is important to observe
the ranges of k defined by the floor n/2 and ceiling n/2 brackets: the greatest
integer less than or equal to n/2 and least integer greater than or equal to n/2,
respectively (e.g. the standard C functions, floor and ceil, may be used [84]).
The reduced system in the (2,2) block of U made up of the element d

i
, e

i
,
and f

i
is again a diagonally dominant tridiagonal matrix. Therefore the previous
procedure can be repeated until a 1×1 system of equations remains.
RECURRENCE FORMULAE, POLYNOMIAL EVALUATION 115
10
0
10
0
10
1
10
2
10
3
10
4
10
5
10
–1
10
–2
10
–3
10
–4
10
–5
n
T
i
m
e

(
s
)
n=2
m
Cyclic reduction
Cyclic reduction
F/ M
Fig. 3.16. Times for cyclic reduction vs. the recursive procedure of Gaussian
Elimination (GE) on pages 112, 113. Machine is Athos, a Cray SV-1. Speedup
is about five for large n.
Cyclic reduction needs two additional vectors l and m. The reduced matrix
containing d

i
, e

i
, and f

i
can be stored in place of the original matrix T, that is,
in the vectors d, e, and f . This classical way of implementing cyclic reduction
evidently saves memory but it has the disadvantage that the memory is accessed
in large strides. In the first step of cyclic reduction, the indices of the entries
of the reduced system are two apart. Applied recursively, the strides in cyclic
reduction get larger by powers of two. This inevitably means cache misses on
machines with memory hierarchies. Already on the Cray vector machines, large
tridiagonal systems caused memory bank conflicts. Another way of storing the
reduced system is by appending the d

i
, e

i
, and f

i
to the original elements d
i
, e
i
,
and f
i
. In the following C code, cyclic reduction is implemented in this way.
Solving tridiagonal systems of equations by cyclic reduction costs 19n floating
point operations, 10n for the factorization and 9n for forward and backward
substitution. Thus the work redundancy of the cyclic reduction algorithm is 2.5
because the Gaussian of the previous section elimination only needs 8n flops: see
Figures 3.15, 3.16.
void fcr(int n,double *d,double *e,
double *f,double *x,double *b)
{
int k,kk,i,j,jj,m,nn[21];
/* Initializations */
m=n; nn[0]=n; i=0; kk=0; e[0]=0.0; f[n-1]=0.0;
while (m>1){ /* Gaussian elimination */
k=kk; kk=k+m; nn[++i]=m-m/2; m=m/2;
e[kk]=e[k+1]/d[k];
116 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
#pragma ivdep
for (j=0; j<m-1; j++){
jj =2*j+k+2;
f[kk+j] =f[jj-1]/d[jj];
e[kk+j+1]=e[jj+1]/d[jj];
}
if (m != nn[i]) f[kk+m-1]=f[kk-2]/d[kk-1];
#pragma ivdep
for (j=0; j<m; j++){
jj=k+2*j+1;
b[kk+j]=b[jj] - b[jj-1]*e[kk+j]
- b[jj+1]*f[kk+j];
d[kk+j]=d[jj] - f[jj-1]*e[kk+j]
- e[jj+1]*f[kk+j];
f[kk+j]=-f[jj+1]*f[kk+j];
e[kk+j]=-e[jj-1]*e[kk+j];
}
}
x[kk]=b[kk]/d[kk]; /* Back substitution */
while (i>0){
#pragma ivdep
for (j=0; j<m; j++){
jj =k+2*j;
x[jj] =(b[jj] - e[jj]*x[kk+j-1]
- f[jj]*x[kk+j])/d[jj];
x[jj+1]=x[kk+j];
}
if (m != nn[i])
x[kk-1]=(b[kk-1]-e[kk-1]*x[kk+m-1])/d[kk-1];
m=m+nn[i]; kk=k; k -= (nn[--i]+m);
}
}
A superior parallelization is effected in the case of multiple right-hand sides.
Namely, if X and B consists of L columns,
T X

L columns
= B

L columns
then we can use the recursive method on pages 112, 113. In this situation, there is
an inner loop, which counts the L columns of X: x
i
, b
i
, i = 0, . . . , L−1 systems.
Again, the xm and bm macros are for Fortran ordering.
RECURRENCE FORMULAE, POLYNOMIAL EVALUATION 117
#define xm(p,q) *(x+q+p*n)
#define bm(p,q) *(b+q+p*n)
void multidiag(int n,int L,float *d,float *e,
float *f,float *m,float *u,float *x,float *b)
{
/* solves tridiagonal system with multiple
right hand sides: one T matrix, L solutions X */
int i,j;
u[0]=e[0];
for(j=0;j<L;j++) xm(0,j)=bm(0,j);
for(i=1;i<n;i++){
m[i]=d[i]/u[i-1];
u[i]=e[i] - m[i]*f[i-1];
for(j=0;j<L;j++){
xm(i,j)=bm(i,j) - m[i]*xm(i-1,j);
}
}
for(j=1;j<L;j++) xm(n-1,j)=xm(n-1,j)/u[n-1];
for(i=n-2;i>=0;i--){
for(j=0;j<L;j++){
xm(i,j)=(xm(i,j)-f[i]*xm(i+1,j))/u[i];
}
}
}
#undef xm
#undef bm
3.5.4 Another example of non-unit strides to achieve parallelism
From the above tridiagonal system solution, we saw that doubling strides may
be used to achieve SIMD parallelism. On most machines, it is less efficient to
use some non-unit strides: for example, there may be memory bank conflicts
on cacheless machines [123], or cache misses for memory systems with cache.
Memory bank conflicts may also occur in cache memory architectures and result
from the organization of memory into banks wherein successive elements are
stored in successive banks. For example, a[i] and a[i+1] will be stored in
successive banks (or columns). Upon accessing these data, a bank requires a
certain time to read (or store) them and refresh itself for subsequent reads (or
stores). Requesting data from the same bank successively forces a delay until
the read/refresh cycle is completed. Memory hardware often has mechanisms to
anticipate constant stride memory references, once they have been previously
used (locally). Intel features a hardware prefetcher: When a cache miss occurs
twice for regularly ordered data, this prefetcher is started. Even so, basic cache
structure assumes locality: if a word is fetched from memory, the safest assump-
tion is that other words nearby will also be used. Hence an associated cache
118 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
line is loaded from memory, even though parts of the line might not be used.
Cray, Fujitsu, and NEC vector hardware does not use caches for their vector
memory references. Thus, non-unit stride does not carry the same penalty as its
cache-miss equivalents on cache memory machines.
In this section, another example of doubling stride memory access is shown:
an in-place, in-order binary radix FFT. The algorithm is most appropriate for
cacheless machines like the Cray SV-1 or NEC SX-5, although the performance
on Pentium 4 is quite respectable due to the anticipation of non-unit stride
memory hardware. The point of using non-unit strides is to avoid data depend-
encies wherein memory locations could be written over before the previous (old)
data are used. To look a few pages ahead, examine the signal flow diagrams in
Figures 3.20 and 3.21 (which defines the a, b, c, and d variables). Notice that
if the d result of the first (k = 0) computational box is written before the first
k = 1 a input for the second box is used, the previous value of that a will be
lost forever and the final results will be incorrect. Likewise for the c result of
the second box (k = 1), which will be written onto the a input for the third box
(k = 2), and so on. The algorithm of Figure 3.20 cannot be done in-place. That
is, the intermediate stages cannot write to the same memory locations as their
inputs. An astute observer will notice that the last stage can be done in-place,
however, since the input locations are exactly the same as the outputs and do
not use w. One step does not an algorithm make, however, so one has to be
clever to design a strategy, which will group the “boxes” (as in Figure 3.20) so
there are no data dependencies.
Fortunately, there are such clever strategies and here we illustrate one—
Temperton’s in-place, in-order procedure. It turns out that there are in-place,
in-order algorithms, with unit stride, but these procedures are more appropriate
for large n (n = 2
m
is the dimension of the transform) and more complicated
than we have space for here ([96]). Figure 3.17 shows the signal flow diagram for
n = 16. The general strategy is as follows.
• The first “half” of the m steps (recall n = 2
m
) are from Cooley and
Tukey [21]. There are two loops: one over the number of twiddle factors
ω
k
, and a second over the number of “boxes” that use any one particular
twiddle factor, indexed by k in Figures 3.17 and 3.20.
• The second “half” of the m steps also re-order the storage by using three
loops instead of two.
• The final m/2 steps ((m − 1)/2 if m is odd) group the “boxes” in pairs so
that storing results does not write onto inputs still to be used. Figure 3.18
shows the double boxes.
Finally, here is the code for the driver (cfft 2) showing the order in which
step1 (Cooley–Tukey step) and step2 (step plus ordering) are referenced.
Array w contains the pre-computed twiddle factors: w
k
= exp(2πik/n) for
k = 0, . . . , n/2 − 1,
RECURRENCE FORMULAE, POLYNOMIAL EVALUATION 119
void cfft2(n,x,w,iw,sign)
int n,iw,int sign;
float x[][2],w[][2];
{
/*
n=2**m FFT from C. Temperton, SIAM J. on Sci.
and Stat. Computing, 1991. WPP: 25/8/1999, ETHZ
*/
int n2,m,j,mj,p2,p3,p4,BK;
void step1(), step2();
m = (int) (log((float) n)/log(1.999));
mj = 1; n2 = n/2;
for(j=0;j<m;j++){
if(j < (m+1)/2){ p2 = n2/mj;
step1(n,mj,&x[0][0],&x[p2][0],w,iw,sign);
} else{ p2 = n2/mj; p3 = mj; p4 = p2+mj;
step2(n,mj,&x[0][0],&x[p2][0],
&x[p3][0],&x[p4][0],w,iw,sign);
}
mj = 2*mj;
}
}
Here is the Cooley–Tukey step (step1):
void step1(n,mj,a,b,w,iw,sign)
int iw,n,mj,sign;
float a[][2],b[][2],w[][2];
{
float wkr,wku,wambr,wambu;
int i,k,ks,kw,lj,ii,ij;
lj = n/(2*mj); ij = n/mj; ks = iw*mj;
for(i=0;i<mj;i++){
ii = i*ij;
if(sign > 0){
#pragma ivdep
for(k=0;k<lj;k++){
kw=k*ks; wkr=w[kw][0]; wku=w[kw][1];
wambr = wkr*(a[ii+k][0]-b[ii+k][0])
- wku*(a[ii+k][1]-b[ii+k][1]);
wambu = wku*(a[ii+k][0]-b[ii+k][0])
+ wkr*(a[ii+k][1]-b[ii+k][1]);
a[ii+k][0] = a[ii+k][0]+b[ii+k][0];
120 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
a[ii+k][1] = a[ii+k][1]+b[ii+k][1];
b[ii+k][0] = wambr; b[ii+k][1] = wambu;
}
} else {
#pragma ivdep
for(k=0;k<lj;k++){
kw=k*ks; wkr=w[kw][0]; wku=-w[kw][1];
wambr = wkr*(a[ii+k][0]-b[ii+k][0])
- wku*(a[ii+k][1]-b[ii+k][1]);
wambu = wku*(a[ii+k][0]-b[ii+k][0])
+ wkr*(a[ii+k][1]-b[ii+k][1]);
a[ii+k][0] = a[ii+k][0]+b[ii+k][0];
a[ii+k][1] = a[ii+k][1]+b[ii+k][1];
b[ii+k][0] = wambr; b[ii+k][1] = wambu;
}
}
}
}
3
0
1
2
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
3
4
5
6
7
0
2
4
6
0
2
0
4
0
4
0
4
0
0
0
0
0
0
0
0
0
4
4
6
Fig. 3.17. In-place, self-sorting FFT. Also see Fig. 3.20.
RECURRENCE FORMULAE, POLYNOMIAL EVALUATION 121
w
k
(a–b)
w
k
(c–d)
k
a
b
c
d
k
a +b
c +d
Fig. 3.18. Double “bug” for in-place, self-sorting FFT. Also see Fig. 3.21.
Next we have the routine for the last m/2 steps, which permutes everything so
that the final result comes out in the proper order. Notice that there are three
loops and not two. Not surprisingly, the k loop which indexes the twiddle factors
(w) may be made into an outer loop with the same j, i loop structure (see also
Chapter 4).
void step2(n,mj,a,b,c,d,w,iw,sign)
int iw,n,mj;
float a[][2],b[][2],c[][2],d[][2],w[][2];
int sign;
{
float wkr,wku,wambr,wambu,wcmdr,wcmdu;
int mj2,i,j,k,ks,kw,lj,ii;
mj2=2*mj; lj=n/mj2; ks=iw*mj;
for(k=0;k<lj;k++){
kw = k*ks; wkr = w[kw][0];
wku=(sign>0)?w[kw][1]:(-w[kw][1]);
for(i=0;i<lj;i++){
ii = i*mj2;
#pragma ivdep
for(j=k;j<mj;j+=n/mj){
wambr = wkr*(a[ii+j][0]-b[ii+j][0])
- wku*(a[ii+j][1]-b[ii+j][1]);
wambu = wku*(a[ii+j][0]-b[ii+j][0])
+ wkr*(a[ii+j][1]-b[ii+j][1]);
122 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
a[ii+j][0] = a[ii+j][0]+b[ii+j][0];
a[ii+j][1] = a[ii+j][1]+b[ii+j][1];
b[ii+j][0] = c[ii+j][0]+d[ii+j][0];
b[ii+j][1] = c[ii+j][1]+d[ii+j][1];
wcmdr = wkr*(c[ii+j][0]-d[ii+j][0])
- wku*(c[ii+j][1]-d[ii+j][1]);
wcmdu = wku*(c[ii+j][0]-d[ii+j][0])
+ wkr*(c[ii+j][1]-d[ii+j][1]);
c[ii+j][0] = wambr; c[ii+j][1] = wambu;
d[ii+j][0] = wcmdr; d[ii+j][1] = wcmdu;
}
}
}
}
3.5.5 Some examples from Intel SSE and Motorola Altivec
In what follows, we proceed a little deeper into vector processing. The Intel
Pentium III, and Pentium 4 chips have a set of eight vector registers called
XMM, from Section 3.2.5. Under the generic title SSE (for Streaming SIMD
Extensions), these eight 4-word registers (of 32 bits) are well adapted to vector
processing scientific computations. Similarly, the Apple/Motorola modification
of the IBM G-4 chip has additional hardware called Altivec: thirty-two 4-word
registers, see Section 3.2.7. Because the saxpy operation is so simple and is part of
an exercise, we examine sdot and isamax on these two platforms. Subsequently,
in Section 3.6 we will cover FFT on Intel P-4 and Apple G4.
One extremely important consideration in using the Altivec and SSE
hardware is data alignment. As we showed in Section 1.2, cache loadings are
associative. Namely, an address in memory is loaded into cache modulo cache
associativity: that is, in the case of 4-way associativity, a cacheline is 16 bytes
long (4 words) and the associated memory loaded on 4-word boundaries. When
loading a scalar word from memory at address m, this address is taken modulo
16 (bytes) and loading begins at address (m/16) · 16. As in Figure 3.19, if we
wish to load a 4-word segment beginning at A
3
, two cachelines (cache blocks)
must be loaded to get the complete 4-word segment. When using the 4-word
(128 bit) vector registers, in this example a misalignment occurs: the data are
not on 16-byte boundaries. This misalignment must be handled by the program;
it is not automatically done properly by the hardware as it is in the case of
simple scalar (integer or floating point) data. Possible misalignment may occur
for both loads and stores. The vector data are loaded (stored) from (into) cache
when using vector registers (V
i
in our examples) on these cacheline boundar-
ies. In our BLAS examples below, therefore, we assume the data are properly
4-word aligned. Our website BLAS code examples do not assume proper align-
ment but rather treat 1–3 word initial and end misalignments as special cases.
RECURRENCE FORMULAE, POLYNOMIAL EVALUATION 123
A0 A1 A2 A3 A6 A3 A4 A5
A0 A1 A2
A0 A1 A2 A7
A3 A4 A5 A6 A7
4-word bndry.
Cacheline 1 Cacheline 2
A-array in memory
A3 A4 A5 A6
X
Load Vi
What we get What we wanted
Fig. 3.19. Data misalignment in vector reads.
See [20], or section 2.22 in [23]. Our FFT examples use 4-word alignment memory
allocation— mm malloc or valloc for SSE and Altivec, respectively. One should
note that valloc is a Posix standard and could be used for either machine, that
is, instead of mm malloc on Pentium 4.
3.5.6 SDOT on G-4
The basic procedure has been outlined in Section 3.3. There, only the reduction
from the partial accumulation of V L (= 4 in the present case, see page 87)
remains to be described. We only do a simple obvious variant. Both Intel
and Apple have chosen intrinsics for their programming paradigm on these
platforms. Somewhat abbreviated summaries of these intrinsics are given in
Appendix A and Appendix B, respectively. Assembly language programming
provides much more precise access to this hardware, but using intrinsics is easier
and allows comparison between the machines. Furthermore, in-line assembly
language segments in the code cause the C optimizer serious problems. In what
follows, keep in mind that the “variables” V
i
are only symbolic and actual register
assignments are determined by the compiler. Hence, V
i
does not refer directly to
any actual hardware assignments. One considers what one wants the compiler
to do and hope it actually does something similar.
Our web-server contains the complete code for both systems including simple
test programs. Hopefully edifying explanations for the central ideas are given
here. An important difference between the intrinsics versions and their assembler
counterparts is that many details may be omitted in the C intrinsics. In par-
ticular, register management and indexing are handled by the C compiler when
using intrinsics. For purposes of exposition, we make the simplifying assumption
that the number of elements in sdot n is a multiple of 4 (i.e. 4|n),
n = q · 4.
124 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
The vec ld operations simply load a 4-word segment, while the vec madd opera-
tions multiply the first two arguments and add the 4-word result to the third.
Conversely, vec st stores the partial accumulation into memory. First we show
a Altivec version of sdot0, that is sdot without strides. The accumulation is
in variable V
7
, which may actually be any Altivec register chosen by the gcc
compiler.
The choice of switches is given by
gcc -O3 -faltivec sdottest.c -lm
The gcc version of the C compiler is that released by Apple in their free
Developers’ Kit. Our tests used gcc3, version 1161, April 4, 2003. In Appendix B,
we include a slightly abbreviated list of vec operations.
float sdot0(int n, float *x, float *y)
{ /* no x,y strides */
float *xp,*yp,sum=0.0;
int i,ii,nres,nsegs; /* nsegs = q */
vector float V7 = (vector float)(0.0,0.0,0.0,0.0);
vector float V0,V1;
float psum[4];
xp = x; yp = y;
V0 = vec_ld(0,xp); xp += 4; /* load x */
V1 = vec_ld(0,yp); yp += 4; /* load y */
nsegs = (n >> 2) - 1;
nres = n - ((nsegs+1) << 2); /* nres=n mod 4 */
for(i=0;i<nsegs;i++){
V7 = vec_madd(V0,V1,V7); /* part sum of 4 */
V0 = vec_ld(0,xp); xp += 4; /* load next 4 x */
V1 = vec_ld(0,yp); yp += 4; /* load next 4 y */
}
V7 = vec_madd(V0,V1,V7); /* final part sum */
/* Now finish up: v7 contains partials */
vec_st(V7,0,psum); /* store partials to memory */
for(i=0;i<4;i++){
sum += psum[i];
}
return(sum);
}
3.5.7 ISAMAX on Intel using SSE
The index of maximum element search is somewhat trickier. Namely, the merge
operation referenced in Section 3.2.8 can be implemented in various ways on
different machines, or a well scheduled branch prediction scheme is required.
We show here how it works in both the Pentium III and Pentium 4 machines.
RECURRENCE FORMULAE, POLYNOMIAL EVALUATION 125
Our website has code for the Apple G-4. The appropriate compiler, which
supports this hardware on Pentium III or Pentium 4 is icc (Intel also provides
a Fortran [24]):
icc -O3 -axK -vec report3 isamaxtest.c -lm
On the Pentium, the 4-word SSE arrays are declared by m128 declarations.
Other XMM (SSE) instructions in the isamax routine are described in refer-
ence [27] and in slightly abbreviated form in Appendix A.
• mm set ps sets the result array to the contents of the parenthetical
constants.
• mm set ps1 sets entire array to content of parenthetical constant.
• mm load ps loads 4 words beginning at its pointer argument.
• mm andnot ps is used to compute the absolute values of the first argument
by abs(a) = a and not(−0.0). This is a mask of all bits except the sign.
• mm cmpnle ps compares the first argument with the second, to form a
mask (in V
3
here) if it is larger.
• mm movemask ps counts the number which are larger, if any.
• mm add ps in this instance increments the index count by 4.
• mm max ps selects the larger of the two arguments.
• mm and ps performs an and operation.
• mm store ps stores 4 words beginning at the location specified by the first
argument.
In this example, we have assumed the x data are aligned on 4-word boundaries
and that n = q · 4 (i.e. 4|n). Data loads of x are performed for a segment
(of 4) ahead. This routine is a stripped-down variant of that given in an Intel
report [26].
int isamax0(int n, float *x) /* no stride for x */
{
float ebig,*xp;
int i,ibig,nsegs,mb,nn; /* nsegs = q */
_m128 offset4,V0,V1,V2,V3,V6,V7;
_declspec (align(16)) float xbig[4],indx[4];
V7 = _mm_set_ps(3.0,2.0,1.0,0.0);
V2 = _mm_set_ps(3.0,2.0,1.0,0.0);
V6 = _mm_set_ps1(-0.0);
offset4 = _mm_set_ps1(4.0);
xp = x; nsegs = (nn >> 2) - 2;
V0 = _mm_load_ps(xp); xp += 4; /* 1st 4 */
V1 = _mm_load_ps(xp); xp += 4; /* next 4 */
V0 = _mm_andnot_ps(V6,V0); /* abs. value */
for(i=0;i<nsegs;i++){
V1 = _mm_andnot_ps(V6,V1); /* abs. value */
V3 = _mm_cmpnle_ps(V1,V0); /* old vs new */
126 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
mb = _mm_movemask_ps(V3); /* any bigger */
V2 = _mm_add_ps(V2,offset4); /* add offset */
if(mb > 0){V0 = _mm_max_ps(V0,V1); /* BRANCH */
V3=_mm_and_ps(V2,V3); V7=_mm_max_ps(V7,V3);}
V1 = _mm_load_ps(xp); xp += 4; /* load new 4 */
}
/* process last segment of 4 */
V1 = _mm_andnot_ps(V6,V1); /* abs. value */
V3 = _mm_cmpnle_ps(V1,V0); /* old vs new */
mb = _mm_movemask_ps(V3); /* any bigger */
V2 = _mm_add_ps(V2,offset4); /* add offset */
if(mb > 0){V0 = _mm_max_ps(V0,V1);
V3=_mm_and_ps(V2,V3); V7=_mm_max_ps(V7,V3);}
/* Finish up: maxima are in V0, indices in V7 */
_mm_store_ps(xbig,V0); _mm_store_ps(indx,V7);
big = 0.0; ibig = 0;
for(i=0;i<4;i++){
if(xbig[i]>big){big=xbig[i]; ibig=(int)indx[i];}
}
return(ibig);
}
3.6 FFT on SSE and Altivec
For our last two examples of SIMD programming, we turn again to Fast Fourier
Transform, this time using the Intel SSE and Apple/Motorola Altivec hardware.
For faster but more complicated variants, see references [28, 30, 31, 108]. Because
these machines favor unit stride memory access, we chose a form which restricts
the inner loop access to stride one in memory. However, because our preference
has been for self-sorting methods we chose an algorithm, which uses a workspace.
In Figure 3.20, one can see the signal flow diagram for n = 8. If you look carefully,
it is apparent why a workspace is needed. Output from the bugs (Figure 3.21)
will write into areas where the next elements are read—a dependency situation if
the output array is the same as the input. As in Section 3.2, if the output array is
different, this data dependency goes away. An obvious question is whether there
is an in-place and self-sorting method. The answer is yes, but it is complicated
to do with unit stride. Hence, for reasons of simplicity of exposition, we use a
self-sorting algorithm with unit stride but this uses a workspace [141, 123].
The workspace version simply toggles back and forth between the input (x)
and the workspace/output (y), with some logic to assign the result to the
output y. In fact, the ccopy (copies one n–dimensional complex array into
another, see: Section 2.2) is not needed except to adhere to the C language rules.
Without it, some compiler optimizations could cause trouble. A careful examin-
ation of Figure 3.20 shows that the last pass could in principle be done in-place.
The following code is the driver routine, a complex binary radix (n = 2
m
) FFT.
FFT ON SSE AND ALTIVEC 127
3
0
1
1
0
0
0
0
0
1
7
0
1
2
3
4
5
6
7
0
1
2
0
2
3
4
5
6
Fig. 3.20. Workspace version of self-sorting FFT.
w=e
a
b d=w
k
(a –b)
FFT “bug”
k
c =a +b
2πi
n
Fig. 3.21. Decimation in time computational “bug.”
One step through the data is represented in Figure 3.20 by a column of boxes
(bugs) shown in Figure 3.21. Each “bug” computes: c = a+b and d = w
k
(a−b),
where w = the nth root of unity. Since complex data in Fortran and similar vari-
ants of C supporting complex data type are stored Re x
0
, Im x
0
, Re x
1
, . . . , the
128 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
V
3
=




(a −b)
r
(a −b)
i
(a −b)
r
(a −b)
i




, V
6
=




ω
r
ω
r
ω
r
ω
r




, V
7
=




−ω
i
ω
i
−ω
i
ω
i




,
V
3
shuffle
−→ V
4
=




(a −b)
i
(a −b)
r
(a −b)
i
(a −b)
r




, V
0
= V
6
· V
3
=




ω
r
(a −b)
r
ω
r
(a −b)
i
ω
r
(a −b)
r
ω
r
(a −b)
i




,
V
1
= V
7
· V
4
=




−ω
i
(a −b)
i
ω
i
(a −b)
r
−ω
i
(a −b)
i
ω
i
(a −b)
r




,
d = V
2
= V
0
+ V
1
.
(result)
Fig. 3.22. Complex arithmetic for d = w
k
(a − b) on SSE and Altivec. In
Figure 3.21, c is easy.
arithmetic for d (now a vector d with two complex elements) is slightly involved.
Figure 3.22 indicates how mm shuffle ps is used for this calculation.
void cfft2(n,x,y,w,sign)
int n;
float x[][2],y[][2],w[][2],sign;
{ /* x=in, y=out, w=exp(2*pi*i*k/n), k=0..n/2-1 */
int jb, m, j, mj, tgle;
void ccopy(),step();
m = (int)(log((float)n)/log(1.99)); mj=1; tgle=1;
step(n,mj,&x[0][0],&x[n/2][0],
&y[0][0],&y[mj][0],w,sign);
for(j=0;j<m-2;j++){
mj *= 2;
if(tgle){
step(n,mj,&y[0][0],&y[n/2][0],
&x[0][0],&x[mj][0],w,sign); tgle = 0;
} else {
step(n,mj,&x[0][0],&x[n/2][0],
&y[0][0],&y[mj][0],w,sign); tgle = 1;
}
}
if(tgle){ccopy(n,y,x);} /* if tgle: y -> x */
mj = n/2;
FFT ON SSE AND ALTIVEC 129
step(n,mj,&x[0][0],&x[n/2][0],
&y[0][0],&y[mj][0],w,sign);
}
The next code is the Intel SSE version of step [27]. Examining the driver (cfft2,
above), the locations of half-arrays (a, b, c, d) are easy to determine: b is located
n/2 complex elements from a, while the output d is m
j
= 2
j
complex locations
from c where j = 0, . . . , log
2
(n) −1 is the step number.
void step(n,mj,a,b,c,d,w,sign)
int n, mj;
float a[][2],b[][2],c[][2],d[][2],w[][2],sign;
{
int j,k,jc,jw,l,lj,mj2,mseg;
float rp,up;
_declspec (align(16)) float wr[4],wu[4];
_m128 V0,V1,V2,V3,V4,V6,V7;
mj2 = 2*mj; lj = n/mj2;
for(j=0; j<lj; j++){
jw = j*mj; jc = j*mj2;
rp = w[jw][0]; up = w[jw][1];
if(sign<0.0) up = -up;
if(mj<2){ /* special case mj=1 */
d[jc][0] = rp*(a[jw][0] - b[jw][0])
- up*(a[jw][1] - b[jw][1]);
d[jc][1] = up*(a[jw][0] - b[jw][0])
+ rp*(a[jw][1] - b[jw][1]);
c[jc][0] = a[jw][0] + b[jw][0];
c[jc][1] = a[jw][1] + b[jw][1];
} else { /* mj > 1 cases */
wr[0] = rp; wr[1] = rp;
wr[2] = rp; wr[3] = rp;
wu[0] = -up; wu[1] = up;
wu[2] = -up; wu[3] = up;
V6 = _mm_load_ps(wr);
V7 = _mm_load_ps(wu);
for(k=0; k<mj; k+=2){
V0 = _mm_load_ps(&a[jw+k][0]);
V1 = _mm_load_ps(&b[jw+k][0]);
V2 = _mm_add_ps(V0,V1); /* a+b */
_mm_store_ps(&c[jc+k][0],V2); /* c to M */
V3 = _mm_sub_ps(V0,V1); /* a-b */
V4 = _mm_shuffle_ps(V3,V3,
_MM_SHUFFLE(2,3,0,1));
V0 = _mm_mul_ps(V6,V3);
130 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
V1 = _mm_mul_ps(V7,V4);
V2 = _mm_add_ps(V0,V1); /* w*(a-b) */
_mm_store_ps(&d[jc+k][0],V2); /* d to M */
}
}
}
}
In Figure 3.23, we show the performance of three FFT variants: the in-place
(Section 3.5.4), the SSE intrinsics (Section 3.6), and a generic version of the
same workspace algorithm on a 1.7 GHz Pentium 4 running Linux version 2.4.18.
There are many possible improvements: both split-radix [44] and radix 4 are
faster than radix 2. Because of symmetries in the twiddle factors, fewer need to be
accessed from memory. Also, competing with professionals is always hard. How-
ever, notice that the improvement using SSE is a factor of 2.35 faster than the
generic version, which uses the same algorithm. In our Chapter 4 (Section 4.8.3)
on shared memory parallelism we show a variant of step used here. In that case,
the work of the outer loop (over twiddle factors, the w array elements) may
be distributed over multiple processors. The work done by each computation
box, labeled k in Figure 3.21, is independent of every other k. Thus, these tasks
may be distributed over multiple CPUs. Multiple processors may be used when
available, and MKL [28] uses the pthreads library to facilitate this.
Here is step for the G-4 Altivec. Older versions of Apple Developers’ kit
C compiler, gcc, supported complex arithmetic, cplx data type, but we do not
1500
1000
500
0
10
1
10
2
10
3
10
4
n
M
f
l
o
p
s
Generic
In-place
cfft2 (SSE)
n=2
m
FFT, Ito: Pentium 4
Fig. 3.23. Intrinsics, in-place (non-unit stride), and generic FFT. Ito: 1.7 GHz
Pentium 4.
FFT ON SSE AND ALTIVEC 131
use this typing here. Newer versions apparently do not encourage this complex
typing.
#define _cvf constant vector float
#define _cvuc constant vector unsigned char
void step(int n,int mj, float a[][2], float b[][2],
float c[][2], float d[][2],float w[][2], float sign)
{
int j,k,jc,jw,l,lj,mj2;
float rp,up;
float wr[4], wu[4];
_cvf vminus = (vector float)(-0.,0.,-0.,0.);
_cvf vzero = (vector float)(0.,0.,0.,0.);
_cvuc pv3201 = (vector unsigned char)
(4,5,6,7,0,1,2,3,12,13,14,15,8,9,10,11);
vector float V0,V1,V2,V3,V4,V5,V6,V7;
mj2 = 2*mj;
lj = n/mj2;
for(j=0; j<lj; j++){
jw = j*mj; jc = j*mj2;
rp = w[jw][0];
up = w[jw][1];
if(sign<0.0) up = -up;
if(mj<2){
/* special case mj=1 */
d[jc][0] = rp*(a[jw][0] - b[jw][0])
- up*(a[jw][1] - b[jw][1]);
d[jc][1] = up*(a[jw][0] - b[jw][0])
+ rp*(a[jw][1] - b[jw][1]);
c[jc][0] = a[jw][0] + b[jw][0];
c[jc][1] = a[jw][1] + b[jw][1];
} else {
/* mj>=2 case */
wr[0]=rp; wr[1]=rp; wr[2]=rp; wr[3]=rp;
wu[0]=up; wu[1]=up; wu[2]=up; wu[3]=up;
V6 = vec_ld(0,wr);
V7 = vec_ld(0,wu);
V7 = vec_xor(V7,vminus);
for(k=0; k<mj; k+=2){ /* read a,b */
V0 = vec_ld(0,(vector float *)&a[jw+k][0]);
V1 = vec_ld(0,(vector float *)&b[jw+k][0]);
V2 = vec_add(V0, V1); /* c=a-b */
132 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
vec_st(V2,0,(vector float *)&c[jc+k][0]);
V3 = vec_sub(V0, V1);
V4 = vec_perm(V3,V3,pv3201); /* shuffle */
V0 = vec_madd(V6,V3,vzero);
V1 = vec_madd(V7,V4,vzero);
V2 = vec_add(V0,V1); /* d=w*(a-b) */
vec_st(v2,0,(vector float *)&d[jc+k][0]);
}
}
}
}
Although the Altivec instructions used here are similar to the SSE variant,
here is a brief summary of their functionality 30. A summary is also given in
Appendix B.
• V0 = vec ld(0,ptr) loads its result V0 with four floating point words
beginning at pointer ptr and block offset 0.
• V2 = vec add(V0, V1) performs an element by element floating point add
of V0 and V1 to yield result vector V2.
• vec st(V2,0,ptr) stores the four word contents of vector V2 beginning at
pointer ptr with block offset 0.
• V3 = vec sub(V0, V1) subtracts the four floating point elements of V1
from V0 to yield result V3.
• V2 = vec perm(V0,V1,pv) selects bytes from V0 or V1 according to the
permutation vector pv to yield a permuted vector V2. These are arranged
in little endian fashion [30].
• V3 = vec madd(V0,V1,V2) is a multiply–add operation: V3 = V0*V1+V2,
where the multiply is element by element, and likewise the add.
In Figure 3.24, we show our results on a Power Mac G-4. This is a 1.25 GHz
machine running OS-X (version 10.2) and the Developers’ kit gcc compiler (ver-
sion 1161, April 4, 2003). The Altivec version using intrinsics is three times faster
than the generic one using the same algorithm.
Exercise 3.1 Variants of matrix–matrix multiply In Section 3.4.1 are
descriptions of two variations of matrix–matrix multiply: C = AB. The two
are (1) the text book method which computes c
ij
=

k
a
ik
b
kj
, and (2) the
outer product variant c
∗j
=

k
a
∗k
b
kj
, which processes whole columns/time.
The first variant is a dot-product (sdot), while the second uses repeated saxpy
operations.
What is to be done?
The task here is to code these two variants in several ways:
1. First, program the dot-product variant using three nested loops, then
instead of the inner-most loop, substitute sdot.
FFT ON SSE AND ALTIVEC 133
1000
800
600
400
200
0
10
0
10
1
10
2
10
3
10
4
10
5
10
6
n
Generic
In-place
M
f
l
o
p
s
cffts (Altivec)
n=2
m
FFT, Ogdoad: Mac G–4
Fig. 3.24. Intrinsics, in-place (non-unit stride), and generic FFT. Tests are
from a machine named Ogdoad: 1.25 GHz Power Mac G-4.
2. Next, program the outer-product variant using three nested loops, then
replace the inner-most loop with saxpy.
3. Using large enough matrices to get good timing results, compare the per-
formances of these variants. Again, we recommend using the clock func-
tion, except on Cray platforms where second gives better resolution.
4. By using multiple repetitions of the calculations with BLAS routines sdot
and saxpy, see if you can estimate the function call overhead. That is,
how long does it take to call either sdot or saxpy even if n = 0? This is
trickier than it seems: some timers only have 1/60 second resolution, so
many repetitions may be required.
5. Finally, using the outer-product method (without substituting saxpy),
see if you can unroll the first outer loop to a depth of 2—hence using
two columns of B as operands at a time (Section 3.2). To make your life
simpler, choose the leading dimension of C (also A) to be even. Do you
see any cache effects for large n (the leading dimension of A and C)?
Helpful hints: Look in /usr/local/lib for the BLAS: for example, libblas.a
or as a shared object libblas.so. The utilities nm or ar (plus grep) may make
it easier to find the BLAS routines you want: ar t libblas.a |grep -i dot,
or nm libblas.so |grep -i axpy. These are likely to be Fortran routines,
so consult Appendix E to get the Fortran-C communication correct and the
proper libraries loaded.
134 SIMD, SINGLE INSTRUCTION MULTIPLE DATA
Exercise 3.2 Using intrinsics and SSE or Altivec In Section 3.5.5 we
gave descriptions of VL = 4-vector programming on Intel Pentium III and
Pentium 4 using SSE (software streaming extensions), and Apple/Motorola G-4
Altivec. The programming mode uses intrinsics. In this exercise, we ask you to
program two basic functions in this mode: the saxpy operation (y ← α · x +y),
and ssum a summation (s ←

n−1
i=0
x
i
).
To make your job easier, assume that the elements of the appropriate arrays
(x, y) have unit spacing between them. For the squeemish, you can also assume
that 4|n (n is divisible by 4), as below, although the more general case is a
useful exercise. The non-unit stride case is an exercise for real gurus. Here are
two examples, for SSE (Intel) and Altivec (Apple), respectively, doing the simpler
sscal problem (x ← α · x).
SSE version:
void sscal(int n, float alpha, float *x) {
/* SSE version of sscal */
int i,ns;
float alpha_vec[4];
_m128 tmm0,tmm1;
for (i = 0; i < 4; i ++){
alpha_vec[i] = alpha; /* 4 copies of alpha */
}
tmm0 = _mm_load_ps(alpha_vec); /* alphas in tmm0 */
ns = n/4;
for (i = 0; i < ns; i ++) {
tmm1 = _mm_load_ps(&x[4*i]); /* load 4 x’s */
tmm1 = _mm_mul_ps(tmm1, tmm0); /* alpha*x’s */
_mm_store_ps(&x[4*i],tmm1); /* store x’s */
}
}
Altivec version:
void sscal(int n, float alpha, float *x) {
/* Altivec version of sscal */
int i,ns;
float alpha_vec[4];
const vector float vzero =
(vector float)(0.,0.,0.,0.);
vector float V0,V1;
for (i = 0; i < 4; i ++){
alpha_vec[i] = alpha; /* 4 copies of alpha */
FFT ON SSE AND ALTIVEC 135
}
V0 = vec_ld(0,alpha_vec); /* copies into V0 */
ns = n/4;
for (i = 0; i < ns; i ++) {
V1 = vec_ld(0,(vector float *)&x[4*i]); /* load */
V1 = vec_madd(V0,V1,vzero); /* a*x */
vec_st(V1,0,(vector float *)&x[4*i]); /* store */
}
}
What is to be done?
Using one/other of the above sscal routines as examples, program both saxpy
and ssum using either the SSE or Altivec intrinsics. You may choose one machine
or the other. The examples isamax from Section 3.5.7 or sdot from Section 3.5.6
may be helpful for coding the reduction operation ssum.
1. Write test programs for these routines—with a large number of iter-
ations and/or big vectors—to get good timings for your results. Use
the system timer time = (double) clock(); and scale your timings by
CLOCKS PER SEC defined in <time.h>.
2. Modify your code to use a pre-fetch strategy to load the next segment to
be used while computing the current segment.
3. Compare the pre-fetch strategy to the vanilla version.
If your local machines do not have the appropriate compilers, they are available
(free, without support) from
• The gcc Apple compiler is available from the Apple developer web-site
http://developer.apple.com/.
• The Intel icc compiler is available from
http://developer.intel.com/.
Our Beowulf machine Asgard has an up-to-date version of this compiler.
References: Technical report [26] and website www.simdtech.org.
4
SHARED MEMORY PARALLELISM
I think there’s a world market for about five computers.
Th. J. Watson (1943)
4.1 Introduction
Shared memory machines typically have relatively few processors, say 2–128.
An intrinsic characteristic of these machines is a strategy for memory coherence
and a fast tightly coupled network for distributing data from a commonly access-
ible memory system. Our test examples were run on two HP Superdome clusters:
Stardust is a production machine with 64 PA-8700 processors, and Pegasus is a
32 CPU machine with the same kind of processors.
4.2 HP9000 Superdome machine
The HP9000 is grouped into cells (Figure 4.1), each with 4 CPUs, a common
memory/cell, and connected to a CCNUMA crossbar network. The network con-
sists of sets of 4×4 crossbars and is shown in Figure 4.2. An effective bandwidth
test, the EFF BW benchmark [116], groups processors into two equally sized sets.
I/O
control
Cell
control
PA 8700
PA 8700
PA 8700
PA 8700
Crossbar
PCI PCI
PCI PCI
E E
E E
Memory
......
......
Six I/O slots
Six I/O slots
Fig. 4.1. One cell of the HP9000 Superdome. Each cell has 4 PA-8700 CPUs
and a common memory. See Figure 4.2.
CRAY X1 MACHINE 137
C
e
l
l

1

c
r
o
s
s
b
a
r
C
e
l
l

2

c
r
o
s
s
b
a
r
C
e
l
l

3

c
r
o
s
s
b
a
r
C
e
l
l

4

c
r
o
s
s
b
a
r
Cell 1 Cell 4
Cell 2 Cell 3
Fig. 4.2. Crossbar interconnect architecture of the HP9000 Superdome.
Group 1
Group 2
Effective BW
Fig. 4.3. Pallas EFF BW benchmark. The processors are divided into two
equally sized groups and arbitrary pairwise connections are made between
processors from each group: simultaneous messages from each pair are sent
and received.
Arbitrary pairings are made between elements from each group, Figure 4.3, and
the cross-sectional bandwidth of the network is measured for a fixed number
of processors and varying message sizes. The results from the HP9000 machine
Stardust are shown in Figure 4.4. It is clear from this figure that the cross-
sectional bandwidth of the network is quite high. Although not apparent from
Figure 4.4, the latency for this test (the intercept near Message Size = 0) is
not high. Due to the low incremental resolution of MPI Wtime (see p. 234), mul-
tiple test runs must be done to quantify the latency. Dr Byrde’s tests show that
minimum latency is 1.5 µs.
4.3 Cray X1 machine
A clearer example of a shared memory architecture is the Cray X1 machine,
shown in Figures 4.5 and 4.6. In Figure 4.6, the shared memory design is obvious.
Each multi-streaming processor (MSP) shown in Figure 4.5 has 4 processors
(custom designed processor chips forged by IBM), and 4 corresponding caches.
138 SHARED MEMORY PARALLELISM
2 CPUs
Pallas eff_Bw Benchmark
4 CPUs
8 CPUs
16 CPUs
24 CPUs
32 CPUs
4500
4000
3500
3000
2500
2000
1500
1000
500
0
6
4
1
2
8
2
5
6
5
1
2
1
0
6
4
2
0
4
8
4
0
9
6
8
1
9
2
1
6
3
8
4
3
2
7
6
8
6
5
5
3
6
I
E
+
0
5
3
E
+
0
5
5
E
+
0
5
1
E
+
0
6
2
E
+
0
6
4
E
+
0
6
8
E
+
0
6
2
E
+
0
7
Message size (Bytes)
B
a
n
d
w
i
d
t
h

(
M
B
/
s
)
Fig. 4.4. EFF BW benchmark on Stardust. These data are courtesy of Olivier
Byrde. The collapse of the bandwidth for large message lengths seems to be
cache/buffer management configuration.
Scalar
V
e
c
t
o
r
V
e
c
t
o
r
1/2 MB
cache
Scalar
V
e
c
t
o
r
V
e
c
t
o
r
1/2 MB
cache
Scalar
V
e
c
t
o
r
V
e
c
t
o
r
1/2 MB
cache
Scalar
V
e
c
t
o
r
V
e
c
t
o
r
1/2 MB
cache
To node memory system crossbar
Fig. 4.5. Cray X1 MSP. The only memory in each of these MSPs consists of
four sets of 1/2 MB caches (called Ecache). There are four of these MSPs
per node, Figure 4.6.
Although not clear from available diagrams, vector memory access apparently
permits cache by-pass; hence the term streaming in MSP. That is, vector registers
are loaded directly from memory: see, for example, Figure 3.4. On each board
(called nodes) are 4 such MSPs and 16 memory modules which share a common
NEC SX-6 MACHINE 139
P P P P
C C C C
P P P P
C C C C C C C
P P P P
C
P P P P
C C C C
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
4 proc. MSP 4 proc. MSP 4 proc. MSP 4 proc. MSP
Node memory system crossbar
Fig. 4.6. Cray X1 node (board): each board has 4 groups of 4 MSPs (Figure 4.5),
16 processors total. Cache coherency is maintained only within one node.
Between nodes, MPI is used.
(coherent) memory view. Coherence is only maintained on each board, but not
across multiple board systems. The processors have 2 complete sets of vector
units, each containing 32 vector registers of 64 words/each (each word is 64-bits),
and attendant arithmetic/logical/shift functional units. Message passing is used
between nodes: see Chapter 5.
4.4 NEC SX-6 machine
Another example of vector CPUs tightly coupled to a common memory system is
the Nippon Electric Company (NEC) SX-6 series (Figure 4.7). The evolution of
this vector processing line follows from earlier Watanabe designed SX-2 through
SX-5 machines, and recent versions are extremely powerful. Our experience
has shown that NEC’s fast clocks, made possible by CMOS technology, and
multiple vector functional unit sets per CPU make these machines formidable.
The Yokohama Earth Simulator is a particularly large version (5120 nodes) from
this line of NEC distributed node vector machines. At least two features distin-
guish it from Cray designs: (1) reconfigurable vector register files, and (2) there
are up to 8 vector units/CPU [139]. Reconfigurable vector registers means that
the 144 kB register file may be partitioned into a number of fixed sized vector
registers whose total size remains 144 kB. Up to 128 nodes (each with 8 CPUs)
are available in the SX-6/128M models (Figure 4.8). Between nodes, message
passing is used, as described in Chapter 5.
140 SHARED MEMORY PARALLELISM
Scalar
pipeline
Scalar
cache
CPU
load/
store
unit
Scalar
registers
Mask reg.
Vector unit
8 x
Scalar unit
To
mem.
bus
Vector
register
block
Mask unit
Logical
Multiply
Add/shift
Divide
Fig. 4.7. NEC SX-6 CPU: each node contains multiple vector units, a scalar
unit, and a scalar cache. Each CPU may have between 4 and 8 vector register
sets and corresponding vector functional units sets. Vector registers are
144 kB register files of 64-bit words. These vector registers are reconfigurable
with different numbers of words/vector. See Figure 4.8.
. . . . . . . . . . . . . . . .
IOP
0
IOP
3
. . . . . . . . . . . . . . . . . . . . . . . . .
CPU
0
CPU
1
CPU
2
CPU
7
Single node (8 CPU) memory crossbar
Operator
station
I/O devices HIPPI
network
other
I/O
4 I/O controllers
Fig. 4.8. NEC SX-6 node: each contains multiple vector CPUs. Each node may
have between 4 and 8 CPUs which are connected to a common memory sys-
tem. Up to 16 such nodes may be connected to each other by a Internode
crossbar switch: up to 128 CPUs total, see Figure 4.7.
4.5 OpenMP standard
In this book, we use OpenMP [17] because it seems to have become a standard
shared memory programming paradigm. Appendix C gives a summary of its func-
tionality. A Posix standard, pthreads, allows tighter control of shared memory
parallelism but is more painful to program. Namely, because it allows so much
SHARED MEMORY VERSIONS OF THE BLAS AND LAPACK 141
user control, more work is required for the programmer. In our examples from
the Hewlett-Packard Superdome machine at ETH Z¨ urich, OpenMP is implemen-
ted using pthreads. This is also true on many Intel PC clusters. From p. 108, we
saw a straightforward parallelized version of the Linpack benchmark sgefa. In
this section, we explore the outer-loop parallelism of this straightforward imple-
mentation and find its efficiency limited by poor scaling: it works well on a few
processors, but scaling degrades when the number of CPUs exceeds eight. The
speedup is 7.4 on 8 CPUs, but flattens to 10 on 16. Similar scaling to 8 CPUs
was reported by R¨ ollin and Fichtner [126], but they do not show data for any
larger number of CPUs.
4.6 Shared memory versions of the BLAS and LAPACK
Loop structures of linear algebra tasks are the first places to look for parallelism.
Time critical tasks—matrix–vector multiplication, matrix–matrix multiplication,
and LU factorization—have nested loop structures. In a shared memory envir-
onment, it is usually straightforward to parallelize these loop-based algorithms
by means of compiler directives from OpenMP. As the important basic tasks are
embedded in basic linear algebra subroutines (BLAS) it often suffices to paral-
lelize these in order to get good parallel performance. Looking at the triangular
factorization in Figure 2.3, one can see that in dgetrf the routines to parallelize
are dtrsm and dgemm (see Table 2.2). The mathematical library MLIB from
Hewlett-Packard [73] conforms to the public domain version 3.0 of LAPACK in
all user-visible usage conventions. But the internal workings of some subpro-
grams have been tuned and optimized for HP computers. In Table 4.1 we list
execution times for solving systems of equations of orders from n = 500–5000.
For problem sizes up to 1000, it evidently does not make sense to employ more
Table 4.1 Times t in seconds (s) and speedups S(p) for various problem sizes
n and processor numbers p for solving a random system of equations with
the general solver dgesv of LAPACK on the HP Superdome. The execution
times are the best of three measurements.
p n = 500 n = 1000 n = 2000 n = 5000
t(s) S(p) t(s) S(p) t(s) S(p) t(s) S(p)
1 0.08 1 0.66 1 4.62 1 72.3 1
2 0.05 1.6 0.30 2.2 2.15 2.2 32.3 2.2
4 0.03 2.7 0.16 4.1 1.09 4.2 16.4 4.4
8 0.02 4.0 0.09 7.3 0.61 7.6 8.3 8.7
12 0.02 4.0 0.08 8.3 0.45 10.3 5.9 12.3
16 0.02 4.0 0.08 8.3 0.37 12.5 4.6 15.7
24 0.08 8.3 0.32 14.4 3.3 21.9
32 0.29 15.9 3.0 24.1
142 SHARED MEMORY PARALLELISM
than a very few processors. The overhead for the management of the various
threads (startup, synchronization) soon outweighs the gain from splitting the
work among processors. The execution times and speedups for these small prob-
lems level off so quickly that it appears the operating system makes available
only a limited number of processors. This is not the case for n = 2000. The
speedup is very good up to 8 processors and good up to 16 processors. Beyond
this number, the speedup again deteriorates. Also, the execution times are no
longer reproducible. The operating system apparently finds it hard to allocate
threads to equally (un)loaded processors. This effect is not so evident in our
largest problem size tests.
Performance of the BLAS routines largely depend on the right choices of
block (or panel) sizes. These depend on the algorithm but even more on the
cache and memory sizes. The authors of LAPACK and previously of LINPACK
initially hoped that independent computer vendors would provide high perform-
ing BLAS. Quickly changing hardware, however, made this approach less than
satisfactory. Often, only the fraction of the BLAS used in the LINPACK bench-
mark were properly tuned. Faced with this misery, an ambitious project was
started in the late 1990s with the aim of letting the computer do this tedious
work. In the ATLAS project [149, 150], a methodology was developed for the
automatic generation of efficient basic linear algebra routines for computers with
hierarchical memories. The idea in principle is fairly simple. One just measures
the execution times of the building blocks of the BLAS with varying parameters
and chooses those settings that provide the fastest results. This approach was
successful for the matrix–matrix multiplication [149]. On many platforms, the
automated tuning of Level 3 BLAS dgemm outperformed the hand-tuned version
from computer vendors. Tuning all of the BLAS takes hours, but the benefit
remains as long as the hardware is not modified.
4.7 Basic operations with vectors
In Sections 2.1 and 3.2.3, we discussed both the saxpy operation (2.3)
y = αx +y,
and inner product sdot (2.4)
s = x · y.
In the following, these operations are reprogrammed for shared memory
machines using OpenMP. In coding such shared memory routines, the program-
mer’s first task is data scoping. Within an OpenMP parallel region, data are
either private (meaning local to the processor computing the inner portion of
the parallel region) or shared (meaning data shared between processors). Loop
variables and temporary variables are local to a processor and should be declared
private, which means their values may be changed without modifying other
private copies. Even though shared data may be modified by any processor, the
BASIC OPERATIONS WITH VECTORS 143
computation is not parallel unless only independent regions are modified. Other-
wise, there will be a data dependency, see Section 3.2. There are ways to control
such dependencies: locks and synchronization barriers are necessary. In that case,
the parallel regions are not really parallel, see Appendix C. Hence, private data
may be modified only locally by one processor (or group of processors working
on a common task), while shared data are globally accessible but each region
to be modified should be independent of other processors working on the same
shared data. Several examples of this scoping will be given below.
4.7.1 Basic vector operations with OpenMP
4.7.1.1 SAXPY
The parallelization of the saxpy operation is simple. An OpenMP for directive
generates a parallel region and assigns the vectors to the p processors in blocks
(or chunks) of size N/p.
#pragma omp parallel for
for (i=0; i< N; i++){
y[i] += alpha*x[i];
}
When we want the processors to get blocks of smaller sizes we can do this by
the schedule option. If a block size of 100 is desired, we can change the above
code fragment by adding a schedule qualifier:
#pragma omp parallel for schedule(static,100)
for (i=0; i< N; i++){
y[i] += alpha*x[i];
}
Now chunks of size 100 are cyclically assigned to each processor. In Table 4.2,
timings on the HP Superdome are shown for N = 10
6
and chunks of size N/p,
100, 4, and 1. It is evident that large chunk sizes are to be preferred for vector
operations. Both chunk sizes N/p and 100 give good timings and speedups. The
latter causes more overhead of OpenMP’s thread management. This overhead
becomes quite overwhelming for the block size 4. Here 2.5 · 10
5
loops of only
Table 4.2 Some execution times in microseconds for the
saxpy operation.
Chunk size p = 1 2 4 6 8 12 16
N/p 1674 854 449 317 239 176 59
100 1694 1089 601 405 317 239 166
4 1934 2139 1606 1294 850 742 483
1 2593 2993 3159 2553 2334 2329 2129
144 SHARED MEMORY PARALLELISM
length 4 are to be issued. The block size 1 is of course ridiculous. Here, the
memory traffic is the bottleneck. Each of the four double words of a cacheline
are handled by a different processor (if p ≥ 4). The speed of the computation is
determined by the memory bandwidth. There is no speedup at all.
4.7.1.2 Dot product
OpenMP implements a fork-join parallelism on a shared memory multicomputer.
Therefore the result of the dot product will be stored in a single variable in the
master thread. In a first approach we proceed in a similar fashion to saxpy, see
Figure 4.9. By running this code segment on multiple processors, we see that the
result is only correct using one processor. In fact, the results are not even repro-
ducible when p > 1. To understand this we have to remember that all variables
except the loop counter are shared among the processors. Here, the variable dot
is read and updated in an asynchronous way by all the processors. This phe-
nomenon is known as a race condition in which the precise timing of instruction
execution effects the results [125]. In order to prevent untimely accesses of dot
we have to protect reading and storing it. This can be done by protecting the
statement that modifies dot from being executed by multiple threads at the
same time. This mutual exclusion synchronization is enforced by the critical
construct in OpenMP, Figure 4.10: While this solution is correct, it is intoler-
ably slow because it is really serial execution. To show this, we list the execution
times for various numbers of processors in row II of Table 4.3. There is a bar-
rier in each iteration that serializes the access to the memory cell that stores
dot. In this implementation of the dot product, the variable dot is written N
times. To prevent data dependencies among threads, we introduce a local variable
dot = 0.0;
#pragma omp parallel for
for (i=0; i< N; i++){
dot += x[i]*y[i];
}
Fig. 4.9. Global variable dot unprotected, and thus giving incorrect results
(version I).
dot = 0.0;
#pragma omp parallel for
for (i=0; i< N; i++){
#pragma omp critical
{ dot += x[i]*y[i]; }
}
Fig. 4.10. OpenMP critical region protection for global variable dot
(version II).
BASIC OPERATIONS WITH VECTORS 145
Table 4.3 Execution times in microseconds for our dot product, using the C
compiler guidec. Line I means no protection for global variable dot, from
Figure 4.9; line II means the critical region for global variable variant
dot from Figure 4.10; line III means critical region for local variable
local dot from Figure 4.11; and line IV means parallel reduction for
from Figure 4.12.
Version p = 1 2 4 6 8 12 16
I 1875 4321 8799 8105 6348 7339 6538
II 155,547 121,387 139,795 140,576 171,973 1,052,832 3,541,113
III 1387 728 381 264 225 176 93
IV 1392 732 381 269 220 176 88
dot = 0.0;
#pragma omp parallel shared(dot,p,N,x,y) \
private(local_dot,i,k,offset)
#pragma omp for
for(k=0;k<p;k++){
offset = k*(N/p);
local_dot = 0.0;
for (i=offset; i< offset+(N/p); i++){
local_dot += x[i]*y[i];
}
#pragma omp critical
{ dot += local_dot; }
}
Fig. 4.11. OpenMP critical region protection only for local accumulations
local dot (version III).
local dot for each thread which keeps a partial result (Figure 4.11). To get the
full result, we form
dot =
p−1

k=0
local dot
k
,
where local dot
k
is the portion of the inner product that is computed by pro-
cessor k. Each local dot
k
can be computed independently of every other. Each
thread has its own instance of private variable local dot. Only at the end
are the p individual local results added. These are just p accesses to the global
variable dot. Row III in Table 4.3 shows that this local accumulation reduces
the execution time significantly.
In Chapter 3, we distinguished between vector operations whose results were
either the same size as an input vector, or those reductions which typically yield
a single element. For example, inner products and maximum element searches,
146 SHARED MEMORY PARALLELISM
dot = 0.0;
#pragma omp parallel for reduction(+ : dot)
for (i=0; i< N; i++){
dot += x[i]*y[i];
}
Fig. 4.12. OpenMP reduction syntax for dot (version IV).
Section 3.3, are reductions. OpenMP has built-in mechanisms for certain reduc-
tion operations. Figure 4.12 is an inner product example. Here, dot is the
reduction variable and the plus sign “+” is the reduction operation. There are a
number of other reduction operations besides addition. The OpenMP standard
does not specify how a reduction has to be implemented. Actual implementations
can be adapted to the underlying hardware.
4.8 OpenMP matrix vector multiplication
As we discussed in Chapter 3 regarding matrix multiplication, there are at least
two distinct ways to effect this operation, particularly Section 3.4.1. Likewise,
its sub-operation, matrix–vector multiplication, can be arranged as the textbook
method, which uses an inner product; or alternatively as an outer product which
preserves vector length. If A is an m×n matrix and x a vector of length n, then
A times the vector x is a vector of length m, and with indices written out
y = Ax or y
k
=
n−1

i=0
a
k,i
x
i
, 0 ≤ k < m.
A C code fragment for the matrix–vector multiplication can be written as a
reduction
/* dot product variant of matrix-vector product */
for (k=0; k<m; k++){
y[k] = 0.0;
for (i=0; i<n; i++)
y[k] += A[k+i*m]*x[i];
}
Here again, we assume that the matrix is stored in column (Fortran, or column
major) order. In this code fragment, each y
k
is computed as the dot product of
the kth row of A with the vector x. Alternatively, an outer product loop ordering
is based on the saxpy operation (Section 3.4.1)
/* saxpy variant of matrix-vector product */
for (k=0; k<m; k++) y[k] = 0.0;
for (i=0; i<n; i++)
for (k=0; k<m; k++)
y[k] += a[k+i*m]*x[i];
OPENMP MATRIX VECTOR MULTIPLICATION 147
Here, the ith column of A takes the role of x in (2.3) while x
i
takes the role of
the scalar α and y is the y in saxpy, as usual (2.3).
4.8.1 The matrix–vector multiplication with OpenMP
In OpenMP the goal is usually to parallelize loops. There are four options to
parallelize the matrix–vector multiplication. We can parallelize either the inner
or the outer loop of one of the two variants given above. The outer loop in
the dot product variant can be parallelized without problem by just prepending
an omp parallel for compiler directive to the outer for loop. It is important
not to forget to declare shared the loop counter i of the inner loop! Likewise,
parallelization of the inner loop in the saxpy variant of the matrix–vector product
is straightforward.
The other two parallelization possibilities are more difficult because they
involve reductions. The parallelization of the inner loop of the dot product
variant can be as a simple dot product
for (k=0; k<m; k++){
tmp = 0.0;
#pragma omp parallel for reduction(+ : tmp)
for (i=0; i<n; i++){
tmp += A[k+i*m]*x[i];
}
y[k] = tmp;
}
Parallelization of the outer loop of the saxpy variant is a reduction operation
to compute the variable y[k] = tmp. As OpenMP does not permit reduction
variables to be of pointer type, we have to resort to the approach of code fragment
Figure 4.12 and insert a critical section into the parallel region
#pragma omp parallel private(j,z)
{
for (j=0; j<M; j++) z[j] = 0.0;
#pragma omp for
for (i=0; i<N; i++){
for (j=0; j<M; j++){
z[j] += A[j+i*M]*x[i];
}
}
#pragma omp critical
for (j=0; j<M; j++) y[j] += z[j];
}
We compared the four approaches. Execution times obtained on the HP
Superdome are listed in Table 4.4. First of all the execution times show that
parallelizing the inner loop is obviously a bad idea. OpenMP implicitly sets a
148 SHARED MEMORY PARALLELISM
Table 4.4 Some execution times in microseconds for the matrix–vector
multiplication with OpenMP on the HP superdome.
Variant Loop parallelized P
1 2 4 8 16 32
n = 100
dot Outer 19.5 9.76 9.76 39.1 78.1 146
dot Inner 273 420 2256 6064 13,711 33,056
saxpy Outer 9.76 9.76 19.5 68.4 146 342
saxpy Inner 244 322 420 3574 7128 14,912
n = 500
dot Outer 732 293 146 97.7 146 196
dot Inner 1660 2197 12,109 31,689 68,261 165,185
saxpy Outer 732 146 48.8 97.7 195 488
saxpy Inner 2050 1904 2539 17,725 35,498 72,656
n = 1000
dot Outer 2734 1367 684 293 196 293
dot Inner 4199 4785 23,046 61,914 138,379 319,531
saxpy Outer 2930 1464 781 195 391 977
saxpy Inner 6055 4883 5078 36,328 71,777 146,484
synchronization point at the end of each parallel region. Therefore, the parallel
threads are synchronized for each iteration step of the outer loop. Evidently, this
produces a large overhead.
Table 4.4 shows that parallelizing the outer loop gives good performance
and satisfactory speedups as long as the processor number is not large. There
is no clear winner among the dot product and saxpy variant. A slightly faster
alternative to the dot product variant is the following code segment where the
dot product of a single thread is collapsed into one call to the Level 2 BLAS
subroutine for matrix–vector multiplication dgemv.
n0 = (m - 1)/p + 1;
#pragma omp parallel for private(blksize)
for (i=0; i<p; i++){
blksize = min(n0, m-i*n0);
if (blksize > 0)
dgemv_("N", &blksize, &n, &DONE, &A[i*n0],
&m, x, &ONE, &DZERO, &y[i*n0], &ONE);
}
The variable blksize holds the number of A’s rows that are computed by the
respective threads. The variables ONE, DONE, and DZERO are used to pass by
address the integer value 1, and the double values 1.0 and 0.0 to the Fortran
subroutine dgemv (see Appendix E).
OPENMP MATRIX VECTOR MULTIPLICATION 149
4.8.2 Shared memory version of SGEFA
The basic coding below is a variant from Section 3.4.2. Again, we have used
the Fortran or column major storage convention wherein columns are stored in
sequential memory locations, not by the C row-wise convention. Array y is the
active portion of the reduction in Figure 3.15. The layout for the multiple saxpy
operation (msaxpy) is
a
0
a
1
a
2
. . . a
m−1
x
0
y
0,0
y
0,1
y
0,2
. . . y
0,m−1
x
1
y
1,0
y
1,1
y
1,2
. . . y
1,m−1
.
.
.
.
.
.
.
.
.
.
.
.
x
m−1
y
m−1,0
y
m−1,1
y
m−1,2
. . . y
m−1,m−1
And the calculation is, for j = 0, . . . , m−1
y
j
←a
j
x +y
j
, (4.1)
where y
j
is the jth column in the region of active reduction in Figure 3.15.
Column x is the first column of the augmented system, and a is the first row
of that system. Neither a nor x are modified by msaxpy. Column y
j
in msaxpy
(4.1) is modified only by the processor which reduces this jth column and no
other. In BLAS language, the msaxpy computation is a rank-1 update [135,
section 4.9]
Y ←x a
T
+ Y. (4.2)
Revisit Figure 2.1 to see the Level 2 BLAS variant.
#define ym(p,q) (y+p+q*n)
void msaxpy(nr,nc,a,n,x,y)
int nr,nc,n;
float *a,*x,*y;
{
/* multiple SAXPY operation, wpp 29/01/2003 */
int i,j;
#pragma omp parallel shared(nr,nc,a,x,y) private(i,j)
#pragma omp for schedule(static,10) nowait
for(j=0;j<nc;j++){
for(i=0;i<nr;i++){
*ym(i,j) += a[j*n]*x[i];
}
}
}
#undef ym
150 SHARED MEMORY PARALLELISM
8
6
4
W
a
l
l
c
o
c
k

t
i
m
e

(
s
)
S
p
e
e
d
u
p

v
s
.

N
C
P
U
s
2
60
40
20
0
0 20 40 60
0
0 20
NCPUs NCPUs
40 60
Fig. 4.13. Times and speedups for parallel version of classical Gaussian elim-
ination, SGEFA, on p. 108, [37]. The machine is Stardust, an HP 9000
Superdome cluster with 64 CPUs. Data is from a problem matrix of size
1000 × 1000. Chunk size specification of the for loop is (static,32), and
shows how scaling is satisfactory only to about NCPUs ≤ 16, after which
system parameters discourage more NCPUs. The compiler was cc.
In preparing the data for Figure 4.13, several variants of the OpenMP
parameters were tested. These variants were all modifications of the
#pragma omp for schedule(static,10) nowait
line. The choice shown, schedule(static,10) nowait, gave the best results of
several experiments. The so-called chunksize numerical argument to schedule
refers to the depth of unrolling of the outer j-loop, in this case the unrolling is to
10 j-values/loop; for example, see Section 3.2. The following compiler switches
and environmental variables were used. Our best results used HP’s compiler cc
cc +O4 +Oparallel +Oopenmp filename.c -lcl -lm -lomp
-lcps -lpthread
with guidec giving much less satisfactory results. An environmental variable
specifying the maximum number of processors must be set:
setenv OMP NUM THREADS 8
In this expression, OMP NUM THREADS is set to eight, whereas any number larger
than zero but no more than the total number of processors could be specified.
OPENMP MATRIX VECTOR MULTIPLICATION 151
4.8.3 Shared memory version of FFT
Another recurring example in this book, the binary radix FFT, shows a less
desirable property of this architecture or of the OpenMP implementation. Here
is an OpenMP version of step used in Section 3.6, specifically the driver
given on p. 128. Several variants of the omp for loop were tried and the
schedule(static,16) nowait worked better than the others.
void step(n,mj,a,b,c,d,w,sign)
int n,mj;
float a[][2],b[][2],c[][2],d[][2],w[][2];
float sign;
{
float ambr, ambu, wjw[2];
int j, k, ja, jb, jc, jd, jw, lj, mj2;
/* one step of workspace version of CFFT2 */
mj2 = 2*mj; lj = n/mj2;
#pragma omp parallel shared(w,a,b,c,d,lj,mj,mj2,sign) \
private(j,k,jw,ja,jb,jc,jd,ambr,ambu,wjw)
#pragma omp for schedule(static,16) nowait
for(j=0;j<lj;j++){
jw=j*mj; ja=jw; jb=ja; jc=j*mj2; jd=jc;
wjw[0] = w[jw][0]; wjw[1] = w[jw][1];
if(sign<0) wjw[1]=-wjw[1];
for(k=0; k<mj; k++){
c[jc + k][0] = a[ja + k][0] + b[jb + k][0];
c[jc + k][1] = a[ja + k][1] + b[jb + k][1];
ambr = a[ja + k][0] - b[jb + k][0];
ambu = a[ja + k][1] - b[jb + k][1];
d[jd + k][0] = wjw[0]*ambr - wjw[1]*ambu;
d[jd + k][1] = wjw[1]*ambr + wjw[0]*ambu;
}
}
}
An examination of Figure 3.20 shows that the computation of all the compu-
tation boxes (Figure 3.21) using each twiddle factor are independent—at an
inner loop level. Furthermore, each set of computations using twiddle factor
ω
k
= exp(2πik/n) is also independent of every other k—at an outer loop level
counted by k = 0, . . . . Hence, the computation of step may be parallelized
at an outer loop level (over the twiddle factors) and vectorized at the inner
loop level (the number of independent boxes using ω
k
). The results are not so
satisfying, as we see in Figure 4.14. There is considerable improvement in the
processing rate for large n, which falls dramatically on one CPU when L1 cache
is exhausted. However, there is no significant improvement near the peak rate as
NCPU increases and worse, the small n rates are significantly worse. These odd
152 SHARED MEMORY PARALLELISM
500
(static, 16)
np=1
np=4
np=8
np=32
400
300
200
100
M
f
l
o
p
s
N
0
10
1
10
2
10
3
10
4
10
5
10
6
Fig. 4.14. Simple minded approach to parallelizing one n = 2
m
FFT using
OpenMP on Stardust. The algorithm is the generic one from Figure 3.20
parallelized using OpenMP for the outer loop on the twiddle factors ω
k
. The
compiler was cc. np = number of CPUs.
results seem to be a result of cache coherency enforcement of the memory image
over the whole machine. An examination of Figures 4.1 and 4.2 shows that when
data are local to a cell, memory access is fast: each memory module is local to
the cell. Conversely, when data are not local to a cell, the ccnuma architecture
seems to exact a penalty to maintain the cache coherency of blocks on other cells.
Writes on these machines are write-through (see Section 1.2.1.1), which means
that the cache blocks corresponding to these new data must be marked invalid
on every processor but the one writing these updates. It seems, therefore, that
the overhead for parallelizing at an outer loop level when the inner loop does so
little computation is simply too high. The independent tasks are assigned to an
independent CPU, they should be large enough to amortize this latency.
4.9 Overview of OpenMP commands
The OpenMP standard is gratefully a relatively limited set. It is designed
for shared memory parallelism, but lacks any vectorization features. On many
machines (e.g. Intel Pentium, Crays, SGIs), the pragma directives in C, or func-
tionally similar directives (cdir$, *vdir, etc.) in Fortran can be used to control
vectorization: see Section 3.2.2. OpenMP is a relatively sparse set of commands
(see Appendix C) and unfortunately has no SIMDvector control structures.
Alas, this means no non-vendor specific vectorization standard exists at all.
For the purposes of our examples, #pragma ivdep are satisfactory, but more
USING LIBRARIES 153
general structures are not standard. A summary of these commands is given in
Appendix C.
4.10 Using Libraries
Classical Gaussian elimination is inferior to a blocked procedure as we indicated
in Sections 2.2.2.1 and 2.2.2.2 because matrix multiplication is an accumulation
of vectors (see Section 3.4.1). Hence, dgemm, Section 2.2.2.2, should be superior
to variants of saxpy. This is indeed the case as we now show. The test results
are from HP’s MLIB, which contains LAPACK. Our numbers are from a Fortran
version
f90 +O3 +Oparallel filename.f -llapack
and the LAPACK library on our system is to be found in
-llapack → /opt/mlib/lib/pa2.0/liblapack.a.
This specific path to LAPACK may well be different on another machine. To spe-
cify the maximum number of CPUs you might want for your job, the following
permits you to control this
setenv MLIB NUMBER OF THREADS ncpus
where ncpus (say eight) is the maximum number. On a loaded machine, you
can expect less than optimal performance from your job when ncpus becomes
comparable to total number of processors on the system. The ETH Superdome
has 64 processors and when ncpus is larger than about 1/4 of this number, scal-
ing suffers. Both Figures 4.13 and 4.15 show that when ncpus > 8, performance
no longer scales linearly with ncpus.
We feel there are some important lessons to be learned from these examples.
• Whenever possible, use high quality library routines: reinventing the wheel
rarely produces a better one. In the case of LAPACK, an enormous amount
of work has produced a high performance and well-documented collection of
the best algorithms.
• A good algorithm is better than a mediocre one. sgetrf uses a blocked
algorithm for Gaussian elimination, while the classical method sgefa involves
too much data traffic.
• Even though OpenMP can be used effectively for parallelization, its indirect-
ness makes tuning necessary. Table 4.2 shows that codes can be sensitive to
chunk sizes and types of scheduling.
Exercise 4.1 Parallelize the Linpack benchmark. The task here is to
parallelize the Linpack benchmark described in Section 3.4.2, p. 108, and
Section 4.8.2. Start with Bonnie Toy’s C version of the benchmark which you
154 SHARED MEMORY PARALLELISM
0.0
0 20 40 60
NCPUs NCPUs
W
a
l
l
c
o
c
k

t
i
m
e

(
s
)
S
p
e
e
d
u
p

v
s
.

N
C
P
U
s
0
0
20
40
60
20 40 60
0.2
0.4
0.6
Fig. 4.15. Times and speedups for the Hewlett-Packard MLIB version LAPACK
routine sgetrf. The test was in Fortran, compiled with f90. Scaling is
adequate for NCPUs < 8, but is less than satisfactory for larger NCPUs.
However, the performance of the MLIB version of sgetrf is clearly superior
to classical Gaussian elimination sgefa shown in Figure 4.13. The problem
size is 1000 ×1000.
can download from NETLIB [111]:
http://www.netlib.org/benchmark/linpackc
The complete source code we will call clinpack.c. In Section 4.8.2, we saw a
version of sgefa.c which used the OpenMP “omp for” construct. In clinpack.c,
the relevant section is in the double version dgefa (an int procedure there).
Beginning with the comment row elimination with column indexing, you
can see two code fragments in the for (j = kp1; j < n; j++) loop. The first
swaps row elements a(k+1+i,l) for a(k+1+i,k) for i=0..n-k-1. In fact, as we
noted on p. 108, this can be pulled out of the loop. Look at the if(l!=k) section
of p. 108 before the corresponding for(j=kp1;k<n;k++) loop. Your task, after
moving out this swap, is to parallelize the remaining for loop. One way to do
this is by using the OMP parallel for directive to parallelize the corresponding
daxpy call in Toy’s version.
What is to be done?
1. From the NETLIB site (above), download the clinpack.c version of the
Linpack benchmark.
2. With some effort, port this code to the shared memory machine of your
choice. The only real porting problem here is likely to be the timer—
second. We recommend replacing second with the walltime routine
(returns a double) given below. On Cray hardware, use timef instead
of walltime for multi-CPU tests.
USING LIBRARIES 155
3. Once you get the code ported and the walltime clock installed, then
proceed to run the code to get initial 1-CPU timing(s).
4. Now modify the dgefa routine in the way described above to use the
OpenMP “omp parallel for” directive to call daxpy independently for
each processor available. You need to be careful about the data scoping
rules.
5. Run your modified version, with various chunksizes if need be and plot
the multiple-CPU timing results compared to the initial 1-CPU value(s).
Be sure that the numerical solutions are correct. Vary the number of
CPUs by using the setenv OMP NUM THREADS settings from Section 4.8.2.
Also compare your initial 1-CPU result(s) to setenv OMP NUM THREADS 1
values. Also, try different matrix sizes other than the original 200×200.
For the wallclock timing, use the following (t0 = 0.0 initially),
tw = walltime(&t0); /* start timer */
sgefa(a,n,n,ipvt,&info); /* factor A -> LU */
t0 = walltime(&tw); /* time for sgefa */
where variables tw,t0 start the timer and return the difference between the start
and end. walltime is
#include <sys/time.h>
double walltime(double *t0)
{
double mic, time, mega=0.000001;
struct timeval tp;
struct timezone tzp;
static long base_sec = 0, base_usec = 0;
(void) gettimeofday(&tp, &tzp);
if (base_sec == 0) {
base_sec = tp.tv_sec; base_usec = tp.tv_usec;
}
time = (double)(tp.tv_sec - base_sec);
mic = (double)(tp.tv_usec - base_usec);
time = (time + mic * mega) - *t0;
return(time);
}
Benchmark report: from the NETLIB server [111], download the latest ver-
sion of performance.ps just to see how you are doing. Do not expect your
results to be as fast as those listed in this report. Rather, the exercise is
just to parallelize dgefa. Any ideas how to modify Toy’s loop unrolling of
daxpy?
5
MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
Never trust a computer you can’t throw out a window.
S. Wozniak
The Multiple instruction, multiple data (MIMD) programming model
usually refers to computing on distributed memory machines with multiple
independent processors. Although processors may run independent instruction
streams, we are interested in streams that are always portions of a single pro-
gram. Between processors which share a coherent memory view (within a node),
data access is immediate, whereas between nodes data access is effected by mess-
age passing. In this book, we use MPI for such message passing. MPI has
emerged as a more/less standard message passing system used on both shared
memory and distributed memory machines.
It is often the case that although the system consists of multiple independent
instruction streams, the programming model is not too different from SIMD.
Namely, the totality of a program is logically split into many independent tasks
each processed by a group (see Appendix D) of processes—but the overall pro-
gram is effectively single threaded at the beginning, and likewise at the end. The
MIMD model, however, is extremely flexible in that no one process is always
master and the other processes slaves. A communicator group of processes
performs certain tasks, usually with an arbitrary master/slave relationship. One
process may be assigned to be master (or root) and coordinates the tasks of
others in the group. We emphasize that the assignments of which is root is
arbitrary—any processor may be chosen. Frequently, however, this choice is one
of convenience—a file server node, for example.
Processors and memory are connected by a network, for example, Figure 5.1.
In this form, each processor has its own local memory. This is not always the
case: The Cray X1 (Figures 4.5 and 4.6), and NEC SX-6 through SX-8 series
machines (Figures 4.7 and 4.8), have common memory within nodes. Within
a node, memory coherency is maintained within local caches. Between nodes,
it remains the programmer’s responsibility to assure a proper read–update rela-
tionship in the shared data. Data updated by one set of processes should not be
clobbered by another set until the data are properly used.
MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA 157
P M P M P M
Interconnection network
. . . . . . . . . . . .
Fig. 5.1. Generic MIMD distributed-memory computer (multiprocessor).
...
4.2 GB interf. 4.2 GB interf.
1 GB bus for 8 cabinets
Fig. 5.2. Network connection for ETH Beowulf cluster.
Although a much more comprehensive review of the plethora of networks in
use is given in section 2.4.2. of Hwang [75], we note here several types of networks
used on distributed memory machines:
1. Bus: a congestion-bound, but cheap, and scalable with respect to cost
network (like on our Beowulf cluster, Asgard, Figure 5.2).
2. Dynamic networks: for example, a crossbar which is hard to scale and
has very many switches, is expensive, and typically used for only limited
number of processors.
3. Static networks: for example, Ω-networks, arrays, rings, meshes, tori,
hypercubes. In Chapter 1, we showed an Ω-network in Figures 1.11 and
1.12. That example used 2 × 2 switches, but we also noted in Chapter 4
regarding the HP9000 machine that higher order switches (e.g. 4 ×4) are
also used.
4. Combinations: for example, a bus connecting clusters which are connec-
ted by a static network: see Section 5.2.
An arbitrarily selected node, node
i
, is chosen to coordinate the others. In MPI,
the numbering of each node is called its rank. In many situations, a selected
158 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
master node is numbered rank = 0. The physical assignment of nodes is
in an environmental list which maps the rank number to the physical nodes
the system understands. For example, processor node
i
= rank = 0 might be
mapped rank = 0 → n038. In the Parallel Batch System (PBS), this list is
PBS NODEFILE, and is defined for the shell running your batch job.
5.1 MPI commands and examples
Before we do an example, we need to show the basic forms of MPI commands.
In Appendix D, a more complete tabulation of MPI procedures is given, but
here are three of the most basic: a message send, MPI Send; a message receive,
MPI Recv; and a broadcast, MPI Bcast. These functions return an integer value
which is either a zero (0) indicating there was no error, or a nonzero from
which one can determine the nature of the error. The default behavior of MPI
is to abort the job if an error is detected, but this may be controlled using
MPI Errhandler set [109].
For MPI Recv (and others in Appendix D), a status structure, Figure 5.3,
is used to check the count of received data (a private variable accessible only by
using MPI Get count), the source (rank) of the data being received, the mess-
age tag, and any error message information. On some systems, private count
is an integer or array of integers, but this is seldom used. For your local
implementation, look at the include file mpidefs.h. On our system, this is in
directory /usr/local/apli/mpich-1.2.3/include.
These fields are as follows for MPI Send and MPI Recv respectively [115].
1. send data is a pointer to the data to be sent; recv data is the pointer to
where the received data are to be stored.
2. count is the number of items to send, or expected, respectively.
3. datatype is the type of each item in the message; for example, type
MPI FLOAT which means a single precision floating point datum.
4. dest rank is the rank of the node to which the data are to be sent;
source rank is the rank of the node which is expected to be sending
data. Frequently, MPI Recv will accept data from any source and uses
MPI ANY SOURCE to do so. This variable is defined in the mpi.h include
file.
5. tag ident is a tag or label on the message. Often the MPI Recv procedure
would use MPI ANY TAG to accept a message with any tag. This is defined
in mpi.h.
6. comm defines a set allowed to send/receive data. When communicating
between all parts of your code, this is usually MPI COMM WORLD, meaning
any source or destination. When using libraries, parts of code which were
written by someone else, or even parts of your own code which do distinctly
different calculations, this mechanism can be useful. Distinct portions of
code may send the same tag, and must be classified into communicators
to control their passing messages to each other.
MPI COMMANDS AND EXAMPLES 159
7. status is a structure (see Figure 5.3) used in MPI Recv that contains
information from the source about how many elements were send, the
source’s rank, the tag of the message, and any error information. The
count is accessible only through MPI Get count. On some systems, more
information is returned in private count. See the MPI status struct
above.
Commands MPI Send and MPI Recv are said to be blocking. This means the
functions do not return to the routine which called them until the messages are
sent or received, respectively [115]. There are also non-blocking equivalents of
these send/receive commands. For example, the pair MPI Isend or MPI Irecv ini-
tiate sending a message and receive it, respectively, but MPI Isend immediately
int MPI_Send( /* returns 0 if success */
void* send_data, /* message to send */
int count, /* input */
MPI_Datatype datatype, /* input */
int dest_rank, /* input */
int tag_ident, /* input */
MPI_Comm comm) /* input */
typedef struct {
int count;
int MPI_SOURCE;
int MPI_TAG;
int MPI_ERROR;
/* int private_count; */
} MPI_Status;
int MPI_Recv( /* returns 0 if success */
void* recv_data, /* message to receive */
int count, /* input */
MPI_Datatype datatype, /* input */
int source_rank, /* input */
int tag_ident, /* input */
MPI_Comm comm, /* input */
MPI_Status* status) /* struct */
int MPI_Bcast( /* returns 0 if success */
void* message, /* output/input */
int count, /* input */
MPI_Datatype datatype, /* input */
int root, /* input */
MPI_Comm comm) /* input */
Fig. 5.3. MPI status struct for send and receive functions.
160 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
returns to the calling procedure once the transmission is initialized—irrespective
of whether the transmission actually occurred. The danger is that the associated
send/receive buffers might be modified before either operation is explicitly com-
pleted. We do not illustrate these non-blocking commands here because on our
Beowulf system they are effectively blocking anyway since all data traffic share
the same memory channel. In addition, they can be dangerous: The buffers may
be modified and thus updated information lost. Both forms (non-blocking and
blocking) of these commands are point to point. This means data are transmitted
only from one node to another.
Another commonly used command involves a point-to-many broadcast. For
example, in our y = Ax matrix–vector multiply example discussed later, it is
convenient to distribute vector x to every processor available. Vector x is unmodi-
fied by these processors, but is needed in its entirety to compute the segment of
the result y computed on each node. This broadcast command, however, has the
peculiarity that it is both a send and receive: From the initiator (say root= 0),
it is a broadcast; while for a slave = rank = root, it is a receive function. The
parameters in MPI Bcast, as typed in the template above, are as follows.
1. message is a pointer to the data to be sent to all elements in the communic-
ator comm (as input) if the node processing the MPI Bcast command has
rank = root. If rank = root, then the command is processed as a receive
and message (output) is a pointer to where the received data broadcast
from root are to be stored.
2. count is the integer number of elements of type datatype to be broadcast
(input), or the number received (output).
3. datatype specifies the datatype of the received message. These may be
one of the usual MPI datatypes, for example, MPI FLOAT (see Table D.1).
4. root is an integer parameter specifying the rank of the initiator of the
broadcast. If the rank of the node is not root, MPI Bcast is a command
to receive data sent by root.
5. comm is of MPI Comm type and specifies the communicator group to
which the broadcast is to be distributed. Other communicators are not
affected.
A more complete explanation of MPI functions and procedures is given in Peter
Pacheco’s book [115], and the definitive reference is from Argonne [110]. In For-
tran, an important reference is Gropp et al. [64]. Otherwise, download the MPI-2
standard reference guide from NETLIB [111]. In order to supplement our less
complete treatment, Appendix D contains a useful subset of MPI functionality.
Before we proceed to some examples, we would like to show the way to
compile and execute MPI programs. Included are compile and PBS scripts. You
may have to modify some details for your local environment, but the general
features are hopefully correct.
To submit a batch job, you will need PBS commands. Many operations are
better done in command line mode, however: for example, compilations and small
jobs. Since compilers are frequently unavailable on every node, batch jobs for
MATRIX AND VECTOR OPERATIONS WITH PBLAS AND BLACS 161
compilation may fail. At your discretion are the number of nodes and wallclock
time limit.
1. PBS = “parallel batch system” commands:
• qsub submits your job script:
qsub qjob
• qstat gets the status of the previously submitted qjob (designated,
say, 123456.gate01 by PBS):
qstat 123456.gate01
• qdel deletes a job from PBS queue:
qdel 123456
2. An available nodefile, PBS NODEFILE, is assigned to the job qjob you sub-
mit. This nodefile mapping can be useful for other purposes. For example,
to get the number of nodes available to your job qjob:
nnodes=‘wc $PBS_NODEFILE|awk ‘{print $1}’’
3. mpicc = C compiler for MPICH/LAM version of MPI:
mpicc -O3 yourcode.c -lm -lblas -lf2c
the compile includes a load of the BLAS library and libf2c.a for
Fortran–C communication.
4. mpirun = run command MPICH/LAM with options for NCPUs:
mpirun -machinefile $PBS_NODEFILE -np $nnodes a.out
Next, you will likely need a batch script to submit your job. On our sys-
tems, two choices are available: MPICH [109] and LAM [90]. The following
scripts will be subject to some modifications for your local system: in particular,
the pathname $MPICHBIN, which is the directory where the compiler mpicc and
supervisory execution program mpirun are located. On our systems, the compiler
runs only on the service nodes (three of these)—so you need a compile script
to generate the desired four executables (run128, run256, run512, run1024).
Figure 5.4 shows the compile script which prepares run128,..., run1024 and
runs interactively (not through PBS).
And Figure 5.5 is the PBS batch script (submitted by qsub) to run the
run128,..., run1024 executables.
The compile script Figure 5.4 is easily modified for LAM by changing the
MPICH path to the “commented out” LAM path (remove the # character). A
PBS script for LAM is similar to modify and is shown in Figure 5.6.
5.2 Matrix and vector operations with PBLAS and BLACS
Operations as simple as y = Ax for distributed memory computers have
obviously been written into libraries, and not surprisingly a lot more.
162 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
#!/bin/bash
# LAM path
# inc=/usr/local/apli/lam-6.5.6/include
# MPICC=/usr/local/apli/lam-6.5.6/mpicc
# MPICH path
inc=/usr/local/apli/mpich-1.2.3/include
MPICC=/usr/local/apli/mpich-1.2.3/bin/mpicc
echo "$inc files" >comp.out
for sz in 128 256 512 1024
do
echo "${sz} executable"
which $MPICC
sed "s/MAX_X=64/MAX_X=$sz/" <time3D.c >timeit.c
$MPICC -I${inc} -O3 -o run${sz} timeit.c -lm
done
Fig. 5.4. MPICH compile script.
#!/bin/bash
# MPICH batch script
#
#PBS -l nodes=64,walltime=200
set echo
date
MRUN = /usr/local/apli/mpich-1.2.3/bin/mpirun
nnodes=‘wc $PBS_NODEFILE|awk ‘{print $1}’’
echo "=== QJOB submitted via MPICH ==="
for sz in 128 256 512 1024
do
echo "=======N=$sz timings======"
$MRUN -machinefile $PBS_NODEFILE -np $nnodes run${sz}
echo "=========================="
done
Fig. 5.5. MPICH (PBS) batch run script.
MATRIX AND VECTOR OPERATIONS WITH PBLAS AND BLACS 163
#!/bin/bash
# LAM batch script
#
LAMBIN = /usr/local/apli/lam-6.5.6/bin
machinefile=‘basename $PBS_NODEFILE.tmp’
uniq $PBS_NODEFILE > $machinefile
nnodes=‘wc $machinefile|awk ‘{print $1}’’
nnodes=$(( nnodes - 1 ))
$LAMBIN/lamboot -v $machinefile
echo "=== QJOB submitted via LAM ==="
# yourcode is your sourcefile
$LAMBIN/mpicc -O3 yourcode.c -lm
$LAMBIN/mpirun n0-$nnodes -v a.out
$LAMBIN/lamwipe -v $machinefile
rm -f $machinefile
Fig. 5.6. LAM (PBS) run script.
LAPACK
BLAS
PBLAS
BLACS
ScaLAPACK
Global
Local
Message Passing Primitives
(MPI, PVM, ... )
Fig. 5.7. The ScaLAPACK software hierarchy.
ScaLAPACK [12] is the standard library for the basic algorithms from lin-
ear algebra on MIMD distributed memory parallel machines. It builds on the
algorithms implemented in LAPACK that are available on virtually all shared
memory computers. The algorithms in LAPACK are designed to provide accur-
ate results and high performance. High performance is obtained through the
three levels of basic linear algebra subroutines (BLAS), which are discussed in
Section 2.2. Level 3 BLAS routines make it possible to exploit the hierarch-
ical memories partly described in Chapter 1. LAPACK and the BLAS are used
in ScaLAPACK for the local computations, that is, computations that are
bound to one processor, see Figure 5.7. Global aspects are dealt with by using
the Parallel Basic Linear Algebra Subroutines (PBLAS), see Table 5.2, that
164 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
are an extension of the BLAS for distributed memories [19]. In Section 5.4 we
show how the PBLAS access matrices are distributed among the processors in
a two-dimensional block cyclic data layout. Recall that vectors and scalars are
considered special cases of matrices. The actual communication is implemen-
ted in the Basic Linear Algebra Communication Subprograms (BLACS) [149].
The BLACS are high level communication functions that provide primitives for
transferring matrices that are distributed on the process grid according to the
two-dimensional block cyclic data layout of Section 5.4, see Table 5.1. Besides
general rectangular matrices, the BLACS also provide communication routines
for trapezoidal and triangular matrices. In particular, they provide the so-called
scoped operations, that is, collective communication routines that apply only
to process columns or rows.
Table 5.1 Summary of the BLACS. The context of many of these
routines can be a column, a row, or the whole process grid.
Support routines
BLACS PINFO Number of processes available for BLACS use
BLACS SETUP Number of processes to create
SETPVMTIDS Number of PVM tasks the user has spawned
BLACS GET Get some BLACS internals
BLACS SET Set some BLACS internals
BLACS GRIDINIT Initialize the BLACS process grid
BLACS GRIDMAP Initialize the BLACS process grid
BLACS FREEBUF Release the BLACS buffer
BLACS GRIDEXIT Release process grid
BLACS ABORT Abort all BLACS processes
BLACS EXIT Exit BLACS
BLACS GRIDINFO Get grid information
BLACS PNUM Get system process number
BLACS PCOORD Get row/column coordinates in the process grid
BLACS BARRIER Set a synchronization point
Point-to-point communication
GESD2D, TRSD2D General/trapezoidal matrix send
GERV2D, TRRV2D General/trapezoidal matrix receive
Broadcasts
GEBS2D, TRBS2D General/trapezoidal matrix broadcast (sending)
GEBR2D, TRBR2D General/trapezoidal matrix broadcast (receiving)
Reduction (combine) operations
GSUM2D, GAMX2D, Reduction with sum/max/min operator
GAMN2D
DISTRIBUTION OF VECTORS 165
The BLACS are implemented by some lower level message passing interface.
Figure 5.7 indicates that the BLACS are built on top of MPI. The BLACS have
synchronous send and receive routines to communicate matrices or submatrices
from one process to another; functions to broadcast submatrices to many pro-
cesses; and others to compute global reductions. The PBLAS and the BLACS
relieve the application programmer from the writing of functions for moving
matrices and vectors. Because of the standardized conventions, code becomes
more portable and is easier to write. As with the BLAS, the basic operations are
defined for a relatively few basic functions that can be optimized for all com-
puting platforms. The PBLAS provide most of the functionality of the BLAS
for two-dimensional block cyclically distributed matrices shown in Section 5.4.
Matrix transposition is also available.
Besides its communication functions, the BLACS have features to initialize,
change, and query process grids. An application programmer will usually directly
use the BLACS only by these routines. In the code fragment in Figure 5.8, it
is shown how a process grid is initialized after MPI has been started. The
invocation of Cblacs get returns a handle (ctxt) to a process grid, called
context. A context is analogous to an MPI communicator, that is, it is the
“world” in which communication takes place. The grid is initialized by the call
to Cblacs gridinit, which determines the size of the process grid and the
numbering order of process rows or columns. Figure 5.9 shows the grid that
is obtained by this code when p = 8. The eight processes are numbered row
by row, that is, in row-major ordering. Finally, the call to Cblacs pcoord in
Figure 5.8 returns the row and column indices of the actual process myid in
the process grid, 0 ≤ myrow < pr and 0 ≤ mycol < pc. If the number of pro-
cessors available is larger than the number of processes in the process grid, then
mycol = myrow = −1 for the superfluous processes.
After the completion of the parallel computation, the allocated memory space
needs to be freed, the process grid released, and MPI should be shut down,
see Figure 5.10.
After having initialized the process grid, the arrays are to be allocated and
initialized. The initialization of matrices and vectors is the responsibility of the
application programmer. This can be a tedious and error-prone task, and there
are many ways to proceed.
5.3 Distribution of vectors
To be more specific about arrangements of array distributions, let us first ponder
the problem of distributing a vector x with n elements x
0
, x
1
, . . . , x
n−1
on p
processors.
5.3.1 Cyclic vector distribution
A straightforward distribution is obtained in a manner similar to dealing a deck
of cards: The first number x
0
is stored on processor 0, the second number x
1
166 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
Table 5.2 Summary of the PBLAS.
Level 1 PBLAS
P SWAP Swap two vectors: x ↔ y
P SCAL Scale a vector: x ← αx
P COPY Copy a vector: x ← y
P AXPY axpy operation: y ← y +αx
P DOT, P DOTU, P DOTC Dot product: s ← x · y = x

y
P NRM2 2-norm: s ← x
2
P ASUM 1-norm: s ← x
1
P AMAX Index of largest vector element:
first i such |x
i
| ≥ |x
k
| for all k
Level 2 PBLAS
P GEMV General matrix–vector multiply:
y ← αAx +βy
P HEMV Hermitian matrix–vector multiply:
y ← αAx +βy
P SYMV Symmetric matrix–vector multiply:
y ← αAx +βy
P TRMV Triangular matrix–vector multiply: x ← Ax
P TRSV Triangular system solves
(forward/backward substitution): x ← A
−1
x
P GER, P GERU, P GERC Rank-1 updates: A ← αxy

+A
P HER, P SYR Hermitian/symmetric rank-1 updates:
A ← αxx

+A
P HER2, P SYR2 Hermitian/symmetric rank-2 updates:
A ← αxy



yx

+A
Level 3 PBLAS
P GEMM, P SYMM, P HEMM General/symmetric/Hermitian matrix–matrix
multiply: C ← αAB +βC
P SYRK, P HERK Symmetric/Hermitian rank-k update:
C ← αAA

+βC
P SYR2K, P HER2K Symmetric/Hermitian rank-k update:
C ← αAB



BA

+βC
P TRAN, P TRANU, P TRANC Matrix transposition:
C ← βC +αA

P TRMM Multiple triangular matrix–vector multiplies:
B ← αAB
P TRSM Multiple triangular system solves: B ← αA
−1
B
DISTRIBUTION OF VECTORS 167
/* Start the MPI engine */
MPI_Init(&argc, &argv);
/* Find out number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &p);
/* Find out process rank */
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
/* Get a BLACS context */
Cblacs_get(0, 0, &ctxt);
/* Determine pr and pc for the pr x pc grid */
for (pc=p/2; p%pc; pc--);
pr = p/pc;
if (pr > pc){pc = pr; pr = p/pc;}
/* Initialize the pr x pc process grid */
Cblacs_gridinit(&ctxt, "Row-major", pr, pc);
Cblacs_pcoord(ctxt, myid, &myrow, &mycol);
Fig. 5.8. Initialization of a BLACS process grid.
0
1
0 1 2 3
0 1 2 3
4 5 6 7
Fig. 5.9. Eight processes mapped on a 2 ×4 process grid in row-major order.
/* Release process grid */
Cblacs_gridexit(ctxt);
/* Shut down MPI */
MPI_Finalize();
Fig. 5.10. Release of the BLACS process grid.
on processor 1, and so on, until the pthnumber x
p−1
that is stored on processor
p −1. The next number x
p
is stored again on processor 0, etc. So,
x
i
is stored at position j on processor k if i = j ·p+k, 0 ≤ k < p. (5.1)
168 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
As an example, let x be a vector with 22 elements x
0
, x
1
, . . . , x
21
. If x is
distributed cyclically, the elements are distributed over processors 0 to 4 like
0 : x
0
x
5
x
10
x
15
x
20
1 : x
1
x
6
x
11
x
16
x
21
2 : x
2
x
7
x
12
x
17
3 : x
3
x
8
x
13
x
18
4 : x
4
x
9
x
14
x
19
(5.2)
The cyclic distribution is displayed in Figure 5.11. Each of the 22 vector elements
is given a “color” indicating on which processor it is stored.
Fig. 5.11. Cyclic distrnibution of a vector.
From the layout in (5.2) we see that two processors hold five elements while
three processors hold only four elements. It is evidently not possible that all
processors hold the same number of vector elements. Under these circumstances,
the element distribution over the processors is optimally balanced.
5.3.2 Block distribution of vectors
Due to the principle of locality of reference (Section 1.1), it may be more
efficient to distribute the vector in p big blocks of length b = n/p. In the
example above, we could store the first 22/5 = 5 elements of x on the first
processor 0, the next 5 elements on processor 1, and so on. With this block
distribution, the 22-vector of before is distributed in the following way on 5
processors.
0 : x
0
x
1
x
2
x
3
x
4
1 : x
5
x
6
x
7
x
8
x
9
2 : x
10
x
11
x
12
x
13
x
14
3 : x
15
x
16
x
17
x
18
x
19
4 : x
20
x
21
(5.3)
For vectors of length n, we have
x
i
is stored at position k on processor j if i = j · b +k, 0 ≤ k < b. (5.4)
The block distribution is displayed in Figure 5.12. Evidently, this procedure does
not distribute the elements as equally as the cyclic distribution in Figure 5.11.
Fig. 5.12. Block distribution of a vector.
DISTRIBUTION OF VECTORS 169
The first p−1 processors all get n/p elements and the last one gets the remain-
ing elements which may be as few as one element. The difference can thus be up
to p − 1 elements. If n is much larger than p, this issue is not so important.
Nevertheless, let n = 1000001 and p = 1000. Then, n/p = 1001. The first
999 processors store 999 × 1001 = 999,999 elements, while the last processors
just store 2 elements.
An advantage of the block distribution is the fast access of elements that are
stored contiguously in memory of cache based computers, that is, the locality of
reference principle of Section 1.1. Thus, the block distribution supports block
algorithms. Often algorithms are implemented in blocked versions to enhance
performance by improving the ration of floating point operations vs. memory
accesses.
5.3.3 Block–cyclic distribution of vectors
A block–cyclic distribution is a compromise between the two previous distribu-
tions. It improves upon the badly balanced block distribution of Section 5.3.2,
but also restores some locality of reference memory access of the cyclic
distribution.
If b is a block size, usually a small integer like 2, 4, or 8, then partition the
n-vector x into n
b
= n/b blocks, the first n
b
−1 of which consist of b elements,
while the last has length n − (n
b
− 1)b = n mod b. The blocks, which should be
bigger than a cacheline, will now be distributed cyclically over p processors. Set
i = j · b + k, then from (5.4) we know that x
i
is the kth element in block j. Let
now j = l · p +m. Then, interpreting (5.1) for blocks, we see that global block j
is the lth block on processor m. Thus, we get the following distribution scheme,
x
i
is stored at position l · b + k on processor m
if i = j · b + k, 0 ≤ k < b and j = l · p + m, 0 ≤ m < p.
(5.5)
Therefore, the block–cyclic distribution generalizes both cyclic and block distri-
butions. The cyclic distribution obtains if b = 1, while the block distribution is
obtained when b = n/p.
With block size b = 2 and p = 5 processors, the 22 elements of vector x are
stored according to
0 : x
0
x
1
x
10
x
11
x
20
x
21
1 : x
2
x
3
x
12
x
13
2 : x
4
x
5
x
14
x
15
3 : x
6
x
7
x
16
x
17
4 : x
8
x
9
x
18
x
19
(5.6)
Figure 5.13 shows a diagram for this distribution.
Fig. 5.13. Block–cyclic distribution of a vector.
170 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
The number of blocks assigned to each processor cannot differ by more than 1.
Therefore, the difference in the number of elements per processor cannot exceed
b, the number/block.
5.4 Distribution of matrices
Columns or rows of matrices can be distributed according to one of the three
vector distributions (cyclic, block, and block–cyclic) in a straightforward manner.
Column cyclic, column block, and column block–cyclic distributions are obtained
if the columns of matrices are distributed in the way elements of vectors are
distributed. Analogously, row cyclic, row block, and row block–cyclic distributions
are obtained.
5.4.1 Two-dimensional block–cyclic matrix distribution
Can we distribute both rows and columns simultaneously according to one of the
vector distributions? Let us assume that we are given a matrix A with n
r
rows
and n
c
columns. Where, that is, at which position on which processor, should
element a
i
r
,i
c
of A go? If we want to decide on this by means of the formula
in (5.5) individually and independently for i
r
and i
c
, we need to have values for
the block size and the processor number for rows and columns of A. If we have
block sizes b
r
and b
c
and processor numbers p
r
and p
c
available, we can formally
proceed as follows.
a
i
r
,i
c
is stored at position (l
r
· b
r
+ k
r
, l
c
· b
c
+ k
c
) on processor
(m
r
, m
c
) if
i
r
= j
r
b
r
+ k
r
, 0 ≤ k
r
< b
r
with j
r
= l
r
p
r
+ m
r
, 0 ≤ m
r
< p
r
,
and i
c
= j
c
b
c
+k
c
, 0 ≤ k
c
< b
c
and j
c
= l
c
p
c
+m
c
, 0 ≤ m
c
< p
c
.
(5.7)
How do we interpret this assignment? The n
r
×n
c
matrix is partitioned into
small rectangular matrix blocks of size b
r
×b
c
. These blocks are distributed over
a rectangular grid of processors of size p
r
× p
c
. This processor grid is a logical
arrangement of the p = p
r
· p
c
processors that are available to us. Notice that
some of the numbers b
r
, b
c
, p
r
, p
c
can (and sometimes must) take on the value 1.
In Figure 5.14, we show an example of a 15 × 20 matrix that is partitioned
in a number of blocks of size 2 × 3. These blocks are distributed over 6 = 2 · 3
processors. Processor (0, 0) stores the white blocks. Notice that at the right and
lower border of the matrix, the blocks may be smaller: two wide, not three;
and one high, not two, respectively. Further, not all processors may get the
same number of blocks. Here, processor (0,0) gets 12 blocks with altogether 64
elements while processor (0,2) only gets 8 blocks with 48 elements.
Such a two-dimensional block–cyclic matrix distribution is the method by
which matrices are distributed in ScaLAPACK [12]. Vectors are treated as special
cases of matrices. Column vectors are thus stored in the first processor column;
row vectors are stored in the first processor row. In the ScaLAPACK convention,
BASIC OPERATIONS WITH VECTORS 171
Fig. 5.14. Block–cyclic distribution of a 15×20 matrix on a 2×3 processor grid
with blocks of 2 × 3 elements.
a grid of processors is called the process grid. Additionally, an assumption is made
that there is just one (application) process running per processor.
5.5 Basic operations with vectors
In Sections 2.1 and 3.2.3, we discussed both the saxpy (2.3)
y = αx +y,
and inner product operation sdot (2.4),
s = x · y.
These operations were again reviewed in Chapter 4, particularly Section 4.7,
regarding shared memory machines. In this section, we review these basic opera-
tions again in the context of distributed memory systems. What is difficult in
this situation is careful management of the data distribution.
Each component of y can be treated independently as before, provided that
the vectors x and y are distributed in the same layout on the processors. If
vectors are not distributed in the same layout, x or y have to be aligned. Align-
ing vectors involves expensive communication. Comparing (5.2) and (5.3) we see
that rearranging a cyclically distributed vector into a block distributed one cor-
responds to a matrix transposition. Matrix transposition is also investigated in
multiple dimensional Fourier transform (FFT), Sections 5.8 and 5.9.
However, in computing sdot we have some freedom of how we implement the
sum over the n products, x
i
y
i
. In a distributed memory environment, one has
172 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
to think about where the sum s should be stored: That is, on which processor is
the result to be made available?
For simplicity, assume that all processors hold exactly n/p elements. Then
the parallel execution time for saxpy (2.3) on p processors is given by
T
saxpy
p
=
n
p
T
flop
, (5.8)
such that the speedup is ideal,
S
saxpy
p
=
T
saxpy
1
T
saxpy
p
= p. (5.9)
Here, T
flop
is the average time to execute a floating point operation (flop). Under
the same assumptions as those for computing saxpy, plus a binary tree reduction
schema, we get an expression for the time to compute the dot product
T
dot
p
= 2
n
p
T
flop
+T
reduce
(p)(1) = 2
n
p
T
flop
+ log
2
(p)(T
startup
+T
word
). (5.10)
Here, T
reduce
(p)(1) means the time a reduction on p processors takes for one item.
T
startup
is the communication startup time, and T
word
is the time to transfer
a single data item. Thus, the speedup becomes
S
dot
p
=
T
dot
1
T
dot
p
=
p
1 + (log
2
(p)/2n)(T
startup
+T
word
)/T
flop
. (5.11)
In general, n would have to be much larger than log p to get a good speedup
because T
startup
T
flop
.
5.6 Matrix–vector multiply revisited
In Section 4.8, we discussed the matrix–vector multiply operation on shared
memory machines in OpenMP. In the next two sections, we wish to revisit this
operation—first in MPI, then again using the PBLAS.
5.6.1 Matrix–vector multiplication with MPI
To give more meat to our discussion of Sections 5.3 and 5.4 on data distribution
and its importance on distributed memory machines, let us revisit our matrix–
vector multiply example of Section 4.8 and give a complete example of MPI
programming for the y = Ax problem. In a distributed memory environment,
we first have to determine how to distribute A, x, and y. Assume that x is
an N-vector and y is an M-vector. Hence A is M ×N. These two vectors will
be distributed in blocks, as discussed in Section 5.3.2. Row and column block
sizes are b
r
= M/p and b
c
= N/p, respectively. We will call the number of
elements actually used by a processor n and m. Clearly, m = b
r
and n = b
c
except
perhaps on the last processor where m = M −(p−1) · b
r
and n = N −(p−1) · b
c
.
MATRIX–VECTOR MULTIPLY REVISITED 173
*
=
Fig. 5.15. The data distribution in the matrix–vector product A ∗ x = y with
five processors. Here the matrix A is square, that is, M = N.
For simplicity, we allocate on each processor the same amount of memory, see
the program excerpt in Figure 5.16. We distribute the matrix A in block rows,
see Figure 5.15, such that each processor holds a b
r
-by-N array of contiguous
rows of A. This data layout is often referred to as 1-dimensional (1D) block row
distribution or strip mining. so, the rows of A are aligned with the elements of
y meaning that, for all k, the k-th element y and the k-th row of A reside on
the same processor. Again, for simplicity, we allocate an equal amount of data
on each processor. Only m rows of A are accessed.
Since each processor owns complete rows of A, each processor needs all of
vector x to form the local portion of y. In MPI, gathering x on all processors is
accomplished by MPI Allgather. Process j, say, sends its portion of x, stored
in x loc to every other process that stores it in the j-th block of x glob.
Finally, in a local operation, the local portion of A, A loc, is multiplied with
x by the invocation of the level-2 BLAS dgemv.
Notice that the matrix vector multiplication changes considerably if the mat-
rix A is distributed instead of in block rows in block columns. In this case the
local computation precedes the communication phase which becomes a vector
reduction.
5.6.2 Matrix–vector multiply with PBLAS
As a further and fruitful examination of the MIMD calculation, let us revisit
the example of the matrix–vector product, this time using the PBLAS. We are
to compute the vector y = Ax of size M. A, x, and y are distributed over the
two-dimensional process grid previously defined in Figure 5.8. A assumes a true
two-dimensional block–cyclic distribution. x is stored as a 1 ×M matrix on the
first process row; y is stored as an M ×1 matrix on the first process column, as in
Figure 5.13. Again, it is advantageous to choose block sizes as large as possible,
whence the block–cyclic distributions become the simple block distributions of
Figure 5.18. Blocks of A have size M/p
r
× N/p
c
. The code fragment in
Figure 5.17 shows how space is allocated for the three arrays. A has size 15 ×20.
174 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
#include <stdio.h>
#include "mpi.h"
int N=374, M=53, one=1; /* matrix A is M x N */
double dzero=0.0, done=1.0;
main(int argc, char* argv[]) {
/* matvec.c -- Pure MPI matrix vector product */
int myid, p; /* rank, no. of procs. */
int m, mb, n, nb, i, i0, j, j0;
double *A, *x_global, *x_local, *y;
MPI_Init(&argc, &argv); /* Start up MPI */
MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* rank */
MPI_Comm_size(MPI_COMM_WORLD, &p); /* number */
/* Determine block sizes */
mb = (M - 1)/p + 1; /* mb = ceil(M/p) */
nb = (N - 1)/p + 1; /* nb = ceil(N/p) */
/* Determine true local sizes */
m = mb; n = nb;
if (myid == p-1) m = M - myid*mb;
if (myid == p-1) n = N - myid*nb;
/* Allocate memory space */
A = (double*) malloc(mb*N*sizeof(double));
y = (double*) malloc(mb*sizeof(double));
x_local = (double*) malloc(nb*sizeof(double));
x_global = (double*) malloc(N*sizeof(double));
/* Initialize matrix A and vector x */
for(j=0;j<N;j++)
for(i = 0; i < m; i++){
A[i+j*mb] = 0.0;
if (j == myid*mb+i) A[i+j*mb] = 1.0;
}
for(j=0;j<n;j++) x_local[j] = (double)(j+myid*nb);
/* Parallel matrix - vector multiply */
MPI_Allgather(x_local, n, MPI_DOUBLE, x_global,
nb, MPI_DOUBLE, MPI_COMM_WORLD);
dgemv_("N", &m, &N, &done, A, &mb, x_global,
&one, &dzero, y, &one);
for(i=0;i<m;i++)
printf("y[%3d] = %10.2f\n", myid*mb+i,y[i]);
MPI_Finalize(); /* Shut down MPI */
}
Fig. 5.16. MPI matrix–vector multiply with row-wise block-distributed matrix.
MATRIX–VECTOR MULTIPLY REVISITED 175
int M=15, N=20, ZERO=0, ONE=1;
/* Determine block sizes */
br = (M - 1)/pr + 1; /* br = ceil(M/pr) */
bc = (N - 1)/pc + 1; /* bc = ceil(N/pc) */
/* Allocate memory space */
x = (double*) malloc(bc*sizeof(double));
y = (double*) malloc(br*sizeof(double));
A = (double*) malloc(br*bc*sizeof(double));
/* Determine local matrix sizes and base indices */
i0 = myrow*br; j0 = mycol*bc;
m = br; n = bc;
if (myrow == pr-1) m = M - i0;
if (mycol == pc-1) n = N - j0;
Fig. 5.17. Block–cyclic matrix and vector allocation.
Fig. 5.18. The 15 × 20 matrix A stored on a 2 × 4 process grid with big blocks
together with the 15-vector y (left) and the 20-vector x (top).
By consequence, b
r
= M/p
r
= 15/2 = 8 and b
c
= N/p
c
= 20/4 = 5.
Thus, a 8 ×5 matrix is allocated on each processor. But on the last (second)
process row only m = 7 rows of the local matrices A are used, as in Figure 5.18.
On the first process row m = b
r
= 8. On all processes n = b
c
= 5. The code
fragment is similar to the one in Figure 5.15 where the data are distributed
differently. The values i
0
and j
0
hold the global indices of the (0,0) element of
the local portion of A.
176 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
Until this point of our discussion, the processes have only had a local view
of the data, matrices and vectors, although the process grid was known. Before
a PBLAS or ScaLAPACK routine can be invoked, a global view of the data has
to be defined. This is done by array descriptors. An array descriptor contains
the following parameters:
(1) the number of rows in the distributed matrix (M);
(2) the number of columns in the distributed matrix (N);
(3) the row block size (b
r
);
(4) the column block size (b
c
);
(5) the process row over which the first row of the matrix is distributed;
(6) the process column over which the first column of the matrix is distrib-
uted;
(7) the BLACS context;
(8) the leading dimension of the local array storing the local blocks.
In Figure 5.19 we show how array descriptors are defined by a call to the
ScaLAPACK descinit for the example matrix–vector multiplication. Notice
that descinit is a Fortran subroutine such that information on the success
of the call is returned through a variable, here info. Attributes of the array
descriptors are evident, except perhaps attributes (5) and (6). These give the
identifiers of those process in the process grid on which the first element of the
respective array resides. This cannot always be the process (0, 0). This flexibility
is particularly important if submatrices are accessed.
All that remains is to do the actual matrix–vector multiplication by a call to
the appropriate PBLAS routine pdgemv. This is shown in Figure 5.20.
Notice the similarity to calling the BLAS dgemv in Figure 5.15. However,
instead of the single reference to A, four arguments have to be transfered to
pdgemv: These reference the local portion of A, the row and column indices of the
descinit_(descA, &M, &N, &br, &bc, &ZERO, &ZERO,
&ctxt, &br, &info);
descinit_(descx, &ONE, &N, &ONE, &bc, &ZERO, &ZERO,
&ctxt, &ONE, &info);
descinit_(descy, &M, &ONE, &br, &ONE, &ZERO, &ZERO,
&ctxt, &br, &info);
Fig. 5.19. Defining the matrix descriptors.
/* Multiply y = alpha*A*x + beta*y */
alpha = 1.0; beta = 0.0;
pdgemv_("N", &M, &N, &alpha, A, &ONE, &ONE, descA,
x, &ONE, &ONE, descx, &ONE, &beta, y, &ONE,
&ONE, descy, &ONE);
Fig. 5.20. General matrix–vector multiplication with PBLAS.
SCALAPACK 177
“origin” of the matrix to be worked on (here twice ONE), and the corresponding
array descriptor. Notice that according to Fortran convention, the first row and
column of a matrix have index= 1. Because this is a call to a Fortran routine,
this is also true even if one defined the array in C starting with index= 0.
5.7 ScaLAPACK
ScaLAPACK routines are parallelized versions of LAPACK routines [12]. In par-
ticular, the LAPACK block size equals the block size of the two-dimensional
block–cyclic matrix distribution, b = b
r
= b
c
. We again consider the block LU
factorization as discussed in Section 2.2.2.2. Three essential steps are needed:
1. The LU factorization of the actual panel. This task requires communica-
tion in one process column only for (a) determination of the pivots, and
(b) for the exchange of pivot row with other rows.
2. The computation of the actual block row. This requires the broadcast of
the b ×b factor L
11
of the actual pivot block, in a single collective commu-
nication step along the pivot row. The broadcast of L
11
is combined with
the broadcast of L
21
to reduce the number of communication startups.
3. The rank-b update of the remaining matrix. This is the computationally
intensive portion of the code. Before computations can start, the pivot
block row (U
12
) has to be broadcast along the process columns.
The complexity of the distributed LU factorization is considered carefully in
reference [18]. Let the number of processors be p, arranged in a p
r
×p
c
process
grid, where p = p
r
· p
c
. Then the execution time for the LU factorization can be
estimated by the formula
T
LU
(p
r
, p
c
, b) ≈

2nlog p
r
+ 2
n
b
log p
c

T
startup
+
n
2
2p
(4p
r
+ p
r
log p
r
+ p
c
log p
c
) T
word
+
2n
3
3
T
flop
. (5.12)
Again, we have used the variables T
startup
, T
word
, and T
flop
introduced earlier in
Section 4.7. The time to send a message of length n from one process to another
can be written as
T
startup
+ n · T
word
.
Formula (5.12) is derived by making simplifying assumptions. In particular, we
assumed that a broadcast of n numbers to p processors costs
log
2
p (T
startup
+ n · T
word
).
178 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
In fact, T
flop
is not constant but depends on the block size b. On our Beowulf
cluster, these three quantities are
1. T
startup
≈ 175 · 10
−6
s = 175 µs.
2. T
word
≈ 9.9 · 10
−8
s corresponding to a bandwidth of 10.5 MB/s. This
means that sending a message of length 2000 takes roughly twice as long
as sending only an empty message.
3. T
flop
≈ 2.2 · 10
−8
s. This means that about 8000 floating point operations
could be executed at the speed of the LU factorization during T
startup
.
If we compare with the Mflop/s rate that dgemm provides, this number is
even higher.
Floating point performance predicted by (5.12) is probably too good for small b
because in that case, T
word
is below average. In deriving the formula, we assumed
that the load is balanced. But this not likely to be even approximately true if
the block size b is big. In that case, a single process has to do a lot of work near
the end of the factorization leading to a severe load imbalance. Thus, one must
be cautious when using a formula like (5.12) to predict the execution time on
a real machine.
Nevertheless, we would like to extract some information from this formula.
Notice that the number of process columns p
c
has a slightly smaller weight than
p
r
in the T
startup
and in the T
word
term. This is caused by the pivot search that
is restricted to a process column. Therefore, it might be advisable to choose
the number of process columns p
c
slightly bigger than the number of process
rows p
r
[18]. We have investigated this on our Beowulf cluster. In Table 5.3 we
show timings of the ScaLAPACK routine pdgesv that implements the solution
of a linear system Ax = b by Gaussian elimination plus forward and backward
substitution. The processor number was fixed to be p = 36 but we allowed the
size of the process grid to vary. Our Beowulf has dual processor nodes, therefore
only 18 nodes were used in these experiments. In fact, we only observe small
differences in the timings when we used just one processor on p (dedicated) nodes.
Each of the two processors have their own cache but compete for the memory
access. Again, this indicates that the BLAS exploit the cache economically. The
timings show that it is indeed useful to choose p
r
≤ p
c
. The smaller the problem,
the smaller the ratio p
r
/p
c
should be. For larger problems it is advantageous to
have the process grid more square shaped.
In Table 5.4 we present speedup numbers obtained on our Beowulf. These
timings should be compared with those from the Hewlett-Packard Superdome,
see Table 4.1. Notice that the floating point performance of a single Beowulf pro-
cessor is lower than the corresponding performance of a single HP processor. This
can be seen from the one-processor numbers in these two tables. Furthermore,
the interprocessor communication bandwidth is lower on the Beowulf not only
in an absolute sense, but also relative to the processor performance. Therefore,
the speedups on the Beowulf are very low for small problem sizes. For problem
sizes n = 500 and n = 1000, one may use two or four processors, respectively, for
SCALAPACK 179
Table 5.3 Timings of the ScaLAPACK
system solver pdgesv on one processor
and on 36 processors with varying
dimensions of the process grid.
p
r
p
c
b Time (s)
1440 2880 5760 11,520
1 1 20 27.54 347.6 2537 71,784
2 18 20 2.98 17.5 137 1052
3 12 20 3.04 12.1 103 846
4 9 20 3.49 11.2 359 653
6 6 20 5.45 14.2 293 492
1 1 80 54.60 534.4 2610 56,692
2 18 80 4.69 26.3 433 1230
3 12 80 3.43 18.7 346 1009
4 9 80 4.11 15.7 263 828
6 6 80 5.07 15.9 182 639
Table 5.4 Times t and speedups S(p) for various problem
sizes n and processor numbers p for solving a random
system of equations with the general solver pdgesv of
ScaLAPACK on the Beowulf cluster. The block size for
n ≤ 1000 is b = 32, the block size for n > 1000 is b = 16.
p n = 500 n = 1000 n = 2000 n = 5000
t(s) S(p) t(s) S(p) t(s) S(p) t(s) S(p)
1 0.959 1 8.42 1.0 121 1 2220 1
2 0.686 1.4 4.92 1.7 47.3 2.7 1262 1.8
4 0.788 1.2 3.16 2.7 17.7 6.9 500 4.4
8 0.684 1.4 2.31 3.7 10.8 11 303 7.3
16 1.12 0.86 2.45 3.4 7.43 16 141 15
32 1.12 0.86 2.27 3.7 6.53 19 48 46
solving the systems of equations. The efficiency drops below 50 percent if more
processors are employed. For larger problem sizes, n ≥ 2000, we observe super-
linear speedups. These are caused by the fact that the traffic to main memory
decreases with an increasing number of processors due to the growing size of the
cumulated main memory. They are real effects and can be an important reason
for using a large machine.
180 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
5.8 MPI two-dimensional FFT example
We now sally forth into a more completely written out example—a two-
dimensional FFT. More sophisticated versions for general n exist in packaged
software form (FFTW) [55], so we restrict ourselves again to the n = 2
m
binary
radix case for illustration. For more background, review Section 2.4. The essence
of the matter is the independence of row and column transforms: we want
y
p,q
=
n−1

s=0
n−1

t=0
ω
ps+qt
x
s,t
,
which we process first by the independent rows, then the independent columns
(Z represents a temporary state of x),
∀s : Z
s,q
=
n−1

t=0
ω
qt
x
s,t
,
∀q : y
p,q
=
n−1

s=0
ω
ps
Z
s,q
.
In these formulae, ω = e
2πi/n
is again (see Chapter 2, Section 2.4) the nth root
of unity. Instead of trying to parallelize one row or column of the FFT, it is far
simpler to utilize the independence of the transforms of each row, or column,
respectively. Let us be quite clear about this: in computing the Z
sq
s, each row
(s) is totally independent of every other; while after the Z
sq
s are computed,
each column (q) of y (y
∗,q
) may be computed independently. Thus, the simplest
parallelization scheme is by strip-mining (Section 5.6). Figure 5.21 illustrates the
method. The number of columns in each strip is n/p, where again p =NCPUs,
x0,*
x1,*
x2,*
x3,*
xn–2,*
xn–1,*
..
..
..
..
Xpose
..
..
..
..
Z*,0
Z*,1
Z*,2
Z*,3
Z*, n–2
Z*, n–1
Width
= n/p
(a) (b)
Fig. 5.21. Strip-mining a two-dimensional FFT. (a) Transform rows of X:
xs,* for s = 0, . . . , n − 1 and (b) transform columns of Z: Z*,q for
q = 0, . . ., n −1.
MPI TWO-DIMENSIONAL FFT EXAMPLE 181
(a) (b)
Fig. 5.22. Two-dimensional transpose for complex data. (a) Global transpose of
blocks and (b) local transposes within blocks.
and each strip is stored on an independent processor. An observant reader will
notice that this is fine to transform the columns, but what about the rows? As
much as we dislike it, a transposition is required. Worse, two are required if
we want the original order. A basic transposition is shown in Figure 5.22 and in
the code for Xpose is shown beginning on p. 182. The transposition is first by
blocks, each block contains (n/p) · (n/p) complex elements. Diagonal blocks are
done in-place, that is on the same processor. Other blocks must be moved to cor-
responding transposed positions, where they are subsequently locally transposed.
The following algorithm works only for p = 2
q
processors where q < m = log
2
(n).
Let r be the rank of the processor containing the block to be transposed with its
corresponding other (as in Figure 5.22); to find other, where ⊕ is exclusive
or we use a trick shown to one of the authors by Stuart Hawkinson [76] (look
for “XOR trick” comment in the Xpose code),
for s = 1 , . . . , p −1{
other = r ⊕s
other block ↔ current block
}.
It is crucial that the number of processors p = 2
q
. To show this works, two points
must be made: (1) the selection other never references diagonal blocks, and (2)
the other s values select all the other non-diagonal blocks. Here is a proof:
1. It can only happen that s ⊕ r = r if s = 0, but the range of s is s ≥ 1.
Therefore s ⊕r = r is excluded.
2. To show that j = s ⊕ r exhausts the other values 0 ≤ j ≤ 2
q
− 1 except
j = r, expand s = (s
q−1
, s
q−2
, . . . , s
0
) where each s
k
is a binary digit,
that is, 0 or 1. If s
(1)
⊕ r = s
(2)
⊕ r, since ⊕ is exclusive it must be that
s
(1)
k
= s
(2)
k
for all 0 ≤ k ≤ q −1. For otherwise, by examining the 2
k
term,
we would need s
(1)
k
⊕ r
k
= s
(2)
k
⊕ r
k
but s
(1)
k
= s
(2)
k
. By writing out the
1-bit ⊕ logic tableau, it is clear this cannot happen for either r
k
= 0 or
182 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
r
k
= 1. Hence, for each of the 2
q
−1 values of s, j = s ⊕r is unique and
therefore the s values exhaust all the indices j = r.
For more general cases, where p = 2
q
, the index digit permutation ij = i· p+j →
ji = j · p+i will do the job even if not quite as charmingly as the above exclusive
or. If p does not divide n (p | n), life becomes more difficult and one has to
be clever to get a good load balance [55]. The MPI command MPI Alltoall
can be used here, but this command is implementation dependent and may not
always be efficient [59]. In this routine, MPI Sendrecv replace does just what
it says: buf io on other and its counterpart on rk=rank are swapped. Please
see Appendix D for more details on this routine.
void Xpose(float *a, int n) {
float t0,t1;
static float *buf_io;
int i,ij,is,j,step,n2,nn,size,rk,other;
static int init=-1;
MPI_Status stat;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rk);
/* number of local rows of 2D array */
nn = n/size;
n2 = 2*nn;
if(init!=n){
buf_io = (float *)malloc(nn*n2*sizeof(float));
init = n;
}
/* local transpose of first block (in-place) */
for(j = 0; j < nn; j ++){
for(i = 0; i < j; i++) {
t0 = a[rk*n2+i*2*n+j*2];
t1 = a[rk*n2+i*2*n+j*2+1];
a[rk*n2+i*2*n+j*2] = a[rk*n2+j*2*n+2*i];
a[rk*n2+i*2*n+j*2+1] = a[rk*n2+j*2*n+2*i+1];
a[rk*n2+j*2*n+2*i] = t0;
a[rk*n2+j*2*n+2*i+1] = t1;
}
}
/* size-1 communication steps */
for (step = 1; step < size; step ++) {
other = rk ˆ step; /* XOR trick */
ij = 0;
for(i=0;i<nn;i++){ /* fill send buffer */
is = other*n2 + i*2*n;
for(j=0;j<n2;j++){
MPI TWO-DIMENSIONAL FFT EXAMPLE 183
buf_io[ij++] = a[is + j];
}
}
/* exchange data */
MPI_Sendrecv_replace(buf_io,2*nn*nn,MPI_FLOAT,
other,rk,other,other,MPI_COMM_WORLD,&stat);
/* write back recv buffer in transposed order */
for(i = 0; i < nn; i ++){
for(j = 0; j < nn; j ++){
a[other*n2+j*2*n+i*2] = buf_io[i*n2+j*2];
a[other*n2+j*2*n+i*2+1] = buf_io[i*n2+j*2+1];
}
}
}
}
Using this two-dimensional transposition procedure, we can now write down the
two-dimensional FFT itself. From Section 3.6, we use the one-dimensional FFT,
cfft2, which assumes that array w is already filled with powers of the roots of
unity: w = {exp(2πik/n), k = 0, . . . , n/2 − 1}.
void FFT2D(float *a,float *w,float sign,int ny,int n)
{
int i,j,off;
float *pa;
void Xpose();
void cfft2();
for(i=0;i<ny;i++){
off = 2*i*n;
pa = a + off;
cfft2(n,pa,w,sign);
}
Xpose(a,n);
for(i=0;i<ny;i++){
off = 2*i*n;
pa = a + off;
cfft2(n,pa,w,sign);
}
Xpose(a,n);
}
A machine readable version of this two-dimensional FFT and the following
three-dimensional FFT may be downloaded from our ftp site [6] with all needed
co-routines.
184 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
5.9 MPI three-dimensional FFT example
In higher dimensions, the independence of orthogonal directions again makes par-
allelization conceptually clear. Now, instead of transposing blocks, rectangular
parallelepipeds (pencils) are moved. Imagine extending Figure 5.22 into the paper
and making a cube of it. The three-dimensional transform is then computed in
three stages (corresponding to each x
1
, x
2
, x
3
direction), parallelizing by indexing
the other two directions (arrays Z
[1]
and Z
[2]
are temporary):
y
p,q,r
=
n−1

s=0
n−1

t=0
n−1

u=0
ω
ps+qt+ru
x
s,t,u
,
∀s, t : Z
[1]
s,t,r
=
n−1

u=0
ω
ru
x
s,t,u
,
∀s, r : Z
[2]
s,q,r
=
n−1

t=0
ω
qt
Z
[1]
s,t,r
,
∀q, r : y
p,q,r
=
n−1

s=0
ω
ps
Z
[2]
s,q,r
.
Be aware that, as written, the transforms are un-normalized. This means that
after computing first the above three-dimensional transform, then the inverse
(ω → ¯ ω in the above), the result will equal the input to be scaled by n
3
, that is,
we get n
3
x
s,t,u
. First we show the transpose, then the three-dimensional code.
void Xpose(float *a, int nz, int nx) {
/* 3-D transpose for cubic FFT: nx = the thickness
of the slabs, nz=dimension of the transform */
float t0,t1;
static float *buf_io;
int i, ijk, j, js, k, step, n2, nb, np, off;
static int init=-1;
int size, rk, other;
MPI_Status stat;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rk);
/* number of local planes of 3D array */
n2 = 2*nz; np = 2*nx*nx; nb = nz*n2*nx;
if(init!=nx){
if(init>0) free(buf_io);
MPI THREE-DIMENSIONAL FFT EXAMPLE 185
buf_io = (float *)malloc(nb*sizeof(float));
init = nx;
}
/* local transpose of first block (in-place) */
for(j = 0; j < nx; j++){
off = j*2*nx + rk*n2;
for(k = 0; k < nz; k ++){
for(i = 0; i < k; i++) {
t0 = a[off + i*np + k*2];
t1 = a[off + i*np + k*2+1];
a[off+i*np+k*2] = a[off+k*np+2*i];
a[off+i*np+k*2+1] = a[off+k*np+2*i+1];
a[off+k*np+2*i] = t0;
a[off+k*np+2*i+1] = t1;
}
}
}
/* size-1 communication steps */
for (step = 1; step < size; step ++) {
other = rk ˆ step;
/* fill send buffer */
ijk = 0;
for(j=0;j<nx;j++){
for(k=0;k<nz;k++){
off = j*2*nx + other*n2 + k*np;
for(i=0;i<n2;i++){
buf_io[ijk++] = a[off + i];
}
}
}
/* exchange data */
MPI_Sendrecv_replace(buf_io,n2*nz*nx,MPI_FLOAT,
other,rk,other,other,MPI_COMM_WORLD,&stat);
/* write back recv buffer in transposed order */
ijk = 0;
for(j=0;j<nx;j++){
off = j*2*nx + other*n2;
for(k=0;k<nz;k++){
for(i=0;i<nz;i++){
a[off+i*np+2*k] = buf_io[ijk];
a[off+i*np+2*k+1] = buf_io[ijk+1];
186 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
ijk += 2;
}
}
}
}
}
Recalling cfft2 from Section 3.5.4, we assume that array w has been initialized
with n/2 powers of roots of unity ω, w = {exp(2πik/n), k = 0, . . . , n/2 − 1},
here is the transform itself:
void FFT3D(float *a,float *w,float sign,int nz,int nx)
{
int i,j,k,off,rk;
static int nfirst=-1;
static float *pw;
float *pa;
void Xpose();
void cfft2();
MPI_Comm_rank(MPI_COMM_WORLD,&rk);
if(nfirst!=nx){
if(nfirst>0) free(pw);
pw = (float *) malloc(2*nx*sizeof(float));
nfirst = nx;
}
/* X-direction */
for(k=0;k<nz;k++){
off = 2*k*nx*nx;
for(j=0;j<nx;j++){
pa = a + off + 2*nx*j;
cfft2(nx,pa,w,sign);
}
}
/* Y-direction */
for(k=0;k<nz;k++){
for(i=0;i<nx;i++){
off = 2*k*nx*nx+2*i;
for(j=0;j<nx;j++){
*(pw+2*j) = *(a+2*j*nx+off);
*(pw+2*j+1) = *(a+2*j*nx+1+off);
}
cfft2(nx,pw,w,sign);
for(j=0;j<nx;j++){
*(a+2*j*nx+off) = *(pw+2*j);
MPI MONTE CARLO (MC) INTEGRATION EXAMPLE 187
*(a+2*j*nx+1+off) = *(pw+2*j+1);
}
}
}
/* Z-direction */
Xpose(a,nz,nx);
for(k=0;k<nz;k++){
off = 2*k*nx*nx;
for(j=0;j<nx;j++){
pa = a + off + 2*nx*j;
cfft2(nx,pa,w,sign);
}
}
Xpose(a,nz,nx);
}
5.10 MPI Monte Carlo (MC) integration example
In this section we illustrate the flexibility to be seen MC simulations on multiple
CPUs. In Section 2.5, it was pointed out that the simplest way of distribut-
ing work across multiple CPUs in MC simulations is to split the sample into
p =NCPUs pieces: N = p · (N/p), and each CPU computes a sample of size
N/p. Another approach is a domain decomposition, where various parts of an
integration domain are assigned to each processor. (See also Section 2.3.10.4 for
another example of domain decomposition.) Integration is an additive procedure:

A
f(x) dx =

i

A
i
f(x) dx,
so the integration domain can be divided into A = ∪
i
A
i
disjoint sets {A
i
},
where A
i
∩ A
j
= ∅ when i = j. Very little communication is required. One only
has to gather the final partial results (numbered i) and add them up to finish
the integral. Beginning on p. 187 we show an example of integration of a more
or less arbitrary function f(x, y) over the star-shaped region in Figure 5.23.
If f has singularities or very rapid variation in one segment of the domain,
obviously these considerations would have to be considered. The star-shaped
domain in Figure 5.23 can easily be divided into five pieces, which we assign
to five CPUs. To effect uniform sampling on each point of the star, we do a
little cutting and pasting. Each point has the same area as the center square,
so we sample uniformly on a like-area square and rearrange these snips off this
square to make the points as in Figure 5.24. Array seeds contains a set of initial
random number seeds (Section 2.5) to prepare independent streams for each
processor.
188 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
C
S
N
W 1
2
E
Fig. 5.23. A domain decomposition MC integration.
Sampling is
uniform in
this unit
square
A
B


Corners are
removed, flipped
around y-axis
and pasted into
place
Fig. 5.24. Cutting and pasting a uniform sample on the points.
#include <stdio.h>
#include <math.h>
#include "mpi.h"
#define NP 10000
/* f(x,y) is any "reasonable" function */
#define f(x,y) exp(-x*x-0.5*y*y)*(10.0*x*x-x)*(y*y-y)
main(int argc, char **argv)
{
/* Computes integral of f() over star-shaped
region: central square is 1 x 1, each of
N,S,E,W points are 2 x 1 triangles of same
area as center square */
int i,ierr,j,size,ip,master,rank;
/* seeds for each CPU: */
float seeds[]={331.0,557.0,907.0,1103.0,1303.0};
float x,y,t,tot;
static float seed;
float buff[1]; /* buff for partial sums */
float sum[4]; /* part sums from "points" */
MPI MONTE CARLO (MC) INTEGRATION EXAMPLE 189
float ggl(); /* random number generator */
MPI_Status stat;
MPI_Init(&argc,&argv); /* initialize MPI */
MPI_Comm_size(MPI_COMM_WORLD,&size); /* 5 cpus */
MPI_Comm_rank(MPI_COMM_WORLD,&rank); /* rank */
master = 0; /* master */
ip = rank;
if(ip==master){
/* master computes integral over center square */
tot = 0.0;
seed = seeds[ip];
for(i=0;i<NP;i++){
x = ggl(&seed)-0.5;
y = ggl(&seed)-0.5;
tot += f(x,y);
}
tot *= 1.0/((float) NP); /* center part */
} else {
/* rank != 0 computes E,N,W,S points of star */
seed = seeds[ip];
sum[ip-1] = 0.0;
for(i=0;i<NP;i++){
x = ggl(&seed);
y = ggl(&seed) - 0.5;
if(y > (0.5-0.25*x)){
x = 2.0 - x; y = -(y-0.5);
}
if(y < (-0.5+0.25*x)){
x = 2.0 - x; y = -(y+0.5);
}
x += 0.5;
if(ip==2){
t = x; x = y; y = -t;
} else if(ip==3){
x = -x; y = -y;
} else if(ip==4){
t = x; x = -y; y = t;
}
sum[ip-1] += f(x,y);
}
sum[ip-1] *= 1.0/((float) NP);
buff[0] = sum[ip-1];
190 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
MPI_Send(buff,1,MPI_FLOAT,0,0,MPI_COMM_WORLD);
}
if(ip==master){ /* get part sums of other cpus */
for(i=1;i<5;i++){
MPI_Recv(buff,1,MPI_FLOAT,MPI_ANY_SOURCE,
MPI_ANY_TAG,MPI_COMM_WORLD,
&stat);
tot += buff[0];
}
printf(" Integral = %e\n",tot);
}
MPI_Finalize();
}
5.11 PETSc
In this final section we give a brief summary of the Portable Extensible Toolkit for
Scientific computation (PETSc) [7]–[9]. PETSc is a respectably sized collection
of C routines for solving large sparse linear and nonlinear systems of equations
on parallel (and serial) computers. PETSc’s building blocks (or libraries) of data
structures and routines is depicted in Figure 5.25. PETSc uses the MPI standard
for all message-passing communication. Here, we mainly want to experiment with
various preconditioners (from p. 30) that complement PETScs parallel linear
equation solvers.
Each part of the PETSc library manipulates a particular family of objects
(vectors, matrices, Krylov subspaces, preconditioners, etc.) and operations that
can be performed on them. We will first discuss how data, vectors and matrices,
BLAS LAPACK MPI
Matrices Vectors Index sets
Krylov subspace
methods (KSP)
Preconditioners
(PC)
Draw
Linear equation solvers
(SLES)
Nonlinear equation
solvers (SNES)
PDE solvers
Time stepping (TS)
L
e
v
e
l

o
f

a
b
s
t
r
a
c
t
i
o
n
Fig. 5.25. The PETSc software building blocks.
PETSC 191
are distributed by PETSc. Subsequently, we deal with the invocation of the
iterative solvers, after choosing the preconditioner. We will stay close to the
examples that are given in the PETSc manual [9] and the tutorial examples that
come with the software [7]. The problem that we illustrate is the two-dimensional
Poisson equation on a n×n grid. It is easy enough to understand how this
matrix is constructed and distributed. So, we will not be so much concerned
about the construction of the matrix but can concentrate on the parallel solvers
and preconditioners.
5.11.1 Matrices and vectors
On the lowest level of abstraction, PETSc provides matrices, vectors, and index
sets. The latter are used for storing permutations originating from reordering due
to pivoting or reducing fill-in (e.g. [32]). We do not manipulate these explicitly.
To see how a matrix can be built, look at Figure 5.26. The building of vectors
proceeds in a way similar to Figure 5.27. After defining the matrix object A, it
is created by the command
MatCreate(MPI Comm comm, int m, int n, int M, int N, Mat *A).
The default communicator is PETSC COMM WORLD defined in an earlier call
to PetscInitialize. (Note that PetscInitialize calls MPI Init and sets
Mat A; /* linear system matrix */
...
ierr = MatCreate(PETSC_COMM_WORLD,PETSC_DECIDE,
PETSC_DECIDE,n*n,n*n,&A);
ierr = MatSetFromOptions(A);
/* Set up the system matrix */
ierr = MatGetOwnershipRange(A,&Istart,&Iend);
for (I=Istart; I<Iend; I++) {
v = -1.0; i = I/n; j = I - i*n;
if (i>0) {J = I - n;
ierr = MatSetValues(A,1,&I,1,&J,&v,
INSERT_VALUES);
}
...
v = 4.0;
ierr = MatSetValues(A,1,&I,1,&I,&v,
INSERT_VALUES);
}
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);
Fig. 5.26. Definition and initialization of a n ×n Poisson matrix.
192 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
Vec u;
...
ierr = VecCreate(PETSC_COMM_WORLD,&u);
ierr = VecSetSizes(u,PETSC_DECIDE,n*n);
ierr = VecSetFromOptions(u);
Fig. 5.27. Definition and initialization of a vector.
PETSC COMM WORLD to MPI COMM WORLD.) Parameters m and n are the number of
local rows and (respectively) columns on the actual processor. Parameters M and
N denote the global size of the matrix A. In our example shown in Figure 5.26,
PETSc decides on how the matrix is distributed. Matrices in PETSc are always
stored block row-wise as in Section 5.4. Internally, PETSc also blocks the rows.
The usually square diagonal block plays an important role in Jacobi and domain
decomposition preconditioners, p. 30. solving the examples in [132]. Two of the
three authors of Eispack [133] are the principle designers of PETSc, so PETSc
was used for solving the examples in [132]. The column block sizes come into
play when a matrix A is multiplied with a vector x. The PETSc function
MatMult(Mat A, Vec x, Vec y), that forms y = Ax, requires that the
column blocks of A match the block distribution of vector x. The vector that
stores the matrix–vector product, here y, has to be distributed commensurate
with A, that is, their block row distributions have to match.
The invocation of MatSetFromOptions in Figure 5.26 sets the actual matrix
format. This format can be altered by runtime options. Otherwise, the default
matrix format MPIAIJ is set which denotes a “parallel” matrix which is
distributed block row-wise and stored in the CSR format (Section 2.3.9) on the
processors.
After the formal definition of A, matrix elements are assigned values. Our
example shows the initialization of an n×n Poisson matrix, that is, by the 5-point
stencil [135]. We give the portion of the code that defines the leftmost nonzero
off-diagonal and the diagonal. This is done element ×element with a call to
MatSetValues(Mat A, int m, int *idxm, int n,
int *idxn, PetscScalar *vals, InsertMode insmod).
Here vals is an m×n array of values to be inserted into A at the rows and column
given by the index vectors idxm and idxn. The InsertMode parameter can either
be INSERT VALUES or ADD VALUES.
The numbers of the rows that will go on a particular processor are obtained
through
MatGetOwnershipRange(Mat A, int *is, int *ie)
It is more economical if the elements are inserted into the matrix by those
processors the elements are assigned to. Otherwise, these values have to be
transferred to the correct processor. This is done by the pair of commands
MatAssemblyBegin and MatAssemblyEnd. These routines bring the matrix into
PETSC 193
the final format which is designed for matrix operations. The reason for hav-
ing two functions for assembling matrices is that during the assembly which
may involve communication, other matrices can be constructed or computation
can take place. This is latency hiding which we have encountered several times,
beginning with Chapter 1, Section 1.2.2.
The definition and initialization of the distributed matrix A does not expli-
citly refer to the number of processors involved. This makes it easier to investigate
the behavior of algorithms when the number of processors is varied.
The definition of vectors follows a similar line as the definition of matrices
shown in Figure 5.27.
5.11.2 Krylov subspace methods and preconditioners
PETSc provides a number of Krylov subspace (Section 2.3.5) procedures for solv-
ing linear systems of equations. The default solver is restarted GMRES [11].
Here we are solving a symmetric positive definite system, so the preconditioned
conjugate gradient algorithm, see Figure 2.6, is more appropriate [129]. In
PETSc, a pointer (called a context, Section 5.2) has to be defined in order
to access the linear system solver object. The Krylov subspace method [129]
and the preconditioner will then be properties of the solver as in Figure 5.28.
After the definition of the pointer, sles, the system matrix and the matrix that
defines the preconditioner are set. These matrices can be equal if the precon-
ditioner can be extracted from the system matrix, for example, using Jacobi
(Section 2.3.2), Gauss–Seidel (GS) (Section 2.3.3), or similar preconditioners, or
if the preconditioner is obtained by an incomplete factorization of the system
matrix. In this case, PETSc allocates the memory needed to store the precondi-
tioner. The application programmer can help PETSc by providing information
on the amount of additional memory needed. This information will enhance
performance in the setup phase.
/* Create linear solver context */
ierr = SLESCreate(PETSC_COMM_WORLD,&sles);
/* Set operator */
ierr = SLESSetOperators(sles,A,A,
DIFFERENT_NONZERO_PATTERN);
/* Set Krylov subspace method */
ierr = SLESGetKSP(sles,&ksp);
ierr = KSPSetType(ksp,KSPCG);
ierr = KSPSetTolerances(ksp,1.e-6,PETSC_DEFAULT,
PETSC_DEFAULT,PETSC_DEFAULT);
Fig. 5.28. Definition of the linear solver context and of the Krylov subspace
method.
194 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
ierr = SLESGetPC(sles,&pc);
ierr = PCSetType(pc,PCJACOBI);
Fig. 5.29. Definition of the preconditioner, Jacobi in this case.
ierr = SLESSolve(sles,b,x,&its);
Fig. 5.30. Calling the PETSc solver.
If the preconditioner is the simple Jacobi preconditioner, then its definition
only requires getting a pointer to the preconditioner and setting its type, as
shown in Figure 5.29.
Finally, the PETSc solver is invoked by calling SLESSolve with two input
parameters: the solver context, and the right-hand side vector; and two output
parameters: the solution vector and the number of iterations (see Figure 5.30).
Most of the properties of the linear system solver can be chosen by options [7]–[9].
5.12 Some numerical experiments with a PETSc code
In this section, we discuss a few experiments that we executed on our Beowulf
PC cluster for solving the Poisson equation −∆u = f on a square domain by
using a 5-point finite difference scheme [136]. This equation is simple and we
encountered it when discussing the red–black reordering to enhance parallelism
when using the GS preconditioner in Chapter 2, Section 2.3.10.3.
In Table 5.5, we list solution times and in parentheses the number of iteration
steps to reduce the initial residual by a factor 10
−16
. Data are listed for three
problem sizes. The number of grid points in one axis direction is n, so the number
of linear systems is n
2
. In consequence, we are solving problems of order 16,129,
65,025, and 261,121 corresponding to n = 127, 255, and 511, respectively. In
the left-hand column of Table 5.5 is given the data for the conjugate gradient
method solution when using a Jacobi preconditioner.
The second column, indicated by “Block Jacobi (1),” the data are obtained
using a tridiagonal matrix preconditioner gotten by dropping the outermost two
diagonals of the Poisson matrix shown in Figure 2.10. To that end, we assemble
this tridiagonal matrix, say M, the same way as in Figure 5.26. However, we now
assign the rows to each processor (rank). Here, procs denotes the number of pro-
cessors involved and myid is the rank of the actual processor (Figure 5.31). These
numbers can be obtained by calling MPI functions or the corresponding func-
tions of PETSc. Parameter nlocal is a multiple of n and is equal to blksize on
all but possibly the last processor. The function that sets the involved operators,
cf. Figure 5.28, then reads
ierr = SLESSetOperators(sles,A,M,
DIFFERENT_NONZERO_PATTERN);
The third column in Table 5.5, indicated by “Block Jacobi (2),” is obtained
by replacing M again by A. In this way the whole Poisson matrix is taken
SOME NUMERICAL EXPERIMENTS WITH A PETSC CODE 195
Table 5.5 Execution times in seconds (iteration steps) for solving an
n
2
× n
2
linear system from the two-dimensional Poisson problem using
a preconditioned conjugate gradient method.
n p Jacobi Block Jacobi (1) Block Jacobi (2) IC (0)
127 1 2.95 (217) 4.25 (168) 0.11 (1) 5.29 (101)
2 2.07 (217) 2.76 (168) 0.69 (22) 4.68 (147)
4 1.38 (217) 1.66 (168) 0.61 (38) 3.32 (139)
8 1.32 (217) 1.53 (168) 0.49 (30) 3.12 (137)
16 1.51 (217) 1.09 (168) 0.50 (65) 2.02 (128)
255 1 29.0 (426) 34.6 (284) 0.50 (1) 42.0 (197)
2 17.0 (426) 22.3 (284) 4.01 (29) 31.6 (263)
4 10.8 (426) 12.5 (284) 3.05 (46) 20.5 (258)
8 6.55 (426) 6.91(284) 2.34 (66) 14.6 (243)
16 4.81 (426) 4.12(284) 16.0 (82) 10.9 (245)
32 4.23 (426) 80.9 (284) 2.09 (113) 21.7 (241)
511 1 230.0 (836) 244.9 (547) 4.1 (1) 320.6 (384)
2 152.2 (836) 157.3 (547) 36.2 (43) 253.1 (517)
4 87.0 (836) 86.2 (547) 25.7 (64) 127.4 (480)
8 54.1 (836) 53.3 (547) 15.1 (85) 66.5 (436)
16 24.5 (836) 24.1 (547) 21.9 (110) 36.6 (422)
32 256.8 (836) 17.7 (547) 34.5 (135) 107.9 (427)
blksize = (1 + (n-1)/procs)*n;
nlocal = min((myid+1)*blksize,n*n) - myid*blksize;
ierr = MatCreate(PETSC_COMM_WORLD,nlocal,nlocal,
n*n,n*n,&A);
Fig. 5.31. Defining PETSc block sizes that coincide with the blocks of the Poisson
matrix.
into account when the diagonal blocks are determined. Thus, the preconditioner
changes as the processor number changes. For p = 1 we have M = A such that
the iteration converges in one iteration step. This shows how much time a direct
solution would take. The last column in Table 5.5, indicated by “IC (0)” is from
an incomplete Cholesky factorization preconditioner with zero fill-in as discussed
in Section 2.3.10.5. To use this preconditioner, the matrices have to be stored in
the MPIRowbs format,
ierr = MatCreateMPIRowbs(PETSC_COMM_WORLD,
PETSC_DECIDE, n*n, PETSC_DEFAULT,
PETSC_NULL, &A);
196 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
The MPIRowbs matrix format is the format that BlockSolve95 [82] uses to store
matrices. PETSc can compute incomplete Cholesky factorizations, but only the
diagonal blocks. In contrast, BlockSolve95 provides a global ICC factorization
that we want to exploit here. PETSc provides an interface to BlockSolve95
that is accessible just through the above command. To run the code, we have
to indicate to BlockSolve95 that our system is a scalar system, meaning that
there is one degree of freedom per grid point. This can be done by the option
-mat rowbs no inode which can also be set by
PetscOptionsSetValue("-mat_rowbs_no_inode",0);
This statement has to appear after PetscInitialize() but before any other
PETSc command.
Now let us look more closely at the numbers in Table 5.5. The first two
columns show preconditioners that do not depend on the processor number.
Therefore the iteration counts are constant for a problem size. With both Jacobi
and Block Jacobi (1), the iteration count increases in proportion to the number
of grid points n in each space dimension. Notice that 1/n is the mesh width.
In the cases Block Jacobi (2) and IC (0), the preconditioners depend on the
processor number, so the iteration count also changes with p. Clearly, with
Block Jacobi (2), the preconditioner gets weaker as p grows, so the iteration
count increases. With IC (0) the iteration count stays relatively constant for one
problem size. The incomplete factorization is determined not only by matrix A
but also by p. Communication issues must also be taken into account.
As with dense systems, the speedups are little with small problem sizes.
While a problem size 16,129 may be considered large in the dense case, this is
not so for sparse matrices. For example, the matrix A for n = 127 only has
about 5 · 127
2
= 80,645 nonzero elements, while in the smallest problem size one
might justify the use of up to 4 processors, but for the larger problems up to 16
processors still give increasing speedups. Recall that we used only one processor
of the dual processor nodes of our Beowulf. The execution times for p ≤ 16
were obtained on processors from one frame of 24 processors. Conversely, the
times with 32 processors involved the interframe network with lower (aggregated)
bandwidth. At some places the loss in performance when going from 16 to 32
processors is huge while at others it is hardly noticeable.
The execution times indicate that on our Beowulf the Block Jacobi (2) pre-
conditioner worked best. It takes all of the local information into account and
does the most it can with it, that is, does a direct solve. This clearly gave the
lowest iteration counts. IC (0) is not as efficient in reducing the iteration counts,
but of course it requires much less memory than Block Jacobi (2). On the other
hand, IC (0) does not reduce the iteration numbers so much that it can compete
with the very simple Jacobi and Block Jacobi (1) preconditioners.
In summary, on a machine with a weak network, it is important to reduce
communications as much as possible. This implies that the number of iterations
needed to solve a problem should be small. Of the four preconditioners that we
SOME NUMERICAL EXPERIMENTS WITH A PETSC CODE 197
illustrate here, the one that consumes the most memory but makes the most out
of the local information performed best.
Exercise 5.1 Effective bandwidth test From Pallas,
http://www.pallas.com/e/products/pmb/download.htm,
download the EFF BW benchmark test. See Section 4.2 for a brief description
of EFF BW, in particular, Figures 4.3 and 4.4. Those figures are given for the
HP9000 cluster, a shared memory machine. However, the EFF BW test uses MPI,
so is completely appropriate for a distributed memory system. What is to be
done? Unpack the Pallas EFF BW test, edit the makefile for your machine, and
construct the benchmark. Run the test for various message lengths, different
numbers of CPUs, and determine: (1) the bandwidth for various message sizes,
and (2) the latency just to do the handshake, that is, extrapolate to zero message
length to get the latency.
Exercise 5.2 Parallel MC integration In Section 5.10 is described
a parallel MC numerical integration of an arbitrary function f(x, y) for five
CPUs. Namely, for the star-shaped region, each contribution—the central
square and four points of the star—is computed on a different processor.
A copy of the code can be downloaded from our website, the filename is
Chapter5/uebung5.c:
www.inf.ethz.ch/˜arbenz/book
What is to be done? In this exercise, we would like you to modify this code in
several ways:
1. To test the importance of independent random number streams:
(a) noting the commented seeds for each cpu line in the code, you
can see five random number seeds, 331, 557, etc. This is crude and
the numbers are primes. Change these to almost anything other than
zero, however, and look for effects on the resulting integration.
(b) In fact, try the same seed on each processor and see if there are any
noticeable effects on the integration result.
(c) What can you conclude about the importance of independent
streams for a numerical integration? Integration is an additive
process, so there is little communication.
2. By closer examination of Figure 5.24, find a modification to further sub-
divide each independent region. There are some hints below on how to do
this. Modify the code to run on 20 processors. Our Beowulf machine is
a perfect platform for these tests.
3. Again run the integration on 20 CPUs and compare your results with
the 5-CPU version. Also, repeat the test for dependence on the random
number streams.
198 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
Hint: To modify the code for 20 CPUs, refer to Figure 5.24. The points of the
star, labeled N,S,E,W, were done with cut and paste. We used the following
initial sampling for these points:
x = ran3 −
1
2
, y = ran3 −
1
2
.
The corners of the distribution were cut, rotated around the y-axis, and shifted
up or down and inserted into place to make the triangles. The N,W, and S
points are simply rotations of the E-like sample in Figure 5.24:

x

y

=

cos(θ) sin(θ)
−sin(θ) cos(θ)

x
y

,
where θ = π/2, π, 3π/2 respectively. The points of the star-shaped region were
indexed by rank= 1, 2, 3, 4 in counterclockwise order. The central region can be
simply subdivided into four equal squares, each sampled analogous to the (x, y)
sampling above. To do the points, dividing the base = 1, length = 2 triangles
into four similar ones is an exercise in plane geometry. However, you may find
the algebra of shifts and rotations a little messy. Checking your result against
the 5-CPU version is but one way to test your resulting code. Reference: P.
Pacheco [115].
Exercise 5.3 Solving partial differential equations by MC: Part II In
Chapter 2, Exercise 2.1, was given a solution method for solving elliptic partial
differential equations by an MC procedure. Using your solution to that exer-
cise, the current task is to parallelize the method using MPI. What is to be
done?
1. Starting with your solution from Chapter 2, the simplest parallelization
procedure is to compute multiple initial x values in parallel. That is, given
x to start the random walk, a solution u(x) is computed. Since each x is
totally independent of any other, you should be able to run several different
x values each on a separate processor.
2. Likewise, either as a Gedanken experiment or by writing a PBS script,
you should be able to imagine that several batch jobs each computing
a different u(x) value is also a perfectly sensible way to do this in parallel.
After all, no x value computation communicates with any other.
Exercise 5.4 Matrix–vector multiplication with the PBLAS The pur-
pose of this exercise is to initialize distributed data, a matrix and two vectors,
and to actually use one of the routines of the parallel BLAS. We have indicated
in Section 5.6 how the matrix–vector product is implemented in PBLAS, more
information can be found in the ScaLAPACK users guide [12] which is available
on-line, too.
What is to be done?
Initialize a pr ×pc process grid as square as possible, cf. Figure 5.8, such that
pr · pc = p, the number of processors that you want to use. Distribute a 50 ×100
SOME NUMERICAL EXPERIMENTS WITH A PETSC CODE 199
matrix, say A, in a cyclic blockwise fashion with 5 × 5 blocks over the process
grid. Initialize the matrix such that
a
i,j
= |i − j|.
Distribute the vector x over the first row of the process grid, and initialize it
such that x
i
= (−1)
i
. The result shall be put in vector y that is distributed over
the first column of the process grid. You do not need to initialize it.
Then call the PBLAS matrix–vector multiply, pdgemv, to compute y = Ax.
If all went well, the elements in y have all equal value.
Hints
• Use the tar file Chapter 5/uebung6a.tar as a starting point for your work.
It contains a skeleton and a make file which is necessary for correct compila-
tion. We included a new qsub mpich which you should use for submission.
Check out the parameter list by just calling qsub mpich (without parame-
ters) before you start. Note that your C code is supposed to read and use
those parameters.
Exercise 5.5 Solving Ax = b using ScaLAPACK This exercise contin-
ues the previous Exercise 5.4, but now we want to solve a system of equation
whose right hand side is from the matrix–vector product. So, the matrix A must
be square (n × n) and nonsingular and the two vectors x and y have the same
length n. To make things simple initialize the elements of A by randomly (e.g.
by rand) except the diagonal elements that you set to n. Set all elements of x
equal to one.
What is to be done?
Proceed as in the previous Exercise 5.4 to get y = Ax. Then call the
ScaLAPACK subroutine pdgesv to solve Ax = y. Of course, this should give
back the vector x you started from. Check this by an appropriate function that
you find in Table 5.2.
Exercise 5.6 Distribute a matrix by a master–slave procedure In the
examples before, each processor computed precisely those entries of the matrix
A that it needs to store for the succeeding computation. Now, we assume that
the matrix is read by the “master” process 0 from a file and distributed to the
other processes 1 to p, the “slaves.”
What is to be done?
Write a function (using MPI primitives) that distributes the data from pro-
cess rank zero to all involved processes. You are given the numbers pr, pc, br,
bc, and N, a two-dimensional array Aglob[N][N] and finally an array bglob[N]
containing the data to be spread.
Try not to waste much memory on the different processes for the local storage.
ScaLAPACK requires that all the local memory blocks are of the same size.
Consult the manual and the example.
200 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA
Hints
(a) Use the tar file Chapter5/uebung6.tar as a starting point for your work.
It contains a skeleton and a make file which is necessary for correct
compilation. We included a new qsub mpich which you should use for sub-
mission. Check out the parameter list by just calling qsub mpich (without
parameters) before you start. Note that your C code is supposed to read
and use those parameters.
(b) The manuals, reference cards and source code needed for this assignment
(uebung6.tar) can be downloaded from our website [6].
Exercise 5.7 More of ScaLAPACK This is a follow-up of Exercise 5.5.
The subroutine pdgesv for solving the systems of equations calls the subroutine
pdgetrf for the factorization and the subroutine pdgetrs for forward and back-
ward substitution. Measure the execution times and speedups of pdgetrf and
pdgetrs separately. Why are they so different?
APPENDIX A
SSE INTRINSICS FOR FLOATING POINT
A.1 Conventions and notation
Intel icc Compiler version 4.0 names reflect the following naming conventions:
an “ mm” prefix indicates an SSE2 vector operation, and is followed by a plain
spelling of the operation or the actual instruction’s mnemonic. Suffixes indicating
datatype are: s = scalar, p = packed (means full vector), i = integer with u
indicating “unsigned.” Datatypes are m128, a vector of four 32-bit floating
point words or two 64-bit double precision words, and m64 is a vector of four
16-bit integers or two 32-bit integers. The set below is not a complete set of the
intrinsics: only those relevant to single precision floating point operations are
listed. The complete set of intrinsics is given in the primary reference [27], in
particular volume 3. The other two volumes contain more detailed information
on the intrinsic to hardware mappings.
Compilation of C code using these intrinsics requires the Intel icc compiler,
available from their web-site [24]. This compiler does some automatic vector-
ization and users are advised to turn on the diagnostics via the -vec report
switch:
icc -O3 -axK -vec report3 yourjob.c -lm.
Very important: On Intel Pentium III, 4, and Itanium chips, bit ordering is
little endian. This means that bits (and bytes, and words in vectors) in a datum
are numbered from right to left: the least significant bit is numbered 0 and the
numbering increases as the bit significance increases toward higher order.
A.2 Boolean and logical intrinsics
• m128 mm andnot ps( m128, m128)
Synopsis: Computes the bitwise AND–NOT of four operand pairs. d =
mm andnot ps(a,b) computes, for i = 0, . . . , 3, the and of the complement
of a
i
and b
i
: d
i
← ¬ a
i
and b
i
. That is, where a = (a
3
, a
2
, a
1
, a
0
) and
b = (b
3
, b
2
, b
1
, b
0
), the result is
d ← (¬ a
3
and b
3
, ¬ a
2
and b
2
, a
1
and b
1
, ¬ a
0
and b
0
).
• m128 mm or ps( m128, m128)
202 SSE INTRINSICS FOR FLOATING POINT
Synopsis: Vector bitwise or of four operand pairs. d = mm or ps(a,b)
computes, for i = 0, . . . , 3, the or of a
i
and b
i
: d
i
←a
i
or b
i
. That is,
d ←a or b.
• m128 mm shuffle ps( m128, m128, unsigned int)
Synopsis: Vector shuffle of operand pairs. d = mm shuffle ps(a,b,c)
selects any two of the elements of a to set the lower two words of d;
likewise, any two words of b may be selected and stored into the upper
two words of d. The selection mechanism is encoded into c. The first bit
pair sets d
0
from a, the second bit pair sets d
1
from a, the third bit pair
sets d
2
from b, and the fourth bit pair sets d
3
from b. For example, c =
MM SHUFFLE(2,3,0,1) = 0xb1 = 10 11 00 01 selects (remember little
endian goes right to left):
element 01 of a for d
0
: d
0
←a
1
element 00 of a for d
1
: d
1
←a
0
element 11 of b for d
2
: d
2
←b
3
element 10 of b for d
3
: d
3
←b
2
with the result that d ← (b
2
, b
3
, a
0
, a
1
), where d = (d
3
, d
2
, d
1
, d
0
). As a
second example, using the same shuffle encoding as above (c = 0xb1), z =
mm shuffle ps(y,y,c) gives z ←(y
2
, y
3
, y
0
, y
1
), the gimmick we used to
turn the real and imaginary parts around in Figure 3.22.
A final set of comparisons return integers and test only low order pairs. The
general form for these is as follows.
• int mm ucomibr br br ss( m128, m128)
Synopsis: Vector binary relation (br) comparison of low order operand
pairs. d = mm br br br ps(a,b) tests a
0
br b
0
. If the binary relation br is
satisfied, one (1) is returned; otherwise zero (0).
The set of possible binary relations (br) is shown in Table A.1.
• m128 mm xor ps( m128, m128)
Synopsis: Vector bitwise exclusive or (xor). d = mm xor ps(a,b) com-
putes, for i = 0, . . . , 3, d
i
←a
i
⊕b
i
. That is,
d ←a ⊕b.
A.3 Load/store operation intrinsics
• m128 mm load ps(float*)
Synopsis: Load Aligned Vector Single Precision, d = mm load ps(a)
loads, for i = 0, . . . , 3, a
i
into d
i
: d
i
← a
i
. Pointer a must be 16-byte
aligned. See Section 1.2, and particularly Section 3.19: this alignment is
very important.
LOAD/STORE OPERATION INTRINSICS 203
Table A.1 Available binary relations for the mm compbr br br ps and
mm compbr br br ss intrinsics. Comparisons 1 through 6, eq, lt, le, gt,
ge, and neq are the only ones available for mm comibr collective
comparisons. It should be noted that eq is nearly the same as unord
except that the results are different if one operand is NaN [114].
Binary Description Mathematical Result if
relation br expression operand NaN
eq Equality a = b False
lt Less than a < b False
le Less than or equal a ≤ b False
gt Greater than a > b False
ge Greater than or equal a ≥ b False
neq Not equal a = b True
nge Not greater than or equal a < b True
ngt Not greater than a ≤ b True
nlt Not less than a ≥ b True
nle Not less than or equal a > b True
ord One is larger a ≶ b False
unord Unordered ¬ (a ≶ b) True
• void mm store ps(float*, m128)
Synopsis: Store Vector of Single Precision Data to Aligned Location,
mm store ps(a,b) stores, for i = 0, . . . , 3, b
i
into a
i
: a
i
← b
i
. Pointer
a must be 16-byte aligned. See Section 1.2, and particularly Section 3.19:
this alignment is very important.
• m128 mm movehl ps( m128, m128)
Synopsis: Move High to Low Packed Single Precision, d = mm movehl ps
(a,b) moves the two low order words of b into the high order position of d
and extends the two low order words of a into low order positions of d. That
is, where a = (a
3
, a
2
, a
1
, a
0
) and b = (b
3
, b
2
, b
1
, b
0
), mm movehl ps(a,b)
sets d ← (a
3
, a
2
, b
3
, b
3
).
• m128 mm loadh pi( m128, m64*)
Synopsis: Load High Packed Single Precision, d = mm loadh ps(a,p) sets
the two upper words of d with the 64 bits of data loaded from address p and
passes the two low order words of a to the low order positions of d. That is,
where the contents of the two words beginning at the memory location poin-
ted to by p are (b
1
, b
0
) and a = (a
3
, a
2
, a
1
, a
0
), then mm loadh ps(a,p)
sets d ← (b
1
, b
0
, a
1
, a
0
).
• void mm storeh pi( m64*, m128)
Synopsis: Store High Packed Single Precision, mm storeh ps(p,b) stores
the two upper words of b into memory beginning at the address p. That
is, if b = (b
3
, b
2
, b
1
, b
0
), then mm storeh ps(p,b) stores (b
3
, b
2
) into the
memory locations pointed to by p.
204 SSE INTRINSICS FOR FLOATING POINT
• m128 mm loadl pi( m128, m64*)
Synopsis: Load Low Packed Single Precision, d = mm loadl ps(a,p) sets
the two lower words of d with the 64 bits of data loaded from address
p and passes the two high order words of a to the high order positions
of d. That is, where the contents of the two words beginning at the
memory location pointed to by p are (b
1
, b
0
) and a = (a
3
, a
2
, a
1
, a
0
), then
mm loadl ps(a,p) sets d ← (a
3
, a
2
, b
1
, b
0
).
• void mm storel pi( m64*, m128)
Synopsis: Store Low Packed Single Precision, mm storel ps(p,b) stores
the two lower words of b into memory beginning at the address p. That
is, if b = (b
3
, b
2
, b
1
, b
0
), then mm storel ps(p,b) stores (b
1
, b
0
) into the
memory locations pointed to by p.
• m128 mm load ss(float*)
Synopsis: Load Low Scalar Single Precision, d = mm load ss(p) loads the
lower word of d with the 32 bits of data from address p and zeros the three
high order words of d. That is, if the content of the memory location at p
is b, then mm load ss(p) sets d ← (0, 0, 0, b).
• void mm store ss(float*, m128)
Synopsis: Store Low Scalar Single Precision, mm store ss(p,b) stores
the low order word of b into memory at the address p. That is, if
b = (b
3
, b
2
, b
1
, b
0
), then mm store ss(p,b) stores b
0
into the memory
location pointed to by p.
• m128 mm move ss( m128, m128)
Synopsis: Move Scalar Single Precision, d = mm move ss(a,b) moves the
low order word of b into the low order position of d and extends the
three high order words of a into d. That is, where a = (a
3
, a
2
, a
1
, a
0
) and
b = (b
3
, b
2
, b
1
, b
0
), mm move ss(a,b) sets d ← (a
3
, a
2
, a
1
, b
0
).
• m128 mm loadu ps(float*)
Synopsis: Load Unaligned Vector Single Precision, d = mm loadu ps(a)
loads, for i = 0, . . . , 3, a
i
into d
i
: d
i
← a
i
. Pointer a need not be 16-byte
aligned. See Section 1.2, and particularly Section 3.19.
• void mm storeu ps(float*, m128)
Synopsis: Store Vector of Single Precision Data to Unaligned Loca-
tion, mm storeu ps(a,b) stores, for i = 0, . . . , 3, b
i
into a
i
: a
i
← b
i
.
Pointer a need not be 16-byte aligned. See Section 1.2, and particularly
Section 3.19.
• m128 mm unpackhi ps( m128, m128)
Synopsis: Interleaves two upper words, d = mm unpackhi ps(a,b) selects
(a
3
, a
2
) from a and (b
3
, b
2
) from b and interleaves them. That is
d = mm unpackhi ps(a,b) sets d ← (b
3
, a
3
, b
2
, a
2
).
• m128 mm unpacklo ps( m128, m128)
Synopsis: Interleaves the two lower words, d = mm unpacklo ps(a,b)
selects (a
1
, a
0
) from a and (b
1
, b
0
) from b and interleaves them. That is
d = mm unpacklo ps(a,b) sets d ← (b
1
, a
1
, b
0
, a
0
).
VECTOR COMPARISONS 205
There is a final set of load/store intrinsics which are said to be composite,
which says that they cannot be mapped onto single hardware instructions. They
are, however extremely useful and we have used them in Figure 3.22.
• m128 mm set ps1(float)
Synopsis: Sets the four floating point elements to the same scalar input,
d = mm set ps1(a) sets d ← (a, a, a, a).
• m128 mm set ps(float,float,float,float)
Synopsis: Sets the four floating point elements to the array specified, d =
mm set ps(a3,a2,a1,a0) sets d ← (a3, a2, a1, a0).
• m128 mm setr ps(float,float,float,float)
Synopsis: Sets the four floating point elements to the array specified, but in
reverse order: d = mm set ps(a3,a2,a1,a0) sets d ← (a0, a1, a2, a3).
• m128 mm setzero ps(void)
Synopsis: Sets four elements to zero, d = mm setzero ps() sets d ←
(0, 0, 0, 0).
• m128 mm load ps1(float*)
Synopsis: Sets the four floating point elements to a scalar taken from
memory, d = mm load ps1(p) sets d ← (∗p, ∗p, ∗p, ∗p).
• m128 mm loadr ps(float*)
Synopsis: Sets the four floating point elements to a vector of elements
taken from memory, but in reverse order. d = mm loadr ps1(p) sets d ←
(∗p, ∗(p + 1), ∗(p + 2), ∗(p + 3)).
• void mm store ps1(float*, m128)
Synopsis: Stores low order word into four locations, mm store ps1(p,a)
stores a
0
into locations p+3,p+2,p+1,p. That is, p[3] = a[0], p[2] =
a[0], p[1] = a[0], p[0] = a[0].
• void mm storer ps(float*, m128)
Synopsis: Stores four floating point elements into memory, but in reverse
order: mm storer ps(p,b) sets p[3] = a[0], p[2] = a[1], p[1] =
a[2], p[0] = a[3].
A.4 Vector comparisons
The general form for comparison operations using binary relations is as
follows.
• m128 mm cmpbr br br ps( m128, m128)
Synopsis: Test Binary Relation br, d = mm cmpbr br br ps(a, b) tests, for i =
0, . . . , 3, a
i
br b
i
, and if the br relation is satisfied d
i
= all 1s; d
i
= 0
otherwise.
The binary relation br must be one of those listed in Table A.1.
For example, if br is lt (less than),
• m128 mm cmplt ps( m128, m128)
206 SSE INTRINSICS FOR FLOATING POINT
Synopsis: Compare for Less Than, d = mm cmplt ps(a,b) compares, for
each i = 0, . . . , 3, the pairs (a
i
, b
i
), and sets d
i
= all 1s if a
i
< b
i
; otherwise
d
i
= 0.
A.5 Low order scalar in vector comparisons
Other comparisons, with suffix ss compare low order scalar elements within a
vector. Table A.1 indicates which br binary relations are available for ss suffix
comparisons.As an example of an ss comparison, if br is less than (<),
• m128 mm cmplt ss( m128, m128)
Synopsis: Compare Scalar Less Than, d = mm cmplt ss(a,b) computes
a mask of 32-bits in d
0
: if low order pair a
0
< b
0
, the d
0
= all 1s;
otherwise d
0
= 0. That is, if a = (a
3
, a
2
, a
1
, a
0
) and b = (b
3
, b
2
, b
1
, b
0
),
then mm cmplt ss(a,b) only compares the (a
0
, b
0
) pair for less than, and
sets only d
0
accordingly. The other elements of a and b are not compared
and the corresponding elements of d are not reset.
The general form for these ss comparisons is as follows.
• m128 mm cmpbr br br ss( m128, m128)
Synopsis: Compare Scalar binary relation, d = mm cmpbr br br ss(a,b) com-
putes a mask of 32-bits in d
0
: if the low order pair a
0
br b
0
, then
d
0
= all 1s; otherwise d
0
= 0. The remaining elements of a and b are
not compared and the higher order elements of d are not reset.
A.6 Integer valued low order scalar in vector comparisons
Comparisons labeled comi, with suffix ss also only compare low order scalar
elements within a vector. The first six entries in Table A.1 indicate which br
binary relations are available for comi ss suffix comparisons.
The general form for these comi ss comparisons is as follows.
• int int mm comibr br br ss( m128, m128)
Synopsis: Scalar binary relation (br), d = int mm comibr br br ss(a,b) com-
pares the low order pair with binary relation br and if a
0
br b
0
, then d = 1;
otherwise d = 0.
For example, if br is ge (greater than or equal), then d = int mm comige ss(a,b)
compares the low order pair (a
0
, b
0
) and if a
0
≥ b
0
then d = 1; otherwise d = 0
is returned.
A.7 Integer/floating point vector conversions
• m128 mm cvt pi2ps( m128, m64)
Synopsis: Packed Signed Integer to Packed Floating Point Conversion, d =
mm cvt pi2ps(a,b) converts the two 32-bit integers in b into two 32-bit
ARITHMETIC FUNCTION INTRINSICS 207
floating point words and returns the result in d. The high order words of
a are passed to d. That is, if m64 vector b = (b
1
, b
0
) and m128 vector
a = (a
3
, a
2
, a
1
, a
0
), then d = (a
3
, a
2
, (float)b
1
, (float)b
0
).
• m64 mm cvt ps2pi( m128)
Synopsis: Packed Floating Point to Signed Integer Conversion, d =
mm cvt ps2pi ss(a) converts the two low order 32-bit floating point
words in a into two 32-bit integer words in the current rounding mode
and returns the result in d. The two high order words of a are ignored.
That is, if the m128 vector is a = (a
3
, a
2
, a
1
, a
0
), then m64 vector
d = ((int)a
1
, (int)a
0
).
• m128 mm cvt si2ss( m128, int)
Synopsis: Scalar Signed Integer to Single Floating Point Conversion, d =
mm cvt si2ss(a,b) converts the 32-bit signed integer b to a single preci-
sion floating point entry which is stored in the low order word d
0
. The three
high order elements of a, (a
3
, a
2
, a
1
) are passed through into d. That is, if
the m128 vector a = (a
3
, a
2
, a
1
, a
0
), then d ← (a
3
, a
2
, a
1
, (float)b).
• int mm cvt ss2si( m128)
Synopsis: Scalar Single Floating Point to Signed Integer Conversion, d =
mm cvt ss2si(a) converts the low order 32-bit floating point element a
0
to a signed integer result d with truncation. If a = (a
3
, a
2
, a
1
, a
0
), then
d ← (int)a
0
.
• m64 mm cvt ps2pi( m128)
Synopsis: Packed Single Precision to Packed Integer Conversion, d =
mm cvt ps2pi(a) converts the two lower order single precision floating
point elements of a into two integer 32-bit integers with truncation. These
are stored in d. That is, if the m128 vector a = (a
3
, a
2
, a
1
, a
0
), then
(d
1
, d
0
) = d ← ((int)a
1
, (int)a
0
).
• int mm cvt ss2si( m128)
Synopsis: Scalar Single Floating Point to Signed Integer Conversion,
d = mm cvt ss2si(a) converts the low order 32-bit floating point ele-
ment a
0
to a signed integer result d using the current rounding mode. If
a = (a
3
, a
2
, a
1
, a
0
), then d ← (int)a
0
.
A.8 Arithmetic function intrinsics
• m128 mm add ps( m128, m128)
Synopsis: Add Four Floating Point Values, d = mm add ps(a,b) com-
putes, for i = 0, . . . , 3, the sums d
i
= a
i
+ b
i
.
• m128 mm add ss( m128, m128)
Synopsis: Add lowest numbered floating point elements, pass remain-
ing from the first operand. d = mm add ss(a,b) computes a
0
+ b
0
and
passes remaining a
i
s to d. That is, where a = (a
3
, a
2
, a
1
, a
0
) and b =
(b
3
, b
2
, b
1
, b
0
), the result is d ← (a
3
, a
2
, a
1
, a
0
+ b
0
).
208 SSE INTRINSICS FOR FLOATING POINT
• m128 mm div ps( m128, m128)
Synopsis: Vector Single Precision Divide, d = mm div ps(a,b) computes
d ← (a
3
/b
3
, a
2
/b
2
, a
1
/b
1
, a
0
/b
0
).
• m128 mm div ss( m128, m128)
Synopsis: Scalar Single Precision Divide, d = mm div ss(a,b) computes
the division of the low order floating point pair a
0
/b
0
and passes the
remaining elements of a to d. That is, d ← (a
3
, a
2
, a
1
, a
0
/b
0
).
• m128 mm max ps( m128, m128)
Synopsis: Vector Single Precision Maximum, d = mm max ps(a,b) com-
putes, for i = 0, . . . , 3, the respective maxima d
i
← a
i
∨ b
i
. That is,
d ← (a
3
∨ b
3
, a
2
∨ b
2
, a
1
∨ b
1
, a
0
∨ b
0
).
• m128 mm max ss( m128, m128)
Synopsis: Scalar Single Precision Maximum, d = mm max ss(a,b) com-
putes a
0
∨ b
0
and passes the remaining elements of a to d. That is,
d ← (a
3
, a
2
, a
1
, a
0
∨ b
0
).
• m128 mm min ps( m128, m128)
Synopsis: Vector Single Precision Minimum, d = mm min ps(a,b) com-
putes, for i = 0, . . . , 3, the respective minima d
i
← a
i
∧ b
i
. That is,
d ← (a
3
∧ b
3
, a
2
∧ b
2
, a
1
∧ b
1
, a
0
∧ b
0
).
• m128 mm min ss( m128, m128)
Synopsis: Scalar Single Precision Minimum, d = mm min ss(a,b) com-
putes a
0
∧ b
0
and passes the remaining elements of a to d. That is,
d ← (a
3
, a
2
, a
1
, a
0
∧ b
0
).
• m128 mm mul ps( m128, m128)
Synopsis: Vector Single Precision Multiply, d = mm mul ps(a,b) com-
putes, for i = 0, . . . , 3, a
i
· b
i
single precision floating point products.
That is,
d ← (a
3
· b
3
, a
2
· b
2
, a
1
· b
1
, a
0
· b
0
).
• m128 mm mul ss( m128, m128)
Synopsis: Scalar Single Precision Multiply, d = mm mul ss(a,b) com-
putes single precision floating point product a
0
· b
0
and stores this in d
0
;
the remaining elements of d are set with the high order elements of a.
That is,
d ← (a
3
, a
2
, a
1
, a
0
· b
0
).
ARITHMETIC FUNCTION INTRINSICS 209
• m128 mm rcp ps( m128)
Synopsis: Vector Reciprocal Approximation, d = mm rcp ps(a,b) com-
putes approximations to the single precision reciprocals 1.0/a
i
. That is,
d ←(1.0/a
3
, 1.0/a
2
, 1.0/a
1
, 1.0/a
0
).
The maximum error is
|d
i
· a
i
−1.0| ≤ 1.5 ×2
−12
.
• m128 mm rcp ss( m128)
Synopsis: Scalar Reciprocal Approximation, d = mm rcp ss(a,b) com-
putes approximations to the single precision reciprocal 1.0/a
0
which is
stored in d
0
; the remaining elements of a are passed through. That is,
d ←(a
3
, a
2
, a
1
, 1.0/a
0
).
The maximum error is
|d
0
· a
0
−1.0| ≤ 1.5 ×2
−12
.
• m128 mm rsqrt ps( m128)
Synopsis: Approximation of Vector Reciprocal Square Root, d =
mm rsqrt ps(a) computes approximations to the single precision recip-
rocal square roots 1.0/

a
i
. That is,
d ←

1

a
3
,
1

a
2
,
1

a
1
,
1

a
0

The maximum error is
|d
i
·

a
i
−1.0| ≤ 1.5 ×2
−12
.
• m128 mm rsqrt ss( m128)
Synopsis: Approximation of Scalar Reciprocal Square Root, d =
mm rsqrt ps(a) computes approximation to the single precision recip-
rocal square root 1.0/

a
0
, and remaining values of a are passed through.
That is,
d ←

a
3
, a
2
, a
1
,
1

a
0

The maximum error is
|d
0
·

a
0
−1.0| ≤ 1.5 ×2
−12
.
210 SSE INTRINSICS FOR FLOATING POINT
• m128 mm sqrt ps( m128)
Synopsis: Vector Square Root, d = mm sqrt ps(a) computes, for i =
0, . . . , 3, the square roots d
i
=

a
i
. That is,
d ←(

a
3
,

a
2
,

a
1
,

a
0
).
• m128 mm sqrt ss( m128)
Synopsis: Scalar Square Root, d = mm sqrt ss(a) computes the square
root d
0
=

a
0
; the remaining elements of a are passed through. That is,
d ←(a
3
, a
2
, a
1
,

a
0
).
• m128 mm sub ps( m128, m128)
Synopsis: Vector Subtract, d = mm sub ps(a,b) computes, for i =
0, . . . , 3, the differences a
i
−b
i
. That is,
d ←(a
3
−b
3
, a
2
−a
2
, a
1
−b
1
, a
0
−b
0
).
• m128 mm add ss( m128, m128)
Synopsis: Scalar Subtract, d = mm sub ss(a,b) computes the difference
d
0
= a
0
+b
0
, and the remaining values of a are passed through. That is,
d ←(a
3
, a
2
, a
1
, a
0
−b
0
).
APPENDIX B
ALTIVEC INTRINSICS FOR FLOATING POINT
The Apple/Motorola Altivec intrinsics are invoked using the Apple developers’
kit C compiler [22]. This version of gcc has been extensively modified by Apple
within the Darwin program and now effects some automatic vectorization. The
use of Altivec intrinsics requires the -faltivec switch:
gcc -O3 -faltivec yourcode.c -lm.
Very important: On the Motorola/IBM/Apple G-4 chips, the bit ordering
(also byte, and word for vector data) convention is big endian. This means the
highest order (most significant) bit is numbered 0 and the numbering increases
toward less significant bits—numbering is left to right. The set of intrinsics given
below is not complete: only those intrinsics relevant to single precision floating
point operations are given. A more detailed and complete set is given in the
Motorola manual [22].
B.1 Mask generating vector comparisons
• vector signed int vec cmpb (vector float, vector float)
Synopsis: Vector Compare Bounds Floating Point, d = vec cmpb(a,b)
computes d
i
[0, 1] = [a
i
> b
i
, a
i
< −b
i
], for i = 0, . . . , 3. For example, if
a
i
≤ b
i
, bit 0 of d
i
will be set to 0, otherwise bit 0 of d
i
will be set to 1; and
if a
i
≥ −b
i
, bit 1 of d
i
will be set to 0, otherwise bit 1 will be set to 1.
Other comparisons are based on simple binary relations of the following form.
See Table B.1 for a list of available brs.
• vector bool int vec cmpbr br br (vector float, vector float)
Synopsis: Vector Compare binary relation, d = vec cmpbr br br(a,b) sets d
i
=
all 1s if respective floating point element a
i
br br br b
i
, but d
i
= 0 otherwise.
For example, if br br br is equality (=), then when a
i
= b
i
, the corresponding
d
i
of vector d will be set to a mask of all 1s, otherwise d
i
= 0.
There are other comparison functions which represent collective comparisons
with binary relations, br br br, vec allbr br br (collective all) and vec anybr br br (collective
any). These are described in Section B.6 on Collective Comparisons.
212 ALTIVEC INTRINSICS FOR FLOATING POINT
Table B.1 Available binary relations for compar-
ison functions. For additional relations applic-
able to Collective Operations, see Table B.2.
All comparison functions
Binary Description Mathematical
relation br expression
eq Equality a = b
ge Greater than or equal a ≥ b
gt Greater than a > b
le Less than or equal a ≤ b
lt Less than a < b
The mask (results d above) is used with the following selection functions.
• vector float vec sel(vector float, vector float, vector bool int)
Synopsis: Vector Select, d = vec sel(a,b,c) sets successive bits of d
either to the corresponding bit of b or to that of a according to whether
the indexing bit of c is set. Perhaps more precisely, if i = 0, . . . , 127
indexes the bits of d, a, b, then d[i] = (c[i] == 1)?b[i]:a[i], see [84],
Section 2.11.
• vector float vec splat(vector float, unsigned literal)
Synopsis: Vector Splat, d = vec splat(a,b) selects element b mod 4
from a and distributes it to each element of d. For example, d =
vec splat(a,7) chooses element a
3
from a because 3 = 7 mod 4, so
d ← (a
3
, a
3
, a
3
, a
3
). If d and a are not vector float, then the index is
b mod n, where n is the number of elements of a and d. If these were byte
vectors, the index will be k = b mod 16 and the corresponding byte k of
a will be distributed to d.
B.2 Conversion, utility, and approximation functions
• vector float vec ctf(vector int, unsigned literal)
Synopsis: Vector Convert from Fixed Point Word, d = vec ctf(a,b)
computes d
i
=(float)a
i
· 2
b
, where b is a 5-bit unsigned literal. For
example, if a
i
= 7.1 and b = 3, then d = vec ctf(a,3) gives element
d
i
= 56.8 = 7.1 · 2
3
.
• vector float vec expte(vector float)
Synopsis: Vector is 2 raised to the Exponent Estimate Floating Point, d =
vec expte(a) computes d
i
= 2
a
i
to a 3-bit accurate result. For example,
if a
i
= 0.5, then

2 ≈ d
i
= 1.5 = 2
1
· (1 · 2
−1
+ 1 · 2
−2
+ 0 · 2
−3
).
• vector float vec floor(vector float)
Synopsis: Vector Floor, d = vec floor(a) computes d
i
= a
i
, for i =
0, . . . , 3, where a
i
is the floating point representation of the largest less
than or equal to a
i
. For example, in the case a
2
= 37.3, then d
2
←37.0.
VECTOR LOGICAL OPERATIONS AND PERMUTATIONS 213
• vector float vec loge(vector float)
Synopsis: Vector log 2 Estimate Floating Point d = vec loge(a) com-
putes d
i
= log
2
a
i
to a 3-bit accurate result. For example, if a
i
= e, then
log
2
(a
i
) = 1/ log(2) ≈ d
i
= 1.5 = 2
1
· (1 · 2
−1
+ 1 · 2
−2
+ 0 · 2
−3
).
• vector float vec mergeh(vector float, vector float)
Synopsis: Vector Merge High, d = vec mergeh(a,b) merges a and b
according to d = (d
0
, d
1
, d
2
, d
3
) = (a
0
, b
0
, a
1
, b
1
).
• vector float vec mergel(vector float, vector float)
Synopsis: Vector Merge Low, d = vec mergel(a,b) merges a and b
according to d = (d
0
, d
1
, d
2
, d
3
) = (a
2
, b
2
, a
3
, b
3
).
• vector float vec trunc(vector float)
Synopsis: Vector Truncate, d = vec trunc(a) computes, for i = 0, . . . , 3,
d
i
= a
i
if a
i
≥ 0.0, or d
i
= a
i
if a
i
< 0.0. That is, each a
i
is rounded to
an integer (floating point representation) toward zero in the IEEE round-
to-zero mode.
• vector float vec re(vector float, vector float)
Synopsis: Vector Reciprocal Estimate, d = vec re(a) computes a recip-
rocal approximation d
i
≈ 1./a
i
for i = 0, . . . , 3. This approximation
is accurate to 12 bits: that is, the maximum error of the estimate d
i
satisfies
|d
i
· a
i
− 1| ≤ 2
−12
.
• vector float vec round(vector float)
Synopsis: Vector Round, d = vec round(a) computes, for each i =
0, . . . , 3, d
i
as the nearest integer (in floating point representation) to a
i
in IEEE round-to-nearest mode. For example, a
i
= 1.499 yields d
i
= 1.0.
If a
i
and a
i
are equally near, rounding is to the even integer: a
i
= 1.5
yields d
i
= 2.0.
• vector float vec rsqrte(vector float)
Synopsis: Vector Reciprocal Square Root Estimate, d = vec rsqrte(a)
computes, for i = 0, . . . , 3, an approximation to each reciprocal square
root 1/

a
i
to 12 bits of accuracy. That is, for each a
i
, i = 0, . . . , 3,
|d
i
·

a
i
− 1| ≤ 2
−12
.
B.3 Vector logical operations and permutations
• vector float vec and(vector float, vector float)
Synopsis: Vector Logical AND, d = vec and(a,b) computes, for i =
0, . . . , 3, d
i
= a
i
and b
i
bitwise.
• vector float vec andc(vector float, vector float)
Synopsis: Vector Logical AND with 1s Complement, d = vec andc(a,b)
computes d
i
= a
i
and ¬ b
i
for i = 0, . . . , 3 bitwise.
214 ALTIVEC INTRINSICS FOR FLOATING POINT
• vector float vec ceil(vector float)
Synopsis: Vector Ceiling, d = vec ceil(a) computes d
i
= a
i
, for i =
0, . . . , 3, where d
i
is the floating point representation of the smallest integer
greater than or equal to a
i
. For example, if a
2
= 37.3, then d
2
← 38.0.
• vector float vec nor(vector float, vector float)
Synopsis: Vector Logical NOR, d = vec nor(a,b) computes the bitwise
or of each pair of element a
i
or b
i
, then takes the 1s complement of
that result: d
i
= ¬ (a
i
or b
i
). In other words, vec nor(a,b) computes
d = ¬ (a or b) considered as the negation of the 128-bit or of boolean
operands a and b.
• vector float vec or(vector float, vector float)
Synopsis: Vector Logical OR, d = vec or(a,b) computes the bitwise or
of each pair (a
i
, b
i
) and stores the results into the corresponding d
i
for
i = 0, . . . , 3. That is, d = a or b as a 128-bit boolean operation.
• vector float vec perm(vector float, vector float, vector
unsigned char)
Synopsis: Vector Permute, d = vec perm(a,b,c) permutes bytes of vec-
tors a, b according to permutation vector c. Here is the schema: for bytes
of c, call these {c
i
: i = 0, . . . , 15}, low order bits 4–7 of c
i
index a byte of
either a or b for selecting d
i
: j = c
i
[4 : 7]. Selection bit 3 of c
i
, that is,
c
i
[3], picks either a
j
or b
j
according to c
i
[3] = 0 (sets d
i
← a
j
), or c
i
[3] = 1
(sets d
i
← b
j
). For example, if byte c
2
= 00011001, then j = 1001 = 9, so
d
2
= b
9
because bit 3 of c
2
is 1, whereas if c
2
= 00001001, then d
2
= a
9
because bit 3 of c
2
, that is, c
2
[3] = 0 in that case. Examine variable pv3201
in the Altivec FFT of Section 3.6 for a more extensive example.
• vector float vec xor(vector float, vector float)
Synopsis: Vector Logical XOR, d = vec xor(a,b) computes d = a ⊕ b.
That is, the exclusive or (xor) of 128-bit quantities a and b is taken and
the result stored in d.
B.4 Load and store operations
• vector float vec ld(int, float*)
Synopsis: Vector Load Indexed, d = vec ld(a,b) loads vector d with four
elements beginning at the memory location computed as the largest 16-byte
aligned location less than or equal to a+b, where b is a float* pointer.
• vector float vec lde(int, float*)
Synopsis: Vector Load Element Indexed, d = vec lde(a,b) loads an ele-
ment d
k
from the location computed as the largest 16-byte aligned location
less than or equal to a + b which is a multiple of 4. Again, b is a float*
pointer. Index k is computed from the aligned address mod 16 then divided
by 4. All other d
i
values for i = k are undefined.
• vector float vec ldl(int, float*)
Synopsis: Vector Load Indexed least recently used (LRU), d = vec ld(a,b)
loads vector d with four elements beginning at the memory location
FULL PRECISION ARITHMETIC FUNCTIONS ON VECTOR 215
computed as the largest 16-byte aligned location less than or equal to
a + b, where b is a float* pointer. vec ldl is the same as vec ld except
the load marks the cache line as least recently used: see Section 1.2.1.
• void vec st(vector float, int, float*)
Synopsis: Vector Store Indexed, vec st(a,b,c) stores 4-word a beginning
at the first 16-byte aligned address less than or equal to c+b. For example,
vec store(a,4,c) will store a 16-byte aligned vector (a
0
, a
1
, a
2
, a
3
) into
locations c
4
, c
5
, c
6
, c
7
, that is, at location c + 4.
• void vec ste(vector float, int, float*)
Synopsis: Vector Store Element Indexed, vec ste(a,b,c) stores a single
floating point element a
k
of a at largest 16-byte aligned location effective
address (EA) less than or equal to b + c which is a multiple of 4. Indexed
element a
k
chosen by k = (EA mod 16)/4.
• void vec stl(vector float, int, float*)
Synopsis: Vector Store Indexed LRU, vec stl(a,b,c) stores a at largest
16-byte aligned location less than or equal to b + c. The cache line stored
into is marked LRU.
B.5 Full precision arithmetic functions on vector operands
• vector float vec abs(vector float)
Synopsis: Vector Absolute Value, d = vec abs(a) computes d
i
← |a
i
| for
i = 0, . . . , 3.
• vector float vec add(vector float, vector float)
Synopsis: Vector Add, d = vec add(a,b) computes d
i
← a
i
+ b
i
for i =
0, . . . , 3.
• vector float vec madd(vector float, vector float, vector float)
Synopsis: Vector Multiply Add, d = vec madd(a,b,c) computes, for i =
0, . . . , 3, d
i
= a
i
· b
i
+ c
i
. For example, if a scalar a → (a, a, a, a) (scalar
propagated to a vector), the y = vec madd(a,x,y) is a saxpy operation,
see Section 2.1.
• vector float vec max(vector float, vector float)
Synopsis: Vector Maximum, d = vec max(a,b) computes, for i = 0, . . . , 3,
d
i
= a
i
∨ b
i
, so each d
i
is set to the larger of the pair (a
i
, b
i
).
• vector float vec min(vector float, vector float)
Synopsis: Vector Maximum, d = vec max(a,b) computes, for i = 0, . . . , 3,
d
i
= a
i
∧ b
i
. So each d
i
is set to the smaller of the pair (a
i
, b
i
).
• vector float vec nmsub(vector float, vector float, vector float)
Synopsis: Vector Multiply Subtract, d = vec nmsub(a,b,c) computes, for
i = 0, . . . , 3, d
i
= a
i
· b
i
− c
i
. This intrinsic is similar to vec madd except
d ← a · b −c with a minus sign.
• vector float vec sub(vector float, vector float)
Synopsis: Vector Subtract, d = vec sub(a,b) computes, for i = 0, . . . , 3,
the elements of d by d
i
= a
i
−b
i
.
216 ALTIVEC INTRINSICS FOR FLOATING POINT
B.6 Collective comparisons
The following comparison functions return a single integer result depending upon
whether either (1) all pairwise comparisons between two operand vectors sat-
isfy a binary relation, or (2) if any corresponding pair satisfy a binary relation
or bound when comparing two operand vectors. Tables B.1 and B.2 show the
available binary relations.
The general form for the vec all br br br integer functions is as follows.
• int vec all br br br(vector float, vector float)
Synopsis: Elements Satisfy a binary relation, d = vec all br(a,b)
returns one (1) if a
i
br br br b
i
for every i = 0, . . . , 3, but zero (0) otherwise. For
example, if br br br is equality (=), then if each a
i
= b
i
, then d = 1; otherwise
d = 0.
In addition to the available binary relations given in Table B.1, there are addi-
tional functions shown in Table B.2. There are also collective unary all operations
shown below.
• int vec all nan(vector float)
Synopsis: All Elements Not a Number, d = vec all nan(a) returns one
(1) if for every i = 0, . . . , 3, a
i
is not a number (NaN) (see [114]); but zero
(0) otherwise.
Table B.2 Additional available binary relations for collective com-
parison functions. All binary relations shown in Table B.1 are also
applicable. The distinction between, for example, vec any nge and
vec any lt is in the way NaN operands are handled. Otherwise,
for valid IEEE floating point representations apparently similar
functions are equivalent.
Collective comparison functions
Binary Description Mathematical Result if
relation br expression operand NaN
ne Not equal a = b
nge Not greater than or equal a < b True
ngt Not greater than a ≤ b True
nle Not less than or equal a > b True
nlt Not less than a ≥ b True
Collective all comparisons only
in Within bounds ∀i, |a
i
| ≤ |b
i
|
Collective any comparisons only
out Out of bounds ∃i, |a
i
| > |b
i
|
COLLECTIVE COMPARISONS 217
• int vec all numeric(vector float)
Synopsis: All Elements Numeric, d = vec all numeric(a) returns one (1)
if every a
i
, for i = 0, . . . , 3, is a valid floating point numerical value; but
returns zero (0) otherwise.
Additionally, there are vector comparisons which return an integer one (1) if
any (a
i
, b
i
) pair satisfies a binary relation; or a zero (0) otherwise. Available
collective binary relations are shown in Table B.2. The general form for the any
functions is as follows.
• int vec any br br br(vector float, vector float)
Synopsis: Any Element Satisfies br br br, d = vec any br br br(a,b) returns one
(1) if any a
i
br br br b
i
for i = 0, . . . , 3, but returns zero (0) otherwise. For
example, if br br br is equality (=), then for i = 0, . . . , 3 if any a
i
= b
i
, then
d = vec any eq(a,b) returns one (1); if no pair (a
i
, b
i
) are equal, d =
vec any eq(a,b) returns zero (0).
Finally, there are unary any operations. These are as follows.
• int vec any nan(vector float)
Synopsis: Any Element Not a Number, d = vec any nan(a) returns one
(1) if any a
i
is not a number (NaN) for i = 0, . . . , 3, but returns zero (0)
otherwise [114].
• int vec any numeric(vector float)
Synopsis: Any Element Numeric, d = vec any numeric(a) returns one
(1) if any a
i
for i = 0, . . . , 3 is a valid floating point numerical value, but
returns zero (0) otherwise.
APPENDIX C
OPENMP COMMANDS
A detailed descriptions of OpenMP commands can be found in Chandra et al.
[17]. On our HP9000, we used the following guidec compiler switches [77]:
guidec +O3 +Oopenmp filename.c -lm -lguide.
Library /usr/local/KAI/guide/lib/32/libguide.a was specified by -lguide.
A maximum of eight CPUs (Section 4.8.2) is requested (C-shell) by
setenv OMP NUM THREADS 8.
C/C++ Open MP syntax
Define parallel region
#pragma omp parallel [clause] ...
structured block
Work-sharing
#pragma omp for [clause] ...
for loop
#pragma omp sections [clause] ...
{
[#pragma omp section
structured block]
}
#pragma omp single [clause] ...
structured block
Combination parallel/work-sharing
#pragma omp parallel for [clause] ...
for loop
#pragma omp parallel sections [clause] ...
{
[#pragma omp section
structured block]
}
OPENMP COMMANDS 219
C/C++ Open MP syntax (cont.)
Synchronization
#pragma omp master
structured block
#pragma omp critical [(name)]
structured block
#pragma omp barrier
#pragma omp atomic
expression
#pragma omp flush [(list)]
#pragma omp ordered
structured block
Data environment
#pragma omp threadprivate (vbl1, vbl2, ...)
C/C++ Clause P
a
r
a
l
l
e
l
r
e
g
i
o
n
f
o
r
S
e
c
t
i
o
n
s
S
i
n
g
l
e
P
a
r
a
l
l
e
l
f
o
r
P
a
r
a
l
l
e
l
s
e
c
t
i
o
n
s
shared(list) y y y
private(list) y y y y y y
firstprivate(list) y y y y y y
lastprivate(list) y y y y
default(private | shared | none)
default(shared | none) y y y
reduction (operator |
intrinsic : list) y y y y y
copyin (list) y y y
if (expression) y y y
ordered y y
schedule(type[,chunk]) y y
nowait y y y
APPENDIX D
SUMMARY OF MPI COMMANDS
A complete description for the MPI commands can be found on the Argonne
National Laboratory website [110] and most are listed, with some examples, in
Pacheco [115]. Our list is not complete: only those relevent to our discussion and
examples are described. See Chapter 5.
D.1 Point to point commands
Blocking sends and receives
MPI Get count: This function returns the number of elements received by the
operation which initialized status, in particular MPI Recv.
int MPI_Get_count(
MPI_Status *status, /* input */
MPI_Datatype datatype, /* input */
int *count) /* output */
MPI Recv: This function begins receiving data sent by rank source and stores
this into memory locations beginning at the location pointed to by message.
int MPI_Recv(
void *message, /* output */
int count, /* input */
MPI_Datatype datatype, /* input */
int source, /* input */
int tag, /* input */
MPI_Comm comm, /* input */
MPI_Status *status) /* output */
MPI Send: This function initiates transmission of data message to process
dest.
int MPI_Send(
void *message, /* input */
int count, /* input */
MPI_Datatype datatype, /* input */
int dest, /* input */
int tag, /* input */
MPI_Comm comm) /* input */
POINT TO POINT COMMANDS 221
Buffered point to point communications
MPI Bsend is a buffered send. Buffer allocation is done by MPI Buffer attach
and de-allocated by MPI Buffer detach. Again, message is the starting loca-
tion for the contents of the send message and count is the number of datatype
items to be sent to dest.
int MPI_Bsend(
void *message, /* input */
int count, /* input */
MPI_Datatype datatype, /* input */
int dest, /* input */
int tag, /* input */
MPI_Comm comm) /* input */
MPI Bsend is usually received by MPI Recv.
Buffer allocation/de-allocation functions
MPI Buffer attach indicates that memory space beginning at buffer should
be used as buffer space for outgoing messages.
int MPI_Buffer_attach(
void *buffer, /* input */
int size) /* input */
MPI Buffer detach indicates that previously attached memory locations
should be de-allocated for buffer use. This function returns the address of the
pointer to previously allocated space and location of the integer containing its
size. This is useful for nested library replace/restore.
int MPI_Buffer_detach(
void *buffer, /* output */
int *size) /* output */
Non-blocking communication routines
MPI Ibsend is a non-blocking buffered send.
int MPI_Ibsend(
void *message, /* input */
int count, /* input */
MPI_Datatype datatype, /* input */
int dest, /* input */
int tag, /* input */
MPI_Comm comm, /* input */
MPI_Request *request) /* output */
222 SUMMARY OF MPI COMMANDS
MPI Irecv is a non-blocking receive. Just because the function has returned
does not mean the message (buffer) information is available. MPI Wait
(page 223) may be used with the request argument to assure completion.
MPI Test may be used to determine the status of completion.
int MPI_Irecv(
void *message, /* output */
int count, /* input */
MPI_Datatype datatype, /* input */
int source, /* input */
int tag, /* input */
MPI_Comm comm, /* input */
MPI_Request *request) /* output */
MPI Isend is a non-blocking normal send.
int MPI_Isend(
void *message, /* input */
int count, /* input */
MPI_Datatype datatype, /* input */
int dest, /* input */
int tag, /* input */
MPI_Comm comm, /* input */
MPI_Request *request) /* output */
MPI Request free functions somewhat like dealloc: the memory ref-
erenced by request is marked for de-allocation and request is set to
MPI REQUEST NULL.
int MPI_Request_free(
MPI_Request *request) /* input/output */
MPI Test Tests the completion of a non-blocking operation associated with
request.
int MPI_Test(
MPI_Request *request, /* input/output */
int *flag, /* output */
MPI_Status *status) /* output */
MPI Testall This is similar to MPI Test except that it tests whether all the
operations associated with a whole array of requests are finished.
int MPI_Testall(
int array_size, /* input */
MPI_Request *requests, /* input/output */
int *flag, /* output */
MPI_Status *statii) /* output array */
POINT TO POINT COMMANDS 223
MPI Testany This is similar to MPI Testall except that it only tests if at
least one of the requests has finished.
int MPI_Testany(
int array_size, /* input */
MPI_Request *requests, /* input/output */
int *done_index, /* output */
int *flag, /* output */
MPI_Status *status) /* output array */
MPI Testsome This is similar to MPI Testany except that it determines
how many of the requests have finished. This count is returned in done count
and the indices of which of the requests have been completed is returned in
done ones.
int MPI_Testsome(
int array_size, /* input */
MPI_Request *requests, /* input/output */
int *done_count, /* output */
int *done_ones, /* output array */
MPI_Status *statii) /* output array */
MPI Wait This function only returns when request has completed.
int MPI_Wait(
MPI_Request *request, /* input/output */
MPI_Status *status) /* output */
MPI Waitall This function only returns when all the requests have been
completed. Some may be null.
int MPI_Waitall(
int array_size, /* input */
MPI_Request *requests, /* input/output */
MPI_Status *statii) /* output array */
MPI Waitany This function blocks until at least one of the requests,
done one, has been completed.
int MPI_Waitany(
int array_size, /* input */
MPI_Request *requests, /* input/output */
int *done_one, /* output */
MPI_Status *status) /* output */
MPI Waitsome This function only returns when at least one of the requests
has completed. The number of requests finished is returned in done count and
the indices of these are returned in array done ones.
224 SUMMARY OF MPI COMMANDS
int MPI_Waitsome(
int array_size, /* input */
MPI_Request *requests, /* input/output */
int *done_count, /* output */
int *done_ones, /* output array */
MPI_Status *statii) /* output array */
Test and delete operations
MPI Cancel assigns a cancellation request for an operation.
int MPI_Cancel(
MPI_request request) /* input */
MPI Iprobe Determine whether a message matching the arguments specified
in source, tag, and comm can be received.
int MPI_Iprobe(
int source, /* input */
int tag, /* input */
MPI_Comm comm, /* input */
int *flag, /* output */
MPI_Status *status) /* output struct */
MPI Probe Block a request until a message matching the arguments specified
in source, tag, and comm is available.
int MPI_Probe(
int source, /* input */
int tag, /* input */
MPI_Comm comm, /* input */
MPI_Status *status) /* output struct */
MPI Test canceled determines whether an operation associated with status
was successfully canceled.
int MPI_Test_canceled(
MPI_Status *status, /* input struct */
int *flag) /* input */
Persistent communication requests
MPI Bsend init creates send request for persistent and buffered message.
int MPI_Bsend_init(
void *message, /* input */
int count, /* input */
MPI_Datatype datatype, /* input */
int dest, /* input */
POINT TO POINT COMMANDS 225
int tag, /* input */
MPI_Comm comm, /* input */
MPI_Request *request) /* output struct */
MPI Recv init creates a request for a persistent buffered receive.
int MPI_Recv_init(
void *message, /* output */
int count, /* input */
MPI_Datatype datatype, /* input */
int source, /* input */
int tag, /* input */
MPI_Comm comm, /* input */
MPI_Request *request) /* output struct */
MPI Send init creates a persistent standard send request.
int MPI_Send_init(
void *message, /* output */
int count, /* input */
MPI_Datatype datatype, /* input */
int dest, /* input */
int tag, /* input */
MPI_Comm comm, /* input */
MPI_Request *request) /* output struct */
MPI Start initiates the non-blocking operation associated with request.
int MPI_Start(
MPI_Request *request) /* input/output */
MPI Startall initiates a set of non-blocking operations associated with
requests.
int MPI_Startall(
int array_size, /* input */
MPI_Request *requests) /* input/output */
Combined send and receive routines
MPI Sendrecv sends the contents of send data to dest and gets data from
source and stores these in recv data.
int MPI_Sendrecv(
void *send_data, /* input */
int sendcount, /* input */
MPI_Datatype sendtype, /* input */
int dest, /* input */
int sendtag, /* input */
226 SUMMARY OF MPI COMMANDS
void *recv_data, /* output */
int recvcount, /* input */
MPI_Datatype recvtype, /* input */
int source, /* input */
int recvtag, /* input */
MPI_Comm comm, /* input */
MPI_status *status) /* output */
MPI Sendrecv replace sends the contents of buffer to dest and replaces
these data from source.
int MPI_Sendrecv_replace(
void *message, /* input/output */
int count, /* input */
MPI_Datatype datatype, /* input */
int dest, /* input */
int sendtag, /* input */
int source, /* input */
int recvtag, /* input */
MPI_Comm comm, /* input */
MPI_status *status) /* output */
D.2 Collective communications
Broadcasts and barriers
MPI Barrier blocks all processes in comm until each process has called it
(MPI Barrier).
int MPI_Barrier(
MPI_Comm comm) /* input */
MPI Bcast sends the contents of send data with rank root to every process
in comm, including root.
int MPI_Bcast(
void *send_data, /* input/output */
int count, /* input */
MPI_Datatype datatype, /* input */
int root, /* input */
MPI_Comm comm) /* input */
Scatter and gather operations (see Section 3.2.2)
MPI Gather gathers all the data send data sent from each process in the
communicator group comm into recv data of processor root.
int MPI_Gather(
void *send_data, /* input */
COLLECTIVE COMMUNICATIONS 227
int sendcount, /* input */
MPI_datatype sendtype, /* input */
void *recv_data, /* output */
int recvcount, /* input */
MPI_datatype recvtype, /* input */
int root, /* input */
MPI_Comm comm) /* input */
MPI Gatherv gathers all the sent data send data from each process in the
communicator group comm into recv data of processor root, but with the
additional capability of different type signatures.
int MPI_Gatherv(
void *send_data, /* input */
int sendcount, /* input */
MPI_datatype sendtype, /* input */
void *recv_data, /* output */
int *recvcounts, /* input array */
int *recvoffsets, /* input array */
MPI_datatype recvtype, /* input */
int root, /* input */
MPI_Comm comm) /* input */
MPI Scatter scatters the data send data from process root to each of the
processes in the communicator group comm.
int MPI_Scatter(
void *send_data, /* input */
int sendcount, /* input array */
MPI_datatype sendtype, /* input */
void *recv_data, /* output */
int recvcount, /* input */
MPI_datatype recvtype, /* input */
int root, /* input */
MPI_Comm comm) /* input */
MPI Scatterv scatters the data send data from process root to each of the
processes in the communicator group comm, but with the additional capability
of different type signatures.
int MPI_Scatterv(
void *send_data, /* input */
int *sendcounts, /* input array */
int *sendoffsets, /* input array */
MPI_datatype sendtype, /* input */
void *recv_data, /* output */
228 SUMMARY OF MPI COMMANDS
int recvcount, /* input */
MPI_datatype recvtype, /* input */
int root, /* input */
MPI_Comm comm) /* input */
MPI Allgather collects all the processes’ send data messages from all the
processors in comm communicator group.
int MPI_Allgather(
void *send_data, /* input */
int sendcount, /* input */
MPI_Datatype sendtype, /* input */
void *recv_data, /* output */
int recvcount, /* input */
MPI_Datatype recvtype, /* input */
MPI_Comm comm) /* input */
MPI Allgatherv collects all the processes’ send data messages from all the
processors in comm communicator group as in MPI Allgather, but permits
different type signatures.
int MPI_Allgatherv(
void *send_data, /* input */
int sendcount, /* input */
MPI_datatype sendtype, /* input */
void *recv_data, /* output */
int *recvcounts, /* input array */
int *offsets, /* input array */
MPI_datatype recvtype, /* input */
MPI_Comm comm) /* input */
MPI Alltoall is an all-to-all scatter/gather operation. All the processors in
comm communicator group share each others’ send data.
int MPI_Alltoall(
void *send_data, /* input */
int sendcount, /* input */
MPI_datatype sendtype, /* input */
void *recv_data, /* output */
int recvcount, /* input */
MPI_datatype recvtype, /* input */
MPI_Comm comm) /* input */
MPI Alltoallv is an all-to-all scatter/gather operation with the additional
facility of allowing different type signatures. All the processors in comm
communicator group share each others’ send data.
COLLECTIVE COMMUNICATIONS 229
int MPI_Alltoallv(
void *send_data, /* input */
int *sendcounts, /* input array */
int *sendoffsets, /* input array */
MPI_datatype sendtype, /* input */
void *recv_data, /* output */
int *recvcounts, /* input array */
int *recvoffsets, /* input array */
MPI_datatype recvtype, /* input */
MPI_Comm comm) /* input */
Reduction operations (see Section 3.3)
MPI Reduce is a generic reduction operation. Each data segment from each
of the processes in the communicator group comm is combined according to the
rules of operator (see Table D.1). The result of this reduction is stored in the
process root.
int MPI_Reduce(
void *segment, /* input */
void *result, /* output */
int count, /* input */
MPI_datatype datatype, /* input */
MPI_op operator, /* input */
int root, /* input */
MPI_Comm comm) /* input */
MPI Allreduce is a generic reduction operation. Each data segment segment
from each of the processes in the communicator group comm is combined accord-
ing to the rules of operator operator (see Table D.1). The result of this reduction
is stored on each process of the comm group in result.
int MPI_Allreduce(
void *segment, /* input */
void *result, /* output */
int count, /* input */
MPI_datatype datatype, /* input */
MPI_op operator, /* input */
MPI_Comm comm) /* input */
MPI Op create creates an operation for MPI Allreduce. fcn is a pointer to
a function which returns void and its template is given in Section D.3 on p. 234.
Integer variable commute should be 1 (true) if the operands commute, but 0
(false) otherwise.
230 SUMMARY OF MPI COMMANDS
Table D.1 MPI datatypes available for collective reduction operations.
Pre-defined reduction operations
MPI operations Operation
MPI MAX Maximum (∨)
MPI MIN Minimum (∧)
MPI SUM Summation (Σ)
MPI PROD Product (

)
MPI BAND Boolean AND (and)
MPI LAND Logical AND (&&)
MPI BOR Boolean OR (or)
MPI LOR Logical OR (||)
MPI BXOR Boolean exclusive OR (⊗)
MPI LXOR Logical exclusive OR
MPI MAXLOC Maximum and its location
MPI MINLOC Minimum and its location
MPI datatypes for reductions
MPI datatype Equiv. C datatype
MPI CHAR Signed char
MPI SHORT Signed short int
MPI INT Signed int
MPI LONG Signed long int
MPI FLOAT Float
MPI DOUBLE Double
MPI LONG DOUBLE Long double
int MPI_Op_create(
MPI_User_function* fcn, /* input */
int commute, /* input */
MPI_Op* operator) /* output */
MPI Op free frees the operator definition defined by MPI Op create.
int MPI_Op_free(
MPI_Op* operator) /* input/output */
MPI Reduce scatter is a generic reduction operation which scatters its results
to each processor in the communication group comm. Each data segment seg-
ment from each of the processes in the communicator group comm is combined
according to the rules of operator operator (see Table D.1). The result of this
reduction is scattered to each process in comm.
int MPI_Reduce_scatter(
void *segment, /* input */
void *recv_data, /* output */
COLLECTIVE COMMUNICATIONS 231
int *recvcounts, /* input */
MPI_datatype datatype, /* input */
MPI_op operator, /* input */
MPI_Comm comm) /* input */
MPI Scan computes a partial reduction operation on a collection of processes
in comm. On each process k in comm, MPI Scan combines the results of the
reduction of all the segments of processes of comm with rank less than or equal
to k and stores those results on k.
int MPI_Scan(
void *segment, /* input */
void *result, /* output */
int count, /* input */
MPI_datatype datatype, /* input */
MPI_op operator, /* input */
MPI_Comm comm) /* input */
Communicators and groups of communicators
MPI Comm group accesses the group associated with a given communicator.
If comm is an inter-communicator, MPI Comm group returns (in group)
the local group.
int MPI_Comm_group(
MPI_comm comm, /* input */
MPI_Group* group) /* output */
MPI Group compare compares group1 with group2: the result of the com-
parison indicates that the two groups are same (both order and members), similar
(only members the same), or not equal. See Table D.2 for result definitions.
Table D.2 MPI pre-defined constants.
Pre-defined MPI constants (in mpi.h)
MPI constants Usage
MPI ANY SOURCE Wildcard source for receives
MPI ANY TAG Wildcard tag for receives
MPI UNDEFINED Any MPI constant undefined
MPI COMM WORLD Any MPI communicator wildcard
MPI COMM SELF MPI communicator for self
MPI IDENT Groups/communicators identical
MPI CONGRUENT Groups congruent
MPI SIMILAR Groups similar
MPI UNEQUAL Groups/communicators not equal
232 SUMMARY OF MPI COMMANDS
int MPI_Group_compare(
MPI_Group group1, /* input */
MPI_Group group2, /* input */
int *result) /* output */
MPI Group difference compares group1 with group2 and forms a new
group which consists of the elements of group1 which are not in group2. That
is, group3 = group1 \ group2.
int MPI_Group_difference(
MPI_Group group1, /* input */
MPI_Group group2, /* input */
MPI_Group* group3) /* output */
MPI Group excl forms a new group group out which consists of the elements
of group by excluding those whose ranks are listed in ranks (the size of ranks
is nr).
int MPI_Group_excl(
MPI_Group group, /* input */
int nr, /* input */
int *ranks, /* input array */
MPI_Group *group_out) /* output */
MPI Group free frees (releases) group group. Group handle group is reset
to MPI GROUP NULL.
int MPI_Group_free(
MPI_Group group) /* input */
MPI Group incl examines group and forms a new group group out whose
members have ranks listed in array ranks. The size of ranks is nr.
int MPI_Group_incl(
MPI_Group group, /* input */
int nr, /* input */
int *ranks, /* input */
MPI_Group *group_out) /* output */
MPI Group intersection forms the intersection of group1 and group2. That
is, MPI Comm intersection forms a new group group out whose members
consist of processes of both input groups but ordered the same way as group1.
int MPI_Group_intersection(
MPI_Group group1, /* input */
MPI_Group group2, /* input */
MPI_Group *group_out) /* output */
MPI Group rank returns the rank of the calling processes in a group. See also
MPI Comm rank.
COLLECTIVE COMMUNICATIONS 233
int MPI_Group_rank(
MPI_Group group, /* input */
int *rank) /* output */
MPI Group size returns the number of elements (processes) in a group.
int MPI_Group_size(
MPI_Group group, /* input */
int *size) /* output */
MPI Group union forms the union of of group1 and group2. That is,
group out consists of group1 followed by the processes of group2 which do
not belong to group1.
int MPI_Group_union(
MPI_Group group1, /* input */
MPI_Group group2, /* input */
MPI_Group *group_out) /* output */
Managing communicators
MPI Comm compare compares the communicators comm1 and comm2 and
returns result which is indentical if the contexts and groups are the same; con-
gruent, if the groups are the same but different contexts; similar, if the groups
are similar but the contexts are different; and unequal, otherwise. The values for
these results are given in Table D.2.
int MPI_Comm_compare(
MPI_Comm comm1, /* input */
MPI_Comm comm2, /* input */
int *result) /* output */
MPI Comm create creates a new communicator comm out from the input
comm and group.
int MPI_Comm_create(
MPI_Comm comm, /* input */
MPI_Group group, /* input */
MPI_Comm *comm_out) /* output */
MPI Comm free frees (releases) a communicator.
int MPI_Comm_free(
MPI_Comm *comm) /* input/output */
Communication status struc
typedef struct {
int count;
int MPI_SOURCE;
234 SUMMARY OF MPI COMMANDS
int MPI_TAG;
int MPI_ERROR;
int private_count;
} MPI_Status;
D.3 Timers, initialization, and miscellaneous
Timers
MPI Wtick returns the resolution (precision) of MPI Wtime in seconds.
double MPI_Wtick(void)
MPI Wtime returns the wallclock time in seconds since the last (local) call to
it. This is a local, not global, timer: a previous call to this function by another
process does not interfere with a local timing.
double MPI_Wtime(void)
Startup and finish
MPI Abort terminates all the processes in comm and returns an error code
to the process(es) which invoked comm
int MPI_Abort(
MPI_Comm *comm, /* input */
int error_code) /* input */
MPI Finalize terminates the current MPI threads and cleans up memory
allocated by MPI.
int MPI_Finalize(void)
MPI Init starts up MPI. This procedure must be called before any other MPI
function may be used. The arguments mimmick those of C main() except argc
is a pointer since it is both an input and output.
int MPI_Init(
int *argc, /* input/output */
char **arv) /* input/output */
Prototypes for user-defined functions
MPI User function defines the basic template of an operation to be created
by MPI Op create.
typedef MPI_User_function(
void *invec, /* input vector */
void *inoutvec, /* input/output vector */
int length, /* length of vecs. */
MPI_Datatype datatype) /* type of vec elements */
APPENDIX E
FORTRAN AND C COMMUNICATION
In this book, essentially all code examples are written in ANSI Standard C. There
are two important reasons for this: (1) C is a low level language that lends itself
well to bit and byte manipulations, but at the same time has powerful macro and
pointer facilities and (2) C support on Intel Pentium and Motorola G-4 machines
is superior to Fortran, and we spend considerable time discussing these chips and
their instruction level parallelism. In brief, the most important characteristics of
Fortran and C are these:
1. Fortran passes scalar information to procedures (subroutines and func-
tions) by address (often called by reference), while C passes scalars by
value.
2. Fortran arrays have contiguous storage on the first index (e.g. a(2,j)
immediately follows a(1,j)), while C has the second index as the “fast”
one (e.g. a[i][2] is stored immediately after a[i][1]). Furthermore,
indexing is normally numbered from 0 in C, but from 1 in Fortran.
3. Fortran permits dynamically dimensioned arrays, while in C this is not
automatic but must be done by hand.
4. Fortran supports complex datatype: complex z is actually a two-
dimensional array (z, z). C in most flavors does not support complex
type. Cray C and some versions of the Apple developers kit gcc do support
complex type, but they do not use a consistent standard.
5. C has extensive support for pointers. Cray-like pointers in Fortran are now
fairly common, but g77, for example, does not support Fortran pointers.
Fortran 90 supports a complicated construction for pointers [105], but is
not available on many Linux machines.
Now let us examine these issues in more detail. Item 1 is illustrated by the fol-
lowing snippets of code showing that procedure subr has its reference to variable
x passed by address, so the set value (π) of x will be returned to the program.
1. Fortran passes all arguments to procedures by address
program address
real x
call subr(x)
print *," x=",x
stop
end
236 FORTRAN AND C COMMUNICATION
subroutine subr(y)
real y
y = 3.14159
return
end
Conversely, C passes information to procedures by value, except arrays
which are passed by address.
#include <stdio.h>
main()
{
float x,*z;
float y[1];
void subrNOK(float),subrOK(float*,float*);
subrNOK(x); /* x will not be set */
printf(" x= %e\n",x);
subrOK(y,z); /* y[0],z[0] are set */
printf(" y[0] = %e, z = %e\n",y[0],*z);
}
This one is incorrect and x in main is not set.
void subrNOK(x)
float x;
{ x = 3.14159; }
but *y,*z are properly set here
void subrOK(float *x,float *y)
{ *x = 3.14159; y[0] = 2.71828; }
2. A 3 ×3 matrix A given by
A =


a
00
a
01
a
02
a
10
a
11
a
12
a
20
a
21
a
22


is stored in C according to the scheme a
ij
=a[j][i], whereas the
analogous scheme for the 1 ≤ i, j ≤ 3 numbered array in Fortran
A =


a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33


is a
ij
=a(i,j).
3. Fortran allows dynamically dimensioned arrays.
FORTRAN AND C COMMUNICATION 237
program dims
c x is a linear array when declared
real x(9)
call subr(x)
print *,x
stop
end
subroutine subr(x)
c but x is two dimensional here
real x(3,3)
do i=1,3
do j=1,3
x(j,i) = float(i+j)
enddo
enddo
return
end
C does not allow dynamically dimensioned arrays. Often by using define,
certain macros can be employed to work around this restriction. We do
this on page 108, for example, with the am macro. For example,
define am(i,j) *(a+i+lda*j)
oid proc(int lda, float *a)
float seed=331.0, ggl(float*);
int i,j;
for(j=0;j<lda;j++){
for(i=0;i<lda;i++){
am(i,j) = ggl(&seed); /* Fortran order */
}
}
undef am
will treat array a in proc just as its Fortran counterpart—with the leading
index i of a(i,j) being the “fast” one.
4. In this book, we adhere to the Fortran storage convention for com-
plex datatype. Namely, if an array complex z(m), declared in For-
tran, is accessed in a C procedure in an array float z[m][2], then
z
k
=real(z(k)) = z[k-1][0], and z
k
=aimag(z(k)) = z[k-1][1]
for each 1 ≤ k ≤ m.
Although it is convenient and promotes consistency with Fortran, use of this
complex convention may exact a performance penalty. Since storage in memory
between two successive elements z(k),z(k+1) is two floating point words apart,
special tricks are needed to do complex arithmetic (see Figure 3.22). These are
unnecessary if all the arithmetic is on float data.
238 FORTRAN AND C COMMUNICATION
5. C uses &x to indicate the address of x. By this device, x can be set within
a procedure by *px = 3.0, for example, where the location or address of
x is indicated by px → x. Thus,
px = &x;
*px = y;
/* is the same as */
x = y;
Now let us give some examples of C ↔ Fortran communication. Using rules 1
and 5, we get the following for calling a Fortran routine from C. This is the most
common case for our needs since so many high performance numerical libraries
are written in Fortran.
1. To call a Fortran procedure from C, one usually uses an underscore for
the Fortran named routine. Cray platforms do not use this convention.
The following example shows how scalar and array arguments are usually
communicated.
int n=2;
unsigned char c1=’C’,c2;
float x[2];
/* NOTE underscore */
void subr_(int*,char*,char*,float*);
subr_(&n,&c1,&c2,x);
printf("n=%d,c1=%c,c2=%c,x=%e,%e\n",
n,c1,c2,x[0],x[1]);
...
subroutine subr(n,c1,c2,x)
integer n
character*1 c1,c2
real x(n)
print *,"in subr: c1=",c1
c2=’F’
do i=1,n
x(i) = float(i)
enddo
return
end
For C calling Fortran procedures, you will likely need some libraries
libf2c.a (or libf2c.so), on some systems libftn.a (or libftn.so), or
some variant. For example, on Linux platforms, you may need libg2c.a:
g2c instead of f2c. To determine if an unsatisfied external is in one
of these libraries, “ar t libf2c.a” lists all compiled modules in archive
libf2c.a. In the shared object case, “nm libftn.so” will list all the
FORTRAN AND C COMMUNICATION 239
named modules in the .so object. Be aware that both ar t and nm may
produce voluminous output, so judicious use of grep is advised.
2. Conversely, to call a C procedure from Fortran:
program foo
integer n
real x(2)
character*1 c1,c2
c1 = ’F’
n = 2
call subr(c1,c2,n,x)
print *,"c1=",c1,", c2=",c2,", n=",n,
* ", x=",x(1),x(2)
...
/* NOTE underscore in Fortran called module */
void subr_(char *c1,char *c2,int *n,float *x)
{
int i;
printf("in subr: c1=%c\n",*c1);
*c2 = ’C’;
for(i=0;i<*n;i++) x[i] = (float)i;
}
If you have Fortran calling C procedures, you may need libraries libc.a,
libm.a, or their .so shared object variants. Again, to determine if an
unsatisfied external is in one of these libraries, ar or nm may be help-
ful. See the description for C calling Fortran item above. In this case,
it is generally best to link the C modules to Fortran using the Fortran
compiler/linker, to wit:
g77 -o try foo.o subr.o
APPENDIX F
GLOSSARY OF TERMS
This glossary includes some specific terms and acronyms used to describe parallel
computing systems. Some of these are to be found in [27] and [22], others in
Computer Dictionary.
• Align refers to data placement in memory wherein block addresses are
exactly equal to their block addresses modulo block size.
• Architecture is a detailed specification for a processor or computer
system.
• Bandwidth is a somewhat erroneously applied term which means a data
transfer rate, in bits/second, for example. Previously, the term meant the
range of frequencies available for analog data.
• Biased exponent is a binary exponent whose positive and negative range
of values is shifted by a constant (bias). This bias is designed to avoid two
sign bits in a floating point word—one for the overall sign and one for the
exponent. Instead, a floating point word takes the form (ex is the exponent,
and 1/2 ≤ x
0
< 1 is the mantissa),
x = ± 2
ex
· x
0
,
where ex is represented as exponent = ex+bias, so the data are stored
Exponent
S
i
g
n

b
i
t
Mantissa
• Big endian is a bit numbering convention wherein the bits (also bytes)
are numbered from left to right—the high order bit is numbered 0, and the
numbering increases toward the lowest order.
• Block refers to data of fixed size and often aligned (see cache block).
• Branch prediction at a data dependent decision point in code is a com-
bination of hardware and software arrangements of instructions constructed
to make a speculative prediction about which branch will be taken.
• Burst is a segment of data transferred, usually a cache block.
• Cache block at the lowest level (L1), is often called a cache line (16 bytes),
whereas in higher levels of cache, the size can be as large as a page of 4 KB.
See Table 1.1.
• Cache coherency means providing a memory system with a common view
of the data. Namely, modified data can appear only in the local cache in
GLOSSARY OF TERMS 241
which it was stored so fetching data from memory might not get the most
up-to-date version. A hardware mechanism presents the whole system with
the most recent version.
• Cache flush of data means they are written back to memory and the
cache lines are removed or marked invalid.
• Cache is a locally accessible high speed memory for instructions or data.
See Section 1.2.
• Cache line is a cache block at the lowest level L1.
• CMOS refers to complementary metal oxide semiconductor. Today, most
memory which requires power for viability is CMOS.
• Communication for our purposes means transferring data from one
location to another, usually from one processor’s memory to others’.
• Cross-sectional bandwidth refers to a maximum data rate possible
between halves of a machine.
• Direct mapped cache is a system where a datum may be written only
into one cache block location and no other. See Section 1.2.1.
• Dynamic random access memory (DRAM) Voltage applied to the
base of a transistor turns on the current which charges a capacitor. This
charge represents one storage bit. The capacitor charge must be refreshed
regularly (every few milliseconds). Reading the data bit destroys it.
• ECL refers to emitter coupled logic, a type of bipolar transistor with very
high switching speeds but also high power requirements.
• Effective address (EA) of a memory datum is the cache address plus
offset within the block.
• Exponent means the binary exponent of a floating point number x =
2
ex
· x
0
where ex is the exponent. The mantissa is 1/2 ≤ x
0
< 1. In IEEE
arithmetic, the high order bit of x
0
is not stored and is assumed to be 1,
see [114].
• Fast Fourier Transform (FFT) is an algorithm for computing y = Wx,
where W
jk
= ω
jk
is the jkth power of the nth root of unity ω, and x, y
are complex nvectors.
• Fetch means getting data from memory to be used as an instruction
or for an operand. If the data are already in cache, the process can be
foreshortened.
• Floating-point register (FPR) refers to one register within a set used
for floating point data.
• Floating-point unit refers to the hardware for arithmetic operations on
floating point data.
• Flush means that when data are to be modified, the old data may have
to be stored to memory to prepare for the new. Cache flushes store local
copies back to memory (if already modified) and mark the blocks invalid.
• Fully associative cache designs permit data from memory to be stored
in any available cache block.
• Gaussian elimination is an algorithm for solving a system of linear
equations Ax = b by using row or column reductions.
242 GLOSSARY OF TERMS
• General purpose register (GPR) usually refers to a register used for
immediate storage and retrieval of integer operations.
• Harvard architecture is distinct from the original von Neumann design,
which had no clear distinction between instruction data and arithmetic
data. The Harvard architecture keeps distinct memory resources (like
caches) for these two types of data.
• IEEE 754 specifications for floating point storage and operations were
inspired and encouraged by W. Kahan. The IEEE refined and adopted
this standard, see Overton [114].
• In-order means instructions are issued and executed in the order in which
they were coded, without any re-ordering or rearrangement.
• Instruction latency is the total number of clock cycles necessary to
execute an instruction and make ready the results of that instruction.
• Instruction parallelism refers to concurrent execution of hardware
machine instructions.
• Latency is the amount of time from the initiation of an action until the
first results begin to arrive. For example, the number of clock cycles a
multiple data instruction takes from the time of issue until the first result
is available is the instruction’s latency.
• Little endian is a numbering scheme for binary data in which the lowest
order bit is numbered 0 and the numbering increases as the significance
increases.
• Loop unrolling hides pipeline latencies by processing segments of data
rather than one/time. Vector processing represents one hardware mode for
this unrolling, see Section 3.2, while template alignment is more a software
method, see Section 3.2.4.
• Mantissa means the x
0
portion of a floating point number x = 2
ex
· x
0
,
where 1/2 ≤ x
0
< 1.
• Monte Carlo (MC) simulations are mathematical experiments which use
random numbers to generate possible configurations of a model system.
• Multiple instruction, multiple data (MIMD) mode of parallelism
means that more than one CPU is used, each working on independent
parts of the data to be processed and further that the machine instruction
sequence on each CPU may differ from every other.
• NaNin the IEEE floating point standard is an abbreviation for a particular
unrepresentable datum. Often there are more than one such NaN. For
example, some cause exceptions and others are tagged but ignored.
• No-op is an old concept wherein cycles are wasted for synchronization
purposes. The “no operation” neither modifies any data nor generates bus
activity, but a clock cycle of time is taken.
• Normalization in our discussions means two separate things. (1) In
numerical floating point representations, normalization means that the
highest order bit in a mantissa is set (in fact, or implied as in IEEE),
and the exponent is adjusted accordingly. (2) In our MC discussions,
GLOSSARY OF TERMS 243
normalization means that a probability density distribution p(x) is mul-
tiplied by a positive constant such that

p(x) dx = 1, that is, the total
probability for x having some value is unity.
• Out of order execution refers to hardware rearrangement of instructions
from computer codes which are written in-order.
• Page is a 4 KB aligned segment of data in memory.
• Persistent data are those that are expected to be loaded frequently.
• Pipelining is a familiar idea from childhood rules for arithmetic. Each
arithmetic operation requires multiple stages (steps) to complete, so
modern computing machines allow multiple operands sets to be computed
simultaneously by pushing new operands into the lowest stages as soon as
those stages are available from previous operations.
• Pivot in our discussion means the element of maximum absolute size in
a matrix row or column which is used in Gaussian elimination. Pivoting
usually improves the stability of this algorithm.
• Quad pumped refers to a clocking mechanism in computers which
involves two overlapping signals both of whose leading and trailing edges
turn switches on or off.
• Quad word is a group of four 32-bit floating point words.
• Reduced instruction set computer (RISC) means one with fixed
instruction length (usually short) operations and typically relatively few
data access modes. Complex operations are made up from these.
• Redundancy in this book means the extra work that ensues as a result
of using a parallel algorithm. For example, a simple Gaussian elimination
tridiagonal system solver requires fewer floating point operations than a
cyclic reduction method, although the latter may be much faster. The
extra operations represent a redundancy. In the context of branch predic-
tion, instructions which issue and start to execute but whose results are
subsequently discarded due to a missed prediction are redundant.
• Rename registers are those whose conventional numbering sequence is
reordered to match the numbered label of an instruction. See Figures 3.13
and 3.14 and attendant discussion.
• Set associative cache design means that storage is segmented into sets.
A datum from memory is assigned to its associated set according to its
address.
• Shared memory modes of parallelism mean that each CPU (processor)
has access to data stored in a common memory system. In fact, the memory
system may be distributed but read/write conflicts are resolved by the
intercommunication network.
• Single instruction stream, multiple data streams (SIMD) usually
means vector computing. See Chapter 3.
• Slave typically means an arbitrarily ranked processor assigned a task by
an equally arbitrarily chosen master.
• Snooping monitors addresses by a bus master to assure data coherency.
244 GLOSSARY OF TERMS
• Speedup refers to the ratio of program execution time on a pre-
parallelization version to the execution time of a parallelized version—on
the same type of processors. Or perhaps conversely: it is the ratio of pro-
cessing rate of a parallel version to the processing rate for a serial version
of the same code on the same machine.
• Splat operations take a scalar value and store it in all elements of a vector
register. For example, the saxpy operation (y ← a · x + y) on SSE and
Altivec hardware is done by splatting the constant a into all the elements of
a vector register and doing the a· x multiplications as Hadamard products
a
i
· x
i
, where each a
i
= a. See Equation (2.3).
• Stage in our discussion means a step in a sequence of arithmetic opera-
tions which may subsequently be used concurrently with successive steps
of an operation which has not yet finished. This stage will be used for
new operands while the subsequent stages work on preceding operands.
For example, in multiplication, the lowest order digit stage may be used
for the next operand pair while higher order multiply/carry operations on
previous operands are still going on.
• Startup for our purposes refers to the amount of time required to establish
a communications link before any actual data are transferred.
• Static random access memory (SRAM) does not need to be refreshed
like DRAM and reading the data is not destructive. However, the storage
mechanism is flip-flops and requires either four transistors and two resist-
ors, or six transistors. Either way, SRAMs cells are more complicated and
expensive than DRAM.
• Superscalar machine means one which permits multiple instructions to
run concurrently with earlier instruction issues.
• Synchronization in parallel execution forces unfinished operations to
finish before the program can continue.
• Throughput is the number of concurrent instructions which are running
per clock cycle.
• Vector length (VL) is the number of elements in a vector register, or
more generally the number in the register to be processed. For example,
VL ≤ 64 on Cray SV-1 machines, VL = 4 for single precision data on SSE
hardware (Pentium III or 4) and Altivec hardware (Macintosh G-4).
• Vector register is a set of registers whose multiple data may be processed
by invoking a single instruction.
• Word is a floating point datum, either 32-bit or 64-bit for our examples.
• Write back is a cache write strategy in which data to be stored in memory
are written only to cache until the corresponding cache lines are again to
be modified, at which time they are written to memory.
• Write through is a cache write strategy in which modified data are
immediately written into memory as they are stored into cache.
APPENDIX G
NOTATIONS AND SYMBOLS
and Boolean and: i and j = 1 if i = j = 1, 0 otherwise.
a ∨ b means the maximum of a, b: a ∨ b = max(a,b).
a ∧ b means the minimum of a, b: a ∧ b = min(a,b).
∀x
i
means for all x
i
.
A
−1
is the inverse of matrix A.
A
−T
is a matrix transpose: [A
T
]
ij
= A
ji
.
Ex is the expectation value of x: for a discrete sample of x,
Ex =
1
N

N
i=1
x
i
. For continuous x, Ex =

p(x) xdx.
¸x) is the average value of x, that is, physicists’ notation ¸x) = Ex.
∃x
i
means there exists an x
i
.
·z is the imaginary part of z: if z = x + iy, then ·z = y.
(x, y) is the usual vector inner product: (x, y) =

i
x
i
y
i
.
m[n says integer m divides integer n exactly.
a is the Boolean complement of a: bitwise, 1 = 0 and 0 = 1.
[[x[[ is some vector norm: for example, [[x[[ = (x, x)
1/2
is an L
2
norm.
⊕ When applied to binary data, this is a Boolean exclusive OR:
for each independent bit, i ⊕j = 1 if only one of i = 1 or
j = 1 is true, but is zero otherwise.
When applied to matrices, this is a direct sum: A⊕B is
a block diagonal matrix with A, then B, along the diagonal.
or Boolean OR operation.
⊗ Kronecker product of matrices: when A is p p, B is q q,
A⊗B is a pq pq matrix whose i, jth q q block is a
i,j
B.
p(x) is a probability density: P¦x ≤ X¦ =

x≤X
p(x) dx.
p(x[y) is a conditional probability density:

p(x[y) dx = 1.
'z is the real part of z: if z = x + iy, then 'z = x.
x ← y means that the current value of x (if any) is replaced by y.
U(0, 1) means a uniformly distributed random number between 0 and 1.
VL the vector length: number of elements processed in SIMD mode.
VM is a vector mask: a set of flags (bits) within a register, each
corresponding to a test condition on words in a vector register.
w(t) is a vector of independent Brownian motions: see Section 2.5.3.2.
REFERENCES
1. 3DNow: Technology Manual, No. 21928G/0, March 2000. Advanced Micro
Devices Version of XMM. Available from URL http://www.amd.com/
gb-uk/Processors/SellAMDProducts/ 0,, 30 177 5274 5284% 5E992% 5E1144,
00.html.
2. M. Abramowitz and I. Stegun (eds). Handbook of Mathematical Functions. US
National Bureau of Standards, Washington, DC, 1964.
3. J. C. Ag¨ u´ı and J. Jim´enez. A binary tree implementation of a parallel distributed
tridiagonal solver. Parallel Comput. 21:233–241, 1995.
4. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz et al.
LAPACK Users’ Guide—Release 2.0. Society for Industrial and Applied Math-
ematics, Philadelphia, PA, 1994. (Software and guide are available from Netlib
at URL: http://www.netlib.org/lapack/.)
5. P. Arbenz and M. Hegland. On the stable parallel solution of general narrow
banded linear systems. In P. Arbenz, M. Paprzycki, A. Sameh, and V. Sarin
(eds), High Performance Algorithms for Structured Matrix Problems, pp. 47–73.
Nova Science Publishers, Commack, NY, 1998.
6. P. Arbenz and W. Petersen. http://www.inf.ethz.ch/˜arbenz/book.
7. S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, L. C. McInnes, and
B. F. Smith. PETSc home page. http://www.mcs.anl.gov/petsc, 2001.
8. Satish Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient manage-
ment of parallelism in object oriented numerical software libraries. In E. Arge,
A. M. Bruaset, and H. P. Langtangen (eds), Modern Software Tools in Scientific
Computing, pp. 163–202. Birkh¨auser Press, Basel, 1997.
9. S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. PETSc Users Manual.
Technical Report ANL-95/11 - Revision 2.1.5, Argonne National Laboratory,
2003.
10. R. Balescu. Equilibrium and NonEquilibrium Statistical Mechanics. Wiley-
Interscience, New York, 1975.
11. R. Barret, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout,
R. Pozo, Ch. Romine, and H. van der Vorst. Templates for the Solution of Lin-
ear Systems: Building Blocks for Iterative Methods. Society for Industrial and
Applied Mathematics, Philadelphia, PA, 1994. Available from Netlib at URL
http://www.netlib.org/templates/index.html.
12. L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon,
J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and
R. C. Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied
Mathematics, Philadelphia, PA, 1997. (Software and guide are available from
Netlib at URL http://www.netlib.org/scalapack/.)
13. S. Bondeli. Divide and conquer: A parallel algorithm for the solution of a
tridiagonal linear system of equations. Parallel Comput., 17:419–434, 1991.
REFERENCES 247
14. R. P. Brent. Random number generation and simulation on vector and parallel
computers. In D. Pritchard and J. Reeve (eds), Euro-Par ’98 Parallel Pro-
cessing, pp. 1–20. Springer, Berlin, 1998. (Lecture Notes in Computer Science,
1470.)
15. E. O. Brigham. The Fast Fourier Transform. Prentice Hall, Englewood Cliffs,
NJ, 1974.
16. E. O. Brigham. The Fast Fourier Transform and its Applications. Prentice Hall,
Englewood Cliffs, NJ, 1988.
17. R. Chandra, R. Menon, L. Dagum, D. Kohr, and D. Maydan. Parallel
Programming in OpenMP. Morgan Kaufmann, San Francisco, CA, 2001.
18. J. Choi, J. Demmel, I. Dhillon, J. J. Dongarra, S. Ostrouchov, A. P. Petitet,
K. Stanley, D. W. Walker, and R. C. Whaley. ScaLAPACK: A portable linear
algebra library for distribute memory computers—design issues and performance.
LAPACK Working Note 95, University of Tennessee, Knoxville, TN, March 1995.
Available from http://www.netlib.org/lapack/lawns/.
19. J. Choi, J. J. Dongarra, S. Ostrouchov, A. P. Petitet, D. W. Walker, and
R. C. Whaley. A proposal for a set of parallel basic linear algebra subprograms.
LAPACK Working Note 100, University of Tennessee, Knoxville, TN, May 1995.
Available from the Netlib software repository.
20. Apple Computer Company. Altivec Address Alignment.
http://developer.apple.com/ hardware/ve/ alignment.html.
21. J. W. Cooley and J. W. Tukey. An algorithm for the machine calculations of
complex Fourier series. Math. Comp., 19:297–301, 1965.
22. Apple Computer Corp. Power PC Numerics. Addison Wesley Publication Co.,
Reading, MA, 1994.
23. Intel Corporation. Intel Developers’ Group. http://developer.intel.com.
24. Intel Corporation. Integer minimum or maximum element search using streaming
SIMD extensions. Technical Report, Intel Corporation, 1 Jan. 1999. AP-804,
Order No. 243638-002.
25. Intel Corporation. Intel Architecture Software Developer’s Manual: Vol 2
Instruction Set Reference. Intel Corporation, 1999. Order No. 243191.
http://developer.intel.com.
26. Intel Corporation. Intel Math Kernel Library, Reference Manual. Intel Corpora-
tion, 2001. Order No. 630813-011. http://www.intel.com/software/products/
mkl/mkl52/, go to Technical Information, and download User Guide/Reference
Manual.
27. Intel Corporation. Intel Pentium 4 and Intel Xeon Processor Optimization
Manual. Intel Corporation, 2001. Order No. 248966-04,
http://developer.intel.com.
28. Intel Corporation. Split radix fast Fourier transform using streaming SIMD exten-
sions, version 2.1. Technical Report, Intel Corporation, 28 Jan. 1999. AP-808,
Order No. 243642-002.
29. Motorola Corporation. MPC7455 RISC Microprocessor hardware specifications.
http://e-www.motorola.com/brdata/PDFDB/docs/MPC7455EC.pdf.
30. Motorola Corporation. Altivec Technology Programming Environments Manual,
Rev. 0.1. Motorola Corporation, 1998. Available as ALTIVECPIM.pdf, document
ALTIVECPEM/D, http://www.motorla.com.
248 REFERENCES
31. R. Crandall and J. Klivington. Supercomputer-style FFT library for Apple
G-4. Technical Report, Adv. Computation Group, Apple Computer Company,
6 Jan. 2000.
32. E. Cuthill. Several strategies for reducing the bandwidth of matrices. In D. J. Rose
and R. Willoughby (eds), Sparse Matrices and their Applications. Plenum Press,
New York, 1972.
33. P. J. Davis and P. Rabinowitz. Methods of Numerical Integration. Academic Press,
Orlando, FL, 1984.
34. L. Devroye. Non-Uniform Random Variate Generation. Springer, New York,
1986.
35. Diehard rng tests. George Marsaglia’s diehard random number test suite.
Available at URL http://stat.fsu.edu/pub/diehard.
36. J. J. Dongarra. Performance of various computers using standard lin-
ear equations software. Technical Report, NETLIB, Sept. 8, 2002.
http://netlib.org/benchmark/performance.ps.
37. J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart. LINPACK
Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA,
1979.
38. J. J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. A proposal for a set of
level 3 basic linear algebra subprograms. ACM SIGNUM Newslett., 22(3):2–14,
1987.
39. J. J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. A proposal for a set of
level 3 basic linear algebra subprograms. ACM Trans. Math. Software, 16:1–17,
1990.
40. J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of
fortran basic linear algebra subprograms. ACM Trans. Math. Software, 14:1–17,
1988.
41. J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended
set of fortran basic linear algebra subprograms: Model implementation and test
programs. ACM Trans Math Software, 14:18–32, 1988.
42. J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst. Numer-
ical Linear Algebra for High-Performance Computers. Society for Industrial and
Applied Mathematics, Philadelphia, PA, 1998.
43. I. S. Duff, M. A. Heroux, and R. Pozo. An overview of the sparse basic linear
algebra subprograms: The new standard from the BLAS technical forum. ACM
Trans. Math. Software, 28(2):239–267, 2002.
44. P. Duhamel and H. Hollmann. Split-radix fft algorithm. Electr. Lett., 1:14–16,
1984.
45. C. Dun, M. Hegland, and M. Osborne. Parallel stable solution methods for tri-
diagonal linear systems of equations. In R. L. May and A. K. Easton (eds),
Computational Techniques and Applications: CTAC-95, pp. 267–274. World
Scientific, Singapore, 1996.
46. A. Erd´elyi, et al. Higher Transcendental Functions, the Bateman Manuscript
Project, 3 vols. Robert E. Krieger Publ., Malabar, FL, 1981.
47. B. W. Chars, et al. Maple, version 8. Symbolic computation.
http://www.maplesoft.com/.
48. C. May, et al, (ed.). Power PC Architecture. Morgan Kaufmann, San Fracisco,
CA, 1998.
REFERENCES 249
49. J. F. Hart, et al. Computer Approximations. Robert E. Krieger Publ. Co.,
Huntington, New York, 1978.
50. R. P. Feynman and A. R. Hibbs. Quantum Mechanics and Path Integrals.
McGraw-Hill, New York, 1965.
51. R. J. Fisher and H. G. Dietz. Compiling for SIMD within a register. In
S. Chatterjee (ed.), Workshop on Languages and Compilers for Parallel Comput-
ing, Univ. of North Carolina, August 7–9, 1998, pp. 290–304. Springer, Berlin,
1999. http://www.shay.ecn.purdue.edu/
\
˜swar/.
52. Nancy Forbes and Mike Foster. The end of moore’s law? Comput. Sci. Eng.,
5(1):18–19, 2003.
53. G. E. Forsythe, M. A. Malcom, and C. B. Moler. Computer Methods for
Mathematical Computations. Prentice-Hall, Englewood Cliffs, NJ, 1977.
54. G. E. Forsythe and C. B. Moler. Computer Solution of Linear Algebraic Systems.
Prentice-Hall, Englewood Cliffs, NJ, 1967.
55. M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the
FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP ’98), Vol. 3, pp. 1381–1384. IEEE Service
Center, Piscataway, NJ, 1998. Available at URL http://www.fftw.org.
56. J. E. Gentle. Random Number Generation and Monte Carlo Methods. Springer
Verlag, New York, 1998.
57. R. Gerber. The Software Optimization Cookbook. Intel Press, 2002.
http://developer.intel.com/intelpress.
58. R. Geus and S. R¨ollin. Towards a fast parallel sparse matrix-vector multiplication.
Parallel Comput., 27(7):883–896, 2001.
59. S. Goedecker and A. Hoisie. Performance Optimization of Numerically Intensive
Codes. Software, Environments, Tools. SIAM Books, Philadelphia, 2001.
60. G. H. Golub and C. F. van Loan. Matrix Computations, 3rd edn. The Johns
Hopkins University Press, Baltimore, MD, 1996.
61. G. H. Gonnet. Private communication.
62. D. Graf. Pseudo random randoms—generators and tests. Technical Report,
ETHZ, 2002. Semesterarbeit.
63. A. Greenbaum. Iterative Methods for Solving Linear Systems. SIAM,
Philadelphia, PA, 1997.
64. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming
with the Message-Passing Interface. MIT Press, Cambridge, MA, 1995.
65. M. J. Grote and Th. Huckle. Parallel preconditioning with sparse approximate
inverses. SIAM J. Sci. Comput., 18(3):838–853, 1997.
66. H. Grothe. Matrix generators for pseudo-random vector generation. Statistical
Lett., 28:233–238, 1987.
67. H. Grothe. Matrixgeneratoren zur Erzeugung gleichverteilter Pseudozufalls-
vektoren. PhD Thesis, Technische Hochschule Darmstadt, 1988.
68. Numerical Algorithms Group. G05CAF, 59-bit random number generator.
Technical Report, Numerical Algorithms Group, 1985. NAG Library Mark 18.
69. M. Hegland. On the parallel solution of tridiagonal systems by wrap-around
partitioning and incomplete LU factorizattion. Numer. Math., 59:453–472,
1991.
70. D. Heller. Some aspects of the cyclic reduction algorithm for block tridiagonal
linear systems. SIAM J. Numer. Anal., 13:484–496, 1976.
250 REFERENCES
71. J. L. Hennessy and D. A. Patterson. Computer Architecture A Quantitative
Approach, 2nd edn. Morgan Kaufmann, San Francisco, CA, 1996.
72. M. R. Hestenes and E. Stiefel. Methods of conjugent gradients for solving linear
systems. J. Res. Nat. Bur. Standards, 49:409–436, 1952.
73. Hewlett-Packard Company. HP MLIB User’s Guide. VECLIB, LAPACK,
ScaLAPACK, and SuperLU, 5th edn., September 2002. Document number
B6061-96020. Available at URL http://www.hp.com/.
74. R. W. Hockney. A fast direct solution of Poisson’s equation using Fourier analysis.
J. ACM, 12:95–113, 1965.
75. Kai Hwang. Advanced Computer Architecture: Parallelism, Scalability, Program-
mability. McGraw-Hill, New York, 1993.
76. Intel Corporation, Beaverton, OR. Paragon XP/S Supercomputer, Intel Corpora-
tion Publishing, June 1992.
77. Intel Corporation, Beaverton, OR. Guide Reference Manual (C/C++
Edition), March 2000. http://developer.intel.com/software/products/
trans/kai/.
78. The Berkeley Intelligent RAM Project: A study project about integration of
memory and processors. See http://iram.cs.berkeley.edu/.
79. F. James. RANLUX: A Fortran implementation of the high-quality pseudo-
random number generator of L¨ uscher. Comput. Phys. Commun., 79(1):111–114,
1994.
80. P. M. Johnson. Introduction to vector processing. Comput. Design, 1978.
81. S. L. Johnsson. Solving tridiagonal systems on ensemble architectures. SIAM
J. Sci. Stat. Comput., 8:354–392, 1987.
82. M. T. Jones and P. E. Plassmann. BlockSolve95 Users Manual: Scalable lib-
rary software for the parallel solution of sparse linear systems. Technical Report
ANL-95/48, Argonne National Laboratory, December 1995.
83. K. Kankaala, T. Ala-Nissala, and I. Vattulainen. Bit level correlations in some
pseudorandom number generators. Phys. Rev. E, 48:4211–4216, 1993.
84. B. W. Kernighan and D. M. Richie. The C Programming Language: ANSI C Ver-
sion, 2nd edn. Prentice Hall Software Series. Prentice-Hall, Englewood Cliffs, NJ,
1988.
85. Ch. Kim. RDRAM Project, development status and plan. RAMBUS developer
forum, Japan, July 2–3, http://www.rambus.co.jp/forum/downloads/MB/
2samsung chkim.pdf, also information about RDRAM development http://
www.rdram.com/, another reference about the CPU-Memory gap is http://
www.acuid.com/memory io.html.
86. P. E. Kloeden and E. Platen. Numerical Solution of Stochastic Differential
Equations. Springer, New York, 1999.
87. D. Knuth. The Art of Computer Programming, vol 2: Seminumerical Algorithms.
Addison Wesley, New York, 1969.
88. L. Yu. Kolotilina and A. Yu. Yeremin. Factorized sparse approximate inverse
preconditionings I. Theory. SIAM J. Matrix Anal. Appl., 14:45–58, 1993.
89. E. Kreyszig. Advanced Engineering Mathematics, John Wiley, New York, 7th edn,
1993.
90. LAM: An open cluster environment for MPI. http://www.lam-mpi.org.
REFERENCES 251
91. S. Lavington. A History of Manchester Computers. British Computer Society,
Swindon, Wiltshire, SN1 1BR, UK, 1998.
92. C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic linear algebra sub-
programs for Fortran usage. Technical Report SAND 77-0898, Sandia National
Laboratory, Albuquerque, NM, 1977.
93. C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic linear algebra
subprograms for Fortran usage. ACM Trans. Math. Software, 5:308–325, 1979.
94. A. K. Lenstra, H. W. Lenstra, and L. Lovacz. Factoring polynomials with rational
coefficients. Math. Ann., 261:515–534, 1982.
95. A. Liegmann. Efficient solution of large sparse linear systems. PhD Thesis
No. 11105, ETH Zurich, 1995.
96. Ch. van Loan. Computational Frameworks for the Fast Fourier Transform. Soci-
ety for Industrial and Applied Mathematics, Philadelphia, PA, 1992. (Frontiers
in Applied Mathematics, 10.)
97. Y. L. Luke. Mathematical Functions and their Approximations. Academic Press,
New York, 1975.
98. M. L¨ uscher. A portable high-quality random number generator for lattice field
theory simulations. Comput. Phys. Commun., 79(1):100–110, 1994.
99. Neal Madras. Lectures on Monte Carlo Methods. Fields Institute Monographs,
FIM/16. American Mathematical Society books, Providence, RI, 2002.
100. George Marsaglia. Generating a variable from the tail of the normal distribution.
Technometrics, 6:101–102, 1964.
101. George Marsaglia and Wai Wan Tsang. The ziggurat method for gener-
ating random variables. J. Statis. Software, 5, 2000. Available at URL
http://www.jstatoft.org/v05/i08/ziggurat.pdf.
102. N. Matsuda and F. Zimmerman. PRNGlib: A parallel random number generator
library. Technical Report, Centro Svizzero Calculo Scientifico, 1996. Technical
Report TR-96-08, http://www.cscs.ch/pubs/tr96abs.html#TR-96-08-ABS/.
103. M. Matsumoto and Y. Kurita. Twisted GFSR generators. ACM Trans. Model.
Comput. Simulation, pages part I: 179–194, part II: 254–266, 1992 (part I), 1994
(part II). http://www.math.keio.ac.jp/˜matumoto/emt.html.
104. J. Mauer and M. Troyer. A proposal to add an extensible random number facility
to the standard library. ISO 14882:1998 C++ Standard.
105. M. Metcalf and J. Reid. Fortran 90/95 Explained. Oxford Science Publications,
Oxford, 1996.
106. G. N. Milstein. Numerical Integration of Stochastic Differential Equations. Kluwer
Academic Publishers, Dordrecht, 1995.
107. G. E. Moore, Cramming more components into integrated circuits. Elec-
tronics, 38(8):114–117, 1965. Available at URL: ftp://download.intel.com/
research/silicon/moorsepaper.pdf.
108. Motorola. Complex floating point fast Fourier transform for altivec. Technical
Report, Motorola Inc., Jan. 2002. AN21150, Rev. 2.
109. MPI routines. http://www-unix.mcs.anl.gov/mpi/www/www3/.
110. MPICH—A portable implementation of MPI. http://www-unix.mcs.anl.gov/
mpi/mpich/.
111. Netlib. A repository of mathematical software, data, documents, address lists,
and other useful items. Available at URL http://www.netlib.org.
252 REFERENCES
112. Numerical Algorithms Group, Wilkinson House, Jordon Hill, Oxford. NAG For-
tran Library Manual. BLAS usage began with Mark 6, the Fortran 77 version is
currently Mark 20, http://www.nag.co.uk/.
113. J. Ortega. Introduction to Parallel and Vector Solution of Linear Systems. Plenum
Press, New York, 1998.
114. M. L. Overton. Numerical Computing with IEEE Floating Point Arithmetic.
SIAM Books, Philadelphia, PA, 2001.
115. P. S. Pacheco. Parallel Programming with MPI. Morgan Kaufmann, San Fran-
cisco, CA, 1997.
116. Mpi effective bandwidth test. Available at URL http://www.pallas.de/pages/
pmbd.htm.
117. F. Panneton and P. L’Ecuyer and M. Matsumoto. Improved long-period gener-
ators based on linear recurrences modulo 2. ACM Trans. Math. Software, 2005.
Submitted in 2005, but available at http://www.iro.umontreal.ca/∼lecuyer/.
118. G. Parisi. Statistical Field Theory. Addison-Wesley, New York, 1988.
119. O. E. Percus and M. H. Kalos. Random number generators for mimd processors.
J. Parallel distribut. Comput., 6:477–497, 1989.
120. W. Petersen. Basic linear algebra subprograms for CFT usage. Technical Note
2240208, Cray Research, 1979.
121. W. Petersen. Lagged fibonacci series random number generators for the NEC
SX-3. Int. J. High Speed Comput., 6(3):387–398, 1994. Software available at
http://www.netlib.org/random.
122. W. P. Petersen. Vector Fortran for numerical problems on Cray-1. Comm. ACM,
26(11):1008–1021, 1983.
123. W. P. Petersen. Some evaluations of random number generators in REAL*8
format. Technical Report, TR-96-06, Centro Svizzero Calculo Scientifico, 1996.
124. W. P. Petersen. General implicit splitting for numerical simulation of stochastic
differential equations. SIAM J. Num. Anal., 35(4):1439–1451, 1998.
125. W. P. Petersen, W. Fichtner, and E. H. Grosse. Vectorized Monte Carlo calcula-
tion for ion transport in amorphous solids. IEEE Trans. Electr. Dev., ED-30:1011,
1983.
126. A. Ralston, E. D. Reilly, and D. Hemmendinger (eds). Encyclopedia of Computer
Science. Nature Publishing Group, London, 4th edn, 2000.
127. S. R¨ollin and W. Fichtner. Parallel incomplete lu-factorisation on shared memory
multiprocessors in semiconductor device simulation. Technical Report 2003/1,
ETH Z¨ urich, Inst. f¨ ur Integrierte Systeme, Feb 2003. Parallel Matrix Algorithms
and Applications (PMAA’02), Neuchatel, Nov. 2002, to appear in special issue
on parallel computing.
128. Y. Saad. Krylov subspace methods on supercomputers. SIAM J. Sci. Stat.
Comput., 10:1200–1232, 1989.
129. Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical
Report 90-20, Research Institute for Advanced Computer Science, NASA Ames
Research Center, Moffet Field, CA, 1990.
130. Y. Saad. Iterative Methods for Sparse Linear Systems. PWS Publishing Company,
Boston, MA, 1996.
131. H. A. Schwarz. Ueber einen Grenz¨ ubergang durch alternirendes Verfahren.
Vierteljahrsschrift Naturforsch. Ges. Z¨ urich, 15:272–286, 1870. Reprinted in:
Gesammelte Mathematische Abhandlungen, vol. 2, pp. 133-143, Springer, Berlin,
1890.
REFERENCES 253
132. D. Sima. The design space of register renaming techniques. IEEE Micro.,
20(5):70–83, 2000.
133. B. F. Smith, P. E. Bjørstad, and W. D. Gropp. Domain Decomposition: Par-
allel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge
University Press, Cambridge, 1996.
134. B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Klema
et al. Matrix Eigensystem Routines—EISPACK Guide. Lecture Notes in Com-
puter Science 6. Springer, Berlin, 2nd edn, 1976.
135. SPRNG: The scalable parallel random number generators library for ASCI Monte
Carlo computations. See http://sprng.cs.fsu.edu/.
136. J. Stoer and R. Bulirsch. Einf¨ uhrung in die Numerische Mathematik II. Springer,
Berlin, 2nd edn, 1978.
137. J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer, New York,
2nd edn, 1993.
138. H. Stone. Parallel tridiagonal equation solvers. ACM Trans. Math. Software,
1:289–307, 1975.
139. P. N. Swarztrauber. Symmetric FFTs. Math. Comp., 47:323–346, 1986.
140. Nec sx-6 system specifications. Available at URL http://www.sw.nec.co.jp/hpc/
sx-e/sx6.
141. C. Temperton. Self-sorting fast Fourier transform. Technical Report, No. 3,
European Centre for Medium Range Weather Forcasting (ECMWF), 1977.
142. C. Temperton. Self-sorting mixed radix fast Fourier transforms. J. Comp. Phys.,
52:1–23, 1983.
143. Top 500 supercomputer sites. A University of Tennessee, University of Mannheim,
and NERSC/LBNL frequently updated list of the world’s fastest computers.
Available at URL http://www.top500.org.
144. B. Toy. The LINPACK benchmark program done in C, May 1988. Available from
URL http://netlib.org/benchmark/.
145. H. A. van der Vorst. Analysis of a parallel solution method for tridiagonal linear
systems. Parallel Comput., 5:303–311, 1987.
146. Visual Numerics, Houston, TX. International Mathematics and Statistics Library.
http://www.vni.com/.
147. K. R. Wadleigh and I. L. Crawford. Software Optimization for High Perform-
ance Computing. Hewlett-Packard Professional Books, Upper Saddle River, NJ,
2000.
148. H. H. Wang. A parallel method for tridiagonal equations. ACM Trans. Math.
Software, 7:170–183, 1981.
149. R. C. Whaley. Basic linear algebra communication subprograms: Analysis
and implementation across multiple parallel architectures. LAPACK Working
Note 73, University of Tennessee, Knoxville, TN, June 1994. Available at URL
http://www.netlib.org/lapack/lawns/.
150. R. C. Whaley and J. Dongarra. Automatically tuned linear algebra software.
LAPACK Working Note 131, University of Tennessee, Knoxville, TN, December
1997. Available at URL http://www.netlib.org/lapack/lawns/.
151. R. C. Whaley, A. Petitet, and J. Dongarra. Automated empirical optim-
ization of software and the ATLAS project. LAPACK Working Note 147,
University of Tennessee, Knoxville, TN, September 2000. Available at URL
http://www.netlib.org/lapack/lawns/.
152. J. H. Wilkinson and C. Reinsch (eds). Linear Algebra. Springer, Berlin, 1971.
This page intentionally left blank
INDEX
3DNow, see SIMD, 3Dnow
additive Schwarz procedure, 47
Altivec, see SIMD, Altivec
and, Boolean and, 125, 245
Apple
Developers’ Kit, 130
G-4, see Motorola, G-4
ATLAS project
LAPACK, see LAPACK, ATLAS
project
U. Manchester, see pipelines,
U. Manchester
BLACS, 161, 164, 165
Cblacs gridinit, 165
block cyclic distribution, 169
block distribution, 168
cyclic vector distribution, 165, 168
process grid, see LAPACK,
ScaLAPACK, process grid
release, 167
scoped operations, 164
BLAS, 21–23, 141, 142, 163, 165, 178
Level 1 BLAS, 21
ccopy, 126
cdot, 22
daxpy, 23
dscal, 25
isamax, 21, 106, 124–126
saxpy, 19, 94, 96, 106–109, 142–143,
146, 147, 172
sdot, 19, 105, 123–124, 142, 144–146
sscal, 108
sswap, 108
Level 2 BLAS, 21
dgemv, 23, 148, 176
dger, 25
sger, 109, 149
Level 3 BLAS, 21, 26, 163
dgemm, 23, 25, 26, 141, 142
dgetrf, 27
dtrsm, 141
NETLIB, 108
PBLAS, 161, 163, 166
pdgemv, 176
root names, 20
suffixes, prefixes, 19
block cyclic, see LAPACK, block cyclic
layout
blocking, see MPI, blocking
Brownian motion, see Monte Carlo
bus, 157
cache, 3–8
block, 6
block address, 6
cacheline, 6, 97
data alignment, 122
direct mapped, 6
least recently used (LRU), 7
misalignment, 122, 123
miss rate, 7
page, 6
set associative, 6
write back, 8
write through, 8
ccnuma, see memory, ccnuma
ccopy, see BLAS, Level 1 BLAS, ccopy
cfft2, see FFT, cfft2
clock period, 89
clock tick, see clock period
compiler
cc, 153
compiler directive, 14, 15, 91–92
OpenMP, see OpenMP
f90, 153
gcc, 124, 130
-faltivec, 124
switches, 124
guidec, 145, 150
icc, 125
switches, 125
256 INDEX
compiler (Cont.)
mpicc, 161
cos function, 65, 66–68
CPU
CPU vs. memory, 1
Cray Research
C-90, 93
compiler directives, 91
SV-1, 67, 92
X1, 137–139
cross-sectional bandwidth, see Pallas,
EFF BW
crossbar, see networks, crossbar
data dependencies, 86–89
data locality, 3
δ(x) function, 75
domain decomposition, 45, 57
dot products, see BLAS, Level 1 BLAS,
sdot
dynamic networks, see networks, dynamic
EFF BW, see Pallas, EFF BW
EISPACK, 21, 192
Feynman, Richard P., 76
FFT, 49–57, 86, 126–132
bit reversed order, 54
bug, 126, 127
cfft2, 118, 128, 129
Cooley–Tukey, 49, 54, 118, 119
fftw, 180
in-place, 117–122
OpenMP version, 151
real, 55
real skew-symmetric, 56–57
real symmetric, 56–57
signal flow diagram, 120, 129
Strang, Gilbert, 49
symmetries, 49
transpose, 181, 184
twiddle factors, w, 50, 118, 121, 130,
151
Fokker–Planck equation, 77
gather, 40, 91–92
Gauss-Seidel iteration, see linear algebra,
iterative methods
Gaussian elimination, see linear algebra
gcc, see compiler, gcc
Gonnet, Gaston H., see random numbers
goto, 57
Hadamard product, 43, 45, 89
Hewlett-Packard
HP9000, 67, 136–137
cell, 152
MLIB, see libraries, MLIB
hypercube, see networks, hypercube
icc, see compiler, icc
if statements
branch prediction, 103–104
vectorizing, 102–104
by merging, 102–103
instructions, 8–14
pipelined, 86
scheduling, 8
template, 10
Intel, 1
Pentium 4, 67, 85, 130
Pentium III, 67, 85
intrinsics, see SIMD, intrinsics
irreducible polynomials, see LLL
algorithm
ivdep, 91
Jacobi iteration, see linear algebra,
iterative methods
Knuth, Donald E., 60

, Kronecker product, 53, 245

, Kronecker sum, 53, 245
Langevin’s equation, see Monte Carlo
LAPACK, 21–28, 153, 161, 163
ATLAS project, 142
block cyclic layout, 164
dgesv, 28
dgetrf, 24
dgetrf, 26, 141
ScaLAPACK, 21, 161, 163, 170, 177
array descriptors, 176
block cyclic distribution, 169
context, 165
descinit, 176
info, 176
process grid, 164, 165, 167, 170, 171,
176–178
sgesv, 23
INDEX 257
sgetrf, 23, 153
latency, 11, 191, 193
communication, 41
memory, 11, 96–97, 152, 191, 193
message passing, 137
pipelines, 89, 90, 92, 95
libraries, 153
EISPACK, see EISPACK
LAPACK, see LAPACK
LINPACK, see LINPACK
MLIB, 141
MLIB NUMBER OF THREADS, 153
NAG, 61
linear algebra, 18–28
BLSMP sparse format, 39
CSR sparse format, 39, 40, 42
cyclic reduction, 112–118
Gaussian elimination, 21, 23–28, 112,
141, 149
blocked, 25–28
classical, 23, 141–150
iterative methods, 29–49
coloring, 46
Gauss–Seidel iteration, 31
GMRES, 34–36, 39
iteration matrix, 30, 34, 42, 46
Jacobi iteration, 30
Krylov subspaces, 34
PCG, 34, 36–39
preconditioned residuals, 34
residual, 29–31
SOR iteration, 31
spectral radius, 29
SSOR iteration, 32
stationary iterations, 29–33
LU, 141
multiple RHS, 116
LINPACK, 142
dgefa, 27
Linpack Benchmark, 1, 107, 142, 149
sgefa, 106, 108, 141, 149,
little endian, 90, 132
LLL algorithm, 63
log function, 65–68
loop unrolling, 8–14, 86–89
Marsaglia, George, see random numbers
MatLab, 89
matrix
matrix–matrix multiply, 19
matrix–vector multiply, 19, 146, 147
tridiagonal, 112–117
∨, maximum, 62, 245
memory, 5–8
banks, 93, 117
BiCMOS, 6
ccnuma, 136, 152
CMOS, 1, 6, 85, 95
CPU vs. memory, 1
data alignment, 122
DRAM, 1
ECL, 1, 85, 95
IRAM, 1
latency, 96
RAM, 6
RDRAM, 1
SRAM, 1
message passing
MPI, see MPI
pthreads, see pthreads
PVM, see PVM
MIMD, 156
∧, minimum, 62, 245
MMX, see SIMD, SSE, MMX
Monte Carlo, 57–60, 68–80
acceptance/rejection, 60, 68, 73
acceptance ratio, 70
if statements, 68
von Neumann, 73
fault tolerance, 58
Langevin methods, 74
Brownian motion, 74, 75, 79
stochastic differential equations, 76
random numbers, see random numbers
Moore, Gordon E., 1
Moore’s law, 1–3
Motorola, 5
G-4, 4, 67, 85, 130
MPI, 157, 161, 165
MPI ANY SOURCE, 158
MPI ANY TAG, 158
blocking, 159
MPI Bcast, 158, 160
commands, 158
communicator, 158, 160, 165
MPI COMM WORLD, 158
mpidefs.h, 158
MPI FLOAT, 158, 160
Get count, 158
MPI Irecv, 41, 159
MPI Isend, 41, 159
mpi.h, 158
MPI Allgather, 41
MPI Alltoall, 41
258 INDEX
MPI (Cont.)
mpicc, 161
mpirun, 161
NETLIB, 160
node, 157
non-blocking, 159
rank, 158, 194
MPI Recv, 158
root, 160
MPI Send, 158, 159
shut down, 165
status, 158
tag, 158
MPI Wtime, 137
multiplicative Schwarz procedure, 46
NEC
Earth Simulator, 2
SX-4, 67, 74, 93
SX-5, 93
SX-6, 93, 139
NETLIB, 26, 71, 160
BLAS, 108
zufall, 71
networks, 15–17, 156–157
crossbar, 157
dynamic networks, 157
hypercube, 157
Ω networks, 157
static, 157
switches, 15, 16
tori, 157
OpenMP, 140–143
BLAS, 141
critical, 144, 145
data scoping, 149
FFT, 151
for, 143, 145, 147
matrix–vector multiply, 147
OMP NUM THREADS, 150
private variables, 145, 149, 151
reduction, 146
shared variables, 149, 151
or, Boolean or, 245
Pallas
EFF BW, 136–137
Parallel Batch System (PBS), 158, 160,
161
PBS NODEFILE, 158, 161
qdel, 161
qstat, 161
qsub, 161
Parallel BLAS, see BLAS, PBLAS
partition function, 60
pdgemv, see BLAS, PBLAS, pdgemv
PETSc, 190–197
MatAssemblyEnd, 191
MatCreate, 191
MatSetFromOptions, 191
MatSetValues, 191
MPIRowbs, 195
PetscOptionsSetValue, 196
pipelines, 8–14, 86, 89–104
U. Manchester, Atlas Project, 89
polynomial evaluation, see SIMD
pragma, 91, 92, 143–153
pthreads, 14, 15, 130, 140
PVM, 14, 15, 164
random numbers, 58, 60
Box–Muller, 65–68
Gonnet, Gaston H., 63
high dimensional, 72–80
isotropic, 73
Marsaglia, George, 71
non-uniform, 64
polar method, 67, 68, 70
SPRNG, 64
uniform, 60–64, 86
lagged Fibonacci, 61–88
linear congruential, 60
Mersenne Twister, 62
ranlux, 62
recursive doubling, 110
red-black ordering, 43
scatter, 91–92
sdot, see Level 1 BLAS, sdot
sgesv, see LAPACK, sgesv
sgetrf, see LAPACK, sgetrf
shared memory, 136–153
FFT, 151
SIMD, 85–132
3DNow, 88
Altivec, 85, 88, 97, 122–132
valloc, 123
vec , 126
vec , 123–124, 132
Apple G-4, 85
INDEX 259
SIMD (Cont.)
compiler directives, 91
FFT, 126
gather, 91
Intel Pentium, 85
intrinsics, 122–132
polynomial evaluation, 110–112
reduction operations, 105–107
registers, 86
scatter, 91
segmentation, 89
speedup, 90, 92, 94, 95, 141, 142, 148
SSE, 85, 88, 93, 97, 103, 122–132
m128, 124–130
mm malloc, 123
mm , 123–130
MMX, 93
XMM, 93, 99, 122
SSE2, 93
vector register, 88
sin function, 65, 66, 68
SOR, see linear algebra, iterative
methods
speedup, 10, 68, 89, 90, 92, 94, 95, 141,
142, 148
superlinear, 178
sqrt function, 65–68
SSE, see SIMD, SSE
static networks, see networks, static
stochastic differential equations, see
Monte Carlo
Strang, Gilbert, see FFT
strip mining, 173, 180
Superdome, see Hewlett-Packard, HP9000
superscalar, 88
transpose, 181, 184
vector length VL, 61, 62, 66, 86–90,
94–97, 105, 123
vector mask VM, 102, 103
vectorization, 85, 93, 107
Cray Research, 93
intrinsics, see SIMD, intrinsics
NEC, 93
VL, see vector length VL
VM, see vector mask VM
Watanabe, Tadashi, 139
XMM, see SIMD, SSE, XMM
⊕, exclusive or, 61, 64, 181, 245

OXFORD TEXTS IN APPLIED AND ENGINEERING MATHEMATICS

OXFORD TEXTS IN APPLIED AND ENGINEERING MATHEMATICS
* * * * * * * * * G. D. Smith: Numerical Solution of Partial Differential Equations 3rd Edition R. Hill: A First Course in Coding Theory I. Anderson: A First Course in Combinatorial Mathematics 2nd Edition D. J. Acheson: Elementary Fluid Dynamics S. Barnett: Matrices: Methods and Applications L. M. Hocking: Optimal Control: An Introduction to the Theory with Applications D. C. Ince: An Introduction to Discrete Mathematics, Formal System Specification, and Z 2nd Edition O. Pretzel: Error-Correcting Codes and Finite Fields P. Grindrod: The Theory and Applications of Reaction–Diffusion Equations: Patterns and Waves 2nd Edition Alwyn Scott: Nonlinear Science: Emergence and Dynamics of Coherent Structures D. W. Jordan and P. Smith: Nonlinear Ordinary Differential Equations: An Introduction to Dynamical Systems 3rd Edition I. J. Sobey: Introduction to Interactive Boundary Layer Theory A. B. Tayler: Mathematical Models in Applied Mechanics (reissue) L. Ramdas Ram-Mohan: Finite Element and Boundary Element Applications in Quantum Mechanics Lapeyre, et al.: Monte Carlo Methods for Transport and Diffusion Equations I. Elishakoff and Y. Ren: Finite Element Methods for Structures with Large Stochastic Variations Alwyn Scott: Nonlinear Science: Emergence and Dynamics of Coherent Structures 2nd Edition W. P. Petersen and P. Arbenz: Introduction to Parallel Computing

1. 2. 3. 4. 5. 6. 7. 8. 9.

Titles marked with an asterisk (*) appeared in the Oxford Applied Mathematics and Computing Science Series, which has been folded into, and is continued by, the current series.

Introduction to Parallel Computing
W. P. Petersen
Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math.ethz.ch

P. Arbenz
Institute for Scientific Computing Department Informatik, ETHZ, Zurich arbenz@inf.ethz.ch

1

Great Clarendon Street, Oxford OX2 6DP Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide in Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries Published in the United States by Oxford University Press Inc., New York c Oxford University Press 2004 The moral rights of the author have been asserted Database right Oxford University Press (maker) First published 2004 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this book in any other binding or cover and you must impose the same condition on any acquirer A catalogue record for this title is available from the British Library Library of Congress Cataloging in Publication Data (Data available) Typeset by Newgen Imaging Systems (P) Ltd., Chennai, India Printed in Great Britain on acid-free paper by Biddles Ltd., King’s Lynn, Norfolk ISBN 0 19 851576 6 (hbk) 0 19 851577 4 (pbk) 10 9 8 7 6 5 4 3 2 1

3

40960 IBM PowerPC processors at the IBM Research Laboratory in Yorktown Heights. the problem is not a paucity of material but rather the distillation of an overflowing cornucopia. and have come from computer science. . birds. critical or encouraging. Modern computer architectures run internal clocks with cycles less than a nanosecond. Inc. the top three entries in the list rely on large numbers of commodity processors: 65536 IBM PowerPC 440 processors at Livermore National Laboratory. physics. The Earth Simulator is now number four and has 5120 SX-6 vector processors from NEC Corporation. in diameter and mounted in PC-cards. mathematics. Students in this u course have typically been in their third or fourth year. Here are some basic facts to consider for a truly high performance cluster: 1. Until the end of 2004. chemistry.PREFACE The contents of this book are a distillation of many projects which have subsequently become the material for a course on parallel computing given for several years at the Swiss Federal Institute of Technology in Z¨rich. One source of amusement as well as amazement to us has been that the power of 1980s supercomputers has been brought in abundance to PCs and Macs. Student contributions. at the NASA Ames Research Centre. alas. Our own thinking has evolved as well: A quarter of a century of experience in supercomputing has been sobering. Just as those early 1990s contemptuous dismissals of magnetic storage media must now be held up to the fact that 2 GB disk drives are now 1 in. This defines the time scale of floating point calculations. whether large or small. Thus. In all cases. the tip-top of the famous Top 500 supercomputers [143] was the Yokohama Earth Simulator. the early 1990s dismissive sobriquets about dinosaurs lead us to chuckle that the most elegant of creatures. It is. have helped crystallize our thinking in a quickly changing area. Who would have guessed that vector processing computers can now be easily hauled about in students’ backpacks? Furthermore. we have to proceed with what exists now and hope that these ideas will have some relevance tomorrow. and 10160 Intel Itanium II processors connected by an Infiniband Network [75] and constructed by Silicon Graphics. and programs for computational science and engineering. Hence. are those ancients’ successors. or graduate students. One of the students’ most often voiced complaints has been organizational and of information overload. we will focus our energies on floating point calculations for science and engineering applications. Currently. a subject which overlaps with all scientific and engineering disciplines. It is thus the point of this book to attempt some organization within a quickly changing interdisciplinary topic. for the three previous years.

who updates what and when becomes a matter of considerable importance. Thus we have the following not particularly profound observations: if the data are local to a processor. we start from the lowest level and work up. Therefore. but a floating point operation (flop) count is only an indirect measure of an algorithm’s efficiency. Pentium 4. which sees a coherent memory image but on a different processor’s memory. Chapter 3 discusses instruction level parallelism. So it is that NEC and Cray have moved toward strong nodes. 1. there should be roughly a thousand or more data items to amortize the delay of fetching them from other processors’ memories. 3. so the discussion is often close to the hardware. Chapter 4 concerns shared memory parallelism. It is almost as if floating point operations take insignificant time. there should be a 100 times more than that if we expect to write-off the delay in getting them. OpenMP will be the model for handling this paradigm. This mode assumes that data are local to nodes or at least part of a coherent memory image shared by processors. 2. while data access is everything. When one result is written into a memory location subsequently used/modified by an independent process. and Apple/Motorola G-4 chips. Chapter 2 provides some theoretical background for the applications and examples used in the remainder of the book. Our model will be the message passing interface. Processor architecture is important here. Note that this is 1000 or more clock cycles. 5.vi PREFACE 2. 3. and finally. the best algorithms are those which encourage data locality. getting data from memory to the CPU is the problem of high speed computing. As we will clearly see. For a processor to get a datum within a node. with even stronger processors on these nodes. 4. MPI. but even more so for the modern machines with hierarchical memory. typically requires a delay of order 1 µs. if the data must be fetched from other nodes. Basically. This is why we have organized the book in the following manner. they may be used very quickly. Chapter 1 contains a discussion of memory and data dependencies. Chapter 5 is at the next higher level and considers message passing. For a node to get a datum which is on a different node by using message passing takes more than 100 or more µs. not only for NEC and Cray machines. We take close looks at the Intel Pentium III. This is hard to swallow: The classical books go on in depth about how to minimize floating point operations. . They have to expect that programs will have blocked or segmented data structures. One cannot expect a summation of elements in an array to be efficient when each element is on a separate node. and variants and tools built on this system. A lower flop count only approximately reflects that fewer data are accessed. particularly vectorization. if the data are on a tightly coupled node of processors.

In particular. Petersen and P. Arbenz Authors’ comments on the corrected second printing We are grateful to many students and colleagues who have found errata in the one and half years since the first printing. For example. July 1.PREFACE vii Finally. W. a very important decision was made to use explicit examples to show how all these pieces work. we rely on examples. or high dimensional ones? It is also not enough to show program snippets. . P. that some less familiar topics we have included will be edifying. 2005. both regarding hardware developments and algorithms tuned to new machines. We hope. Sven Knudsen. Thus we are indeed thankful to our colleagues for their helpful comments and criticisms. Fast Fourier Transform. however. We feel that one learns by examples and by proceeding from the specific to the general. and Monte Carlo simulations. we would like to thank Christian Balderer. again. and Abraham Nieva. How does one compile these things? How does one specify how many processors are to be used? Where are the libraries? Here. how does one do large problems. who took the time to carefully list errors they discovered. It is a difficult matter to keep up with such a quickly changing area such as high performance computing. Our choices of examples are mostly basic and familiar: linear algebra (direct solvers for dense matrices. iterative solvers for large sparse matrices).

Having colleagues like them helps make many things worthwhile. doubtless errors still remain for which only the authors are to blame. advice.ACKNOWLEDGMENTS Our debt to our students. Gaston Gonnet. Kate Pullen. George Sigut (our Beowulf machine). and Dr Bruce u Greer of Intel. Dr Olivier Byrde of Cray Research and ETH. Mario R¨tti. as well as trying hard to keep us out of trouble by reading portions of an evolving text. and Christoph Schwab. Rolf Jeltsch. that a “thank you” hardly seems enough. Anita Petrie. and the staff of Oxford University Press for their patience and hard work. Despite their valiant efforts. We are also sincerely thankful for the support and encouragement of Professors Walter Gander. and at every turn extremely competent. system administrators. and Tonko Racic (our HP9000 cluster) have been cheerful. The help of our system gurus cannot be overstated. examples. Finally. Dr Roman Geus. and colleagues is awesome. encouraging. Michael Mascagni and Dr Michael Vollmer. Intel Corporation’s Dr Vollmer did so much to provide technical material. assistants. Former assistants have made significant contributions and include Oscar Chinellato. . Bruno Loepfe (our Cray cluster). Other helpful contributors were Adrian Burri. we would like to thank Alison Jones. Other contributors who have read parts of an always changing manuscript and who tried to keep us on track have been Prof. and Dr Andrea Scascighini—particularly for their contributions to the exercises. Martin Gutknecht.

10 Preconditioning and parallel preconditioning 2.2.1 Memory 1. instruction scheduling.3.5.2 Memory systems 1.3.2 Uniform distributions 2.5.3.3.7 The conjugate gradient (CG) method 2.2 Jacobi iteration 2.2.5 Krylov subspace methods 2.2.2 Solving systems of equations with LAPACK 2.2.3 Gauss–Seidel (GS) iteration 2.3. iterative methods 2.2 LAPACK and the BLAS 2.3 Multiple processors and processes 1.1 Typical performance numbers for the BLAS 2.1 Linear algebra 2.9 The sparse matrix vector product 2.4.1 Symmetries 2.4 Successive and symmetric successive overrelaxation 2.3.CONTENTS List of Figures List of Tables 1 BASIC ISSUES 1.3 Linear algebra: sparse matrices.3.5 Monte Carlo (MC) methods 2.2 Pipelines.1 Cache designs 1.6 The generalized minimal residual method (GMRES) 2.1 Random numbers and independent streams 2.3.5.4 Networks 2 APPLICATIONS 2.8 Parallelization 2.3 Non-uniform distributions xv xvii 1 1 5 5 8 15 15 18 18 21 22 23 28 29 30 31 31 34 34 36 39 39 42 49 55 57 58 60 64 . and loop unrolling 1.4 Fast Fourier Transform (FFT) 2.1 Stationary iterations 2.3.3.

2.5.2 HP9000 Superdome machine 4.7 Motorola G-4 architecture 3. polynomial evaluation 3.10 Using Libraries 85 85 86 89 91 92 96 97 97 101 102 105 106 106 107 110 110 112 114 117 122 123 124 126 136 136 136 137 139 140 141 142 143 146 147 149 151 152 153 .1 Basic vector operations with OpenMP 4.2 A single tridiagonal system 3. searching 3.5 OpenMP standard 4.1 The matrix–vector multiplication with OpenMP 4.5.5 Some examples from Intel SSE and Motorola Altivec 3.3 Cray X1 machine 4. 3.5.6 Pentium 4 architecture 3.2.3 Solving tridiagonal systems by cyclic reduction.5 Pentium 4 and Motorola G-4 architectures 3.7 ISAMAX on Intel using SSE 3.8.x CONTENTS 3 SIMD.2.8 Branching and conditional execution 3.4.4 Long memory latencies and short vector lengths 3.4.3 Cray SV-1 hardware 3.5.2.9 Overview of OpenMP commands 4.2.8 OpenMP matrix vector multiplication 4.4 NEC SX-6 machine 4.5. scatter/gather operations 3.2 More about dependencies.7.1 Pipelining and segmentation 3. SINGLE INSTRUCTION MULTIPLE DATA 3.5.8.4 Some basic linear algebra examples 3.2.2 SGEFA: The Linpack benchmark 3.6 SDOT on G-4 3.7 Basic operations with vectors 4.6 Shared memory versions of the BLAS and LAPACK 4.1 Polynomial evaluation 3.1 Introduction 4.2.4 Another example of non-unit strides to achieve parallelism 3.2 Shared memory version of SGEFA 4.5 Recurrence formulae.1 Matrix multiply 3.1 Introduction 3.3 Reduction operations.6 FFT on SSE and Altivec 4 SHARED MEMORY PARALLELISM 4.5.3 Shared memory version of FFT 4.2 Data dependencies and loop unrolling 3.2.8.

9 MPI three-dimensional FFT example 5.3. utility.11.3.3.4 Vector comparisons A.10 MPI Monte Carlo (MC) integration example 5.1 Matrix–vector multiplication with MPI 5.4.8 MPI two-dimensional FFT example 5.2 Krylov subspace methods and preconditioners 5.3 Vector logical operations and permutations B.3 Load/store operation intrinsics A.5 Full precision arithmetic functions on vector operands B.2 Boolean and logical intrinsics A.2 Matrix–vector multiply with PBLAS 5.7 ScaLAPACK 5.3 Block–cyclic distribution of vectors 5.4 Distribution of matrices 5.1 Cyclic vector distribution 5.5 Basic operations with vectors 5.4 Load and store operations B.1 MPI commands and examples 5.11.2 Matrix and vector operations with PBLAS and BLACS 5.6 Integer valued low order scalar in vector comparisons A.6 Matrix–vector multiply revisited 5.6.7 Integer/floating point vector conversions A.1 Matrices and vectors 5. and approximation functions B.11 PETSc 5.8 Arithmetic function intrinsics APPENDIX B ALTIVEC INTRINSICS FOR FLOATING POINT B.2 Conversion. MULTIPLE INSTRUCTION. MULTIPLE DATA 5.6.3 Distribution of vectors 5.12 Some numerical experiments with a PETSc code APPENDIX A SSE INTRINSICS FOR FLOATING POINT A.CONTENTS xi 5 MIMD.6 Collective comparisons 156 158 161 165 165 168 169 170 170 171 172 172 173 177 180 184 187 190 191 193 194 201 201 201 202 205 206 206 206 207 211 211 212 213 214 215 216 .1 Conventions and notation A.5 Low order scalar in vector comparisons A.1 Two-dimensional block–cyclic matrix distribution 5.1 Mask generating vector comparisons B.2 Block distribution of vectors 5.

2 Collective communications D.1 Point to point commands D.xii CONTENTS APPENDIX C OPENMP COMMANDS 218 220 220 226 234 235 240 245 246 255 APPENDIX D SUMMARY OF MPI COMMANDS D.3 Timers. initialization. and miscellaneous APPENDIX E APPENDIX F APPENDIX G References Index FORTRAN AND C COMMUNICATION GLOSSARY OF TERMS NOTATIONS AND SYMBOLS .

Aligning templates of instructions generated by unrolling loops. Block Gaussian elimination. The preconditioned GMRES(m) algorithm. 9×9 grid and sparsity pattern of the corresponding Poisson matrix if grid points are numbered in checkerboard (red-black) ordering. Sparse matrix–vector multiplication y = AT x with the matrix A stored in the CSR format.2 2.1 1. Ω-network. Pipelining: a pipe filled with marbles.10 1. Pre-fetching 2 data one loop iteration ahead (assumes 2|n). Ω-network switches.8 1. Two-dimensional nearest neighbor connected torus. Data address in set associative cache memory. 2 2 3 4 5 7 9 11 13 13 15 16 17 24 26 27 33 37 38 40 41 42 44 44 45 46 48 63 . Gaussian elimination of an M × N matrix based on Level 2 BLAS as implemented in the LAPACK routine dgetrf.10 2.7 2. Sparse matrix with band-like nonzero structure row-wise block distributed on six processors. Caches and associativity. Overlapping domain decomposition.6 1.1 2.4 1.5 1.9 2. Sparse matrix–vector multiplication y = Ax with the matrix A stored in the CSR format.3 2.LIST OF FIGURES 1. Memory versus CPU performance.13 2. Graphical argument why parallel RNGs should generate parallel streams.11 2.15 Intel microprocessor transistor populations since 1972.12 1. Linpack benchmark optimal performance tests. 9×9 grid and sparsity pattern of the corresponding Poisson matrix if grid points are arranged in checkerboard (red-black) ordering.6 2.5 2.14 2. Generic machine with cache memory. 9×9 grid and sparsity pattern of the corresponding Poisson matrix if grid points are numbered in lexicographic order.3 1. which is functionally equivalent to dgefa from LINPACK.13 2.9 1.2 1.4 2. The main loop in the LAPACK routine dgetrf. The preconditioned conjugate gradient algorithm.12 2.8 2. The incomplete Cholesky factorization with zero fill-in.11 1.7 1. Stationary iteration for solving Ax = b with preconditioner M . Aligning templates and hiding memory latencies by pre-fetching data.

24 Intrinsics. 3. 3.1 One cell of the HP9000 Superdome. 3.5 Cray X1 MSP.18 Double “bug” for in-place.7 Four-stage multiply pipeline: C = A ∗ B with out-of-order instruction issue.13 Branch prediction best when e(x) > 0. 4. 2.12 Branch processing by merging results. 3.7. and generic FFT.5 saxpy operation by SIMD.19 Box–Muller vs.2 Scatter and gather operations. polar method for generating univariate normals.6 Cray X1 node (board).11 High level overview of the Motorola G-4 structure.20 Timings on NEC SX-4 for uniform interior sampling of an n-sphere. 3. self-sorting FFT. 4. 3.4 Cray SV-1 CPU diagram. 3.21 Simulated two-dimensional Brownian motion.18 Polar method for normal random variates. 3.15 Simple parallel version of SGEFA.10 Port structure of Intel Pentium 4 out-of-order instruction core. 2. and generic FFT.4 EFF BW benchmark on Stardust. 2.7 NEC SX-6 CPU. 2. 67 68 70 72 74 79 80 90 91 91 93 94 96 98 99 100 100 101 102 104 104 108 115 120 121 123 127 127 128 130 133 136 137 137 138 138 139 140 .17 In-place. 3. 3. 2.9 Block diagram of Intel Pentium 4 pipelined instruction execution unit. 4. Ziggurat method.14 Branch prediction best if e(x) ≤ 0. in-place (non-unit stride).8 Another way of looking at Figure 3. 3. Ogdoad: 1.21 Decimation in time computational “bug”. Ito: 1.16 Timings for Box–Muller method vs.3 Pallas EFF BW benchmark. 3.6 Long memory latency vector computation. 3. 4. 4.xiv LIST OF FIGURES 2. 3. 3.1 Four-stage multiply pipeline: C = A ∗ B. 3. 4.23 Intrinsics.2 Crossbar interconnect architecture of the HP9000 Superdome. including the Altivec technology. the recursive procedure.20 Workspace version of self-sorting FFT. 3.16 Times for cyclic reduction vs.19 Data misalignment in vector reads. 3.22 Complex arithmetic for d = wk (a − b) on SSE and Altivec. 3.25 GHz Power Mac G-4. self-sorting FFT. 3. 4.7 GHz Pentium 4 3. 3.22 Convergence of the optimal control process.17 AR method. 2. 3. 3.3 Scatter operation with a directive telling the C compiler to ignore any apparent vector dependencies. in-place (non-unit stride).

3 5. SGEFA.4 5. MPICH compile script. Strip-mining a two-dimensional FFT. The 15 × 20 matrix A stored on a 2 × 4 process grid with big blocks together with the 15-vector y and the 20-vector x.23 5.9 4. Two-dimensional transpose for complex data.13 4. A domain decomposition MC integration. Times and speedups for parallel version of classical Gaussian elimination.5 5. Global variable dot unprotected.20 5.13 5.22 5.26 5. Eight processes mapped on a 2 × 4 process grid in row-major order.15 5. MPI matrix–vector multiply with row-wise block-distributed matrix. MPI status struct for send and receive functions. OpenMP reduction syntax for dot (version IV).1 5.11 4. 140 144 144 145 146 150 152 154 157 157 159 162 162 163 163 167 167 167 168 168 169 171 173 174 175 175 176 176 180 181 188 188 190 191 192 .15 5. Simple minded approach to parallelizing one n = 2m FFT using OpenMP on Stardust. OpenMP critical region protection for global variable dot (version II).6 5. MPICH (PBS) batch run script. OpenMP critical region protection only for local accumulations local dot (version III). Initialization of a BLACS process grid. Network connection for ETH Beowulf cluster. Generic MIMD distributed-memory computer (multiprocessor).8 5. Cutting and pasting a uniform sample on the points. The ScaLAPACK software hierarchy. Block distribution of a vector.10 5.7 5. Block–cyclic distribution of a vector. Cyclic distribution of a vector.16 5. Definition and initialization of a vector. Block–cyclic distribution of a 15 × 20 matrix on a 2 × 3 processor grid with blocks of 2 × 3 elements. Block–cyclic matrix and vector allocation. and thus giving incorrect results (version I).10 4.27 NEC SX-6 node. LAM (PBS) run script.14 5. The PETSc software building blocks.11 5.25 5. The data distribution in the matrix–vector product A ∗ x = y with five processors. General matrix–vector multiplication with PBLAS. Defining the matrix descriptors.8 4.19 5.21 5. Definition and initialization of a n × n Poisson matrix.12 4.LIST OF FIGURES xv 4.18 5.2 5.9 5. Times and speedups for the Hewlett-Packard MLIB version LAPACK routine sgetrf.14 4.12 5.24 5.17 5. Release of the BLACS process grid.

5. 5.28 Definition of the linear solver context and of the Krylov subspace method.29 Definition of the preconditioner.30 Calling the PETSc solver. Jacobi in this case. 5.xvi LIST OF FIGURES 5.31 Defining PETSc block sizes that coincide with the blocks of the Poisson matrix. 193 194 194 195 .

7 2.1 2. 4 19 20 22 23 28 33 35 39 4. Iteration steps for solving the Poisson equation on a 31 × 31 and on a 63 × 63 grid.2 D.5 2.LIST OF TABLES 1. Summary of the PBLAS.1 D. Additional available binary relations for collective comparison functions. MPI datatypes available for collective reduction operations.6 2. Timings of the ScaLAPACK system solver pdgesv on one processor and on 36 processors with varying dimensions of the process grid.2 4. MPI pre-defined constants. Some Krylov subspace methods for Ax = b with nonsingular A. Some execution times in microseconds for the matrix–vector multiplication with OpenMP on the HP superdome. Summary of the BLACS. Execution times in microseconds for our dot product. Some execution times in microseconds for the saxpy operation.2 .1 2.2 5. and Motorola G-4. using the C compiler guidec. Available binary relations for the mm compbr ps and mm compbr ss intrinsics.3 5.4 GHz Pentium 4. Iteration steps for solving the Poisson equation on a 31 × 31 and on a 63×63 grid.5 A.4 2. Times t in seconds (s) and speedups S(p) for various problem sizes n and processor numbers p for solving a random system of equations with the general solver dgesv of LAPACK on the HP Superdome. Available binary relations for comparison functions.4 141 143 145 148 164 166 179 179 195 203 212 216 230 231 5.1 B. Times (s) and speed in Mflop/s of dgesv on a P 4 (2. 1 GB).3 2. Times t and speedups S(p) for various problem sizes n and processor numbers p for solving a random system of equations with the general solver pdgesv of ScaLAPACK on the Beowulf cluster. Execution times for solving an n2 × n2 linear system from the two-dimensional Poisson problem. Summary of the basic linear algebra subroutines.4 5.4 GHz. Basic linear algebra subprogram prefix/suffix conventions. Number of memory references and floating point operations for vectors of length n.2 2.1 5.3 4.1 Cache structures for Intel Pentium III.1 B. 4. Some performance numbers for typical BLAS in Mflop/s for a 2.8 4.

This page intentionally left blank .

Anyone who has recently purchased memory for a PC knows how inexpensive . his law [107] that the number of transistors on microprocessors doubles roughly every one to two years has proven remarkably astute. the IRAM [78] and RDRAM projects [85].). Likewise. In consequence. Worse. that central processing unit (CPU) performance would also double every two years or so has also remained prescient. This is not completely accurate. however. Appendix F) and direct memory access.1 Memory Since first proposed by Gordon Moore (an Intel founder) in 1965.1 BASIC ISSUES No physical quantity can continue to change exponentially forever. Figure 1. for example. We are confident that this disparity between CPU and memory access performance will eventually be tightened. Figure 1. frequently with multiple levels of cache. Your job is delaying forever. we must deal with the world as it is. E. nearly all memory systems these days are hierarchical. Figure 1. Moore (2003) 1. Alas. Cray Research Y-MP series machines had well balanced CPU to memory performance. of course: it is mostly a matter of cost. close packing is difficult and such systems tend to have small storage per volume. but in the meantime. Dynamic random access memory (DRAM) in some variety has become standard for bulk memory. There are many projects and ideas about how to close this performance gap. Its corollary. Almost any personal computer (PC) these days has a much larger memory than supercomputer memory systems of the 1980s or early 1990s. In the 1980s and 1990s. ECL (see glossary. G. in recent years there has developed a disagreeable mismatch between CPU and memory performance: CPUs now outperform memory systems by orders of magnitude according to some reckoning [71].2 indicates that when one includes multi-processor machines and algorithmic development. computer performance is actually better than Moore’s 2-year performance doubling time estimate.3 shows the diverging trends between CPUs and memory performance.1 shows Intel microprocessor data on the number of transistors beginning with the 4004 in 1972. using CMOS (see glossary. has well balanced CPU/Memory performance. because they have to be cooled. NEC (Nippon Electric Corp. Appendix F) and CMOS static random access memory (SRAM) systems were and remain expensive and like their CPU counterparts have to be carefully kept cool.

Only some of the fastest machines are indicated: Cray-1 (1984) had 1 CPU. Intel microprocessor transistor populations since 1972. Fujitsu VP2600 (1990) had 1 CPU.5 year 80486 doubling) 80386 102 8086 101 4004 100 1970 1980 Year 1990 2000 (Line fit = 2 year doubling) 80286 103 Fig. 1.2 105 BASIC ISSUES Intel CPUs 104 Number of transistors Itanium Pentium-II-4 Pentium (2. Linpack benchmark optimal performance tests. 1. ES (2002). These data were gleaned from various years of the famous dense linear equations benchmark [37]. 108 10 7 Super Linpack Rmax [fit: 1 year doubling] ES 106 MFLOP/s 105 104 103 102 Cray–1 101 1980 1990 Year 2000 Cray T–3D NEC SX–3 Fujitsu VP2600 Fig. . and the last.1.2. NEC SX-3 (1991) had 4 CPUS. Cray T-3D (1996) had 2148 DEC α processors. is the Yokohama NEC Earth Simulator with 5120 vector processors.

000 = CPU = SRAM = DRAM Data rate (MHz) 1000 3 100 10 1990 1995 Year 2000 2005 Fig. in cases of non-unit but fixed stride. Computer architectural design almost always assumes a basic principle —that of locality of reference. most machines use cache memory hierarchies whose underlying assumption is that of data locality. Hence. it seems our first task in programming high performance computers is to understand memory access. Additional levels are possible and often used. . the larger the cache becomes.1 and Figure 1. Nevertheless.5). Non-local memory access. it is interesting that the cost of microprocessor fabrication has also grown exponentially in recent years. DRAM has become and how large it permits one to expand their system. Here it is: The safest assumption about the next data to be used is that they are the same or nearby the last used. 1. it is hard to imagine another strategy which could be easily implemented. the higher the level of associativity (see Table 1. As one moves up in cache levels. In Figure 1.MEMORY 10. Memory versus CPU performance: Samsung data [85]. we show a more/less generic machine with two levels of cache. L3 cache in Table 1. Whereas the locality assumption is usually accurate regarding instructions. Line extensions beyond 2003 for CPU performance are via Moore’s law. Most benchmark studies have shown that 90 percent of the computing time is spent in about 10 percent of the code. Economics in part drives this gap juggernaut and diverting it will likely not occur suddenly. are often handled with pre-fetch strategies—both in hardware and software. and the lower the cache access bandwidth.1. Dynamic RAM (DRAM) is commonly used for bulk memory. Hence. it is less reliable for other data. in particular. for example. while static RAM (SRAM) is more common for caches. with some evidence of manufacturing costs also doubling in roughly 2 years [52] (and related articles referenced therein). However.4.3.

Table 1. 64-bit 8·16 B (SIMD) 550 MHz 4.2 GB/s L1 ↔ Reg. 4. Pentium III memory access data Channel: Width Size Clocking Bandwidth Channel: Width Size Clocking Bandwidth Channel: Width Size Clocking Bandwidth M ↔ L2 64-bit 256 kB (L2) 133 MHz 1.25 GHz 2 GB/s 40 GB/s access data L2 ↔ L1 256-bit 32 kB 1.06 GHz 98 GB/s L1 ↔ Reg. 128-bit 32·16 B (SIMD) 1. 1.25 GHz 40 GB/s . and Motorola G-4. 256-bit 8·16 B (SIMD) 3.1 Cache structures for Intel Pentium III.4.3 GB/s 98 GB/s Power Mac G-4 memory M ↔ L3 L3 ↔ L2 64-bit 256-bit 2 MB 256 kB 250 MHz 1.4 GB/s L1 ↔ Reg.4 BASIC ISSUES Memory System bus L2 cache Bus interface unit Instruction cache L1 data cache Fetch and decode unit Execution unit Fig.0 GB/s Pentium 4 memory access data M ↔ L2 L2 ↔ L1 64-bit 256-bit 256 kB (L2) 8 kB (L1) 533 MHz 3. Generic machine with cache memory.06 GHz 4.25 GHz 20.06 GB/s L2 ↔ L1 64-bit 8 kB (L1) 275 MHz 2.

Light speed propagation delay is almost exactly 1 ns in 30 cm. it is a safe place to hide things. all the memory is cache. Caches. then. for example. and others. The advantage to this direct interface is that memory access is closer in performance to the CPU. if possible. 1. and a direct mapped cache (same as 1-way associative in this 8 block example). These very simplified examples have caches with 8 blocks: a fully associative (same as 8-way set associative in this case). 1. Early Cray machines had twisted pair cable interconnects. a 2-way set associative cache with 4 sets.6 series. the NEC SX-4. is necessarily further away. one can see that it is possible for the CPU to have a direct interface to the memory. The downside is that memory becomes expensive and because of cooling requirements. assuming a 1. Note that block 4 in memory also maps to the same sets in each indicated cache design having 8 blocks.MEMORY SYSTEMS Fully associative 2-way associative 5 Direct mapped 12 mod 4 12 mod 8 0 4 Memory 12 Fig. In effect.5. This is also true for other supercomputers. Fujitsu AP3000.1 Cache designs So what is a cache. tend to be very close to the CPU—on-chip. 1.2 Memory systems In Figure 3. and what should we know to intelligently program? According to a French or English dictionary. the further away the data are from the CPU. Caches and associativity.0 GHz clock.5.4 depicting the Cray SV-1 architecture. This is perhaps not an adequate description of cache with regard . how does it work.2. all of the same physical length.1 indicates some cache sizes and access times for three machines we will be discussing in the SIMD Chapter 3. Table 1. the longer it takes to get. so a 1 ft waveguide forces a delay of order one clock cycle. Obviously.

The set where the cacheline is to be placed is computed by (block address) mod (m = no. According to Hennessey and Patterson [71]. and the cache is called an n−way set associative cache. Caches. If there are m sets. Instructions are only read. there are practical limits on the size of a set. In Figure 1. reading data. the higher the level of cache. Fully associative means the data block can go anywhere in the cache. data read but not used pose no problem about what to do with them—they are ignored. More accurately. A datum from memory to be read is included in a cacheline (block) and fetched as part of that line. In effect. a direct mapped cache is set associative with each set consisting of only one block. the block sizes may also change. is the easiest. The most common case. and perhaps 9–10 percent of all memory traffic. the larger the cache. Thus. the more sets are used. the larger the number of sets. the data will be mis-aligned and mis-read: see Figure 3. of sets in cache). that is. This follows because higher level caches are usually much larger than lower level ones and search mechanisms for finding blocks within a set tend to be complicated and expensive. and a direct mapped. • Set associative means a block can be anywhere within a set. Caches can be described as direct mapped or set associative: • Direct mapped means a data block can go only one place in the cache. the number of blocks in a set is n = (cache size in blocks)/m. Typically. hence likely to be used next. about 25 percent of memory data traffic is writes. The largest possible block size is called a page and is typically 4 kilobytes (kb).5 . Data read from cache into vector registers (SSE or Altivec) must be aligned on cacheline boundaries. which is usually of DRAM type. a 2-way. are high speed CMOS or BiCMOS memory but of much smaller size than the main memory. the principle of data locality encourages having a fast data access for data being used. close by and quickly accessible. Otherwise. Hence.19. of course. that is. Figure 1. an 8-way or fully associative. an 8-way cache has 8 cachelines (blocks) in each set and so on. In our examination of SIMD programming on cache memory architectures (Chapter 3).5 are three types namely. we will be concerned with block sizes of 16 bytes. A 4-way set associative cache is partitioned into sets each with 4 blocks. Namely. 4 single precision floating point words.6 BASIC ISSUES to a computer memory. The machines we examine in this book have both 4-way set associative and 8-way set associative caches. The idea is to bring data from memory into the cache where the CPU can work on them. Since bulk storage for data is usually relatively far from the CPU. then modify and write some of them back to memory. However. then. it is a safe place for storage that is close by.

the block location in the set is picked which has not been used for the longest time. A cache miss rate is the fraction of data requested in a given code which are not in cache and must be fetched from either higher levels of cache or from bulk memory. 1. Usually. or a direct mapped cache [71]. Data address in set associative cache memory.6): • A block address which specifies which block of data in memory contains the desired datum. Only the tag field is used to determine whether the desired datum is in cache or not. The machines we discuss in this book use an approximate LRU algorithm which is more consistent with the principle of data locality. — a index field selects the set possibly containing the datum. Many locations in memory can be mapped into the same cache block. The block is placed in the set according to a least recently used (LRU) algorithm. This location depends only on the initial state before placement is requested. each containing 128 16-byte blocks (cachelines).MEMORY SYSTEMS 7 shows an extreme simplification of the kinds of caches: a cache block (number 12) is mapped into a fully associative. Typically it takes cM = O(100) cycles to get one datum from memory. .6. Now we ask: where does the desired cache block actually go within the set? Two choices are common: 1. but only cH = O(1) cycles to fetch it from low level cache. Namely. • An offset which tells where the datum is relative to the beginning of the block. The algorithm for determining the least recently used location can be heuristic. this is itself divided into two parts. The block is placed in the set in a random location. whereas a real 8 kB. — a tag field which is used to determine whether the request is a hit or a miss. an address format is partitioned into two parts (Figure 1. a 4-way associative. 4-way cache will have sets with 2 kB. hence the location is deterministic in the sense it will be reproducible. 2. This simplified illustration has a cache with 8 blocks. the random location is selected by a hardware pseudo-random number generator. so in order to determine whether a particular datum is in the block. so the penalty for a cache miss is high and a few percent miss rate is not inconsequential. To locate a datum in memory. the tag portion Block address Tag Index Block offset Fig. Reproducibility is necessary for debugging purposes.

2.2. Since bulk memory is usually far away from the CPU. why write the data all the way back to their rightful memory locations if we want them for a subsequent step to be computed very soon? Two strategies are in use.2 Pipelines. it is clear this variable must be safely stored before it is used. In Section 3. It is well known [71] that certain processes. in principle. but the idea. In writing modified data back into memory. is not complicated. the datum may be obtained immediately from the beginning of the block using this offset.8 BASIC ISSUES of the block is checked. 1. A write through strategy automatically writes back to memory any modified variables in cache.1. in which case the whole block containing the datum is available. It will be a common theme throughout this book that data dependencies are of much concern in parallel computing. the memory issues considered above revolve around the same basic problem—that of data dependencies. we will explore in more detail some coding issues when dealing with data dependencies. want it both ways. or (2) these cache resident data are again to be modified by the CPU. There is little point in checking any other field since the index field was already determined before the check is made. A write back strategy skips the writing to memory until: (1) a subsequent read tries to replace a cache block which has been modified. .2. 1. modern cache designs often permit both write-through and write-back modes [29]. and the offset will be unnecessary unless there is a hit. these data cannot be written onto old data which could be subsequently used for processes issued earlier. Otherwise. instruction scheduling. 1. I/O and multi-threading. In consequence. Consider the following sequence of C instructions. If there is a hit. A copy of the data is kept in cache for subsequent use. A subsequent cache miss on the written through data will be assured to fetch valid data from memory because the data are freshly updated on each write. even though it occurs only roughly one-fourth as often as reading data. for example.2. These two situations are more/less the same: cache resident data are not written back to memory until some process tries to modify them.1 Writes Writing data into memory from cache is the principal problem. Which mode is used may be controlled by the program. the modification would write over computed information before it is saved. This copy might be written over by other data mapped to the same location in cache without worry. and loop unrolling For our purposes. Conversely. if the programming language ordering rules dictate that an updated variable is to be used for the next step.

What is more recent. so it may be possible to do other operations while the first is being processed. from grammar school. Array element a[1] is to be set when the first instruction is finished. second. . . . The second. . a[2] = f2(a[1]). L+1 L . On modern machines. Computations f1(a[0]). the computation of f1(a[0]) will take some time. cannot issue until the result a[1] is ready. for example. There are data dependencies: the first. . and f3(a[2]) are not independent. indicated by the dots. Pipelining: a pipe filled with marbles. The same applies to computing the second f2(a[1]). . f2(a[1]). and last must run in a serial fashion and not concurrently.MEMORY SYSTEMS 9 a[1] = f1(a[0]). f2(a[1]). 1. a[3] = f3(a[2]). This notion of pipelining operations was also not invented yesterday: the University of Manchester Atlas project implemented 2 1 Step 1 3 2 1 Step 2 . however. That multiple steps are needed to do arithmetic is ancient history. . . is that it is possible to do multiple operands concurrently: as soon as a low order digit for one operand pair is computed by one stage.7. However. and likewise f3(a[2]) must wait until a[2] is finished. 3 2 1 Step L L+2 L+1 L . essentially all operations are pipelined: several hardware stages are needed to do any computation. 3 2 1 Step L + 1 Fig. . that stage can be used to process the same low order digit of the next operand pair.

for(i=nms*m. after which one result (marble) pops out the other end at a rate of one result/cycle. The simple loop for(i=0. The resulting speedup is n · L/(n + L). with multiple operands. each of which does m operations f(a[i]). To program systems with pipelined operations to advantage. . a total of n · L cycles. Imagine that the pipe will hold L marbles. . which will be symbolic for stages necessary to process each operand. res = n%m. one wishes to hide the time it takes to get the data from memory into registers. only L + n cycles are required as long as the last operands can be flushed from the pipeline once they have started into it. We will refer to the instructions which process each f(a[i]) as a template. a residual segment. Our last loop cleans up the remaining is when n = nms · m. is to choose an appropriate depth of unrolling m which permits squeezing all the m templates together into the tightest time grouping possible. That is. Such data . b[i+1] = f(a[i+1]). } /* residual segment res = n mod m */ nms = n/m. we can keep pushing operands into the pipe until it is full. The schema is in principle straightforward and shown by loop unrolling transformations done either by the compiler or by the programmer. the rest last. we will need to know how instructions are executed when they run concurrently. b[i+m-1] = f(a[i+m-1]). However.i++){ b[i] = f(a[i]). sometimes last (as shown) or for data alignment reasons. L for large n. then.i<n. part of the res first.i<nms*m+res. for(i=0.10 BASIC ISSUES such arithmetic pipelines as early as 1962 [91]. By this simple device. } is expanded into segments of length (say) m i’s. To do one complete operation on one operand pair takes L steps. that is. The most important aspect of this procedure is prefetching data within the segment which will be used by subsequent segment elements in order to hide memory latencies. instead n operands taking L cycles each. } The first loop processes nms segments. The terminology is an analogy to a short length of pipe into which one starts pushing marbles. that is. that is.7.i<n. The problem of optimization. Figure 1. Sometimes this residual segment is processed first.i+=m){ b[i ] = f(a[i ]).i++){ b[i] = f(a[i]).

1. If the computation of f(ti) does not take long enough. The next section illustrates this idea in more generality. 1.2. b[n-1] = f(t1).}. Pre-fetching in its simplest form is for m = 1 and takes the form t = a[0]. t = a[++i]. /* for(i=0. Data are loaded into and stored from registers. .8 or more. In that case. ){ b[i] = f(t). t0 = a[i+2]. /* prefetch a[i+1] */ } b[n-1] = f(t). /* prefetch a[0] */ for(i=0. /* } b[n-2] = f(t0). In every case. where one tries to hide the next load of a[i] under the loop overhead. . /* t1 = a[1]. in particular. as in Figure 1. We can go one or more levels deeper.i<n-1. integer registers. Different machines have varying numbers and types of these: floating point registers.MEMORY SYSTEMS 11 pre-fetching was called bottom loading in former times. /* t1 = a[i+3]. but graphically. and anywhere from say 8 to 32 of each type or sometimes blocks of such registers t0 = a[0]. we have the following highlighted purpose of loop unrolling: The purpose of loop unrolling is to hide latencies. the loop unrolling level m must be increased. general purpose registers. Unless the stored results will be needed in subsequent iterations of the loop (a data dependency). we will need some notation.2.i<n-3. these stores may always be hidden: their meanderings into memory can go at least until all the loop iterations are finished. Prefetching 2 data one loop iteration ahead (assumes 2|n) .i+=2){ b[i ] = f(t0). address calculation registers. Before we begin. i = 0 . the delay in reading data from memory. not much memory access latency will be hidden under f(ti). prefetch a[0] */ prefetch a[1] */ prefetch a[i+2] */ prefetch a[i+3] */ Fig. We denote these registers by {Ri .1 Instruction scheduling with loop unrolling Here we will explore only instruction issue and execution where these processes are concurrent. b[i+1] = f(t1).8.

Consider the following operation to be performed on an array A: B = f (A). This just says that one of the {fi . In Figures 1. 2 · m. and (3) there are enough registers. For example. and put result into R3 .8. R3 ← R1 ∗ R2 : multiply contents of R1 by R2 . it is known that finding an optimal . As illustrated. Our illustration is a dream. One might run out of functional units. Usually it is not that easy. As in Figure 1. and we could set up two templates and try to align them. By merging the two together and aligning them to fill in the gaps. however. In fact. Several problems raise their ugly heads. One might run out of registers. if there are only eight registers. finding an optimal algorithm to align these templates is no small matter. M ← R1 : R3 ← R1 + R2 : add contents of R1 and R2 and store results into R3 . say two. . More complicated operations are successive applications of basic ones. 2. memory fetch.9.12 BASIC ISSUES which may be partitioned in different ways. 4. alignment of two templates is all that seems possible at compile time.} operations might halt awaiting hardware that is busy. Bi = f (Ai ) and Bi+1 = f (Ai+1 ). . This will work only if: (1) the separate operations can run independently and concurrently. . going deeper shows us how to hide memory latencies under the calculation. i = 1. that is. In general. we will run out and no more unrolling is possible without going up in levels of the memory hierarchy. if the multiply unit is busy. Each step of the calculation takes some time and there will be latencies in between them where results are not yet available. If memory is busy. If these two calculations ran sequentially. 3. where f (·) is in two steps: Bi = f2 (f1 (Ai )). 1. they would take twice what each one requires. We will use the following simplified notation for the operations on the contents of these registers: loads a datum from memory M into register R1 . might run concurrently. Finally.10. everything fit together quite nicely. the two templates can be merged together. it may not be possible to use it until it is finished with multiplies already started. (2) if it is possible to align the templates to fill in some of the gaps. the various operations. R1 ← M : stores content of register R1 into memory M . A big bottleneck is memory traffic. f1 and f2 . No matter how many there are. this may not be so. each calculation f (Ai ) and f (Ai+1 ) takes some number (say m) of cycles (m = 11 as illustrated). however. Namely. In Figure 1. it may not be possible to access data to start another template. memory latencies may be significantly hidden. they can be computed in m + 1 cycles.9 and 1. More than that and we run out of registers. By using look-ahead (prefetch) memory access when the calculation is long enough. If we try to perform multiple is together. by starting the f (Ai+1 ) operation steps one cycle after the f (Ai ). if the calculation is complicated enough.

Aligning templates and hiding memory latencies by pre-fetching data. R1 ← f1 (R0 ) . We assume 2|n. R6 ← f2 (R5 ) . . Bi+1 ← R6 = R0 ← R 3 R3 ← Ai+2 R4 ← R 7 R7 ← Ai+3 R1 ← f1 (R0 ) R5 ← f1 (R4 ) R2 ← f2 (R1 ) R3 ← f2 (R5 ) Bi ← R2 Bi+1 ← R6 Fig. .MEMORY SYSTEMS 13 R0 ← Ai . Bi+1 ← R6 = R0 ← Ai R4 ← Ai+1 wait R1 ← f1 (R0 ) R5 ← f1 (R4 ) R2 ← f2 (R1 ) R3 ← f2 (R5 ) Bi ← R2 Bi+1 ← R6 Fig. R0 ← R3 R3 ← Ai+2 . wait R5 ← f1 (R4 ) . R5 ← f1 (R4 ) .9. we assume 2|n and the loop variable i is stepped by 2: compare with Figure 1. . R4 ← R 7 R7 ← Ai+3 . Bi ← R2 .9. R2 ← f2 (R1 ) . algorithm is an NP-complete problem. This means there is no algorithm which can compute an optimal alignment strategy and be computed in a time t which can be represented by a polynomial in the number of steps. 1.10. . wait R1 ← f1 (R0 ) . 1. R4 ← Ai+1 . R2 ← f2 (R1 ) . Bi ← R2 . Again. . Aligning templates of instructions generated by unrolling loops. R6 ← f2 (R5 ) . while loop variable i is stepped by 2.

g. Control of the data is direct. but has the disadvantage that parallelism is less directly controlled. While it is true that there is no optimal algorithm for unrolling loops into such templates and dovetailing them perfectly together. — Task level parallelism refers to large chunks of work on independent data. Cache can be used for this purpose if the data are not written through back to memory. shift.. For example. . but is it useless in practice? Fortunately the situation is not at all grim. 3. but it is likely to be very good and will serve our purposes admirably. in Figure 1. • On distributed memory machines (e. — On distributed memory machines or networked heterogeneous systems. Modern machines usually have multiple copies of each functional unit: add. there are heuristics for getting a good algorithm. Modern machines have lots of registers. where MPI will be our programming choice. see Appendix C). Many machines allow renaming registers. PVM. the work done by each independent processor is either a subset of iterations of an outer loop. but data management is done by system software. The result may not be the best possible.14 BASIC ISSUES So.g. As in the outer-loop level paradigm. but is the most efficient. even if only temporary storage registers. 1. Task assignments and scheduling are done by the batch queueing system. 2. The programmer’s job is to specify the independent tasks by various compiler directives (e. — Outer loop level parallelism will be discussed in Chapter 5. one job might be to run a simulation for one set of parameters. or alternatively. by far the best mode of parallelism is by distributing independent problems. while another job does a completely different set. the programmer could use MPI. on ETH’s Beowulf machine). This mode of using directives is relatively easy to program. 4. or our Hewlett-Packard HP9000 Superdome machine. This mode is not only the easiest to parallelize. • On shared memory machines. It is possible to rename R5 which was assigned by the compiler and call it R0 . or an independent problem. a task. if not an optimal one. on ETH’s Cray SV-1 cluster. our little example is fun. both task level and outer loop level parallelism are natural modes. its data are in the f1 pipeline and so R0 is not needed anymore. or pthreads. multiply. So running out of functional units is only a bother but not fatal to this strategy. as soon as R0 is used to start the operation f1 (R0 ). for example. Here is the art of optimization in the compiler writer’s work. For example. etc. Several things make this idea extremely useful. thus providing us more registers than we thought we had.9.

12. For example. each of which uses the same lower level instruction level parallelism discussed in Section 1.11.2.MULTIPLE PROCESSORS AND PROCESSES 15 1. our programming methodology reflects the following viewpoint. The above discussion about loop unrolling applies in an analogous way for that mode.11–1.4 Networks Two common network configurations are shown in Figures 1. Ω-network. For example.13.2) or higher order. 1.2. . For a large number of 0 sw 1 2 sw 3 sw sw sw sw 0 1 2 3 4 sw 5 sw sw 4 5 6 sw sw sw 7 6 7 Fig. There the context is vector processing as a method by which machines can concurrently compute multiple independent data. 1.2. we outline the considerations for multiple independent processors. these may be 4 → 4 (see Figure 4. in Chapter 4 we discuss the NEC SX-6 (Section 4.4) and Cray X1 (Section 4. Variants of Ω-networks are very commonly used in tightly coupled clusters and relatively modest sized multiprocessing systems. special vector registers are used to implement the loop unrolling and there is a lot of hardware support. the former Thinking Machines C-5 used a quadtree network and likewise the HP9000 we discuss in Section 4. Namely. In other flavors.3 Multiple processors and processes In the SIMD Section 3. To conclude this section.3) machines which use such log(N CP U s) stage networks for each board (node) to tightly couple multiple CPUs in a cache coherent memory image. Generally. instead of 2 → 2 switches as illustrated in Figure 1. we will return to loop unrolling and multiple data processing.2.

Between nodes.11. Message passing is effected by very low latency primitives (shmemput. processors. for l = 0.13. is more amenable to a larger number of CPUs. no memory coherency is available and data dependencies must be controlled by software. to set up a satisfactory work environment for yourself on the machines you will use. however. however.1) . which places processors on tightly coupled grid.12. 1. . T3-E machines used a three-dimensional grid with the faces connected to their opposite faces in a three-dimensional torus arrangement.4 and 4. cross-bar arrangements of this type can become unwieldy simply due to the large number of switches necessary and the complexity of wiring arrangements. . A two-dimensional illustration is shown in Figure 1. to pick an editor. but harder to illustrate in a plane image. A problem with this architecture was that the nodes were not very powerful. Furthermore. ar. image compiler support for parallelism is necessarily limited. n − 1 (1. As we will see in Sections 4. Between nodes. because the system does not have a coherent memory. cc. shmemget.3. In our view.). The network.16 BASIC ISSUES Straight through Cross-over Upper broadcast Lower broadcast Fig. The very popular Cray T3-D.1 Cache effects in FFT The point of this exercise is to get you started: to become familiar with certain Unix utilities tar. A great deal was learned from this network’s success. is extremely powerful and the success of the machine reflects this highly effective design. very tightly coupled nodes with say 16 or fewer processors can provide extremely high performance. such clusters will likely be the most popular architectures for supercomputers in the next few years. The transformation in the problem is n−1 yl = k=0 ω ±kl xk . Exercise 1. but lacks portability. . Ω-network switches from Figure 1. The generalization to three dimensions is not hard to imagine. Another approach. message passing on a sophisticated bus system is used. . etc. make. and to measure cache effects for an FFT. This mode has shown itself to be powerful and effective.

A three- with ω = e2πi/n equal to the nth root of unity. dimensional torus has six nearest neighbors instead of four. 5. using get download the tar file uebung1. operations/ second) vs.inf.NETWORKS 17 PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Fig. 1. plot Mflops (million floating pt.4. in directory Chapter1/uebung1.a = library of modules for the FFT (make lib). What is to be done? From our anonymous ftp server http://www.ch/˜arbenz/book. 2. or plot by hand on graph paper.ethz. (b) cfftst = test program (make cfftst). Run this job on ALL MACHINES using (via qsub) the batch sumission script. Two-dimensional nearest neighbor connected torus. a makefile.1) is given by the sign argument in cfft2 and is a float.tar 1. problem size n. Interpret the results in light of Table 1. The sign in Equation (1. Execute make to generate: (a) cfftlib. A sufficient background for this computation is given in Section 2. 3. Un-tar this file into five source files.1. . Use your favorite plotter—gnuplot. and an NQS batch script (may need slight editing for different machines). 4. From the output on each. for example.13.

0 an−1. . matrices are stored in column-major order. . This is also true. an−1.. On distributed memory multicomputers. in C in row-major order.. This is achieved by either stacking the columns on top of each other or by appending row after row.  . the latter row-major order. We denote such a vector by x and think of it as a column vector. . Vectors are one-dimensional arrays of say n real or complex numbers x0 .2) .1) x =  . .  . on a shared memory multiprocessor computer. . .  . these numbers occupy n consecutive memory locations. xn−1 . The actual procedure depends on the programming language.m−1  a10 a11 . xn−1 On a sequential computer.1 Linear algebra Linear algebra is often the kernel of most numerical computations.. (2.  .   x0  x1    (2. It deals with vectors and matrices and simple operations like addition and multiplication on these objects. Results without causes are much more impressive. a1.1 .2 APPLICATIONS I am afraid that I rather give myself away when I explain. Arthur Conan Doyle (Sherlock Holmes. . There is no principal difference.. . . but for writing efficient programs one has to respect how matrices .m−1    A= . the primary issue is how to distribute vectors on the memory of the processors involved in the computation. an−1. . Matrices are two-dimensional arrays of the form   a00 a01 . . The former is called column-major. . . at least conceptually. In Fortran.m−1 The n · m real (complex) matrix elements aij are stored in n · m (respectively 2 · n · m if complex datatype is available) consecutive memory locations. The Stockbroker’s Clerk) 2. a0. x1 .

3)–(2.3) Other common operations we will deal with in this book are the scalar (inner.3.2 are listed some of the acronyms and conventions for the basic linear algebra subprograms discussed in this book.6). we will explicitly program in column-major order. y) = i=0 xi yi . The operation is one of the more basic.1). Notice that there is no such simple procedure for determining the memory location of an element of a sparse matrix. S D C Z U C Prefixes Real Double precision Complex Double complex Suffixes Transpose Hermitian conjugate are laid out.4) matrix–vector multiplication (Section 5. .6). and matrix–matrix multiplication (Section 3. we outline data descriptors to handle sparse matrices. the matrix element aij of the m × n matrix A is located i + j · m memory locations after a00 . In Tables 2.LINEAR ALGEBRA 19 Table 2.3). y = A x.4. that is. Therefore. in our C codes we will write a[i+j*m]. (2.6) sdot. or dot) product (Section 3.5. In Section 2.1 Basic linear algebra subprogram prefix/suffix conventions.1 and 2. (2. To be consistent with the libraries that we will use that are mostly written in Fortran.5) In Equations (2. albeit most important of these: y = αx + y. Thus. equality denotes assignment. In this and later chapters we deal with one of the simplest operations one wants to do with vectors and matrices: the so-called saxpy operation (2.6) (2. C = A B. (2. n−1 s = x · y = (x. the variable on the left side of the equality sign is assigned the result of the expression on the right side.

SYR. packed) system solves (forward/backward substitution): x ← A−1 x GER. Generate/apply plane rotation Generate/apply modified plane rotation Swap two vectors: x ↔ y Scale a vector: x ← αx Copy a vector: x ← y axpy operation: y ← y + αx Dot product: s ← x · y = x∗ y 2-norm: s ← x 2 1-norm: s ← x 1 Index of largest vector element: first i such that |xi | ≥ |xk | for all k Level 1 BLAS ROTG.20 APPLICATIONS Table 2. packed) matrix–vector multiply: y ← αAx + βy TRMV. HPMV Hermitian (banded. GERC Rank-1 updates: A ← αxy∗ + A HER. HPR. ROT ROTMG. TBMV.2 Summary of the basic linear algebra subroutines. SPMV Symmetric (banded. SYMM. SBMV. HEMM SYRK. SPR Hermitian/symmetric (packed) rank-1 updates: A ← αxx∗ + A HER2. HERK SYR2K. HER2K TRMM TRSM General/symmetric/Hermitian matrix–matrix multiply: C ← αAB + βC Symmetric/Hermitian rank-k update: C ← αAA∗ + βC Symmetric/Hermitian rank-k update: C ← αAB ∗ + α∗ BA∗ + βC Multiple triangular matrix–vector multiplies: B ← αAB Multiple triangular system solves: B ← αA−1 B . packed) matrix–vector multiply: y ← αAx + βy SEMV. SPR2 Hermitian/symmetric (packed) rank-2 updates: A ← αxy∗ + α∗ yx∗ + A Level 3 BLAS GEMM. packed) matrix–vector multiply: x ← Ax TRSV. SYR2. HBMV. TPMV Triangular (banded. HPR2. TPSV Triangular (banded. GERU. ROTM SWAP SCAL COPY AXPY DOT NRM2 ASUM I AMAX Level 2 BLAS GEMV. GBMV General (banded) matrix–vector multiply: y ← αAx + βy HEMV. TBSV.

and Level 3 BLAS which provided an easy route to optimizing LINPACK. other major software libraries such as the IMSL library [146] and NAG [112] rewrote portions of their existing codes and structured new routines to use these BLAS. Lawson. Soon afterward. for example.3) and somewhat larger kernels for numerical linear algebra were developed [40. In consequence of these observations.7) by Gaussian elimination with partial pivoting. • A prefix (or combination prefix) to indicate the datatype of the operands. implemented in assembly language if necessary. The first major package which used these BLAS kernels was LINPACK [38]. Kincaid. 41] to include matrix–vector operations. Vendors focused on Level 1. vector computers (e. distributed memory machines were in production and shared memory machines began to have significant numbers of processors. Soon. For example. we will illustrate several BLAS routines and considerations for their implementation on some machines. axpy (2. Further issues are the solution of least squares problems. . but incorporated the latest dense and banded linear algebra algorithms available. and QR factorization.3).LAPACK AND THE BLAS 21 An important topic of this and subsequent chapters is the solution of the system of linear equations Ax = b (2. in the late 1980s. It also used the Level 3 BLAS which were optimized by much vendor effort. Fortran compilers were by then optimizing vector operations as efficiently as hand coded Level 1 BLAS. 2. Subsequently. or isamax for the index of the maximum absolute element in an array of type single. Level 2. The first major package for linear algebra which used the Level 3 BLAS was LAPACK [4] and subsequently a scalable (to large numbers of processors) version was released as ScaLAPACK [12]. A further set of matrix–matrix operations was proposed [42] and soon standardized [39] to form a Level 2.2 LAPACK and the BLAS By 1976 it was clear that some standardization of basic computer operations on vectors was needed [92]. These so-called Level 1 BLAS consisted of vector operations and some attendant co-routines. and Krogh proposed a limited set of Basic Linear Algebra Subprograms (BLAS) to be optimized by hardware vendors.g. that would form the basis of comprehensive linear algebra packages [93]. By then it was already known that coding procedures that worked well on one machine might work very poorly on others [125]. Gram–Schmidt orthogonalization. then LAPACK. Additionally. In subsequent chapters. however. Hanson. Early in their development. LAPACK not only integrated pre-existing solvers and eigenvalue routines found in EISPACK [134] (which did not use the BLAS) and LINPACK (which used Level 1 BLAS). saxpy for single precision axpy operation. [125]) saw significant optimizations using the BLAS. Conventions for different BLAS are indicated by • A root operation. such machines were clustered together in tight networks (see Section 1.

Additionally. the Level 3 BLAS have a higher order floating point operations than memory accesses. then both the memory accesses and the floating point operations are of O(n2 ).1.y.2 give the prefix/suffix and root combinations for the BLAS.3 shows the crucial difference between the Level 3 BLAS and the rest. A call of one of the Level 1 BLAS thus gives rise to O(n) floating point operations and O(n) memory accesses. ddot. In contrast.4 gives some performance numbers for the five BLAS of Table 2.2. 2. dgemv.22 APPLICATIONS c u • A suffix if there is some qualifier. the numbers in Table 2. where both x and y are vectors of complex elements. cdotc or cdotu for conjugated or unconjugated complex dot product.3)–(2. Here.4 have been obtained by timing a loop inside of which the respective function is called many times.1 and 2. that perform the basic operations (2. for example. and dgemm. Notice that the timer has a resolution of only 1 µs! Therefore. n is the vector length.3. The Level 1 BLAS comprise basic vector operations. Tables 2. Read ddot daxpy dgemv dger dgemm Write Flops Flops/mem access 2n 2n 2n2 2n2 2n3 1 2/3 2 1 2n/3 2n 1 2n n n2 + n n n2 + 2n n2 n2 2n2 . The Level 2 BLAS comprise operations that involve matrices and vectors. we look at the rank-1 update routine dger. respectively. matrix–matrix multiplication costs O(n3 ) floating point operations while there are only O(n2 ) reads and writes. An overview on the number of memory accesses and floating point operations is given in Table 2. If the involved matrix is n × n.1 Typical performance numbers for the BLAS Let us look at typical representations of all three levels of the BLAS. The Mflop/s rates of the Level 1 BLAS ddot Table 2. daxpy.3.1) = i=0 xi yi .1) = i=0 n−1 xi yi .1. The most prominent operation of the Level 3 BLAS. Table 2.x.3 Number of memory references and floating point operations for vectors of length n. ¯ cdotu(n. The last column in Table 2.y. respectively: n−1 cdotc(n.x.6).

The Level 3 BLAS dgemm performs at a good fraction of the peak performance of the processor (4.LAPACK AND THE BLAS 23 Table 2. it computes a lower triangular matrix L and an upper triangular matrix U . n = 100 ddot daxpy dgemv dger dgemm 1480 1160 1370 670 2680 500 1820 1300 740 330 3470 2000 1900 1140 670 320 3720 10.8) where P is an m × m permutation matrix. we hope it will be clear why Level 2 and Level 3 BLAS are usually more efficient than the vector operations found in the Level 1 BLAS. it is considerably reduced. we will review the classical form for Gaussian elimination because there we will be deeper into hardware considerations wherein the classical representation will seem more appropriate. The high rates are for vectors that can be held in the on-chip cache of 512 MB. that is. 2n vs. The low 240 and 440 Mflop/s with the very long vectors are related to the memory bandwidth of about 1900 MB/s (cf.2.1 Classical Gaussian elimination Gaussian elimination is probably the most important algorithm in numerical linear algebra.4 GHz Pentium 4.2. We see from Table 2. The main code portion of an implementation of Gaussian elimination is found in Figure 2.3 that the ratio of computation to memory accesses increases with the problem size.8 Gflop/s). The algorithm is obviously well known [60] and we review it more than once due to its importance.1. 2. L is an m × m lower triangular matrix with ones on the diagonal and U is an m × n upper triangular matrix.2.000 440 240 — — — and daxpy quite precisely reflect the ratios of the memory accesses of the two routines.4. such that P A = LU. The Level 2 BLAS dgemv has about the same performance as daxpy if the matrix can be held in cache (n = 100). Table 1.1). Otherwise. in particular Section 3. This leads to a very low performance rate. a matrix with nonzero elements only on and above its diagonal.2. The algorithm factors an m × n matrix in . Given a rectangular m × n matrix A.2 Solving systems of equations with LAPACK 2.4 Some performance numbers for typical BLAS in Mflop/s for a 2. dger has a high volume of read and write operations. The performance increases with the problem size. while the number of floating point operations is limited.000. 3n. Here and in subsequent sections. In Chapter 3. (2. This ratio is analogous to a volume-to-surface area effect.

tmp2 = N-j-1. Previously computed elements lij and uij are not affected: Their elements in L may be permuted. we chose m = n + 1 = 6. /* Rank-one update trailing submatrix */ if (j < min(M. } } Fig. a34  ˆ  a44  ˆ a54 ˆ l10  l20  l30  l40 l50 1 l21 l31 l41 l51 1 1 1 (2. In the algorithm of Figure 2.&ONE). if (j < M) dscal_(&tmp.  1  u00        1 u01 u11 u02 u12 a22 ˆ a32 ˆ a42 ˆ a52 ˆ u03 u13 a23 ˆ a33 ˆ a43 ˆ a53 ˆ  u04 u14   a24  ˆ  ≡ L1 U1 = P1 A. min(m.&A[jp]. j++){ /* Find pivot and test for singularity */ tmp = M-j.&M. tmp = M-j-1.N)) { tmp = M-j-1.&A[j+1+j*M].&MDONE.1 + idamax_(&tmp.&M). zeros are introduced below the first element in the reduced matrix aij . /* Compute j-th column of L-factor */ s = 1.&ONE ).0){ /* Interchange actual with pivot row */ if (jp != j) dswap_(&N.&s.&tmp2. n) − 1 steps. the first two columns of the lower triangular factor L and the two rows of the upper triangular factor U are computed.&ONE. Column three of L is filled below its ˆ diagonal element. } else if (info == 0) info = j. 2. After completing the first two steps of Gaussian elimination. if (A[jp+j*M] != 0. For illustrating a typical step of the algorithm.&A[j+1+(j+1)*M]. dger_(&tmp.9) In the third step of Gaussian elimination.&A[j+j*M]. ipiv[j] = jp.&A[j].N).1. jp = j . the step number is j.0/A[j+j*M].&M.24 APPLICATIONS info = 0. however! . Gaussian elimination of an M × N matrix based on Level 2 BLAS as implemented in the LAPACK routine dgetrf. &A[j+(j+1)*M].1. j < min(M.&A[j+1+j*M]. for (j=0.&M).

that Pj is the identity matrix if jp = j. Let Pj be the m-by-m permutation matrix that swaps entries j and jp in a vector of length m. In the example in (2. The latter number is called block size.1.2(a). then the whole column is zero. indicated by U0 . the algorithm has to be blocked! 2. |ˆjp . Notice. That is.9). Therefore. To that end. and the first j rows of the upper triangular factor U . Applying P2 yields T P2 L1 P2 P2 U1 = P2 P1 A = P2 A.LAPACK AND THE BLAS 25 The j-th step of Gaussian elimination has three substeps: 1.4. ˆ a lij = aij /ˆjj . ˆ In the code fragment in Figure 2. Compute the jth column of L.2. Therefore. (2. After the perˆ mutation.2 Block Gaussian elimination The essential idea to get a blocked algorithm is in collecting a number of steps of Gaussian elimination. To use the high performance Level 3 BLAS routine dgemm. P2 L1 P2 differs from L1 only in swapped elements on columns 0 and 1 below row 1.2. P2 applied to U1 swaps two of the rows with elements aik . This is executed in the Level 1 BLAS dscal. Assume that we have arrived in the Gaussian elimination at a certain stage j > 0 as indicated in Figure 2.10) T Notice that P2 swaps row j = 2 with row jp ≥ 2. indicated by L0 . 2. applying Pj corresponds to a swap of two complete rows of A. P2 = P2 P1 . we have computed (up to row permutations) the first j columns of the lower triangular factor L.j | ≥ |ˆij | for all i ≥ j. We wish to eliminate ˆ ˆ the next panel of b columns of the reduced matrix A. k > j. Evidently. Update the remainder of the matrix. Also. please look at Figure 3. 3.15. As we see in Table 2. ˆ ˆ aik = aik − lij ajk .9) we have j = 2. ˆ i.1 the lower triangle of L without the unit diagonal and the upper triangle of U are stored in-place onto matrix A. This is found using function a idamax in Fig. This operation is a rank-one update that can be done with the Level 2 BLAS dger. ajj is (one of) the largest element(s) of the first column of the ˆ reduced matrix. say b. Find the first index jp of the largest element (in modulus) in pivot a column j. that is. from (2. Therefore. the elements of Lk below the diagonal can be stored where Uk has zero elements. split A in four . 2. if ajj = 0. the BLAS dger has quite a low performance. i > j.

ˆ A L P ˆ11 = 11 U11 . and is determined by LAPACK and is determined by hardware characteristics rather than problem size.3 shows the main loop in the Fortran subroutine dgetrf for factoring an M × N matrix. dgetrf. U12 = L−1 A12 . is available from the NETLIB [111] software repository. wherein A11 comprises the b × b block in the upper left corner of A. We use the notation of Figure 2. Let us examine the situation for Gaussian elimination. ˆ ˆ portions. by using matrix–matrix multiplication Level 3 BLAS routine dgemm. Factor the first panel of A by classical Gaussian elimination with column pivoting.2(a).26 (a) APPLICATIONS (b) U0 Â11 L0 Â21 Â22 Â12 L0 L21 Ã22 U11 L11 U0 U12 Fig. 3b2 floating point numbers. . Update the rest of the matrix by a rank-b update. that is. 108) in three substeps ˆ 1. Advantages of Level 3 BLAS over the Level 1 and Level 2 BLAS are the higher ratios of computations to memory accesses. n = bm. Block Gaussian elimination as implemented in the LAPACK routine.4. Then we investigate the jth block step. The sizes of the further three blocks become clear from Figure 2. This can be done 11 efficiently by a Level 3 BLAS as there are multiple right hand sides. 3. Block Gaussian elimination. we assume that n is divisible by b. 1 ≤ j ≤ m.15 and p.2.3 and 2. see Tables 2. (The rectangular portion of L0 is the portion of L0 on the left A ˆ of A. L21 A21 Apply the row interchanges P on the “rectangular portion” of L0 and ˆ A12 ˆ22 . Furthermore. The block Gaussian elimination step now proceeds similarly to the classical algorithm (see Figure 3. Compute U12 by forward substitution. 2. For simplicity let us assume that the cache can hold three b × b blocks. ˆ A22 = A22 − L21 U12 . The motivation for the blocking Gaussian elimination was that blocking makes possible using the higher performance Level 3 BLAS. Figure 2. The block size is NB. (a) Before and (b) after an intermediate block elimination step.2.) ˆ 2.

LDA) END IF END IF 20 CONTINUE Fig.N) THEN * $ * $ $ Apply interchanges to columns J+JB:N. $ M-J-JB+1. A(J. The in-place factorization A11 = L11 U11 costs b2 reads and writes. A(1.J+JB). ˆ 1. N-J-JB+1. 1) IF(J+JB. Keeping the required block of U12 in cache. L21 and the panel are read. $ A(J+JB. J+JB-1. IINFO) Adjust INFO and the pivot indices. CALL DTRSM(’Left’. $ ONE.3. A(J.M) THEN 27 * $ 10 * * Update trailing submatrix. J.0 . ONE. ’Lower’.N). after the update.LE. A(J+JB.J+JB). CALL DLASWP(J-1.J). NB JB = MIN(MIN(M.J+JB-1) IPIV(I) = J . MIN(M.J). IPIV(J). ’No transpose’. LDA.LE. A(J. the .NB) * * Factor diagonal and subdiagonal blocks and test for exact singularity.AND.GT. the backward substitution L21 ← A21 U11 needs the reading and storing of b · (m − j)b floating point numbers. which is functionally equivalent to dgefa from LINPACK. -ONE. 1) Compute block row of U. JB.0) INFO = IINFO + J .J+JB).1 + IPIV(I) CONTINUE Apply interchanges to columns 1:J-1. ’No transpose’.N)-J+1.EQ.J).1 DO 10 I = J. J+JB-1. ˆ 3. The main loop in the LAPACK routine dgetrf. N-J-JB+1. A(J. LDA. The rank-b update can be made panel by panel. LDA. Keeping L11 /U11 in cache. IPIV. CALL DGETF2(M-J+1. LDA) IF(J+JB. CALL DGEMM(’No transpose’. LDA. LDA. MIN(M. LDA. The same holds for the forward substitution U12 ← L−1 A12 . IPIV. CALL DLASWP(N-J-JB+1.LAPACK AND THE BLAS DO 20 J=1. 11 4. ’Unit’. JB. −1 ˆ 2. J. IINFO. 2. JB. A.J+JB). IF(INFO.

16 .13 0. n = 500 b 1 2 4 8 16 32 64 Time (s) Speed 0. The performance depends on the block size.78 0.44 1. However.89 0.5 Times (s) and speed in Mflop/s of dgesv on a P 4 (2.5 we have collected some timings of the LAPACK routine dgesv that comprises a full system solve of order n including factorization.10 308 463 641 757 833 926 833 n = 1000 b 1 2 4 8 16 32 64 Time (s) Speed 2. For example. This raw 2 formula seems to show that b should be chosen as big as possible. The default block size on the Pentium 4 is 64. A factor of three or more improvement in performance over the classical Gaussian elimination (see p. b Thus.11 0.10 0. This assumption gives an upper bound for b.18 0. In Table 2. this consumes 2b · (m − j)b read and 2b · (m − j) write operations for each of the m − j panels. the block algorithm even more clearly shows its superiority. we have derived it under the assumption that 3b2 floating point numbers can be stored in cache.4 GHz.16 8 6.74 64 10. There is little improvement when b ≥ 16.06 4 8.27 0. However. Neglecting the access of U12 .17 1. 2.57 2 11. 108) can be expected if the block size is chosen properly.42 32 7.28 APPLICATIONS panel is stored in memory.80 0. forward and backward substitution.3 Linear algebra: sparse matrices. 1 GB). In Chapter 4.96 307 463 617 749 833 855 694 n = 2000 b Time (s) Speed 322 482 654 792 831 689 525 1 16. A may be a general linear Table 2. Our numbers show that it is indeed wise to block Gaussian elimination. iterative methods In many situations. either an explicit representation of matrix A or its factorization is extremely large and thus awkward to handle. or such explicit representations do not exist at all. we will see that on shared memory machines. we get nowhere close to the performance of dgemm.09 0. Summing all the contributions of items (1)–(4) and taking n = bm into account gives m 2b2 (m − j) + 3b2 (m − j)2 = j=1 n3 + O(n2 ). the blocked algorithm executes 3 b flops per memory access.08 0.73 16 6.

when no explicit representation of A exists. where e(k) is called the error. then it makes no sense to solve the system of equations more accurately than the error that is introduced by the discretization. . it is possible to discontinue the iterations if one considers one of the x(k) accurate enough. then the factorization A = LU in a direct solution may generate nonzero numbers in the triangular factors L and U in places that are zero in the original matrix A. In these situations. often the determination of the elements of A is not highly accurate. x(1) .11) would be foolish. a differential operator. . Multiplying (2. iterative methods are required to solve (2. is constructed that converges to the solution of (2.11). we examine the so-called stationary iterative and conjugate-gradient methods. so a highly accurate solution of (2.3. x(2) . .11) for some initial vector x(0) . This so-called fill-in can be so severe that storing the factors L and U needs so much more memory than A.11) are used. Determining the error e(k) would require knowing x. whereas the residual r(k) is easily computed using known information once the approximation x(k) is available. x(2) . 2. A second reason for choosing an iterative solution method is if it is not necessary to have a highly accurate solution. . Then x = x(k) + e(k) .1 Stationary iterations Assume x(k) is an approximation to the solution x of (2. most of A’s elements are zero. a sequence x(0) . If A is very large and sparse. ITERATIVE METHODS 29 operator whose only definition is its effect on an explicit representation of a vector x.LINEAR ALGEBRA: SPARSE MATRICES. say. In such cases.12) by A gives Ae(k) = Ax − Ax(k) = b − Ax(k) =: r(k) . so a direct method becomes very expensive or may even be impossible for lack of memory. Likewise. In iterative methods.13) (2. (2. From (2. converges to the solution. . x(1) . For example. that is.13) we see that x = x(k) + A−1 r(k) . In the next section. if A is a discretization of. As the sequence x(0) . Or.12) and (2. Direct methods such as Gaussian elimination may be suitable to solve problems of size n ≤ 105 .11). Ax = b. iterative methods for the solution of Ax = b (2.14) . iterative methods are the only option. .12) Vector r(k) is called the residual at x(k) . (2.

The error changes in each iteration according to e(k+1) = x − x(k+1) = x − x(k) − M −1 r(k) . the Jacobi iteration is obtained if one sets M = D in (2. the largest eigenvalue of G in absolute value.15) converges for any x(0) if and only if ρ(G) < 1. Another way to write (2. A = L+D+U . so (2.15). if A in (2. Matrix M is called a preconditioner.11). L is strictly lower triangular.15). the eigenvalues of the iteration matrix are decisive for the convergence of the stationary iteration (2. The matrix G = I − M −1 A is called the iteration matrix . x(k+1) = x(k) + D−1 r(k) = −D−1 (L + U )x(k) + D−1 b. The stationary iteration (2.3.j=i   n 1  (k) (k) (k) bi − aij xj  = xi + aij xj  .2 Jacobi iteration By splitting A into pieces.30 APPLICATIONS Of course. N = M − A. Component-wise we have  1  (k+1) bi − = xi aii  (2. The residuals satisfy r(k+1) = (I − AM −1 )r(k) . Let ρ(G) denote the spectral radius of G.17) The matrix I − AM −1 is similar to G = I − M −1 A which implies that it has the same eigenvalues as G.15) gives a convergent sequence {x(k) }.14) is only of theoretical interest. Nevertheless. that is. = e(k) − M −1 Ae(k) = (I − M −1 A)e(k) .14) is replaced by a matrix. where (1) M is a good approximation of A and (2) the system M z = r can be solved cheaply.18) n j=1.16) are called stationary iterations because the rules to compute x(k+1) from x(k) do not depend on k. 2.16) Iterations (2. determining e(k) = A−1 r(k) is as difficult as solving the original (2.15) is M x(k+1) = N x(k) + b. aii j=1 . Then the following statement holds [60]. and U is strictly upper triangular. = e(k) − M −1 (b − Ax(k) − b + Ax). say M . (2. In fact. where D is diagonal. then one may expect that x(k+1) = x(k) + M −1 r(k) .15) and (2. (2. Let us now look at the most widely employed stationary iterations. r(k) = b − Ax(k) (2.

A block Jacobi iteration is obtained if the diagonal matrix D in (2.18) is replaced by a block diagonal matrix.3. SOR can be symmetrized to become Symmetric Successive . and U is the strictly upper triangular portion of A. if aii > j=i |aij |. For a proof of this and similar statements. that is. j>i (2. SOR becomes GS iteration if ω = 1. this is Mω x(k+1) = Nω x(k) + b. In matrix notation.LINEAR ALGEBRA: SPARSE MATRICES. Usually ω ≥ 1.15) with D and L as in the previous section. the iteration is (D + L)x(k+1) = −U x(k) + b. Component-wise we have  1  bi − = aii  (k+1) aij xj j<i (k) aij xj  . if L = U T . ITERATIVE METHODS 31 Not surprisingly.4 Successive and symmetric successive overrelaxation Successive overrelaxation (SOR) is a modification of the GS iteration and has an adjustable parameter ω. If A is symmetric. see [137]. L is the strictly lower triangular. for all i. that is. ω (2. ω Nω = 1−ω D − U. A block GS iteration is obtained if the diagonal matrix D in (2. Component-wise it is defined by  ω  bi − = aii  (k+1) aij xj j<i (k) aij xi  j>i (k+1) xi − + (1 − ω)xi . Mω = 1 D + L. whence the term over relaxation.20) where again D is the diagonal. (k) 0 < ω < 2. the Jacobi iteration converges if A is strictly diagonally dominant.19) (k+1) xi − The GS iteration converges if A is symmetric and positive definite [60]. 2.19) is replaced by a block diagonal matrix.3. 2. Thus.3 Gauss–Seidel (GS) iteration The GS iteration is obtained if we set M = D + L in (2.

To that end. Equation (2.22) Combining (2.3. The algorithm used for Example 2. ω As Mω x(k+ 2 ) is known from (2.3. (D + L)D−1 (D + LT )x(k+1) = LD−1 LT x(k) + b.2 Let A be the Poisson matrix for an n × n grid. GS iterations reduce the number of iterations to achieve convergence roughly by a factor of 2 compared . The diagonal blocks are themselves tridiagonal with diagonal elements 4 and offdiagonal elements −1. 253ff. see [137] or [113. 1 1 (2. (2. The off-diagonal blocks are minus the identity. p. A has a block-tridiagonal structure where the blocks have order n.22).3.23).21) (2.21) and (2.].6 gives the number of iteration steps needed to reduce the residual by the factor 106 by different stationary iteration methods and two problem sizes. Remark 2. A is symmetric positive definite and has the order n2 .22) yields ω ω T Mω D−1 Mω x(k+1) = N T D−1 Nω x(k) + b. T Mω x(k+1) = 1 1 1 2−ω Dx(k+ 2 ) − Mω x(k+ 2 ) + b. Test numbers for this example show typical behavior.24) (S)SOR converges if A is symmetric positive definite (spd) and 0 < ω < 2.2 takes the form given in Figure 2.20) is defined as the first half-step Mω x(k+ 2 ) = Nω x(k) + b. Example 2.4.1 With the definitions in (2. These savings are called the Conrad–Wallach trick. Ortega [113] gives a different but enlightening presentation.22). of an iteration rule that is complemented by a backward SOR step T T Mω x(k+1) = Nω x(k+ 2 ) + b.32 APPLICATIONS Overrelaxation (SSOR).21) we can omit the multiplication by L in (2. 2−ω 2−ω ω (2.23) The symmetric GS iteration is obtained if ω = 1 is set in the SSOR iteration (2.21). Table 2.20) we can rewrite (2. A similar simplification can be made to (2.

rk+1 = b − Axk+1 . ρ0 = r0 . k = 0. we conclude that stationary iterations are not usually viable solution methods for really large problems. the iteration count using the Jacobi or GS iterations grows by the same factor. 2. Notice that we have listed only iteration counts in Table 2. Except for the Jacobi iteration.6 and not execution times. Another typical behavior of stationary iterations is the growth of iteration counts as the problem size increases. with the Jacobi method. these linear system solutions are the most time-consuming part of the algorithm.6 Iteration steps for solving the Poisson equation on a 31×31 and on a 63×63 grid with a relative residual accuracy of 10−6 . Set r0 = b − Ax0 . xk+1 = xk + zk . the solutions consist of solving one (or several as with SSOR) triangular systems of equations. Solver Jacobi Block Jacobi Gauss–Seidel Block Gauss–Seidel SSOR (ω = 1.LINEAR ALGEBRA: SPARSE MATRICES. When the problem size grows by a factor 4.8) n = 312 2157 1093 1085 547 85 61 n = 632 7787 3943 3905 1959 238 132 Choose initial vector x0 and convergence tolerance τ . it is often difficult or impossible to determine the optimal value of the relaxation parameter ω. The growth in iteration counts is not so large when using SSOR methods. k = k + 1. while ρk > τ ρ0 do Solve M zk = rk . .4. ρk = rk .8) Block SSOR (ω = 1. Each iteration of a stationary iteration consists of at least one multiplication with a portion of the system matrix A and the same number of solutions with the preconditioner M . So does blocking of Jacobi and GS iterations. Usually. Nevertheless. However. ITERATIVE METHODS 33 Table 2. endwhile Fig. SSOR further reduces the number of iterations. Stationary iteration for solving Ax = b with preconditioner M . for difficult problems.

k−1 = x(0) + j=0 Gj z(0) . Table 2. For a more comprehensive discussion of Krylov subspace methods. the Gram–Schmidt orthogonalization procedure is called Lanczos’ algorithm if G is symmetric and positive definite. the kth iterate x(k) shifted by x(0) is a particular element of the Krylov subspace Kk (G. Gk−1 z(0) } become a poorly conditioned basis for the Krylov subspace Kk (G. the approximations x(k) are found by some criterion applied to all vectors in Kk (G.26) . z(0) ) := span{z(0) . 30.15) and (2.3. Vectors z(j) are called preconditioned residuals. Gz(0) . z(0) ). and Arnoldi’s algorithm otherwise. the reader is referred to [36]. The preconditioned conjugate gradient method determines x(k) such that the residual r(k) is orthogonal to Kk (G. a considerable amount of work in Krylov subspace methods is devoted to the construction of a good basis of Kk (G. In Krylov subspace methods. z(0) ). . z(0) ). . Therefore. Gk−1 z(0) }.17) that x(k) = x(0) + M −1 r(k−1) + M −1 r(k−2) + · · · + M −1 r0 . z(0) ). z(0) ) which is defined as the linear space spanned by these powers of G applied to z(0) : Kk (G. Here. together with the properties of the system matrix required for their applicability. x(k) is chosen to minimize z(k) 2 . Thus.25) Here.3. In GMRES and PCG this is done by a modified Gram–Schmidt orthogonalization procedure [63]. (2. In this book. . Gz(0) . (2. In the context of Krylov subspaces.34 APPLICATIONS 2.6 The generalized minimal residual method (GMRES) The GMRES is often the algorithm of choice for solving the equation Ax = b. the preconditioned equation M −1 Ax = M −1 b. we consider only the two most important Krylov subspace methods: the Generalized Minimal RESidual method (GMRES) for solving arbitrary (nonsingular) systems and the Preconditioned Conjugate Gradient (PCG) method for solving symmetric positive definite systems of equations. .7 lists the most important of such methods. . If the vectors Gj z(0) become closely aligned with an eigenvector corresponding to the absolute largest eigenvalue of G. z(j) = M −1 r(j) . 2.5 Krylov subspace methods It can be easily deduced from (2. =x (0) +M −1 k−1 j=0 (I − AM −1 )j r(0) . . In GMRES. G = I − M −1 A is the iteration matrix defined on p. we consider a variant. the set {z(0) . . .

q2 . . for k = 1. where r0 = b − Ax0 for some initial vector x0 .k (2. . The algorithm proceeds recursively in that computation of the orthogonal basis of Kk+1 (z0 .k h3. z0 = M −1 r0 . M −1 A). . . hjk = qT yk .k + hk+1. by k yk = yk − j=1 qj hjk . qk ]. M −1 A). . 63. . h1. ITERATIVE METHODS 35 Table 2.7 Some Krylov subspace methods for Ax = b with nonsingular A. qk .2 ··· ··· ··· . qk have already been computed. q2 . . (2. • Construct a new vector yk from yk . . .29) Hk+1. It can be shown that yk is linear combination of (M −1 A)k qk and of q1 . On the left is its name and on the right the matrix properties it is designed for [11. . . qk } is constructed from the Krylov subspace Kk (z0 . . Bi-CGSTAB. SYMMLQ QMRS Matrix type Arbitrary Arbitrary Symmetric positive definite Symmetric J-symmetric where the preconditioner M is in some sense a good approximation of A and systems of the form M z = r can be solved much more easily than Ax = b. M −1 A) = Kk (z0 .LINEAR ALGEBRA: SPARSE MATRICES. . . 2. 128]. k can be collected in the form M −1 AQk = Qk+1 Hk+1. . hk+1. M −1 A) makes use of the already computed orthogonal basis of Kk (z0 . . . . j (2.. . Notice that Kk (z0 .k     . qk+1 = yk /hk+1. . .   (2. To see how this works.27) and (2. Then. QMR CG MINRES. Here. Algorithm GMRES Bi-CG. . . . We get qk+1 in the following way: • Compute the auxiliary vector yk = M −1 Aqk .k h2. q2 . hk+1. .k = yk 2.k . qk . G).28) Let Qk = [q1 . In GMRES. q2 . assume q1 . an orthogonal basis {q1 . .27) • Normalize yk to make qk+1 . .k .k qk+1 eT k with  h11 h21   =   h12 h22 h3. . .k = Qk Hk. . orthogonal to q1 . . Normalized vector q1 is chosen by q1 = z0 / z0 2 .28) for j = 1.

Notice that the application of Givens rotations set to zero the nonzeros below the diagonal of Hk.k yk (2. We give a possible implementation of GMRES in Figure 2.k and Hk. b 2 Qk+1 e1 − Qk+1 Hk+1. For example. y ≡ xT M y. Hk+1. Then M −1 AQk = Qk Hk. it is possible to change the preconditioner every iteration.k−1 . . If hk+1..k . m + O(1) vectors of length n have to be stored because all the vectors qj are needed to compute xk from yk . the approximate solution xm is used as the initial approximation x0 of a completely new GMRES run. Independence of the GMRES runs gives additional freedom. say m. Unit vector ek is the last row of the k × k identity matrix. In addition to the matrices A and M . In the GMRES(m) algorithm.k can be computed in O(k) floating point operations from the QR factorization of Hk. 2. The GMRES method is very memory consumptive.. see [36]. 2. 2. The absolute value |sm | of the last component of s equals the norm of the preconditioned residual. They are upper triangular matrices with an additional nonzero first lower off-diagonal. it is easy to test for convergence at each iteration step [11]. (2. We have M −1 (b − Axk ) 2 = = = b 2 q1 − M −1 AQk yk b 2 e1 − Hk+1..26).k = yk 2 is very small we may declare the iterations to have converged. Thus.m denotes the first m elements of s.k and xk is in fact the solution x of (2.k is obtained by deleting the last row from Hk+1. Furthermore. the approximate solution xk = Qk yk is determined such that M −1 b − M −1 Axk 2 becomes minimal. then M −1 A appearing in the preconditioned system (2.k yk = b 2 e1 . the computation of its QR factorization only costs k Givens rotations. A popular way to limit the memory consumption of GMRES is to restart the algorithm after a certain number.36 APPLICATIONS Hk.k is a Hessenberg matrix.3. In general.. If one checks this norm as a convergence criterion. the QR factorization of Hk+1.26) is symmetric positive and definite with respect to the M -inner product x.31) Since Hk+1. of iteration steps.k yk 2.k .5.k are called Hessenberg matrices [152].30) where the last equality holds since Qk+1 has orthonormal columns. yk is obtained by solving the small k + 1 × k least squares problem Hk+1. Vector s1.7 The conjugate gradient (CG) method If the system matrix A and the preconditioner M are both symmetric and positive definite.

. tridiagonal. In contrast. for k = 1. Apply J1 . s = Jk s end Solve the upper triangular system Hm. r0 = b − Ax0 . xm = x0 + [q1 . . s = z0 2 e1 . while 1 do Compute preconditioned residual z0 from M z0 = r0 . qm ] t. . = xT AM −1 M y = (M −1 Ax)T M y = M −1 Ax. then the Hessenberg matrix Hk.k . In fact. 2. ϕ has its global minimum . first suggested by Hestenes and Stiefel [72]. xk is improved along a search direction. . . we have x.5.k = yk 2 . . Generate and apply a Givens rotation Jk acting on the last two rows of Hk+1.26). . M −1 Ay M = xT M (M −1 Ay) = xT Ay.k ..k to annihilate hk+1. . . qk . the Lanczos algorithm constructs an (M -)orthonormal basis of a Krylov subspace where the generating matrix is symmetric.k . If we execute the GMRES algorithm using this M -inner product. r0 = rm end Fig. . y. q1 = z0 / z0 . The preconditioned GMRES(m) algorithm. This implies that yk in (2.LINEAR ALGEBRA: SPARSE MATRICES. .27) has to be explicitly made orthogonal only to qk and qk−1 to become orthogonal to all previous basis vectors q1 . ρ0 = r0 . . ITERATIVE METHODS 37 Choose an initial guess x0 .k becomes symmetric. for the expansion of the sequence of Krylov subspaces only three basis vectors have to be stored in memory at a time. k do hjk = qT yk . that is. m do Solve M yk = Aqk . yk = yk − qj hjk j end qk+1 = yk /hk+1.m . y M. hk+1. . . . targets directly improving approximations xk of the solution x of (2. . 2.m t = s1. q2 . This is the Lanczos algorithm. Like GMRES. Jk−1 on the last column of Hk+1. for all x. pk+1 . . . rm = b − Axm if rm 2 < · ρ0 then Return with x = xm as the approximate solution end x0 = xm . So. . such that the functional ϕ = 1 xT Ax − xT b becomes minimal 2 at the new approximation xk+1 = xk + αk pk+1 . the CG method. for j = 1. q2 . .

Example 2. The preconditioned conjugate gradient (PCG) algorithm is shown in Figure 2.3. The numbers of iteration steps have been reduced by at least a factor of 10. However.6. are computed. at x. In the course of the algorithm. In general. Both GMRES and CG have a finite termination property. . Set r0 = b − Ax0 and solve M z0 = r0 . this is of little practical value since to be useful the number of iterations needed for convergence must be much lower than n.8 lists the number of iteration steps needed to reduce the residual by the factor ε = 10−6 .6. αk = ρk−1 /pT qk . respectively. end Fig.2 but now use the conjugate gradient method as our solver preconditioned by one step of a stationary iteration. the order of the system of equations. at step n. p1 = z0 . k xk = xk−1 + αk pk . The iteration is usually stopped if the norm of the actual residual rk has dropped below a small fraction ε of the initial residual r0 . An elaboration of this relationship is given in Golub and Van Loan [60]. Set ρ0 = zT r0 . The convergence rate of the CG algorithm is given by x − xk ≤ x − x0 √ κ−1 √ κ+1 k A A . where κ = M −1 A A−1 M is the condition number of M −1 A. 2. Solve M zk = rk . k βk = ρk /ρk−1 . κ is close to unity only if M is a good preconditioner for A. . 0 for k = 1. That is. . if rk < ε r0 then exit. the algorithms will terminate. rk = rk−1 − αk qk . Lanczos and CG algorithms are closely related. Because the work per iteration step is not much larger with PCG than with stationary iterations.3 We consider the same problem as in Example 2. The preconditioned conjugate gradient algorithm. .38 APPLICATIONS Choose x0 and a convergence tolerance ε.3. the execution times are similarly reduced by large factors. ρk = zT rk . Table 2. 2. residual and preconditioned residual vectors rk = b − Axk and zk = M −1 rk . do qk = Apk . pk+1 = zk + βk pk .

also of length nnz. vectors are distributed in blocks. Together with the list val there is an index list col ind. It is the basic storage mechanism used in SPARSKIT [129] or PETSc [9] where it is called generalized sparse AIJ format.]. while col ptr[n] points to the memory location right . In the CSR format. particularly when A is structurally symmetric (i. We assume that the sparse matrices are distributed block row-wise over the processors. Here.9 The sparse matrix vector product There are a number of ways sparse matrices can be stored in memory. or Yale storage) format. however. in this section we consider parallelizing these three crucial operations. if aij = 0 → aji = 0) is the BLSMP storage from Kent Smith [95]. ITERATIVE METHODS 39 Table 2. List element col ind[i] holds the column index of the matrix element stored in val[i]. Thus. While we fix the way we store our data. PCG with preconditioner as indicated. List item col ptr[j] points to the location in val where the first nonzero matrix element of row j is stored.8 Parallelization The most time consuming portions of PCG and GMRES are the matrix– vector multiplications by system matrix A and solving linear systems with the preconditioner M . we only consider the popular compressed sparse row (CSR.8) Block SSOR (ω = 1.8 Iteration steps for solving the Poisson equation on a 31 × 31 and on a 63 × 63 grid with an relative residual accuracy of 10−6 . Accordingly. We will see that this numbering can strongly affect the parallelizability of the crucial operations in PCG and GMRES. Computations of inner products and vector norms can become expensive. when vectors are distributed over memories with weak connections. the nonzero elements of an n × m matrix are stored row-wise in a list val of length nnz equal to the number of nonzero elements of the matrix. 2.LINEAR ALGEBRA: SPARSE MATRICES. Another convenient format. For an exhaustive overview on sparse storage formats see [73. p.e.3.3. we do not fix in advance the numbering of rows and columns in matrices. Preconditioner Jacobi Block Jacobi Symmetric Gauss–Seidel Symmetric block Gauss–Seidel SSOR (ω = 1. 430ff.8) n = 312 76 57 33 22 18 15 n = 632 149 110 58 39 26 21 2. and another index list col ptr of length n + 1.

Parallelizing the matrix–vector product is straightforward. Indirect addressing can be very slow due to the irregular memory access pattern. then it is only necessary to store either the upper or the lower triangle. 2.2 where the x[col ptr[j]] fetches are called gather operations. the outer loop is parallelized with a compiler directive just as in the dense matrix–vector product. 7. col ptr = [1. For illustration. 7.0. for example. 2. val = [1. where each processor needs to get data from every other processor and thus calls MPI Allgather. Section 3.7. j++) y[i] = y[i] + val[j]*x[col_ind[j]]. Section 4. If A is symmetric. Sometimes. On a distributed memory machine. 5]. 2. the elements of the vector x are addressed indirectly via the index array col ptr. Using OpenMP on a shared memory computer. behind the last element in val. can improve performance significantly [58]. however. 4. 4. 8.7 to compute its segment of y. for example. The vectors x and y are distributed accordingly. 3. 5. 2. col ind = [1.2. each processor gets a block of rows. in the sparse case there are generally 0 2 3 4 0 0 5 6 0 7  0 0 0 0  0 9 . in PETSc [9]. Here. i < n. there will be elements of vector x stored on other processors that may have to be gathered. let  1 0  A = 0  0 0 Then. We will revisit such indirect addressing in Chapter 3. 6. 5. j < col_ptr[i+1].40 APPLICATIONS for (i=0. } Fig. Block CSR storage.1. the rows are distributed so that each processor holds (almost) the same number of rows or nonzero elements. Sparse matrix–vector multiplication y = Ax with the matrix A stored in the CSR format. i++){ y[i] = 0. for (j=col_ptr[i]. A matrix–vector multiplication y = Ax can be coded in the way shown in Figure 2. Each processor executes a code segment as given in Figure 2.7. Unlike the dense case. Often each processor gets n/p rows such that the last processor possible has less work to do. 10].8. 2. 4. 9. 9]. 3. 3.  8 0 0 10 . However. 2. 3.

It is in principle possible to hide some communication latencies by a careful coding of the necessary send and receive commands. Furthermore. where it is usually more effective. Ai.i−1 and Ai. respectively. 2. • Step 2: Update yi ← yi +Ai. Sparse matrix with band-like nonzero structure row-wise block distributed on six processors. by proceeding in the following three steps • Step 1: Form yi = Aii xi (these are all local data) and concurrently receive xi−1 .i+1 are the submatrices of Ai that are to be multiplied with xi−1 and xi+1 . then a processor needs only data from nearest neighbors to be able to form its portion of Ax. just a few nonzero elements required from the other processor memories.i+1 xi+1 . that the vector x was accessed indirectly. To see this. some fraction of the communication can be hidden under the computation. ITERATIVE METHODS 41 P_0 P_1 P_2 P_3 P_4 P_5 Fig. • Step 3: Update yi ← yi + Ai. For example.LINEAR ALGEBRA: SPARSE MATRICES. This technique is called latency hiding and exploits non-blocking communication as implemented in MPI Isend and MPI Irecv (see Appendix D). if one . Very often communication can be reduced by reordering the matrix. We have seen. In Figure 2. An analogous latency hiding in the SIMD mode of parallelism is given in Figure 3.7 we have given a code snippet for forming y = Ax.8. if the nonzero elements of the matrix all lie inside a band as depicted in Figure 2. that is. data. we split the portion Ai of the matrix A that is local to processor i in three submatrices. or same amount of. Then.i−1 xi−1 and concurrently receive xi+1 .6.8. Aii is the diagonal block that is to be multiplied with the local portion xi of x. the portions of x that reside on processors i − 1 and i + 1. not all processors need the same. If one multiplies with AT . a call to MPI Alltoallv may be appropriate. Therefore.

of course. As we derived in Section 2. for (i=0.33) ∞ Sum j=0 Gj is a truncated approximation to A−1 M1 = (I − G)−1 = j=0 Gj .5 and 2.2. If only the upper or lower triangle of a symmetric matrix is stored.1. i < n. j++) y[row_ind[j]] = y[row_ind[j]] + val[j]*x[i]. Sparse matrix–vector multiplication y = AT x with the matrix A stored in the CSR format.4 The above matrix–vector multiplications should. an operation we revisit in Section 3.8.9.9.2. then we have seen in (2.3.3. then both matrix–vector multiplies have to be used. Thus. In this case. be constructed on a set of sparse BLAS.34) . then −1 z = M −1 r = (I + G + · · · + Gm−1 )M1 r. (2. forms y = AT x. Since M approximates A it is straightforward to execute a fixed number of steps of stationary iterations to approximately solve Az = r.3. If the (0) iteration is started with z = 0. Let A = M1 − N1 be a splitting of A. Work on a set of sparse BLAS is in progress [43]. y[row ind[j]] elements are scattered in a nonuniform way back into memory. 2. where M1 is an spd matrix.25) that k−1 k−1 z(k) = j=0 k−1 Gj c = j=0 −1 Gj M1 r.10 Preconditioning and parallel preconditioning 2. we first form local portions yi = AT xi of y without communication. if the preconditioner M is defined by applying m steps of a stationary iteration.42 APPLICATIONS for (j=0. one for multiplying with the original matrix. (2. k > 0. } Fig.0. Remark 2. then the result vector is accessed indirectly.32) −1 −1 −1 where G = M1 N1 = I − M1 A is the iteration matrix and c = M1 r. where M is the preconditioner. j < col_ptr[i+1]. i++){ for (j=col_ptr[i]. 2. then forming y = yi involves only the nearest neighbor communications. see Figure 2. the corresponding stationary iteration for solving Az = r is given by M1 z(k) = N1 z(k−1) + r or z(k) = Gz(k−1) + c. j++) y[i] = 0. (2. both GMRES and PCG require the solution of a system of equations M z = r.3.1 Preconditioning with stationary iterations As you can see in Figures 2. j < m. If A has i a banded structure as indicated in Figure 2.10. Since AT is stored column-wise.6. one for multiplying with its transpose.

In that case.3 Parallel Gauss–Seidel preconditioning Gauss–Seidel preconditioning is defined by  zi (k+1)  aij zj j<i (k+1) = 1  ci − aii − j>i (k) aij zi  . that is. If m = 1 in (2. Each step of the GS iteration requires the solution of a triangular system of equations.32). Again (Example 2. we set M1 = D = diag(A). the update of zi in (2.3. the degree of parallelism can be improved by permuting A. From the point of view of computational complexity and parallelization.3. L inherits the sparsity pattern of the lower triangle of A. 2. the degree of parallelism depends on the sparsity pattern.2 Jacobi preconditioning In Jacobi preconditioning.35) from Section 2. The grid and the sparsity pattern of the corresponding matrix A is given in Figure 2. ITERATIVE METHODS 43 This preconditioner is symmetric if A and M1 are also symmetric.3. The best known permutation is probably the checkerboard or red-black ordering for problems defined on rectangular grids. let A be the matrix that is obtained by discretizing the Poisson equation −∆u = f (∆ is the Laplace operator) on a square domain by 5-point finite differences in a 9×9 grid. 2. but as the iteration given in (2. To improve its usually mediocre quality without degrading the parallelization property. row by row.34).2). (2. Frequently. Note that the original matrix A is permuted and not L.35) has to wait until zi−1 is known. D−1 r(k) corresponds to an elementwise vector–vector or Hadamard product.10.10. Just like solving tridiagonal systems. In the case where L is sparse. The diagonal M1 is replaced by a block diagonal M1 such that each block resides entirely on one processor. the preconditioner consists of only one step of the iteration (2. M = M1 . If the unknowns are colored as indicated in Figure 2.32)–(2.LINEAR ALGEBRA: SPARSE MATRICES. We note that the preconditioner is not applied in the form (2. Since there are elements in the first lower off-diagonal (k+1) (k+1) of A.10 for the case where the grid points are numbered in lexicographic order. the . and thus of L. then z = z(1) = D−1 r. If the lower triangle of L is dense.11(left) and first the white (red) unknowns are numbered and subsequently the black unknowns.3. which is easy to parallelize.32). then GS is tightly recursive. SOR and GS are very similar. Jacobi preconditioning can be replaced by block Jacobi preconditioning.34).3.

11(right).44 APPLICATIONS Fig. Note that the color red is indicated as white in the figure. 9 × 9 grid (left) and sparsity pattern of the corresponding Poisson matrix (right) if grid points are numbered in lexicographic order.10. 2. w (2. cf. 9 × 9 grid (left) and sparsity pattern of the corresponding Poisson matrix (right) if grid points are numbered in checkerboard (often called redblack) ordering. Each of the two .11.36) This structure admits parallelization if both the black and the white (red) unknowns are distributed evenly among the processors. w (k) zb (k+1) −1 = Db (bb − Abw z(k+1) ). Here. sparsity pattern is changed completely. 2. one step of SOR becomes −1 z(k+1) = Dw (bw − Awb zb ). A= Dw Abw Awb . Fig. Now the matrix has a 2×2 block structure. Figure 2. the indices “w” and “b” refer to white (red) and black unknowns. With this notation. Db where the diagonal blocks are diagonal.

) Then Ω = ∪j Ωj implies that each column of the identity matrix appears in at least one of the Rj .12. In the example in Figure 2.36) then comprises a sparse matrix–vector multiplication and a multiplication with the inverse of a diagonal matrix.10. . Furthermore. The first has been discussed in the previous section. (2. d.13. the latter is easy to parallelize. Ω = ∪i Ωi . one may prefer to color groups of grid points or grid points in subdomains as indicated in Figure 2. . This can be Fig. (Rj extracts from a vector those components that belong to subdomain Ωj .3. This means that each point x ∈ Ω: in particular. the subdomains overlap.37) When we step through the subdomains. ITERATIVE METHODS 45 steps in (2. 2. . In Figure 2. all subdomains are mutually disjoint: Ωi ∩ Ωj = ∅. T (b − Az(k) )|Ωj = Rj (b − Az(k) ). then each column of the identity matrix appears in exactly one of the Rj . . again as a Hadamard product. if the subdomains do not overlap. if i = j.12. the blocks on the diagonal of A and the residuals corresponding to the jth subdomain are T A|Ωj = Aj = Rj ARj . 9 × 9 grid (left) and sparsity pattern of the corresponding Poisson matrix (right) if grid points are arranged in checkerboard (red-black) ordering. This approach is called domain decomposition [133].12—again for the Poisson equation on the square grid. each grid point is an element of at least one of the subdomains. By means of the Rj s. To formalize this approach. i = 1. 2. we update those components of the approximation z(k) corresponding to the respective domain. Let Rj be the matrix consisting of those columns of the identity matrix T that correspond to the numbers of the grid points in Ωj . We require that Ω is covered by the Ωi . let us denote the whole computational domain Ω and the subdomains Ωi . .4 Domain decomposition Instead of coloring single grid points.LINEAR ALGEBRA: SPARSE MATRICES.

written as T z(k+i/d) = z(k+(i−1)/d) + Rj A−1 Rj (b − Az(k+(i−1)/d) ). . Let us color the subdomains such that domains that have a common edge (or even overlap) do not have the same color. . i = 1.13. Overlapping domain decomposition. e(k+1) = (I − Bd A) · · · (I − B1 A)e(k) . In fact.46 APPLICATIONS Ω3 Ω2 Ω1 Ωj Fig. More generally. APj = PjT A. coloring. Thus. if we have q .12 we see that two colors are enough in the case of the Poisson matrix on a rectangular grid. j ≡ z(k+(i−1)/d) + Bj r(k+(i−1)/d) . The iteration matrix of the whole step is the product of the iteration matrices of the single steps. From (2. that is Pj = Pj2 . .38) We observe that Pj := Bj A is an A-orthogonal projector. 2. As a stationary method. the multiplicative Schwarz procedure converges for spd matrices.38) is called a multiplicative Schwarz procedure [131]. The problems with parallelizing multiplicative Schwarz are related to those parallelizing GS and the solution is the same. Iteration (2. .38) we see that T T e(k+j/d) = (I − Rj (Rj ARj )−1 Rj A)e(k+(j−1)/d) = (I − Bj A)e(k+(j−1)/d) . d. The multiplicative Schwarz procedure is related to GS iteration wherein only the most recent values are used for computing the residuals. (2. The adjective “multiplicative” comes from the behavior of the iteration matrices. it is block GS iteration if the subdomains do not overlap. From Figure 2.

3. The incomplete Cholesky LU factorization can be parallelized quite like the original LU or Cholesky factorization.5 Incomplete Cholesky factorization/ICCG(0) Very popular preconditioners are obtained by incomplete factorizations [130]. R is nonzero at places where A is zero. Ad ). If the Ωj do overlap as in Figure 2. A = LLT . As subdomains with the same color do not overlap. . x(k+1) = x(k+1−1/q) j∈colorq Bj r(k+1−1/q) .13. . as the related Jacobi iterations. One of the most used incomplete Cholesky factorizations is called IC(0). Nevertheless. . It is simplified because the zero structure of the triangular factor is known in advance. Additive Schwarz preconditioners are.LINEAR ALGEBRA: SPARSE MATRICES.39) care has to be taken when updating grid points corresponding to overlap regions. in (2. the additive Schwarz iteration may diverge as a stationary iteration but can be useful as a preconditioner in a Krylov subspace method.14. . see Figure 2. . Thus. involves finding a version of the square root of A. easily parallelized. (2. the sparsity structure of the lower triangular portion of the incomplete Cholesky factor is a priori specified to be the structure of the original matrix A. . An incomplete LU or Cholesky factorization can help in this situation. The incomplete LU factorization ILU(0) for nonsymmetric matrices is defined in a similar way. ITERATIVE METHODS 47 colors. A = LLT + R. j∈color2 x(k+2/q) = x(k+1/q) + . This may render the direct solution infeasible with regard to operations count and memory consumption. 2. Incomplete factorizations are obtained by ignoring some of the fill-in. An additive Schwarz procedure is given by d z(k+1) = z(k) + j=1 Bj r(k) .39) If the domains do not overlap. R = O. . A direct solution of Az = r. Here. which in most situations will generate a lot of fill-in. r(k) = b − Az(k) . then x(k+1/q) = x(k) + j∈color1 Bj r(k) . the updates corresponding to domains that have the same color can be computed simultaneously. Bj r(k+1/q) .10. when A is spd. then the additive Schwarz preconditioner coincides with the block Jacobi preconditioner diag(A1 .

48 APPLICATIONS for k = 1. K = M −1 . . for i = k + 1. . . . With our previous notation. The incomplete Cholesky factorization with zero fill-in. ki = Kei is the ith column of K. See Saad [130] for an overview. . their set-up parallelizes nicely since the columns of K can be computed independent of each other.6 Sparse approximate inverses (SPAI) SPAI preconditioners are obtained by finding the solution of n min I − AK K 2 F = i=1 ei − Aki . . .14. There are variants of SPAI in which the number of the zeros per column are increased until ei − Aki < τ for some τ [65]. 2. SPAI is an approximate inverse of A.3. we multiply with K. SPAI preconditioners are quite expensive to compute. There are also symmetric SPAI preconditioners [88] that compute a sparse approximate inverse of a Cholesky factor. n do √ lkk = akk . The polynomial p(A) is determined to approximate A−1 . 2. . SPAI preconditioners also parallelize well in their application. p(λ) ≈ λ−1 for λ ∈ σ(A). Therefore. n do if aij = 0 then lij = 0 else aij = aij − lik lkj .3. . However. Likewise. σ(A) denotes the set of eigenvalues of A. Here.10. . 2. for j = k + 1. that is. . M z = r ⇐⇒ z = M −1 r = Kr. n do lik = lik /lkk . endif endfor endfor endfor Fig. Here. . .10. where K has a given sparsity structure.7 Polynomial preconditioning A preconditioner of the form m−1 M = p(A) = j=0 γj Aj is called a polynomial preconditioner. in the iterative solver. As the name says.

In our superficial survey here. it was known to Gauss that the splitting we show below was suitable for parallel computers. However. is a power of 2: n = 2m for some integer m ≥ 0. y ∈ Cn are complex n-vectors. As written. and use a convolution procedure. the discrete Fourier Transform is simply a linear transformation y = Wn x. we number the elements according to the C numbering convention: 0 ≤ p. . For convenience. n = 2r 3s 5t . . The matrix elements of Wn are wp. the idea of FFT originated with Karl Friedrich Gauss around 1805. It is this latter formulation which became one of the most famous algorithms in computation developed in the twentieth century. and n−1/2 Wn is a unitary matrix.FAST FOURIER TRANSFORM (FFT) 49 Polynomial preconditioners are easy to implement. In fact. how to use the FFT. There is a vast literature on FFT and readers serious about this topic should consult Van Loan’s book [96] not only for its comprehensive treatment but also for the extensive bibliography. . have clear-headed concise discussions of symmetries. section 4. Therefore. on parallel or vector processors as they just require matrix–vector multiplication and no inner products. a Written in its most basic form. Additionally.q = e(2πi/n)pq ≡ ω pq . 2. iterative Krylov subspace methods such as CG or GMRES construct approximate solutions in a Krylov subspace with the same number of matrix–vector products that satisfy some optimality properties. q ≤ n − 1. Oran Brigham’s two books [15] and [16].6]. the usual procedure is to pad the vector to a convenient larger size. . and perhaps more important. However. For example see Van Loan’s [96. What makes the FFT possible is that Wn has many symmetries. whom he acknowledged were not only fast but accurate (schnell und pr¨zis). the algorithm suitable for digital computers was invented by Cooley and Tukey in 1965 [21]. where ω = exp(2πi/n) is the nth root of unity. in particular. the amount of computation is that of matrix times vector multiply. Our intent here is far more modest. Even the topic of parallel algorithms for FFT has an enormous literature—and again. O(n2 ).4 Fast Fourier Transform (FFT) According to Gilbert Strang. Computers in his case were women. we treat only the classic case where n. the dimension of the problem. where x. If n contains large primes. the generalized prime factor algorithm can be used [142]. E. a (inner) Krylov subspace method is generally more effective than a polynomial preconditioner.2. If n is a product of small prime factors. Van Loan’s book [96] does well in the bibliographic enumeration of this narrower topic.

are needed. Splitting gives 8 n 2 even 2 + 6n + 8 diag * n 2 odd 2 = 4n2 + 6n. wp. .q . ω. ω = exp(2πi/n) is the nth root of unity. q (2.q = iq wp. .” Let us see how the FFT works: the output is n−1 yp = q=0 ω pq xq . Now call the set w = {1. ¯ wp+n/2. yn/2 = Wn/2 xeven − diag(wn )Wn/2 xodd . .q . . p wn−p. p = 0. wp+n/4. . ¯ wp. the first n/2 powers of ω.q = wp.40) (2. . with ω = exp(2πi/n). . . . Already you should be able to see what happens: the operation count: y = Wn x has 8n2 real operations (*.40b) Because of the first pair of symmetries. the others being linear combinations (frequently very simple) of these n/2. The first row of the matrix Wn is all ones.n−1 = {w.q=0. r=0 n/2−1 p·r p ωn/2 x2r + ωn r=0 r=0 p·r ωn/2 x2r+1 . The yn/2 is the second half of y and the minus sign is a result of (2. The first half of the third row is every other element of the second row. The point is that only w. .n−q = wp. then the second row of Wn is w1. ω 2 . and so forth.. the pre-computed roots of unity. Historically. were called “twiddle factors.q .q = (−1) wp. .40a).40a) (2. ω. ω 2 . n − 1. w = (1. . +). . ω n/2−1 ). . = + = y = Wn/2 xeven + diag(wn )Wn/2 xodd .q . As above.. .q . while the next half of the third row is just the negative of the first half.q+n/4 = ip wp.50 APPLICATIONS Here are some obvious symmetries: wp. −w}. ω 2 . where diag(wn ) = diag(1. ω n/2−1 ).q+n/2 = (−1) wp. . ω p·q xq + (odd). ω n/2−1 }. ω 3 . ω. .q . only n/2 elements of W are independent. Gauss’s idea is to split this sum into even/odd parts: yp = q even n/2−1 n/2−1 p·(2r) ωn x2r r=0 n/2−1 p·(2r+1) ωn x2r+1 .

We get n−1 y2·r = q=0 2·r·q ωn xq . and so forth until the transforms become W2 and we are done. n/2 − 1}. To do this. . .FAST FOURIER TRANSFORM (FFT) 51 a reduction in floating point operations of nearly 1/2. . x− = {xr − xr+n/2 |r = 0. yeven = {y2·r |r = 0. . the resulting algorithm is called the general prime factor algorithm (GPFA) [142]. and D = diag(wn ) = diag(1. yeven = Wn/2 (x + xn/2 ). In the first stage.41) y2·r+1 = q=0 (2r+1)·q ωn xq . we write the above splitting in matrix form: as above. For example. we end up with two n/2 × n/2 transforms and a diagonal scaling. n/2 − 1}. Each of the two transforms is of O((n/2)2 ). the idea of separating the calculation into even/odd parts can then be repeated. . . . Now we split each Wn/2 transformation portion into two Wn/4 operations (four altogether). ω 2 . . . n/2−1 = q=0 r·q q ωn/2 ωn (xq − x(n/2)+q ). n/2−1 = q=0 r·q ωn/2 (xq + xn/2+q ). yodd = {y2·r+1 |r = 0. decimation in frequency. hence the savings of 1/2 for the calculation. . A variant of this idea (called splitting in the time domain) is to split the output and is called decimation in the frequency domain. n/2 − 1}. . x+ = {xr + xr+n/2 |r = 0. time domain or frequency domain. . . . n−1 (2. . then into eight Wn/8 . . . Let us expand on this variant. yodd = Wn/2 diag(wn )(x − xn/2 ) . if n = 2m1 3m2 5m3 · · · can be factored into products of small primes. (2. ω n/2−1 ). in order to see how it works in more general cases. . . .42) In both forms. . ω. n/2 − 1}.

we may write (2. l1 < p and 0 ≤ k1 . D) W2 ⊗ In/2 x.1 B ··· a0. noting that W2 = we get an even more concise form yeven yodd = I2 ⊗ Wn/2 diag(In/2 . (2. l0 < p and 0 ≤ m1 .p−1 B a0. that (Ap ⊗ Bq )lm = Al0 m0 Bl1 m1 .p−1 B   A ⊗ B =   ··· p×p q×q ap−1. D) x+ x− . All that remains is to get everything into the right order by permuting (yeven . l < n range. yodd ) → y.44) where I2 is the 2 × 2 unit matrix.43) y = P2 n/2 I2 ⊗ Wn/2 diag(In/2 . An operation ⊗ between two matrices is called a Kronecker (or “outer”) product. l0 < q. where 0 ≤ l0 < p and 0 ≤ l1 < q.47) .0 B a1.52 APPLICATIONS to get yeven yodd = Wn/2 Wn/2 In/2 Dn/2 x+ x− . Using this notation. with the ranges of the digits 0 ≤ k0 . With this definition of Pq . Notice that the decimation for k. D) W2 ⊗ In/2 x. Such permutations are called index digit permutations and the idea is as follows. l = l0 + l1 p. l is switched. Furthermore. (2. which for matrix A (p×p) and matrix B (q ×q). If the dimension of the transform can be written in the p form n = p · q (only two factors for now).43) in a slightly more compact form: yeven yodd = I2 ⊗ Wn/2 diag(In/2 .p−1 B and of course A ⊗ B = B ⊗ A. l1 < q.43) Now some notation is required. although they p both have the same 0 ≤ k.0 B  a1.1 B · · · ap−1.46) where k = k0 + k1 q. (2. (2. We have p Pq kl = δ k 0 l 1 δ k 1 l0 .45) 1 1 1 −1 .1 B ··· a1. is written as   a0. we get a very compact form for (2.0 B ap−1. A little counting shows that where l = l0 + l1 p and m = m0 + m1 p with 0 ≤ m0 . (2. then a permutation Pq takes an index l = l0 + l1 p → l1 + l0 q.

Having outlined the method of factorization. ω q−1 ). D) W2 ⊗ In/2 x. where any nj may be repeated (e. One variant is [142]: 1 Wn = j=k Plj j ⊗ Imj n Dljj ⊗ Imj n n Wnj ⊗ In/nj .47) because the I2 ⊗ Wn/2 is a direct sum of two Wn/2 multiplies. (2.. l=0 (wq )p−1 where wq = diag(1. where n = i=1 ni : j−1 k mj = i=1 ni and lj = i=j+1 ni . respectively).48) So.46) [142].42). which we do not prove [142]. finally we get a commuted form y = Wn/2 ⊗ I2 P2 n/2 diag(In/2 . the O(n2 ) computation has been reduced by this procedure to two O((n/2)2 ) computations. The remaining two Wn/2 operations may be similarly factored. the diagonal diag(1. .49) As in (2. . there is a plethora of similar representations.41) and (2. (2. we state without proof the following identity which may be shown from (2.   . the next four Wn/4 similarly until n/2 independent W2 operations are reached and the factorization is complete. p p (A ⊗ B) Pq = Pq (B ⊗ A) .42). each of which is O((n/2)2 ). ω. which are the products of all the previous factors up to j and of all the remaining factors after j. Not surprisingly. A general formulation can be k written in the above notation plus two further notations. Although this factorized form looks more complicated than (2. . Here is a general result. (2. and so forth [96].g. plus another O(n) due to the diagonal factor. D) in (2.41) and (2. unit-stride. Using the above dimensions of A and B (p × p and q × q. it generalizes to the case n = n1 · n2 · · · nk . ω 2 . the diagonal matrix Dljj generalizes diag(1. D) factor is O(n).FAST FOURIER TRANSFORM (FFT) 53 To finish up our matrix formulation. respectively. in order.49) appropriate for the next recursive steps:   (wq )0 p−1   (wq )1   p Dq = (wq )l =  . . This is easy to see from (2. frequently altering the order of reads/writes for various desired properties—in-place. . nj = 2 for the binary radix case n = 2m ). interested readers should hopefully be able to follow the arguments in [96] or [142] without much trouble.50) In this expression. .

Frequently. Perhaps more to the point for the purposes in this book is that they also reduce the memory traffic by the same amount—from O(n2 ) to O(n · log(n)).9) or Bluestein’s ([96]. 4. . 3. Additionally. 111. thus reducing rounding errors. This is important. This is evident from (2. Unfortunately. 001. if some nj is too large. for if one’s larger calculation is a convolution. 100. 1. In Cooley and Tukey’s formulation. Furthermore. This is not always inconvenient.2. large prime factor codings also require a lot of registers—so some additional memory traffic (even if only to cache) results. 011. Is this recursive factorization procedure stable? 2. What happens in the classical Cooley–Tukey algorithm is that the results come out numbered in bit-reversed order. The FFT is very stable: first because Wn is unitary up to a scaling. . Although convenient operationally. 2 .54 APPLICATIONS In any case. the partial results are stored into the same locations as the operands. both operand vectors come out in the same bit-reversed order so multiplying them element by element does not require any re-ordering.3) methods. nj > 2. the point of all this is that by using such factorizations one can reduce the floating point operation count from O(n2 ) to O(n · log(n)). whereas the input/output data (“surface area”) is high relative to the computation. without permutations. 011 → 110. a significant savings. but also because the number of floating point operations is sharply reduced from O(n2 ) to O(n · log n). Figure 3. It turns out that small prime factors. etc. section 4. The (memory traffic)/(floating point computation) ratio is higher for nj = 2 than for nj > 2. . and the results come out numbered 0. in n = n1 · n2 · · · nk . What we have not done in enough detail is to show where the partial results at each step are stored. only radices 2–5 are available and larger prime factor codings use variants of Rader’s ([96]. . 6. the inverse operation can start . 2. 5. 110. theorem 4.2. coding for this large prime factor becomes onerous. To be specific: when n = 8. 4. are often more efficient than for nj = 2. These involve computing a slightly larger composite value than nj which contains smaller prime factors. the input vector elements are numbered 0. 5.50) in that there are k steps. this may be inconvenient to use. The output bit patterns of the indices of these re-ordered elements are bit-reversed: 001 → 100. 2. 7. Optimal choices of algorithms depend on the hardware available and problem size [55]. 010. 7 or by their bit patterns 000. 101. Now we wish to answer the following questions: 1. What about the efficiency of more general n-values than n = 2m ? The answers are clear and well tested by experience: 1 . 1. each of which requires O(n) operations: each Wnj is independent of n and may be optimally hand-coded. 6. 3.21 shows that the computation (“volume”) for nj = 2 is very simple.

that storage follows the Fortran convention x = ( x0 . However. )) and processing proceeds on that pretense. that is.5.1): memory traffic is the most important consideration on today’s computing machinery.e. When n is even. Thus. we can go further: if the input sequence has certain symmetries. x0 . The inverse operation is similar but has a pre-processing step followed by an . then z = Wn/2 x. .50) for each recursive factorization will result in an ordered-in/ordered-out algorithm. x1 . Consider first the case when the n-dimensional input contains only real elements. that p is what we will do in Sections 3. Another helpful feature of this algorithm is that the complex ¯ portion z = Wn/2 x simply pretends that the real input vector x is complex (i. . .4 and 3. when using FFT for filtering. it is well to have an ordered-in. .51) because x is real. . A procedure first outlined by Cooley et al. . so there will be only n independent real output elements in the complex array y. ¯ ¯ 4 4i (2. This is entirely sensible because there are only n independent real input elements. solving PDEs.1 Symmetries As we have just shown. Hence.4. j=0 = yn−k .40) may be used to reduce the computational effort even more. and most other situations. we pretend x ∈ Cn/2 yk = 1 1 (zk + z(n/2)−k ) + ω k (zk − z(n/2)−k ). no re-ordering is needed. It turns out that a single post-processing step to re-order the bit-reversed output ([141]) is nearly as expensive as the FFT itself—a reflection on the important point we made in the introduction (Section 1. Equations (2. Clever choices of Pq (2. by splitting either the input or output into even and odd parts. (see [139]) reduces an n-dimensional FFT to an n/2-dimensional complex FFT plus post-processing step of O(n) operations. ¯ n−1 = j=0 ω −jk xj = ω (n−j)k xj . .FAST FOURIER TRANSFORM (FFT) 55 from this weird order and get back to the original order by reversing the procedure.6. the second half of the output sequence y is the complex conjugate of the first half read in backward order. It follows that the output y = Wn x satisfies a complex conjugate condition: we have n−1 n−1 yk = ¯ j=0 n−1 ω jk xj = ¯ ¯ j=0 ω jk xj . (2. x1 . In addition. 2. considerable savings in computational effort ensue. In fact.52) for k = 0. these algorithms illustrate some important points about data dependencies and strides in memory which we hope our dear readers will find edifying. ordered-out algorithm. n/2.

again pretend z ∈ Cn/2 . the n/2-dimensional FFT and O(n) operation post(pre)-processing steps are both parallelizable and may be done in place if the Wn/2 FFT algorithm is in place. and another O(n) post-processing step: oj = (xj − x(n/2)−j ) − sin y = CRFFTn/2 (o). but either symmetric or skew-symmetric. and finally another O(n) dimensional post-processing step. . Even more savings are possible if the input sequence x is not only real. . n 2πj (xj − x(n/2)−j ). We get. for j = 3. . . 5. 5. Here RCFFTn/2 is an n/2-dimensional FFT done by the CTLW procedure (2. Cost savings reduce the O(n · log (n)) operation count to O n n log 2 2 + O(n). for j = 0. . If the input sequence is real x ∈ Rn and satisfies xn−j = xj . If the input sequence is real and skew-symmetric xj = −xn−j the appropriate algorithm is again from Dollimore [139]. zj = yj + y(n/2)−j + i¯ j (yj − y(n/2)−j ) ¯ ω ¯ ¯ x = Wn/2 z. n The routine CRFFTn/2 is an n/2-dimensional half-complex inverse to the real → half-complex routine RCFFT (2. . n/2 − 1. yj = 2yj + yj−2 . The computation is first for j = 0. .56 APPLICATIONS ¯ n/2-dimensional complex FFT. . ej = (xj + x(n/2)−j ) − 2 sin y = RCFFTn/2 (e). . The un-normalized inverse is x = Wn/2 y: for j = 0. . . . yj = yj−2 − 2yj . 2πj (xj + x(n/2)−j ). . (y) ↔ (y). . .53) Both parts. . n/2 − 1. satisfies for j = 0. . . . 7. an O(n/4) dimensional complex FFT. . an O(n) pre-processing. (2. n/2 − 1. . then Dollimore’s method [139] splits the n-dimensional FFT into three parts. a pre-processing step.53). j = 3. . swap real and imaginary parts . . . n/2 − 1.52). n/2 − 1. The expression half-complex . n/2 − 1. 7.4). then an n/4-dimensional complex FFT. This is equivalent to a cosine transform [96] (Section 4.

If subsamples (say N/ncpus) are reasonably large. Each subsample is likely to have relatively the same distribution of short and long running sample data. Furthermore. the pieces (subsamples) may be easily distributed to independently running processors. or by sectioning (see Sections 3.” it is now generally agreed that MC simulations are nearly an ideal paradigm for parallel computing.f) or C (master.c).2. very unparallel in viewpoint. Vectorizing tasks running on each independent processor only makes the whole job run faster. This perspective considered existing codes whose logical construction had an outermost loop which counted the sample. Distributed memory parallelism (MIMD) is complemented by instruction level parallelism (SIMD): there is no conflict between these modes. see [121]). inter processor communication is usually very small. it has become clear that MC methods are nearly ideal simulations for parallel environments (e.g. Because of the many branches. 4. After this “attitude readjustment. distributing a sample datum of one particle/CPU on distributed memory machines resulted in a poorly load balanced simulation. Because of the independence of each sample path (datum). Just because one particular path may take a longer (or shorter) . The skew-symmetric transform is equivalent to representation as a Fourier sine series. Worse.MONTE CARLO (MC) METHODS 57 means that the sequence satisfies the complex-conjugate symmetry (2. say) started at its source was simulated to its end through various branches and decisions.51) and thus only n/4 elements of the complete n/2 total are needed. in many cases. domains may be split into subdomains of approximately equal volume. 3. Independently seeded data should be statistically independent and may be run as independent data streams. An important point about each of these symmetric cases is that each step is parallelizable: either vectorizable. A complete test suite for each of these four symmetric FFT cases can be downloaded from our web-server [6] in either Fortran (master. 2.1) both parallelizable and vectorizable. but by splitting up the sample N into pieces.5 Monte Carlo (MC) methods In the early 1980s. it was widely thought that MC methods were ill-suited for both vectorization (SIMD) and parallel computing generally. After a little more thought. load balancing is also good. The intersection between the concept of statistical independence of data streams and the concept of data independence required for parallel execution is nearly inclusive. however. hence improving the load balance. Not only can such simulations often vectorize in inner loops. Only final statistics require communication. A short list of advantages shows why: 1.2 and 3. 2. To clarify: a sample path (a particle. integration problems particularly. because they ran one particle to completion as an outer loop. these codes were hard to vectorize. All internal branches to simulated physical components were like goto statements.

i=1 (2. b F = a f (x) dx. Expectation values Ef = f mean the average value of f in the range a < x < b. ∞ H= 0 h(x) dx may also be transformed into the bounded interval by x= to get 1 z . on z ∈ U (0. ≈ |A| 1 N N f (xi ).e. there are far more accurate procedures [33] than MC. Writing x = (b − a)z. 2. it is easy to do infinite ranges as well. each subsample will on average take nearly the same time as other independent subsamples running on other independent processors.54) again with F = |A|Eg. 1)). the final statistic will consist of a sample of size N ·(1−1/ncpus) not N . In one dimension. which we approximate by the estimate F = (b − a) f ≡ (b − a)Ef. MC in one dimension provides some useful illustrations. For example. and all the xi s are uniformly distributed random numbers in the range a < xi < b. but this is likely to be large enough to get a good result.54) Here we call |A| = (b−a) the “volume” of the integration domain. If a processor fails to complete its task. For example. H= 0 1 h (1 − z)2 z 1−z dz. where g(z) is averaged over the unit interval (i. if one CPU fails to complete its N/ncpus subsample. Let us look at the bounded interval case. 1−z 0 < z < 1. Hence such MC simulations can be fault tolerant. It is not fault tolerant to divide an integration domain into independent volumes.58 APPLICATIONS time to run to termination.5.55) . However. g(z) = f ((a − b)z) gives us (2. however. (2. By other trickery. 5.1 Random numbers and independent streams One of the most useful purposes for MC is numerical integration. the total sample size is reduced but the simulation can be run to completion anyway.

x2 .56) To be useful in (2.55). in 2. k( u2 (1 − u)2 u(1 − u) (2. range rescaling is appropriate for multiple dimension problems as well—provided the ranges of integration are fixed. Quite generally. when this is appropriate (i.58) In this case. estimated by MC as an average over uniformly distributed sample points {xi |i = 1. a multiple dimension integral of f (x) over a measurable Euclidean subset A ⊂ Rn is computed by I= A f (x) dn x. . xn ). Density p is subject to the normalization condition p(x) dn x = 1. the problem is usually written as Ef = f (x)p(x) dn x. If the ranges of some coordinates depend on the value of others. Not surprisingly. and zero otherwise (x ∈ A). For an infinite volume. N }. . the integrand may have to be replaced by f (x) → f (x)χA (x).e. i=1 (2. . . ≈ |A| 1 N N f (xi ). ∞ 1 K= −∞ k(x)dx = 0 2u2 − 2u + 1 2u − 1 )du. . ≈ 1 N N f (xi ). where χA (x) is the indicator function: χA (x) = 1 if x ∈ A. Also. i=1 (2. Ef is the expected value of f in A. . . . With care. = |A|Ef.MONTE CARLO (MC) METHODS 59 The doubly infinite case is almost as easy. where x = (u − 1)/u − u/(u − 1).57) where |A| is again the volume of the integration region. A is bounded). . Likewise. one-dimensional integrals (finite or infinite ranges) may be scaled 0 < z < 1. the sample points are not drawn uniformly but rather from the distribution density p which is a weighting function of x = (x1 . = f p. The volume element is dn x = dx1 dx2 · · · dxn . we need h(x)/(1 − z)2 to be bounded as z → 1.56 we need that k(x)/u2 and k(x)/(1 − u)2 must be bounded as u → 0 and u → 1 respectively.

2]. However. An excellent introduction to the theory of these methods is Knuth [87.3. 1. 3. Do not trust any of them completely. k is Boltzmann’s constant. our concern will be parallel generation using these two basics. Sometimes x may be high dimensional and sampling it by acceptance/ rejection (see Section 2. For example. As we will discuss later. More than one should be tried. In general purpose libraries. In nearly all parallel applications.0 < x < 1. In a common case. this is very inefficient: typical function call overheads are a few microseconds whereas the actual computation requires only a few clock cycles. so procedure call overhead will take most of the CPU time if used in this way. Few general purpose computer mathematical library packages provide quality parallel or vectorized random number generators.60 APPLICATIONS If p(x) = χA (x)/|A|. the SPRNG generator suite provides many options in several data formats [135]. the computations are usually simple. Your statistics should be clearly reproducible.59). several facts about random number generators should be pointed out immediately. Most computer mathematics libraries have generators for precisely this uniformly distributed case. e−S(x) dn x (2. Z = exp(−S) dn x.2. That is.2. and T is the temperature. with many variants of them in practice. A good example is taken from statistical mechanics: S = E/kT . p takes the form p(x) = e−S(x) . wordsize = 224 (single precision IEEE arithmetic) or wordsize = 253 (double . The denominator in this Boltzmann case (2. with different initializations.59) which we discuss in more detail in Section 2. we get the previous bounded range situation and uniformly distributed points inside domain A. 121. Two basic methods are in general use. Many random number generators are poor. 2.5. where E is the system energy for a configuration x.5. Here.5.2 Uniform distributions From above we see that the first task is to generate random numbers in the range 0. 2. Parallel random number generators are available and if used with care will give good results [102. vol.0 < x < 1. 5.3. Linear congruential (LC) generators which use integer sequences xn = a · xn−1 + b mod wordsize (or a prime) to return a value ((float)xn )/((float)wordsize). computation takes only a few nanoseconds. 135]. They are 1. routines are often function calls which return one random number 0. and the results compared.0.1) from the probability density p(x) becomes difficult. In particular.0 at a time.3. We will consider simulations of such cases later in Section 2.5. is called the partition function.

requires memory space for a buffer.i<LAGQ. and the operator ⊗ can be one of several variants.LAGQ. and ⊗ is floating point add (+). For example.MONTE CARLO (MC) METHODS 61 precision IEEE).0.0 < x < 1. . q) at a time. 63). in IEEE double precision (53 bits of mantissa. but returns only one result/call. and unfortunately there are not many really good choices [87]. multiply. max(p. if any xi in the sequence repeats. 2. Here is a code fragment to show how the buffer is filled. Other more sophisticated implementations can give good results. but they use more memory or more computation. upper bits of the first argument. and (xu |xl ) means the concatenation of the u . The so-called period of the LC method is less than wordsize. q = 273. On a machine with a 1 GHz clock and assuming that the basic LC operation takes 10 cycles. Notice that the LC method is a one-step recursion and will not parallelize easily as written above. and l bits of the second argument [103]. We assume LAGP = p > q = LAGQ for convenience [14. and structures like op(xn−p . xn−q+1 ) ≡ xn−p ⊕ ((xu |xl n−q m−q+1 )M ) where M is a bit-matrix. . Additionally.0. In these integer cases. using IEEE single precision floating point arithmetic. (2. the whole series repeats in exactly the same order. that is 0. There are better choices for both the lags and the operator ⊗. Some simple choices are p = 607.60) where the lags p. and its generalizations (see p. the random number stream repeats itself every 10 s! This says that the LC method cannot be used with single precision IEEE wordsize for any serious MC simulation. q) is the storage required for this procedure. kP = 0. The LF method. Repeating the same random sample statistics over and over again is a terrible idea. if a is chosen to be odd. bit-wise exclusive or (⊕). kQ = LAGP . not 24). Lagged Fibonacci (LF) sequences xn = xn−p ⊗ xp−q mod 1. Other operators have been chosen to be subtract. the 59-bit NAG [68] C05AGF library routine generates good statistics. Clearly. b may be set to zero. To put this in perspective. Because the generator is a one-step recurrence. the routine returns a floating point result normalized to the wordsize. It is easy to parallelize by creating streams of such recurrences. Using a circular buffer. The LF method. The LC methods require only one initial value (x0 ). The choice of the multiplier a is not at all trivial. however. can be vectorized along the direction of the recurrence to a maximum segment length (V L in Chapter 3) of min(p. the maximal period of an LC generator is 224 . the situation is much better since 253 is a very large number. q must be chosen carefully. This means that a sequence generated by this procedure repeats itself after this number (period).i++){ t = buff[i] + buff[kQ+i]. for(i=0. xn−q . 104].

M. V L independent streams x[i] = a · xn−1 mod wordsize. . } Uniformly distributed numbers x are taken from buff which is refilled as needed. In our tests [62. 2.i<LAGP-LAGQ.62 APPLICATIONS buff[i] = t . L¨scher’s RANLUX [98] in F.63) should give independent streams. so long lags are recommended—or a more sophisticated operator ⊗. or seeds (LF).0. Here is how it goes: if the period of the random number generator (RNG) is extremely long (e.g. kQ = 0.(double)((int) t). n [i] [i] [i] (2. we see obvious modifications of the above procedures for parallel (or vector) implementations of i = 1.61) and (2.61) or the Percus–Kalos [119] approach. James’ u implementation [79] also tests well at high “luxury” levels (this randomly throws away some results to reduce correlations). . where p[i] are independent primes for each stream [i]: x[i] = a · xn−1 + p[i] mod wordsize n and x[i] = xn−p ⊗ xn−q mod 1. . Crucial to this argument is that the generator should have a very long period. q) must 2 wordsize · (2 repeat for the whole sequence to repeat.i++){ t = buff[i+kP] + buff[i+kQ]. . the Mersenne Twister [103] shows good test results and the latest implementations are sufficiently fast. this routine sensibly returns many random numbers per call. In our parallel thinking mode.63) The difficulty is in verifying whether the streams are statistically independent.(double)((int) t). much longer than the wordsize. however. The maximal periods of LF sequences are extremely large 1 p∨q − 1) because the whole buffer of length max(p.62) (2. n [i] (2. 124]. however. These sequences are not without their problems. } kP = LAGQ. Furthermore. It is clear that each i counts a distinct sample datum from an independent initial seed (LC). Hypercubes in dimension p ∨ q can exhibit correlations. A simple construction shows why one should expect that parallel implementations of (2. 223 · (2p∧q − 1) for a p. q lagged Fibonacci generator in single precision IEEE . for(i=0. buff[i + kP] = t .

ggl) seem to give very satisfactorily independent streams. since the parallel simulations are also unlikely to use even an infinitesimal fraction of the period. which like the WELL generators themselves. In addition. called by the acronym WELL (Well Equidistributed Long-period Linear). however.60) can be constructed which use more terms. One cannot be convinced entirely by this argument and much testing is therefore required [62. 79]. finding a suitable matrix M is a formidable business.MONTE CARLO (MC) METHODS Sequence3 Sequence1 63 Cycle for RNG generator Sequence2 Fig. and Matsumoto. At the time of this 2nd printing. then any possible overlap between initial buffers (of length p ∨ q) is very unlikely. 3 in Figure 2.15) are sufficiently small relative to the period (cycle). arithmetic). Graphical argument why parallel RNGs should generate parallel streams: if the cycle is astronomically large. 135]. where p is prime. apparently the best generators are due to Panneton. Furthermore. hence the sequences can be expected to be independent. If the parallel sequences (labeled 1.15 shows the reasoning.g. The full complexity of this set of WELL generators is beyond the scope of this text but curious readers should refer to reference [117]. More general procedures than (2. when used in parallel with initial buffers filled by other generators (e. . the probability of overlap is tiny. Figure 2. For vectors of large size.15. can be downloaded. An inherently vector procedure is a matrix method [66. none of the user sequences will overlap. 2. L’Ecuyer. M is a matrix of “multipliers” whose elements must be selected by a spectral test [87]. 67] for an array of results x: xn = M xn−1 mod p. the likelihood of overlap of the independent sequences is also very small. 2. 135] and variants [61. Very long period generators of the LF type [124. the web-site given in this reference [117]. also contains an extensive set of test routines.

SFI Origin2000. IBM SP2. if u = P (x). non-uniformly distributed random numbers.2 form the base routine for most samplings. This suite has been extensively tested [35] and ported to a wide variety of machines including Cray T-3E.65) where u1 . .60): xn = xn−p × xn−q mod 264 .0: see [34. where x is a real scalar. (2. If it is the case that P −1 is easily computed. The simplest method.61): xn = a · xn−1 mod (261 − 1) Modified LF (2. when it is possible.1]. . Imagine that we wish to generate sequences distributed according to the density p(x). HP Superdome.5. a particularly useful package SPRNG written by Mascagni and co-workers contains parallel versions of six pseudorandom number generators [135].5.64 APPLICATIONS For our purposes.3 Non-uniform distributions Although uniform generators from Section 2. is a sequence of uniformly distributed. mutually independent. The cumulative probability distribution is z P [x < z] = P (z) = −∞ p(t) dt. then P −1 (u) exists.64) Direct inversion may be used when the right-hand side of this expression (2. u2 . is by direct inversion. and all Linux systems. theorem 2. then to generate a sequence of mutually independent. To be more precise: since 0 ≤ P (x) ≤ 1 is monotonically increasing (p(x) ≥ 0). one computes xi = P −1 (ui ).0 < ui < 1.60): xn = xn−p + xn−q mod 232 (1st sequence) yn = yn−p + yn−q mod 232 (2nd sequence) (nth result) zn = xn ⊕ yn • LF with integer multiply (2.61): xn = a · xn−1 + b mod 264 LC with prime modulus (2. The issue is whether this inverse is easily found. . other non-uniformly distributed sequences must be constructed from uniform ones.64) is known and can be inverted. .61): xn = a · xn−1 + b mod 248 LC 64–bit (2. 2. This suite contains parallel generators of the following types (p is a user selected prime): • Combined multiple recursive generator: xn = a · xn−1 + p mod 264 yn = 107374182 · yn−1 + 104480 · yn−5 mod 2147483647 zn = (xn + 232 · yn ) mod 264 • • • • (nth result) Linear congruential 48–bit (2. (2. real numbers 0.

x > 0 because P (x) = u = 1 − exp(−λx) is easily inverted for x = P −1 (u). cos. 1). either k/2 or (k −1)/2. The sequence generated by xi = − log(ui /λ) will be exponentially distributed. 1/2 < x0 ≤ 1.5. = −2 ln(u2 ). involves computing only the square root of an argument between 1/2 and 2.1 An easy case is the exponential density p(x) = λ exp(−λx). this means SIMD (see Chapter 3) modes wherein multiple streams of operands may be computed concurrently. this takes the recursive form yn+1 = 1 2 yn + x0 yn . where its second form is used when k is odd [49]. . Chapter 4) and shifting it right one bit. particularly in parallel. a fixed number of Newton steps is used (say n = 0. = sin(t1 ) · t2 . . // u1 ∈ U (0. 3. u2 (2.66) An essential point here is that the library functions sqrt. // u2 ∈ U (0. is effected by masking out the biased exponent. that is. p.60). A normally distributed sequence is a common case: 1 P [x < z] = √ 2π z −∞ e−t 2 /2 dt = 1 1 + erf 2 2 z √ 2 .MONTE CARLO (MC) METHODS 65 Example 2. so are {1 − ui }. then √ √ √ x = 2k/2 x0 = 2(k−1)/2 2 · x0 . the Box–Muller method. A Newton method is usually chosen to √ find the new mantissa: where we want y = x0 . the number depending on the desired precision and the accuracy of . 1). Fortunately a good procedure. and (2. Things are not usually so simple because computing P −1 may be painful. but the essence of the thing starts with a range reduction: where x = 2k x0 . either x0 or 2 · x0 . (2. removing the bias (see Overton [114]. and sin must have parallel versions. We have already briefly discussed parallel generation of the uniformly distributed variables u1 . Since the floating point representation of x is x = [k : x0 ] (in some biased exponent form). . This follows because if {ui } are uniformly distributed in (0.63). Here the inversion of the error function erf is no small amount of trouble. Usually. 1). computing the new exponent. . 61. = cos(t1 ) · t2 . Thus. Function sqrt may be computed in various ways. is known and even better suited for our purposes since it is easily parallelized. (2. log. It generates two normals at a time t1 t2 z1 z2 = 2πu1 . √ √ The new mantissa. a better method for generating normally distributed random variables is required. For purposes of parallelization.61).

parallel implementations for V L independent arguments (i = 1. In the Intel MKL library [28]. However. . Also. for large arguments this ranging is the biggest source of error. one computes log x = log2 (e) log2 (x) from log2 (x) by log2 (x) = k + log2 (x0 ). . It costs nearly the same to compute both cos and sin as it does to compute either. [49]). by division modulo 2π. x is always ranged 0 < x < 2π. Once this ranging has been done. [49. . Assuming a rational approximation. . By appropriate ranging and circular function identities. respectively. For example. section 6. cos x1 [i] =P x1 [i] 2 . Also.5.66 APPLICATIONS the initial approximation y0 ). log2 [i] x0 [i] = P x0 [i] [i] . for our problem. . a sixth to seventh order polynomial in x2 is used: in 1 parallel form (see Section 3. where i = 1. The range of x0 is 1/2 < x0 ≤ 1. V L) would look like: log2 x[i] = k [i] + log2 x0 . see Hart et al. . see Luke [97. V L.2.4]. Newton steps are then repeated on multiple data: yn+1 = [i] 1 2 [i] yn + x0 [i] yn [i] . the Linux routine in /lib/libm. see Hart et al. . The remaining computation cos x1 (likewise sin x1 ) is then ranged 0 ≤ x1 < π 2 using modular arithmetic and identities among the circular functions. see Section 3. . see Section 3. Q x0 where P and Q are relatively low order polynomials (about 5–6.5). for i = 1. In the general case. particularly table 3.so sincos returns both sin and cos of one argument. .6].3. section 3. most intrinsic libraries have a function which returns both given a common argument. Finally. . writing x = 2k · x0 . Again. first the argument x is ranged to 0 ≤ x1 < 2π: x = x1 + m · 2π. cos and sin: because standard Fortran libraries support complex arithmetic and exp(z) for a complex argument z requires both cos and sin. the number of independent square roots to be calculated. To do this. . and the Cray intrinsic Fortran library contains a routine COSS which does the same thing. the relevant functions are vsSinCos and vdSinCos for single and double. Computing log x is also easily parallelized. computing cos is enough to get both cos and sin. V L. The log function is slowly varying so either polynomial or rational approximations may be used. . .

see Section 3. To us. The label Polar indicates the polar method. Machines are PA8700.5. and NEC SX-4). an interesting number: univariate normal generation computed on the NEC SX-4 in vector mode costs (clock cycles)/(normal) ≈ 7. but are simply selections from well-known algorithms that are appropriate for parallel implementations. which includes uniformly distributed variables.2. while pv BM means a “partially” vectorized Box–Muller method. HP9000. Cray SV-1. see Section 2. On the machines we tested (Macintosh G-4. Finally.2 [124]. PA8700 BM P3 Polar P4 Polar P3 pv BM P4 pv BM 106 0. In any case. The compilers were icc on the Pentiums and gcc on the PA8700. 2. Label pv BM means that ggl was used to generate two arrays of uniformly distributed random variables which were subsequently used in a vectorized loop. These approximations are not novel to the parallel world.p. Pentium III. Box–Muller was always faster and two machine results are illustrated in Figure 2.16. add and 8 f.2.16. Uniforms were generated by ggl which does not vectorize easily. multiply units.001 104 Fig. the Box–Muller method illustrates two features: (1) generating normal random variables is easily done by elementary functions.1 0. Pentium III.8. see Section 2.01 (1) (2) (3) (4) (5) (6) 105 n PA8700 Pol. and Pentium 4. Pentium 4. this number is astonishingly small but apparently stems from the multiplicity of independent arithmetic units: 8 f. and (2) the procedure is relatively easy to parallelize if appropriate approximations for the elementary functions are used.p. the Box–Muller method is faster than the polar method [87] which depends on acceptance rejection.MONTE CARLO (MC) METHODS 67 So. . polar method for generating univariate normals.5.5. Intel compiler icc vectorized the second part of the split loop (1) (2) 1 (3) (4) (5) (6) Time (s) 0. hence if/else processing. Timings for Box–Muller method vs.

While there remains some dispute about the tails of the distributions of Box–Muller and polar methods.17. This is done by finding another distribution function. . y) : 0 ≤ y ≤ cq(x)} covers the set A = {(x. Here.3. and then we give some examples. y) are uniformly distributed in both A and S.2].7 for the relevant compiler switches used in Figure 2. 2. Distribution density q(x) must be easy to sample from and c > 1. gcc -O3 produced faster code than both the native cc -O3 and guidec -O4 [77]. the results of these two related procedures are quite satisfactory. we showed the Box–Muller method for normals and remarked that this procedure is superior to the so-called polar method for parallel computing. In Equation (2. which is easy to sample from and for which there exists a constant S \A Frequency p(x) A cq(x) x Fig. whose cumulative distribution is difficult to invert (2. section 3. the speedups are impressive—Box–Muller is easily a factor of 10 faster. First we need to know what the acceptance/rejection (AR) method is. 2.16. say q(x). We wish to sample from a probability density function p(x).5. The idea is due to von Neumann and is illustrated in Figure 2. see Gentle [56.66). we show why the Box–Muller method is superior to acceptance/rejection polar method.1 Acceptance/rejection methods In this book we wish to focus on those aspects of MC relating parallel computing.1. y) : 0 ≤ y ≤ p(x)} and points (x.65). On Cray SV-2 machines using the ranf RNG.5. On one CPU of Stardust (an HP9000).17. while at the same time emphasize why branching (if/else processing) is the antithesis of parallelism [57]. while the covering is cq(x). AR method: the lower curve represents the desired distribution density p(x). experience shows that if the underlying uniform generator is of high quality. The set S = {(x.68 APPLICATIONS nicely but the library does not seem to include a vectorized RNG. See Section 3.

y) : 0 ≤ y ≤ p(x)}. For this reason. q(x). . A = {(x. Hence x is distributed according to the density p(x). x → x. Since y is uniformly distributed. y) = q(x)h(y|x). y) points are uniformly distributed in the set S = {(x. where h(y|x) is the conditional probability density for y given x. q(x) pick u ∈ U (0. y) ∈ S under the curve p as shown in Figure 2. We get p[xy] (x. which contains the (x. Theorem 2. cq(x) h dy = 1.17. let A ⊂ S. y is thus uniformly distributed in the range 0 ≤ y ≤ cq(x). a constant. hence the probability density function for these (x. Here is the general algorithm: P1: pick x from p. 2 The generalization of AR to higher dimensional spaces. y) : 0 ≤ y ≤ cq(x)}. Proof Given x is taken from the distribution q. P [x ≤ X] = X p(x) X dx −∞ 0 dy = −∞ p(x) dx. y) = 1/c. and is then h(y|x) = 1/(cq(x)).2 Acceptance/rejection samples according to the probability density p(x). involves only a change in notation and readers are encouraged to look at Madras’ proof [99]. y) is a constant (the X → ∞ limit of the next expression shows it must be 1). 1) y = u · c · q(x) if (y ≤ p(x)) then accept x else goto P1 fi Here is the proof that this works [99].f. Since A ⊂ S. these for 0 (x.5. The joint probability density function is therefore p[xy] (x. h(y|x) does not depend on y. which is von Neumann’ result.d. Now. the points (x.MONTE CARLO (MC) METHODS 69 1 < c < ∞ such that 0 ≤ p(x) ≤ cq(x) for all x in the domain of functions p(x). Therefore. y) ∈ A are uniformly distributed.

The basic antithesis is clear: if processing on one independent datum follows one instruction path while another datum takes an entirely different one. which again generates two univariate normals (z1 . Possible (v1 . the chance that (v1 . 2. Number of x tried c The polar method. v2 ) will fall within the circle is the ratio of the areas of the circle to the square. each may take different times to complete. v2 ) coordinates are randomly selected inside the square: if they fall within the circle. they cannot be processed REJECT ACCEPT |S| (υ1. z2 ) is a good illustration. A circle of radius one is inscribed in a square with sides equal to two.70 APPLICATIONS The so-called acceptance ratio is given by Acceptance ratio = 1 Number of x accepted = < 1. they are accepted. Here is the algorithm: P1: pick u1 . That is. u2 ∈ U (0. if the fall outside the circle. 1) v1 = 2u1 − 1 v2 = 2u2 − 1 2 2 S = v1 + v 2 if (S < 1)then z1 = v1 −2 log(S)/S z2 = v2 −2 log(S)/S else goto P1 fi and the inscription method is shown in Figure 2. .18. This is a simple example of AR. they are rejected. Polar method for normal random variates. for any given randomly chosen point in the square.18. Hence. and the acceptance ratio is π/4.υ2) Fig.

the code is only suitable for 32-bit floating point. distributed over multiple independent processors.2 regarding scatter/gather operations) can be a way to index accepted points but not rejected ones. An indirect address scheme (see Section 3. uses this idea. For larger tasks. c ∼ 1). each with the same area: see Figure 2.19 shows. AR can occasionally fail miserably when the dimension of the system is high enough that the acceptance ratio becomes small. Our essential point regarding parallel computing here is that independent possible (v1 . Their idea is to construct a set of strips which cover one half of the symmetric distribution. Each strip area is the same. For example. AR should be avoided for small tasks. In a general context. of course. Conversely. On independent processors. p(n. the code actually has 256 strips. uses Marsaglia’s method [101]. On older architectures where cos. the lack of synchrony is not a problem but requesting one z1 . The computation is in effect a table lookup to find the strip parameters which are pre-initialized. the large inherent latencies (see Section 3. z2 pair from an independent CPU is asking for a very small calculation that requires resource management. If a branch is nearly always taken (i. v2 ) pairs may take different computational paths. Figure 2. but a better algorithm wins. sin. apparently not lost on Marsaglia and Tsang. The returned normal is given an random sign. a job scheduling operating system intervention.2. but the following calculation shows that completely aside from parallel computing. however. In some situations. To us. so the strips may be picked uniformly. the polar method was superior. again not easily modified. the results using AR for small tasks can be effective.19(a).MONTE CARLO (MC) METHODS 71 concurrently and have the results ready at the same time. µ) = e−mu µn /n!. We will be more expansive about branch processing (the if statement above) in Chapter 3. z2 pair by the polar method is too small compared to the overhead of task assignment. The speed performance is quite impressive. An integer routine fische which returns values distributed according to a Poisson distribution. the tiny computation of the z1 . so this method holds considerable promise. AR cannot be avoided. You can download this from the zufall package on NETLIB [111] or from our ftp server [6] . for example. The lowest lying strip. for a procedure call costs at least 1–5 µs. Unlike the diagram. There is an important point here. handing the tail. Not to put too fine a point on possible AR problems. thus they cannot be computed concurrently in a synchronous way.e. Their method is extremely fast as Figure 2. The rejection rate is very small. Furthermore.2. the lesson to be learned here is that a parallel method may be faster than a similar sequential one. as written. not 8.19 shows a stratified sampling method for Gaussian random variables by Marsaglia and Tsang [100]. but there are problems with the existing code [100]: the uniform random number generator is in-line thus hard to change.1) of task assignment become insignificant compared to the (larger) computation. Our example problem here is to sample uniformly . and other elementary functions were expensive.

72 APPLICATIONS inside. the surface of. . vn = 2un − 1 2 2 2 S = v1 + v2 + · · · + vn if (S < 1)then accept points {v1 . (top) A simplified diagram of the stratified sampling used in the procedure [100]. .. . 2. . un each uj ∈ U (0. not 8.. u2 . there are 256 strips. Ziggurat method. vn } else goto P1 fi 1 2 100 (1) (2) 10–1 (3) (4) 4 x Time (s) 10–2 (1) P3 pv BM (2) P4 pv BM (3) P3 Zigrt. .19. Box–Muller vs. a unit n-sphere. Here is the AR version for interior sampling: P1: pick u1 . In the actual code. . Ziggurat on a Pentium III and Pentium 4. (4) P4 Zigrt. . or on. 1) v1 = 2u1 − 1 v2 = 2u2 − 1 . 10–3 10–4 104 105 n 106 Fig. v2 . (bottom) Comparative performance of Box–Muller vs. . .

∼ √ αn 2 2π √ for large n.65).. p. 1 Since 0 p(r) dr = 1.25×10−7 A vastly better approach is given in Devroye [34. the density function for isotropically distributed points is p(r) = k · rn−1 .46 and Ωn is the solid angle volume of an n-sphere [46. .18) gets small fast: Ωn 0 drrn−1 Volume of unit n-sphere .. n { xi = rxi } } 1 A plot of the timings for both the AR and isotropic procedures is shown in Figure 2.2]. n { xi = zi /r } if (interior points) { // sample radius according to P = rn u ∈ U (0.308 0. n { zi = box-muller } n 2 r= i=1 zi // project onto surface for i = 1. = Volume of n-cube 2n n −(n+1)/2 1 ..0025 0.524 0. .20. for i =1. section 4. Constant α = eπ/2 ≈ 1.. The following table shows how bad the acceptance ratio gets even for relatively modest n values.. 2π n/2 . . we see a rapidly decreasing acceptance ratio for large n.785 0. 1) r = u1/n for i = 1. This is an isotropic algorithm for both surface or interior sampling... then k = n and P [|x| < r] = rn . n 2 3 4 10 20 Ratio 0. . Ωn = Γ(n/2) Hence.. Sampling of |x| takes the form |x| = u1/n from (2. 234].MONTE CARLO (MC) METHODS 73 Alas.. using Stirling’s approximation. here the ratio of volume of an inscribed unit radius n-sphere to the volume of a side = 2 covering n-cube (see Figure 2. If interior points are desired.

when S(x) > 0 has the contraction . generating configurations of x according to the density p(x) = e−S(x) /Z can be painful unless S is say Gaussian. The usual problem of interest is to find the expected (mean) value of a physical variable. For an arbitrary action S(x). physicists are more interested in statistical properties than about detailed orbits of a large number of arbitrarily labeled particles.59). the Euclidean space in n-dimensions.20. 2.59). we mentioned a common computation in quantum and statistical physics—that of finding the partition function and its associated Boltzmann distribution. In that case. 2. while the absurdly rising curve shows the inscription method. but already astrophysical simulations are doing O(108 )—an extremely large number whose positions and velocities each must be integrated over many timesteps. say f (x). Timings on NEC SX-4 for uniform interior sampling of an n-sphere. For the foreseeable future.5. Lower curve (isotropic method) uses n normals z. For our simplified discussion. Equation (2. computing machinery will be unlikely to handle NA particles.74 APPLICATIONS 104 Time (s) for 643 points Inscribe Isotropic x x 102 100 10–2 0 50 n 100 Fig.022 × 1023 ) and the 6 refers to the 3 space + 3 momentum degrees of freedom of the NA particles. however. In an extreme case. where NA is the Avogadro’s number (NA ≈ 6. the problem is computing the expected value of f . we assume the integration region is all of En . which basically means the particles have no interactions. A relatively general method exists.3. Frequently. where x may be very high dimensional.2 Langevin methods Earlier in this chapter.59)) Ef = e−S(x) f (x) dn x . written (see (2. Z (2. n = dim [x] ∼ 6 · NA .67) where the partition function Z = e−S(x) dn x normalizes e−S /Z in (2.

Item (1) says w(t) is a diffusion process. δij = 1. is the Laplace operator in n-dimensions. where h is the time step h = ∆t. In small but finite form.MONTE CARLO (MC) METHODS 75 property that when |x| is large enough. E∆wi (tk )∆wj (tl ) = hδij δkl . it is possible to generate samples of x according to p(x) by using Langevin’s equation [10].70) (2.69) 2 ∂x where w is a Brownian motion. x· ∂S > 1.68) In this situation. Item (3) says increments dw are only locally correlated.15) and (2. t = 0) = δ(x) 3. 1. The probability density for w satisfies the heat equation: 1 ∂p(w. What are we to make of w? Its properties are as follows. The t = 0 initial distribution for p is p(x. t) = ∂t 2 where p(w. and the stationary distribution will be p(x). these equations are E∆wi (tk ) = 0. (2. ∂x (2.79) below. (2.3.71) . 106]. if i = j and zero otherwise: the δij are the elements of the identity matrix in n-dimensions. but δ(x) dn x = 1. equation 11. ∂wi 2. and δ(x) is the Dirac delta function—zero everywhere except x = 0. If S has the above contraction property and some smoothness requirements [86. That is n p= i=1 ∂2p 2. t). Equation (2. Edwi (t) dwj (s) = δij δ(s − t) dt ds. which usually turn out to be physically reasonable. a Langevin process x(t) given by the following equation will converge to a stationary state: 1 ∂S dx(t) = − dt + dw(t). In these equations.69) follows from the Fokker–Planck equation (see [10]. which by item (2) means w(0) starts at the origin and diffuses out with a mean square E|w|2 = n · t. Its increments dw (when t → t + dt) satisfy the equations: Edwi (t) = 0.

Here is a one-dimensional version for the case of a continuous Markov process y. ∆t) will be near y ∼ z. t) as a function of t. For small changes ∆t.e. ∆t)p(z|x.74) is known as the Chapman– Kolmogorov–Schmoluchowski equation [10] and its variants extend to quantum probabilities and gave rise to Feynman’s formulation of quantum mechanics. we will get all probabilities. t) dy = 1. t) dz 2 ∂z 2 f (y)p(y|x. p(z|x. t1 + t2 ) = p(z|y. after time t. To do so. Equation (2. see Feynman–Hibbs [50]. say f ∈ C ∞ . most of the contribution to p(y|z.75) = lim ∆t→0 1 ∂ 2 f (z) (y − z)2 p(y|z. ∆t)p(z|x. The important properties of p(y|x. We would like to find an evolution equation for the transition probability p(y|x. t) dy ∂t 1 f (y) (p(y|x. let f (y) be an arbitrary smooth function of y. p(y|x. t) = transition probability of x → y. t) dz) dy ∂f (z) (y − z)p(y|z. = lim ∆t→0 ∆t f (y) = lim ∆t→0 1 ∆t f (y) p(y|z.74) says that after t1 . t) dz ∂z dy dy (2. t + ∆t) − p(y|x. t) dy ∂t 1 ∆t + + (f (z)p(y|z. (2. t) dy. Then using (2. ∆t)p(z|x. y) and if we sum over all possible ys.74) (2. + O((y − z)3 ) − .73) says that x must go somewhere in time t. ∆t)p(z|x.74) ∂p(y|x. t2 )p(y|x.76 APPLICATIONS There is an intimate connection between parabolic (and elliptic) partial differential equations and stochastic differential equations. t1 ) dy.72) where (2. x must have gone somewhere (i. and (2.73) (2. t) dy . t)) dy. t) dz − p(y|x. so we expand f (y) = f (z) + fz (z)(z − y) + 1 fzz (z)(y − z)2 + · · · to get 2 f (y) ∂p(y|x. t) are p(y|x.

one makes the following assumptions [10] p(y|z. ∂t ∂y 2 ∂y 2 (2. Since f (y) is perfectly arbitrary. t)] [b(z)p(z|x.76) (2.76) and (2. the coefficients are b(x) = − and a(x) = I.79) (2.69) is thus ∂p 1 ∂ = · ∂t 2 ∂x ∂S ∂p p+ ∂x ∂x .79) is the Fokker–Planck equation and is the starting point for many probabilistic representations of solutions for partial differential equations. in the ∆t → 0 limit f (z) ∂p(z|x. (2. t) f (z) − ∂f 1 ∂2f b(z) + a(z) ∂z 2 ∂z 2 dz. ∆t)(y − z) dy = b(z)∆t + o(∆t). Substituting (2. p(y|z. In our example (2.MONTE CARLO (MC) METHODS 77 If most of the contributions to the z → y transition in small time increment ∆t are around y ∼ z.77) ∂ 1 ∂2 [a(z)p(z|x. ∂x ∂x . t)] + [a(y)p(y|x. for m > 2. ∆t)(y − z)m dy = o(∆t). so as t → ∞ ∂S ∂p p+ = constant = 0. The Fokker–Planck equation for (2. Generalization of (2. The coefficient b(y) becomes a vector (drift coefficients) and a(y) becomes a symmetric matrix (diffusion matrix).77) into (2. we integrated by parts and assumed that the large |y| boundary terms are zero. 1 ∂S 2 ∂x And the stationary state is when ∂p/∂t → 0.78) In the second line of (2. we get using (2.73) and an integration variable substitution.69). dz.79) to dimensions greater than one is not difficult. p(y|z.75). ∆t)(y − z)2 dy = a(z)∆t + o(∆t). t) dz = ∂t = p(z|x.78). it must be true that ∂ 1 ∂2 ∂p(y|x. t)] + ∂z 2 ∂z 2 Equation (2. t)]. t) = − [b(y)p(y|x. the identity matrix.

T λsmall .68) is positive when |x| is large. and normally distributed random numbers ξ is generated each by the Box–Muller √ method (see Equation (2. N per time step: √ ∆w[i] = hξ [i] .68). k xk+1 = xk + ∆x.67) may be computed by a long time average [118] Ef ≈ ≈ 1 T T f (x(t)) dt. hence we get (2. Process x(t) will become stationary when m is large enough that T = h · m will exceed the relaxation time of the dynamics: to lowest order this means that relative to the smallest eigenvalue λsmall > 0 of 1.78 APPLICATIONS if p goes to zero at |x| → ∞. and (2) we can accelerate the convergence by the sample average over N . (2. . The convergence rate. (2. then. . The most useful parallel form is to compute a sample N of such vectors. p(x.69)? We only touch on the easiest way here.82) Two features of (2. How. a new vector of mutually independent. k [i] = [i] xk + ∆x . that is. is O(1/mN ): √ the statistical error is O(1/ mN ). measured by the variance. ∆x = − h 2 ∂S ∂x + ∆w. then x(t) will converge to a stationary process and Ef in (2. i = 1. Z where 1/Z is the normalization constant. 0 m N 1 mN f xk k=1 i=1 [i] .81) If S is contracting. univariate. a simple forward Euler method: √ ∆w = hξ. t → ∞) = e−S(x) . the Jacobian [∂ 2 S/∂xi ∂xj ]. . (2. (2. that is.66)) and the increments are simulated by ∆w = hξ. . ∆x[i] = − [i] xk+1 h 2 ∂S ∂x [i] + ∆w[i] . Integrating once more. T = m · h. do we simulate Equation (2.80) At each timestep of size h.82) should be noted: (1) the number of time steps chosen is m.

The downside of such an explicit formulation is that Euler methods can be unstable for a large stepsize h. parallelism is preserved. N = 1 may be chosen to save memory. The remaining i = 1 equation is also very likely to be easy to parallelize by usual matrix methods for large n. Simulated two-dimensional Brownian motion. To end this digression on Langevin equations (stochastic differential equations).21. and the updates per time step take the following form.01. . 10 5 0 y –5 –10 –10 –5 0 x 5 10 Fig. There are 32 simulated paths. is very large. In that situation. higher order methods. we show 32 simulated paths of a two-dimensional Brownian motion. with timestep h = 0 .MONTE CARLO (MC) METHODS 79 There are several reasons to find (2. 2.1 and 100 steps: that is.82) attractive. the size of the vectors x and w. implicit or semi-implicit methods.80) is an explicit equation—only old data xk appears on the right-hand side. w2 The pair ξ are generated at each step by (2. w= the updates at each time step are wk+1 = wk + √ h ξ1 . If n. ξ2 w1 . 106. All paths begin at the origin. 124]. or small times steps can be used [86. In each of these modifications. Where. Equation (2.66) and h = 0. the final time is t = 100 h. In Figure 2. we show examples of two simple simulations.21.

It may seem astonishing at first to see that the drift term. and p = e−2|x| follows from two integrations.1 MC integration in one-dimension. Integration was by an explicit trapezoidal rule [120]. p ∼ exp(−2|x|). in Figure 2. The parameters are N = 10 . The Langevin equation (2. −→ 0. The second integration constant is determined by the same normalization.80 104 APPLICATIONS dx = –(x/|x|) dt + dw. However. The first is to a constant in terms of x = ±∞ boundary values of p and p .22. where N is the sample size. ∂t ∂x 2 ∂x2 t→∞ The evolution ∂p/∂t → 0 means the distribution becomes stationary (time independent). And finally.01 . N = 10.5. Convergence of the optimal control process. that is. is sufficiently contracting to force the process to converge. As we discussed in Section 2.80) for this one-dimensional example is dx = −sign(x) dt + dw. x0 = 0.79) which in this case takes the form: ∂p ∂ 1 ∂2 = [sign(x)p ] + p. which must both be zero if p is normalizable. MC methods are particularly useful to do integrations. Again. this is the Fokker–Planck equation (2. t) dx = 1. 2. we show the distribution of a converged optimal control process. TR. h = 0.01] 103 Frequency 102 101 100 10–1 –5 0 Process value x(t) 5 Fig. Exercise 2.240.22. −sign(x) dt. p(x. 240 and time step h = 0 . But it does converge and to a symmetric exponential distribution. we also pointed out that the statistical error is proportional to N −1/2 . t = 100 [expl. This statistical error can be large since N −1/2 = 1 percent .

For x = 1. The method consists in picking a useful φ whose integral Iφ is known. i=1 where Iφ is the same as I except that g is replaced by φ. That is. Notice that for all positive x. A good one for the modified Bessel function is φ(u) = 1 − cos(πu/2) + 1 (cos(πu/2))2 . 2. this constant can be reduced significantly.MONTE CARLO (MC) METHODS 81 when N = 104 . we want you to try some variance reduction schemes. do several runs with different seeds) for both this method and a raw integration for the same size N —and compare the results. the zeroth order modified Bessel function with argument x. The idea is that an integral 1 I= 0 g(u) du can be rewritten for a control variate φ as 1 1 I= 0 (g(u) − φ(u)) du + 0 N φ(u) du. although the statistical error is k · N −1/2 for some constant k.g. Try the I+ + I− antithetic variates method for I0 (1) and compute the variance (e. ≈ 1 N [g(ui ) − φ(ui )] + Iφ . What is to be done? The following integral I0 (x) = 1 π 0 π e−x cos ζ dζ is known to have a solution I0 (x). The method of antithetic variables applies in any given domain of integration where the integrand is monotonic [99]. the antithetic variates method is I0 (1) ≈ I+ + I− = 1 N N i=1 e− cos πui /2 + 1 N N i=1 e− cos π(1−ui )/2 . Abramowitz and Stegun [2]. the antithetic variates method applies if we integrate only to ζ = π/2 and double the result. So. see. for example. or any comparable table. 1. The method of control variates also works well here. In this exercise. the integrand exp(−x cos ζ) is monotonically decreasing on 0 ≤ ζ ≤ π/2 and is symmetric around ζ = π/2. The keyword here is proportional. 2 .

2 Solving partial differential equations by MC. y. z) = 0. Exercise 2. z) = Eg(X(τ )) exp − 0 τ v(X(s)) ds .3. x2 y2 z2 + 2 + 2 <1 . y. y. For this φ compute I0 (1). = EY (τ ). Our Dirichlet 2 a b c The potential is v(x. • Generate N realizations of X(t) and integrate the following system of stochastic differential equations (W is a Brownian motion) dX = dW. At the heart of the simulation lies the Feynman–Kac formula.2) and for convenience. Again. y. (2.82 APPLICATIONS the first three terms of the Taylor series expansion of the integrand. y. z) − v(x. 2 a b c + 1 1 1 + 2 + 2 . y. y. The following partial differential equation is defined in a three-dimensional ellipsoid E: 1 2 ∆u(x. you can get variances. By doing several runs with different seeds. Y is defined as the t exponential of the integral − 0 v(X(s)) ds. The simulation proceeds as follows. which in our case (g = 1) is u(x. Again compare with the results from a raw procedure without the control variate. z)u(x. This assignment is to solve a three-dimensional partial differential equation inside an elliptical region by MC simulations of the Feynman–Kac formula. = E exp − 0 τ v(X(s)) ds .83) where the ellipsoidal domain is D= (x. we have chosen x = 1 to be specific. The goal is to solve this boundary value problem at x = (x. z) ∈ R3 . z) by a MC simulation. • Starting at some interior point X(0) = x ∈ D.5. z) = 2 boundary condition is x2 y2 z2 + 4 + 4 a4 b c when u = g(x) = 1 x ∈ ∂D. dY = −v(X(t))Y dt. . Here X(s) is a Brownian motion starting from X(0) = x and τ is its exit time from D (see Section 2. which describes the solution u in terms of an expectation value of a stochastic process Y .

ξ3 ) at each time step) √ ξk = ± 3. compute the RMS error compared to the known analytic solution given below. For example. 2 a b c (2. See note (b).MONTE CARLO (MC) METHODS 83 • Integrate this set of equations until X(t) exits at time t = τ . • Compute u(x) = EY (X(τ )) to get the solution. y. . 3: ξ = (ξ1 . this realization of X exited the domain D. t → t + h): √ Xn = Xn−1 + hξ. . For each i = 1. . each with probability 1/6. Hints (a) The exact solution of the partial differential equation is (easily seen by two differentiations) u(x. h [v(Xn )Ye + v(Xn−1 )Yn−1 ] 2 where h is the step size and ξ is a vector of three roughly normally distributed independent random variables each with mean 0 and variance 1. compute N 1 umc (x) = Y (i) . with probability 2/3. ξ2 . 0). Label this realization i. z) = exp y2 z2 x2 + 2 + 2 −1 . z) (for X) and Y (0) = 1 (for Y ). save the values Y (i) = Y (τ ). A simple integration procedure uses an explicit trapezoidal rule for each step n → n + 1 (i.83) for x = (x. Yn = Yn−1 − (an Euler step). When all N realizations i have exited. (2) for several initial values of x ∈ D. .e. . N . . A inexpensive procedure is (three new ones for each coordinate k = 1. 0. . . The initial values for each realization are X(0) = (x. z). (trapezium). Note: This problem will be revisited in Chapter 5.84) . m = 50–100 values x = (x. N i=1 This is the MC solution umc (x) ≈ u(x) to (2. At the end of each time step. What is to be done? The assignment consists of (1) implement the MC simulation on a serial machine and test your code. respectively. y. y. −a < x < a. Ye = (1 − v(Xn−1 )h)Yn−1 . = 0. check if Xn ∈ D: if (X1 /a)2 +(X2 /b)2 +(X3 /c)2 ≥ 1.

i=1 (c) For this problem.84) and the numerical approximation umc (x) given by the MC integration method for m (say m = 50 or 100) starting points xi ∈ D: rms = 1 m m (u(xi ) − umc (xi ))2 . c = 1).g. (e) Choose some reasonable values for the ellipse (e.84 APPLICATIONS (b) Measure the accuracy of your solution by computing the root mean square error between the exact solution u(x) (2. a = 3. 000 (∼ 1–2 percent statistical error). b = 2. (d) A sensible value for N is 10. a stepsize of 1/1000 ≤ h ≤ 1/100 should give a reasonable result. .

we will cover both old and new and find that the old paradigms for programming were simpler because CMOS or ECL memories permitted easy non-unit stride memory access. Most of the ideas are the same. vector hardware is now commonly available in Intel Pentium III and Pentium 4 PCs. but this is somewhat misleading. Today. multiple data (SIMD) mode is the simplest method of parallelism and now becoming the most common. SINGLE INSTRUCTION MULTIPLE DATA Pluralitas non est ponenda sine neccesitate William of Occam (1319) 3. which is not a bad thing. perhaps automatic vectorization will become as effective as on the older non-cache machines. 22. Intrinsics are C macros which contain one or more SIMD instructions to execute certain operations on multiple data. It seems at first that the intrinsics keep a programmer close to the hardware. 51]). As PC and Mac compilers improve. .1 Introduction The single instruction. In this chapter. but are not strictly speaking SIMD. so the simpler programming methodology makes it easy to understand the concepts. for example see Section 1. This form. Hardware control in this method of programming is only indirect. Our examples will show you how this works. and Apple/Motorola G-4 machines. Data are explicitly declared mm128 datatypes in the Intel SSE case and vector variables using the G-4 Altivec. In the meantime.1. on PCs and Macs we will often need to use intrinsics ([23. usually 4-words/time in our case. vector computers were expensive but reasonably simple to program. In most cases this SIMD mode means the same as vectorization. Ten years ago.3 SIMD. SIMD or vectorization. The SSE2 or Altivec programming serves to illustrate a form of instruction level parallelism we wish to emphasize.2. Actual register assignments are made by the compiler and may not be quite what the programmer wants. encouraged by multimedia applications. There are variants on this theme which use templates or macros which consist of multiple instructions carefully scheduled to accomplish the same objective.2. has single instructions which operate on multiple data.

and 3. Look at the following loop in which f (x) is some arbitrary function (see Section 3. 3.3 Consistent with our notion that examples are the best way to learn.5 to see more such recurrences).1. and ← V1 ∗ V2 : multiplies contents of V1 by V2 . ← V1 : ← V1 + V2 : adds contents of V1 and V2 and store results into V3 . stores contents of register V1 into memory M . .8 reduction operations: Section 3. SINGLE INSTRUCTION MULTIPLE DATA Four basic concepts are important: • • • • • Memory access data dependencies: Sections 3. Some typical operations are denoted: V1 M V3 V3 3. Data dependencies and loop unrolling We first illustrate some basic notions of data independence.2. 3.2 branch execution: Section 3. • FFT First we review the notation and conventions of Chapter 1. We will see that the maximum number that may be computed in parallel is number of x[i]’s computed in parallel ≤ k.5.i++){ x[i]=f(x[i-k]). and stores these into V3 .5. x[m+1]=f(x[m-1]).2. Generic segments of size V L data are accumulated and processed in multi-word registers. 11.2.2 pipelining and unrolling loops: Sections 3.i<n. for(i=m. 3. p.86 SIMD. For example.1. the Level 1 basic linear algebra subprograms (BLAS) — vector updates (-axpy) — reduction operations and linear searches • recurrence formulae and polynomial evaluations • uniform random number generation. for k = 2 the order of execution goes x[m ]=f(x[m-2]).2.2.2 ← M: loads V L data from memory M into register V1 . and 1.2.2. x[i-k] must be computed before x[i] when k > 0. several will be illustrated: • from linear algebra. } If the order of execution follows C rules.

not the updated value C language rules prescribe. . This would not be the case if the output values were stored into a different array y. however. Conversely. unrolling the loop to a depth of 2 to do two elements at a time would have no read/write overlap. for(i=m. x[m+3]=f(x[m+1]). */ By line three. x[i+1]=f(x[i-k+1]).i++){ x[i]=f(x[i+k]). } or groups of 4.i<n. x[m] must have been properly updated from the first line or x[m] will be an old value. .i++){ y[i]=f(x[i-k]). This data dependency in the loop is what makes it complicated for a compiler or the programmer to vectorize. unrolling to depth 4 would not permit doing all 4 values independently of each other: The first pair must be finished before the second can proceed. x[i+3]=f(x[i-k+3]). consider unrolling the loop above with a dependency into groups of 2. } when k > 0 all n − m could be computed as vectors as long as the sequential i ordering is preserved. x[m+4]=f(x[m+2]). /* x[m] is new data! */ /* likewise x[m+1] */ /* etc. For the same value of k. x[i+1]=f(x[i-k+1]). either we or the compiler can unroll the loop in any way desired. Since none of the values of x would be modified in the loop. If n − m were divisible by 4.i<n. for(i=m.i+=4){ x[i ]=f(x[i-k ]).i+=2){ x[i ]=f(x[i-k ]). x[i+2]=f(x[i-k+2]).i<n. according to the C ordering rules. for example: for(i=m. . . } If k = 2.i<n. } There is no overlap in values read in segments (groups) and those stored in that situation. for(i=m.DATA DEPENDENCIES AND LOOP UNROLLING 87 x[m+2]=f(x[m ]). Let us see what sort of instructions a vectorizing compiler would generate to do V L operands.

then used. xm+1 . R3 . An example. q) is. say R1 . In the applications provided in Chapter 2. x[1]=f(x[2]). so what happens to it afterward no longer matters. 61..88 SIMD. we pointed out that the lagged Fibonacci sequence (2. . /* x[1] must be new data */ the value of x[1] must be computed in the first expression before its new value is used in the second.] overlap. see Section 1. better performance and higher quality go together. where each Rj stores only one word. q). SSE2 register. we showed code which uses a circular buffer (buff) of length max(p. AMD 3DNow technology [1]. or (2) Vr is a collection of registers. so no x[i] is ever written onto before its value is copied. On p.0. Fujitsu). Whereas. in x[0]=f(x[1]). . Hence.] ← V2 // store VL results. q) to compute this long recurrence formula. .. Its old value is lost forever. .1. Although this mode is instruction level parallelism. it is not SIMD because multiple data are processed each by individual instructions.60) xn = xn−p + xn−q mod 1.. SINGLE INSTRUCTION MULTIPLE DATA V1 ← [xm+k . At issue are the C (or Fortran) execution order rules of the expressions. In sequential fashion. .. can be SIMD parallelized with a vector length V L ≤ min(p. The term superscalar applies to single word register machines which permit concurrent execution of instructions.2. Such recurrences and short vector dependencies will be discussed further in Section 3.2. R5 . Although the segments [xm+k . xm+k+1 .] // VL of these V2 ← f (V1 ) // vector of results [xm . the larger min(p. . NEC. the better the quality of the generator. xm+k+2 . xm+1 . . According to the discussion given there.. x[2]=f(x[1]). This would be the type of optimization a compiler would generate for superscalar chips.5. for the pair x[1]=f(x[0]). or Motorola/Apple G-4 Altivec register..] and [xm . /* x[1] is old data */ /* x[1] may now be clobbered */ the old value of x[1] has already been copied into a register and used. xm+k+1 . Appendix B. . Registers Vr are purely symbolic and may be one of (1) Vr is a vector register (Cray. several single word registers can be used to form multi-word operations. In that case. V1 has a copy of the old data. R7 . .

Let p = n mod 4. on Intel Pentium III and Pentium 4 machines.i<n. 3. y[i+1]=f(x[i+1]). pushing more balls into the pipe causes balls to emerge from the other end at the same rate they are pushed in. On out-of-order execution machines. In either case.1 Pipelining and segmentation The following analysis. For example.. } may be unrolled into groups of 4. but sometimes after all the q full V L segments. is very simplified. y[p-1]=f(x[p-1]).3. To return to a basic example.i<n. and its subsequent generalization in Section 3.i++){ y[i]=f(x[i]). The overhead (same as latency in this case) is how many balls will fit into the pipe before it is full. y[i+3]=f(x[i+3]). or better. the idea apparently having originated in 1962 with the University of Manchester’s Atlas project [91]. The total number of segments processed is q if r = 0. y[i+2]=f(x[i+2]). or why would a compiler do it? The following section explains. A special notation is used in MatLab for these products: A. The generalization to say m at a time is obvious. It must be remarked that sometimes the residual segment is processed first. a general analysis of speedup becomes difficult. Where it may differ significantly from real machines is where there is out-of-order execution. for example. otherwise q + 1 when r > 0.DATA DEPENDENCIES AND LOOP UNROLLING 89 The basic procedure of unrolling loops is the same: A loop is broken into segments. This is sometimes called a Hadamard or element by element product. The terminology “pipeline” is easy enough to understand: One imagines a small pipe to be filled with balls of slightly smaller diameter.2.*B.. Figure 3. and 0 ≤ r < V L is a residual. Hardware rarely does any single operation in one clock tick.2. the loop for(i=p. unrolls this loop into: y[0]=f(x[0]). why would we do this. imagine we wish to multiply an array of integers A times B to get array C: for i ≥ 0. . For a loop of n iterations. for(i=p.i+=4){ y[i ]=f(x[i]). either we programmers. Subsequently. one unrolls the loop into segments of length V L by n=q·VL+r where q = the number of full segments. Ci = Ai ∗ Bi . We indicate the successive . Today nearly all functional units are pipelined.1. the compiler. } /* p<4 */ These groups (vectors) are four at a time.

1 to see how it works. Ci2 .1. After four cycles the pipeline is full. Ci1 .2) which for large number of elements n ≤ V L is the pipeline length (=4 in this case). it (stage j) can be used for Ai+1. while in the pipelined case it takes 4 + n to do n. Bi0 ]. Bi1 .2. Ci0 ].90 SIMD. until reaching the last i. (3.1) When stage j is done with Ai. little-endian.1. Look at the time flow in Figure 3.j ∗ Bi+1. Ai2 .j . we get one result/clock-cycle. four cycles (the pipeline latency) are needed to fill the pipeline after which one result is produced per cycle.j . It takes 4 clock ticks to do one multiply. Fig.j−1 . Subsequently. Four-stage multiply pipeline: C = A ∗ B. Ai1 . that is. . In this simplified example. the speedup is thus speedup = 4n 4+n (3. Bi = [Bi3 . Ci = [Ci3 . 3. Ai = [Ai3 . SINGLE INSTRUCTION MULTIPLE DATA bytes of each element by numbering them 3.j−1 ∗ Bi.0. Ai0 ].j ∗ Bi. Byte numbered j must wait until byte j − 1 is computed via Cij = Aij ∗ Bij + carry from Ai. Bi2 . A00*B00 A01*B01 A10*B10 A02*B02 A11*B11 A20*B20 A03*B03 A12*B12 A21*B21 A30*B30 C0 Time A13*B13 A22*B22 A31*B31 A40*B40 C1 A23*B23 A32*B32 A41*B41 A50*B50 C2 A33*B33 A42*B42 A51*B51 A60*B60 etc.

To illustrate.i++){ y[index[i]] = x[i]. the difficulty for a compiler is that it is impossible to know at compile time that all the indices (index) are unique. If they are unique. certain compiler directives are available. In their simplest form. 3. pragma’s in C and their Fortran (e.3.2. gather collects elements from random locations. The gather operation has a similar problem. Scatter operation with a directive telling the C compiler to ignore any apparent vector dependencies. after the loop in Figure 3.3.g. Take the case that two of the indices are not unique: index[1]=3 index[2]=1 index[3]=1 For n = 3.i<n. } Fig. the contiguous xi above) and scatters them into random locations. or by the compiler. In the latter case. w[i] = z[index[i]]. treated in more detail in Chapter 4 on shared memory parallelism. If any two array elements of index have the same value. Let us beat this to death so that there is no confusion. One special class of these dependencies garnered a special name—scatter/gather operations. These directives.2 More about dependencies. Scatter and gather operations. cdir$) equivalents. In the scatter case.3 is executed.DATA DEPENDENCIES AND LOOP UNROLLING 91 3.g. and another when w and z overlap. It is not hard to understand the terminology: scatter takes a segment of elements (e.i++){ y[index[i]] = x[i]. we should have array y with the values for(i=0.i<n. let index be an array of integer indices whose values do not exceed the array bounds of our arrays. perhaps with help from the programmer. /* scatter */ /* gather */ . regardless of how the loop count n is segmented (unrolled). which is frequently the case. scatter/gather operations From our examples above. all n may be processed in any order. the two operations are in Figure 3.2. } Fig. there would be such a dependency.2. it is hopefully clear that reading data from one portion of an array and storing updated versions of it back into different locations of the same array may lead to dependencies which must be resolved by the programmer. 3. Look at the simple example in Figure 3. #pragma ivdep for(i=0. are instructions whose syntax looks like code comments but give guidance to compilers for optimization within a restricted scope.

it has to be conservative and generates only one/time code. thus speedups may be higher than (3. And as icing on the cake. Now everybody is happy: The programmer gets her vectorized code and the compiler does the right optimization it was designed to do. some illustration about the hardware should be helpful.4 shows the block diagram for the Cray SV-1 central processing unit. Cray Research invented compiler directives to help the frustrated compiler when it is confused by its task.3. The worst that could happen is that another compiler may complain that it does not understand the pragma and either ignore it. the order might be arbitrary. In the event that out-of-sequence execution would effect the results. or (rarely) give up compiling. Conversely. too. This machine has fixed overheads (latencies) for the functional units.2 can be problematic.2. If all n calculations were done asynchronously. the resultant code is approximately portable. There could be no parallel execution in that case. since memory architectures on Pentiums and Motorola G-4s have multiple layers of cache.5). 3. Programmer knows what she is doing and to ignore the vector dependency. Not only must the results be those specified by the one after another C execution rules. The syntax is shown in Figure 3. the results should agree with SIMD parallelized results. Bear in mind that the analysis only roughly applies to in-order execution with fixed starting overhead (latencies). Namely if w and z are pointers to a memory region where gathered data (w) are stored into parts of z. The pragma tells the compiler that indeed Ms. Beginning in the late 1970s. y[1] might end up containing x[2]. again the compiler must be conservative and generate only one/time code. What is the unfortunate compiler to do? Obviously. For example. There are many compiler directives to aid compilation. an incorrect value according to the C execution rules.92 SIMD. then reset to x[1]. the dependencies are called antidependencies [147]. As an asynchronous parallel operation. this would be contrary to the order of execution rules. if a loop index i + 1 were executed before the i-th. speedups may also be lower than our analysis. we will illustrate some of them and enumerate the others. the gather operation in Figure 3. although the processing rate may differ when confronted with .3 Cray SV-1 hardware To better understand these ideas about pipelining and vectors. Out-of-order execution [23] can hide latencies. Figure 3. SINGLE INSTRUCTION MULTIPLE DATA y[1]=x[3] y[2]=unchanged y[3]=x[1] where perhaps y[1] was set to x[2]. In C and Fortran where pointers are permitted. maybe there is something subtle going on here that the compiler cannot know about. but the compiler cannot assume that all the indices (index) are unique. If done in purely sequential order. After all.

reciprocal approximation. . Furthermore. are shown in the upper middle portion. the “MM” stands for multimedia. add Scalar integer Fig. . . wonder how any of the Jaguar swooshing windows can happen in Macintosh OS-X. These conflicts occur when bulk memory is laid out in banks (typically up to 1024 of these) and when successive requests to the same bank happen faster than the request/refresh cycle of the bank.p. Intel began with the Pentium featuring MMX integer SIMD technology. logical.p. its successor SSE2 includes double precision (64 bit) operations. add Vector f. some of it is the power of the . lz. Shift Logical Int. The only parts of interest to us here are the eight 64-word floating point vector registers and eight 64-bit scalar registers. Vector integer Population. 3. multiply. There are many variants on this theme. if you.DATA DEPENDENCIES AND LOOP UNROLLING Vector mask Si One to three ports to memory 1 2 3 4 5 6 7 Sj Shift Integer add Vj Vk Vi Central memory Vector registers Vk Cache Vi Sj Sj Reciprocal Multiply f. . lz. V7 . memory bank conflicts. Not surprisingly. The eight vector registers. floating point words and so forth. the Pentium III featured similar vector registers called XMM (floating point) and again included the integer MMX hardware. gentle reader. Cray SV-1 CPU diagram. numbered V0 . NEC SX-4. Vj Reciprocal Multiply f. and SX-6 machines have reconfigurable blocks of vector registers. The pipelines of the arithmetic units appear in the upper right—add.4.p. not 64.p. Later. SX-5. Logical 93 V0 S7 Si T77 T00 Save scalar registers Tjk S0 Scalar registers Sk Si Population. Cray C-90 has registers with 128. add Scalar f. The three letter acronym SSE refers to this technology and means streaming SIMD extensions.

Hence. y2 . With the . . SINGLE INSTRUCTION MULTIPLE DATA // this may have to be expanded to a vector on some machines // put a into a scalar register S1 ← a // operation 1 V1 ← [y0 . Since there is only one port to memory. To be more general than the simple speedup formula 3. let us digress to our old friend the saxpy operation.5. and subsequent add to y may all run concurrently as soon as the respective operations’ pipelines are full. the final results now in V4 cannot be stored until all of V4 is filled. .5.] // read V L of x // a * x V3 ← S1 ∗ V 2 V4 ← V 1 + V3 // y + a * x // operation 3 [y0 . add. saxpy operation by SIMD.2. Memory is also a pipelined functional unit. This is because memory is busy streaming data into V2 and is not available to stream the results into V4 .] // read V L of y // operation 2 V2 ← [x0 . . and store—all running concurrently [80. y ← a · x+y. . For our purposes. However. y1 . a segmented vector timing formula can be written for a segment length n ≤ V L. reconfigurable up to 2 doubles (64-bit) or down to 16 1-byte and 8 2-byte sizes. see for example [23].5 considered only one operation? Because the functional units used are all independent of each other: as soon as the first word (containing y0 ) arrives in V2 after πmemory cycles. (one result)/(clock tick) is the computational rate. .94 SIMD.1. the processing rate is approximately (1 saxpy result)/(3 clock ticks). multiplication by a. if as in Figure 3.] ← V4 // store updated y’s Fig. . 123]. . multiply. To explore this idea of pipelining and vector registers further. . x1 . read.6 concerning the first comment in the pseudo-code): Why are the three parts of operation 2 in Figure 3. the three operations of Figure 3. y1 . the four 32-bit configuration is the most useful. As soon as the first result (a · x0 ) of the a · x operation arrives (πmultiply cycles later) in V3 .4. the read of x. where for each pipe of length πi it takes πi + n cycles to do n operandpairs. We will explore this further in Sections 3.2. If there is but one path to memory. the add operation with the y elements already in V1 may begin. 3.1 collapse into only one: read.2. three pieces are needed (see Section 3. Subsequently. there are three ports to memory. however. .4 and 3. the multiply operation can start. similar to Figure 3. The vector registers on the Pentium and G-4 are 4 length 32-bit words. shown in Figure 3. with only one port to memory. Altivec hardware on G-4 that makes this possible. Thus.5. x2 .

which is similar to πmemory when CMOS or ECL memory is used. because of multiple cache levels (up to three). these startup latency counts are 3 to 15 cycles for arithmetic operations. For example. The number of linked operations is usually much easier to determine: The number α of linked operations is determined by the number of independent functional units or the number of registers. (3.4 applies.2. that is. the number cycles to compute one result is. the speedup ratio is roughly 1. the average pipeline overhead (latency). 1 Rscalar α = i=1 πi . On machines with cache memories. When an instruction sequence exhausts the number of resources.4) speedup = = α i=1 Rscalar {πi + n} i=1 If n ≤ V L is small. is α n πi Rvector . No linked operation can exist if the number of functional units of one category is exceeded. Such a resource might be the number of memory ports or the number of add or multiply units. Hardware reference manuals and experimentation are the ultimate sources of information on these quantities. this ratio becomes (as n gets large) speedup ≈ α i=1 πi α = π . n = number of independent data πi = pipeline overhead for linked operation i α = number of linked operations Rvector = processing rate (data/clock period) we get the following processing rate which applies when πmemory is not longer than the computation. however. the higher the speedup is. (3. the next Section 3. It may seem curious at first. . In that case. comparing the vector (Rvector ) to scalar (Rscalar ) processing rates. n . the pipeline length for memory. α must be increased. two multiplies if there is only one multiply unit mean two separate operations. but the larger the overhead. Typically. it may be hard to determine a priori how long is πmemory .DATA DEPENDENCIES AND LOOP UNROLLING 95 notation. The speedup.3) Rvector = α {πi + n} i=1 In one/time mode. if n is large compared to the overheads πi .

.96 SIMD. Ai+1 . the vector register V0 (size four here) is loaded with Ai . . . without any prefetch. It is hard to imagine how to prefetch more than one segment ahead. Ai+7 (wait memory) − Tf − Tj V2 ← f (V1 ) Bi .2. instead of a delay of Twait memory . .6. . and Tj is the cost of the jump back to the beginning of the loop. without prefetch on left. . . Ai+1 .4 Long memory latencies and short vector lengths On systems with cache. the latency is reduced to Twait memory − Tf − Tj where Tf is the time for the V L = 4 computations of f () for the current segment. . Bi+7 ← V2 for i = 0. Ai+3 are prefetched into V0 before entering the loop. This can be longer than the inner portion of a loop. V L = 4 times for the current segment. Ai+3 and after the time needed for these data to arrive. Ai+7 ) is begun before the i. The advantage of this prefetching strategy is that part of the memory latency may be hidden behind the computation f . . i + 3 elements of f (A) are computed. i + 1. . then the next segment fetch (Ai+4 . Ai . . . Bi+3 ← V2 } ← f (V0 ) V2 Bi+4 . Bi+3 ← V1 } Fig. Ai+3 wait memory . In the first. Bi+1 . . That is. Ai+5 . Bi+2 . . n − 5 by 4 { V 1 ← V0 V0 ← Ai+4 . i + 2. SINGLE INSTRUCTION MULTIPLE DATA 3. . . unrolling loops by using the vector hardware to compute B = f (A) takes on forms shown in Figure 3. V1 ← f (V0 ) Bi . Typical memory latencies may be 10 times that of a relatively simple computation. Ai+6 . . n − 1 by 4 { V0 ← Ai . Thus. . . A3 for i = 0. more than one i. f (A) for these four elements can be computed. i + 3 loop count because the prefetch is pipelined and not readily influenced by software control. 3. with prefetch on the right: Tf is the time for function f and Tj is the cost of the jump to the top of the loop. Bi+3 ) of B. Ai+2 . . . . . i + 2. In the second variant. i + 1. There are two variants which loop through the computation 4/time. . Ai+2 . Long memory latency vector computation. . for example saxpy. . These data are then stored in a segment (Bi . the most important latency to be reckoned with is waiting to fetch data from bulk memory. these data saved in V1 as the first instruction of the loop. .6. How does V0 ← A0 . that is.

7 and 3. we will rely on diagrams to give a flavor of the appropriate features. 3. these inequalities are fairly sharp.9 shows a block diagram of the instruction pipeline. so we will only be concerned with instruction issues which proceed before previous instructions have finished. On the Pentium. Ai+3 into L1 also brings Ai+4 .9. Ai+2 . an instruction cache (Trace cache) of 12k µops (micro-operations) provides instructions to an out-of-order core (previously Dispatch/execute unit on Pentium III) from a pool of machine instructions which may be reordered according to available resources. Ai+L2B into L2. Ai+1 . namely speedups of 4 (or 2 in double) are often closely reached. once datum Ai is fetched.5 Pentium 4 and Motorola G-4 architectures In this book. The result is. it must be remarked) is deeply pipelined. Ai+6 . where L2B is the L2 cache block size. since higher levels of cache typically have larger block sizes (see Table 1. The out of order core has four ports to . In both cases of Figure 3.1) than L1.2. For our purposes.DATA DEPENDENCIES AND LOOP UNROLLING 97 one save data that have not yet arrived? It is possible to unroll the loop to do 2 · V L or 3 · V L. However. To be clear: unrolling with this mechanism is powerful because of the principle of data locality from Chapter 1. as in Figures 3. . but not more than one loop iteration ahead. Ai+2 . On a Pentium 4.7 and 3. fetching Ai . Experimentally. 3.8. This overview will be limited to those features which pertain to the SIMD hardware available (SSE on the Pentium. thus permitting arithmetic pipeline filling before all the results are stored.10 from Chapter 1. we get the following speedup from unrolling the loop into segments of size V L: speedup = Rvector ≤VL Rscalar (3. one can expect speedup ≤ 4 (single precision) or speedup ≤ 2 (double precision). then. 1.2. we can only give a very superficial overview of the architectural features of both Intel Pentium 4 and Motorola G-4 chips.8. The general case of reordering is beyond the scope of this text. at a time in some situations.1.6. . This foreshortens the next memory fetch latency for successive V L = 4 segments. however. Ai+5 .5) In summary. Ai+3 are in the same cacheline and come almost gratis. Both machines permit modes of outof-order execution. and Altivec on G-4). on SSE2 and Altivec hardware. See our example in Figures 3. Namely. Section 1. In this regard. etc. the other date Ai+1 . the out of order core can actually reorder instructions.6 Pentium 4 architecture Instruction execution on the Pentium 4 (and on the Pentium III. . Figure 3. this overlap of partially completed instructions is enough to outline the SIMD behavior. in effect. Instructions may issue before others are completed. whereas on the G-4 this does not happen. the same as aligning four templates as in Figures 1.. .

10.3. In brief. 3. There are four ports (0.98 SIMD.10. one C0 A13*B13 A22*B22 A31*B31 C0 A13*B13 A22*B22 A31*B31 A40*B40 C1 A23*B23 A32*B32 C1 A23*B23 A32*B32 A41*B41 A50*B50 C2 A33*B33 C2 A33*B33 A42*B42 A51*B51 A60*B60 C3 C3 instr. see Figures 3. Retirement unit receives results from the execution unit and updates the system state in proper program order. two A42*B42 A51*B51 A60*B60 Pipelining with out-of-order execution.7. etc. These are shown in Figure 3. then.1 and 3. A43*B43 A52*B52 A61*B61 A70*B70 In-order execution Fig. one A03*B03 A12*B12 A21*B21 A30*B30 A03*B03 A12*B12 A21*B21 A30*B30 . instr. floating point. two A43*B43 A52*B52 A61*B61 A70*B70 C4 A40*B40 A53*B53 A62*B62 A71*B71 A80*B80 A41*B41 A50*B50 etc. Execution trace cache is the primary instruction cache. 2. The left side of the figure shows in-order pipelined execution for V L = 4 . various functional units—integer.1. This cache can hold up to 12k µops and can deliver up to 3 µops/cycle. the various parts of the NetBurst architecture are: 1.3) to the execution core: see Figure 3. Out-of-order core can dispatch up to 6 µops/cycle. Four-stage multiply pipeline: C = A ∗ B with out-of-order instruction issue. SINGLE INSTRUCTION MULTIPLE DATA A00*B00 A00*B00 A01*B01 A10*B10 A01*B01 A10*B10 A02*B02 A11*B11 A20*B20 A02*B02 A11*B11 A20*B20 instr. instr. pipelined execution (out-of-order) shows that the next segment of V L = 4 can begin filling the arithmetic pipeline without flushing. On the right.2. 3. logical.

can begin before earlier issues have completed all their results. . there are 1.2 Instr. Another way of looking at Figure 3. (1) (2) In-order execution of pipelined arithmetic Time Instr.3 Instr.9. .DATA DEPENDENCIES AND LOOP UNROLLING Instr. On the Pentium 4. the instructions are in a pool of concurrent instructions and may be reordered by the Execution unit.1 Instr.5) plus a penalty for any instruction cache flushing. In consequence.4 Instr. XMM1. 4 single precision 32-bit words. for example.9) are always in-order and likewise retirement and result storage. Branch prediction permits beginning execution of instructions before the actual branch information is available. 4.1 Instr.7: since instruction (2) may begin before (1) is finished. .3 etc.2 Instr. A history of previous branches taken is kept to predict the next branch. This history is updated and the order of execution adjusted accordingly. XMM7. In this instance. see Figure 3. In between. These registers are called XMM registers: and are numbered XMM0. 2. Other important architectural features are the register sets. There is a delay for failed prediction: typically this delay is at least the pipeline length πi as in (3. Instruction fetches (see Fetch/decode unit in Figure 3. out-of-order means that (2). however. Each may be partitioned in various ways: 16 bytes/register. or 2 double precision 64-bit words. 3. Our principle interest here will be in the 32-bit single precision partitioning. Eight 128-bit floating point vector registers. Intel’s branch prediction algorithm is heuristic but they claim it is more than 95 percent correct. (2) can start filling the pipeline as soon as the first result from instruction (1) C0 emerges.5 etc. . (1) C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 (3) C0 C1 C2 C3 C4 C5 C6 C7 99 (2) (3) Out-of-order execution of pipelined arithmetic Fig. there is an overlap of instruction execution.8. . subsequently (3) and (4). Eight scalar floating point registers (80 bits).

The trace cache replaced the older P-3 instruction cache and has a 12k µops capacity. Intel calls this system the NetBurst Micro-architecture.100 SIMD. SIMD (XMM) Memory load Load and prefetch Memory store Store address Fig. 8 way assoc. but stores are in-order. Port 1 handles the SIMD instructions. server only L2 Cache..10. SINGLE INSTRUCTION MULTIPLE DATA System bus Light data traffic Heavy data traffic Bus unit L3 cache. 3. . Front end Fetch/ decode Trace cache microcode (ROM) Execution (out of order) Branch history update BTBs/branch prediction Retirement Fig. Port 0 Port 1 Port 2 Port 3 ALU 0 Double speed Add. FP store data branches Integer FP operation Execute Normal speed Shift FP: add.. subtract FP move. subract. see: [23]. Port structure of Intel Pentium 4 out-of-order instruction core. L1 Cache 4-way assoc.9. mult. SIMD int. logical FP move ALU 1 Double speed Add. Block diagram of Intel Pentium 4 pipelined instruction execution unit. div.. see: [23]. 3. Loads may be speculative.

see: [29].11): 1. 3. .” but because of their great flexibility Intel no longer refers to them in this way. For our examples here.7 Motorola G-4 architecture The joint Motorola/IBM/Apple G-4 chip is a sophisticated RISC processor. Originally.11. of 64-bit length. called MMX. The G-4 chip has a rich set of registers (see Figure 3. and numbered MM0.DATA DEPENDENCIES AND LOOP UNROLLING 101 3. High level overview of the Motorola G-4 structure. including the Altivec technology. .. but instruction execution may begin prior to the completion of earlier ones. . Thirty-two general purpose registers (GPUs). . [25]. Floating point arithmetic is generally done in 80-bit precision on G-4 [48]. Instruction fetch Branch unit Completion unit Instruction cache (32 kb) BHT/BTIC SEQUENCER UNIT Dispatch unit AltiVec issue GPR issue FPR issue GPRs Rename buffers FPRs LSU Rename buffers FPU SFX0 SFX1 Complex SFX2 CFX Permute VRs Rename buffers Simple Float Data cache (32 kb) Interface to memory sub-system L3 cache System bus ALTIVEC ENGINE Unified L2 cache (256 kb) and L3 cache tag control System interface unit MPX Fig. these labels meant “multimedia. of 64-bit length. Thirty-two floating point registers (FPRs). it is similar to the more deeply pipelined Intel Pentium 4 chip. MM1. Its instruction issue is always in-order. 3.2. MM7. 2. Eight 64-bit integer vector registers. These are the integer counterpart to the XMM registers.

} else { c[i]=g(a[i]).12.8 Branching and conditional execution Using the SIMD paradigm for loops with if statements or other forms of branching is tricky. . Thirty-two Altivec registers. otherwise it is not set.i++){ if(e(a[i])>0. a conditional execution with even two choices (f or g) cannot be done directly with one instruction on multiple data. .2.g. or alternatively parts of both. 3. Clearly. The selection chooses f (ai ) if the ith bit is set. Branch processing by merging results. see [23] and [71]). why do both or even parts of both? • one of f (a [i]) or g(a [i]) may not even be defined. V0 ← [a0 . 3.i<n.102 SIMD. a1 . Regardless of possible inefficiencies. to compute if’s or branches in SIMD efficiently. Indeed. otherwise g(ai ) is chosen. the corresponding mask bit ‘i’ is set.12 shows one possible selection process. for example.] V7 ← e(V0 ) V M ← (V7 > 0?) forms a mask if e(ai ) > 0 V1 ← f (V0 ) compute f(a) V2 ← g(V0 ) compute g(a) V3 ← V M ?V1 : V2 selects f if VM bit set Fig. It is impossible.0){ c[i]=f(a[i]). almost by definition. If e(ai ) > 0. each of 128-bit length. The closest we can come to vectorizing this is to execute either both f and g and merge the results. Branch prediction has become an important part of hardware design (e. These may be partitioned in various ways: 16 bytes/register. . SINGLE INSTRUCTION MULTIPLE DATA 3. } } Branch condition e(x) is usually simple but is likely to take a few cycles. there are problems here: • if one of the f or g is very expensive. for(i=0. 4 single precision 32-bit floating point words/register. Look at the following example. or 2 double precision 64-bit words/register. Figure 3. .

i++){ if(e(a[i])>0. In Figure 3. If e(x) > 0 is more probable.13. npickb=0. In the example of isamax later in Section 3. obviously at least the cost of computing R1 = f1 (R0 ) because it is not used.14. Since the computation of f1 in Figure 3. what if the branch e(x) ≤ 0 were more probable? In that case.13) will be efficient (see p.k<npickb. branch prediction failure penalties are almost always far less costly than the merging results procedure illustrated in Figure 3. Now imagine that the hardware keeps a history which records the last selection of this branch.0)?b[i]:c[i]. Instruction reordering is done for optimization.14. a more efficient branch prediction would be Figure 3. } else { pickb[npickb++]=i.0){ picka[npicka++]=i. g(x) = g2 (g1 (x)). it may run concurrently with the steps needed to determine the if(e(R0 ) > 0) test.k++) c(pickb[k])=g(b[pickb[k]]).5. 86 for register conventions). An alternative to possible inefficient procedures is to use scatter/gather: npicka=0. g(x).12. so a missed branch prediction also forces an instruction cache flush—roughly 10 clock cycles on Pentium III and 20 cycles on Pentium 4. because bi and ci require no calculation—only selection.14) is a pipelined operation.k<npicka.DATA DEPENDENCIES AND LOOP UNROLLING 103 The following is another example of such a branching. for(i=0.i<n. but is efficient in this case. Conversely. the following branch prediction sequence (Figure 3. the prediction fails and there will be penalties. f (x).13 (or g1 (x) in the case shown in Figure 3. according . d[i]=(e(a[i])>0. you will probably have to work to optimize (vectorize) such branching loops.7. if R7 > 0 the prediction likewise fails and penalties will be exacted. If it turns out that R7 ≤ 0 seems more likely. Regardless of these penalties. In Figure 3. in the computation of the loop on p. As many as 126 instructions can be in the instruction pipe on P-4. Imagine that at least two operations are necessary to compute these functions: f (x) = f2 (f1 (x)). Branch prediction is now a staple feature of modern hardware.k++) c(picka[k])=f(a[picka[k]]). we will see how we effect a merge operation on Intel Pentium’s SSE hardware. for(k=0. Either way. } } for(k=0. if it turns out R7 ≤ 0. A simplified illustration can be shown when the operations. 102 take multiple steps.

Branch prediction best if e(x) ≤ 0 . 3.2).14. 132]. particularly if the registers can be renamed. the history of the previously chosen branches is used to predict which branch to choose the next time.14 should be used. then the second sequence Figure 3. SINGLE INSTRUCTION MULTIPLE DATA R0 ← x ← e(x) R7 R1 ← f1 (R0 ) if (R7 > 0) { R3 ← f2 (R1 ) } else { R4 ← g1 (R0 ) R3 ← g2 (R4 ) } // compute part of f (x) // predicted f if e(x) > 0 // picks g otherwise Fig. Solid advice is given on p.14) is just a reordering of the first (Figure 3.13.104 SIMD. Thus. when the latter is possible. 125 is more appropriate for our SIMD discussion and the decision about which branch to choose is done by the programmer. . R0 ← x ← e(x) R7 ← g1 (R0 ) R1 if (R7 > 0) { R4 ← f1 (R0 ) R3 ← f2 (R4 ) } else { R3 ← g2 (R1 ) } Fig. see [23] or [71]. Likewise. in loops. Power-PC (Motorola G-4) allows renaming registers [132]. A great deal of thought has been given to robust hardware determination of branch prediction: again. With some examination. 3. With out-of-order execution possible. it is not hard to convince yourself that the second expression (Figure 3. not by hardware. Branch prediction best when e(x) > 0 .13). // compute part of g(x) // picks f if e(x) > 0 // predicted g to that history.9) and renaming of registers [23. 12 of reference [23] about branch prediction and elimination of branches. usually when the loop is unrolled (see Section 3. the instruction sequence could be reordered such that the g(x) path is the most likely branch. The isamax0 example given on p. The Pentium 4 allows both reordering (see Figure 3.

Not surprisingly. V0 ← [0. reduction operations can be painful.3 Reduction operations. 0. 0. The calculation is n dot = (x. . ym+2 . ∀k} i .] // read V L of x V2 ← [ym . The BLAS version of isamax has an arbitrary but fixed stride.g. . no matter how clever the compiler is. . xm+1 . .REDUCTION OPERATIONS. Obviously. Another important operation is a maximum element search. it is likely less efficient than using pure vector operations. . ours is isamax0 = inf {i | |xi | ≥ |xk |. many compilers look for such obvious constructions and insert optimized routines to do reduction operations such as inner products. y) = i=1 xi yi where the result dot is a simple scalar element. In SIMD. VL dot = i=1 partiali . SEARCHING 105 3. . We only illustrate a unit stride version. xm+2 .] // read V L of y // segment of x*y V3 ← V 1 ∗ V 2 // accumulate partial result V0 ← V0 + V 3 which yields a segment of length V L of partial sums in V0 : q−1 partiali = k=0 xi+k·V L · yi+k·V L . searching The astute reader will have noticed that only vector → vector operations have been discussed above. the last reduction of V L partial sums may be expensive and given all the special cases which have to be treated (e. Fortunately for the programmer. Operations which reduce the number of elements of operands to a scalar or an index of a scalar are usually not vector operations in the same sense. Assume for simplicity that the number n of elements is a multiple of V L (the segment length): n = q · V L. It is easy enough to do a partial reduction to V L elements. shared memory parallelism. For example. . there will be reduction operations it cannot recognize and the resultant code will not parallelize well. that operation requires clever programming to be efficient. and distributed memory parallelism. .] // initialize V0 to zero // loop over segments: m += VL each time V1 ← [xm . ym+1 . the BLAS routine sdot. which must be further reduced to get the final result. for example. n = q · V L). . Let us consider the usual inner product operation. unless a machine has special purpose hardware for an inner product.

k++){ c[i][j] += a[i][k]*b[k][j]. [89]. section 7. We will now illustrate two: matrix × matrix multiply and Gaussian elimination. We use the C ordering wherein columns of B are assumed stored consecutively and that successive rows of A are stored n floating point words apart in memory: for (i=0.3).i<n. Here is the usual textbook way of writing the operation (e. 125).106 SIMD.4 Some basic linear algebra examples We have repeatedly flogged the saxpy operation because it is important in several important tasks. the first index i for which |xi | ≥ |xk | for all 1 ≤ k ≤ n. and requires such a search for the element of maximum absolute value. for (k=0. } } } It is important to observe that the inner loop counter k does not index elements of the result (c[i][j]): that is.4.1. where matrices A. In the first case. 3. C are assumed to be square and of size n × n.g. 124.. we show that a loop interchange can be helpful.i++){ /* mxm by expl. } } An especially good compiler might recognize the situation and insert a highperformance sdot routine in place of the inner loop in our first way of expressing the matrix multiply. 3.&b[0][j].i<n. dot product */ for (j=0. Meanwhile we will show that it is sometimes possible to avoid some reductions. The Linpack benchmark uses gefa to factor a matrix A = LU by partial pivoting Gaussian elimination. SINGLE INSTRUCTION MULTIPLE DATA that is.i++){ /* mxm by dot product */ for (j=0.j<n.j++){ c[i][j]=0.1 Matrix multiply In this example. In examples later in this chapter.&a[i][0]. we want the matrix product C = AB. we will show how to use the SSE (Pentium III and 4) and Altivec (Apple G-4) hardware to efficiently do both sdot and the isamax operations (see p. we might turn the two inner loops inside out .j++){ c[i][j]=sdot(n.j<n. B. We could as well write this as for (i=0. Alternatively. Ci.j is a fixed location with respect to the index k.k<n.n).

where elements are stored in sequential words. The core operation is an LU factorization A → P LU where L is lower triangular with unit diagonals (these are not stored).∗ ← Ci.j++){ c[i][j] += a[i][k]*b[k][j]. 3. We will use this arbitrariness to introduce shared memory parallelism. } for (k=0. That is.2 SGEFA: The Linpack benchmark It is certain that the Linpack [37] benchmark is the most famous computer floating point performance test. .∗ with the first ai. see Appendix E)to be consistent with the Linpack benchmark. This trick of inverting loops can be quite effective [80] and is often the essence of shared memory parallelism.4. } } } The astute reader will readily observe that we could be more effective by initializing the i-th row. j) may be chosen for vectorization. However. repeatedly for k → k + 1. n − 1 elements in the row i. we assume Fortran ordering (column major order. the ‘∗’ means all j = 0.j<n.. Based on J. . that of Gaussian elimination. Ci.2. either of the inner loops (i. 108) is necessary for the permutation information. Since the permutation P involves row exchanges.SOME BASIC LINEAR ALGEBRA EXAMPLES 107 to get for (i=0.i<n.j++){ c[i][j]=0.2. In this example. Ci. namely.1. A reformulation of Gaussian elimination as matrix–vector and matrix–matrix multiplications was given in Section 2.i++){ /* mxm by outer product */ for (j=0. U is upper triangular (diagonals stored).j<n.k++){ for (j=0. this test has been run on every computer supporting a Fortran compiler and in C versions [144] on countless others.∗ rather than zero. the important thing here is that the inner loop is now a saxpy operation. Here.k · bk. In a further example. Dongarra’s version of SGEFA [53] which appeared in Linpack [38].0 b0. Similar examples can be found in digital filtering [123]. the order of the two inner loops is entirely arbitrary. and P is a permutation matrix which permits the factorization to be done in-place but nevertheless using partial pivoting. .∗ .k<n. Our version is readily adapted to shared memory parallelism. only a vector (ip on p.∗ + ai. . .

3. Vector ip contains pivot information: the permutation matrix P is computed from this vector [38]. Figure 3. see Appendix E. of j for(k + 1 ≤ i.108 SIMD. To hopefully clarify the structure. .15. a simple vector swap (y ↔ x). . n − 1) akj ↔ alj independent of i. Simple parallel version of SGEFA. . j ≤ n − 1){ aij ← aij + aik · akj } } // akj 0 Pivot k k+1 Area of active reduction 0 k k+1 l n–1 n–1 Fig. we have not discussed two simple BLAS operations: sswap. Here is the code for the P LU reduction. n − 1) aik ← −aik /p if(l = k) for(j = k + 1 . SINGLE INSTRUCTION MULTIPLE DATA Heretofore. . or aik indep. Here is the classical algorithm: for 0 ≤ k≤ n − 1 { l ← maxl |akl | p ← akl ipk ← l if(l = k) akk ↔ alk for(i = k . and sscal. . Macro am is used for Fortran or column major ordering. These may be fetched from [111]. which simply scales a vector (x ← a · x).15 shows a diagram which indicates that the lower shaded portion (pushed down as k increases) is the active portion of the reduction.

j++){ /* msaxpy. if((*am(l.isamax().k<nm1.kp1).k++){ kp1=k+1. *am(k.SOME BASIC LINEAR ALGEBRA EXAMPLES 109 #define am(p. *info=0. sger */ pa1=am(k.k).0){*info=k.int info) { /* parallel C version of SGEFA: wpp 18/4/2000 */ int j.pa2. pa1=am(k. pa2=am(j.am(k.t.kp1.1).l).pa1.int n. t =*am(j.j<n.l)=*am(k.k))==0.1)+k.k).k. Section 4.kp1)).t. } /* end msaxpy.la).pa1.2 but note in passing that it is equivalent to the Level 2 BLAS routine sger. .q) (a+p*la+q) void sgefa(float *a.l). nm1 =n-1. sger */ } } #undef am A variant may be effected by replacing the lines between and including msaxpy and end msaxpy with saxpy msaxpy(n-k-1. l =isamax(n-k. if(l!=k){ pa1=am(kp1. float *pa1.t.pa1.k)).int la. pa1=am(k. sswap(n-k-1.sscal().1).n.0/(*am(k. for(k=0. } for(j=kp1. *am(k. where msaxpy is shown below (also see Appendix E for a review of column major ordering).pa2. } t =-1.l. return.k).8. saxpy(n-k-1.} if(l!=k){ t =*am(k. void saxpy().kp1).am(kp1. We will review this update again in Chapter 4.1. pa2=am(kp1.n-k-1.*pa2.pa1.kp1).k).nm1.k)=t.sswap().int *ip. ip[k]=l. a rank-1 update. sscal(n-k-1.la.am(kp1.kp1).k).

j<nc.i++){ *ym(i.5. . and (2) cyclic reduction for the solution of tridiagonal linear systems.1 Polynomial evaluation From the recurrence (3. (3.x.2. v2 x The next two are v3 v ← v2 1 . however. it is clear that p(k−1) must be available before the computation of p(k) by (3. Polynomial evaluations are usually evaluated by Horner’s rule (initially.j) += a[ka]*x[i].a(j)*x[*]+y(*. One scheme for x.q) (y+p+q*n) void msaxpy(nr. for(j=0.6).6) begins. polynomial evaluation In the previous Sections 3.y) int nr.j++){ for(i=0.6) where the result is Pn = p(n) . .j) wpp 29/01/2003 */ int i. but may be useful in instances when n is large.j. for long polynomials. Usually.nc.nc. } ka += n.*y. x3 .110 SIMD.j) <.n. Horner’s rule is strictly recursive and not vectorizable. p(0) = an ) as a recurrence formula p(k) = an−k + p(k−1) · x.4.a. SINGLE INSTRUCTION MULTIPLE DATA #define ym(p. we evaluate the polynomial by recursive doubling. float *a.i<nr. Instead. uses recursive doubling.ka. we are not so lucky and must use more sophisticated algorithms.*x. The following is not usually an effective way to compute Pn (x). we have shown how relatively simple loop interchanges can facilitate vectorizing loops by avoiding reductions. x2 .4. For a single polynomial of one argument x. We now n show two examples: (1) Polynomial evaluation (Pn (x) = k=0 ak xk ). ka=0. 3. x4 . with the first step.1. v4 v2 . } } #undef ym 3. .n..5 Recurrence formulae. { /* multiple SAXY: y(*. x v1 ← 2 . 3.

RECURRENCE FORMULAE, POLYNOMIAL EVALUATION

111

and the next four,

    v5 v1 v6  v2    ← v4   . v7  v3  v8 v4     v2k +1 v1  v2  v2k +2       ·  ← v2k  ·  .      ·   ·  v2k+1 v2k

The general scheme is

The polynomial evaluation may now be written as a dot product. float poly(int n,float *a,float *v,float x) { int i,one,id,nd,itd; fortran float SDOT(); float xi,p; v[0]=1; v[1] =x; v[2] =x*x; id =2; one =1; nd =id; while(nd<n){ itd=(id <= n)?id:(n-nd); xi =v[id]; #pragma ivdep for (i=1;i<=itd;i++) v[id+i]=xi*v[i]; id=id+id; nd=nd+itd; } nd=n+1; p =SDOT(&nd,a,&one,v,&one); return(p); } Obviously, n must be fairly large for all this to be efficient. In most instances, the actual problem at hand will be to evaluate the same polynomial for multiple arguments. In that case, the problem may be stated as
n

for i = 1, . . . , m, Pn (xi ) =
k=0

ak xk , i

that is, i = 1, . . . , m independent evaluations of Pn (xi ). For this situation, it is more efficient to again use Horner’s rule, but for m independent xi , i = 1, . . . , m, and we get

112

SIMD, SINGLE INSTRUCTION MULTIPLE DATA

void multihorner(float *a,int m,int n,float *x,float *p) { /* multiple (m) polynomial evaluations */ int i,j; for (j=0;j<=m;j++){ p[j] =a[n]; } for (i=n-1;i>=0;i--){ for (j=0;j<=m;j++){ p[j]=p[j]*x[j]+a[i]; } } return; } 3.5.2 A single tridiagonal system In this example, we illustrate using non-unit stride algorithms to permit parallelism. The solution to be solved is T x = b, where T is a tridiagonal matrix. For the purpose of illustration we consider a matrix of order n = 7,   d0 f0  e1 d1 f1     e2 d2 f2    , e3 d3 f3 (3.7) T =    e4 d4 f4    e5 d5 f5  e6 f6 For bookkeeping convenience we define e0 = 0 and fn−1 = 0. If T is diagonally dominant, that is, if di | > ei | + fi | holds for all i, then Gaussian elimination without pivoting provides a stable triangular decomposition,    1 c0 f0   1 1  c1 f1       1 c2 f2 2     .  1 c3 f3 T = LU ≡  3      1 c4 f4 4      1 c5 f5  5 1 c6 6 (3.8) Notice that the ones on the diagonal of L imply that the upper off-diagonal elements of U are equal to those of T . Comparing element—wise the two representations of T in (3.7) and (3.8) yields the usual recursive algorithm for determining the unknown entries of L and U [54]. The factorization portion could

RECURRENCE FORMULAE, POLYNOMIAL EVALUATION

113

be made in-place, that is, the vectors l and c could overwrite e and d such that there would be no workspace needed. The following version saves the input and computes l and c as separate arrays, however. After having computing the LU

void tridiag(int n,float *d,float *e,float *f, float *x,float *b,float *l,float *c) { /* solves a single tridiagonal system Tx=b. l=lower, c=diagonal */ int i; c[0] =e[0]; /* factorization */ for(i=1;i<n;i++){ l[i]=d[i]/c[i-1]; c[i]=e[i] - l[i]*f[i-1]; } x[0] =b[0]; /*forward to solve: Ly=b */ for(i=1;i<n;i++){ x[i]=b[i]-l[i]*x[i-1]; } x[n-1] =x[n-1]/c[n-1]; /* back: Ux=y */ for(i=n-2;i>=0;i--){ x[i]=(x[i]-f[i]*x[i+1])/c[i]; } } factorization, the tridiagonal system T x = LU x = b is solved by first finding y from Ly = b going forward, then computing x from U x = y by going backward. By Gaussian elimination, diagonally dominant tridiagonal systems can be solved stably in as few as 8n floating point operations. The LU factorization is not parallelizable because the computation of c requires that ci−1 is updated before ci can be computed. Likewise, in the following steps, xi−1 has to have been computed in the previous iteration of the loop before xi . Finally, in the back solve xi+1 must be computed before xi . For all three tasks, in each loop iteration array elements are needed from the previous iteration. Because of its importance, many attempts have been made to devise parallel tridiagonal system solvers [3, 13, 45, 69, 70, 81, 138, 148]. The algorithm that is closest in operation count to the original LU factorization is the so-called twisted factorization [145], where the factorization is started at both ends of the diagonal. In that approach, only two elements can be computed in parallel. Other approaches provide parallelism but require more work. The best known of these is cyclic reduction, which we now discuss.

114

SIMD, SINGLE INSTRUCTION MULTIPLE DATA

3.5.3 Solving tridiagonal systems by cyclic reduction. Cyclic reduction is an algorithm for solving tridiagonal systems that is parallelizable albeit at the expense of some redundant work. It was apparently first presented by Hockney [74]. We show the classical cyclic reduction for diagonally dominant systems. For a cyclic reduction for arbitrary tridiagonal systems see [5]. Let us return to the system of equations T x = b with the system matrix T given in (3.7). We rearrange x and b such that first the even and then the odd elements appear. In this process, T is permuted to become   d0 f0   d2 e2 f2    d4 e4 f4    d6 e6  . (3.9) T =     e1 f1 d1     e3 f3 d3 e5 f5 d5 T is still diagonally dominant. It can be factored stably in the form   d0 f0 1   1 d2 e2 f2     e4 1 d4    1 d6 T =     1 m1 1 d1 f1     m3 1 e3 d3 3 m5 1 e5 5    f4   e6  .    f3  d5 (3.10)

This is usually called a 2×2 block LU factorization. The diagonal blocks differ in their order by one if T has odd order. Otherwise, they have the same order. More precisely, the lower left block of L is n/2 × n/2 . The elements of L and U are given by
2k+1 = e2k+1 /d2k , m2k+1 = f2k+1 /d2k+2 , = d2k+1 − l2k+1 f2k − m2k+1 e2k+2 , dk e2k+1 = −l2k+1 e2k , d2k+1 = −m2k+1 d2k+2 ,

0≤k 0≤k 0≤k 1≤k 0≤k

< < < < <

n/2 n/2 n/2 n/2 n/2

, , , , − 1,

where we used the conventions e0 = 0 and fn−1 = 0. It is important to observe the ranges of k defined by the floor n/2 and ceiling n/2 brackets: the greatest integer less than or equal to n/2 and least integer greater than or equal to n/2, respectively (e.g. the standard C functions, floor and ceil, may be used [84]). The reduced system in the (2,2) block of U made up of the element di , ei , and fi is again a diagonally dominant tridiagonal matrix. Therefore the previous procedure can be repeated until a 1×1 system of equations remains.

RECURRENCE FORMULAE, POLYNOMIAL EVALUATION
100 n = 2m Cyclic reduction 10–1

115

10–2 Time (s)

10–3

F/ M Cyclic reduction

10–4

10–5 100

101

102

103 n

104

105

Fig. 3.16. Times for cyclic reduction vs. the recursive procedure of Gaussian Elimination (GE) on pages 112, 113. Machine is Athos, a Cray SV-1. Speedup is about five for large n. Cyclic reduction needs two additional vectors l and m. The reduced matrix containing di , ei , and fi can be stored in place of the original matrix T , that is, in the vectors d, e, and f . This classical way of implementing cyclic reduction evidently saves memory but it has the disadvantage that the memory is accessed in large strides. In the first step of cyclic reduction, the indices of the entries of the reduced system are two apart. Applied recursively, the strides in cyclic reduction get larger by powers of two. This inevitably means cache misses on machines with memory hierarchies. Already on the Cray vector machines, large tridiagonal systems caused memory bank conflicts. Another way of storing the reduced system is by appending the di , ei , and fi to the original elements di , ei , and fi . In the following C code, cyclic reduction is implemented in this way. Solving tridiagonal systems of equations by cyclic reduction costs 19n floating point operations, 10n for the factorization and 9n for forward and backward substitution. Thus the work redundancy of the cyclic reduction algorithm is 2.5 because the Gaussian of the previous section elimination only needs 8n flops: see Figures 3.15, 3.16. void fcr(int n,double *d,double *e, double *f,double *x,double *b) { int k,kk,i,j,jj,m,nn[21]; /* Initializations */ m=n; nn[0]=n; i=0; kk=0; e[0]=0.0; f[n-1]=0.0; while (m>1){ /* Gaussian elimination */ k=kk; kk=k+m; nn[++i]=m-m/2; m=m/2; e[kk]=e[k+1]/d[k];

there is an inner loop. .e[jj+1]*f[kk+j]. Again. if X and B consists of L columns. Namely. e[kk+j+1]=e[jj+1]/d[jj]. . f[kk+j]=-f[jj+1]*f[kk+j]. k -= (nn[--i]+m). L − 1 systems. b[kk+j]=b[jj] . /* Back substitution */ while (i>0){ #pragma ivdep for (j=0. m=m+nn[i]. x[jj+1]=x[kk+j]. T L X columns = L B columns then we can use the recursive method on pages 112. kk=k. j<m. j++){ jj =k+2*j.f[jj]*x[kk+j])/d[jj]. #pragma ivdep for (j=0. 113. . In this situation. } if (m != nn[i]) x[kk-1]=(b[kk-1]-e[kk-1]*x[kk+m-1])/d[kk-1]. j<m-1. j<m. e[kk+j]=-e[jj-1]*e[kk+j].b[jj-1]*e[kk+j] .f[jj-1]*e[kk+j] .e[jj]*x[kk+j-1] .b[jj+1]*f[kk+j]. } } x[kk]=b[kk]/d[kk]. i = 0. SINGLE INSTRUCTION MULTIPLE DATA #pragma ivdep for (j=0. f[kk+j] =f[jj-1]/d[jj]. x[jj] =(b[jj] . . d[kk+j]=d[jj] . } if (m != nn[i]) f[kk+m-1]=f[kk-2]/d[kk-1]. bi . } } A superior parallelization is effected in the case of multiple right-hand sides.116 SIMD. j++){ jj =2*j+k+2. which counts the L columns of X: xi . . j++){ jj=k+2*j+1. the xm and bm macros are for Fortran ordering.

it is less efficient to use some non-unit strides: for example.j<L.float *d. we saw that doubling strides may be used to achieve SIMD parallelism. Hence an associated cache .j++){ xm(i. POLYNOMIAL EVALUATION 117 #define xm(p.i--){ for(j=0. Even so.j)=bm(0. a[i] and a[i+1] will be stored in successive banks (or columns).m[i]*f[i-1].float *u.5. On most machines. for(j=0.j++){ xm(i.j)-f[i]*xm(i+1. there may be memory bank conflicts on cacheless machines [123].j) . Memory bank conflicts may also occur in cache memory architectures and result from the organization of memory into banks wherein successive elements are stored in successive banks. this prefetcher is started.j)=xm(n-1.j))/u[i]. Intel features a hardware prefetcher: When a cache miss occurs twice for regularly ordered data.j<L.j)=bm(i.int L.q) *(x+q+p*n) #define bm(p.j. a bank requires a certain time to read (or store) them and refresh itself for subsequent reads (or stores). } } } #undef xm #undef bm 3.j++) xm(0.j++) xm(n-1.j)=(xm(i.j)/u[n-1]. the safest assumption is that other words nearby will also be used. basic cache structure assumes locality: if a word is fetched from memory.4 Another example of non-unit strides to achieve parallelism From the above tridiagonal system solution.j<L. once they have been previously used (locally).j<L.RECURRENCE FORMULAE. u[0]=e[0].float *x. for(i=n-2.i++){ m[i]=d[i]/u[i-1]. Requesting data from the same bank successively forces a delay until the read/refresh cycle is completed.float *b) { /* solves tridiagonal system with multiple right hand sides: one T matrix. u[i]=e[i] .i>=0. for(j=0. for(i=1. Upon accessing these data.i<n.float *e. For example. or cache misses for memory systems with cache.j).q) *(b+q+p*n) void multidiag(int n.float *m.m[i]*xm(i-1. Memory hardware often has mechanisms to anticipate constant stride memory references. } } for(j=1. float *f.j). L solutions X */ int i.

the previous value of that a will be lost forever and the final results will be incorrect. non-unit stride does not carry the same penalty as its cache-miss equivalents on cache memory machines. although the performance on Pentium 4 is quite respectable due to the anticipation of non-unit stride memory hardware.21 (which defines the a. .18 shows the double boxes.20 cannot be done in-place.17 and 3. and NEC vector hardware does not use caches for their vector memory references. examine the signal flow diagrams in Figures 3. Cray. Thus. That is. In this section. however. Array w contains the pre-computed twiddle factors: wk = exp(2πik/n) for k = 0.20 and 3. . there are such clever strategies and here we illustrate one— Temperton’s in-place. in-order algorithms. One step does not an algorithm make. . An astute observer will notice that the last stage can be done in-place. Finally.20. and d variables). but these procedures are more appropriate for large n (n = 2m is the dimension of the transform) and more complicated than we have space for here ([96]).17 shows the signal flow diagram for n = 16. The algorithm of Figure 3. There are two loops: one over the number of twiddle factors ω k . The algorithm is most appropriate for cacheless machines like the Cray SV-1 or NEC SX-5. with unit stride. . in-order binary radix FFT. the intermediate stages cannot write to the same memory locations as their inputs. To look a few pages ahead. another example of doubling stride memory access is shown: an in-place. Fortunately. Figure 3. even though parts of the line might not be used. c. Fujitsu. . Likewise for the c result of the second box (k = 1). and a second over the number of “boxes” that use any one particular twiddle factor. Notice that if the d result of the first (k = 0) computational box is written before the first k = 1 a input for the second box is used. • The first “half” of the m steps (recall n = 2m ) are from Cooley and Tukey [21]. Figure 3. n/2 − 1. The general strategy is as follows. and so on. here is the code for the driver (cfft 2) showing the order in which step1 (Cooley–Tukey step) and step2 (step plus ordering) are referenced. It turns out that there are in-place. • The final m/2 steps ((m − 1)/2 if m is odd) group the “boxes” in pairs so that storing results does not write onto inputs still to be used. in-order procedure. • The second “half” of the m steps also re-order the storage by using three loops instead of two. however. indexed by k in Figures 3. which will be written onto the a input for the third box (k = 2). SINGLE INSTRUCTION MULTIPLE DATA line is loaded from memory.118 SIMD.20) so there are no data dependencies. so one has to be clever to design a strategy. b. which will group the “boxes” (as in Figure 3. since the input locations are exactly the same as the outputs and do not use w. The point of using non-unit strides is to avoid data dependencies wherein memory locations could be written over before the previous (old) data are used.

ks = iw*mj.sign) int n.&x[p4][0].a. lj = n/(2*mj). WPP: 25/8/1999.mj. for(j=0.iw.w. mj = 1.&x[p2][0]. POLYNOMIAL EVALUATION 119 void cfft2(n. wkr=w[kw][0].k<lj. ij = n/mj.p4.wambr.wku*(a[ii+k][1]-b[ii+k][1]).w[][2].999)). wku=w[kw][1]. Computing.BK. wambu = wku*(a[ii+k][0]-b[ii+k][0]) + wkr*(a[ii+k][1]-b[ii+k][1]).lj.w[][2]. ETHZ */ int n2.w.mj.w.b[][2].sign).x.sign. wambr = wkr*(a[ii+k][0]-b[ii+k][0]) .iw.sign). m = (int) (log((float) n)/log(1.mj. step2(). if(sign > 0){ #pragma ivdep for(k=0. int i. a[ii+k][0] = a[ii+k][0]+b[ii+k][0]. for(i=0. p3 = mj.mj. &x[p3][0].iw. SIAM J.ks.mj.j++){ if(j < (m+1)/2){ p2 = n2/mj.w.k.wambu. step2(n. { float wkr.sign) int iw.kw. } else{ p2 = n2/mj.ij.m.&x[p2][0].RECURRENCE FORMULAE.n.i<mj. float a[][2].&x[0][0].iw.j<m. void step1(). .b.j.p2.ii. 1991. { /* n=2**m FFT from C.p3. float x[][2].int sign. step1(n.&x[0][0].i++){ ii = i*ij. n2 = n/2. } } Here is the Cooley–Tukey step (step1): void step1(n.wku. p4 = p2+mj. Temperton. on Sci. } mj = 2*mj.k++){ kw=k*ks. and Stat.iw.

3. wambu = wku*(a[ii+k][0]-b[ii+k][0]) + wkr*(a[ii+k][1]-b[ii+k][1]). self-sorting FFT. wambr = wkr*(a[ii+k][0]-b[ii+k][0]) .k++){ kw=k*ks. b[ii+k][0] = wambr. In-place. a[ii+k][0] = a[ii+k][0]+b[ii+k][0].wku*(a[ii+k][1]-b[ii+k][1]). b[ii+k][1] = wambu. wku=-w[kw][1]. b[ii+k][0] = wambr. 3. b[ii+k][1] = wambu.17. } } else { #pragma ivdep for(k=0.20. Also see Fig. } } } } 0 0 1 2 1 3 4 2 5 6 3 7 8 4 9 10 5 11 12 6 13 14 7 15 6 4 0 15 4 0 0 13 14 2 4 0 11 12 0 0 0 9 10 6 4 0 7 8 4 0 0 5 6 2 4 0 3 4 0 0 0 1 2 0 Fig. a[ii+k][1] = a[ii+k][1]+b[ii+k][1]. . wkr=w[kw][0].k<lj.120 SIMD. SINGLE INSTRUCTION MULTIPLE DATA a[ii+k][1] = a[ii+k][1]+b[ii+k][1].

Double “bug” for in-place.wcmdu. { float wkr.18. int sign.n.mj.sign) int iw. Not surprisingly.i.ii.c[][2].b[][2].a.d[][2].j+=n/mj){ wambr = wkr*(a[ii+j][0]-b[ii+j][0]) . Also see Fig. wkr = w[kw][0].k. mj2=2*mj. wku=(sign>0)?w[kw][1]:(-w[kw][1]).21.j.d. Next we have the routine for the last m/2 steps.wku.wambr.ks. for(i=0.lj.wcmdr.RECURRENCE FORMULAE. self-sorting FFT. wambu = wku*(a[ii+j][0]-b[ii+j][0]) + wkr*(a[ii+j][1]-b[ii+j][1]).kw.i<lj.k++){ kw = k*ks.i++){ ii = i*mj2. 3. POLYNOMIAL EVALUATION a a+b 121 k c+d b c w k (a–b) k d w k (c–d) Fig. void step2(n.c.mj.k<lj. for(k=0.wambu. ks=iw*mj. i loop structure (see also Chapter 4).wku*(a[ii+j][1]-b[ii+j][1]). .iw. float a[][2]. the k loop which indexes the twiddle factors (w) may be made into an outer loop with the same j. #pragma ivdep for(j=k. Notice that there are three loops and not two. int mj2. lj=n/mj2. 3.b. which permutes everything so that the final result comes out in the proper order.j<mj.w.w[][2].

c[ii+j][1] = wambu. d[ii+j][0] = wcmdr. these eight 4-word registers (of 32 bits) are well adapted to vector processing scientific computations. the Apple/Motorola modification of the IBM G-4 chip has additional hardware called Altivec: thirty-two 4-word registers. Our website BLAS code examples do not assume proper alignment but rather treat 1–3 word initial and end misalignments as special cases. a cacheline is 16 bytes long (4 words) and the associated memory loaded on 4-word boundaries. Because the saxpy operation is so simple and is part of an exercise. and Pentium 4 chips have a set of eight vector registers called XMM. it is not automatically done properly by the hardware as it is in the case of simple scalar (integer or floating point) data. The vector data are loaded (stored) from (into) cache when using vector registers (Vi in our examples) on these cacheline boundaries.6 we will cover FFT on Intel P-4 and Apple G4. therefore. we examine sdot and isamax on these two platforms. This misalignment must be handled by the program. wcmdu = wku*(c[ii+j][0]-d[ii+j][0]) + wkr*(c[ii+j][1]-d[ii+j][1]). Possible misalignment may occur for both loads and stores.2. When loading a scalar word from memory at address m. } } } } 3.2. As in Figure 3. if we wish to load a 4-word segment beginning at A3 . Similarly. from Section 3. see Section 3. this address is taken modulo 16 (bytes) and loading begins at address (m/16) · 16. wcmdr = wkr*(c[ii+j][0]-d[ii+j][0]) . c[ii+j][0] = wambr. One extremely important consideration in using the Altivec and SSE hardware is data alignment.5. Namely. As we showed in Section 1. Subsequently. The Intel Pentium III.19. b[ii+j][1] = c[ii+j][1]+d[ii+j][1]. b[ii+j][0] = c[ii+j][0]+d[ii+j][0]. we proceed a little deeper into vector processing.2. in the case of 4-way associativity. two cachelines (cache blocks) must be loaded to get the complete 4-word segment. in Section 3.7. d[ii+j][1] = wcmdu. in this example a misalignment occurs: the data are not on 16-byte boundaries.122 SIMD.wku*(c[ii+j][1]-d[ii+j][1]). In our BLAS examples below. cache loadings are associative. . an address in memory is loaded into cache modulo cache associativity: that is.5. When using the 4-word (128 bit) vector registers.5 Some examples from Intel SSE and Motorola Altivec In what follows. SINGLE INSTRUCTION MULTIPLE DATA a[ii+j][0] = a[ii+j][0]+b[ii+j][0]. we assume the data are properly 4-word aligned. a[ii+j][1] = a[ii+j][1]+b[ii+j][1]. Under the generic title SSE (for Streaming SIMD Extensions).

RECURRENCE FORMULAE. or section 2. . In particular. instead of mm malloc on Pentium 4. Hopefully edifying explanations for the central ideas are given here. Our web-server contains the complete code for both systems including simple test programs. For purposes of exposition. Vi does not refer directly to any actual hardware assignments. but using intrinsics is easier and allows comparison between the machines. POLYNOMIAL EVALUATION 4-word bndry. There. We only do a simple obvious variant. 4|n). we make the simplifying assumption that the number of elements in sdot n is a multiple of 4 (i. respectively. An important difference between the intrinsics versions and their assembler counterparts is that many details may be omitted in the C intrinsics. keep in mind that the “variables” Vi are only symbolic and actual register assignments are determined by the compiler. n = q · 4.3. One should note that valloc is a Posix standard and could be used for either machine. Somewhat abbreviated summaries of these intrinsics are given in Appendix A and Appendix B. Both Intel and Apple have chosen intrinsics for their programming paradigm on these platforms. Data misalignment in vector reads. register management and indexing are handled by the C compiler when using intrinsics. Assembly language programming provides much more precise access to this hardware.19. that is. only the reduction from the partial accumulation of V L (= 4 in the present case. 3. Furthermore.22 in [23]. in-line assembly language segments in the code cause the C optimizer serious problems. See [20]. 123 A0 A1 A2 A3 A4 A5 A6 A7 A-array in memory A0 A1 A2 A3 Cacheline 1 Load Vi A4 A5 A6 A7 Cacheline 2 X A0 A1 A2 A3 What we get A3 A4 A5 A6 What we wanted Fig. 3. Our FFT examples use 4-word alignment memory allocation— mm malloc or valloc for SSE and Altivec. One considers what one wants the compiler to do and hope it actually does something similar.6 SDOT on G-4 The basic procedure has been outlined in Section 3. see page 87) remains to be described. respectively. In what follows.e. Hence.5.

0.5. or a well scheduled branch prediction scheme is required.0. } 3. /* part sum of 4 */ V0 = vec_ld(0.yp). xp = x. vector float V0.V1. Namely.0.xp). We show here how it works in both the Pentium III and Pentium 4 machines. float *y) { /* no x. /* load x */ V1 = vec_ld(0. xp += 4.0).sum=0.i++){ V7 = vec_madd(V0. SINGLE INSTRUCTION MULTIPLE DATA The vec ld operations simply load a 4-word segment. V0 = vec_ld(0. /* load y */ nsegs = (n >> 2) .y strides */ float *xp.((nsegs+1) << 2). In Appendix B.nres. Our tests used gcc3.2.0. float psum[4]. /* nsegs = q */ vector float V7 = (vector float)(0.xp). The accumulation is in variable V7 .1.i<4. the merge operation referenced in Section 3. float *x.i++){ sum += psum[i].0. version 1161. April 4. Conversely.ii. while the vec madd operations multiply the first two arguments and add the 4-word result to the third. yp += 4.0.0. /* final part sum */ /* Now finish up: v7 contains partials */ vec_st(V7.psum).V1. float sdot0(int n. /* load next 4 x */ V1 = vec_ld(0. yp += 4.7 ISAMAX on Intel using SSE The index of maximum element search is somewhat trickier. First we show a Altivec version of sdot0. nres = n . 2003. yp = y.yp). .8 can be implemented in various ways on different machines. xp += 4. which may actually be any Altivec register chosen by the gcc compiler.0.124 SIMD.V7).nsegs. /* nres=n mod 4 */ for(i=0. that is sdot without strides. we include a slightly abbreviated list of vec operations.V1.V7). The choice of switches is given by gcc -O3 -faltivec sdottest. vec st stores the partial accumulation into memory. /* store partials to memory */ for(i=0. /* load next 4 y */ } V7 = vec_madd(V0.i<nsegs. } return(sum). int i.*yp.c -lm The gcc version of the C compiler is that released by Apple in their free Developers’ Kit.

• mm movemask ps counts the number which are larger.0). value */ for(i=0.V3.0).e. • mm andnot ps is used to compute the absolute values of the first argument by abs(a) = a and not(−0. float *x) /* no stride for x */ { float ebig. • mm add ps in this instance increments the index count by 4. /* old vs new */ . xp += 4. the 4-word SSE arrays are declared by m128 declarations. • mm cmpnle ps compares the first argument with the second. value */ V3 = _mm_cmpnle_ps(V1.V0). • mm set ps1 sets entire array to content of parenthetical constant. This routine is a stripped-down variant of that given in an Intel report [26].*xp.0).0.c -lm On the Pentium.0. we have assumed the x data are aligned on 4-word boundaries and that n = q · 4 (i. which supports this hardware on Pentium III or Pentium 4 is icc (Intel also provides a Fortran [24]): icc -O3 -axK -vec report3 isamaxtest.0. • mm and ps performs an and operation.nn.0. /* abs. /* abs.V0. xp = x.i++){ V1 = _mm_andnot_ps(V6.2. /* next 4 */ V0 = _mm_andnot_ps(V6.0.1.2.0. /* 1st 4 */ V1 = _mm_load_ps(xp). • mm load ps loads 4 words beginning at its pointer argument.mb. Data loads of x are performed for a segment (of 4) ahead. POLYNOMIAL EVALUATION 125 Our website has code for the Apple G-4.V1.0.ibig.RECURRENCE FORMULAE.2.V7. This is a mask of all bits except the sign. int isamax0(int n. • mm set ps sets the result array to the contents of the parenthetical constants.i<nsegs. The appropriate compiler. offset4 = _mm_set_ps1(4.0).V0). /* nsegs = q */ _m128 offset4.V6. • mm max ps selects the larger of the two arguments. In this example. Other XMM (SSE) instructions in the isamax routine are described in reference [27] and in slightly abbreviated form in Appendix A. _declspec (align(16)) float xbig[4].V2. if any. 4|n). nsegs = (nn >> 2) .V1). V7 = _mm_set_ps(3.0). • mm store ps stores 4 words beginning at the location specified by the first argument. V6 = _mm_set_ps1(-0. V0 = _mm_load_ps(xp).1. V2 = _mm_set_ps(3. to form a mask (in V3 here) if it is larger. int i. xp += 4.nsegs.0.indx[4].

/* any bigger */ V2 = _mm_add_ps(V2. The workspace version simply toggles back and forth between the input (x) and the workspace/output (y). However. /* load new 4 */ } /* process last segment of 4 */ V1 = _mm_andnot_ps(V6. /* abs.2. see: Section 2.V3). In fact.} V1 = _mm_load_ps(xp). a complex binary radix (n = 2m ) FFT. 108].20 shows that the last pass could in principle be done in-place. big = 0. value */ V3 = _mm_cmpnle_ps(V1.20. 30. /* add offset */ if(mb > 0){V0 = _mm_max_ps(V0. which uses a workspace.6 FFT on SSE and Altivec For our last two examples of SIMD programming.V7). some compiler optimizations could cause trouble.i++){ if(xbig[i]>big){big=xbig[i]. with some logic to assign the result to the output y.V3).V3).V1).offset4).126 SIMD. for(i=0. In Figure 3. if the output array is different. ibig = 0.2) is not needed except to adhere to the C language rules. Output from the bugs (Figure 3. the ccopy (copies one n–dimensional complex array into another. V7=_mm_max_ps(V7. /* BRANCH */ V3=_mm_and_ps(V2. Without it. The following code is the driver routine.21) will write into areas where the next elements are read—a dependency situation if the output array is the same as the input. As in Section 3. Because these machines favor unit stride memory access. 123]. we turn again to Fast Fourier Transform. The answer is yes. . see references [28. For faster but more complicated variants. because our preference has been for self-sorting methods we chose an algorithm. A careful examination of Figure 3. 31. } 3. we use a self-sorting algorithm with unit stride but this uses a workspace [141. one can see the signal flow diagram for n = 8. /* any bigger */ V2 = _mm_add_ps(V2. V3=_mm_and_ps(V2. /* add offset */ if(mb > 0){V0 = _mm_max_ps(V0. it is apparent why a workspace is needed. for reasons of simplicity of exposition.V1). but it is complicated to do with unit stride. _mm_store_ps(indx. SINGLE INSTRUCTION MULTIPLE DATA mb = _mm_movemask_ps(V3). Hence. xp += 4. /* old vs new */ mb = _mm_movemask_ps(V3). ibig=(int)indx[i].0.i<4.V0). If you look carefully.} /* Finish up: maxima are in V0.V0).offset4). indices in V7 */ _mm_store_ps(xbig. this time using the Intel SSE and Apple/Motorola Altivec hardware.} } return(ibig).V3). V7=_mm_max_ps(V7. we chose a form which restricts the inner loop access to stride one in memory.V1). this data dependency goes away. An obvious question is whether there is an in-place and self-sorting method.

where w = the nth root of unity. . the .FFT ON SSE AND ALTIVEC 0 0 1 0 0 1 0 127 2 1 3 0 0 2 3 4 2 5 1 0 4 5 6 3 7 1 0 6 7 Fig. Im x0 .20. Each “bug” computes: c = a+b and d = wk (a−b). Since complex data in Fortran and similar variants of C supporting complex data type are stored Re x0 . Workspace version of self-sorting FFT. Decimation in time computational “bug.21. a FFT “bug” c=a+b k w=e n b 2πi d = w k(a – b) Fig. 3.21. .” One step through the data is represented in Figure 3. . Re x1 . .20 by a column of boxes (bugs) shown in Figure 3. 3.

sign).99)). tgle = 0. c is easy. V6 =  ωr  . &y[0][0]. (result) ωi (a − b)r    .&y[mj][0].22. float x[][2]. if(tgle){ step(n. { /* x=in.  Fig.&x[0][0]. m.&y[n/2][0]. step(n. w=exp(2*pi*i*k/n). void ccopy().y. m = (int)(log((float)n)/log(1.21. V7 =  −ωi  .22 indicates how mm shuffle ps is used for this calculation. In Figure 3.mj.} /* if tgle: y -> x */ mj = n/2.sign). tgle.step().sign). tgle=1. j. Complex arithmetic for d = wk (a − b) on SSE and Altivec.sign. &y[0][0].mj. mj=1.w.x. &x[0][0]. Figure 3.j<m-2.  V1 = V 7 · V4 =   −ωi (a − b)i  .w. 3. } } if(tgle){ccopy(n.w[][2]. . V0 = V6 · V3 =  ωr (a − b)r (a − b)r ωr (a − b)i   −ωi (a − b)i  ωi (a − b)r  d = V 2 = V0 + V1 .&x[n/2][0].j++){ mj *= 2.&y[mj][0]. SINGLE INSTRUCTION MULTIPLE DATA      (a − b)r ωr −ωi  (a − b)i   ωr   ωi       V3 =   (a − b)r  .y[][2]. tgle = 1.128 SIMD. y=out..y.x). for(j=0.&x[0][0]. arithmetic for d (now a vector d with two complex elements) is slightly involved.&x[n/2][0].sign) int n. } else { step(n.w. k=0. void cfft2(n.&y[0][0].&x[mj][0].n/2-1 */ int jb.mj.w. mj. (a − b)i ωr ωi    (a − b)i ωr (a − b)r  (a − b)r   ωr (a − b)i shuf f le   V3 −→ V4 =   (a − b)i  .

k<mj. { int j. /* c to M */ V3 = _mm_sub_ps(V0.jc. c[jc][1] = a[jw][1] + b[jw][1].&x[0][0]. lj = n/mj2.mj.0.b. wu[2] = -up.V4. float rp.jw.c. j<lj.d. V6 = _mm_load_ps(wr). .b[jw][1]). wr[2] = rp. _m128 V0.V1).a. . wr[3] = rp. jc = j*mj2. V2 = _mm_add_ps(V0. wr[1] = rp. V7 = _mm_load_ps(wu). } The next code is the Intel SSE version of step [27]. d[jc][1] = up*(a[jw][0] .mj2. up = w[jw][1]. float a[][2]. . _MM_SHUFFLE(2.V2.b[jw][0]) .mseg.wu[4].up*(a[jw][1] . j++){ jw = j*mj. the locations of half-arrays (a. if(mj<2){ /* special case mj=1 */ d[jc][0] = rp*(a[jw][0] .V7.&x[n/2][0].V3).FFT ON SSE AND ALTIVEC 129 step(n. mj.0) up = -up. } else { /* mj > 1 cases */ wr[0] = rp. V1 = _mm_load_ps(&b[jw+k][0]).l. mj2 = 2*mj.w. wu[1] = up. b.sign) int n.b[jw][1]). /* a-b */ V4 = _mm_shuffle_ps(V3.1)).3.lj.V6.k.&y[mj][0].V1. /* a+b */ _mm_store_ps(&c[jc+k][0].b[][2]. _declspec (align(16)) float wr[4]. void step(n.sign). c[jc][0] = a[jw][0] + b[jw][0]. for(j=0. k+=2){ V0 = _mm_load_ps(&a[jw+k][0]). .c[][2]. if(sign<0. for(k=0.V3.w. c.V1). d) are easy to determine: b is located n/2 complex elements from a.d[][2]. wu[3] = up. log2 (n) − 1 is the step number.V2).w[][2]. &y[0][0]. above).up. while the output d is mj = 2j complex locations from c where j = 0. rp = w[jw][0]. wu[0] = -up. . V0 = _mm_mul_ps(V6.mj.b[jw][0]) + rp*(a[jw][1] .V3.sign. Examining the driver (cfft2.

6). and MKL [28] uses the pthreads library to facilitate this. Multiple processors may be used when available. the SSE intrinsics (Section 3. notice that the improvement using SSE is a factor of 2. cplx data type. labeled k in Figure 3. the work of the outer loop (over twiddle factors. 3. fewer need to be accessed from memory. Also.4.8. In that case. /* w*(a-b) */ _mm_store_ps(&d[jc+k][0]. Older versions of Apple Developers’ kit C compiler.V4). is independent of every other k. Ito: 1. /* d to M */ } } } } In Figure 3. Because of symmetries in the twiddle factors.7 GHz Pentium 4 running Linux version 2.130 SIMD. and generic FFT. Intrinsics. Thus.23.V2). which uses the same algorithm. V2 = _mm_add_ps(V0. However.35 faster than the generic version. . but we do not 1500 n = 2m FFT.V1). supported complex arithmetic. Ito: Pentium 4 1000 Mflops cfft2 (SSE) In-place 500 Generic 0 101 102 n 103 104 Fig. and a generic version of the same workspace algorithm on a 1. these tasks may be distributed over multiple CPUs. In our Chapter 4 (Section 4.5. the w array elements) may be distributed over multiple processors.4). The work done by each computation box. SINGLE INSTRUCTION MULTIPLE DATA V1 = _mm_mul_ps(V7.7 GHz Pentium 4. we show the performance of three FFT variants: the in-place (Section 3.21.3) on shared memory parallelism we show a variant of step used here. competing with professionals is always hard.23. Here is step for the G-4 Altivec. There are many possible improvements: both split-radix [44] and radix 4 are faster than radix 2. gcc.18. in-place (non-unit stride).

V4. vector float V0. V7 = vec_ld(0. wu[1]=up. for(k=0. rp = w[jw][0].).8. V7 = vec_xor(V7.FFT ON SSE AND ALTIVEC 131 use this typing here.. V1). k<mj.wu)... if(sign<0.0.float w[][2].. for(j=0. wu[0]=up.vminus).int mj.V5.10. float c[][2]. wu[2]=up.0.6.V6.lj. wr[3]=rp. float b[][2]. wu[4]. wr[2]=rp. c[jc][1] = a[jw][1] + b[jw][1]. wu[3]=up.0. V6 = vec_ld(0. wr[1]=rp.V2.1.).wr).b */ V0 = vec_ld(0. k+=2){ /* read a.11). up = w[jw][1].. /* c=a-b */ .mj2.V1.0.2.k.9.(vector float *)&a[jw+k][0]).b[jw][0]) .b[jw][1]). V2 = vec_add(V0.up. float a[][2].7. if(mj<2){ /* special case mj=1 */ d[jc][0] = rp*(a[jw][0] .l. mj2 = 2*mj. float d[][2].15. V1 = vec_ld(0.(vector float *)&b[jw+k][0]). j<lj. jc = j*mj2. _cvf vminus = (vector float)(-0.V7.5. #define _cvf constant vector float #define _cvuc constant vector unsigned char void step(int n.14. c[jc][0] = a[jw][0] + b[jw][0].b[jw][0]) + rp*(a[jw][1] .jw.b[jw][1]).-0.up*(a[jw][1] .0.. _cvuc pv3201 = (vector unsigned char) (4. j++){ jw = j*mj. d[jc][1] = up*(a[jw][0] . Newer versions apparently do not encourage this complex typing.13.0. _cvf vzero = (vector float)(0.V3. float sign) { int j.jc.12.3. float wr[4]. float rp. } else { /* mj>=2 case */ wr[0]=rp. lj = n/mj2.0) up = -up.

• V2 = vec add(V0. and (2) the a∗k bkj .V1. V1 = vec_madd(V7. • V3 = vec sub(V0. .0.V3.ptr) stores the four word contents of vector V2 beginning at pointer ptr with block offset 0.2) and the Developers’ kit gcc compiler (version 1161. V1). SINGLE INSTRUCTION MULTIPLE DATA vec_st(V2.pv) selects bytes from V0 or V1 according to the permutation vector pv to yield a permuted vector V2. which processes whole columns/time. This is a 1. V2 = vec_add(V0. V1) subtracts the four floating point elements of V1 from V0 to yield result V3.0. These are arranged in little endian fashion [30]. • vec st(V2. Exercise 3. we show our results on a Power Mac G-4. program the dot-product variant using three nested loops. while the second uses repeated saxpy operations.24. substitute sdot.0. First. • V2 = vec perm(V0.vzero). /* d=w*(a-b) */ vec_st(v2.pv3201). What is to be done? The task here is to code these two variants in several ways: 1. V1) performs an element by element floating point add of V0 and V1 to yield result vector V2. and likewise the add.V4. V4 = vec_perm(V3.(vector float *)&d[jc+k][0]).1 are descriptions of two variations of matrix–matrix multiply: C = AB. then instead of the inner-most loop. A summary is also given in Appendix B. /* shuffle */ V0 = vec_madd(V6. V3 = vec_sub(V0. The Altivec version using intrinsics is three times faster than the generic one using the same algorithm.(vector float *)&c[jc+k][0]). The two are (1) the text book method which computes cij = k aik bkj .V1).1 Variants of matrix–matrix multiply In Section 3.25 GHz machine running OS-X (version 10.V2) is a multiply–add operation: V3 = V0*V1+V2.4.132 SIMD. In Figure 3. outer product variant c∗j = k The first variant is a dot-product (sdot). April 4.V1. • V0 = vec ld(0. 2003). } } } } Although the Altivec instructions used here are similar to the SSE variant.V3. where the multiply is element by element. • V3 = vec madd(V0.ptr) loads its result V0 with four floating point words beginning at pointer ptr and block offset 0. here is a brief summary of their functionality 30.vzero).

Using large enough matrices to get good timing results. and generic FFT. Finally. The utilities nm or ar (plus grep) may make it easier to find the BLAS routines you want: ar t libblas.so |grep -i axpy.FFT ON SSE AND ALTIVEC 1000 n = 2m FFT. libblas. To make your life simpler. choose the leading dimension of C (also A) to be even. so many repetitions may be required. Again. then replace the inner-most loop with saxpy. Ogdoad: Mac G–4 800 cffts (Altivec) 133 600 Mflops In-place 400 200 Generic 0 100 101 102 103 n 104 105 106 Fig. so consult Appendix E to get the Fortran-C communication correct and the proper libraries loaded. That is. except on Cray platforms where second gives better resolution. see if you can unroll the first outer loop to a depth of 2—hence using two columns of B as operands at a time (Section 3.25 GHz Power Mac G-4. These are likely to be Fortran routines. 3. 3.a or as a shared object libblas. in-place (non-unit stride). .so.a |grep -i dot. 2.24. Next. 5. Tests are from a machine named Ogdoad: 1. see if you can estimate the function call overhead. program the outer-product variant using three nested loops. Do you see any cache effects for large n (the leading dimension of A and C)? Helpful hints: Look in /usr/local/lib for the BLAS: for example.2). Intrinsics. using the outer-product method (without substituting saxpy). or nm libblas. 4. we recommend using the clock function. how long does it take to call either sdot or saxpy even if n = 0? This is trickier than it seems: some timers only have 1/60 second resolution. compare the performances of these variants. By using multiple repetitions of the calculations with BLAS routines sdot and saxpy.

respectively. tmm0). float *x) { /* SSE version of sscal */ int i. float alpha. Here are two examples. i < 4. float alpha. _m128 tmm0. const vector float vzero = (vector float)(0. for (i = 0.ns. /* load 4 x’s */ tmm1 = _mm_mul_ps(tmm1. float *x) { /* Altivec version of sscal */ int i. for (i = 0. for (i = 0.V1. float alpha_vec[4]. i < ns. we ask you to program two basic functions in this mode: the saxpy operation (y ← α · x + y). i ++){ alpha_vec[i] = alpha. float alpha_vec[4].. /* 4 copies of alpha */ } tmm0 = _mm_load_ps(alpha_vec). The non-unit stride case is an exercise for real gurus. /* store x’s */ } } Altivec version: void sscal(int n.. doing the simpler sscal problem (x ← α · x). SINGLE INSTRUCTION MULTIPLE DATA Exercise 3.ns. i < 4.5 we gave descriptions of VL = 4-vector programming on Intel Pentium III and Pentium 4 using SSE (software streaming extensions).0.134 SIMD.tmm1. i ++){ alpha_vec[i] = alpha. n−1 and ssum a summation (s ← i=0 xi ). i ++) { tmm1 = _mm_load_ps(&x[4*i]).0. /* alpha*x’s */ _mm_store_ps(&x[4*i]. /* alphas in tmm0 */ ns = n/4. for SSE (Intel) and Altivec (Apple). To make your job easier. In this exercise. The programming mode uses intrinsics. as below.5.0. you can also assume that 4|n (n is divisible by 4). SSE version: void sscal(int n. and Apple/Motorola G-4 Altivec.2 Using intrinsics and SSE or Altivec In Section 3. although the more general case is a useful exercise. For the squeemish. /* 4 copies of alpha */ .tmm1).. vector float V0.). assume that the elements of the appropriate arrays (x. y) have unit spacing between them.

h>. • The Intel icc compiler is available from http://developer. and scale your timings by CLOCKS PER SEC defined in <time.0. 1. program both saxpy and ssum using either the SSE or Altivec intrinsics.(vector float *)&x[4*i]).com/.simdtech.FFT ON SSE AND ALTIVEC 135 } V0 = vec_ld(0.org. Modify your code to use a pre-fetch strategy to load the next segment to be used while computing the current segment. If your local machines do not have the appropriate compilers. without support) from • The gcc Apple compiler is available from the Apple developer web-site http://developer. 2. References: Technical report [26] and website www. i ++) { V1 = vec_ld(0. /* load */ V1 = vec_madd(V0.7 or sdot from Section 3.intel. they are available (free. i < ns. /* a*x */ vec_st(V1. for (i = 0. Use the system timer time = (double) clock(). The examples isamax from Section 3. .alpha_vec).com/. Our Beowulf machine Asgard has an up-to-date version of this compiler.apple.5.5. You may choose one machine or the other.vzero). /* copies into V0 */ ns = n/4. 3.(vector float *)&x[4*i]). Compare the pre-fetch strategy to the vanilla version. /* store */ } } What is to be done? Using one/other of the above sscal routines as examples.V1. Write test programs for these routines—with a large number of iterations and/or big vectors—to get good timings for your results.6 may be helpful for coding the reduction operation ssum.

a common memory/cell. Th.. E PCI Six I/O slots Crossbar Fig. An effective bandwidth test.. .4 SHARED MEMORY PARALLELISM I think there’s a world market for about five computers. and connected to a CCNUMA crossbar network. Six I/O slots PCI Memory PA 8700 PA 8700 Cell control PA 8700 PA 8700 I/O control E . PCI E E PCI .2.1).2... each with 4 CPUs. One cell of the HP9000 Superdome. 4. The network consists of sets of 4 × 4 crossbars and is shown in Figure 4... and Pegasus is a 32 CPU machine with the same kind of processors.. J. groups processors into two equally sized sets. say 2–128.. An intrinsic characteristic of these machines is a strategy for memory coherence and a fast tightly coupled network for distributing data from a commonly accessible memory system.. 4.. the EFF BW benchmark [116]. Each cell has 4 PA-8700 CPUs and a common memory.1 Introduction Shared memory machines typically have relatively few processors. Watson (1943) 4.1. Our test examples were run on two HP Superdome clusters: Stardust is a production machine with 64 PA-8700 processors. See Figure 4.2 HP9000 Superdome machine The HP9000 is grouped into cells (Figure 4.

and 4 corresponding caches. Each multi-streaming processor (MSP) shown in Figure 4.5 µs. It is clear from this figure that the crosssectional bandwidth of the network is quite high.3 Cray X1 machine A clearer example of a shared memory architecture is the Cray X1 machine. Due to the low incremental resolution of MPI Wtime (see p.4. The results from the HP9000 machine Stardust are shown in Figure 4. the shared memory design is obvious.CRAY X1 MACHINE 137 Cell 1 crossbar Cell 1 Cell 4 Cell 2 crossbar Cell 2 Cell 3 Fig.3.5 has 4 processors (custom designed processor chips forged by IBM).5 and 4. shown in Figures 4. 4. multiple test runs must be done to quantify the latency. 234). 4. In Figure 4. Although not apparent from Figure 4. Crossbar interconnect architecture of the HP9000 Superdome.6.6. Cell 3 crossbar Cell 4 crossbar . the latency for this test (the intercept near Message Size = 0) is not high. Arbitrary pairings are made between elements from each group. The processors are divided into two equally sized groups and arbitrary pairwise connections are made between processors from each group: simultaneous messages from each pair are sent and received.2. Figure 4.4. Dr Byrde’s tests show that minimum latency is 1. Group 1 Effective BW Group 2 Fig. Pallas EFF BW benchmark. and the cross-sectional bandwidth of the network is measured for a fixed number of processors and varying message sizes. 4.3.

Figure 4. That is. These data are courtesy of Olivier Byrde. The collapse of the bandwidth for large message lengths seems to be cache/buffer management configuration.4. The only memory in each of these MSPs consists of four sets of 1/2 MB caches (called Ecache). EFF BW benchmark on Stardust.138 SHARED MEMORY PARALLELISM Pallas eff_Bw Benchmark 4500 4000 3500 2 CPUs 4 CPUs 8 CPUs 16 CPUs 24 CPUs 32 CPUs Bandwidth (MB/s) 3000 2500 2000 1500 1000 500 0 64 128 256 512 1064 2048 4096 8192 16384 32768 65536 IE + 05 3E + 05 5E + 05 1E + 06 2E + 06 4E + 06 8E + 06 Message size (Bytes) Fig.6. hence the term streaming in MSP. vector memory access apparently permits cache by-pass. Cray X1 MSP.4.5. Figure 3. 4. On each board (called nodes) are 4 such MSPs and 16 memory modules which share a common 2E + 07 . Although not clear from available diagrams. There are four of these MSPs per node. 4. for example. Scalar Vector Vector Scalar Vector Vector Scalar Vector Vector Scalar Vector Vector 1/2 MB cache 1/2 MB cache 1/2 MB cache 1/2 MB cache To node memory system crossbar Fig. vector registers are loaded directly from memory: see.

7). made possible by CMOS technology.6. MSP P C P C P C P C 4 proc. The processors have 2 complete sets of vector units. and attendant arithmetic/logical/shift functional units. 4. MPI is used. Cache coherency is maintained only within one node. but not across multiple board systems.NEC SX-6 MACHINE 4 proc. Between nodes. Message passing is used between nodes: see Chapter 5. and (2) there are up to 8 vector units/CPU [139].4 NEC SX-6 machine Another example of vector CPUs tightly coupled to a common memory system is the Nippon Electric Company (NEC) SX-6 series (Figure 4. MSP P C P C P C P C 4 proc. Our experience has shown that NEC’s fast clocks. Reconfigurable vector registers means that the 144 kB register file may be partitioned into a number of fixed sized vector registers whose total size remains 144 kB.8). At least two features distinguish it from Cray designs: (1) reconfigurable vector register files. and recent versions are extremely powerful. and multiple vector functional unit sets per CPU make these machines formidable. Up to 128 nodes (each with 8 CPUs) are available in the SX-6/128M models (Figure 4. 4. 16 processors total. MSP P C P C P C P C 139 Node memory system crossbar Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Fig. Between nodes. as described in Chapter 5. Cray X1 node (board): each board has 4 groups of 4 MSPs (Figure 4. The Yokohama Earth Simulator is a particularly large version (5120 nodes) from this line of NEC distributed node vector machines. The evolution of this vector processing line follows from earlier Watanabe designed SX-2 through SX-5 machines. each containing 32 vector registers of 64 words/each (each word is 64-bits). Memory . MSP P C P C P C P C 4 proc.5). message passing is used. (coherent) memory view. Coherence is only maintained on each board.

. A Posix standard. allows tighter control of shared memory parallelism but is more painful to program.controllers . . because it allows so much . 4. . . .. . . Namely. Mask unit Logical CPU load/ store unit To mem.. ..140 SHARED MEMORY PARALLELISM Vector unit 8x Mask reg. .. . ...8. . we use OpenMP [17] because it seems to have become a standard shared memory programming paradigm.. . These vector registers are reconfigurable with different numbers of words/vector. NEC SX-6 node: each contains multiple vector CPUs. Appendix C gives a summary of its functionality. a scalar unit. pthreads. .5 OpenMP standard In this book. .. . see Figure 4. Each node may have between 4 and 8 CPUs which are connected to a common memory system. . 4. . NEC SX-6 CPU: each node contains multiple vector units...8. and a scalar cache. See Figure 4. . 4.. . CPU 0 CPU 1 CPU 2 . . Vector registers are 144 kB register files of 64-bit words..4 . CPU 7 Single node (8 CPU) memory crossbar IOP 0 .. bus Scalar unit Scalar cache Vector register block Multiply Add/shift Divide Scalar registers Scalar pipeline Fig..7.7. IOP 3 Operator I/O devices station other HIPPI I/O network Fig. Up to 16 such nodes may be connected to each other by a Internode crossbar switch: up to 128 CPUs total. Each CPU may have between 4 and 8 vector register sets and corresponding vector functional units sets. I/O . .

but scaling degrades when the number of CPUs exceeds eight. 108.6 Shared memory versions of the BLAS and LAPACK Loop structures of linear algebra tasks are the first places to look for parallelism.4 8. From p.29 S(p) 1 2. matrix–matrix multiplication. 4.61 0. but flattens to 10 on 16.03 0.3 16.2 4.6 10.0 4.62 2.02 0.09 0. OpenMP is implemenu ted using pthreads.4 8.0 n = 1000 t(s) 0.0 S(p) 1 2. The execution times are the best of three measurements.15 1. In our examples from the Hewlett-Packard Superdome machine at ETH Z¨rich.0 4.08 S(p) 1 2. more work is required for the programmer. and LU factorization—have nested loop structures.16 0.7 4.3 8.9 4.6 3. In Table 4.32 0.3 15. Similar scaling to 8 CPUs was reported by R¨llin and Fichtner [126].08 0.3 32.2). The speedup is 7.02 0.2 4.3 3.2 4.08 0.08 0.3 12.30 0.9 n = 5000 t(s) 72.45 0. one can see that in dgetrf the routines to parallelize are dtrsm and dgemm (see Table 2. we explore the outer-loop parallelism of this straightforward implementation and find its efficiency limited by poor scaling: it works well on a few processors.5 14.09 0.3 5. The mathematical library MLIB from Hewlett-Packard [73] conforms to the public domain version 3. it is usually straightforward to parallelize these loop-based algorithms by means of compiler directives from OpenMP. Looking at the triangular factorization in Figure 2. In this section.05 0. This is also true on many Intel PC clusters.3 8.66 0.37 0.3.6 2.4 15.0 of LAPACK in all user-visible usage conventions. In a shared memory environment. Time critical tasks—matrix–vector multiplication. As the important basic tasks are embedded in basic linear algebra subroutines (BLAS) it often suffices to parallelize these in order to get good parallel performance.7 21.1 we list execution times for solving systems of equations of orders from n = 500–5000.1 . p n = 500 t(s) 1 2 4 8 12 16 24 32 0. But the internal workings of some subprograms have been tuned and optimized for HP computers.02 S(p) 1 1.2 7.3 8. For problem sizes up to 1000. we saw a straightforward parallelized version of the Linpack benchmark sgefa.4 on 8 CPUs.SHARED MEMORY VERSIONS OF THE BLAS AND LAPACK 141 user control. but they do not show data for any o larger number of CPUs.1 7.1 Times t in seconds (s) and speedups S(p) for various problem sizes n and processor numbers p for solving a random system of equations with the general solver dgesv of LAPACK on the HP Superdome.9 24.7 12.3 n = 2000 t(s) 4. it evidently does not make sense to employ more Table 4.

synchronization) soon outweighs the gain from splitting the work among processors. a methodology was developed for the automatic generation of efficient basic linear algebra routines for computers with hierarchical memories. Within an OpenMP parallel region. Often. the programmer’s first task is data scoping. These depend on the algorithm but even more on the cache and memory sizes.3. On many platforms. 4. Performance of the BLAS routines largely depend on the right choices of block (or panel) sizes. Also. but the benefit remains as long as the hardware is not modified. which means their values may be changed without modifying other private copies. The overhead for the management of the various threads (startup. however.7 Basic operations with vectors In Sections 2. 150]. This approach was successful for the matrix–matrix multiplication [149]. the speedup again deteriorates. we discussed both the saxpy operation (2.2. This is not the case for n = 2000. The operating system apparently finds it hard to allocate threads to equally (un)loaded processors. made this approach less than satisfactory. The speedup is very good up to 8 processors and good up to 16 processors. only the fraction of the BLAS used in the LINPACK benchmark were properly tuned. an ambitious project was started in the late 1990s with the aim of letting the computer do this tedious work. these operations are reprogrammed for shared memory machines using OpenMP. the execution times are no longer reproducible.142 SHARED MEMORY PARALLELISM than a very few processors. In coding such shared memory routines. Even though shared data may be modified by any processor. the automated tuning of Level 3 BLAS dgemm outperformed the hand-tuned version from computer vendors.3) y = αx + y. The authors of LAPACK and previously of LINPACK initially hoped that independent computer vendors would provide high performing BLAS. This effect is not so evident in our largest problem size tests. The execution times and speedups for these small problems level off so quickly that it appears the operating system makes available only a limited number of processors. Beyond this number. and inner product sdot (2. The idea in principle is fairly simple.1 and 3. data are either private (meaning local to the processor computing the inner portion of the parallel region) or shared (meaning data shared between processors). Quickly changing hardware.4) s = x · y. In the ATLAS project [149. In the following. Loop variables and temporary variables are local to a processor and should be declared private. Tuning all of the BLAS takes hours. the . Faced with this misery. One just measures the execution times of the building blocks of the BLAS with varying parameters and chooses those settings that provide the fastest results.

we can change the above code fragment by adding a schedule qualifier: #pragma omp parallel for schedule(static.2 Some execution times in microseconds for the saxpy operation.1 SAXPY The parallelization of the saxpy operation is simple. An OpenMP for directive generates a parallel region and assigns the vectors to the p processors in blocks (or chunks) of size N/p. 4. private data may be modified only locally by one processor (or group of processors working on a common task). The latter causes more overhead of OpenMP’s thread management. the parallel regions are not really parallel.7. In Table 4.7. timings on the HP Superdome are shown for N = 106 and chunks of size N/p. Otherwise. see Section 3.100) for (i=0. 100. Both chunk sizes N/p and 100 give good timings and speedups. } Now chunks of size 100 are cyclically assigned to each processor. In that case.BASIC OPERATIONS WITH VECTORS 143 computation is not parallel unless only independent regions are modified. there will be a data dependency. Here 2.2. 4. i++){ y[i] += alpha*x[i]. #pragma omp parallel for for (i=0. while shared data are globally accessible but each region to be modified should be independent of other processors working on the same shared data.1. This overhead becomes quite overwhelming for the block size 4. i++){ y[i] += alpha*x[i]. } When we want the processors to get blocks of smaller sizes we can do this by the schedule option.5 · 105 loops of only Table 4.1 Basic vector operations with OpenMP 4. Several examples of this scoping will be given below.2. i< N. and 1. see Appendix C. It is evident that large chunk sizes are to be preferred for vector operations. Chunk size N/p 100 4 1 p=1 1674 1694 1934 2593 2 4 6 8 12 16 854 449 317 239 176 59 1089 601 405 317 239 166 2139 1606 1294 850 742 483 2993 3159 2553 2334 2329 2129 . If a block size of 100 is desired. There are ways to control such dependencies: locks and synchronization barriers are necessary. i< N. Hence.

the variable dot is written N times. In this implementation of the dot product. we introduce a local variable dot = 0. There is no speedup at all. Global variable dot unprotected. = 0. Here. 4. and thus giving incorrect results (version I). i< N. the memory traffic is the bottleneck. To understand this we have to remember that all variables except the loop counter are shared among the processors. This can be done by protecting the statement that modifies dot from being executed by multiple threads at the same time. #pragma omp parallel for for (i=0.3.144 SHARED MEMORY PARALLELISM length 4 are to be issued. i++){ dot += x[i]*y[i]. Figure 4.2 Dot product OpenMP implements a fork-join parallelism on a shared memory multicomputer. we see that the result is only correct using one processor. the results are not even reproducible when p > 1.10: While this solution is correct. OpenMP critical region protection for global variable dot (version II). the variable dot is read and updated in an asynchronous way by all the processors. } Fig. Therefore the result of the dot product will be stored in a single variable in the master thread. see Figure 4. Each of the four double words of a cacheline are handled by a different processor (if p ≥ 4). To show this.10.0.7. i++){ omp critical { dot += x[i]*y[i]. To prevent data dependencies among threads. Here. The block size 1 is of course ridiculous.0. it is intolerably slow because it is really serial execution. There is a barrier in each iteration that serializes the access to the memory cell that stores dot. } . dot #pragma for #pragma } Fig. In fact. 4. In order to prevent untimely accesses of dot we have to protect reading and storing it. The speed of the computation is determined by the memory bandwidth. This mutual exclusion synchronization is enforced by the critical construct in OpenMP.9. i< N. 4. In a first approach we proceed in a similar fashion to saxpy. By running this code segment on multiple processors. This phenomenon is known as a race condition in which the precise timing of instruction execution effects the results [125].9. omp parallel for (i=0.1. we list the execution times for various numbers of processors in row II of Table 4.

k. To get the full result. OpenMP critical region protection only for local accumulations local dot (version III).576 171.12. local_dot = 0. and line IV means parallel reduction for from Figure 4.0.10.N. i< offset+(N/p). Line I means no protection for global variable dot.offset) #pragma omp for for(k=0. inner products and maximum element searches.11). using the C compiler guidec.y) \ private(local_dot. In Chapter 3.832 3. we form p−1 dot = k=0 local dotk . } } Fig. or those reductions which typically yield a single element.BASIC OPERATIONS WITH VECTORS 145 Table 4.052. For example.p. These are just p accesses to the global variable dot.3 shows that this local accumulation reduces the execution time significantly.9. Row III in Table 4. where local dotk is the portion of the inner product that is computed by processor k. for (i=offset. Only at the end are the p individual local results added. 4. line II means the critical region for global variable variant dot from Figure 4.795 140. i++){ local_dot += x[i]*y[i]. we distinguished between vector operations whose results were either the same size as an input vector.113 1387 728 381 264 225 176 93 1392 732 381 269 220 176 88 dot = 0.11. Each thread has its own instance of private variable local dot.k++){ offset = k*(N/p).x.0. #pragma omp parallel shared(dot. Each local dotk can be computed independently of every other.541. local dot for each thread which keeps a partial result (Figure 4.3 Execution times in microseconds for our dot product.547 121.387 139.11.973 1. from Figure 4. line III means critical region for local variable local dot from Figure 4.i. } #pragma omp critical { dot += local_dot. Version p = 1 I II III IV 2 4 6 8 12 16 1875 4321 8799 8105 6348 7339 6538 155.k<p. .

its sub-operation.0. i< N. matrix–vector multiplication. or alternatively as an outer product which preserves vector length.i xi . and with indices written out n−1 y = Ax or yk = i=0 ak. are reductions. Alternatively. i++) y[k] += A[k+i*m]*x[i]. 0 ≤ k < m. In this code fragment. k<m. Actual implementations can be adapted to the underlying hardware. A C code fragment for the matrix–vector multiplication can be written as a reduction /* dot product variant of matrix-vector product */ for (k=0. there are at least two distinct ways to effect this operation. Figure 4. There are a number of other reduction operations besides addition. 4. If A is an m × n matrix and x a vector of length n. k++) y[k] += a[k+i*m]*x[i]. i<n. then A times the vector x is a vector of length m. Likewise. or column major) order.4.8 OpenMP matrix vector multiplication As we discussed in Chapter 3 regarding matrix multiplication.146 SHARED MEMORY PARALLELISM dot = 0. particularly Section 3. can be arranged as the textbook method. k<m. k++){ y[k] = 0. The OpenMP standard does not specify how a reduction has to be implemented. 4.1) /* saxpy variant of matrix-vector product */ for (k=0. OpenMP has built-in mechanisms for certain reduction operations. } Here again. } Fig. #pragma omp parallel for reduction(+ : dot) for (i=0. an outer product loop ordering is based on the saxpy operation (Section 3. which uses an inner product. we assume that the matrix is stored in column (Fortran. k++) y[k] = 0.12 is an inner product example.0. dot is the reduction variable and the plus sign “+” is the reduction operation.1. OpenMP reduction syntax for dot (version IV). for (i=0. i<n. i++){ dot += x[i]*y[i]. each yk is computed as the dot product of the kth row of A with the vector x. Here. Section 3. for (i=0.3. k<m. i++) for (k=0.0.12.4. .

OPENMP MATRIX VECTOR MULTIPLICATION

147

Here, the ith column of A takes the role of x in (2.3) while xi takes the role of the scalar α and y is the y in saxpy, as usual (2.3). 4.8.1 The matrix–vector multiplication with OpenMP In OpenMP the goal is usually to parallelize loops. There are four options to parallelize the matrix–vector multiplication. We can parallelize either the inner or the outer loop of one of the two variants given above. The outer loop in the dot product variant can be parallelized without problem by just prepending an omp parallel for compiler directive to the outer for loop. It is important not to forget to declare shared the loop counter i of the inner loop! Likewise, parallelization of the inner loop in the saxpy variant of the matrix–vector product is straightforward. The other two parallelization possibilities are more difficult because they involve reductions. The parallelization of the inner loop of the dot product variant can be as a simple dot product for (k=0; k<m; k++){ tmp = 0.0; #pragma omp parallel for reduction(+ : tmp) for (i=0; i<n; i++){ tmp += A[k+i*m]*x[i]; } y[k] = tmp; } Parallelization of the outer loop of the saxpy variant is a reduction operation to compute the variable y[k] = tmp. As OpenMP does not permit reduction variables to be of pointer type, we have to resort to the approach of code fragment Figure 4.12 and insert a critical section into the parallel region #pragma omp parallel private(j,z) { for (j=0; j<M; j++) z[j] = 0.0; #pragma omp for for (i=0; i<N; i++){ for (j=0; j<M; j++){ z[j] += A[j+i*M]*x[i]; } } #pragma omp critical for (j=0; j<M; j++) y[j] += z[j]; } We compared the four approaches. Execution times obtained on the HP Superdome are listed in Table 4.4. First of all the execution times show that parallelizing the inner loop is obviously a bad idea. OpenMP implicitly sets a

148

SHARED MEMORY PARALLELISM

Table 4.4 Some execution times in microseconds for the matrix–vector multiplication with OpenMP on the HP superdome. Variant Loop parallelized P 1 n = 100 dot dot saxpy saxpy n = 500 dot dot saxpy saxpy n = 1000 dot dot saxpy saxpy Outer Inner Outer Inner Outer Inner Outer Inner Outer Inner Outer Inner 19.5 273 9.76 244 2 9.76 420 9.76 322 4 9.76 2256 19.5 420 8 39.1 6064 68.4 3574 16 78.1 13,711 146 7128 32 146 33,056 342 14,912 196 165,185 488 72,656

732 293 146 1660 2197 12,109 732 146 48.8 2050 1904 2539 2734 4199 2930 6055

97.7 146 31,689 68,261 97.7 195 17,725 35,498

1367 684 293 196 293 4785 23,046 61,914 138,379 319,531 1464 781 195 391 977 4883 5078 36,328 71,777 146,484

synchronization point at the end of each parallel region. Therefore, the parallel threads are synchronized for each iteration step of the outer loop. Evidently, this produces a large overhead. Table 4.4 shows that parallelizing the outer loop gives good performance and satisfactory speedups as long as the processor number is not large. There is no clear winner among the dot product and saxpy variant. A slightly faster alternative to the dot product variant is the following code segment where the dot product of a single thread is collapsed into one call to the Level 2 BLAS subroutine for matrix–vector multiplication dgemv. n0 = (m - 1)/p + 1; #pragma omp parallel for private(blksize) for (i=0; i<p; i++){ blksize = min(n0, m-i*n0); if (blksize > 0) dgemv_("N", &blksize, &n, &DONE, &A[i*n0], &m, x, &ONE, &DZERO, &y[i*n0], &ONE); } The variable blksize holds the number of A’s rows that are computed by the respective threads. The variables ONE, DONE, and DZERO are used to pass by address the integer value 1, and the double values 1.0 and 0.0 to the Fortran subroutine dgemv (see Appendix E).

OPENMP MATRIX VECTOR MULTIPLICATION

149

4.8.2 Shared memory version of SGEFA The basic coding below is a variant from Section 3.4.2. Again, we have used the Fortran or column major storage convention wherein columns are stored in sequential memory locations, not by the C row-wise convention. Array y is the active portion of the reduction in Figure 3.15. The layout for the multiple saxpy operation (msaxpy) is a0 x0 x1 . . . xm−1 y0,0 y1,0 . . . ym−1,0 a1 y0,1 y1,1 ym−1,1 a2 y0,2 y1,2 ym−1,2 ... am−1 ... y0,m−1 ... y1,m−1 . . . . . . . . . ym−1,m−1

And the calculation is, for j = 0, . . . , m − 1 yj ← aj x + yj , (4.1)

where yj is the jth column in the region of active reduction in Figure 3.15. Column x is the first column of the augmented system, and a is the first row of that system. Neither a nor x are modified by msaxpy. Column yj in msaxpy (4.1) is modified only by the processor which reduces this jth column and no other. In BLAS language, the msaxpy computation is a rank-1 update [135, section 4.9] Y ← x aT + Y. Revisit Figure 2.1 to see the Level 2 BLAS variant. #define ym(p,q) (y+p+q*n) void msaxpy(nr,nc,a,n,x,y) int nr,nc,n; float *a,*x,*y; { /* multiple SAXPY operation, wpp 29/01/2003 */ int i,j; #pragma omp parallel shared(nr,nc,a,x,y) private(i,j) #pragma omp for schedule(static,10) nowait for(j=0;j<nc;j++){ for(i=0;i<nr;i++){ *ym(i,j) += a[j*n]*x[i]; } } } #undef ym (4.2)

150
8

SHARED MEMORY PARALLELISM

60 Speedup vs. NCPUs 0 20 40 NCPUs 60 6 Wallcock time (s)

40

4

2

20

0

0 0 20 40 NCPUs 60

Fig. 4.13. Times and speedups for parallel version of classical Gaussian elimination, SGEFA, on p. 108, [37]. The machine is Stardust, an HP 9000 Superdome cluster with 64 CPUs. Data is from a problem matrix of size 1000 × 1000. Chunk size specification of the for loop is (static,32), and shows how scaling is satisfactory only to about NCPUs ≤ 16, after which system parameters discourage more NCPUs. The compiler was cc.

In preparing the data for Figure 4.13, several variants of the OpenMP parameters were tested. These variants were all modifications of the #pragma omp for schedule(static,10) nowait line. The choice shown, schedule(static,10) nowait, gave the best results of several experiments. The so-called chunksize numerical argument to schedule refers to the depth of unrolling of the outer j-loop, in this case the unrolling is to 10 j-values/loop; for example, see Section 3.2. The following compiler switches and environmental variables were used. Our best results used HP’s compiler cc cc +O4 +Oparallel +Oopenmp filename.c -lcl -lm -lomp -lcps -lpthread with guidec giving much less satisfactory results. An environmental variable specifying the maximum number of processors must be set: setenv OMP NUM THREADS 8 In this expression, OMP NUM THREADS is set to eight, whereas any number larger than zero but no more than the total number of processors could be specified.

OPENMP MATRIX VECTOR MULTIPLICATION

151

4.8.3 Shared memory version of FFT Another recurring example in this book, the binary radix FFT, shows a less desirable property of this architecture or of the OpenMP implementation. Here is an OpenMP version of step used in Section 3.6, specifically the driver given on p. 128. Several variants of the omp for loop were tried and the schedule(static,16) nowait worked better than the others. void step(n,mj,a,b,c,d,w,sign) int n,mj; float a[][2],b[][2],c[][2],d[][2],w[][2]; float sign; { float ambr, ambu, wjw[2]; int j, k, ja, jb, jc, jd, jw, lj, mj2; /* one step of workspace version of CFFT2 */ mj2 = 2*mj; lj = n/mj2; #pragma omp parallel shared(w,a,b,c,d,lj,mj,mj2,sign) \ private(j,k,jw,ja,jb,jc,jd,ambr,ambu,wjw) #pragma omp for schedule(static,16) nowait for(j=0;j<lj;j++){ jw=j*mj; ja=jw; jb=ja; jc=j*mj2; jd=jc; wjw[0] = w[jw][0]; wjw[1] = w[jw][1]; if(sign<0) wjw[1]=-wjw[1]; for(k=0; k<mj; k++){ c[jc + k][0] = a[ja + k][0] + b[jb + k][0]; c[jc + k][1] = a[ja + k][1] + b[jb + k][1]; ambr = a[ja + k][0] - b[jb + k][0]; ambu = a[ja + k][1] - b[jb + k][1]; d[jd + k][0] = wjw[0]*ambr - wjw[1]*ambu; d[jd + k][1] = wjw[1]*ambr + wjw[0]*ambu; } } } An examination of Figure 3.20 shows that the computation of all the computation boxes (Figure 3.21) using each twiddle factor are independent—at an inner loop level. Furthermore, each set of computations using twiddle factor ω k = exp(2πik/n) is also independent of every other k—at an outer loop level counted by k = 0, . . . . Hence, the computation of step may be parallelized at an outer loop level (over the twiddle factors) and vectorized at the inner loop level (the number of independent boxes using ω k ). The results are not so satisfying, as we see in Figure 4.14. There is considerable improvement in the processing rate for large n, which falls dramatically on one CPU when L1 cache is exhausted. However, there is no significant improvement near the peak rate as NCPU increases and worse, the small n rates are significantly worse. These odd

152

SHARED MEMORY PARALLELISM
500 (static, 16) 400 np = 1 np = 4 np = 8 np = 32

300 Mflops 200 100 0 101

102

103 N

104

105

106

Fig. 4.14. Simple minded approach to parallelizing one n = 2m FFT using OpenMP on Stardust. The algorithm is the generic one from Figure 3.20 parallelized using OpenMP for the outer loop on the twiddle factors ω k . The compiler was cc. np = number of CPUs. results seem to be a result of cache coherency enforcement of the memory image over the whole machine. An examination of Figures 4.1 and 4.2 shows that when data are local to a cell, memory access is fast: each memory module is local to the cell. Conversely, when data are not local to a cell, the ccnuma architecture seems to exact a penalty to maintain the cache coherency of blocks on other cells. Writes on these machines are write-through (see Section 1.2.1.1), which means that the cache blocks corresponding to these new data must be marked invalid on every processor but the one writing these updates. It seems, therefore, that the overhead for parallelizing at an outer loop level when the inner loop does so little computation is simply too high. The independent tasks are assigned to an independent CPU, they should be large enough to amortize this latency. 4.9 Overview of OpenMP commands

The OpenMP standard is gratefully a relatively limited set. It is designed for shared memory parallelism, but lacks any vectorization features. On many machines (e.g. Intel Pentium, Crays, SGIs), the pragma directives in C, or functionally similar directives (cdir$, *vdir, etc.) in Fortran can be used to control vectorization: see Section 3.2.2. OpenMP is a relatively sparse set of commands (see Appendix C) and unfortunately has no SIMDvector control structures. Alas, this means no non-vendor specific vectorization standard exists at all. For the purposes of our examples, #pragma ivdep are satisfactory, but more

8. The ETH Superdome has 64 processors and when ncpus is larger than about 1/4 of this number. should be superior to variants of saxpy. Our numbers are from a Fortran version f90 +O3 +Oparallel filename.2 shows that codes can be sensitive to chunk sizes and types of scheduling. which contains LAPACK. 4. you can expect less than optimal performance from your job when ncpus becomes comparable to total number of processors on the system.2 because matrix multiplication is an accumulation of vectors (see Section 3. A summary of these commands is given in Appendix C. In the case of LAPACK. p.USING LIBRARIES 153 general structures are not standard. Both Figures 4. scaling suffers. 108.2. Start with Bonnie Toy’s C version of the benchmark which you .1 and 2.2. This is indeed the case as we now show.4. The test results are from HP’s MLIB. its indirectness makes tuning necessary.13 and 4. We feel there are some important lessons to be learned from these examples. sgetrf uses a blocked algorithm for Gaussian elimination.2.10 Using Libraries Classical Gaussian elimination is inferior to a blocked procedure as we indicated in Sections 2. The task here is to parallelize the Linpack benchmark described in Section 3.2. • Whenever possible. • Even though OpenMP can be used effectively for parallelization. On a loaded machine. To specify the maximum number of CPUs you might want for your job.1).f -llapack and the LAPACK library on our system is to be found in -llapack → /opt/mlib/lib/pa2.1 Parallelize the Linpack benchmark.4. Exercise 4.2. dgemm.15 show that when ncpus > 8.2. This specific path to LAPACK may well be different on another machine.2. • A good algorithm is better than a mediocre one. Table 4.2. an enormous amount of work has produced a high performance and well-documented collection of the best algorithms. use high quality library routines: reinventing the wheel rarely produces a better one. while the classical method sgefa involves too much data traffic. Section 2.a. performance no longer scales linearly with ncpus. Hence.2. the following permits you to control this setenv MLIB NUMBER OF THREADS ncpus where ncpus (say eight) is the maximum number.0/liblapack. and Section 4.

From the NETLIB site (above). Scaling is adequate for NCPUs < 8.netlib. The only real porting problem here is likely to be the timer— second. use timef instead of walltime for multi-CPU tests. compiled with f90. However. Look at the if(l!=k) section of p. j < n.k) for i=0. Times and speedups for the Hewlett-Packard MLIB version LAPACK routine sgetrf. The test was in Fortran.15.k++) loop. this can be pulled out of the loop. With some effort. 2. as we noted on p. 4. after moving out this swap.154 0. One way to do this is by using the OMP parallel for directive to parallelize the corresponding daxpy call in Toy’s version. j++) loop.k<n. We recommend replacing second with the walltime routine (returns a double) given below. is to parallelize the remaining for loop. In Section 4.4 Speedup vs.c version of the Linpack benchmark. we saw a version of sgefa. you can see two code fragments in the for (j = kp1. In fact. can download from NETLIB [111]: http://www.2.org/benchmark/linpackc The complete source code we will call clinpack. download the clinpack.l) for a(k+1+i. the relevant section is in the double version dgefa (an int procedure there).n-k-1. 108 before the corresponding for(j=kp1.c. Your task.0 0 0 20 40 NCPUs 60 Fig. NCPUs 0 20 40 NCPUs 60 Wallcock time (s) 40 0.8.6 SHARED MEMORY PARALLELISM 60 0.2 20 0. but is less than satisfactory for larger NCPUs. .13. On Cray hardware. In clinpack. 108.. the performance of the MLIB version of sgetrf is clearly superior to classical Gaussian elimination sgefa shown in Figure 4. The first swaps row elements a(k+1+i. port this code to the shared memory machine of your choice.c which used the OpenMP “omp for” construct.c. What is to be done? 1. The problem size is 1000 × 1000. Beginning with the comment row elimination with column indexing.

&tzp). /* start timer */ /* factor A -> LU */ /* time for sgefa */ where variables tw. Once you get the code ported and the walltime clock installed. mega=0.tv_usec . sgefa(a. For the wallclock timing. 4. with various chunksizes if need be and plot the multiple-CPU timing results compared to the initial 1-CPU value(s). Be sure that the numerical solutions are correct.2.n.*t0. download the latest version of performance. (void) gettimeofday(&tp.0 initially).000001.&info). Vary the number of CPUs by using the setenv OMP NUM THREADS settings from Section 4. struct timeval tp. base_usec = 0. time. return(time).base_usec). 5.t0 start the timer and return the difference between the start and end. use the following (t0 = 0. base_usec = tp.h> double walltime(double *t0) { double mic.tv_sec . Run your modified version. walltime is #include <sys/time. Also.8. try different matrix sizes other than the original 200×200. struct timezone tzp. if (base_sec == 0) { base_sec = tp. } Benchmark report: from the NETLIB server [111]. } time = (double)(tp. tw = walltime(&t0). Rather. Now modify the dgefa routine in the way described above to use the OpenMP “omp parallel for” directive to call daxpy independently for each processor available. t0 = walltime(&tw).n. Do not expect your results to be as fast as those listed in this report.USING LIBRARIES 155 3.tv_usec. mic = (double)(tp.tv_sec.ps just to see how you are doing. static long base_sec = 0. time = (time + mic * mega) .base_sec). the exercise is just to parallelize dgefa. Any ideas how to modify Toy’s loop unrolling of daxpy? . Also compare your initial 1-CPU result(s) to setenv OMP NUM THREADS 1 values.ipvt. then proceed to run the code to get initial 1-CPU timing(s). You need to be careful about the data scoping rules.

This is not always the case: The Cray X1 (Figures 4. Between processors which share a coherent memory view (within a node). we are interested in streams that are always portions of a single program. the totality of a program is logically split into many independent tasks each processed by a group (see Appendix D) of processes—but the overall program is effectively single threaded at the beginning.5 and 4. Namely. data access is immediate. it remains the programmer’s responsibility to assure a proper read–update relationship in the shared data. S. however. Within a node. Processors and memory are connected by a network.7 and 4. is extremely flexible in that no one process is always master and the other processes slaves. however. . usually with an arbitrary master/slave relationship.8). have common memory within nodes. In this book. MULTIPLE INSTRUCTION. Frequently. for example. and NEC SX-6 through SX-8 series machines (Figures 4. this choice is one of convenience—a file server node. Between nodes. the programming model is not too different from SIMD.5 MIMD. Wozniak The Multiple instruction. multiple data (MIMD) programming model usually refers to computing on distributed memory machines with multiple independent processors. Data updated by one set of processes should not be clobbered by another set until the data are properly used. A communicator group of processes performs certain tasks. each processor has its own local memory. whereas between nodes data access is effected by message passing. MPI has emerged as a more/less standard message passing system used on both shared memory and distributed memory machines. Figure 5. Although processors may run independent instruction streams. In this form. memory coherency is maintained within local caches. for example. we use MPI for such message passing. The MIMD model. One process may be assigned to be master (or root) and coordinates the tasks of others in the group.6). We emphasize that the assignments of which is root is arbitrary—any processor may be chosen. It is often the case that although the system consists of multiple independent instruction streams. and likewise at the end. MULTIPLE DATA Never trust a computer you can’t throw out a window.1.

Bus: a congestion-bound. An arbitrarily selected node.. Static networks: for example. rings. a bus connecting clusters which are connected by a static network: see Section 5.. 4.2 GB interf. In Chapter 1.1.. meshes. 4 × 4) are also used. Combinations: for example. That example used 2 × 2 switches. 5.. Network connection for ETH Beowulf cluster. arrays. Dynamic networks: for example. hypercubes. we showed an Ω-network in Figures 1. we note here several types of networks used on distributed memory machines: 1. is expensive. a selected .. Although a much more comprehensive review of the plethora of networks in use is given in section 2. In MPI. MULTIPLE INSTRUCTION.4. the numbering of each node is called its rank.2). 4.12. Generic MIMD distributed-memory computer (multiprocessor)..2 GB interf. and typically used for only limited number of processors.. MULTIPLE DATA 157 P M P M . nodei . In many situations. of Hwang [75]. but we also noted in Chapter 4 regarding the HP9000 machine that higher order switches (e.. tori. .2. 4. Asgard..11 and 1. 5.2... Figure 5. a crossbar which is hard to scale and has very many switches. 2.. P M Interconnection network Fig. but cheap. and scalable with respect to cost network (like on our Beowulf cluster. 1 GB bus for 8 cabinets Fig. Ω-networks.2. 3.g..MIMD. is chosen to coordinate the others.

For example. MPI Recv will accept data from any source and uses MPI ANY SOURCE to do so. 2.158 MIMD. MPI Send. This variable is defined in the mpi. The default behavior of MPI is to abort the job if an error is detected. dest rank is the rank of the node to which the data are to be sent. a message receive. This is defined in mpi. Figure 5. MPI Recv. this mechanism can be useful. 5. tag ident is a tag or label on the message. this list is PBS NODEFILE. type MPI FLOAT which means a single precision floating point datum. Distinct portions of code may send the same tag. MULTIPLE DATA master node is numbered rank = 0. or even parts of your own code which do distinctly different calculations. source rank is the rank of the node which is expected to be sending data. send data is a pointer to the data to be sent. When communicating between all parts of your code.h. Often the MPI Recv procedure would use MPI ANY TAG to accept a message with any tag. On our system. a more complete tabulation of MPI procedures is given. In the Parallel Batch System (PBS). comm defines a set allowed to send/receive data. is used to check the count of received data (a private variable accessible only by using MPI Get count). parts of code which were written by someone else.3/include. private count is an integer or array of integers. These functions return an integer value which is either a zero (0) indicating there was no error. this is usually MPI COMM WORLD. or a nonzero from which one can determine the nature of the error. For MPI Recv (and others in Appendix D). but here are three of the most basic: a message send. look at the include file mpidefs. this is in directory /usr/local/apli/mpich-1. respectively. These fields are as follows for MPI Send and MPI Recv respectively [115]. When using libraries. 1.3. the source (rank) of the data being received. 4. MULTIPLE INSTRUCTION. The physical assignment of nodes is in an environmental list which maps the rank number to the physical nodes the system understands. For your local implementation. meaning any source or destination. Frequently. 5. and must be classified into communicators to control their passing messages to each other. datatype is the type of each item in the message. the message tag. 3. In Appendix D. recv data is the pointer to where the received data are to be stored. for example. but this is seldom used.1 MPI commands and examples Before we do an example. processor nodei = rank = 0 might be mapped rank = 0 → n038. we need to show the basic forms of MPI commands. and is defined for the shell running your batch job. 6.h.h include file. count is the number of items to send. or expected. and any error message information. but this may be controlled using MPI Errhandler set [109]. . MPI Bcast. a status structure.2. and a broadcast. On some systems.

datatype. See the MPI status struct above. For example. int MPI_SOURCE. . count.MPI COMMANDS AND EXAMPLES 159 7. the source’s rank. /* input */ tag_ident. the pair MPI Isend or MPI Irecv initiate sending a message and receive it. comm. */ } MPI_Status. /* message to send */ count. source_rank.3) used in MPI Recv that contains information from the source about how many elements were send. respectively. int MPI_TAG. On some systems. and any error information. more information is returned in private count. /* input */ root. 5. /* input */ comm) /* input */ typedef struct { int count. /* int private_count. tag_ident.3. status) if /* /* /* /* /* /* /* success */ message to receive */ input */ input */ input */ input */ input */ struct */ /* returns 0 if success */ message. but MPI Isend immediately int MPI_Send( void* int MPI_Datatype int int MPI_Comm /* returns 0 if success */ send_data. the tag of the message. This means the functions do not return to the routine which called them until the messages are sent or received. int MPI_ERROR. respectively [115]. /* input */ datatype. There are also non-blocking equivalents of these send/receive commands. /* input */ datatype. The count is accessible only through MPI Get count. status is a structure (see Figure 5. /* input */ dest_rank. MPI status struct for send and receive functions. /* output/input */ count. /* input */ comm) /* input */ Fig. int MPI_Recv( void* int MPI_Datatype int int MPI_Comm MPI_Status* int MPI_Bcast( void* int MPI_Datatype int MPI_Comm /* returns 0 recv_data. Commands MPI Send and MPI Recv are said to be blocking.

The danger is that the associated send/receive buffers might be modified before either operation is explicitly completed. while for a slave = rank = root. MULTIPLE INSTRUCTION. MULTIPLE DATA returns to the calling procedure once the transmission is initialized—irrespective of whether the transmission actually occurred. [64]. root is an integer parameter specifying the rank of the initiator of the broadcast. in our y = Ax matrix–vector multiply example discussed later. they can be dangerous: The buffers may be modified and thus updated information lost. then the command is processed as a receive and message (output) is a pointer to where the received data broadcast from root are to be stored. Included are compile and PBS scripts. or the number received (output). Another commonly used command involves a point-to-many broadcast. Many operations are better done in command line mode. and the definitive reference is from Argonne [110]. download the MPI-2 standard reference guide from NETLIB [111]. however. Both forms (non-blocking and blocking) of these commands are point to point. In Fortran. as typed in the template above. 4. 2. it is convenient to distribute vector x to every processor available. compilations and small jobs. datatype specifies the datatype of the received message. 5. but the general features are hopefully correct. Before we proceed to some examples. batch jobs for . Otherwise. comm is of MPI Comm type and specifies the communicator group to which the broadcast is to be distributed. for example. it is a broadcast. The parameters in MPI Bcast. These may be one of the usual MPI datatypes. If the rank of the node is not root. You may have to modify some details for your local environment. In addition. 3. Since compilers are frequently unavailable on every node. you will need PBS commands. message is a pointer to the data to be sent to all elements in the communicator comm (as input) if the node processing the MPI Bcast command has rank = root. Vector x is unmodified by these processors. For example.160 MIMD. count is the integer number of elements of type datatype to be broadcast (input). If rank = root. an important reference is Gropp et al. are as follows. 1. MPI FLOAT (see Table D. To submit a batch job. We do not illustrate these non-blocking commands here because on our Beowulf system they are effectively blocking anyway since all data traffic share the same memory channel. we would like to show the way to compile and execute MPI programs. however: for example. but is needed in its entirety to compute the segment of the result y computed on each node. it is a receive function. This means data are transmitted only from one node to another. This broadcast command. Appendix D contains a useful subset of MPI functionality. Other communicators are not affected. has the peculiarity that it is both a send and receive: From the initiator (say root= 0).1). In order to supplement our less complete treatment. A more complete explanation of MPI functions and procedures is given in Peter Pacheco’s book [115]. MPI Bcast is a command to receive data sent by root.

the pathname $MPICHBIN. The compile script Figure 5. to get the number of nodes available to your job qjob: nnodes=‘wc $PBS_NODEFILE|awk ‘{print $1}’’ 3. 1. The following scripts will be subject to some modifications for your local system: in particular. And Figure 5.gate01 by PBS): qstat 123456.2 Matrix and vector operations with PBLAS and BLACS Operations as simple as y = Ax for distributed memory computers have obviously been written into libraries.a for Fortran–C communication. run1024 executables... Figure 5. run256. On our systems. two choices are available: MPICH [109] and LAM [90]. the compiler runs only on the service nodes (three of these)—so you need a compile script to generate the desired four executables (run128. A PBS script for LAM is similar to modify and is shown in Figure 5. 123456.. run1024). 4.5 is the PBS batch script (submitted by qsub) to run the run128.4 shows the compile script which prepares run128.MATRIX AND VECTOR OPERATIONS WITH PBLAS AND BLACS 161 compilation may fail. mpirun = run command MPICH/LAM with options for NCPUs: mpirun -machinefile $PBS_NODEFILE -np $nnodes a...gate01 • qdel deletes a job from PBS queue: qdel 123456 2. This nodefile mapping can be useful for other purposes. run1024 and runs interactively (not through PBS).. and not surprisingly a lot more. For example. PBS NODEFILE. mpicc = C compiler for MPICH/LAM version of MPI: mpicc -O3 yourcode. run512.4 is easily modified for LAM by changing the MPICH path to the “commented out” LAM path (remove the # character).. say. At your discretion are the number of nodes and wallclock time limit.6.c -lm -lblas -lf2c the compile includes a load of the BLAS library and libf2c. you will likely need a batch script to submit your job. 5.out Next. An available nodefile.. PBS = “parallel batch system” commands: • qsub submits your job script: qsub qjob • qstat gets the status of the previously submitted qjob (designated. which is the directory where the compiler mpicc and supervisory execution program mpirun are located. On our systems. . is assigned to the job qjob you submit.

c -lm done Fig.c $MPICC -I${inc} -O3 -o run${sz} timeit.6/mpicc # MPICH path inc=/usr/local/apli/mpich-1.out for sz in 128 256 512 1024 do echo "${sz} executable" which $MPICC sed "s/MAX_X=64/MAX_X=$sz/" <time3D.162 MIMD. MPICH compile script. #!/bin/bash # MPICH batch script # #PBS -l nodes=64. MULTIPLE DATA #!/bin/bash # LAM path # inc=/usr/local/apli/lam-6.c >timeit.2. MPICH (PBS) batch run script. 5.6/include # MPICC=/usr/local/apli/lam-6.5.2.3/include MPICC=/usr/local/apli/mpich-1.4. MULTIPLE INSTRUCTION.3/bin/mpirun nnodes=‘wc $PBS_NODEFILE|awk ‘{print $1}’’ echo "=== QJOB submitted via MPICH ===" for sz in 128 256 512 1024 do echo "=======N=$sz timings======" $MRUN -machinefile $PBS_NODEFILE -np $nnodes run${sz} echo "==========================" done Fig. 5.5.2.walltime=200 set echo date MRUN = /usr/local/apli/mpich-1. .5.3/bin/mpicc echo "$inc files" >comp.

5. computations that are bound to one processor.1 )) $LAMBIN/lamboot -v $machinefile echo "=== QJOB submitted via LAM ===" # yourcode is your sourcefile $LAMBIN/mpicc -O3 yourcode. that is. ..6/bin machinefile=‘basename $PBS_NODEFILE.c -lm $LAMBIN/mpirun n0-$nnodes -v a. ScaLAPACK PBLAS Global Local LAPACK BLACS BLAS Message Passing Primitives (MPI.2. The ScaLAPACK software hierarchy. PVM. High performance is obtained through the three levels of basic linear algebra subroutines (BLAS).7. LAM (PBS) run script.7. It builds on the algorithms implemented in LAPACK that are available on virtually all shared memory computers.5. Level 3 BLAS routines make it possible to exploit the hierarchical memories partly described in Chapter 1. 5. that .MATRIX AND VECTOR OPERATIONS WITH PBLAS AND BLACS 163 #!/bin/bash # LAM batch script # LAMBIN = /usr/local/apli/lam-6. Global aspects are dealt with by using the Parallel Basic Linear Algebra Subroutines (PBLAS).. The algorithms in LAPACK are designed to provide accurate results and high performance. LAPACK and the BLAS are used in ScaLAPACK for the local computations. see Table 5. ScaLAPACK [12] is the standard library for the basic algorithms from linear algebra on MIMD distributed memory parallel machines.2.out $LAMBIN/lamwipe -v $machinefile rm -f $machinefile Fig. see Figure 5.6.tmp’ uniq $PBS_NODEFILE > $machinefile nnodes=‘wc $machinefile|awk ‘{print $1}’’ nnodes=$(( nnodes . which are discussed in Section 2. ) Fig.

TRRV2D General/trapezoidal matrix receive Broadcasts GEBS2D. Besides general rectangular matrices.4.4 we show how the PBLAS access matrices are distributed among the processors in a two-dimensional block cyclic data layout. In particular.1. Support routines BLACS PINFO BLACS SETUP SETPVMTIDS BLACS GET BLACS SET BLACS GRIDINIT BLACS GRIDMAP BLACS FREEBUF BLACS GRIDEXIT BLACS ABORT BLACS EXIT BLACS GRIDINFO BLACS PNUM BLACS PCOORD BLACS BARRIER Number of processes available for BLACS use Number of processes to create Number of PVM tasks the user has spawned Get some BLACS internals Set some BLACS internals Initialize the BLACS process grid Initialize the BLACS process grid Release the BLACS buffer Release process grid Abort all BLACS processes Exit BLACS Get grid information Get system process number Get row/column coordinates in the process grid Set a synchronization point Point-to-point communication GESD2D. see Table 5. TRBS2D GEBR2D. The BLACS are high level communication functions that provide primitives for transferring matrices that are distributed on the process grid according to the two-dimensional block cyclic data layout of Section 5. that is. Recall that vectors and scalars are considered special cases of matrices. or the whole process grid. TRBR2D General/trapezoidal matrix broadcast (sending) General/trapezoidal matrix broadcast (receiving) Reduction (combine) operations GSUM2D. MULTIPLE DATA are an extension of the BLAS for distributed memories [19]. collective communication routines that apply only to process columns or rows. they provide the so-called scoped operations.1 Summary of the BLACS. the BLACS also provide communication routines for trapezoidal and triangular matrices. a row. Table 5. In Section 5. The context of many of these routines can be a column. TRSD2D General/trapezoidal matrix send GERV2D. The actual communication is implemented in the Basic Linear Algebra Communication Subprograms (BLACS) [149].164 MIMD. GAMX2D. MULTIPLE INSTRUCTION. Reduction with sum/max/min operator GAMN2D .

3. see Figure 5. the BLACS have features to initialize. . Figure 5. . it is the “world” in which communication takes p